CN112101208A

CN112101208A - Feature series fusion gesture recognition method and device for elderly people

Info

Publication number: CN112101208A
Application number: CN202010965987.6A
Authority: CN
Inventors: 罗晓君; 杨金水; 罗湘喜; 孙瑜
Original assignee: Jiangsu Huiming Science And Technology Co ltd
Current assignee: Jiangsu Huiming Science And Technology Co ltd
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2020-12-18

Abstract

The invention relates to a feature series fusion gesture recognition method and device for elderly people. A feature series fusion gesture recognition method for elderly people comprises the following steps: converting the RGB image into a YCbCr image by using a color space model, and segmenting the image by using an ellipse model in a skin color model to obtain a gesture part; performing series fusion on HOG and LBP characteristics by using a series characteristic fusion method, describing gesture characteristics from two angles of edges and textures, and adopting gesture classification recognition based on an SVM; and judging whether the recognized gesture is effective or not by adopting a face verification method and a head posture estimation method, if so, recognizing the user requirement by utilizing the gesture, and otherwise, judging that the recognized gesture is ineffective. The invention judges the current demand of the old through the hand action of the old, converts the care demand into different hand actions, indirectly solves the problem that the old cannot clearly express the care demand by language, and simultaneously provides a simple and easy expression mode for the old.

Description

Feature series fusion gesture recognition method and device for elderly people

Technical Field

The invention relates to the technical field of gesture recognition, in particular to a method and a device for recognizing features of elderly people through series fusion.

Background

Older people are older, and the physical function of the older people is degraded, so that the older people often cannot clearly and intuitively express the living care requirements such as defecation, food intake, medication and the like due to unclear mouth and teeth.

The gesture is an important way for people to communicate information, and people can express rich semantic information through hand motions. Gesture recognition is a process of tracking and recognizing executed gestures and converting the gestures into words or sentences capable of expressing semantic information, and is mainly divided into two types, namely static gesture recognition and dynamic gesture recognition. In the research of gesture recognition at home and abroad, many early works are all dependent on various hardware devices to acquire specific gesture information.

With the breakthrough of computer hardware and image processing theory, the gesture recognition technology based on computer vision gradually becomes mainstream, and the method can complete the whole process of gesture recognition only by means of a camera, a common hardware device. Recognition of several typical gestures is accomplished in conjunction with a linear ANN, such as using the distance between the fingertip and the palm centroid. For example, an Extreme Learning Machine (ELM) pattern recognition algorithm recognizes the monocular vision-based libars symbol, and obtains a high recognition rate, but the accuracy of the method may change due to the influence of illumination. For example, a gesture learning system based on two-dimensional image sampling and splicing is used for carrying out classification and identification on gestures by sampling and concatenating gesture demonstration videos to construct training data. For example, a gesture recognition system designed and studied based on Kinect recognizes 12 letters using the length and direction vector of a finger and the vector angle between the finger and the palm, and only an open gesture shape has been studied although the system achieves a high recognition rate.

Disclosure of Invention

The invention aims to provide a feature series fusion gesture recognition method for elderly people, which is based on gesture segmentation of an elliptical model and feature extraction of fusion HOG and LBP, establishes a gesture recognition algorithm, introduces a face verification solution to judge the validity of a gesture, and judges the care requirement of the elderly people by using the finally obtained validity gesture. Based on the purpose, the technical scheme adopted by the invention is as follows:

a feature series fusion gesture recognition method for elderly people is characterized by comprising the following steps:

step S1, gesture image segmentation: converting the RGB image into a YCbCr image by using a color space model, and then segmenting the image by using an ellipse model in a skin color model to obtain a gesture part;

step S2, extracting gesture features: performing series fusion on HOG and LBP characteristics by using a series characteristic fusion method, describing gesture characteristics from two angles of edges and textures, and adopting gesture classification recognition based on an SVM;

step S3, effective gesture recognition: and judging whether the recognized gesture is effective or not by adopting a face verification method and a head posture estimation method, if so, recognizing the user requirement by utilizing the gesture, and otherwise, judging that the recognized gesture is ineffective.

Further, the step S1 includes a step S11, in which a monocular camera is used to collect images for face detection;

step S12, firstly, analyzing the color space and mapping to the proper color space; then, analyzing a skin color model, and segmenting a skin color area; then, analyzing noise interference, and excluding skin color and skin color-like connected domains outside the hand region; finally, the hand area is intercepted.

Further, in step S12, after converting the RGB color space into the YCbCr color space, the Y luminance component and the CbCr color component are separately processed, and the expression for converting from the RGB color space into the YCbCr color space is as follows:

modeling the skin color by adopting the elliptical model to obtain a segmented binary image, wherein the elliptical model is described by the following formula:

wherein (x, y) is the boundary point of the ellipse, and (c)_x，c_y) Is the center of the ellipse, a is the major axis of the ellipse, b is the minor axis of the ellipse, θ is the rotation angle of the ellipse;

searching the binary image to obtain the outline of each skin color and skin color-like connected domain, and eliminating the interference of the skin color-like region by using an area operator; and finally, removing the interference of the face region by filtering a connected domain containing a face rectangular frame, and only leaving a pure hand region.

Further, in the step S2, the tandem feature fusion method is specifically as follows; suppose a gesture image img_iAfter HOG feature extraction, generating a feature vector a_i，a_iObtaining a characteristic vector A by PCA dimension reduction processing and mapping_i；img_iAfter circular neighborhood, rotational invariance and unified LBP feature extraction, a feature vector B is generated_iThen the final fused feature vector of the image is represented as:

C_i＝[A_i B_i] (2.1)。

further, the face verification technology adopts face recognition verification based on ResNet; the face recognition firstly needs to extract features, namely, images are collected through a camera, then a face area is positioned through an SSD face detection network, then face key points are searched through CLM feature point positioning, then the face area is aligned, and finally the feature description of the face is obtained through a feature extraction network; the face alignment adopts affine transformation and a bilinear interpolation method, the affine transformation is linear transformation from two-dimensional coordinates to two-dimensional coordinates, and the mathematical model is as follows:

where (x ', y') is the mapped point of (x, y) after affine transformation, the homogeneous coordinate representation of the transformation is:

wherein M is represented as an affine transformation matrix and comprises 6 unknown variables, (a)₀₀，a₀₁，a₁₀，a₁₁) Represents linear transformation parameters, (b)₀₀，b₀₁) Representing a translation parameter; the affine transformation matrix formula is: x '═ XH, where X' is a known matrix of standard frontal image reference points, X is a known matrix of image reference points to be aligned, H is an unknown affine transformation parameter matrix, which can be solved:

H＝(X^TX)^-1X^TX′ (3.4)

after H is solved, carrying out affine transformation on each pixel point of the whole image to be aligned, and combining to obtain a corrected result image;

the linear interpolation method mainly aims to solve the problem of deformation distortion caused by image size conversion, namely amplification or reduction, and calculates the pixel value of a point to be solved by searching four integer pixel points closest to the corresponding coordinate (i, j) and then respectively carrying out linear interpolation in two directions;

suppose we know the source image midpoint Q₁₁(x₁，y₁) Point Q₁₂(x₁，y₂) Point Q₂₁(x₂，y₁) And point Q₂₂(x₂，y₂) The pixel value of a certain point P (x, y) is obtained by first aligning Q in the x direction₁₁(x₁，y₁) And Q₂₁(x₂，y₁) Two points are subjected to linear interpolation calculation to obtain a pixel value I (R)₁) In the x direction to Q₁₂(x₁，y₂) And Q₂₂(x₂，y₂) Two points are subjected to linear interpolation calculation to obtain a pixel value I (R)₂)，Namely:

wherein R is₁The coordinates of the point are (x, y)₁)，R₂The coordinates of the point are (x, y)₂)，

Then to R in the y direction₁And R₂The two points perform linear interpolation to calculate the pixel value I (P) of the point (x, y) position, namely:

after the pixel values of 4 integer coordinate points adjacent to any point (x, y) of the image are known, the pixel values of the (x, y) coordinate points can be obtained through a bilinear interpolation method as follows:

the pixel value of each point of the target image is equal to the pixel value of the corresponding position in the source image, and the image size transformation can be completed;

after the human faces are aligned, processing the images by using a ResNet residual convolution neural network, outputting 128-dimensional characteristic vectors, measuring the similarity between the characteristic vectors of the human face images to be recognized and the standard characteristic vectors by using cosine distances, and judging whether the personnel in the images to be recognized are the designated users or not according to the size of the similarity.

Further, the head pose estimation method is EPnP-based head pose estimation; the EPnP algorithm is to represent the three-dimensional coordinates of all feature points in the world coordinate system by the weighted sum of coordinates of 4 virtual control points, the 4 virtual control points cannot be coplanar, by solving the coordinates of the 4 control points in the camera coordinate system, the conversion relationship between the coordinates can be obtained, and the attitude information of the head is further calculated according to the conversion relationship, which is specifically as follows:

recording n characteristic points in world coordinate systemHas the coordinates of

The coordinates of the 4 virtual control points are

Coordinate transformation projected into camera coordinate system

Virtual control point change to

Each feature point in the two coordinate systems is represented by a weighted sum of 4 virtual control points, which are respectively corresponding to the feature points, namely:

according to equation (3.8a) is known

And

on the premise of (2), the weight parameter alpha can be obtained by solving_ijThen, the coordinates of the 4 virtual control points in the camera coordinate system need to be obtained; according to a projection imaging model from 3D points to 2D points, known image points U are combined_iAnd spatial point

Substituting M and M respectively, and developing to obtain:

wherein λ is_iOf feature points for which scale factors are to be determinedPixel coordinate U_iThe internal reference matrix K of the camera is known, and the weight parameter alpha is known_ijThe coordinates of the control point in the camera coordinate system are obtained

Setting control points in a camera coordinate system to be solved

Further expanding equation (3.9), namely:

two linear equations can be obtained from the above equation:

when the weight value alpha is known_ij2D feature points (u)_i，v_i) And (f) in the reference matrix_u，f_v)、(c_u，c_v) On the premise of (1), can be solved

A specific value of (a);

after the coordinates of 4 control points in the camera coordinate system are obtained through the steps, the coordinates are substituted into the formula (3.8b), namely

Obtaining the coordinates of the 3D characteristic points in a camera coordinate system; the coordinates of the characteristic points under the camera coordinate system can be obtained through solving, and then the rotation matrix and the translation matrix are calculated through the relation of the points under the world coordinate system and the camera coordinate system.

Further, the rules of the valid gesture in step S3 include: the rules of "valid gesture must look at person" based on identity information, "valid gesture must be in place" based on location information, "valid gesture must be focused" based on gesture information, and "valid gesture must last" based on statistical information.

Furthermore, (1) the realization of 'effective gesture must see people' depends on a face verification technology, and only when a specified user exists in a monitoring picture, a gesture area can be allowed to be detected on img, gesture features are extracted, and gesture categories are recognized;

(2) the 'valid gesture must be in place' means that a gesture specially specified in advance is valid, and other gestures cannot be responded by the system;

(3) "effective gesture must be focused" means that the user must watch the poster painted with the gesture mark when making the corresponding gesture, and the yaw angle theta of the head gesture of the user when watching_yIf the angle is greater than 30 degrees, the gesture which satisfies the head rotation angle is considered to be effective;

(4) "a valid gesture must last" means that a certain gesture that is swung out lasts at least 3s, i.e. for 90 frames;

if one of the above four items is not satisfied, the gesture is considered to be invalid.

The invention also aims to provide a vision tracking device for the eye movement machine of the elderly, which comprises a gesture image segmentation module, a gesture recognition module and an effective gesture recognition module;

the gesture image segmentation module comprises video acquisition equipment, a face detection module and a gesture image segmentation submodule; the video acquisition equipment acquires images; the face detection module detects a face in an image acquired by the video acquisition equipment; the gesture image segmentation submodule segments a hand area in an image acquired by the video acquisition equipment;

the gesture recognition module comprises a gesture feature extraction module and a gesture recognition sub-module; the gesture feature extraction module extracts gestures; the gesture recognition sub-module compares and recognizes gestures according to preset gesture classification and extracted gestures;

the effective gesture recognition module recognizes an effective gesture according to a preset rule.

Compared with the prior art, the invention has the following beneficial effects: (1) the invention introduces a face verification technology and a head posture estimation technology to judge whether the recognized gesture is effective, if the recognized gesture is effective, the user requirement is recognized by utilizing the gesture, otherwise, the recognized gesture is judged to be an invalid gesture. The characteristic series fusion gesture recognition technology for the elderly can quickly and accurately acquire the gesture actions of the user, so that the care requirements of the elderly are recognized in real time, and meanwhile, the related technology of gesture effective recognition is added, so that the safety of gesture recognition is greatly improved, and the probability of misidentification is reduced. (2) According to the method for fusing the HOG and the LBP in series connection, provided by the invention, the key characteristics for describing the gesture can be captured better, and the quality of the gesture description operator is greatly improved, so that the SVM classifier for subsequent training has better efficiency and higher precision.

The characteristic series fusion gesture recognition technology for the elderly provided by the invention can judge the current demand of the elderly through the hand action of the elderly, convert the care demand into different hand actions, indirectly solve the problem that the elderly cannot clearly express the care demand by language, and simultaneously provide a simple and easy expression mode for the elderly.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of an SDD network architecture;

FIG. 3 is a diagram of SDD face detection results;

FIG. 4 is a schematic diagram of color space conversion, (4a) RGB image; (4b) a YCbCr image;

FIG. 5 is a schematic diagram of skin color segmentation of an elliptical model, (5a) YCbCr image, and (5b) elliptical model segmentation;

fig. 6 is a schematic diagram of a noise interference elimination process, (6a) an area operator filter image, (6b) a face filter image, (6c) a hand binarization image, and (6d) a hand RGB original image;

FIG. 7 is a gesture image acquisition, (7a) an undisclosed image, (7b) a whiteclosed image, (7c) a whiteclosed RGB image, (7d) a grayscale image, and (7e) a size normalization image;

FIG. 8 is a schematic diagram of tandem fusion features;

FIG. 9 is a feature fusion gesture diagram;

FIG. 10 is a sample diagram of a gesture;

FIG. 11 is a schematic diagram of the positioning result of the CLM, (11a) the original image, and (11b) the image of the positioning result of the CLM key points;

FIG. 12 is a schematic diagram of bilinear interpolation;

fig. 13 is a result diagram of a face verification section, (13a) a face verification image 1, and (13b) a face verification image 2;

FIG. 14 is a diagram showing the results of head pose estimation, (14a) head right tilt, (14b) head left tilt;

Detailed Description

The invention is further described below with reference to examples and figures.

Example 1

As shown in fig. 1, a feature series fusion gesture recognition method for elderly people is characterized by comprising the following steps:

Specifically, the step S1 includes a step S11, in which a monocular camera is used to perform image acquisition for face detection;

use ordinary monocular camera as video acquisition equipment, the camera concrete parameter is: frame rate 30FPS, image size 640 x 480. After the image is obtained, face detection is needed, and a foundation is laid for subsequent face verification. The invention adopts SSD (Single Shot MultiBox Detector) network to realize the detection of the human face. The SSD network still keeps the end-to-end characteristic of YOLO, the whole calculation process is packaged in a single network, the SSD network is an integrated entity of a single-stage model, is one of the current target detection networks with the best comprehensive effect, and is simple and convenient to train, and accurate and rapid in prediction.

The general architecture of the SSD network is shown in fig. 2, which uses the basic network to mine the feature information of the input image, generates the position coordinates and the confidence score of the face region through the prediction network, and finally sets the confidence score threshold by itself, and only the face region larger than the threshold is retained. The face detection results based on the SSD network are shown in fig. 3.

Step S12 is explained in detail as follows:

at present, the gesture image segmentation algorithm mainly comprises a texture segmentation method based on a static image, a matching method, a threshold value method, a difference method based on a dynamic sequence, an optical flow method and the like. The invention uses a gesture segmentation method based on a skin color model: firstly, analyzing a color space and mapping the color space to a proper color space; then, analyzing a skin color model, and segmenting a skin color area; then, analyzing noise interference, and excluding skin color and skin color-like connected domains outside the hand region; finally intercepting the hand region color space is a mathematical description of the intuitive visual perception of the image. The skin color shows good clustering performance in a YCbCr color space, and is more suitable for a gesture image segmentation task. After converting the RGB color space into the YCbCr color space, the Y luminance component and the CbCr color component may be processed separately, or even the Y luminance component may be completely discarded. The expression for converting from RGB color space to YCbCr color space is as follows:

as shown in fig. 4 below, the RGB image is converted into a YCbCr image, and the skin color region appears more prominent and compact. Fig. 4a shows an RGB image, and fig. 4b shows a YCbCr image.

The skin color model mainly comprises the following components: single gaussian model, mixed gaussian model, elliptical model, etc. By projecting the samples from the YCbCr color space to the CrCb color plane, the skin color points can be found to be gathered in an elliptical area, the calculation loss of the elliptical model is very low, and the model is quite visual. Therefore, the invention adopts the ellipse model to model the skin color. The elliptical model can be described by the following equation:

wherein (x, y) is the boundary point of the ellipse, and (c)_x，c_y) Is the center of the ellipse, a is the major axis of the ellipse, b is the minor axis of the ellipse, and θ is the rotation angle of the ellipse.

The skin color segmentation operation of the elliptical model is performed on the image in YCbCr color space, and the result is shown in fig. 5, where fig. 5a shows the YCbCr image and fig. 5b shows the elliptical model segmenting the image. As seen from fig. 5, after the skin color segmentation is performed by using the elliptical model, the image still has some noise regions such as holes, burrs, and faces, and the noise mainly comes from the interference of the skin color-like region and the interference of the face region. In order to remove the interferences, the invention provides an improved interference elimination strategy, which specifically comprises the following operations: in the SSD face detection network introduced in section 1.1, collected RGB original image img_rgbInputting a face detection network to obtain coordinates (x) of top left vertex of a face rectangular frame₁，y₁) And lower right vertex coordinates (x)₂，y₂) (ii) a Then, the original drawing img_rgbConverting from RGB color space to YCbCr color space to obtain image img_YCbCrThen, the image img is divided by the skin color elliptical model_YCbCrObtaining a binary image after skin color segmentation; searching the binary image to obtain the outline of each skin color and skin color-like connected domain, and then eliminating the interference of the skin color-like region by using an area operator; finally, the connected domain containing the face rectangular frame is filtered, so that the interference of the face region is eliminated, and only pure face is leftA pure hand area. The effect map is shown in fig. 6, (6a) an area operator filter map, (6b) a face filter map, (6c) a hand binary image, and (6d) a hand RGB original map.

The method comprises the steps of obtaining the outline of a hand region after interference elimination in two steps of a noise interference analysis stage, enabling a directly peeled hand image to be too aggressive, and appropriately leaving a white space for the region of interest when the image is segmented so as to avoid unexpected situations, so that when the circumscribed rectangle of the hand is drawn, the side length of the rectangle is extended outwards by 15 pixels compared with the original side length; when the rectangular area is peeled off, the side length is extended by 10 pixels from the original side, and the obtained gesture image is shown in fig. 7b, and the rectangular area is cut out from the RGB original image to obtain a diagram 7 c. For computational convenience, the image is grayed out here, and then the gesture image size is further normalized to 128 × 96 using bilinear interpolation, resulting in a "final" gesture image, as shown in fig. 7 (e).

Specifically, the gesture recognition in step S2 includes gesture feature extraction and gesture classification recognition.

Histogram of Oriented Gradients (HOG) describes the Gradient direction and Gradient intensity distribution characteristics of a local part of an image. The HOG is realized mainly by dividing the image into a plurality of regions with the same size, then respectively calculating a directional gradient histogram in each region, and finally connecting and combining all the histograms, namely, a HOG feature description operator of the detected image, wherein the small-size region is called a cell unit (cell).

Local Binary Patterns (LBPs) [13] describe the Local texture of an image. The LBP realizing method is that when calculating LBP response value of a certain point, the point is taken as the center, a neighborhood is drawn according to a certain rule, the point is compared with the pixel values of all the points in the depocenter neighborhood one by one, when the pixel value of other points is larger than the pixel value of the center point, the position is marked as 1, when the pixel value is smaller than or equal to the pixel value of the center point, the position is marked as 0, the marking results around the center position in the neighborhood are obtained in sequence, all the peripheral marking results are connected in sequence to form a binary number, and the binary number can be regarded as the LBP response value of the center position.

Therefore, the HOG and the LBP can both capture the characteristics with identification degree of the key of the gesture, wherein the HOG focuses more on the edge information of the gesture, the extracted characteristics have excellent effect under the condition of simple gesture and single background, but the HOG has poor effect when the background is complicated or the hierarchy information of the gesture cannot be ignored; LBP focuses more on texture information of gestures, and can capture complex gestures such as overlapping between fingers or between fingers and palm, but the extracted features have a moderate effect. There are inevitable limitations to the individual features described.

According to the method, HOG and LBP characteristics are fused, the gesture characteristics are described from two angles of edge and texture respectively, the quality of a final gesture description operator is improved, and the method lays a foundation for high-precision gesture classification identification in the following. The invention uses a serial feature fusion method: hypothetical gesture image img_iAfter HOG feature extraction, generating a feature vector a_i，a_iObtaining a characteristic vector A by PCA dimension reduction processing and mapping_i；img_iAfter circular neighborhood, rotational invariance and unified LBP feature extraction, a feature vector B is generated_iThen the final fused feature vector of the image is represented as:

C_i＝[A_i B_i] (2.1)

in the invention, a 396-dimensional HOG + PCA feature description operator is connected in series with a 108-dimensional circular neighborhood + rotation invariance + unified LBP feature description operator to obtain a 504-dimensional final fusion feature vector. A schematic diagram of a feature fusion process based on the tandem approach is shown in fig. 8.

And (4) respectively carrying out HOG and improved LBP feature extraction of the parameters designed in the text on the gesture image in the step (7e), and placing the obtained LBP feature vector in the HOG feature vector subjected to PCA dimension reduction to obtain a fused feature vector. This vector will serve as a mathematical description of the image for the following classification identification.

SVM-based gesture classification recognition

According to the method, a Support Vector Machine (SVM) which is excellent in performance under a small sample data set is selected as a classifier, and the classifier can be put into use after an SVM model parameter file is generated by training of the established gesture data set, so that the final purpose of gesture prediction is achieved.

To train SVM classifiers and verify the effectiveness of gesture recognition, a correlated gesture image dataset is established herein. The invention designs 8 different gestures with certain discrimination by way of example, 5 volunteers are invited to make the 8 gestures in front of the camera, each person in each gesture collects 100 frames of images under different scenes and different illumination, a large amount of image data can ensure that an SVM model file trained subsequently has certain generalization and robustness as far as possible, and the practical application of accurately recognizing the gestures so as to control the posture of the bed body can be met.

As can be seen from the above description, each gesture has 5 × 100 to 500 images, and 8 × 500 to 4000 images in total are 8 × 500 gestures. The original panoramic image with 640 × 480 sizes acquired by the camera contains excessive background noise, a gesture area is extracted by using a gesture segmentation algorithm, 4000 gesture images with different sizes are obtained, the original 4000 panoramic images are deleted, 500 images of each gesture are further divided into a training set (400) and a testing set (100) according to the ratio of 8: 2, and partial gesture sample images in the data set are shown in fig. 10.

The multi-classification strategy of the SVM is mainly divided into two implementation manners of one-to-many (one-vs-rest) and one-to-one (one-vs-one). The one-to-one method has higher precision than the one-to-many method, and the training complexity and the test real-time performance are still in an acceptable range, so the one-to-one implementation mode is used for solving the multi-classification problem of 8 gestures. The SVM class stated in machine learning module ml in opencv3.4.0 library is used herein, and the positive sample label is set to 1 and the negative sample label is set to 0. When creating an SVM classifier, an SVM type (a default C _ SVC is used herein) needs to be set, a kernel function type and related hyper-parameters need to be set, an RBF kernel function is selected herein, and the related hyper-parameters are represented by gamma and C. Because the selection of the hyperparameter of the SVM classifier is very important and directly influences the generalization and robustness of the SVM model, the invention uses a five-fold cross validation method to solve the selection problem of the hyperparameter, so that a relatively excellent SVM model can be obtained, and the gesture classification effect is ensured to reach a higher level.

Because the gesture motion is random, sometimes the non-human conscious motion can be recognized as the effective gesture by the machine, thereby influencing the user experience. Relatively mature face detection, face verification and head posture estimation are introduced as auxiliary technologies of the system, the judgment of gesture effectiveness is helped to be realized through the mature auxiliary technologies, and the accuracy of the care requirement identification of the elderly is guaranteed through the restriction of effectiveness.

In this embodiment, face recognition verification based on ResNet is adopted.

The face recognition needs to extract features firstly, namely, images are collected through a camera, then a face area is located through an SSD face detection network, face key points are located and searched through CLM feature point locating, then the face area is aligned, and finally feature description of a face is obtained through a feature extraction network.

The CLM can quite accurately search out key feature points of a human face, such as eyes, mouth and the like, and mainly comprises three parts: shape model, local model, and fitting optimization. In the embodiment, a 68 characteristic point calibration method of a 300-W face data set is adopted to describe the structural shape of a face, and after a face rectangular region is obtained through an SSD face detection algorithm, the exact position of a characteristic point is found by utilizing a fitting optimization strategy to search in the characteristic point field in the face region in combination with the constraints of a shape model and a local model. The CLM positioning result is shown in fig. 11, (11a) the original image (11b) the CLM key point positioning result image.

Face alignment is mainly to solve two problems: firstly, the angles of the face images caused by the rotation of the head are different, and secondly, the sizes of the face images caused by the shooting distance are different. The embodiment adopts affine transformation and bilinear interpolation to realize face alignment. The affine transformation is a linear transformation from two-dimensional coordinates to two-dimensional coordinates, and the mathematical model of the affine transformation is as follows:

wherein M is represented as an affine transformation matrix and comprises 6 unknown variables, (a)₀₀，a₀₁，a₁₀，a₁₁) Represents linear transformation parameters, (b)₀₀，b₀₁) Representing the translation parameters. At least 3 pairs of points are required for obtaining the affine transformation matrix, 2 eye points and 1 nose point are selected, and the 3 points of the images to be aligned are assumed to be (x)₁，y₁)、(x₂，y₂)、(x₃，y₃) The 3 points of the standard frontal face image are (x)₁′，y₁′)、(x₂′，y₂′)、(x₃′，y₃'), equation (3.2) can be transformed into:

formula (3.3) can be abbreviated as: x '═ XH, where X' is a known matrix of standard frontal image reference points, X is a known matrix of image reference points to be aligned, H is an unknown affine transformation parameter matrix, which can be solved:

H＝(X^TX)^-1X^TX′ (3.4)

and after H is solved, carrying out affine transformation on each pixel point of the whole image to be aligned, and combining to obtain a corrected result image.

The linear interpolation method mainly aims to solve the problem of malformation distortion caused by image size conversion, namely amplification or reduction, and calculates the pixel value of the point to be solved by searching four integer pixel points closest to the corresponding coordinate (i, j) and then respectively carrying out linear interpolation in two directions.

As shown in FIG. 12, assume that we know the source image midpoint Q₁₁(x₁，y₁) Point Q₁₂(x₁，y₂) Point Q₂₁(x₂，y₁) And point Q₂₂(x₂，y₂) The pixel value of a certain point P (x, y) is obtained by first aligning Q in the x direction₁₁(x₁，y₁) And Q₂₁(x₂，y₁) Two points are subjected to linear interpolation calculation to obtain a pixel value I (R)₁) In the x direction to Q₁₂(x₁，y₂) And Q₂₂(x₂，y₂) Two points are subjected to linear interpolation calculation to obtain a pixel value I (R)₂) Namely:

and (4) making the pixel value of each point of the target image equal to the pixel value of the corresponding position in the source image, thus finishing the image size transformation.

After the human faces are aligned, processing the images by using a ResNet residual convolution neural network, outputting 128-dimensional characteristic vectors, measuring the similarity between the characteristic vectors of the human face images to be recognized and the standard characteristic vectors by using cosine distances, and judging whether the personnel in the images to be recognized are the designated users or not according to the size of the similarity. The final face verification result graph is shown in fig. 13.

The valid gesture rules are determined as follows:

the embodiment defines rule definitions of 'effective gesture must look at people' based on identity information, 'effective gesture must be in place' based on position information, 'effective gesture must be concentrated' based on posture information and 'effective gesture must be continuous' based on statistical information, and designs an effective gesture judgment module. The judgment rules of the 4 valid gestures are as follows:

(1) the realization that effective gestures need to see people depends on a face verification technology, and only when an appointed user exists in a monitoring picture, a gesture area can be allowed to be detected on img, gesture features are extracted, and gesture categories are identified;

(2) "valid gestures must be in place" means that the specifically defined gesture is valid and no other gestures can be responded to by the system;

(4) a "valid gesture must last" means that some gesture that is swung out lasts at least 3s, i.e. for 90 frames.

If one of the above four items is not satisfied, the gesture is regarded as an invalid gesture, so that the probability of false recognition is greatly reduced.

Example 2

The present embodiment is different from embodiment 1 in that the head pose estimation method in the present embodiment is based on EPnP head pose estimation.

The EPnP algorithm is that three-dimensional coordinates of all feature points in a world coordinate system are represented by weighted sum of coordinates of 4 virtual control points, the 4 virtual control points cannot be coplanar, a conversion relation between the coordinates can be obtained by solving the coordinates of the 4 control points in a camera coordinate system, and attitude information of a head is further calculated according to the conversion relation.

The coordinates of the n characteristic points in the world coordinate system are recorded as

The coordinates of the 4 virtual control points are

Coordinate transformation projected into camera coordinate system

Virtual control point change to

according to equation (3.8a) is known

And

Substituting M and M respectively, and developing to obtain:

wherein λ is_iThe scale factor is to be solved, the pixel coordinate U of the characteristic point_iThe internal reference matrix K of the camera is known, and the weight parameter alpha is known_ijThe coordinates of the control point in the camera coordinate system are obtained

Setting control points in a camera coordinate system to be solved

Further expanding equation (3.9), namely:

two linear equations can be obtained from the above equation:

A specific value of (a);

The standard 3D model used in this embodiment is derived from the front head three-dimensional data model of the research institute of robot and the system of the science and university of corbela, theoretically, the more the number of the selected human face feature points is, the more accurate the posture calculation result is, but actually, several key feature points are required, in this embodiment, 14 points are used, the numbers of the points on the feature point calibration graph in the standard 3D model are 33, 29, 34, 38, 13, 17, 25, 21, 55, 49, 43, 39, 45 and 6, respectively, and the final head posture estimation result is shown in fig. 14.

In order to verify the gesture recognition effect, fusion features are extracted from 8 gesture (100 test images of each gesture) test classifiers, each gesture image is solved to obtain 504-dimensional serial feature vectors, the class of each test image is determined according to SVM multi-classifier model parameters obtained in the training stage, the class is compared with the real class of each test image, the decision result of the SVM multi-classifier model is counted, and a confusion matrix for gesture recognition is drawn as shown in the following table 1.

TABLE 1 test sample gesture recognition results

It can be seen that, except the gesture h, the gestures may not be mistakenly determined into other categories, and the remaining gestures may generate a little false recognition, especially the most difficult determination for the sample of the gesture f, which is also due to the reason that the complexity of the gesture itself is high and the recognition degree is low. In general, most of the test samples can be correctly classified, and the gesture recognition effect is good.

Example 3

The vision tracking device for the eye movement machine of the elderly is characterized by comprising a gesture image segmentation module, a gesture recognition module and an effective gesture recognition module;

Finally, it should be noted that: the above embodiments are only used to illustrate the present invention and do not limit the technical solutions described in the present invention; thus, while the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted; all such modifications and variations are intended to be included herein within the scope of this disclosure and the present invention and protected by the following claims.

Claims

1. A feature series fusion gesture recognition method for elderly people is characterized by comprising the following steps:

2. The method for recognizing the characteristic series-connection fusion gesture of the elderly according to claim 1, wherein the step S1 comprises a step S11 of performing image acquisition by using a monocular camera to perform face detection;

3. The senior citizen feature series fusion gesture recognition method of claim 3, wherein in step S12, after converting the RGB color space into the YCbCr color space, the Y luminance component and the CbCr color component are separately processed, and the expression for converting from the RGB color space to the YCbCr color space is as follows:

4. The elderly feature series-fused gesture recognition method according to claim 3, wherein in step S2, the series connection featureThe fusion method is concretely as follows; suppose a gesture image img_iAfter HOG feature extraction, generating a feature vector a_i，a_iObtaining a characteristic vector A by PCA dimension reduction processing and mapping_i；img_iAfter circular neighborhood, rotational invariance and unified LBP feature extraction, a feature vector B is generated_iThen the final fused feature vector of the image is represented as:

C_i＝[A_i B_i] (2.1)。

5. the method for recognizing the serially-connected feature fusion gesture of the elderly according to claim 1, wherein the face verification technology adopts face recognition verification based on ResNet; the face recognition firstly needs to extract features, namely, images are collected through a camera, then a face area is positioned through an SSD face detection network, then face key points are searched through CLM feature point positioning, then the face area is aligned, and finally the feature description of the face is obtained through a feature extraction network; the face alignment adopts affine transformation and a bilinear interpolation method, the affine transformation is linear transformation from two-dimensional coordinates to two-dimensional coordinates, and the mathematical model is as follows:

wherein M is represented as an affine transformation matrix and comprises 6 unknown variables, (a)₀₀，a₀₁，a₁₀，a₁₁) Represents linear transformation parameters, (b)₀₀，b₀₁) Representing a translation parameter; the affine transformation matrix formula is: x '═ XH, where X' is constructed from standard frontal image reference pointsAnd forming a known matrix, wherein X is a known matrix formed by reference points of the images to be aligned, H is an unknown affine transformation parameter matrix, and the known matrix can be obtained by the following steps:

H＝(XTX)^-1X^TX′ (3.4)

suppose we know the source image midpoint Q₁₁(x₁，y₁) Point Q₁₂(x₁，y₂) Point Q₂₁(x₂，y₁) And point Q₂₂(x₂，y₂) The pixel value of a certain point P (x, y) is obtained by first aligning Q in the x direction₁₁(x₁，y₁) And Q₂₁(x₂，y₁) Two points are subjected to linear interpolation calculation to obtain a pixel value I (R)₁) In the x direction to Q₁₂(x₁，y₂) And Q₂₂(x₂，y₂) Two points are subjected to linear interpolation calculation to obtain a pixel value I (R)₂) Namely:

6. The senior citizen feature series fusion gesture recognition method according to claim 1, wherein the head pose estimation method is EPnP-based head pose estimation; the EPnP algorithm is to represent the three-dimensional coordinates of all feature points in the world coordinate system by the weighted sum of coordinates of 4 virtual control points, the 4 virtual control points cannot be coplanar, by solving the coordinates of the 4 control points in the camera coordinate system, the conversion relationship between the coordinates can be obtained, and the attitude information of the head is further calculated according to the conversion relationship, which is specifically as follows:

The coordinates of the 4 virtual control points are

Coordinate transformation projected into camera coordinate system

Virtual control point change to

according to equation (3.8a) is known

And

Substituting M and M respectively, and developing to obtain:

Setting control points in a camera coordinate system to be solved

Further expanding equation (3.9), namely:

two linear equations can be obtained from the above equation:

A specific value of (a);

7. The senior citizen feature serial fusion gesture recognition method according to claim 1, wherein the rules of valid gestures in step S3 include: the rules of "valid gesture must look at person" based on identity information, "valid gesture must be in place" based on location information, "valid gesture must be focused" based on gesture information, and "valid gesture must last" based on statistical information.

8. The elderly feature series-fusion gesture recognition method according to claim 7,

9. A serial connection and fusion gesture recognition device for the features of the elderly is characterized by comprising a gesture image segmentation module, a gesture recognition module and an effective gesture recognition module;