CN110796018A

CN110796018A - Hand motion recognition method based on depth image and color image

Info

Publication number: CN110796018A
Application number: CN201910945063.7A
Authority: CN
Inventors: 刘玉婷; 李公法; 李蔚; 田泉; 蒋国璋; 陶波; 江都
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-02-14
Anticipated expiration: 2039-09-30
Also published as: CN110796018B

Abstract

The invention discloses a hand motion recognition method based on a depth image and a color image, which comprises the steps of utilizing 36 gestures in an ASL (active library language) gesture library as templates, obtaining gesture data through a Kinect sensor, and establishing a gesture database under a depth background and a color background; and (3) taking a regression-based target detection algorithm SSD as a research basis, and respectively carrying out transfer learning on the selected target detection model by utilizing a self-built gesture database based on color and depth backgrounds under a Tensorflow deep learning framework to obtain two types of network models capable of identifying and detecting hand movement under the depth and color backgrounds. The hand motion recognition detection network framework with the detection results fused under the depth and color backgrounds is utilized, the non-maximum suppression algorithm is improved, and the effectiveness of the proposed network framework on hand motion recognition detection is finally obtained. The invention avoids the problems of missed detection and false detection of the target, improves the gesture recognition rate and can realize one-hand recognition and two-hand recognition.

Description

Hand motion recognition method based on depth image and color image

Technical Field

The invention relates to the technical field of image processing and intelligent interaction, in particular to a hand motion recognition method based on a depth image and a color image.

Background

With the development of related disciplines of machine vision and artificial intelligence, human-computer interaction technology gradually starts to become an important research direction. The randomness and variability of hand morphology makes the current gesture recognition technology unable to replace traditional interactive devices in practical applications to achieve human-computer interaction, which means that the gesture recognition detection interaction technology and some other recognition technologies based on computer vision still need to be explored continuously. The depth data provides more shape information, sharp edges, and is robust to changes in lighting conditions than color data, which provides appearance and texture information and is sensitive to light changes. Therefore, the task of utilizing additional depth information to facilitate color information for joint hand motion recognition detection has become a focus of research.

It is still a key to hand motion recognition detection to propose a method that combines color and depth data in an optimal way.

Disclosure of Invention

In order to solve the technical problems, the invention provides a hand motion recognition method based on a depth image and a color image, which can read the color information and the depth information of a gesture in a complex scene and interact with a display by accurately recognizing the gesture.

The technical scheme adopted by the invention is as follows: a hand motion recognition method based on a depth image and a color image is characterized by comprising the following steps:

step 1: self-building a hand movement database;

36 gestures in the ASL gesture simulation library are read through kinect, and a depth hand motion database and a color hand motion database are established;

step 2: preprocessing hand data, and dividing the preprocessed data into a training set and a test set;

and step 3: training a deep hand motion detection model and a color hand motion detection model;

training a SSD _ Mobilene-based target detection model in a deep environment by using a deep training set, and training a SSD _ Mobilene-based target detection model in a color environment by using a color training set; obtaining two types of network models which can respectively identify and detect the hand movement under the depth and color backgrounds;

and 4, step 4: fusing detection results of two types of hand motion recognition detection network models under color and depth backgrounds by utilizing an SSD _ Mobilene double-flow network framework;

and 5: carrying out non-maximum improvement on the fused network model, introducing a non-maximum suppression algorithm of a proportional penalty coefficient, and testing the fused network framework by using the data concentrated in the test to obtain a hand motion high-precision identification model;

step 6: and identifying the hand motion by using a hand motion high-precision identification model.

The method has the advantages that under a tensoflow deep learning framework and an SSD _ Mobilene network model, the color and depth information of each level is extracted from a pair of convolutional neural networks to obtain the characteristic kernels of different layers, all the characteristic kernels are fused at the detection result, and the method is improved in a non-maximum inhibition value algorithm, so that the problems of missing detection and false detection of the target are avoided, and the gesture recognition rate is improved. And the gesture can be recognized in both a static picture and a dynamic video, so that not only can one-hand recognition be realized, but also two-hand recognition can be realized.

The color map can fully consider information such as color and texture of an extracted object, the depth map can extract depth information of the object, the depth information can avoid influence of factors such as illumination and environment, the two kinds of information are fully combined, and the convolutional neural network training model is used to enable gesture recognition to meet requirements of practical application.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network model for recognizing and detecting hand movements in a depth and color background, respectively, according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a network framework in which detection results of hand motion recognition based on depth images and color images are fused according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Referring to fig. 1, the hand motion recognition method based on depth images and color images provided by the invention includes the following steps:

step 1: self-building a hand movement database;

establishing a depth hand motion database and a color hand motion database by using kinect to read 36 gestures in an ASL (American Sign language) gesture library;

in this embodiment, under the tensrflow deep learning framework, the color image data and the depth image data are respectively preprocessed by artificially labeling the color image and the depth image and performing format conversion.

the network model in this embodiment is used to identify and output gestures in the color background and the color video, and the depth background and the depth video, respectively, and the output result includes position information of the gesture in the image and a gesture category confidence.

referring to fig. 2 and 3, in this embodiment, based on a fusion model of an SSD _ mobilene dual-flow network framework, a color and a corresponding depth image are simultaneously input into two SSD _ mobilene networks having the same structure, and features are simultaneously extracted from six feature maps with different scales based on a color network channel and a depth network channel respectively for detection, and detection results of the color network channel and the depth network channel are output after non-maximum suppression is performed simultaneously;

wherein the jth neuron of the l layer of the convolutional network outputs

For upper layer output, f_c(.) is the activation function of the convolutional layer, M is the set of selected input feature maps, w^lWeight of layer I of the convolutional network, b^lFor the biasing of the network layer l,

a jth neuron output representing an ith channel of the ith layer;

when the l-th layer is a pooling layer, the output of the j-th neuron of the layer is

Wherein f is_p(.) is a convolution layerP (.) is a pooling function;

when the l layer is a fully connected layer, the output of the jth neuron of the layer is

Wherein f is_F(.) is the activation function of the fully connected layer, w_FWeight of the full connection layer, b_FBias for full connection, b^lIs the bias of the l layer of the network layer;

after extracting the characteristic values for a plurality of times, putting all the characteristic values into a detection layer for fusion, and then carrying out non-maximum suppression operation.

the method improves the non-maximum value inhibition operation, introduces a non-maximum value inhibition algorithm of a proportional penalty coefficient, endows a corresponding penalty coefficient to the prediction box according to the IoU value of the prediction box, reduces the confidence score of the prediction box by turns through the penalty coefficient, and removes the prediction box with lower confidence score through multiple iterations.

In the embodiment of this patent, the non-maximum suppression process is: the method comprises the steps of setting a prediction frame (set during image preprocessing) originally as B, setting a confidence score set corresponding to the prediction frame as S, setting a prediction frame set which is finally reserved for output as F, sequencing the prediction frames of the B from large to small according to the confidence score in the S, taking a first prediction frame as a suppression prediction frame, putting the first prediction frame into the prediction frame set F which is reserved for output, removing the prediction frame from the set B, calculating IoU values of the residual prediction frame and the first prediction frame, removing IoU prediction frames which are larger than a given threshold value T from the set B, and repeating the steps until the number of the prediction frames in the B is equal to 0. Iou is the detection accuracy.

Detection accuracy Iou:

wherein d represents a detection frame and p represents a preselection detection frame; IoU is the coincidence degree of the frame d predicted by the network framework and the frame p marked in the original picture, the calculation method is that the intersection of the detection result d and the marked sample p is compared with the union, namely the detection accuracy rate IoU;

introducing a non-maximum suppression algorithm of proportional penalty coefficients, endowing corresponding penalty coefficients to a prediction box according to the IoU value of the prediction box, reducing the confidence score of the prediction box by turns through the penalty coefficients and removing the prediction box with lower confidence score through multiple iterations, when the IoU values of a detection box and a preselected detection box are more than or equal to a given threshold value T, calculating the corresponding penalty coefficients α according to the IoU value, and when IoU is less than the given threshold value, the penalty coefficients are 1, and the proportional penalty coefficients are the prediction boxes B with the highest original confidence scores through the prediction box and the original confidence score_maxIoU, and the penalty coefficient is calculated in the following specific calculation mode:

α＝1 IoU(B_max,B_j)＜T；

α＝1-lg(IoU(B_max,B_j)+1) IoU(B_max,B_j)≥T；

wherein IoU (B)_max,B_j) The detection box is preset with IoU value of the box with the highest confidence score, and after a penalty coefficient is introduced, the result is IoU (B)_max,B_j) If the result is less than T, the penalty coefficient is 1, and the confidence coefficient is the same as that of the detection frame corresponding to the initial NMS at the moment; IoU (B)_max,B_j) ≧ T, at which time the corresponding detection box confidence of the NMS introducing the penalty factor changes from G (1-lg (IoU (B)_max,B_j) +1)), G is the original confidence score of the corresponding detection box;

because the initial non-maximum value suppression algorithm directly removes the prediction frame with the value greater than the given threshold value T from the pre-selection detection frame IoU, the condition of removing the prediction frame is limited by introducing the penalty coefficient, so that the problems of missed detection and false detection of the target are avoided.

The hand motion recognition method based on the depth image and the color image solves the problem of how to combine color data and depth data in an optimal mode, can accurately read single-hand and double-hand gesture information under a complex environment, and realizes man-machine interaction recognition.

It should be understood that parts of the specification not set forth in detail are prior art; the above description of the preferred embodiments is intended to be illustrative, and not to be construed as limiting the scope of the invention, which is defined by the appended claims, and all changes and modifications that fall within the metes and bounds of the claims, or equivalences of such metes and bounds are therefore intended to be embraced by the appended claims.

Claims

1. A hand motion recognition method based on a depth image and a color image is characterized by comprising the following steps:

step 1: self-building a hand movement database;

2. The hand motion recognition method based on the depth image and the color image as claimed in claim 1, wherein: in step 2, under a Tensorflow deep learning framework, the color image data and the depth image data are respectively preprocessed in a mode of manually labeling the color image and the depth image and performing format conversion.

3. The hand motion recognition method based on the depth image and the color image as claimed in claim 1, wherein: in step 3, the network model can be used for recognizing and outputting gestures in a color background, a color video, a depth background and a depth video respectively, and output results comprise position information of the gestures in the image and gesture category confidence.

4. The hand motion recognition method based on the depth image and the color image as claimed in claim 1, wherein: step 4, based on a fusion model of an SSD _ Mobilene dual-flow network framework, inputting colors and corresponding depth images into two SSD _ Mobilene networks with the same structure at the same time, simultaneously and respectively extracting features from six feature maps with different scales based on a color network channel and a depth network channel for detection, and outputting detection results of the color network channel and the depth network channel after non-maximum suppression;

wherein the jth neuron of the l layer of the convolutional network outputs

a jth neuron output representing an ith channel of the ith layer;

Wherein f is_p(.) is the activation function of the convolutional layer, p (.) is the pooling function;

5. The hand motion recognition method based on the depth image and the color image as claimed in claim 1, wherein: in step 5, the non-maximum suppression process is as follows: setting a prediction frame as B during image preprocessing, setting a confidence score set corresponding to the prediction frame as S, and finally keeping the output prediction frame set as F, firstly sequencing the prediction frames of B from large to small according to the confidence score in S, taking a first prediction frame as an inhibition prediction frame, putting the inhibition prediction frame into the output-keeping prediction frame set F, then removing the prediction frame from the set B, then calculating IoU values of the residual prediction frame and the first prediction frame, removing IoU prediction frames larger than a given threshold value T from the set B, and repeating the steps until the number of the prediction frames in B is equal to 0;

wherein:

α＝1 IoU(B_max,B_j)＜T；

α＝1-lg(IoU(B_max,B_j)+1) IoU(B_max,B_j)≥T；

wherein IoU (B)_max,B_j) Is a detection framePresetting IoU value of the box with highest confidence score, and adding penalty factor when IoU (B)_max,B_j) If the result is less than T, the penalty coefficient is 1, and the confidence coefficient is the same as that of the detection frame corresponding to the initial NMS at the moment; IoU (B)_max,B_j) ≧ T, at which time the corresponding detection box confidence of the NMS introducing the penalty factor changes from G (1-lg (IoU (B)_max,B_j) +1)), G is the original confidence score of the corresponding detection box;

6. The method of hand motion recognition based on depth images and color images according to any one of claims 1 to 5, wherein: and 5, inputting the paired depth and color hand test sets into the fused network framework for testing, and performing comparative analysis on the paired depth and color hand test sets and the test results of the hand recognition detection model in the single mode to obtain more effective experimental results.