CN111539288A

CN111539288A - Real-time detection method for gestures of both hands

Info

Publication number: CN111539288A
Application number: CN202010301111.1A
Authority: CN
Inventors: 高成英; 李文盛
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2020-08-14
Anticipated expiration: 2040-04-16
Also published as: CN111539288B

Abstract

The invention discloses a real-time detection method of two-hand postures, which can reconstruct skeleton models of two hands by reconstructing the two-hand postures by adopting 2d joint point positions and 3d joint point positions, can clearly construct even the two-hand postures of complex interaction, solves the problem that the two-hand postures of complex interaction can not be detected in the prior art, and simultaneously can reduce the operation difficulty of reconstructing the two-hand skeleton models and improve the speed of reconstructing the two-hand skeleton models by adopting a mode of fitting the 2d joint point positions and the 3d joint point positions, thereby ensuring the real-time property of detecting the two-hand postures and solving the problem that the real-time property is difficult to realize in the prior art.

Description

Real-time detection method for gestures of both hands

Technical Field

The invention relates to the technical field of gesture detection, in particular to a real-time detection method for gestures of two hands.

Background

The hand plays a very critical role in human daily life, hand gestures contain a large amount of non-language communication information, tracking and reconstruction of hand gestures become more and more important, prediction of 3D hand gestures is a long-term research direction in computer vision, and a large number of applications are applied in the fields of virtual/augmented reality (VR/AR), human-computer interaction, human motion tracking and control, and the like, wherein real-time and accurate detection of hand gestures is required in all of the applications.

However, the conventional method for detecting the hand posture has the following disadvantages: 1. only two hands with simple gestures can be detected, and two-hand gestures with complex interaction cannot be detected; 2. when the mesh of the hand posture is reconstructed, a large amount of calculation and more hardware resources are needed, and the real-time performance is difficult to meet.

Disclosure of Invention

The invention aims to provide a real-time detection method for the posture of two hands, which solves the problems that the posture of two hands with complex interaction cannot be detected and the real-time performance is difficult to realize in the prior art.

The invention is realized by the following technical scheme:

a real-time detection method of double-hand posture is based on a monocular camera and specifically comprises the following steps:

step S1, capturing single-frame images of both hands by a monocular camera, inputting the single-frame images into an image segmentation network for segmentation, and segmenting into segmentation results of three categories including a left hand, a right hand and a background;

step S2, extracting a left-hand heat map comprising the position of the left-hand 2d joint point and a right-hand heat map comprising the position of the right-hand 2d joint point according to the segmentation result;

step S3, calculating the position of a left-hand 3d joint point and the position of a right-hand 3d joint point according to a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points;

and step S4, fitting the positions of the left-hand 2d joint points and the positions of the left-hand 3d joint points with the left-hand skeleton model, and fitting the positions of the right-hand 2d joint points and the positions of the right-hand 3d joint points with the right-hand skeleton model to obtain parameters of the left-hand skeleton model and the right-hand skeleton model, so as to obtain the postures of the two hands.

As a further alternative to the real-time detection method of the two-hand posture, the step S1 includes the steps of:

step S11, extracting image features according to the input double-hand single-frame image;

step S12, performing up-sampling operation on the image features to obtain probability graphs of three categories including a left hand, a right hand and a background;

and step S13, obtaining segmentation results of three categories including the left hand, the right hand and the background according to the probability graph including the left hand, the right hand and the background.

As a further alternative to the real-time detection method of two-handed gestures, the image segmentation network comprises a first convolutional layer, a second convolutional layer, and a transposed convolutional layer.

As a further alternative to the real-time detection method of the two-hand posture, the step S11 includes the steps of:

step S111, inputting the two-hand single-frame image into a first convolution layer for down-sampling processing;

in step S112, the downsampled image is input to the second convolution layer and image feature extraction is performed.

As a further alternative to the real-time detection method of the two-hand posture, the step S2 includes the steps of:

step S21, overlapping the segmentation results of the three categories including the left hand, the right hand and the background with the original single-frame image, inputting the result into a two-dimensional joint point extraction network after overlapping, and performing down-sampling processing to obtain posture characteristics;

step S22, performing upsampling processing on the posture features to obtain a left-hand heat map including the position of the left-hand 2d joint point and a right-hand heat map including the position of the right-hand 2d joint point.

As a further alternative to the real-time detection method of two-hand gestures, the two-dimensional joint point extraction network includes a network of Hourglass structures and a third convolutional layer.

As a further alternative to the real-time detection method of the two-hand posture, the step S3 includes the steps of:

step S31, extracting the confidence coefficient of the left-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point according to the left-hand heat map and the right-hand heat map;

step S32, the left hand 2d joint point position and the confidence coefficient of the left hand 2d joint point, and the right hand 2d joint point position and the confidence coefficient of the right hand 2d joint point are input into a three-dimensional joint point extraction network, and the left hand 3d joint point position and the right hand 3d joint point position are obtained.

As a further alternative to the real-time detection method of two-hand gestures, the three-dimensional joint point extraction network comprises a first fully-connected layer, a dual linear module, and a second fully-connected layer.

As a further alternative to the real-time detection method of two-hand gestures, the dual linear module comprises a first dual linear module and a second dual linear module, the first dual linear module and the second dual linear module each comprising two fully-connected layers.

As a further alternative of the real-time detection method of the two-hand posture, the fitting in step S4 is performed by a minimized energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.

The invention has the beneficial effects that:

by using the method, the two-hand posture reconstruction is carried out by adopting the 2d joint point position and the 3d joint point position, so that the skeleton models of two hands can be reconstructed, even the two-hand posture with complex interaction can be clearly constructed, the problem that the two-hand posture with complex interaction can not be detected in the prior art is solved, meanwhile, the calculation difficulty of reconstructing the two-hand skeleton models can be reduced by adopting the mode of fitting the 2d joint point position and the 3d joint point position, the speed of reconstructing the two-hand skeleton models is improved, the real-time property of detecting the two-hand posture is ensured, and the problem that the real-time property is difficult to realize in the prior art is solved.

Drawings

FIG. 1 is a schematic flow chart of a real-time detection method of two-hand gestures according to the present invention;

FIG. 2 is a schematic diagram illustrating the components of an image segmentation network in the real-time detection method for two-hand gestures according to the present invention;

FIG. 3 is a schematic diagram illustrating a two-dimensional joint extraction network in the real-time detection method of two-hand gestures according to the present invention;

FIG. 4 is a schematic diagram illustrating a three-dimensional joint extraction network in a real-time detection method for two-hand gestures according to the present invention;

description of reference numerals: 1. a first winding layer; 2. a second convolutional layer; 3. transposing the convolution layer; 4. a network of the Hourglass architecture; 5. a third convolutional layer; 6. a first fully-connected layer; 7. a first dual linear module; 8. a second dual linear module; 9. a second fully connected layer.

Detailed Description

The invention will be described in detail with reference to the drawings and specific embodiments, which are illustrative of the invention and are not to be construed as limiting the invention.

As shown in fig. 1 to 4, a real-time detection method for a two-hand gesture, which is based on a monocular camera, specifically includes the following steps:

In this embodiment, carry out both hands posture through adopting 2d joint point position and 3d joint point position and rebuild, can rebuild out the skeleton model of both hands, even complicated interactive both hands posture also can clearly be found, the problem that can't detect complicated interactive both hands posture that prior art exists has been solved, and simultaneously, carry out the mode of fitting through adopting 2d joint point position and 3d joint point position, can reduce the operation degree of difficulty of rebuilding both hands skeleton model, promote the speed of rebuilding both hands skeleton model, thereby the real-time of detecting both hands posture has been guaranteed, thereby the problem that the real-time nature is difficult to realize that prior art exists has been solved.

It should be noted that, the skeleton model of two hands, each hand includes 21 joint points 2d and 21 joint points 3d, wherein the joint point at the wrist is used as the root joint point, there are four joint points on each finger, and the skeleton of each hand has 26 degrees of freedom, wherein there are 6 degrees of freedom at the root joint point at the wrist, and there are 4 degrees of freedom in each finger.

Preferably, the step S1 includes the steps of:

In the embodiment, a captured single-frame image is input into an image segmentation network to obtain a segmentation image comprising three categories of a left hand, a right hand and a background; the method specifically comprises the following steps: the image segmentation network firstly extracts image characteristics through downsampling, then restores the image characteristics to original pixels through upsampling, and adds the characteristics which are the same as the pixels in the downsampling process during the upsampling process to be used as input of the next upsampling process, so that the characteristics in the original image can be guaranteed not to be lost.

Preferably, the image segmentation network comprises a first convolutional layer 1, a second convolutional layer 2 and a transposed convolutional layer 3.

In this embodiment, the first convolutional layer 1 is an encoder, the transposed convolutional layer 3 is a decoder, and the resolution of the image is reduced by the encoder and restored by the decoder.

Preferably, the step S11 includes the steps of:

In this embodiment, the first convolutional layer 1 includes five convolutional layers having a convolutional kernel size of 3 and a step size of 2, and is capable of reducing the resolution of an input image to half of the original resolution, reducing the resolution to thirty-half of the original image after five successive dimensionalities reduction, the convolutional kernel size of the second convolutional layer 2 is 3 and a step size of 1, and is capable of extracting image features, and the transposed convolutional layer 3 includes five convolutional layers having a convolutional kernel size of 3 and a step size of 2, and is capable of increasing the resolution of the input features to twice of the original resolution.

Preferably, the step S2 includes the steps of:

In this embodiment, the original single frame image and the segmentation result are superimposed together, and the superimposed single frame image and the segmentation result are input into a two-dimensional joint point extraction network, the network firstly performs down-sampling to extract posture features, and then performs up-sampling to obtain 42H × W probability maps, where H is the height of the original image, W is the width of the original image, each probability map represents the position of one joint point, the corresponding position with the maximum value in the probability map is the position of the corresponding two-dimensional joint point, and the corresponding 42 joint points can be extracted from the probability map, where 21 joint points exist on the left hand and 21 joint points exist on the right hand.

Preferably, the two-dimensional joint point extraction network includes a network 4 of a Hourglass structure and a third convolution layer 5.

Preferably, the step S3 includes the steps of:

In this embodiment, the position of the point with the largest value in each heat map is the position of the 2d joint point, and this value is the confidence of the 2d joint point prediction, so the confidence of the left-hand 2d joint point position and the left-hand 2d joint point, and the confidence of the right-hand 2d joint point position and the right-hand 2d joint point can be extracted through the heat maps.

Preferably, the three-dimensional joint point extraction network comprises a first fully-connected layer 6, a dyadic linear module and a second fully-connected layer 9.

Preferably, the dual linear module includes a first dual linear module 7 and a second dual linear module 8, and the first dual linear module 7 and the second dual linear module 8 respectively include two fully connected layers.

Preferably, the fitting in step S4 is performed by a minimization energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.

Example (b):

step S1, shooting single-frame images of both hands by a monocular camera, and inputting the shot single-frame images into an image segmentation network to obtain three types of segmentation maps, each of which is: left hand, right hand, and background; specifically, the image segmentation network firstly extracts image features through downsampling, then restores the image features to original pixels through upsampling, and adds the features which are the same as the pixels during the downsampling during the upsampling to be used as the input of the next upsampling, so that the features in the original image can be ensured not to be lost, the output of the network is a probability graph of H W3, wherein H is the height of the original image, W is the width of the original image, in the result of H W, values of three channels corresponding to each point are used as the probabilities of three categories, and the result of H W1 is extracted from the probability graph, in the result, the value of a background part is 0, the value of a left hand part is 1, and the value of a right hand part is 2; it should be noted that, the segmentation result corresponds to the original image one to one, the position where the point with the median value of 1 in the segmentation result is located corresponds to the original image, that is, the pixel point where the left hand is located, and the point with the median value of 2 in the segmentation result corresponds to the pixel point where the right hand is located in the original image; when training the image segmentation network, calculating the cross entropy of the predicted value and the true value by using the following loss function:

wherein M represents three categories, 3, S in the present invention_iAnd

respectively representing the real value and the predicted value of the ith class segmentation result.

And step S2, superposing the original single-frame image and the segmentation result together to obtain one H W4 feature as input into a two-dimensional joint point extraction network, firstly down-sampling the extracted feature by the network, and then up-sampling the extracted feature to obtain 42H W probability maps, wherein each probability map represents the position of one joint point, the position corresponding to the point with the maximum value in the probability maps is the position of the corresponding two-dimensional joint point, and the corresponding 42 joint points can be extracted from the probability maps, wherein the left hand has 21 joint points, and the right hand has 21 joint points.

Step S3 is a process of representing the position of each joint point by using a heat map, and extracting the position of the point with the maximum value from the heat map, that is, 2 obtained by predictiond position of joint point, maximum value c in ith thermogram_i∈[0，1]For predicting the confidence of the ith joint point, batch normalization operation and sigmoid activation operation are required to be performed after each layer, and the following loss function is adopted in the training of the step:

wherein N is the number of 2d joint points, and 42 u is taken in the invention_iAnd

respectively representing the true value and the predicted value of the ith key point;

combining the position of the left-hand 2d joint point and the confidence coefficient of the left-hand 2d joint point, combining the position of the right-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point, inputting a combined result into a three-dimensional joint point extraction network to obtain the position of a left-hand 3d joint point and the position of a right-hand 3d joint point, specifically, firstly expanding an input vector to 1024 dimensions through a full connecting layer, then passing through two dual linear modules, and finally converting the input vector to 42 x 3 through the full connecting layer to obtain the global positions of 42 left-hand and right-hand joint points;

it should be noted that the following loss function is used for training:

wherein, J_iIs the true value of the position of the joint point,

the predicted value of the joint point position is shown, and N is the number of the joint points.

Step S4, using a moving skeleton model to fit to the predicted 2d/3d joint points, wherein the skeleton model of each hand comprises 26 degrees of freedom, t ∈ R³And R ∈ SO (3) respectively represent the global position and rotation angle of the root joint point, θ ∈ R²⁰Representing fingersTake Θ ═ t, R, θ } as a parameter of the skeletal model, by transforming M (Θ) ∈ R^21×3Obtaining the global position of the hand joint point, and recording the parameters of the left and right hand skeletons as theta_LAnd Θ_R，Θ_H＝{Θ_L,Θ_RDenotes the skeletal parameters of both hands, fitting the skeletal model to 3d joint points by minimizing the following, where J_iRepresents the global position of the ith 3d joint point:

in addition, the 2d joint points are used as additional constraints to make the predicted result more fit to the features of the hand in the original image. Fitting the skeleton to the 2d joint point by minimizing the following formula, where u_iDenotes the position of the ith 2d joint, pi is used to project the 3d joint onto the 2d plane:

in order to keep the posture of the hand skeleton model normal, it is necessary to ensure that the hand joints do not have large-angle bending, and therefore, limitation needs to be added to the joint angles. Here we only constrain the parameters predicted from the first frame, let

And

the upper limit and the lower limit of the ith joint angle are respectively, and the joint angle is monitored by the following formula:

in order to avoid the excessive change of the hand posture amplitude obtained by reconstruction between adjacent frames, the change rate of the parameters obtained by prediction of two adjacent frames needs to be constrained, as shown in the following formula:

constraining the skeleton fitting process through the four formulas, and fitting by minimizing the following energy equation to obtain theta_HWhere w is the weight of each term, w is the time when predicting the parameters of the first frame₃Not 0, in the subsequent prediction, w₃Is 0:

E＝ω₁E_3D+ω₂E_2D+ω₃E

during training, the left-hand segmentation, the 2d joint point prediction and the 3d joint point prediction tasks are pre-trained respectively, and then the prediction of the 2d/3d joint points is trained end to end.

The technical solutions provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the embodiments are only used to help understanding the principles of the embodiments of the present invention; meanwhile, for a person skilled in the art, according to the embodiments of the present invention, there may be variations in the specific implementation manners and application ranges, and in summary, the content of the present description should not be construed as a limitation to the present invention.

Claims

1. A real-time detection method of bimanual gestures, said method based on a monocular camera, characterized in that: the method specifically comprises the following steps:

step S1, capturing a double-hand single-frame image through a monocular camera, inputting the single-frame image into an image segmentation network for segmentation, and segmenting into segmentation results of three categories including a left hand, a right hand and a background;

2. A method of real-time detection of bimanual gestures as claimed in claim 1, further comprising: the step S1 includes the steps of:

3. A method of real-time detection of bimanual gestures as claimed in claim 2, further comprising: the image segmentation network includes a first convolutional layer, a second convolutional layer, and a transposed convolutional layer.

4. A method of real-time detection of bimanual gestures as claimed in claim 3, further comprising: the step S11 includes the steps of:

5. A method for real-time detection of bimanual gestures according to claim 1 or 4, further comprising: the step S2 includes the steps of:

6. A method of real-time detection of bimanual gestures according to claim 5, further comprising: the two-dimensional joint point extraction network comprises a network of a Hourglass structure and a third convolution layer.

7. The method of claim 6, wherein the real-time detection of the two-hand gesture comprises: the step S3 includes the steps of:

8. A method of real-time detection of bimanual gestures according to claim 7, further comprising: the three-dimensional joint point extraction network comprises a first fully-connected layer, a dual linear module and a second fully-connected layer.

9. A method of real-time detection of bimanual gestures as claimed in claim 8, further comprising: the dual linear module comprises a first dual linear module and a second dual linear module, wherein the first dual linear module and the second dual linear module respectively comprise two full connection layers.

10. A method of real-time detection of bimanual gestures according to claim 9, further comprising: the fitting in step S4 is a fitting by a minimized energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.