CN111539288B

CN111539288B - Real-time detection method for gestures of both hands

Info

Publication number: CN111539288B
Application number: CN202010301111.1A
Authority: CN
Inventors: 高成英; 李文盛
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2023-04-07
Anticipated expiration: 2040-04-16
Also published as: CN111539288A

Abstract

The invention discloses a real-time detection method of two-hand posture, which can reconstruct a skeleton model of two hands by reconstructing the two-hand posture by adopting 2d joint point positions and 3d joint point positions, can clearly construct even the two-hand posture with complex interaction, solves the problem that the two-hand posture with complex interaction can not be detected in the prior art, and simultaneously can reduce the operation difficulty of reconstructing the two-hand skeleton model and improve the speed of reconstructing the two-hand skeleton model by adopting a mode of fitting the 2d joint point positions and the 3d joint point positions, thereby ensuring the real-time property of detecting the two-hand posture and solving the problem that the real-time property is difficult to realize in the prior art.

Description

Real-time detection method for gestures of both hands

Technical Field

The invention relates to the technical field of gesture detection, in particular to a real-time detection method for gestures of two hands.

Background

The hand plays a very critical role in human daily life, hand gestures contain a large amount of non-language communication information, tracking and reconstruction of hand gestures become more and more important, prediction of 3D hand gestures is a long-term research direction in computer vision, and a large number of applications are applied in the fields of virtual/augmented reality (VR/AR), human-computer interaction, human motion tracking and control, and the like, wherein real-time and accurate detection of hand gestures is required in all of the applications.

However, the conventional method for detecting the hand posture has the following disadvantages: 1. the gesture detection device can only detect two hands with simple gestures, but cannot detect two-hand gestures with complex interaction; 2. when the mesh of the hand posture is reconstructed, a large amount of calculation and more hardware resources are needed, and the real-time performance is difficult to meet.

Disclosure of Invention

The invention aims to provide a real-time detection method for gestures of two hands, which solves the problems that the gestures of two hands with complex interaction cannot be detected and the real-time property is difficult to realize in the prior art.

The invention is realized by the following technical scheme:

a real-time detection method of two-hand posture is based on a monocular camera and specifically comprises the following steps:

the method comprises the following steps that S1, single-frame images of two hands are captured through a monocular camera, the single-frame images are input into an image segmentation network to be segmented, and segmentation results comprising a left hand, a right hand and a background are segmented;

s2, extracting a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points according to the segmentation result;

s3, calculating the position of a left-hand 3d joint point and the position of a right-hand 3d joint point according to a left-hand heat map comprising the position of the left-hand 2d joint point and a right-hand heat map comprising the position of the right-hand 2d joint point;

and S4, fitting the positions of the left-hand 2d joint points and the positions of the left-hand 3d joint points with a left-hand skeleton model, and fitting the positions of the right-hand 2d joint points and the positions of the right-hand 3d joint points with a right-hand skeleton model to obtain parameters of the left-hand skeleton model and the right-hand skeleton model, so that the postures of the two hands are obtained.

As a further alternative of the real-time detection method of two-hand gestures, the step S1 includes the steps of:

s11, extracting image features according to the input double-hand single-frame image;

s12, performing up-sampling operation on the image features to obtain a probability map comprising three categories of a left hand, a right hand and a background;

and S13, obtaining segmentation results of three categories including the left hand, the right hand and the background according to the probability graph including the left hand, the right hand and the background.

As a further alternative to the real-time detection method of two-handed gestures, the image segmentation network comprises a first convolutional layer, a second convolutional layer, and a transposed convolutional layer.

As a further alternative to the real-time detection method of two-hand gestures, the step S11 includes the steps of:

step S111, inputting the two-hand single-frame image into a first convolution layer for down-sampling processing;

in step S112, the downsampled image is input to the second convolution layer and image feature extraction is performed.

As a further alternative to the real-time detection method of two-hand gestures, the step S2 includes the steps of:

s21, overlapping segmentation results of three categories including a left hand, a right hand and a background with an original single-frame image, and inputting the overlapped segmentation results into a two-dimensional joint point extraction network for down-sampling processing to obtain posture characteristics;

and S22, performing up-sampling processing on the posture characteristics to obtain a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points.

As a further alternative to the real-time detection method of two-hand gestures, the two-dimensional joint point extraction network includes a network of Hourglass structures and a third convolutional layer.

As a further alternative to the real-time detection method of two-hand gestures, the step S3 includes the steps of:

step S31, extracting the confidence coefficient of the left-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point according to the left-hand heat map and the right-hand heat map;

and S32, inputting the position of the left-hand 2d joint point, the confidence coefficient of the left-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point into a three-dimensional joint point extraction network to obtain the position of the left-hand 3d joint point and the position of the right-hand 3d joint point.

As a further alternative to the real-time detection method of two-hand gestures, the three-dimensional joint point extraction network comprises a first fully-connected layer, a dual linear module, and a second fully-connected layer.

As a further alternative to the real-time detection method of two-hand gestures, the dual linear module comprises a first dual linear module and a second dual linear module, the first dual linear module and the second dual linear module each comprising two fully-connected layers.

As a further alternative of the real-time detection method of the two-hand posture, the fitting in step S4 is a fitting by a minimized energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.

The invention has the beneficial effects that:

by using the method, the two-hand posture reconstruction is carried out by adopting the 2d joint point position and the 3d joint point position, the skeleton models of two hands can be reconstructed, even the two-hand posture of the complex interaction can be clearly constructed, the problem that the two-hand posture of the complex interaction can not be detected in the prior art is solved, meanwhile, the operation difficulty of reconstructing the two-hand skeleton models can be reduced by adopting the mode of fitting the 2d joint point position and the 3d joint point position, the speed of reconstructing the two-hand skeleton models is improved, the real-time property of detecting the two-hand posture is ensured, and the problem that the real-time property is difficult to realize in the prior art is solved.

Drawings

FIG. 1 is a schematic flow chart of a real-time detection method of two-hand gestures according to the present invention;

FIG. 2 is a schematic diagram illustrating the components of an image segmentation network in the real-time detection method for two-hand gestures according to the present invention;

FIG. 3 is a schematic diagram illustrating a two-dimensional joint extraction network in the real-time detection method of two-hand gestures according to the present invention;

FIG. 4 is a schematic diagram illustrating a three-dimensional joint extraction network in a real-time detection method for two-hand gestures according to the present invention;

description of reference numerals: 1. a first winding layer; 2. a second convolutional layer; 3. transposing the convolution layer; 4. a network of the Hourglass architecture; 5. a third convolutional layer; 6. a first fully-connected layer; 7. a first dual linear module; 8. a second dual linear module; 9. a second fully connected layer.

Detailed Description

The invention will be described in detail with reference to the drawings and specific embodiments, which are illustrative of the invention and are not to be construed as limiting the invention.

As shown in fig. 1 to 4, a real-time detection method for a two-hand gesture, which is based on a monocular camera, specifically includes the following steps:

the method comprises the following steps that S1, single-frame images of two hands are captured through a monocular camera, the single-frame images are input into an image segmentation network to be segmented, and segmentation results of three categories including a left hand, a right hand and a background are segmented;

In this embodiment, carry out both hands posture through adopting 2d joint point position and 3d joint point position and rebuild, can rebuild out the skeleton model of both hands, even complicated interactive both hands posture also can clearly be found, the problem that can't detect complicated interactive both hands posture that prior art exists has been solved, and simultaneously, carry out the mode of fitting through adopting 2d joint point position and 3d joint point position, can reduce the operation degree of difficulty of rebuilding both hands skeleton model, promote the speed of rebuilding both hands skeleton model, thereby the real-time of detecting both hands posture has been guaranteed, thereby the problem that the real-time nature is difficult to realize that prior art exists has been solved.

It should be noted that, the skeleton model of two hands, each hand includes 21 joint points 2d and 21 joint points 3d, wherein the joint point at the wrist is used as the root joint point, there are four joint points on each finger, and the skeleton of each hand has 26 degrees of freedom, wherein there are 6 degrees of freedom at the root joint point at the wrist, and there are 4 degrees of freedom in each finger.

Preferably, the step S1 includes the steps of:

In the embodiment, a captured single-frame image is input into an image segmentation network to obtain a segmentation image comprising three categories of a left hand, a right hand and a background; the method specifically comprises the following steps: the image segmentation network firstly extracts image characteristics through downsampling, then restores the image characteristics to original pixels through upsampling, and adds the characteristics which are the same as the pixels in the downsampling process during the upsampling process to be used as input of the next upsampling process, so that the characteristics in the original image can be guaranteed not to be lost.

Preferably, the image segmentation network comprises a first convolution layer 1, a second convolution layer 2 and a transposed convolution layer 3.

In this embodiment, the first convolution layer 1 is an encoder, the transposed convolution layer 3 is a decoder, and the resolution of the image is reduced by the encoder and restored by the decoder.

Preferably, the step S11 includes the steps of:

In this embodiment, the first convolutional layer 1 includes five convolutional layers with a convolutional kernel size of 3 steps of 2, and can reduce the resolution of the input image to half of the original resolution, and after five successive dimensionalities reduction, the resolution is reduced to thirty-half of the original image, the convolutional kernel size of the second convolutional layer 2 is 3 steps of 1, and can extract the image features, and the transposed convolutional layer 3 includes five convolutional layers with a convolutional kernel size of 3 steps of 2, and can increase the resolution of the input features to twice of the original resolution.

Preferably, the step S2 includes the steps of:

In this embodiment, the original single frame image and the segmentation result are superimposed together, and the superimposed single frame image and the segmentation result are input into a two-dimensional joint point extraction network, the network firstly performs down-sampling to extract posture features, and then performs up-sampling to obtain 42H × W probability maps, where H is the height of the original image, W is the width of the original image, each probability map represents the position of one joint point, the corresponding position with the maximum value in the probability map is the position of the corresponding two-dimensional joint point, and the corresponding 42 joint points can be extracted from the probability map, where 21 joint points exist on the left hand and 21 joint points exist on the right hand.

Preferably, the two-dimensional joint point extraction network includes a network 4 of a Hourglass structure and a third convolution layer 5.

Preferably, the step S3 includes the steps of:

In this embodiment, the position of the point with the largest value in each heat map is the position of the 2d joint point, and this value is the confidence of the 2d joint point prediction, so the confidence of the left-hand 2d joint point position and the left-hand 2d joint point, and the confidence of the right-hand 2d joint point position and the right-hand 2d joint point can be extracted through the heat maps.

Preferably, the three-dimensional joint point extraction network comprises a first fully-connected layer 6, a dyadic linear module and a second fully-connected layer 9.

Preferably, the dual linear module includes a first dual linear module 7 and a second dual linear module 8, and the first dual linear module 7 and the second dual linear module 8 respectively include two fully connected layers.

Preferably, the fitting in step S4 is a fitting by a minimized energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.

The embodiment is as follows:

step S1, shooting single-frame images of both hands by a monocular camera, and inputting the shot single-frame images into an image segmentation network to obtain three types of segmentation images, wherein the three types of segmentation images are respectively as follows: left hand, right hand, and background; specifically, the image segmentation network firstly extracts image features through downsampling, then restores the image features to original pixels through upsampling, and adds the features which are the same as the pixels during the downsampling during the upsampling to be used as the input of the next upsampling, so that the features in the original image can be ensured not to be lost, the output of the network is a probability graph of H W3, wherein H is the height of the original image, W is the width of the original image, in the result of H W, values of three channels corresponding to each point are used as the probabilities of three categories, and the result of H W1 is extracted from the probability graph, in the result, the value of a background part is 0, the value of a left hand part is 1, and the value of a right hand part is 2; it should be noted that, the segmentation result corresponds to the original image one to one, the position where the point with the median value of 1 in the segmentation result is located corresponds to the original image, that is, the pixel point where the left hand is located, and the point with the median value of 2 in the segmentation result corresponds to the pixel point where the right hand is located in the original image; when training the image segmentation network, calculating the cross entropy of the predicted value and the true value by using the following loss function:

wherein M represents three categories, and 3,S is taken in the present invention _i And

respectively representing the real value and the predicted value of the ith class segmentation result.

And S2, superposing the original single-frame image and the segmentation result together to obtain one H W4 characteristic as input to a two-dimensional joint point extraction network, firstly performing down-sampling on the network to extract the characteristic, and then performing up-sampling on the characteristic to obtain 42H W probability maps, wherein each probability map represents the position of one joint point, the position corresponding to the point with the maximum value in the probability maps is the position of the corresponding two-dimensional joint point, and the corresponding 42 joint points can be extracted from the position, wherein the left hand has 21 joint points, and the right hand has 21 joint points.

Step S3, the position of each joint point is represented by adopting a heat map, the position of the point with the maximum value extracted from the heat map is the predicted position of the 2d joint point, and the maximum value c in the ith heat map _i ∈[0，1]For predicting the confidence of the ith joint point, batch normalization operation and sigmoid activation operation are required to be performed after each layer, and the following loss function is adopted in the training of the step:

wherein N is the number of 2d joint points, and is 42,u in the invention _i And

respectively representing the true values of the ith key pointAnd a predicted value;

combining the position of the left-hand 2d joint point and the confidence coefficient of the left-hand 2d joint point, combining the position of the right-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point, inputting a combined result into a three-dimensional joint point extraction network to obtain the position of a left-hand 3d joint point and the position of a right-hand 3d joint point, specifically, firstly expanding an input vector to 1024 dimensions through a full connecting layer, then passing through two dual linear modules, and finally converting the input vector to 42 x 3 through a full connecting layer to obtain the global positions of 42 left-hand and right-hand joint points;

it should be noted that the following loss function is used for training:

/>

wherein, J _i Is the true value of the position of the joint point,

the predicted value of the joint point position is shown, and N is the number of the joint points.

And S4, fitting the predicted 2d/3d joint points by using a motion skeleton model. The skeletal model for each hand consists of 26 degrees of freedom, te R ³ And R ∈ SO (3) represents the global position and the rotation angle of the root joint point, respectively, and θ ∈ R ²⁰ Representing the joint angle of the finger. Taking theta = { t, R, theta } as a parameter of a skeleton model, and transforming M (theta) ∈ R ^21×3 Obtaining the global position of the hand joint point, and recording the parameters of the left and right hand skeletons as theta respectively _L And Θ _R ，Θ _H ＝{Θ _L ,Θ _R Denotes the skeletal parameters of both hands, fitting the skeletal model to 3d joint points by minimizing the following, where J _i Represents the global position of the ith 3d joint point:

in addition to this, 2d articulation points are usedIs an additional constraint to enable the predicted result to be more fitted to the characteristics of the hand in the original drawing. Fitting the skeleton to the 2d joint point by minimizing the following formula, where u _i The position representing the i-th 2d joint, pi is used to project the 3d joint onto the 2d plane:

in order to keep the posture of the hand skeleton model normal, it is necessary to ensure that the hand joints do not have large-angle bending, and therefore, limitation needs to be added to the joint angles. Here we only constrain the parameters predicted from the first frame, let

And &>

Supervising the joint angles by the following formula, for the upper limit and the lower limit of the ith joint angle, respectively:

in order to avoid the excessive change of the hand posture amplitude obtained by reconstruction between adjacent frames, the change rate of the parameters obtained by prediction of two adjacent frames needs to be constrained, as shown in the following formula:

constraining the skeleton fitting process through the four formulas, and fitting by minimizing the following energy equation to obtain theta _H Where w is the weight of each term, w is the time when predicting the parameters of the first frame ₃ Not 0, in the subsequent prediction, w ₃ Is 0:

E＝ω ₁ E _3D +ω ₂ E _2D +ω ₃ E

during training, the left-hand segmentation, the 2d joint point prediction and the 3d joint point prediction tasks are pre-trained respectively, and then the prediction of the 2d/3d joint points is trained end to end.

The technical solutions provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the embodiments are only used to help understanding the principles of the embodiments of the present invention; meanwhile, for a person skilled in the art, according to the embodiments of the present invention, the specific implementation manners and the application ranges may be changed, and in conclusion, the content of the present specification should not be construed as limiting the invention.

Claims

1. A real-time detection method of bimanual gestures, said method based on a monocular camera, characterized in that: the method specifically comprises the following steps:

the method comprises the following steps that S1, a monocular camera captures a double-hand single-frame image, the single-frame image is input into an image segmentation network for segmentation, and segmentation results of three categories including a left hand, a right hand and a background are segmented;

s4, fitting the positions of the left-hand 2d joint points and the positions of the left-hand 3d joint points with a left-hand skeleton model, and fitting the positions of the right-hand 2d joint points and the positions of the right-hand 3d joint points with a right-hand skeleton model to obtain parameters of the left-hand skeleton model and the right-hand skeleton model so as to obtain the postures of the two hands;

wherein, step S4 specifically includes:

fitting a moving skeleton model to the predicted 2d/3d joint points; the skeletal model for each hand includes 26 degrees of freedom, teR ³ And R ∈ SO (3) represents the global position and the rotation angle of the root joint point, respectively, and θ ∈ R ²⁰ Represents a joint angle of a finger; taking theta = { t, R, theta } as a parameter of a skeleton model, and transforming M (theta) ∈ R ^21×3 Obtaining the global position of the hand joint point, and recording the parameters of the left and right hand skeletons as theta _L And Θ _R ，Θ _H ＝{Θ _L ，Θ _R Denotes the skeletal parameters of both hands, fitting a skeletal model to the 3d joint points by minimizing the following, where J _i Represents the global position of the ith 3d joint point:

in addition, using the 2d joint points as additional constraints to make the predicted result more fit to the features of the hand in the original image, the skeleton is fitted to the 2d joint points by minimizing the following formula, where u is _i Represents the position of the ith 2d joint, and pi is used to project the 3d joint to the 2d plane:

in order to keep the posture of the hand skeleton model normal, the hand joints are required to be ensured not to be bent at large angles, so that limitation needs to be added to the joint angles, and here, only parameters obtained by prediction of a first frame are constrained, and the parameters are set

And &>

constraining the skeleton fitting process through the four formulas, and fitting by minimizing the following energy equation to obtain theta _H Where w is the weight of each term, w is the weight of the first frame when predicting the parameters of the first frame ₃ Not 0, in the subsequent prediction, w ₃ Is 0:

E＝ω ₁ E _3D +ω ₂ E _2D +ω ₃ E

2. A method of real-time detection of bimanual gestures as claimed in claim 1, further comprising: the step S1 includes the steps of:

s12, performing up-sampling operation on the image features to obtain probability graphs of three categories including a left hand, a right hand and a background;

3. The method of claim 2, wherein the real-time detection of bimanual gestures comprises: the image segmentation network includes a first convolution layer, a second convolution layer, and a transposed convolution layer.

4. A method of real-time detection of bimanual gestures as claimed in claim 3, further comprising: the step S11 includes the steps of:

step S111, inputting the two-hand single-frame image into the first convolution layer for down-sampling processing;

5. A method for real-time detection of bimanual gestures according to claim 1 or 4, further comprising: the step S2 includes the steps of:

6. A method of real-time detection of bimanual gestures according to claim 5, further comprising: the two-dimensional joint point extraction network comprises a network of a Hourglass structure and a third convolution layer.

7. The method of claim 6, wherein the real-time detection of bimanual gestures comprises: the step S3 includes the steps of:

and step S32, inputting the confidence degrees of the left-hand 2d joint point position and the left-hand 2d joint point, and the confidence degrees of the right-hand 2d joint point position and the right-hand 2d joint point into a three-dimensional joint point extraction network to obtain a left-hand 3d joint point position and a right-hand 3d joint point position.

8. The method of claim 7, wherein the real-time detection of bimanual gestures comprises: the three-dimensional joint point extraction network comprises a first fully-connected layer, a dual linear module and a second fully-connected layer.

9. A method of real-time detection of bimanual gestures as claimed in claim 8, further comprising: the dual linear module comprises a first dual linear module and a second dual linear module, and the first dual linear module and the second dual linear module respectively comprise two full connection layers.

10. A method of real-time detection of bimanual gestures according to claim 9, further comprising: the fitting in step S4 is a fitting by a minimized energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.