CN111539288B - Real-time detection method for gestures of both hands - Google Patents

Real-time detection method for gestures of both hands Download PDF

Info

Publication number
CN111539288B
CN111539288B CN202010301111.1A CN202010301111A CN111539288B CN 111539288 B CN111539288 B CN 111539288B CN 202010301111 A CN202010301111 A CN 202010301111A CN 111539288 B CN111539288 B CN 111539288B
Authority
CN
China
Prior art keywords
hand
joint
joint point
real
time detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010301111.1A
Other languages
Chinese (zh)
Other versions
CN111539288A (en
Inventor
高成英
李文盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202010301111.1A priority Critical patent/CN111539288B/en
Publication of CN111539288A publication Critical patent/CN111539288A/en
Application granted granted Critical
Publication of CN111539288B publication Critical patent/CN111539288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a real-time detection method of two-hand posture, which can reconstruct a skeleton model of two hands by reconstructing the two-hand posture by adopting 2d joint point positions and 3d joint point positions, can clearly construct even the two-hand posture with complex interaction, solves the problem that the two-hand posture with complex interaction can not be detected in the prior art, and simultaneously can reduce the operation difficulty of reconstructing the two-hand skeleton model and improve the speed of reconstructing the two-hand skeleton model by adopting a mode of fitting the 2d joint point positions and the 3d joint point positions, thereby ensuring the real-time property of detecting the two-hand posture and solving the problem that the real-time property is difficult to realize in the prior art.

Description

Real-time detection method for gestures of both hands
Technical Field
The invention relates to the technical field of gesture detection, in particular to a real-time detection method for gestures of two hands.
Background
The hand plays a very critical role in human daily life, hand gestures contain a large amount of non-language communication information, tracking and reconstruction of hand gestures become more and more important, prediction of 3D hand gestures is a long-term research direction in computer vision, and a large number of applications are applied in the fields of virtual/augmented reality (VR/AR), human-computer interaction, human motion tracking and control, and the like, wherein real-time and accurate detection of hand gestures is required in all of the applications.
However, the conventional method for detecting the hand posture has the following disadvantages: 1. the gesture detection device can only detect two hands with simple gestures, but cannot detect two-hand gestures with complex interaction; 2. when the mesh of the hand posture is reconstructed, a large amount of calculation and more hardware resources are needed, and the real-time performance is difficult to meet.
Disclosure of Invention
The invention aims to provide a real-time detection method for gestures of two hands, which solves the problems that the gestures of two hands with complex interaction cannot be detected and the real-time property is difficult to realize in the prior art.
The invention is realized by the following technical scheme:
a real-time detection method of two-hand posture is based on a monocular camera and specifically comprises the following steps:
the method comprises the following steps that S1, single-frame images of two hands are captured through a monocular camera, the single-frame images are input into an image segmentation network to be segmented, and segmentation results comprising a left hand, a right hand and a background are segmented;
s2, extracting a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points according to the segmentation result;
s3, calculating the position of a left-hand 3d joint point and the position of a right-hand 3d joint point according to a left-hand heat map comprising the position of the left-hand 2d joint point and a right-hand heat map comprising the position of the right-hand 2d joint point;
and S4, fitting the positions of the left-hand 2d joint points and the positions of the left-hand 3d joint points with a left-hand skeleton model, and fitting the positions of the right-hand 2d joint points and the positions of the right-hand 3d joint points with a right-hand skeleton model to obtain parameters of the left-hand skeleton model and the right-hand skeleton model, so that the postures of the two hands are obtained.
As a further alternative of the real-time detection method of two-hand gestures, the step S1 includes the steps of:
s11, extracting image features according to the input double-hand single-frame image;
s12, performing up-sampling operation on the image features to obtain a probability map comprising three categories of a left hand, a right hand and a background;
and S13, obtaining segmentation results of three categories including the left hand, the right hand and the background according to the probability graph including the left hand, the right hand and the background.
As a further alternative to the real-time detection method of two-handed gestures, the image segmentation network comprises a first convolutional layer, a second convolutional layer, and a transposed convolutional layer.
As a further alternative to the real-time detection method of two-hand gestures, the step S11 includes the steps of:
step S111, inputting the two-hand single-frame image into a first convolution layer for down-sampling processing;
in step S112, the downsampled image is input to the second convolution layer and image feature extraction is performed.
As a further alternative to the real-time detection method of two-hand gestures, the step S2 includes the steps of:
s21, overlapping segmentation results of three categories including a left hand, a right hand and a background with an original single-frame image, and inputting the overlapped segmentation results into a two-dimensional joint point extraction network for down-sampling processing to obtain posture characteristics;
and S22, performing up-sampling processing on the posture characteristics to obtain a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points.
As a further alternative to the real-time detection method of two-hand gestures, the two-dimensional joint point extraction network includes a network of Hourglass structures and a third convolutional layer.
As a further alternative to the real-time detection method of two-hand gestures, the step S3 includes the steps of:
step S31, extracting the confidence coefficient of the left-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point according to the left-hand heat map and the right-hand heat map;
and S32, inputting the position of the left-hand 2d joint point, the confidence coefficient of the left-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point into a three-dimensional joint point extraction network to obtain the position of the left-hand 3d joint point and the position of the right-hand 3d joint point.
As a further alternative to the real-time detection method of two-hand gestures, the three-dimensional joint point extraction network comprises a first fully-connected layer, a dual linear module, and a second fully-connected layer.
As a further alternative to the real-time detection method of two-hand gestures, the dual linear module comprises a first dual linear module and a second dual linear module, the first dual linear module and the second dual linear module each comprising two fully-connected layers.
As a further alternative of the real-time detection method of the two-hand posture, the fitting in step S4 is a fitting by a minimized energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.
The invention has the beneficial effects that:
by using the method, the two-hand posture reconstruction is carried out by adopting the 2d joint point position and the 3d joint point position, the skeleton models of two hands can be reconstructed, even the two-hand posture of the complex interaction can be clearly constructed, the problem that the two-hand posture of the complex interaction can not be detected in the prior art is solved, meanwhile, the operation difficulty of reconstructing the two-hand skeleton models can be reduced by adopting the mode of fitting the 2d joint point position and the 3d joint point position, the speed of reconstructing the two-hand skeleton models is improved, the real-time property of detecting the two-hand posture is ensured, and the problem that the real-time property is difficult to realize in the prior art is solved.
Drawings
FIG. 1 is a schematic flow chart of a real-time detection method of two-hand gestures according to the present invention;
FIG. 2 is a schematic diagram illustrating the components of an image segmentation network in the real-time detection method for two-hand gestures according to the present invention;
FIG. 3 is a schematic diagram illustrating a two-dimensional joint extraction network in the real-time detection method of two-hand gestures according to the present invention;
FIG. 4 is a schematic diagram illustrating a three-dimensional joint extraction network in a real-time detection method for two-hand gestures according to the present invention;
description of reference numerals: 1. a first winding layer; 2. a second convolutional layer; 3. transposing the convolution layer; 4. a network of the Hourglass architecture; 5. a third convolutional layer; 6. a first fully-connected layer; 7. a first dual linear module; 8. a second dual linear module; 9. a second fully connected layer.
Detailed Description
The invention will be described in detail with reference to the drawings and specific embodiments, which are illustrative of the invention and are not to be construed as limiting the invention.
As shown in fig. 1 to 4, a real-time detection method for a two-hand gesture, which is based on a monocular camera, specifically includes the following steps:
the method comprises the following steps that S1, single-frame images of two hands are captured through a monocular camera, the single-frame images are input into an image segmentation network to be segmented, and segmentation results of three categories including a left hand, a right hand and a background are segmented;
s2, extracting a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points according to the segmentation result;
s3, calculating the position of a left-hand 3d joint point and the position of a right-hand 3d joint point according to a left-hand heat map comprising the position of the left-hand 2d joint point and a right-hand heat map comprising the position of the right-hand 2d joint point;
and S4, fitting the positions of the left-hand 2d joint points and the positions of the left-hand 3d joint points with a left-hand skeleton model, and fitting the positions of the right-hand 2d joint points and the positions of the right-hand 3d joint points with a right-hand skeleton model to obtain parameters of the left-hand skeleton model and the right-hand skeleton model, so that the postures of the two hands are obtained.
In this embodiment, carry out both hands posture through adopting 2d joint point position and 3d joint point position and rebuild, can rebuild out the skeleton model of both hands, even complicated interactive both hands posture also can clearly be found, the problem that can't detect complicated interactive both hands posture that prior art exists has been solved, and simultaneously, carry out the mode of fitting through adopting 2d joint point position and 3d joint point position, can reduce the operation degree of difficulty of rebuilding both hands skeleton model, promote the speed of rebuilding both hands skeleton model, thereby the real-time of detecting both hands posture has been guaranteed, thereby the problem that the real-time nature is difficult to realize that prior art exists has been solved.
It should be noted that, the skeleton model of two hands, each hand includes 21 joint points 2d and 21 joint points 3d, wherein the joint point at the wrist is used as the root joint point, there are four joint points on each finger, and the skeleton of each hand has 26 degrees of freedom, wherein there are 6 degrees of freedom at the root joint point at the wrist, and there are 4 degrees of freedom in each finger.
Preferably, the step S1 includes the steps of:
s11, extracting image features according to the input double-hand single-frame image;
s12, performing up-sampling operation on the image features to obtain a probability map comprising three categories of a left hand, a right hand and a background;
and S13, obtaining segmentation results of three categories including the left hand, the right hand and the background according to the probability graph including the left hand, the right hand and the background.
In the embodiment, a captured single-frame image is input into an image segmentation network to obtain a segmentation image comprising three categories of a left hand, a right hand and a background; the method specifically comprises the following steps: the image segmentation network firstly extracts image characteristics through downsampling, then restores the image characteristics to original pixels through upsampling, and adds the characteristics which are the same as the pixels in the downsampling process during the upsampling process to be used as input of the next upsampling process, so that the characteristics in the original image can be guaranteed not to be lost.
Preferably, the image segmentation network comprises a first convolution layer 1, a second convolution layer 2 and a transposed convolution layer 3.
In this embodiment, the first convolution layer 1 is an encoder, the transposed convolution layer 3 is a decoder, and the resolution of the image is reduced by the encoder and restored by the decoder.
Preferably, the step S11 includes the steps of:
step S111, inputting the two-hand single-frame image into a first convolution layer for down-sampling processing;
in step S112, the downsampled image is input to the second convolution layer and image feature extraction is performed.
In this embodiment, the first convolutional layer 1 includes five convolutional layers with a convolutional kernel size of 3 steps of 2, and can reduce the resolution of the input image to half of the original resolution, and after five successive dimensionalities reduction, the resolution is reduced to thirty-half of the original image, the convolutional kernel size of the second convolutional layer 2 is 3 steps of 1, and can extract the image features, and the transposed convolutional layer 3 includes five convolutional layers with a convolutional kernel size of 3 steps of 2, and can increase the resolution of the input features to twice of the original resolution.
Preferably, the step S2 includes the steps of:
s21, overlapping segmentation results of three categories including a left hand, a right hand and a background with an original single-frame image, and inputting the overlapped segmentation results into a two-dimensional joint point extraction network for down-sampling processing to obtain posture characteristics;
and S22, performing up-sampling processing on the posture characteristics to obtain a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points.
In this embodiment, the original single frame image and the segmentation result are superimposed together, and the superimposed single frame image and the segmentation result are input into a two-dimensional joint point extraction network, the network firstly performs down-sampling to extract posture features, and then performs up-sampling to obtain 42H × W probability maps, where H is the height of the original image, W is the width of the original image, each probability map represents the position of one joint point, the corresponding position with the maximum value in the probability map is the position of the corresponding two-dimensional joint point, and the corresponding 42 joint points can be extracted from the probability map, where 21 joint points exist on the left hand and 21 joint points exist on the right hand.
Preferably, the two-dimensional joint point extraction network includes a network 4 of a Hourglass structure and a third convolution layer 5.
Preferably, the step S3 includes the steps of:
step S31, extracting the confidence coefficient of the left-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point according to the left-hand heat map and the right-hand heat map;
and S32, inputting the position of the left-hand 2d joint point, the confidence coefficient of the left-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point into a three-dimensional joint point extraction network to obtain the position of the left-hand 3d joint point and the position of the right-hand 3d joint point.
In this embodiment, the position of the point with the largest value in each heat map is the position of the 2d joint point, and this value is the confidence of the 2d joint point prediction, so the confidence of the left-hand 2d joint point position and the left-hand 2d joint point, and the confidence of the right-hand 2d joint point position and the right-hand 2d joint point can be extracted through the heat maps.
Preferably, the three-dimensional joint point extraction network comprises a first fully-connected layer 6, a dyadic linear module and a second fully-connected layer 9.
Preferably, the dual linear module includes a first dual linear module 7 and a second dual linear module 8, and the first dual linear module 7 and the second dual linear module 8 respectively include two fully connected layers.
Preferably, the fitting in step S4 is a fitting by a minimized energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.
The embodiment is as follows:
step S1, shooting single-frame images of both hands by a monocular camera, and inputting the shot single-frame images into an image segmentation network to obtain three types of segmentation images, wherein the three types of segmentation images are respectively as follows: left hand, right hand, and background; specifically, the image segmentation network firstly extracts image features through downsampling, then restores the image features to original pixels through upsampling, and adds the features which are the same as the pixels during the downsampling during the upsampling to be used as the input of the next upsampling, so that the features in the original image can be ensured not to be lost, the output of the network is a probability graph of H W3, wherein H is the height of the original image, W is the width of the original image, in the result of H W, values of three channels corresponding to each point are used as the probabilities of three categories, and the result of H W1 is extracted from the probability graph, in the result, the value of a background part is 0, the value of a left hand part is 1, and the value of a right hand part is 2; it should be noted that, the segmentation result corresponds to the original image one to one, the position where the point with the median value of 1 in the segmentation result is located corresponds to the original image, that is, the pixel point where the left hand is located, and the point with the median value of 2 in the segmentation result corresponds to the pixel point where the right hand is located in the original image; when training the image segmentation network, calculating the cross entropy of the predicted value and the true value by using the following loss function:
Figure BDA0002454017040000081
wherein M represents three categories, and 3,S is taken in the present invention i And
Figure BDA0002454017040000082
respectively representing the real value and the predicted value of the ith class segmentation result.
And S2, superposing the original single-frame image and the segmentation result together to obtain one H W4 characteristic as input to a two-dimensional joint point extraction network, firstly performing down-sampling on the network to extract the characteristic, and then performing up-sampling on the characteristic to obtain 42H W probability maps, wherein each probability map represents the position of one joint point, the position corresponding to the point with the maximum value in the probability maps is the position of the corresponding two-dimensional joint point, and the corresponding 42 joint points can be extracted from the position, wherein the left hand has 21 joint points, and the right hand has 21 joint points.
Step S3, the position of each joint point is represented by adopting a heat map, the position of the point with the maximum value extracted from the heat map is the predicted position of the 2d joint point, and the maximum value c in the ith heat map i ∈[0,1]For predicting the confidence of the ith joint point, batch normalization operation and sigmoid activation operation are required to be performed after each layer, and the following loss function is adopted in the training of the step:
Figure BDA0002454017040000091
wherein N is the number of 2d joint points, and is 42,u in the invention i And
Figure BDA0002454017040000092
respectively representing the true values of the ith key pointAnd a predicted value;
combining the position of the left-hand 2d joint point and the confidence coefficient of the left-hand 2d joint point, combining the position of the right-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point, inputting a combined result into a three-dimensional joint point extraction network to obtain the position of a left-hand 3d joint point and the position of a right-hand 3d joint point, specifically, firstly expanding an input vector to 1024 dimensions through a full connecting layer, then passing through two dual linear modules, and finally converting the input vector to 42 x 3 through a full connecting layer to obtain the global positions of 42 left-hand and right-hand joint points;
it should be noted that the following loss function is used for training:
Figure BDA0002454017040000093
/>
wherein, J i Is the true value of the position of the joint point,
Figure BDA0002454017040000101
the predicted value of the joint point position is shown, and N is the number of the joint points.
And S4, fitting the predicted 2d/3d joint points by using a motion skeleton model. The skeletal model for each hand consists of 26 degrees of freedom, te R 3 And R ∈ SO (3) represents the global position and the rotation angle of the root joint point, respectively, and θ ∈ R 20 Representing the joint angle of the finger. Taking theta = { t, R, theta } as a parameter of a skeleton model, and transforming M (theta) ∈ R 21×3 Obtaining the global position of the hand joint point, and recording the parameters of the left and right hand skeletons as theta respectively L And Θ R ,Θ H ={Θ LR Denotes the skeletal parameters of both hands, fitting the skeletal model to 3d joint points by minimizing the following, where J i Represents the global position of the ith 3d joint point:
Figure BDA0002454017040000102
in addition to this, 2d articulation points are usedIs an additional constraint to enable the predicted result to be more fitted to the characteristics of the hand in the original drawing. Fitting the skeleton to the 2d joint point by minimizing the following formula, where u i The position representing the i-th 2d joint, pi is used to project the 3d joint onto the 2d plane:
Figure BDA0002454017040000103
in order to keep the posture of the hand skeleton model normal, it is necessary to ensure that the hand joints do not have large-angle bending, and therefore, limitation needs to be added to the joint angles. Here we only constrain the parameters predicted from the first frame, let
Figure BDA0002454017040000104
And &>
Figure BDA0002454017040000105
Supervising the joint angles by the following formula, for the upper limit and the lower limit of the ith joint angle, respectively:
Figure BDA0002454017040000106
in order to avoid the excessive change of the hand posture amplitude obtained by reconstruction between adjacent frames, the change rate of the parameters obtained by prediction of two adjacent frames needs to be constrained, as shown in the following formula:
Figure BDA0002454017040000111
constraining the skeleton fitting process through the four formulas, and fitting by minimizing the following energy equation to obtain theta H Where w is the weight of each term, w is the time when predicting the parameters of the first frame 3 Not 0, in the subsequent prediction, w 3 Is 0:
E=ω 1 E 3D2 E 2D3 E
during training, the left-hand segmentation, the 2d joint point prediction and the 3d joint point prediction tasks are pre-trained respectively, and then the prediction of the 2d/3d joint points is trained end to end.
The technical solutions provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the embodiments are only used to help understanding the principles of the embodiments of the present invention; meanwhile, for a person skilled in the art, according to the embodiments of the present invention, the specific implementation manners and the application ranges may be changed, and in conclusion, the content of the present specification should not be construed as limiting the invention.

Claims (10)

1. A real-time detection method of bimanual gestures, said method based on a monocular camera, characterized in that: the method specifically comprises the following steps:
the method comprises the following steps that S1, a monocular camera captures a double-hand single-frame image, the single-frame image is input into an image segmentation network for segmentation, and segmentation results of three categories including a left hand, a right hand and a background are segmented;
s2, extracting a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points according to the segmentation result;
s3, calculating the position of a left-hand 3d joint point and the position of a right-hand 3d joint point according to a left-hand heat map comprising the position of the left-hand 2d joint point and a right-hand heat map comprising the position of the right-hand 2d joint point;
s4, fitting the positions of the left-hand 2d joint points and the positions of the left-hand 3d joint points with a left-hand skeleton model, and fitting the positions of the right-hand 2d joint points and the positions of the right-hand 3d joint points with a right-hand skeleton model to obtain parameters of the left-hand skeleton model and the right-hand skeleton model so as to obtain the postures of the two hands;
wherein, step S4 specifically includes:
fitting a moving skeleton model to the predicted 2d/3d joint points; the skeletal model for each hand includes 26 degrees of freedom, teR 3 And R ∈ SO (3) represents the global position and the rotation angle of the root joint point, respectively, and θ ∈ R 20 Represents a joint angle of a finger; taking theta = { t, R, theta } as a parameter of a skeleton model, and transforming M (theta) ∈ R 21×3 Obtaining the global position of the hand joint point, and recording the parameters of the left and right hand skeletons as theta L And Θ R ,Θ H ={Θ L ,Θ R Denotes the skeletal parameters of both hands, fitting a skeletal model to the 3d joint points by minimizing the following, where J i Represents the global position of the ith 3d joint point:
Figure QLYQS_1
in addition, using the 2d joint points as additional constraints to make the predicted result more fit to the features of the hand in the original image, the skeleton is fitted to the 2d joint points by minimizing the following formula, where u is i Represents the position of the ith 2d joint, and pi is used to project the 3d joint to the 2d plane:
Figure QLYQS_2
in order to keep the posture of the hand skeleton model normal, the hand joints are required to be ensured not to be bent at large angles, so that limitation needs to be added to the joint angles, and here, only parameters obtained by prediction of a first frame are constrained, and the parameters are set
Figure QLYQS_3
And &>
Figure QLYQS_4
Supervising the joint angles by the following formula, for the upper limit and the lower limit of the ith joint angle, respectively:
Figure QLYQS_5
in order to avoid the excessive change of the hand posture amplitude obtained by reconstruction between adjacent frames, the change rate of the parameters obtained by prediction of two adjacent frames needs to be constrained, as shown in the following formula:
Figure QLYQS_6
constraining the skeleton fitting process through the four formulas, and fitting by minimizing the following energy equation to obtain theta H Where w is the weight of each term, w is the weight of the first frame when predicting the parameters of the first frame 3 Not 0, in the subsequent prediction, w 3 Is 0:
E=ω 1 E 3D2 E 2D3 E
during training, the left-hand segmentation, the 2d joint point prediction and the 3d joint point prediction tasks are pre-trained respectively, and then the prediction of the 2d/3d joint points is trained end to end.
2. A method of real-time detection of bimanual gestures as claimed in claim 1, further comprising: the step S1 includes the steps of:
s11, extracting image features according to the input double-hand single-frame image;
s12, performing up-sampling operation on the image features to obtain probability graphs of three categories including a left hand, a right hand and a background;
and S13, obtaining segmentation results of three categories including the left hand, the right hand and the background according to the probability graph including the left hand, the right hand and the background.
3. The method of claim 2, wherein the real-time detection of bimanual gestures comprises: the image segmentation network includes a first convolution layer, a second convolution layer, and a transposed convolution layer.
4. A method of real-time detection of bimanual gestures as claimed in claim 3, further comprising: the step S11 includes the steps of:
step S111, inputting the two-hand single-frame image into the first convolution layer for down-sampling processing;
in step S112, the downsampled image is input to the second convolution layer and image feature extraction is performed.
5. A method for real-time detection of bimanual gestures according to claim 1 or 4, further comprising: the step S2 includes the steps of:
s21, overlapping segmentation results of three categories including a left hand, a right hand and a background with an original single-frame image, and inputting the overlapped segmentation results into a two-dimensional joint point extraction network for down-sampling processing to obtain posture characteristics;
and S22, performing up-sampling processing on the posture characteristics to obtain a left-hand heat map comprising the positions of the left-hand 2d joint points and a right-hand heat map comprising the positions of the right-hand 2d joint points.
6. A method of real-time detection of bimanual gestures according to claim 5, further comprising: the two-dimensional joint point extraction network comprises a network of a Hourglass structure and a third convolution layer.
7. The method of claim 6, wherein the real-time detection of bimanual gestures comprises: the step S3 includes the steps of:
step S31, extracting the confidence coefficient of the left-hand 2d joint point and the confidence coefficient of the right-hand 2d joint point according to the left-hand heat map and the right-hand heat map;
and step S32, inputting the confidence degrees of the left-hand 2d joint point position and the left-hand 2d joint point, and the confidence degrees of the right-hand 2d joint point position and the right-hand 2d joint point into a three-dimensional joint point extraction network to obtain a left-hand 3d joint point position and a right-hand 3d joint point position.
8. The method of claim 7, wherein the real-time detection of bimanual gestures comprises: the three-dimensional joint point extraction network comprises a first fully-connected layer, a dual linear module and a second fully-connected layer.
9. A method of real-time detection of bimanual gestures as claimed in claim 8, further comprising: the dual linear module comprises a first dual linear module and a second dual linear module, and the first dual linear module and the second dual linear module respectively comprise two full connection layers.
10. A method of real-time detection of bimanual gestures according to claim 9, further comprising: the fitting in step S4 is a fitting by a minimized energy equation including a 2d joint point constraint term, a 3d joint point constraint term, a joint angle constraint term, and a time constraint term.
CN202010301111.1A 2020-04-16 2020-04-16 Real-time detection method for gestures of both hands Active CN111539288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010301111.1A CN111539288B (en) 2020-04-16 2020-04-16 Real-time detection method for gestures of both hands

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010301111.1A CN111539288B (en) 2020-04-16 2020-04-16 Real-time detection method for gestures of both hands

Publications (2)

Publication Number Publication Date
CN111539288A CN111539288A (en) 2020-08-14
CN111539288B true CN111539288B (en) 2023-04-07

Family

ID=71976803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010301111.1A Active CN111539288B (en) 2020-04-16 2020-04-16 Real-time detection method for gestures of both hands

Country Status (1)

Country Link
CN (1) CN111539288B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233222A (en) * 2020-09-29 2021-01-15 深圳市易尚展示股份有限公司 Human body parametric three-dimensional model deformation method based on neural network joint point estimation
CN113158774B (en) * 2021-03-05 2023-12-29 北京华捷艾米科技有限公司 Hand segmentation method, device, storage medium and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992858A (en) * 2017-12-25 2018-05-04 深圳市唯特视科技有限公司 A kind of real-time three-dimensional gesture method of estimation based on single RGB frame
CN109635630A (en) * 2018-10-23 2019-04-16 百度在线网络技术(北京)有限公司 Hand joint point detecting method, device and storage medium
CN109800676A (en) * 2018-12-29 2019-05-24 上海易维视科技股份有限公司 Gesture identification method and system based on depth information
CN110287844A (en) * 2019-06-19 2019-09-27 北京工业大学 Traffic police's gesture identification method based on convolution posture machine and long memory network in short-term
CN110741385A (en) * 2019-06-26 2020-01-31 Oppo广东移动通信有限公司 Gesture recognition method and device and location tracking method and device
CN110837778A (en) * 2019-10-12 2020-02-25 南京信息工程大学 Traffic police command gesture recognition method based on skeleton joint point sequence

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992858A (en) * 2017-12-25 2018-05-04 深圳市唯特视科技有限公司 A kind of real-time three-dimensional gesture method of estimation based on single RGB frame
CN109635630A (en) * 2018-10-23 2019-04-16 百度在线网络技术(北京)有限公司 Hand joint point detecting method, device and storage medium
CN109800676A (en) * 2018-12-29 2019-05-24 上海易维视科技股份有限公司 Gesture identification method and system based on depth information
CN110287844A (en) * 2019-06-19 2019-09-27 北京工业大学 Traffic police's gesture identification method based on convolution posture machine and long memory network in short-term
CN110741385A (en) * 2019-06-26 2020-01-31 Oppo广东移动通信有限公司 Gesture recognition method and device and location tracking method and device
CN110837778A (en) * 2019-10-12 2020-02-25 南京信息工程大学 Traffic police command gesture recognition method based on skeleton joint point sequence

Also Published As

Publication number Publication date
CN111539288A (en) 2020-08-14

Similar Documents

Publication Publication Date Title
CN111311729B (en) Natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network
CN108537754B (en) Face image restoration system based on deformation guide picture
CN111626159B (en) Human body key point detection method based on attention residual error module and branch fusion
CN112330729A (en) Image depth prediction method and device, terminal device and readable storage medium
CN112950471A (en) Video super-resolution processing method and device, super-resolution reconstruction model and medium
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
CN113283525B (en) Image matching method based on deep learning
CN111539288B (en) Real-time detection method for gestures of both hands
CN112837215B (en) Image shape transformation method based on generation countermeasure network
CN113628348A (en) Method and equipment for determining viewpoint path in three-dimensional scene
CN113554039B (en) Method and system for generating optical flow graph of dynamic image based on multi-attention machine system
CN113221726A (en) Hand posture estimation method and system based on visual and inertial information fusion
CN111709270B (en) Three-dimensional shape recovery and attitude estimation method and device based on depth image
CN116524121A (en) Monocular video three-dimensional human body reconstruction method, system, equipment and medium
US20220319055A1 (en) Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture
WO2022208440A1 (en) Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture
Lin et al. Efficient and high-quality monocular depth estimation via gated multi-scale network
Al Ismaeil et al. Real-time enhancement of dynamic depth videos with non-rigid deformations
CN112417991A (en) Double-attention face alignment method based on hourglass capsule network
CN112215140A (en) 3-dimensional signal processing method based on space-time countermeasure
CN116740290A (en) Three-dimensional interaction double-hand reconstruction method and system based on deformable attention
CN116266336A (en) Video super-resolution reconstruction method, device, computing equipment and storage medium
CN115565039A (en) Monocular input dynamic scene new view synthesis method based on self-attention mechanism
Fang et al. Hand pose estimation on hybrid CNN-AE model
CN114240999A (en) Motion prediction method based on enhanced graph attention and time convolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant