Disclosure of Invention
The purpose of the invention is as follows: an object is to provide a method for hand pose estimation using a 3D network incorporating kinematic constraints, so as to solve the above problems in the prior art. A further object is to propose a system implementing the above method.
The technical scheme is as follows: a method for hand pose estimation by a 3D network introducing kinematic constraints comprises the following steps:
step 1, converting a hand area positioned in an original depth map into voxelized input;
step 2, introducing a kinematic constraint 3D hand posture estimation network;
and 3, evaluating the accuracy of the predicted joint point positions and evaluating the reasonability of the hand postures formed by the predicted joint points.
In a further embodiment, the step 1 further comprises:
step 1-1, positioning a hand region from a depth map by using a hand region positioning network;
step 1-2, projecting the hand area positioned in the depth map to a 3D space;
step 1-3, discretizing data projected to a 3D space according to a preset voxel size;
step 1-4, judging the value of each position of the voxel according to whether the position is covered by discrete points or not
The value is set to 1 when the data is covered, and is set to be not coveredIs set to 0.
In a further embodiment, the step 2 further comprises:
the network takes a voxelized hand area as input, and 3D heatmap representing joint point probability distribution is predicted through a 3D convolutional neural network; the coordinates of each joint point and the corresponding bone length are obtained by processing the 3D heatmap, so that the kinematic constraint can be added to the predicted result by modifying the loss function.
In a further embodiment, adding kinematic constraints to the predicted result requires processing a 3D heatmap, where the 3D heatmap predicted by the network represents the probability distribution of a single joint point, and the position of the maximum value in the heatmap is the position of the joint point; obtaining the positions of the joint points by using the 3D heatmap, and then calculating the length of the skeleton according to the corresponding relation between the joint points; a Soft-argmax function is proposed to obtain the coordinates of the joint points from the 3D heatmap in a derivable way:
wherein X represents a size of
3D heatmap of size;
representative is the soft-max function;
the position of the 3D heatmap maximum is represented.
In a further embodiment, kinematic constraints are added to the prediction results after obtaining the coordinates of each joint point corresponding to the 3D heatmap:
step 3-1, setting a standard range for each bone length according to training data, namely the maximum length and the minimum length of the bone;
step 3-2, comparing the skeleton length obtained from the 3D heatmap with a set standard range, and punishing the skeleton length when the skeleton length is higher than the maximum length or lower than the minimum length, so as to add kinematic constraint to the prediction result, wherein the loss function of the hand posture estimation network added with the kinematic constraint is as follows:
wherein the content of the first and second substances,
Lis an overall loss function, comprising a total of three parts,
respectively representing the constraint of 3D heatmap, the constraint that the length of the skeleton exceeds the maximum length, and the constraint that the length of the skeleton is lower than the minimum length;
Nthe number of joint points is represented by,
N-1the number of bones is represented as a function of time,
and
respectively representing the predicted and true 3D heatmap,
is the bone length calculated from the predicted joint coordinates,
and
representing preset longest and shortest bone lengths,
the weights of the individual components of the loss function are represented.
A system for hand pose estimation incorporating a kinematically constrained 3D network, comprising a first module for converting hand regions in an original depth map into voxelized input; a second module for introducing a kinematically constrained 3D hand pose estimation network; and a third module for evaluating the accuracy of the predicted joint positions and evaluating the rationality of the hand gesture formed by the predicted joint positions.
In a further embodiment, the first module further locates a hand region from the depth map using a hand region location network and projects the hand region in the depth map into 3D space; carrying out discretization processing on the data projected to the 3D space according to a preset voxel size; determining the value of each position of the voxel according to whether the position is covered by discrete points
The value is set to 1 when the data is covered, and is set to 0 when the data is not covered.
The second module is further used for obtaining a prediction result of the joint point by using a hand posture estimation network introducing kinematic constraint and taking a voxelized hand area as input; the network takes a voxelized hand area as input, and 3D heatmap representing joint point probability distribution is predicted through a 3D convolutional neural network; the coordinates of each joint point and the corresponding bone length are obtained by processing the 3D heatmap, so that the kinematic constraint can be added to the predicted result by modifying the loss function.
The second module is further used for adding kinematic constraint to the predicted result to process the 3D heatmap, the 3D heatmap predicted by the network represents the probability distribution of a single joint point, and the position of the maximum value in the heatmap is the position of the joint point; obtaining the positions of the joint points by using the 3D heatmap, and then calculating the length of the skeleton according to the corresponding relation between the joint points; a Soft-argmax function is proposed to obtain the coordinates of the joint points from the 3D heatmap in a derivable way:
wherein X represents a size of
3D heatmap of size;
representative is the soft-max function;
the position of the 3D heatmap maximum is represented.
The third module further sets a standard range for each bone length according to the training data, namely the maximum length and the minimum length of the bone; comparing the skeleton length obtained from the 3D heatmap with a set standard range, and penalizing the skeleton length higher than the maximum length or lower than the minimum length so as to add kinematic constraint to the prediction result, wherein the loss function of the hand posture estimation network added with the kinematic constraint is as follows:
wherein the content of the first and second substances,
Lis an overall loss function, comprising a total of three parts,
respectively representing the constraint of 3D heatmap, the constraint that the length of the skeleton exceeds the maximum length, and the constraint that the length of the skeleton is lower than the minimum length;
Nthe number of joint points is represented by,
N-1the number of bones is represented as a function of time,
and
respectively representing the predicted and true 3D heatmap,
is the bone length calculated from the predicted joint coordinates,
and
representing preset longest and shortest bone lengths,
the weights of the individual components of the loss function are represented.
Has the advantages that: the invention provides a method and a system for estimating hand postures by a 3D network introducing kinematic constraints, wherein the hand posture estimation network introducing the kinematic constraints takes a voxelized hand area as input to obtain a prediction result of a joint point; the network takes a voxelized hand area as input, and 3D heatmap representing joint point probability distribution is predicted through a 3D convolutional neural network; the coordinates of each joint point and the corresponding bone length are obtained by processing the 3D heatmap, so that the kinematic constraint can be added to the predicted result by modifying the loss function. Obtaining the positions of the joint points by using the 3Dheatmap, and then calculating the length of the skeleton according to the corresponding relation between the joint points; a Soft-argmax function is proposed to obtain the coordinates of the joint points from the 3D heatmap in a derivable way. Through the operation, the invention can better judge the rationality of the bone length of the prediction result and the rationality of the predicted hand posture.
Detailed Description
The task of hand posture estimation is divided into two steps, wherein the first step is to locate a hand area from a depth map, and the second step is to predict the coordinates of each joint point from the hand area. The method for positioning the hand area with excellent performance at present comprises two steps, wherein the first step is that based on the characteristic that the depth value is lower as the hand is closer to a camera in a depth map, a rough hand positioning result is obtained by setting a depth threshold value on the assumption that the hand is the object closest to the camera; the second step is to input the rough hand positioning area into the hand area positioning network (Com-rainnet) shown in fig. 1, predict the coordinates of the palm of the hand, and then obtain the final hand positioning result according to the preset hand size.
The test index adopted by the Com-RefineNet network is the predicted Euclidean distance average value of the palm center position and the real position.
The traditional convolution neural network-based method for joint point prediction is to input a depth map as 2D data into a 2D convolution neural network and directly regress the positions of joint points. However, the depth map actually represents 2.5D data, which is directly regarded as 2D data and processed by using a 2D convolutional neural network, and key information in the depth map cannot be well extracted. In addition, directly regressing the position of the joint point with the depth map as an input is a highly nonlinear process, which increases the difficulty of learning the network.
The applicant thinks that, aiming at the problems existing in the traditional hand posture estimation method, the current effective mode is to adopt a voxelized input form, use a 3D convolutional neural network to extract the characteristics, predict a 3D heatmap representing the joint point position probability distribution, and enable the joint point prediction to achieve higher precision. Converting the hand region in the original depth map into a voxelized input requires four steps: firstly, a hand region in a depth map is projected to a 3D space; secondly, discretizing the data projected to the 3D space according to the preset voxel size; third step, bodyValue of each position of element
Will be set to either 1 or 0 depending on whether the location is covered by a discrete point. The voxelized hand area is shown in fig. 2, where the blue dots indicate that the value of the position is 1.
The hand is a structural object, and the hand gesture formed by connecting all joint points through bones needs to meet physical constraints. Therefore, in the hand posture estimation task, in addition to the accuracy of the predicted joint point position, it is necessary to consider whether the hand posture formed by the predicted joint point is reasonable. As shown in FIG. 3, the joint point position of the thumb of the right drawing is very close to that of the left drawing, but the degree of bending of the thumb of the right drawing is unreasonable. Therefore, skeleton length constraint is added to the predicted hand gesture to ensure the reasonability of the hand gesture.
In this application, we propose a 3D network that introduces kinematic constraints for accurate and reasonable hand pose estimation. The prediction process of the whole network can be divided into two steps, the first step is to locate the hand region from the depth map by using a hand region positioning network (Com-RefineNet) as shown in fig. 1, the second step is to obtain the prediction result of the joint points by using a hand posture estimation network (RVHE) introducing kinematic constraints and taking the voxelized hand region as input, and the overall structure diagram is shown in fig. 4.
The RVHE network adopts the test index which is the average value of the Euclidean distance between the predicted position of each joint point and the real position.
The hand posture estimation network structure introduced with the kinematic constraint is consistent with the structure in the article, the whole flow is shown in fig. 5, the network takes a voxelized hand area as input, and 3D heatmap representing joint point probability distribution is predicted through a 3D convolutional neural network. By processing the 3D heatmap, the coordinates of each joint point and the corresponding bone length can be obtained, so that the kinematic constraint can be added to the predicted result by modifying the loss function.
Adding kinematic constraints to the predicted result requires processing of a 3D heatmap, where the 3D heatmap predicted by the network represents the probability distribution of a single joint point, and the position of the maximum value in the heatmap is the position of the joint point. The length of the bone can be calculated by obtaining the positions of the joint points by using the 3D heatmap and then according to the corresponding relation between the joint points. But directly using argmax function to obtain coordinates is an irreducible way, which affects the end-to-end back propagation chain during network training, so we propose a Soft-argmax function to obtain the coordinates of the joint point from 3D heatmap in an derivable way.
The Soft-argmax function proposes the premise that for a sufficiently sharp profile, the position of the maximum may be approximately equivalent to the expectation of this profile. Considering that one 3D heatmap corresponds to one joint point, the distribution of each heatmap is close to the leptoprotic distribution, and after passing the value of the heatmap through soft-max, the distribution of each heatmap is sharper. The Soft-argmax function can thus be used to obtain the coordinates of the joint points in each heatmap in a derivable way. The Soft-argmax function is shown in equation (1):
wherein X represents a size of
3D heatmap of size;
representative is the soft-max function;
the position of the 3D heatmap maximum is represented.
Through the operation of formula (1), we can obtain the coordinates of each joint point corresponding to the 3D heatmap, and then add kinematic constraint to the prediction result. Firstly, a standard range is set for each bone length according to training data, namely the maximum length and the minimum length of the bone; the bone length from the 3D heatmap is then compared to a set standard range, and either above the maximum length or below the minimum length is penalized, adding kinematic constraints to the prediction. The loss function of the hand pose estimation network with added kinematic constraints is shown in equation (2).
Wherein the content of the first and second substances,
Lis an overall loss function, comprising a total of three parts,
respectively representing the constraint of 3D heatmap, the constraint that the length of the skeleton exceeds the maximum length, and the constraint that the length of the skeleton is lower than the minimum length;
Nthe number of joint points is represented by,
N-1the number of bones is represented as a function of time,
and
respectively representing the predicted and true 3D heatmap,
is the bone length calculated from the predicted joint coordinates,
and
representing preset longest and shortest bone lengths,
the weights of the individual components of the loss function are represented.
The existing gesture data sets with higher quality are three, namely an NYU data set, an ICVL data set and an MSRA data set. The NYU data set is used for training and testing because the NYU data set comprises most gestures, most marked joint points and most accurate annotation information. The network is trained to predict the positions of 14 joint points in total, including the palm center position, two wrist joint points, the carpometacarpal joint of the thumb, the metacarpophalangeal joints of five fingers and the finger tip, and the joint distribution diagram is shown in fig. 6.
To test the performance of the network, a total of three experiments were performed in the NYU test set.
(1) Hand region correction
In order to test the correction capability of Com-RefineNet on the palm center position, firstly, a rough hand positioning mode (Com) is utilized to obtain rough coordinates of the palm center and calculate a positioning error; and then sending the rough hand area to Com-RefineNet to obtain the corrected palm center position, and calculating the positioning error, wherein two calculation results can be shown in table 1, wherein the group route represents the real palm center position.
TABLE 1 errors of different positioning methods
As can be seen from Table 1, using Com-RefineNet can provide more accurate hand region positioning results for subsequent predicted networks.
(2) Peeling test
To further verify the Com-RefineNet and the added physical constraints as the improvement of V2V-PoseNet, a peeling experiment was performed to propose five different prediction structures. The first is the infrastructure, V2V-PoseNet, which uses coarse hand positioning and does not add physical constraints. The second structure is that on the basis of the first structure, the skeleton length constraint is added to V2V-PoseNet. The third structure is that the rough hand positioning is replaced by Com-refine net on the basis of the first structure. The fourth structure both replaces the coarse localization by Com-reflonenet and adds a physical constraint, RVHE. The fifth structure is to take the hand area with the true palm position and add physical constraints to the prediction network. The average prediction error for all the joints of the five structures described above is calculated here and the results are shown in table 2.
TABLE 2 prediction error for different configurations
In table 2, V2V represents the joint location prediction network without added physical constraints, Com represents the coarse hand position, ComRN represents the use of Com-rainnet, Cs is the added physical constraints, and Gt is the true hand area position. As can be seen from table 2, the prediction error of the first structure is up to 20 mm. Comparing the method using rough positioning with Com-reflinenet, i.e. the first structure with the third structure and the second structure with the fourth structure, it can be seen that the error is reduced by 6mm and 5mm, respectively, using the modified hand area; comparing the methods without and with physical constraints, i.e. the first structure compared to the second structure and the third structure compared to the fourth structure, it can be seen that the addition of physical constraints reduces the error by 1mm and 0.7mm, respectively. The fifth configuration achieved the best prediction error of 9.23mm using hand region positioning without bias. Therefore, accurate hand region positioning is very important for final prediction of the network, and a prediction result can be more reasonable and accurate by adding certain physical constraints on the basis of accurately positioning the hand region.
(3) Comprehensive experiment
Connecting Com-RefineNet and V2V-PoseNet added with physical constraints in series, namely RVHE, and comparing with other advanced methods in the gesture attitude estimation field on two test indexes. A total of five methods were selected. Respectively, DeepPrior + +, FeedBack, REN, Deepmodel.
The average prediction error for all joints was obtained by testing on the NYU data set (see table 3).
TABLE 3 positioning error for different methods
It can be seen from table 3 that compared with other advanced methods, the method has reached a more accurate level, reduced by nearly 6mm compared with deepPrior, and has a similar accuracy to REN prediction.
(4) Qualitative analysis
The output of the RVHE is translated into the position of the joint point and plotted on a depth map with the predicted joint indicated in red.
The node positions and the bones are connected, the green represents the real result, and partial prediction results can be shown in FIG. 7. It can be seen that RVHE can provide accurate hand pose estimation results.