CN110135340A

CN110135340A - 3D hand gestures estimation method based on cloud

Info

Publication number: CN110135340A
Application number: CN201910402435.1A
Authority: CN
Inventors: 邹露; 黄章进; 张智森; 温泉
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2019-08-16

Abstract

The invention discloses a kind of high-precision 3D hand gestures estimation methods, it include: using deep neural network structure end to end, the hand point cloud data that the capture of depth transducer equipment can directly be handled avoids space waste and computing redundancy caused by depth data is converted to voxel or multi-view image later；A spatial alternation algorithm is innovatively devised, so that this method has rotation, translation invariance to the point cloud data of input；To further decrease global error brought by the error of fingertip location estimation, a fingertip location optimization algorithm is innovatively devised.Therefore, the present invention can capture the hand gestures expression that complicated hand gestures change and estimate an accurate low-dimensional.

Description

3D hand gestures estimation method based on cloud

Technical field

The present invention relates to computer visions and Attitude estimation technical field more particularly to a kind of 3D hand appearance based on cloud State estimation method.

Background technique

In recent years, widely available with depth transducer, the 3D hand gestures estimation based on depth camera achieves aobvious The progress of work.Meanwhile benefiting from the immense success that deep neural network obtains in Computer Vision Task, convolutional neural networks (convolutional neural network, CNN) achieves frightened in the hand gestures estimation task based on depth image The effect of people.However, convolutional neural networks cannot be utilized directly in depth image usually using 2D picture as the input of network 3D information.

Depth image is encoded to 3D point cloud data by Ge et al. proposition, carries out 3D posture to it using 3D convolutional neural networks Estimate (Ge L, Liang H, Yuan J, et al.3d convolutional neural networks for efficient and robust hand pose estimation from single depth images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1991- 2000).However, index will be presented with the increase of the scale of point cloud data in the memory space and parameter amount of 3D convolutional neural networks Grade increases, and causes the accuracy rate of such methods and real-time not good enough.At the same time, it due to the sparsity of 3D point cloud, usually wraps Containing a large amount of white space (i.e. no spatial data points), to cause a large amount of space waste.Although Ge et al. will be sparse 3D point cloud be converted into again handling it after dense 3D point cloud, not only increase unnecessary calculation amount, Er Qiegai Become the spatial distribution of original point cloud data, it is undesirable so as to cause accuracy rate.

In addition, Ge et al. also propose based on multiple view convolutional neural networks 3D hand gestures homing method (Ge L, Liang H,Yuan J,et al.Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns[C]//Proceedings of the IEEE Conference on computer vision and pattern recognition.2016:3593-3601), this method It needs that 3D point cloud is projected as to first the 2D image of different perspectives, then recycle convolution by complicated data prediction Neural network carries out posture recurrence.

Summary of the invention

The object of the present invention is to provide and a kind of 3D hand gestures estimation method based on cloud, calculation amount is small, calculate As a result accuracy rate is high.

The purpose of the present invention is what is be achieved through the following technical solutions:

A kind of 3D hand gestures estimation method based on cloud, comprising:

The depth image of hand is converted into 3D point cloud, and carries out down-sampled processing；

Training spatial alternation network, to treated that 3D point cloud is normalized by down-sampled；

3D point cloud after normalized and the surface normal for putting cloud are input to trained hand gestures recurrence Network is realized the prediction of hand joint point position, and is operated by spatial inverse transform, and the tentative prediction knot of hand gestures is obtained Fruit；

It is modified using tentative prediction result of the finger tip corrective networks to hand gestures, obtains final hand gestures.

As seen from the above technical solution provided by the invention, it directly using point cloud data as input avoids that cloud will be put Spatial redundancy and computing cost caused by data are converted to voxel later；Meanwhile this method is made by spatial alternation network There is rotation, translation invariance to input point cloud data；Fingertip location also finally is corrected by finger tip corrective networks, ties prediction Fruit is more accurate.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of flow chart of the 3D hand gestures estimation method based on cloud provided in an embodiment of the present invention；

Fig. 2 is the test result provided in an embodiment of the present invention on ICVL, MARA and NYU data set.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

The embodiment of the present invention provides a kind of 3D hand gestures estimation method based on cloud, as shown in Figure 1, it mainly includes Following steps:

The depth image of hand is converted to 3D point cloud, and carries out down-sampled processing by step 1.

In the embodiment of the present invention, the depth image can be depth camera, depth transducer or other relevant devices Acquired image.

After depth image is converted to 3D point cloud, down-sampled is N number of point, and down-sampled treated that 3D point cloud can be expressed asWherein, p_iIndicate at i-th point, a corresponding 3D coordinate, the numerical value of N can be set according to the actual situation, example Such as, 1024 can be set as.

Step 2, training spatial alternation network, to treated that 3D point cloud is normalized by down-sampled.

In the embodiment of the present invention, paper (Jaderberg M, Simonyan K, Zisserman can be used for reference A.Spatial transformer networks[C]//Advances in neural information processing Systems.2015:2017-2025 the network structure disclosed in).Spatial alternation network passes through affine transformation (affine Transformation), the normalization of point cloud is realized, principle can indicate are as follows:

As shown in above formula,3D coordinate before correspondent transform,3D coordinate after correspondent transform, It is denoted herein as the form of homogeneous coordinates.A_θFor affine transformation matrix, the prediction result of spatial alternation network is corresponded to,It is imitative Penetrate the inverse matrix of transformation matrix.Wherein, parameter a₁~a₉With parameter a₁₀~a₁₂Respectively correspond 3D rotation transformation and 3D translation transformation. Point in 3D point cloud is multiplied by affine transformation matrix A_θThe normalization of point cloud data can be realized

As shown in above formula,For point p_iNormalization result.

Step 3 is to make full use of the collected deep image information of depth transducer, by the 3D point cloud after normalized And the surface normal of point cloud is input to trained hand gestures Recurrent networks, realizes the prediction of hand joint point position, And operated by spatial inverse transform, obtain the tentative prediction result of hand gestures.

In the embodiment of the present invention, hand gestures Recurrent networks are PointNet++ network, for carrying out to 3D point cloud data Layer-by-layer feature abstraction and posture return.

PointNet++ network refers to tool there are three the PointNet network of level of abstraction, and PointNet++ network can be found in opinion Text (Qi C R, Yi L, Su H, et al.Pointnet++:Deep hierarchical feature learning on point sets in a metric space[C]//Advances in Neural Information Processing Systems.2017:5099-5108).PointNet network can be found in paper (Qi C R, Su H, Mo K, et al.Pointnet:Deep learning on point sets for 3d classification and segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:652-660)。

In the training stage, it is assumed that the training sample after shared T normalizationTraining sampleInclude PointXyz coordinate and the point surface normalP_t ^GTIndicate hand real joint point coordinate；The then training stage Optimization aim is defined as:

Wherein, ω indicates network parameter to be optimized；ω^*Network parameter after indicating optimization；Indicate that hand gestures return Return network PointNet++, output is the matrix of 3*M, i.e., the 3D coordinate of the hand joint point after M normalization.

Since the output of hand gestures Recurrent networks have passed through the normalization of spatial alternation network, needed in the training stage Luv space is changed into coordinate value contravariant.The processing operation of the inverse transformation is the inverse of the affine transformation matrix that step 2 obtains Matrix.

Step 4 is modified using tentative prediction result of the finger tip corrective networks to hand gestures, obtains final hand Posture.

Abovementioned steps 3 are tentative prediction as a result, in order to promote the precision of hand gestures estimation, are designed using PointNet as base The finger tip corrective networks of plinth, the hand joint point position that the finger tip corrective networks are predicted with hand gestures Recurrent networks is (before i.e. State the result of step 3) around K point (can search for obtain by k nearest neighbor) as input, to be repaired to finger tip coordinate Just.

The embodiment of the present invention obtain it is following the utility model has the advantages that

1) it directly using point cloud data as network inputs, does not need to carry out data complicated pretreatment, it can also be effectively Space waste and computing redundancy caused by avoiding converting data to voxel or multi-view image later.

2) the collected deep image information of depth transducer can be made full use of, hand gestures return end to end for training Network.

3) by spatial alternation thought, hand point cloud is normalized, so that this method has the point cloud data of input There are rotation, translation invariance.

4) fingertip location is corrected by finger tip corrective networks, keeps prediction result more accurate.

In order to verify the performance of above scheme of the embodiment of the present invention, surveyed on data set ICVL, MARA and NYU Examination, experimental result is as shown in Fig. 2, the method proposed achieves good performance on all data sets.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can The mode of necessary general hardware platform can also be added to realize by software by software realization.Based on this understanding, The technical solution of above-described embodiment can be embodied in the form of software products, which can store non-easy at one In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. a kind of 3D hand gestures estimation method based on cloud characterized by comprising

3D point cloud after normalized and the surface normal for putting cloud are input to trained hand gestures Recurrent networks, It realizes the prediction of hand joint point position, and is operated by spatial inverse transform, obtain the tentative prediction result of hand gestures；

2. a kind of 3D hand gestures estimation method based on cloud according to claim 1, which is characterized in that trained Spatial alternation network is used for by affine transformation, to realize the normalization of a cloud, formula are as follows:

Wherein, p_iIndicating at i-th point, N is the point number of 3D point cloud after down-sampled processing,For point p_iNormalization as a result, A_θ For the Prediction Parameters of symmetric space converting network.

3. a kind of 3D hand gestures estimation method based on cloud according to claim 2, which is characterized in that the hand Posture Recurrent networks are PointNet++ network, are returned for carrying out layer-by-layer feature abstraction and posture；

In the training stage, it is assumed that the training sample after shared T normalizationTraining sampleIt contains a littleXyz coordinate and the point surface normalP_t ^GTIndicate hand real joint point coordinate；Then the training stage is excellent Change target are as follows:

Wherein, ω indicates network parameter to be optimized；ω^*Network parameter after indicating optimization；Indicate that hand gestures return net Network PointNet++, output are the matrix of 3*M, i.e., the 3D coordinate of the hand joint point after M normalization；

It is operated again by spatial inverse transform, luv space is changed into the hand joint predicted point position coordinates contravariant.

4. a kind of 3D hand gestures estimation method based on cloud according to claim 3, which is characterized in that the finger tip Corrective networks, to be modified to fingertip location, obtain final using K point around hand joint point position as input Hand joint point coordinate.