CN109086683B

CN109086683B - Human hand posture regression method and system based on point cloud semantic enhancement

Info

Publication number: CN109086683B
Application number: CN201810758545.7A
Authority: CN
Inventors: 王贵锦; 陈醒濠; 杨华中
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2020-09-15
Anticipated expiration: 2038-07-11
Also published as: CN109086683A

Abstract

The embodiment of the invention provides a hand posture regression method and system based on point cloud semantic enhancement, which are used for extracting point cloud characteristics of hand point cloud data, performing point-by-point classification to obtain semantic segmentation information of the hand point cloud data, performing semantic enhancement on the hand point cloud data based on the semantic segmentation information, obtaining a hand posture prediction result based on the semantically enhanced hand point cloud data, performing geometric transformation on the hand posture prediction result to obtain a hand posture regression result, and performing geometric transformation on input data and output by utilizing network learning.

Description

Human hand posture regression method and system based on point cloud semantic enhancement

Technical Field

The invention relates to the technical field of computers, in particular to a human hand posture regression method and system based on point cloud semantic enhancement.

Background

In human-computer interaction based on vision, human hand posture estimation refers to accurate prediction of three-dimensional coordinate positions of skeleton nodes of human hands, and has wide application prospects in the fields of virtual reality, augmented reality, human-computer interaction and the like. The human hand posture estimation problem is a popular research point in the field of computer vision in the last decades.

Human hand posture estimation methods based on vision can be classified into 2 types, one is an appearance-based method; the method estimates the state of the human hand by establishing mapping from a two-dimensional image feature space to a three-dimensional human hand posture space through machine learning, has the advantages of easy realization of real-time tracking and the defects that dense learning samples are needed to ensure the precision and an efficient learning and searching algorithm is established in a huge image database; the other is a model-based method, projecting a three-dimensional model of a human hand into a two-dimensional image space, correcting the pose parameters estimated in the three-dimensional model through feature comparison and data estimation,

the advantage of model-based methods is that the estimation results are more accurate, but the performance of such methods depends on the model chosen, usually the depth image is input as a single-channel image into a two-dimensional Convolutional Neural Network (CNN), and then the hand pose is predicted. However, mapping from two-dimensional images to three-dimensional node coordinates is a highly non-linear problem, and the disparity in input and output space makes network learning very difficult. More recently, there have also been methods based on three-dimensional convolutional neural networks (3D CNN), which first convert the depth image to a voxel representation and then pose back and forth using 3D CNN. However, the 3D voxel requires a quantitative representation of continuous coordinate information, thereby introducing quantization errors that are detrimental to accurate human hand pose estimation. Meanwhile, the 3D CNN method occupies a large memory, and is particularly obvious when the 3D voxel resolution is high; the trained network is not robust to input geometric transformation, the precision is limited, in addition, most of the existing methods are based on heat map prediction or direct regression, and the posture regression performance is low.

Disclosure of Invention

The present invention provides a human hand pose regression method and system based on point cloud semantic enhancement that overcomes or at least partially solves the above-mentioned problems.

According to the first aspect of the invention, a human hand posture regression method based on point cloud semantic enhancement is provided, and comprises the following steps:

extracting point cloud characteristics of the hand point cloud data, and performing point-by-point classification to obtain semantic segmentation information of the hand point cloud data;

performing semantic enhancement on the hand point cloud data based on the semantic segmentation information, obtaining a hand posture prediction result based on the hand point cloud data after semantic enhancement, and performing geometric transformation on the hand posture prediction result to obtain a hand posture regression result.

According to a second aspect of the invention, a human hand posture regression device based on point cloud semantic enhancement is provided, which comprises:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions capable of performing the human hand pose regression method based on point cloud semantic enhancement as described above.

According to a second aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above-described point cloud semantic enhancement based human hand pose regression method.

The invention provides a hand posture regression method and system based on point cloud semantic enhancement, which are used for extracting point cloud characteristics of hand point cloud data, performing point-by-point classification to obtain semantic segmentation information of the hand point cloud data, performing semantic enhancement on the hand point cloud data based on the semantic segmentation information, obtaining a hand posture prediction result based on the semantically enhanced hand point cloud data, performing geometric transformation on the hand posture prediction result to obtain a hand posture regression result, performing geometric transformation on input data and output by utilizing network learning, enabling the hand posture estimation method to be more robust to the geometric transformation of the input data, and effectively fusing semantic information of a sub-network of the point-by-point classification of the input point cloud and a posture regression sub-network, so that the performance of hand posture estimation is further improved.

Drawings

FIG. 1 is a schematic diagram of a human hand posture regression method based on point cloud semantic enhancement according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network system of a human hand posture regression method based on point cloud semantic enhancement according to an embodiment of the invention;

FIG. 3 is a block diagram of a point cloud point-by-point classification subnetwork in accordance with an embodiment of the present invention;

FIG. 4 is a diagram of a pose regression subnetwork according to an embodiment of the present invention;

FIG. 5 is a diagram of a transformation learning subnetwork in accordance with an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of human hand posture regression equipment based on point cloud semantic enhancement according to an embodiment of the invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The human hand information obtained by the computer vision in the virtual operation human-computer interaction process has great advantages in the nature and cost of interaction, and is a main trend of future development. Because the information of the human hand cannot be directly acquired by the computer, the estimation of the posture parameters of the human hand becomes a basic work. The virtual hand in the scene can be driven only by obtaining accurate hand posture parameters, and the consistency of the reality and the virtues of human-computer interaction is ensured.

Existing human hand pose estimation methods require converting the depth image to a voxel representation and then pose back and forth using 3D CNN. However, the 3D voxel requires a quantitative representation of continuous coordinate information, thereby introducing quantization errors that are detrimental to accurate human hand pose estimation. Meanwhile, the 3D CNN method occupies a large amount of memory, especially when the resolution of the 3D voxel is high.

In order to solve the above problems, aiming at the problems that in the prior art, spatial transformation of input data is less considered, so that a trained network is not robust to input geometric transformation and limited in precision, and because prediction or direct regression is based on a heat map, how to improve posture regression by combining semantic segmentation information of the input data is not considered, the embodiment of the invention estimates the posture of a human hand based on point cloud, and provides a human hand posture regression method based on point cloud semantic enhancement, as shown in fig. 1, the method comprises the following steps:

Specifically, in this embodiment, a sub-network is classified point by point through point cloud, hand point cloud data is used as input, a PointNet + + network is used to extract point cloud features of the point cloud, and finally, point by point classification is performed to obtain semantic segmentation information of the point cloud. The posture regression subnetwork is also used for inputting hand point cloud data and outputting the final result of hand posture estimation, and a network system established by the method according to the embodiment of the invention is shown in fig. 2.

In this embodiment, a method for performing geometric transformation on input data and output by using network learning makes the human hand posture estimation method more robust to the geometric transformation of the input data, and effectively fuses semantic segmentation information of point-by-point classification of an input point cloud and a posture regression subnetwork, so that the performance of human hand posture estimation is further improved.

Specifically, on the basis of the above embodiment, extracting point cloud features of the hand point cloud data, and performing point-by-point classification specifically includes:

constructing a point cloud point-by-point classification sub-network based on a plurality of point cloud abstraction layers, a plurality of point feature propagation layers and a plurality of layers of sensors, wherein the point cloud abstraction layers sample and group hand point cloud data, extract point cloud features of the grouped hand point cloud data through a PointNet layer, and transmit the point cloud features into the corresponding point feature propagation layers; the point feature propagation layer carries out interpolation operation on the input point cloud features and carries out serial connection and fusion with the corresponding bottom layer point-by-point features; the multilayer perceptron is used for generating labels classified point by point based on the bottom layer point by point features after the cascade connection and the fusion.

In this embodiment, as shown in fig. 3, a structural diagram of a point cloud point-by-point classification sub-network is shown, and the structure is based on a PointNet + + network and includes three point cloud abstraction layers and three point feature propagation layers in total. The point cloud abstraction layer comprises point cloud sampling and grouping operation, and point cloud feature extraction is carried out on the grouped point clouds by utilizing the PointNet layer. The point feature propagation layer firstly carries out interpolation operation on the input point cloud features and then carries out serial connection and fusion with the corresponding bottom layer point-by-point features. And finally, the network generates point cloud point-by-point classification labels by using a multilayer perceptron.

The data structure of a point cloud is a point set formed by point coordinates in a three-dimensional space, and the point cloud is essentially a long string of points (nx3 matrix, where n is the number of points). Geometrically, the order of the points does not affect its representation in space of the overall shape, e.g. the same point cloud may be represented by two completely different matrices.

In this embodiment, three-dimensional hand point cloud data is directly thrown into a network for training, the data size is small, fig. 2 is a network structure schematic diagram of a human hand posture regression method based on point cloud semantic enhancement in the embodiment of the present invention, hand point cloud data (nx3) including n points is input, a 3D spatial transformation learning subnetwork T-Net is used for estimating an input transformation matrix T-Net of 3x3 from original data, and a network structure schematic diagram of the human hand posture regression method based on point cloud semantic enhancement in the embodiment of the present invention is shown in the figure, where the input_inAnd acts on the original hand point cloud data to realize the alignment of the data.

FIG. 3 is a diagram of a point cloud point-by-point classification subnetwork. The structure is based on a PointNet + + network, a main network comprises three point cloud abstraction layers and three point feature propagation layers, the point cloud abstraction layers comprise point cloud sampling and grouping operation, and feature extraction is carried out on the grouped point clouds by utilizing the PointNet layer. The point feature propagation layer firstly carries out interpolation operation on the input features and then carries out serial connection and fusion with the corresponding bottom layer point-by-point features. And finally, the network generates point cloud point-by-point classification labels by using a multilayer perceptron.

FIG. 4 is a diagram of a pose regression subnetwork. The structure is based on a PointNet + + network, a main network comprises three point cloud abstract layers, and classification label information obtained by a point-by-point classification network is subjected to feature fusion with an attitude regression network on an input layer and an output layer.

On the basis of the above embodiments, performing semantic enhancement on the hand point cloud data based on the semantic segmentation information further includes:

on the basis of a transformation learning sub-network, hand point cloud data is used as input, point cloud characteristics are extracted through three PointNet layers, and an input transformation matrix and an output transformation matrix of the point cloud characteristics are obtained on the basis of three full-connection layer learning.

As shown in FIG. 5, in this embodiment, an input transformation matrix T of 3 × 3 is first learned from the input point cloud using a transformation learning subnetwork (T-Net)_inAnother transformation learning subnetwork (T-Net) learns an output transformation matrix T of 3 × 3 from the input point cloud_out。

T-Net is a sub-network of a predictive eigenspace transformation matrix, which learns a transformation matrix consistent with the eigenspace dimensions from input data, and then multiplies the original data by the transformation matrix to realize the transformation operation of the input eigenspace, so that each subsequent point has a relationship with each point in the input data. Through the data fusion, the gradual abstraction of the original point cloud data containing features is realized.

On the basis of the above embodiments, performing semantic enhancement on the hand point cloud data based on the semantic segmentation information specifically includes:

multiplying the hand point cloud data by the input transformation matrix to obtain transformed hand point cloud data, and performing first concatenation and fusion on the transformed hand point cloud data and the semantic segmentation information; and extracting the point cloud characteristics of the hand point cloud data after the first concatenation and fusion, and performing second concatenation and fusion on the extracted point cloud characteristics and the semantic segmentation information.

In particular, the matrix T is transformed using the input_inWill T_inCarrying out matrix multiplication on the hand point cloud data to obtain converted hand point cloud data, carrying out first serial fusion on the converted hand point cloud data and semantic segmentation information which is classified and learned by a point-by-point classification subnetwork of the point cloud, and then extracting point cloud characteristics of the hand point cloud data after the first serial fusion; and performing second concatenation and fusion on the extracted point cloud characteristics and the semantic segmentation information to obtainAnd predicting the hand posture.

Another transformation learning subnetwork learns an output transformation matrix T of 3 × 3 from the input point cloud_outOutput transformation matrix T_outAnd carrying out geometric transformation on the hand posture prediction result to obtain a final hand posture prediction result.

On the basis of the foregoing embodiments, in this embodiment, network training is further performed on the point-by-point cloud classification subnetwork and the posture regression subnetwork, and in the network training process, three loss functions need to be optimized simultaneously: point-by-point classification loss functions, attitude regression loss functions, and matrix reciprocity loss functions. The point-by-point classification loss function is a cross entropy loss function, the attitude regression loss function utilizes a smooth L1 loss function, and the matrix reciprocity loss function is defined as follows:

L_im＝||T_inT_out-I||²

the loss function is used to limit the output transformation matrix T_outTransforming a matrix T for input_inThe inverse matrix I is an identity matrix, so that the network can keep the consistency of the geometric transformation of the input data and the output posture, and is insensitive to the geometric transformation of the input data, and the learning difficulty of the network is reduced.

In this embodiment, a human hand posture regression system based on point cloud semantic enhancement is provided based on the human hand posture regression method based on point cloud semantic enhancement of the above embodiments, as shown in fig. 2, including a point cloud point-by-point classification subnetwork and a posture regression subnetwork;

the point cloud point-by-point classification sub-network is used for extracting point cloud characteristics of the hand point cloud data and performing point-by-point classification to obtain semantic segmentation information of the hand point cloud data;

the gesture regression subnetwork is used for performing semantic enhancement on the hand point cloud data based on the semantic segmentation information, obtaining a hand gesture prediction result based on the hand point cloud data after the semantic enhancement, and performing geometric transformation on the hand gesture prediction result to obtain a hand gesture regression result.

The point cloud point-by-point classification sub-network takes point cloud data of a hand as input, utilizes a PointNet + + network to extract characteristics of the point cloud, and finally carries out point-by-point classification to obtain semantic segmentation information of the point cloud_in，T_inThe method comprises the steps of carrying out matrix multiplication on a data point cloud to obtain a transformed point cloud, carrying out serial fusion on the transformed point cloud and semantic information learned by a point-by-point classification sub-network of the point cloud, then extracting point cloud characteristics, carrying out serial fusion on the extracted point cloud characteristics and semantic segmentation information of the point cloud again to predict a hand posture result, and learning another transformation learning sub-network (T-Net) from an input point cloud to obtain an output transformation matrix T3 × 3_out，T_outAnd carrying out geometric transformation on the hand posture estimation to obtain a final hand posture estimation result.

In this embodiment, fig. 4 is a diagram of a posture regression subnetwork structure. The structure is based on a PointNet + + network, a main network comprises three point cloud abstract layers, and classification label information obtained by point-by-point classification of sub-networks of the point cloud is subjected to feature fusion with an attitude regression network on an input layer and an output layer. T-Net is a sub-network of a predictive eigenspace transformation matrix, which learns a transformation matrix consistent with the eigenspace dimensions from input data, and then multiplies the original data by the transformation matrix to realize the transformation operation of the input eigenspace, so that each subsequent point has a relationship with each point in the input data. Through the data fusion, the gradual abstraction of the original point cloud data containing features is realized.

Fig. 6 is a block diagram illustrating a structure of a human hand posture regression device based on point cloud semantic enhancement according to an embodiment of the present application.

Referring to fig. 6, the human hand posture regression device based on point cloud semantic enhancement includes: a processor (processor)810, a memory (memory)830, a communication interface (communications interface)820, and a bus 840;

wherein the content of the first and second substances,

the processor 810, the memory 830 and the communication interface 820 complete communication with each other through the bus 840;

the communication interface 820 is used for information transmission between the test equipment and the communication equipment of the display device;

the processor 810 is configured to call program instructions in the memory 830 to perform the human hand pose regression method based on point cloud semantic enhancement provided by the above embodiments of the method, for example, including:

The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, the computer is capable of performing a human hand pose regression method based on point cloud semantic enhancement as described above, for example comprising:

Also provided in this embodiment is a non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the above-described point cloud semantic enhancement-based human hand pose regression method, for example, including:

In summary, the embodiment of the present invention provides a hand posture regression method and system based on point cloud semantic enhancement, which extracts point cloud features of hand point cloud data, performs point-by-point classification to obtain semantic segmentation information of the hand point cloud data, performs semantic enhancement on the hand point cloud data based on the semantic segmentation information, obtains a hand posture prediction result based on the semantically enhanced hand point cloud data, performs geometric transformation on the hand posture prediction result, and performs geometric transformation on input data and output by using network learning to obtain a hand posture regression result.

The above-described embodiments of the test equipment and the like of the display device are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A human hand posture regression method based on point cloud semantic enhancement is characterized by comprising the following steps:

performing semantic enhancement on the hand point cloud data based on the semantic segmentation information, obtaining a hand posture prediction result based on the hand point cloud data after semantic enhancement, and performing geometric transformation on the hand posture prediction result to obtain a hand posture regression result;

the semantic enhancement of the hand point cloud data based on the semantic segmentation information further comprises:

2. The human hand posture regression method based on point cloud semantic enhancement as claimed in claim 1, wherein the point cloud features of the hand point cloud data are extracted and classified point by point, and the method specifically comprises the following steps:

3. The human hand posture regression method based on point cloud semantic enhancement as claimed in claim 1, wherein the semantic enhancement is performed on the hand point cloud data based on the semantic segmentation information, specifically comprising:

4. The point cloud semantic enhancement-based human hand posture regression method according to claim 3, wherein the geometric transformation is performed on the hand posture prediction result, and the obtaining of the hand posture regression result specifically comprises:

and performing geometric transformation on the hand posture prediction result based on the output transformation matrix to obtain a hand posture regression result.

5. The human hand posture regression method based on point cloud semantic enhancement as claimed in claim 4, wherein after learning based on three full connection layers to obtain an output transformation matrix of point cloud features, the method further comprises:

and optimizing the input transformation matrix and the output transformation matrix based on a matrix reciprocity loss function so that the output transformation matrix is the inverse matrix of the input transformation matrix.

6. The point cloud semantics-based human hand pose regression method of claim 5, wherein the matrix reciprocity loss function is:

L_im＝||T_inT_out-I||²

in the formula, T_inFor input of transformation matrices, T_outTo output the transform matrix, I is the identity matrix.

7. A human hand posture regression system based on point cloud semantic enhancement is characterized by comprising a point cloud point-by-point classification sub-network and a posture regression sub-network;

the gesture regression subnetwork is used for performing semantic enhancement on the hand point cloud data based on the semantic segmentation information, obtaining a hand gesture prediction result based on the hand point cloud data after the semantic enhancement, and performing geometric transformation on the hand gesture prediction result to obtain a hand gesture regression result; the semantic enhancement of the hand point cloud data based on the semantic segmentation information further comprises: on the basis of a transformation learning sub-network, hand point cloud data is used as input, point cloud characteristics are extracted through three PointNet layers, and an input transformation matrix and an output transformation matrix of the point cloud characteristics are obtained on the basis of three full-connection layer learning.

8. A human hand posture regression device based on point cloud semantic enhancement is characterized by comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 6.

9. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 6.