CN110555412B

CN110555412B - End-to-end human body gesture recognition method based on combination of RGB and point cloud

Info

Publication number: CN110555412B
Application number: CN201910836867.3A
Authority: CN
Inventors: 张世雄; 李楠楠; 赵翼飞; 李若尘; 李革; 安欣赏; 张伟民
Original assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Current assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2023-05-16
Anticipated expiration: 2039-09-05
Also published as: CN110555412A

Abstract

An end-to-end human body gesture recognition method based on the combination of RGB and point cloud comprises the following steps: 1): preprocessing RGB information and point cloud information; 2): extracting two-dimensional (2D) skeleton information of a human body by using a front-end network; 3): three-dimensional (3D) skeleton information of a human body is extracted using a three-dimensional (3D) network. The end-to-end human body gesture recognition method based on the combination of RGB and point cloud provided by the invention can effectively extract an accurate 3D model of a human body from human body data acquired by RGB-D equipment. The method solves a series of problems of great ambiguity of the 2D gesture, insufficient 3D gesture precision, lack of 3D data sets and the like caused by gesture appearance change, gesture freedom degree diversity, similar gestures and self-shielding in gesture recognition.

Description

End-to-end human body gesture recognition method based on combination of RGB and point cloud

Technical Field

The invention relates to a method for recognizing the gesture of key points of a human body through an RGB-D camera, in particular to an end-to-end human body gesture recognition method based on the combination of RGB and point cloud.

Background

Key point detection of human body gestures is an important field of computer vision research, and research results are mainly used for a series of intelligent applications such as human-computer interaction of new generation, human-computer interaction of Virtual Reality (VR) and Augmented Reality (AR), behavior recognition analysis and the like. The traditional gesture recognition algorithm generally adopts a wearable acceleration sensor to recognize and detect the human gesture, so that the cost is high, and the wearing is complicated and active matching is needed. The early detection of human body posture based on video is mainly based on template matching of hand-painted features, the extraction mode of the hand-painted features is complex in design, low in reliability, easy to be interfered by the outside and poor in recognition effect on complex actions. Meanwhile, interference factors such as camera visual angles, illumination, shielding and the like commonly exist in a real scene, and the conventional method often has the problem that the recognition accuracy is not high or the recognition is not performed in the real scene. With the continuous maturity of the application of deep learning in the field of computer vision, human body gesture recognition also increasingly adopts a deep learning method. On the other hand, the hardware Of the acquisition device is continuously developing more and more three-dimensional (3D) acquisition devices, which can well supplement the defects Of two-dimensional (2D) projection, including rotation, shielding, similarity and the like Of human body gestures, and three main schemes, namely a 3D Structured Light scheme (Structured Light), a ToF 3D scheme (Time Of Flight) and a binocular Stereo imaging scheme (Stereo System), are available on the market at present. According to the three acquisition schemes, point cloud images with depth information can be acquired, and according to different acquired data, human body gesture recognition can be divided into gesture recognition with depth data, namely gesture recognition based on three-dimensional (3D) point cloud, and human body gesture recognition based on common image data (namely RGB data), namely gesture recognition based on 2D images.

Because three-dimensional (3D) point cloud accuracy is low, more noise is contained, the data volume is increased greatly, the dimension is larger, one dimension is more than that of a two-dimensional image, the calculation is complex, and the calculation amount is large. Meanwhile, sparsity of the three-dimensional point cloud is also considered, so that aiming at how to improve the calculation efficiency in the reconstruction based on the voxels, excessive memory waste in unoccupied parts is avoided, the reconstruction resolution is improved, the network structure is improved so as to improve the reconstruction effect, and more details can be recovered.

The 2D image contains rich color information, is clear and contains more detail information. The noise is small, the acquisition equipment is mature, but the space depth information is insufficient, and ambiguity is easy to cause. The human body posture of the 2D image commonly has a certain ill-posed problem. For a traditional three-dimensional mapping of 2D image body poses, one image body pose may often correspond to multiple different three-dimensional body poses. From a statistical point of view, reasonable predictions of the input image form a distribution. Reflected in the training set, the pose of two human body poses that look similar in the image may be quite different in practice.

Disclosure of Invention

The invention provides an end-to-end human body gesture recognition method based on the combination of RGB and point cloud, which can effectively extract an accurate 3D model of a human body from human body data acquired by RGB-D equipment. The method solves a series of problems of great ambiguity of the 2D gesture, insufficient 3D gesture precision, lack of 3D data sets and the like caused by gesture appearance change, gesture freedom degree diversity, similar gestures and self-shielding in gesture recognition.

The technical scheme provided by the invention is as follows:

the invention discloses an end-to-end human body gesture recognition method based on combination of RGB and point cloud, which comprises the following steps: step 1): preprocessing RGB information and point cloud information; step 2): extracting two-dimensional (2D) skeleton information of a human body by using a front-end network; step 3): three-dimensional (3D) skeleton information of a human body is extracted using a three-dimensional (3D) network.

In the method for recognizing the end-to-end human body gesture based on the combination of RGB and point cloud, before preprocessing, an RGB-D camera is used as acquisition input of signals, and the acquired signals are divided into RGB information and point cloud information.

In the method of the end-to-end human body gesture recognition method based on the combination of RGB and point cloud, in the step 1), the RGB information and the point cloud information are respectively subjected to filtering and denoising preprocessing, and alignment processing is performed.

In the method of the end-to-end human body gesture recognition method based on the combination of RGB and point cloud, in the step 1), a contour feature mapping method is used, the point cloud image is used as a coordinate reference to extract each edge salient feature, the feature points are mapped one by one, the offset { p1, p2, p3, & gtof & ltDEG & gtof & lt/DEG & gtof each feature point is calculated, and finally, calculating the average offset p of all the characteristic points, and projecting RGB to an affine space for conversion alignment.

In the method for recognizing the end-to-end human body gesture based on the combination of RGB and point cloud, in the step 2), the preprocessed RGB information is input into a pre-network trained in advance to extract human body two-dimensional (2D) skeleton information, the extracted human body two-dimensional (2D) skeleton information (2D gesture) and point cloud information are input into a point cloud cutting module together, the extracted human body two-dimensional (2D) skeleton information is used for cutting a point cloud image, and useless background information is removed.

In the method for recognizing the end-to-end human body gesture based on the combination of RGB and point cloud, in the step 2), a single human body detection model from bottom to top is adopted in the front network, a network pre-trained by a large amount of data is used for detecting two-dimensional (2D) key nodes of a human body, namely, the joint node coordinates of all human bodies in one image are detected first, and then coordinate clustering is carried out, so that key point coordinates corresponding to the human body are formed.

In the method for recognizing the end-to-end human body gesture based on the combination of RGB and the point cloud, in the step 3), the point cloud information and the two-dimensional (2D) human body skeleton information are fused and then are input into a three-dimensional (3D) network at the same time, and the trained three-dimensional (3D) network can extract accurate three-dimensional (3D) human body skeleton information from the point cloud information.

In the method for recognizing the end-to-end human body gesture based on the combination of RGB and point cloud, in the step 3), a convolutional neural network is adopted in a three-dimensional (3D) network, the convolutional neural network is divided into three layers in total, the first two layers of networks are connected into a layer pooling layer, and finally, the three layers of networks are output through a full link layer.

Compared with the prior art, the invention has the beneficial effects that:

1. the method designs a unique mode of merging two networks, merges two types of data of RGB and RGB-D at the same time, adopts a strategy of merging data in the middle, effectively reduces the problems of noise in the early stage of data and overlarge data amount in the early stage of data merging, and provides a novel human body gesture detection method aiming at the RGB-D camera.

2. The model in the identification method can provide effective real 3D gesture information of a human body, and different models in the past can only output 2D skeleton information, and the model can output the 3D skeleton model with real world coordinate values, can have detailed data for the height and each trunk size of the human body, and can realize centimeter-level precision under the condition of higher RGB-D camera precision.

3. The 3D gesture can be estimated only according to the current frame, and compared with the prior process that a plurality of continuous image frames or video frames are needed to estimate the 3D gesture, the method has the advantage that the method is greatly improved.

Drawings

The invention is further illustrated by way of example with reference to the accompanying drawings in which:

fig. 1 is a flow chart of an end-to-end human body gesture recognition method based on the combination of RGB and point cloud of the present invention.

The specific embodiment is as follows:

according to the invention, the gesture recognition of the 3D point cloud is effectively combined with the gesture recognition based on RGB image data, and a front-back network combined deep learning method is provided, namely, an end-to-end human body gesture recognition method based on the combination of RGB and point cloud is provided, and the advantages of the point cloud image and the RGB image are fused.

The end-to-end human body gesture recognition method based on the combination of RGB and point cloud comprises the following main steps:

1. and preprocessing the RGB information and the point cloud information. Before preprocessing, an RGB-D device (RGB-D camera) is used as an acquisition input of signals, and the acquired signals are divided into RGB information and point cloud information. And then, respectively carrying out filtering and denoising pretreatment on the RGB and point cloud information, and carrying out alignment treatment. Specifically, since the RGB information and the point cloud information are collected by different collecting devices, the image information collected by the two devices cannot completely coincide, and a position offset p exists between the two devices. The contour feature mapping method firstly takes a point cloud image as a coordinate reference, extracts the salient features of each edge, maps feature points one by one, calculating the offset { p1 }, of each feature point p2, p3, & lt & gtis & lt- & gt, and finally, calculating the average offset p of all the characteristic points, and projecting RGB to an affine space for conversion alignment. The method has the advantages that the aligned point cloud information and the aligned RGB information are matched in plane space information, and the cutting of the point cloud information and the mutual fusion of the point cloud information and the RGB information are facilitated.

2. And extracting the 2D skeleton information of the human body by using the front-end network. In the step, the preprocessed RGB information is input into a pre-trained pre-network to extract the 2D skeleton information of the human body, and the extracted 2D skeleton information of the human body is used for cutting the point cloud picture to remove useless background information. The invention provides a front-end network for 2D gesture extraction based on a convolutional neural network structure, wherein an activation function used by all network layers is a ReLu function, and output gesture information comprises 25 skeleton key point information, such as 25 data of key points of a nose, a head, a shoulder and the like. The front-end network provided by the invention adopts a single human body detection model from bottom to top, and a network pre-trained by a large amount of data firstly detects 2D key nodes of human bodies, namely, firstly detects the joint node coordinates of all human bodies in one image, and then performs coordinate clustering to form key point coordinates corresponding to the human bodies. The feature extraction network adopts a convolutional neural network VGG-19 network to extract features, then 3 networks of 3 multiplied by 3 are connected to predict confidence intervals of 25 nodes, and human body 2D skeleton information, namely a human body 2D skeleton diagram, is obtained according to the confidence intervals. The RGB information contains rich colors and context associated information, the extracted skeleton precision is high, and the RGB information data acquisition is relatively easy, so that more training data are needed, and the trained model is more accurate.

3. And extracting the 3D skeleton information of the human body by using a 3D network. Specifically, the point cloud information and the human body 2D skeleton information are fused and then input into the 3D network at the same time, and the trained 3D network can extract accurate human body 3D skeleton information from the point cloud information. The invention provides a 3D gesture estimation network of a 3D network, which adopts a 3D kernel convolution mode to construct a convolution neural network, wherein the convolution neural network is divided into three layers in total, the first two layers of networks are connected into a layer pooling layer, and finally, the convolution neural network is output through a full connection layer. In the input process, as shown in fig. 1, one data is human body 2D skeleton information output by a front network, namely 2D gesture, and the other data is cut human body point cloud information, and normalization processing is performed on the point cloud and the 2D gesture information, so that the values of the point cloud and the 2D gesture information are unified to a (-1, 1) interval, and then the 2D gesture information (X, Y) is merged into the 3D point cloud information (X, Y, Z) layer by layer, wherein the weight of the (X, Y) of the 2D skeleton information and the weight of the (X, Y) in the point cloud information are set to be 10:1, extracting a confidence region of the 3D skeleton by utilizing 3 multiplied by 3, and finally outputting 3D skeleton information through a full connection layer. The method has the advantages that the advantages of carrying out advantage complementation on the point cloud information and the RGB information are effectively achieved, the point cloud has good space position information but is sparse, skeleton information cannot be accurately extracted, the RGB information contains abundant information but lacks space position information, and the two information can be effectively fused by utilizing a trained network.

The invention provides a network capable of accurately extracting 3D gesture information of a human skeleton end to end, which is characterized in that training is firstly carried out on a human3.6 data set in the early stage of work, and then fine adjustment is carried out on a real-world human body data set acquired by combining the network. After training, only forward end reasoning is needed in the application process of the invention.

The flow chart of the end-to-end human body gesture recognition method based on the combination of RGB and point cloud of the basic invention is shown in fig. 1, and the specific implementation flow is as follows:

1. firstly, an RGB-D camera is used as acquisition input of signals to acquire RGB-D data;

2. dividing the acquired signals into RGB information and point cloud information;

3. respectively inputting RGB information and point cloud information into a preprocessing module for filtering, denoising and preprocessing, and performing alignment processing;

4. inputting the preprocessed RGB information into a pre-trained pre-network (Pose-net) to extract 2D skeleton information of the human body, namely 2D gesture;

5. inputting the extracted 2D skeleton information (2D gesture) of the human body and the point cloud information into a point cloud cutting module, cutting the point cloud by using the extracted 2D skeleton information of the human body, and eliminating the useless background information;

6. then, the point cloud information and the human body 2D skeleton information (2D gestures) are fused and then are input into a 3D network at the same time, the trained 3D network can extract accurate human body 3D skeleton information from the point cloud information, on one hand, the characteristics of high 2D skeleton extraction accuracy and accurate key point positioning are effectively utilized, and on the other hand, the point cloud information is also utilized to effectively geometrically constrain the finally generated 3D human body gestures;

and 7.3D network outputs the accurate model of the human body 3D framework.

An RGB-D camera is an acquisition device that can acquire both point cloud images and RGB color images. The method adopts an end-to-end deep neural network method, and simultaneously adopts a scheme of mutually fusing RGB images and point cloud images, so that the limitation that the prior gesture recognition is singly dependent on RGB images or point cloud images is overcome, the human gesture extraction scheme which takes common 2D space image features and depth 3D space features into consideration is adopted, the recognition precision is improved, and the angular ambiguity of single-picture gesture recognition is eliminated.

In summary, the invention provides an effective fully supervised deep learning network model, which provides two network-level extraction: one is a pre-network (post-net) for extracting skeletal gestures of a human body, and the other is a 3D network for combining skeletal information with extracted 3D gesture information of a point cloud. The deep learning network model provided by the invention can effectively extract an accurate 3D model of a human body from human body data acquired by RGB-D equipment. Unlike the conventional 3D conversion model, the 3D information herein is 3D data information including the real human body, and the conventional 3D human body converted from 2D is often matched by the model, so that the obtained 3D data is not real, and ambiguity occurs due to the angle of the camera and the distance from the camera. In view of the above, the invention combines 2D-3D model conversion and depth point cloud information to obtain accurate 3D human body skeleton.

The above examples are only specific embodiments of the present invention for illustrating the technical solution of the present invention, but not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that the present invention is not limited thereto: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. An end-to-end human body gesture recognition method based on combination of RGB and point cloud is characterized by comprising the following steps:

step 1): preprocessing RGB information and point cloud information;

step 2): extracting human body two-dimensional (2D) skeleton information by using a pre-network, inputting the preprocessed RGB information into the pre-network trained in advance to extract the human body two-dimensional (2D) skeleton information, inputting the 2D gesture and the point cloud information of the extracted human body two-dimensional (2D) skeleton information into a point cloud cutting module, cutting a point cloud image by using the extracted human body two-dimensional (2D) skeleton information, and eliminating useless background information; and

step 3): extracting human body three-dimensional (3D) skeleton information by using a three-dimensional (3D) network, merging the point cloud information and the human body two-dimensional (2D) skeleton information, and then inputting the merged point cloud information and the human body two-dimensional (2D) skeleton information into the three-dimensional (3D) network, wherein the trained three-dimensional (3D) network extracts accurate human body three-dimensional (3D) skeleton information from the point cloud information.

2. The end-to-end human body gesture recognition method based on the combination of RGB and point cloud according to claim 1, in step 1), before the preprocessing, the collected signals are first separated into RGB information and point cloud information by using an RGB-D camera as a collection input of the signals.

3. The end-to-end human body gesture recognition method based on the combination of RGB and point cloud according to claim 1, in step 1), the RGB information and the point cloud information are subjected to filtering and denoising preprocessing, and alignment processing is performed.

4. The end-to-end human body gesture recognition method based on the combination of RGB and point cloud as claimed in claim 1, in step 1), a contour feature mapping method is used to extract each edge salient feature with the point cloud image as a coordinate reference, the feature points are mapped one by one, the offset { p1, p2, p3, & gtof & ltDEG & gtof & lt/DEG & gtof each feature point is calculated, and finally, calculating the average offset p of all the characteristic points, and projecting RGB to an affine space for conversion alignment.

5. The end-to-end human body gesture recognition method based on the combination of RGB and point cloud as claimed in claim 1, in the step 2), the front network adopts a single human body detection model from bottom to top, and a network pre-trained by a large amount of data firstly detects two-dimensional (2D) key nodes of human bodies, namely, firstly detects joint point coordinates of all human bodies in an image, and then performs coordinate clustering to form key point coordinates corresponding to the human bodies.

6. The end-to-end human body gesture recognition method based on the combination of RGB and point cloud according to claim 1, wherein in step 3), the three-dimensional (3D) network adopts a convolutional neural network, the convolutional neural network is divided into three layers in total, wherein the first two layers of networks are both connected into a layer pooling layer, and finally output through a fully connected layer.