CN110555412A

CN110555412A - End-to-end human body posture identification method based on combination of RGB and point cloud

Info

Publication number: CN110555412A
Application number: CN201910836867.3A
Authority: CN
Inventors: 张世雄; 李楠楠; 赵翼飞; 李若尘; 李革; 安欣赏; 张伟民
Original assignee: Shenzhen Longgang Intelligent Audiovisual Research Institute
Current assignee: Shenzhen Longgang Intelligent Audiovisual Research Institute
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2019-12-10
Anticipated expiration: 2039-09-05
Also published as: CN110555412B

Abstract

An end-to-end human body posture identification method based on RGB and point cloud combination comprises the following steps: 1): preprocessing the RGB information and the point cloud information; 2): extracting human two-dimensional (2D) skeleton information using a pre-network; and 3): three-dimensional (3D) skeleton information of a human body is extracted using a three-dimensional (3D) network. The end-to-end human body posture identification method based on the combination of RGB and point cloud provided by the invention can effectively extract an accurate 3D model of a human body from human body data acquired by RGB-D equipment. The method solves a series of problems in gesture recognition, such as gesture appearance change, gesture freedom diversity, similar gestures, self-shielding, great ambiguity of 2D gestures, insufficient 3D gesture precision, lack of 3D data sets and the like.

Description

end-to-end human body posture identification method based on combination of RGB and point cloud

Technical Field

The invention relates to a method for recognizing the gesture of key points of a human body through an RGB-D camera, in particular to an end-to-end human body gesture recognition method based on the combination of RGB and point cloud.

Background

The key point detection of human body posture is an important field of computer vision research, and research results are mainly used for a series of intelligent applications such as new generation human-computer interaction, Virtual Reality (VR) and Augmented Reality (AR) human-computer interaction, behavior recognition and analysis and the like. The traditional gesture recognition algorithm generally adopts a wearable acceleration sensor to recognize and detect the human gesture, so that the cost is high, and the wearing is complicated and active cooperation is required. The early detection of the human body posture based on the video is mainly based on template matching of hand-drawn features, the extraction mode of the hand-drawn features is complex in design, low in reliability, easy to be interfered by the outside, and poor in complex action recognition effect. Meanwhile, interference factors such as camera view angle, illumination, shielding and the like generally exist in a real scene, and the traditional method often has the problem that the identification accuracy is not high or cannot be identified in the real scene. With the continuous maturity of the application of deep learning in the field of computer vision, the method of deep learning is adopted for human posture recognition more and more. On the other hand, more and more three-dimensional (3D) acquisition devices are continuously developed, so that defects Of two-dimensional (2D) projection can be well compensated, including rotation, shielding, similarity and the like existing in human body postures, and for an RGB-D camera, three mainstream schemes are provided in the market at present, namely a 3D structured light scheme (structured light), a ToF 3D scheme (Time Of Flight, Time difference distance measurement technology) and a binocular stereo imaging scheme (StereoSystem). The three acquisition schemes can acquire a point cloud image with depth information, and according to different acquired data, the human body posture recognition can be divided into posture recognition with depth data, namely posture recognition based on three-dimensional (3D) point cloud, and human body posture recognition based on common image data (namely RGB data), namely posture recognition based on 2D images.

because the three-dimensional (3D) point cloud has low precision and contains more noise, the data volume is greatly increased, the dimensionality is larger, one dimension is more than that of a two-dimensional image, the calculation is complex, and the calculation amount is large. Meanwhile, the sparsity of the three-dimensional point cloud is also considered, the calculation efficiency is improved in the voxel-based reconstruction, excessive memory waste in an unoccupied part is avoided, the reconstruction resolution is improved, the network structure is improved to improve the reconstruction effect, and more details can be recovered.

The 2D image contains rich color information, is clear and contains more detailed information. The noise is small, the acquisition equipment is mature, but the spatial depth information is insufficient, so that ambiguity is easily caused. The human body posture of the 2D image generally has a certain unfixed problem. For a conventional three-dimensional mapping of 2D image body poses, one image body pose may often correspond to multiple different three-dimensional body poses. From a statistical point of view, a reasonable prediction of the input image forms a distribution. Reflected in the training set, the two human poses that appear similar in the image may be quite different in actual pose.

Disclosure of Invention

The invention provides an end-to-end human body posture identification method based on RGB and point cloud combination, which can effectively extract an accurate 3D model of a human body from human body data acquired by RGB-D equipment. The method solves a series of problems in gesture recognition, such as gesture appearance change, gesture freedom diversity, similar gestures, self-shielding, great ambiguity of 2D gestures, insufficient 3D gesture precision, lack of 3D data sets and the like.

The technical scheme provided by the invention is as follows:

the invention discloses an end-to-end human body posture identification method based on RGB and point cloud combination, which comprises the following steps: step 1): preprocessing the RGB information and the point cloud information; step 2): extracting human two-dimensional (2D) skeleton information using a pre-network; and step 3): three-dimensional (3D) skeleton information of a human body is extracted using a three-dimensional (3D) network.

In the method for recognizing the end-to-end human body posture based on the combination of RGB and point cloud, before preprocessing, an RGB-D camera is used as the acquisition input of signals, and the acquired signals are divided into RGB information and point cloud information.

In the method for recognizing the end-to-end human body posture based on the combination of RGB and point cloud, in the step 1), filtering and denoising preprocessing are respectively carried out on RGB information and point cloud information, and alignment processing is carried out.

In the method for recognizing the end-to-end human body posture based on the combination of RGB and point cloud, in step 1), a contour feature mapping method is used, a point cloud graph is taken as a coordinate reference, each edge salient feature is extracted, the feature points are mapped one by one, the offset { p1, p2, p3, · · · · · · · · · · · · · · · · · · · · } of each feature point is calculated, finally, the average offset p of all the feature points is calculated, and then RGB is projected to an affine space for conversion alignment.

In the method for recognizing the end-to-end human body posture based on the combination of RGB and point cloud, in the step 2), the preprocessed RGB information is input into a pre-network trained in advance to extract human body two-dimensional (2D) skeleton information, the extracted human body two-dimensional (2D) skeleton information (2D posture) and the point cloud information are input into a point cloud cutting module together, the extracted human body two-dimensional (2D) skeleton information is used for cutting a point cloud picture, and useless background information is removed.

In the method for recognizing the end-to-end human body posture based on the combination of RGB and point cloud, in the step 2), a single human body detection model from bottom to top is adopted by a front network, and the network pre-trained by mass data firstly detects two-dimensional (2D) key nodes of a human body, namely, the coordinates of all the joint nodes of the human body in one image are firstly detected, and then coordinate clustering is carried out to form key point coordinates corresponding to the human body.

In the method for recognizing the end-to-end human body posture based on the combination of RGB and point cloud, in the step 3), the point cloud information and the human body two-dimensional (2D) skeleton information are fused and then are simultaneously input into a three-dimensional (3D) network, and the trained three-dimensional (3D) network can extract accurate human body three-dimensional (3D) skeleton information from the point cloud information.

in the method for recognizing the end-to-end human body posture based on the combination of RGB and point cloud, in the step 3), a convolutional neural network is adopted in a three-dimensional (3D) network, the convolutional neural network is divided into three layers in total, the first two layers of networks are both connected to a pooling layer, and finally the three layers of networks are output through a full link layer.

Compared with the prior art, the invention has the beneficial effects that:

1. A unique mode of fusing two networks is designed, data of RGB and RGB-D types are fused at the same time, a strategy of fusing data in the middle is adopted, the problems of early noise and excessive early fused data of the data are effectively solved, and a novel method for detecting the human posture of the RGB-D camera is provided.

2. The model in the identification method can provide effective real 3D posture information of a human body, and different from the previous models which can only output 2D skeleton information, the model can output a 3D skeleton model with real world coordinate values, detailed data can be provided for the height and each trunk size of the human body, and the accuracy of centimeter level can be realized under the condition of higher accuracy of an RGB-D camera.

3. The 3D attitude can be estimated only according to the current frame, and the method is greatly improved compared with the conventional process of estimating the 3D attitude only by a plurality of continuous image frames or video frames.

drawings

The invention is further illustrated by way of example in the following with reference to the accompanying drawings:

FIG. 1 is a flow chart of an end-to-end human body gesture recognition method based on RGB and point cloud combination.

The specific implementation mode is as follows:

The invention effectively combines the gesture recognition of the 3D point cloud and the gesture recognition based on RGB image data, provides a deep learning method combining front and back networks, namely an end-to-end human body gesture recognition method based on the combination of RGB and point cloud, and simultaneously integrates the respective advantages of the point cloud image and the RGB image.

The end-to-end human body posture identification method based on the combination of RGB and point cloud comprises the following main steps:

1. And preprocessing the RGB information and the point cloud information. Before preprocessing, an RGB-D device (RGB-D camera) is used as a signal acquisition input, and the acquired signal is divided into RGB information and point cloud information. And then, respectively carrying out filtering and denoising pretreatment on the RGB information and the point cloud information, and carrying out alignment treatment. Specifically, since the RGB information and the point cloud information are acquired by different acquisition devices, the image information acquired by the two devices do not completely coincide, and there is a positional offset p between them, the present invention provides a method for contour feature mapping, which corrects the offset. The contour feature mapping method comprises the steps of firstly, taking a point cloud image as a coordinate reference, extracting each edge salient feature, mapping feature points one by one, calculating the offset { p1, p2, p3, · · · · · · · · · · · · · · · } of each feature point, finally calculating the average offset p of all feature points, and projecting RGB to an affine space for conversion alignment. The method has the advantages that the aligned point cloud information and the RGB information are matched in the plane space information, and the cutting of the point cloud information and the mutual fusion of the point cloud information and the RGB information are facilitated.

2. Human 2D skeleton information is extracted using a pre-network. In the step, the preprocessed RGB information is input into a pre-network trained in advance to extract human body 2D skeleton information, the extracted human body 2D skeleton information is used for cutting the point cloud picture, and useless background information is removed. The invention provides a convolutional neural network structure-based front-end network for 2D posture extraction, wherein activation functions used by all network layers are ReLu functions, and output posture information comprises 25 skeleton key point information, such as 25 key points of nose, head, shoulders and the like. The preposed network provided by the invention adopts a bottom-up single human body detection model, and the network pre-trained by a large amount of data firstly detects 2D key nodes of a human body, namely, coordinates of all joint nodes of the human body in an image are detected, and then coordinate clustering is carried out to form key point coordinates corresponding to the human body. The network for feature extraction adopts a convolutional neural network VGG-19 network to extract features, 3 networks with the size of 3 multiplied by 3 are accessed to predict confidence intervals of 25 joint points, and human body 2D skeleton information, namely a human body 2D skeleton diagram, is obtained according to the confidence intervals. The advantage here is that RGB information contains abundant color and context correlation information, and the skeleton precision of extracting is higher, and RGB information data acquisition is relatively easy moreover for training data is more, and the model that trains out is more accurate.

3. Human 3D skeletal information is extracted using a 3D network. Specifically, the point cloud information and the human body 2D skeleton information are fused and then simultaneously input into the 3D network, and the trained 3D network can extract accurate human body 3D skeleton information from the point cloud information. The invention provides a 3D posture estimation network of a 3D network, which adopts a 3-dimensional kernel convolution mode to construct a convolution neural network, wherein the convolution neural network is totally divided into three layers, the first two layers of networks are both connected into a pooling layer, and finally, the networks are output through a full connection layer. In the input process, as the input is two kinds of data, as shown in fig. 1, one kind of data is human body 2D skeleton information output by a front network, namely 2D posture, and the other kind of data is cut human body point cloud information, and normalization processing is performed on the point cloud and the 2D posture information at the same time, so that the numerical values of the two kinds of data are unified to a (-1, 1) interval, and then the 2D posture information (X, Y) is gradually merged into the 3D point cloud information (X, Y, Z), wherein the weight setting ratio of (X, Y) in the (X, Y) and the (X, Y) in the point cloud information of the 2D skeleton information is 10: 1, extracting a confidence region of the 3D skeleton by using 3 multiplied by 3, and finally outputting the 3D skeleton information through a full connection layer. The point cloud information and the RGB information are complementary in advantages, the point cloud has good spatial position information but is sparse, skeleton information cannot be extracted accurately, the RGB information contains rich information but lacks the spatial position information, and the point cloud information and the RGB information can be effectively fused by using a trained network.

The invention provides a network capable of accurately extracting 3D posture information of a human skeleton from end to end, which is characterized in that training is firstly carried out on a human3.6 data set in the early work, and then fine adjustment is carried out on the data set of the human body in the real world acquired by the network in combination with the self. After training is finished, the invention only needs to carry out reasoning of a forward end in the application process.

The flow chart of the end-to-end human body posture identification method based on the combination of RGB and point cloud is shown in figure 1, and the specific implementation flow is as follows:

1. Firstly, an RGB-D camera is used as the acquisition input of signals to acquire RGB-D data;

2. dividing the collected signals into RGB information and point cloud information;

3. Respectively inputting the RGB information and the point cloud information into a preprocessing module for filtering and denoising preprocessing, and carrying out alignment processing;

4. Inputting the preprocessed RGB information into a pre-network (Pose-net) trained in advance to extract human body 2D skeleton information, namely 2D posture;

5. Inputting the extracted human body 2D skeleton information (2D posture) and point cloud information into a point cloud cutting module, cutting the point cloud by using the extracted human body 2D skeleton information, and removing useless background information;

6. then fusing the point cloud information and human body 2D skeleton information (2D posture) and inputting the fused point cloud information and the human body 2D skeleton information into a 3D network at the same time, wherein the trained 3D network can extract accurate human body 3D skeleton information from the point cloud information, so that on one hand, the characteristics of high extraction accuracy and accurate positioning of key points of the 2D skeleton are effectively utilized, and on the other hand, the point cloud information is also utilized to carry out effective geometric constraint on the finally generated 3D human body posture;

And 7.3D network output human body 3D skeleton accurate model.

An RGB-D camera is an acquisition device that can simultaneously acquire a point cloud image and an RGB color image. The method adopts an end-to-end deep neural network method, and simultaneously adopts a scheme of mutually fusing the RGB image and the point cloud image, so that the limitation that the prior gesture recognition singly depends on the RGB image or the point cloud image is overcome, the human body gesture extraction scheme which considers the common 2D space image characteristic and the deep 3D space characteristic is adopted, the recognition precision is improved, and the angle ambiguity of the gesture recognition of a single image is eliminated.

to sum up, the invention provides an effective fully supervised deep learning network model, which provides two network level extraction: one is a front-end network (pos-net) for extracting a skeletal pose of a human body, and the other is a 3D network for extracting 3D pose information by combining skeletal information with point cloud. The deep learning network model provided by the invention can effectively extract an accurate 3D model of a human body from human body data acquired by RGB-D equipment. Unlike the conventional 3D conversion model, the 3D information here is 3D data information including real human body, and the 3D information in the conventional 2D converted 3D human body is often matched by the model, so the obtained 3D data is not real, and there is ambiguity due to the angle of the camera and the distance to the camera. Aiming at the problem, the accurate 3D human body skeleton is obtained by combining 2D-3D model conversion and depth point cloud information.

Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An end-to-end human body posture identification method based on RGB and point cloud combination is characterized by comprising the following steps:

Step 1): preprocessing the RGB information and the point cloud information;

Step 2): extracting human two-dimensional (2D) skeleton information using a pre-network; and

Step 3): three-dimensional (3D) skeleton information of a human body is extracted using a three-dimensional (3D) network.

2. The method for recognizing the human body posture end-to-end based on the combination of RGB and point cloud as claimed in claim 1, wherein in step 1), before the preprocessing, an RGB-D camera is firstly used as the acquisition input of the signal, and the acquired signal is divided into RGB information and point cloud information.

3. the end-to-end human body posture identifying method based on RGB and point cloud combination as claimed in claim 1, in step 1), filtering and denoising preprocessing are respectively carried out on the RGB information and the point cloud information, and alignment processing is carried out.

4. the end-to-end human body posture identification method based on RGB and point cloud combination as claimed in claim 1, in step 1), using a contour feature mapping method, taking a point cloud image as a coordinate reference, extracting each edge salient feature, performing one-to-one mapping on feature points, calculating the offset { p1, p2, p3, … … } of each feature point, finally calculating the average offset p of all feature points, and projecting RGB to an affine space for conversion and alignment.

5. The end-to-end human body posture identification method based on the combination of RGB and point cloud as claimed in claim 1, in step 2), inputting the preprocessed RGB information into a pre-network trained in advance to extract the human body two-dimensional (2D) skeleton information, inputting the extracted human body two-dimensional (2D) skeleton information (2D posture) and point cloud information into a point cloud cutting module together, and cutting a point cloud picture by using the extracted human body two-dimensional (2D) skeleton information to remove useless background information.

6. The end-to-end human body posture identification method based on the combination of RGB and point cloud as claimed in claim 1, in step 2), the front network adopts a bottom-up single human body detection model, and the network pre-trained by mass data first detects two-dimensional (2D) key nodes of the human body, that is, detects the coordinates of all human body joint points in an image, and then carries out coordinate clustering to form key point coordinates corresponding to the human body.

7. The end-to-end human body posture recognition method based on the combination of RGB and point cloud as claimed in claim 1, in step 3), the point cloud information and the human body two-dimensional (2D) skeleton information are fused and then input into a three-dimensional (3D) network, and the trained three-dimensional (3D) network extracts accurate human body three-dimensional (3D) skeleton information from the point cloud information.

8. The method for recognizing the human body posture end-to-end based on the combination of the RGB and the point cloud as claimed in claim 1, in the step 3), the three-dimensional (3D) network adopts a convolutional neural network, the convolutional neural network is divided into three layers in total, wherein the first two layers of networks are both connected to a layer pooling layer and are finally output through a full connection layer.