CN111597976A

CN111597976A - Multi-person three-dimensional attitude estimation method based on RGBD camera

Info

Publication number: CN111597976A
Application number: CN202010408082.9A
Authority: CN
Inventors: 秦昊; 李冬平; 杨颢
Original assignee: Hangzhou Faceunity Technology Co ltd
Current assignee: Hangzhou Faceunity Technology Co ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-08-28

Abstract

The invention discloses a multi-person three-dimensional attitude estimation method based on RGBD cameras, which comprises the steps of firstly, training a real human body data set to obtain a deep convolutional network supporting human body position detection and semantic segmentation; then, a virtually synthesized human body depth map-three-dimensional characteristic point data set is constructed, and a depth convolution network capable of estimating human body joint points from the depth map is obtained through training of the data set; and finally, the user inputs RGBD pictures or videos to obtain world coordinates of all human body three-dimensional joint points. The invention provides a robust algorithm for recovering multi-person three-dimensional postures from a single RGBD camera; in the network pre-training stage, only the pre-labeled RGB pictures are needed, and the depth map can be automatically obtained by using a virtual synthesis method, so that the pre-training has low requirement on data labeling; in the actual operation stage, single-frame attitude estimation and multi-frame attitude estimation are considered at the same time, and accurate and stable multi-person three-dimensional attitudes can be output.

Description

Multi-person three-dimensional attitude estimation method based on RGBD camera

Technical Field

The invention belongs to the technical field of machine vision and deep learning, and particularly relates to a multi-person three-dimensional attitude estimation method based on an RGBD camera.

Background

The purpose of the human body posture estimation is to obtain the coordinates of human body joint points from an input image, so that the information of the human body joint direction, rotation and the like can be analyzed. People can further consider the information of the time sequence, observe the position change condition of the human body joint points in a period of time and carry out semantic understanding at a more abstract level, thereby realizing complex tasks such as action recognition, tracking, prediction and the like. The human body posture estimation is widely applied and is often applied to the fields of games, entertainment, security, medical rehabilitation and the like. By using the result of human body posture estimation, people can feel the interest of the motion sensing game and man-machine interaction without any motion sensing equipment; a movie manufacturer can drive the animation model without additional auxiliary equipment to complete convenient action sequence generation; the children do not need to worry about the accident that the old people at home fall down, but the old people cannot timely give out information to miss precious time for medical treatment and treatment.

The problem of human posture estimation has been studied for a long time. Most of the early methods identify each part of the human body for matching on the basis of geometric prior, so as to calculate the human body posture. In recent years, with the rapid development of deep learning, the convolutional neural network has made a breakthrough in many fields of computer vision direction, such as object classification, object detection, semantic segmentation, and other tasks. People also get a great breakthrough in the field of human posture estimation by using a deep learning method. Many methods for estimating the human body posture based on the convolutional neural network are proposed, such as deep pose, Stacked Hourglass Networks, OpenPose, and the like. Compared with the traditional visual methods, the methods are usually trained by a large amount of data, and rich prior information contained in the data can be utilized, so that the precision and the stability are greatly improved.

Since most of the presently disclosed data sets are input as color RGB images, most of the present research on pose estimation is limited to estimation of only two-dimensional joint points. However, the two-dimensional joint has a great limitation in the application field. For example, the two-dimensional joint points are difficult to calculate translation and rotation information of each joint of a human body, and cannot be competent for a plurality of scenes related to three-dimensional correlation. Therefore, there is a very urgent need for the research of three-dimensional human posture estimation.

It is very difficult to obtain an estimate of the three-dimensional joint points from a conventional RGB image. The birth of the depth camera provides a new idea for solving the problem, and the depth camera can obtain the depth value of the object, so that the distance information of the object is perceived. In 2009, the depth camera Kinect popular in the first public was introduced by microsoft, and has functions of dynamically capturing human body posture and the like. Kinect has assisted corresponding Xbox360 game platform, has further expanded the operational mode of recreation, has fully shown the interactive concept of man-machine. In 2017, as the first mobile phone IPhone X equipped with a depth camera was released, the integration of the depth camera into the mobile phone will slowly become a trend. Therefore, the human body posture estimation method based on the RGBD camera has better convenience and popularity.

Disclosure of Invention

The invention aims to provide a multi-person three-dimensional posture estimation method based on an RGBD camera, aiming at the defects of the prior art. The method is used for solving the problem that the user automatically obtains the coordinates of the human body joint points from the RGBD picture input.

The purpose of the invention is realized by the following technical scheme: a multi-person three-dimensional attitude estimation method based on RGBD cameras comprises the following steps:

(1) pre-training a human body detection segmentation network: training to obtain a deep convolutional network supporting human body position detection and semantic segmentation according to a real human body RGB picture data set and corresponding labeling information;

(2) pre-training a three-dimensional human body posture estimation network: a three-dimensional human body posture estimation network of a depth convolution network capable of estimating human body joint points from a depth map is obtained by constructing a synthesized human body depth map-three-dimensional characteristic point data set and then training through the data set;

(3) the actual use process of the user is as follows: when an RGBD picture to be processed is input, operating the human body detection segmentation network obtained by training in the step (1) and extracting a corresponding depth map, and operating the three-dimensional human body posture estimation network obtained by training in the step (2) to estimate three-dimensional joint points of a human body to obtain world coordinates of all the human body three-dimensional joint points; when a continuous video scene is input, the prediction result of the world coordinates of the three-dimensional joint points of the human body is improved by using the correlation of multi-frame image information and using a Bayesian method and an exponential smoothing mode.

Further, the step (1) is specifically: according to the input picture and the corresponding labeling information, a deep convolution network supporting human body position detection and semantic segmentation is trained, wherein the input is an RGB picture, and the output is a bounding box of the position of the human body and a human body region mask.

Further, the multitask deep convolution network is composed of three sub-networks, specifically: the first sub-network is a characteristic pyramid network, and through inputting RGB pictures, multi-level and multi-scale convolution related operation is carried out to extract abstract characteristics of the pictures; the second sub-network is a regional candidate network, the abstract features output by the first sub-network are input, and a candidate frame of the human body position is generated through convolution correlation operation; the third sub-network is a full convolution neural network, abstract features in the human body position candidate box output by the second sub-network are input, and a human body region mask is generated through convolution correlation operation.

Further, the step (2) includes the sub-steps of:

(2.1) constructing a synthesized human body depth map-three-dimensional feature point data set specifically comprises the following steps: automatically synthesizing a plurality of three-dimensional human body models, binding the three-dimensional human body models with human body action skeleton data, obtaining the three-dimensional human body models with different human body actions through skin covering operation, and finally drawing a depth map of all the three-dimensional human body models to obtain a human body depth map-three-dimensional characteristic point data set;

(2.2) training through a human body depth map-three-dimensional feature point data set to obtain a depth convolution network capable of estimating human body joint points from a depth map specifically as follows: training a three-dimensional human body posture estimation network according to the labeling information of the human body depth map-three-dimensional feature point data set, inputting a single-channel depth picture, and outputting an xy heat map and a z distance response map which comprise human body three-dimensional joint points; the basic structure of the three-dimensional human body posture estimation network is a stacked hourglass type network, and features are repeatedly extracted by a convolution module through operations of down sampling and up sampling for multiple times, and finally two output graphs are output.

Further, the step (3) includes two cases:

(3.1) when an RGBD picture to be processed is input, operating the human body detection segmentation network obtained by training in the step (1) and extracting a corresponding depth map, and then operating the three-dimensional human body posture estimation network obtained by training in the step (2) to estimate three-dimensional joint points of the human body to obtain coordinates of all the human body three-dimensional joint points, wherein the specific steps are as follows: a user inputs a single RGBD image, firstly, RGB images in the single RGBD image are extracted, and the human body detection segmentation network is operated to obtain the position of a human body and a human body segmentation mask; estimating local coordinates of the three-dimensional joint points corresponding to each person by using the extracted single depth image, and obtaining world coordinates of the three-dimensional joint points of the human body in the image according to camera parameters and the association among the local coordinates;

and (3.2) when a continuous video scene is input, obtaining the coordinates of the human body three-dimensional joint points in the previous frame according to the step (3.1), then constructing prior probability distribution of the coordinates of the human body three-dimensional joint points of the current frame according to the coordinates of the human body three-dimensional joint points of the previous frame, optimizing an xy heat map of the human body three-dimensional joint points of the current frame by using a Bayesian formula, smoothly optimizing a z distance response map of the human body three-dimensional joint points of the current frame by using indexes, and finally obtaining the optimized coordinates of the human body three-dimensional joint points of the current frame.

The invention has the beneficial effects that: the invention provides a robust algorithm for recovering multi-person three-dimensional postures from a single RGBD camera; in the network pre-training stage, only pre-labeled RGB pictures (which can be easily obtained in a public data set) are needed, and a depth map can be automatically obtained by using a virtual synthesis method, so that the pre-training has low requirement on data labeling; in the actual operation stage, single-frame attitude estimation and multi-frame attitude estimation are considered at the same time, and accurate and stable multi-person three-dimensional attitudes can be output.

Drawings

FIG. 1 is a flow chart of a multi-person three-dimensional attitude estimation method based on an RGBD camera;

FIG. 2 is a schematic diagram of a virtually synthesized three-dimensional mannequin with various motions;

FIG. 3 is a schematic diagram of a stacked hourglass network configuration;

FIG. 4 is a schematic diagram of a four-step hourglass module configuration;

FIG. 5 is a schematic diagram of an attitude estimation network output; wherein (a) is a depth input map, (b) is an xy heat map, and (c) is a z-distance response map;

FIG. 6 is a visualization of the operational flow from inputting RGBD pictures to outputting three-dimensional poses;

FIG. 7 is a graph of results for an embodiment of the present invention; the method comprises the following steps of (a) inputting an RGB image schematic diagram and a human body bounding box, (b) superposing a two-dimensional posture estimation result schematic diagram on an input depth image, and (c) outputting a three-dimensional posture skeleton result schematic diagram.

Detailed Description

The real human body RGB picture data set adopted by the embodiment is an open data set COCO (http:// codataset. org/# home), which is widely used in image detection and segmentation and comprises more than 25 million RGB pictures and corresponding human body detection and segmentation labeling information; the human body detection and segmentation marking information is a human body detection frame and a segmentation mask image. The action skeleton data set adopted by the embodiment is from a CMU Mocap database (http:// Mocap. cs. CMU. edu /), can also be supplemented by self, has a stored format of bvh, and comprises about 2000 action sequences and 31 human body joint points; the action sequence comprises common actions of walking, jumping, climbing, running, basketball, football, boxing and the like.

The invention discloses a multi-person three-dimensional attitude estimation method based on an RGBD camera, which comprises the following steps as shown in figure 1:

(1) pre-training a human body detection segmentation network: training a multitask depth convolution network supporting human body position detection and semantic segmentation according to a real human body RGB picture data set to obtain a human body detection segmentation network, wherein the input is an RGB picture, and the output is a bounding box [ x ] of the position of a human body_min,y_min,x_max,y_max]And a body region mask. The human body region mask is a binary image, the value of each pixel point represents the probability of belonging to a human body or a background, the pixel value of the human body region is 1, and the pixel value of the background region is 0; (x)_min,y_min)、(x_max,y_max) The coordinates of the upper left corner and the lower right corner of the position of the human body are respectively. The concept of bounding boxes and area masks has been widely used in the task of pattern detection and segmentation.

The multitask depth convolution network supporting human body position detection and semantic segmentation is composed of three sub-networks, and specifically comprises the following steps:

(1.1) the first sub-network is a Feature Pyramid Network (FPN), and by inputting the RGB picture, multi-level and multi-scale convolution related operation is carried out, abstract features of the RGB picture are extracted, and a Feature map is obtained.

The pyramid feature extraction Network uses a standard Residual Neural Network (ResNet) as an integral Network backbone and comprises 5 standard Residual down-sampling modules; each module comprising a down-sampling layer L_i(i is 1-5), the step length is 2, and the ResNet of the feature extraction backbone network can feel the scale range of 1, 1/2, 1/4, 1/8 and 1/16; assuming that the input image resolution is 512 × 512, after 5 times of downsampling, the input image resolution is reduced, and then the image size corresponding to the minimum feature scale is 32 × 32. In order to further enhance the feature extraction capability of the network, a fusion branch is added after each down-sampling layer, wherein the fusion branch comprises an up-sampling layer with the step size of 2 and a convolution layer with the step size of 1x1, and each fusion branch is used for connecting the corresponding down-sampling layer L_iAfter the output feature is amplified by 2 times of resolution, the feature is compared with a down-sampling layer L by using a standard neural network layer concat operation_i-1The output features are spliced, different features are effectively fused, the network expression capacity is improved, and a final feature map is obtained.

(1.2) the second sub-network is a Region candidate network (RPN), the input is the feature map extracted by the first sub-network in step (1.1), and a candidate frame of the human body position is generated through convolution correlation operation, specifically:

(1.2.1) setting 5 fixed frames for each pixel point in the input feature map, the centers of the fixed frames being set on each pixel point of the feature map and having a change in aspect ratio, similarly to the classic target detection method, fast-RCNN; the length-width ratios of the 5 fixed frames are 1:1, 1:2, 2:1, 1:3 and 3:1 respectively.

(1.2.2) carrying out convolution operation twice on the input feature map, and extracting the intermediate layer features of the feature map by using convolution of 3x3 for the first time; processing the middle layer characteristics by two groups of 1 × 1 convolutions with the output channel number being twice and four times of the number of the fixed frames respectively for the second time, wherein the convolution output characteristics with the output channel number being twice of the number of the fixed frames are fixed frame score values belonging to probability p of a human body and fixed frame score values belonging to probability 1-p of a background, and the convolution output characteristics with the output channel number being four times of the number of the fixed frames are fixed frame correction values [ delta x, delta y, delta w, delta h ] for correcting the position and the size of the fixed frames; where Δ x is a correction value for the abscissa of the center of the frame, Δ y is a correction value for the ordinate of the center of the frame, Δ w is a correction value for the width of the frame, and Δ h is a correction value for the height of the frame.

(1.2.3) using a fixed frame with probability p > 0.5 belonging to a human body as a candidate frame, and merging the candidate frames by adopting a standard non-maximum suppression (NMS) algorithm, wherein the specific steps are as follows: firstly, sorting the candidate frames from large to small according to the probability p of belonging to the human body; then, the intersection ratio IoU of the candidate frame with the maximum p value and other candidate frames is calculated, and the candidate frame IoU is deleted>The candidate frames of 0.7 traverse the rest of the candidate frames from large to small according to the p value, and repeat the operations until all the candidate frames are processed, so as to obtain a plurality of non-overlapping candidate frames; finally, obtaining the bounding box of the position of each human body according to the reserved candidate frames

Wherein the superscript (i) denotes the ith detected human,

the left boundary coordinate, the right boundary coordinate, the upper boundary coordinate and the lower boundary coordinate in the human body candidate frame are respectively.

(1.3) the third sub-network is a full Convolutional neural network (FCN), the input is a feature map abstract feature in a bounding box at the position of the human body output by the second sub-network, and a human body region mask is generated through convolution correlation operation, specifically: firstly, intercepting an area of a bounding box of the position of a human body in a feature map to obtain a feature subgraph, and adjusting the feature subgraph to be the input size of a full convolution neural network in a bilinear interpolation mode; then, carrying out semantic segmentation on the characteristic subgraph by using a full convolution neural network to obtain a human body region mask; the full convolution neural network firstly carries out full convolution layer processing of 3x3,256 channels twice, each full convolution layer is followed by a nonlinear transformation Relu layer, then an up-sampling layer of 2 times is carried out once, and finally a probability graph belonging to a human body and a probability graph not belonging to the human body are output through a 1x1, 2-channel full convolution layer, when the probability that a pixel point belongs to the human body is more than or equal to the probability not belonging to the human body, the pixel value of the point is 1, otherwise, the pixel value is 0, and finally a human body region mask is obtained. In this embodiment, the input resolution of the fully convolutional neural network is 14 × 14, and the output resolution is 28 × 28.

(2) Pre-training a three-dimensional human body posture estimation network: and constructing a synthesized human body depth map-three-dimensional characteristic point data set, and then training a depth convolution network capable of estimating human body joint points from the human body depth map through the data set to obtain a three-dimensional human body posture estimation network.

(2.1) constructing a human body depth map-three-dimensional feature point data set:

(2.1.1) first, three-dimensional character modeling software Maya is used to generate three-dimensional human body models containing different genders, different ages, different sizes, different decorations (clothes, hairstyles, hats, etc.).

(2.1.2) obtaining different body posture data including 31 body joint points using the motion skeleton data set; binding and skinning the three-dimensional human body model and the action skeleton data set, and matching with an action sequence in the action skeleton data set to obtain a three-dimensional human body model (figure 2) with different human body actions;

(2.1.3) randomly generating camera parameters comprising camera height, camera horizontal angle and human body rendering position, rendering the three-dimensional human body model obtained in the step (2.1.2), drawing human body depth maps of all the three-dimensional human body models, and obtaining a human body depth map-three-dimensional characteristic point data set corresponding to 31 human body joint point coordinate positions; the height range of the camera is 1.6-1 m, the horizontal angle range of the camera is 0-15 degrees, and the human body rendering position range is 2-6 m. The total data amount of the human body depth map-three-dimensional feature point data set in this embodiment is 300000 pieces.

(2.2) training a deep convolution network for estimating human body joint points, which specifically comprises the following steps: and (3) training a three-dimensional human body posture estimation network according to the human body depth map-three-dimensional feature point data set obtained in the step (2.1), wherein the three-dimensional human body posture estimation network is input into a single-channel human body depth map and output into an xy heat map and a z distance response map of 62 channels of human body three-dimensional joint points corresponding to each human body joint point. The network is similar to the existing standard RGB image human body 2D joint point heat map prediction network method, except that the RGB image only has 2D supervisory information, so that only an xy heat map is output, and the depth map simultaneously contains 2D and 3D supervisory information, so that the CNN can be trained to output a plurality of channel information containing the xy heat map and a z-distance response map.

As shown in fig. 3, the basic structure of the three-dimensional body posture estimation network is a stacked hourglass network (stackedHourglass Networks) formed by transversely splicing and stacking a plurality of four-order hourglass modules, as shown in fig. 3, intermediate supervision is arranged among the four-order hourglass modules, an xy heat map and a z distance response map of each key point are respectively subjected to L2 loss functions, and a total loss function is obtained by summing, so as to supervise effective convergence of the four-order hourglass modules; as shown in fig. 4, the four-step hourglass module adopts a residual module as a basic structural unit, and a first-step hourglass module is constructed on the basis of the residual module; the first order hourglass module is split into two branches: the first branch is characteristic extraction on the original scale of the human body depth image and is composed of a residual error module; the second branch adopts the strategy of down-sampling and up-sampling, firstly down-sampling to 1/2 of the original scale in a maximum pooling mode, then connecting three residual modules, and up-sampling 2 times to the original scale in a nearest neighbor interpolation mode to extract features; and adding the features extracted by the two branches to obtain the output of the first-order hourglass module. On the basis of the first-order hourglass module, replacing a second residual error module of a second branch of the first-order hourglass module with the first-order hourglass module to obtain a second-order hourglass module, and performing feature extraction on the original scale, the 1/2 scale and the 1/4 scale by the second-order hourglass module; similarly nesting to obtain a fourth-order hourglass module, performing feature extraction on the original scale, 1/2 scale, 1/4 scale, 1/8 scale and 1/16 scale, wherein the input scale and the output scale of the hourglass module are the same, and outputting xy heat maps and z distance response maps of the three-dimensional joint points of the human body; further, a multi-step hourglass module can be obtained, which can gradually improve the accuracy of the output. Compared with a convolution-correction linear activation function module or a convolution-batch standardization-correction linear activation function module which is commonly used in a convolution neural network, the residual module is additionally provided with a bypass addition structure on the basis of a plurality of convolution modules.

As shown in fig. 5, the xy heat map represents a probability estimate of the presence of a joint point at each location of the body depth map; the z-distance response graph is an estimated value of predicting a z-direction distance for each position of the human body depth graph, wherein the z direction refers to the depth direction, and the z distance refers to a value in the depth direction; the position with the maximum probability estimation of the joint point in the xy heat map is the joint point predicted by the model, and the depth of the joint point is the value of the corresponding position of the z-distance response map. Although the human depth map is known, the image acquired by the depth camera in real time operation cannot be directly used to obtain the z-distance value of the joint point due to factors such as occlusion and noise.

(3) In practical use, RGBD pictures or videos can be input, including two cases:

and (3.1) when the RGBD picture is input, operating the human body detection and segmentation network trained in the step (1) to obtain a bounding box at the position of the human body and a human body region mask, extracting a single depth map, inputting the three-dimensional human body posture estimation network trained in the step (2), obtaining xy heat maps and z distance response maps of all human body three-dimensional joint points, and finally calculating the world coordinates of the human body three-dimensional joint points.

(3.1.1) operating the human body detection segmentation network and extracting a depth map, specifically: inputting a single RGBD picture to be processed, operating a human body detection and segmentation network, and predicting a bounding box at the position of an ith (i is 1-N) human body in an RGB channel according to the RGBD picture

And corresponding body region mask

Wherein N is the number of detected human bodies. Then, extracting bounding boxes of the position of the ith personal body in a D channel depth map of the RGBD picture by using an image segmentation Graphcut algorithm

And corresponding body region mask

Each human body part in the depth map is obtained. The main purpose of this step is to eliminate the interference of background and other human bodies on the prediction of human joint points.

(3.1.2) operating a three-dimensional human body posture estimation network to estimate human body three-dimensional joint points, which specifically comprises the following steps: according to the bounding box of the position of the ith person in the depth map obtained in the step (3.1.1)

And corresponding body region mask

Extracting a single person depth image in an RGBD image D channel, inputting the single person depth image into the three-dimensional human body posture estimation network trained in the step (2), and predicting to obtain an xy heat map of the kth joint point of the ith human body

And z distance response map

(3.1.3) calculating the relative coordinates of the human body three-dimensional joint point of the kth joint point of the ith human body by the following formula

Wherein the content of the first and second substances,

representing the probability that the pixel point (x, y) is the kth joint point of the ith individual,

is to make

The maximum coordinate.

(3.1.4) calculating the screen coordinates of the human body three-dimensional joint point of the kth joint point of the ith human body by the following formula

The calculation procedure of step (3.1.4) is denoted as (x ', y') ═ f (x, y).

(3.1.5) calculating a proportional relation scale of the xy coordinate and the z coordinate according to the internal reference and the geometric relation of the RGBD camera by the following formula:

where w and h denote the width and height of the RGBD picture, fov_xRepresenting the horizontal projection angle of the camera, fov_yRepresenting the camera vertical projection angle.

(3.1.6) obtaining world coordinates of the human body three-dimensional joint point of the kth joint point of the ith human body with the RGBD camera as the origin by the following formula

(3.2) when the RGBD video is input, further optimization is needed on the basis of the step (3.1), the correlation among multiple frames of pictures in the video is utilized, and the Bayesian method and the exponential smoothing mode are used for improving the prediction result of the world coordinate of the human body three-dimensional joint point

The accuracy and robustness of the overall prediction are improved, and the flow is shown in fig. 6:

(3.2.1) calculating the relative coordinates of the human body three-dimensional joint points of the ith human body k-th joint point in the t-1 frame (previous frame) by the following formula

(3.2.2) obtaining the screen coordinates of the human body three-dimensional joint points of the ith human body kth joint point in the t-1 th frame according to the calculation process of the step (3.1.3)

(3.2.3) Screen coordinates according to t-1 th frame

Prior probability distribution of corresponding screen coordinates when constructing t frame (current frame)

The probability distribution is the mean value

Variance of

Neglecting the constants in front of the two-dimensional gaussian distribution:

wherein, a is the size of the variance distribution range, the resolution of the xy heat map in this embodiment is 64 × 64, and a is 2; definition of

As coordinates of the screen

The reliability of (2).

The larger the confidence level of the prediction result, the higher the reliability of the result of the previous frame

The larger the variance Σ of the prior probabilities, the more probability the current frame result is concentrated at the predicted result of the previous frame.

(3.2.4) obtaining probability chart of posterior distribution by using Bayesian formula

Optimization

(3.2.5) calculating the relative xy-direction coordinates of the current frame by the following equation

Comprises the following steps:

the present invention also takes into account the information of the previous frame for the estimation of the z-direction distance. Because the bounding boxes of the positions of the human bodies of the previous frame and the current frame are different, and the partial position of the current frame cannot obtain the corresponding z-distance response image information, the z-distance response image of the previous frame needs to be expanded by adopting a nearest neighbor interpolation mode. The fluctuation of the distance prediction in the z direction is relieved by adopting an exponential smoothing mode; exponential smoothing is actually a special weighted moving average method, which gives unequal weights to the values of different frames and increases the weights of the next few frames, so that the predicted values can quickly reflect new changes.

(3.2.6) obtaining the optimized current frame by the following calculationz distance response map

Wherein the content of the first and second substances,

a z-distance response graph representing the ith individual's kth joint point at frame t-1,

a z-distance response graph representing the ith individual's kth joint point in the tth frame; setting up

This setting takes into account the reliability of the prediction result of the previous frame, and can dynamically control the change of the weight, and select an appropriate weight of 0.8 for smoothing.

(3.2.7) calculating the z-direction distance result of the current frame of the ith individual's k-th joint point by the following formula

(3.2.8) obtaining the world coordinates of the ith individual joint point in the ith frame according to the relative coordinates obtained in the steps (3.2.5) and (3.2.7) and the steps (3.1.4) - (3.1.6).

The specific result of this embodiment is shown in fig. 7, and it can be seen from fig. 7 that, given an input RGBD image, the method of the present invention can effectively calculate bounding boxes of human body regions of different people, extract a human depth map in each bounding box, and further calculate the three-dimensional joint point position of each person, and as can be seen from the rendering result of the joint points, the obtained three-dimensional joint point position matches the position of a real human body.

Claims

1. A multi-person three-dimensional attitude estimation method based on an RGBD camera is characterized by comprising the following steps:

(1) pre-training a human body detection segmentation network: and training to obtain a deep convolutional network supporting human body position detection and semantic segmentation according to the real human body RGB picture data set and the corresponding labeling information.

(2) Pre-training a three-dimensional human body posture estimation network: a three-dimensional human body posture estimation network of a depth convolution network capable of estimating human body joint points from a depth map is obtained by constructing a synthesized human body depth map-three-dimensional characteristic point data set and then training through the data set.

2. The method for estimating the three-dimensional pose of the plurality of people based on the RGBD camera according to claim 1, wherein the step (1) is specifically as follows: according to the input picture and the corresponding labeling information, a deep convolution network supporting human body position detection and semantic segmentation is trained, wherein the input is an RGB picture, and the output is a bounding box at the position of a human body, a human body region mask and the like.

3. The RGBD-camera-based multi-person three-dimensional pose estimation method according to claim 2, wherein the multitask depth convolution network is composed of three sub-networks, specifically: the first sub-network is a characteristic pyramid network, and through inputting RGB pictures, multi-level and multi-scale convolution related operation is carried out to extract abstract characteristics of the pictures; the second sub-network is a regional candidate network, the abstract features output by the first sub-network are input, and a candidate frame of the human body position is generated through convolution correlation operation; the third sub-network is a full convolution neural network, abstract features in the human body position candidate box output by the second sub-network are input, and a human body region mask is generated through convolution correlation operation.

4. The RGBD camera-based multi-person three-dimensional pose estimation method of claim 3, wherein the step (2) comprises the sub-steps of:

5. The RGBD-camera-based multi-person three-dimensional pose estimation method according to claim 4, wherein the step (3) comprises two cases: