CN111597976A - Multi-person three-dimensional attitude estimation method based on RGBD camera - Google Patents

Multi-person three-dimensional attitude estimation method based on RGBD camera Download PDF

Info

Publication number
CN111597976A
CN111597976A CN202010408082.9A CN202010408082A CN111597976A CN 111597976 A CN111597976 A CN 111597976A CN 202010408082 A CN202010408082 A CN 202010408082A CN 111597976 A CN111597976 A CN 111597976A
Authority
CN
China
Prior art keywords
human body
dimensional
network
joint points
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010408082.9A
Other languages
Chinese (zh)
Inventor
秦昊
李冬平
杨颢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Faceunity Technology Co ltd
Original Assignee
Hangzhou Faceunity Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Faceunity Technology Co ltd filed Critical Hangzhou Faceunity Technology Co ltd
Priority to CN202010408082.9A priority Critical patent/CN111597976A/en
Publication of CN111597976A publication Critical patent/CN111597976A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-person three-dimensional attitude estimation method based on RGBD cameras, which comprises the steps of firstly, training a real human body data set to obtain a deep convolutional network supporting human body position detection and semantic segmentation; then, a virtually synthesized human body depth map-three-dimensional characteristic point data set is constructed, and a depth convolution network capable of estimating human body joint points from the depth map is obtained through training of the data set; and finally, the user inputs RGBD pictures or videos to obtain world coordinates of all human body three-dimensional joint points. The invention provides a robust algorithm for recovering multi-person three-dimensional postures from a single RGBD camera; in the network pre-training stage, only the pre-labeled RGB pictures are needed, and the depth map can be automatically obtained by using a virtual synthesis method, so that the pre-training has low requirement on data labeling; in the actual operation stage, single-frame attitude estimation and multi-frame attitude estimation are considered at the same time, and accurate and stable multi-person three-dimensional attitudes can be output.

Description

Multi-person three-dimensional attitude estimation method based on RGBD camera
Technical Field
The invention belongs to the technical field of machine vision and deep learning, and particularly relates to a multi-person three-dimensional attitude estimation method based on an RGBD camera.
Background
The purpose of the human body posture estimation is to obtain the coordinates of human body joint points from an input image, so that the information of the human body joint direction, rotation and the like can be analyzed. People can further consider the information of the time sequence, observe the position change condition of the human body joint points in a period of time and carry out semantic understanding at a more abstract level, thereby realizing complex tasks such as action recognition, tracking, prediction and the like. The human body posture estimation is widely applied and is often applied to the fields of games, entertainment, security, medical rehabilitation and the like. By using the result of human body posture estimation, people can feel the interest of the motion sensing game and man-machine interaction without any motion sensing equipment; a movie manufacturer can drive the animation model without additional auxiliary equipment to complete convenient action sequence generation; the children do not need to worry about the accident that the old people at home fall down, but the old people cannot timely give out information to miss precious time for medical treatment and treatment.
The problem of human posture estimation has been studied for a long time. Most of the early methods identify each part of the human body for matching on the basis of geometric prior, so as to calculate the human body posture. In recent years, with the rapid development of deep learning, the convolutional neural network has made a breakthrough in many fields of computer vision direction, such as object classification, object detection, semantic segmentation, and other tasks. People also get a great breakthrough in the field of human posture estimation by using a deep learning method. Many methods for estimating the human body posture based on the convolutional neural network are proposed, such as deep pose, Stacked Hourglass Networks, OpenPose, and the like. Compared with the traditional visual methods, the methods are usually trained by a large amount of data, and rich prior information contained in the data can be utilized, so that the precision and the stability are greatly improved.
Since most of the presently disclosed data sets are input as color RGB images, most of the present research on pose estimation is limited to estimation of only two-dimensional joint points. However, the two-dimensional joint has a great limitation in the application field. For example, the two-dimensional joint points are difficult to calculate translation and rotation information of each joint of a human body, and cannot be competent for a plurality of scenes related to three-dimensional correlation. Therefore, there is a very urgent need for the research of three-dimensional human posture estimation.
It is very difficult to obtain an estimate of the three-dimensional joint points from a conventional RGB image. The birth of the depth camera provides a new idea for solving the problem, and the depth camera can obtain the depth value of the object, so that the distance information of the object is perceived. In 2009, the depth camera Kinect popular in the first public was introduced by microsoft, and has functions of dynamically capturing human body posture and the like. Kinect has assisted corresponding Xbox360 game platform, has further expanded the operational mode of recreation, has fully shown the interactive concept of man-machine. In 2017, as the first mobile phone IPhone X equipped with a depth camera was released, the integration of the depth camera into the mobile phone will slowly become a trend. Therefore, the human body posture estimation method based on the RGBD camera has better convenience and popularity.
Disclosure of Invention
The invention aims to provide a multi-person three-dimensional posture estimation method based on an RGBD camera, aiming at the defects of the prior art. The method is used for solving the problem that the user automatically obtains the coordinates of the human body joint points from the RGBD picture input.
The purpose of the invention is realized by the following technical scheme: a multi-person three-dimensional attitude estimation method based on RGBD cameras comprises the following steps:
(1) pre-training a human body detection segmentation network: training to obtain a deep convolutional network supporting human body position detection and semantic segmentation according to a real human body RGB picture data set and corresponding labeling information;
(2) pre-training a three-dimensional human body posture estimation network: a three-dimensional human body posture estimation network of a depth convolution network capable of estimating human body joint points from a depth map is obtained by constructing a synthesized human body depth map-three-dimensional characteristic point data set and then training through the data set;
(3) the actual use process of the user is as follows: when an RGBD picture to be processed is input, operating the human body detection segmentation network obtained by training in the step (1) and extracting a corresponding depth map, and operating the three-dimensional human body posture estimation network obtained by training in the step (2) to estimate three-dimensional joint points of a human body to obtain world coordinates of all the human body three-dimensional joint points; when a continuous video scene is input, the prediction result of the world coordinates of the three-dimensional joint points of the human body is improved by using the correlation of multi-frame image information and using a Bayesian method and an exponential smoothing mode.
Further, the step (1) is specifically: according to the input picture and the corresponding labeling information, a deep convolution network supporting human body position detection and semantic segmentation is trained, wherein the input is an RGB picture, and the output is a bounding box of the position of the human body and a human body region mask.
Further, the multitask deep convolution network is composed of three sub-networks, specifically: the first sub-network is a characteristic pyramid network, and through inputting RGB pictures, multi-level and multi-scale convolution related operation is carried out to extract abstract characteristics of the pictures; the second sub-network is a regional candidate network, the abstract features output by the first sub-network are input, and a candidate frame of the human body position is generated through convolution correlation operation; the third sub-network is a full convolution neural network, abstract features in the human body position candidate box output by the second sub-network are input, and a human body region mask is generated through convolution correlation operation.
Further, the step (2) includes the sub-steps of:
(2.1) constructing a synthesized human body depth map-three-dimensional feature point data set specifically comprises the following steps: automatically synthesizing a plurality of three-dimensional human body models, binding the three-dimensional human body models with human body action skeleton data, obtaining the three-dimensional human body models with different human body actions through skin covering operation, and finally drawing a depth map of all the three-dimensional human body models to obtain a human body depth map-three-dimensional characteristic point data set;
(2.2) training through a human body depth map-three-dimensional feature point data set to obtain a depth convolution network capable of estimating human body joint points from a depth map specifically as follows: training a three-dimensional human body posture estimation network according to the labeling information of the human body depth map-three-dimensional feature point data set, inputting a single-channel depth picture, and outputting an xy heat map and a z distance response map which comprise human body three-dimensional joint points; the basic structure of the three-dimensional human body posture estimation network is a stacked hourglass type network, and features are repeatedly extracted by a convolution module through operations of down sampling and up sampling for multiple times, and finally two output graphs are output.
Further, the step (3) includes two cases:
(3.1) when an RGBD picture to be processed is input, operating the human body detection segmentation network obtained by training in the step (1) and extracting a corresponding depth map, and then operating the three-dimensional human body posture estimation network obtained by training in the step (2) to estimate three-dimensional joint points of the human body to obtain coordinates of all the human body three-dimensional joint points, wherein the specific steps are as follows: a user inputs a single RGBD image, firstly, RGB images in the single RGBD image are extracted, and the human body detection segmentation network is operated to obtain the position of a human body and a human body segmentation mask; estimating local coordinates of the three-dimensional joint points corresponding to each person by using the extracted single depth image, and obtaining world coordinates of the three-dimensional joint points of the human body in the image according to camera parameters and the association among the local coordinates;
and (3.2) when a continuous video scene is input, obtaining the coordinates of the human body three-dimensional joint points in the previous frame according to the step (3.1), then constructing prior probability distribution of the coordinates of the human body three-dimensional joint points of the current frame according to the coordinates of the human body three-dimensional joint points of the previous frame, optimizing an xy heat map of the human body three-dimensional joint points of the current frame by using a Bayesian formula, smoothly optimizing a z distance response map of the human body three-dimensional joint points of the current frame by using indexes, and finally obtaining the optimized coordinates of the human body three-dimensional joint points of the current frame.
The invention has the beneficial effects that: the invention provides a robust algorithm for recovering multi-person three-dimensional postures from a single RGBD camera; in the network pre-training stage, only pre-labeled RGB pictures (which can be easily obtained in a public data set) are needed, and a depth map can be automatically obtained by using a virtual synthesis method, so that the pre-training has low requirement on data labeling; in the actual operation stage, single-frame attitude estimation and multi-frame attitude estimation are considered at the same time, and accurate and stable multi-person three-dimensional attitudes can be output.
Drawings
FIG. 1 is a flow chart of a multi-person three-dimensional attitude estimation method based on an RGBD camera;
FIG. 2 is a schematic diagram of a virtually synthesized three-dimensional mannequin with various motions;
FIG. 3 is a schematic diagram of a stacked hourglass network configuration;
FIG. 4 is a schematic diagram of a four-step hourglass module configuration;
FIG. 5 is a schematic diagram of an attitude estimation network output; wherein (a) is a depth input map, (b) is an xy heat map, and (c) is a z-distance response map;
FIG. 6 is a visualization of the operational flow from inputting RGBD pictures to outputting three-dimensional poses;
FIG. 7 is a graph of results for an embodiment of the present invention; the method comprises the following steps of (a) inputting an RGB image schematic diagram and a human body bounding box, (b) superposing a two-dimensional posture estimation result schematic diagram on an input depth image, and (c) outputting a three-dimensional posture skeleton result schematic diagram.
Detailed Description
The real human body RGB picture data set adopted by the embodiment is an open data set COCO (http:// codataset. org/# home), which is widely used in image detection and segmentation and comprises more than 25 million RGB pictures and corresponding human body detection and segmentation labeling information; the human body detection and segmentation marking information is a human body detection frame and a segmentation mask image. The action skeleton data set adopted by the embodiment is from a CMU Mocap database (http:// Mocap. cs. CMU. edu /), can also be supplemented by self, has a stored format of bvh, and comprises about 2000 action sequences and 31 human body joint points; the action sequence comprises common actions of walking, jumping, climbing, running, basketball, football, boxing and the like.
The invention discloses a multi-person three-dimensional attitude estimation method based on an RGBD camera, which comprises the following steps as shown in figure 1:
(1) pre-training a human body detection segmentation network: training a multitask depth convolution network supporting human body position detection and semantic segmentation according to a real human body RGB picture data set to obtain a human body detection segmentation network, wherein the input is an RGB picture, and the output is a bounding box [ x ] of the position of a human bodymin,ymin,xmax,ymax]And a body region mask. The human body region mask is a binary image, the value of each pixel point represents the probability of belonging to a human body or a background, the pixel value of the human body region is 1, and the pixel value of the background region is 0; (x)min,ymin)、(xmax,ymax) The coordinates of the upper left corner and the lower right corner of the position of the human body are respectively. The concept of bounding boxes and area masks has been widely used in the task of pattern detection and segmentation.
The multitask depth convolution network supporting human body position detection and semantic segmentation is composed of three sub-networks, and specifically comprises the following steps:
(1.1) the first sub-network is a Feature Pyramid Network (FPN), and by inputting the RGB picture, multi-level and multi-scale convolution related operation is carried out, abstract features of the RGB picture are extracted, and a Feature map is obtained.
The pyramid feature extraction Network uses a standard Residual Neural Network (ResNet) as an integral Network backbone and comprises 5 standard Residual down-sampling modules; each module comprising a down-sampling layer Li(i is 1-5), the step length is 2, and the ResNet of the feature extraction backbone network can feel the scale range of 1, 1/2, 1/4, 1/8 and 1/16; assuming that the input image resolution is 512 × 512, after 5 times of downsampling, the input image resolution is reduced, and then the image size corresponding to the minimum feature scale is 32 × 32. In order to further enhance the feature extraction capability of the network, a fusion branch is added after each down-sampling layer, wherein the fusion branch comprises an up-sampling layer with the step size of 2 and a convolution layer with the step size of 1x1, and each fusion branch is used for connecting the corresponding down-sampling layer LiAfter the output feature is amplified by 2 times of resolution, the feature is compared with a down-sampling layer L by using a standard neural network layer concat operationi-1The output features are spliced, different features are effectively fused, the network expression capacity is improved, and a final feature map is obtained.
(1.2) the second sub-network is a Region candidate network (RPN), the input is the feature map extracted by the first sub-network in step (1.1), and a candidate frame of the human body position is generated through convolution correlation operation, specifically:
(1.2.1) setting 5 fixed frames for each pixel point in the input feature map, the centers of the fixed frames being set on each pixel point of the feature map and having a change in aspect ratio, similarly to the classic target detection method, fast-RCNN; the length-width ratios of the 5 fixed frames are 1:1, 1:2, 2:1, 1:3 and 3:1 respectively.
(1.2.2) carrying out convolution operation twice on the input feature map, and extracting the intermediate layer features of the feature map by using convolution of 3x3 for the first time; processing the middle layer characteristics by two groups of 1 × 1 convolutions with the output channel number being twice and four times of the number of the fixed frames respectively for the second time, wherein the convolution output characteristics with the output channel number being twice of the number of the fixed frames are fixed frame score values belonging to probability p of a human body and fixed frame score values belonging to probability 1-p of a background, and the convolution output characteristics with the output channel number being four times of the number of the fixed frames are fixed frame correction values [ delta x, delta y, delta w, delta h ] for correcting the position and the size of the fixed frames; where Δ x is a correction value for the abscissa of the center of the frame, Δ y is a correction value for the ordinate of the center of the frame, Δ w is a correction value for the width of the frame, and Δ h is a correction value for the height of the frame.
(1.2.3) using a fixed frame with probability p > 0.5 belonging to a human body as a candidate frame, and merging the candidate frames by adopting a standard non-maximum suppression (NMS) algorithm, wherein the specific steps are as follows: firstly, sorting the candidate frames from large to small according to the probability p of belonging to the human body; then, the intersection ratio IoU of the candidate frame with the maximum p value and other candidate frames is calculated, and the candidate frame IoU is deleted>The candidate frames of 0.7 traverse the rest of the candidate frames from large to small according to the p value, and repeat the operations until all the candidate frames are processed, so as to obtain a plurality of non-overlapping candidate frames; finally, obtaining the bounding box of the position of each human body according to the reserved candidate frames
Figure BDA0002492081630000051
Wherein the superscript (i) denotes the ith detected human,
Figure BDA0002492081630000052
Figure BDA0002492081630000053
the left boundary coordinate, the right boundary coordinate, the upper boundary coordinate and the lower boundary coordinate in the human body candidate frame are respectively.
(1.3) the third sub-network is a full Convolutional neural network (FCN), the input is a feature map abstract feature in a bounding box at the position of the human body output by the second sub-network, and a human body region mask is generated through convolution correlation operation, specifically: firstly, intercepting an area of a bounding box of the position of a human body in a feature map to obtain a feature subgraph, and adjusting the feature subgraph to be the input size of a full convolution neural network in a bilinear interpolation mode; then, carrying out semantic segmentation on the characteristic subgraph by using a full convolution neural network to obtain a human body region mask; the full convolution neural network firstly carries out full convolution layer processing of 3x3,256 channels twice, each full convolution layer is followed by a nonlinear transformation Relu layer, then an up-sampling layer of 2 times is carried out once, and finally a probability graph belonging to a human body and a probability graph not belonging to the human body are output through a 1x1, 2-channel full convolution layer, when the probability that a pixel point belongs to the human body is more than or equal to the probability not belonging to the human body, the pixel value of the point is 1, otherwise, the pixel value is 0, and finally a human body region mask is obtained. In this embodiment, the input resolution of the fully convolutional neural network is 14 × 14, and the output resolution is 28 × 28.
(2) Pre-training a three-dimensional human body posture estimation network: and constructing a synthesized human body depth map-three-dimensional characteristic point data set, and then training a depth convolution network capable of estimating human body joint points from the human body depth map through the data set to obtain a three-dimensional human body posture estimation network.
(2.1) constructing a human body depth map-three-dimensional feature point data set:
(2.1.1) first, three-dimensional character modeling software Maya is used to generate three-dimensional human body models containing different genders, different ages, different sizes, different decorations (clothes, hairstyles, hats, etc.).
(2.1.2) obtaining different body posture data including 31 body joint points using the motion skeleton data set; binding and skinning the three-dimensional human body model and the action skeleton data set, and matching with an action sequence in the action skeleton data set to obtain a three-dimensional human body model (figure 2) with different human body actions;
(2.1.3) randomly generating camera parameters comprising camera height, camera horizontal angle and human body rendering position, rendering the three-dimensional human body model obtained in the step (2.1.2), drawing human body depth maps of all the three-dimensional human body models, and obtaining a human body depth map-three-dimensional characteristic point data set corresponding to 31 human body joint point coordinate positions; the height range of the camera is 1.6-1 m, the horizontal angle range of the camera is 0-15 degrees, and the human body rendering position range is 2-6 m. The total data amount of the human body depth map-three-dimensional feature point data set in this embodiment is 300000 pieces.
(2.2) training a deep convolution network for estimating human body joint points, which specifically comprises the following steps: and (3) training a three-dimensional human body posture estimation network according to the human body depth map-three-dimensional feature point data set obtained in the step (2.1), wherein the three-dimensional human body posture estimation network is input into a single-channel human body depth map and output into an xy heat map and a z distance response map of 62 channels of human body three-dimensional joint points corresponding to each human body joint point. The network is similar to the existing standard RGB image human body 2D joint point heat map prediction network method, except that the RGB image only has 2D supervisory information, so that only an xy heat map is output, and the depth map simultaneously contains 2D and 3D supervisory information, so that the CNN can be trained to output a plurality of channel information containing the xy heat map and a z-distance response map.
As shown in fig. 3, the basic structure of the three-dimensional body posture estimation network is a stacked hourglass network (stackedHourglass Networks) formed by transversely splicing and stacking a plurality of four-order hourglass modules, as shown in fig. 3, intermediate supervision is arranged among the four-order hourglass modules, an xy heat map and a z distance response map of each key point are respectively subjected to L2 loss functions, and a total loss function is obtained by summing, so as to supervise effective convergence of the four-order hourglass modules; as shown in fig. 4, the four-step hourglass module adopts a residual module as a basic structural unit, and a first-step hourglass module is constructed on the basis of the residual module; the first order hourglass module is split into two branches: the first branch is characteristic extraction on the original scale of the human body depth image and is composed of a residual error module; the second branch adopts the strategy of down-sampling and up-sampling, firstly down-sampling to 1/2 of the original scale in a maximum pooling mode, then connecting three residual modules, and up-sampling 2 times to the original scale in a nearest neighbor interpolation mode to extract features; and adding the features extracted by the two branches to obtain the output of the first-order hourglass module. On the basis of the first-order hourglass module, replacing a second residual error module of a second branch of the first-order hourglass module with the first-order hourglass module to obtain a second-order hourglass module, and performing feature extraction on the original scale, the 1/2 scale and the 1/4 scale by the second-order hourglass module; similarly nesting to obtain a fourth-order hourglass module, performing feature extraction on the original scale, 1/2 scale, 1/4 scale, 1/8 scale and 1/16 scale, wherein the input scale and the output scale of the hourglass module are the same, and outputting xy heat maps and z distance response maps of the three-dimensional joint points of the human body; further, a multi-step hourglass module can be obtained, which can gradually improve the accuracy of the output. Compared with a convolution-correction linear activation function module or a convolution-batch standardization-correction linear activation function module which is commonly used in a convolution neural network, the residual module is additionally provided with a bypass addition structure on the basis of a plurality of convolution modules.
As shown in fig. 5, the xy heat map represents a probability estimate of the presence of a joint point at each location of the body depth map; the z-distance response graph is an estimated value of predicting a z-direction distance for each position of the human body depth graph, wherein the z direction refers to the depth direction, and the z distance refers to a value in the depth direction; the position with the maximum probability estimation of the joint point in the xy heat map is the joint point predicted by the model, and the depth of the joint point is the value of the corresponding position of the z-distance response map. Although the human depth map is known, the image acquired by the depth camera in real time operation cannot be directly used to obtain the z-distance value of the joint point due to factors such as occlusion and noise.
(3) In practical use, RGBD pictures or videos can be input, including two cases:
and (3.1) when the RGBD picture is input, operating the human body detection and segmentation network trained in the step (1) to obtain a bounding box at the position of the human body and a human body region mask, extracting a single depth map, inputting the three-dimensional human body posture estimation network trained in the step (2), obtaining xy heat maps and z distance response maps of all human body three-dimensional joint points, and finally calculating the world coordinates of the human body three-dimensional joint points.
(3.1.1) operating the human body detection segmentation network and extracting a depth map, specifically: inputting a single RGBD picture to be processed, operating a human body detection and segmentation network, and predicting a bounding box at the position of an ith (i is 1-N) human body in an RGB channel according to the RGBD picture
Figure BDA0002492081630000071
And corresponding body region mask
Figure BDA0002492081630000072
Wherein N is the number of detected human bodies. Then, extracting bounding boxes of the position of the ith personal body in a D channel depth map of the RGBD picture by using an image segmentation Graphcut algorithm
Figure BDA0002492081630000073
And corresponding body region mask
Figure BDA0002492081630000074
Each human body part in the depth map is obtained. The main purpose of this step is to eliminate the interference of background and other human bodies on the prediction of human joint points.
(3.1.2) operating a three-dimensional human body posture estimation network to estimate human body three-dimensional joint points, which specifically comprises the following steps: according to the bounding box of the position of the ith person in the depth map obtained in the step (3.1.1)
Figure BDA0002492081630000075
And corresponding body region mask
Figure BDA0002492081630000076
Extracting a single person depth image in an RGBD image D channel, inputting the single person depth image into the three-dimensional human body posture estimation network trained in the step (2), and predicting to obtain an xy heat map of the kth joint point of the ith human body
Figure BDA0002492081630000077
And z distance response map
Figure BDA0002492081630000078
(3.1.3) calculating the relative coordinates of the human body three-dimensional joint point of the kth joint point of the ith human body by the following formula
Figure BDA0002492081630000079
Figure BDA00024920816300000710
Figure BDA00024920816300000711
Wherein the content of the first and second substances,
Figure BDA00024920816300000712
representing the probability that the pixel point (x, y) is the kth joint point of the ith individual,
Figure BDA00024920816300000713
is to make
Figure BDA00024920816300000714
The maximum coordinate.
(3.1.4) calculating the screen coordinates of the human body three-dimensional joint point of the kth joint point of the ith human body by the following formula
Figure BDA00024920816300000715
Figure BDA00024920816300000716
Figure BDA00024920816300000717
The calculation procedure of step (3.1.4) is denoted as (x ', y') ═ f (x, y).
(3.1.5) calculating a proportional relation scale of the xy coordinate and the z coordinate according to the internal reference and the geometric relation of the RGBD camera by the following formula:
Figure BDA0002492081630000081
where w and h denote the width and height of the RGBD picture, fovxRepresenting the horizontal projection angle of the camera, fovyRepresenting the camera vertical projection angle.
(3.1.6) obtaining world coordinates of the human body three-dimensional joint point of the kth joint point of the ith human body with the RGBD camera as the origin by the following formula
Figure BDA0002492081630000082
Figure BDA0002492081630000083
Figure BDA0002492081630000084
Figure BDA0002492081630000085
(3.2) when the RGBD video is input, further optimization is needed on the basis of the step (3.1), the correlation among multiple frames of pictures in the video is utilized, and the Bayesian method and the exponential smoothing mode are used for improving the prediction result of the world coordinate of the human body three-dimensional joint point
Figure BDA0002492081630000086
The accuracy and robustness of the overall prediction are improved, and the flow is shown in fig. 6:
(3.2.1) calculating the relative coordinates of the human body three-dimensional joint points of the ith human body k-th joint point in the t-1 frame (previous frame) by the following formula
Figure BDA0002492081630000087
Figure BDA0002492081630000088
Figure BDA0002492081630000089
(3.2.2) obtaining the screen coordinates of the human body three-dimensional joint points of the ith human body kth joint point in the t-1 th frame according to the calculation process of the step (3.1.3)
Figure BDA00024920816300000810
(3.2.3) Screen coordinates according to t-1 th frame
Figure BDA00024920816300000811
Prior probability distribution of corresponding screen coordinates when constructing t frame (current frame)
Figure BDA00024920816300000812
The probability distribution is the mean value
Figure BDA00024920816300000813
Variance of
Figure BDA00024920816300000814
Neglecting the constants in front of the two-dimensional gaussian distribution:
Figure BDA00024920816300000815
wherein, a is the size of the variance distribution range, the resolution of the xy heat map in this embodiment is 64 × 64, and a is 2; definition of
Figure BDA00024920816300000816
As coordinates of the screen
Figure BDA00024920816300000817
The reliability of (2).
Figure BDA00024920816300000818
The larger the confidence level of the prediction result, the higher the reliability of the result of the previous frame
Figure BDA00024920816300000819
The larger the variance Σ of the prior probabilities, the more probability the current frame result is concentrated at the predicted result of the previous frame.
(3.2.4) obtaining probability chart of posterior distribution by using Bayesian formula
Figure BDA0002492081630000091
Optimization
Figure BDA0002492081630000092
Figure BDA0002492081630000093
(3.2.5) calculating the relative xy-direction coordinates of the current frame by the following equation
Figure BDA0002492081630000094
Comprises the following steps:
Figure BDA0002492081630000095
the present invention also takes into account the information of the previous frame for the estimation of the z-direction distance. Because the bounding boxes of the positions of the human bodies of the previous frame and the current frame are different, and the partial position of the current frame cannot obtain the corresponding z-distance response image information, the z-distance response image of the previous frame needs to be expanded by adopting a nearest neighbor interpolation mode. The fluctuation of the distance prediction in the z direction is relieved by adopting an exponential smoothing mode; exponential smoothing is actually a special weighted moving average method, which gives unequal weights to the values of different frames and increases the weights of the next few frames, so that the predicted values can quickly reflect new changes.
(3.2.6) obtaining the optimized current frame by the following calculationz distance response map
Figure BDA0002492081630000096
Figure BDA0002492081630000097
Wherein the content of the first and second substances,
Figure BDA0002492081630000098
a z-distance response graph representing the ith individual's kth joint point at frame t-1,
Figure BDA0002492081630000099
a z-distance response graph representing the ith individual's kth joint point in the tth frame; setting up
Figure BDA00024920816300000910
This setting takes into account the reliability of the prediction result of the previous frame, and can dynamically control the change of the weight, and select an appropriate weight of 0.8 for smoothing.
(3.2.7) calculating the z-direction distance result of the current frame of the ith individual's k-th joint point by the following formula
Figure BDA00024920816300000911
Figure BDA00024920816300000912
(3.2.8) obtaining the world coordinates of the ith individual joint point in the ith frame according to the relative coordinates obtained in the steps (3.2.5) and (3.2.7) and the steps (3.1.4) - (3.1.6).
The specific result of this embodiment is shown in fig. 7, and it can be seen from fig. 7 that, given an input RGBD image, the method of the present invention can effectively calculate bounding boxes of human body regions of different people, extract a human depth map in each bounding box, and further calculate the three-dimensional joint point position of each person, and as can be seen from the rendering result of the joint points, the obtained three-dimensional joint point position matches the position of a real human body.

Claims (5)

1. A multi-person three-dimensional attitude estimation method based on an RGBD camera is characterized by comprising the following steps:
(1) pre-training a human body detection segmentation network: and training to obtain a deep convolutional network supporting human body position detection and semantic segmentation according to the real human body RGB picture data set and the corresponding labeling information.
(2) Pre-training a three-dimensional human body posture estimation network: a three-dimensional human body posture estimation network of a depth convolution network capable of estimating human body joint points from a depth map is obtained by constructing a synthesized human body depth map-three-dimensional characteristic point data set and then training through the data set.
(3) The actual use process of the user is as follows: when an RGBD picture to be processed is input, operating the human body detection segmentation network obtained by training in the step (1) and extracting a corresponding depth map, and operating the three-dimensional human body posture estimation network obtained by training in the step (2) to estimate three-dimensional joint points of a human body to obtain world coordinates of all the human body three-dimensional joint points; when a continuous video scene is input, the prediction result of the world coordinates of the three-dimensional joint points of the human body is improved by using the correlation of multi-frame image information and using a Bayesian method and an exponential smoothing mode.
2. The method for estimating the three-dimensional pose of the plurality of people based on the RGBD camera according to claim 1, wherein the step (1) is specifically as follows: according to the input picture and the corresponding labeling information, a deep convolution network supporting human body position detection and semantic segmentation is trained, wherein the input is an RGB picture, and the output is a bounding box at the position of a human body, a human body region mask and the like.
3. The RGBD-camera-based multi-person three-dimensional pose estimation method according to claim 2, wherein the multitask depth convolution network is composed of three sub-networks, specifically: the first sub-network is a characteristic pyramid network, and through inputting RGB pictures, multi-level and multi-scale convolution related operation is carried out to extract abstract characteristics of the pictures; the second sub-network is a regional candidate network, the abstract features output by the first sub-network are input, and a candidate frame of the human body position is generated through convolution correlation operation; the third sub-network is a full convolution neural network, abstract features in the human body position candidate box output by the second sub-network are input, and a human body region mask is generated through convolution correlation operation.
4. The RGBD camera-based multi-person three-dimensional pose estimation method of claim 3, wherein the step (2) comprises the sub-steps of:
(2.1) constructing a synthesized human body depth map-three-dimensional feature point data set specifically comprises the following steps: automatically synthesizing a plurality of three-dimensional human body models, binding the three-dimensional human body models with human body action skeleton data, obtaining the three-dimensional human body models with different human body actions through skin covering operation, and finally drawing a depth map of all the three-dimensional human body models to obtain a human body depth map-three-dimensional characteristic point data set;
(2.2) training through a human body depth map-three-dimensional feature point data set to obtain a depth convolution network capable of estimating human body joint points from a depth map specifically as follows: training a three-dimensional human body posture estimation network according to the labeling information of the human body depth map-three-dimensional feature point data set, inputting a single-channel depth picture, and outputting an xy heat map and a z distance response map which comprise human body three-dimensional joint points; the basic structure of the three-dimensional human body posture estimation network is a stacked hourglass type network, and features are repeatedly extracted by a convolution module through operations of down sampling and up sampling for multiple times, and finally two output graphs are output.
5. The RGBD-camera-based multi-person three-dimensional pose estimation method according to claim 4, wherein the step (3) comprises two cases:
(3.1) when an RGBD picture to be processed is input, operating the human body detection segmentation network obtained by training in the step (1) and extracting a corresponding depth map, and then operating the three-dimensional human body posture estimation network obtained by training in the step (2) to estimate three-dimensional joint points of the human body to obtain coordinates of all the human body three-dimensional joint points, wherein the specific steps are as follows: a user inputs a single RGBD image, firstly, RGB images in the single RGBD image are extracted, and the human body detection segmentation network is operated to obtain the position of a human body and a human body segmentation mask; estimating local coordinates of the three-dimensional joint points corresponding to each person by using the extracted single depth image, and obtaining world coordinates of the three-dimensional joint points of the human body in the image according to camera parameters and the association among the local coordinates;
and (3.2) when a continuous video scene is input, obtaining the coordinates of the human body three-dimensional joint points in the previous frame according to the step (3.1), then constructing prior probability distribution of the coordinates of the human body three-dimensional joint points of the current frame according to the coordinates of the human body three-dimensional joint points of the previous frame, optimizing an xy heat map of the human body three-dimensional joint points of the current frame by using a Bayesian formula, smoothly optimizing a z distance response map of the human body three-dimensional joint points of the current frame by using indexes, and finally obtaining the optimized coordinates of the human body three-dimensional joint points of the current frame.
CN202010408082.9A 2020-05-14 2020-05-14 Multi-person three-dimensional attitude estimation method based on RGBD camera Pending CN111597976A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010408082.9A CN111597976A (en) 2020-05-14 2020-05-14 Multi-person three-dimensional attitude estimation method based on RGBD camera

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010408082.9A CN111597976A (en) 2020-05-14 2020-05-14 Multi-person three-dimensional attitude estimation method based on RGBD camera

Publications (1)

Publication Number Publication Date
CN111597976A true CN111597976A (en) 2020-08-28

Family

ID=72190853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010408082.9A Pending CN111597976A (en) 2020-05-14 2020-05-14 Multi-person three-dimensional attitude estimation method based on RGBD camera

Country Status (1)

Country Link
CN (1) CN111597976A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101259A (en) * 2020-09-21 2020-12-18 中国农业大学 Single pig body posture recognition system and method based on stacked hourglass network
CN112258555A (en) * 2020-10-15 2021-01-22 佛山科学技术学院 Real-time attitude estimation motion analysis method, system, computer equipment and storage medium
CN112487974A (en) * 2020-11-30 2021-03-12 叠境数字科技(上海)有限公司 Video stream multi-person segmentation method, system, chip and medium
CN112560618A (en) * 2020-12-06 2021-03-26 复旦大学 Behavior classification method based on skeleton and video feature fusion
CN112651316A (en) * 2020-12-18 2021-04-13 上海交通大学 Two-dimensional and three-dimensional multi-person attitude estimation system and method
CN112785692A (en) * 2021-01-29 2021-05-11 东南大学 Single-view-angle multi-person human body reconstruction method based on depth UV prior
CN112800905A (en) * 2021-01-19 2021-05-14 浙江光珀智能科技有限公司 Pull-up counting method based on RGBD camera attitude estimation
CN112836652A (en) * 2021-02-05 2021-05-25 浙江工业大学 Multi-stage human body posture estimation method based on event camera
CN113191243A (en) * 2021-04-25 2021-07-30 华中科技大学 Human hand three-dimensional attitude estimation model establishment method based on camera distance and application thereof
CN113221626A (en) * 2021-03-04 2021-08-06 北京联合大学 Human body posture estimation method based on Non-local high-resolution network
CN113313720A (en) * 2021-06-30 2021-08-27 上海商汤科技开发有限公司 Object segmentation method and device
CN113379904A (en) * 2021-07-05 2021-09-10 东南大学 Hidden space motion coding-based multi-person human body model reconstruction method
CN113421328A (en) * 2021-05-27 2021-09-21 中国人民解放军军事科学院国防科技创新研究院 Three-dimensional human body virtual reconstruction method and device
CN114529605A (en) * 2022-02-16 2022-05-24 青岛联合创智科技有限公司 Human body three-dimensional attitude estimation method based on multi-view fusion
CN116957919A (en) * 2023-07-12 2023-10-27 珠海凌烟阁芯片科技有限公司 RGBD image-based 3D human body model generation method and system
CN117372628A (en) * 2023-12-01 2024-01-09 北京渲光科技有限公司 Single-view indoor scene three-dimensional reconstruction method, system and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846403A (en) * 2017-01-04 2017-06-13 北京未动科技有限公司 The method of hand positioning, device and smart machine in a kind of three dimensions
CN107066935A (en) * 2017-01-25 2017-08-18 网易(杭州)网络有限公司 Hand gestures method of estimation and device based on deep learning
CN107423698A (en) * 2017-07-14 2017-12-01 华中科技大学 A kind of gesture method of estimation based on convolutional neural networks in parallel
CN109003301A (en) * 2018-07-06 2018-12-14 东南大学 A kind of estimation method of human posture and rehabilitation training system based on OpenPose and Kinect
CN109903332A (en) * 2019-01-08 2019-06-18 杭州电子科技大学 A kind of object's pose estimation method based on deep learning
CN110188598A (en) * 2019-04-13 2019-08-30 大连理工大学 A kind of real-time hand Attitude estimation method based on MobileNet-v2
CN110490171A (en) * 2019-08-26 2019-11-22 睿云联(厦门)网络通讯技术有限公司 A kind of danger gesture recognition method, device, computer equipment and storage medium
CN110516670A (en) * 2019-08-26 2019-11-29 广西师范大学 Suggested based on scene grade and region from the object detection method for paying attention to module

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846403A (en) * 2017-01-04 2017-06-13 北京未动科技有限公司 The method of hand positioning, device and smart machine in a kind of three dimensions
CN107066935A (en) * 2017-01-25 2017-08-18 网易(杭州)网络有限公司 Hand gestures method of estimation and device based on deep learning
CN107423698A (en) * 2017-07-14 2017-12-01 华中科技大学 A kind of gesture method of estimation based on convolutional neural networks in parallel
CN109003301A (en) * 2018-07-06 2018-12-14 东南大学 A kind of estimation method of human posture and rehabilitation training system based on OpenPose and Kinect
CN109903332A (en) * 2019-01-08 2019-06-18 杭州电子科技大学 A kind of object's pose estimation method based on deep learning
CN110188598A (en) * 2019-04-13 2019-08-30 大连理工大学 A kind of real-time hand Attitude estimation method based on MobileNet-v2
CN110490171A (en) * 2019-08-26 2019-11-22 睿云联(厦门)网络通讯技术有限公司 A kind of danger gesture recognition method, device, computer equipment and storage medium
CN110516670A (en) * 2019-08-26 2019-11-29 广西师范大学 Suggested based on scene grade and region from the object detection method for paying attention to module

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DUSHYANT MEHTA 等: "VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera", 《HTTPS://ARXIV.ORG/ABS/1705.01583》 *
陈国军 等: "基于RGBD的实时头部姿态估计", 《图学学报》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101259A (en) * 2020-09-21 2020-12-18 中国农业大学 Single pig body posture recognition system and method based on stacked hourglass network
CN112258555A (en) * 2020-10-15 2021-01-22 佛山科学技术学院 Real-time attitude estimation motion analysis method, system, computer equipment and storage medium
CN112487974A (en) * 2020-11-30 2021-03-12 叠境数字科技(上海)有限公司 Video stream multi-person segmentation method, system, chip and medium
CN112560618A (en) * 2020-12-06 2021-03-26 复旦大学 Behavior classification method based on skeleton and video feature fusion
CN112560618B (en) * 2020-12-06 2022-09-16 复旦大学 Behavior classification method based on skeleton and video feature fusion
CN112651316B (en) * 2020-12-18 2022-07-15 上海交通大学 Two-dimensional and three-dimensional multi-person attitude estimation system and method
CN112651316A (en) * 2020-12-18 2021-04-13 上海交通大学 Two-dimensional and three-dimensional multi-person attitude estimation system and method
CN112800905A (en) * 2021-01-19 2021-05-14 浙江光珀智能科技有限公司 Pull-up counting method based on RGBD camera attitude estimation
CN112785692A (en) * 2021-01-29 2021-05-11 东南大学 Single-view-angle multi-person human body reconstruction method based on depth UV prior
CN112836652B (en) * 2021-02-05 2024-04-19 浙江工业大学 Multi-stage human body posture estimation method based on event camera
CN112836652A (en) * 2021-02-05 2021-05-25 浙江工业大学 Multi-stage human body posture estimation method based on event camera
CN113221626B (en) * 2021-03-04 2023-10-20 北京联合大学 Human body posture estimation method based on Non-local high-resolution network
CN113221626A (en) * 2021-03-04 2021-08-06 北京联合大学 Human body posture estimation method based on Non-local high-resolution network
CN113191243A (en) * 2021-04-25 2021-07-30 华中科技大学 Human hand three-dimensional attitude estimation model establishment method based on camera distance and application thereof
CN113421328A (en) * 2021-05-27 2021-09-21 中国人民解放军军事科学院国防科技创新研究院 Three-dimensional human body virtual reconstruction method and device
CN113313720B (en) * 2021-06-30 2024-03-29 上海商汤科技开发有限公司 Object segmentation method and device
CN113313720A (en) * 2021-06-30 2021-08-27 上海商汤科技开发有限公司 Object segmentation method and device
CN113379904A (en) * 2021-07-05 2021-09-10 东南大学 Hidden space motion coding-based multi-person human body model reconstruction method
CN114529605A (en) * 2022-02-16 2022-05-24 青岛联合创智科技有限公司 Human body three-dimensional attitude estimation method based on multi-view fusion
CN114529605B (en) * 2022-02-16 2024-05-24 青岛联合创智科技有限公司 Human body three-dimensional posture estimation method based on multi-view fusion
CN116957919A (en) * 2023-07-12 2023-10-27 珠海凌烟阁芯片科技有限公司 RGBD image-based 3D human body model generation method and system
CN117372628A (en) * 2023-12-01 2024-01-09 北京渲光科技有限公司 Single-view indoor scene three-dimensional reconstruction method, system and equipment
CN117372628B (en) * 2023-12-01 2024-02-23 北京渲光科技有限公司 Single-view indoor scene three-dimensional reconstruction method, system and equipment

Similar Documents

Publication Publication Date Title
CN111597976A (en) Multi-person three-dimensional attitude estimation method based on RGBD camera
US11727596B1 (en) Controllable video characters with natural motions extracted from real-world videos
Zhou et al. Dance dance generation: Motion transfer for internet videos
CN113706699B (en) Data processing method and device, electronic equipment and computer readable storage medium
KR100483806B1 (en) Motion Reconstruction Method from Inter-Frame Feature Correspondences of a Single Video Stream Using a Motion Library
CN110033505A (en) A kind of human action capture based on deep learning and virtual animation producing method
CN106896925A (en) The device that a kind of virtual reality is merged with real scene
CN106997618A (en) A kind of method that virtual reality is merged with real scene
US11648477B2 (en) Systems and methods for generating a model of a character from one or more images
CN112258555A (en) Real-time attitude estimation motion analysis method, system, computer equipment and storage medium
CN111402412A (en) Data acquisition method and device, equipment and storage medium
CN114339030B (en) Network live video image stabilizing method based on self-adaptive separable convolution
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN112819951A (en) Three-dimensional human body reconstruction method with shielding function based on depth map restoration
CN113538667A (en) Dynamic scene light field reconstruction method and device
CN107016730A (en) The device that a kind of virtual reality is merged with real scene
CN113724155A (en) Self-boosting learning method, device and equipment for self-supervision monocular depth estimation
CN107018400B (en) It is a kind of by 2D Video Quality Metrics into the method for 3D videos
CN110415322B (en) Method and device for generating action command of virtual object model
CN112308977A (en) Video processing method, video processing apparatus, and storage medium
US11138743B2 (en) Method and apparatus for a synchronous motion of a human body model
CN115331265A (en) Training method of posture detection model and driving method and device of digital person
CN106981100A (en) The device that a kind of virtual reality is merged with real scene
CN113989928A (en) Motion capturing and redirecting method
CN113920270B (en) Layout reconstruction method and system based on multi-view panorama

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200828