CN105631861A

CN105631861A - Method of restoring three-dimensional human body posture from unmarked monocular image in combination with height map

Info

Publication number: CN105631861A
Application number: CN201510970682.3A
Authority: CN
Inventors: 耿卫东; 杜宇; 刘永豪; 韩菲琳; 桂义林; 王镇
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2015-12-21
Filing date: 2015-12-21
Publication date: 2016-06-01
Anticipated expiration: 2035-12-21
Also published as: CN105631861B

Abstract

The invention discloses a method of restoring a three-dimensional human body posture from an unmarked monocular image in combination with a height map. The method comprises the following steps: 1) a color image and a height image are used for training to obtain a deep convolutional network-based two-dimensional joint point recognition model; 2) a video frame image sequence and a camera parameter are inputted, and a height map corresponding to each frame of image is calculated; 3) the video frame image and the height map obtained in the second step are inputted, and the two-dimensional joint point recognition model obtained through training the first step is used to obtain two-dimensional joint point coordinates of a human body in each frame of image; and 4) the two-dimensional joint point coordinates obtained in the third step are inputted, and the human body three-dimensional posture is restored according to an optimization model. During the two-dimensional joint point recognition process, the color image and the height image are used integrally, and the two-dimensional joint point recognition accuracy is improved; and time sequence consistency constraints are added to the optimization model which can restore the three-dimensional human body posture from the two-dimensional joint point, and thus the restored three-dimensional human body posture is closer to the real human body posture.

Description

The method recovering 3 D human body attitude in conjunction with height map from unmarked monocular image

Technical field

The present invention relates to 3 D human body pose recovery method, particularly relate to a kind of method recovering 3 D human body attitude in conjunction with height map from unmarked monocular image sequence.

Background technology

Human body three-dimensional Attitude estimation, because it is widely applied prospect, receives the concern of a lot of researcher. The method of existing 3 D human body Attitude estimation can be largely classified into the method based on monocular camera and based on the multi views method as sequence. At present, the method for monocular is just being subject to industry and is more and more paying close attention to. Because while many vision methods provide more viewdata, and then more rich information can be provided for Attitude estimation, but these data not always can obtain in reality, particularly in video monitoring and sanatorium etc. apply.

Recovering 3 D human body attitude from monocular image sequence is an intrinsic ill-conditioning problem, because when 3 d pose derived by our view from the single-view image that monocular camera obtains time, the projection for observing on same two dimensional image can obtain the combination of multiple 3 d pose and camera position. Add under real conditions due to environmental factors or the impact blocked etc., feature (the such as human body contour outline of image, extremity or two dimension articulare) can not be accurately detected, so that recovering 3 D human body attitude based on monocular image to become more challenging. But, human observer but can pass through eyes and accurately estimate the attitude of human body. In most cases, the mankind can also organize the anatomical landmarks in three dimensions and then the relative position of prediction camera easily. Why the mankind can accomplish this point, it is likely that be because in the brain of people storing the configuration information of substantial amounts of 3 D human body attitude, and then can eliminate the ambiguousness of 2 d-to-3 d. Therefore, it can obtain a rational agency of such ability by substantial amounts of three-dimensional motion data in study motion capture database.

Summary of the invention

Present invention aims to the deficiencies in the prior art, improve a kind of method recovering 3 D human body attitude in conjunction with height map from unmarked monocular image.

It is an object of the invention to be achieved through the following technical solutions: a kind of method recovering 3 D human body attitude in conjunction with height map from unmarked monocular image, the step of the method is as follows:

1) color image data collection and the training of height image data set is used to obtain the two-dimentional articulare model of cognition based on degree of depth convolutional network;

2) reading in color video frame image sequence and camera parameter, estimation obtains the height image that every color image frame is corresponding;

3) by step 2) the coloured image input step 1 of the height map that obtains and correspondence thereof) the two-dimentional articulare model of cognition based on degree of depth convolutional network that training obtains, identify and obtain the two-dimentional body joint point coordinate of human body in every two field picture;

4) input step 3) the two-dimentional body joint point coordinate of multiple image that obtains, recover the 3 d pose of human body in every two field picture according to Optimized model.

Further, described height map calculates in the following ways and obtains:

Height map is the middle mark of a kind of new human body that the present invention proposes. In anatomical structure, there is an empirical scalar in the height of each position of major joint point of human body and the length of skeleton and human body. Therefore elevation information contains the spatial relationship between each articulare of framing structure. For each pixel in human body contour outline, using the Height Estimation method that Park and partners thereof mention in the paper being entitled as " RobustEstimationofHeightofMovingPeopleUsingaSingleCamera " delivered for 2012 to carry out Height Estimation, the method is by the height calculating human body in two dimensional character (i.e. head point and the foot point) back projection on the plane of delineation to three-dimensional scenic. What the value of each pixel in height map represented is the height (namely relative to the height on ground in world coordinate system) of this point. But owing to the height of different people is different, in order to unrelated with human height, the value H of each pixel in height map, (x y) is normalized to relative altitudeThat is:

\hat{H} (x, y) = k \cdot \frac{H (x, y)}{h_{i}}

Wherein x and y is pixel coordinate, h_iRepresent the height of i-th people. K be one for relative altitude figure is mapped to required interval convergent-divergent constant, be empirically set to 255.

Further, step 1) described in the two-dimentional articulare model of cognition mentioned in the paper being entitled as " ArticulatedPoseEstimationbyaGraphicalModelwithImageDepen dentPairwiseRelations " delivered for 2014 based on Chen and partners thereof of the two-dimentional articulare model of cognition based on degree of depth convolutional network, paper is X.ChenandA.L.Yuille, " ArticulatedPoseEstimationbyaGraphicalModelwithImageDepen dentPairwiseRelations ", InNIPS, pages1736 1744, 2014. here this model has been improved, increase a new data stream (i.e. height map), (structure originally only uses RGB image to make its degree of depth convolutional network become double-flow design, use now RGB image and the height image estimated) simultaneously, and before last graph model, add a fused layer, merge the output of double fluid convolutional network.

Further, described step 1) in based on the training process of the two-dimentional articulare model of cognition of degree of depth convolutional network be:

1.1) use public data collection " LeedsSportsPoses " (LSP) (coloured image) and synthesis height image data set the two-dimentional articulare model of cognition based on degree of depth convolutional network is trained, the height image data set of synthesis by exercise data driving anthropometric dummy obtain.

1.2) model of cognition obtained is finely tuned by the height map using the RGB image in real video and correspondence thereof.

Further, described step 2) in height map generalization process corresponding to colored video frame images as follows:

2.1) the color video two field picture of input is carried out foreground extraction, obtain the prospect bianry image of every two field picture.

2.2) read in the prospect bianry image of camera internal reference, outer ginseng and every two field picture, generate the height map that every two field picture is corresponding.

Further, described step 3) in identify that the two-dimentional articulare obtained has 14, respectively: left/right ankle, left/right knee, left/right buttocks, left/right wrist, left and right/ancon, left/right shoulder, cervical region and head. The coordinate of two dimension articulare obtains by optimizing scoring function F based on the graph model at position (l, t | I):

F (l, t | I) = \underset{i &Element; V}{Σ} U (l_{i} | I) + \underset{(i, j) &Element; ϵ}{Σ} R (l_{i}, l_{j}, t_{i j}, t_{j i} | I) + ω_{0}

Wherein l={l_i| i �� V} is the set of articulare position, t={t_ij| (i, j) �� } it is paired relationship type, ��₀It it is a bias term. V and �� is the set on the summit of graph model and limit respectively. U and R is obtained by the Joint Distribution of the convolutional network that marginalisation training obtains, and comprises position type and the mixing of paired relationship type. The input of convolutional network is an image block, and exporting is a position (i.e. articulare) probability that drops on this image block.

Further, described step 4) particularly as follows:

It is represented by t 3 D human body state:

X_{t} = {[P_{t}^{T}, V_{t}^{T}]}^{T}

Wherein P_tRepresent t 3 D human body attitude, V_tRepresent the articulare speed of t.

3 D human body state can be expressed as a series of main constituent B={b₁..., b_kAnd the linear combination of an average vector ��:

X_{t} = μ + B_{t}^{*} ω_{t}

{b_{i}}_{i &Element; I_{B_{t}^{*}}} &Element; B_{t}^{*} &Subset; B

Wherein ��_tIt is the coefficient of main constituent,It is a majorized subset of B. B obtains by different types of exercise data carries out principal component analysis (PCA).

The two-dimentional body joint point coordinate of a given image sequenceCorresponding 3 d poseCan be obtained by the following object function of optimization:

\underset{Ω}{m i n} | | p - {CA}_{0} (M + B * Ω) | |^{2} + α | | D (M + B^{*} Ω) | |^{2}

WhereinM and n represents the number of frame number that the image sequence of input comprises and three-dimensional articulare respectively; Weak perspective camera modelFor camera projection matrix;

M = I_{m} &CircleTimes; μ, Ω = {[ω_{1}^{T} ... ω_{m}^{T}]}^{T}, A_{0} = I_{m} &CircleTimes; [I_{n} 0_{n}], B^{*} = Σ_{t = 1}^{m} B_{t}^{*};

�� is used for balancing back projection's error and temporal consistency,

The invention has the beneficial effects as follows: the present invention is comprehensive in the identification process of two dimension articulare uses coloured image and height image, makes the recognition accuracy of two dimension articulare be improved. On the other hand, in the Optimized model recovering 3 D human body attitude from two dimension articulare, add temporal consistency constraint, make the 3 D human body attitude recovered be more nearly real human body attitude.

Accompanying drawing explanation

Fig. 1 is the overview flow chart of the present invention;

Fig. 2 is based on the skeleton of height and dissects exploded view;

Fig. 3 (a)-(d) is four different synthesis height image of two dimension articulare model of cognition training in embodiment;

Fig. 4 is the true altitude map generalization process that the present invention uses;

Fig. 5 (a)-(c) is three real video images that embodiment uses;

Fig. 6 (a)-(c) is the height map that real video images that embodiment estimates is corresponding;

Fig. 7 (a)-(c) is the two-dimension human body skeleton that embodiment only uses coloured image to estimate;

Fig. 8 (a)-(c) is that embodiment uses colour to add the two-dimension human body skeleton that height image estimates;

Fig. 9 (a)-(c) is the 3 D human body attitude that embodiment recovers.

Detailed description of the invention

The core of the present invention is to increase elevation information (namely using height map) in the process of two dimension articulare identification, then adds temporal consistency constraint in the process recovering 3 d pose according to two dimension articulare.

Following with an embodiment, the embodiment of idiographic flow is described, step following (see Fig. 1):

1) color image data collection and the training of height image data set is used to obtain the two-dimentional articulare model of cognition based on degree of depth convolutional network; A given image sequenceWherein w and h is width and the height of image respectively, and d is number of active lanes. The target of two dimension articulare identification is exactly utilize RGB image with the height image estimated to calculate the position of two dimension articulare on every two field picture. Two dimension articulare identify the two-dimentional articulare model of cognition that the model used is mentioned in the paper being entitled as " ArticulatedPoseEstimationbyaGraphicalModelwithImageDepen dentPairwiseRelations " delivered for 2014 based on Chen and partners thereof, paper is X.ChenandA.L.Yuille, " ArticulatedPoseEstimationbyaGraphicalModelwithImageDepen dentPairwiseRelations ", InNIPS, pages1736 1744,2014. Here carried out improving (see Fig. 1) to this model, increase a new data stream (i.e. height map), its degree of depth convolutional network is made to become double-flow design (structure originally only uses RGB image, uses now RGB image and the height image estimated simultaneously). And before last graph model, add a fused layer, merge the output of double fluid convolutional network. The training process of two dimension articulare model of cognition is:

1.1) use public data collection " LeedsSportsPoses " (LSP) (coloured image) and synthesize height image data set the two-dimentional articulare model of cognition based on degree of depth convolutional network is trained. The data set that LSP data collection is Johnson and partners thereof to be announced in the paper being entitled as " ClusteredPoseandNonlinearAppearanceModelsforHumanPoseEst imation " delivered for 2010, paper is S.JohnsonandM.Everingham, " ClusteredPoseandNonlinearAppearanceModelsforHumanPoseEst imation ", InBMVC, pages1 11,2010. Synthesis altitude information, by using CMU and the action data that gathers voluntarily to drive 9 models to obtain, including the position of height map (see Fig. 3) and the two and three dimensions of predefined 14 articulares, synthesizes nearly 180,000 of height image.

1.2) convolutional network obtained is finely tuned by the height map simultaneously using the RGB image in real video and correspondence thereof. Here pick totally 3811 frame data walked and run of 10 people in real video, the position of its two dimension articulare has been carried out manual mark, and has estimated the height image that every two field picture is corresponding.

2) reading in real video frame image sequence (see Fig. 5) and camera parameter, estimation obtains height image corresponding to every two field picture (see Fig. 6). Detailed process is:

2.1) video frame images of input is carried out foreground extraction, obtain the prospect bianry image of every two field picture.

2.2) read in the prospect bianry image of camera internal reference, outer ginseng and every two field picture, according to above-mentioned height map generalization method, calculate and obtain the height map that every two field picture is corresponding. Height map calculates in the following ways and obtains (see Fig. 4):

Height map is the middle mark of a kind of new human body that the present invention proposes. In anatomical structure, there is an empirical scalar (as shown in Figure 2) in the height of each position of major joint point of human body and the length of skeleton and human body. Therefore elevation information contains the spatial relationship between each articulare of framing structure. For each pixel in human body contour outline, using the Height Estimation method that Park and partners thereof mention in the paper being entitled as " RobustEstimationofHeightofMovingPeopleUsingaSingleCamera " delivered for 2012 to carry out Height Estimation, the method is by the height calculating human body in two dimensional character (i.e. head point and the foot point) back projection on the plane of delineation to three-dimensional scenic. What the value of each pixel in height map represented is the height (namely relative to the height on ground in world coordinate system) of this point. But owing to the height of different people is different, in order to unrelated with human height, the value H of each pixel in height map, (x y) is normalized to relative altitudeThat is:

\hat{H} (x, y) = k \cdot \frac{H (x, y)}{h_{i}}

3) height map that input real video images and estimation obtain, the two-dimentional articulare model of cognition that training obtains is used to be identified, obtain 14 two-dimentional body joint point coordinate of human body in every two field picture, the two-dimension human body skeleton (see Fig. 8) that the two-dimentional articulare of correspondence that is linked in sequence obtains. Here the real video images inputted does not appear in training data.

Also the two-dimentional articulare model of cognition only using coloured image is tested on real video images (the two-dimentional articulare model of cognition that namely Chen and partners thereof mention in the paper being entitled as " ArticulatedPoseEstimationbyaGraphicalModelwithImageDepen dentPairwiseRelations " delivered for 2014), partial test result such as Fig. 7 in an embodiment). After can be seen that addition elevation information according to comparing result, the recognition accuracy of two dimension articulare is improved.

4) input step 3) the two-dimentional body joint point coordinate of multiple image that obtains, the 3 d pose (see Fig. 9) of human body in every two field picture is recovered according to Optimized model. The Optimized model that the Optimized model of 3 D human body attitude proposes in the paper being entitled as " Reconstructing3DHumanPosefrom2DImageLandmarks " delivered for 2012 is recovered based on Ramakrishna and partners thereof according to two dimension articulare. Here this model is improved, adds joint angle speed and temporal consistency constraint, be described in detail below:

It is represented by t 3 D human body state:

X_{t} = {[P_{t}^{T}, V_{t}^{T}]}^{T}

X_{t} = μ + B_{t}^{*} ω_{t}

{b_{i}}_{i &Element; I_{B_{t}^{*}}} &Element; B_{t}^{*} &Subset; B

Wherein ��_tIt is the coefficient of main constituent,It is a majorized subset of B. B obtains by different types of motion carries out principal component analysis (PCA).

\underset{Ω}{m i n} | | p - {CA}_{0} (M + B^{*} Ω) | |^{2} + α | | D (M + B^{*} Ω) | |^{2}

WhereinM and n represents the number of frame number that the image sequence of input comprises and three-dimensional articulare respectively; Weak perspective camera modelI is unit matrix,For Kronecker product,For camera projection matrix;

M = I_{m} &CircleTimes; μ, Ω = {[ω_{1}^{T} ... ω_{m}^{T}]}^{T},

A_{0} = I_{m} &CircleTimes; [I_{n} 0_{n}], B^{*} = Σ_{t = 1}^{m} B_{t}^{*};

�� is used for balancing back projection's error and temporal consistency,

Claims

1. the method recovering 3 D human body attitude in conjunction with height map from unmarked monocular image, it is characterised in that the method comprises the steps:

2. the method recovering 3 D human body attitude as claimed in claim 1 in conjunction with height map from unmarked monocular image, it is characterised in that described height map calculates in the following ways and obtains:

For each pixel in human body contour outline, using Height Estimation method to carry out Height Estimation, the method is by the height calculating human body in two dimensional character (i.e. head point and the foot point) back projection on the plane of delineation to three-dimensional scenic. What the value of each pixel in height map represented is the height (namely relative to the height on ground in world coordinate system) of this point. The value H of each pixel in height map, (x y) is normalized to relative altitudeThat is:

\hat{H} (x, y) = k \cdot \frac{H (x, y)}{h_{i}}

Wherein x and y is pixel coordinate, h_iRepresent the height of i-th people. K is a convergent-divergent constant for relative altitude figure being mapped to required interval.

3. the method recovering 3 D human body attitude as claimed in claim 1 in conjunction with height map from unmarked monocular image, it is characterized in that, described step 1) in based on the two-dimentional articulare model of cognition of degree of depth convolutional network traditional based on the model of cognition of RGB image increases height map, its degree of depth convolutional network is made to become double-flow design, and before last graph model, add a fused layer, merge the output of double fluid convolutional network.

4. the method recovering 3 D human body attitude as claimed in claim 1 in conjunction with height map from unmarked monocular image, it is characterised in that step 1) in based on the training process of the two-dimentional articulare model of cognition of degree of depth convolutional network be:

5. the method recovering 3 D human body attitude as claimed in claim 1 in conjunction with height map from unmarked monocular image, it is characterised in that described step 2) in height map generalization process corresponding to colored video frame images as follows:

6. the method recovering 3 D human body attitude as claimed in claim 1 in conjunction with height map from unmarked monocular image, it is characterized in that, described step 3) in identify that the two-dimentional articulare that obtains has 14, respectively: left/right ankle, left/right knee, left/right buttocks, left/right wrist, left and right/ancon, left/right shoulder, cervical region and head. The coordinate of two dimension articulare obtains by optimizing scoring function F based on the graph model at position (l, t | I):

F (l, t | I) = \underset{i &Element; V}{Σ} U (l_{i} | I) + \underset{(i, j) &Element; ϵ}{Σ} R (l_{i}, l_{j}, t_{i j}, t_{j i} | I) + ω_{0}

7. the method recovering 3 D human body attitude as claimed in claim 1 in conjunction with height map from unmarked monocular image, it is characterised in that described step 4) particularly as follows:

It is represented by t 3 D human body state:

X_{t} = {[P_{t}^{T}, V_{t}^{T}]}^{T}

3 D human body state can be expressed as a series of main constituent B={b₁,��,b_kAnd the linear combination of an average vector ��:

X_{t} = μ + B_{t}^{*} ω_{t}

{b_{i}}_{i &Element; I_{B_{t}^{*}}} &Element; B_{t}^{*} &Subset; B

\min_{Ω} | | p - {CA}_{0} (M + B^{*} Ω) | |^{2} + α | | D (M + B^{*} Ω) | |^{2}

M = I_{m} &CircleTimes; μ, Ω = {[\begin{matrix} ω_{1}^{T} & ... & ω_{m}^{T} \end{matrix}]}^{T},

0_nFor the null matrix of n �� n,�� is used for balancing back projection's error and temporal consistency,