CN109271933A

CN109271933A - The method for carrying out 3 D human body Attitude estimation based on video flowing

Info

Publication number: CN109271933A
Application number: CN201811080931.1A
Authority: CN
Inventors: 李帅; 胡韬; 于洋; 付延生
Original assignee: Qingdao Research Institute Of Beijing University Of Aeronautics And Astronautics
Current assignee: Qingdao Research Institute Of Beijing University Of Aeronautics And Astronautics
Priority date: 2018-09-17
Filing date: 2018-09-17
Publication date: 2019-01-25
Anticipated expiration: 2038-09-17
Also published as: CN109271933B

Abstract

The method of the present invention that 3 D human body Attitude estimation is carried out based on video flowing, 3 D human body 3D Attitude estimation is carried out to video flowing based on the method for deep learning, avoid many defects caused by analyzing mistake because of two-dimensional visual, the time relationship for fully utilizing video interframe, improves the accuracy and real-time of video flowing 3D posture inferred results.Include video n-th (n >=2) frame, 1) input present frame two dimensional image, image shallow-layer figure is generated using shallow-layer neural network module；2) the image shallow-layer figure of (n-1) frame generates human body two dimension artis thermodynamic chart, present frame generation, is input to LSTM module together to generate profound characteristic pattern；3) the deep layer characteristics of image figure that present frame generates exports the human body two dimension artis thermodynamic chart that present frame is generated to residual error module；4) the human body two dimension artis thermodynamic chart of present frame is exported to three-dimensional artis inference module, carries out two dimension to three-dimensional space reflection；The human body three-dimensional artis thermodynamic chart superposition that every frame generates above, generates the video flowing of 3 D human body Attitude estimation.

Description

The method for carrying out 3 D human body Attitude estimation based on video flowing

Technical field

The present invention relates to the methods for carrying out 3 D human body Attitude estimation for two dimensional image video flowing, belong to virtual reality skill Art field.

Background technique

The 3D Attitude estimation of human body is by the position the 3D essence in several joints (such as head, shoulder, ancon etc.) of human body Really estimate.Due to losing depth information, estimate that the position of the 3D artis of human body is meter from two-dimentional rgb video stream The very big challenge of one of calculation machine visual field.

With the development of deep neural network (Deep Convolutional Networks), more and more technology wounds It newly focuses on and three-dimensional human skeleton detection is carried out based on deep neural network end to end.Existing relatively conventional 3 D human body appearance State estimation method, there are mainly two types of technology path:

Two-part 3D artis infers that as shown in rear attached drawing 1, this method is divided into two stages, and the first stage, utilization is existing Two-dimentional artis infer model, accurately estimate the position of human body 2D artis, it is general to be indicated with two dimension artis thermodynamic chart； Second stage generates the number of human body three-dimensional artis using the 2D artis thermodynamic chart and middle layer characteristic pattern generated on last stage Learn expression formula.

End-to-end 3D artis is inferred, as shown in rear attached drawing 2, the input of the deduction model is RGB image, exports as people Body 3D mathematic(al) representation.

As described above, existing 3 D human body Attitude estimation has following technological deficiency: A, general directly to export human synovial Point 3D coordinate, this is very difficult to study for network, because the learning tasks of feature space to 3D configuration space are one The learning tasks of a nonlinearity, non-linear disadvantage with higher；B, when carrying out artis 3D deduction, in neural network Between feature and be underutilized, it is difficult to the characteristic information of different scale, dimension is combined, generate infer effect compared with Difference；C, during the 3D posture based on video flowing is inferred, calculation amount amplification is larger, so that final deduction effect is not achieved Requirement of real-time, practical application effect are poor；D, during 3D posture based on video flowing is inferred, using every interframe when Void relation, to can not solve the problems, such as that artis is blocked and disappears.

In view of this, special propose present patent application.

Summary of the invention

The method of the present invention that 3 D human body Attitude estimation is carried out based on video flowing, it is above-mentioned existing its object is to solve Technology there are the problem of and 3 D human body 3D Attitude estimation is carried out to video flowing based on the method for deep learning, main includes three-dimensional Human body attitude model generates, the spatial relationship of artis is established and video interframe temporal correlation captures, to avoid because of two dimension Many defects caused by visual analysis mistake fully utilize the time relationship of video interframe, improve video flowing 3D posture and push away The accuracy and real-time of disconnected result.

For achieving the above object, it is described based on video flowing carry out 3 D human body Attitude estimation method, include with Lower implementation steps:

Video first frame, 1) input present frame two dimensional image, mentioning for human body two-dimensional attitude is carried out using hourglass network module It takes, generates the human body two dimension artis thermodynamic chart of first frame；2) the human body two dimension artis thermodynamic chart of present frame is exported to three Artis inference module is tieed up, carries out two dimension to three-dimensional space reflection to generate human body three-dimensional artis thermodynamic chart；

The second frame of video, 1) input present frame two dimensional image, image shallow-layer figure is generated using shallow-layer neural network module；2) First frame generate human body two dimension artis thermodynamic chart, present frame generate image shallow-layer figure, be input to together LSTM module with Generate profound characteristic pattern；3) the deep layer characteristics of image figure that present frame generates exports the human body that present frame is generated to residual error module Two-dimentional artis thermodynamic chart；4) the human body two dimension artis thermodynamic chart of present frame is exported to three-dimensional artis inference module, is carried out Two dimension is to three-dimensional space reflection to generate human body three-dimensional artis thermodynamic chart；

Video n-th (n >=2) frame, 1) input present frame two dimensional image, image shallow-layer is generated using shallow-layer neural network module Figure；2) the image shallow-layer figure of (n-1) frame generates human body two dimension artis thermodynamic chart, present frame generation, is input to together LSTM module is to generate profound characteristic pattern；3) the deep layer characteristics of image figure that present frame generates is exported to residual error module, and generation is worked as The human body two dimension artis thermodynamic chart of previous frame；4) the human body two dimension artis thermodynamic chart of present frame exports to three-dimensional artis and infers Module carries out two dimension to three-dimensional space reflection to generate human body three-dimensional artis thermodynamic chart；

The human body three-dimensional artis thermodynamic chart superposition that every frame generates above, generates the video flowing of 3 D human body Attitude estimation.

As described above, for the time-space relationship for fully utilizing every interframe, main integrated use hourglass network (Hourglass Network), shallow-layer neural network, LSTM (Long Short-Term Memory, shot and long term memory) module, residual error module and Three-dimensional artis inference module carries out 3 D human body Attitude estimation.Wherein,

Hourglass module is extracted to carry out human body 2D posture to calculate to a nicety, generate the heating power of human body two dimension artis Figure；

Shallow-layer neural network, to export the characteristic pattern of single-frame images；

The image that LSTM module, the human body 2D artis thermodynamic chart generated with hourglass module and shallow-layer neural network generate is special Sign figure is input, generates the profound characteristic pattern of present frame；

Residual error module is input with the present frame deep layer characteristics of image figure that LSTM module generates, generates human body two dimension joint Point；

Three-dimensional artis inference module, it is empty that the depth of the 2D artis and estimation extracted using hourglass module carries out 2D to 3D Between mapping, ultimately generate human body three-dimensional body joint point coordinate.

It is that single order hourglass network (Hourglass) includes following with additional project for advanced optimizing for hourglass network Structure in parallel:

Upper midway has several primary modules of M input channel and N output channel；

Lower midway has concatenated down-sampled 1/2 pond layer, several primary modules, rises sampling arest neighbors interpolating module；

N (n >=2) rank hourglass network has a structure that

Any primary module of midway under (n-1) rank hourglass network is replaced with into (n-1) rank hourglass network, in others, Lower half line structure is identical as (n-1) rank hourglass network.

Specifically, upper midway extracts the data in M channel to obtain the data of N channel.In several concatenated primary moulds In block, the input channel number of two adjacent primary modules, the latter primary module is always equal to the defeated of previous primary module Port number out.

Lower midway equally extracts the data in M channel to obtain the data of N channel, the difference is that in script input half It is carried out in size, that is, be in series with down-sampled 1/2 pond layer, primary module and rise sampling arest neighbors interpolating module.

It is by the primary module replacement under (n-1) rank hourglass network (Hourglass) in midway in n rank hourglass network N-1 rank hourglass network is expanded by the way that the primary module is replaced with a new hourglass network for (n-1) rank hourglass network For n rank hourglass network.

To sum up content, the method for carrying out 3 D human body Attitude estimation based on video flowing have the advantage that

1, make full use of video interframe time relationship, improve video flowing 3D posture inferred results accuracy and in real time Property.

2, the nonlinear degree from " feature space " to " 3D configuration space " learning tasks is significantly reduced, realizes one The representation method and learning method of kind science.

3, the deep learning network for realizing a kind of " end-to-end " for carrying out human body 3D Attitude estimation, carries out human joint points 3D avoids the generation of accumulated error during inferring.

4, the intermediate features for maximumlly utilizing neural network are realized, the feature of different scale, dimension is combined, are produced Bear optimal deduction effect.

5, calculation amount is directly reduced, so that final deduction effect reaches the requirement of real-time, practicability is stronger.

Detailed description of the invention

Fig. 1 is two-part estimation method schematic diagram in the prior art；

Fig. 2 is end-to-end estimation method schematic diagram in the prior art；

Fig. 3 is herein described based on video flowing progress 3 D human body Attitude estimation method flow diagram；

Fig. 4 is the structural schematic diagram of the primary module (Residual)；

Fig. 5 is the structural schematic diagram of single order hourglass module；

Fig. 6 is the structural schematic diagram of second order hourglass module；

Fig. 7 is the structural schematic diagram of the shallow-layer neural network；

Fig. 8 is three-dimensional artis inference module flow chart；

Specific embodiment

The present invention is described in further detail with implementation example with reference to the accompanying drawing.

Embodiment 1, as shown in figure 3, as follows based on the method that video flowing carries out 3 D human body Attitude estimation:

Video third frame, 1) input present frame two dimensional image, image shallow-layer figure is generated using shallow-layer neural network module；2) The image shallow-layer figure of human body two dimension artis thermodynamic chart, present frame generation that 2nd frame generates, is input to LSTM module together with life At profound characteristic pattern；3) the deep layer characteristics of image figure that present frame generates exports the human body two that present frame is generated to residual error module Tie up artis thermodynamic chart；4) the human body two dimension artis thermodynamic chart of present frame is exported to three-dimensional artis inference module, carries out two It ties up to three-dimensional space reflection to generate human body three-dimensional artis thermodynamic chart；

In video first frame, hourglass module carries out human body 2D posture and extracts, and generates accurate prediction human body two dimension artis Thermodynamic chart time-consuming 100ms；

In the second frame of video, third frame, shallow-layer neural network exports the characteristic pattern of single-frame images, and time-consuming is 20ms/ frame； The characteristics of image figure that LSTM module, the human body 2D artis thermodynamic chart generated according to hourglass network and shallow-layer neural network generate, Generate the profound characteristic pattern of present frame, time-consuming 10ms/ frame；Residual error module, input are that the present frame that LSTM module generates is deep Tomographic image characteristic pattern generates human body two dimension artis, time-consuming 10ms/ frame；Three-dimensional artis inference module, is mentioned using hourglass module The depth of the 2D artis and estimation that take carries out mapping of the 2D to 3d space, time-consuming 10ms/ frame；

That is, the three-dimensional artis deduction of video first frame needs 120ms, 60ms is only needed for every frame thereafter, to make It obtains while guaranteeing 3 D human body Attitude estimation precision, ensures that the Real time Efficiency of estimation method.

In human body 2D Attitude estimation, processing is iterated for the export structure of neural network, in multiple processing ranks Section generates prediction.These intermediate prediction results can be improved gradually to generate more accurate estimated result." hourglass module " just It is this design structure, uses the multiple prediction result of cascade scheme, gradually correction result.

" hourglass module " described herein is made of primary module (Residual Module).

As shown in figure 4, the primary module (Residual Module), is the characteristic pattern with the channel M, it is defeated Out be the characteristic pattern with N channel.

First behavior convolution road, by the different convolutional layer of three core scales, round rectangle is expressed as a convolution operation, In text write the parameter of the convolution operation exactly, be divided into 3 rows, be the port number of input feature vector, the size of convolution kernel respectively And the port number of output feature；

Second behavior is skipped a grade road, the convolutional layer for being only 1 comprising a core scale；Skip a grade the I/O channel number phase on road Together, this is unit mapping all the way.

The step-length of all convolutional layers is 1, pading 0, does not change the long and wide size of data, only to data depth (channel) it changes.

Above-mentioned primary module (Residual Module), can be by two state modulators: input depth M and output depth N, Realize the operation to arbitrary dimension image.

Primary module (Residual Module) is extracted the feature (convolution road) of higher level, while remaining original The information (road of skipping a grade) of level can regard advanced " convolution " layer of guarantor's size as.

As shown in figure 5, the input of single order hourglass module is the characteristic pattern in the channel M, output is the characteristic pattern of N channel.Thereon It include on the way 3 concatenated primary modules (Residual), in two adjacent primary modules, the input of the latter primary module Port number is always equal to the output channel number of previous primary module, gradually to extract deeper time feature.

Lower midway equally extracts the data in M channel to obtain the data of N channel, the difference is that in script input half It is carried out in size.With concatenated down-sampled 1/2 pond layer, 5 primary modules, rise sampling arest neighbors interpolating module.

Specifically, upper midway is carried out in archeus, and lower midway experienced first down-sampled (rectangle with/2 printed words) to be risen again Sample the process of (rectangle with * 2).

Wherein, down-sampled module is risen sampling module and is used arest neighbors interpolation using maximum pond.

Single order hourglass network (Hourglass), by the way that the characteristic pattern in the channel M of input is divided into two-way processing.Wherein one A branch is carried out on original scale；It in addition all the way, is to be carried out on a lower scale, finally in respective branch On be disposed after merged.So that neural network identification with higher and ability to express, it can be to different scale Characteristic information is preferably selected, to extract the substantive characteristics for influencing final result.

As shown in fig. 6, second order hourglass network (Hourglass), is the dotted line frame portion of single order hourglass network (Hourglass) Divide and is substituted for a single order hourglass network (input channel 256, output channel N).

That is second order hourglass network (Hourglass) is by the 4th in the lower midway of single order hourglass network (Hourglass) A primary module replaces with single order hourglass network (Hourglass).

In second order hourglass network (Hourglass), lower midway constitutes mistake that is down-sampled twice, then rising sampling twice Journey.

Second order hourglass network (Hourglass) has carried out maximum relative to initial data size on down-sampled branch For 1/4 it is down-sampled, the otherness of dimensional information has more been highlighted relative to single order hourglass network (Hourglass).

The information of different scale is integrated in order to further increase, the application can take n rank hourglass network (Hourglass), Undergo the down-sampled of most n times, and it is down-sampled every time before, separate midway and retain archeus information；Sampling is risen every time It is added afterwards with the data of a upper scale；Between down-sampled twice, feature is extracted using three primary modules；It is added twice Between, feature is extracted using a primary module (Residual).That is n rank hourglass network (Hourglass) can extract from original Scale is to 1/2ⁿThe intermediate features of scale.

N (n >=2) rank hourglass network is that a primary module of midway under (n-1) rank hourglass network is replaced with (n-1) Rank hourglass network, other upper and lower half line structures are identical as (n-1) rank hourglass network.

For n rank and (n-1) rank hourglass network, the primary module position that lower midway is replaced can be identical, can also With not identical.In the present embodiment, the primary module that the lower midway of n rank and (n-1) rank hourglass network is replaced is the 4th.

As shown in fig. 7, the shallow-layer neural network, is handled single-frame images to extract characteristics of image.In this Shen Please in, shallow-layer neural network removes last full articulamentum and Soft-max layers using VGG16.

The LSTM module is RNN (Recurrent neural network, the circulation nerve of a kind of particular form Network), and RNN is a series of general name of neural networks for being capable of handling sequence data.

In this application, being connected between frame and frame is done using LSMT module, inputs the thermodynamic chart for previous frame and is worked as The shallow-layer neural network of previous frame exports feature, and output is present frame profound level feature.

As shown in following formula,

f_t=σ (W_f·[h_t-1, x_t]+b_f)

i_t=σ (W_i·[h_t-1, x_t]+b_i)

o_t=σ (W_o[h_t-1, x_t]+b_o)

h_t=o_t*tanh(C_t)

f_tIt indicates to forget door, first determine what information can be abandoned from cell state in LSTM module, this determines logical It crosses this and forgets door to complete.I.e. the forgetting door can read h_ { t-1 } and x_t, export the numerical value between 0 to 1 to each Number in cell state C_ { t-1 }；1 indicates " being fully retained ", and 0 indicates " giving up completely ".

i_tIt indicates input gate, determines which type of new information is stored in cell state.It include following two parts, First part, sigmoid layers claim " input gate layer " to determine that value will be to be updated；Second part, a tanh layers of creation one A new candidate value vector, C_t can be added into state.

O_tIndicate out gate, C_t-1It is updated to C_t.By oldState and f_tIt is multiplied, discards the information for determining and needing to abandon.It connects Add i_t ^*C_t.New candidate value is generated, is changed according to the degree for determining each state of update.

The residual error module is a kind of depth convolutional network, have be easier to optimization, can be by increasing comparable depth Come the characteristics of improving accuracy rate.

Residual error module described herein, i.e., to residual error module usually used in the prior art, removal is therein to be connected entirely Layer and Soft-max layers are connect, the study of feature combination is made of its remaining module.

Residual error module described herein, input are the present frame further feature that LSTM module is supplemented according to former frames Figure exports as human body two dimension artis mathematic(al) representation, therefore can be promoted on the basis of keeping hourglass module precision integrally The operational efficiency of estimation method.

As shown in figure 8, human body three-dimensional artis inference module described herein, utilizes the 2D heat for generating hourglass module Try hard to and shallow-layer neural network extracts middle layer characteristics of image as input, artis depth is predicted, output is The vector of one P*1, for indicating each artis depth information predicted, then again by the artis thermodynamic chart of P*P and The artis depth map of P*1 combines the mathematic(al) representation to form 3 D human body posture.

Three-dimensional artis is inferred, can be based on individual RGB picture by the method for deep learning and obtain depth information.This Kind method is established on the basis of large-scale target database, such as face database, scene database.Firstly, passing through the side of study Method carries out feature extraction (including brightness, depth, texture, geometry, mutual alignment) to each target in database；So Afterwards, probability function is established to feature；Finally, the similarity degree for rebuilding similar purpose in target and database is expressed as probability Size takes the target depth of maximum probability to rebuild target depth, carries out three-dimensional reconstruction in conjunction with texture mapping or interpolation method.

The three-dimensional artis that the application uses is inferred, i.e., by the feature of preceding several modules extractions, by deep learning mould Type predicts the human joint points depth information of two-dimension picture, in conjunction with the human body two dimension artis that previous stage generates, generates people Body three-dimensional artis.

Unlike the prior art, the herein described method for carrying out 3 D human body Attitude estimation based on video flowing, makes Human body 3D Attitude estimation is carried out to video flowing with deep learning method, this method mainly includes following sections:

1,3 D human body attitude mode generates

Using hourglass module, human body three-dimensional artis inference module, 3 D human body Attitude estimation model is established.The model point At two parts, first part is a Generator network, generates the 3 d pose of human body, second part is one Discriminator network, for judging that the posture superiority and inferiority that Generator is generated can make by two networks interactions The performance for obtaining two networks is mutually promoted, and finally obtains a high accuracy three-dimensional human body attitude.

2, the spatial relationship of artis is established

Using shallow-layer neural network, residual error module, passes through the foundation of spatial relationship and optimize above-mentioned 3 D human body posture mould Type, to learn the space configuration information of artis.

Denoising autocoder can be based on by using Dropout Autoencoder (DAE) component, for learning to making an uproar Sound data have the expression of robustness, extend framework more clearly to infer the space configuration of human skeletal.After input layer It is introduced directly into stratum disjunction, effect is the position for removing to completely random joint from skeleton, rather than simply interfering them And angle.Then, the unique method for restoring complete posture is the joint angle that missing is rebuild by the deduction from adjacent segment Spend information.

3, video interframe temporal correlation captures

Using LSTM module, to learn the continuity of the every interframe of video, the mesh of learning time dimensional information is reached with this 's.

It can be realized by multistage convolutional neural networks (CNN) and estimate about single image human body attitude.Although in static state There is superior performance on image, but application of these models on video is not only computation-intensive, but also by Performance degradation and the influence flicked.

In this application, the new recirculating network of one kind is proposed to solve the above problems.Weight secret sharing is imposed on Multistage CNN can be rewritten as recurrent neural network (RNN), to dramatically speed up the speed for calling video network.It is every in video Interframe remembers (LSTM) unit using shot and long term, highly effective in terms of every interframe forces Geometrical consistency, can handle well Input quality decline in video, while successful stabilization Sequential output.

It should be understood that for those of ordinary skills, it can be modified or changed according to the above description, And all these modifications and variations should all belong to the protection domain of appended claims of the present invention.

Claims

1. a kind of method for carrying out 3 D human body Attitude estimation based on video flowing, it is characterised in that: it include following implementation steps,

Video first frame, 1) input present frame two dimensional image, the extraction of human body two-dimensional attitude is carried out using hourglass network module, it is raw At the human body two dimension artis thermodynamic chart of first frame；2) the human body two dimension artis thermodynamic chart of present frame is exported to three-dimensional joint Point inference module carries out two dimension to three-dimensional space reflection to generate human body three-dimensional artis thermodynamic chart；

The second frame of video, 1) input present frame two dimensional image, image shallow-layer figure is generated using shallow-layer neural network module；2) first The image shallow-layer figure of human body two dimension artis thermodynamic chart, present frame generation that frame generates, is input to LSTM module together to generate Profound characteristic pattern；3) the deep layer characteristics of image figure that present frame generates exports the human body two dimension that present frame is generated to residual error module Artis thermodynamic chart；4) the human body two dimension artis thermodynamic chart of present frame is exported to three-dimensional artis inference module, carries out two dimension To three-dimensional space reflection to generate human body three-dimensional artis thermodynamic chart；

Video n-th (n >=2) frame, 1) input present frame two dimensional image, image shallow-layer figure is generated using shallow-layer neural network module； 2) the image shallow-layer figure of (n-1) frame generates human body two dimension artis thermodynamic chart, present frame generation, is input to LSTM mould together Block is to generate profound characteristic pattern；3) the deep layer characteristics of image figure that present frame generates is exported to residual error module, generates present frame Human body two dimension artis thermodynamic chart；4) the human body two dimension artis thermodynamic chart of present frame is exported to three-dimensional artis inference module, Two dimension is carried out to three-dimensional space reflection to generate human body three-dimensional artis thermodynamic chart；

2. the method according to claim 1 for carrying out 3 D human body Attitude estimation based on video flowing, it is characterised in that:

Single order hourglass network includes the structure of following parallel connection,

Described n (n >=2) the rank hourglass network is that any primary module of midway under (n-1) rank hourglass network is replaced with (n- 1) rank hourglass network.

3. the method according to claim 2 for carrying out 3 D human body Attitude estimation based on video flowing, it is characterised in that: described Primary module (Residual), have the channel M input and N channel output；

Primary module (Residual) includes the structure of following parallel connection,

First behavior convolution road, it is in series by three different convolutional layers of core scale；

Second behavior is skipped a grade road, is 1 comprising a core scale, input convolutional layer identical with output channel number.