CN110826500B

CN110826500B - Method for estimating 3D human body posture based on antagonistic network of motion link space

Info

Publication number: CN110826500B
Application number: CN201911085729.2A
Authority: CN
Inventors: 薛裕明; 谢军伟; 李�根; 罗鸣; 童同; 高钦泉
Original assignee: Fujian Imperial Vision Information Technology Co ltd
Current assignee: Fujian Imperial Vision Information Technology Co ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2023-04-14
Anticipated expiration: 2039-11-08
Also published as: CN110826500A

Abstract

The invention relates to a method for estimating a 3D human body posture based on a antagonism network of a motion link space. A convolution neural network is adopted, three-dimensional coordinates of key nodes of a human body are estimated from an image collected by monocular equipment, specifically, a monocular RGB image is used as input, and a motion link space and countermeasure network technology are adopted, so that the overfitting phenomenon is solved, and the accuracy and precision of 3D human body posture estimation are improved.

Description

Method for estimating 3D human body posture based on antagonistic network of motion link space

Technical Field

The invention relates to image content understanding, in particular to a method for estimating a 3D human body posture based on a antagonism network of a motion link space.

Background

The current artificial intelligence technology brings huge breakthroughs in the fields of image content understanding, video enhancement, voice recognition and the like. Especially in the image content understanding, the 3D human body posture recognition technology has high application value in the fields of rehabilitation, video monitoring, advanced human-computer interaction and the like.

The 3D human body posture estimation refers to a technology for predicting three-dimensional coordinates of a human body posture from a monocular or monocular image. The 3D body pose estimation can be roughly classified into the following three methods:

the first method is to calculate a spatial coordinate system according to information such as position relation and shooting angle among multi-view cameras by adopting a mathematical operation or machine learning mode, predict a corresponding depth map and estimate a 2D image of any angle. However, the disadvantage is that not only the image collected by the multi-view camera is needed, but also the arrangement position of the collecting device cannot be changed.

The second method is to directly calculate 2D human body posture coordinates from a single image by using only a single acquisition device, and then estimate the corresponding 3D human body posture by a simple matrix multiplication or lightweight network learning mode. However, due to lack of original image input, spatial information may be lost, resulting in poor accuracy of 3D coordinates; in addition, this method relies only on 2D pose input information, so its error is amplified in the 3D estimation process.

The third method is to calculate the end-to-end mapping relationship from the monocular RGB images to the 3-dimensional coordinates by a deep learning method. Compared with the former two methods, the method has obvious improvement on efficiency and performance.

Although the 3D human body posture estimation has made a certain progress, additional acquisition equipment information is still needed, and the phenomenon of overfitting is very easy to occur due to the existence of a deep neural network.

Therefore, the invention only takes the monocular RGB image as input, adopts the motion link space and the countermeasure network technology, not only solves the overfitting phenomenon, but also improves the precision and the accuracy of the 3D human body posture estimation.

Disclosure of Invention

The invention aims to provide a method for estimating a 3D human body posture based on a reactive network of a motion link space, which adopts a convolutional neural network to estimate three-dimensional coordinates of key nodes of a human body from an image acquired by monocular equipment, so that the accuracy and precision of estimating the 3D human body posture are improved.

In order to achieve the purpose, the technical scheme of the invention is as follows: a method for estimating a 3D human body posture based on a antagonism network of a motion link space comprises the following steps:

s1, collecting a human body color image I by adopting monocular equipment, then carrying out image normalization, and labeling by utilizing 2D and 3D human body data sets to respectively obtainTaking 2D human body skeleton coordinate P and 3D human body skeleton coordinate M epsilon R ^3×n (ii) a Adopting the original image and the human skeleton coordinate to carry out mirror image and cutting, and carrying out image data augmentation;

s2, generating a network by the 3D human body skeleton coordinates: weak supervision generation is adopted to resist network learning to solve the problem of data overfitting, wherein the following calculation formula is adopted in the feature extraction stage:

F＝R(BN(W ₁ *I _g +B ₁ )) (1)

wherein R represents a nonlinear activation function LeakyRelu, W ₁ ，B ₁ Respectively representing the weights and offsets of the convolutional layers in the feature extraction stage, BN representing the normalization function, I _g Representing an input picture, and F representing an output result obtained in the characteristic extraction stage; then, the 3D human skeleton coordinates are obtained through the convolution block, the remodeling module and the two full-connection layers respectively;

s3, estimating a camera coordinate parameter K epsilon R by adopting a convolutional neural network ^2×3 To assist in back projecting the layers;

s4, generating a 3D human body skeleton coordinate generated by a network based on the 3D human body skeleton coordinate obtained by labeling in the S1 and the 3D human body skeleton coordinate generated in the S2, calculating a link angle and a link length of a human body skeleton by adopting a Wassertein GAN discriminator of a motion link space, and simultaneously fusing and inputting the input image and the 3D human body skeleton coordinate into a convolutional neural network so as to improve the accuracy of a human body structure, namely the generation of the 3D human body skeleton coordinate;

s5, through a back projection layer, based on the camera coordinate parameter K belonging to R calculated in the step S3 ^2×3 Converting the 3D human skeleton coordinates into 2D human skeleton coordinates;

P'＝KM (2)

wherein P' is the predicted 2D human skeletal coordinates;

s6, predicting a loss function of the key nodes of the 3D human body posture, wherein M belongs to R ^3×n Representing 3D human skeleton coordinates, i.e. 3D human pose key node position, coordinate m _i (x, y, z) represents one of the key node positions of the human body, i =1, \8230N, and carrying out reshape operation on the last output layer so as to obtain the 3D human body coordinate;

s7, a gradual training strategy: dividing the training process into a plurality of preset sub-training periods, and adopting a stepping increasing strategy to train the sub-training periods in sequence; when training is started, the original image is zoomed into a small picture and training is started with a large learning rate, and after each sub-training period is completed, the color original image is gradually increased and the learning rate is gradually reduced; when the 3D human skeleton coordinate generated after completing one sub-training period and the corresponding calibration data have large entries, continuing to perform backward propagation, updating the convolution weight parameter and the bias parameter by using a gradient descent optimization algorithm, and then executing the step S2; and when the 3D human body bone coordinates generated after one sub-training period is finished reach the expected times or all preset sub-training periods are finished, obtaining the final result.

In an embodiment of the present invention, the loss function of the key node of the 3D body pose is equal to:

W(P _r ,P _g )+λL _cam

wherein, W (P) _r ,P _g ) Representing the loss function of WGAN, the input of which comprises two parts, P _g The notation is a batch of data (containing images and correspondingly generated 3D human skeleton coordinates) input as generated, P _r Representing a batch of inputs as real data (containing images and corresponding real labeled 3D human bone coordinates),

representing the loss value discriminated as a real 3D human skeleton,

a loss value representing a 3D human skeleton discriminated to be generated; | f | non-conducting phosphor _L The Lipschitz constant of the function f is less than or equal to 1, which means that the Lipschitz constant of the function f is required to cover the laces in the whole space _L Not more than 1, fetch @onall possible f that satisfy the condition>

An upper bound of (c); l is a radical of an alcohol _cam Expressing the loss function of the camera estimation network, taking 0-1,trace as the lambda to calculate the trace of the corresponding matrix, | | | calcing _F Is F norm, K is belonged to R ^2×3 ，I ₂ Is a 2 x 2 identity matrix. />

Compared with the prior art, the invention has the following beneficial effects:

the innovation of the method for estimating the 3D human body posture based on the antagonistic network of the motion link space is mainly embodied in two aspects: firstly, a deep neural network model is used for generating a human body 3D skeleton frame in a weak supervision mode, the generation is accurate, the effect is good, and most human body action analysis requirements can be met. And secondly, 3D coordinates are introduced for the first time to be fused with the images, and a discrimination network is introduced into a KCS network layer at the same time, so that the discrimination network is upgraded, and a great auxiliary effect is provided for the generation of a 3D structure. The invention aims to provide a method for estimating a 3D human body posture based on a antagonism network of a motion link space, which is accurate and reliable by using an antagonism generation network and a KCS network layer and a camera back projection network as an auxiliary means.

Drawings

FIG. 1 is a diagram of the present invention FIG. 1 is a diagram of a network structure of a 3D human skeleton coordinate part generated by the method for estimating 3D human posture based on an antagonistic network of a kinematic link space according to the present invention;

FIG. 2 is a camera estimation network structure of the method for estimating 3D human body posture based on the antagonistic network of the kinematic link space;

FIG. 3 is a discriminator portion of the method of the present invention for estimating 3D body pose based on a antagonism network of the kinematic link space;

FIG. 4 is a basic flow chart of the method for estimating 3D human body posture based on the antagonistic network of the motion link space;

FIG. 5 is a diagram illustrating the effect of the method for estimating the 3D human body posture based on the antagonism network of the kinematic link space.

Detailed Description

The technical scheme of the invention is specifically explained by combining the attached drawings 1-5.

As shown in fig. 4, the method for estimating a 3D human body posture based on a antagonistic network of a motion link space of the present invention aims to estimate three-dimensional coordinates of key nodes of a human body from an image collected by monocular equipment by using a convolutional neural network, and specifically comprises the following steps:

step 1:

in order to train the model, a large number of color human body images are selected as input I, then image normalization is carried out, and 2D and 3D human body data sets are used for labeling, so that 2D and 3D coordinates of each human body are obtained as P and M. The method comprises the steps of carrying out mirror image inversion on a color original image and labeling information, randomly changing the brightness and the chroma saturation to obtain a large amount of augmented image data, and storing the augmented image data in a matching data pair mode to serve as a training data set for deep learning. At the same time, the 2D coordinate P (P) on the training set is also matched ₁ ,p ₂ ,...p _n ) (ii) a 3D coordinate M (M) ₀ ,m ₁ ,...,m _n )，M∈R ^3×n (ii) a And normalization processing is performed, so that the convergence rate of the model can be further improved, the precision of the model is improved, and gradient explosion is prevented.

Step 2:

the generator portion 1: the 3D human skeletal coordinates generate a network. Compared with the traditional method, the method adopts weak supervision generation to resist network learning to solve the problem of data overfitting, and the specific steps are as follows:

the characteristic extraction stage consists of a convolution layer, a batch regularization layer and a LeakyRelu activation function, and the calculation formula is as follows:

F＝R(BN(W ₁ *I _g +B ₁ )) (1)

wherein R represents a nonlinear activation function LeakyRelu, W ₁ ，B ₁ Respectively representing the weights and offsets of the convolutional layers in the feature extraction stage, BN representing the normalization function, I _g The input picture is shown, and F shows an output result obtained in the characteristic extraction stage; then, respectively passing through a convolution block, a remodeling module (flatten) and two full-connection layers to obtain corresponding 3D human body skeleton coordinates;

and 3, step 3:

the generator section 2: in order to estimate the accuracy of the human body posture, the invention adopts a convolution neural network to estimate the coordinate parameter K belonging to R of the camera ^2×3 The method aims to assist a back projection layer, back projects the 3D human skeleton coordinates to the corresponding 2D human skeleton coordinates, compares the 2D human skeleton coordinates with the 2D coordinates in the original input image, and calculates the back projection loss, thereby preventing the over-fitting phenomenon. Since K must have the following properties as a matrix transformation:

KK ^t ＝s ² I ₂ (2)

where s is the scaling factor of the projection, I ₂ Is a 2 x 2 identity matrix to which the invention assigns the largest singular value in the K matrix, since s is an uncertain quantity. The calculation formula method is as follows:

the loss function of the camera estimation network is as follows:

wherein trace is the trace for calculating the corresponding matrix, | | | | | non-woven calculation _F Is F norm, K is equal to R ^2×3 。

The 3D human bone coordinates are converted to 2D bone coordinates by training the network acquisition output shown in fig. 2, i.e. obtaining the backprojected matrix K:

P'＝KM (5)

and 4, step 4:

a discriminator section: as shown in FIG. 3, in order to determine the accuracy of human structure generation, the present invention uses a classifier of Wasserstein GAN [1] of kinematic link space [2] (KCS: kinetic chain space) for more reasonable calculation of link angle and length. Meanwhile, the input image and the 3D human skeleton frame are fused and input into the convolutional neural network, and the characteristic of whether the 3D skeleton is attached to the original image or not is increased.

The KCS layer is a network layer introduced by the invention and capable of improving the representation of the human posture. The KCS matrix is an important method for representing human body posture, and it contains joint link nodes and bone lengths. A bone b _k Can be represented as a link of the r-th and t-th nodes.

b _k ＝p _r -p _t ＝Mc (6)

c＝(0,...,0,1,0,...,0,-1,0,...,0) ^T (7)

The position of r is 1, and the position of t is-1. The final overall human skeleton is defined as:

B＝(b ₁ ,b ₂ ,...,b _n ) (8)

the matrix C is obtained by linking a plurality of C vectors so that B can be represented as.

B＝MC (9)

The KCS matrix is calculated as follows:

by adding the Ψ matrix to the network layer, it can be found that there is a length of each bone on the diagonal, and an angle representation between any two bones at other locations. Compared with matrices for calculating Euclidean distances in other methods, the algorithm adopts a matrix operation form, so that the operation speed is effectively improved, and the part is mainly used for extracting bone features and making the judgment on virtually constructed bones at the fastest speed.

In order to increase the characteristic of whether the 3D skeleton is jointed with the original image or not, the invention adds a second part of input, namely the original image and the 3D skeleton are combined as input, and the characteristic is extracted through a convolutional neural network. The method specifically comprises the steps of initializing a newly added 3D image part into a floating-point matrix of width, height and depth, wherein the initial values are all 0.5, the width and the height are equal to those of an original image, the depth is the maximum depth value of a 3D human body, and each point of an input 3D human body is assigned to be 1.0. As shown in fig. 3.

The invention links the two parts of extracted features, and adds two full link layers in the following network, wherein each full link layer comprises 90 neurons. A determination is ultimately made from whom the 3D bone coordinates are derived.

And 5:

loss function: predicting a loss function of the key nodes of the 3D human body posture: w (P) _r ,P _g )+λL _cam ，M∈R ^3×n The 3D human body posture key node position is represented, the coordinate mi (x, y, z) represents one key node position of the human body, and reshape operation is carried out on the last output layer, so that the 3D human body coordinate is obtained. The discriminator part adopts Wasserteiinloss [1]]The loss function as part is shown below:

wherein, W (P) _r ,P _g ) Representing the loss function of WGAN, the input of which comprises two parts, P _g The notation is a batch of input as the generated data, P _r Indicating that a batch of input is real data,

representing the loss value discriminated as a real 3D human skeleton,

a loss value representing a 3D human skeleton discriminated to be generated; | f | non-conducting phosphor _L The Lipschitz constant of the function f is less than or equal to 1, which means that the Lipschitz constant of the function f is required to cover the laces in the whole space _L Without exceeding 1, fetch @ for all possible f's that satisfy the condition>

An upper bound of (c);

the loss function of the camera estimation network is as follows:

wherein trace is a trace for calculating a corresponding matrix, | | | | computation proceeds _F Is F norm, K is belonged to R ^2×3 ，I ₂ Is a 2 x 2 identity matrix.

And 6:

and (5) gradually training a strategy. Dividing the training process into a plurality of preset sub-training periods, and adopting a stepping increasing strategy to train the sub-training periods in sequence; the original image is scaled to small pictures at the beginning of training and the training is started with a large learning rate, and the color original image is gradually increased and the learning rate is gradually decreased after each sub-training period.

When the 3D human skeleton coordinates generated after completing one sub-training period and the corresponding calibration data have a large exit, continuing to perform back propagation, updating the convolution weight parameters and the bias parameters by using a gradient descent optimization algorithm, and then executing the step 2; and when the 3D human body bone coordinates generated after one sub-training period is finished reach the expected times or all preset sub-training periods are finished, obtaining the final result. The reason for this is that training is started on the basis of scaling the original picture into a small picture, and is assisted by the university learning rate. And after the training period is finished, increasing the input picture, reducing the learning rate and performing training again. By analogy, the precision of the picture with higher resolution can be enhanced on the basis of the picture with low resolution, and the robustness of the network is increased.

FIG. 5 is a diagram illustrating the effect of the method for estimating 3D human body posture based on the antagonistic network of the motion link space.

Reference documents:

[1].M.Arjovsky,S.Chintala,and L.Bottou.Wasserstein generative adversarial networks.In D.Precup and Y.W.Teh,editors,Proceedings of the 34th International Conference on Machine Learning,volume 70of Proceedings of Machine Learning Research,pages 214–223,International Convention Centre,Sydney,Australia,06–11Aug 2017.PMLR.3,4,5

[2]B.Wandt,H.Ackermann,and B.Rosenhahn.A kinematic chain space for monocular motion capture.In ECCV Workshops,Sept.2018.1,2,4,8。

the above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A method for estimating a 3D human body posture based on a antagonism network of a motion link space is characterized by comprising the following steps:

s1, collecting a human body color image I by adopting monocular equipment, then carrying out image normalization, marking by utilizing 2D and 3D human body data sets, and respectively obtaining 2D human body bone coordinates P and 3D human body bone coordinates M e to R ^3×n (ii) a Adopting an original image and human skeleton coordinates to carry out mirror image and cutting, and carrying out image data augmentation;

s2, generating a network by the 3D human body skeleton coordinates: weak supervision generation is adopted to resist network learning to solve the problem of data overfitting, wherein the feature extraction stage adopts the following calculation formula:

F＝R(BN(W ₁ *I _g +B ₁ )) (1)

wherein R represents a nonlinear activation function LeakyRelu, W ₁ ，B ₁ Respectively representing the weight and the bias of the convolution layer in the feature extraction stage, wherein BN represents a normalization function, ig represents an input picture, and F represents an output result obtained in the feature extraction stage; then, the 3D human skeleton coordinates are obtained through the convolution block, the remodeling module and the two full-connection layers respectively;

s3, estimating a camera coordinate parameter K epsilon R by adopting a convolutional neural network ^2×3 To assist in the back projection layer;

s4, generating a 3D human body skeleton coordinate generated by a network based on the 3D human body skeleton coordinate obtained by labeling in the step S1 and the 3D human body skeleton coordinate generated in the step S2, calculating a link angle and a link length of a human body skeleton by adopting a WasserteinGAN discriminator of a motion link space, and simultaneously fusing and inputting the input image and the 3D human body skeleton coordinate into a convolutional neural network to improve the accuracy of generating the 3D human body skeleton coordinate;

P'＝KM (2)

wherein P' is the predicted 2D human skeletal coordinates;

s6, predicting a loss function of key nodes of the 3D human body posture, wherein M belongs to R ^3×n Representing 3D human skeleton coordinates, i.e. 3D human posture key node position, coordinate m _i (x, y, z) represents one of key node positions of the human body, i =1, \8230;, n, and a reshape operation is performed at the last output layer, thereby acquiring 3D human body coordinates;

s7, a gradual training strategy: dividing the training process into a plurality of preset sub-training periods, and adopting a stepping increasing strategy to train the sub-training periods in sequence; scaling the original image into small pictures and starting training at a large learning rate when training is started, and gradually increasing the color original image and gradually reducing the learning rate after each sub-training period is finished; when the 3D human skeleton coordinates generated after completing one sub-training period and the corresponding calibration data have large entries, continuing to perform back propagation, updating the convolution weight parameters and the bias parameters by using a gradient descent optimization algorithm, and then executing the step S2; and when the 3D human body bone coordinates generated after one sub-training period is finished reach the expected times or all preset sub-training periods are finished, obtaining the final result.

2. The method for estimating the 3D human body posture based on the antagonistic network of the motion link space of claim 1, wherein the loss function of the key node of the 3D human body posture is equal to:

W(P _r ,P _g )+λL _cam

wherein, W (P) _r ,P _g ) Representing the loss function of WGAN, the input of which comprises two parts, P _g The notation is a batch of input as the data generated, P _r Indicating that a batch of input is real data,

represents a value which is discriminated as a loss of real 3D human bone, is>

A loss value representing a 3D human skeleton discriminated to be generated; | f | non-conducting phosphor _L 1 or less means that the Lipschitz constant of the function f is 1, meaning that the Lipschitz constant of the function f | | | ventilated phosphor is required _L Under the condition of not exceeding 1, taking f to all possible satisfied conditions

The upper bound of (c); l is _cam Expressing the loss function of the camera estimation network, taking lambda as 0-1, trace as the trace for calculating the corresponding matrix, | | | | purple _F Is F norm, K is belonged to R ^2×3 ，I ₂ Is a 2 x 2 identity matrix. />