CN110826500A

CN110826500A - Method for estimating 3D human body posture based on antagonistic network of motion link space

Info

Publication number: CN110826500A
Application number: CN201911085729.2A
Authority: CN
Inventors: 薛裕明; 谢军伟; 李�根; 罗鸣; 童同; 高钦泉
Original assignee: Fujian Timor View Mdt Infotech Ltd
Current assignee: Fujian Timor View Mdt Infotech Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-02-21
Anticipated expiration: 2039-11-08
Also published as: CN110826500B

Abstract

The invention relates to a method for estimating a 3D human body posture based on a antagonism network of a motion link space. A convolution neural network is adopted, three-dimensional coordinates of key nodes of a human body are estimated from an image collected by monocular equipment, specifically, a monocular RGB image is used as input, and a motion link space and countermeasure network technology are adopted, so that the overfitting phenomenon is solved, and the accuracy and precision of 3D human body posture estimation are improved.

Description

Method for estimating 3D human body posture based on antagonistic network of motion link space

Technical Field

The invention relates to image content understanding, in particular to a method for estimating a 3D human body posture based on a antagonism network of a motion link space.

Background

The current artificial intelligence technology brings huge breakthroughs in the fields of image content understanding, video enhancement, voice recognition and the like. Particularly, in the image content understanding, the 3D human body posture recognition technology has high application value in the fields of rehabilitation medicine, video monitoring, advanced human-computer interaction and the like.

The 3D human body posture estimation refers to a technology for predicting three-dimensional coordinates of a human body posture from a monocular or monocular image. The 3D human posture estimation can be roughly classified into the following three methods:

the first method is to calculate a spatial coordinate system according to information such as position relation between multi-view cameras and shooting angles by adopting a mathematical operation or machine learning mode, predict a corresponding depth map, and estimate a 2D image of any angle. However, the disadvantage is that not only the image collected by the multi-view camera is needed, but also the placing position of the collecting device cannot be changed.

The second method is to directly calculate 2D human body posture coordinates from a single image by using only a single acquisition device, and then estimate the corresponding 3D human body posture by a simple matrix multiplication or lightweight network learning mode. However, due to lack of original image input, spatial information may be lost, resulting in poor accuracy of 3D coordinates; in addition, this method relies only on 2D pose input information, so its error may be amplified in the 3D estimation process.

The third method is to calculate the end-to-end mapping relation from the monocular RGB images to the 3-dimensional coordinates by a deep learning method. Compared with the former two methods, the method has obvious improvement in efficiency and performance.

Although the 3D human body posture estimation has made a certain progress, additional acquisition equipment information is still needed, and the phenomenon of overfitting is very easy to occur due to the existence of a deep neural network.

Therefore, the invention only takes the monocular RGB image as input, adopts the motion link space and the countermeasure network technology, not only solves the overfitting phenomenon, but also improves the precision and the accuracy of the 3D human body posture estimation.

Disclosure of Invention

The invention aims to provide a method for estimating a 3D human body posture based on a reactive network of a motion link space, which adopts a convolutional neural network to estimate three-dimensional coordinates of key nodes of a human body from an image acquired by monocular equipment, so that the accuracy and precision of estimating the 3D human body posture are improved.

In order to achieve the purpose, the technical scheme of the invention is as follows: a method for estimating a 3D human body posture based on a antagonism network of a motion link space comprises the following steps:

s1, collecting a human body color image I by adopting monocular equipment, then carrying out image normalization, marking by utilizing 2D and 3D human body data sets, and respectively obtaining 2D human body bone coordinate P and 3D human body bone coordinate M e to R^3×n(ii) a Adopting an original image and human skeleton coordinates to carry out mirror image and cutting, and carrying out image data augmentation;

step S2, generating a network by the 3D human body skeleton coordinates: weak supervision generation is adopted to resist network learning to solve the problem of data overfitting, wherein the following calculation formula is adopted in the feature extraction stage:

F＝R(BN(W₁*I_g+B₁)) (1)

wherein R represents a nonlinear activation function LeakyRelu, W₁，B₁Respectively representing the weights and offsets of the convolutional layers in the feature extraction stage, BN representing the normalization function, I_gRepresenting an input picture, and F representing an output result obtained in the characteristic extraction stage; then, the 3D human skeleton coordinates are obtained through the convolution block, the remodeling module and the two full-connection layers respectively;

step S3, estimating camera coordinate parameter K ∈ R by adopting convolutional neural network^2×3To assist in the back projection layer;

step S4, generating 3D human skeleton coordinates generated by the network based on the 3D human skeleton coordinates obtained by the labeling in the step S1 and the 3D human skeleton coordinates generated in the step S2, calculating the link angle and the length of the human skeleton by adopting a Wasserstein GAN discriminator of a motion link space, and simultaneously fusing and inputting the input image and the 3D human skeleton coordinates into a convolutional neural network so as to improve the accuracy of the human body structure, namely the generation of the 3D human skeleton coordinates;

step S5, through the back projection layer, based on the camera coordinate parameter K epsilon R calculated in step S3^2×3Converting the 3D human skeleton coordinates into 2D human skeleton coordinates;

P'＝KM (2)

wherein P' is the predicted 2D human skeletal coordinates;

step S6, predicting a loss function of the key nodes of the 3D human body posture, wherein M belongs to R^3×nRepresenting 3D human skeleton coordinates, i.e. 3D human posture key node position, coordinate m_i(x, y, z) represents one key node position of the human body, i is 1, … …, n, and reshape operation is performed on the last output layer, so as to obtain 3D human body coordinates;

step S7, gradual training strategy: dividing the training process into a plurality of preset sub-training periods, and adopting a stepping increasing strategy to train the sub-training periods in sequence; scaling the original image into small pictures and starting training at a large learning rate when training is started, and gradually increasing the color original image and gradually reducing the learning rate after each sub-training period is finished; when the 3D human skeleton coordinate generated after completing one sub-training period and the corresponding calibration data have large entries, the backward propagation is continued, the gradient descent optimization algorithm is used for updating the convolution weight parameter and the bias parameter, and then the step S2 is executed; and when the 3D human body bone coordinates generated after one sub-training period is finished reach the expected times or all the preset sub-training periods are finished, obtaining the final result.

In an embodiment of the present invention, the loss function of the key node of the 3D body pose is equal to:

W(P_r,P_g)+λL_cam

wherein, W (P)_r,P_g) Representing the loss function of WGAN, the input of which comprises two parts, P_gThe notation is a batch of data (containing images and correspondingly generated 3D human skeleton coordinates) input as generated, P_rRepresenting a batch of inputs as real data (containing images and corresponding real labeled 3D human bone coordinates),

representing the loss value discriminated as a real 3D human skeleton,

a loss value representing a 3D human skeleton discriminated to be generated; | f | non-conducting phosphor_L1 or less means that the Lipschitz constant of the function f is 1, meaning that the Lipschitz constant of the function f | | | ventilated phosphor is required_LUnder the condition of not exceeding 1, taking f to all possible satisfied conditions

The upper bound of (c); l is_camRepresenting the loss function of the camera estimation network, taking lambda as 0-1, and trace as the trace for calculating the corresponding matrix, | | | | non-calculation_FIs F norm, K is belonged to R^2×3，I₂Is an identity matrix of 2 x 2.

Compared with the prior art, the invention has the following beneficial effects:

the innovation of the method for estimating the 3D human body posture based on the antagonistic network of the motion link space is mainly embodied in two aspects: firstly, a deep neural network model is used for generating a human body 3D skeleton frame in a weak supervision mode, the generation is accurate, the effect is good, and most human body action analysis requirements can be met. And secondly, the 3D coordinates are introduced for the first time to be fused with the images, and the discrimination network is introduced into the KCS network layer at the same time, so that the discrimination network is upgraded, and a great auxiliary effect is provided for the generation of the 3D structure. The invention aims to provide a method for estimating a 3D human body posture based on a antagonism network of a motion link space, which is accurate and reliable in the generated 3D human body posture by using an antagonism generation network and assisting a KCS network layer and a camera back projection network.

Drawings

FIG. 1 is a diagram of the present invention FIG. 1 is a diagram of a network structure of a 3D human skeleton coordinate part generated by the method for estimating 3D human posture based on an antagonistic network of a kinematic link space according to the present invention;

FIG. 2 is a camera estimation network structure of the method for estimating 3D human body posture based on the antagonistic network of the kinematic link space;

FIG. 3 is a discriminator portion of the method of the present invention for estimating 3D body pose based on a antagonism network of the kinematic link space;

FIG. 4 is a basic flow chart of the method for estimating 3D human body posture based on the antagonistic network of the motion link space;

FIG. 5 is a diagram illustrating the effect of the method for estimating the 3D human body posture based on the antagonism network of the kinematic link space.

Detailed Description

The technical scheme of the invention is specifically explained below by combining the attached drawings 1-5.

As shown in fig. 4, the method for estimating a 3D human body posture based on a reactive network of a motion link space of the present invention aims to estimate three-dimensional coordinates of key nodes of a human body from an image acquired by monocular equipment by using a convolutional neural network, and specifically comprises the following steps:

step 1:

to train the model, a number of color body images were selected as input I, followed by image normalization and labeling with 2D and 3D body data sets, resulting in 2D and 3D coordinates of each body being P, M. The method comprises the steps of carrying out mirror image inversion on a color original image and labeling information, randomly changing the brightness and the chroma saturation to obtain a large amount of augmented image data, and storing the augmented image data in a matching data pair mode to serve as a training data set for deep learning. At the same time, the 2D coordinate P (P) on the training set is also matched₁,p₂,...p_n) (ii) a 3D coordinate M (M)₀,m₁,...,m_n)，M∈R^3×n(ii) a And normalization processing is performed, so that the convergence rate of the model can be further improved, the precision of the model is improved, and gradient explosion is prevented.

Step 2:

the generator portion 1: the 3D human skeletal coordinates generate a network. Compared with the traditional method, the method adopts weak supervision generation to resist network learning to solve the problem of data overfitting, and the specific steps are as follows:

the characteristic extraction stage consists of a convolution layer, a batch regularization layer and a LeakyRelu activation function, and the calculation formula is as follows:

F＝R(BN(W₁*I_g+B₁)) (1)

wherein R represents a nonlinear activation function LeakyRelu, W₁，B₁Respectively representing the weights and offsets of the convolutional layers in the feature extraction stage, BN representing the normalization function, I_gRepresenting an input picture, and F representing an output result obtained in the characteristic extraction stage; then, respectively passing through a convolution block, a reshaping module (flatten) and two full-connection layers to obtain corresponding 3D human skeleton coordinates;

and step 3:

the generator section 2: in order to estimate the accuracy of the human body posture, the invention adopts a convolution neural network to estimate the coordinate parameter K belonging to R of the camera^2×3The method aims to assist a back projection layer, back projects the 3D human skeleton coordinates to the corresponding 2D human skeleton coordinates, compares the 2D human skeleton coordinates with the 2D coordinates in the original input image, and calculates the back projection loss, thereby preventing the over-fitting phenomenon. Since K must have the following properties as a matrix transformation:

KK^t＝s²I₂(2)

where s is the scaling factor of the projection, I₂Is an identity matrix of 2 x 2 to which the invention assigns the largest singular value in the K matrix, since s is an uncertain quantity. The calculation formula method is as follows:

the loss function of the camera estimation network is as follows:

wherein trace is the trace for calculating the corresponding matrix, | | | | | non-woven calculation_FIs F norm, K is belonged to R^2×3。

The 3D human skeletal coordinates are converted to 2D skeletal coordinates by training the network shown in fig. 2 to obtain the output, i.e. to obtain the matrix K of the back projection:

P'＝KM (5)

and 4, step 4:

a discriminator section: as shown in FIG. 3, in order to determine the accuracy of human structure generation, the present invention uses a classifier of Wasserstein GAN [1] of kinematic link space [2] (KCS: kinetic chain space) for more reasonable calculation of link angle and length. Meanwhile, the input image and the 3D human skeleton frame are fused and input into the convolutional neural network, and the characteristic of whether the 3D skeleton is attached to the original image or not is increased.

The KCS layer is a network layer introduced by the invention and capable of improving the representation of the human posture. The KCS matrix is an important method for representing human body posture, and it contains joint link nodes and bone lengths. A bone b_kCan be represented as a link of the r-th and t-th nodes.

b_k＝p_r-p_t＝Mc (6)

c＝(0,...,0,1,0,...,0,-1,0,...,0)^T(7)

The position of r is 1 and the position of t is-1. The final overall human skeleton is defined as:

B＝(b₁,b₂,...,b_n) (8)

the matrix C is obtained by linking a plurality of C vectors, so that B can be represented as.

B＝MC (9)

The KCS matrix is calculated as follows:

by adding the Ψ matrix to the network layer, it can be found that there is a length of each bone on the diagonal, and an angle representation between any two bones at other locations. Compared with matrices for calculating Euclidean distances in other methods, the algorithm adopts a matrix operation form, so that the operation speed is effectively improved, and the part is mainly used for extracting bone features and making the judgment on virtually constructed bones at the fastest speed.

In order to increase the characteristic of whether the 3D skeleton is jointed with the original image or not, the invention adds a second part of input, namely the original image and the 3D skeleton are combined as input, and the characteristic is extracted through a convolutional neural network. Specifically, the newly added 3D image portion is initialized to a floating-point matrix of width, height, depth, which is the same width and height as the original image, and the initial value is all 0.5, where width, height is the maximum depth value of the 3D human body, and each point of the input 3D human body is assigned to 1.0. As shown in fig. 3.

The invention links two parts of extracted features, and adds two full link layers in the next network, wherein each full link layer comprises 90 neurons. A determination is ultimately made from whom the 3D bone coordinates are derived.

And 5:

loss function: predicting a loss function of the key nodes of the 3D human body posture: w (P)_r,P_g)+λL_cam，M∈R^3×nThe 3D human body posture key node position is represented, the coordinate mi (x, y, z) represents one key node position of the human body, and reshape operation is carried out on the last output layer, so that the 3D human body coordinate is obtained. The discriminator part adopts Wasserteiinloss [1]]The loss function as part is shown below:

wherein, W (P)_r,P_g) Representing the loss function of WGAN, the input of which comprises two parts, P_gThe notation is a batch of input as the generated data, P_rIndicating that a batch of inputs is trueThe real data is transmitted to the mobile terminal,

representing the loss value discriminated as a real 3D human skeleton,

The upper bound of (c);

the loss function of the camera estimation network is as follows:

wherein trace is the trace for calculating the corresponding matrix, | | | | | non-woven calculation_FIs F norm, K is belonged to R^2×3，I₂Is an identity matrix of 2 x 2.

Step 6:

and (5) gradually training a strategy. Dividing the training process into a plurality of preset sub-training periods, and adopting a stepping increasing strategy to train the sub-training periods in sequence; the original image is scaled to small pictures at the beginning of training and the training is started with a large learning rate, and the color original image is gradually increased and the learning rate is gradually decreased after each sub-training period.

When the 3D human skeleton coordinate generated after completing one sub-training period and the corresponding calibration data have a larger exit, continuing to perform backward propagation, updating the convolution weight parameter and the bias parameter by using a gradient descent optimization algorithm, and then executing the step 2; and when the 3D human body bone coordinates generated after one sub-training period is finished reach the expected times or all the preset sub-training periods are finished, obtaining the final result. The reason for this is that training is started on the basis of scaling the original picture into a small picture, and is assisted by the university learning rate. And after the training period is finished, increasing the input picture, reducing the learning rate and performing training again. By analogy, the precision of the picture with higher resolution can be enhanced on the basis of the picture with low resolution, and the robustness of the network is increased.

Reference documents:

[1].M.Arjovsky,S.Chintala,and L.Bottou.Wasserstein generativeadversarial networks.In D.Precup and Y.W.Teh,editors,Proceedings of the 34thInternational Conference on Machine Learning,volume 70of Proceedings ofMachine Learning Research,pages 214–223,International Convention Centre,Sydney,Australia,06–11Aug 2017.PMLR.3,4,5

[2]B.Wandt,H.Ackermann,and B.Rosenhahn.A kinematic chain space formonocular motion capture.In ECCV Workshops,Sept.2018.1,2,4,8。

the above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A method for estimating a 3D human body posture based on a antagonism network of a motion link space is characterized by comprising the following steps:

F＝R(BN(W₁*I_g+B₁)) (1)

wherein R represents a nonlinear activation function LeakyRelu, W₁，B₁Respectively representing the weight and the bias of the convolution layer in the feature extraction stage, wherein BN represents a normalization function, Ig represents an input picture, and F represents an output result obtained in the feature extraction stage; then, the 3D human skeleton coordinates are obtained through the convolution block, the remodeling module and the two full-connection layers respectively;

step S4, generating 3D human skeleton coordinates generated by the network based on the 3D human skeleton coordinates obtained by the labeling in the step S1 and the 3D human skeleton coordinates generated in the step S2, calculating the link angle and the length of the human skeleton by adopting a Wasserstein GAN discriminator of a motion link space, and simultaneously fusing and inputting the input image and the 3D human skeleton coordinates into a convolutional neural network so as to improve the accuracy of the generation of the 3D human skeleton coordinates;

P'＝KM (2)

wherein P' is the predicted 2D human skeletal coordinates;

2. The method for estimating the 3D human body posture based on the antagonistic network of the motion link space of claim 1, wherein the loss function of the key node of the 3D human body posture is equal to:

W(P_r,P_g)+λL_cam

wherein, W (P)_r,P_g) Representing the loss function of WGAN, the input of which comprises two parts, P_gThe notation is a batch of input as the generated data, P_rIndicating that a batch of input is real data,

representing the loss value discriminated as a real 3D human skeleton,

The upper bound of (c); l is_camRepresenting camera estimation networkThe λ is 0-1, trace is the trace of the calculation corresponding matrix, | | | | computation proceeds_FIs F norm, K is belonged to R^2×3，I₂Is an identity matrix of 2 x 2.