CN111914618A

CN111914618A - Three-dimensional human body posture estimation method based on countermeasure type relative depth constraint network

Info

Publication number: CN111914618A
Application number: CN202010521352.7A
Authority: CN
Inventors: 刘阳温; 李桂清; 韦国栋; 聂勇伟
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-06-10
Filing date: 2020-06-10
Publication date: 2020-11-10
Anticipated expiration: 2040-06-10
Also published as: CN111914618B

Abstract

The invention discloses a three-dimensional human body posture estimation method based on an antagonistic relative depth constraint network, which comprises the following steps: 1) inputting two-dimensional pixel coordinates of 16 joint points of a human body, and carrying out normalization pretreatment; 2) inputting two-dimensional pixel coordinates to a depth prediction network, and outputting depth values of 16 joint points of a human body; 3) reconstructing three-dimensional coordinates of the joint point by using the depth value and the two-dimensional pixel coordinates; 4) inputting the three-dimensional human body posture to a discriminator of a generating countermeasure network for calculating the authenticity error, and calculating the relative depth error by using the relative depth information between the three-dimensional human body posture and each joint point corresponding to the image; 5) and adding the authenticity error calculated by the discriminator of the generative countermeasure network and the relative depth error to obtain a total error, and feeding the total error back to the depth prediction network to obtain a more accurate three-dimensional human body posture. The invention solves the problem that the relative depth relation between the result of the outdoor three-dimensional human body posture data lack and the generation type confrontation network method and the joint points of the picture is not consistent.

Description

Three-dimensional human body posture estimation method based on countermeasure type relative depth constraint network

Technical Field

The invention relates to the technical field of three-dimensional human body posture estimation, in particular to a three-dimensional human body posture estimation method based on a countermeasure type relative depth constraint network.

Background

The three-dimensional human body posture estimation refers to a process of estimating three-dimensional coordinates of each main joint point of a human body in an image from the image and representing the three-dimensional posture of the human body in the image. In recent years, with new application scenes which are driven to increase by current technological progress, three-dimensional human body posture estimation has wide application value in the aspects of human-computer interaction, motion estimation, animation, virtual reality and the like, and becomes a basic and challenging subject.

Due to the development of deep learning and the easy acquisition of two-dimensional human body posture data, the field of two-dimensional human body posture estimation is greatly developed and broken through. However, in the aspect of three-dimensional human body posture estimation, since the three-dimensional human body posture data acquisition work is difficult and the cost is high, the three-dimensional human body posture data available for network learning is less. Most of the existing three-dimensional human body posture data are manually collected indoors through a precise instrument. Therefore, the existing three-dimensional human body posture estimation method is poor in outdoor image performance due to the lack of abundant outdoor three-dimensional human body posture data.

Due to the mature development of two-dimensional posture estimation and the difficulty in acquiring three-dimensional human body posture data. The existing three-dimensional human body posture estimation method tends to estimate the three-dimensional human body posture through a weak supervision method from two-dimensional human body postures. The weak supervision mode aims to learn the prior attributes of the three-dimensional human body posture, such as the attributes of the length of bones and the included angle between bones of the three-dimensional human body posture through a constraint neural network, and does not need full supervision on three-dimensional human body posture data corresponding to pictures one by one, so that the limitation of lack of outdoor three-dimensional human body posture data is relieved. In order to generate a more reasonable three-dimensional human body posture by a weak supervision neural network, the existing method adopts a generating type countermeasure network to carry out weak supervision learning of the three-dimensional human body posture. The method of generating the countermeasure network aims to generate the three-dimensional human body posture which accords with the distribution of the existing three-dimensional human body posture data by utilizing the existing acquired three-dimensional human body posture data and by using the neural network called as a generator under weak supervision. The mode of the generated countermeasure network can enable the generator to learn a reasonable three-dimensional human body posture, for example, the lengths of the left arm and the right arm of the human body are symmetrical and equal, the included angle between the skeletons is reasonable, the reprojection is superposed with the two-dimensional human body posture, and the like. However, the existing generative countermeasure network method focuses on the constraint of the existing collected three-dimensional human body posture data distribution, but neglects the constraint of the relative depth between the joint points of the human body corresponding to the image, so that the estimated three-dimensional human body posture conforms to the existing collected three-dimensional human body posture data distribution, but does not conform to the relative depth relationship between the corresponding joint points in the image. The relative depth refers to the relative relationship between the distance from the camera to the respective joint points of the human body in the image. The relative depth can be obtained from the image through human eye observation, and is easy to obtain compared with the difficulty of capturing real three-dimensional coordinates. The relative depth information can be used as a kind of weakly supervised information.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a three-dimensional human body posture estimation method based on a countermeasure type relative depth constraint network, the limitation that the three-dimensional human body posture data is difficult to acquire is solved through a weak supervision method, and the defect that the three-dimensional human body posture estimated by the existing method using the generation type countermeasure network does not conform to the relative depth relation corresponding to an image is solved through combining the generation type countermeasure network and the relative depth constraint.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: the three-dimensional human body posture estimation method based on the antagonistic relative depth constraint network comprises the following steps:

1) inputting two-dimensional pixel coordinates of 16 joint points of a human body, and performing normalization pretreatment;

2) inputting two-dimensional pixel coordinates of the human body after normalization preprocessing of 16 joint points into a depth prediction network, and outputting depth values of the 16 joint points of the human body;

3) reconstructing three-dimensional coordinates of the joint points by using the depth values and the two-dimensional pixel coordinates of the 16 joint points to obtain a reconstructed three-dimensional human body posture;

4) inputting the reconstructed three-dimensional human body posture to a discriminator of a generative countermeasure network for calculating the authenticity error, and simultaneously calculating the relative depth error by using the reconstructed three-dimensional human body posture and the relative depth information between the joint points corresponding to the image;

5) and adding the authenticity error calculated by the discriminator of the generative countermeasure network and the relative depth error to obtain a total error, feeding the total error back to the depth prediction network, and constraining the depth prediction network to predict a more accurate depth value so as to reconstruct and obtain a more accurate three-dimensional human body posture.

In step 1), for each human body, subtracting the average value of the two-dimensional pixel coordinates of 16 joint points of the human body from the two-dimensional pixel coordinates of each joint point, and then dividing the average value by the standard deviation of the two-dimensional pixel coordinates of the 16 joint points of the human body, thereby obtaining the two-dimensional pixel coordinates after normalization preprocessing.

In step 2), inputting the two-dimensional pixel coordinates after normalization preprocessing of each joint point obtained in the previous step into a depth prediction network composed of three modules to predict the depth values of 16 joint points of the human body, and the method comprises the following steps:

2.1) inputting the two-dimensional pixel coordinates after normalization preprocessing of each joint point into a feature extraction module to extract features, wherein the feature extraction module consists of a full connection layer containing 1024 neurons and a linear rectification activation function layer;

2.2) inputting the features extracted by the feature extraction module into a residual error network module for feature learning, wherein the residual error network module consists of two residual error blocks, each residual error block inputs the output value of the upper layer of the neural network into a full connection layer containing 1024 neurons and a linear rectification activation function layer to output a preliminary feature value, then inputs the preliminary feature value into the full connection layer containing 1024 neurons to output a further feature value, then adds the further feature value with the input value input into the residual error block, and finally inputs the feature value obtained by addition into a linear rectification activation function layer and outputs the feature value of the residual error block to the lower layer of the neural network;

2.3) inputting the output characteristics of the residual error network module into a depth value regression module, wherein the depth value regression module consists of a full connection layer containing 16 neurons, and the depth value regression module inputs the output characteristics of the residual error network module and outputs the depth values of 16 joint points of the human body.

In step 3), reconstructing three-dimensional coordinates of the joint points by using the depth values and two-dimensional pixel coordinates of the 16 joint points, which are as follows:

assuming that the two-dimensional pixel coordinate of a certain joint point of the human body is (u, v), wherein u is the horizontal coordinate of the joint point in the image, and v is the vertical coordinate of the joint point in the image; assuming that the depth value predicted by the joint point in the previous step is H and the focal length corresponding to the image is f, the three-dimensional coordinate of the joint point is

And reconstructing the three-dimensional coordinates of each joint point to reconstruct the three-dimensional coordinates of 16 joint points of the human body, wherein the three-dimensional coordinates of the 16 joint points of the human body form the three-dimensional posture of the human body.

In step 4), inputting the reconstructed three-dimensional human body posture to a discriminator of a generative countermeasure network for authenticity error calculation, and simultaneously performing relative depth error calculation by using the reconstructed three-dimensional human body posture and relative depth information between the joint points corresponding to the image, wherein the method comprises the following steps:

4.1) taking the three-dimensional human body posture obtained by the last step of reconstruction as a false sample, taking the existing collected three-dimensional human body posture data as a true sample, and inputting the sample into a discriminator of a generative countermeasure network, so that the reconstructed three-dimensional human body posture can be in accordance with the existing collected real three-dimensional human body posture data distribution, and a more reasonable three-dimensional human body posture can be obtained; the discriminator of the generative countermeasure network consists of an upper layer full-connection characteristic extraction module, a lower layer full-connection characteristic extraction module and a full-connection true and false prediction module; firstly, inputting a three-dimensional human body posture sample into an upper layer and a lower layer of fully-connected feature extraction modules for feature extraction, splicing features extracted by the upper layer and the lower layer of fully-connected feature extraction modules to obtain merged features, inputting the merged features into a fully-connected true and false prediction module for sample true and false judgment, outputting a judgment value of the sample, and calculating the authenticity error of the three-dimensional human body posture by using a loss function of a generative countermeasure network through the judgment value; the upper fully-connected feature extraction module and the lower fully-connected feature extraction module have the same structure and are both composed of a feature extraction module in a depth prediction network and a residual error network module composed of a residual error block, and the fully-connected true-false prediction module is composed of a fully-connected layer containing 1024 neurons, a linear rectification activation function layer and a fully-connected layer containing 1 neuron;

4.2) calculating relative depth errors by using the reconstructed three-dimensional human body posture and the relative depth information between the joint points corresponding to the image, acquiring the relative depth information between the joint points of the human body in the image through human eye observation of the image, and storing the relative depth relation information between the joint points in a matrix form of 16 rows and 16 columns, wherein the method specifically comprises the following steps: from the image observation, if the ith joint point of the human body is closer to the camera than the jth joint point, the element value r (i, j) of the ith row and j column of the matrix is 1; r (i, j) is-1 if the ith joint point is farther from the camera than the jth joint point; the distance difference between the ith joint point and the camera is within a set range than that between the jth joint point and the camera, and then r (i, j) is 0; wherein i and j are integers with values in an interval [1,16], r is a matrix for storing relative depth information between joint points, and r (i, j) is an element value of an ith row and a jth column in the matrix and is used for representing the relative depth relation between an ith joint point and a jth joint point;

calculating the relative depth error between each pair of joint points in the three-dimensional human body posture obtained by reconstruction in the step 3) by using the obtained matrix of the relative depth information, specifically:

in the formula, L_i,jRepresenting a relative depth error value of a point pair formed by the ith joint point and the jth joint point in the three-dimensional human body posture; r (i, j) represents the relative depth relation between the ith joint point and the jth joint point, and the value is {1, -1, 0 }; l r (i, j) | represents the absolute value of r (i, j); h_iAnd H_jRespectively representing the depth values of the ith joint point and the jth joint point obtained in the depth prediction network; finally, the relative depth error sum of 256 point pairs formed by 16 joint points of the human body is calculated according to the relative depth error between each pair of joint points of the human body, and the method specifically comprises the following steps:

in the formula, L_rankThe depth error sum of the relative depth of 256 point pairs formed by 16 joint points of a human body is represented, wherein (i, j) represents a point pair formed by the ith joint point and the jth joint point in the human body, and B represents a set formed by 256 point pairs formed by 16 joint points of the human body in a pairwise manner; and calculating the sum of the relative depth errors of the 256 point pairs formed by the 16 joint points of the human body pairwise, and expressing the sum as the relative depth error of the three-dimensional human body posture of the human body.

In the step 5), the authenticity error calculated by the discriminator of the generative countermeasure network is added with the relative depth error to obtain the total error of the reconstructed three-dimensional human body posture in the two aspects of authenticity and relative depth, the error is fed back to the depth prediction network through the backward gradient descent propagation of the neural network, and the parameters in the depth prediction network are updated, so that the neural network can learn the authenticity of the three-dimensional human body posture and the relative depth information between the joint points corresponding to the picture, predict more accurate joint point depth, and reconstruct to obtain more accurate three-dimensional human body posture.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method adopts the generative countermeasure network to carry out weak supervision, only needs to use the existing acquired three-dimensional human body posture data for training, does not need to acquire the three-dimensional human body posture data corresponding to the image for full supervision, thereby relieving the problem that the three-dimensional human body posture data is difficult to acquire, and having the advantage of wider application.

2. The invention adopts a mode of combining the generative confrontation network and the relative depth constraint, fully utilizes the relative depth information among all the joint points in the picture on the basis of obtaining a more reasonable three-dimensional human body posture through the generative confrontation network, and leads the estimated three-dimensional human body posture to be more in line with the three-dimensional posture corresponding to the human body in the image, thereby obtaining higher precision.

3. The network of the invention adopts a simple full connection layer, has simple network structure and fast and efficient calculation, thereby achieving real-time performance.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of 16 joint points of a human body.

FIG. 3 is a block diagram of a depth prediction network; in the figure, Linear represents a full connection layer, the number below the Linear represents the number of neurons contained in the full connection layer, RELU represents a Linear rectification activation function layer, the content in a large box represents the structure of a residual block, and x 2 in the upper right corner of the large box represents that two residual blocks exist.

FIG. 4 is a diagram of an arbiter structure of a generative countermeasure network; in the figure, Linear represents a fully-connected layer, the number below Linear represents the number of neurons included in the fully-connected layer, RELU represents a Linear rectification activation function layer, FCnet represents a fully-connected feature extraction module network, and Concat represents the concatenation of features extracted by the fully-connected feature extraction modules of the upper and lower layers.

Fig. 5 is a network structure diagram of a fully-connected feature extraction module in an arbiter of a generative countermeasure network. In the figure, Linear represents a fully-connected layer, the number below Linear represents the number of neurons included in the fully-connected layer, and RELU represents a Linear rectification activation function layer.

Detailed Description

The present invention will be further described with reference to the following specific examples.

The three-dimensional human body posture estimation method based on the antagonistic relative depth constraint network provided by the embodiment has a complete flow of three-dimensional human body posture estimation as shown in fig. 1. Firstly, inputting two-dimensional pixel coordinates of 16 joint points of a human body, and carrying out normalization pretreatment; secondly, inputting two-dimensional pixel coordinates of the human body after normalization preprocessing of 16 joint points into a depth prediction network, and outputting depth values of the 16 joint points of the human body; then, reconstructing three-dimensional coordinates of the joint points by using the depth values and two-dimensional pixel coordinates of the 16 joint points; then, inputting the reconstructed three-dimensional human body posture to a discriminator of a generative countermeasure network for authenticity error calculation, and simultaneously utilizing the reconstructed three-dimensional human body posture and relative depth information among all joint points corresponding to the image for relative depth error calculation; and finally, adding the authenticity error calculated by the discriminator of the generative countermeasure network and the relative depth error to obtain a total error, feeding the total error back to the depth prediction network, and constraining the depth prediction network to predict a depth value with a smaller total error, thereby reconstructing to obtain a more accurate three-dimensional human body posture. The specific situation is as follows:

1) inputting two-dimensional pixel coordinates of human body joint points, and then carrying out normalization processing on the two-dimensional pixel coordinates of the human body joint points, wherein the normalization processing specifically comprises the following steps: for each human body, subtracting the mean value of the two-dimensional pixel coordinates of the 16 joint points of the human body from the two-dimensional pixel coordinates of each joint point, and then dividing the mean value by the standard deviation of the two-dimensional pixel coordinates of the 16 joint points of the human body, thereby obtaining the two-dimensional pixel coordinates after normalization preprocessing. The human body has 16 joint points as shown in figure 2.

2) The structure of the depth prediction network is shown in fig. 3. In the figure, line indicates a fully-connected layer, and the number below line indicates the number of neurons included in the fully-connected layer. RELU denotes the linear rectifying activation function layer. The content in the large box represents the structure of the residual block, and the x 2 in the upper right corner of the large box represents that there are two residual blocks. Inputting the two-dimensional pixel coordinates of the human body after normalization preprocessing of 16 joint points into a depth prediction network, and outputting the depth values of the 16 joint points of the human body. Inputting the two-dimensional pixel coordinates obtained in the last step after normalization preprocessing of all the joint points into a depth prediction network consisting of three modules to predict the depth values of 16 joint points of the human body, and the method comprises the following steps:

2.1) inputting the two-dimensional pixel coordinates after normalization preprocessing of each joint point into a feature extraction module to extract features. The feature extraction module consists of a fully-connected layer containing 1024 neurons and a linear rectification activation function layer.

2.2) inputting the features extracted by the feature extraction module into the residual error network module for feature learning. The residual error network module is composed of two residual error blocks. Each residual block is to input the output value of the upper layer of the neural network into a full connection layer containing 1024 neurons and a linear rectification activation function layer to output a preliminary characteristic value, then input the preliminary characteristic value into the full connection layer containing 1024 neurons to output a further characteristic value, then add the further characteristic value and the input value input into the residual block, finally input the characteristic value obtained by the addition into the linear rectification activation function layer, and output the residual block characteristic value to the next layer of the neural network.

2.3) inputting the output characteristics of the residual error network module to a depth value regression module. The depth value regression module consists of a fully connected layer containing 16 neurons. The depth value regression module inputs the output characteristics of the residual error network module and outputs the depth values of 16 joint points of the human body.

3) The three-dimensional coordinates of the joint points are reconstructed by using the depth values and the two-dimensional pixel coordinates of the 16 joint points, which are as follows:

suppose that the two-dimensional pixel coordinate of a certain joint point of the human body is (u, v), wherein u is the horizontal coordinate of the joint point in the image, and v is the vertical coordinate of the joint point in the image. Assuming that the depth value predicted by the joint point in the previous step is H and the focal length corresponding to the image is f, the three-dimensional coordinate of the joint point is

And reconstructing the three-dimensional coordinates of each joint point to reconstruct the three-dimensional coordinates of 16 joint points of the human body. The three-dimensional coordinates of the 16 joint points of the human body constitute the three-dimensional posture of the human body.

4) The structure of the arbiter of the generative countermeasure network is shown in fig. 4. In the figure, line indicates a fully-connected layer, and the number below line indicates the number of neurons included in the fully-connected layer. RELU denotes the linear rectifying activation function layer. FCnet represents a fully connected feature extraction module network. Concat represents that the features extracted by the upper and lower layers of fully-connected feature extraction modules are spliced. The network structure of the fully-connected feature extraction module in the arbiter of the generative countermeasure network is shown in fig. 5. The method comprises the following steps of performing error calculation by using a discriminator of a generative countermeasure network and relative depth information, inputting a reconstructed three-dimensional human body posture to the discriminator of the generative countermeasure network for authenticity error calculation, and performing relative depth error calculation by using the reconstructed three-dimensional human body posture and the relative depth information between all joint points corresponding to an image, wherein the method comprises the following steps:

4.1) taking the three-dimensional human body posture obtained by the last step of reconstruction as a false sample, taking the existing collected three-dimensional human body posture data as a true sample, and inputting the sample into a discriminator of a generative countermeasure network, so that the reconstructed three-dimensional human body posture can accord with the existing collected real three-dimensional human body posture data distribution, and a more reasonable three-dimensional human body posture can be obtained; the discriminator of the generative countermeasure network consists of an upper layer full-connection characteristic extraction module, a lower layer full-connection characteristic extraction module and a full-connection true and false prediction module. Firstly, inputting a three-dimensional human body posture sample into an upper layer and a lower layer of fully-connected feature extraction modules for feature extraction; and then splicing the features extracted by the upper and lower layers of fully-connected feature extraction modules to obtain merged features, inputting the merged features into a fully-connected true and false prediction module to judge whether the sample is true or false, outputting a judgment value of the sample, and calculating the authenticity error of the three-dimensional human posture by utilizing a loss function of a generative countermeasure network through the judgment value. The upper layer full-connection feature extraction module and the lower layer full-connection feature extraction module have the same structure and are both composed of a feature extraction module in a depth prediction network and a residual error network module consisting of a residual error block. The full-connection true-false prediction module consists of a full-connection layer containing 1024 neurons, a linear rectification activation function layer and a full-connection layer containing 1 neuron.

And 4.2) calculating a relative depth error by using the reconstructed three-dimensional human body posture and the relative depth information between the joint points corresponding to the image. Through the human eye observation of the image, the relative depth information among all the joint points of the human body in the image can be obtained. The invention uses a matrix form of 16 rows and 16 columns to store the relative depth information between the joint points. The method specifically comprises the following steps: from the image observation, the element value r (i, j) of the ith row and j column of the matrix is 1 if the ith joint point of the human body is obviously closer to the camera than the jth joint point; the ith joint point is significantly farther from the camera than the jth joint point, then r (i, j) is-1; if the i-th joint point is not far away from the camera by a large difference than the j-th joint point, r (i, j) is 0. Wherein i and j are integers whose values are in the interval [1,16], r is a matrix for storing relative depth information between the joint points, and r (i, j) is an element value of the ith row and j column in the matrix and is used for representing the relative depth relationship between the ith joint point and the jth joint point.

in the formula, i and j are both in the range of [1,16]]Is an integer of (1). L is_i,jAnd the relative depth error value of a point pair formed by the ith joint point and the jth joint point in the three-dimensional human body posture is represented. And r (i, j) represents the relative depth relation between the ith joint point and the jth joint point and takes the value of {1, -1, 0 }. L r (i, j) | represents the absolute value of r (i, j). H_iAnd H_jAnd respectively representing the depth values of the ith joint point and the jth joint point obtained in the depth prediction network. Finally, the relative depth error sum of 256 point pairs formed by every two of the 16 joint points of the human body is calculated according to the relative depth error between each pair of joint points of the human body, which specifically comprises the following steps:

in the formula, i and j are both in the range of [1,16]]Is an integer of (1). L is_i,jAnd the relative depth error value of a point pair formed by the ith joint point and the jth joint point in the three-dimensional human body posture is represented. L is_rankRepresents the relative depth error sum of 256 point pairs formed by two of the 16 joint points of the human body. (i, j) represents a point pair consisting of the ith joint point and the jth joint point in the human body, and B represents a set consisting of 256 point pairs consisting of 16 joint points of the human body in pairs. And calculating the sum of the relative depth errors of 256 point pairs formed by every two 16 joint points of the human body, and expressing the sum as the relative depth error of the three-dimensional human body posture of the human body.

5) Adding the authenticity error calculated by the discriminator of the generative countermeasure network and the relative depth error to obtain a total error, feeding the total error back to the depth prediction network, constraining the depth prediction network to predict a more accurate depth value, and reconstructing to obtain a more accurate three-dimensional human body posture, wherein the method specifically comprises the following steps:

and adding the authenticity error and the relative depth error calculated by a discriminator of the generating countermeasure network to obtain the total error of the reconstructed three-dimensional human body posture in the two aspects of authenticity and relative depth, feeding the error back to the depth prediction network through the backward gradient descent propagation of the neural network, and updating parameters in the depth prediction network, so that the neural network can learn the authenticity of the three-dimensional human body posture and the relative depth information between the joint points corresponding to the picture, predict more accurate joint point depth, and reconstruct to obtain more accurate three-dimensional human body posture.

In conclusion, after the scheme is adopted, the invention provides a new weak supervision method for three-dimensional human body posture estimation. The invention combines the generation type confrontation network and the relative depth constraint mode, and utilizes the relative depth relation information among all the joint points in the picture on the basis of obtaining a more reasonable three-dimensional human body posture through the generation type confrontation network, so that the estimated three-dimensional human body posture is more consistent with the three-dimensional posture corresponding to the human body in the image, thereby obtaining higher precision, having practical application value and being worth popularizing.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. The three-dimensional human body posture estimation method based on the antagonistic relative depth constraint network is characterized by comprising the following steps of:

2. The three-dimensional human body posture estimation method based on the antagonistic relative depth constraint network according to claim 1, characterized in that: in step 1), for each human body, subtracting the average value of the two-dimensional pixel coordinates of 16 joint points of the human body from the two-dimensional pixel coordinates of each joint point, and then dividing the average value by the standard deviation of the two-dimensional pixel coordinates of the 16 joint points of the human body, thereby obtaining the two-dimensional pixel coordinates after normalization preprocessing.

3. The three-dimensional human body posture estimation method based on the antagonistic relative depth constraint network according to claim 1, characterized in that: in step 2), inputting the two-dimensional pixel coordinates after normalization preprocessing of each joint point obtained in the previous step into a depth prediction network composed of three modules to predict the depth values of 16 joint points of the human body, and the method comprises the following steps:

4. The three-dimensional human body posture estimation method based on the antagonistic relative depth constraint network according to claim 1, characterized in that: in step 3), reconstructing three-dimensional coordinates of the joint points by using the depth values and two-dimensional pixel coordinates of the 16 joint points, which are as follows:

assuming that the two-dimensional pixel coordinate of a certain joint point of the human body is (u, v), wherein u is the horizontal coordinate of the joint point in the image, and v is the vertical coordinate of the joint point in the image; suppose that the depth value predicted by the joint in the previous step isH, if the focal distance corresponding to the image is f, the three-dimensional coordinate of the joint point is

5. The three-dimensional human body posture estimation method based on the antagonistic relative depth constraint network according to claim 1, characterized in that: in step 4), inputting the reconstructed three-dimensional human body posture to a discriminator of a generative countermeasure network for authenticity error calculation, and simultaneously performing relative depth error calculation by using the reconstructed three-dimensional human body posture and relative depth information between the joint points corresponding to the image, wherein the method comprises the following steps:

6. The three-dimensional human body posture estimation method based on the antagonistic relative depth constraint network according to claim 1, characterized in that: in the step 5), the authenticity error calculated by the discriminator of the generative countermeasure network is added with the relative depth error to obtain the total error of the reconstructed three-dimensional human body posture in the two aspects of authenticity and relative depth, the error is fed back to the depth prediction network through the backward gradient descent propagation of the neural network, and the parameters in the depth prediction network are updated, so that the neural network can learn the authenticity of the three-dimensional human body posture and the relative depth information between the joint points corresponding to the picture, predict more accurate joint point depth, and reconstruct to obtain more accurate three-dimensional human body posture.