CN116030498A

CN116030498A - Virtual garment running and showing oriented three-dimensional human body posture estimation method

Info

Publication number: CN116030498A
Application number: CN202310079683.3A
Authority: CN
Inventors: 李耿; 张朋; 袁可欣; 丁鹏飞; 张洁
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-04-28

Abstract

The invention relates to a three-dimensional human body posture estimation method for virtual clothing walk-show, which uses improved Kalman filtering to preprocess an input image; designing a high-resolution two-dimensional human body posture estimation network based on HRNet-W32, training the network by using a two-dimensional standard data set, and estimating a two-dimensional human body posture corresponding to an RGB image in a three-dimensional data set by using the network; constructing a graph convolution three-dimensional regression network based on residual connection and attention mechanism, combining semantic information, taking the human body posture estimated by the two-dimensional human body posture estimation network as input, taking the corresponding three-dimensional human body posture actual value as a label, and training the three-dimensional regression network; and connecting the two trained networks in series to finally obtain the three-dimensional human body posture estimation model. Compared with other three-dimensional human body posture estimation methods, the three-dimensional human body posture estimation method based on the three-dimensional standard data set realizes more accurate three-dimensional human body posture estimation.

Description

Virtual garment running and showing oriented three-dimensional human body posture estimation method

Technical Field

The invention relates to a three-dimensional human body posture estimation technology, in particular to a virtual garment walking and showing oriented three-dimensional human body posture estimation method.

Background

Clothing shows gradually go to virtualization. The traditional virtual garment running and showing method is mainly divided into two types of animation simulation and special effect manufacturing: animation simulation requires professional artists to design a walk-show action, and has long animation production period and single effect; the special effect manufacturing needs to be participated by a real model, virtual clothing showing in a real sense is difficult to realize, and meanwhile, the method needs to put in a large amount of manpower and material resources for realizing clothing showing, and consumes a large amount of funds.

Three-dimensional human body posture estimation has been widely used in various fields such as augmented reality, man-machine interaction, etc. along with development of deep learning as an important component of computer vision. If the three-dimensional human body posture estimation technology can be applied to the field of virtual clothing running and showing, the virtual character model is driven by model clothing running and showing videos to realize running and showing, so that the diversification and virtualization of clothing running and showing actions can be realized while a large amount of resources are saved.

The current three-dimensional human body posture estimation method based on the image is mainly divided into a single-stage mode and a two-stage mode. The single-stage mode directly takes an RGB image as input, extracts main joint point information of a human body from the image, and returns the three-dimensional posture of the human body. However, the method is easily influenced by factors such as background, illumination and the like due to lack of supervision training, and has poor precision and generalization capability. The two-stage mode firstly utilizes a two-dimensional human body posture estimation network to carry out two-dimensional human body posture estimation on an input RGB picture to obtain the two-dimensional coordinates of main joints of the human body, and then takes a two-dimensional skeleton as input to carry out three-dimensional posture regression through a constructed neural network. The two-stage mode carries out supervised training on the two-dimensional human body posture estimation network, so that the overall generalization capability of the network is good. The accuracy of three-dimensional regression often depends on the accuracy of two-dimensional human body pose estimation. In combination with the complexity of the garment running and showing process, the existing two-stage mode three-dimensional human body posture estimation method cannot well realize virtual garment running and showing based on three-dimensional human body posture estimation.

Disclosure of Invention

Aiming at the problem of precision deficiency of three-dimensional human body posture estimation in a virtual clothing running-showing scene, the three-dimensional human body posture estimation method for the virtual clothing running-showing is provided, the three-dimensional human body posture is accurately estimated, and the virtual running-showing scene requirement is met.

The technical scheme of the invention is as follows: a three-dimensional human body posture estimation method for virtual clothing walk-show specifically comprises the following steps:

1) Image data acquisition: processing model walk-show video in a single scene into pictures of one frame by one frame, and acquiring image data of an input video;

2) Image preprocessing: the improved Kalman filtering is used for optimizing the human motion state, reducing the prediction deviation of human joint points caused by human clothes shielding and self shielding in the video image, and realizing image preprocessing;

3) Building and training a two-dimensional human body posture estimation network model, and sending the preprocessed image into the trained two-dimensional human body posture estimation network model to obtain two-dimensional posture estimation;

wherein the two-dimensional human body posture estimation network model structure: the method comprises the first three layers of a high-resolution two-dimensional human body posture estimation network of the HRNet-W32, wherein a bottleck module in the first layer and a basic block module convolution kernel in the second layer and the third layer are replaced by a pyramid segmentation attention module, so that the precision loss caused by removing the fourth layer of the HRNet-W32 network is compensated;

wherein the two-dimensional human body posture estimation network model is trained: for the network model, training is carried out by using a public data set C0C02017, the pictures in the COCO2017 data set are required to be preprocessed before training, the size of the pictures is fixed to be 256 multiplied by 192, the learning rate is set to be 0.001, the training period is set to be 210, and an average precision index is used as an evaluation standard for two-dimensional attitude estimation;

4) Constructing and training a three-dimensional human body posture regression network model, and inputting two-dimensional posture estimation into the trained three-dimensional human body posture regression network model to carry out three-dimensional posture regression;

the three-dimensional human body posture regression network model is a graph convolution network based on residual connection and attention mechanisms, channel weighting of hidden priori edges in the two-dimensional human body posture is learned by combining graph convolution with semantic information, and the model is combined with a kernel matrix, so that the capability of graph convolution is improved; regarding the human skeleton as a piece of graph structure data, and eliminating the problem of over-smoothing by using residual connection in the process of graph convolution stacking; acquiring local and global context information among different key points through global context by using an attention mechanism, and simultaneously solving the problems of occlusion and depth blurring in three-dimensional human body posture estimation;

wherein three-dimensional human body posture regression network model training: training a three-dimensional Human body posture regression network by taking S1, S5, S6, S7 and S8 in a Human3.6M data set as a training set, and taking S9 and S11 as verification set verification effects; and verifying the effect of three-dimensional human body posture estimation by taking the Euclidean distance between the three-dimensional joint point coordinates obtained by network prediction and the real label human body joint point coordinates as an evaluation standard of a final three-dimensional human body posture estimation result.

Further, the improved kalman filtering method in the step 2) optimizes the human motion state specifically comprises the following steps: the motion of the human body on each axis in the three-dimensional space is a Bezier curve, the motion on each axis is approximately uniform acceleration or deceleration motion, and the prediction of the current position is obtained by combining the states of the first three positions with the change of acceleration;

kalman filter optimization: first, the first three states x are acquired _k-1 、x _k-2 、x _k-3 Predicting the current state from the first three states:

wherein->

For predicting position->

For observing position, K _k Is the kalman gain.

Further, the pyramid segmentation attention module in the step 3) is composed of an SPC module and an SE Weight module, the SPC module segments the attention channels, and multi-scale feature extraction is performed for the spatial information on each channel feature image; the SE Weight module is used for extracting the channel attention of the characteristic images with different scales so as to obtain the channel attention vector of the characteristic images with different scales; and finally outputting a feature map with richer multi-scale information representation capability, realizing extraction and fusion of multi-scale feature information with finer granularity, and improving network precision.

Further, 3) outputting 3 feature graphs with different sizes at the last stage of the two-dimensional human body posture estimation network, fusing multi-scale features at the last stage of the network by using a self-adaptive spatial feature fusion algorithm, selecting the size and the channel number as feature fusion standards to perform self-adaptive spatial feature fusion, and performing 1×1 convolution on the fused output to obtain final output.

Further, in step 4), in order to avoid the loss of characteristic information of the human body joint thermodynamic diagram output by the two-dimensional pose estimation network during the three-dimensional regression, which causes the performance degradation of the network model, the joint points of the two-dimensional pose are estimated as the integral of all positions in the thermodynamic diagram, and normalized according to the weighted sum of probabilities, and the specific calculation formula is as follows:

where p is the position present in the domain; q is a pixel point associated with a position; n (N) _k Is a transformed node; m is M _k Is a heat map;

is a regularized heat map; omega is M _k Is a domain of (c).

The invention has the beneficial effects that: the three-dimensional human body posture estimation method for virtual clothing walk-show is based on an original two-dimensional human body posture estimation network HRNet-W32, a fourth stage of serious information redundancy in the network is eliminated, meanwhile, a pyramid segmentation attention module is used for replacing 3X3 convolution in a bottleck module and a basicblock module in the original network, more efficient multi-scale information extraction is achieved, then, each layer of characteristics are fused and output by using a self-adaptive spatial characteristic fusion strategy in an output stage, semantic information of high-layer characteristics is obtained more fully, and therefore the problem of defect of characteristic extraction under low resolution caused by the fourth stage of elimination is solved, and finally, accurate calculation of the two-dimensional human body posture is achieved. When the three-dimensional regression is carried out by the two-dimensional gesture, the invention firstly carries out integral regression on the joint point thermodynamic diagram output by the two-dimensional gesture estimation network, and avoids the loss of characteristic information when the thermodynamic diagram is used for carrying out the three-dimensional regression; meanwhile, the problem that a convolution filter is limited to run in a single-step neighborhood of each node by combining graph convolution with semantic information, the convolution kernel receptive field is always 1, and the network information exchange rate is seriously low is solved; and finally, introducing a non-local layer (non-local) to capture the local and global relation between the nodes in the three-dimensional regression part, and improving the three-dimensional human body posture regression performance.

Drawings

FIG. 1 is a flow chart of a three-dimensional human body posture estimation model construction method for virtual garment running shows;

FIG. 2 is a schematic diagram of a pyramid segmentation attention module architecture;

FIG. 3a is a block diagram of a pyramid segmentation attention module after replacing the bottleck module in HRNet-W32;

FIG. 3b is a block diagram of the pyramid segmentation attention module after replacing the basic block module in HRNet-W32;

fig. 4 is a schematic diagram of the overall network structure of the three-dimensional human body posture estimation model for virtual garment running show.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

The three-dimensional human body posture estimation model construction flow chart for virtual clothing walk-show shown in fig. 1 specifically comprises the following steps:

1. image data acquisition: and processing the model walk-show video in the single scene into a frame-by-frame picture, and acquiring image data of the input video.

2. Image preprocessing: aiming at the problem that human body clothes shielding, self shielding and the like in a video image easily cause large deviation in human body joint point prediction, an improved Kalman filter is used for optimizing the human body motion state, reducing the deviation and realizing image preprocessing.

Considering that the motion of the human body on each axis in the three-dimensional space is a bezier curve, the motion on each axis can be approximately uniform acceleration (deceleration) motion, so that the prediction of the current position can be obtained by combining the states of the first three positions with the change of acceleration.

wherein->

For predicting position->

For observing position, K _k Is the kalman gain.

3. Building a two-dimensional human body posture estimation network model: establishing a high-resolution two-dimensional human body posture estimation network based on HRNet-W32, wherein the conventional HRNet-W32 consists of four stages, the first stage comprises four residual error units, each unit consists of a bottleck module with the width of 64, and a convolution layer of 3x3 is followed; the 2 nd, 3 rd and 4 th levels contain several basic block multi-resolution modules, respectively. Network improvement: through proper network clipping and analysis of ablation experimental results of an original high-resolution network model, the first three stages of the network are reserved, and a pyramid segmentation attention module is used for replacing convolution kernels of an original bottleck module and a basic block module, so that the receptive field of an image is enhanced; deep and shallow features are extracted, extraction of multi-scale features is guaranteed, and network two-dimensional joint prediction accuracy is improved while system parameters and calculation complexity are reduced.

4. Training a two-dimensional human body posture estimation network model: for the network model, training is performed by using a public data set C0C02017, pictures in a COCO2017 data set are required to be preprocessed before training, the size of the pictures is fixed to be 256 multiplied by 192, the learning rate is set to be 0.001, the training period is set to be 210, and indexes such as average precision (mean average precision, mAP) are used as evaluation standards for two-dimensional posture estimation.

5. Building a three-dimensional human body posture regression network model: and constructing a graph convolution network based on residual connection and an attention mechanism, and simultaneously combining semantic information to realize regression from the two-dimensional gesture to the three-dimensional space. According to the characteristics of the human body posture model based on the skeleton, the human body skeleton can be regarded as a piece of image structure data, and residual connection is used in the process of stacking the images to eliminate the overcomplete problem; in order to further explore the relation between the key points of the human body, the hidden semantic information between different key points is captured, the local and global context information between different key points is acquired through the global context by using an attention mechanism, and meanwhile the problems of shielding and depth blurring in three-dimensional human body posture estimation are solved.

Integral regression is carried out on the joint point thermodynamic diagram output by the two-dimensional attitude estimation network, so that characteristic information loss caused by using the thermodynamic diagram to carry out three-dimensional regression is avoided; meanwhile, the problem that a convolution filter is limited to run in a single-step neighborhood of each node by combining graph convolution with semantic information, the convolution kernel receptive field is always 1, and the network information exchange rate is seriously low is solved; and finally, introducing a non-local layer (non-local) to capture the local and global relation between the nodes in the three-dimensional regression part, and improving the three-dimensional human body posture regression performance.

6. Three-dimensional human body posture regression network model training: and training the three-dimensional Human body posture regression network by taking S1, S5, S6, S7 and S8 in the Human3.6M data set as training sets, and using S9 and S11 as verification set verification effects. And verifying the effect of three-dimensional human body posture estimation by taking the Euclidean distance between the three-dimensional joint point coordinates obtained by network prediction and the real label human body joint point coordinates as an evaluation standard of a final three-dimensional human body posture estimation result.

As shown in tables 1 and 2, the ratio of the number of parameters and floating point operand added in the fourth stage of HRNet-W32 network to the total number of parameters and the ratio of floating point operand are 72.53% and 38.15%, respectively, but the human body posture estimation accuracy on the MPII data set is only improved by 0.4%, which is mainly because the feature information extraction efficiency is reduced along with the continuous deepening of the network layer, and the fact that the receptive field scale exceeds the original image size during feature extraction causes information redundancy. In order to reduce floating point operation amount and network parameter quantity, the invention adjusts the characteristic receptive field of the HRNet-W32 network and removes the fourth stage with larger information redundancy.

TABLE 1

TABLE 2

As shown in fig. 2, in order to compensate for the problem of precision loss caused by removing the fourth stage of the HRNet-W32 network, the present invention introduces a pyramid segmentation attention module to replace the 3×3 convolution in the bottleck module and the basicblock module in the HRNet-W32, and the two replaced modules are shown in fig. 3a and 3 b.

The pyramid segmentation attention module is mainly composed of an SPC module and a SE Weight module. The SPC module segments the attention channels and performs multi-scale feature extraction aiming at the space information on each channel feature image; the SE Weight module is used for extracting the channel attention of the characteristic images with different scales so as to obtain the channel attention vector of the characteristic images with different scales; and finally outputting a feature map with richer multi-scale information representation capability, realizing extraction and fusion of multi-scale feature information with finer granularity, and improving network precision.

The mathematical expression is as follows:

for the input feature map X, it is divided into S parts by [ X ] ₀ ,X ₁ ,...,X _S-1 ]Representing the number of channels per segment

C represents the total number of channels, and the feature map X after segmentation _i ∈R ^C′×W×H I=0, 1..s-1, w represents image width and H represents image height. And for each segmented channel feature map, multi-scale convolution kernel group convolution is used, and the spatial information of different scale feature maps is extracted while the parameter number is reduced. Feature map F of different scales _i The specific calculation mode is as follows: f (F) _i ＝Conv(K _i ×K _i ,G _i )(X _i ),i＝0,1,...,S-1

Here the number of the elements is the number,

thus, a characteristic diagram after multi-scale fusion is obtained:

F＝Cat([F ₀ ,F ₁ ,...,F _S-1 ])

here F.epsilon.R ^C×W×H After extracting the multi-scale feature map, the feature map F with different scales _i And extracting the attention weight of the channel, wherein the calculation formula is as follows:

Z _i ＝SEWeight(F _i ),i＝0,1,...,S-1

where Z is _i ∈R ^C′×1×1 The whole multi-scale channel attention weight vector is:

and then, further carrying out weight calibration on the channel attention by using Softmax, so as to realize information interaction among the multi-scale channel attention.

Finally, feature images F with corresponding scales _i Multiplying the weighted and recalibrated attention vector by channel-wise level, namely:

Y _i ＝F _i ⊙att _i ,i＝1,2,3,...,S-1

finally, the obtained feature graphs after the multi-scale channel attention weighting are spliced, and a feature graph with richer multi-scale information is output:

Out＝Cat([Y ₀ ,Y ₁ ,...,Y _S-1 ])。

as shown in fig. 4, in order to fully utilize the three-dimensional human body posture to estimate the final layer 3 feature graphs with different sizes of the network, the invention uses an adaptive spatial feature fusion Algorithm (ASFF) to fuse multi-scale features at the final stage of the network, so as to realize more accurate key point detection. Considering that the sizes of 3 feature graphs are 1/4, 1/8 and 1/16 of the size of the original image respectively, the size and the channel number of the feature graph with the size of 1/4 of the original image are selected as feature fusion standards. Firstly, carrying out 1X 1 convolution on other 2 feature images with different sizes to ensure that the channel number of the feature images is consistent with that of the feature images with 1/4 size of the original image; secondly, up-sampling is carried out on the characteristic images with the sizes of 1/8 and 1/16 of the original image by 2 times and 4 times respectively so that the sizes of 3 characteristic images are kept consistent. Finally, 3 feature patterns X ¹ _m,n 、X ² _m,n 、X ³ _m,n And carrying out self-adaptive spatial feature fusion, and carrying out 1×1 convolution on the fused output to obtain a final output, so that the network always maintains high-resolution representation.

The 3 feature graphs with the same size and channel number after adjustment contain different local detail features, and an ASFF is used for fusing the 3 feature graphs according to the weight parameters of each layer to define a _m,n 、b _m,n 、c _m,n For the weight parameter, the fusion strategy is as follows:

a _m,n X ¹ _m.n +b _m,n X ² _m,n +c _m,n X ³ _m,n ＝Y _m,n

wherein Y is _m,n A is a fused characteristic diagram _m,n 、b _m,n 、c _m,n ∈[0,1]And satisfies the following:

a _m，n +b _m，n +c _m，n ＝1

while weight parameter a _m,n 、b _m,n 、c _m,n Then by combining X ¹ _m,n 、X ² _m,n 、X ³ _m,n These three feature maps are convolved 1×1 and the parameter a _m,n 、b _m,n 、c _m,n They are limited in range by softmax after contact to [0,1]The sum is 1, and the specific calculation formula is as follows:

the joint coordinate conversion is carried out on the heat map output by the two-dimensional human body posture estimation network, so that the method has great significance in improving the performance of the network model, and the network does not need to always keep high resolution after the joint coordinate conversion is carried out on the heat map, so that the advantages of the two methods of heat map representation and regression are fully utilized, and the operation complexity of the subsequent network is greatly reduced. The integral regression of the two-dimensional heat map is to estimate joints as the integral of all positions in the heat map and normalize according to probability weighted summation. Because the integral is without parameters, the integral regression has little effect on the performance of the network model in terms of calculation and storage. The specific integral formula is as follows:

where p is the position present in the domain; q is a pixel point associated with a position; m is M _k Is a heat map;

is a regularized heat map; omega is M _k Is a domain of (c).

Firstly, two-dimensional human body posture estimation is carried out on an input picture or video by establishing a two-dimensional posture estimation network to obtain coordinates of two-dimensional human body joints, and three-dimensional posture regression is carried out by taking the coordinates as input. Considering that human body joints have a correlation, the human body posture constructed by two-dimensional human body joints can be regarded as a graph structure, and the graph convolution is used for realizing the regression of the three-dimensional human body posture. However, graph convolution solves the problem of multiple neighborhoods of graph nodes, and allows a convolution filter to share the same weight matrix for all nodes; meanwhile, the convolution filter is limited to run in a single-step adjacent area of each node, so that the convolution kernel receptive field is always 1, and the information exchange efficiency when the network is deepened is seriously affected. In order to solve the problems, the invention uses the graph convolution to combine semantic information to learn the channel weighting of the hidden priori edges in the two-dimensional human body gesture, and combines the channel weighting with a kernel matrix to improve the capability of the graph convolution; in order to alleviate the problem of overcomplete caused by stacking of the graph-rolling network, a graph-rolling network model based on residual connection is constructed, and meanwhile, a local and global relation between non-local layer (non-local) capturing nodes is introduced, so that the performance of three-dimensional human body posture regression is improved.

As shown in Table 3, the experimental results of the two-dimensional posture estimation network in the first stage on the COCO2017 verification set are compared with other methods, and the results show that the method provided by the invention has better effect in two-dimensional human posture estimation compared with other networks, and compared with the original network HRNet-W32, the AP ⁵⁰ Improves by 3.1 percent, AP ⁷⁵ The average accuracy mAP is improved by 2.7 percent, and the average accuracy mAP is improved by 2.2 percent. It can be seen that the accuracy of the method proposed by the invention for two-dimensional human body posture estimation is higher than that of other networks.

TABLE 3 Table 3

As shown in table 4, two-dimensional key point detection accuracy verification is performed on a picture with size of 384×384 on the COCO2017 verification set, the correct proportion PCK of the key point estimation is calculated, and the correct proportion of the key point estimation is compared with other network models. Where head represents the average of the correct proportions of the 5 joint point estimates associated with the head; shoulder represents the average of the correct proportions of the 2 joint point estimates associated with the shoulder; elbow represents the average of the correct proportions of 2 joint point estimates associated with the elbow; writet represents the average of the correct proportions of the two joint estimates associated with the chest; the buttons represent the average of the correct proportions of the 2 joint point estimates associated with the buttocks; knee represents the average of the correct proportions of the 2 joint point estimates associated with the knee; ankle represents the average of the correct proportions of the 2 joint point estimates associated with the ankle; average represents the average of all joint estimates in the correct proportion. As can be seen from the comparison result of Table 4, the estimation accuracy of the two-dimensional gesture estimation network provided by the first stage of the invention to the joint point is improved to a certain extent, and the average estimation accuracy also reaches a higher value.

TABLE 4 Table 4

As shown in Table 5, the experimental results of the two-stage three-dimensional human body posture estimation network model on the Human3.6M data set are compared with other methods. The result shows that the method provided by the invention can realize the prediction of smaller Euclidean distance between the three-dimensional human body joint coordinate point and the real label human body joint coordinate point in the three-dimensional human body posture estimation, and improves the precision of the three-dimensional human body posture estimation.

TABLE 5

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A three-dimensional human body posture estimation method for virtual clothing walk-show is characterized by comprising the following steps:

2. The virtual garment running show-oriented three-dimensional human body posture estimation method according to claim 1, wherein the improved kalman filtering method of step 2) optimizes the human body movement state specifically comprises the following steps: the motion of the human body on each axis in the three-dimensional space is a Bezier curve, the motion on each axis is approximately uniform acceleration or deceleration motion, and the prediction of the current position is obtained by combining the states of the first three positions with the change of acceleration;

wherein->

For predicting position->

For observing position, K _k Is the kalman gain.

3. The virtual garment running and showing-oriented three-dimensional human body posture estimation method according to claim 1, wherein the pyramid segmentation attention module in the step 3) consists of an SPC module and a SE Weight module, the SPC module segments an attention channel, and multi-scale feature extraction is carried out on spatial information on feature images of each channel; the SE Weight module is used for extracting the channel attention of the characteristic images with different scales so as to obtain the channel attention vector of the characteristic images with different scales; and finally outputting a feature map with richer multi-scale information representation capability, realizing extraction and fusion of multi-scale feature information with finer granularity, and improving network precision.

4. The three-dimensional human body posture estimation method for virtual clothing running and showing according to claim 3, wherein the step 3) outputs 3 feature graphs with different sizes at the last level of the two-dimensional human body posture estimation network, a self-adaptive spatial feature fusion algorithm is used for fusing multi-scale features at the last stage of the network, the size and the channel number are selected as feature fusion standards for self-adaptive spatial feature fusion, and the fused output is subjected to 1 x 1 convolution to obtain final output.

5. The virtual garment running and showing oriented three-dimensional human body posture estimation method according to claim 1, wherein the step 4) is to avoid the characteristic information loss of the human body joint point thermodynamic diagram output by the two-dimensional posture estimation network during the three-dimensional regression, so as to reduce the performance of the network model, estimate the joint points of the two-dimensional posture as the integral of all positions in the thermodynamic diagram, and normalize according to the probability weighted sum, and the specific calculation formula is as follows:

is a regularized heat map; omega is M _k Is a domain of (c). />