CN112686097A

CN112686097A - Human body image key point posture estimation method

Info

Publication number: CN112686097A
Application number: CN202011433083.5A
Authority: CN
Inventors: 侯峦轩; 马鑫; 赫然; 孙哲南
Original assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Current assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-04-20

Abstract

The invention discloses a method for estimating the posture of key points of a human body image, which comprises the following steps: preprocessing an input training image, and detecting the input image by using a pedestrian detection network of a large receptive field characteristic pyramid network based on cavity convolution; cutting a boundary frame formed by the detected human body, and only keeping an image in the frame; and inputting the cut image into a designed model, and estimating key points of the human body posture. The invention can generate key points of the input image containing the human body, and the generated key points of the human body in the image after estimation processing have higher precision and better keep the skeleton geometric information of the human body.

Description

Human body image key point posture estimation method

Technical Field

The invention relates to the technical field of image processing, in particular to a method for estimating the postures of key points of a human body image.

Background

The pose estimation of the key points of the human body image refers to modeling estimation of the key points on a human body skeleton from an image containing the human body, wherein the human body key points are generally defined as follows: the method comprises the steps of ankle joint, left knee joint, left hip, right hip, left knee joint, left ankle joint, right ankle joint, upper neck, vertex, right wrist, left elbow, left shoulder, right elbow and left wrist, and finally carrying out posture estimation on an input image through a trained posture estimation model, and outputting the image containing the key points of the human skeleton.

Because the human body is quite flexible, various postures and shapes can be generated, a new posture can be generated by slight change of any part of the human body, meanwhile, the visibility of key points is greatly influenced by wearing, postures, visual angles and the like, and the influence of environments such as shielding, illumination, fog and the like is also faced, in addition, the 2D human body key points and the 3D human body key points have obvious difference in vision, and different parts of the human body have the effect of shortening the vision (foreshoring), so that the detection of the human skeleton key points becomes a very challenging problem in the field of computer vision.

The existing human skeleton key point detection algorithm for solving the problem of human image key point posture estimation is basically carried out on the basis of geometric prior based on a template matching idea, and the core lies in how to represent the whole human body structure by using a template, including the representation of key points, the representation of limb structures and the representation of the relationship between different limb structures. A good template matching idea can simulate more gesture ranges, so that the corresponding human body gestures can be better matched and detected.

Deep learning based methods such as G-RMI, PAF, RMPE, Mask R-CNN have also been proposed. The invention provides a special pedestrian detection network structure aiming at the specific task of detecting, then inputting a human body image into the network structure, performing a series of nonlinear processing (for fitting a complex mapping function) to obtain a generated human body skeleton key point attitude image, taking the generated human body skeleton key point attitude image and a real labeled human body firmware key point image as input of a loss function, calculating the value of the loss function, solving the gradient to minimize the value, reversely propagating the solved gradient by utilizing a reverse propagation function and updating parameters of network weight, multiple iterations are performed until the loss function is unchanged.

Due to further invention of the technology and the high-quality and high-accuracy human skeleton key point diagram, the method has important significance for user experience and market competition. The existing human body image key point attitude estimation generation quality can not meet the requirement, and the uncertainty is large. Therefore, it is necessary to improve the pose estimation method of the key points of the human body image by one step.

Disclosure of Invention

Aiming at the technical defects in the prior art, the invention firstly provides a proprietary detection network detectetnet and further provides a human body image key point attitude estimation method of a deep neural network of a cascade pyramid fused with a hole convolution (scaled conv) so as to improve the estimation and correction quality of the human body image key point attitude and reduce the uncertainty.

In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows:

a method for estimating the pose of key points of a human body image comprises the following steps:

s1, preprocessing image data in an image database:

firstly, sending an original image into a trained detection network detectionNet for detection, inputting a size 224 x 224, and only outputting a human body image marked by a boundary box for a human body; then, cutting the output human body image to form a preset format size;

s2, obtaining a depth network model which can carry out posture estimation on the human body image to obtain a human body firmware key point image through training:

using the cut human body image in the step S1 as the input of the network, using json files in a training set as the human body key point marking information image as GroudTruth, training a global network and a correction network in a deep neural network model, and obtaining a trained deep neural network model for finishing the posture estimation from the human body image to the human body key point image;

processing the input human body image through a global network to obtain feature maps with different sizes, and performing L on the feature maps and a real labeled skeleton key point image by adopting a bottom-up U-Shape structure₂And (3) loss calculation, namely obtaining feature diagram outputs of different scales through a global network, then performing up-sampling operation through a Bottleneck and attention mechanism module, performing concat operation on the feature diagrams of different scales, and then performing L₂Calculating loss, and finishing the training of the model after the model is iterated for multiple times and is stable;

and S3, carrying out attitude estimation processing on the images in the test data set by using the trained deep neural network model.

The invention uses two networks of a global network and a correction network to respectively position and correct the key points, and adopts L₂The loss function improves the progress of generating key points and reduces the uncertainty, and the correction network structure based on Bottleneck and attention mechanism can improve the correction performance between different scales.

The global network of the invention improves the model capacity and accelerates the training speed by using the residual error network ResNet101 structure as the backbone network.

The invention provides a special detection network, solves the problem that channel weight distribution among all scale characteristic graphs is neglected in a general correction network aiming at properties, and improves detection correction by adopting an attention mechanism module. By means of the proposed human body image key point posture estimation model based on the depth neural network of the attention mechanism module, the residual error network is used as the basis for constructing the model, and the cascade pyramid structure is combined, so that the correction performance of the model is better, and the generalization capability is stronger.

Drawings

FIG. 1 shows the test results of the present invention on a human body image in a test data set, with the input human body image on the left, the output image corrected by the attention mechanism module in the middle, and the output image corrected without the attention mechanism module on the right.

Fig. 2 is a block diagram showing a structure of a detection network detetectionnet method according to the proprietary one of the present invention.

Fig. 3 shows a block diagram of a different type of bottleeck in design 2.

Shown in FIG. 4 as p₄，p₅，p₆The operational connection mode between them.

Fig. 5 is a process diagram of the ResNet50 network.

Fig. 6 is a diagram showing a global network architecture.

Fig. 7 is a partial schematic diagram illustrating the summing operation in the detection network and the global network.

Fig. 8 is a diagram showing the overall network structure of the present invention.

FIG. 9 is a view showing the structure of Bottleneck.

Fig. 10 is an overall structural view of the present invention after the correction network is added.

FIG. 11 is a schematic diagram showing the convolution of the hole convolution (related conv) according to the present invention.

FIG. 12 is a schematic diagram of an attention mechanism module of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention discloses a human body image key point posture estimation method, which comprises the following steps:

in step S1, specific data enhancement is first performed on the image training set data, and first we define all possible data enhancements that can be applied to the image, as shown in the following table (the parameters all correspond to the parameters of the TensorFlow corresponding function):

operation name description parameter amplitude range

We employ the following specific operations:

we define the enhancement policy as an unordered set of K sub-policies (policies one-two). During training, one of the K sub-strategies will be randomly selected and then applied to the current image. Each sub-strategy has 2 image enhancement operations, where P is the probability value (between the range 0-1) for each operation, M is the parameter magnitude, and each parameter magnitude is normalized to be within the interval 0-10.

And then, performing target detection on the images in the training data set by using a detection network detectionNet, only reserving boundary frames of human bodies for all category frames, performing cutting operation to generate human body images with the corresponding size of 384 × 288, then using the human body posture key point label information json file in the data set, and calling COCO api as label information of the corresponding human body to accelerate the reading speed of I/O.

The target detection network adopts a detection network detectetnet, trains and uses all 80 classes of the COCO data set, and finally selects and outputs the classes of the human body (the output image form is that the human body is marked by a bounding box in an image). The specific structure is shown in fig. 2, wherein the specific design of DetectionNet and the modules in the drawing are explained as follows:

adopting Resnet50 as a backbone network to extract features, and randomly initializing a ResNet50 network by using standard Gaussian distribution;

according to the features extracted by Resnet50, a feature map with the scale of 1-4, 4 is reserved and named as P₂,P₃,P₄,P₅And adding stage5 by concatenating convolution kernels having convolution kernel size 1x1, with the feature map being P₆A characteristic diagram of (1);

and at staAfter ge4 we keep the spatial resolution of the feature map unchanged, i.e. it is

The conversion is done by 3x3 convolutions or pooling layers with step size 2, where S is_pxRepresenting the spatial resolution, i is the original picture size, where the original picture size is 224 x 224, x e [ i,2,3,4,5,6]At P₄,P₅，P₆And connecting convolution kernels with the convolution kernel size of 1x1 to keep the channel number consistent (256 channels).

P₄,P₅，P₆The transformation between the two types of the AB is realized by two types of bottleecks shown in the figure 5, the two types of the AB are designed as the figure 4, the two types of the AB are respectively obtained by convolution of 1 by 1, the hollow coefficient of 3 by 3 is 2, and the relu layer is also obtained.

And finally, summing the feature maps of the stages 4-6 according to a pyramid framework, wherein a lateral connection summing mode is as shown in FIG. 8, forming an FPN feature pyramid, performing target detection by adopting a Fast RCNN method, and performing constraint through regression loss and classification loss. The multiple loss fusion (classification loss and regression loss fusion) is the prediction operation in FIG. 3, the classification loss adopts log loss (namely, the probability of real classification is taken as negative log, and the classification output is K +1 dimension), and the loss of regression is the same as that of R-CNN (smooth L1 loss). Overall loss function:

two branches are connected to the last full connection layer of the detection network, one branch is softmax and is used for classifying each ROI area, if K types are to be classified (adding K +1 types in total to background), the output result is p ═ p (p is₀………p_k) The other is a bounding box for more precise regions of the ROI, output as

The coordinates of the bounding box representing the k classes are (x, y) the coordinates of the upper left corner of the bounding box and (x + w, y + h) the coordinates of the lower right corner of the bounding box. u is the group Truth of each POI area, and v is the regression target of the group Truth of the bounding box. Where λ is the hyperparameter, controls the balance between the two task losses, where λ is 1. [ u.gtoreq.1]Is 1 when u is more than or equal to 1.

The classification loss is specifically:

is a loss function in log form.

The regression loss is specifically:

wherein v ═ v_x,v_y,v_w,v_hIs the position of the real box of class u, and

is the prediction box position of class u. And is

In addition, the cropping operation is to perform operations such as expanding the frame to a fixed aspect ratio, then performing cropping, and then performing data enhancement on the bounding box region in the image containing the human body bounding box, such as random flipping, random rotation, random scaling, and the like.

Furthermore, in all training steps, the data set is the MSCOCO training data set (including 57K images and 150K images containing human body examples), and after being detected by the detector network (FPN + roiign) in step S1, for all detected bounding boxes, only human bounding boxes were taken (i.e. only the bounding boxes of the human category in the first 100 boxes of all classes were used in all experiments) and extended to a fixed aspect ratio, light: weight 384:288, the corresponding cropped image is resized to a default height of 384 pixels and width of 288 pixels, then, the corresponding data enhancement strategy adopts random rotation (the angle is-45 to +45 degrees) and random scale (0.7 to 1.35) for the cut image, and taking the marking information (json file contains the position of the human body boundary frame and the key point) of the corresponding picture as the GrountTruth.

Wherein the overall DetectionNet flow block diagram is shown in fig. 3.

Step S2, training a human body image key point posture estimation model of a neural network fused with a hole convolution (scaled conv) by using the training input data, so as to complete a key point posture estimation task of a human body image.

In step S2, the clipped image containing the human body and the labeled information corresponding to the human body skeleton key points in step S1 are mainly used as the input of the network, the human body skeleton-fixing key points (in the form of json files, 17 key points are respectively labeled in the form of xy-axis coordinates) containing the labels are used as the group route, the human body key point estimation network in the depth model is trained, and the task from inputting the image of the human body to outputting the human body firmware key point image is completed. Specifically, after the human body image detected by the detection network is cut, the characteristic diagram is extracted by using ResNet101 as a backbone network, and different conv characteristics are extracted by the method

Is denoted as C₂C₃C₄C₅Then, the characteristic diagram of each layer is added from bottom to top by adopting a U-shaped structure, and the thermodynamic diagram generated by adding each time and with different scales is generated by adopting an L₂And (5) calculating a loss function to obtain key points of the human body.

In the global network, a convolutional neural network structure ResNet101 is used for firstly extracting features, and the U-Shape structure is adopted for carrying out up-sampling and sum operation on the feature map and then carrying out operation to keep the size of the generated feature map the same as the dimension of the feature map formed by the last layer of residual layer.

In this example, 4 residual blocks are included in the global network. The specific structure of the residual block is a convolutional neural network, and the residual block comprises a normalization layer, an activation layer and a convolutional layer; the size, step length and filling of the convolution layer filter are respectively 3 × 3,1 and 1. And the input and the output of the residual error layer are connected in a forwarding way. The number of layers of the convolutional layers and the number and the size of the filters in each convolutional layer can be selected and set according to actual conditions, and 3x3, 1 and 1 are filled by using the size and the step length of the convolutional layer filters respectively to generate a corresponding thermodynamic diagram from the characteristic diagram.

Similarly, the number of the residual blocks can be selected and set according to the actual situation. In the global network, real human body image x and group Truth real human body skeleton posture key point image y are input, the y network structure is ResNet-101 pre-trained on ImageNet data set,

in this step, the clipped human body image (384 × 288) is used as a model input, input to the backbone network of ResNet101, and output as a feature map of 192 × 144 × 64 through convolution kernels of 7 × 7 channels 64, padding 3, and stride 2, and MAX posing is performed, where the pooling kernel size is 3 × 3padding 0, stride 2, and output as a feature map of 96 × 72 × 64.

The generated 96 × 72 × 64 feature maps are sequentially passed through 4 residual blocks C₂C₃C₄C₅The corresponding outputs of each residual block are 96 × 72 × 256, 48 × 36 × 512, 24 × 18 × 1024, 12 × 9 × 2014. As shown in fig. 5.

Next layer of residual block

Performing 1x1 convolution operation, then upsampling and mixing with the previous layer

After the addition operation, a predict operation is performed, wherein L is used₂Loss constraint, the flow is as follows:

performing L with a thermodynamic diagram of a human skeleton posture key point image which is really marked₂And (4) calculating loss.

In the present invention, the Predict operation is that after each layer of superimposed feature maps is convolved by 1 × 1conv, then convolved by 3 × 3 to generate 17 feature maps (thermodynamic diagrams of 17 key points, the number of key points of a human body in the MSCOCO data set is 17), and the feature maps are used as Predict to participate in training.

Wherein the loss function of the generated network is:

wherein, x is an input image, y is a thermodynamic diagram corresponding to the group Truth, and the output of the global network is

And is

Wherein F_generatorThere are 17 feature maps (thermodynamic diagrams of key points) output for each residual block of the global network.

Then the output of the global network is used as the input of the correction network, namely C generated by four residual errors of the global network₂C₃C₄C₅Corresponding 4-scale signatures 96 × 72 × 256, 48 × 36 × 512, 24 × 18 × 1024, 12 × 9 × 2014, the structure of bottleeck, fig. 9, by varying numbers of bottleeck, operates as follows:

C₅+3*Bottleneck+unsample*8

C₄+2*Bottleneck+unsample*4

C₃+1*Bottleneck+unsample*2

the above processing of the rectification network is specifically shown in fig. 10, and the characteristic diagram obtained by summing each layer in the global network is passed through Bottleneck, and then passed through the attention mechanism module designed by us, as shown in fig. 12, where:

1. feature map to be generated

Sending to global average pooling, the global average pooling operation of the feature map of the kth channel can be represented as:

wherein F_iRepresenting the feature map, C is the number of channels, H is the height of the feature map, W is the width of the feature map, and T is the output.

2. And (5) performing convolution on the output T by 1x1 to enable the number of the characteristic diagram channels of each channel to be consistent.

3. Then, sigmoid operation is performed, and then, the sigmoid operation is performed with the original feature map, which can be specifically expressed as:

wherein

For the outer product, it can be expressed as a linear algebra (uv is two matrices):

σ is a sigmoid function, which can be expressed as (z is a function input):

is a 1x1 convolution operation.

Finally go in separatelySampling on the line, then performing concat operation and then using L through Bottleneck₂ ^*Loss constraint of wherein L₂ ^*To rectify the loss of N keypoints (each keypoint being represented by L) of the output of the network₂The largest M of N (17) are calculated (M is set to 9), and only the M key point losses are retained and included in the corrective network loss function (L)₂Loss), followed by thermodynamic map generation by convolution of 3x3 conv.

In the present invention, each scale feature map generated in step S2 is sent to the correction network, the thermodynamic diagrams after completion of the summation are summed in the manner of summing each scale feature map, and finally the L _2 loss function is used for calculation, so as to obtain more accurate human body key points.

And then, using the trained deep neural network model to estimate the key points of the human body in the images containing the human body in the test data set.

The following describes the hole convolution (scaled conv). Referring to fig. 11, the left graph represents the ordinary 3 × 3conv convolution, the middle graph represents the hole convolution (scaled conv) with scaled coefficient 2, the right graph represents the hole convolution (scaled conv) with scaled coefficient 4, and the actual convolution kernel size is also 3 × 3, but the hole is 1, that is, for a 7 × 7 image patch, only 9 points and the kernel of 3 × 3 are convolved, and the rest points are skipped.

It is also understood that the kernel size is 7 × 7, but the weight of 9 points in the figure is not 0, and the rest are 0. It can be seen that although the kernel size is only 3x3, the field of this convolution has increased to 7x7 (if it is considered that the previous layer of this 2-scaled conv is a 1-scaled conv, then each point is the convolution output of the 1-scaled, so the field is 3x3, so 1-scaled and 2-scaled together can achieve a conv of 7x 7), the right-hand diagram is a 4-scaled conv operation, which follows the two 1-scaled and 2-scaled conv, to achieve a field of 15x 15. Compared with the conventional conv operation, the convolution of 3 layers 3 × 3 is added up, and if stride is 1, only the (kernel-1) × layer +1 ═ 7 field can be achieved, i.e. the field is linear to the layer number layer, and the field of the scaled conv is exponentially increased.

The invention constructs a neural network taking a human body image as input by utilizing the high nonlinear fitting capability of a convolutional neural network and aiming at a human body image posture estimation task. In particular, the neural network selectively focuses on the weight distribution among different scale feature maps through an additional attention mechanism module. In this way, a correction network based on the attention mechanism module can be used for training a graph human skeleton key point posture estimation model with good perception effect. In the testing phase, the images in the test set are used as the input of the model, and the generated effect map is obtained, as shown in fig. 1.

It should be noted that the human body image key point attitude estimation model of the neural network with the attention mechanism fused provided by the invention comprises two subnetworks, namely a global network and a correction network, and the objective function of the whole model is L₂. When the human body image posture estimation is completed, the final objective function of the whole model is L₂The loss function can be reduced to the minimum and kept stable.

In order to describe the specific implementation mode of the invention in detail and verify the effectiveness of the invention, the method provided by the invention is applied to an open data set training. The database contains photographs of some natural scenes, such as flowers, trees, etc. All images of the data set are selected as a training data set, firstly, all images in the training data set are subjected to target detection by using a trained characteristic pyramid network (FPN), only a human body class boundary box is output, a corresponding cut human body image is generated, a human body key point coordinate information json file marked in the data set is used as input of a model, a global network and a correction network are trained by utilizing gradient reverse transmission until the network is converged, and a human body skeleton key point posture estimation model is obtained.

To test the validity of the model, the input image is processed and the visualization is shown in fig. 1. In the experiment, the result of the experiment is shown in fig. 1 by comparing with the real image of groudtruth. The embodiment effectively proves the effectiveness of the method provided by the invention on the super-resolution of the image.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for estimating the pose of key points in human body images is characterized in that,

the method comprises the following steps:

s1, preprocessing image data in an image database:

firstly, sending an original image into a trained characteristic pyramid network detectionNet based on hole convolution for detection, and only outputting a human body image marked by a human body by a bounding box; then, cutting the output human body image to form a preset format size, and then carrying out automatic data enhancement processing;

s2, obtaining a deep neural network model capable of carrying out posture estimation on the human body image to obtain a human body firmware key point image through training:

using the cut human body image in the step S1 as the input of the network, using json files marked out in the form of xy-axis coordinates in a training set as the human body key point marking information image as the GroudTruth, training the global network and the correction network in the deep neural network model, and obtaining the trained deep neural network model for finishing the posture estimation from the human body image to the human body key point image;

processing the input human body image through a ResNet101 network of a global network to obtain feature map outputs with different sizes, adopting a bottom-up U-Shape structure, sampling each layer of feature map from bottom to top, then adding, and performing prediction operation, wherein L is performed by using the same as GroudRuth₂A loss function calculation to perform loss constraint; after predetermined convolution step processing is carried out through prediction operation, thermodynamic diagrams of different key points are generated, and therefore initial human body key points are obtained;

features to sum each layer in a global networkAfter passing through the Bottleneck, the graph passes through a designed attention mechanism module, then up-sampling is respectively carried out, then concat operation is carried out, and then the graph passes through the Bottleneck, and L based on key points difficult to detect is adopted₂ ^*Loss constraint, then thermodynamic diagram generation is carried out through convolution, initial human body key points are corrected, and final human body key points are obtained; after the model is iterated for multiple times and is stabilized, the training of the model is completed;

and S3, carrying out posture estimation processing on the images containing the human body in the test data set by using the trained deep neural network model.

2. The method for estimating pose of key points of human body image according to claim 1, wherein the feature pyramid network FPN processes the image by using a method of specific data enhancement, and modifies the last two stages of the FPN to specifically target the detection, and cuts the detected human body image and then inputs the cut human body image, specifically:

according to the features extracted by Resnet50, a feature map with the scale of 1-4, 4 is reserved and named as P₂，P₃，P₄，P₅And adding stage5 by concatenating convolution kernels having convolution kernel size 1x1, with the feature map being P₆A characteristic diagram of (1);

and after stage4 we keep the spatial resolution of the feature map unchanged, i.e. we keep the spatial resolution of the feature map unchanged

Wherein

Representing the spatial resolution, i is the original map size, x ∈ [ i,2,3,4,5,6]At P₄，P₅，P₆Connecting convolution kernels with the convolution kernel size of 1x1 to keep the channel number consistent (256 channel number);

finally, adding feature graphs of stages 4-6 according to a pyramid framework to form an FPN feature pyramid, performing target detection by adopting a Fast RCNN method, performing constraint through regression loss and classification loss, performing multi-loss fusion, wherein the classification loss adopts log loss, and the regression loss is the same as R-CNN;

overall loss function:

at the last full link layer of the detection network, two branches are accessed, one is softmax for classifying each ROI region, if there are K types to be classified, and if there are K +1 types in total in addition to the background, the output result is p ═ p (p0......... pk), the other is a bounding box for a more accurate region of the ROI, and the output is

Representing the coordinates of the bounding box of the k class, wherein (x, y) is the coordinates of the upper left corner of the bounding box, and (x + w, y + h) is the coordinates of the lower right corner of the bounding box; u is the group Truth of each POI area, v is the regression target of the group Truth of the bounding box, wherein lambda is a hyper-parameter, and the balance between two task losses is controlled, wherein lambda is 1, [ u ≧ 1]Is 1 when u is more than or equal to 1;

the classification loss is specifically:

is a function of the loss in the form of a log,

the regression loss is specifically:

wherein v ═ v_x，v_y，v_w，v_hIs the position of the real box of class u, and

is a predicted box position of class u, and

3. the human image keypoint pose estimation method of claim 1,

the data enhancement process includes random flipping, random rotation, random scaling, and the use of specific parameters.

4. The human image keypoint pose estimation method of claim 1,

the step of processing the input human body image through the ResNet50 network of the global network to obtain feature map outputs with different sizes comprises the following steps:

randomly initializing a ResNet50 network using a standard Gaussian distribution;

inputting human body image into ResNet50 network comprising four residual blocks, and respectively inputting different conv characteristics

Is denoted as C₂ C₃ C₄ C₅，

Wherein, C₂Number of channels 64, C₃Number of channels 128, C₄The number of channels is 256, C₅The number of channels is 512, and at each residual block C₂ C₃ C₄ C₅Adding convolution of convolution kernel 1x1, and connecting BN layer and ReLU to make the number of characteristic channels be 256, and defining the obtained residual block of different layers as

Thereby obtaining different size characteristic mapsAnd (6) outputting.

5. The human image keypoint pose estimation method of claim 1,

the target function of the deep neural network model adopts a loss function L₂Comprises the following steps:

L₂＝E_{x，y～P(X，Y)}||F_generate(x)-y||₂

wherein x is an input real human body image, and is less than thermodynamic diagram corresponding to GroundTruth, F_generatorOutputting a thermodynamic diagram of a plurality of key points corresponding to each residual block of the global network; where E represents the mathematical expectation of the L2 norm under the P (X, Y) distribution, and P (X, Y) is a probability density function, where L is₂ ^*To rectify the loss of N keypoints (each keypoint being represented by L) of the output of the network₂Calculated) and only those key point losses are retained in the corrective network loss function.

6. The human image keypoint pose estimation method of claim 1,

the attention mechanism module adopting the design is characterized in that:

1. feature map to be generated

2. Performing convolution on the output T by 1x1 to make the number of the characteristic diagram channels of each channel consistent;

wherein

σ is a sigmoid function, which can be expressed as (z is a function input):

is a 1x1 convolution operation.