CN112686097A - Human body image key point posture estimation method - Google Patents

Human body image key point posture estimation method Download PDF

Info

Publication number
CN112686097A
CN112686097A CN202011433083.5A CN202011433083A CN112686097A CN 112686097 A CN112686097 A CN 112686097A CN 202011433083 A CN202011433083 A CN 202011433083A CN 112686097 A CN112686097 A CN 112686097A
Authority
CN
China
Prior art keywords
human body
image
network
loss
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011433083.5A
Other languages
Chinese (zh)
Inventor
侯峦轩
马鑫
赫然
孙哲南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Original Assignee
Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd filed Critical Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Priority to CN202011433083.5A priority Critical patent/CN112686097A/en
Publication of CN112686097A publication Critical patent/CN112686097A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a method for estimating the posture of key points of a human body image, which comprises the following steps: preprocessing an input training image, and detecting the input image by using a pedestrian detection network of a large receptive field characteristic pyramid network based on cavity convolution; cutting a boundary frame formed by the detected human body, and only keeping an image in the frame; and inputting the cut image into a designed model, and estimating key points of the human body posture. The invention can generate key points of the input image containing the human body, and the generated key points of the human body in the image after estimation processing have higher precision and better keep the skeleton geometric information of the human body.

Description

Human body image key point posture estimation method
Technical Field
The invention relates to the technical field of image processing, in particular to a method for estimating the postures of key points of a human body image.
Background
The pose estimation of the key points of the human body image refers to modeling estimation of the key points on a human body skeleton from an image containing the human body, wherein the human body key points are generally defined as follows: the method comprises the steps of ankle joint, left knee joint, left hip, right hip, left knee joint, left ankle joint, right ankle joint, upper neck, vertex, right wrist, left elbow, left shoulder, right elbow and left wrist, and finally carrying out posture estimation on an input image through a trained posture estimation model, and outputting the image containing the key points of the human skeleton.
Because the human body is quite flexible, various postures and shapes can be generated, a new posture can be generated by slight change of any part of the human body, meanwhile, the visibility of key points is greatly influenced by wearing, postures, visual angles and the like, and the influence of environments such as shielding, illumination, fog and the like is also faced, in addition, the 2D human body key points and the 3D human body key points have obvious difference in vision, and different parts of the human body have the effect of shortening the vision (foreshoring), so that the detection of the human skeleton key points becomes a very challenging problem in the field of computer vision.
The existing human skeleton key point detection algorithm for solving the problem of human image key point posture estimation is basically carried out on the basis of geometric prior based on a template matching idea, and the core lies in how to represent the whole human body structure by using a template, including the representation of key points, the representation of limb structures and the representation of the relationship between different limb structures. A good template matching idea can simulate more gesture ranges, so that the corresponding human body gestures can be better matched and detected.
Deep learning based methods such as G-RMI, PAF, RMPE, Mask R-CNN have also been proposed. The invention provides a special pedestrian detection network structure aiming at the specific task of detecting, then inputting a human body image into the network structure, performing a series of nonlinear processing (for fitting a complex mapping function) to obtain a generated human body skeleton key point attitude image, taking the generated human body skeleton key point attitude image and a real labeled human body firmware key point image as input of a loss function, calculating the value of the loss function, solving the gradient to minimize the value, reversely propagating the solved gradient by utilizing a reverse propagation function and updating parameters of network weight, multiple iterations are performed until the loss function is unchanged.
Due to further invention of the technology and the high-quality and high-accuracy human skeleton key point diagram, the method has important significance for user experience and market competition. The existing human body image key point attitude estimation generation quality can not meet the requirement, and the uncertainty is large. Therefore, it is necessary to improve the pose estimation method of the key points of the human body image by one step.
Disclosure of Invention
Aiming at the technical defects in the prior art, the invention firstly provides a proprietary detection network detectetnet and further provides a human body image key point attitude estimation method of a deep neural network of a cascade pyramid fused with a hole convolution (scaled conv) so as to improve the estimation and correction quality of the human body image key point attitude and reduce the uncertainty.
In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows:
a method for estimating the pose of key points of a human body image comprises the following steps:
s1, preprocessing image data in an image database:
firstly, sending an original image into a trained detection network detectionNet for detection, inputting a size 224 x 224, and only outputting a human body image marked by a boundary box for a human body; then, cutting the output human body image to form a preset format size;
s2, obtaining a depth network model which can carry out posture estimation on the human body image to obtain a human body firmware key point image through training:
using the cut human body image in the step S1 as the input of the network, using json files in a training set as the human body key point marking information image as GroudTruth, training a global network and a correction network in a deep neural network model, and obtaining a trained deep neural network model for finishing the posture estimation from the human body image to the human body key point image;
processing the input human body image through a global network to obtain feature maps with different sizes, and performing L on the feature maps and a real labeled skeleton key point image by adopting a bottom-up U-Shape structure2And (3) loss calculation, namely obtaining feature diagram outputs of different scales through a global network, then performing up-sampling operation through a Bottleneck and attention mechanism module, performing concat operation on the feature diagrams of different scales, and then performing L2Calculating loss, and finishing the training of the model after the model is iterated for multiple times and is stable;
and S3, carrying out attitude estimation processing on the images in the test data set by using the trained deep neural network model.
The invention uses two networks of a global network and a correction network to respectively position and correct the key points, and adopts L2The loss function improves the progress of generating key points and reduces the uncertainty, and the correction network structure based on Bottleneck and attention mechanism can improve the correction performance between different scales.
The global network of the invention improves the model capacity and accelerates the training speed by using the residual error network ResNet101 structure as the backbone network.
The invention provides a special detection network, solves the problem that channel weight distribution among all scale characteristic graphs is neglected in a general correction network aiming at properties, and improves detection correction by adopting an attention mechanism module. By means of the proposed human body image key point posture estimation model based on the depth neural network of the attention mechanism module, the residual error network is used as the basis for constructing the model, and the cascade pyramid structure is combined, so that the correction performance of the model is better, and the generalization capability is stronger.
Drawings
FIG. 1 shows the test results of the present invention on a human body image in a test data set, with the input human body image on the left, the output image corrected by the attention mechanism module in the middle, and the output image corrected without the attention mechanism module on the right.
Fig. 2 is a block diagram showing a structure of a detection network detetectionnet method according to the proprietary one of the present invention.
Fig. 3 shows a block diagram of a different type of bottleeck in design 2.
Shown in FIG. 4 as p4,p5,p6The operational connection mode between them.
Fig. 5 is a process diagram of the ResNet50 network.
Fig. 6 is a diagram showing a global network architecture.
Fig. 7 is a partial schematic diagram illustrating the summing operation in the detection network and the global network.
Fig. 8 is a diagram showing the overall network structure of the present invention.
FIG. 9 is a view showing the structure of Bottleneck.
Fig. 10 is an overall structural view of the present invention after the correction network is added.
FIG. 11 is a schematic diagram showing the convolution of the hole convolution (related conv) according to the present invention.
FIG. 12 is a schematic diagram of an attention mechanism module of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention discloses a human body image key point posture estimation method, which comprises the following steps:
in step S1, specific data enhancement is first performed on the image training set data, and first we define all possible data enhancements that can be applied to the image, as shown in the following table (the parameters all correspond to the parameters of the TensorFlow corresponding function):
operation name description parameter amplitude range
Figure BDA0002827294920000051
We employ the following specific operations:
Figure BDA0002827294920000052
we define the enhancement policy as an unordered set of K sub-policies (policies one-two). During training, one of the K sub-strategies will be randomly selected and then applied to the current image. Each sub-strategy has 2 image enhancement operations, where P is the probability value (between the range 0-1) for each operation, M is the parameter magnitude, and each parameter magnitude is normalized to be within the interval 0-10.
And then, performing target detection on the images in the training data set by using a detection network detectionNet, only reserving boundary frames of human bodies for all category frames, performing cutting operation to generate human body images with the corresponding size of 384 × 288, then using the human body posture key point label information json file in the data set, and calling COCO api as label information of the corresponding human body to accelerate the reading speed of I/O.
The target detection network adopts a detection network detectetnet, trains and uses all 80 classes of the COCO data set, and finally selects and outputs the classes of the human body (the output image form is that the human body is marked by a bounding box in an image). The specific structure is shown in fig. 2, wherein the specific design of DetectionNet and the modules in the drawing are explained as follows:
adopting Resnet50 as a backbone network to extract features, and randomly initializing a ResNet50 network by using standard Gaussian distribution;
according to the features extracted by Resnet50, a feature map with the scale of 1-4, 4 is reserved and named as P2,P3,P4,P5And adding stage5 by concatenating convolution kernels having convolution kernel size 1x1, with the feature map being P6A characteristic diagram of (1);
and at staAfter ge4 we keep the spatial resolution of the feature map unchanged, i.e. it is
Figure BDA0002827294920000061
Figure BDA0002827294920000062
The conversion is done by 3x3 convolutions or pooling layers with step size 2, where S ispxRepresenting the spatial resolution, i is the original picture size, where the original picture size is 224 x 224, x e [ i,2,3,4,5,6]At P4,P5,P6And connecting convolution kernels with the convolution kernel size of 1x1 to keep the channel number consistent (256 channels).
P4,P5,P6The transformation between the two types of the AB is realized by two types of bottleecks shown in the figure 5, the two types of the AB are designed as the figure 4, the two types of the AB are respectively obtained by convolution of 1 by 1, the hollow coefficient of 3 by 3 is 2, and the relu layer is also obtained.
And finally, summing the feature maps of the stages 4-6 according to a pyramid framework, wherein a lateral connection summing mode is as shown in FIG. 8, forming an FPN feature pyramid, performing target detection by adopting a Fast RCNN method, and performing constraint through regression loss and classification loss. The multiple loss fusion (classification loss and regression loss fusion) is the prediction operation in FIG. 3, the classification loss adopts log loss (namely, the probability of real classification is taken as negative log, and the classification output is K +1 dimension), and the loss of regression is the same as that of R-CNN (smooth L1 loss). Overall loss function:
Figure BDA0002827294920000063
two branches are connected to the last full connection layer of the detection network, one branch is softmax and is used for classifying each ROI area, if K types are to be classified (adding K +1 types in total to background), the output result is p ═ p (p is0………pk) The other is a bounding box for more precise regions of the ROI, output as
Figure BDA0002827294920000071
The coordinates of the bounding box representing the k classes are (x, y) the coordinates of the upper left corner of the bounding box and (x + w, y + h) the coordinates of the lower right corner of the bounding box. u is the group Truth of each POI area, and v is the regression target of the group Truth of the bounding box. Where λ is the hyperparameter, controls the balance between the two task losses, where λ is 1. [ u.gtoreq.1]Is 1 when u is more than or equal to 1.
The classification loss is specifically:
Figure BDA0002827294920000072
is a loss function in log form.
The regression loss is specifically:
Figure BDA0002827294920000073
wherein v ═ vx,vy,vw,vhIs the position of the real box of class u, and
Figure BDA0002827294920000074
is the prediction box position of class u. And is
Figure BDA0002827294920000075
In addition, the cropping operation is to perform operations such as expanding the frame to a fixed aspect ratio, then performing cropping, and then performing data enhancement on the bounding box region in the image containing the human body bounding box, such as random flipping, random rotation, random scaling, and the like.
Furthermore, in all training steps, the data set is the MSCOCO training data set (including 57K images and 150K images containing human body examples), and after being detected by the detector network (FPN + roiign) in step S1, for all detected bounding boxes, only human bounding boxes were taken (i.e. only the bounding boxes of the human category in the first 100 boxes of all classes were used in all experiments) and extended to a fixed aspect ratio, light: weight 384:288, the corresponding cropped image is resized to a default height of 384 pixels and width of 288 pixels, then, the corresponding data enhancement strategy adopts random rotation (the angle is-45 to +45 degrees) and random scale (0.7 to 1.35) for the cut image, and taking the marking information (json file contains the position of the human body boundary frame and the key point) of the corresponding picture as the GrountTruth.
Wherein the overall DetectionNet flow block diagram is shown in fig. 3.
Step S2, training a human body image key point posture estimation model of a neural network fused with a hole convolution (scaled conv) by using the training input data, so as to complete a key point posture estimation task of a human body image.
In step S2, the clipped image containing the human body and the labeled information corresponding to the human body skeleton key points in step S1 are mainly used as the input of the network, the human body skeleton-fixing key points (in the form of json files, 17 key points are respectively labeled in the form of xy-axis coordinates) containing the labels are used as the group route, the human body key point estimation network in the depth model is trained, and the task from inputting the image of the human body to outputting the human body firmware key point image is completed. Specifically, after the human body image detected by the detection network is cut, the characteristic diagram is extracted by using ResNet101 as a backbone network, and different conv characteristics are extracted by the method
Figure BDA0002827294920000081
Is denoted as C2C3C4C5Then, the characteristic diagram of each layer is added from bottom to top by adopting a U-shaped structure, and the thermodynamic diagram generated by adding each time and with different scales is generated by adopting an L2And (5) calculating a loss function to obtain key points of the human body.
In the global network, a convolutional neural network structure ResNet101 is used for firstly extracting features, and the U-Shape structure is adopted for carrying out up-sampling and sum operation on the feature map and then carrying out operation to keep the size of the generated feature map the same as the dimension of the feature map formed by the last layer of residual layer.
In this example, 4 residual blocks are included in the global network. The specific structure of the residual block is a convolutional neural network, and the residual block comprises a normalization layer, an activation layer and a convolutional layer; the size, step length and filling of the convolution layer filter are respectively 3 × 3,1 and 1. And the input and the output of the residual error layer are connected in a forwarding way. The number of layers of the convolutional layers and the number and the size of the filters in each convolutional layer can be selected and set according to actual conditions, and 3x3, 1 and 1 are filled by using the size and the step length of the convolutional layer filters respectively to generate a corresponding thermodynamic diagram from the characteristic diagram.
Similarly, the number of the residual blocks can be selected and set according to the actual situation. In the global network, real human body image x and group Truth real human body skeleton posture key point image y are input, the y network structure is ResNet-101 pre-trained on ImageNet data set,
Figure BDA0002827294920000091
in this step, the clipped human body image (384 × 288) is used as a model input, input to the backbone network of ResNet101, and output as a feature map of 192 × 144 × 64 through convolution kernels of 7 × 7 channels 64, padding 3, and stride 2, and MAX posing is performed, where the pooling kernel size is 3 × 3padding 0, stride 2, and output as a feature map of 96 × 72 × 64.
The generated 96 × 72 × 64 feature maps are sequentially passed through 4 residual blocks C2C3C4C5The corresponding outputs of each residual block are 96 × 72 × 256, 48 × 36 × 512, 24 × 18 × 1024, 12 × 9 × 2014. As shown in fig. 5.
Next layer of residual block
Figure BDA0002827294920000092
Performing 1x1 convolution operation, then upsampling and mixing with the previous layer
Figure BDA0002827294920000093
After the addition operation, a predict operation is performed, wherein L is used2Loss constraint, the flow is as follows:
Figure BDA0002827294920000094
performing L with a thermodynamic diagram of a human skeleton posture key point image which is really marked2And (4) calculating loss.
In the present invention, the Predict operation is that after each layer of superimposed feature maps is convolved by 1 × 1conv, then convolved by 3 × 3 to generate 17 feature maps (thermodynamic diagrams of 17 key points, the number of key points of a human body in the MSCOCO data set is 17), and the feature maps are used as Predict to participate in training.
Wherein the loss function of the generated network is:
Figure BDA0002827294920000095
wherein, x is an input image, y is a thermodynamic diagram corresponding to the group Truth, and the output of the global network is
Figure BDA0002827294920000096
And is
Figure BDA0002827294920000097
Wherein FgeneratorThere are 17 feature maps (thermodynamic diagrams of key points) output for each residual block of the global network.
Then the output of the global network is used as the input of the correction network, namely C generated by four residual errors of the global network2C3C4C5Corresponding 4-scale signatures 96 × 72 × 256, 48 × 36 × 512, 24 × 18 × 1024, 12 × 9 × 2014, the structure of bottleeck, fig. 9, by varying numbers of bottleeck, operates as follows:
C5+3*Bottleneck+unsample*8
C4+2*Bottleneck+unsample*4
C3+1*Bottleneck+unsample*2
the above processing of the rectification network is specifically shown in fig. 10, and the characteristic diagram obtained by summing each layer in the global network is passed through Bottleneck, and then passed through the attention mechanism module designed by us, as shown in fig. 12, where:
1. feature map to be generated
Figure BDA0002827294920000101
Sending to global average pooling, the global average pooling operation of the feature map of the kth channel can be represented as:
Figure BDA0002827294920000102
wherein FiRepresenting the feature map, C is the number of channels, H is the height of the feature map, W is the width of the feature map, and T is the output.
2. And (5) performing convolution on the output T by 1x1 to enable the number of the characteristic diagram channels of each channel to be consistent.
3. Then, sigmoid operation is performed, and then, the sigmoid operation is performed with the original feature map, which can be specifically expressed as:
Figure BDA0002827294920000103
wherein
Figure BDA0002827294920000104
For the outer product, it can be expressed as a linear algebra (uv is two matrices):
Figure BDA0002827294920000105
σ is a sigmoid function, which can be expressed as (z is a function input):
Figure BDA0002827294920000106
Figure BDA0002827294920000111
is a 1x1 convolution operation.
Finally go in separatelySampling on the line, then performing concat operation and then using L through Bottleneck2 *Loss constraint of wherein L2 *To rectify the loss of N keypoints (each keypoint being represented by L) of the output of the network2The largest M of N (17) are calculated (M is set to 9), and only the M key point losses are retained and included in the corrective network loss function (L)2Loss), followed by thermodynamic map generation by convolution of 3x3 conv.
In the present invention, each scale feature map generated in step S2 is sent to the correction network, the thermodynamic diagrams after completion of the summation are summed in the manner of summing each scale feature map, and finally the L _2 loss function is used for calculation, so as to obtain more accurate human body key points.
And then, using the trained deep neural network model to estimate the key points of the human body in the images containing the human body in the test data set.
The following describes the hole convolution (scaled conv). Referring to fig. 11, the left graph represents the ordinary 3 × 3conv convolution, the middle graph represents the hole convolution (scaled conv) with scaled coefficient 2, the right graph represents the hole convolution (scaled conv) with scaled coefficient 4, and the actual convolution kernel size is also 3 × 3, but the hole is 1, that is, for a 7 × 7 image patch, only 9 points and the kernel of 3 × 3 are convolved, and the rest points are skipped.
It is also understood that the kernel size is 7 × 7, but the weight of 9 points in the figure is not 0, and the rest are 0. It can be seen that although the kernel size is only 3x3, the field of this convolution has increased to 7x7 (if it is considered that the previous layer of this 2-scaled conv is a 1-scaled conv, then each point is the convolution output of the 1-scaled, so the field is 3x3, so 1-scaled and 2-scaled together can achieve a conv of 7x 7), the right-hand diagram is a 4-scaled conv operation, which follows the two 1-scaled and 2-scaled conv, to achieve a field of 15x 15. Compared with the conventional conv operation, the convolution of 3 layers 3 × 3 is added up, and if stride is 1, only the (kernel-1) × layer +1 ═ 7 field can be achieved, i.e. the field is linear to the layer number layer, and the field of the scaled conv is exponentially increased.
The invention constructs a neural network taking a human body image as input by utilizing the high nonlinear fitting capability of a convolutional neural network and aiming at a human body image posture estimation task. In particular, the neural network selectively focuses on the weight distribution among different scale feature maps through an additional attention mechanism module. In this way, a correction network based on the attention mechanism module can be used for training a graph human skeleton key point posture estimation model with good perception effect. In the testing phase, the images in the test set are used as the input of the model, and the generated effect map is obtained, as shown in fig. 1.
It should be noted that the human body image key point attitude estimation model of the neural network with the attention mechanism fused provided by the invention comprises two subnetworks, namely a global network and a correction network, and the objective function of the whole model is L2. When the human body image posture estimation is completed, the final objective function of the whole model is L2The loss function can be reduced to the minimum and kept stable.
In order to describe the specific implementation mode of the invention in detail and verify the effectiveness of the invention, the method provided by the invention is applied to an open data set training. The database contains photographs of some natural scenes, such as flowers, trees, etc. All images of the data set are selected as a training data set, firstly, all images in the training data set are subjected to target detection by using a trained characteristic pyramid network (FPN), only a human body class boundary box is output, a corresponding cut human body image is generated, a human body key point coordinate information json file marked in the data set is used as input of a model, a global network and a correction network are trained by utilizing gradient reverse transmission until the network is converged, and a human body skeleton key point posture estimation model is obtained.
To test the validity of the model, the input image is processed and the visualization is shown in fig. 1. In the experiment, the result of the experiment is shown in fig. 1 by comparing with the real image of groudtruth. The embodiment effectively proves the effectiveness of the method provided by the invention on the super-resolution of the image.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (6)

1. A method for estimating the pose of key points in human body images is characterized in that,
the method comprises the following steps:
s1, preprocessing image data in an image database:
firstly, sending an original image into a trained characteristic pyramid network detectionNet based on hole convolution for detection, and only outputting a human body image marked by a human body by a bounding box; then, cutting the output human body image to form a preset format size, and then carrying out automatic data enhancement processing;
s2, obtaining a deep neural network model capable of carrying out posture estimation on the human body image to obtain a human body firmware key point image through training:
using the cut human body image in the step S1 as the input of the network, using json files marked out in the form of xy-axis coordinates in a training set as the human body key point marking information image as the GroudTruth, training the global network and the correction network in the deep neural network model, and obtaining the trained deep neural network model for finishing the posture estimation from the human body image to the human body key point image;
processing the input human body image through a ResNet101 network of a global network to obtain feature map outputs with different sizes, adopting a bottom-up U-Shape structure, sampling each layer of feature map from bottom to top, then adding, and performing prediction operation, wherein L is performed by using the same as GroudRuth2A loss function calculation to perform loss constraint; after predetermined convolution step processing is carried out through prediction operation, thermodynamic diagrams of different key points are generated, and therefore initial human body key points are obtained;
features to sum each layer in a global networkAfter passing through the Bottleneck, the graph passes through a designed attention mechanism module, then up-sampling is respectively carried out, then concat operation is carried out, and then the graph passes through the Bottleneck, and L based on key points difficult to detect is adopted2 *Loss constraint, then thermodynamic diagram generation is carried out through convolution, initial human body key points are corrected, and final human body key points are obtained; after the model is iterated for multiple times and is stabilized, the training of the model is completed;
and S3, carrying out posture estimation processing on the images containing the human body in the test data set by using the trained deep neural network model.
2. The method for estimating pose of key points of human body image according to claim 1, wherein the feature pyramid network FPN processes the image by using a method of specific data enhancement, and modifies the last two stages of the FPN to specifically target the detection, and cuts the detected human body image and then inputs the cut human body image, specifically:
adopting Resnet50 as a backbone network to extract features, and randomly initializing a ResNet50 network by using standard Gaussian distribution;
according to the features extracted by Resnet50, a feature map with the scale of 1-4, 4 is reserved and named as P2,P3,P4,P5And adding stage5 by concatenating convolution kernels having convolution kernel size 1x1, with the feature map being P6A characteristic diagram of (1);
and after stage4 we keep the spatial resolution of the feature map unchanged, i.e. we keep the spatial resolution of the feature map unchanged
Figure FDA0002827294910000021
Figure FDA0002827294910000022
Wherein
Figure FDA0002827294910000023
Representing the spatial resolution, i is the original map size, x ∈ [ i,2,3,4,5,6]At P4,P5,P6Connecting convolution kernels with the convolution kernel size of 1x1 to keep the channel number consistent (256 channel number);
finally, adding feature graphs of stages 4-6 according to a pyramid framework to form an FPN feature pyramid, performing target detection by adopting a Fast RCNN method, performing constraint through regression loss and classification loss, performing multi-loss fusion, wherein the classification loss adopts log loss, and the regression loss is the same as R-CNN;
overall loss function:
Figure FDA0002827294910000024
at the last full link layer of the detection network, two branches are accessed, one is softmax for classifying each ROI region, if there are K types to be classified, and if there are K +1 types in total in addition to the background, the output result is p ═ p (p0......... pk), the other is a bounding box for a more accurate region of the ROI, and the output is
Figure FDA0002827294910000025
Representing the coordinates of the bounding box of the k class, wherein (x, y) is the coordinates of the upper left corner of the bounding box, and (x + w, y + h) is the coordinates of the lower right corner of the bounding box; u is the group Truth of each POI area, v is the regression target of the group Truth of the bounding box, wherein lambda is a hyper-parameter, and the balance between two task losses is controlled, wherein lambda is 1, [ u ≧ 1]Is 1 when u is more than or equal to 1;
the classification loss is specifically:
Figure FDA0002827294910000035
is a function of the loss in the form of a log,
the regression loss is specifically:
Figure FDA0002827294910000031
wherein v ═ vx,vy,vw,vhIs the position of the real box of class u, and
Figure FDA0002827294910000032
is a predicted box position of class u, and
Figure FDA0002827294910000033
3. the human image keypoint pose estimation method of claim 1,
the data enhancement process includes random flipping, random rotation, random scaling, and the use of specific parameters.
4. The human image keypoint pose estimation method of claim 1,
the step of processing the input human body image through the ResNet50 network of the global network to obtain feature map outputs with different sizes comprises the following steps:
randomly initializing a ResNet50 network using a standard Gaussian distribution;
inputting human body image into ResNet50 network comprising four residual blocks, and respectively inputting different conv characteristics
Figure FDA0002827294910000036
Is denoted as C2 C3 C4 C5
Wherein, C2Number of channels 64, C3Number of channels 128, C4The number of channels is 256, C5The number of channels is 512, and at each residual block C2 C3 C4 C5Adding convolution of convolution kernel 1x1, and connecting BN layer and ReLU to make the number of characteristic channels be 256, and defining the obtained residual block of different layers as
Figure FDA0002827294910000034
Thereby obtaining different size characteristic mapsAnd (6) outputting.
5. The human image keypoint pose estimation method of claim 1,
the target function of the deep neural network model adopts a loss function L2Comprises the following steps:
L2=Ex,y~P(X,Y)||Fgenerate(x)-y||2
wherein x is an input real human body image, and is less than thermodynamic diagram corresponding to GroundTruth, FgeneratorOutputting a thermodynamic diagram of a plurality of key points corresponding to each residual block of the global network; where E represents the mathematical expectation of the L2 norm under the P (X, Y) distribution, and P (X, Y) is a probability density function, where L is2 *To rectify the loss of N keypoints (each keypoint being represented by L) of the output of the network2Calculated) and only those key point losses are retained in the corrective network loss function.
6. The human image keypoint pose estimation method of claim 1,
the attention mechanism module adopting the design is characterized in that:
1. feature map to be generated
Figure FDA0002827294910000041
Sending to global average pooling, the global average pooling operation of the feature map of the kth channel can be represented as:
Figure FDA0002827294910000042
wherein FiRepresenting the feature map, C is the number of channels, H is the height of the feature map, W is the width of the feature map, and T is the output.
2. Performing convolution on the output T by 1x1 to make the number of the characteristic diagram channels of each channel consistent;
3. then, sigmoid operation is performed, and then, the sigmoid operation is performed with the original feature map, which can be specifically expressed as:
Figure FDA0002827294910000043
wherein
Figure FDA0002827294910000044
For the outer product, it can be expressed as a linear algebra (uv is two matrices):
Figure FDA0002827294910000045
σ is a sigmoid function, which can be expressed as (z is a function input):
Figure FDA0002827294910000051
Figure FDA0002827294910000052
is a 1x1 convolution operation.
CN202011433083.5A 2020-12-10 2020-12-10 Human body image key point posture estimation method Pending CN112686097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011433083.5A CN112686097A (en) 2020-12-10 2020-12-10 Human body image key point posture estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011433083.5A CN112686097A (en) 2020-12-10 2020-12-10 Human body image key point posture estimation method

Publications (1)

Publication Number Publication Date
CN112686097A true CN112686097A (en) 2021-04-20

Family

ID=75446585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011433083.5A Pending CN112686097A (en) 2020-12-10 2020-12-10 Human body image key point posture estimation method

Country Status (1)

Country Link
CN (1) CN112686097A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537026A (en) * 2021-07-09 2021-10-22 上海智臻智能网络科技股份有限公司 Primitive detection method, device, equipment and medium in building plan
CN113643419A (en) * 2021-06-29 2021-11-12 清华大学 Human body inverse dynamics solving method and device based on deep learning
CN114241455A (en) * 2021-12-20 2022-03-25 东南大学 Method for classifying actions of driver based on key point enhancement
CN114387614A (en) * 2021-12-06 2022-04-22 西北大学 Complex human body posture estimation method based on double key point physiological association constraint
CN114519666A (en) * 2022-02-18 2022-05-20 广州方硅信息技术有限公司 Live broadcast image correction method, device, equipment and storage medium
CN114724247A (en) * 2022-04-11 2022-07-08 西安电子科技大学广州研究院 Attitude estimation method and system based on semantic cognition in specific scene
CN114783065A (en) * 2022-05-12 2022-07-22 大连大学 Parkinson's disease early warning method based on human body posture estimation
CN115272992A (en) * 2022-09-30 2022-11-01 松立控股集团股份有限公司 Vehicle attitude estimation method
CN115578753A (en) * 2022-09-23 2023-01-06 中国科学院半导体研究所 Human body key point detection method and device, electronic equipment and storage medium
CN115601793A (en) * 2022-12-14 2023-01-13 北京健康有益科技有限公司(Cn) Human body bone point detection method and device, electronic equipment and storage medium
JP2023527615A (en) * 2021-04-28 2023-06-30 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Target object detection model training method, target object detection method, device, electronic device, storage medium and computer program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508681A (en) * 2018-11-20 2019-03-22 北京京东尚科信息技术有限公司 The method and apparatus for generating human body critical point detection model
CN111160085A (en) * 2019-11-19 2020-05-15 天津中科智能识别产业技术研究院有限公司 Human body image key point posture estimation method
CN111783772A (en) * 2020-06-12 2020-10-16 青岛理工大学 Grabbing detection method based on RP-ResNet network
CN111814622A (en) * 2020-06-29 2020-10-23 华南农业大学 Crop pest type identification method, system, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508681A (en) * 2018-11-20 2019-03-22 北京京东尚科信息技术有限公司 The method and apparatus for generating human body critical point detection model
CN111160085A (en) * 2019-11-19 2020-05-15 天津中科智能识别产业技术研究院有限公司 Human body image key point posture estimation method
CN111783772A (en) * 2020-06-12 2020-10-16 青岛理工大学 Grabbing detection method based on RP-ResNet network
CN111814622A (en) * 2020-06-29 2020-10-23 华南农业大学 Crop pest type identification method, system, equipment and medium

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2023527615A (en) * 2021-04-28 2023-06-30 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Target object detection model training method, target object detection method, device, electronic device, storage medium and computer program
CN113643419A (en) * 2021-06-29 2021-11-12 清华大学 Human body inverse dynamics solving method and device based on deep learning
CN113643419B (en) * 2021-06-29 2024-04-23 清华大学 Human body inverse dynamics solving method based on deep learning
CN113537026A (en) * 2021-07-09 2021-10-22 上海智臻智能网络科技股份有限公司 Primitive detection method, device, equipment and medium in building plan
CN114387614A (en) * 2021-12-06 2022-04-22 西北大学 Complex human body posture estimation method based on double key point physiological association constraint
CN114387614B (en) * 2021-12-06 2023-09-01 西北大学 Complex human body posture estimation method based on double key point physiological association constraint
CN114241455A (en) * 2021-12-20 2022-03-25 东南大学 Method for classifying actions of driver based on key point enhancement
CN114519666A (en) * 2022-02-18 2022-05-20 广州方硅信息技术有限公司 Live broadcast image correction method, device, equipment and storage medium
CN114519666B (en) * 2022-02-18 2023-09-19 广州方硅信息技术有限公司 Live image correction method, device, equipment and storage medium
CN114724247B (en) * 2022-04-11 2023-01-31 西安电子科技大学广州研究院 Attitude estimation method and system based on semantic cognition in specific scene
CN114724247A (en) * 2022-04-11 2022-07-08 西安电子科技大学广州研究院 Attitude estimation method and system based on semantic cognition in specific scene
CN114783065A (en) * 2022-05-12 2022-07-22 大连大学 Parkinson's disease early warning method based on human body posture estimation
CN114783065B (en) * 2022-05-12 2024-03-29 大连大学 Parkinsonism early warning method based on human body posture estimation
CN115578753A (en) * 2022-09-23 2023-01-06 中国科学院半导体研究所 Human body key point detection method and device, electronic equipment and storage medium
CN115272992B (en) * 2022-09-30 2023-01-03 松立控股集团股份有限公司 Vehicle attitude estimation method
CN115272992A (en) * 2022-09-30 2022-11-01 松立控股集团股份有限公司 Vehicle attitude estimation method
CN115601793A (en) * 2022-12-14 2023-01-13 北京健康有益科技有限公司(Cn) Human body bone point detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111160085A (en) Human body image key point posture estimation method
CN112686097A (en) Human body image key point posture estimation method
US11556797B2 (en) Systems and methods for polygon object annotation and a method of training an object annotation system
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN110443144A (en) A kind of human body image key point Attitude estimation method
CN112149563A (en) Method and system for estimating postures of key points of attention mechanism human body image
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
KR102693803B1 (en) Generation of 3D object models from 2D images
CN112270249A (en) Target pose estimation method fusing RGB-D visual features
CN113160062B (en) Infrared image target detection method, device, equipment and storage medium
CN108549893A (en) A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN111882492A (en) Method for automatically enhancing image data
CN105868769A (en) Method and device for positioning face key points in image
CN106815808A (en) A kind of image split-joint method of utilization piecemeal computing
CN110084817A (en) Digital elevation model production method based on deep learning
CN115359372A (en) Unmanned aerial vehicle video moving object detection method based on optical flow network
CN116645592B (en) Crack detection method based on image processing and storage medium
CN114757904A (en) Surface defect detection method based on AI deep learning algorithm
CN113378812A (en) Digital dial plate identification method based on Mask R-CNN and CRNN
CN114066831A (en) Remote sensing image mosaic quality non-reference evaluation method based on two-stage training
CN115294356A (en) Target detection method based on wide area receptive field space attention
CN113888697A (en) Three-dimensional reconstruction method under two-hand interaction state
CN114332070A (en) Meteor crater detection method based on intelligent learning network model compression
CN111311732A (en) 3D human body grid obtaining method and device
CN116934972A (en) Three-dimensional human body reconstruction method based on double-flow network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210420

RJ01 Rejection of invention patent application after publication