CN112131959A

CN112131959A - 2D human body posture estimation method based on multi-scale feature reinforcement

Info

Publication number: CN112131959A
Application number: CN202010883889.8A
Authority: CN
Inventors: 邵展鹏; 刘鹏; 胡超群; 周小龙
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-25
Anticipated expiration: 2040-08-28
Also published as: CN112131959B

Abstract

A2D human body posture estimation method based on multi-scale feature reinforcement comprises the following steps: 1) firstly, extracting a feature with high representation capability from an input picture, and performing cross-channel interaction on the features with different scales through a separation attention module; 2) constructing a multi-stage prediction network for the obtained feature maps with different scales, and performing transverse propagation and downward propagation fusion on the features of each stage, so that more spatial resolution information is fused while semantic information is ensured; 3) constructing a high-resolution adjusting network to finely adjust the positioning result of the multistage prediction network, upsampling the multistage features to the maximum resolution through transposition convolution, and then performing cascade operation to position key points with large loss; 4) after the whole network structure is constructed, the input data of the network structure needs to be processed and parameters need to be set. The invention improves the detection capability of the whole network on key points with different scales.

Description

2D human body posture estimation method based on multi-scale feature reinforcement

Technical Field

The invention relates to a human body posture estimation task in computer vision, in particular to a 2D human body posture estimation method based on multi-scale feature reinforcement.

Background

Human posture estimation is one of the current popular research fields as the basis of many visual tasks such as motion recognition, posture tracking, human-computer interaction and the like. The method has wide application prospect in the fields of virtual reality, intelligent monitoring, robots and the like. With the development of the deep convolutional neural network, a plurality of excellent solutions for the human body posture estimation task emerge. However, because the scenes in which human bodies may appear are complex and variable, and the number of people on one picture is different, mutual occlusion and self-occlusion are easy to occur. The distance between the camera and the human body and the different visual angles can lead to different sizes of people in the picture, and the quality of the picture is easily influenced by environmental factors such as illumination and the like. Human posture estimation remains a significant challenge to be solved urgently.

In early research, human bodies are mainly modeled by artificially selecting features and appropriate models, most of the models are tree models and random forest models, and the traditional method has high requirements on image processing and has certain limitations in practical application. With the application of the depth structure to human body posture estimation, the performance of posture estimation is greatly improved. The current research focus is multi-person posture estimation, which faces more challenges and is closer to the actual scene, and the mainstream solutions are divided into a top-down method and a bottom-up method.

The bottom-up method firstly detects all key points in the image, and then distributes the obtained key points to different individuals in the image in a clustering mode. This method has the advantage of not increasing the processing time linearly with the number of people in the picture, at the cost of less accuracy than the top-down method. Some researchers provide a partial association field, and construct the relationship between two key points into a two-dimensional vector, so that the problem of wrong connection of key points between different human bodies is well avoided. The top-down method firstly detects the human body in the picture, and then carries out key point prediction on the detected human body, so that the challenges in single posture estimation need to be solved, and inaccurate and repeated detection of human body proposals is faced. Some methods divide different human body key points into two types for independent processing, firstly locate key points which are easy to detect through a global positioning network, and then locate key points which are difficult to detect through a cascade network. However, at present, the network can not well position human bodies with different sizes because more semantic information and resolution information are lost in the transmission process.

Disclosure of Invention

In order to solve the problems existing in the existing human body posture estimation method, the invention provides a 2D human body posture estimation method based on multi-scale feature reinforcement. The network loses more spatial information, so the invention performs up-sampling on the features with different scales by transposition convolution, concatenates the features of four stages together, finely adjusts the positioning result of the multistage prediction network, fuses the results of the two stages and outputs the final positioning result.

The technical scheme adopted by the invention for solving the technical problems is as follows:

A2D human body posture estimation method based on multi-scale feature reinforcement comprises the following steps:

1) obtaining abstract features with high representation capacity:

inputting the preprocessed pictures into a ResNeSt backbone network, performing cross-channel interaction on the features of different dimensions through a separation attention module, removing a final classification layer and outputting the features of four stages;

2) constructing a multi-stage prediction network:

acquiring four features with different resolutions through step 1), constructing a feature-enhanced functional pyramid for the features in the four stages, and performing fusion enhancement on the high-level features by using a feature enhancement strategy because the top-level feature points lose more semantic information in the propagation process;

3) constructing a high-resolution adjusting network:

a high-resolution adjusting network is constructed to adjust the position of the key point with large prediction loss in the previous stage, the characteristics in the multistage predicting network are up-sampled through transposition convolution, the up-sampling and convolution operations are well combined, the expanded characteristics are subjected to cascade operation, and richer space details are introduced into the key point with a smaller scale;

4) training setting of the whole network:

setting all input pictures to be in a 4:3 aspect ratio, then using a human detector to obtain a human body example in each picture, setting the size of the input example to be 384 multiplied by 288, and using an MSE loss function to carry out gradient return on errors in the training process; the initial learning rate of the network is set to be 5e-4, the weight attenuation is 1e-5, an Adam optimizer is used, the learning rate is reduced to half of the original learning rate after 6 batches of training, and 20 batches of training are performed.

Further, in the step 1), considering that the final positioning result of the features with stronger representation capability is important, a feature extraction network ResNeSt for a pixel-level visual task is used, and features with different scales are subjected to cross-channel interaction through a separation attention module;

firstly, dividing the feature map into K base arrays, dividing each group into R groups again, namely, the total number of the feature groups is G-KR, and applying some transformations to the features of each group individually

The middle of the feature set is represented as:

wherein

Representing different transformation functions, G representing the total number of feature sets, the input to the kth base set is:

wherein for K ∈ 1,2

H, W and C represent the height, width and channel number of the feature map, respectively.

Furthermore, in the step 2), after four features with different resolutions are obtained through the backbone network, a multi-level prediction network with a pyramid structure is constructed to maintain semantic information and spatial resolution information with different scales, and as the top-level features are reduced in dimension through a convolution kernel with the size of 1 × 1, more semantic information is lost, and the loss of each layer of semantic information is directly caused; the characteristic enhancement module is used for effectively enhancing the top-layer characteristics, so that the representation capability of the whole multi-stage prediction network is effectively improved;

and then respectively predicting the multilevel characteristic network, firstly eliminating aliasing effect generated by characteristic superposition by using 1 × 1 convolution, then applying a BN (batch normalization) layer to the multilevel characteristic network for normalization processing, then using ReLU activation function processing, reducing 256-dimensional characteristics to the finally required dimension of 17 dimensions by using 3 × 3 convolution, finally up-sampling the obtained heat map to output size, and carrying out normalization processing again, thereby effectively improving the generalization capability of the model.

Furthermore, in the step 3), after global positioning is performed according to the method in the step 2), some small hidden key points still have large detection errors, a high-resolution fine tuning network is constructed, features of different scales are integrated together, feature images in the multi-stage prediction network are subjected to feature refinement through a plurality of bottleneck modules, and then transposed convolution layers of different times are sampled to output sizes;

obtaining four high-resolution features with the same size through transposition convolution, respectively carrying out scale normalization and ReLU function processing on each feature, cascading the features together according to a first dimension, then carrying out final prediction on the features by using convolution kernel with the size of 3 multiplied by 3, carrying out scale normalization on an output result and outputting the result, and only modifying the position of a key point with a larger loss value in the gradient return process of network training in order to prevent the interference on the position of the key point of a larger human body while modifying a smaller target.

The technical conception of the invention is as follows: the method comprises the steps of obtaining features with high expression capacity by using a backbone network, constructing a feature-enhanced multi-stage prediction network based on the features, carrying out primary positioning on all key points, then constructing a high-resolution adjustment network, introducing more spatial context information in a feature diagram through transposition convolution and cascade operation, and carrying out position adjustment on key points with larger errors. And finally, fusing the outputs of the two stages to obtain a final positioning result.

The invention has the following beneficial effects: the ResNeSt backbone network is applied to the human body posture estimation task, the multi-stage prediction network is constructed for the obtained features, a feature strengthening strategy is used for aiming at the loss in feature propagation, the excellent performance of the multi-stage prediction network is effectively ensured, the high-resolution fine tuning network is constructed for the key points with large errors, the up-sampling and convolution operation are effectively combined together through the transposition convolution, and the detection capability of the whole network on the key points with different scales is improved. The prediction results of the two stages are integrated, and the method has better performance and certain robustness on the key point prediction in different scenes.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a diagram of a network architecture according to an aspect of the present invention;

FIG. 3 is a block diagram of a feature extraction network;

fig. 4 is a schematic diagram of a feature enhancement policy flow.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 4, a 2D human body posture estimation method based on multi-scale feature reinforcement includes the following steps:

1) extracting features with high representation capability for an input picture:

in the invention, the final positioning result of the features with stronger representation capability is considered to be important, so that a feature extraction network ResNeSt aiming at a pixel-level visual task is used, and as shown in FIG. 3, cross-channel interaction is carried out on the features with different scales through a separation attention module;

The middle of the feature set is represented as:

wherein

Representing different transformation functions and G representing the total number of feature sets. The input to the kth base array is:

wherein for K ∈ 1,2

H, W and C represent the height, width and channel number of the feature map, respectively. Global context information with weight statistics for channels

It can be derived by global average pooling across dimensions:

each feature map channel is generated by weighted combination after the attention separation module, and the calculation formula of the c channel is as follows:

wherein

Is the weight obtained after calculation of the softmax layer:

s^krepresenting global spatial information, mapping relationships

Determining the weight of each channel by the value, and then connecting the outputs of each base array by channel dimension, i.e., V ═ Concat { V }¹,V²,...V^KThe output of each module is Y, which can be expressed as:

where V represents the output of the array of basis numbers,

an output representing a jump connection;

2) constructing a multistage prediction network, and pre-positioning all key points, specifically as follows:

after 4 features with different resolutions are obtained through a backbone network, a pyramid-structured multi-level prediction network is constructed to maintain semantic information and spatial resolution information with different scales. Because the dimension of the top-layer features is reduced through the convolution kernel with the size of 1 multiplied by 1, more semantic information is lost, and the loss of each layer of semantic information is directly caused. The invention uses a feature enhancement module, as shown in fig. 4, to effectively enhance the top-level features and effectively improve the characterization capability of the whole multi-level prediction network. The top-level features are firstly subjected to spatial adaptive pooling, are features with three resolutions and have 256 dimensions, then the three feature maps are sampled to the original size in a weighted fusion mode for fusion, so that a feature with reduced dimensionality and unchanged resolution is obtained, and finally the feature is fused with the original feature, as shown in a network structure in FIG. 2;

then, respectively predicting the multilevel characteristic network, firstly eliminating aliasing effect generated by characteristic superposition by using 1 × 1 convolution, then applying a BN (batch normalization) layer to the multilevel characteristic network for normalization processing, then using ReLU activation function processing, reducing 256-dimensional characteristics to finally required dimensionality 17-dimensional (the number of key points of a human body) by using 3 × 3 convolution, finally sampling the obtained heat map to output size, and carrying out normalization processing again, thereby effectively improving the generalization capability of the model;

3) constructing a high-resolution fine tuning network, and further adjusting the position of a key point with a smaller scale, wherein the specific steps are as follows:

after global positioning is carried out according to the method in the step 2, some small key points still exist, and the detection error of the blocked key points is large. As shown in fig. 2, the present invention constructs a high resolution fine tuning network that integrates features of different dimensions. Feature thinning is carried out on a feature map in the multistage prediction network through a plurality of bottleneck modules, and then the feature map is up-sampled to the output size through the transposition convolutional layers of different times. The input channel and the output channel of the transposed convolution are 256-dimensional, the size of a convolution kernel is set to be 4 multiplied by 4, the step length is 2, and the filling is 1;

obtaining four high-resolution features with the same size through transposition convolution, respectively carrying out scale normalization and ReLU function processing on each feature, cascading the features together according to a first dimension, then carrying out final prediction on the features by using a convolution core with the size of 3 multiplied by 3, and carrying out scale normalization on an output result and then outputting the result. In order to prevent the position of a key point of a larger human body from being interfered while a smaller target is modified, only the position of the key point with a larger loss value is modified in the gradient return process of network training;

4) after the whole network structure is constructed, the input data of the network structure needs to be processed and parameters need to be set, and the steps are as follows:

all input pictures are set to be 4:3 in aspect ratio, then a human detector is used for acquiring human body examples in each picture, the size of the input examples is set to be 384 multiplied by 288, and MSE loss functions are used for carrying out gradient feedback on errors in the training process. Setting the initial learning rate of the network to be 5e-4, setting the weight attenuation to be 1e-5, using an Adam optimizer, reducing the learning rate to be half of the original rate after training every 6 batches, and training 20 batches;

through the operation of the steps, the 2D human body posture estimation with strengthened characteristics can be realized.

The objects, technical solutions and advantages of the present invention are further described in detail with reference to the detailed description illustrated in the drawings, it should be understood that the above description is only an exemplary embodiment of the present invention, and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A2D human body posture estimation method based on multi-scale feature reinforcement is characterized by comprising the following steps:

1) obtaining abstract features with high representation capacity:

2) constructing a multi-stage prediction network:

3) constructing a high-resolution adjusting network:

4) training setting of the whole network:

2. The multi-scale feature reinforcement-based 2D human body posture estimation method according to claim 1, wherein in the step 1), considering that the final positioning result of the features with stronger characterization capability is crucial, a feature extraction network ResNeSt for a visual task at a pixel level is used, and cross-channel interaction is performed on the features with different scales through a separation attention module;

The middle of the feature set is represented as:

wherein

wherein for K ∈ 1,2

3. The multi-scale feature reinforcement-based 2D human body posture estimation method according to claim 1 or 2, characterized in that in step 2), after four features with different resolutions are obtained through a backbone network, a pyramid-structured multi-level prediction network is constructed to maintain semantic information and spatial resolution information of different scales, and since top-level features are dimension-reduced by a convolution kernel with a size of 1 × 1, more semantic information is lost, which directly results in the loss of semantic information of each layer; the characteristic enhancement module is used for effectively enhancing the top-layer characteristics, so that the representation capability of the whole multi-stage prediction network is effectively improved;

4. The multi-scale feature reinforcement based 2D human body posture estimation method according to claim 1 or 2, characterized in that in step 3), after global positioning is performed according to the method in step 2), some small key points with larger detection errors are remained, a high-resolution fine tuning network is constructed, features of different scales are integrated together, feature images in a multi-stage prediction network are subjected to feature refinement through a plurality of bottleneck modules, and then the feature images are sampled to output sizes through different times of transposition convolutional layers;