CN111476184A

CN111476184A - Human body key point detection method based on double-attention machine system

Info

Publication number: CN111476184A
Application number: CN202010284037.7A
Authority: CN
Inventors: 霍占强; 靳晗; 乔应旭; 宋素玲; 雒芬
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2020-07-31
Anticipated expiration: 2040-04-13
Also published as: CN111476184B

Abstract

The invention relates to a human body key point detection method based on a double-attention machine mechanism, which comprises the following steps: the method comprises the steps of obtaining a human body key point detection data set which comprises a training data set and a test data set, preprocessing the training data set and the test data set, building a human body key point detection network, adding a channel attention and space attention module in a residual block with extracted features, carrying out appointed times of model training on the preprocessed training data set by using the human body key point detection network, evaluating and storing a trained network model, and testing on the test data set by using the human body key point detection network model. The method provided by the invention is more accurate in detection of key points of the human body, especially in detection of difficult key points.

Description

Human body key point detection method based on double-attention machine system

Technical Field

The invention relates to the field of computer vision and the field of deep learning, in particular to a human key point detection method based on deep learning.

Background

Human body key point detection is the basis of many computer vision tasks, and plays a fundamental role in research in related fields such as behavior recognition, person tracking, gait recognition and the like. The human body key point detection mainly detects some key points (joint points such as wrists, knees, ankles, faces and the like) of a human body in an image, and is important for describing human body postures and predicting human body behaviors. Therefore, researchers in this field have attempted to solve this problem in the last decade using different approaches, from the first graphical structures and graph models, to the later depth maps. Although some progress has been made in these conventional methods, the accuracy is low and it is difficult to put the methods into practical use. Until 2014, deep neural networks are applied to human key point detection problems for the first time by deep Pose, and human key points are predicted through cascaded convolutional neural networks. In 2016, with the outbreak of deep learning, a series of algorithms based on deep learning appeared, such as Hourglass, CPM, OpenPose, G-RMI, RMPE, CPN, HRNet [1], and the detection accuracy of human key points is continuously improved. Although the accuracy of the human body key point detection is improved by the algorithms as a whole, the accuracy of the algorithms still needs to be improved in a complex scene, particularly for the detection of difficult key points.

Attention has been paid in recent years to the successful use of these mechanisms in image processing, speech recognition and natural language processing. The attention mechanism in computer vision is a brain signal processing mechanism specific to human vision, the human vision obtains a target area needing important attention by rapidly scanning a global image, and then more attention resources are put into the area to obtain more detailed information needed by the attention target and suppress other useless information. The current Attention mechanism mainly includes Channel Attention (Channel Attention) focusing on different features of a picture and Spatial Attention (Spatial Attention) focusing on different regions of the picture, and represents a Channel Attention module proposed by an algorithm SENET [2], so that the sensitivity of a model to different Channel features is improved, and performance is improved. And CBAM [3] and BAM [4] introduce two attention mechanisms at the same time, compare with the attention mechanism that SENet [2] only focuses on the channel, increase the attention to different areas of the picture, the overall network performance is further promoted. Therefore, the channel attention and the space attention mechanism are added into the human key point detection algorithm at the same time, so that the model can be helped to endow different weights to each part of the input features, more key and important information can be extracted, and the model can be judged more accurately.

Reference documents:

1.Ke Sun,Bin Xiao,Dong Liu,and Jingdong Wang.Deep High-ResolutionRepresentation Learning for Human Pose Estimation.In:Proc.Of Computer Visionand Pattern Recognition(CVPR).2019.

2.J.Hu,L.Shen,and G.Sun.Squeeze-and-Excitation Networks.In:Proc.OfComputer Vision and Pattern Recognition(CVPR).2018.

3.S.Woo,J.Park,J.Y.Lee,and I.So Kweon.CBAM:Convolutional BlockAttention Module.In:Proc.of European Conf.on Computer Vision(ECCV).2018.

4.J.Park,S.Woo,J.Y.Lee,and I.So Kweon.BAM:Bottleneck AttentionModule.In:Proc.of British Machine Vision Conference(BMVC).2018.

disclosure of Invention

Aiming at the problem that the conventional human body key point detection method has high error rate in detection of difficult key points in a complex scene, the invention designs a network structure based on a double-attention machine system to detect human body key points, and mainly comprises the following steps:

step S1: acquiring a human body key point detection data set, which comprises a training data set and a test data set;

step S2: training a human body key point detection network;

step S21: preprocessing the training data set and the testing data set acquired in the step S1;

step S22: building a human body key point detection network, and adding a channel attention and space attention module in the residual block for extracting the characteristics;

step S23: performing model training for the training data set processed in the step S21 by using the network in the step S22 for a specified number of times;

step S24: evaluating and storing the model trained in step S23;

step S3: the keypoint model trained in step S2 is tested on the test data set acquired in step S1.

Aiming at the problem that the conventional human body key point detection method is high in error rate especially for difficult key point detection in a complex scene, the human body key point detection method based on the double-attention mechanism provided by the invention improves the accuracy of key point detection by respectively adding a channel attention module and a space attention module to a Basicblock for extracting features of a high-resolution network (HRNet) and a Bottleneck residual block after parallel improvement. Compared with the existing method, the method provided by the invention is more accurate in detection of key points of the human body, especially in detection of difficult key points.

Drawings

FIG. 1 is a flow chart of a human body key point detection method based on a double-attention machine mechanism.

Fig. 2 is a block diagram of a high resolution network used in the present invention.

Fig. 3 is a structural diagram of the Bottleneck attention adding module after parallel modification according to the present invention.

FIG. 4 is a block diagram of the add attention to Basicblock module of the present invention.

Detailed Description

Fig. 1 shows a flowchart of a human body key point detection method based on a double-attention machine system, which is provided by the invention, and the method mainly comprises the following steps: the method comprises the steps of obtaining a human body key point detection data set which comprises a training data set and a test data set, preprocessing the training data set and the test data set, building a human body key point detection network, adding a channel attention and space attention module in a residual block with extracted features, carrying out appointed times of model training on the preprocessed training data set by using the human body key point detection network, evaluating and storing a trained network model, and testing on the test data set by using the human body key point detection network model. The specific implementation details of each step are as follows:

step S1: acquiring a human body key point detection data set, wherein the human body key point detection data set comprises a training data set and a test data set, the data comprises pictures containing different human body postures and mark files of true values of joints of a human body, public data sets MPII and COCO2107 are used in the data set, the MPII human body posture data set comprises 25k pictures and 40k human body examples with 16 key points, and the training set and the test set are respectively 28k examples and 12k examples; the COCO2107 keypoint detection dataset contains 200k pictures and 250k human instances with 17 keypoints, wherein the training set train2017 comprises 58k pictures and 150k human instances, and the verification set val2017 and the test set test2017 respectively contain 5k pictures and 20k pictures.

Step S2: training a human body key point detection network, wherein the specific mode comprises the following steps of S21, S22, S23 and S24:

step S21, preprocessing the training data set and the test data set obtained in the step S1, wherein the human body key point detection network only detects key points of a human body, for detecting the human body in the picture, the MPII data set uses a provided human body frame, the COCO data set uses a faster-RCNN to carry out human body detection to obtain a human body detection frame, the height-width ratio of the human body detection frames of the MPII and COCO training data sets is fixed to be 4:3, then the human body detection frame is cut out from the picture, the sizes of the human body detection frame are respectively adjusted to be 256 × 256 and 256 × 192, meanwhile, data enhancement is carried out to the data, the data enhancement comprises random rotation (-45 degrees, 45 degrees), random scaling (0.65, 1.35) and turning.

Step S22, constructing a human body key point detection network, adding a channel attention and space attention module in a feature extraction residual block, wherein the human body key point detection network is specifically characterized in that a high-resolution network (HRnet) is used as an integral framework, the network is connected with high-to-low subnetworks in parallel, the whole process generates a reliable high-resolution representation by repeatedly fusing high-to-low subnetworks, the HRnet integral network structure is divided into five stages, as shown in FIG. 2, in the first stage, an input image is convolved twice by 3 × 3 convolution operation with two step lengths of 2, so that the height (H) and the width (W) of the image are changed into the sizes of H/4 and W/4, the number of channels is 64, then feature extraction is carried out by 4 improved Bottleneck residual blocks, namely, parallel improvement is carried out on the Bottleneck residual block in Net Res, 3 × 3 convolution layers and 3 × 3 in ResNeXt are connected, then the channel attention and space attention modules are added, and the channel attention modules are connected in parallelCompressing the feature map subjected to convolution operation on a spatial dimension by using maximum pooling and average pooling to obtain two different spatial background descriptions: fc (Fc)_avgAnd Fc_maxThen, the two descriptions are calculated by using a shared network composed of multilayer perceptrons, the calculation formula is Mc (F) ═ sigma (M L P (AvgPool (F)) + M L P (MaxPool (F))), a channel attention feature map Mc (F) is obtained, space attention is obtained, namely, the maximum pooling and the average pooling are used on the channel dimension of the obtained channel attention feature map, and two different feature descriptions Fs are obtained_avgAnd Fs_maxThen, the two feature descriptions are merged using a Concatenation operation (collocation), and a spatial attention feature map ms (f) is generated by a convolution operation, and the calculation formula is ms (f) ═ σ (f)^7×7([AvgPool(F)；MaxPool(F)]) Wherein f) is^7×7The convolution operation of 7 × 7 is shown, the improved module is called PRAB (parallel Residual Attention Block) according to the present invention, the structure is shown in FIG. 3, when the size of the characteristic diagram is [ H/4, W/4 ]]The number of channels is changed to 256, the second stage starts to change the number of channels of the feature map to 32 through a 3 × 3 convolution operation with a step size of 1, meanwhile, a low-resolution branch is generated on the basis of the previous stage, and the feature map with the size of [ H/8, W/8 ] is obtained through a 3 × 3 convolution operation with a step size of 2]Changing the number of channels from 256 to 64, then respectively extracting the features of the two branches by using 4 Basicblocks, adding channel attention and spatial attention into the Basicblocks in the same first stage, wherein the structure is shown in FIG. 4, then performing repeated multi-scale fusion, performing convolution and downsampling on the high resolution to the low resolution by using 3 × 3 with the step length of 2, adding the low resolution to obtain the output of the low resolution branch, simultaneously performing upsampling on the low resolution to the high resolution by using the nearest neighbor difference value, adding the high resolution to obtain the output of the high resolution branch, and finally obtaining the feature graph output by the two branches, wherein the feature graph size is [ H/4, W/4, 32/4 []，[H/8,W/8,64](ii) a The input of the third stage is the branch obtained by multi-scale fusion of the second stage, and is simultaneously in [ H/8, W/8,64 ]]Generating a new low resolution branch [ H/16, W/16,128 ] on the basis of the branch]Then each branch is carried out by 4 pieces of basic blocks added into the attention module respectivelyFeature extraction, in the same second stage, obtaining 3 branches [ H/4, W/4,32 ] through multi-scale fusion]，[H/8,W/8,64]，[H/16,W/16,128](ii) a The fourth stage is the same as the third stage, and 4 branches [ H/4, W/4,32 ] are obtained]，[H/8,W/8,64]，[H/16,W/16,128]，[H/32,W/32,256](ii) a The fifth stage upsamples the 3 branches of low resolution, and [ H/4, W/4,32 ]]The branches are combined, and a convolution operation of 1 × 1 is performed to obtain the final output result, namely a heat map of key points, such as 17 key points of a COCO data set, and the size of the finally obtained heat map is [ H/4, W/4,17 [ ]]。

Step S23: and (5) performing model training on the training data set processed in the step S21 by using the network in the step S22 for a specified number of times, specifically, using an Adam optimizer for model training, setting the initial learning rate to be 1e-3, reducing the learning rate to be 1e-4 and 1e-5 when the epoch is 170 and 200 respectively, and stopping training when the epoch is 310.

Step S24: and evaluating and storing the model trained in the step S23, specifically, evaluating the MPII data set by using PCKh (head normalized probability of correct key points) and evaluating the COCO data set by using OKS (similarity of target key points), and storing the final model after the network training specifies epoch.

Step S3: and (4) testing the key point model trained in the step (S2) on the test data set acquired in the step (S1), specifically, inputting the test set data into the acquired key point model, calculating an average value of the heat maps of the original image and the turned image to obtain a final predicted heat map, and calculating a position predicted value of each key point at a 1/4 offset position from a highest point of a heat value to a next highest point.

Claims

1. A human body key point detection method based on a double-attention machine mechanism is characterized by comprising the following steps:

step S1: acquiring a human body key point detection data set, wherein the human body key point detection data set comprises a training data set and a test data set, the data comprises pictures containing different human body postures and mark files of true values of joints of a human body, public data sets MPII and COCO2107 are used in the data set, the MPII human body posture data set comprises 25k pictures and 40k human body examples with 16 key points, and the training set and the test set are respectively 28k examples and 12k examples; the COCO2107 key point detection data set comprises 200k pictures and 250k human body examples with 17 key points, wherein a training set train2017 comprises 58k pictures and 150k human body examples, and a verification set val2017 and a test set test2017 respectively comprise 5k pictures and 20k pictures;

step S21, preprocessing the training data set and the test data set obtained in the step S1, wherein the specific mode is that the human body key point detection network only detects key points of a human body, for the detection of the human body in the picture, the MPII data set uses a provided human body frame, the COCO data set uses a faster-RCNN for human body detection to obtain a human body detection frame, the height-width ratio of the human body detection frames of the MPII and COCO training data sets is fixed to be 4:3, then the human body detection frame is cut out from the picture, the sizes of the human body detection frames are respectively adjusted to be 256 × 256 and 256 × 192, meanwhile, data enhancement is carried out on the data, and the data enhancement comprises random rotation (-45 degrees, 45 degrees), random scaling (0.65, 1.35) and turning;

step S22: the human body key point detection network is built, a channel attention module and a space attention module are added in a residual block for extracting characteristics, and the specific mode is that the human body key point detection network takes a high-resolution network (HRnet) as an integral framework, the network is connected with high-to-low sub-networks in parallel, and the reliable high-resolution representation is generated by repeatedly fusing high-to-low sub-networks in the whole processThe HRnet overall network structure is divided into five stages, wherein in the first stage, an input image is convoluted twice by two 3 × 3 convolution operations with the step size of 2, so that the height (H) and the width (W) of the image are changed into the sizes of H/4 and W/4, the number of channels is 64, then feature extraction is carried out by 4 improved Bottleneck, namely, a Bottleneck residual block in ResNet is improved in parallel, a 3 × 3 convolution layer in the ResNet and a 3 × 3 and a group 32 convolution layer in ResNext are connected in parallel, then channel attention and space attention are added, and the channel attention is that a feature map after convolution operation is compressed on a space dimension by maximum pooling and average pooling to obtain two different space background descriptions, namely Fc_avgAnd Fc_maxThen, the two descriptions are calculated by using a shared network composed of multilayer perceptrons, the calculation formula is Mc (F) ═ sigma (M L P (AvgPool (F)) + M L P (MaxPool (F))), a channel attention feature map Mc (F) is obtained, space attention is obtained, namely, the maximum pooling and the average pooling are used on the channel dimension of the obtained channel attention feature map, and two different feature descriptions Fs are obtained_avgAnd Fs_maxThen, the two feature descriptions are merged using a Concatenation operation (collocation), and a spatial attention feature map ms (f) is generated by a convolution operation, and the calculation formula is ms (f) ═ σ (f)^7×7([AvgPool(F)；MaxPool(F)]) Wherein f) is^7×7Representing the convolution operation of 7 × 7 the improved module is called the parallel residual Attention Block PRAB (parallel residual Attention Block) when the feature map size is [ H/4, W/4%]The number of channels is changed to 256, the second stage starts to change the number of channels of the feature map to 32 through a 3 × 3 convolution operation with a step size of 1, meanwhile, a low-resolution branch is generated on the basis of the previous stage, and the feature map with the size of [ H/8, W/8 ] is obtained through a 3 × 3 convolution operation with a step size of 2]Changing the number of channels from 256 to 64, then respectively extracting features of the two branches by using 4 Basicblocks, adding channel attention and spatial attention to the Basicblocks in the same first stage, then performing repeated multi-scale fusion, carrying out convolution and down-sampling on the high resolution to the size of the low resolution by using 3 × 3 convolution with the step size of 2, adding the low resolution to obtain the output of the low resolution branch, and simultaneously carrying out feature extraction on the low resolution by using the latest low resolutionThe adjacent difference value is up-sampled to the high resolution and added with the high resolution to obtain the output of the high resolution branch, and finally the size of the characteristic graph output by the two branches is [ H/4, W/4,32 ]]，[H/8,W/8,64](ii) a The input of the third stage is the branch obtained by multi-scale fusion of the second stage, and is simultaneously in [ H/8, W/8,64 ]]Generating a new low resolution branch [ H/16, W/16,128 ] on the basis of the branch]Then, each branch is respectively subjected to feature extraction by 4 basic blocks added into the attention module, and in the same second stage, 3 branches [ H/4, W/4,32 ] are obtained after multi-scale fusion]，[H/8,W/8,64]，[H/16,W/16,128](ii) a The fourth stage is the same as the third stage, and 4 branches [ H/4, W/4,32 ] are obtained]，[H/8,W/8,64]，[H/16,W/16,128]，[H/32,W/32,256](ii) a The fifth stage upsamples the 3 branches of low resolution, and [ H/4, W/4,32 ]]The branches are combined, and a convolution operation of 1 × 1 is performed to obtain the final output result, namely a heat map of key points, such as 17 key points of a COCO data set, and the size of the finally obtained heat map is [ H/4, W/4,17 [ ]]；

Step S23: performing model training on the training data set processed in the step S21 by using the network in the step S22 for a specified number of times, specifically, using an Adam optimizer for model training, setting an initial learning rate to be 1e-3, reducing the learning rate to be 1e-4 and 1e-5 when the epoch is 170 and 200, and stopping training when the epoch is 310;

step S24: evaluating and storing the model trained in the step S23, specifically, evaluating an MPII data set by using PCKh (head normalized probability of correct key points), evaluating a COCO data set by using OKS (similarity of target key points), and storing a final model after network training specifies epoch;