CN112949498A

CN112949498A - Target key point detection method based on heterogeneous convolutional neural network

Info

Publication number: CN112949498A
Application number: CN202110242260.XA
Authority: CN
Inventors: 何宁; 尹晓杰; 于海港
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-06-11
Anticipated expiration: 2041-03-04
Also published as: CN112949498B

Abstract

The invention discloses a target key point detection method based on a heterogeneous convolutional neural network. Using ResNet-50 as a skeleton network in a backbone network, and replacing standard convolution with convolution kernel size of 3 multiplied by 3 in bootleneck by heterogeneous convolution; after the last layer of the backbone network, a void space pyramid pooling layer is added; finally, feature pyramid fusion is carried out on the feature maps with the resolutions of 8 × 8, 16 × 16, 32 × 32 and 64 × 64, feature maps with the resolution of 64 × 64 and the number of channels of 16 are output, a detection heat map is generated by using a Gaussian kernel function, and a posture estimation result is output. The model uses a hollow space feature pyramid pooling layer and a feature pyramid fusion module. A novel lightweight target key point detection algorithm is constructed, and multi-target posture estimation can be performed on pictures of any size.

Description

Target key point detection method based on heterogeneous convolutional neural network

Technical Field

The invention belongs to the technical field of computer vision and digital image processing, and particularly relates to a target key point detection method based on a later convolution neural network.

Background

Human body posture estimation is a basic problem of human body behavior recognition, and the obtained skeleton structure can provide high-level semantics for human body action recognition. The estimation of the human body posture itself has many applications in reality, for example: sports action specification, body correction, virtual reality games, video monitoring, robot motion control and the like.

The existing human body posture estimation methods are divided into two types: a bottom-up body pose estimation method and a top-down body pose estimation method. The method adopts a top-down method, and the existing top-down method has relatively high recall rate and accuracy. However, the accuracy of the model is maximized, but the parameters and floating point operation amount of the model are increased. Human body posture estimation falls to the ground in many practical applications, some of which are deployed on a mobile phone end and a microcomputer, and because the storage capacity and the calculation amount of equipment are limited, the parameters and the floating point calculation amount of an optimization model are one of important requirements of improvement on human body posture estimation.

Aiming at the problem of large model parameter and floating point operand, the method combines the traditional convolution and the grouping convolution to provide a heterogeneous convolution, and reduces the parameter and the floating point operand of the model on the premise of keeping the precision and the receptive field.

Disclosure of Invention

The invention relates to a target key point detection algorithm based on a heterogeneous convolutional neural network, which is characterized in that a heterogeneous convolution is provided based on standard convolution and grouping convolution. The network model of the invention is divided into three parts, namely a main network part, a cavity space pyramid pooling part and a characteristic pyramid module part. Using ResNet-50 as a skeleton network in a backbone network, and replacing standard convolution with convolution kernel size of 3 multiplied by 3 in bootleneck by heterogeneous convolution; after the last layer of the backbone network, a void space pyramid pooling layer is added; finally, feature pyramid fusion is carried out on the feature maps with the resolutions of 8 × 8, 16 × 16, 32 × 32 and 64 × 64, feature maps with the resolution of 64 × 64 and the number of channels of 16 are output, a detection heat map is generated by using a Gaussian kernel function, and a posture estimation result is output.

The invention mainly provides a heterogeneous convolution based on combination of standard convolution and grouping convolution, and a cavity space feature pyramid pooling layer and a feature pyramid fusion module are used in a model. A new lightweight target key point detection algorithm is constructed, and multi-target posture estimation (human body key point detection) can be carried out on pictures with any sizes. The method mainly comprises the following steps:

step 1: inputting a picture, detecting a target in the picture by using a trained fastercnnn target detector, acquiring coordinates of a target frame, and storing the coordinates in a data structure.

Step 2: and then, widening the target frame by 20% on the basis of the detected target frame according to the coordinates acquired by the target detector in the step 1, and separately intercepting the target frame.

And step 3: and (4) inputting the single target frame intercepted in the step (2) into a heterogeneous convolutional neural network for target key point detection.

Step 3-1: the single target image is then resized to a 256x256 resolution image.

Step 3-2: feature extraction is performed using a ResNet50 network with an expansion coefficient of 2 as a skeleton network so that the resolution of the feature map is 8 × 8, in which a packet convolution with a packet of 4 replaces a 3 × 3 standard convolution with a step size of 1, and a packet convolution with a packet of 16 replaces a1 × 1 standard convolution with a step size of 1, to ensure that the parameter amount and floating point operation amount of the model are reduced while a large receptive field is retained.

The parameters of the conventional convolution are:

S_cp＝N×C×K×K

the parameters of the packet convolution are:

the parameters of the heterogeneous convolution are:

where N is the number of channels of the input feature map, C is the number of channels of the output feature map, G is the number of groups of the group convolution, K, K₁、K₂Are the convolution kernel size.

G_cp≤SH_p≤S_cp

Compared with grouping convolution, heterogeneous convolution effectively integrates channels of the characteristic diagram, and compared with standard convolution, heterogeneous convolution improves the receptive field of the model.

Step 3-3: pyramid pooling the 8 x 8 convolved added features obtained in step 3-2.

Step 3-4: and (4) performing up-sampling on the feature map obtained in the step 3-3 for three times to obtain a feature map with the resolution of 64x 64.

Step 3-4: and splicing the 16 × 16, 32 × 32 and 64 × 64 feature maps obtained by the backbone network in the step 3-2 with the feature maps with the corresponding resolutions obtained by up-sampling in the step 3-3 by using a hop connection layer.

Step 3-5: and (3) adjusting the channels of the feature map with the resolution of 64x64 obtained in the step 3-4 to the number of key points in the data set by convolution of 1x1, thereby outputting the coordinates of the corresponding key points.

And optimizing the network in a continuous iteration mode of random gradient descent in the training process. The loss function used is the mean square error loss function:

wherein m is the number of key points, y_iTo be the coordinates of the labeled group _ truth keypoints,

the coordinates of the key points predicted by the model are obtained, n is the number of training samples in each batch, and i is the index of the current key point.

And 4, step 4: and (4) corresponding the detected key points of the single-target human body to the picture in the step (1), thereby obtaining the result of the multi-human body posture estimation.

The invention provides a heterogeneous convolution based on standard convolution and packet convolution. The convolution significantly reduces the amount of floating point operations compared to a standard convolution and possesses a field of the same size as a standard convolution. Compared with the grouping convolution, the heterogeneous convolution effectively integrates the convolution channel, and the accuracy of the model is improved. The method proposed by the invention is verified on the MPII data set. The experimental result shows that the accuracy of the method is improved by 1.2% compared with that of the original ResNet-50 method and the floating point operation amount is reduced by 72.18% by using a backbone network of which the standard convolution is replaced by heterogeneous convolution and adding a cavity space pyramid pooling layer and a characteristic pyramid fusion module.

Drawings

FIG. 1 is a diagram of a heterogeneous convolutional neural network model based on RseNet50

FIG. 2 is a block diagram of a block convolution, standard convolution, and heterogeneous convolution

FIG. 3 human posture estimation detection effect diagram

Detailed Description

The invention is examined below with reference to examples for its superiority over other algorithms.

We trained the model using the training set of MPII data sets, and tested the validity of the algorithm with the validation set of MPII data sets. The experimental environment was Ubuntu 18.04.3LTS, Intel (R) Xeon (R) Silver 4110CPU@2.10Hzx 32, memory 64g, graphics card RTX2080Ti, and software platforms of cuda10.0.130, cudnn7.5, pytorech 1.4, and python 3.6.

During training, the batch size is set to 64, and the resolution size of the image is set to 256 × 256. The initial learning rate is 0.001, the learning rate is changed at 170 th and 200 th epochs, and the learning rate is reduced by 10% at 170 th and 200 th epochs, so that 210 epochs are trained.

To verify the accuracy and efficiency of the improved algorithm, we performed model comparisons for the estimated networks using ResNet18 and ResNet 50. Experimental results show that the method can reduce model parameters and floating point operation amount under the condition of ensuring accuracy. The results are shown in Table 1.

TABLE 1 comparison of results in MPII data set

Wherein

Is a constant, l is 60% PCKh @0.5 of the head diagonal in the group _ channel is the finger limit

Claims

1. A target key point detection method based on a heterogeneous convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:

step 1: inputting a picture, detecting a target in the picture by using a trained fastercnnn target detector, acquiring coordinates of a target frame, and storing the coordinates in a data structure;

step 2: secondly, widening the target frame by 20% on the basis of the detected target frame by using the coordinates obtained by the target detector in the step 1, and independently intercepting the widened 20% of the target frame;

and step 3: inputting the single target frame obtained by the step 2 into a heterogeneous convolutional neural network for target key point detection;

2. The method for detecting the target key point based on the heterogeneous convolutional neural network as claimed in claim 1, wherein: in the step 3, the method specifically comprises the following steps of step 3-1: then the size of the single target image is adjusted to be an image with 256x256 resolution;

step 3-2: performing feature extraction using a ResNet50 network with an expansion coefficient of 2 as a skeleton network so that the resolution of the feature map is 8 × 8, replacing 3 × 3 standard convolution with a step size of 1 with grouped convolution with a group of 4, and replacing 1 × 1 standard convolution with a step size of 1 with grouped convolution with a group of 16;

the parameters of the conventional convolution are:

S_cp＝N×C×K×K

the parameters of the packet convolution are:

the parameters of the heterogeneous convolution are:

where N is the number of channels of the input feature map, C is the number of channels of the output feature map, G is the number of groups of the group convolution, K, K₁、K₂Are the convolution kernel size;

G_cp≤SH_p≤S_cp

compared with the grouping convolution, the heterogeneous convolution effectively integrates the channel of the characteristic diagram, and compared with the standard convolution, the heterogeneous convolution improves the receptive field of the model;

step 3-3: pooling the 8 × 8 convolved added feature pyramid obtained in step 3-2;

step 3-4: carrying out up-sampling on the feature map obtained in the step 3-3 for three times to obtain a feature map with the resolution of 64x 64;

step 3-4: splicing the 16 × 16, 32 × 32 and 64 × 64 feature maps obtained by the backbone network in the step 3-2 with the feature map with the corresponding resolution obtained by up-sampling in the step 3-3 by using a hop connection layer;

3. The method for detecting the target key point based on the heterogeneous convolutional neural network as claimed in claim 2, wherein: optimizing the network in a random gradient descent continuous iteration mode in the training process; the loss function used is the mean square error loss function: