CN112949498B

CN112949498B - Target key point detection method based on heterogeneous convolutional neural network

Info

Publication number: CN112949498B
Application number: CN202110242260.XA
Authority: CN
Inventors: 何宁; 尹晓杰; 于海港
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2023-11-14
Anticipated expiration: 2041-03-04
Also published as: CN112949498A

Abstract

The invention discloses a target key point detection method based on a heterogeneous convolutional neural network. Using ResNet-50 as a skeleton network in a backbone network, and replacing standard convolution with a convolution kernel size of 3 multiplied by 3 in bootleneck with heterogeneous convolution; after the last layer of the backbone network, we add a hole space pyramid pooling layer; finally, feature pyramid fusion is carried out on feature graphs with 8×8, 16×16, 32×32 and 64×64 resolutions, feature graphs with the resolution of 64×64 and the channel number of 16 are output, a detection heat map is generated by using a Gaussian kernel function, and an attitude estimation result is output. And a cavity space feature pyramid pooling layer and a feature pyramid fusion module are used in the model. A novel light target key point detection algorithm is constructed, and multi-target attitude estimation can be carried out on pictures with any size.

Description

Target key point detection method based on heterogeneous convolutional neural network

Technical Field

The invention belongs to the technical field of computer vision and digital image processing, and particularly relates to a target key point detection method based on a convolutional neural network.

Background

Human body posture estimation is a fundamental problem of human body behavior recognition, and the obtained skeleton structure can provide high-level semantics for human body motion recognition. Human body pose estimation itself has many applications in reality, for example: sports action specification, body correction, virtual reality games, video monitoring, robot motion control, and the like.

Existing human body posture estimation methods are divided into two categories: a bottom-up human body posture estimation and top-down human body posture estimation method. The method adopts a top-down method, and the existing top-down method has relatively high recall rate and accuracy. However, the model precision is pursued to be maximized, but the parameter quantity and floating point operation quantity of the model are increased. Human body posture estimation is landed in many practical applications, some of the human body posture estimation can be deployed on mobile phone terminals and microcomputers, and the parameter amount and floating point operand of an optimized model are one of important requirements for improvement of human body posture estimation due to limited storage amount and calculation amount of equipment.

Aiming at the problem of large parameter quantity and floating point operation quantity of a model, the method combines the traditional convolution and the grouping convolution to provide a heterogeneous convolution, and reduces the parameter quantity and the floating point operation quantity of the model on the premise of keeping accuracy and receptive field.

Disclosure of Invention

The invention discloses a target key point detection algorithm based on a heterogeneous convolution neural network, which is characterized by providing a heterogeneous convolution based on standard convolution and group convolution. The network model of the invention is divided into three parts, namely a main network part, a cavity space pyramid pooling part and a characteristic pyramid module part. Using ResNet-50 as a skeleton network in a backbone network, and replacing standard convolution with a convolution kernel size of 3 multiplied by 3 in bootleneck with heterogeneous convolution; after the last layer of the backbone network, we add a hole space pyramid pooling layer; finally, feature pyramid fusion is carried out on feature graphs with 8×8, 16×16, 32×32 and 64×64 resolutions, feature graphs with the resolution of 64×64 and the channel number of 16 are output, a detection heat map is generated by using a Gaussian kernel function, and an attitude estimation result is output.

The invention mainly provides a heterogeneous convolution based on combination of standard convolution and grouping convolution, wherein a cavity space feature pyramid pooling layer and a feature pyramid fusion module are used in a model. A novel light target key point detection algorithm is constructed, and multi-target attitude estimation (human body key point detection) can be carried out on pictures with any size. Mainly comprises the following steps:

step 1: inputting a picture, detecting a target in the picture by using a trained good fastermann target detector, acquiring coordinates of a target frame, and storing the coordinates in a data structure.

Step 2: and then widening the coordinate acquired by the target detector in the step 1 by 20% on the basis of the detected target frame, and independently intercepting the target frame.

Step 3: and (3) inputting the single target frame cut in the step (2) into a heterogeneous convolutional neural network, and detecting target key points.

Step 3-1: the single target image is then resized to a 256x256 resolution image.

Step 3-2: feature extraction is performed by using a ResNet50 network with an expansion coefficient of 2 as a skeleton network, so that the resolution of a feature map is 8×8, wherein 3×3 standard convolution with a step length of 1 is replaced by group convolution with a group of 4, and 1×1 standard convolution with a step length of 1 is replaced by group convolution with a group of 16, so that the parameter amount and floating point operation amount of a model are reduced on the basis of keeping a large receptive field.

The parameters of the traditional convolution are as follows:

S _cp ＝N×C×K×K

the parameters of the packet convolution are:

the parameters of the deconvolution are:

wherein N is the number of channels of the input feature map, C is the number of channels of the output feature map, G is the number of packets of the packet convolution, K, K ₁ 、K ₂ Are all convolution kernel sizes.

G _cp ≤SH _p ≤S _cp

Compared with the grouping convolution, the heterogeneous convolution effectively integrates the channels of the feature map, and compared with the standard convolution, the heterogeneous convolution improves the receptive field of the model.

Step 3-3: the 8 x 8 convolutions obtained in step 3-2 are pooled into a layer of incremental feature pyramids.

Step 3-4: and (3) up-sampling the feature map obtained in the step (3-3) for three times to obtain a 64x64 feature map with the resolution.

Step 3-4: and (3) splicing the characteristic diagrams of 16 multiplied by 16, 32 multiplied by 32 and 64 multiplied by 64 obtained in the main network in the step 3-2 with the characteristic diagrams of corresponding resolutions obtained by up-sampling in the step 3-3 by using a jump connection layer.

Step 3-5: the channels of the feature map with the resolution of 64×64 obtained in step 3-4 are adjusted to the number of key points in the data set by convolution of 1×1, so that coordinates of the corresponding key points are output.

The network is optimized by using a random gradient descent continuous iteration mode in the training process. The loss function used is the mean square error loss function:

wherein m is the number of key points, y _i For the coordinates of the marked group _ trunk key point,and (3) predicting coordinates of the key points for the model, wherein n is the number of training samples in each batch, and i is the index of the current key points.

Step 4: and (3) corresponding the detected single-target human body key points to the picture in the step (1), thereby obtaining a multi-human body posture estimation result.

The invention provides a standard convolution and packet convolution-based heterogeneous convolution. The convolution reduces the floating point operand significantly compared to standard convolution and has a receptive field of the same size as standard convolution. Compared with the grouping convolution, the heterogeneous convolution effectively integrates the convolved channels, and improves the accuracy of the model. The method proposed by the invention is verified on the MPII dataset. Experimental results show that the model with the heterogeneous convolution instead of the backbone network of the standard convolution and the cavity space pyramid pooling layer and the feature pyramid fusion module improves the precision by 1.2% and reduces the floating point operand by 72.18% compared with the original ResNet-50 method.

Drawings

FIG. 1 is a diagram of a heterogeneous convolutional neural network model based on RseNet50

FIG. 2 block convolution, standard convolution, and heterogeneous convolution block diagram

FIG. 3 human gesture estimation detection effect diagram

Detailed Description

The invention is now demonstrated with respect to other algorithms by the following examples.

We train the model using a training set of MPII data sets, with a validation set of MPII data sets to test the validity of the algorithm. The experimental environment is Ubuntu 18.04.3LTS, intel (R) Xeon (R) Silver 4110CPU@2.10Hzx 32, memory 64g, graphics card RTX2080Ti and software platforms of cuda10.0.130, cudnn7.5, pytorch1.4 and python 3.6.

During training, the batch size is set to 64, and the resolution size of the image is set to 256×256. The initial learning rate was 0.001, the learning rate was changed at 170 th and 200 th epochs, and the learning rate was decreased by 10% at 170 and 200 epochs, for a total of 210 epochs trained.

To verify the accuracy and efficiency of the improved algorithm, we used ResNet18 and ResNet50 for model comparison for the estimated network. Experimental results show that the model parameter and floating point operand can be reduced by the method under the condition of ensuring accuracy. The experimental results are shown in table 1.

Table 1 results comparison table in MPII dataset

Wherein the method comprises the steps ofIs a constant, and 60%PCKh@0.5, i being the head diagonal in group_trunk, is defined

Claims

1. A target key point detection method based on a heterogeneous convolutional neural network is characterized by comprising the following steps of: the method comprises the following steps:

step 1: inputting a picture, detecting a target in the picture by using a trained fasterrcnn target detector, acquiring coordinates of a target frame, and storing the coordinates in a data structure;

step 2: then widening the target frame by 20% based on the coordinates acquired by the target detector in the step 1 and independently intercepting the widening of the target frame by 20%;

step 3: inputting the single target frame cut in the step 2 into a heterogeneous convolutional neural network to detect target key points;

step 4: corresponding the detected single-target human body key points to the picture in the step 1, thereby obtaining a multi-human body posture estimation result;

the step 3 specifically comprises the following steps of step 3-1: then the size of the single target image is adjusted to 256x256 resolution image;

step 3-2: feature extraction is carried out by using a ResNet50 network with an expansion coefficient of 2 as a skeleton network, so that the resolution of a feature map is 8 multiplied by 8, 3 multiplied by 3 standard convolution with a step length of 1 is replaced by group convolution with a group of 4, and 1 multiplied by 1 standard convolution with a step length of 1 is replaced by group convolution with a group of 16;

the parameters of the traditional convolution are:

S _cp ＝N×C×K×K

the parameters of the packet convolution are:

the parameters of the deconvolution are:

wherein N is the number of channels of the input feature map, C is the number of channels of the output feature map, G is the number of packets of the packet convolution, K, K ₁ 、K ₂ All are convolution kernel sizes;

G _cp ≤SH _p ≤S _cp

compared with the grouping convolution, the heterogeneous convolution effectively integrates the channels of the feature map, and compared with the standard convolution, the heterogeneous convolution improves the receptive field of the model;

step 3-3: pooling the 8 x 8 convolved incremental feature pyramids obtained in step 3-2;

step 3-4: performing up-sampling on the feature map obtained in the step 3-3 for three times to obtain a feature map with the resolution of 64x 64;

step 3-4: splicing the characteristic diagrams of 16 multiplied by 16, 32 multiplied by 32 and 64 multiplied by 64 obtained in the step 3-2 of the main network with the characteristic diagrams of corresponding resolution obtained in the step 3-3 by using a jump connection layer;

step 3-5: the channels of the feature map with the resolution of 64×64 obtained in step 3-4 are adjusted to the number of key points in the data set by convolution with 1×1, so that coordinates of the corresponding key points are output.

2. The target key point detection method based on the heterogeneous convolutional neural network according to claim 1, wherein the method comprises the following steps of: optimizing the network in a random gradient descent continuous iteration mode in the training process; the loss function used is the mean square error loss function: