CN111626159B

CN111626159B - Human body key point detection method based on attention residual error module and branch fusion

Info

Publication number: CN111626159B
Application number: CN202010410104.5A
Authority: CN
Inventors: 刘峰; 龙芳芳; 干宗良; 崔子冠; 赵峥来
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2022-07-26
Anticipated expiration: 2040-05-15
Also published as: CN111626159A

Abstract

The invention discloses a human body key point detection method based on attention residual error module and branch fusion. Belongs to the technical field of computer vision, and comprises the following steps: carrying out feature processing on the input picture by using a feature extraction network to obtain a feature map; inputting the characteristic diagram into an area to generate a network to obtain a target suggestion box; performing region pooling operation to obtain a region-of-interest characteristic diagram; inputting the obtained data into the convolutional layer to perform feature extraction operation to obtain a feature graph I; performing feature extraction and fusion by using the first branch and the second branch; superposing the results of the two branches, firstly using deconvolution to recover the resolution, and then performing twice linear interpolation up-sampling; the locations of the keypoints are modeled as one-hot binary masks for training. The invention improves the diversity of the information output by the network, captures different visual fields better, not only effectively solves the problem of disordered key points in a simple scene, but also improves the accuracy and efficiency and can be well adapted to a complex scene.

Description

Human body key point detection method based on attention residual error module and branch fusion

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a human body key point detection method based on attention residual error module and branch fusion.

Background

In the prior art, detection of human body posture is widely concerned by scholars at home and abroad, and is also an important subject in the field of computer vision, and the core content of the detection is to detect human body targets from pictures by technologies such as image processing and analysis, machine learning, pattern recognition and the like, distinguish human body parts and further detect human joint points; in recent years, related researches at home and abroad divide the research mode of detecting the human body posture into a method based on a wearable sensor and a method based on the computer vision field according to different modes of acquiring original data of the human body describing posture; the former is mostly a contact type attitude analysis system, which has higher human body analysis capability, but a sensor for collecting human body motion parameters needs to be attached to a human body or accessories, so that the problem of inconvenient wearing exists, and unnatural feeling is brought to people; the cost is high, the operation is difficult, the remote control is not suitable, and the popularization is difficult; although the best human body detection algorithm has a good effect at present, errors still exist, and the detection task accuracy is not high due to the errors; the human body posture is represented by the optical flow, the human body silhouette, the outline, the skeleton, the joint points and the like in the image, so that the parameters of a human body model do not need to be solved, and the solution of the human body posture is simplified; the deep learning algorithm provides a new idea for detecting human body postures, and generally performs matching analysis through global features of images, so that the problem of feature matching ambiguity of a local feature method under the conditions of complex postures and shielding relations can be effectively avoided, and the algorithm can be ensured to have better robustness.

Disclosure of Invention

In view of the above problems, the present invention provides a human body key point detection method based on attention residual error module and branch fusion, which solves the problems of poor detection effect and low accuracy in the prior art.

The technical scheme of the invention is as follows: a human body key point detection method based on attention residual error module and branch fusion specifically comprises the following steps:

step (1.1), carrying out feature processing on an input picture by using a feature extraction network to obtain a feature map; inputting the feature map into a region generation network to obtain a target suggestion box, and then performing region pooling operation by combining the feature map to obtain a region-of-interest feature map;

step (1.2), inputting the obtained characteristic diagram of the region of interest into the convolutional layer for characteristic extraction operation, and marking the obtained result as a characteristic diagram I;

step (1.3), inputting the first characteristic diagram into the first branch and the second branch respectively for characteristic processing;

the specific steps of the branch for carrying out feature processing on the first feature graph are as follows:

(1.3.1) designing two identical attention residual error modules at the input position of the branch I, connecting the front layer and the rear layer in the network by matching with a data bypass, connecting the two attention residual error modules in pairs and superposing the two attention residual error modules at a pixel level, and enabling each layer of module in the network to receive feature mapping from the front layers of modules in a cascading mode;

(1.3.2), inputting the obtained product into a full connection layer after dimension reduction of the convolution layer; finally, reshaping the shape of the winding layer to obtain a winding layer with the same size as the first branch;

the operation of feature processing on the first feature map by the branch two-pair is as follows:

taking the first cavity convolution layer, the second cavity convolution layer and the third cavity convolution layer which are arranged in the second branch and have different space convolution rates as a combination, and obtaining different receptive fields through the combination so as to obtain multi-scale information;

and (1.4) overlapping results processed in the first branch and the second branch of the characteristic diagram, marking the result as a second characteristic diagram, deconvoluting the second characteristic diagram, then performing up-sampling, and finally obtaining joint point information through an unique-hot binary mask.

Further, in step (1.3.1), the attention residual module is composed of a residual small module of hole convolution in cooperation with an attention mechanism:

wherein, the residual error small module of the cavity convolution is as follows: the method comprises three convolution layers, namely a dimension reduction convolution layer, a cavity convolution layer and a dimension lifting convolution layer, wherein convolution weights are obtained through convolution operations of the three convolution layers and are set as V;

the attention mechanism comprises the following specific steps: after performing convolution operation on the V through a convolution layer, sequentially performing global weighting pooling, dot product convolution and S-shaped growth curves, and obtaining space attention weight through a network; finally, the spatial attention weight and the V weight are used for realizing the output of the channel attention, and the spatial attention weighted characteristic is obtained.

Further, superposing the output parameters of the two branches to obtain a second characteristic diagram; and performing resolution restoration on the feature map II by using an inverse convolution layer, performing up-sampling by using double linear interpolation to generate high-resolution output, and finally modeling the position of the joint point of the human body into an unique-heat binary mask to obtain the information of the joint point.

The invention has the beneficial effects that: the invention belongs to a human body posture detection method in the technical field of computer vision, which is a top-down detection method, and particularly relates to a method for performing feature fusion on an attention residual error module and a data bypass, wherein the method has stronger durability and usability and higher accuracy; (1) the attention residual error module at the first branch circuit distributes the weight to each channel characteristic, and the information of the characteristic diagram is highlighted in a space and channel self-adaptive manner; meanwhile, two layers of cross-layer connection is established between the attention residual error modules to connect the front layer and the rear layer in the network, so that signals can flow at a high speed between the input layer and the output layer, the design mode improves the information flow between the layers, enriches the information and lays a foundation for high accuracy and high efficiency of subsequent detection; (2) the Full Convolution Network (FCN) branch matched with the cavity convolution is adopted at the second branch, so that each group of results before and after convolution can be mutually staggered and depended, the receptive field is expanded, and the problem of local information loss (grid problem) of the cavity convolution is solved; and multi-scale context information can be captured, and local information dependence is obtained. Effectively avoiding single receptive field, insufficient acquired context information and insufficient ' seen ' to the full '; the problem of the confusion of the detection joint point caused by the problem; (3) the two branches are subjected to addition fusion operation, more diversified information is obtained, and different views of each target area are captured better; by combining the prediction results of the two visual fields, the diversity of the information output by the network is improved, the problem of disordered joint detection is effectively solved in a simple scene, the accuracy and the efficiency are improved, and the method can be well adapted to a complex scene.

Drawings

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a schematic diagram of an attention residual module according to the present invention;

FIG. 3 is a diagram illustrating an exemplary structure of a grid problem in the present invention;

FIG. 4 is a schematic representation of the human joint of the present invention.

Detailed Description

In order to more clearly illustrate the technical solution of the present invention, the present invention will be further described below; obviously, the following description is only a part of the embodiments, and it is obvious for a person skilled in the art to apply the technical solutions of the present invention to other similar situations without creative efforts; in order to more clearly illustrate the technical solution of the present invention, the technical solution of the present invention is further described in detail below with reference to the accompanying drawings:

a human body key point detection method based on attention residual error module and branch fusion comprises the steps of using a feature extraction network to perform feature processing on an input picture to obtain a feature map; inputting the feature map into the area to generate a network to obtain a target suggestion box; performing region pooling operation by combining the characteristic map to obtain a characteristic map of the region of interest; inputting the obtained characteristic diagram of the region of interest into the convolution layer for characteristic extraction operation to obtain a characteristic diagram I; carrying out deeper feature extraction and fusion by using a brand new neural network; after the results of the two branches are superposed, resolution restoration is carried out by deconvolution, and then twice linear interpolation upsampling is carried out; the positions of the joint points are modeled as one-hot binary masks for training.

As shown in fig. 1, the detection method specifically includes the following steps:

step (1.2), inputting the obtained characteristic diagram of the region of interest into the convolutional layer for characteristic extraction operation, and recording the obtained result as a characteristic diagram I;

and (1.4) overlapping results of the first characteristic diagram processed in the first branch and the second branch, recording the result as a second characteristic diagram, performing deconvolution on the second characteristic diagram, performing up-sampling, and finally obtaining joint point information through an unique-heat binary mask.

Further, in the step (1.2), the convolutional layers are three identical convolutional layers;

for convenience of description, the parameters related to the convolutional layer are defined herein, and the length, width and dimension of the input feature map are W, H, C respectively, and the dimension form is R ^W×H×C (ii) a The size of the convolution kernel (kernel) is k, and the size form is recorded as k multiplied by k; the step length (stride) is s; padding (padding) is p; the width of the output feature map after the convolution operation is:

the lengths are the same;

therefore, the dimension of the convolution layer is 3 multiplied by 3, the step length and the filling are both 1, and the first characteristic diagram obtained after the convolution layer is obtained by a formula is consistent with the characteristic diagram of the interested region in the dimension R ^W×H×C 。

Further, in the step (1.3), the specific step of performing feature processing on the first feature map by the branch is as follows:

(1.3.1) designing two identical attention residual error modules at the input position of the branch I, connecting the front layer and the rear layer in the network by matching with a data bypass, connecting the two attention residual error modules in pairs and superposing the two attention residual error modules at a pixel level, and enabling each layer of module in the network to receive feature mapping from the front layers of modules in a cascading mode; wherein:

(1) and a residual error small module of the cavity convolution: the method comprises three convolution layers, namely a dimension reduction convolution layer, a cavity convolution layer and a dimension lifting convolution layer, wherein convolution weights are obtained through convolution operations of the three convolution layers and are set as V;

(2) the attention mechanism comprises the following specific steps: after performing convolution operation on the V through a convolution layer, sequentially performing global weighting pooling, dot multiplication convolution and S-shaped growth curves, and obtaining space attention weight through a network; finally, the spatial attention weight and the V weight are used for realizing the output of the channel attention to obtain a spatial attention weighted feature;

specifically, 1), residual small module of hole convolution: the hole convolution has a parameter of hole convolution Rate (d) which can be set, and the specific meaning is that (d-1) 0 s are filled in the convolution kernel or the number of pixel is skipped; therefore, when different partition rates are set, the receptive fields are different, that is, multi-scale information is obtained; continuing with the parameter definition above, the convolution kernel size of the hole convolution is:

n＝k+(k-1)*(d-1) (2)

the width of the output signature is therefore:

the lengths are the same;

the hole convolution can freely expand the receptive field without introducing extra parameters, but if the resolution is increased, the overall calculation amount of the algorithm is also increased, so that blind increase cannot be realized; the hole convolution has a grid problem, namely information is lost, and information acquired remotely has no correlation (small targets are obvious);

to increase the reception field and reduce the calculation amount, the three convolution layer parameters of the dimension reduction convolution layer, the cavity convolution layer and the dimension increase convolution layer are respectively set as follows: the input dimension is C, the output dimension is C/4, k is 1, and s is 1; the input dimension is C/4, the output dimension is C/4, k is 3, s is 1, p is 2, and d is 2; the input dimension is C, the output dimension is C/4, k is 1, and s is 1;

2) the attention mechanism comprises the following specific steps: let the input of the attention residual error module be V ∈ R ^H×W×C The learned residual is mapped as V' e R ^H×W×C The multiple of dimensionality reduction is r; the output of the attention residual module is

Then there are:

wherein denotes element-wise multiplication in a spatial context; spatial attention weight α ∈ R ^H×W Produced by the following means; firstly, convolution weight W is obtained through convolution operation ₁ ∈R ^H×W×C/r (ii) a Then, performing Global weighted pooling (GDC) on the obtained feature map, and if the number of packets in the Convolution is G and the number of output feature maps is N, when equations (5) and (6) are satisfied, the effect of GDC is achieved:

k＝H＝W (6)

the number of groups and the number of output feature maps are equal to the number of input feature maps, and the size of a convolution kernel is the same as that of the input feature maps; the learned convolution weight is W ₂ ∈R ^1×1×C/r GDC gives a learnable weight to each position, and meanwhile regularizes the whole network structure in a space range to prevent overfitting; performing dot multiplication on the output, wherein the size of convolution kernel is 1 × 1 × C/r, and the operation in the step is to calculate the W ₂ Weighted combination is performed in the depth direction to generate W ₃ ∈R ^1×1×C (ii) a Finally, obtaining the spatial attention weight beta (Sigmoid) through an S-shaped growth curve ₃ V) in which W ₃ Represents convolution weight, Sigmoid represents S-type growth curve; finally, theβ is re-weighted on the input V of the attention residual module to achieve the output of the channel attention, resulting in a spatial attention weighted feature at the i, j-th element in the spatial domain:

wherein beta is _i.j 、V _i.j Denotes the value of β and V in space at the i, j-th element, which denotes the element-by-element multiplication between the i, j-th elements;

by way of example, W-14, C-512, and r-4 may be provided.

(1.3.2), inputting the result into a full connection layer after dimensionality reduction of the convolution layer; in order to be superposed with the final result of the first branch, finally, the shape of the first branch is reshaped to obtain a convolution layer with the same size as the first branch; specifically, the method comprises the following steps:

1) reducing the dimension of the characteristics obtained in the step (1.3.1) through a dimension reduction volume; by way of example, here the parameters are set to: the input dimension is C, the output dimension is C/2, k is 3, and s is 1;

2) sending the characteristics obtained in the step 1) into a Full Convolution (FC), wherein an FC layer has different attributes from an FCN, and the FCN predicts each pixel according to a local receptive field and shares parameters at different spatial positions; the FC layer is location sensitive, and the prediction of different spatial locations is realized by changing the parameter set; therefore, the joint prediction method has the capability of adapting to different spatial positions, and is helpful to predict at each spatial position by utilizing the global information of the whole scheme and distinguish and identify the independent joint parts belonging to the same object; the method is not only efficient, but also allows more sample training parameters to be used in the FC layer, avoids overfitting and the like, and improves the universality; as a specific embodiment, the feature size used is 14 × 14, so the FC layer would produce a 196 × 1 × 1 vector; in order to fuse the result with the output of the second branch, the dimension of the second branch needs to be kept consistent, so that the obtained vector is reshaped, and the reshaped dimension is consistent with the dimension of the output of the first branch.

Further, the specific steps of inputting the feature diagram one to the branch circuit two in the step (1.3) are as follows: taking the first cavity convolution layer, the second cavity convolution layer and the third cavity convolution layer which are arranged in the second branch and have different spatial convolution rates as a combination, and obtaining different receptive fields through the combination so as to obtain multi-scale information; the specific parameter calculation conditions are as follows:

let the magnitude of the receptive field of the j-th layer be rf _j Then, the calculation formula is:

rf _j ＝(n-1)*j+1 (8)

wherein rf ₀ 1 is ═ 1; as shown in fig. 3, from left to right, belong to the top-bottom relationship (convolution from left to right); the three convolution kernels are all k-3 and d-2, n is 5 according to formula (2), and the field of view of the central pixel of the third layer (rightmost) is 13 according to formula (8); however, only 75% of the actual calculation is involved; to prevent this problem, the design forms 3 convolutional layers into one group, then each group uses a continuously increasing void rate, and the other groups repeat; the goal is to have the final receptive field fully cover the entire area (without any holes or missing edges); at this time, the following requirements are met:

M _i ＝Max[M _i+1 -2r _i ,2r _i -M _i+1 ] (9)

wherein, Max [ a, b ]]Means to find the maximum value of a and b, M _i Refers to the maximum hole convolution rate, M, at the i-layer _i+1 Refers to the maximum hole convolution rate, r, at the (i +1) layer _i Is the hole convolution rate of the ith layer, the design goal being to let M be ₂ ≤k；

Assuming that the value of k is 3, there is a second layer, k is 3, which can be derived from (9) if r is [1,2,5]

M ₂ ＝Max[M ₃ -2r ₂ ,-M ₃ +2r ₂ ),

r ₂ ＝Max[1,-1,2]＝2<3

The condition is satisfied; up to this point, as described above, r ═ 1,2,5 may be selected as a group.

Further, superposing the output parameters of the two branches to obtain a second characteristic diagram; performing resolution restoration on the feature map II by using an inverse convolution layer, performing up-sampling by using double linear interpolation to generate high-resolution output, and modeling the position of the joint point of the human body into an independent-heating binary mask to obtain joint point information;

specifically, 1), adding two branches, further fusing features, improving the diversity of information output by a network and improving the quality of an output mask by combining the prediction results of the two visual fields so as to obtain better joint point prediction;

2) the final fused features described in 1) above are first subjected to resolution restoration by deconvolution to obtain dimensions marked by length, width and dimension, denoted as W × H × K (for example, 28 × 28 × 17 is desirable), and then subjected to upsampling by twice linear interpolation to generate 2W × 2H × K (for example, 56 × 56 × 17 is desirable) high-resolution output;

3) modeling the positions of the joint points as unique heat binary masks, outputting the unique heat binary masks by using the high resolution of 2 Wx2HxK described in the above 5, making unique heat M xM (for example, 56 x 56 can be taken) binary masks for each of the K joint points of the example, marking only one pixel in the binary masks as a foreground, and completing training to obtain the K joint points;

in addition, during training, for each real joint point with label, the cross-entropy loss on the softmax output of M ^2 is minimized (which helps detect a single point); the K joints are still treated independently, corresponding to one joint type (e.g. eye, left shoulder).

Through the steps, K (for example, 17) joint points needing to be detected can be definitely calibrated finally, so that the problem of disordered joint point detection is effectively solved in a simple scene, the accuracy and the efficiency are improved, and the method can be well adapted to a complex scene.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of embodiments of the present invention; other variations are possible within the scope of the invention; thus, by way of example, and not limitation, alternative configurations of embodiments of the invention may be considered consistent with the teachings of the present invention; accordingly, the embodiments of the invention are not limited to the embodiments explicitly described and depicted.

Claims

1. A human body key point detection method based on attention residual error module and branch fusion is characterized by comprising the following steps:

step (1.3), inputting the first feature graph into the first branch and the second branch respectively for feature processing;

the specific steps of the branch for carrying out feature processing on the first feature map are as follows:

(1.3.1) designing two identical attention residual modules at the input part of the branch I, connecting the front layer and the rear layer in the network by matching with a data bypass, connecting the two attention residual modules in pairs and superposing the two attention residual modules at a pixel level, and enabling each layer of module in the network to receive feature mapping from the front layers of modules in a cascading mode;

(1.3.2), inputting the product into a full connection layer after dimensionality reduction of the convolution layer; finally, reshaping the shape of the winding layer to obtain a winding layer with the same size as the first branch;

the operation of the branch two-to-feature graph I for feature processing is as follows:

2. The human body key point detection method based on attention residual module and branch fusion as claimed in claim 1, wherein in step (1.3.1), the attention residual module is composed of a residual small module of hole convolution in cooperation with an attention mechanism:

3. The human body key point detection method based on the attention residual error module and the branch circuit fusion according to any one of claims 1-2, characterized in that output parameters of two branch circuits are superposed to obtain a second feature diagram; and finally, modeling the position of the joint point of the human body as a unique-hot binary mask to obtain joint point information.