CN112597955B

CN112597955B - Single-stage multi-person gesture estimation method based on feature pyramid network

Info

Publication number: CN112597955B
Application number: CN202011607963.XA
Authority: CN
Inventors: 骆炎民; 张智谦; 林躬耕
Original assignee: Fujian Gongtian Software Co ltd; Huaqiao University
Current assignee: Fujian Gongtian Software Co ltd; Huaqiao University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-06-02
Anticipated expiration: 2040-12-30
Also published as: CN112597955A

Abstract

The embodiment of the invention discloses a single-stage multi-person gesture estimation method based on a feature pyramid network, which relates to the technical field of human gesture estimation and comprises the following steps of: step 10, building a feature pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary feature graphs with sequentially reduced resolution, and then carrying out information fusion among channels; step 20, constructing a center point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map by using a multi-person gesture estimation data set as training labels, and training the characteristic pyramid network; and step 30, inputting the image to be tested into the trained characteristic pyramid network, calculating the positions of joints and forming a complete multi-person human body posture. According to the embodiment of the invention, the network can efficiently perform information flow, and the accuracy of human body posture estimation is improved; meanwhile, the processing speed of the multi-person gesture estimation algorithm can be further increased through a rapid post-processing matching process.

Description

Single-stage multi-person gesture estimation method based on feature pyramid network

Technical Field

The invention relates to the technical field of human body posture estimation, in particular to a single-stage multi-person posture estimation method based on a characteristic pyramid network.

Background

Human body posture estimation is a key step for further understanding human body behaviors by computer vision, namely, all the joints of a human body are effectively predicted through one RGB image and a correct posture is formed. The human body gesture is predicted accurately, and the human body gesture has important significance for higher-level computer vision tasks, such as human behavior recognition, human-computer interaction, pedestrian re-recognition, abnormal behavior detection and the like.

Although the human body posture estimation field is rapidly developed, for a multi-person posture estimation task, the current top-down or bottom-up method is a multi-stage method, and one problem of the above methods is that the method is time-consuming and cannot exert the advantage of end-to-end trainability of the CNN network. The traditional attitude estimation method ignores the thinking of network parameters and reasoning speed when pursuing precision in a unified way; the attitude estimation algorithm is difficult to land, and the economic benefit is greatly reduced.

In terms of network architecture design, howards A, zhmoginov A, chen L C, et al (18 th Proceedings ofthe IEEE conference on computervision andpatternrecognition) in the paper "Mobilenetv2: invertedresiduals and linerbottotleecks et al, propose a lightweight network architecture named MobileNet that compresses the computational effort of a common 3x3 convolution by replacing the common 3x3 convolution with a 3x3 deep separable convolution plus a 1x1 pointwise convolution. The dimension of the input feature map is enlarged by using a reverse residual error unit, namely 1x1 convolution is firstly used, then 3x3 depth separable convolution is used for convolution operation, and finally the dimension of the feature map is reduced by using 1x1 convolution operation, so that more feature information can be reserved, and the expressive power of the model is ensured. However, for the human body posture estimation field, the network lacks fusion and application of multi-scale features, the multi-scale features are applied to segmentation, detection and other tasks, and the multi-scale features have excellent results, so that the network has the obvious effect of improving the precision of detection of people with different scales and joint points in pictures in the human body posture estimation field.

In the pose estimation work based on RGB images, nie, xueching, et al (Proceedings ofthe IEEE International Conference on computervision.2019.) propose a Single-stage network for pose estimation in the paper "Single-stage multi-personpoint machines", which sets a center joint point in a first hierarchy by layering human joint points. The second layer of articulation points are trunk joints, including neck, shoulders and buttocks. The third layer of joints includes the head, elbow and knee, and the fourth layer of joints includes the wrist and ankle. In this way the predicted stress of the network is relieved, the key point being dependent on the joints he is adjacent to. However, when the joint points of the upper layer are blocked or invisible, the joint points of the lower layer may fail to be predicted, and a long-distance offset problem exists, which limits the accuracy of human body posture estimation.

The Shenzhen city, science and technology limited company discloses a multi-person gesture estimation method based on a cascading pyramid network in a patent applied by Shenzhen city, namely a multi-person gesture estimation method based on the cascading pyramid network (patent publication number: CN 108229445A), which comprises the steps of positioning key points in a boundary box of each person by using the cascading pyramid network, wherein a global network can position simple key points, and a refinement network can process difficult key points by integrating feature representations from all levels of the global network. The method utilizes multi-scale features, but because the network is more complex and is a multi-stage method, its efficiency is lower than that of a single stage.

Therefore, how to provide a single-stage multi-person gesture estimation method based on a feature pyramid network, to realize more rapid and high-precision single-stage multi-person gesture estimation, becomes a problem to be solved urgently. .

Disclosure of Invention

The invention aims to solve the technical problem of providing a single-stage multi-person gesture estimation method based on a feature pyramid network, so as to improve the speed and efficiency of human gesture estimation.

In order to solve the technical problems, the embodiment of the invention adopts the following technical scheme:

a single-stage multi-person gesture estimation method based on a feature pyramid network comprises the following steps:

step 10, building a feature pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary feature graphs with sequentially reduced resolution, then carrying out information fusion among channels, then carrying out up-sampling and feature addition operation on all primary feature graphs between feature branches by taking the primary feature graph with the lowest resolution as a starting point, and finally carrying out prediction output;

step 20, acquiring a multi-person gesture estimation data set, wherein the multi-person gesture estimation data set comprises multi-person gesture pictures and ground truth labeling of joint points; constructing a center point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map by using the multi-person gesture estimation data set as training labels, and training the characteristic pyramid network;

and 30, inputting the image to be tested into the trained characteristic pyramid network, calculating joint positions according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and forming complete human body gestures according to the joint positions.

Further, the step 10 specifically includes:

step 11, creating a plurality of first convolution kernels for extracting primary features of the image and changing the channel number of the features of the image;

step 12, sequentially cascading convolution modules formed by a plurality of reverse residual units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature image output by the feature extraction main branch is sequentially reduced;

step 13, after each layer of original feature images extracted by the reverse residual error unit module, setting a group of second convolution kernels for carrying out inter-channel information fusion on the original feature images of the current layer to obtain corresponding fusion feature images;

step 14, sequentially cascading a plurality of deconvolution modules with the feature map layer with the lowest resolution as a starting point, wherein the deconvolution modules are used for amplifying the resolution of the fusion feature map of the current layer into the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then carrying out element summation operation on the amplified feature map and the fusion feature map of the next layer from position to obtain an enhanced feature map;

and 15, predicting and outputting the reinforced characteristic map with the highest resolution by using four groups of parallel third convolution check.

Further, in the step 20, training is performed on the feature pyramid network, specifically: calculating a loss value and total loss of the central point heat map, the upper offset heat map, the lower offset heat map, the joint refinement heat map and the training label which are predicted and output by the characteristic pyramid network respectively, and training the characteristic pyramid network according to the loss value;

the formula for calculating the center point heat map loss value is as follows:

wherein P (P) _j ) Representing the predicted center point heat map with position p _j Predicted value at G (p _j ) Representing a center point heat map constructed from training labels at position p _j True value at;

the formula for calculating the offset heat map loss value is:

wherein i represents heat maps corresponding to different joint types, p _j A certain position on the heat map is indicated,

an upper offset heat map representing a predicted joint type i,>

an upper offset heat map true value representing a joint type i in the training label;

the formula for calculating the lower offset heat map loss value is:

lower offset heat map representing predicted joint type i,/->

Representing a true value of a lower offset heat map with a joint type i in the training label;

the formula for calculating the joint refinement heat map loss value is as follows:

a refined heat map representing a predicted joint type i,/->

A refinement heat map true value with a joint type i in the training label is represented;

the formula for calculating the total loss is: l=αm+βl _u +γL _d +δL _o

Where α, β, γ, δ represent the weight of each loss.

Further, in the step 20, the center point heat map is constructed according to the positions of the center joints, the upper offset heat map is constructed according to the offsets of the center joints to each type of upper body joint and the hip joint, the lower offset heat map is constructed according to the offsets of the hip joints to each type of lower limb joint, and the joint refinement heat map is constructed according to the positions of the joints other than the center joints.

Further, the step 30 specifically includes:

step 31, obtaining an image to be detected, preprocessing the image to be detected, and then inputting the preprocessed image to the trained characteristic pyramid network to obtain a predicted central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map;

step 32, obtaining at least one central joint point position by using a non-maximum suppression algorithm in the predicted central point heat map;

step 33, according to the central joint point position, in a predicted upper offset heat map, response values corresponding to each type of upper body joint and hip joint are obtained, and according to the response values, fuzzy positions of each type of upper body joint and hip joint are calculated;

step 34, calculating the accurate positions of the upper body joints and the hip joints according to the fuzzy positions of the upper body joints and the hip joints of each type and the response values of the corresponding joints in the predicted joint refinement heat map;

step 35, calculating the fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the lower offset heat map of each type of lower body joint;

step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy position of each type of lower body joint and the response value of the corresponding joint in the predicted joint refinement heat map;

step 37, according to the accurate positions of the joints of the whole body, all the joints are sequentially connected to form a complete human body posture based on the preset joint sequence.

One or more technical solutions provided in the embodiments of the present invention at least have the following technical effects or advantages:

1. the feature pyramid network is built based on the MobileNet network, and then up-sampling and feature addition operations are carried out among feature branches, so that the parameter number of the deep convolutional neural network can be effectively reduced, the network can carry out information flow efficiently, joint point information can be fused into space information and semantic information, and the accuracy of human body posture estimation is greatly improved;

2. the human body posture can be directly deduced by means of the single-stage human body posture representation mode through the prediction of the joint positions, the problems of low training speed and long reasoning time faced by a traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the estimation of the human body posture can be completed in one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

The invention will be further described with reference to examples of embodiments with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a feature pyramid network in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a reverse residual unit according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a network output of a feature pyramid in an embodiment of the present invention;

fig. 5 is a schematic diagram of a result of human body posture estimation according to an embodiment of the present invention.

Detailed Description

According to the technical scheme in the embodiment of the application, the overall thought is as follows:

firstly, a characteristic pyramid network is built based on MobileNet, so that the parameter number of the deep convolutional neural network can be greatly reduced, and the reduced precision is within an acceptable range; secondly, deconvolution of a plurality of 3x3 layers is arranged between the feature graphs of different layers in the feature pyramid network so as to perform feature recovery, and feature addition operation is performed with the feature graph of the previous layer, so that information flow and fusion between the features are further improved, and the network can effectively utilize and fuse space information and semantic information; then, in addition to the joint point prediction and the offset prediction, the joint refinement heat map is added, and because the prediction of the network for long-distance offset is often inaccurate, the joint position refinement is carried out by additionally predicting the refinement heat map at each joint point, so that the precision of each joint point of a human body is more accurate, the precision of gesture estimation is greatly improved, and an accurate gesture reference is provided for higher-level tasks such as behavior recognition, pedestrian re-recognition, abnormal behavior detection and the like.

The embodiment of the application provides a single-stage multi-person gesture estimation method based on a feature pyramid network, please refer to fig. 1, which includes:

namely, a large number of sample images (multi-person gesture images) are obtained in advance, joint points of each sample image are marked and then divided into a training set, a verification set and a test set, the training set is input into a deep convolutional neural network for training, the verification set and the test set are utilized for verifying the trained deep convolutional neural network, and whether a loss value reaches a preset threshold value is judged

And 30, inputting the image to be detected into the trained characteristic pyramid network, calculating the joint point position according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and then forming the complete human body posture according to the joint point position.

The human body posture can be directly deduced by means of the prediction of the joint positions through a single-stage human body posture representation mode, so that the problems of low training speed and long reasoning time faced by a traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the estimation of the human body posture can be completed in only one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved

Referring to fig. 2, in a possible implementation manner, in step 10, a feature pyramid network is built based on a MobileNet network, which specifically includes:

step 11 of creating a plurality of first convolution kernels (such as convolution kernels of size 3x 3), the channel number is used for extracting the primary characteristics of the image and changing the characteristics of the image;

wherein, as shown in fig. 3, the single reverse residual unit is specifically constructed by:

step 121, firstly, performing feature extraction by a plurality of convolution kernels (PointWise convolution modules) with the space size of 1x1, and simultaneously improving the number of feature channels;

step 122, adding RELU6 activation function after the convolution kernel;

step 123, adding a 3x 3-sized depth separable convolution (DepthWise convolution module) after the RELU6 activation function for extracting features;

step 124, adding a RELU6 activation function after the separable convolution;

step 125, adding a 1x1 convolution kernel (PointWise convolution module) after the RELU6 activation function to reduce the number of characteristic channels;

step 126, adding a Linear activation function after the 1x1 convolution kernel;

step 127, mapping the identity of the feature map input to the step 121 to the feature map generated in the step 126 by using the identity mapping, and performing a feature element addition operation;

the steps 11 and 12 are set up of a MobileNet network structure, and the embodiment further performs information fusion and feature reinforcement on the feature branches through subsequent steps on the basis of the MobileNet network:

step 13, after each layer of original feature images extracted by the reverse residual error unit module, a group of second convolution kernels (for example, convolution kernels with the size of 1×1) are set for carrying out channel information fusion on the original feature images of the current layer to obtain corresponding fusion feature images;

step 14, sequentially cascading a plurality of deconvolution modules with the feature map layer with the lowest resolution as a starting point, wherein the deconvolution modules are used for amplifying the resolution of the fusion feature map of the current layer into the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then carrying out element summation operation on the amplified feature map and the fusion feature map of the next layer from position to further carry out feature fusion to obtain an enhanced feature map;

and 15, performing prediction output of the final four heat maps on the intensified characteristic map with the maximum resolution by using four groups of parallel third convolution kernels (for example, convolution kernels with the size of 1×1).

In one possible embodiment, the upper body joints are divided into 9 classes, head joints, neck joints, chest joints, and left and right shoulder elbow wrist joints, respectively, wherein the chest joints are defined as central articulation points; the lower body joints are divided into 7 classes, namely hip joints and 6 lower limb joints (including left and right buttocks, knees and ankles); the central heat map is constructed according to the positions of central joint points, the upper offset heat map is constructed according to the offset of the central joint points to other 8 types of upper body joints and hip joints respectively, the lower offset heat map is constructed according to the offset of the hip joints to other 6 types of lower limb joints respectively, and the joint refinement heat map is constructed according to the positions of other joint points except the central joint points; the number of heat map channels predicted and output by the pyramid network is 1, 18, 12 and 30 respectively, corresponding to 1 central point heat map, 18 upper offset heat maps (offset heat maps of X, Y two channels corresponding to each type of joint), 12 lower offset heat maps (offset heat maps of X, Y two channels corresponding to each type of joint) and 30 joint refinement heat maps (corresponding to other joints except the central point, and also divided into X, Y two channels), and the output heat maps are shown in fig. 4. In the four sets of parallel third convolution kernels, the number of each set of convolution kernels corresponds to the heat map channel number, i.e., 1, 18, 12, 30, respectively.

Through setting up the characteristic pyramid network based on the MobileNet network, and then carrying out up-sampling and characteristic addition operation among characteristic branches, the parameter quantity of the deep convolutional neural network can be effectively reduced, the network can carry out information flow efficiently, and the information of the joints can be fused into space information and semantic information, so that the accuracy of human body posture estimation is greatly improved.

In one possible implementation, the step 20 specifically includes:

step 21, acquiring a sample image in a data set, and adjusting the sample image into an RGB image with the size of 256 multiplied by 256;

step 22, inputting a sample image into the characteristic pyramid network to perform a single forward process, and obtaining central point heat maps corresponding to a plurality of human bodies in the image predicted by the network, wherein the heat maps are shifted upwards, shifted downwards and thinned;

step 23, respectively constructing truth labels of the heat maps by using ground truth labels of human body joints of the sample images: using chest joint points of a plurality of human bodies as human body center points, and processing by using Gaussian kernels to obtain a single two-dimensional center point heat map, wherein the positions of Gaussian peaks in the heat map are the positions of the plurality of center points; the heat map is offset by the offset structure from the chest joint to the upper half body (such as the upper limb, the head) and a plurality of joint points of the hip of each human body, and the offset heat map of a single joint comprises two heat maps with x and y coordinates respectively; the lower offset heatmap is constructed in a similar manner using the offset of the hip articulation point to the lower limb articulation point; the joint refinement heat map is used for further refining each joint point position predicted by the central joint point position and the offset value, each class of joint refinement heat map also corresponds to x and y channels, the response value of a certain point points to the position of the joint point closest to the position, and the response range is a circle with R as a radius around each joint point;

step 24, calculating the network prediction heat map and the truth label heat map by using a mean square error loss function to obtain a loss value of the central node heat map, and training the network prediction central node heat map by using the loss value:

step 25, calculating a loss value of the upper body deviation heat map by using a mean square error loss function, and training the network predicted upper deviation heat map by using the loss value:

an upper offset heat map representing a predicted joint type i,>

step 26, calculating a loss value of the upper body offset heat map by using a mean square error loss function, and training the network predicted lower offset heat map by using the loss value:

lower offset heat map representing predicted joint type i,/->

step 27, calculating a loss value of the joint refinement heat map by using a mean square error loss function, and training the joint refinement heat map predicted by the network by using the loss value:

a refined heat map representing a predicted joint type i,/->

step 28, the final loss function of the network is:

L＝αM+βL _u +γL _d +δL _o

where α, β, γ, δ represent the weight of each loss.

Different subtasks can have different influences on parameters of the backbone network through setting the weight parameters, so that the backbone network is optimized according to the needs.

In one possible implementation manner, the step 30 specifically includes:

step 31, obtaining an image to be detected, preprocessing the image to be detected (for example, adjusting the image to be detected into RGB images with the size of 256 multiplied by 256), then inputting the RGB images into the trained characteristic pyramid network, and performing a single forward process through the characteristic pyramid network to obtain a center point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map which are predicted according to the RGB images to be detected;

step 32, searching the maximum pixel value positions of the central joint points (such as a plurality of chest joints) corresponding to a plurality of human bodies in the predicted central point heat map by using a non-maximum suppression algorithm, and taking the maximum pixel value positions as the central joint point positions;

step 33, according to the positions of the central joint points, in the predicted upper offset heat map, response values in the upper offset heat maps of various joints (including upper body joints and hip joints) at the corresponding positions of each central joint point are obtained, and according to the response values, fuzzy positions of the upper body joints and the hip joints of each type are calculated;

step 35, calculating the fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the response value of the corresponding position in the lower offset heat map of each type of lower body joint;

step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy positions of the various joints of the lower body and the response values of the corresponding positions in the refined heat map of the lower body joint;

step 37, according to the calculated final accurate position of the whole body joint, based on the preset joint sequence, connecting the joints in turn to form a complete multi-human body gesture, as shown in fig. 5.

The positions of other upper body joints and hip joints are deduced according to the positions of the central joint points through the central joint points and the upper offset heat maps, the positions of other lower body joints are deduced according to the positions of the hip joints through the lower offset heat maps, and meanwhile, the positions are further and accurately positioned through refining the heat maps, so that errors caused by long-distance offset are avoided, and the accurate positions of the whole body joints are obtained.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that the specific embodiments described are illustrative only and not intended to limit the scope of the invention, and that equivalent modifications and variations of the invention in light of the spirit of the invention will be covered by the claims of the present invention.

Claims

1. A single-stage multi-person gesture estimation method based on a feature pyramid network is characterized by comprising the following steps:

step 30, inputting an image to be tested into the trained characteristic pyramid network, calculating joint positions according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and forming complete human body gestures according to the joint positions;

the step 10 specifically includes:

step 13, after each layer of original feature images extracted by the reverse residual error unit module, setting a group of second convolution kernels for carrying out information fusion among channels on the original feature images of the current layer to obtain corresponding fusion feature images;

step 15, predicting and outputting the reinforcement characteristic map with the highest resolution by utilizing four groups of parallel third convolution check;

in the step 20, training is performed on the feature pyramid network, specifically: calculating a loss value and total loss of the central point heat map, the upper offset heat map, the lower offset heat map, the joint refinement heat map and the training label which are predicted and output by the characteristic pyramid network respectively, and training the characteristic pyramid network according to the loss value;

the formula for calculating the center point heat map loss value is as follows:

the formula for calculating the offset heat map loss value is:

wherein i represents heat maps corresponding to different joint types, p _j Representing a position on the heat map, P _i ^u An up-shift heat map representing a predicted joint type i,

the formula for calculating the lower offset heat map loss value is:

wherein i represents heat maps corresponding to different joint types, p _j Representing a position on the heat map, P _i ^d A lower offset heat map representing a predicted joint type i,

/>

a refined heat map representing a predicted joint type i,/->

the formula for calculating the total loss is: l=αm+βl _u +γL _d +δL _o

Wherein α, β, γ, δ represent the weight of each loss;

the step 30 specifically includes:

2. The method according to claim 1, characterized in that: in the step 20, the center point heat map is constructed according to the positions of the center joint points, the upper offset heat map is constructed according to the offsets of the center joint points to each type of upper body joint and the hip joint, the lower offset heat map is constructed according to the offsets of the hip joint to each type of lower limb joint, and the joint refinement heat map is constructed according to the positions of the joint points except the center joint points.