CN112597955A

CN112597955A - Single-stage multi-person attitude estimation method based on feature pyramid network

Info

Publication number: CN112597955A
Application number: CN202011607963.XA
Authority: CN
Inventors: 骆炎民; 张智谦; 林躬耕
Original assignee: Fujian Gongtian Software Co ltd; Huaqiao University
Current assignee: Fujian Gongtian Software Co ltd; Huaqiao University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-02
Anticipated expiration: 2040-12-30
Also published as: CN112597955B

Abstract

The embodiment of the invention discloses a single-stage multi-person posture estimation method based on a characteristic pyramid network, which relates to the technical field of human posture estimation and comprises the following steps: step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution ratios and then carrying out information fusion among channels; step 20, constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network; and step 30, inputting the image to be detected into the trained characteristic pyramid network, calculating the positions of joints and forming a complete multi-person human body posture. According to the embodiment of the invention, the network can efficiently flow information, so that the accuracy of human body posture estimation is improved; meanwhile, the processing speed of the multi-person posture estimation algorithm can be further increased through a quick post-processing matching process.

Description

Single-stage multi-person attitude estimation method based on feature pyramid network

Technical Field

The invention relates to the technical field of human body posture estimation, in particular to a single-stage multi-person posture estimation method based on a characteristic pyramid network.

Background

The human body posture estimation is a key step for further understanding human body behaviors through computer vision, namely, all joint points of a human body are effectively predicted through one RGB image and a correct posture is formed. The accurate prediction of the human body posture has important significance on computer vision tasks of higher levels, such as human behavior recognition, human-computer interaction, pedestrian re-recognition, abnormal behavior detection and the like.

Although the field of human body posture estimation is rapidly developed, for a multi-person posture estimation task, the top-down method and the bottom-up method are all multi-stage methods at present, and one problem of the methods is that the time is consumed, and the advantages of end-to-end trainable performance of a CNN network cannot be achieved. When the traditional attitude estimation method pursues the precision, thinking about network parameters and reasoning speed is ignored; the attitude estimation algorithm is difficult to fall to the ground, and the economic benefit is greatly reduced.

In terms of network architecture design, HowardA, Zhmogunov A, Chen L C, et al (18th Proceedings of the IEEE conference on computation and interception repetition) proposed a lightweight network architecture named Mobilenet in the paper "Mobilenetv 2: invertedResiduals and linear convolutions", which compressed the computation of the common 3x3 convolution by replacing the common 3x3 convolution with a 3x3 depth separable convolution followed by a 1x1 point-by-point convolution. By using an inverse residual error unit, namely firstly using 1x1 convolution to enlarge the dimension of an input feature map, then using 3x3 depth separable convolution to perform convolution operation, and finally using 1x1 convolution operation to reduce the dimension, more feature information can be reserved, and the expression capability of the model is ensured. However, for the field of human body posture estimation, the network lacks fusion and application of multi-scale features, the multi-scale features show excellent results when applied to tasks such as segmentation and detection, and the network also has an obvious effect of improving the precision of detection of people and joint points with different scales in a picture in the field of human body posture estimation.

In the RGB image-based pose estimation work, Nie, xueching, et al (Proceedings of the IEEE International Conference on computer vision.2019.) propose a Single-stage network in the paper "Single-stage multi-personnpos machines" for pose estimation, which sets a central joint point in a first hierarchy by performing hierarchical processing on the joint points of a human body. The second layer of joints are the trunk joints, including the neck, shoulders and hips. The third level joints include the head, elbow and knee, and the fourth level joints include the wrist and ankle. In this way the pressure of the network prediction is relieved, the key points depending on his adjacent joints. However, when the previous layer of joint points is blocked or invisible, the next layer of joint points may fail to predict, and there is a long-distance offset problem, and the two problems limit the accuracy of human body posture estimation.

Shenzhen Wei Tei science and technology Limited discloses a multi-person pose estimation method based on a cascaded pyramid network in the patent of Shenzhen Wei Tei science and technology Limited (patent publication No. CN108229445A), which comprises the steps of positioning key points in the bounding box of each character by using the cascaded pyramid network, wherein the global network can position simple key points, and processing difficult key points by integrating feature representations from all levels of the global network by a refinement network. The method utilizes multi-scale features, but because the network is complex and multi-stage, the efficiency is lower than that of a single stage.

Therefore, how to provide a single-stage multi-user posture estimation method based on a feature pyramid network to achieve faster and higher-precision single-stage multi-user posture estimation becomes a problem to be solved urgently. .

Disclosure of Invention

The technical problem to be solved by the invention is to provide a single-stage multi-person posture estimation method based on a characteristic pyramid network, so that the speed and the efficiency of human body posture estimation are improved.

In order to solve the technical problem, the embodiment of the invention adopts the following technical scheme:

a single-stage multi-person posture estimation method based on a feature pyramid network comprises the following steps:

step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution, then carrying out information fusion among channels, then taking the primary characteristic graph with the lowest resolution as a starting point, carrying out up-sampling and characteristic addition operation on all the primary characteristic graphs among characteristic branches, and finally carrying out prediction output;

step 20, acquiring a multi-person posture estimation data set, wherein the multi-person posture estimation data set comprises multi-person posture pictures and joint point ground truth value labels; constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network;

and step 30, inputting the image to be detected into the trained feature pyramid network, calculating the joint position according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint thinning heat map, and forming a complete human body posture according to the joint position.

Further, the step 10 specifically includes:

step 11, creating a plurality of first convolution kernels for extracting primary features of the image and changing the number of channels of the image features;

step 12, sequentially cascading convolution modules formed by a plurality of reverse residual error units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature map output by the feature extraction main branch is sequentially reduced;

step 13, after each layer of original feature map extracted by the reverse residual error unit module, setting a group of second convolution kernels for performing inter-channel information fusion on the original feature map of the current layer to obtain a corresponding fusion feature map;

step 14, with the feature map hierarchy with the lowest resolution as a starting point, sequentially cascading a plurality of deconvolution modules, which are used for amplifying the resolution of the fusion feature map of the current layer to the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then performing position-by-position element summation operation on the amplified feature map and the fusion feature map of the next layer to obtain an enhanced feature map;

and step 15, utilizing four groups of parallel third convolution kernels to predict and output the enhanced feature map with the maximum resolution.

Further, in the step 20, training the feature pyramid network specifically includes: respectively calculating a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map output by the feature pyramid network prediction, and loss values and total loss of training labels, and then training the feature pyramid network according to the loss values;

the formula for calculating the center point heat map loss value is:

wherein, P (P)_j) Position p in heat map representing predicted center point_jPredicted value of (c), G (p)_j) Represents the position of the central point heat map constructed by the training label as p_jTrue value of (1);

the formula for calculating the offset heat map loss values is:

where i represents the heat map, p, for different joint types_jA location on the heat map is represented,

upward offset heat with i as the predicted joint typeIn the figure, the figure shows that,

representing the true value of an upper deviation heat map with the joint type i in the training label;

the formula for calculating the lower offset heat map loss value is:

a lower offset heat map representing the predicted joint type i,

representing a true value of a lower offset thermal diagram with a joint type i in a training label;

the formula for calculating the joint refinement heatmap loss value is:

a refined heat map representing the predicted joint type i,

representing a refined heat map truth value with a joint type i in the training label;

the formula for calculating the total loss is: l ═ α M + β L_u+γL_d+δL_o

Where α, β, γ, δ represent the respective lost weights.

Further, in step 20, the central point heat map is constructed from the positions of the central joint points, the upper offset heat map is constructed from the offsets of the central joint points to each of the upper body joints and hip joints, respectively, the lower offset heat map is constructed from the offsets of the hip joints to each of the lower limb joints, respectively, and the joint refinement heat map is constructed from the positions of the other joint points except the central joint point.

Further, the step 30 specifically includes:

step 31, acquiring an image to be detected, preprocessing the image to be detected, and inputting the preprocessed image into the trained feature pyramid network to obtain a predicted central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map;

step 32, obtaining at least one central joint point position in the predicted central point heat map by using a non-maximum suppression algorithm;

step 33, according to the position of the central joint point, acquiring a response value corresponding to each type of upper body joint and hip joint in the predicted upper offset heat map, and according to the response value, calculating to obtain fuzzy positions of each type of upper body joint and hip joint;

step 34, calculating to obtain the accurate position of each type of upper body joint and hip joint according to the fuzzy position of each type of upper body joint and hip joint and the response value of the corresponding joint in the predicted joint refinement heat map;

step 35, calculating to obtain a fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the lower offset heat map of each type of lower body joint;

step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy position of each type of lower body joint and the predicted response value of the corresponding joint in the joint refinement heat map;

and step 37, sequentially connecting all joints to form a complete human body posture based on preset joint sequencing according to the accurate positions of joints of the whole body.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:

1. by constructing a characteristic pyramid network based on a MobileNet network and then performing up-sampling and characteristic addition operation among characteristic branches, the parameter quantity of a deep convolutional neural network can be effectively reduced, the network can efficiently perform information flow, joint point information can be favorably fused into spatial information and semantic information, and the accuracy of human posture estimation is greatly improved;

2. the human body posture can be directly deduced by means of the prediction of the joint position through a single-stage human body posture representation mode, so that the problems of low training speed and long reasoning time of a traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the human body posture can be estimated in one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a feature pyramid network in an embodiment of the invention;

FIG. 3 is a schematic diagram of an inverse residual unit according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the output of a feature pyramid network in an embodiment of the present invention;

FIG. 5 is a diagram illustrating a result of human body posture estimation according to an embodiment of the present invention.

Detailed Description

The technical scheme in the embodiment of the application has the following general idea:

firstly, constructing a characteristic pyramid network based on the MobileNet, so that the parameter quantity of the deep convolutional neural network can be greatly reduced, and the reduced precision is within an acceptable range; secondly, a plurality of 3x3 deconvolution devices are arranged between feature graphs of different levels in the feature pyramid network to recover features, and feature addition operation is carried out on the feature graphs and the feature graphs of the previous level, so that the flow and fusion of information between the features are further improved, and the network can effectively utilize and fuse spatial information and semantic information; then, besides joint point prediction and offset prediction, joint refinement heat map prediction is added, and the network often has inaccuracy to long-distance offset prediction, so that the joint position refinement is carried out by additionally predicting the refinement heat map at each joint point, the precision of each joint point of a human body is more accurate, the precision of posture estimation is greatly improved, and an accurate posture reference is provided for higher-level tasks such as behavior recognition, pedestrian re-recognition, abnormal behavior detection and the like.

The embodiment of the present application provides a single-stage multi-user posture estimation method based on a feature pyramid network, please refer to fig. 1, which includes:

the method comprises the steps of obtaining a large number of sample images (multi-person posture images) in advance, dividing the sample images into a training set, a verification set and a test set after marking joint points of the sample images, inputting the training set into a deep convolutional neural network for training, verifying the trained deep convolutional neural network by utilizing the verification set and the test set, and judging whether a loss value reaches a preset threshold value or not

And step 30, inputting the image to be detected into the trained feature pyramid network, calculating the position of a joint point according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and forming a complete human body posture according to the position of the joint point.

The human body posture can be directly deduced by depending on the prediction of joint positions through a single-stage human body posture representation mode, so that the problems of low training speed and long reasoning time of the traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the human body posture can be estimated in one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved

Referring to fig. 2, in a possible implementation manner, in step 10, building a feature pyramid network based on a MobileNet network specifically includes:

step 11, creating a plurality of first convolution kernels (for example, convolution kernels with the size of 3 × 3 × 3) for extracting primary features of an image and changing the number of channels of the image features;

as shown in fig. 3, a single inverse residual unit is specifically constructed as follows:

step 121, firstly, extracting features by a plurality of convolution kernels (PointWise convolution modules) with the space size of 1x1, and simultaneously increasing the number of feature channels;

step 122, adding a RELU6 activation function after the convolution kernel;

step 123, adding a depth separable convolution (DepthWise convolution module) with the size of 3x3 after the RELU6 activation function, wherein the depth separable convolution module is used for extracting features;

step 124, adding a RELU6 activation function after the separable convolution;

step 125, adding a 1x1 convolution kernel (Pointwise convolution module) for reducing the number of characteristic channels after the RELU6 activating function;

step 126, adding a Linear activation function after the 1x1 convolution kernel;

step 127, using identity mapping to map the identity of the feature map input in the step 121 to the feature map generated in the step 126, and performing feature element addition operation;

the above steps 11 and 12 are the building of a MobileNet network structure, and the embodiment further performs information fusion and feature enhancement on the feature branches through the subsequent steps on the basis of the MobileNet network:

step 13, after each layer of original feature map extracted by the reverse residual error unit module, setting a group of second convolution kernels (for example, convolution kernels with the size of 1 × 1) for performing inter-channel information fusion on the original feature map of the current layer to obtain a corresponding fusion feature map;

step 14, with the feature map hierarchy with the lowest resolution as a starting point, sequentially cascading a plurality of deconvolution modules, which are used for amplifying the resolution of the fusion feature map of the current layer to the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then performing position-by-position element summation operation on the amplified feature map and the fusion feature map of the next layer to perform further feature fusion to obtain an enhanced feature map;

and step 15, performing final prediction output of the four types of heat maps on the reinforced feature map with the highest resolution by using four groups of parallel third convolution kernels (such as convolution kernels with the size of 1 × 1).

In one possible embodiment, the upper body joints are classified into 9 classes, head, neck, chest, and left and right shoulder, elbow, and wrist joints, respectively, wherein the chest joint is defined as the central joint point; the lower body joints are divided into 7 types, namely hip joints and 6 lower limb joints (including left and right hip joints, knee joints and ankle joints); the central heat map constructed from the locations of central joint points, the upper offset heat map constructed from offsets of the central joint points to the other 8 classes of upper body joints and hip joints, respectively, the lower offset heat map constructed from offsets of the hip joints to the other 6 classes of lower body joints, respectively, the joint refinement heat map constructed from the locations of the other joint points other than the central joint points; the pyramid network predicts the output heat map channel numbers as 1, 18, 12, 30, corresponding to 1 center point heat map, 18 top offset heat maps (X, Y two channel offset heat maps for each type of joint), 12 bottom offset heat maps (X, Y two channel offset heat maps for each type of joint), and 30 joint refinement heat maps (X, Y two channels for joints other than the center point), respectively, with the output heat maps shown in fig. 4. In the four parallel sets of third convolution kernels, the number of each set of convolution kernels corresponds to the number of heat map channels, i.e., 1, 18, 12, 30, respectively.

By constructing the characteristic pyramid network based on the MobileNet network and then performing up-sampling and characteristic addition operation among the characteristic branches, the parameter quantity of the deep convolutional neural network can be effectively reduced, the network can efficiently perform information flow, joint point information can be favorably fused into spatial information and semantic information, and the accuracy of human posture estimation is greatly improved.

In a possible implementation, the step 20 specifically includes:

step 21, obtaining a sample image in a data set, and adjusting the sample image into an RGB image with the size of 256 × 256;

step 22, inputting the sample image into the feature pyramid network to perform a single forward process, so as to obtain central point heat maps, upper offset heat maps, lower offset heat maps and refined heat maps corresponding to a plurality of human bodies in the network-predicted image;

step 23, constructing truth labels of the heat maps respectively by using ground truth labeling of the human body joint points of the sample images: taking a plurality of chest joint points of a human body as central points of the human body, and processing by using a Gaussian kernel to obtain a single two-dimensional central point heat map, wherein the positions of Gaussian peaks in the heat map are positions of the plurality of central points; constructing an offset heat map with offsets for various joint points of the chest joint to the upper body (e.g., upper limbs, head) and hip of each individual, the offset heat map for a single type of joint comprising two heat maps in x and y coordinates, respectively; the lower offset heat map is constructed in a similar manner, using the offset of the hip joint points to the lower limb joint points; the joint refinement heat map is used for further refining the positions of all joint points predicted by the position of a central joint point and an offset value, each type of joint refinement heat map also corresponds to x and y channels, the response value of a certain point points to the position of the joint point closest to the position, and the range with response is a circle which takes R as the radius and surrounds each joint point;

step 24, calculating the network prediction heat map and the truth label heat map by using a mean square error loss function to obtain a loss value of the central joint point heat map, and training the network prediction central joint point heat map by using the loss value:

step 25, calculating a loss value of the upper body offset heat map by using a mean square error loss function, and training the upper offset heat map predicted by the network according to the loss value:

an up-shifted heat map representing the predicted joint type i,

step 26, calculating a loss value of the upper body offset heat map by using a mean square error loss function, and training a lower offset heat map predicted by the network according to the loss value:

a lower offset heat map representing the predicted joint type i,

step 27, calculating a loss value of the joint refinement heat map by using a mean square error loss function, and training the network-predicted joint refinement heat map by using the loss value:

a refined heat map representing the predicted joint type i,

step 28, the final loss function of the network is:

L＝αM+βL_u+γL_d+δL_o

where α, β, γ, δ represent the respective lost weights.

Different subtasks can have different influences on parameters of the main network through setting of the weight parameters, and therefore the main network is optimized according to requirements.

In a possible implementation manner, the step 30 specifically includes:

step 31, acquiring an image to be detected, preprocessing the image to be detected (for example, adjusting the image to be detected to be an RGB image with a size of 256 × 256), inputting the image to be detected into the trained feature pyramid network, and performing a single forward process through the feature pyramid network to obtain a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map predicted according to the RGB image to be detected;

step 32, in the predicted central point heat map, searching the maximum pixel value positions of central joint points (for example, a plurality of chest joints) corresponding to a plurality of human bodies by using a non-maximum suppression algorithm, and taking the maximum pixel value positions as the central joint point positions;

step 33, according to the position of the central joint point, obtaining a response value in an upper offset heat map of each type of joint (including an upper body joint and a hip joint) at the position corresponding to each central joint point in the predicted upper offset heat map, and according to the response value, calculating to obtain fuzzy positions of each type of upper body joint and hip joint;

step 35, calculating the fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the response value of the corresponding position in the lower offset heat map of each type of lower body joint;

step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy positions of the lower body joints and the response values of the corresponding positions in the detailed heat map of the lower body joints;

and step 37, sequentially connecting all joints to form a complete multi-person human body posture based on preset joint sequencing according to the calculated final accurate positions of the joints of the whole body, as shown in fig. 5.

The positions of other upper body joints and hip joints are deduced according to the position of the central joint point through the central joint point and the upper offset heat map, the positions of other lower body joints are deduced according to the positions of the hip joints through the lower offset heat map, meanwhile, the positions are accurately positioned through the refined heat map, errors caused by long-distance offset are avoided, and therefore the accurate positions of the joints of the whole body are obtained.

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A single-stage multi-person posture estimation method based on a feature pyramid network is characterized by comprising the following steps:

2. The method according to claim 1, characterized in that said step 10 comprises in particular:

3. The method of claim 1, wherein: in the step 20, training the feature pyramid network specifically includes: respectively calculating a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map output by the feature pyramid network prediction, and loss values and total loss of training labels, and then training the feature pyramid network according to the loss values;

the formula for calculating the center point heat map loss value is:

the formula for calculating the offset heat map loss values is:

where i represents the heat map, p, for different joint types_jRepresenting a location, P, on the heat map_i ^uAn up-shifted heat map representing the predicted joint type i,

the formula for calculating the lower offset heat map loss value is:

where i represents the heat map, p, for different joint types_jRepresenting a location, P, on the heat map_i ^dA lower offset heat map representing the predicted joint type i,

the formula for calculating the joint refinement heatmap loss value is:

where i represents the heat map, p, for different joint types_jRepresenting a location, P, on the heat map_i ^dIndicates the predicted offA refined heatmap with section type i,

the formula for calculating the total loss is: l ═ α M + β L_u+γL_d+δL_o

Where α, β, γ, δ represent the respective lost weights.

4. The method of claim 1, wherein: in step 20, the central point heat map is constructed from the locations of the central joint points, the upper offset heat map is constructed from the offsets of the central joint points to each of the upper body joints and hip joints, respectively, the lower offset heat map is constructed from the offsets of the hip joints to each of the lower limb joints, respectively, and the joint refinement heat map is constructed from the locations of the other joint points except the central joint point.

5. The method according to claim 4, wherein the step 30 specifically comprises: