CN112597955A - Single-stage multi-person attitude estimation method based on feature pyramid network - Google Patents
Single-stage multi-person attitude estimation method based on feature pyramid network Download PDFInfo
- Publication number
- CN112597955A CN112597955A CN202011607963.XA CN202011607963A CN112597955A CN 112597955 A CN112597955 A CN 112597955A CN 202011607963 A CN202011607963 A CN 202011607963A CN 112597955 A CN112597955 A CN 112597955A
- Authority
- CN
- China
- Prior art keywords
- joint
- heat map
- map
- feature
- pyramid network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The embodiment of the invention discloses a single-stage multi-person posture estimation method based on a characteristic pyramid network, which relates to the technical field of human posture estimation and comprises the following steps: step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution ratios and then carrying out information fusion among channels; step 20, constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network; and step 30, inputting the image to be detected into the trained characteristic pyramid network, calculating the positions of joints and forming a complete multi-person human body posture. According to the embodiment of the invention, the network can efficiently flow information, so that the accuracy of human body posture estimation is improved; meanwhile, the processing speed of the multi-person posture estimation algorithm can be further increased through a quick post-processing matching process.
Description
Technical Field
The invention relates to the technical field of human body posture estimation, in particular to a single-stage multi-person posture estimation method based on a characteristic pyramid network.
Background
The human body posture estimation is a key step for further understanding human body behaviors through computer vision, namely, all joint points of a human body are effectively predicted through one RGB image and a correct posture is formed. The accurate prediction of the human body posture has important significance on computer vision tasks of higher levels, such as human behavior recognition, human-computer interaction, pedestrian re-recognition, abnormal behavior detection and the like.
Although the field of human body posture estimation is rapidly developed, for a multi-person posture estimation task, the top-down method and the bottom-up method are all multi-stage methods at present, and one problem of the methods is that the time is consumed, and the advantages of end-to-end trainable performance of a CNN network cannot be achieved. When the traditional attitude estimation method pursues the precision, thinking about network parameters and reasoning speed is ignored; the attitude estimation algorithm is difficult to fall to the ground, and the economic benefit is greatly reduced.
In terms of network architecture design, HowardA, Zhmogunov A, Chen L C, et al (18th Proceedings of the IEEE conference on computation and interception repetition) proposed a lightweight network architecture named Mobilenet in the paper "Mobilenetv 2: invertedResiduals and linear convolutions", which compressed the computation of the common 3x3 convolution by replacing the common 3x3 convolution with a 3x3 depth separable convolution followed by a 1x1 point-by-point convolution. By using an inverse residual error unit, namely firstly using 1x1 convolution to enlarge the dimension of an input feature map, then using 3x3 depth separable convolution to perform convolution operation, and finally using 1x1 convolution operation to reduce the dimension, more feature information can be reserved, and the expression capability of the model is ensured. However, for the field of human body posture estimation, the network lacks fusion and application of multi-scale features, the multi-scale features show excellent results when applied to tasks such as segmentation and detection, and the network also has an obvious effect of improving the precision of detection of people and joint points with different scales in a picture in the field of human body posture estimation.
In the RGB image-based pose estimation work, Nie, xueching, et al (Proceedings of the IEEE International Conference on computer vision.2019.) propose a Single-stage network in the paper "Single-stage multi-personnpos machines" for pose estimation, which sets a central joint point in a first hierarchy by performing hierarchical processing on the joint points of a human body. The second layer of joints are the trunk joints, including the neck, shoulders and hips. The third level joints include the head, elbow and knee, and the fourth level joints include the wrist and ankle. In this way the pressure of the network prediction is relieved, the key points depending on his adjacent joints. However, when the previous layer of joint points is blocked or invisible, the next layer of joint points may fail to predict, and there is a long-distance offset problem, and the two problems limit the accuracy of human body posture estimation.
Shenzhen Wei Tei science and technology Limited discloses a multi-person pose estimation method based on a cascaded pyramid network in the patent of Shenzhen Wei Tei science and technology Limited (patent publication No. CN108229445A), which comprises the steps of positioning key points in the bounding box of each character by using the cascaded pyramid network, wherein the global network can position simple key points, and processing difficult key points by integrating feature representations from all levels of the global network by a refinement network. The method utilizes multi-scale features, but because the network is complex and multi-stage, the efficiency is lower than that of a single stage.
Therefore, how to provide a single-stage multi-user posture estimation method based on a feature pyramid network to achieve faster and higher-precision single-stage multi-user posture estimation becomes a problem to be solved urgently. .
Disclosure of Invention
The technical problem to be solved by the invention is to provide a single-stage multi-person posture estimation method based on a characteristic pyramid network, so that the speed and the efficiency of human body posture estimation are improved.
In order to solve the technical problem, the embodiment of the invention adopts the following technical scheme:
a single-stage multi-person posture estimation method based on a feature pyramid network comprises the following steps:
step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution, then carrying out information fusion among channels, then taking the primary characteristic graph with the lowest resolution as a starting point, carrying out up-sampling and characteristic addition operation on all the primary characteristic graphs among characteristic branches, and finally carrying out prediction output;
step 20, acquiring a multi-person posture estimation data set, wherein the multi-person posture estimation data set comprises multi-person posture pictures and joint point ground truth value labels; constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network;
and step 30, inputting the image to be detected into the trained feature pyramid network, calculating the joint position according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint thinning heat map, and forming a complete human body posture according to the joint position.
Further, the step 10 specifically includes:
step 11, creating a plurality of first convolution kernels for extracting primary features of the image and changing the number of channels of the image features;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual error units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature map output by the feature extraction main branch is sequentially reduced;
step 13, after each layer of original feature map extracted by the reverse residual error unit module, setting a group of second convolution kernels for performing inter-channel information fusion on the original feature map of the current layer to obtain a corresponding fusion feature map;
step 14, with the feature map hierarchy with the lowest resolution as a starting point, sequentially cascading a plurality of deconvolution modules, which are used for amplifying the resolution of the fusion feature map of the current layer to the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then performing position-by-position element summation operation on the amplified feature map and the fusion feature map of the next layer to obtain an enhanced feature map;
and step 15, utilizing four groups of parallel third convolution kernels to predict and output the enhanced feature map with the maximum resolution.
Further, in the step 20, training the feature pyramid network specifically includes: respectively calculating a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map output by the feature pyramid network prediction, and loss values and total loss of training labels, and then training the feature pyramid network according to the loss values;
wherein, P (P)j) Position p in heat map representing predicted center pointjPredicted value of (c), G (p)j) Represents the position of the central point heat map constructed by the training label as pjTrue value of (1);
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,upward offset heat with i as the predicted joint typeIn the figure, the figure shows that,representing the true value of an upper deviation heat map with the joint type i in the training label;
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,a lower offset heat map representing the predicted joint type i,representing a true value of a lower offset thermal diagram with a joint type i in a training label;
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,a refined heat map representing the predicted joint type i,representing a refined heat map truth value with a joint type i in the training label;
the formula for calculating the total loss is: l ═ α M + β Lu+γLd+δLo
Where α, β, γ, δ represent the respective lost weights.
Further, in step 20, the central point heat map is constructed from the positions of the central joint points, the upper offset heat map is constructed from the offsets of the central joint points to each of the upper body joints and hip joints, respectively, the lower offset heat map is constructed from the offsets of the hip joints to each of the lower limb joints, respectively, and the joint refinement heat map is constructed from the positions of the other joint points except the central joint point.
Further, the step 30 specifically includes:
step 31, acquiring an image to be detected, preprocessing the image to be detected, and inputting the preprocessed image into the trained feature pyramid network to obtain a predicted central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map;
step 32, obtaining at least one central joint point position in the predicted central point heat map by using a non-maximum suppression algorithm;
step 33, according to the position of the central joint point, acquiring a response value corresponding to each type of upper body joint and hip joint in the predicted upper offset heat map, and according to the response value, calculating to obtain fuzzy positions of each type of upper body joint and hip joint;
step 34, calculating to obtain the accurate position of each type of upper body joint and hip joint according to the fuzzy position of each type of upper body joint and hip joint and the response value of the corresponding joint in the predicted joint refinement heat map;
step 35, calculating to obtain a fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy position of each type of lower body joint and the predicted response value of the corresponding joint in the joint refinement heat map;
and step 37, sequentially connecting all joints to form a complete human body posture based on preset joint sequencing according to the accurate positions of joints of the whole body.
One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:
1. by constructing a characteristic pyramid network based on a MobileNet network and then performing up-sampling and characteristic addition operation among characteristic branches, the parameter quantity of a deep convolutional neural network can be effectively reduced, the network can efficiently perform information flow, joint point information can be favorably fused into spatial information and semantic information, and the accuracy of human posture estimation is greatly improved;
2. the human body posture can be directly deduced by means of the prediction of the joint position through a single-stage human body posture representation mode, so that the problems of low training speed and long reasoning time of a traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the human body posture can be estimated in one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
The invention will be further described with reference to the following examples with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a schematic diagram of a feature pyramid network in an embodiment of the invention;
FIG. 3 is a schematic diagram of an inverse residual unit according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the output of a feature pyramid network in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a result of human body posture estimation according to an embodiment of the present invention.
Detailed Description
The technical scheme in the embodiment of the application has the following general idea:
firstly, constructing a characteristic pyramid network based on the MobileNet, so that the parameter quantity of the deep convolutional neural network can be greatly reduced, and the reduced precision is within an acceptable range; secondly, a plurality of 3x3 deconvolution devices are arranged between feature graphs of different levels in the feature pyramid network to recover features, and feature addition operation is carried out on the feature graphs and the feature graphs of the previous level, so that the flow and fusion of information between the features are further improved, and the network can effectively utilize and fuse spatial information and semantic information; then, besides joint point prediction and offset prediction, joint refinement heat map prediction is added, and the network often has inaccuracy to long-distance offset prediction, so that the joint position refinement is carried out by additionally predicting the refinement heat map at each joint point, the precision of each joint point of a human body is more accurate, the precision of posture estimation is greatly improved, and an accurate posture reference is provided for higher-level tasks such as behavior recognition, pedestrian re-recognition, abnormal behavior detection and the like.
The embodiment of the present application provides a single-stage multi-user posture estimation method based on a feature pyramid network, please refer to fig. 1, which includes:
step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution, then carrying out information fusion among channels, then taking the primary characteristic graph with the lowest resolution as a starting point, carrying out up-sampling and characteristic addition operation on all the primary characteristic graphs among characteristic branches, and finally carrying out prediction output;
step 20, acquiring a multi-person posture estimation data set, wherein the multi-person posture estimation data set comprises multi-person posture pictures and joint point ground truth value labels; constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network;
the method comprises the steps of obtaining a large number of sample images (multi-person posture images) in advance, dividing the sample images into a training set, a verification set and a test set after marking joint points of the sample images, inputting the training set into a deep convolutional neural network for training, verifying the trained deep convolutional neural network by utilizing the verification set and the test set, and judging whether a loss value reaches a preset threshold value or not
And step 30, inputting the image to be detected into the trained feature pyramid network, calculating the position of a joint point according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and forming a complete human body posture according to the position of the joint point.
The human body posture can be directly deduced by depending on the prediction of joint positions through a single-stage human body posture representation mode, so that the problems of low training speed and long reasoning time of the traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the human body posture can be estimated in one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved
Referring to fig. 2, in a possible implementation manner, in step 10, building a feature pyramid network based on a MobileNet network specifically includes:
step 11, creating a plurality of first convolution kernels (for example, convolution kernels with the size of 3 × 3 × 3) for extracting primary features of an image and changing the number of channels of the image features;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual error units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature map output by the feature extraction main branch is sequentially reduced;
as shown in fig. 3, a single inverse residual unit is specifically constructed as follows:
step 121, firstly, extracting features by a plurality of convolution kernels (PointWise convolution modules) with the space size of 1x1, and simultaneously increasing the number of feature channels;
step 122, adding a RELU6 activation function after the convolution kernel;
step 123, adding a depth separable convolution (DepthWise convolution module) with the size of 3x3 after the RELU6 activation function, wherein the depth separable convolution module is used for extracting features;
step 124, adding a RELU6 activation function after the separable convolution;
step 125, adding a 1x1 convolution kernel (Pointwise convolution module) for reducing the number of characteristic channels after the RELU6 activating function;
step 126, adding a Linear activation function after the 1x1 convolution kernel;
step 127, using identity mapping to map the identity of the feature map input in the step 121 to the feature map generated in the step 126, and performing feature element addition operation;
the above steps 11 and 12 are the building of a MobileNet network structure, and the embodiment further performs information fusion and feature enhancement on the feature branches through the subsequent steps on the basis of the MobileNet network:
step 13, after each layer of original feature map extracted by the reverse residual error unit module, setting a group of second convolution kernels (for example, convolution kernels with the size of 1 × 1) for performing inter-channel information fusion on the original feature map of the current layer to obtain a corresponding fusion feature map;
step 14, with the feature map hierarchy with the lowest resolution as a starting point, sequentially cascading a plurality of deconvolution modules, which are used for amplifying the resolution of the fusion feature map of the current layer to the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then performing position-by-position element summation operation on the amplified feature map and the fusion feature map of the next layer to perform further feature fusion to obtain an enhanced feature map;
and step 15, performing final prediction output of the four types of heat maps on the reinforced feature map with the highest resolution by using four groups of parallel third convolution kernels (such as convolution kernels with the size of 1 × 1).
In one possible embodiment, the upper body joints are classified into 9 classes, head, neck, chest, and left and right shoulder, elbow, and wrist joints, respectively, wherein the chest joint is defined as the central joint point; the lower body joints are divided into 7 types, namely hip joints and 6 lower limb joints (including left and right hip joints, knee joints and ankle joints); the central heat map constructed from the locations of central joint points, the upper offset heat map constructed from offsets of the central joint points to the other 8 classes of upper body joints and hip joints, respectively, the lower offset heat map constructed from offsets of the hip joints to the other 6 classes of lower body joints, respectively, the joint refinement heat map constructed from the locations of the other joint points other than the central joint points; the pyramid network predicts the output heat map channel numbers as 1, 18, 12, 30, corresponding to 1 center point heat map, 18 top offset heat maps (X, Y two channel offset heat maps for each type of joint), 12 bottom offset heat maps (X, Y two channel offset heat maps for each type of joint), and 30 joint refinement heat maps (X, Y two channels for joints other than the center point), respectively, with the output heat maps shown in fig. 4. In the four parallel sets of third convolution kernels, the number of each set of convolution kernels corresponds to the number of heat map channels, i.e., 1, 18, 12, 30, respectively.
By constructing the characteristic pyramid network based on the MobileNet network and then performing up-sampling and characteristic addition operation among the characteristic branches, the parameter quantity of the deep convolutional neural network can be effectively reduced, the network can efficiently perform information flow, joint point information can be favorably fused into spatial information and semantic information, and the accuracy of human posture estimation is greatly improved.
In a possible implementation, the step 20 specifically includes:
step 21, obtaining a sample image in a data set, and adjusting the sample image into an RGB image with the size of 256 × 256;
step 22, inputting the sample image into the feature pyramid network to perform a single forward process, so as to obtain central point heat maps, upper offset heat maps, lower offset heat maps and refined heat maps corresponding to a plurality of human bodies in the network-predicted image;
step 23, constructing truth labels of the heat maps respectively by using ground truth labeling of the human body joint points of the sample images: taking a plurality of chest joint points of a human body as central points of the human body, and processing by using a Gaussian kernel to obtain a single two-dimensional central point heat map, wherein the positions of Gaussian peaks in the heat map are positions of the plurality of central points; constructing an offset heat map with offsets for various joint points of the chest joint to the upper body (e.g., upper limbs, head) and hip of each individual, the offset heat map for a single type of joint comprising two heat maps in x and y coordinates, respectively; the lower offset heat map is constructed in a similar manner, using the offset of the hip joint points to the lower limb joint points; the joint refinement heat map is used for further refining the positions of all joint points predicted by the position of a central joint point and an offset value, each type of joint refinement heat map also corresponds to x and y channels, the response value of a certain point points to the position of the joint point closest to the position, and the range with response is a circle which takes R as the radius and surrounds each joint point;
step 24, calculating the network prediction heat map and the truth label heat map by using a mean square error loss function to obtain a loss value of the central joint point heat map, and training the network prediction central joint point heat map by using the loss value:
wherein, P (P)j) Position p in heat map representing predicted center pointjPredicted value of (c), G (p)j) Represents the position of the central point heat map constructed by the training label as pjTrue value of (1);
step 25, calculating a loss value of the upper body offset heat map by using a mean square error loss function, and training the upper offset heat map predicted by the network according to the loss value:
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,an up-shifted heat map representing the predicted joint type i,representing the true value of an upper deviation heat map with the joint type i in the training label;
step 26, calculating a loss value of the upper body offset heat map by using a mean square error loss function, and training a lower offset heat map predicted by the network according to the loss value:
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,a lower offset heat map representing the predicted joint type i,representing a true value of a lower offset thermal diagram with a joint type i in a training label;
step 27, calculating a loss value of the joint refinement heat map by using a mean square error loss function, and training the network-predicted joint refinement heat map by using the loss value:
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,a refined heat map representing the predicted joint type i,representing a refined heat map truth value with a joint type i in the training label;
step 28, the final loss function of the network is:
L=αM+βLu+γLd+δLo
where α, β, γ, δ represent the respective lost weights.
Different subtasks can have different influences on parameters of the main network through setting of the weight parameters, and therefore the main network is optimized according to requirements.
In a possible implementation manner, the step 30 specifically includes:
step 31, acquiring an image to be detected, preprocessing the image to be detected (for example, adjusting the image to be detected to be an RGB image with a size of 256 × 256), inputting the image to be detected into the trained feature pyramid network, and performing a single forward process through the feature pyramid network to obtain a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map predicted according to the RGB image to be detected;
step 32, in the predicted central point heat map, searching the maximum pixel value positions of central joint points (for example, a plurality of chest joints) corresponding to a plurality of human bodies by using a non-maximum suppression algorithm, and taking the maximum pixel value positions as the central joint point positions;
step 33, according to the position of the central joint point, obtaining a response value in an upper offset heat map of each type of joint (including an upper body joint and a hip joint) at the position corresponding to each central joint point in the predicted upper offset heat map, and according to the response value, calculating to obtain fuzzy positions of each type of upper body joint and hip joint;
step 34, calculating to obtain the accurate position of each type of upper body joint and hip joint according to the fuzzy position of each type of upper body joint and hip joint and the response value of the corresponding joint in the predicted joint refinement heat map;
step 35, calculating the fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the response value of the corresponding position in the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy positions of the lower body joints and the response values of the corresponding positions in the detailed heat map of the lower body joints;
and step 37, sequentially connecting all joints to form a complete multi-person human body posture based on preset joint sequencing according to the calculated final accurate positions of the joints of the whole body, as shown in fig. 5.
The positions of other upper body joints and hip joints are deduced according to the position of the central joint point through the central joint point and the upper offset heat map, the positions of other lower body joints are deduced according to the positions of the hip joints through the lower offset heat map, meanwhile, the positions are accurately positioned through the refined heat map, errors caused by long-distance offset are avoided, and therefore the accurate positions of the joints of the whole body are obtained.
Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.
Claims (5)
1. A single-stage multi-person posture estimation method based on a feature pyramid network is characterized by comprising the following steps:
step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution, then carrying out information fusion among channels, then taking the primary characteristic graph with the lowest resolution as a starting point, carrying out up-sampling and characteristic addition operation on all the primary characteristic graphs among characteristic branches, and finally carrying out prediction output;
step 20, acquiring a multi-person posture estimation data set, wherein the multi-person posture estimation data set comprises multi-person posture pictures and joint point ground truth value labels; constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network;
and step 30, inputting the image to be detected into the trained feature pyramid network, calculating the joint position according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint thinning heat map, and forming a complete human body posture according to the joint position.
2. The method according to claim 1, characterized in that said step 10 comprises in particular:
step 11, creating a plurality of first convolution kernels for extracting primary features of the image and changing the number of channels of the image features;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual error units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature map output by the feature extraction main branch is sequentially reduced;
step 13, after each layer of original feature map extracted by the reverse residual error unit module, setting a group of second convolution kernels for performing inter-channel information fusion on the original feature map of the current layer to obtain a corresponding fusion feature map;
step 14, with the feature map hierarchy with the lowest resolution as a starting point, sequentially cascading a plurality of deconvolution modules, which are used for amplifying the resolution of the fusion feature map of the current layer to the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then performing position-by-position element summation operation on the amplified feature map and the fusion feature map of the next layer to obtain an enhanced feature map;
and step 15, utilizing four groups of parallel third convolution kernels to predict and output the enhanced feature map with the maximum resolution.
3. The method of claim 1, wherein: in the step 20, training the feature pyramid network specifically includes: respectively calculating a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map output by the feature pyramid network prediction, and loss values and total loss of training labels, and then training the feature pyramid network according to the loss values;
wherein, P (P)j) Position p in heat map representing predicted center pointjPredicted value of (c), G (p)j) Represents the position of the central point heat map constructed by the training label as pjTrue value of (1);
where i represents the heat map, p, for different joint typesjRepresenting a location, P, on the heat mapi uAn up-shifted heat map representing the predicted joint type i,representing the true value of an upper deviation heat map with the joint type i in the training label;
where i represents the heat map, p, for different joint typesjRepresenting a location, P, on the heat mapi dA lower offset heat map representing the predicted joint type i,representing a true value of a lower offset thermal diagram with a joint type i in a training label;
where i represents the heat map, p, for different joint typesjRepresenting a location, P, on the heat mapi dIndicates the predicted offA refined heatmap with section type i,representing a refined heat map truth value with a joint type i in the training label;
the formula for calculating the total loss is: l ═ α M + β Lu+γLd+δLo
Where α, β, γ, δ represent the respective lost weights.
4. The method of claim 1, wherein: in step 20, the central point heat map is constructed from the locations of the central joint points, the upper offset heat map is constructed from the offsets of the central joint points to each of the upper body joints and hip joints, respectively, the lower offset heat map is constructed from the offsets of the hip joints to each of the lower limb joints, respectively, and the joint refinement heat map is constructed from the locations of the other joint points except the central joint point.
5. The method according to claim 4, wherein the step 30 specifically comprises:
step 31, acquiring an image to be detected, preprocessing the image to be detected, and inputting the preprocessed image into the trained feature pyramid network to obtain a predicted central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map;
step 32, obtaining at least one central joint point position in the predicted central point heat map by using a non-maximum suppression algorithm;
step 33, according to the position of the central joint point, acquiring a response value corresponding to each type of upper body joint and hip joint in the predicted upper offset heat map, and according to the response value, calculating to obtain fuzzy positions of each type of upper body joint and hip joint;
step 34, calculating to obtain the accurate position of each type of upper body joint and hip joint according to the fuzzy position of each type of upper body joint and hip joint and the response value of the corresponding joint in the predicted joint refinement heat map;
step 35, calculating to obtain a fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy position of each type of lower body joint and the predicted response value of the corresponding joint in the joint refinement heat map;
and step 37, sequentially connecting all joints to form a complete human body posture based on preset joint sequencing according to the accurate positions of joints of the whole body.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011607963.XA CN112597955B (en) | 2020-12-30 | 2020-12-30 | Single-stage multi-person gesture estimation method based on feature pyramid network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011607963.XA CN112597955B (en) | 2020-12-30 | 2020-12-30 | Single-stage multi-person gesture estimation method based on feature pyramid network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112597955A true CN112597955A (en) | 2021-04-02 |
CN112597955B CN112597955B (en) | 2023-06-02 |
Family
ID=75206178
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011607963.XA Active CN112597955B (en) | 2020-12-30 | 2020-12-30 | Single-stage multi-person gesture estimation method based on feature pyramid network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112597955B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011402A (en) * | 2021-04-30 | 2021-06-22 | 中国科学院自动化研究所 | System and method for estimating postures of primates based on convolutional neural network |
CN113297995A (en) * | 2021-05-31 | 2021-08-24 | 深圳市优必选科技股份有限公司 | Human body posture estimation method and terminal equipment |
CN113343762A (en) * | 2021-05-07 | 2021-09-03 | 北京邮电大学 | Human body posture estimation grouping model training method, posture estimation method and device |
CN113420604A (en) * | 2021-05-28 | 2021-09-21 | 沈春华 | Multi-person posture estimation method and device and electronic equipment |
CN113610015A (en) * | 2021-08-11 | 2021-11-05 | 华侨大学 | Attitude estimation method, device and medium based on end-to-end rapid ladder network |
CN113673354A (en) * | 2021-07-23 | 2021-11-19 | 湖南大学 | Human body key point detection method based on context information and combined embedding |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124415A1 (en) * | 2015-11-04 | 2017-05-04 | Nec Laboratories America, Inc. | Subcategory-aware convolutional neural networks for object detection |
CN108229445A (en) * | 2018-02-09 | 2018-06-29 | 深圳市唯特视科技有限公司 | A kind of more people's Attitude estimation methods based on cascade pyramid network |
US20190171871A1 (en) * | 2017-12-03 | 2019-06-06 | Facebook, Inc. | Systems and Methods for Optimizing Pose Estimation |
CN110427890A (en) * | 2019-08-05 | 2019-11-08 | 华侨大学 | More people's Attitude estimation methods based on depth cascade network and mass center differentiation coding |
CN111191622A (en) * | 2020-01-03 | 2020-05-22 | 华南师范大学 | Posture recognition method and system based on thermodynamic diagram and offset vector and storage medium |
CN111832383A (en) * | 2020-05-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Training method of gesture key point recognition model, gesture recognition method and device |
-
2020
- 2020-12-30 CN CN202011607963.XA patent/CN112597955B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124415A1 (en) * | 2015-11-04 | 2017-05-04 | Nec Laboratories America, Inc. | Subcategory-aware convolutional neural networks for object detection |
US20190171871A1 (en) * | 2017-12-03 | 2019-06-06 | Facebook, Inc. | Systems and Methods for Optimizing Pose Estimation |
CN108229445A (en) * | 2018-02-09 | 2018-06-29 | 深圳市唯特视科技有限公司 | A kind of more people's Attitude estimation methods based on cascade pyramid network |
CN110427890A (en) * | 2019-08-05 | 2019-11-08 | 华侨大学 | More people's Attitude estimation methods based on depth cascade network and mass center differentiation coding |
CN111191622A (en) * | 2020-01-03 | 2020-05-22 | 华南师范大学 | Posture recognition method and system based on thermodynamic diagram and offset vector and storage medium |
CN111832383A (en) * | 2020-05-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Training method of gesture key point recognition model, gesture recognition method and device |
Non-Patent Citations (2)
Title |
---|
定志锋: "基于统计结构梯度特征的行人检测", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
申小凤;王春佳;: "基于ASPP的高分辨率卷积神经网络2D人体姿态估计研究", 现代计算机 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011402A (en) * | 2021-04-30 | 2021-06-22 | 中国科学院自动化研究所 | System and method for estimating postures of primates based on convolutional neural network |
CN113343762A (en) * | 2021-05-07 | 2021-09-03 | 北京邮电大学 | Human body posture estimation grouping model training method, posture estimation method and device |
CN113343762B (en) * | 2021-05-07 | 2022-03-29 | 北京邮电大学 | Human body posture estimation grouping model training method, posture estimation method and device |
CN113420604A (en) * | 2021-05-28 | 2021-09-21 | 沈春华 | Multi-person posture estimation method and device and electronic equipment |
CN113297995A (en) * | 2021-05-31 | 2021-08-24 | 深圳市优必选科技股份有限公司 | Human body posture estimation method and terminal equipment |
CN113297995B (en) * | 2021-05-31 | 2024-01-16 | 深圳市优必选科技股份有限公司 | Human body posture estimation method and terminal equipment |
CN113673354A (en) * | 2021-07-23 | 2021-11-19 | 湖南大学 | Human body key point detection method based on context information and combined embedding |
CN113673354B (en) * | 2021-07-23 | 2024-02-20 | 湖南大学 | Human body key point detection method based on context information and joint embedding |
CN113610015A (en) * | 2021-08-11 | 2021-11-05 | 华侨大学 | Attitude estimation method, device and medium based on end-to-end rapid ladder network |
CN113610015B (en) * | 2021-08-11 | 2023-05-30 | 华侨大学 | Attitude estimation method, device and medium based on end-to-end fast ladder network |
Also Published As
Publication number | Publication date |
---|---|
CN112597955B (en) | 2023-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112597955A (en) | Single-stage multi-person attitude estimation method based on feature pyramid network | |
CN108170816B (en) | Intelligent visual question-answering method based on deep neural network | |
CN109948475B (en) | Human body action recognition method based on skeleton features and deep learning | |
CN111476806B (en) | Image processing method, image processing device, computer equipment and storage medium | |
CN109919085B (en) | Human-human interaction behavior identification method based on light-weight convolutional neural network | |
CN111259735B (en) | Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network | |
CN110427890B (en) | Multi-person attitude estimation method based on deep cascade network and centroid differentiation coding | |
CN110334584B (en) | Gesture recognition method based on regional full convolution network | |
CN109785409B (en) | Image-text data fusion method and system based on attention mechanism | |
CN113128424A (en) | Attention mechanism-based graph convolution neural network action identification method | |
CN116229056A (en) | Semantic segmentation method, device and equipment based on double-branch feature fusion | |
CN111709268A (en) | Human hand posture estimation method and device based on human hand structure guidance in depth image | |
CN111507359A (en) | Self-adaptive weighting fusion method of image feature pyramid | |
CN112597956B (en) | Multi-person gesture estimation method based on human body anchor point set and perception enhancement network | |
CN111914595B (en) | Human hand three-dimensional attitude estimation method and device based on color image | |
Gheitasi et al. | Estimation of hand skeletal postures by using deep convolutional neural networks | |
CN111368637B (en) | Transfer robot target identification method based on multi-mask convolutional neural network | |
CN113822134A (en) | Instance tracking method, device, equipment and storage medium based on video | |
CN112651294A (en) | Method for recognizing human body shielding posture based on multi-scale fusion | |
CN116460851A (en) | Mechanical arm assembly control method for visual migration | |
CN113610015B (en) | Attitude estimation method, device and medium based on end-to-end fast ladder network | |
CN115424012A (en) | Lightweight image semantic segmentation method based on context information | |
CN114821631A (en) | Pedestrian feature extraction method based on attention mechanism and multi-scale feature fusion | |
Sharma et al. | Real-Time Word Level Sign Language Recognition Using YOLOv4 | |
CN113673540A (en) | Target detection method based on positioning information guidance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |