CN112597955A - Single-stage multi-person attitude estimation method based on feature pyramid network - Google Patents

Single-stage multi-person attitude estimation method based on feature pyramid network Download PDF

Info

Publication number
CN112597955A
CN112597955A CN202011607963.XA CN202011607963A CN112597955A CN 112597955 A CN112597955 A CN 112597955A CN 202011607963 A CN202011607963 A CN 202011607963A CN 112597955 A CN112597955 A CN 112597955A
Authority
CN
China
Prior art keywords
joint
heat map
map
feature
pyramid network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011607963.XA
Other languages
Chinese (zh)
Other versions
CN112597955B (en
Inventor
骆炎民
张智谦
林躬耕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Gongtian Software Co ltd
Huaqiao University
Original Assignee
Fujian Gongtian Software Co ltd
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Gongtian Software Co ltd, Huaqiao University filed Critical Fujian Gongtian Software Co ltd
Priority to CN202011607963.XA priority Critical patent/CN112597955B/en
Publication of CN112597955A publication Critical patent/CN112597955A/en
Application granted granted Critical
Publication of CN112597955B publication Critical patent/CN112597955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The embodiment of the invention discloses a single-stage multi-person posture estimation method based on a characteristic pyramid network, which relates to the technical field of human posture estimation and comprises the following steps: step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution ratios and then carrying out information fusion among channels; step 20, constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network; and step 30, inputting the image to be detected into the trained characteristic pyramid network, calculating the positions of joints and forming a complete multi-person human body posture. According to the embodiment of the invention, the network can efficiently flow information, so that the accuracy of human body posture estimation is improved; meanwhile, the processing speed of the multi-person posture estimation algorithm can be further increased through a quick post-processing matching process.

Description

Single-stage multi-person attitude estimation method based on feature pyramid network
Technical Field
The invention relates to the technical field of human body posture estimation, in particular to a single-stage multi-person posture estimation method based on a characteristic pyramid network.
Background
The human body posture estimation is a key step for further understanding human body behaviors through computer vision, namely, all joint points of a human body are effectively predicted through one RGB image and a correct posture is formed. The accurate prediction of the human body posture has important significance on computer vision tasks of higher levels, such as human behavior recognition, human-computer interaction, pedestrian re-recognition, abnormal behavior detection and the like.
Although the field of human body posture estimation is rapidly developed, for a multi-person posture estimation task, the top-down method and the bottom-up method are all multi-stage methods at present, and one problem of the methods is that the time is consumed, and the advantages of end-to-end trainable performance of a CNN network cannot be achieved. When the traditional attitude estimation method pursues the precision, thinking about network parameters and reasoning speed is ignored; the attitude estimation algorithm is difficult to fall to the ground, and the economic benefit is greatly reduced.
In terms of network architecture design, HowardA, Zhmogunov A, Chen L C, et al (18th Proceedings of the IEEE conference on computation and interception repetition) proposed a lightweight network architecture named Mobilenet in the paper "Mobilenetv 2: invertedResiduals and linear convolutions", which compressed the computation of the common 3x3 convolution by replacing the common 3x3 convolution with a 3x3 depth separable convolution followed by a 1x1 point-by-point convolution. By using an inverse residual error unit, namely firstly using 1x1 convolution to enlarge the dimension of an input feature map, then using 3x3 depth separable convolution to perform convolution operation, and finally using 1x1 convolution operation to reduce the dimension, more feature information can be reserved, and the expression capability of the model is ensured. However, for the field of human body posture estimation, the network lacks fusion and application of multi-scale features, the multi-scale features show excellent results when applied to tasks such as segmentation and detection, and the network also has an obvious effect of improving the precision of detection of people and joint points with different scales in a picture in the field of human body posture estimation.
In the RGB image-based pose estimation work, Nie, xueching, et al (Proceedings of the IEEE International Conference on computer vision.2019.) propose a Single-stage network in the paper "Single-stage multi-personnpos machines" for pose estimation, which sets a central joint point in a first hierarchy by performing hierarchical processing on the joint points of a human body. The second layer of joints are the trunk joints, including the neck, shoulders and hips. The third level joints include the head, elbow and knee, and the fourth level joints include the wrist and ankle. In this way the pressure of the network prediction is relieved, the key points depending on his adjacent joints. However, when the previous layer of joint points is blocked or invisible, the next layer of joint points may fail to predict, and there is a long-distance offset problem, and the two problems limit the accuracy of human body posture estimation.
Shenzhen Wei Tei science and technology Limited discloses a multi-person pose estimation method based on a cascaded pyramid network in the patent of Shenzhen Wei Tei science and technology Limited (patent publication No. CN108229445A), which comprises the steps of positioning key points in the bounding box of each character by using the cascaded pyramid network, wherein the global network can position simple key points, and processing difficult key points by integrating feature representations from all levels of the global network by a refinement network. The method utilizes multi-scale features, but because the network is complex and multi-stage, the efficiency is lower than that of a single stage.
Therefore, how to provide a single-stage multi-user posture estimation method based on a feature pyramid network to achieve faster and higher-precision single-stage multi-user posture estimation becomes a problem to be solved urgently. .
Disclosure of Invention
The technical problem to be solved by the invention is to provide a single-stage multi-person posture estimation method based on a characteristic pyramid network, so that the speed and the efficiency of human body posture estimation are improved.
In order to solve the technical problem, the embodiment of the invention adopts the following technical scheme:
a single-stage multi-person posture estimation method based on a feature pyramid network comprises the following steps:
step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution, then carrying out information fusion among channels, then taking the primary characteristic graph with the lowest resolution as a starting point, carrying out up-sampling and characteristic addition operation on all the primary characteristic graphs among characteristic branches, and finally carrying out prediction output;
step 20, acquiring a multi-person posture estimation data set, wherein the multi-person posture estimation data set comprises multi-person posture pictures and joint point ground truth value labels; constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network;
and step 30, inputting the image to be detected into the trained feature pyramid network, calculating the joint position according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint thinning heat map, and forming a complete human body posture according to the joint position.
Further, the step 10 specifically includes:
step 11, creating a plurality of first convolution kernels for extracting primary features of the image and changing the number of channels of the image features;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual error units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature map output by the feature extraction main branch is sequentially reduced;
step 13, after each layer of original feature map extracted by the reverse residual error unit module, setting a group of second convolution kernels for performing inter-channel information fusion on the original feature map of the current layer to obtain a corresponding fusion feature map;
step 14, with the feature map hierarchy with the lowest resolution as a starting point, sequentially cascading a plurality of deconvolution modules, which are used for amplifying the resolution of the fusion feature map of the current layer to the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then performing position-by-position element summation operation on the amplified feature map and the fusion feature map of the next layer to obtain an enhanced feature map;
and step 15, utilizing four groups of parallel third convolution kernels to predict and output the enhanced feature map with the maximum resolution.
Further, in the step 20, training the feature pyramid network specifically includes: respectively calculating a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map output by the feature pyramid network prediction, and loss values and total loss of training labels, and then training the feature pyramid network according to the loss values;
the formula for calculating the center point heat map loss value is:
Figure BDA0002872330120000031
wherein, P (P)j) Position p in heat map representing predicted center pointjPredicted value of (c), G (p)j) Represents the position of the central point heat map constructed by the training label as pjTrue value of (1);
the formula for calculating the offset heat map loss values is:
Figure BDA0002872330120000032
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,
Figure BDA0002872330120000041
upward offset heat with i as the predicted joint typeIn the figure, the figure shows that,
Figure BDA0002872330120000042
representing the true value of an upper deviation heat map with the joint type i in the training label;
the formula for calculating the lower offset heat map loss value is:
Figure BDA0002872330120000043
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,
Figure BDA0002872330120000044
a lower offset heat map representing the predicted joint type i,
Figure BDA0002872330120000045
representing a true value of a lower offset thermal diagram with a joint type i in a training label;
the formula for calculating the joint refinement heatmap loss value is:
Figure BDA0002872330120000046
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,
Figure BDA0002872330120000047
a refined heat map representing the predicted joint type i,
Figure BDA0002872330120000048
representing a refined heat map truth value with a joint type i in the training label;
the formula for calculating the total loss is: l ═ α M + β Lu+γLd+δLo
Where α, β, γ, δ represent the respective lost weights.
Further, in step 20, the central point heat map is constructed from the positions of the central joint points, the upper offset heat map is constructed from the offsets of the central joint points to each of the upper body joints and hip joints, respectively, the lower offset heat map is constructed from the offsets of the hip joints to each of the lower limb joints, respectively, and the joint refinement heat map is constructed from the positions of the other joint points except the central joint point.
Further, the step 30 specifically includes:
step 31, acquiring an image to be detected, preprocessing the image to be detected, and inputting the preprocessed image into the trained feature pyramid network to obtain a predicted central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map;
step 32, obtaining at least one central joint point position in the predicted central point heat map by using a non-maximum suppression algorithm;
step 33, according to the position of the central joint point, acquiring a response value corresponding to each type of upper body joint and hip joint in the predicted upper offset heat map, and according to the response value, calculating to obtain fuzzy positions of each type of upper body joint and hip joint;
step 34, calculating to obtain the accurate position of each type of upper body joint and hip joint according to the fuzzy position of each type of upper body joint and hip joint and the response value of the corresponding joint in the predicted joint refinement heat map;
step 35, calculating to obtain a fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy position of each type of lower body joint and the predicted response value of the corresponding joint in the joint refinement heat map;
and step 37, sequentially connecting all joints to form a complete human body posture based on preset joint sequencing according to the accurate positions of joints of the whole body.
One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:
1. by constructing a characteristic pyramid network based on a MobileNet network and then performing up-sampling and characteristic addition operation among characteristic branches, the parameter quantity of a deep convolutional neural network can be effectively reduced, the network can efficiently perform information flow, joint point information can be favorably fused into spatial information and semantic information, and the accuracy of human posture estimation is greatly improved;
2. the human body posture can be directly deduced by means of the prediction of the joint position through a single-stage human body posture representation mode, so that the problems of low training speed and long reasoning time of a traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the human body posture can be estimated in one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
The invention will be further described with reference to the following examples with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a schematic diagram of a feature pyramid network in an embodiment of the invention;
FIG. 3 is a schematic diagram of an inverse residual unit according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the output of a feature pyramid network in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a result of human body posture estimation according to an embodiment of the present invention.
Detailed Description
The technical scheme in the embodiment of the application has the following general idea:
firstly, constructing a characteristic pyramid network based on the MobileNet, so that the parameter quantity of the deep convolutional neural network can be greatly reduced, and the reduced precision is within an acceptable range; secondly, a plurality of 3x3 deconvolution devices are arranged between feature graphs of different levels in the feature pyramid network to recover features, and feature addition operation is carried out on the feature graphs and the feature graphs of the previous level, so that the flow and fusion of information between the features are further improved, and the network can effectively utilize and fuse spatial information and semantic information; then, besides joint point prediction and offset prediction, joint refinement heat map prediction is added, and the network often has inaccuracy to long-distance offset prediction, so that the joint position refinement is carried out by additionally predicting the refinement heat map at each joint point, the precision of each joint point of a human body is more accurate, the precision of posture estimation is greatly improved, and an accurate posture reference is provided for higher-level tasks such as behavior recognition, pedestrian re-recognition, abnormal behavior detection and the like.
The embodiment of the present application provides a single-stage multi-user posture estimation method based on a feature pyramid network, please refer to fig. 1, which includes:
step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution, then carrying out information fusion among channels, then taking the primary characteristic graph with the lowest resolution as a starting point, carrying out up-sampling and characteristic addition operation on all the primary characteristic graphs among characteristic branches, and finally carrying out prediction output;
step 20, acquiring a multi-person posture estimation data set, wherein the multi-person posture estimation data set comprises multi-person posture pictures and joint point ground truth value labels; constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network;
the method comprises the steps of obtaining a large number of sample images (multi-person posture images) in advance, dividing the sample images into a training set, a verification set and a test set after marking joint points of the sample images, inputting the training set into a deep convolutional neural network for training, verifying the trained deep convolutional neural network by utilizing the verification set and the test set, and judging whether a loss value reaches a preset threshold value or not
And step 30, inputting the image to be detected into the trained feature pyramid network, calculating the position of a joint point according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and forming a complete human body posture according to the position of the joint point.
The human body posture can be directly deduced by depending on the prediction of joint positions through a single-stage human body posture representation mode, so that the problems of low training speed and long reasoning time of the traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the human body posture can be estimated in one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved
Referring to fig. 2, in a possible implementation manner, in step 10, building a feature pyramid network based on a MobileNet network specifically includes:
step 11, creating a plurality of first convolution kernels (for example, convolution kernels with the size of 3 × 3 × 3) for extracting primary features of an image and changing the number of channels of the image features;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual error units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature map output by the feature extraction main branch is sequentially reduced;
as shown in fig. 3, a single inverse residual unit is specifically constructed as follows:
step 121, firstly, extracting features by a plurality of convolution kernels (PointWise convolution modules) with the space size of 1x1, and simultaneously increasing the number of feature channels;
step 122, adding a RELU6 activation function after the convolution kernel;
step 123, adding a depth separable convolution (DepthWise convolution module) with the size of 3x3 after the RELU6 activation function, wherein the depth separable convolution module is used for extracting features;
step 124, adding a RELU6 activation function after the separable convolution;
step 125, adding a 1x1 convolution kernel (Pointwise convolution module) for reducing the number of characteristic channels after the RELU6 activating function;
step 126, adding a Linear activation function after the 1x1 convolution kernel;
step 127, using identity mapping to map the identity of the feature map input in the step 121 to the feature map generated in the step 126, and performing feature element addition operation;
the above steps 11 and 12 are the building of a MobileNet network structure, and the embodiment further performs information fusion and feature enhancement on the feature branches through the subsequent steps on the basis of the MobileNet network:
step 13, after each layer of original feature map extracted by the reverse residual error unit module, setting a group of second convolution kernels (for example, convolution kernels with the size of 1 × 1) for performing inter-channel information fusion on the original feature map of the current layer to obtain a corresponding fusion feature map;
step 14, with the feature map hierarchy with the lowest resolution as a starting point, sequentially cascading a plurality of deconvolution modules, which are used for amplifying the resolution of the fusion feature map of the current layer to the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then performing position-by-position element summation operation on the amplified feature map and the fusion feature map of the next layer to perform further feature fusion to obtain an enhanced feature map;
and step 15, performing final prediction output of the four types of heat maps on the reinforced feature map with the highest resolution by using four groups of parallel third convolution kernels (such as convolution kernels with the size of 1 × 1).
In one possible embodiment, the upper body joints are classified into 9 classes, head, neck, chest, and left and right shoulder, elbow, and wrist joints, respectively, wherein the chest joint is defined as the central joint point; the lower body joints are divided into 7 types, namely hip joints and 6 lower limb joints (including left and right hip joints, knee joints and ankle joints); the central heat map constructed from the locations of central joint points, the upper offset heat map constructed from offsets of the central joint points to the other 8 classes of upper body joints and hip joints, respectively, the lower offset heat map constructed from offsets of the hip joints to the other 6 classes of lower body joints, respectively, the joint refinement heat map constructed from the locations of the other joint points other than the central joint points; the pyramid network predicts the output heat map channel numbers as 1, 18, 12, 30, corresponding to 1 center point heat map, 18 top offset heat maps (X, Y two channel offset heat maps for each type of joint), 12 bottom offset heat maps (X, Y two channel offset heat maps for each type of joint), and 30 joint refinement heat maps (X, Y two channels for joints other than the center point), respectively, with the output heat maps shown in fig. 4. In the four parallel sets of third convolution kernels, the number of each set of convolution kernels corresponds to the number of heat map channels, i.e., 1, 18, 12, 30, respectively.
By constructing the characteristic pyramid network based on the MobileNet network and then performing up-sampling and characteristic addition operation among the characteristic branches, the parameter quantity of the deep convolutional neural network can be effectively reduced, the network can efficiently perform information flow, joint point information can be favorably fused into spatial information and semantic information, and the accuracy of human posture estimation is greatly improved.
In a possible implementation, the step 20 specifically includes:
step 21, obtaining a sample image in a data set, and adjusting the sample image into an RGB image with the size of 256 × 256;
step 22, inputting the sample image into the feature pyramid network to perform a single forward process, so as to obtain central point heat maps, upper offset heat maps, lower offset heat maps and refined heat maps corresponding to a plurality of human bodies in the network-predicted image;
step 23, constructing truth labels of the heat maps respectively by using ground truth labeling of the human body joint points of the sample images: taking a plurality of chest joint points of a human body as central points of the human body, and processing by using a Gaussian kernel to obtain a single two-dimensional central point heat map, wherein the positions of Gaussian peaks in the heat map are positions of the plurality of central points; constructing an offset heat map with offsets for various joint points of the chest joint to the upper body (e.g., upper limbs, head) and hip of each individual, the offset heat map for a single type of joint comprising two heat maps in x and y coordinates, respectively; the lower offset heat map is constructed in a similar manner, using the offset of the hip joint points to the lower limb joint points; the joint refinement heat map is used for further refining the positions of all joint points predicted by the position of a central joint point and an offset value, each type of joint refinement heat map also corresponds to x and y channels, the response value of a certain point points to the position of the joint point closest to the position, and the range with response is a circle which takes R as the radius and surrounds each joint point;
step 24, calculating the network prediction heat map and the truth label heat map by using a mean square error loss function to obtain a loss value of the central joint point heat map, and training the network prediction central joint point heat map by using the loss value:
Figure BDA0002872330120000091
wherein, P (P)j) Position p in heat map representing predicted center pointjPredicted value of (c), G (p)j) Represents the position of the central point heat map constructed by the training label as pjTrue value of (1);
step 25, calculating a loss value of the upper body offset heat map by using a mean square error loss function, and training the upper offset heat map predicted by the network according to the loss value:
Figure BDA0002872330120000101
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,
Figure BDA0002872330120000102
an up-shifted heat map representing the predicted joint type i,
Figure BDA0002872330120000103
representing the true value of an upper deviation heat map with the joint type i in the training label;
step 26, calculating a loss value of the upper body offset heat map by using a mean square error loss function, and training a lower offset heat map predicted by the network according to the loss value:
Figure BDA0002872330120000104
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,
Figure BDA0002872330120000105
a lower offset heat map representing the predicted joint type i,
Figure BDA0002872330120000106
representing a true value of a lower offset thermal diagram with a joint type i in a training label;
step 27, calculating a loss value of the joint refinement heat map by using a mean square error loss function, and training the network-predicted joint refinement heat map by using the loss value:
Figure BDA0002872330120000107
where i represents the heat map, p, for different joint typesjA location on the heat map is represented,
Figure BDA0002872330120000108
a refined heat map representing the predicted joint type i,
Figure BDA0002872330120000109
representing a refined heat map truth value with a joint type i in the training label;
step 28, the final loss function of the network is:
L=αM+βLu+γLd+δLo
where α, β, γ, δ represent the respective lost weights.
Different subtasks can have different influences on parameters of the main network through setting of the weight parameters, and therefore the main network is optimized according to requirements.
In a possible implementation manner, the step 30 specifically includes:
step 31, acquiring an image to be detected, preprocessing the image to be detected (for example, adjusting the image to be detected to be an RGB image with a size of 256 × 256), inputting the image to be detected into the trained feature pyramid network, and performing a single forward process through the feature pyramid network to obtain a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map predicted according to the RGB image to be detected;
step 32, in the predicted central point heat map, searching the maximum pixel value positions of central joint points (for example, a plurality of chest joints) corresponding to a plurality of human bodies by using a non-maximum suppression algorithm, and taking the maximum pixel value positions as the central joint point positions;
step 33, according to the position of the central joint point, obtaining a response value in an upper offset heat map of each type of joint (including an upper body joint and a hip joint) at the position corresponding to each central joint point in the predicted upper offset heat map, and according to the response value, calculating to obtain fuzzy positions of each type of upper body joint and hip joint;
step 34, calculating to obtain the accurate position of each type of upper body joint and hip joint according to the fuzzy position of each type of upper body joint and hip joint and the response value of the corresponding joint in the predicted joint refinement heat map;
step 35, calculating the fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the response value of the corresponding position in the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy positions of the lower body joints and the response values of the corresponding positions in the detailed heat map of the lower body joints;
and step 37, sequentially connecting all joints to form a complete multi-person human body posture based on preset joint sequencing according to the calculated final accurate positions of the joints of the whole body, as shown in fig. 5.
The positions of other upper body joints and hip joints are deduced according to the position of the central joint point through the central joint point and the upper offset heat map, the positions of other lower body joints are deduced according to the positions of the hip joints through the lower offset heat map, meanwhile, the positions are accurately positioned through the refined heat map, errors caused by long-distance offset are avoided, and therefore the accurate positions of the joints of the whole body are obtained.
Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims (5)

1. A single-stage multi-person posture estimation method based on a feature pyramid network is characterized by comprising the following steps:
step 10, building a characteristic pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary characteristic graphs with sequentially reduced resolution, then carrying out information fusion among channels, then taking the primary characteristic graph with the lowest resolution as a starting point, carrying out up-sampling and characteristic addition operation on all the primary characteristic graphs among characteristic branches, and finally carrying out prediction output;
step 20, acquiring a multi-person posture estimation data set, wherein the multi-person posture estimation data set comprises multi-person posture pictures and joint point ground truth value labels; constructing a central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map by using the multi-person posture estimation data set as training labels, and training the characteristic pyramid network;
and step 30, inputting the image to be detected into the trained feature pyramid network, calculating the joint position according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint thinning heat map, and forming a complete human body posture according to the joint position.
2. The method according to claim 1, characterized in that said step 10 comprises in particular:
step 11, creating a plurality of first convolution kernels for extracting primary features of the image and changing the number of channels of the image features;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual error units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature map output by the feature extraction main branch is sequentially reduced;
step 13, after each layer of original feature map extracted by the reverse residual error unit module, setting a group of second convolution kernels for performing inter-channel information fusion on the original feature map of the current layer to obtain a corresponding fusion feature map;
step 14, with the feature map hierarchy with the lowest resolution as a starting point, sequentially cascading a plurality of deconvolution modules, which are used for amplifying the resolution of the fusion feature map of the current layer to the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then performing position-by-position element summation operation on the amplified feature map and the fusion feature map of the next layer to obtain an enhanced feature map;
and step 15, utilizing four groups of parallel third convolution kernels to predict and output the enhanced feature map with the maximum resolution.
3. The method of claim 1, wherein: in the step 20, training the feature pyramid network specifically includes: respectively calculating a central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map output by the feature pyramid network prediction, and loss values and total loss of training labels, and then training the feature pyramid network according to the loss values;
the formula for calculating the center point heat map loss value is:
Figure FDA0002872330110000021
wherein, P (P)j) Position p in heat map representing predicted center pointjPredicted value of (c), G (p)j) Represents the position of the central point heat map constructed by the training label as pjTrue value of (1);
the formula for calculating the offset heat map loss values is:
Figure FDA0002872330110000022
where i represents the heat map, p, for different joint typesjRepresenting a location, P, on the heat mapi uAn up-shifted heat map representing the predicted joint type i,
Figure FDA0002872330110000023
representing the true value of an upper deviation heat map with the joint type i in the training label;
the formula for calculating the lower offset heat map loss value is:
Figure FDA0002872330110000024
where i represents the heat map, p, for different joint typesjRepresenting a location, P, on the heat mapi dA lower offset heat map representing the predicted joint type i,
Figure FDA0002872330110000025
representing a true value of a lower offset thermal diagram with a joint type i in a training label;
the formula for calculating the joint refinement heatmap loss value is:
Figure FDA0002872330110000026
where i represents the heat map, p, for different joint typesjRepresenting a location, P, on the heat mapi dIndicates the predicted offA refined heatmap with section type i,
Figure FDA0002872330110000027
representing a refined heat map truth value with a joint type i in the training label;
the formula for calculating the total loss is: l ═ α M + β Lu+γLd+δLo
Where α, β, γ, δ represent the respective lost weights.
4. The method of claim 1, wherein: in step 20, the central point heat map is constructed from the locations of the central joint points, the upper offset heat map is constructed from the offsets of the central joint points to each of the upper body joints and hip joints, respectively, the lower offset heat map is constructed from the offsets of the hip joints to each of the lower limb joints, respectively, and the joint refinement heat map is constructed from the locations of the other joint points except the central joint point.
5. The method according to claim 4, wherein the step 30 specifically comprises:
step 31, acquiring an image to be detected, preprocessing the image to be detected, and inputting the preprocessed image into the trained feature pyramid network to obtain a predicted central point heat map, an upper offset heat map, a lower offset heat map and a joint thinning heat map;
step 32, obtaining at least one central joint point position in the predicted central point heat map by using a non-maximum suppression algorithm;
step 33, according to the position of the central joint point, acquiring a response value corresponding to each type of upper body joint and hip joint in the predicted upper offset heat map, and according to the response value, calculating to obtain fuzzy positions of each type of upper body joint and hip joint;
step 34, calculating to obtain the accurate position of each type of upper body joint and hip joint according to the fuzzy position of each type of upper body joint and hip joint and the response value of the corresponding joint in the predicted joint refinement heat map;
step 35, calculating to obtain a fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy position of each type of lower body joint and the predicted response value of the corresponding joint in the joint refinement heat map;
and step 37, sequentially connecting all joints to form a complete human body posture based on preset joint sequencing according to the accurate positions of joints of the whole body.
CN202011607963.XA 2020-12-30 2020-12-30 Single-stage multi-person gesture estimation method based on feature pyramid network Active CN112597955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011607963.XA CN112597955B (en) 2020-12-30 2020-12-30 Single-stage multi-person gesture estimation method based on feature pyramid network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011607963.XA CN112597955B (en) 2020-12-30 2020-12-30 Single-stage multi-person gesture estimation method based on feature pyramid network

Publications (2)

Publication Number Publication Date
CN112597955A true CN112597955A (en) 2021-04-02
CN112597955B CN112597955B (en) 2023-06-02

Family

ID=75206178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011607963.XA Active CN112597955B (en) 2020-12-30 2020-12-30 Single-stage multi-person gesture estimation method based on feature pyramid network

Country Status (1)

Country Link
CN (1) CN112597955B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011402A (en) * 2021-04-30 2021-06-22 中国科学院自动化研究所 System and method for estimating postures of primates based on convolutional neural network
CN113297995A (en) * 2021-05-31 2021-08-24 深圳市优必选科技股份有限公司 Human body posture estimation method and terminal equipment
CN113343762A (en) * 2021-05-07 2021-09-03 北京邮电大学 Human body posture estimation grouping model training method, posture estimation method and device
CN113420604A (en) * 2021-05-28 2021-09-21 沈春华 Multi-person posture estimation method and device and electronic equipment
CN113610015A (en) * 2021-08-11 2021-11-05 华侨大学 Attitude estimation method, device and medium based on end-to-end rapid ladder network
CN113673354A (en) * 2021-07-23 2021-11-19 湖南大学 Human body key point detection method based on context information and combined embedding

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124415A1 (en) * 2015-11-04 2017-05-04 Nec Laboratories America, Inc. Subcategory-aware convolutional neural networks for object detection
CN108229445A (en) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 A kind of more people's Attitude estimation methods based on cascade pyramid network
US20190171871A1 (en) * 2017-12-03 2019-06-06 Facebook, Inc. Systems and Methods for Optimizing Pose Estimation
CN110427890A (en) * 2019-08-05 2019-11-08 华侨大学 More people's Attitude estimation methods based on depth cascade network and mass center differentiation coding
CN111191622A (en) * 2020-01-03 2020-05-22 华南师范大学 Posture recognition method and system based on thermodynamic diagram and offset vector and storage medium
CN111832383A (en) * 2020-05-08 2020-10-27 北京嘀嘀无限科技发展有限公司 Training method of gesture key point recognition model, gesture recognition method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124415A1 (en) * 2015-11-04 2017-05-04 Nec Laboratories America, Inc. Subcategory-aware convolutional neural networks for object detection
US20190171871A1 (en) * 2017-12-03 2019-06-06 Facebook, Inc. Systems and Methods for Optimizing Pose Estimation
CN108229445A (en) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 A kind of more people's Attitude estimation methods based on cascade pyramid network
CN110427890A (en) * 2019-08-05 2019-11-08 华侨大学 More people's Attitude estimation methods based on depth cascade network and mass center differentiation coding
CN111191622A (en) * 2020-01-03 2020-05-22 华南师范大学 Posture recognition method and system based on thermodynamic diagram and offset vector and storage medium
CN111832383A (en) * 2020-05-08 2020-10-27 北京嘀嘀无限科技发展有限公司 Training method of gesture key point recognition model, gesture recognition method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
定志锋: "基于统计结构梯度特征的行人检测", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
申小凤;王春佳;: "基于ASPP的高分辨率卷积神经网络2D人体姿态估计研究", 现代计算机 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011402A (en) * 2021-04-30 2021-06-22 中国科学院自动化研究所 System and method for estimating postures of primates based on convolutional neural network
CN113343762A (en) * 2021-05-07 2021-09-03 北京邮电大学 Human body posture estimation grouping model training method, posture estimation method and device
CN113343762B (en) * 2021-05-07 2022-03-29 北京邮电大学 Human body posture estimation grouping model training method, posture estimation method and device
CN113420604A (en) * 2021-05-28 2021-09-21 沈春华 Multi-person posture estimation method and device and electronic equipment
CN113297995A (en) * 2021-05-31 2021-08-24 深圳市优必选科技股份有限公司 Human body posture estimation method and terminal equipment
CN113297995B (en) * 2021-05-31 2024-01-16 深圳市优必选科技股份有限公司 Human body posture estimation method and terminal equipment
CN113673354A (en) * 2021-07-23 2021-11-19 湖南大学 Human body key point detection method based on context information and combined embedding
CN113673354B (en) * 2021-07-23 2024-02-20 湖南大学 Human body key point detection method based on context information and joint embedding
CN113610015A (en) * 2021-08-11 2021-11-05 华侨大学 Attitude estimation method, device and medium based on end-to-end rapid ladder network
CN113610015B (en) * 2021-08-11 2023-05-30 华侨大学 Attitude estimation method, device and medium based on end-to-end fast ladder network

Also Published As

Publication number Publication date
CN112597955B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
CN112597955A (en) Single-stage multi-person attitude estimation method based on feature pyramid network
CN108170816B (en) Intelligent visual question-answering method based on deep neural network
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN111476806B (en) Image processing method, image processing device, computer equipment and storage medium
CN109919085B (en) Human-human interaction behavior identification method based on light-weight convolutional neural network
CN111259735B (en) Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network
CN110427890B (en) Multi-person attitude estimation method based on deep cascade network and centroid differentiation coding
CN110334584B (en) Gesture recognition method based on regional full convolution network
CN109785409B (en) Image-text data fusion method and system based on attention mechanism
CN113128424A (en) Attention mechanism-based graph convolution neural network action identification method
CN116229056A (en) Semantic segmentation method, device and equipment based on double-branch feature fusion
CN111709268A (en) Human hand posture estimation method and device based on human hand structure guidance in depth image
CN111507359A (en) Self-adaptive weighting fusion method of image feature pyramid
CN112597956B (en) Multi-person gesture estimation method based on human body anchor point set and perception enhancement network
CN111914595B (en) Human hand three-dimensional attitude estimation method and device based on color image
Gheitasi et al. Estimation of hand skeletal postures by using deep convolutional neural networks
CN111368637B (en) Transfer robot target identification method based on multi-mask convolutional neural network
CN113822134A (en) Instance tracking method, device, equipment and storage medium based on video
CN112651294A (en) Method for recognizing human body shielding posture based on multi-scale fusion
CN116460851A (en) Mechanical arm assembly control method for visual migration
CN113610015B (en) Attitude estimation method, device and medium based on end-to-end fast ladder network
CN115424012A (en) Lightweight image semantic segmentation method based on context information
CN114821631A (en) Pedestrian feature extraction method based on attention mechanism and multi-scale feature fusion
Sharma et al. Real-Time Word Level Sign Language Recognition Using YOLOv4
CN113673540A (en) Target detection method based on positioning information guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant