CN112597955B - Single-stage multi-person gesture estimation method based on feature pyramid network - Google Patents

Single-stage multi-person gesture estimation method based on feature pyramid network Download PDF

Info

Publication number
CN112597955B
CN112597955B CN202011607963.XA CN202011607963A CN112597955B CN 112597955 B CN112597955 B CN 112597955B CN 202011607963 A CN202011607963 A CN 202011607963A CN 112597955 B CN112597955 B CN 112597955B
Authority
CN
China
Prior art keywords
heat map
joint
feature
map
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011607963.XA
Other languages
Chinese (zh)
Other versions
CN112597955A (en
Inventor
骆炎民
张智谦
林躬耕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Gongtian Software Co ltd
Huaqiao University
Original Assignee
Fujian Gongtian Software Co ltd
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Gongtian Software Co ltd, Huaqiao University filed Critical Fujian Gongtian Software Co ltd
Priority to CN202011607963.XA priority Critical patent/CN112597955B/en
Publication of CN112597955A publication Critical patent/CN112597955A/en
Application granted granted Critical
Publication of CN112597955B publication Critical patent/CN112597955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a single-stage multi-person gesture estimation method based on a feature pyramid network, which relates to the technical field of human gesture estimation and comprises the following steps of: step 10, building a feature pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary feature graphs with sequentially reduced resolution, and then carrying out information fusion among channels; step 20, constructing a center point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map by using a multi-person gesture estimation data set as training labels, and training the characteristic pyramid network; and step 30, inputting the image to be tested into the trained characteristic pyramid network, calculating the positions of joints and forming a complete multi-person human body posture. According to the embodiment of the invention, the network can efficiently perform information flow, and the accuracy of human body posture estimation is improved; meanwhile, the processing speed of the multi-person gesture estimation algorithm can be further increased through a rapid post-processing matching process.

Description

Single-stage multi-person gesture estimation method based on feature pyramid network
Technical Field
The invention relates to the technical field of human body posture estimation, in particular to a single-stage multi-person posture estimation method based on a characteristic pyramid network.
Background
Human body posture estimation is a key step for further understanding human body behaviors by computer vision, namely, all the joints of a human body are effectively predicted through one RGB image and a correct posture is formed. The human body gesture is predicted accurately, and the human body gesture has important significance for higher-level computer vision tasks, such as human behavior recognition, human-computer interaction, pedestrian re-recognition, abnormal behavior detection and the like.
Although the human body posture estimation field is rapidly developed, for a multi-person posture estimation task, the current top-down or bottom-up method is a multi-stage method, and one problem of the above methods is that the method is time-consuming and cannot exert the advantage of end-to-end trainability of the CNN network. The traditional attitude estimation method ignores the thinking of network parameters and reasoning speed when pursuing precision in a unified way; the attitude estimation algorithm is difficult to land, and the economic benefit is greatly reduced.
In terms of network architecture design, howards A, zhmoginov A, chen L C, et al (18 th Proceedings ofthe IEEE conference on computervision andpatternrecognition) in the paper "Mobilenetv2: invertedresiduals and linerbottotleecks et al, propose a lightweight network architecture named MobileNet that compresses the computational effort of a common 3x3 convolution by replacing the common 3x3 convolution with a 3x3 deep separable convolution plus a 1x1 pointwise convolution. The dimension of the input feature map is enlarged by using a reverse residual error unit, namely 1x1 convolution is firstly used, then 3x3 depth separable convolution is used for convolution operation, and finally the dimension of the feature map is reduced by using 1x1 convolution operation, so that more feature information can be reserved, and the expressive power of the model is ensured. However, for the human body posture estimation field, the network lacks fusion and application of multi-scale features, the multi-scale features are applied to segmentation, detection and other tasks, and the multi-scale features have excellent results, so that the network has the obvious effect of improving the precision of detection of people with different scales and joint points in pictures in the human body posture estimation field.
In the pose estimation work based on RGB images, nie, xueching, et al (Proceedings ofthe IEEE International Conference on computervision.2019.) propose a Single-stage network for pose estimation in the paper "Single-stage multi-personpoint machines", which sets a center joint point in a first hierarchy by layering human joint points. The second layer of articulation points are trunk joints, including neck, shoulders and buttocks. The third layer of joints includes the head, elbow and knee, and the fourth layer of joints includes the wrist and ankle. In this way the predicted stress of the network is relieved, the key point being dependent on the joints he is adjacent to. However, when the joint points of the upper layer are blocked or invisible, the joint points of the lower layer may fail to be predicted, and a long-distance offset problem exists, which limits the accuracy of human body posture estimation.
The Shenzhen city, science and technology limited company discloses a multi-person gesture estimation method based on a cascading pyramid network in a patent applied by Shenzhen city, namely a multi-person gesture estimation method based on the cascading pyramid network (patent publication number: CN 108229445A), which comprises the steps of positioning key points in a boundary box of each person by using the cascading pyramid network, wherein a global network can position simple key points, and a refinement network can process difficult key points by integrating feature representations from all levels of the global network. The method utilizes multi-scale features, but because the network is more complex and is a multi-stage method, its efficiency is lower than that of a single stage.
Therefore, how to provide a single-stage multi-person gesture estimation method based on a feature pyramid network, to realize more rapid and high-precision single-stage multi-person gesture estimation, becomes a problem to be solved urgently. .
Disclosure of Invention
The invention aims to solve the technical problem of providing a single-stage multi-person gesture estimation method based on a feature pyramid network, so as to improve the speed and efficiency of human gesture estimation.
In order to solve the technical problems, the embodiment of the invention adopts the following technical scheme:
a single-stage multi-person gesture estimation method based on a feature pyramid network comprises the following steps:
step 10, building a feature pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary feature graphs with sequentially reduced resolution, then carrying out information fusion among channels, then carrying out up-sampling and feature addition operation on all primary feature graphs between feature branches by taking the primary feature graph with the lowest resolution as a starting point, and finally carrying out prediction output;
step 20, acquiring a multi-person gesture estimation data set, wherein the multi-person gesture estimation data set comprises multi-person gesture pictures and ground truth labeling of joint points; constructing a center point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map by using the multi-person gesture estimation data set as training labels, and training the characteristic pyramid network;
and 30, inputting the image to be tested into the trained characteristic pyramid network, calculating joint positions according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and forming complete human body gestures according to the joint positions.
Further, the step 10 specifically includes:
step 11, creating a plurality of first convolution kernels for extracting primary features of the image and changing the channel number of the features of the image;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature image output by the feature extraction main branch is sequentially reduced;
step 13, after each layer of original feature images extracted by the reverse residual error unit module, setting a group of second convolution kernels for carrying out inter-channel information fusion on the original feature images of the current layer to obtain corresponding fusion feature images;
step 14, sequentially cascading a plurality of deconvolution modules with the feature map layer with the lowest resolution as a starting point, wherein the deconvolution modules are used for amplifying the resolution of the fusion feature map of the current layer into the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then carrying out element summation operation on the amplified feature map and the fusion feature map of the next layer from position to obtain an enhanced feature map;
and 15, predicting and outputting the reinforced characteristic map with the highest resolution by using four groups of parallel third convolution check.
Further, in the step 20, training is performed on the feature pyramid network, specifically: calculating a loss value and total loss of the central point heat map, the upper offset heat map, the lower offset heat map, the joint refinement heat map and the training label which are predicted and output by the characteristic pyramid network respectively, and training the characteristic pyramid network according to the loss value;
the formula for calculating the center point heat map loss value is as follows:
Figure BDA0002872330120000031
wherein P (P) j ) Representing the predicted center point heat map with position p j Predicted value at G (p j ) Representing a center point heat map constructed from training labels at position p j True value at;
the formula for calculating the offset heat map loss value is:
Figure BDA0002872330120000032
wherein i represents heat maps corresponding to different joint types, p j A certain position on the heat map is indicated,
Figure BDA0002872330120000041
an upper offset heat map representing a predicted joint type i,>
Figure BDA0002872330120000042
an upper offset heat map true value representing a joint type i in the training label;
the formula for calculating the lower offset heat map loss value is:
Figure BDA0002872330120000043
wherein i represents heat maps corresponding to different joint types, p j A certain position on the heat map is indicated,
Figure BDA0002872330120000044
lower offset heat map representing predicted joint type i,/->
Figure BDA0002872330120000045
Representing a true value of a lower offset heat map with a joint type i in the training label;
the formula for calculating the joint refinement heat map loss value is as follows:
Figure BDA0002872330120000046
wherein i represents heat maps corresponding to different joint types, p j A certain position on the heat map is indicated,
Figure BDA0002872330120000047
a refined heat map representing a predicted joint type i,/->
Figure BDA0002872330120000048
A refinement heat map true value with a joint type i in the training label is represented;
the formula for calculating the total loss is: l=αm+βl u +γL d +δL o
Where α, β, γ, δ represent the weight of each loss.
Further, in the step 20, the center point heat map is constructed according to the positions of the center joints, the upper offset heat map is constructed according to the offsets of the center joints to each type of upper body joint and the hip joint, the lower offset heat map is constructed according to the offsets of the hip joints to each type of lower limb joint, and the joint refinement heat map is constructed according to the positions of the joints other than the center joints.
Further, the step 30 specifically includes:
step 31, obtaining an image to be detected, preprocessing the image to be detected, and then inputting the preprocessed image to the trained characteristic pyramid network to obtain a predicted central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map;
step 32, obtaining at least one central joint point position by using a non-maximum suppression algorithm in the predicted central point heat map;
step 33, according to the central joint point position, in a predicted upper offset heat map, response values corresponding to each type of upper body joint and hip joint are obtained, and according to the response values, fuzzy positions of each type of upper body joint and hip joint are calculated;
step 34, calculating the accurate positions of the upper body joints and the hip joints according to the fuzzy positions of the upper body joints and the hip joints of each type and the response values of the corresponding joints in the predicted joint refinement heat map;
step 35, calculating the fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy position of each type of lower body joint and the response value of the corresponding joint in the predicted joint refinement heat map;
step 37, according to the accurate positions of the joints of the whole body, all the joints are sequentially connected to form a complete human body posture based on the preset joint sequence.
One or more technical solutions provided in the embodiments of the present invention at least have the following technical effects or advantages:
1. the feature pyramid network is built based on the MobileNet network, and then up-sampling and feature addition operations are carried out among feature branches, so that the parameter number of the deep convolutional neural network can be effectively reduced, the network can carry out information flow efficiently, joint point information can be fused into space information and semantic information, and the accuracy of human body posture estimation is greatly improved;
2. the human body posture can be directly deduced by means of the single-stage human body posture representation mode through the prediction of the joint positions, the problems of low training speed and long reasoning time faced by a traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the estimation of the human body posture can be completed in one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
The invention will be further described with reference to examples of embodiments with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a feature pyramid network in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of a reverse residual unit according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a network output of a feature pyramid in an embodiment of the present invention;
fig. 5 is a schematic diagram of a result of human body posture estimation according to an embodiment of the present invention.
Detailed Description
According to the technical scheme in the embodiment of the application, the overall thought is as follows:
firstly, a characteristic pyramid network is built based on MobileNet, so that the parameter number of the deep convolutional neural network can be greatly reduced, and the reduced precision is within an acceptable range; secondly, deconvolution of a plurality of 3x3 layers is arranged between the feature graphs of different layers in the feature pyramid network so as to perform feature recovery, and feature addition operation is performed with the feature graph of the previous layer, so that information flow and fusion between the features are further improved, and the network can effectively utilize and fuse space information and semantic information; then, in addition to the joint point prediction and the offset prediction, the joint refinement heat map is added, and because the prediction of the network for long-distance offset is often inaccurate, the joint position refinement is carried out by additionally predicting the refinement heat map at each joint point, so that the precision of each joint point of a human body is more accurate, the precision of gesture estimation is greatly improved, and an accurate gesture reference is provided for higher-level tasks such as behavior recognition, pedestrian re-recognition, abnormal behavior detection and the like.
The embodiment of the application provides a single-stage multi-person gesture estimation method based on a feature pyramid network, please refer to fig. 1, which includes:
step 10, building a feature pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary feature graphs with sequentially reduced resolution, then carrying out information fusion among channels, then carrying out up-sampling and feature addition operation on all primary feature graphs between feature branches by taking the primary feature graph with the lowest resolution as a starting point, and finally carrying out prediction output;
step 20, acquiring a multi-person gesture estimation data set, wherein the multi-person gesture estimation data set comprises multi-person gesture pictures and ground truth labeling of joint points; constructing a center point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map by using the multi-person gesture estimation data set as training labels, and training the characteristic pyramid network;
namely, a large number of sample images (multi-person gesture images) are obtained in advance, joint points of each sample image are marked and then divided into a training set, a verification set and a test set, the training set is input into a deep convolutional neural network for training, the verification set and the test set are utilized for verifying the trained deep convolutional neural network, and whether a loss value reaches a preset threshold value is judged
And 30, inputting the image to be detected into the trained characteristic pyramid network, calculating the joint point position according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and then forming the complete human body posture according to the joint point position.
The human body posture can be directly deduced by means of the prediction of the joint positions through a single-stage human body posture representation mode, so that the problems of low training speed and long reasoning time faced by a traditional posture algorithm are solved, the end-to-end training advantage of a convolutional neural network can be greatly utilized, namely, the estimation of the human body posture can be completed in only one network, other post-processing operations are not needed, and the efficiency of human body posture estimation is greatly improved
Referring to fig. 2, in a possible implementation manner, in step 10, a feature pyramid network is built based on a MobileNet network, which specifically includes:
step 11 of creating a plurality of first convolution kernels (such as convolution kernels of size 3x 3), the channel number is used for extracting the primary characteristics of the image and changing the characteristics of the image;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature image output by the feature extraction main branch is sequentially reduced;
wherein, as shown in fig. 3, the single reverse residual unit is specifically constructed by:
step 121, firstly, performing feature extraction by a plurality of convolution kernels (PointWise convolution modules) with the space size of 1x1, and simultaneously improving the number of feature channels;
step 122, adding RELU6 activation function after the convolution kernel;
step 123, adding a 3x 3-sized depth separable convolution (DepthWise convolution module) after the RELU6 activation function for extracting features;
step 124, adding a RELU6 activation function after the separable convolution;
step 125, adding a 1x1 convolution kernel (PointWise convolution module) after the RELU6 activation function to reduce the number of characteristic channels;
step 126, adding a Linear activation function after the 1x1 convolution kernel;
step 127, mapping the identity of the feature map input to the step 121 to the feature map generated in the step 126 by using the identity mapping, and performing a feature element addition operation;
the steps 11 and 12 are set up of a MobileNet network structure, and the embodiment further performs information fusion and feature reinforcement on the feature branches through subsequent steps on the basis of the MobileNet network:
step 13, after each layer of original feature images extracted by the reverse residual error unit module, a group of second convolution kernels (for example, convolution kernels with the size of 1×1) are set for carrying out channel information fusion on the original feature images of the current layer to obtain corresponding fusion feature images;
step 14, sequentially cascading a plurality of deconvolution modules with the feature map layer with the lowest resolution as a starting point, wherein the deconvolution modules are used for amplifying the resolution of the fusion feature map of the current layer into the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then carrying out element summation operation on the amplified feature map and the fusion feature map of the next layer from position to further carry out feature fusion to obtain an enhanced feature map;
and 15, performing prediction output of the final four heat maps on the intensified characteristic map with the maximum resolution by using four groups of parallel third convolution kernels (for example, convolution kernels with the size of 1×1).
In one possible embodiment, the upper body joints are divided into 9 classes, head joints, neck joints, chest joints, and left and right shoulder elbow wrist joints, respectively, wherein the chest joints are defined as central articulation points; the lower body joints are divided into 7 classes, namely hip joints and 6 lower limb joints (including left and right buttocks, knees and ankles); the central heat map is constructed according to the positions of central joint points, the upper offset heat map is constructed according to the offset of the central joint points to other 8 types of upper body joints and hip joints respectively, the lower offset heat map is constructed according to the offset of the hip joints to other 6 types of lower limb joints respectively, and the joint refinement heat map is constructed according to the positions of other joint points except the central joint points; the number of heat map channels predicted and output by the pyramid network is 1, 18, 12 and 30 respectively, corresponding to 1 central point heat map, 18 upper offset heat maps (offset heat maps of X, Y two channels corresponding to each type of joint), 12 lower offset heat maps (offset heat maps of X, Y two channels corresponding to each type of joint) and 30 joint refinement heat maps (corresponding to other joints except the central point, and also divided into X, Y two channels), and the output heat maps are shown in fig. 4. In the four sets of parallel third convolution kernels, the number of each set of convolution kernels corresponds to the heat map channel number, i.e., 1, 18, 12, 30, respectively.
Through setting up the characteristic pyramid network based on the MobileNet network, and then carrying out up-sampling and characteristic addition operation among characteristic branches, the parameter quantity of the deep convolutional neural network can be effectively reduced, the network can carry out information flow efficiently, and the information of the joints can be fused into space information and semantic information, so that the accuracy of human body posture estimation is greatly improved.
In one possible implementation, the step 20 specifically includes:
step 21, acquiring a sample image in a data set, and adjusting the sample image into an RGB image with the size of 256 multiplied by 256;
step 22, inputting a sample image into the characteristic pyramid network to perform a single forward process, and obtaining central point heat maps corresponding to a plurality of human bodies in the image predicted by the network, wherein the heat maps are shifted upwards, shifted downwards and thinned;
step 23, respectively constructing truth labels of the heat maps by using ground truth labels of human body joints of the sample images: using chest joint points of a plurality of human bodies as human body center points, and processing by using Gaussian kernels to obtain a single two-dimensional center point heat map, wherein the positions of Gaussian peaks in the heat map are the positions of the plurality of center points; the heat map is offset by the offset structure from the chest joint to the upper half body (such as the upper limb, the head) and a plurality of joint points of the hip of each human body, and the offset heat map of a single joint comprises two heat maps with x and y coordinates respectively; the lower offset heatmap is constructed in a similar manner using the offset of the hip articulation point to the lower limb articulation point; the joint refinement heat map is used for further refining each joint point position predicted by the central joint point position and the offset value, each class of joint refinement heat map also corresponds to x and y channels, the response value of a certain point points to the position of the joint point closest to the position, and the response range is a circle with R as a radius around each joint point;
step 24, calculating the network prediction heat map and the truth label heat map by using a mean square error loss function to obtain a loss value of the central node heat map, and training the network prediction central node heat map by using the loss value:
Figure BDA0002872330120000091
wherein P (P) j ) Representing the predicted center point heat map with position p j Predicted value at G (p j ) Representing a center point heat map constructed from training labels at position p j True value at;
step 25, calculating a loss value of the upper body deviation heat map by using a mean square error loss function, and training the network predicted upper deviation heat map by using the loss value:
Figure BDA0002872330120000101
wherein i represents heat maps corresponding to different joint types, p j A certain position on the heat map is indicated,
Figure BDA0002872330120000102
an upper offset heat map representing a predicted joint type i,>
Figure BDA0002872330120000103
an upper offset heat map true value representing a joint type i in the training label;
step 26, calculating a loss value of the upper body offset heat map by using a mean square error loss function, and training the network predicted lower offset heat map by using the loss value:
Figure BDA0002872330120000104
wherein i represents heat maps corresponding to different joint types, p j A certain position on the heat map is indicated,
Figure BDA0002872330120000105
lower offset heat map representing predicted joint type i,/->
Figure BDA0002872330120000106
Representing a true value of a lower offset heat map with a joint type i in the training label;
step 27, calculating a loss value of the joint refinement heat map by using a mean square error loss function, and training the joint refinement heat map predicted by the network by using the loss value:
Figure BDA0002872330120000107
wherein i represents heat maps corresponding to different joint types, p j A certain position on the heat map is indicated,
Figure BDA0002872330120000108
a refined heat map representing a predicted joint type i,/->
Figure BDA0002872330120000109
A refinement heat map true value with a joint type i in the training label is represented;
step 28, the final loss function of the network is:
L=αM+βL u +γL d +δL o
where α, β, γ, δ represent the weight of each loss.
Different subtasks can have different influences on parameters of the backbone network through setting the weight parameters, so that the backbone network is optimized according to the needs.
In one possible implementation manner, the step 30 specifically includes:
step 31, obtaining an image to be detected, preprocessing the image to be detected (for example, adjusting the image to be detected into RGB images with the size of 256 multiplied by 256), then inputting the RGB images into the trained characteristic pyramid network, and performing a single forward process through the characteristic pyramid network to obtain a center point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map which are predicted according to the RGB images to be detected;
step 32, searching the maximum pixel value positions of the central joint points (such as a plurality of chest joints) corresponding to a plurality of human bodies in the predicted central point heat map by using a non-maximum suppression algorithm, and taking the maximum pixel value positions as the central joint point positions;
step 33, according to the positions of the central joint points, in the predicted upper offset heat map, response values in the upper offset heat maps of various joints (including upper body joints and hip joints) at the corresponding positions of each central joint point are obtained, and according to the response values, fuzzy positions of the upper body joints and the hip joints of each type are calculated;
step 34, calculating the accurate positions of the upper body joints and the hip joints according to the fuzzy positions of the upper body joints and the hip joints of each type and the response values of the corresponding joints in the predicted joint refinement heat map;
step 35, calculating the fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the response value of the corresponding position in the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy positions of the various joints of the lower body and the response values of the corresponding positions in the refined heat map of the lower body joint;
step 37, according to the calculated final accurate position of the whole body joint, based on the preset joint sequence, connecting the joints in turn to form a complete multi-human body gesture, as shown in fig. 5.
The positions of other upper body joints and hip joints are deduced according to the positions of the central joint points through the central joint points and the upper offset heat maps, the positions of other lower body joints are deduced according to the positions of the hip joints through the lower offset heat maps, and meanwhile, the positions are further and accurately positioned through refining the heat maps, so that errors caused by long-distance offset are avoided, and the accurate positions of the whole body joints are obtained.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that the specific embodiments described are illustrative only and not intended to limit the scope of the invention, and that equivalent modifications and variations of the invention in light of the spirit of the invention will be covered by the claims of the present invention.

Claims (2)

1. A single-stage multi-person gesture estimation method based on a feature pyramid network is characterized by comprising the following steps:
step 10, building a feature pyramid network based on a MobileNet network, wherein the pyramid network is used for extracting a plurality of primary feature graphs with sequentially reduced resolution, then carrying out information fusion among channels, then carrying out up-sampling and feature addition operation on all primary feature graphs between feature branches by taking the primary feature graph with the lowest resolution as a starting point, and finally carrying out prediction output;
step 20, acquiring a multi-person gesture estimation data set, wherein the multi-person gesture estimation data set comprises multi-person gesture pictures and ground truth labeling of joint points; constructing a center point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map by using the multi-person gesture estimation data set as training labels, and training the characteristic pyramid network;
step 30, inputting an image to be tested into the trained characteristic pyramid network, calculating joint positions according to the output central point heat map, the upper offset heat map, the lower offset heat map and the joint refinement heat map, and forming complete human body gestures according to the joint positions;
the step 10 specifically includes:
step 11, creating a plurality of first convolution kernels for extracting primary features of the image and changing the channel number of the features of the image;
step 12, sequentially cascading convolution modules formed by a plurality of reverse residual units at the output ends of the plurality of first convolution kernels to complete the construction of a multi-layer feature extraction main branch, wherein the resolution of a multi-layer original feature image output by the feature extraction main branch is sequentially reduced;
step 13, after each layer of original feature images extracted by the reverse residual error unit module, setting a group of second convolution kernels for carrying out information fusion among channels on the original feature images of the current layer to obtain corresponding fusion feature images;
step 14, sequentially cascading a plurality of deconvolution modules with the feature map layer with the lowest resolution as a starting point, wherein the deconvolution modules are used for amplifying the resolution of the fusion feature map of the current layer into the resolution of the fusion feature map of the next layer to obtain an amplified feature map, and then carrying out element summation operation on the amplified feature map and the fusion feature map of the next layer from position to obtain an enhanced feature map;
step 15, predicting and outputting the reinforcement characteristic map with the highest resolution by utilizing four groups of parallel third convolution check;
in the step 20, training is performed on the feature pyramid network, specifically: calculating a loss value and total loss of the central point heat map, the upper offset heat map, the lower offset heat map, the joint refinement heat map and the training label which are predicted and output by the characteristic pyramid network respectively, and training the characteristic pyramid network according to the loss value;
the formula for calculating the center point heat map loss value is as follows:
Figure FDA0004151291920000021
wherein P (P) j ) Representing the predicted center point heat map with position p j Predicted value at G (p j ) Representing a center point heat map constructed from training labels at position p j True value at;
the formula for calculating the offset heat map loss value is:
Figure FDA0004151291920000022
wherein i represents heat maps corresponding to different joint types, p j Representing a position on the heat map, P i u An up-shift heat map representing a predicted joint type i,
Figure FDA0004151291920000023
an upper offset heat map true value representing a joint type i in the training label;
the formula for calculating the lower offset heat map loss value is:
Figure FDA0004151291920000024
wherein i represents heat maps corresponding to different joint types, p j Representing a position on the heat map, P i d A lower offset heat map representing a predicted joint type i,
Figure FDA0004151291920000025
representing a true value of a lower offset heat map with a joint type i in the training label;
the formula for calculating the joint refinement heat map loss value is as follows:
Figure FDA0004151291920000026
/>
wherein i represents heat maps corresponding to different joint types, p j A certain position on the heat map is indicated,
Figure FDA0004151291920000027
a refined heat map representing a predicted joint type i,/->
Figure FDA0004151291920000028
A refinement heat map true value with a joint type i in the training label is represented;
the formula for calculating the total loss is: l=αm+βl u +γL d +δL o
Wherein α, β, γ, δ represent the weight of each loss;
the step 30 specifically includes:
step 31, obtaining an image to be detected, preprocessing the image to be detected, and then inputting the preprocessed image to the trained characteristic pyramid network to obtain a predicted central point heat map, an upper offset heat map, a lower offset heat map and a joint refinement heat map;
step 32, obtaining at least one central joint point position by using a non-maximum suppression algorithm in the predicted central point heat map;
step 33, according to the central joint point position, in a predicted upper offset heat map, response values corresponding to each type of upper body joint and hip joint are obtained, and according to the response values, fuzzy positions of each type of upper body joint and hip joint are calculated;
step 34, calculating the accurate positions of the upper body joints and the hip joints according to the fuzzy positions of the upper body joints and the hip joints of each type and the response values of the corresponding joints in the predicted joint refinement heat map;
step 35, calculating the fuzzy position of each type of lower body joint according to the accurate position of the hip joint and the lower offset heat map of each type of lower body joint;
step 36, calculating to obtain the accurate position of each type of lower body joint according to the fuzzy position of each type of lower body joint and the response value of the corresponding joint in the predicted joint refinement heat map;
step 37, according to the accurate positions of the joints of the whole body, all the joints are sequentially connected to form a complete human body posture based on the preset joint sequence.
2. The method according to claim 1, characterized in that: in the step 20, the center point heat map is constructed according to the positions of the center joint points, the upper offset heat map is constructed according to the offsets of the center joint points to each type of upper body joint and the hip joint, the lower offset heat map is constructed according to the offsets of the hip joint to each type of lower limb joint, and the joint refinement heat map is constructed according to the positions of the joint points except the center joint points.
CN202011607963.XA 2020-12-30 2020-12-30 Single-stage multi-person gesture estimation method based on feature pyramid network Active CN112597955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011607963.XA CN112597955B (en) 2020-12-30 2020-12-30 Single-stage multi-person gesture estimation method based on feature pyramid network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011607963.XA CN112597955B (en) 2020-12-30 2020-12-30 Single-stage multi-person gesture estimation method based on feature pyramid network

Publications (2)

Publication Number Publication Date
CN112597955A CN112597955A (en) 2021-04-02
CN112597955B true CN112597955B (en) 2023-06-02

Family

ID=75206178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011607963.XA Active CN112597955B (en) 2020-12-30 2020-12-30 Single-stage multi-person gesture estimation method based on feature pyramid network

Country Status (1)

Country Link
CN (1) CN112597955B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011402B (en) * 2021-04-30 2023-04-25 中国科学院自动化研究所 Primate gesture estimation system and method based on convolutional neural network
CN113343762B (en) * 2021-05-07 2022-03-29 北京邮电大学 Human body posture estimation grouping model training method, posture estimation method and device
CN113420604B (en) * 2021-05-28 2023-04-18 沈春华 Multi-person posture estimation method and device and electronic equipment
CN113297995B (en) * 2021-05-31 2024-01-16 深圳市优必选科技股份有限公司 Human body posture estimation method and terminal equipment
CN113673354B (en) * 2021-07-23 2024-02-20 湖南大学 Human body key point detection method based on context information and joint embedding
CN113610015B (en) * 2021-08-11 2023-05-30 华侨大学 Attitude estimation method, device and medium based on end-to-end fast ladder network
CN114529605B (en) * 2022-02-16 2024-05-24 青岛联合创智科技有限公司 Human body three-dimensional posture estimation method based on multi-view fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229445A (en) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 A kind of more people's Attitude estimation methods based on cascade pyramid network
CN110427890A (en) * 2019-08-05 2019-11-08 华侨大学 More people's Attitude estimation methods based on depth cascade network and mass center differentiation coding
CN111191622A (en) * 2020-01-03 2020-05-22 华南师范大学 Posture recognition method and system based on thermodynamic diagram and offset vector and storage medium
CN111832383A (en) * 2020-05-08 2020-10-27 北京嘀嘀无限科技发展有限公司 Training method of gesture key point recognition model, gesture recognition method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965719B2 (en) * 2015-11-04 2018-05-08 Nec Corporation Subcategory-aware convolutional neural networks for object detection
US10733431B2 (en) * 2017-12-03 2020-08-04 Facebook, Inc. Systems and methods for optimizing pose estimation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229445A (en) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 A kind of more people's Attitude estimation methods based on cascade pyramid network
CN110427890A (en) * 2019-08-05 2019-11-08 华侨大学 More people's Attitude estimation methods based on depth cascade network and mass center differentiation coding
CN111191622A (en) * 2020-01-03 2020-05-22 华南师范大学 Posture recognition method and system based on thermodynamic diagram and offset vector and storage medium
CN111832383A (en) * 2020-05-08 2020-10-27 北京嘀嘀无限科技发展有限公司 Training method of gesture key point recognition model, gesture recognition method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于ASPP的高分辨率卷积神经网络2D人体姿态估计研究;申小凤;王春佳;;现代计算机(第13期);全文 *
基于统计结构梯度特征的行人检测;定志锋;《中国优秀硕士学位论文全文数据库 信息科技辑》;全文 *

Also Published As

Publication number Publication date
CN112597955A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
CN112597955B (en) Single-stage multi-person gesture estimation method based on feature pyramid network
CN109325547A (en) Non-motor vehicle image multi-tag classification method, system, equipment and storage medium
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN113657450B (en) Attention mechanism-based land battlefield image-text cross-modal retrieval method and system
CN113807355A (en) Image semantic segmentation method based on coding and decoding structure
CN111259735B (en) Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network
CN111476315A (en) Image multi-label identification method based on statistical correlation and graph convolution technology
CN112488025B (en) Double-temporal remote sensing image semantic change detection method based on multi-modal feature fusion
CN110543890A (en) Deep neural network image matching method based on characteristic pyramid
CN107247952B (en) Deep supervision-based visual saliency detection method for cyclic convolution neural network
CN111612051A (en) Weak supervision target detection method based on graph convolution neural network
CN113011386B (en) Expression recognition method and system based on equally divided characteristic graphs
CN109785409B (en) Image-text data fusion method and system based on attention mechanism
CN113516133A (en) Multi-modal image classification method and system
CN112669343A (en) Zhuang minority nationality clothing segmentation method based on deep learning
CN114764941A (en) Expression recognition method and device and electronic equipment
Tian et al. Real-time semantic segmentation network based on lite reduced atrous spatial pyramid pooling module group
CN112597956B (en) Multi-person gesture estimation method based on human body anchor point set and perception enhancement network
CN112418070B (en) Attitude estimation method based on decoupling ladder network
CN112651294A (en) Method for recognizing human body shielding posture based on multi-scale fusion
Wang et al. Single shot multibox detector with deconvolutional region magnification procedure
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
Sharma et al. Real-Time Word Level Sign Language Recognition Using YOLOv4
CN109871835B (en) Face recognition method based on mutual exclusion regularization technology
CN112818982A (en) Agricultural pest image detection method based on depth feature autocorrelation activation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant