CN115187660A - Knowledge distillation-based multi-person human body posture estimation method and system - Google Patents

Knowledge distillation-based multi-person human body posture estimation method and system Download PDF

Info

Publication number
CN115187660A
CN115187660A CN202210714617.4A CN202210714617A CN115187660A CN 115187660 A CN115187660 A CN 115187660A CN 202210714617 A CN202210714617 A CN 202210714617A CN 115187660 A CN115187660 A CN 115187660A
Authority
CN
China
Prior art keywords
joint point
heat map
network
representing
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210714617.4A
Other languages
Chinese (zh)
Inventor
欧卫华
蒋永红
高国强
犹津
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Siso Electronics Co ltd
Original Assignee
Guizhou Siso Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Siso Electronics Co ltd filed Critical Guizhou Siso Electronics Co ltd
Priority to CN202210714617.4A priority Critical patent/CN115187660A/en
Publication of CN115187660A publication Critical patent/CN115187660A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-person human body posture estimation method and a system based on knowledge distillation, which belong to the human body posture estimation method in the technical field of computer vision and aim to solve the problems that a student network constructed by a stacked Hourglass network in the prior knowledge distillation technology is difficult to be made very small and the performance of the model is prevented from being reduced due to the reduction of the complexity of the network model, wherein the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolution layer, the decoder is a stacked lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 multiplied by 1; inputting data into a teacher network and a student network to obtain corresponding heat maps, and generating joint point offsets by using the student heat maps and target joint point heat maps of data labels to dynamically adjust the knowledge transfer from the teacher network to the student network; and inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.

Description

Knowledge distillation-based multi-person human body posture estimation method and system
Technical Field
The invention belongs to the technical field of computer vision, relates to a human body posture estimation method, and particularly relates to a human body posture estimation method based on a knowledge distillation method.
Background
Human Pose Estimation (HPE) aims at obtaining Human pose joint coordinates from image input, and is one of the hot problems in the field of computer vision. In recent years, the human posture estimation technology is rapidly developed and widely applied to the fields of human body tracking, motion recognition, motion detection, human-computer interaction and the like.
The success of AlexNet in the field of image classification and recognition has pushed computer vision research into the deep learning era. In 2014, a.toshiev et al proposed a deep learning-based human posture estimation model deppose, which utilizes a deep convolutional neural network to realize global prediction of human posture estimation, and marks that human posture estimation enters the deep learning era. By means of the strong feature extraction capability of the deep convolutional network, the human posture estimation is developed in a long way. Compared with a manual feature extraction method, the human body posture estimation algorithm based on deep learning is higher in robustness. Since then, the release of MPII and MSCOCO datasets has spawned a great deal of excellent work based on deep learning human pose estimation.
With the progress of science and technology, more and more intelligent devices appear in the life of people, such as intelligent devices like an automatic driving automobile, an intelligent monitoring camera, an intelligent motion assistor and an elderly robot, all of which need to have the functions of motion recognition, motion detection, human-computer interaction and the like, and the basis for realizing the functions is not independent of a human posture estimation algorithm. It is very difficult to deploy large body pose estimation algorithms to these computing resource limited devices. Although a large-scale human body posture estimation algorithm can be deployed to the cloud end at present, and the functions can be realized by intelligent equipment through technologies such as cloud computing, information leakage is easily caused, and the safety of privacy of people cannot be guaranteed. Therefore, researchers begin to research into lighter-weight and more efficient human posture estimation algorithms, so that the algorithms do not need large-scale GPU and other hardware, and can run on mobile phones, wearable devices, monitoring cameras and embedded devices in real time.
The 2D body Pose Estimation can be divided into two categories, single Person Pose Estimation (SPPE) and Multi-Person body Pose Estimation (MPPE). For multi-person pose estimation, the mainstream methods can be generally classified into Top-down (Top-down) and Bottom-up (Bottom-up) methods according to the starting point (abstraction level) of prediction. The top-down approach starts with high-level abstraction by first detecting people and generating people positions in bounding boxes, and then pose-estimating each person. The top-down method is more intuitive in mind than the bottom-up method, and is also higher in accuracy than the bottom-up method. For how to detect people, most of the current mainstream multi-person posture estimation algorithms adopt common target detection algorithms as human body detectors, such as Faster R-CNN, mask R-CNN, yolo and the like, and then generate multi-person postures by using single-person posture estimation. How to realize the lightweight multi-person posture estimation, more and more researchers develop research aiming at the lightweight of a human posture estimation model. Feng et al use knowledge distillation to extract knowledge into a simple student hourglass network by using a complex teacher hourglass network. In addition, more work of lightweight human posture estimation focuses on improving the network structure, and Tang et al realize high-precision key point positioning by providing a densely connected U-Nets network similar to an hourglass network. Debnath et al was inspired by the hourglass network, and by introducing a novel shunting architecture in the last two layers of the MobileNet, reduced the parameters of the model and alleviated the overfitting, and improved the accuracy. Zhang et al introduced global attention and proposed a lightweight bottleneck block to replace the bottleneck block in ResNet, constructing an LPN similar in structure to simplbaseline. Yu et al propose to replace the point convolution of the channel cleaning module in ShuffleNet with a channel weighting form and construct a Lite-HRNet network represented by high-resolution features. Ding and Zhang et al tried to construct a multi-scale feature fusion network HR-NAS similar to HRNet and an efficientPose similar to SimpleBaseline, respectively, using network space Search (NAS).
The human posture estimation knowledge distillation methods such as the FPD, the OKDHM and the like are limited by the reasons of the complexity difference of teacher networks and the like in knowledge distillation, a student network and a teacher network are mainly constructed by using a stack Hourglass network, the network with a large number of stack layers and a large parameter amount is used as the teacher network, and the network with a small number of stack layers and a small parameter amount is used as the student network, so that the student network is difficult to be very small.
Disclosure of Invention
The invention aims to: the invention provides a knowledge distillation-based multi-person human body posture estimation method and system, aiming at solving the problems that a student network constructed by stacking Hourglass networks is difficult to be made very small in the prior art and the performance reduction of the model caused by the reduction of the complexity of a network model is avoided.
The invention specifically adopts the following technical scheme for realizing the purpose:
a multi-person human posture estimation method based on knowledge distillation comprises the following steps:
acquiring human body posture sample data and a human body posture sample data label;
constructing a teacher network model which is pre-trained and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacked lightweight upsampling module, and the joint point regressor is a convolution with a convolutional kernel of 1 multiplied by 1;
inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, generating joint point bias by using the second output joint point heat map and a target joint point heat map of a human body posture sample data label, and guiding the learning of the student network by combining the joint point bias and the first output joint point heat map;
and inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.
Preferably, the human body posture sample dataThe label passes two-dimensional Gaussian function G m (x, y) computing a target joint point heat map H m ∈R h×w In which a two-dimensional Gaussian function G m (x, y) is:
Figure RE-GDA0003799472960000031
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x m And y m Respectively, the horizontal and vertical coordinates of the joint points.
Preferably, the formula for generating the joint point bias is:
Figure RE-GDA0003799472960000041
wherein P = { P = 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a bias for the corresponding articulation point i,
Figure RE-GDA0003799472960000042
represents the heat map generated by the ith joint tag,
Figure RE-GDA0003799472960000043
representing a heat map of a student network predicting an ith joint point; j represents a hyper-parametric coefficient, h tar Joint Point heatmap, h, representing tag Generation stu Representing a student network generated joint point heat map;
Figure RE-GDA0003799472960000044
comprises the following steps:
Figure RE-GDA0003799472960000045
wherein γ represents a hyper-parameter.
Preferably, the teacher network uses an MSE loss function, specifically:
Figure RE-GDA0003799472960000046
wherein,
Figure RE-GDA0003799472960000047
a heat map representing the teacher's network at the ith joint,
Figure RE-GDA0003799472960000048
heat map, h, representing student network predicted ith relation node tea Heat map, h, representing teacher network output stu A heat map representing student network output, n representing joint points;
the overall loss function for the teacher network and the student network is:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
wherein λ is a hyper-parameter coefficient balancing two loss weights, h tar Representing a heat map generated by the tag.
A knowledge-distillation-based multi-person body pose estimation system, comprising:
the data acquisition module is used for acquiring human body posture sample data and a human body posture sample data label;
the network model building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacking lightweight upper sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 x 1;
the network model training module is used for inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, generating joint point offset by utilizing the second output joint point heat map and a target joint point heat map of a human body posture sample data label, and guiding the learning of the student network by combining the joint point offset and the first output joint point heat map;
and the heat map real-time module is used for inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map and converting the joint point heat map into a real joint point coordinate.
A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the above method.
A computer-readable storage medium, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of the above-mentioned method.
The invention has the following beneficial effects:
1. in the invention, a lightweight human posture estimation network GhostNet is adopted to generate a joint point heat map; an online joint knowledge distillation method based on knowledge distillation utilizes a label to generate a heat map and a teacher network output as supervision information to guide GhosNet learning; thereby need not to construct student's network through the mode of piling up the Hourglass network, ghostNet student's network can be done very little, and can not lead to the model performance to descend because of network model complexity reduces at this in-process to can dispose student's network to intelligent equipment such as automatic driving car, intelligent surveillance camera head, intelligent motion assistor and endowment robot, improve its range of application, application field greatly.
2. In the invention, in order to improve the performance of the model, a knowledge distillation-based joint point online optimization strategy is provided, joint point bias is generated by utilizing the output of a GhostNet network and a heat map generated by a label, HRNet is estimated by utilizing a pre-trained large-scale human body posture to serve as a teacher network, label information is softened, and student network GhostNet learning is guided; the training of the model is more efficient.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic structural view of the present invention;
FIG. 3 is a schematic diagram of the structure of a student network model of the present invention;
fig. 4 is a schematic structural diagram of a three-layer LPB of the present invention.
Detailed Description
Example 1
The embodiment provides a knowledge distillation-based multi-person human body posture estimation method, which is used for estimating the postures of a plurality of persons. It comprises the following steps:
step S1, obtaining sample data
And acquiring human body posture sample data and a human body posture sample data label. In order to verify the superiority of the human body posture estimation, public human body posture estimation data MSCOCO is selected for experiment, wherein the human body posture estimation evaluation adopts mAP and AP 0.5 、AP 0.75 、AP m 、AP l And AR.
S2, building a network model
The network model is a lightweight human posture network model and comprises a constructed teacher network model which is trained in advance, an untrained student network model and a matched loss function. The teacher network model adopts HRNet-W32, the student network model comprises an encoder, a decoder and a joint regression device, the encoder consists of GhostNet convolution layers, the decoder is a stacked lightweight upsampling module, the joint regression device is a convolution with convolution kernel of 1 multiplied by 1, and data sequentially passes through the encoder, the decoder and the joint regression device. The decoder preferably stacks three layers of LPBs, which consist of three parts, a depth separable transpose convolution, a point convolution, and a channel attention module.
S3, training the network model
Inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, obtaining a target joint point heat map after Gaussian calculation of a human body posture sample data label, generating joint point bias by using the second output joint point heat map output by the student network and the target joint point heat map obtained by the Gaussian calculation together, using the heat map output by the teacher network as partial supervision information, guiding the learning of the student network by using the joint point bias generated in a combined manner and the first output joint point heat map output by the teacher network, and finally finishing the training of the network model.
When a target joint point heat map is obtained through Gaussian calculation, a human body posture sample data label passes through a two-dimensional Gaussian function G m (x, y) computing a target joint heat map H m ∈R h×w Wherein a two-dimensional Gaussian function G m (x, y) is:
Figure RE-GDA0003799472960000071
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x m And y m Respectively, the horizontal and vertical coordinates of the joint points.
When joint biases are generated by using the second output joint heat map and the target joint heat map, the formula for generating the joint biases is as follows:
Figure RE-GDA0003799472960000072
wherein, P = { P 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a bias for the corresponding articulation point i,
Figure RE-GDA0003799472960000073
represents the heat map generated by the ith joint tag,
Figure RE-GDA0003799472960000074
representing a heat map of a student network predicting an ith joint point; j represents a hyperparameter, h tar Represents a heat map of tag generation, h stu A heat map representing student network output;
Figure RE-GDA0003799472960000075
comprises the following steps:
Figure RE-GDA0003799472960000076
wherein γ represents a hyper-parameter.
When training, the teacher network adopts an MSE loss function, which specifically comprises the following steps:
Figure RE-GDA0003799472960000077
wherein,
Figure RE-GDA0003799472960000078
a heat map representing the teacher's network at the ith joint point,
Figure RE-GDA0003799472960000079
heat map, h, representing student network predicted ith relation node tea Heat map, h, representing teacher network output stu A heat map representing student network output, n representing joint points;
during training, the overall loss function of the teacher network and the student network is as follows:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
wherein λ is a hyper-parameter coefficient balancing two loss weights, h tar Representing a heatmap generated by the tag.
After the network model is trained, only the student network needs to work when testing.
S4, acquiring the coordinates of the joint points in real time
Inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.
In various heat maps, the larger the number in the heat map, the greater the probability that the joint is present at that location. Therefore, when the joint point heatmap is converted into joint point coordinates, it is calculated by the Soft-argmax function. The Soft-argmax function is calculated as follows:
Figure RE-GDA0003799472960000081
where β is the hyper parameter preventing numerical "overflow", h denotes the length of the heat map, w denotes the width of the heat map, x m 、 y m Are respectively the horizontal and vertical coordinates of the joint point H m Which represents the correspondence of the m joints,
Figure RE-GDA0003799472960000082
represents the coordinates corresponding to the joint point m as<h,w>Heat value, (understood as the probability of the existence of a joint).
Example 2
The embodiment also provides a knowledge distillation-based multi-person human body posture estimation system, which comprises:
the data acquisition module is used for acquiring human body posture sample data and a human body posture sample data label;
in order to verify the superiority of the human body posture estimation, the human body posture sample data and the tag thereof select the disclosed human body posture estimation data MSCOCO for experiment, wherein the human body posture estimation evaluation adopts mAP and AP 0.5 、AP 0.75 、AP m 、AP l And AR.
The network building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model adopts HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder consists of GhostNet convolution layers, the decoder is a stacking lightweight up-sampling module, the joint point regressor is a convolution with convolution kernel of 1 x 1, and data sequentially passes through the encoder, the decoder and the joint point regressor. The decoder preferably stacks three layers of LPBs, which consist of three parts, a depth separable transpose convolution, a point convolution, and a channel attention module.
The network training module is used for inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, obtaining a target joint point heat map after Gaussian calculation of human body posture sample data labels, generating joint point bias by using the second output joint point heat map output by the student network and the target joint point heat map obtained by the Gaussian calculation, using the heat map output by the teacher network as partial monitoring information, guiding the learning of the student network by using the joint point bias generated in a combined manner and the first output joint point heat map output by the teacher network, and finally finishing the training of the network model.
When a target joint point heat map is obtained through Gaussian calculation, a human body posture sample data label passes through a two-dimensional Gaussian function G m (x, y) computing a target joint heat map H m ∈R k×w In which a two-dimensional Gaussian function G m (x, y) is:
Figure RE-GDA0003799472960000091
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x m And y m Respectively, the horizontal and vertical coordinates of the joint points.
When joint offsets are generated using the second output joint heat map and the target joint heat map, the formula for generating the joint offsets is:
Figure RE-GDA0003799472960000092
wherein P = { P = 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a correspondenceThe offset of the articulation point i is such that,
Figure RE-GDA0003799472960000093
represents the heat map generated by the ith joint tag,
Figure RE-GDA0003799472960000094
representing a heat map of a student network predicting an ith joint point; j is the hyperparametric coefficient (representing the first j p choices) i ),h tar Heat map, h, representing tag generation stu A heat map representing student network output;
Figure RE-GDA0003799472960000095
comprises the following steps:
Figure RE-GDA0003799472960000101
wherein gamma represents a joint point penalty hyperparameter.
When training, the teacher network adopts an MSE loss function, which specifically comprises the following steps:
Figure RE-GDA0003799472960000102
wherein,
Figure RE-GDA0003799472960000103
a heat map representing the teacher's network at the ith joint point,
Figure RE-GDA0003799472960000104
heat map, h, representing student network predicted ith relation node tea Representing teacher network heatmap, h stu Representing a student network heat map, n representing joint points;
during training, the overall loss function of the teacher network and the student network is as follows:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
where λ is the hyperparameter coefficient that balances two loss weights, h tar Representing a heatmap generated by the tag.
After the network model is trained, only the student network needs to work during testing
And the heat map real-time module is used for inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map and converting the joint point heat map into a real joint point coordinate.
In various heat maps, the larger the number in the heat map, the greater the probability that the joint is present at that location. Therefore, when the joint point heat map is converted into joint point coordinates, it is calculated by the Soft-argmax function. The Soft-argmax function is calculated as follows:
Figure RE-GDA0003799472960000105
where β is the hyper parameter preventing numerical "overflow", h denotes the length of the heat map, w denotes the width of the heat map, x m 、 y m Are respectively the horizontal and vertical coordinates of the joint point H m A heat map representing the joint point m is shown,
Figure RE-GDA0003799472960000106
representing m-joint heat map correspondences<h,w>The heat map values were processed.
Example 3
The embodiment also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the above method for estimating the posture of the human body of multiple persons based on knowledge distillation.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can be in man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the memory may also include both internal and external memory units of the computer device. In this embodiment, the memory is commonly used for storing an operating system and various application software installed in the computer device, such as a program code for executing the knowledge-based human posture estimation method for multiple persons. In addition, the memory may be used to temporarily store various types of data that have been output or are to be output.
The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process data, for example, execute the program code of the knowledge-based multi-person body posture estimation method.
Example 4
The present embodiment also provides a computer-readable storage medium storing a computer program, which when executed by a processor, causes the processor to perform the steps of the above-mentioned knowledge-based distillation multi-person body pose estimation method.
Wherein the computer readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the knowledge-distillation based multi-person body pose estimation method.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be essentially or partially embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server or a network device, etc.) to execute the method according to the embodiments of the present application.
Examples of the experiments
The experimental example carries out attitude estimation by building the network model and selecting the open human body attitude estimation data MSCOCO by applying the estimation method, and other comparative examples adopt other human body detectors with the same multi-person human body attitude estimation method.
Through actual measurement and analysis, the human body posture estimation method is compared with the existing human body posture estimation method, namely Hourglass, simpleBaseline, LPN, lite-HRNet, shuffleNet V2, DARK and MobilePoseNet. All methods were performed on MSCOCO data in comparative experiments, as shown in the table above, the method of the present application is on the index AP 0.5 The results are all higher than those of other comparative experimental methods, and the model superiority is shown. In addition, the method realizes that the mAP is 68.7 under the condition that the calculated amount is only 0.6GFLOPs, and the mAP is higher than other large models such as Hourglass models and CPN models. Compared with the lightweight human body posture estimation model LPN, the method has the advantage that the calculated amount is reduced by 40% under the condition that the precision is close to that of the method. Compared with other lightweight networks such as MobileNet V2, shuffleNet V2 and Lite-HRNet, the method of the application has greater lead in the aspects of precision, parameter and the like. See table 1 for details:
table: statistical table of estimation results of MSCOCO data
Method Backbone Input #Params GFLOPs AP AP 50 AP 75 AP M ,AP L AR
8-stage Hourglass Hourglass 256×192 25.6M 26.2 66.9 - - - - -
CPN ResNet-50 256×192 27.0M 6.2 68.4 - - - - -
SnmpleBaseline ResNet-50 256×192 34.0M 8.9 70.4 88.6 78.3 67.1 77.2 76.3
HRNet-W32 ResNet-50 256×192 28.5M 12.4 73.4 89.5 80.7 70.2 80.1 79.8
DARK HRNetV1-W48 128×96 63.6M 3.6 71.9 89.1 79.6 69.2 78 77.9
MobileNetV2 MobileNetV2 256×192 9.6M 1.48 64.6 87.4 72.3 61.1 71.2 70.7
MobileNetV2 1× MobileNetV2 384×258 9.6M 3.33 67.3 87.9 74.3 62.8 74.7 72.9
ShuffleNetV2 ShuffleNetV2 256×192 7.6M 1.28 59.9 85.4 66.3 56.6 66.2 66.4
ShuffleNetV2 1× ShuffleNetV2 384×288 7.6M 2.87 63.6 86.5 70.5 59.5 70.7 69.7
Small HRNet HRNet-W16 256×192 1.3M 0.54 55.2 83.7 62.4 52.3 61 62.1
Small HRNet HRNet-W16 384×288 1.3M 1.21 56 83.8 63 52.4 62.6 62.6
Lite-HRNet Lite-HRNet-18 256×192 1.1M 0.20 64.8 86.7 73 62.1 70.5 71.2
Lite-HRNet Lite-HRNet-18 384×288 1.1M 0.45 67.6 87.8 75 64.5 73.7 73.7
LPN ResNet-50 256×192 2.9M 1.0 69.1 88.1 76.6 65.9 75.7 74.9
MobilePoseNet MobilNetV3 256×192 1.5M 0.55 66.2 87.3 74.2 63.1 72.5 72.4
MobilePoseNet MobilNetV3 384×288 1.5M 1.23 69 88.2 75.9 65.5 75.5 74.9
GhostPoseNet GhostNet 256×192 3.1M 0.26 63.8 88.3 71.6 61.6 67 67.3
GhostPoseNet* GhostNet 384×288 3.1M 0.60 67.9 89.4 74.4 65.3 72.1 71.6
GhostPoseNet GhostNet 384×288 3.1M 0.60 68.7 90.4 76.8 65.7 73.0 71.8
The method of the application is compared with the existing lightweight human body posture estimation method and the existing large human body posture estimation method in speed. Including a speed comparison of GPU environments versus no GPU environments. Under the environment without GPU, the running speed of the GhostPoseNet model is higher than the network speed of MobileNet V2, lite-HRNet and the like, and the GhostPoseNet network is more beneficial to the deployment of edge-end equipment considering that the model has low calculation amount and simple structure. See table 2 for details:
table 2: speed comparison table of MSCOCO data
Method BackBone #Params GFLOPs Imput Size AP Speed* Speed
HRNet HRNetV1-W32 28.5M 7.1 256×192 74.4 7.5 19.2
HRNet HRNetV1-W32 28.5M 16 384×288 75.8 4 18.8
NLite-HRNet-18 HRNet-W16 0.7M 0.19 256×192 62.8 11 18.9
WNLite-HRNet-18 HRNet-W16 1.3M 0.3 256×192 66 12 18.6
ShuffleNetV2 ShuffleNetV2 7.6M 1.28 256×192 59.9 17 71.3
ShufleNetV2 ShuffleNetV2 7.6M 2.87 384×288 63.6 10 64.1
MobileNetV2 MobileNetV2 9.6M 1.48 256×192 64.6 6.8 83.1
MobileNetV2 MobileNetV2 9.6M 3.33 384×288 67.3 4.5 73.1
Lite-HRNet Lite-HRNet-18 1.1M 0.2 256×192 64.8 12 17.4
Lite-HRNet Lite-HRNet-18 1.1M 0.45 384×288 67.6 7.1 16.3
MobilePoseNet MobileNetV3 1.5M 0.55 256×192 66.2 7.8 54.8
MobilePoseNet MobileNetV3 1.5M 1.23 384×288 69.0 5.1 50.8
GhostPoseNet GhostNet 3.1M 0.26 256×192 63.8 9.2 62.0
GhostPoseNet GhostNet 3.1M 0.60 384×288 68.7 6.4 59.4

Claims (10)

1. A multi-person human body posture estimation method based on knowledge distillation is characterized by comprising the following steps:
acquiring human body posture sample data and a human body posture sample data label;
constructing a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolution layer, the decoder is a stacking lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 multiplied by 1;
inputting human posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human posture sample data into a student network to obtain a second output joint point heat map, generating joint point offset by using the second output joint point heat map and a target joint point heat map of the human posture sample data labels, and guiding the learning of the student network by combining the joint point offset and the first output joint point heat map;
and inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.
2. The knowledge-based distillation multi-person human posture estimation method as claimed in claim 1, wherein: the human body posture sample data label passes through a two-dimensional Gaussian function G m (x, y) computing a heat map H of target joint points m ∈R h×w In which a two-dimensional Gaussian function G m (x, y) is:
Figure RE-FDA0003799472950000011
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heatmap, x m And y m Respectively are the horizontal and vertical coordinates of the joint point.
3. The knowledge-based distillation multi-person body posture estimation method as claimed in claim 1, wherein: the formula for generating the joint point bias is:
Figure RE-FDA0003799472950000012
wherein P = { P = 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a bias for the corresponding joint point i,
Figure RE-FDA0003799472950000021
representing the heat map generated by the ith node tag,
Figure RE-FDA0003799472950000022
representing student network predicted ith joint pointA heat map of (a); j represents a hyper-parametric coefficient, h tar Heat map, h, representing tag generation stu A heat map representing student network output;
Figure RE-FDA0003799472950000023
comprises the following steps:
Figure RE-FDA0003799472950000024
wherein γ represents a hyper-parameter coefficient.
4. The knowledge-based distillation multi-person human posture estimation method as claimed in claim 1, wherein: the teacher network adopts an MSE loss function, which specifically comprises the following steps:
Figure RE-FDA0003799472950000025
wherein,
Figure RE-FDA0003799472950000026
a heat map representing the teacher's network at the ith joint point,
Figure RE-FDA0003799472950000027
heat map, h, representing student network predicted ith joint point tea Heat map, h, representing teacher network output stu A heat map representing student network output, n representing joint points;
the overall loss function for the teacher network and the student network is:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
wherein λ is a hyper-parameter coefficient balancing two loss weights, h tar Representing a heat map generated by the tag.
5. A knowledge-distillation-based multi-person body pose estimation system, comprising:
the data acquisition module is used for acquiring human body posture sample data and a human body posture sample data label;
the network model building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacking lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 x 1;
the network model training module is used for inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, generating joint point offset by utilizing the second output joint point heat map and a target joint point heat map of a human body posture sample data label, and guiding the learning of the student network by combining the joint point offset and the first output joint point heat map;
and the heat map real-time module is used for inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map and converting the joint point heat map into a real joint point coordinate.
6. The system of claim 5, wherein the system comprises: the human body posture sample data label passes through a two-dimensional Gaussian function G m (x, y) computing a target joint heat map H m ∈R h×w In which a two-dimensional Gaussian function G m (x, y) is:
Figure RE-FDA0003799472950000031
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x m And y m Respectively are the horizontal and vertical coordinates of the joint point.
7. The system of claim 5, wherein the system comprises: the formula for generating the joint point bias is:
Figure RE-FDA0003799472950000032
wherein P = { P = 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a bias for the corresponding articulation point i,
Figure RE-FDA0003799472950000033
representing the heat map generated by the ith node tag,
Figure RE-FDA0003799472950000034
a heat map representing the student network predicted ith joint point; j represents the hyper-parameter coefficient, h tar Heat map, h, representing tag generation stu A heat map representing student network output;
Figure RE-FDA0003799472950000035
comprises the following steps:
Figure RE-FDA0003799472950000041
where γ represents a hyper-parametric coefficient.
8. The system of claim 1, wherein the system comprises: the teacher network adopts an MSE loss function, which specifically comprises the following steps:
Figure RE-FDA0003799472950000042
wherein,
Figure RE-FDA0003799472950000043
a heat map representing the teacher's network at the ith joint point,
Figure RE-FDA0003799472950000044
heat map, h, representing student network predicted ith joint point tea Heat map, h, representing teacher as output over network stu A heat map representing student network output, n representing joint points;
the overall loss function for the teacher network and the student network is:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
where λ is the hyperparameter coefficient that balances two loss weights, h tar Representing a tag-generated hot map of the joint point.
9. A computer device, characterized by: comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 4.
10. A computer-readable storage medium, characterized in that: stored with a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 4.
CN202210714617.4A 2022-06-22 2022-06-22 Knowledge distillation-based multi-person human body posture estimation method and system Pending CN115187660A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210714617.4A CN115187660A (en) 2022-06-22 2022-06-22 Knowledge distillation-based multi-person human body posture estimation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210714617.4A CN115187660A (en) 2022-06-22 2022-06-22 Knowledge distillation-based multi-person human body posture estimation method and system

Publications (1)

Publication Number Publication Date
CN115187660A true CN115187660A (en) 2022-10-14

Family

ID=83514461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210714617.4A Pending CN115187660A (en) 2022-06-22 2022-06-22 Knowledge distillation-based multi-person human body posture estimation method and system

Country Status (1)

Country Link
CN (1) CN115187660A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091849A (en) * 2023-04-11 2023-05-09 山东建筑大学 Tire pattern classification method, system, medium and equipment based on grouping decoder

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091849A (en) * 2023-04-11 2023-05-09 山东建筑大学 Tire pattern classification method, system, medium and equipment based on grouping decoder
CN116091849B (en) * 2023-04-11 2023-07-25 山东建筑大学 Tire pattern classification method, system, medium and equipment based on grouping decoder

Similar Documents

Publication Publication Date Title
CN109460702B (en) Passenger abnormal behavior identification method based on human body skeleton sequence
CN111709310B (en) Gesture tracking and recognition method based on deep learning
WO2020107847A1 (en) Bone point-based fall detection method and fall detection device therefor
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN103605972B (en) Non-restricted environment face verification method based on block depth neural network
CN109948453B (en) Multi-person attitude estimation method based on convolutional neural network
CN105069413A (en) Human body gesture identification method based on depth convolution neural network
CN110633004B (en) Interaction method, device and system based on human body posture estimation
CN105005769A (en) Deep information based sign language recognition method
CN110610210B (en) Multi-target detection method
CN113255557B (en) Deep learning-based video crowd emotion analysis method and system
Lv et al. Application of face recognition method under deep learning algorithm in embedded systems
CN109508686B (en) Human behavior recognition method based on hierarchical feature subspace learning
CN112784778A (en) Method, apparatus, device and medium for generating model and identifying age and gender
CN111444488A (en) Identity authentication method based on dynamic gesture
CN107067410A (en) A kind of manifold regularization correlation filtering method for tracking target based on augmented sample
CN104778699B (en) A kind of tracking of self adaptation characteristics of objects
CN113378770A (en) Gesture recognition method, device, equipment, storage medium and program product
CN111738074B (en) Pedestrian attribute identification method, system and device based on weak supervision learning
CN111709268A (en) Human hand posture estimation method and device based on human hand structure guidance in depth image
CN115223239B (en) Gesture recognition method, gesture recognition system, computer equipment and readable storage medium
CN110298402A (en) A kind of small target deteection performance optimization method
CN112906520A (en) Gesture coding-based action recognition method and device
Wu et al. Single shot multibox detector for vehicles and pedestrians detection and classification
CN117079095A (en) Deep learning-based high-altitude parabolic detection method, system, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination