CN115187660A - Knowledge distillation-based multi-person human body posture estimation method and system - Google Patents
Knowledge distillation-based multi-person human body posture estimation method and system Download PDFInfo
- Publication number
- CN115187660A CN115187660A CN202210714617.4A CN202210714617A CN115187660A CN 115187660 A CN115187660 A CN 115187660A CN 202210714617 A CN202210714617 A CN 202210714617A CN 115187660 A CN115187660 A CN 115187660A
- Authority
- CN
- China
- Prior art keywords
- joint point
- heat map
- network
- representing
- human body
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000013140 knowledge distillation Methods 0.000 title claims abstract description 20
- 238000005070 sampling Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 11
- 241001122767 Theaceae Species 0.000 claims description 10
- 238000003860 storage Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 7
- 238000009826 distribution Methods 0.000 claims description 5
- 238000004821 distillation Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 abstract description 4
- 230000009467 reduction Effects 0.000 abstract description 3
- 230000036544 posture Effects 0.000 description 69
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 230000000052 comparative effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 101000742346 Crotalus durissus collilineatus Zinc metalloproteinase/disintegrin Proteins 0.000 description 1
- 101000872559 Hediste diversicolor Hemerythrin Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-person human body posture estimation method and a system based on knowledge distillation, which belong to the human body posture estimation method in the technical field of computer vision and aim to solve the problems that a student network constructed by a stacked Hourglass network in the prior knowledge distillation technology is difficult to be made very small and the performance of the model is prevented from being reduced due to the reduction of the complexity of the network model, wherein the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolution layer, the decoder is a stacked lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 multiplied by 1; inputting data into a teacher network and a student network to obtain corresponding heat maps, and generating joint point offsets by using the student heat maps and target joint point heat maps of data labels to dynamically adjust the knowledge transfer from the teacher network to the student network; and inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.
Description
Technical Field
The invention belongs to the technical field of computer vision, relates to a human body posture estimation method, and particularly relates to a human body posture estimation method based on a knowledge distillation method.
Background
Human Pose Estimation (HPE) aims at obtaining Human pose joint coordinates from image input, and is one of the hot problems in the field of computer vision. In recent years, the human posture estimation technology is rapidly developed and widely applied to the fields of human body tracking, motion recognition, motion detection, human-computer interaction and the like.
The success of AlexNet in the field of image classification and recognition has pushed computer vision research into the deep learning era. In 2014, a.toshiev et al proposed a deep learning-based human posture estimation model deppose, which utilizes a deep convolutional neural network to realize global prediction of human posture estimation, and marks that human posture estimation enters the deep learning era. By means of the strong feature extraction capability of the deep convolutional network, the human posture estimation is developed in a long way. Compared with a manual feature extraction method, the human body posture estimation algorithm based on deep learning is higher in robustness. Since then, the release of MPII and MSCOCO datasets has spawned a great deal of excellent work based on deep learning human pose estimation.
With the progress of science and technology, more and more intelligent devices appear in the life of people, such as intelligent devices like an automatic driving automobile, an intelligent monitoring camera, an intelligent motion assistor and an elderly robot, all of which need to have the functions of motion recognition, motion detection, human-computer interaction and the like, and the basis for realizing the functions is not independent of a human posture estimation algorithm. It is very difficult to deploy large body pose estimation algorithms to these computing resource limited devices. Although a large-scale human body posture estimation algorithm can be deployed to the cloud end at present, and the functions can be realized by intelligent equipment through technologies such as cloud computing, information leakage is easily caused, and the safety of privacy of people cannot be guaranteed. Therefore, researchers begin to research into lighter-weight and more efficient human posture estimation algorithms, so that the algorithms do not need large-scale GPU and other hardware, and can run on mobile phones, wearable devices, monitoring cameras and embedded devices in real time.
The 2D body Pose Estimation can be divided into two categories, single Person Pose Estimation (SPPE) and Multi-Person body Pose Estimation (MPPE). For multi-person pose estimation, the mainstream methods can be generally classified into Top-down (Top-down) and Bottom-up (Bottom-up) methods according to the starting point (abstraction level) of prediction. The top-down approach starts with high-level abstraction by first detecting people and generating people positions in bounding boxes, and then pose-estimating each person. The top-down method is more intuitive in mind than the bottom-up method, and is also higher in accuracy than the bottom-up method. For how to detect people, most of the current mainstream multi-person posture estimation algorithms adopt common target detection algorithms as human body detectors, such as Faster R-CNN, mask R-CNN, yolo and the like, and then generate multi-person postures by using single-person posture estimation. How to realize the lightweight multi-person posture estimation, more and more researchers develop research aiming at the lightweight of a human posture estimation model. Feng et al use knowledge distillation to extract knowledge into a simple student hourglass network by using a complex teacher hourglass network. In addition, more work of lightweight human posture estimation focuses on improving the network structure, and Tang et al realize high-precision key point positioning by providing a densely connected U-Nets network similar to an hourglass network. Debnath et al was inspired by the hourglass network, and by introducing a novel shunting architecture in the last two layers of the MobileNet, reduced the parameters of the model and alleviated the overfitting, and improved the accuracy. Zhang et al introduced global attention and proposed a lightweight bottleneck block to replace the bottleneck block in ResNet, constructing an LPN similar in structure to simplbaseline. Yu et al propose to replace the point convolution of the channel cleaning module in ShuffleNet with a channel weighting form and construct a Lite-HRNet network represented by high-resolution features. Ding and Zhang et al tried to construct a multi-scale feature fusion network HR-NAS similar to HRNet and an efficientPose similar to SimpleBaseline, respectively, using network space Search (NAS).
The human posture estimation knowledge distillation methods such as the FPD, the OKDHM and the like are limited by the reasons of the complexity difference of teacher networks and the like in knowledge distillation, a student network and a teacher network are mainly constructed by using a stack Hourglass network, the network with a large number of stack layers and a large parameter amount is used as the teacher network, and the network with a small number of stack layers and a small parameter amount is used as the student network, so that the student network is difficult to be very small.
Disclosure of Invention
The invention aims to: the invention provides a knowledge distillation-based multi-person human body posture estimation method and system, aiming at solving the problems that a student network constructed by stacking Hourglass networks is difficult to be made very small in the prior art and the performance reduction of the model caused by the reduction of the complexity of a network model is avoided.
The invention specifically adopts the following technical scheme for realizing the purpose:
a multi-person human posture estimation method based on knowledge distillation comprises the following steps:
acquiring human body posture sample data and a human body posture sample data label;
constructing a teacher network model which is pre-trained and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacked lightweight upsampling module, and the joint point regressor is a convolution with a convolutional kernel of 1 multiplied by 1;
inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, generating joint point bias by using the second output joint point heat map and a target joint point heat map of a human body posture sample data label, and guiding the learning of the student network by combining the joint point bias and the first output joint point heat map;
and inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.
Preferably, the human body posture sample dataThe label passes two-dimensional Gaussian function G m (x, y) computing a target joint point heat map H m ∈R h×w In which a two-dimensional Gaussian function G m (x, y) is:
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x m And y m Respectively, the horizontal and vertical coordinates of the joint points.
Preferably, the formula for generating the joint point bias is:
wherein P = { P = 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a bias for the corresponding articulation point i,represents the heat map generated by the ith joint tag,representing a heat map of a student network predicting an ith joint point; j represents a hyper-parametric coefficient, h tar Joint Point heatmap, h, representing tag Generation stu Representing a student network generated joint point heat map;
wherein γ represents a hyper-parameter.
Preferably, the teacher network uses an MSE loss function, specifically:
wherein,a heat map representing the teacher's network at the ith joint,heat map, h, representing student network predicted ith relation node tea Heat map, h, representing teacher network output stu A heat map representing student network output, n representing joint points;
the overall loss function for the teacher network and the student network is:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
wherein λ is a hyper-parameter coefficient balancing two loss weights, h tar Representing a heat map generated by the tag.
A knowledge-distillation-based multi-person body pose estimation system, comprising:
the data acquisition module is used for acquiring human body posture sample data and a human body posture sample data label;
the network model building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacking lightweight upper sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 x 1;
the network model training module is used for inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, generating joint point offset by utilizing the second output joint point heat map and a target joint point heat map of a human body posture sample data label, and guiding the learning of the student network by combining the joint point offset and the first output joint point heat map;
and the heat map real-time module is used for inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map and converting the joint point heat map into a real joint point coordinate.
A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the above method.
A computer-readable storage medium, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of the above-mentioned method.
The invention has the following beneficial effects:
1. in the invention, a lightweight human posture estimation network GhostNet is adopted to generate a joint point heat map; an online joint knowledge distillation method based on knowledge distillation utilizes a label to generate a heat map and a teacher network output as supervision information to guide GhosNet learning; thereby need not to construct student's network through the mode of piling up the Hourglass network, ghostNet student's network can be done very little, and can not lead to the model performance to descend because of network model complexity reduces at this in-process to can dispose student's network to intelligent equipment such as automatic driving car, intelligent surveillance camera head, intelligent motion assistor and endowment robot, improve its range of application, application field greatly.
2. In the invention, in order to improve the performance of the model, a knowledge distillation-based joint point online optimization strategy is provided, joint point bias is generated by utilizing the output of a GhostNet network and a heat map generated by a label, HRNet is estimated by utilizing a pre-trained large-scale human body posture to serve as a teacher network, label information is softened, and student network GhostNet learning is guided; the training of the model is more efficient.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic structural view of the present invention;
FIG. 3 is a schematic diagram of the structure of a student network model of the present invention;
fig. 4 is a schematic structural diagram of a three-layer LPB of the present invention.
Detailed Description
Example 1
The embodiment provides a knowledge distillation-based multi-person human body posture estimation method, which is used for estimating the postures of a plurality of persons. It comprises the following steps:
step S1, obtaining sample data
And acquiring human body posture sample data and a human body posture sample data label. In order to verify the superiority of the human body posture estimation, public human body posture estimation data MSCOCO is selected for experiment, wherein the human body posture estimation evaluation adopts mAP and AP 0.5 、AP 0.75 、AP m 、AP l And AR.
S2, building a network model
The network model is a lightweight human posture network model and comprises a constructed teacher network model which is trained in advance, an untrained student network model and a matched loss function. The teacher network model adopts HRNet-W32, the student network model comprises an encoder, a decoder and a joint regression device, the encoder consists of GhostNet convolution layers, the decoder is a stacked lightweight upsampling module, the joint regression device is a convolution with convolution kernel of 1 multiplied by 1, and data sequentially passes through the encoder, the decoder and the joint regression device. The decoder preferably stacks three layers of LPBs, which consist of three parts, a depth separable transpose convolution, a point convolution, and a channel attention module.
S3, training the network model
Inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, obtaining a target joint point heat map after Gaussian calculation of a human body posture sample data label, generating joint point bias by using the second output joint point heat map output by the student network and the target joint point heat map obtained by the Gaussian calculation together, using the heat map output by the teacher network as partial supervision information, guiding the learning of the student network by using the joint point bias generated in a combined manner and the first output joint point heat map output by the teacher network, and finally finishing the training of the network model.
When a target joint point heat map is obtained through Gaussian calculation, a human body posture sample data label passes through a two-dimensional Gaussian function G m (x, y) computing a target joint heat map H m ∈R h×w Wherein a two-dimensional Gaussian function G m (x, y) is:
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x m And y m Respectively, the horizontal and vertical coordinates of the joint points.
When joint biases are generated by using the second output joint heat map and the target joint heat map, the formula for generating the joint biases is as follows:
wherein, P = { P 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a bias for the corresponding articulation point i,represents the heat map generated by the ith joint tag,representing a heat map of a student network predicting an ith joint point; j represents a hyperparameter, h tar Represents a heat map of tag generation, h stu A heat map representing student network output;
wherein γ represents a hyper-parameter.
When training, the teacher network adopts an MSE loss function, which specifically comprises the following steps:
wherein,a heat map representing the teacher's network at the ith joint point,heat map, h, representing student network predicted ith relation node tea Heat map, h, representing teacher network output stu A heat map representing student network output, n representing joint points;
during training, the overall loss function of the teacher network and the student network is as follows:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
wherein λ is a hyper-parameter coefficient balancing two loss weights, h tar Representing a heatmap generated by the tag.
After the network model is trained, only the student network needs to work when testing.
S4, acquiring the coordinates of the joint points in real time
Inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.
In various heat maps, the larger the number in the heat map, the greater the probability that the joint is present at that location. Therefore, when the joint point heatmap is converted into joint point coordinates, it is calculated by the Soft-argmax function. The Soft-argmax function is calculated as follows:
where β is the hyper parameter preventing numerical "overflow", h denotes the length of the heat map, w denotes the width of the heat map, x m 、 y m Are respectively the horizontal and vertical coordinates of the joint point H m Which represents the correspondence of the m joints,represents the coordinates corresponding to the joint point m as<h,w>Heat value, (understood as the probability of the existence of a joint).
Example 2
The embodiment also provides a knowledge distillation-based multi-person human body posture estimation system, which comprises:
the data acquisition module is used for acquiring human body posture sample data and a human body posture sample data label;
in order to verify the superiority of the human body posture estimation, the human body posture sample data and the tag thereof select the disclosed human body posture estimation data MSCOCO for experiment, wherein the human body posture estimation evaluation adopts mAP and AP 0.5 、AP 0.75 、AP m 、AP l And AR.
The network building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model adopts HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder consists of GhostNet convolution layers, the decoder is a stacking lightweight up-sampling module, the joint point regressor is a convolution with convolution kernel of 1 x 1, and data sequentially passes through the encoder, the decoder and the joint point regressor. The decoder preferably stacks three layers of LPBs, which consist of three parts, a depth separable transpose convolution, a point convolution, and a channel attention module.
The network training module is used for inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, obtaining a target joint point heat map after Gaussian calculation of human body posture sample data labels, generating joint point bias by using the second output joint point heat map output by the student network and the target joint point heat map obtained by the Gaussian calculation, using the heat map output by the teacher network as partial monitoring information, guiding the learning of the student network by using the joint point bias generated in a combined manner and the first output joint point heat map output by the teacher network, and finally finishing the training of the network model.
When a target joint point heat map is obtained through Gaussian calculation, a human body posture sample data label passes through a two-dimensional Gaussian function G m (x, y) computing a target joint heat map H m ∈R k×w In which a two-dimensional Gaussian function G m (x, y) is:
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x m And y m Respectively, the horizontal and vertical coordinates of the joint points.
When joint offsets are generated using the second output joint heat map and the target joint heat map, the formula for generating the joint offsets is:
wherein P = { P = 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a correspondenceThe offset of the articulation point i is such that,represents the heat map generated by the ith joint tag,representing a heat map of a student network predicting an ith joint point; j is the hyperparametric coefficient (representing the first j p choices) i ),h tar Heat map, h, representing tag generation stu A heat map representing student network output;
wherein gamma represents a joint point penalty hyperparameter.
When training, the teacher network adopts an MSE loss function, which specifically comprises the following steps:
wherein,a heat map representing the teacher's network at the ith joint point,heat map, h, representing student network predicted ith relation node tea Representing teacher network heatmap, h stu Representing a student network heat map, n representing joint points;
during training, the overall loss function of the teacher network and the student network is as follows:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
where λ is the hyperparameter coefficient that balances two loss weights, h tar Representing a heatmap generated by the tag.
After the network model is trained, only the student network needs to work during testing
And the heat map real-time module is used for inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map and converting the joint point heat map into a real joint point coordinate.
In various heat maps, the larger the number in the heat map, the greater the probability that the joint is present at that location. Therefore, when the joint point heat map is converted into joint point coordinates, it is calculated by the Soft-argmax function. The Soft-argmax function is calculated as follows:
where β is the hyper parameter preventing numerical "overflow", h denotes the length of the heat map, w denotes the width of the heat map, x m 、 y m Are respectively the horizontal and vertical coordinates of the joint point H m A heat map representing the joint point m is shown,representing m-joint heat map correspondences<h,w>The heat map values were processed.
Example 3
The embodiment also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the above method for estimating the posture of the human body of multiple persons based on knowledge distillation.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can be in man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the memory may also include both internal and external memory units of the computer device. In this embodiment, the memory is commonly used for storing an operating system and various application software installed in the computer device, such as a program code for executing the knowledge-based human posture estimation method for multiple persons. In addition, the memory may be used to temporarily store various types of data that have been output or are to be output.
The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process data, for example, execute the program code of the knowledge-based multi-person body posture estimation method.
Example 4
The present embodiment also provides a computer-readable storage medium storing a computer program, which when executed by a processor, causes the processor to perform the steps of the above-mentioned knowledge-based distillation multi-person body pose estimation method.
Wherein the computer readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the knowledge-distillation based multi-person body pose estimation method.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be essentially or partially embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server or a network device, etc.) to execute the method according to the embodiments of the present application.
Examples of the experiments
The experimental example carries out attitude estimation by building the network model and selecting the open human body attitude estimation data MSCOCO by applying the estimation method, and other comparative examples adopt other human body detectors with the same multi-person human body attitude estimation method.
Through actual measurement and analysis, the human body posture estimation method is compared with the existing human body posture estimation method, namely Hourglass, simpleBaseline, LPN, lite-HRNet, shuffleNet V2, DARK and MobilePoseNet. All methods were performed on MSCOCO data in comparative experiments, as shown in the table above, the method of the present application is on the index AP 0.5 The results are all higher than those of other comparative experimental methods, and the model superiority is shown. In addition, the method realizes that the mAP is 68.7 under the condition that the calculated amount is only 0.6GFLOPs, and the mAP is higher than other large models such as Hourglass models and CPN models. Compared with the lightweight human body posture estimation model LPN, the method has the advantage that the calculated amount is reduced by 40% under the condition that the precision is close to that of the method. Compared with other lightweight networks such as MobileNet V2, shuffleNet V2 and Lite-HRNet, the method of the application has greater lead in the aspects of precision, parameter and the like. See table 1 for details:
table: statistical table of estimation results of MSCOCO data
Method | Backbone | Input | #Params | GFLOPs | AP | AP 50 | AP 75 | AP M | ,AP L | AR |
8-stage Hourglass | Hourglass | 256×192 | 25.6M | 26.2 | 66.9 | - | - | - | - | - |
CPN | ResNet-50 | 256×192 | 27.0M | 6.2 | 68.4 | - | - | - | - | - |
SnmpleBaseline | ResNet-50 | 256×192 | 34.0M | 8.9 | 70.4 | 88.6 | 78.3 | 67.1 | 77.2 | 76.3 |
HRNet-W32 | ResNet-50 | 256×192 | 28.5M | 12.4 | 73.4 | 89.5 | 80.7 | 70.2 | 80.1 | 79.8 |
DARK | HRNetV1-W48 | 128×96 | 63.6M | 3.6 | 71.9 | 89.1 | 79.6 | 69.2 | 78 | 77.9 |
MobileNetV2 | MobileNetV2 | 256×192 | 9.6M | 1.48 | 64.6 | 87.4 | 72.3 | 61.1 | 71.2 | 70.7 |
MobileNetV2 1× | MobileNetV2 | 384×258 | 9.6M | 3.33 | 67.3 | 87.9 | 74.3 | 62.8 | 74.7 | 72.9 |
ShuffleNetV2 | ShuffleNetV2 | 256×192 | 7.6M | 1.28 | 59.9 | 85.4 | 66.3 | 56.6 | 66.2 | 66.4 |
ShuffleNetV2 1× | ShuffleNetV2 | 384×288 | 7.6M | 2.87 | 63.6 | 86.5 | 70.5 | 59.5 | 70.7 | 69.7 |
Small HRNet | HRNet-W16 | 256×192 | 1.3M | 0.54 | 55.2 | 83.7 | 62.4 | 52.3 | 61 | 62.1 |
Small HRNet | HRNet-W16 | 384×288 | 1.3M | 1.21 | 56 | 83.8 | 63 | 52.4 | 62.6 | 62.6 |
Lite-HRNet | Lite-HRNet-18 | 256×192 | 1.1M | 0.20 | 64.8 | 86.7 | 73 | 62.1 | 70.5 | 71.2 |
Lite-HRNet | Lite-HRNet-18 | 384×288 | 1.1M | 0.45 | 67.6 | 87.8 | 75 | 64.5 | 73.7 | 73.7 |
LPN | ResNet-50 | 256×192 | 2.9M | 1.0 | 69.1 | 88.1 | 76.6 | 65.9 | 75.7 | 74.9 |
MobilePoseNet | MobilNetV3 | 256×192 | 1.5M | 0.55 | 66.2 | 87.3 | 74.2 | 63.1 | 72.5 | 72.4 |
MobilePoseNet | MobilNetV3 | 384×288 | 1.5M | 1.23 | 69 | 88.2 | 75.9 | 65.5 | 75.5 | 74.9 |
GhostPoseNet | GhostNet | 256×192 | 3.1M | 0.26 | 63.8 | 88.3 | 71.6 | 61.6 | 67 | 67.3 |
GhostPoseNet* | GhostNet | 384×288 | 3.1M | 0.60 | 67.9 | 89.4 | 74.4 | 65.3 | 72.1 | 71.6 |
GhostPoseNet | GhostNet | 384×288 | 3.1M | 0.60 | 68.7 | 90.4 | 76.8 | 65.7 | 73.0 | 71.8 |
The method of the application is compared with the existing lightweight human body posture estimation method and the existing large human body posture estimation method in speed. Including a speed comparison of GPU environments versus no GPU environments. Under the environment without GPU, the running speed of the GhostPoseNet model is higher than the network speed of MobileNet V2, lite-HRNet and the like, and the GhostPoseNet network is more beneficial to the deployment of edge-end equipment considering that the model has low calculation amount and simple structure. See table 2 for details:
table 2: speed comparison table of MSCOCO data
Method | BackBone | #Params | GFLOPs | Imput Size | AP | Speed* | Speed |
HRNet | HRNetV1-W32 | 28.5M | 7.1 | 256×192 | 74.4 | 7.5 | 19.2 |
HRNet | HRNetV1-W32 | 28.5M | 16 | 384×288 | 75.8 | 4 | 18.8 |
NLite-HRNet-18 | HRNet-W16 | 0.7M | 0.19 | 256×192 | 62.8 | 11 | 18.9 |
WNLite-HRNet-18 | HRNet-W16 | 1.3M | 0.3 | 256×192 | 66 | 12 | 18.6 |
|
ShuffleNetV2 | 7.6M | 1.28 | 256×192 | 59.9 | 17 | 71.3 |
|
ShuffleNetV2 | 7.6M | 2.87 | 384×288 | 63.6 | 10 | 64.1 |
|
MobileNetV2 | 9.6M | 1.48 | 256×192 | 64.6 | 6.8 | 83.1 |
|
MobileNetV2 | 9.6M | 3.33 | 384×288 | 67.3 | 4.5 | 73.1 |
Lite-HRNet | Lite-HRNet-18 | 1.1M | 0.2 | 256×192 | 64.8 | 12 | 17.4 |
Lite-HRNet | Lite-HRNet-18 | 1.1M | 0.45 | 384×288 | 67.6 | 7.1 | 16.3 |
MobilePoseNet | MobileNetV3 | 1.5M | 0.55 | 256×192 | 66.2 | 7.8 | 54.8 |
MobilePoseNet | MobileNetV3 | 1.5M | 1.23 | 384×288 | 69.0 | 5.1 | 50.8 |
GhostPoseNet | GhostNet | 3.1M | 0.26 | 256×192 | 63.8 | 9.2 | 62.0 |
GhostPoseNet | GhostNet | 3.1M | 0.60 | 384×288 | 68.7 | 6.4 | 59.4 |
Claims (10)
1. A multi-person human body posture estimation method based on knowledge distillation is characterized by comprising the following steps:
acquiring human body posture sample data and a human body posture sample data label;
constructing a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolution layer, the decoder is a stacking lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 multiplied by 1;
inputting human posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human posture sample data into a student network to obtain a second output joint point heat map, generating joint point offset by using the second output joint point heat map and a target joint point heat map of the human posture sample data labels, and guiding the learning of the student network by combining the joint point offset and the first output joint point heat map;
and inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.
2. The knowledge-based distillation multi-person human posture estimation method as claimed in claim 1, wherein: the human body posture sample data label passes through a two-dimensional Gaussian function G m (x, y) computing a heat map H of target joint points m ∈R h×w In which a two-dimensional Gaussian function G m (x, y) is:
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heatmap, x m And y m Respectively are the horizontal and vertical coordinates of the joint point.
3. The knowledge-based distillation multi-person body posture estimation method as claimed in claim 1, wherein: the formula for generating the joint point bias is:
wherein P = { P = 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a bias for the corresponding joint point i,representing the heat map generated by the ith node tag,representing student network predicted ith joint pointA heat map of (a); j represents a hyper-parametric coefficient, h tar Heat map, h, representing tag generation stu A heat map representing student network output;
wherein γ represents a hyper-parameter coefficient.
4. The knowledge-based distillation multi-person human posture estimation method as claimed in claim 1, wherein: the teacher network adopts an MSE loss function, which specifically comprises the following steps:
wherein,a heat map representing the teacher's network at the ith joint point,heat map, h, representing student network predicted ith joint point tea Heat map, h, representing teacher network output stu A heat map representing student network output, n representing joint points;
the overall loss function for the teacher network and the student network is:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
wherein λ is a hyper-parameter coefficient balancing two loss weights, h tar Representing a heat map generated by the tag.
5. A knowledge-distillation-based multi-person body pose estimation system, comprising:
the data acquisition module is used for acquiring human body posture sample data and a human body posture sample data label;
the network model building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacking lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 x 1;
the network model training module is used for inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, generating joint point offset by utilizing the second output joint point heat map and a target joint point heat map of a human body posture sample data label, and guiding the learning of the student network by combining the joint point offset and the first output joint point heat map;
and the heat map real-time module is used for inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map and converting the joint point heat map into a real joint point coordinate.
6. The system of claim 5, wherein the system comprises: the human body posture sample data label passes through a two-dimensional Gaussian function G m (x, y) computing a target joint heat map H m ∈R h×w In which a two-dimensional Gaussian function G m (x, y) is:
H m =G m (x,y)
where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x m And y m Respectively are the horizontal and vertical coordinates of the joint point.
7. The system of claim 5, wherein the system comprises: the formula for generating the joint point bias is:
wherein P = { P = 1 ,p 2 ,p 3 ,...,p k Is a hyperparameter, k is the number of human joints, p i As a bias for the corresponding articulation point i,representing the heat map generated by the ith node tag,a heat map representing the student network predicted ith joint point; j represents the hyper-parameter coefficient, h tar Heat map, h, representing tag generation stu A heat map representing student network output;
where γ represents a hyper-parametric coefficient.
8. The system of claim 1, wherein the system comprises: the teacher network adopts an MSE loss function, which specifically comprises the following steps:
wherein,a heat map representing the teacher's network at the ith joint point,heat map, h, representing student network predicted ith joint point tea Heat map, h, representing teacher as output over network stu A heat map representing student network output, n representing joint points;
the overall loss function for the teacher network and the student network is:
L total =MSE(h tea ,h stu )+λD(h tar ,h stu )
where λ is the hyperparameter coefficient that balances two loss weights, h tar Representing a tag-generated hot map of the joint point.
9. A computer device, characterized by: comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 4.
10. A computer-readable storage medium, characterized in that: stored with a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210714617.4A CN115187660A (en) | 2022-06-22 | 2022-06-22 | Knowledge distillation-based multi-person human body posture estimation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210714617.4A CN115187660A (en) | 2022-06-22 | 2022-06-22 | Knowledge distillation-based multi-person human body posture estimation method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115187660A true CN115187660A (en) | 2022-10-14 |
Family
ID=83514461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210714617.4A Pending CN115187660A (en) | 2022-06-22 | 2022-06-22 | Knowledge distillation-based multi-person human body posture estimation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115187660A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116091849A (en) * | 2023-04-11 | 2023-05-09 | 山东建筑大学 | Tire pattern classification method, system, medium and equipment based on grouping decoder |
-
2022
- 2022-06-22 CN CN202210714617.4A patent/CN115187660A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116091849A (en) * | 2023-04-11 | 2023-05-09 | 山东建筑大学 | Tire pattern classification method, system, medium and equipment based on grouping decoder |
CN116091849B (en) * | 2023-04-11 | 2023-07-25 | 山东建筑大学 | Tire pattern classification method, system, medium and equipment based on grouping decoder |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109460702B (en) | Passenger abnormal behavior identification method based on human body skeleton sequence | |
CN110147743B (en) | Real-time online pedestrian analysis and counting system and method under complex scene | |
CN111709310B (en) | Gesture tracking and recognition method based on deep learning | |
WO2020107847A1 (en) | Bone point-based fall detection method and fall detection device therefor | |
US20200097742A1 (en) | Training neural networks for vehicle re-identification | |
CN111709311B (en) | Pedestrian re-identification method based on multi-scale convolution feature fusion | |
CN103605972B (en) | Non-restricted environment face verification method based on block depth neural network | |
CN105069413A (en) | Human body gesture identification method based on depth convolution neural network | |
CN110633004B (en) | Interaction method, device and system based on human body posture estimation | |
CN105005769A (en) | Deep information based sign language recognition method | |
CN110610210B (en) | Multi-target detection method | |
CN113255557B (en) | Deep learning-based video crowd emotion analysis method and system | |
Lv et al. | Application of face recognition method under deep learning algorithm in embedded systems | |
CN109508686B (en) | Human behavior recognition method based on hierarchical feature subspace learning | |
CN112784778A (en) | Method, apparatus, device and medium for generating model and identifying age and gender | |
CN111709268B (en) | Human hand posture estimation method and device based on human hand structure guidance in depth image | |
CN111444488A (en) | Identity authentication method based on dynamic gesture | |
CN107067410A (en) | A kind of manifold regularization correlation filtering method for tracking target based on augmented sample | |
CN104778699B (en) | A kind of tracking of self adaptation characteristics of objects | |
CN113378770A (en) | Gesture recognition method, device, equipment, storage medium and program product | |
CN111738074B (en) | Pedestrian attribute identification method, system and device based on weak supervision learning | |
CN115223239B (en) | Gesture recognition method, gesture recognition system, computer equipment and readable storage medium | |
CN110298402A (en) | A kind of small target deteection performance optimization method | |
CN111104911A (en) | Pedestrian re-identification method and device based on big data training | |
CN112906520A (en) | Gesture coding-based action recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |