CN115187660A

CN115187660A - Knowledge distillation-based multi-person human body posture estimation method and system

Info

Publication number: CN115187660A
Application number: CN202210714617.4A
Authority: CN
Inventors: 欧卫华; 蒋永红; 高国强; 犹津
Original assignee: Guizhou Siso Electronics Co ltd
Current assignee: Guizhou Siso Electronics Co ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-10-14

Abstract

The invention discloses a multi-person human body posture estimation method and a system based on knowledge distillation, which belong to the human body posture estimation method in the technical field of computer vision and aim to solve the problems that a student network constructed by a stacked Hourglass network in the prior knowledge distillation technology is difficult to be made very small and the performance of the model is prevented from being reduced due to the reduction of the complexity of the network model, wherein the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolution layer, the decoder is a stacked lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 multiplied by 1; inputting data into a teacher network and a student network to obtain corresponding heat maps, and generating joint point offsets by using the student heat maps and target joint point heat maps of data labels to dynamically adjust the knowledge transfer from the teacher network to the student network; and inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.

Description

Knowledge distillation-based multi-person human body posture estimation method and system

Technical Field

The invention belongs to the technical field of computer vision, relates to a human body posture estimation method, and particularly relates to a human body posture estimation method based on a knowledge distillation method.

Background

Human Pose Estimation (HPE) aims at obtaining Human pose joint coordinates from image input, and is one of the hot problems in the field of computer vision. In recent years, the human posture estimation technology is rapidly developed and widely applied to the fields of human body tracking, motion recognition, motion detection, human-computer interaction and the like.

The success of AlexNet in the field of image classification and recognition has pushed computer vision research into the deep learning era. In 2014, a.toshiev et al proposed a deep learning-based human posture estimation model deppose, which utilizes a deep convolutional neural network to realize global prediction of human posture estimation, and marks that human posture estimation enters the deep learning era. By means of the strong feature extraction capability of the deep convolutional network, the human posture estimation is developed in a long way. Compared with a manual feature extraction method, the human body posture estimation algorithm based on deep learning is higher in robustness. Since then, the release of MPII and MSCOCO datasets has spawned a great deal of excellent work based on deep learning human pose estimation.

With the progress of science and technology, more and more intelligent devices appear in the life of people, such as intelligent devices like an automatic driving automobile, an intelligent monitoring camera, an intelligent motion assistor and an elderly robot, all of which need to have the functions of motion recognition, motion detection, human-computer interaction and the like, and the basis for realizing the functions is not independent of a human posture estimation algorithm. It is very difficult to deploy large body pose estimation algorithms to these computing resource limited devices. Although a large-scale human body posture estimation algorithm can be deployed to the cloud end at present, and the functions can be realized by intelligent equipment through technologies such as cloud computing, information leakage is easily caused, and the safety of privacy of people cannot be guaranteed. Therefore, researchers begin to research into lighter-weight and more efficient human posture estimation algorithms, so that the algorithms do not need large-scale GPU and other hardware, and can run on mobile phones, wearable devices, monitoring cameras and embedded devices in real time.

The 2D body Pose Estimation can be divided into two categories, single Person Pose Estimation (SPPE) and Multi-Person body Pose Estimation (MPPE). For multi-person pose estimation, the mainstream methods can be generally classified into Top-down (Top-down) and Bottom-up (Bottom-up) methods according to the starting point (abstraction level) of prediction. The top-down approach starts with high-level abstraction by first detecting people and generating people positions in bounding boxes, and then pose-estimating each person. The top-down method is more intuitive in mind than the bottom-up method, and is also higher in accuracy than the bottom-up method. For how to detect people, most of the current mainstream multi-person posture estimation algorithms adopt common target detection algorithms as human body detectors, such as Faster R-CNN, mask R-CNN, yolo and the like, and then generate multi-person postures by using single-person posture estimation. How to realize the lightweight multi-person posture estimation, more and more researchers develop research aiming at the lightweight of a human posture estimation model. Feng et al use knowledge distillation to extract knowledge into a simple student hourglass network by using a complex teacher hourglass network. In addition, more work of lightweight human posture estimation focuses on improving the network structure, and Tang et al realize high-precision key point positioning by providing a densely connected U-Nets network similar to an hourglass network. Debnath et al was inspired by the hourglass network, and by introducing a novel shunting architecture in the last two layers of the MobileNet, reduced the parameters of the model and alleviated the overfitting, and improved the accuracy. Zhang et al introduced global attention and proposed a lightweight bottleneck block to replace the bottleneck block in ResNet, constructing an LPN similar in structure to simplbaseline. Yu et al propose to replace the point convolution of the channel cleaning module in ShuffleNet with a channel weighting form and construct a Lite-HRNet network represented by high-resolution features. Ding and Zhang et al tried to construct a multi-scale feature fusion network HR-NAS similar to HRNet and an efficientPose similar to SimpleBaseline, respectively, using network space Search (NAS).

The human posture estimation knowledge distillation methods such as the FPD, the OKDHM and the like are limited by the reasons of the complexity difference of teacher networks and the like in knowledge distillation, a student network and a teacher network are mainly constructed by using a stack Hourglass network, the network with a large number of stack layers and a large parameter amount is used as the teacher network, and the network with a small number of stack layers and a small parameter amount is used as the student network, so that the student network is difficult to be very small.

Disclosure of Invention

The invention aims to: the invention provides a knowledge distillation-based multi-person human body posture estimation method and system, aiming at solving the problems that a student network constructed by stacking Hourglass networks is difficult to be made very small in the prior art and the performance reduction of the model caused by the reduction of the complexity of a network model is avoided.

The invention specifically adopts the following technical scheme for realizing the purpose:

a multi-person human posture estimation method based on knowledge distillation comprises the following steps:

acquiring human body posture sample data and a human body posture sample data label;

constructing a teacher network model which is pre-trained and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacked lightweight upsampling module, and the joint point regressor is a convolution with a convolutional kernel of 1 multiplied by 1;

inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, generating joint point bias by using the second output joint point heat map and a target joint point heat map of a human body posture sample data label, and guiding the learning of the student network by combining the joint point bias and the first output joint point heat map;

and inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.

Preferably, the human body posture sample dataThe label passes two-dimensional Gaussian function G _m (x, y) computing a target joint point heat map H ^m ∈R ^h×w In which a two-dimensional Gaussian function G _m (x, y) is:

H ^m ＝G _m (x，y)

where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x _m And y _m Respectively, the horizontal and vertical coordinates of the joint points.

Preferably, the formula for generating the joint point bias is:

wherein P = { P = ₁ ，p ₂ ，p ₃ ，...，p _k Is a hyperparameter, k is the number of human joints, p _i As a bias for the corresponding articulation point i,

represents the heat map generated by the ith joint tag,

representing a heat map of a student network predicting an ith joint point; j represents a hyper-parametric coefficient, h ^tar Joint Point heatmap, h, representing tag Generation ^stu Representing a student network generated joint point heat map;

comprises the following steps:

wherein γ represents a hyper-parameter.

Preferably, the teacher network uses an MSE loss function, specifically:

wherein,

a heat map representing the teacher's network at the ith joint,

heat map, h, representing student network predicted ith relation node ^tea Heat map, h, representing teacher network output ^stu A heat map representing student network output, n representing joint points;

the overall loss function for the teacher network and the student network is:

L _total ＝MSE(h ^tea ，h ^stu )+λD(h ^tar ，h ^stu )

wherein λ is a hyper-parameter coefficient balancing two loss weights, h ^tar Representing a heat map generated by the tag.

A knowledge-distillation-based multi-person body pose estimation system, comprising:

the data acquisition module is used for acquiring human body posture sample data and a human body posture sample data label;

the network model building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacking lightweight upper sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 x 1;

the network model training module is used for inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, generating joint point offset by utilizing the second output joint point heat map and a target joint point heat map of a human body posture sample data label, and guiding the learning of the student network by combining the joint point offset and the first output joint point heat map;

and the heat map real-time module is used for inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map and converting the joint point heat map into a real joint point coordinate.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the above method.

A computer-readable storage medium, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of the above-mentioned method.

The invention has the following beneficial effects:

1. in the invention, a lightweight human posture estimation network GhostNet is adopted to generate a joint point heat map; an online joint knowledge distillation method based on knowledge distillation utilizes a label to generate a heat map and a teacher network output as supervision information to guide GhosNet learning; thereby need not to construct student's network through the mode of piling up the Hourglass network, ghostNet student's network can be done very little, and can not lead to the model performance to descend because of network model complexity reduces at this in-process to can dispose student's network to intelligent equipment such as automatic driving car, intelligent surveillance camera head, intelligent motion assistor and endowment robot, improve its range of application, application field greatly.

2. In the invention, in order to improve the performance of the model, a knowledge distillation-based joint point online optimization strategy is provided, joint point bias is generated by utilizing the output of a GhostNet network and a heat map generated by a label, HRNet is estimated by utilizing a pre-trained large-scale human body posture to serve as a teacher network, label information is softened, and student network GhostNet learning is guided; the training of the model is more efficient.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic structural view of the present invention;

FIG. 3 is a schematic diagram of the structure of a student network model of the present invention;

fig. 4 is a schematic structural diagram of a three-layer LPB of the present invention.

Detailed Description

Example 1

The embodiment provides a knowledge distillation-based multi-person human body posture estimation method, which is used for estimating the postures of a plurality of persons. It comprises the following steps:

step S1, obtaining sample data

And acquiring human body posture sample data and a human body posture sample data label. In order to verify the superiority of the human body posture estimation, public human body posture estimation data MSCOCO is selected for experiment, wherein the human body posture estimation evaluation adopts mAP and AP ^0.5 、AP ^0.75 、AP ^m 、AP ^l And AR.

S2, building a network model

The network model is a lightweight human posture network model and comprises a constructed teacher network model which is trained in advance, an untrained student network model and a matched loss function. The teacher network model adopts HRNet-W32, the student network model comprises an encoder, a decoder and a joint regression device, the encoder consists of GhostNet convolution layers, the decoder is a stacked lightweight upsampling module, the joint regression device is a convolution with convolution kernel of 1 multiplied by 1, and data sequentially passes through the encoder, the decoder and the joint regression device. The decoder preferably stacks three layers of LPBs, which consist of three parts, a depth separable transpose convolution, a point convolution, and a channel attention module.

S3, training the network model

Inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, obtaining a target joint point heat map after Gaussian calculation of a human body posture sample data label, generating joint point bias by using the second output joint point heat map output by the student network and the target joint point heat map obtained by the Gaussian calculation together, using the heat map output by the teacher network as partial supervision information, guiding the learning of the student network by using the joint point bias generated in a combined manner and the first output joint point heat map output by the teacher network, and finally finishing the training of the network model.

When a target joint point heat map is obtained through Gaussian calculation, a human body posture sample data label passes through a two-dimensional Gaussian function G _m (x, y) computing a target joint heat map H ^m ∈R ^h×w Wherein a two-dimensional Gaussian function G _m (x, y) is:

H ^m ＝G _m (x，y)

When joint biases are generated by using the second output joint heat map and the target joint heat map, the formula for generating the joint biases is as follows:

wherein, P = { P ₁ ，p ₂ ，p ₃ ，...，p _k Is a hyperparameter, k is the number of human joints, p _i As a bias for the corresponding articulation point i,

represents the heat map generated by the ith joint tag,

representing a heat map of a student network predicting an ith joint point; j represents a hyperparameter, h ^tar Represents a heat map of tag generation, h ^stu A heat map representing student network output;

comprises the following steps:

wherein γ represents a hyper-parameter.

When training, the teacher network adopts an MSE loss function, which specifically comprises the following steps:

wherein,

a heat map representing the teacher's network at the ith joint point,

during training, the overall loss function of the teacher network and the student network is as follows:

L _total ＝MSE(h ^tea ，h ^stu )+λD(h ^tar ，h ^stu )

wherein λ is a hyper-parameter coefficient balancing two loss weights, h ^tar Representing a heatmap generated by the tag.

After the network model is trained, only the student network needs to work when testing.

S4, acquiring the coordinates of the joint points in real time

Inputting the acquired real-time human body posture data into a student network, outputting a real-time joint point heat map, and converting the joint point heat map into real joint point coordinates.

In various heat maps, the larger the number in the heat map, the greater the probability that the joint is present at that location. Therefore, when the joint point heatmap is converted into joint point coordinates, it is calculated by the Soft-argmax function. The Soft-argmax function is calculated as follows:

where β is the hyper parameter preventing numerical "overflow", h denotes the length of the heat map, w denotes the width of the heat map, x _m 、 y _m Are respectively the horizontal and vertical coordinates of the joint point H ^m Which represents the correspondence of the m joints,

represents the coordinates corresponding to the joint point m as<h，w>Heat value, (understood as the probability of the existence of a joint).

Example 2

The embodiment also provides a knowledge distillation-based multi-person human body posture estimation system, which comprises:

in order to verify the superiority of the human body posture estimation, the human body posture sample data and the tag thereof select the disclosed human body posture estimation data MSCOCO for experiment, wherein the human body posture estimation evaluation adopts mAP and AP ^0.5 、AP ^0.75 、AP ^m 、AP ^l And AR.

The network building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model adopts HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder consists of GhostNet convolution layers, the decoder is a stacking lightweight up-sampling module, the joint point regressor is a convolution with convolution kernel of 1 x 1, and data sequentially passes through the encoder, the decoder and the joint point regressor. The decoder preferably stacks three layers of LPBs, which consist of three parts, a depth separable transpose convolution, a point convolution, and a channel attention module.

The network training module is used for inputting human body posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human body posture sample data into a student network to obtain a second output joint point heat map, obtaining a target joint point heat map after Gaussian calculation of human body posture sample data labels, generating joint point bias by using the second output joint point heat map output by the student network and the target joint point heat map obtained by the Gaussian calculation, using the heat map output by the teacher network as partial monitoring information, guiding the learning of the student network by using the joint point bias generated in a combined manner and the first output joint point heat map output by the teacher network, and finally finishing the training of the network model.

When a target joint point heat map is obtained through Gaussian calculation, a human body posture sample data label passes through a two-dimensional Gaussian function G _m (x, y) computing a target joint heat map H ^m ∈R ^k×w In which a two-dimensional Gaussian function G _m (x, y) is:

H ^m ＝G _m (x，y)

When joint offsets are generated using the second output joint heat map and the target joint heat map, the formula for generating the joint offsets is:

wherein P = { P = ₁ ，p ₂ ，p ₃ ，...，p _k Is a hyperparameter, k is the number of human joints, p _i As a correspondenceThe offset of the articulation point i is such that,

represents the heat map generated by the ith joint tag,

representing a heat map of a student network predicting an ith joint point; j is the hyperparametric coefficient (representing the first j p choices) _i )，h ^tar Heat map, h, representing tag generation ^stu A heat map representing student network output;

comprises the following steps:

wherein gamma represents a joint point penalty hyperparameter.

wherein,

a heat map representing the teacher's network at the ith joint point,

heat map, h, representing student network predicted ith relation node ^tea Representing teacher network heatmap, h ^stu Representing a student network heat map, n representing joint points;

L _total ＝MSE(h ^tea ，h ^stu )+λD(h ^tar ，h ^stu )

where λ is the hyperparameter coefficient that balances two loss weights, h ^tar Representing a heatmap generated by the tag.

After the network model is trained, only the student network needs to work during testing

In various heat maps, the larger the number in the heat map, the greater the probability that the joint is present at that location. Therefore, when the joint point heat map is converted into joint point coordinates, it is calculated by the Soft-argmax function. The Soft-argmax function is calculated as follows:

where β is the hyper parameter preventing numerical "overflow", h denotes the length of the heat map, w denotes the width of the heat map, x _m 、 y _m Are respectively the horizontal and vertical coordinates of the joint point H ^m A heat map representing the joint point m is shown,

representing m-joint heat map correspondences<h，w>The heat map values were processed.

Example 3

The embodiment also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the above method for estimating the posture of the human body of multiple persons based on knowledge distillation.

The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can be in man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the memory may also include both internal and external memory units of the computer device. In this embodiment, the memory is commonly used for storing an operating system and various application software installed in the computer device, such as a program code for executing the knowledge-based human posture estimation method for multiple persons. In addition, the memory may be used to temporarily store various types of data that have been output or are to be output.

The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process data, for example, execute the program code of the knowledge-based multi-person body posture estimation method.

Example 4

The present embodiment also provides a computer-readable storage medium storing a computer program, which when executed by a processor, causes the processor to perform the steps of the above-mentioned knowledge-based distillation multi-person body pose estimation method.

Wherein the computer readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the knowledge-distillation based multi-person body pose estimation method.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be essentially or partially embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server or a network device, etc.) to execute the method according to the embodiments of the present application.

Examples of the experiments

The experimental example carries out attitude estimation by building the network model and selecting the open human body attitude estimation data MSCOCO by applying the estimation method, and other comparative examples adopt other human body detectors with the same multi-person human body attitude estimation method.

Through actual measurement and analysis, the human body posture estimation method is compared with the existing human body posture estimation method, namely Hourglass, simpleBaseline, LPN, lite-HRNet, shuffleNet V2, DARK and MobilePoseNet. All methods were performed on MSCOCO data in comparative experiments, as shown in the table above, the method of the present application is on the index AP ^0.5 The results are all higher than those of other comparative experimental methods, and the model superiority is shown. In addition, the method realizes that the mAP is 68.7 under the condition that the calculated amount is only 0.6GFLOPs, and the mAP is higher than other large models such as Hourglass models and CPN models. Compared with the lightweight human body posture estimation model LPN, the method has the advantage that the calculated amount is reduced by 40% under the condition that the precision is close to that of the method. Compared with other lightweight networks such as MobileNet V2, shuffleNet V2 and Lite-HRNet, the method of the application has greater lead in the aspects of precision, parameter and the like. See table 1 for details:

table: statistical table of estimation results of MSCOCO data

Method	Backbone	Input	#Params	GFLOPs	AP	AP ⁵⁰	AP ⁷⁵	AP ^M	，AP ^L	AR
											8-stage Hourglass	Hourglass	256×192	25.6M	26.2	66.9	-	-	-	-	-
CPN	ResNet-50	256×192	27.0M	6.2	68.4	-	-	-	-	-
											SnmpleBaseline	ResNet-50	256×192	34.0M	8.9	70.4	88.6	78.3	67.1	77.2	76.3
HRNet-W32	ResNet-50	256×192	28.5M	12.4	73.4	89.5	80.7	70.2	80.1	79.8
											DARK	HRNetV1-W48	128×96	63.6M	3.6	71.9	89.1	79.6	69.2	78	77.9
MobileNetV2	MobileNetV2	256×192	9.6M	1.48	64.6	87.4	72.3	61.1	71.2	70.7
											MobileNetV2 1×	MobileNetV2	384×258	9.6M	3.33	67.3	87.9	74.3	62.8	74.7	72.9
ShuffleNetV2	ShuffleNetV2	256×192	7.6M	1.28	59.9	85.4	66.3	56.6	66.2	66.4
											ShuffleNetV2 1×	ShuffleNetV2	384×288	7.6M	2.87	63.6	86.5	70.5	59.5	70.7	69.7
Small HRNet	HRNet-W16	256×192	1.3M	0.54	55.2	83.7	62.4	52.3	61	62.1
											Small HRNet	HRNet-W16	384×288	1.3M	1.21	56	83.8	63	52.4	62.6	62.6
Lite-HRNet	Lite-HRNet-18	256×192	1.1M	0.20	64.8	86.7	73	62.1	70.5	71.2
											Lite-HRNet	Lite-HRNet-18	384×288	1.1M	0.45	67.6	87.8	75	64.5	73.7	73.7
LPN	ResNet-50	256×192	2.9M	1.0	69.1	88.1	76.6	65.9	75.7	74.9
											MobilePoseNet	MobilNetV3	256×192	1.5M	0.55	66.2	87.3	74.2	63.1	72.5	72.4
MobilePoseNet	MobilNetV3	384×288	1.5M	1.23	69	88.2	75.9	65.5	75.5	74.9
											GhostPoseNet	GhostNet	256×192	3.1M	0.26	63.8	88.3	71.6	61.6	67	67.3
GhostPoseNet*	GhostNet	384×288	3.1M	0.60	67.9	89.4	74.4	65.3	72.1	71.6
											GhostPoseNet	GhostNet	384×288	3.1M	0.60	68.7	90.4	76.8	65.7	73.0	71.8

The method of the application is compared with the existing lightweight human body posture estimation method and the existing large human body posture estimation method in speed. Including a speed comparison of GPU environments versus no GPU environments. Under the environment without GPU, the running speed of the GhostPoseNet model is higher than the network speed of MobileNet V2, lite-HRNet and the like, and the GhostPoseNet network is more beneficial to the deployment of edge-end equipment considering that the model has low calculation amount and simple structure. See table 2 for details:

table 2: speed comparison table of MSCOCO data

Method	BackBone	#Params	GFLOPs	Imput Size	AP	Speed*	Speed
								HRNet	HRNetV1-W32	28.5M	7.1	256×192	74.4	7.5	19.2
HRNet	HRNetV1-W32	28.5M	16	384×288	75.8	4	18.8
								NLite-HRNet-18	HRNet-W16	0.7M	0.19	256×192	62.8	11	18.9
WNLite-HRNet-18	HRNet-W16	1.3M	0.3	256×192	66	12	18.6
								ShuffleNetV2 1×	ShuffleNetV2	7.6M	1.28	256×192	59.9	17	71.3
ShufleNetV2 1×	ShuffleNetV2	7.6M	2.87	384×288	63.6	10	64.1
								MobileNetV2 1×	MobileNetV2	9.6M	1.48	256×192	64.6	6.8	83.1
MobileNetV2 1×	MobileNetV2	9.6M	3.33	384×288	67.3	4.5	73.1
								Lite-HRNet	Lite-HRNet-18	1.1M	0.2	256×192	64.8	12	17.4
Lite-HRNet	Lite-HRNet-18	1.1M	0.45	384×288	67.6	7.1	16.3
								MobilePoseNet	MobileNetV3	1.5M	0.55	256×192	66.2	7.8	54.8
MobilePoseNet	MobileNetV3	1.5M	1.23	384×288	69.0	5.1	50.8
								GhostPoseNet	GhostNet	3.1M	0.26	256×192	63.8	9.2	62.0
GhostPoseNet	GhostNet	3.1M	0.60	384×288	68.7	6.4	59.4

Claims

1. A multi-person human body posture estimation method based on knowledge distillation is characterized by comprising the following steps:

constructing a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolution layer, the decoder is a stacking lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 multiplied by 1;

inputting human posture sample data into a pre-trained teacher network to obtain a first output joint point heat map, inputting human posture sample data into a student network to obtain a second output joint point heat map, generating joint point offset by using the second output joint point heat map and a target joint point heat map of the human posture sample data labels, and guiding the learning of the student network by combining the joint point offset and the first output joint point heat map;

2. The knowledge-based distillation multi-person human posture estimation method as claimed in claim 1, wherein: the human body posture sample data label passes through a two-dimensional Gaussian function G _m (x, y) computing a heat map H of target joint points ^m ∈R ^h×w In which a two-dimensional Gaussian function G _m (x, y) is:

H ^m ＝G _m (x，y)

where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heatmap, x _m And y _m Respectively are the horizontal and vertical coordinates of the joint point.

3. The knowledge-based distillation multi-person body posture estimation method as claimed in claim 1, wherein: the formula for generating the joint point bias is:

wherein P = { P = ₁ ，p ₂ ，p ₃ ，...，p _k Is a hyperparameter, k is the number of human joints, p _i As a bias for the corresponding joint point i,

representing the heat map generated by the ith node tag,

representing student network predicted ith joint pointA heat map of (a); j represents a hyper-parametric coefficient, h ^tar Heat map, h, representing tag generation ^stu A heat map representing student network output;

comprises the following steps:

wherein γ represents a hyper-parameter coefficient.

4. The knowledge-based distillation multi-person human posture estimation method as claimed in claim 1, wherein: the teacher network adopts an MSE loss function, which specifically comprises the following steps:

wherein,

a heat map representing the teacher's network at the ith joint point,

heat map, h, representing student network predicted ith joint point ^tea Heat map, h, representing teacher network output ^stu A heat map representing student network output, n representing joint points;

the overall loss function for the teacher network and the student network is:

L _total ＝MSE(h ^tea ，h ^stu )+λD(h ^tar ，h ^stu )

5. A knowledge-distillation-based multi-person body pose estimation system, comprising:

the network model building module is used for building a pre-trained teacher network model and an untrained student network model, wherein the teacher network model is HRNet-W32, the student network model comprises an encoder, a decoder and a joint point regressor, the encoder is a GhostNet convolutional layer, the decoder is a stacking lightweight up-sampling module, and the joint point regressor is a convolution with a convolution kernel of 1 x 1;

6. The system of claim 5, wherein the system comprises: the human body posture sample data label passes through a two-dimensional Gaussian function G _m (x, y) computing a target joint heat map H ^m ∈R ^h×w In which a two-dimensional Gaussian function G _m (x, y) is:

H ^m ＝G _m (x，y)

where σ is the standard deviation of the Gaussian distribution, w and h are the width and length of the generated heat map, x _m And y _m Respectively are the horizontal and vertical coordinates of the joint point.

7. The system of claim 5, wherein the system comprises: the formula for generating the joint point bias is:

representing the heat map generated by the ith node tag,

a heat map representing the student network predicted ith joint point; j represents the hyper-parameter coefficient, h ^tar Heat map, h, representing tag generation ^stu A heat map representing student network output;

comprises the following steps:

where γ represents a hyper-parametric coefficient.

8. The system of claim 1, wherein the system comprises: the teacher network adopts an MSE loss function, which specifically comprises the following steps:

wherein,

a heat map representing the teacher's network at the ith joint point,

heat map, h, representing student network predicted ith joint point ^tea Heat map, h, representing teacher as output over network ^stu A heat map representing student network output, n representing joint points;

the overall loss function for the teacher network and the student network is:

L _total ＝MSE(h ^tea ，h ^stu )+λD(h ^tar ，h ^stu )

where λ is the hyperparameter coefficient that balances two loss weights, h ^tar Representing a tag-generated hot map of the joint point.

9. A computer device, characterized by: comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 4.

10. A computer-readable storage medium, characterized in that: stored with a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 4.