CN117333937A - Human body posture estimation method and device based on classification and distillation and electronic equipment - Google Patents

Human body posture estimation method and device based on classification and distillation and electronic equipment Download PDF

Info

Publication number
CN117333937A
CN117333937A CN202311227441.0A CN202311227441A CN117333937A CN 117333937 A CN117333937 A CN 117333937A CN 202311227441 A CN202311227441 A CN 202311227441A CN 117333937 A CN117333937 A CN 117333937A
Authority
CN
China
Prior art keywords
model
module
distillation
human body
lightweight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311227441.0A
Other languages
Chinese (zh)
Inventor
李观喜
苏鹏
张磊
梁倬华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Ziweiyun Technology Co ltd
Original Assignee
Guangzhou Ziweiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Ziweiyun Technology Co ltd filed Critical Guangzhou Ziweiyun Technology Co ltd
Priority to CN202311227441.0A priority Critical patent/CN117333937A/en
Publication of CN117333937A publication Critical patent/CN117333937A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Psychiatry (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a human body posture estimation method and device based on classification and distillation and electronic equipment, and relates to the technical field of neural networks, wherein the method comprises the following steps: acquiring an image to be identified; convolving the characteristics of each pixel point of the image to be identified based on a lightweight neural network algorithm to obtain an output value; carrying out key point coordinate classification on the output value to obtain horizontal and vertical coordinates; the accuracy of the lightweight model is improved by using a regression distillation mode; and identifying the horizontal and vertical coordinates based on the model with improved accuracy to obtain human body posture estimation. The regression distillation-based mode can be used, and a complex model can be converted into a lightweight model, so that the performance of the model is greatly improved, the detection performance of the model on difficult actions or in severe environments is improved, and the robustness and generalization capability of the model are improved. Different from the characteristic distillation mode, the modification of the structure of the model can be reduced, the model landing speed is increased, and the deployment of the model and the application of products are more convenient.

Description

Human body posture estimation method and device based on classification and distillation and electronic equipment
Technical Field
Relates to the technical field of neural networks, in particular to a human body posture estimation method and device based on classification and distillation and electronic equipment.
Background
2D human body posture estimation is one of the fundamental tasks of computer vision and is a very important research field. Refers to a technique of automatically identifying and locating key points (e.g., head, arm, leg, etc.) of a human body from a 2D image. The purpose is to infer the posture and motion state of the human body from the still image. The method has important application value and significance in the field of computer vision. The 2D human body posture estimation can be applied to many directions such as human-computer interaction, and by recognizing and tracking the human body posture, a more natural and visual human-computer interaction mode such as gesture recognition, body action control and the like can be realized. For example, the device can be used in the field of body-building training, and can automatically evaluate and correct the action posture of a body-building person, thereby improving the training effect and safety. For example, the method can be applied to the field of augmented reality, can track the human body gesture and the motion state in real time, and provides a more real and natural augmented reality experience. In a word, the 2D human body posture estimation technology has wide application prospect and commercial value, and many possible expansion directions and application scenes exist in the future.
Recent studies on 2D pose estimation have achieved excellent performance in common benchmark tests, but 2D pose estimation techniques still perform less than ideal in complex environments. For example, when there are situations such as occlusion, insufficient illumination, complex background, etc., the 2D pose estimation model is prone to false detection, missed detection, etc., and there are still problems of heavy model parameters and high delay in practical application, which is not acceptable for practical application.
Second, 2D pose estimation techniques still have some problems in terms of accuracy and efficiency. For example, in some scenes needing real-time processing, there is a certain contradiction between the running speed and the accuracy of the 2D pose estimation algorithm, and trade-off needs to be performed.
Therefore, in practical application products, the requirements on the speed and the precision of the model are often very high, so that in order to better open the market, audience users are more, and the 2D attitude estimation technology is often required to be deployed in embedded equipment with limited computing resources to achieve the effect of real time and precision.
Disclosure of Invention
The embodiment of the invention provides a human body posture estimation method, a device and electronic equipment based on classification and distillation, which can convert a complex model into a lightweight model by using a regression distillation mode, thereby greatly improving the performance of the model, improving the detection performance of the model on difficult actions or in severe environments, and improving the robustness and generalization capability of the model. Different from the characteristic distillation mode, the modification of the structure of the model can be reduced, the model landing speed is increased, and the deployment of the model and the application of products are more convenient.
In order to achieve the above purpose, the invention adopts the following technical scheme:
in a first aspect, a classification and distillation-based human body posture estimation method is provided, and is applied to a human body posture estimation system, the method includes: acquiring an image to be identified; convolving the characteristics of each pixel point of the image to be identified based on a lightweight neural network algorithm to obtain an output value; carrying out key point coordinate classification on the output value to obtain horizontal and vertical coordinates; the accuracy of the lightweight model is improved by using a regression distillation mode; and identifying the horizontal and vertical coordinates based on the model with improved accuracy to obtain human body posture estimation.
With reference to the first aspect, in one possible design, the convolution, based on the lightweight neural network algorithm, of the features of each pixel point of the image to be identified, before obtaining the output value, further includes: using MobileOne as basic backbone network; the MobileOneBlock uses a structure re-parameterization technology, and a high-precision model is obtained through a model structure with large parameter quantity; and obtaining a model with small parameter quantity by a structural re-parameterization technology.
With reference to the first aspect, in one possible design, the MobileOneBlock uses a structure re-parameterization technique to obtain a high-precision model under a model structure with a large parameter, where the method includes: when the model is trained, the MobileOneBlock has two basic modules, namely an RDW module and an RPW module, and an input value is connected with a Relu activation function through the RDW module and then passes through the RPW module to obtain an output value.
With reference to the first aspect, in one possible design, the mobile oneblock has two basic modules, namely an RDW module and an RPW module, when the model is trained, an input value is connected to a Relu activation function after passing through the RDW module, and then passes through the RPW module, to obtain an output value, which includes: the RDW module consists of a point-by-point convolution (PW) of 1x1, a channel-by-channel convolution (DW) of 1x 3 and a BN module, and input values respectively pass through the three modules and carry out shortcut operation to obtain output values; the RPW module consists of 1 point-by-point convolution (PW) and BN modules of 1x1, through which input values pass and undergo shortcut, respectively, to obtain output values.
With reference to the first aspect, in one possible implementation manner, the convolving the features of each pixel point of the image to be identified based on the lightweight neural network algorithm to obtain an output value includes: when the model is inferred, the RDW module and the RPW module are subjected to re-parameterization to respectively correspond to the DW module and the PW module; point-by-point convolution (PW) is a convolution operation with a convolution kernel size of 1x 1; channel-by-channel convolution (DW) is a convolution operation performed on each input channel separately.
With reference to the first aspect, in one possible related aspect, the classifying the output values by using coordinates of key points to obtain horizontal and vertical coordinates includes: the key point coordinate classification module is provided with two FC layers, Y passes through an FC1 layer and an FC2 layer respectively, and the FC1 outputs horizontal axis coordinate classification information O of n key points x FC2 outputs vertical axis coordinate classification information O of n key points y
With reference to the first aspect, in one possible design, the method further includes: ith key pointThe method for calculating the predicted coordinates of (a) comprises the following steps: />
With reference to the first aspect, in one possible design, the method for using regression distillation to improve the accuracy of the lightweight model includes: the model based on the lightweight neural network algorithm module is used as a student model, and a model based on a ResNet101 trunk module is selected as a teacher model; regression distillation was performed on the student model using the teacher model.
In a second aspect, a human body posture estimation device based on classification and distillation is provided, applied to a human body posture estimation system, the device comprising: the image acquisition module is used for acquiring an image to be identified; the lightweight neural network algorithm module is used for convoluting the characteristics of each pixel point of the image to be identified based on the lightweight neural network algorithm to obtain an output value; the key point coordinate classification module is used for classifying the key point coordinates of the output value to obtain horizontal and vertical coordinates; the knowledge distillation training strategy module is used for improving the precision of the lightweight model by using a regression distillation mode; and identifying the horizontal and vertical coordinates based on the model with improved accuracy to obtain human body posture estimation.
In a third aspect, an embodiment of the present invention provides an electronic device. Comprising the following steps: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of the first aspect.
According to the human body posture estimation method based on classification and distillation, a regression distillation-based mode can be used, a complex model can be converted into a lightweight model, so that the performance of the model is greatly improved, the detection performance of the model on difficult actions or in severe environments is improved, and the robustness and generalization capability of the model are improved. Different from the characteristic distillation mode, the modification of the structure of the model can be reduced, the model landing speed is increased, and the deployment of the model and the application of products are more convenient.
Drawings
FIG. 1 illustrates a classification and distillation based human body pose estimation method provided by embodiments of the present application;
FIG. 2 shows a block diagram of the lightweight neural network algorithm module and the keypoint coordinate classification module provided by the embodiments of the present application;
FIG. 3 illustrates a reparameterization process of reasoning in an embodiment of the present application;
FIG. 4 shows a block diagram of a human body posture estimation device implementing classification and distillation according to an embodiment of the present application;
fig. 5 shows a block diagram of an electronic device according to an embodiment of the present application.
Description of the drawings: realizing a human body posture estimation device-400 based on classification and distillation; -an electronic device-2000; processor-2001; memory-2002.
Detailed Description
The technical scheme of the invention is described below with reference to the accompanying drawings.
In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.
In the embodiments of the present invention, "image" and "picture" may be sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized. "OF", "CORRESPONDING (CORRESPONDING, RELEVANT)" and "CORRESPONDING (coresponding)" may sometimes be used in combination, and it should be noted that the meaning OF the expression is consistent when the distinction is not emphasized.
In the embodiment of the invention, sometimes the subscript is W 1 May be misidentified as a non-subscripted form such as W1, the meaning it is intended to express being consistent when de-emphasizing the distinction.
Referring to recent researches on 2D pose estimation, 2D human body pose estimation techniques can be mainly divided into two categories: deep learning based methods and traditional computer vision methods. Traditional 2D classification and distillation-based human body posture estimation methods are generally based on manually designed feature extraction algorithms and machine learning models, which have the advantage of being simpler and more efficient, but are generally far less accurate and robust than deep learning-based methods.
Deep learning-based methods, hetmap and regression are two common 2D classification and distillation-based human body posture estimation methods. Both methods have advantages and disadvantages. The Heatmap method can convert the human body key point detection problem into a pixel class classification problem, and has better interpretability. The Heatm ap method is more robust in complex scenarios, such as occlusion, illumination changes, etc., than the regression method. However, the defect is also obvious that the detection effect of the Heatm method on the low-resolution picture is not good, and for some small-size and low-resolution key points, the Heatm method is difficult to accurately detect and can achieve better effect only by higher resolution. In order to improve the precision, a plurality of upsampling layers are needed to restore the resolution of the feature map from low to high, and in general, the upsampling will use transposed convolution to obtain better performance, but the corresponding calculation amount is also larger, the number of channels of the feature map output by the backbone network is already high, and the cost caused by resampling is huge. Additional post-processing is required to reduce quantization errors due to downscaling. This can result in an inability to use in embedded devices with limited computing resources.
The regression method can accurately position each key point, and can well process small-size and low-resolution key points. The regression method can directly regress the coordinates of the key points, and can obtain more accurate coordinates compared with the Heatm method. However, the regression method is not well detected in complex scenes, such as illumination changes, occlusion, etc., and the regression method is far less stable than the Heatm ap method.
Aiming at the contradiction problem between precision and speed, the precision of the model can be improved by introducing a more efficient network structure and algorithm. For example, a lightweight network structure, an attention mechanism, a pruning technique and the like are used to reduce the calculation amount and the parameter amount of the model and improve the running speed and the instantaneity of the model. In addition, the accuracy of the model may be improved by integrating a plurality of different models or using a method such as transfer learning.
Based on the problems, the invention is mainly directed to a real-time 2D human body posture estimation method based on knowledge distillation and coordinate classification for mobile terminal and embedded equipment. The traditional Heatmap method generates a Gaussian heat map through 2D Gaussian distribution to serve as a label, monitors model output and optimizes through L2 loss. The size of the Heatm ap obtained by this method is often smaller than the original size of the picture, so that the coordinate obtained by argmax is finally magnified back to the original picture and suffers from unavoidable quantization errors.
Therefore, the key point coordinate classification module can convert the human key point detection problem into a coordinate classification problem, namely classifying each pixel point and determining whether the pixel point belongs to a certain key point. The advantage of this approach is that error transfer can be significantly reduced and the ability to detect small targets can be enhanced. Error propagation can be avoided compared to conventional Heatm ap methods because it is classified and located on a per-keypoint basis, rather than co-learning and regressing all keypoints in the same Heatm ap. The fact that the key point coordinate classification module is added in the actual test can improve the detection capability of the model on the key points of the human body, has the same effect as that of a Heatm mode, is faster than the Heatm mode in speed, and can achieve the effect of real-time detection in embedded equipment and the like.
In the aspect of model compression, the lightweight backbone network is used, and the structure is adopted for heavy parameterization, so that the model accuracy can be ensured, the reasoning speed of the model can be improved, the model complexity after heavy parameterization can be obviously reduced, and the model generalization capability can be improved. The training strategy aspect adopts a knowledge distillation method, and because the key point coordinate classification module is based on converting the human key point detection problem into the coordinate classification problem, a regression distillation-based mode can be used, and a complex model (such as a large deep neural network) can be converted into a lightweight model (such as a small neural network), so that the performance of the model is greatly improved, the detection performance of the model in difficult actions or severe environments is improved, and the robustness and generalization capability of the model are improved. Different from the characteristic distillation mode, the modification of the structure of the model can be reduced, the model landing speed is increased, and the deployment of the model and the application of products are more convenient.
Referring to fig. 1, fig. 1 shows a flowchart of a classification and distillation-based human body posture estimation method, specifically including steps S110 to S150.
Step S110: acquiring an image to be identified;
step S120: convolving the characteristics of each pixel point of the image to be identified based on a lightweight neural network algorithm to obtain an output value;
step S130: carrying out key point coordinate classification on the output value to obtain horizontal and vertical coordinates;
step S140: the accuracy of the lightweight model is improved by using a regression distillation mode;
step S150: and identifying the horizontal and vertical coordinates based on the model with improved accuracy to obtain human body posture estimation.
In some embodiments, the basic architecture is human body posture estimation based on key point coordinate classification, and is embedded into a lightweight neural network algorithm module to obtain a lightweight human body posture estimation model, and a knowledge distillation structure is constructed to further improve the performance of the model.
Wherein the image to be identified may be acquired based on an image acquisition model. The image acquisition module adopts any monocular camera. The characteristic of each pixel point of the image to be identified can be convolved based on a lightweight neural network algorithm through a lightweight model, so that an output value is obtained. The core design of the lightweight model is a lightweight neural network algorithm module and a key point coordinate classification module, and the structural block diagram of the model is shown in fig. 2.
Wherein, before executing step S110, the method further comprises using MobileOne as a basic backbone network; the MobileOneBlock uses a structure re-parameterization technology, and a high-precision model is obtained through a model structure with large parameter quantity; and obtaining a model with small parameter quantity by a structural re-parameterization technology.
Specifically, referring to fig. 2, the lightweight neural network algorithm module uses MobileOne as a basic backbone network, and the MobileOne has 4 mobileoneblocks, thereby forming the lightweight neural network algorithm module shown in fig. 2. The MobileonEBlock uses the structure re-parameterization technology, so that a high-precision model is obtained under a model structure with large parameter quantity, a model with small parameter quantity is obtained by the structure re-parameterization technology, and the reasoning speed of the model is greatly improved while the original precision is maintained.
For some embodiments, when performing MobileOneBlock to obtain a high-precision model under a model structure with a large parameter amount by using a structure re-parameterization technique, the method may further include: when the model is trained, the MobileOneBlock has two basic modules, namely an RDW module and an RPW module, and an input value is connected with a Relu activation function through the RDW module and then passes through the RPW module to obtain an output value.
The mobile OneBlock has two basic modules, namely an RDW module and an RPW module, and an input value is connected with a Relu activation function through the RDW module and then passes through the RPW module. The RDW module consists of a point-by-point convolution (PW) of 1x1, a channel-by-channel convolution (DW) of 1x 3 and a BN module, and input values respectively pass through the three modules and are subjected to shortcut operation to obtain output values. The RPW module consists of 1 point-by-point convolution (PW) and BN modules of 1x1, through which input values pass and undergo shortcut, respectively, to obtain output values. And when the model is inferred, the RDW module and the RPW module are subjected to re-parameterization to respectively correspond to the DW module and the PW module.
When the model training is executed, the mobile OneBlock has two basic modules, namely an RDW module and an RPW module, and the input value is connected with a Relu activation function through the RDW module and then passes through the RPW module, so that the output value can be obtained, and the method can also comprise the following steps: the RDW module consists of a point-by-point convolution (PW) of 1x1, a channel-by-channel convolution (DW) of 1x 3 and a BN module, and input values respectively pass through the three modules and carry out shortcut operation to obtain output values; the RPW module consists of 1 point-by-point convolution (PW) and BN modules of 1x1, through which input values pass and undergo shortcut, respectively, to obtain output values.
For some embodiments, when executing step S120, further including model reasoning, performing a re-parameterization on the RDW module and the RPW module to respectively correspond to the DW module and the PW module; the point-by-point convolution is a convolution operation with a convolution kernel size of 1×1; the channel-by-channel convolution is a convolution operation performed on each input channel separately.
Specifically, referring to fig. 3, fig. 3 illustrates a reparameterization process inferred in the embodiment of the present application. The point-by-point convolution (PW) is a convolution operation with a convolution kernel size of 1×1, that is, a convolution is performed on the feature of each pixel, which can be said to be a full-connection operation for each pixel. The convolution is used for adjusting the channel number of the feature, and by adjusting the channel number, the complexity and the calculated amount of the model can be controlled, and meanwhile, the nonlinearity of the model can be increased. Instead of convolving the entire input with a uniform convolution kernel, a channel-by-channel convolution (DW) performs a convolution operation on each input channel separately. It can be seen as using a separate convolution kernel for each input channel, with the output of each channel being independent of the output of the other channels. Thus, the channel-by-channel convolution does not increase the number of output channels, but only changes the depth of the channels.
In some embodiments, when executing step S130, the key point coordinate classification module may further include two FC layers, where Y passes through the FC1 layer and the FC2 layer, respectively, and FC1 outputs the horizontal axis coordinate classification information O of the n key points x FC2 outputs vertical axis coordinate classification information O of n key points y
Specifically, the key point coordinate classification module has a core idea of regarding a human body posture estimation task as two classification tasks of horizontal and vertical coordinates, and reducing quantization errors by dividing each pixel into a plurality of bins. The method comprises the steps of obtaining a feature map Z through a Mobileone backbone network, firstly flattening the matrix size (n, H, W) of the Z to (n, hxW) through a reaarange layer to obtain a new feature map Y, wherein n represents the number of key points, H is the height of the Z matrix, and W is the width of the Z matrix. Coordinates of key pointsThe classification module is provided with two FC layers, Y respectively passes through an FC1 layer and an FC2 layer, and the FC1 outputs horizontal axis coordinate classification information O of n key points x FC2 outputs vertical axis coordinate classification information O of n key points y
Wherein, the ith key pointThe predicted coordinate calculation method of (2) is as shown in the following equation 1.
In order to realize classification, the data is required to be processed before model training, and each continuous coordinate value is uniformly discretized into an integer to be used as a class label of model training so as to be convenient for carrying out loss calculation with a predicted coordinate.Wherein N is x =W·k,N y =h·k represents the bin numbers of the horizontal axis and the vertical axis, respectively. k is a split factor, and is set to be more than or equal to 1 so as to reduce quantization error, thereby generating sub-pixel positioning accuracy, c x And c y Tag information that is a key point. Using KL divergence as loss function L hard And calculating the KL divergence of the target label and the prediction index.
In some embodiments, when executing step S140, a model based on a lightweight neural network algorithm module may be further included as a student model, and a model based on a general res net101 backbone module is selected as a teacher model; regression distillation was performed on the student model using the teacher model.
Specifically, as the core idea of the decoupling coordinate characterization module is to consider the human body posture estimation task as two classification tasks of horizontal and vertical coordinates, the accuracy of the lightweight model can be improved by using a regression distillation mode. Therefore, the model based on the lightweight neural network algorithm module is used as a student model, and the model based on the ResNet101 backbone module is selected as a teacher model. The teacher model needs to be trained to obtain a high-precision model. Regression distillation of student models using teacher model the structure of which is shown in figure 3 below.
Teacher model output P T Student model output P S In order to make it possible for the student model to fit the output distribution of the teacher model, the teacher model output is used as a soft label to perform a loss calculation with the student model output, as shown in equation 2.
Wherein the method comprises the steps ofIs P T Output j-th category,>is P S The j-th category is output, and N is the total label number.
Since knowledge distillation is used, the student model is combined with the original detection model by the loss of knowledge distillation, and the formula 3 is as follows.
L loss =αL soft +(1-α)L hard Equation 3
Wherein L is soft L for distillation loss hard And classifying the loss of the key point coordinates of the key points of the human body.
In the implementation mode provided by the application, the mobile-based lightweight neural network algorithm module is used, so that the requirements of deployment and products on speed and precision of the model can be met. The adoption of the key point coordinate classification module can enable the model to have higher accuracy, faster reasoning speed and lower calculation complexity. And finally, combining a knowledge distillation training strategy to further improve the accuracy of the model for the lightweight human body posture estimation model, improve the detection performance under difficult actions or severe environments, and improve the robustness and generalization capability of the model.
The human body posture estimating method based on classification and distillation provided by the embodiment of the present application is described in detail based on fig. 1, 2 and 3, and the human body posture estimating apparatus based on classification and distillation provided by the embodiment of the present application will be described in detail below.
As shown in fig. 4, a human body posture estimating apparatus 400 implementing classification and distillation based human body posture includes: an image acquisition module 410, a lightweight neural network algorithm module 420, a keypoint coordinate classification module 430, and a knowledge distillation training strategy module 440.
For ease of illustration, fig. 4 shows only the main components of the human body posture estimation apparatus 400 implementing classification and distillation based.
The image acquisition module 410 is configured to acquire an image to be identified.
The lightweight neural network algorithm module 420 is configured to convolve the feature of each pixel of the image to be identified based on the lightweight neural network algorithm to obtain an output value.
Optionally, the lightweight neural network algorithm module 420 may also be configured to employ MobileOne as the underlying backbone network; the MobileOneBlock uses a structure re-parameterization technology, and a high-precision model is obtained through a model structure with large parameter quantity; and obtaining a model with small parameter quantity by a structural re-parameterization technology.
Alternatively, the lightweight neural network algorithm module 420 may also be used for model training, where the MobileOneBlock has two basic modules, namely an RDW module and an RPW module, and the input value is connected to a Relu activation function after passing through the RDW module, and then passes through an RPW module to obtain the output value.
Optionally, the lightweight neural network algorithm module 420 may also be used in an RDW module, where the RDW module is composed of a 1x1 point-by-point convolution (PW), 1x 3 channel-by-channel convolution (DW), and BN modules, and the input values respectively pass through the three modules and perform a shortcut operation to obtain output values; the RPW module consists of 1 point-by-point convolution (PW) and BN modules of 1x1, through which input values pass and undergo shortcut, respectively, to obtain output values.
Optionally, the lightweight neural network algorithm module 420 may be further configured to perform a heavy parameterization on the RDW module and the RPW module when performing model reasoning, where the heavy parameterization corresponds to the DW module and the PW module respectively; point-by-point convolution (PW) is a convolution operation with a convolution kernel size of 1x 1; channel-by-channel convolution (DW) is a convolution operation performed on each input channel separately.
The key point coordinate classification module 430 is configured to classify the key point coordinates of the output values, and obtain horizontal and vertical coordinates.
Alternatively, the key point coordinate classification module 430 may also be used in a key point coordinate classification module having two FC layers, Y passing through the FC1 layer and the FC2 layer, respectively, and FC1 outputting the horizontal axis coordinate classification information O of n key points x FC2 outputs vertical axis coordinate classification information O of n key points y
Optionally, the keypoint coordinate classification module 430 may also be used for the ith keypointThe method for calculating the predicted coordinates of (a) comprises the following steps: />
A knowledge distillation training strategy module 440 for improving the accuracy of the lightweight model using a regression distillation approach; and identifying the horizontal and vertical coordinates based on the model with improved accuracy to obtain human body posture estimation.
Optionally, the knowledge distillation training policy module 440 may be further configured to use a model based on a lightweight neural network algorithm module as a student model, and select a model based on a general ResNet101 backbone module as a teacher model; regression distillation was performed on the student model using the teacher model.
In addition, the technical effects of the classification and distillation-based human body posture estimation apparatus 400 may be referred to as the technical effects of any of the foregoing methods, and will not be described herein.
Optionally, the embodiment of the present invention further provides a computer readable storage medium, which comprises a computer program or instructions which, when run on a computer, cause the method provided by any embodiment of the present invention to be performed.
Optionally, the embodiment of the invention further provides an electronic device, which is used for executing the method provided by any embodiment of the invention.
As shown in fig. 5, the electronic device 2000 may include a processor 2001.
Optionally, the electronic device 2000 may also include memory 2002.
The processor 2001 is coupled to the memory 2002, for example, by a communication bus.
The following describes the respective constituent elements of the electronic device 2000 in detail with reference to fig. 5:
the processor 2001 is a control center of the electronic device 2000, and may be one processor or a plurality of processing elements. For example, the processor 2001 is one or more central processing UNITs (CENTRAL PROCESSING UNIT, CPU), or may be an APPLICATION SPECIFIC INTEGRATED Cirsiit (ASIC), or one or more integrated CIRCUITs configured to implement embodiments of the present invention, such as: one or more microprocessors (DIGITAL SIGNAL PROCESSOR, DSP), or one or more field programmable gate arrays (FIELD PROGRAMMABLE GATE ARRAY, FPGA).
Alternatively, the processor 2001 may perform various functions of the electronic device 2000 by running or executing software programs stored in the memory 2002, and invoking data stored in the memory 2002.
In a particular implementation, the processor 2001 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 5, as an example.
The memory 2002 is used for storing a software program for executing the solution of the present invention, and is controlled by the processor 2001 to execute the solution, and the specific implementation may refer to the above method embodiment, which is not described herein again.
Alternatively, MEMORY 2002 may be, but is not limited to, READ-ONLY MEMORY (ROM) or other type of static storage device that can store static information and instructions, RANDOM ACCESS MEMORY (RAM) or other type of dynamic storage device that can store information and instructions, electrically erasable programmable READ-ONLY MEMORY (ELECTRICALLY ERASABLE PROGRAMMABLE READ-ONLY MEMORY, EEPROM), compact disc READ-ONLY MEMORY (COMPACT DISC READ-ONLY MEMORY, CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Memory 2002 may be integrated with processor 2001 or may exist separately and be coupled to processor 2001 through interface circuitry of electronic device 2000 (not shown in fig. 5), as embodiments of the invention are not limited in detail.
It should be noted that the structure of the electronic device 2000 illustrated in fig. 5 is not limited to the electronic device, and an actual electronic device may include more or fewer components than illustrated, or may combine some components, or may be different in arrangement of components.
In addition, the technical effects of the electronic device 2000 may refer to the technical effects of the method described in the above method embodiments, which are not described herein.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A classification and distillation-based human body posture estimation method, characterized by being applied to a human body posture estimation system, the method comprising:
acquiring an image to be identified;
convolving the characteristics of each pixel point of the image to be identified based on a lightweight neural network algorithm to obtain an output value;
carrying out key point coordinate classification on the output value to obtain horizontal and vertical coordinates;
the accuracy of the lightweight model is improved by using a regression distillation mode;
and identifying the horizontal and vertical coordinates based on the model with improved accuracy to obtain human body posture estimation.
2. The method according to claim 1, wherein the convolution of the features of each pixel point of the image to be identified based on the lightweight neural network algorithm, before obtaining the output value, further comprises:
using MobileOne as basic backbone network;
the MobileOneBlock uses a structure re-parameterization technology, and a high-precision model is obtained through a model structure with large parameter quantity;
and obtaining a model with small parameter quantity by a structural re-parameterization technology.
3. The method according to claim 2, wherein the MobileOneBlock uses a structure re-parameterization technique to obtain a high-precision model under a model structure with a large parameter, comprising:
when the model is trained, the MobileOneBlock has two basic modules, namely an RDW module and an RPW module, and an input value is connected with a Relu activation function through the RDW module and then passes through the RPW module to obtain an output value.
4. A method according to claim 3, wherein the model training has two basic modules, namely an RDW module and an RPW module, and the input value is obtained by connecting a Relu activation function to the RDW module and then passing the RPW module, and the method comprises:
the RDW module consists of a point-by-point convolution (PW) of 1x1, a channel-by-channel convolution (DW) of 1x 3 and a BN module, and input values respectively pass through the three modules and carry out shortcut operation to obtain output values;
the RPW module consists of 1 point-by-point convolution (PW) and BN modules of 1x1, through which input values pass and undergo shortcut, respectively, to obtain output values.
5. The method according to claim 1, wherein the convolving the features of each pixel of the image to be identified based on the lightweight neural network algorithm to obtain an output value comprises:
when the model is inferred, the RDW module and the RPW module are subjected to re-parameterization to respectively correspond to the DW module and the PW module;
the point-by-point convolution is a convolution operation with a convolution kernel size of 1×1;
the channel-by-channel convolution is a convolution operation performed on each input channel separately.
6. The method of claim 1, wherein said classifying the output values for keypoint coordinates to obtain horizontal and vertical coordinates comprises:
the key point coordinate classification module is provided with two FC layers, Y passes through an FC1 layer and an FC2 layer respectively, and the FC1 outputs horizontal axis coordinate classification information O of n key points x FC2 outputs vertical axis coordinate classification information O of n key points y
7. The method of claim 6, wherein the method further comprises:
ith key pointThe method for calculating the predicted coordinates of (a) comprises the following steps:
8. the method of claim 1, wherein the using a regression distillation to increase the accuracy of the lightweight model comprises:
the model based on the lightweight neural network algorithm module is used as a student model, and a model based on a ResNet101 trunk module is selected as a teacher model;
regression distillation was performed on the student model using the teacher model.
9. A classification and distillation-based human body posture estimation device, characterized by being applied to a human body posture estimation system, the device comprising:
the image acquisition module is used for acquiring an image to be identified;
the lightweight neural network algorithm module is used for convoluting the characteristics of each pixel point of the image to be identified based on the lightweight neural network algorithm to obtain an output value;
the key point coordinate classification module is used for classifying the key point coordinates of the output value to obtain horizontal and vertical coordinates;
the knowledge distillation training strategy module is used for improving the precision of the lightweight model by using a regression distillation mode; and identifying the horizontal and vertical coordinates based on the model with improved accuracy to obtain human body posture estimation.
10. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-8.
CN202311227441.0A 2023-09-21 2023-09-21 Human body posture estimation method and device based on classification and distillation and electronic equipment Pending CN117333937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311227441.0A CN117333937A (en) 2023-09-21 2023-09-21 Human body posture estimation method and device based on classification and distillation and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311227441.0A CN117333937A (en) 2023-09-21 2023-09-21 Human body posture estimation method and device based on classification and distillation and electronic equipment

Publications (1)

Publication Number Publication Date
CN117333937A true CN117333937A (en) 2024-01-02

Family

ID=89274741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311227441.0A Pending CN117333937A (en) 2023-09-21 2023-09-21 Human body posture estimation method and device based on classification and distillation and electronic equipment

Country Status (1)

Country Link
CN (1) CN117333937A (en)

Similar Documents

Publication Publication Date Title
WO2023138300A1 (en) Target detection method, and moving-target tracking method using same
Yuan et al. Particle filter re-detection for visual tracking via correlation filters
WO2023082882A1 (en) Pose estimation-based pedestrian fall action recognition method and device
WO2021098802A1 (en) Object detection device, method, and systerm
CN110659596A (en) Face key point positioning method under case and management scene, computer storage medium and equipment
CN109087337B (en) Long-time target tracking method and system based on hierarchical convolution characteristics
US20220222832A1 (en) Machine learning framework applied in a semi-supervised setting to perform instance tracking in a sequence of image frames
CN111027576A (en) Cooperative significance detection method based on cooperative significance generation type countermeasure network
CN114299303A (en) Ship target detection method, terminal device and storage medium
CN114049515A (en) Image classification method, system, electronic device and storage medium
CN111310821A (en) Multi-view feature fusion method, system, computer device and storage medium
CN115457492A (en) Target detection method and device, computer equipment and storage medium
CN111429481B (en) Target tracking method, device and terminal based on adaptive expression
CN115345905A (en) Target object tracking method, device, terminal and storage medium
Kalash et al. Relative saliency and ranking: Models, metrics, data and benchmarks
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN114972492A (en) Position and pose determination method and device based on aerial view and computer storage medium
Jiao et al. An attention-based feature pyramid network for single-stage small object detection
CN113111687A (en) Data processing method and system and electronic equipment
US20220398283A1 (en) Method for fast and better tree search for reinforcement learning
CN116310899A (en) YOLOv 5-based improved target detection method and device and training method
CN117333937A (en) Human body posture estimation method and device based on classification and distillation and electronic equipment
CN113192085A (en) Three-dimensional organ image segmentation method and device and computer equipment
CN117036658A (en) Image processing method and related equipment
CN115147720A (en) SAR ship detection method based on coordinate attention and long-short distance context

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination