CN116543417A - Human body posture estimation method, device, equipment and storage medium - Google Patents

Human body posture estimation method, device, equipment and storage medium Download PDF

Info

Publication number
CN116543417A
CN116543417A CN202310492398.4A CN202310492398A CN116543417A CN 116543417 A CN116543417 A CN 116543417A CN 202310492398 A CN202310492398 A CN 202310492398A CN 116543417 A CN116543417 A CN 116543417A
Authority
CN
China
Prior art keywords
human body
body posture
image
frame
estimated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310492398.4A
Other languages
Chinese (zh)
Inventor
张睿
董志学
邹游
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Telian Qizhi Technology Co ltd
Original Assignee
Chongqing Telian Qizhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Telian Qizhi Technology Co ltd filed Critical Chongqing Telian Qizhi Technology Co ltd
Priority to CN202310492398.4A priority Critical patent/CN116543417A/en
Publication of CN116543417A publication Critical patent/CN116543417A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure provides a human body posture estimation method, a device, equipment and a storage medium, which are applied to the technical field of artificial intelligence. The method comprises the steps of obtaining a human body posture image to be estimated and a corresponding image number; performing modular operation on the image number and the preset interval step number; if the residual value is not zero, adopting a ByteTrack tracker prediction frame as a target cutting frame; if the residual value is zero, inputting the human body posture image to be estimated to a pedestrian target detector, and updating a ByteTrack tracker prediction frame to obtain a target cutting frame; cutting the human body posture image to be estimated according to the target cutting frame to obtain a pedestrian target image set; inputting the human body key point prediction result to a pre-trained human body posture estimation model, and outputting a corresponding human body key point prediction result; and clustering to obtain a human body posture estimation result. In this way, a large amount of calculation amount can be saved, and the human body posture estimation accuracy can be improved while the overall estimation efficiency is improved.

Description

Human body posture estimation method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a human body posture estimation method, apparatus, device, and storage medium.
Background
Human body posture estimation is a process of acquiring specific positions of various joint parts of a human body from image or video information, is an important task in computer vision, and is an essential step for a computer to understand human actions and behaviors. Human body pose estimation can be classified into 2D pose estimation, which predicts one two-dimensional coordinate for each key point, and 3D pose estimation, which predicts one three-dimensional coordinate for each key point. The human body posture estimation can be applied to the fields of human-computer interaction, video monitoring, virtual reality and the like. Human body pose estimation is currently largely divided into top-down and bottom-up paradigms. Firstly, a top-down paradigm refers to that a human body target detector is adopted to detect pedestrians in an image to obtain a pedestrian target frame, then the pedestrians are cut out of the image based on the pedestrian target frame, and then the pedestrians are input into a single-target human body posture estimation model to carry out posture estimation; the bottom-up paradigm refers to directly inputting images into a human body posture estimation model to obtain predicted positions of all human body key points in the image, and then distributing the key points to each pedestrian target according to a series of key point grouping strategies (taking a single person as a group). Their advantages are respectively: the top-down approach can better handle different sizes of human body, and is more suitable for single person or non-crowded scenes. The bottom-up method can better treat the shielding problem in dense scenes and is more suitable for multiple people or large-scale scenes. Their respective disadvantages are also evident: the top-down approach requires reliance on the performance of the target detector, with the amount of computation rising as the number of people increases. The bottom-up approach does not need to rely on a target detector, is relatively computationally stable, but requires the design of an efficient keypoint grouping strategy to avoid combining keypoints of different people together incorrectly, which may require additional computation or post-processing.
On the basis of the two main flow patterns, the method can be further divided into a thermodynamic diagram-based learning method and a regression-based learning method. The model adopting the learning method learns a gaussian probability distribution map, and after the gaussian probability distribution map is obtained, the position information of the maximum point is obtained on the target gaussian probability distribution map as an estimation result (discrete value). Because learning the gaussian probability distribution diagram can also be regarded as learning a filtering method, and a convolution network is often used on a computer vision task, convolution can be regarded as filtering, in other words, the learning method based on the thermodynamic diagram is very matched with the convolution network, and the learning difficulty of a model is reduced, so that the model performance based on the thermodynamic diagram is more robust and has higher precision than the model performance based on a regression learning method. However, the final prediction of the model obtained by this method is not a human keypoint location, but a gaussian probability distribution diagram, and human keypoint locations can only be indirectly obtained from the gaussian probability distribution diagram, so this method requires the model to maintain a relatively high resolution thermodynamic diagram (typically 64x 64), otherwise, there is a large quantization error when the keypoint locations obtained from the gaussian probability distribution diagram are mapped back to the original image coordinates. The regression learning method is based on continuous target distribution of regression model learning, the positions of key points of human body can be directly predicted through the target distribution, gaussian probability distribution diagram calculation is not needed, and argmax (arguments of the maxima, maximum independent variable point set) or Soft-argmax (Integral Pose Regression, integral gesture regression) is not needed to indirectly acquire position information, so compared with the thermodynamic diagram learning method, the regression learning method is based on the regression learning method, the calculation flow is simpler, the calculation speed is faster, the calculation resource consumption is lower, and meanwhile, the output is variable in the continuous target distribution, and the problem of quantization error is avoided.
In general, the regression-based learning method has better characteristics than the thermodynamic diagram-based learning method, but the accuracy and the robustness of the regression-based learning method are inferior to those of the thermodynamic diagram-based learning method, the main reasons are that an effective supervision constraint is lacked in the model learning process, the target distribution fitting process is unstable, the influence of training data is extremely easy, and the model learning difficulty is more difficult than that of the thermodynamic diagram-based learning method. Therefore, whether in a top-down or bottom-up paradigm, the thermodynamic diagram based learning method is considered as a preference, and model learning is guided in the framework of the paradigm by the method to obtain a final model which can solve the human body posture estimation problem.
In the currently known scheme, there have been attempts to apply part of the characteristics in the regression learning method to the thermodynamic diagram learning method, and the related scheme adopts a Soft-argmax method, and the discrete probability is solved into a desired form, so that the continuous spatial coordinate regression is performed, and the output is changed from discrete to continuous, thereby alleviating the quantization error problem. Therefore, compared with the prior schemes, the schemes can further reduce the resolution of the Gaussian probability distribution map, thereby achieving the effects of reducing the calculation resources and improving the estimation precision. However, although the Soft-argmax method reduces the calculation resources and improves the estimation accuracy, the method still needs a Gaussian probability distribution diagram, the key point position cannot be directly predicted, and the quantization error problem cannot be fundamentally solved. In addition, the above solution does not propose a systematic framework to solve the drawbacks of the top-down paradigm, such as too dependent on the detector accuracy, slow processing speed, and difficulty in deployment on computing resource-limited devices.
Disclosure of Invention
The present disclosure provides a human body posture estimation method, apparatus, device, and storage medium.
According to a first aspect of the present disclosure, a human body posture estimation method is provided. The method comprises the following steps:
acquiring a human body posture image to be estimated and a corresponding image number; the human body posture image to be estimated is any frame of image in the human body posture video stream to be estimated;
performing modular operation on the image number and the preset interval step number to obtain a residual value;
if the residual value is not zero, adopting a ByteTrack tracker prediction frame as a target cutting frame;
if the residual value is zero, inputting the human body posture image to be estimated to a pedestrian target detector to obtain a pedestrian target frame, and updating a ByteTrack tracker prediction frame to obtain a target cutting frame;
cutting the human body posture image to be estimated according to the target cutting frame to obtain a pedestrian target image set; wherein the pedestrian target image set includes a single or multiple pedestrian targets;
inputting the pedestrian target image set into a pre-trained human body posture estimation model, and outputting a corresponding human body key point prediction result;
and clustering the human body key point prediction results to obtain human body posture estimation results.
Further, the human body posture estimation model comprises a backbone network, a regression model and a flow model, and the training process of the human body posture estimation model comprises the following steps:
inputting the image data set into a backbone network to finish image feature extraction and obtain a feature map;
converting the feature map into the input of a regression model through global average pooling, and inputting the converted feature map into the regression model to complete regression calculation so as to obtain a regression value;
calculating a labeling offset based on a probability density distribution function according to the regression value, and finishing probability distribution calculation of the labeling offset input stream model to obtain a residual log-likelihood distribution function and a standard Gaussian distribution function;
calculating a loss function of the human body posture estimation model according to the regression value, the residual error log-likelihood distribution function and the standard Gaussian distribution function, and updating the human body posture estimation model according to a back propagation mechanism;
and finishing training of the human body posture estimation model until the preset round is reached.
Further, the generating process of the image dataset includes:
acquiring a historical human body posture image set;
labeling key points of the human body on the historical human body posture image set;
and taking the marked human body posture image set as an image data set.
Further, after the training of the human body posture estimation model is completed, the method further comprises:
and carrying out structural re-parameterization on the backbone network after training, and updating the human body posture estimation model.
Further, the calculating the labeling offset based on the probability density distribution function according to the regression value includes:
and calculating the offset of the marked human body key points and the real human body key points according to the marked human body key points and the regression value to obtain the marked offset.
Further, the calculation formula of the loss function of the human body posture estimation model is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a standard gaussian distribution function +.>Representing a residual log-likelihood distribution function->Representing the offset of the label->Regression values are represented, and C represents a constant.
According to a second aspect of the present disclosure, a human body posture estimation apparatus is provided. The device comprises:
the image acquisition module is used for acquiring a human body posture image to be estimated and a corresponding image number; the human body posture image to be estimated is any frame of image in the human body posture video stream to be estimated;
the module-taking operation module is used for carrying out module-taking operation on the image number and the preset interval step number to obtain a residual value;
the first judging module is used for adopting a ByteTrack tracker prediction frame as a target cutting frame if the residual value is not zero;
the second judging module is used for inputting the human body posture image to be estimated to a pedestrian target detector if the residual value is zero to obtain a pedestrian target frame, and updating the predicted frame of the ByteTrack tracker to obtain a target cutting frame;
the image clipping module is used for clipping the human body posture image to be estimated according to the target clipping frame to obtain a pedestrian target image set; wherein the pedestrian target image set includes a single or multiple pedestrian targets;
the key point prediction module is used for inputting the pedestrian target image set into a pre-trained human body posture estimation model and outputting a corresponding human body key point prediction result;
and the gesture estimation module is used for clustering the human body key point prediction results to obtain human body gesture estimation results.
According to a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: a memory and a processor, the memory having stored thereon a computer program, the processor implementing the method as described above when executing the program.
According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method according to the first aspect of the present disclosure.
According to the human body posture estimation method, device and equipment and storage medium, the residual value is obtained by performing modular operation on the image number and the preset interval step number, whether the ByteTrack tracker prediction frame (target cutting frame) is updated or not is determined according to the corresponding result of the residual value, so that the situation that the ByteTrack tracker prediction frame is directly used for cutting images when the residual value is not zero is met, the human body posture estimation efficiency is improved, and of course, the problem that the accuracy of a detector is excessively dependent in the image cutting process can be solved while the extra calculation amount is not increased due to the introduction of the ByteTrack tracker. On the other hand, by carrying out specific design on the frame and the loss function of the human body posture estimation model, a large amount of calculation amount can be saved, the overall estimation efficiency is improved, and meanwhile, the human body posture estimation precision is improved.
It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. For a better understanding of the present disclosure, and without limiting the disclosure thereto, the same or similar reference numerals denote the same or similar elements, wherein:
FIG. 1 illustrates a flow chart of a human body pose estimation method according to an embodiment of the present disclosure;
fig. 2 shows a flowchart of a human body posture estimation method according to a further embodiment of the present disclosure;
fig. 3 shows a flowchart of a human body posture estimation method according to a further embodiment of the present disclosure;
fig. 4 shows a block diagram of a human body posture estimation apparatus according to an embodiment of the present disclosure;
fig. 5 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In the method, the human body posture is estimated in real time based on the offset distribution regression model, the stream model and residual log likelihood estimation are introduced in the training stage, the fitting target distribution is split into two tasks of 'the offset distribution of the human body key point mark and the real human body key point' and 'the translation scaling of the offset distribution', the learning difficulty of the regression model is reduced, the regression model is enabled to replace a thermodynamic diagram rendering model, the algorithm accuracy is guaranteed, the calculation resource consumption in the top-down generalized reasoning process is reduced, and the calculation efficiency is improved. In the reasoning stage, a structural re-parameterization and ByteTrack tracker is introduced, so that the calculation efficiency of a top-down normal form is further improved, the resource consumption is reduced, and a bottom-up normal form human body posture estimation scheme with lower precision can be replaced in a non-multi-person crowding application scene.
Fig. 1 shows a flowchart of a human body pose estimation method 100 according to an embodiment of the present disclosure. The method 100 comprises the following steps:
step 110, acquiring a human body posture image to be estimated and a corresponding image number.
The human body posture image to be estimated is any frame of image in the human body posture video stream to be estimated.
In some embodiments, if there is a need for human body posture estimation, a human body posture video stream to be estimated is acquired, any one frame of image is selected from the video stream, and the number of the selected frame of image is acquired, for example, the selected frame of image is the ith frame of image in the video stream. The human body posture in a video stream can be estimated, and the human body posture in real time can also be estimated for a real-time video stream.
And 120, performing modular operation on the image number and the preset interval step number to obtain a residual value.
In some embodiments, the remainder is obtained by performing a modulo operation, i.e., mod (i, s), between the image number i obtained in step 110 and the preset number of steps s. For example, i is 4, i.e., the 4 th frame image in the video stream, s is 2, and the available residual value is zero by calculation; i is 5, and the remainder is 1; … …
And 130, if the residual value is not zero, adopting a ByteTrack tracker prediction frame as a target cutting frame.
In some embodiments, the residual value obtained by the calculation in step 120 is determined to be zero, if the residual value is not zero, that is, mod (i, s) is equal to 0, then the prediction frame of the ByteTrack tracker is directly used as the target cutting frame, calculation of the pedestrian target detector is skipped, a large amount of calculation amount can be saved, the overall calculation efficiency is improved, and meanwhile, the reliability of the ByteTrack tracker keeps the loss probability of the target cutting frame at a low level.
And 140, if the residual value is zero, inputting the human body posture image to be estimated to a pedestrian target detector to obtain a pedestrian target frame, and updating the ByteTrack tracker prediction frame to obtain a target cutting frame.
In some embodiments, the residual value calculated in step 120 is determined to be zero, if the residual value is zero, that is, mod (i, s) =0, a pedestrian target detector is turned on (for example, a lightweight target detector: YOLOv7 (You Only Look Once Version, target detection algorithm) is adopted), an i-th frame image is input to the pedestrian target detector, a pedestrian target frame is calculated, and then a prediction frame of a ByteTrack tracker is updated according to the pedestrian target frame, so that a new prediction frame of the ByteTrack tracker is obtained as a target clipping frame, so that the reliability of the ByteTrack tracker is utilized, and the loss probability of the target clipping frame is kept low.
And step 150, clipping the human body posture image to be estimated according to the target clipping frame to obtain a pedestrian target image set.
Wherein the pedestrian target image set includes a single or multiple pedestrian targets.
In some embodiments, using the object crop box obtained in step 130 or step 140, a single or multiple pedestrian objects are cropped from the i-th frame image to obtain a pedestrian object image set.
Step 160, inputting the pedestrian target image set to a pre-trained human body posture estimation model, and outputting a corresponding human body key point prediction result.
In some embodiments, the pedestrian target image set obtained in step 150 is input to a pre-trained human body posture estimation model, and corresponding human body key point prediction results are output, so that human body posture estimation is obtained according to human body key point collection. The method comprises the steps of adopting a regression-based learning method, introducing a stream model and residual log-likelihood estimation, splitting 'fitting target distribution' into 'offset distribution of human body key point marks and real human body key points' and 'translational scaling of offset distribution', and delivering the offset distribution of the human body key point marks and the real human body key points to the stream model, wherein the regression model only needs to learn translational scaling parameters of the offset distribution, so that the learning difficulty of the regression model is reduced. In addition, the regression learning method is adopted to replace the thermodynamic diagram learning method, so that a Gaussian probability distribution diagram is not required to be maintained, and the trained regression model can directly predict the positions of the key points.
In some embodiments, the human body posture estimation model comprises a backbone network, a regression model and a flow model, as shown in the training process schematic of the human body posture estimation model of fig. 2, comprising the steps of:
and step 210, inputting the image data set into a backbone network to finish image feature extraction, and obtaining a feature map.
In some embodiments, the image dataset X is input to the backbone network G to complete image feature extraction, resulting in a feature map. Wherein, the backbone network adopts a RepVGG (Re-Parameterization Visual Geometry Group, re-parameterized convolutional neural network) model.
In some embodiments, the generation process of the image dataset X as shown in fig. 3 is schematically represented by the following steps:
step 310, acquiring a historical human body posture image set;
step 320, labeling key points of the human body on the historical human body posture image set;
and 330, taking the labeled human body posture image set as an image data set.
In some embodiments, historical human body posture images are obtained, key points of the human body are marked, and a marked human body posture image set is obtained. The historical human body posture image can be obtained from the video stream, and the human body posture image is processed in the steps 120-150, namely, the human body posture image is cut and then marked, so that sample data is provided for model training.
Step 220, converting the feature map into an input of a regression model through global average pooling, and inputting the converted feature map into the regression model to complete regression calculation, so as to obtain a regression value.
In some embodiments, the feature map obtained in step 210 is converted into an input of a regression model R through global averaging pooling, and the regression model R is used to calculate and output a regression value: mu and sigma. The regression model is a full-connection layer model. Global averaging pooling is a more naive choice of convolution structure for establishing the relationship between feature maps and categories. The global averaging pooling layer does not need parameters, avoids overfitting at the layer, sums spatial information, and is more robust to spatial variations of the input.
And 230, calculating a labeling offset based on the probability density distribution function according to the regression value, and completing probability distribution calculation of the labeling offset input stream model to obtain a residual log-likelihood distribution function and a standard Gaussian distribution function.
In some embodiments, first, according to the labeled human body key points and the regression value, calculating the offset of the labeled human body key points and the real human body key points to obtain the labeled offset. Specifically, the calculation formula of the labeling offset is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,is the offset of the human body key point mark and the real human body key point, namely the mark offset, Y g Is a labeling sample during training, regression value: mu and sigma. Then, the marked offset is input into a stream model for calculation to obtain a residual error log likelihood distribution function +.>And performing Gaussian distribution calculation on the marked offset to obtain a standard Gaussian distribution function +.>A flow model is a generative model that maps a simple probability distribution (e.g., gaussian distribution) to a complex data distribution (e.g., image distribution) through a series of reversible nonlinear transformations. The flow model has the advantages that the likelihood function of the data can be accurately calculated, and the data can be efficiently sampled from the model, so that the accuracy and the efficiency of model training are improved.
Step 240, calculating a loss function of the human body posture estimation model according to the regression value, the residual error log-likelihood distribution function and the standard gaussian distribution function, and updating the human body posture estimation model according to a back propagation mechanism.
In some embodiments, the regression value obtained by step 220 is usedAnd the residual log-likelihood distribution function obtained in step 230 +.>Standard Gaussian distribution function->And calculating based on the following formula to obtain a loss function of the human body posture estimation model:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a standard gaussian distribution function +.>Representing a residual log-likelihood distribution function->Representing the offset of the label->Regression values are represented, and C represents a constant. Then, the human body posture estimation model is updated according to a back propagation mechanism so as to modify training parameters of the human body posture estimation model and accelerate convergence of the human body posture estimation model. The training efficiency of the human body posture estimation model and the accuracy of human body posture estimation can be improved through the specific design of the loss function.
And step 250, completing training of the human body posture estimation model until a preset round is reached.
In some embodiments, a threshold value of the training turn is set for the human body posture estimation model, and when the threshold value is triggered, the human body posture estimation model is regarded as reaching convergence or training precision, so that training of the human body posture estimation model is completed. The threshold value setting can be preset by a user according to experience, and dynamic adjustment can be performed in the training process.
In some embodiments, after the human body posture estimation model training is completed, further comprising:
and carrying out structural re-parameterization on the backbone network after training, and updating the human body posture estimation model. Structural reparameterization: a structure module fusion method refers to a series of training friendly (such as easy convergence, stability, good training effect and the like) model structures in a model training stage, and then in an reasoning or deployment stage, the trained structures are equivalently converted into another model structure friendly to reasoning, and the trained parameters are equivalently converted, so that the model reasoning speed is increased. For example, the backbone network is equivalently converted (equivalent refers to that the calculation result is not affected after the model conversion) into a straight barrel type model structure (namely, only a main line convolution model is arranged in the model, and no extra branch exists, such as a VGG structure) which is more friendly to the acceleration computing equipment (such as GPU, VPU and the like), so that the occupation of computing resources is reduced, the advantage of convolution on the acceleration computing equipment is fully utilized, the computing efficiency is improved, and the problem that the convolution is difficult to deploy on the computing resource limited equipment is solved.
Step 170, clustering the human body key point prediction results to obtain human body posture estimation results.
In some embodiments, the human body key point prediction result obtained in step 160 is clustered to obtain a final human body posture estimation result, that is, human body key points corresponding to pedestrians are clustered, so that incorrect combination of key points of different pedestrians can be avoided, and high accuracy of human body posture estimation is ensured.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.
The foregoing is a description of embodiments of the method, and the following further describes embodiments of the present disclosure through examples of apparatus.
Fig. 4 shows a block diagram of a human body posture estimation apparatus 400 according to an embodiment of the present disclosure.
As shown in fig. 4, the apparatus 400 includes:
an image acquisition module 410, configured to acquire a human body posture image to be estimated and a corresponding image number; the human body posture image to be estimated is any frame of image in the human body posture video stream to be estimated;
the modulo operation module 420 is configured to perform modulo operation on the image number and a preset interval step number to obtain a remainder value;
a first judging module 430, configured to use a ByteTrack tracker prediction frame as a target crop frame if the residual value is not zero;
the second judging module 440 is configured to input the human body posture image to be estimated to a pedestrian target detector if the residual value is zero, obtain a pedestrian target frame, and update a ByteTrack tracker prediction frame to obtain a target cutting frame;
the image clipping module 450 is configured to clip the human body posture image to be estimated according to the target clipping frame to obtain a pedestrian target image set; wherein the pedestrian target image set includes a single or multiple pedestrian targets;
the key point prediction module 460 is configured to input the pedestrian target image set to a pre-trained human body posture estimation model, and output a corresponding human body key point prediction result;
the gesture estimation module 470 is configured to cluster the human body key point prediction results to obtain a human body gesture estimation result.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
According to an embodiment of the disclosure, the disclosure further provides an electronic device, a readable storage medium.
Fig. 5 shows a schematic block diagram of an electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
The electronic device 500 includes a computing unit 501 that can perform various appropriate actions and processes according to a computer program stored in a ROM502 or a computer program loaded from a storage unit 508 into a RAM 503. In the RAM503, various programs and data required for the operation of the electronic device 500 may also be stored. The computing unit 501, ROM502, and RAM503 are connected to each other by a bus 504. I/O interface 505 is also connected to bus 504.
A number of components in electronic device 500 are connected to I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as method 100. For example, in some embodiments, the method 100 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM502 and/or the communication unit 509. When the computer program is loaded into RAM503 and executed by computing unit 501, one or more steps of method 100 described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the method 100 by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (9)

1. A human body posture estimation method, characterized by comprising:
acquiring a human body posture image to be estimated and a corresponding image number; the human body posture image to be estimated is any frame of image in the human body posture video stream to be estimated;
performing modular operation on the image number and the preset interval step number to obtain a residual value;
if the residual value is not zero, adopting a ByteTrack tracker prediction frame as a target cutting frame;
if the residual value is zero, inputting the human body posture image to be estimated to a pedestrian target detector to obtain a pedestrian target frame, and updating a ByteTrack tracker prediction frame to obtain a target cutting frame;
cutting the human body posture image to be estimated according to the target cutting frame to obtain a pedestrian target image set; wherein the pedestrian target image set includes a single or multiple pedestrian targets;
inputting the pedestrian target image set into a pre-trained human body posture estimation model, and outputting a corresponding human body key point prediction result;
and clustering the human body key point prediction results to obtain human body posture estimation results.
2. The method of claim 1, wherein the body posture estimation model comprises a backbone network, a regression model, and a flow model, and wherein the training process of the body posture estimation model comprises:
inputting the image data set into a backbone network to finish image feature extraction and obtain a feature map;
converting the feature map into the input of a regression model through global average pooling, and inputting the converted feature map into the regression model to complete regression calculation so as to obtain a regression value;
calculating a labeling offset based on a probability density distribution function according to the regression value, and finishing probability distribution calculation of the labeling offset input stream model to obtain a residual log-likelihood distribution function and a standard Gaussian distribution function;
calculating a loss function of the human body posture estimation model according to the regression value, the residual error log-likelihood distribution function and the standard Gaussian distribution function, and updating the human body posture estimation model according to a back propagation mechanism;
and finishing training of the human body posture estimation model until the preset round is reached.
3. The method of claim 2, wherein the image dataset generation process comprises:
acquiring a historical human body posture image set;
labeling key points of the human body on the historical human body posture image set;
and taking the marked human body posture image set as an image data set.
4. The method of claim 2, further comprising, after the human body pose estimation model training is completed:
and carrying out structural re-parameterization on the backbone network after training, and updating the human body posture estimation model.
5. A method according to claim 3, wherein said calculating a labeling offset based on a probability density distribution function from said regression values comprises:
and calculating the offset of the marked human body key points and the real human body key points according to the marked human body key points and the regression value to obtain the marked offset.
6. The method according to claim 2 or 5, characterized in that the loss function of the human body posture estimation model is calculated as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a standard gaussian distribution function +.>Representing a residual log-likelihood distribution function->Representing the offset of the label->Regression values are represented, and C represents a constant.
7. A human body posture estimation apparatus, characterized by comprising:
the image acquisition module is used for acquiring a human body posture image to be estimated and a corresponding image number; the human body posture image to be estimated is any frame of image in the human body posture video stream to be estimated;
the module-taking operation module is used for carrying out module-taking operation on the image number and the preset interval step number to obtain a residual value;
the first judging module is used for adopting a ByteTrack tracker prediction frame as a target cutting frame if the residual value is not zero;
the second judging module is used for inputting the human body posture image to be estimated to a pedestrian target detector if the residual value is zero to obtain a pedestrian target frame, and updating the predicted frame of the ByteTrack tracker to obtain a target cutting frame;
the image clipping module is used for clipping the human body posture image to be estimated according to the target clipping frame to obtain a pedestrian target image set; wherein the pedestrian target image set includes a single or multiple pedestrian targets;
the key point prediction module is used for inputting the pedestrian target image set into a pre-trained human body posture estimation model and outputting a corresponding human body key point prediction result;
and the gesture estimation module is used for clustering the human body key point prediction results to obtain human body gesture estimation results.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
CN202310492398.4A 2023-05-04 2023-05-04 Human body posture estimation method, device, equipment and storage medium Pending CN116543417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310492398.4A CN116543417A (en) 2023-05-04 2023-05-04 Human body posture estimation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310492398.4A CN116543417A (en) 2023-05-04 2023-05-04 Human body posture estimation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116543417A true CN116543417A (en) 2023-08-04

Family

ID=87442946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310492398.4A Pending CN116543417A (en) 2023-05-04 2023-05-04 Human body posture estimation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116543417A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115595A (en) * 2023-10-23 2023-11-24 腾讯科技(深圳)有限公司 Training method and device of attitude estimation model, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115595A (en) * 2023-10-23 2023-11-24 腾讯科技(深圳)有限公司 Training method and device of attitude estimation model, electronic equipment and storage medium
CN117115595B (en) * 2023-10-23 2024-02-02 腾讯科技(深圳)有限公司 Training method and device of attitude estimation model, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112562069B (en) Method, device, equipment and storage medium for constructing three-dimensional model
CN115409933B (en) Multi-style texture mapping generation method and device
CN111739167B (en) 3D human head reconstruction method, device, equipment and medium
CN115880435B (en) Image reconstruction method, model training method, device, electronic equipment and medium
CN113409430B (en) Drivable three-dimensional character generation method, drivable three-dimensional character generation device, electronic equipment and storage medium
CN114186632A (en) Method, device, equipment and storage medium for training key point detection model
CN112967315B (en) Target tracking method and device and electronic equipment
CN112528858A (en) Training method, device, equipment, medium and product of human body posture estimation model
CN111899159B (en) Method, device, apparatus and storage medium for changing hairstyle
CN116309983B (en) Training method and generating method and device of virtual character model and electronic equipment
CN116543417A (en) Human body posture estimation method, device, equipment and storage medium
CN114708374A (en) Virtual image generation method and device, electronic equipment and storage medium
KR102612354B1 (en) Method for detecting face synthetic image, electronic device, and storage medium
CN113766117B (en) Video de-jitter method and device
CN113255511A (en) Method, apparatus, device and storage medium for living body identification
CN116168132B (en) Street view reconstruction model acquisition method, device, equipment and medium
CN113409340A (en) Semantic segmentation model training method, semantic segmentation device and electronic equipment
CN116596750A (en) Point cloud processing method and device, electronic equipment and storage medium
CN115272705B (en) Training method, device and equipment for saliency object detection model
CN115222895A (en) Image generation method, device, equipment and storage medium
CN113920273A (en) Image processing method, image processing device, electronic equipment and storage medium
CN113870428A (en) Scene map generation method, related device and computer program product
CN114078184A (en) Data processing method, device, electronic equipment and medium
CN111382834B (en) Confidence degree comparison method and device
CN116206035B (en) Face reconstruction method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination