CN118470805A - Dancing action key point capturing method based on transformer fusion openpose - Google Patents
Dancing action key point capturing method based on transformer fusion openpose Download PDFInfo
- Publication number
- CN118470805A CN118470805A CN202410932523.3A CN202410932523A CN118470805A CN 118470805 A CN118470805 A CN 118470805A CN 202410932523 A CN202410932523 A CN 202410932523A CN 118470805 A CN118470805 A CN 118470805A
- Authority
- CN
- China
- Prior art keywords
- openpose
- transducer
- key point
- key
- optimization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 230000004927 fusion Effects 0.000 title claims abstract description 24
- 230000009471 action Effects 0.000 title description 3
- 238000005457 optimization Methods 0.000 claims abstract description 41
- 230000007246 mechanism Effects 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 230000009466 transformation Effects 0.000 claims abstract 4
- 239000013598 vector Substances 0.000 claims description 50
- 230000006870 function Effects 0.000 claims description 38
- 239000011159 matrix material Substances 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 9
- 230000008447 perception Effects 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000005286 illumination Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 230000005428 wave function Effects 0.000 claims description 3
- 230000035945 sensitivity Effects 0.000 claims description 2
- 230000010354 integration Effects 0.000 abstract 1
- 238000001514 detection method Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Image Processing (AREA)
Abstract
The invention discloses a dance motion key point capturing method based on a transformation fusion openpose, which comprises the following steps: 1) Obtaining dance video frames: collecting dance motion videos in real time, and decomposing the dance motion videos frame by frame to obtain dance video frames; 2) Initialization OpenPose: configuring and starting OpenPose, processing dance video frames, and extracting key points of a human body and key points of hands; 3) Transformer model integration and parameter optimization: inputting key points extracted by OpenPose into a pre-trained transducer model for prediction and parameter adjustment, and dynamically adjusting a convolution layer of OpenPose, a learning rate, the number of heads of an attention mechanism and a rotation angle of a video frame based on output of the transducer so as to optimize the key point extraction; 4) And (3) key point fuzzy processing: the output of the transducer is processed using a fuzzy algorithm to reduce noise and prediction errors and output smoothed keypoint data. The method of the invention combines the strong prediction capability and OpenPose flexibility of the transducer model, thereby remarkably improving the accuracy and real-time performance of dance motion capture and having wide application prospect.
Description
Technical Field
The invention belongs to the technical field of motion capture, and particularly relates to a dance motion key point capturing method based on a transformer fusion openpose.
Background
With the rapid development of virtual reality and augmented reality technologies, motion capture technologies are increasingly used in fields such as entertainment, medical treatment, sports, and the like. Existing motion capture techniques rely primarily on deep learning and computer vision techniques, of which OpenPose is a widely used open source pose estimation framework. However, openPose still has room for improvement in real-time and accuracy of node prediction. In terms of real-time performance, openPose is based on a multi-stage convolutional neural network, although the detection precision is improved by a complex network architecture, the computational burden is possibly increased, the real-time processing speed is influenced, and meanwhile, under the condition that the hardware resources are limited, the allocation and the utilization of the computational resources are not flexible and efficient. In the aspect of node prediction, openPose is easy to be interfered by noise under the condition of complex background or rapid motion, so that the detection of key points is inaccurate. And OpenPose, while able to extract human keypoints, in cases of complex articulation or occlusion, the accuracy of predictions may be degraded, and for fast-changing or complex human actions, openPose may have difficulty capturing all details accurately and in real-time.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a dance motion key point capturing method based on a transducer fusion openpose, which remarkably improves the accuracy and the real-time performance of dance motion capturing by combining the strong prediction capability of a transducer model and the flexibility of OpenPose, and has wide application prospect.
In order to achieve the technical purpose, the invention adopts the following technical scheme.
A dance motion key point capturing method based on a transducer fusion openpose comprises the following steps:
step S1, acquiring dance video frames: adopting video acquisition equipment with high depth frame rate (60 fps and above) to acquire dance motion video in real time in an environment with good illumination, and decomposing the dance motion video frame by frame to obtain dance video frames;
Step S2, initializing OpenPose: configuring and starting OpenPose, processing dance video frames, and extracting key points of a human body and key points of hands;
Step S3, integrating a transducer model and optimizing parameters: inputting key points extracted by OpenPose into a pre-trained transducer model for prediction and parameter adjustment, and dynamically adjusting a convolution layer and a learning rate of OpenPose, the number of heads of an attention mechanism and the rotation angle of a video frame based on output of the transducer so as to optimize the key point extraction;
Step S4, key point blurring processing: the output of the transducer is processed using a fuzzy algorithm to reduce noise and prediction errors and output smoothed keypoint data.
Specifically, the content of extracting the human body key points and the hand key points in step S2 by the step OpenPose is as follows:
OpenPose is composed of a multi-stage convolutional neural network and is used for extracting key points of a human body, and a loss function of the key points is expressed as follows:
;
In the above-mentioned method, the step of, Is the loss of the keypoint heatmap; Is the loss of the partial affinity field, PAF refers to the partial affinity field, which is a vector field used to describe the association between different body parts, each vector representing the direction and intensity from one keypoint to another, for determining whether two keypoints belong to the same limb.
Specifically, in step S3, the key points extracted in OpenPose are input into a pre-trained transducer model for prediction and parameter adjustment, where the following is included:
The transducer model is used for sequence prediction and parameter optimization, adopts a self-attention mechanism, and has the following formula:
;
In the above-mentioned method, the step of, 、、Respectively represent queriesMatrix, keyMatrix and valueA matrix; Is the dimension of the key vector.
Specifically, in step S3, the convolutional layer and learning rate of OpenPose, the number of heads of the attention mechanism, and the rotation angle of the video frame are dynamically adjusted based on the output of the transducer, so as to optimize the extraction of the key points, where optimizing the content includes:
Step S31, parameter adjustment optimization
A. And (3) optimizing parameters of a convolution layer: inputting key point data extracted by OpenPose into a transducer model, and dynamically adjusting the super parameters of the OpenPose convolution layer according to the output of the transducer model, wherein the specific formula is as follows:
;
In the above-mentioned method, the step of, AndRespectively denoted as a learnable weight and bias parameter; is the convolution kernel size;
The keypoint data comprises a heat map and a PAF;
b. And (3) loss function adjustment: based on the output of the transducer, adjusting OpenPose the weight parameters in the loss function to optimize the accuracy of the key point extraction;
For the heat map loss:
;
In the above-mentioned method, the step of, Weights predicted for the transducer model; is a predicted heat map; Is a true heat map;
for PAF loss:
;
In the above-mentioned method, the step of, Weights predicted for the transducer model; Is predicted PAF; Is true PAF;
Step S32, optimization of Multi-head attention mechanism
Inputting OpenPose extracted key point data into an optimized transducer model, calculating the attention weights among different key points by utilizing a multi-head attention mechanism, and introducing an adaptive weight adjustment mechanism, wherein the calculation formula is as follows:
;
;
In the above-mentioned method, the step of, The weight parameters are dynamically adjusted according to the key point data; A weight matrix for the query vector Q, for converting the query vector Q into a new query vector; A weight matrix for the key vector K, which is used for converting the key vector K into a new key vector; a weight matrix which is a value vector V and is used for converting the key vector V into a new value vector;
step S33, optimization of position coding
Dynamic position coding optimization: by introducing dynamic position coding DPE in a transducer model to solve the problem of position sensitivity in complex sequence data processing, the optimization content comprises:
s331, environmental perception initialization phase
Assume that the original input sequence isWhereinRepresents the firstThe word embedded environment perception network of each position preprocesses X and outputs context vectors:
;
S332 dynamic coding generation network
For each position in the sequenceDynamic code generation network index according to positionAnd context vector C to generate dynamic position codingExpressed as:
;
In the above-mentioned method, the step of, Is the basic position coding of the position index i, e.gTraditional static coding of function generation; representing a vector concatenation operation; Representing a multi-layer perceptron; parameters (parameters) Responsible for learning how context basedAdjusting the code of each position;
S333, fusion strategy
Encoding dynamic positionWord embeddingFusion by addition operations to form enhanced input representations,;
S334, end-to-end training loss function
The overall goal of the transducer model is to minimize the loss function L, so the average over all samples optimizes all parameters including dynamic position coding:
;
In the above-mentioned method, the step of, Representing a slave datasetAn input-output pair obtained by sampling; A transducer model with dynamic position coding is introduced; All the learnable parameters are contained; Representing an original input sequence The processed form, i.e. the enhanced input representation obtained by the fusion strategy in step S333;
Multidimensional position coding optimization: by introducing multidimensional position coding of space dimension, the transducer model can simultaneously consider the position change of the key points in space and time, and the optimization content is as follows:
For a two-dimensional structure, the position of each pixel or image block in the image can be encoded using two-dimensional sine and cosine wave functions, the original position code in the transducer model is extended to two dimensions, and for each position code (x, y) is defined as:
;
;
In the above-mentioned method, the step of, The number of dimensions of the position code is usually the embedding dimension of the transducer model divided by 2, so that the dimension of the position code is matched with the embedding dimension of the model; Indicating the current number of time steps.
Step S34, dynamically adjusting OpenPose parameter policy optimization
Heat map loss function weight adjustment policy optimization:
The heat map is a core means for representing the positions of key points of human bodies in OpenPose, and is used for weighting loss functions Adopts a gradient descent method for dynamic adjustment of (1) according to the current heat map lossFor weightUpdating:
;
In the above-mentioned method, the step of, Representing the learning rate, controlling the updating step length, and ensuring that the weight can be adjusted towards the direction of reducing the heat map loss in each iteration; representing the current time step number;
PAF loss function weight adjustment strategy optimization:
PAF is a vector field connecting different key points of human body, and is weighted by a loss function The dynamic adjustment strategy of (2) is as follows:
;
By monitoring Changes in (2) correspondingly adjustThe OpenPose model can be focused on learning the correct connection relation between joints, so that the incorrect configuration of the limb structure is reduced, and the continuity and stability of posture estimation are improved;
And (3) adjusting and optimizing parameters of a convolution layer:
For convolutional layer parameters And (3) adjusting and optimizing by adopting a global loss function guiding method:
;
In the above-mentioned method, the step of, Is a learning rate set specifically for the convolutional layer parameters.
Specifically, in step S4, the output of the transducer is processed by using a fuzzy algorithm, and the process is as follows:
in order to reduce noise and prediction errors, a fuzzy algorithm is adopted to process the output of the transducer model, and smooth key point data is generated:
;
In the above-mentioned method, the step of, And representing the key point data after fuzzy processing.
Compared with the prior art, the invention has the following beneficial effects:
1. the method of the invention combines the strong prediction capability and OpenPose flexibility of the transducer model, thereby remarkably improving the accuracy of key point detection in dance motion capture and reducing the error of gesture estimation.
2. The invention improves the real-time response capability under the premise of high frame rate by combining the high-efficiency processing capability of the transducer with the rapid gesture detection of OpenPose.
3. The method improves the robustness of the model in a complex scene by introducing dynamic position coding and multidimensional position coding.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present disclosure, and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flow chart of a method MIIDCAT of the present invention based on a transducer fusion openpose.
Detailed Description
In order to facilitate the understanding and practice of the present application, a detailed description of the various steps of the method presented herein will follow, with the understanding that these examples are intended to illustrate the application and are not intended to limit the scope of the application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.
Examples
As shown in fig. 1, a dance motion key point capturing method based on a transducer fusion openpose includes the following steps:
step S1, acquiring dance video frames: adopting video acquisition equipment with high depth frame rate (60 fps and above) to acquire dance motion video in real time in an environment with good illumination, and decomposing the dance motion video frame by frame to obtain dance video frames;
Step S2, initializing OpenPose: configuring and starting OpenPose, processing dance video frames, and extracting key points of a human body and key points of hands;
Step S3, integrating a transducer model and optimizing parameters: inputting key points extracted by OpenPose into a pre-trained transducer model for prediction and parameter adjustment, and dynamically adjusting a convolution layer of OpenPose, a learning rate, the number of heads of an attention mechanism and the rotation angle of a video frame based on output of the transducer so as to optimize the key point extraction;
Step S4, key point blurring processing: the output of the transducer is processed using a fuzzy algorithm to reduce noise and prediction errors and output smoothed keypoint data.
Specifically, the content of extracting the human body key points and the hand key points in step S2 by the step OpenPose is as follows:
OpenPose is composed of a multi-stage convolutional neural network and is used for extracting key points of a human body, and a loss function of the key points is expressed as follows:
;
In the above-mentioned method, the step of, Is the loss of the keypoint heatmap; Is the loss of the partial affinity field, PAF refers to the partial affinity field, which is a vector field used to describe the association between different body parts, each vector representing the direction and intensity from one keypoint to another, for determining whether two keypoints belong to the same limb.
Specifically, in step S3, the key points extracted in OpenPose are input into a pre-trained transducer model for prediction and parameter adjustment, where the following is included:
The transducer model is used for sequence prediction and parameter optimization, adopts a self-attention mechanism, and has the following formula:
;
In the above-mentioned method, the step of, 、、Respectively represent queriesMatrix, keyMatrix and valueA matrix; Is the dimension of the key vector.
Specifically, in step S3, the convolutional layer and learning rate of OpenPose, the number of heads of the attention mechanism, and the rotation angle of the video frame are dynamically adjusted based on the output of the transducer, so as to optimize the extraction of the key points, where optimizing the content includes:
Step S31, parameter adjustment optimization
A. And (3) optimizing parameters of a convolution layer: inputting key point data extracted by OpenPose into a transducer model, and dynamically adjusting the super parameters of the OpenPose convolution layer according to the output of the transducer model, wherein the specific formula is as follows:
;
In the above-mentioned method, the step of, AndRespectively denoted as a learnable weight and bias parameter; is the convolution kernel size;
The keypoint data comprises a heat map and a PAF;
b. And (3) loss function adjustment: based on the output of the transducer, adjusting OpenPose the weight parameters in the loss function to optimize the accuracy of the key point extraction;
For the heat map loss:
;
In the above-mentioned method, the step of, Weights predicted for the transducer model; is a predicted heat map; Is a true heat map;
for PAF loss:
;
In the above-mentioned method, the step of, Weights predicted for the transducer model; Is predicted PAF; Is true PAF;
Step S32, optimization of Multi-head attention mechanism
Inputting OpenPose extracted key point data into an optimized transducer model, calculating the attention weights among different key points by utilizing a multi-head attention mechanism, and introducing an adaptive weight adjustment mechanism, wherein the calculation formula is as follows:
;
;
In the above-mentioned method, the step of, The weight parameters are dynamically adjusted according to the key point data; A weight matrix for the query vector Q, for converting the query vector Q into a new query vector; A weight matrix for the key vector K, which is used for converting the key vector K into a new key vector; a weight matrix which is a value vector V and is used for converting the key vector V into a new value vector;
step S33, optimization of position coding
Dynamic position coding optimization: the DPE is characterized in that a position coding vector is adaptively generated according to the real-time characteristics of an input sequence, so that the capturing capability of the model on element position relations in the sequence is enhanced, the DPE is particularly suitable for variable-length sequences and language tasks containing complex structures, and the specific optimization content comprises the following steps:
s331, environmental perception initialization phase
Assume that the original input sequence isWhereinRepresents the firstThe word embedded environment perception network of each position preprocesses X and outputs context vectors:
;
The vector is used for guiding the subsequent dynamic position code generation;
S332 dynamic coding generation network
For each position in the sequenceDynamic code generation network index according to positionAnd context vector C to generate dynamic position codingExpressed as:
;
In the above-mentioned method, the step of, Is the basic position coding of the position index i, e.gTraditional static coding of function generation; representing a vector concatenation operation; Representing a multi-layer perceptron; parameters (parameters) Responsible for learning how context basedAdjusting the code of each position;
S333, fusion strategy
Encoding dynamic positionWord embeddingFusion by addition operations to form enhanced input representations,;
S334, end-to-end training loss function
The overall goal of the transducer model is to minimize the loss function L, so the average over all samples optimizes all parameters including dynamic position coding:
;
In the above-mentioned method, the step of, Representing a slave datasetAn input-output pair obtained by sampling; A transducer model with dynamic position coding is introduced; All the learnable parameters are contained; Representing an original input sequence The processed form, i.e. the enhanced input representation obtained by the fusion strategy in step S333;
Multidimensional position coding optimization: by introducing multidimensional position coding of space dimension, the transducer model can simultaneously consider the position change of the key points in space and time, and the optimization content is as follows:
For a two-dimensional structure, the position of each pixel or image block in the image can be encoded using two-dimensional sine and cosine wave functions, the original position code in the transducer model is extended to two dimensions, and for each position code (x, y) is defined as:
;
;
In the above-mentioned method, the step of, The number of dimensions of the position code is usually the embedding dimension of the transducer model divided by 2, so that the dimension of the position code is matched with the embedding dimension of the model; representing the current time step number;
In this embodiment, in general, pair AndUsing different scaling factors to distinguish the importance of different dimensions helps the model learn more complex spatial correlations;
step S34, dynamically adjusting OpenPose parameter policy optimization
In the process of optimizing OpenPose, which is a human body posture estimation framework, the dynamic adjustment of key parameters is the key for improving the performance of the model and ensuring the efficient learning process, so in the embodiment, the prediction precision of the model on the human body joint and limb structure is further enhanced by means of fine parameter fine adjustment strategies through the self-adaptive dynamic adjustment of the heat map loss function weight, the PAF (PART AFFINITY FIELDS, the part associated field) loss function weight and the convolution layer parameters, and the method is specific:
Heat map loss function weight adjustment policy optimization:
The heat map is a core means for representing the positions of key points of human bodies in OpenPose, and is used for weighting loss functions Adopts a gradient descent method for dynamic adjustment of (1) according to the current heat map lossFor weightUpdating:
;
In the above-mentioned method, the step of, Representing the learning rate, controlling the updating step length, and ensuring that the weight can be adjusted towards the direction of reducing the heat map loss in each iteration; the number of current time steps;
The dynamic adjustment mechanism enables the model to be subjected to key optimization aiming at difficulties in training data, so that accuracy of positioning of key points is improved.
PAF loss function weight adjustment strategy optimization:
PAF is a vector field connecting different key points of human body, and is weighted by a loss function The dynamic adjustment strategy of (2) is as follows:
;
By monitoring Changes in (2) correspondingly adjustThe OpenPose model can be focused on learning the correct connection relation between joints, so that the incorrect configuration of the limb structure is reduced, and the continuity and stability of posture estimation are improved;
the adjusted model can be focused on learning the correct connection relation between joints, so that the incorrect configuration of the limb structure is reduced, and the continuity and stability of posture estimation are improved;
And (3) adjusting and optimizing parameters of a convolution layer:
For convolutional layer parameters And (3) adjusting and optimizing by adopting a global loss function guiding method:
;
In the above-mentioned method, the step of, Is a learning rate set specifically for the convolutional layer parameters.
The adjustment optimization of the parameters of the convolution layer aims at balancing the global optimization requirement and the fineness of the local feature extraction. The convolution kernel parameters are continuously finely adjusted through gradient feedback of global loss (L), so that the model can capture high-level semantic features and simultaneously give consideration to low-level detail information, and the gesture of an individual can be accurately identified under a complex background.
Specifically, in step S4, the output of the transducer is processed by using a fuzzy algorithm, and the process is as follows:
in order to reduce noise and prediction errors, a fuzzy algorithm is adopted to process the output of the transducer model, and smooth key point data is generated:
;
In the above-mentioned method, the step of, And representing the key point data after fuzzy processing.
Specifically, the implementation process of the fuzzy algorithm in the steps is as follows:
1) A set (F) is defined which contains all possible output values and their corresponding membership functions.
2) Rules are constructed and used to evaluate and adjust the output of the transducer model by constructing a rule base. For example, if the output value is too high or too low, the rule will trigger a corresponding adjustment.
3) After obtaining the original output (K) of the transducer, the design controller evaluates and adjusts (K) according to rules and outputs adjusted key point data #)
4) Parameter selection and optimization: and selecting proper membership functions and rule parameters, and determining optimal controller parameters through experiments.
The algorithm pseudocode is as follows:
FOR each output K from the Transformer model DO
Evaluate K using the fuzzy rules R
IF K is too high THEN
Decrease K based on the corresponding fuzzy rule
ELSIF K is too low THEN
Increase K based on the corresponding fuzzy rule
ENDIF
Update K_smooth with the adjusted value of K
END FOR
The experimental results in the following table show that the accuracy of the model before the algorithm treatment is 84%, but the accuracy after the algorithm treatment is as high as 92%.
Table 1, accuracy comparison of models before and after Algorithm processing
In conclusion, the method of the invention remarkably improves the accuracy of key point detection in dance motion capture, reduces the error of gesture estimation and improves the real-time response capability under the premise of high frame rate by combining the strong prediction capability of a transducer model and the flexibility of OpenPose; and by introducing dynamic position coding and multidimensional position coding, the robustness of the model in a complex scene is improved.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.
Claims (10)
1. A dance motion key point capturing method based on a transducer fusion openpose is characterized by comprising the following steps:
step S1, acquiring dance video frames: adopting video acquisition equipment with high depth and frame rate, acquiring dance motion video in real time in an environment with good illumination, and decomposing frame by frame to obtain dance video frames;
Step S2, initializing OpenPose: configuring and starting OpenPose, processing dance video frames, and extracting key points of a human body and key points of hands;
Step S3, integrating a transducer model and optimizing parameters: inputting key points extracted by OpenPose into a pre-trained transducer model for prediction and parameter adjustment, and dynamically adjusting a convolution layer of OpenPose, a learning rate, the number of heads of an attention mechanism and the rotation angle of a video frame based on output of the transducer so as to optimize the key point extraction;
Step S4, key point blurring processing: the output of the transducer is processed using a fuzzy algorithm to reduce noise and prediction errors and output smoothed keypoint data.
2. The dance motion key point capturing method based on the transformer fusion openpose according to claim 1, wherein the specific content of the OpenPose extracting the human body key point and the hand key point in the step S2 is as follows:
OpenPose is composed of a multi-stage convolutional neural network and is used for extracting key points of a human body, and a loss function of the key points is expressed as follows:
;
In the above-mentioned method, the step of, Is the loss of the keypoint heatmap; Is the loss of the partial affinity field, PAF refers to the partial affinity field, which is a vector field used to describe the association between different body parts, each vector representing the direction and intensity from one keypoint to another, for determining whether two keypoints belong to the same limb.
3. The method for capturing dance motion keypoints based on the transducer fusion openpose according to claim 1, wherein in step S3, the key points extracted in OpenPose are input into a pre-trained transducer model for prediction and parameter adjustment, and the specific contents are as follows:
The transducer model is used for sequence prediction and parameter optimization, adopts a self-attention mechanism, and has the following formula:
;
In the above-mentioned method, the step of, 、、Respectively represent queriesMatrix, keyMatrix and valueA matrix; Is the dimension of the key vector.
4. The method for capturing dance motion keypoints based on the Transformer fusion openpose according to claim 1, wherein the step S3 of dynamically adjusting the convolutional layer of OpenPose and the learning rate, the number of heads of the attention mechanism and the rotation angle of the video frame based on the Transformer output to optimize the extraction of the keypoints comprises:
Step S31, parameter adjustment optimization
A. And (3) optimizing parameters of a convolution layer: inputting key point data extracted by OpenPose into a transducer model, and dynamically adjusting the super parameters of the OpenPose convolution layer according to the output of the transducer model, wherein the specific formula is as follows:
;
In the above-mentioned method, the step of, AndRespectively denoted as a learnable weight and bias parameter; is the convolution kernel size;
The keypoint data comprises a heat map and a PAF;
b. And (3) loss function adjustment: based on the output of the transducer, adjusting OpenPose the weight parameters in the loss function to optimize the accuracy of the key point extraction;
For the heat map loss:
;
In the above-mentioned method, the step of, Weights predicted for the transducer model; is a predicted heat map; Is a true heat map;
for PAF loss:
;
In the above-mentioned method, the step of, Weights predicted for the transducer model; Is predicted PAF; Is true PAF;
Step S32, optimization of Multi-head attention mechanism
Inputting OpenPose extracted key point data into an optimized transducer model, calculating the attention weights among different key points by utilizing a multi-head attention mechanism, and introducing an adaptive weight adjustment mechanism, wherein the calculation formula is as follows:
;
;
In the above-mentioned method, the step of, The weight parameters are dynamically adjusted according to the key point data; A weight matrix for the query vector Q, for converting the query vector Q into a new query vector; A weight matrix for the key vector K, which is used for converting the key vector K into a new key vector; a weight matrix which is a value vector V and is used for converting the key vector V into a new value vector;
step S33, optimization of position coding
Including dynamic position coding optimization and multidimensional position coding optimization;
step S34, dynamically adjusting OpenPose parameter policy optimization
The method comprises the steps of heat map loss function weight adjustment strategy optimization, PAF loss function weight adjustment strategy optimization and adjustment optimization of convolution layer parameters.
5. The dance motion key point capturing method based on the Transformer fusion openpose according to claim 4, wherein the dynamic position coding optimization in step S33 solves the problem of position sensitivity in complex sequence data processing by introducing dynamic position coding DPE into a Transformer model, and the specific contents include:
s331, environmental perception initialization phase
Assume that the original input sequence isWhereinRepresents the firstThe word embedded environment perception network of each position preprocesses X and outputs context vectors:
;
S332 dynamic coding generation network
For each position in the sequenceDynamic code generation network index according to positionAnd context vector C to generate dynamic position codingExpressed as:
;
In the above-mentioned method, the step of, Is the basic position coding of the position index i, e.gTraditional static coding of function generation; representing a vector concatenation operation; Representing a multi-layer perceptron; parameters (parameters) Responsible for learning how context basedAdjusting the code of each position;
S333, fusion strategy
Encoding dynamic positionWord embeddingFusion by addition operations to form enhanced input representations,;
S334, end-to-end training loss function
The overall goal of the transducer model is to minimize the loss function L, so the average over all samples optimizes all parameters including dynamic position coding:
;
In the above-mentioned method, the step of, Representing a slave datasetAn input-output pair obtained by sampling; A transducer model with dynamic position coding is introduced; All the learnable parameters are contained; Representing an original input sequence The processed form, i.e. the enhanced input representation obtained by the fusion strategy in step S333.
6. The method for capturing dance motion keypoints based on transformation fusion openpose according to claim 4, wherein the multi-dimensional position coding optimization in step S33 enables the transformation model to consider the position change of the keypoints in space and time at the same time by introducing the multi-dimensional position coding in space dimension, specifically comprising the following steps:
For a two-dimensional structure, the position of each pixel or image block in the image can be encoded using two-dimensional sine and cosine wave functions, the original position code in the transducer model is extended to two dimensions, and for each position code (x, y) is defined as:
;
;
In the above-mentioned method, the step of, The number of dimensions of the position code is usually the embedding dimension of the transducer model divided by 2, so that the dimension of the position code is matched with the embedding dimension of the model; Indicating the current number of time steps.
7. The dance motion key point capturing method based on the transformer fusion openpose according to claim 4, wherein the heat map loss function weight adjustment policy optimization process in step S34 is as follows:
The heat map is a core means for representing the positions of key points of human bodies in OpenPose, and is used for weighting loss functions Adopts a gradient descent method for dynamic adjustment of (1) according to the current heat map lossFor weightUpdating:
;
In the above-mentioned method, the step of, Representing the learning rate, controlling the updating step length, and ensuring that the weight can be adjusted towards the direction of reducing the heat map loss in each iteration; Indicating the current number of time steps.
8. The method for capturing dance motion keypoints based on transformation fusion openpose as claimed in claim 4, wherein the PAF loss function weight adjustment strategy in step S34 is optimized, PAF is a vector field connecting different keypoints of human body, and the PAF is used for weighting the loss functionThe dynamic adjustment strategy of (2) is as follows:
;
By monitoring Changes in (2) correspondingly adjustThe OpenPose model can be focused on learning the correct connection relation between joints, so that the incorrect configuration of the limb structure is reduced, and the continuity and stability of posture estimation are improved.
9. The dance motion key point capturing method based on the transformer fusion openpose according to claim 4, wherein the adjustment and optimization of the convolution layer parameters in step S34 are performed on the convolution layer parametersAnd (3) adjusting and optimizing by adopting a global loss function guiding method:
;
In the above-mentioned method, the step of, Is a learning rate set specifically for the convolutional layer parameters.
10. The dance motion key point capturing method based on the Transformer fusion openpose according to claim 1, wherein the processing of the output of the Transformer by using the fuzzy algorithm in step S4 is as follows:
in order to reduce noise and prediction errors, a fuzzy algorithm is adopted to process the output of the transducer model, and smooth key point data is generated:
;
In the above-mentioned method, the step of, And representing the key point data after fuzzy processing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410932523.3A CN118470805B (en) | 2024-07-12 | 2024-07-12 | Dancing action key point capturing method based on transformer fusion openpose |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410932523.3A CN118470805B (en) | 2024-07-12 | 2024-07-12 | Dancing action key point capturing method based on transformer fusion openpose |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118470805A true CN118470805A (en) | 2024-08-09 |
CN118470805B CN118470805B (en) | 2024-09-17 |
Family
ID=92150338
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410932523.3A Active CN118470805B (en) | 2024-07-12 | 2024-07-12 | Dancing action key point capturing method based on transformer fusion openpose |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118470805B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220358310A1 (en) * | 2021-05-06 | 2022-11-10 | Kuo-Yi Lin | Professional dance evaluation method for implementing human pose estimation based on deep transfer learning |
KR20230035770A (en) * | 2021-09-06 | 2023-03-14 | 주식회사 이엠피이모션캡쳐 | System and method for providing dance learning based on artificial intelligence |
CN115798045A (en) * | 2022-11-30 | 2023-03-14 | 国网浙江省电力有限公司嘉兴供电公司 | Depth-adaptive Transformer-based human body posture identification method and system |
CN116246338A (en) * | 2022-12-20 | 2023-06-09 | 西南交通大学 | Behavior recognition method based on graph convolution and transducer composite neural network |
US11763485B1 (en) * | 2022-04-20 | 2023-09-19 | Anhui University of Engineering | Deep learning based robot target recognition and motion detection method, storage medium and apparatus |
US11810366B1 (en) * | 2022-09-22 | 2023-11-07 | Zhejiang Lab | Joint modeling method and apparatus for enhancing local features of pedestrians |
CN117316129A (en) * | 2023-10-25 | 2023-12-29 | 广东外语外贸大学 | Method, equipment and storage medium for generating dance gesture based on multi-modal feature fusion |
CN117765564A (en) * | 2023-11-27 | 2024-03-26 | 中国电信股份有限公司 | User gesture recognition method and device, electronic equipment and storage medium |
WO2024060321A1 (en) * | 2022-09-22 | 2024-03-28 | 之江实验室 | Joint modeling method and apparatus for enhancing local features of pedestrians |
-
2024
- 2024-07-12 CN CN202410932523.3A patent/CN118470805B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220358310A1 (en) * | 2021-05-06 | 2022-11-10 | Kuo-Yi Lin | Professional dance evaluation method for implementing human pose estimation based on deep transfer learning |
KR20230035770A (en) * | 2021-09-06 | 2023-03-14 | 주식회사 이엠피이모션캡쳐 | System and method for providing dance learning based on artificial intelligence |
US11763485B1 (en) * | 2022-04-20 | 2023-09-19 | Anhui University of Engineering | Deep learning based robot target recognition and motion detection method, storage medium and apparatus |
US11810366B1 (en) * | 2022-09-22 | 2023-11-07 | Zhejiang Lab | Joint modeling method and apparatus for enhancing local features of pedestrians |
WO2024060321A1 (en) * | 2022-09-22 | 2024-03-28 | 之江实验室 | Joint modeling method and apparatus for enhancing local features of pedestrians |
CN115798045A (en) * | 2022-11-30 | 2023-03-14 | 国网浙江省电力有限公司嘉兴供电公司 | Depth-adaptive Transformer-based human body posture identification method and system |
CN116246338A (en) * | 2022-12-20 | 2023-06-09 | 西南交通大学 | Behavior recognition method based on graph convolution and transducer composite neural network |
CN117316129A (en) * | 2023-10-25 | 2023-12-29 | 广东外语外贸大学 | Method, equipment and storage medium for generating dance gesture based on multi-modal feature fusion |
CN117765564A (en) * | 2023-11-27 | 2024-03-26 | 中国电信股份有限公司 | User gesture recognition method and device, electronic equipment and storage medium |
Non-Patent Citations (4)
Title |
---|
B. VERSTEEG;A. BEGEN;CISCO;T. VANCAENEGEM;ALCATEL-LUCENT;Z. VAX; MICROSOFT CORPORATION;: "Unicast-Based Rapid Acquisition of Multicast RTP Sessions draft-ietf-avt-rapid-acquisition-for-rtp-16", IETF, 16 October 2010 (2010-10-16) * |
曾佳露: "德江傩堂戏中舞蹈表演形态研究", 《江西师范大学硕士论文》, 15 March 2020 (2020-03-15) * |
杨傲雷;周应宏;杨帮华;徐昱琳: "基于Transformer的三维人体姿态估计及其动作达成度评估", 仪器仪表学报, 15 April 2024 (2024-04-15) * |
毕雪超;: "基于多特征融合的舞蹈动作识别技术研究", 电子设计工程, no. 18, 18 September 2020 (2020-09-18) * |
Also Published As
Publication number | Publication date |
---|---|
CN118470805B (en) | 2024-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Song et al. | Constructing stronger and faster baselines for skeleton-based action recognition | |
Mao et al. | History repeats itself: Human motion prediction via motion attention | |
Mao et al. | Multi-level motion attention for human motion prediction | |
CN113496507B (en) | Human body three-dimensional model reconstruction method | |
CN112037310A (en) | Game character action recognition generation method based on neural network | |
CN111899320A (en) | Data processing method, and training method and device of dynamic capture denoising model | |
CN110807380B (en) | Human body key point detection method and device | |
CN112258555A (en) | Real-time attitude estimation motion analysis method, system, computer equipment and storage medium | |
Liu | Aerobics posture recognition based on neural network and sensors | |
CN113989928A (en) | Motion capturing and redirecting method | |
CN111862278A (en) | Animation obtaining method and device, electronic equipment and storage medium | |
CN114036969A (en) | 3D human body action recognition algorithm under multi-view condition | |
CN114399829B (en) | Posture migration method based on generative countermeasure network, electronic device and medium | |
CN117218246A (en) | Training method and device for image generation model, electronic equipment and storage medium | |
CN118429459A (en) | Multimode nuclear magnetic resonance image reconstruction method based on deformable convolution feature alignment | |
CN116469175B (en) | Visual interaction method and system for infant education | |
CN118470805B (en) | Dancing action key point capturing method based on transformer fusion openpose | |
CN113240714A (en) | Human motion intention prediction method based on context-aware network | |
CN117437467A (en) | Model training method and device, electronic equipment and storage medium | |
CN117011357A (en) | Human body depth estimation method and system based on 3D motion flow and normal map constraint | |
CN116543104A (en) | Human body three-dimensional model construction method, electronic equipment and storage medium | |
KR20230083212A (en) | Apparatus and method for estimating object posture | |
Molnár et al. | Variational autoencoders for 3D data processing | |
CN112907456A (en) | Deep neural network image denoising method based on global smooth constraint prior model | |
CN109190474A (en) | Human body animation extraction method of key frame based on posture conspicuousness |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |