CN118470805A

CN118470805A - Dancing action key point capturing method based on transformer fusion openpose

Info

Publication number: CN118470805A
Application number: CN202410932523.3A
Authority: CN
Inventors: 曾佳露; 高予皓; 肖宇; 邓雅娇; 张志杰; 谷俐娴; 曾佳颖; 王博旭
Original assignee: Yu Zhang Teachers College
Current assignee: Yu Zhang Teachers College
Priority date: 2024-07-12
Filing date: 2024-07-12
Publication date: 2024-08-09
Anticipated expiration: 2044-07-12
Also published as: CN118470805B

Abstract

The invention discloses a dance motion key point capturing method based on a transformation fusion openpose, which comprises the following steps: 1) Obtaining dance video frames: collecting dance motion videos in real time, and decomposing the dance motion videos frame by frame to obtain dance video frames; 2) Initialization OpenPose: configuring and starting OpenPose, processing dance video frames, and extracting key points of a human body and key points of hands; 3) Transformer model integration and parameter optimization: inputting key points extracted by OpenPose into a pre-trained transducer model for prediction and parameter adjustment, and dynamically adjusting a convolution layer of OpenPose, a learning rate, the number of heads of an attention mechanism and a rotation angle of a video frame based on output of the transducer so as to optimize the key point extraction; 4) And (3) key point fuzzy processing: the output of the transducer is processed using a fuzzy algorithm to reduce noise and prediction errors and output smoothed keypoint data. The method of the invention combines the strong prediction capability and OpenPose flexibility of the transducer model, thereby remarkably improving the accuracy and real-time performance of dance motion capture and having wide application prospect.

Description

Dancing action key point capturing method based on transformer fusion openpose

Technical Field

The invention belongs to the technical field of motion capture, and particularly relates to a dance motion key point capturing method based on a transformer fusion openpose.

Background

With the rapid development of virtual reality and augmented reality technologies, motion capture technologies are increasingly used in fields such as entertainment, medical treatment, sports, and the like. Existing motion capture techniques rely primarily on deep learning and computer vision techniques, of which OpenPose is a widely used open source pose estimation framework. However, openPose still has room for improvement in real-time and accuracy of node prediction. In terms of real-time performance, openPose is based on a multi-stage convolutional neural network, although the detection precision is improved by a complex network architecture, the computational burden is possibly increased, the real-time processing speed is influenced, and meanwhile, under the condition that the hardware resources are limited, the allocation and the utilization of the computational resources are not flexible and efficient. In the aspect of node prediction, openPose is easy to be interfered by noise under the condition of complex background or rapid motion, so that the detection of key points is inaccurate. And OpenPose, while able to extract human keypoints, in cases of complex articulation or occlusion, the accuracy of predictions may be degraded, and for fast-changing or complex human actions, openPose may have difficulty capturing all details accurately and in real-time.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a dance motion key point capturing method based on a transducer fusion openpose, which remarkably improves the accuracy and the real-time performance of dance motion capturing by combining the strong prediction capability of a transducer model and the flexibility of OpenPose, and has wide application prospect.

In order to achieve the technical purpose, the invention adopts the following technical scheme.

A dance motion key point capturing method based on a transducer fusion openpose comprises the following steps:

step S1, acquiring dance video frames: adopting video acquisition equipment with high depth frame rate (60 fps and above) to acquire dance motion video in real time in an environment with good illumination, and decomposing the dance motion video frame by frame to obtain dance video frames;

Step S2, initializing OpenPose: configuring and starting OpenPose, processing dance video frames, and extracting key points of a human body and key points of hands;

Step S3, integrating a transducer model and optimizing parameters: inputting key points extracted by OpenPose into a pre-trained transducer model for prediction and parameter adjustment, and dynamically adjusting a convolution layer and a learning rate of OpenPose, the number of heads of an attention mechanism and the rotation angle of a video frame based on output of the transducer so as to optimize the key point extraction;

Step S4, key point blurring processing: the output of the transducer is processed using a fuzzy algorithm to reduce noise and prediction errors and output smoothed keypoint data.

Specifically, the content of extracting the human body key points and the hand key points in step S2 by the step OpenPose is as follows:

OpenPose is composed of a multi-stage convolutional neural network and is used for extracting key points of a human body, and a loss function of the key points is expressed as follows:

；

In the above-mentioned method, the step of, Is the loss of the keypoint heatmap; Is the loss of the partial affinity field, PAF refers to the partial affinity field, which is a vector field used to describe the association between different body parts, each vector representing the direction and intensity from one keypoint to another, for determining whether two keypoints belong to the same limb.

Specifically, in step S3, the key points extracted in OpenPose are input into a pre-trained transducer model for prediction and parameter adjustment, where the following is included:

The transducer model is used for sequence prediction and parameter optimization, adopts a self-attention mechanism, and has the following formula:

；

In the above-mentioned method, the step of, 、、Respectively represent queriesMatrix, keyMatrix and valueA matrix; Is the dimension of the key vector.

Specifically, in step S3, the convolutional layer and learning rate of OpenPose, the number of heads of the attention mechanism, and the rotation angle of the video frame are dynamically adjusted based on the output of the transducer, so as to optimize the extraction of the key points, where optimizing the content includes:

Step S31, parameter adjustment optimization

A. And (3) optimizing parameters of a convolution layer: inputting key point data extracted by OpenPose into a transducer model, and dynamically adjusting the super parameters of the OpenPose convolution layer according to the output of the transducer model, wherein the specific formula is as follows:

；

In the above-mentioned method, the step of, AndRespectively denoted as a learnable weight and bias parameter; is the convolution kernel size;

The keypoint data comprises a heat map and a PAF;

b. And (3) loss function adjustment: based on the output of the transducer, adjusting OpenPose the weight parameters in the loss function to optimize the accuracy of the key point extraction;

For the heat map loss:

；

In the above-mentioned method, the step of, Weights predicted for the transducer model; is a predicted heat map; Is a true heat map;

for PAF loss:

；

In the above-mentioned method, the step of, Weights predicted for the transducer model; Is predicted PAF; Is true PAF;

Step S32, optimization of Multi-head attention mechanism

Inputting OpenPose extracted key point data into an optimized transducer model, calculating the attention weights among different key points by utilizing a multi-head attention mechanism, and introducing an adaptive weight adjustment mechanism, wherein the calculation formula is as follows:

；

In the above-mentioned method, the step of, The weight parameters are dynamically adjusted according to the key point data; A weight matrix for the query vector Q, for converting the query vector Q into a new query vector; A weight matrix for the key vector K, which is used for converting the key vector K into a new key vector; a weight matrix which is a value vector V and is used for converting the key vector V into a new value vector;

step S33, optimization of position coding

Dynamic position coding optimization: by introducing dynamic position coding DPE in a transducer model to solve the problem of position sensitivity in complex sequence data processing, the optimization content comprises:

s331, environmental perception initialization phase

Assume that the original input sequence isWhereinRepresents the firstThe word embedded environment perception network of each position preprocesses X and outputs context vectors：

；

S332 dynamic coding generation network

For each position in the sequenceDynamic code generation network index according to positionAnd context vector C to generate dynamic position codingExpressed as:

；

In the above-mentioned method, the step of, Is the basic position coding of the position index i, e.gTraditional static coding of function generation; representing a vector concatenation operation; Representing a multi-layer perceptron; parameters (parameters) Responsible for learning how context basedAdjusting the code of each position;

S333, fusion strategy

Encoding dynamic positionWord embeddingFusion by addition operations to form enhanced input representations，；

S334, end-to-end training loss function

The overall goal of the transducer model is to minimize the loss function L, so the average over all samples optimizes all parameters including dynamic position coding:

；

In the above-mentioned method, the step of, Representing a slave datasetAn input-output pair obtained by sampling; A transducer model with dynamic position coding is introduced; All the learnable parameters are contained; Representing an original input sequence The processed form, i.e. the enhanced input representation obtained by the fusion strategy in step S333;

Multidimensional position coding optimization: by introducing multidimensional position coding of space dimension, the transducer model can simultaneously consider the position change of the key points in space and time, and the optimization content is as follows:

For a two-dimensional structure, the position of each pixel or image block in the image can be encoded using two-dimensional sine and cosine wave functions, the original position code in the transducer model is extended to two dimensions, and for each position code (x, y) is defined as:

；

In the above-mentioned method, the step of, The number of dimensions of the position code is usually the embedding dimension of the transducer model divided by 2, so that the dimension of the position code is matched with the embedding dimension of the model; Indicating the current number of time steps.

Step S34, dynamically adjusting OpenPose parameter policy optimization

Heat map loss function weight adjustment policy optimization:

The heat map is a core means for representing the positions of key points of human bodies in OpenPose, and is used for weighting loss functions Adopts a gradient descent method for dynamic adjustment of (1) according to the current heat map lossFor weightUpdating:

；

In the above-mentioned method, the step of, Representing the learning rate, controlling the updating step length, and ensuring that the weight can be adjusted towards the direction of reducing the heat map loss in each iteration; representing the current time step number;

PAF loss function weight adjustment strategy optimization:

PAF is a vector field connecting different key points of human body, and is weighted by a loss function The dynamic adjustment strategy of (2) is as follows:

；

By monitoring Changes in (2) correspondingly adjustThe OpenPose model can be focused on learning the correct connection relation between joints, so that the incorrect configuration of the limb structure is reduced, and the continuity and stability of posture estimation are improved;

And (3) adjusting and optimizing parameters of a convolution layer:

For convolutional layer parameters And (3) adjusting and optimizing by adopting a global loss function guiding method:

；

In the above-mentioned method, the step of, Is a learning rate set specifically for the convolutional layer parameters.

Specifically, in step S4, the output of the transducer is processed by using a fuzzy algorithm, and the process is as follows:

in order to reduce noise and prediction errors, a fuzzy algorithm is adopted to process the output of the transducer model, and smooth key point data is generated:

；

In the above-mentioned method, the step of, And representing the key point data after fuzzy processing.

Compared with the prior art, the invention has the following beneficial effects:

1. the method of the invention combines the strong prediction capability and OpenPose flexibility of the transducer model, thereby remarkably improving the accuracy of key point detection in dance motion capture and reducing the error of gesture estimation.

2. The invention improves the real-time response capability under the premise of high frame rate by combining the high-efficiency processing capability of the transducer with the rapid gesture detection of OpenPose.

3. The method improves the robustness of the model in a complex scene by introducing dynamic position coding and multidimensional position coding.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present disclosure, and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a flow chart of a method MIIDCAT of the present invention based on a transducer fusion openpose.

Detailed Description

In order to facilitate the understanding and practice of the present application, a detailed description of the various steps of the method presented herein will follow, with the understanding that these examples are intended to illustrate the application and are not intended to limit the scope of the application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Examples

As shown in fig. 1, a dance motion key point capturing method based on a transducer fusion openpose includes the following steps:

Step S3, integrating a transducer model and optimizing parameters: inputting key points extracted by OpenPose into a pre-trained transducer model for prediction and parameter adjustment, and dynamically adjusting a convolution layer of OpenPose, a learning rate, the number of heads of an attention mechanism and the rotation angle of a video frame based on output of the transducer so as to optimize the key point extraction;

；

Step S31, parameter adjustment optimization

；

The keypoint data comprises a heat map and a PAF;

For the heat map loss:

；

for PAF loss:

；

Step S32, optimization of Multi-head attention mechanism

；

step S33, optimization of position coding

Dynamic position coding optimization: the DPE is characterized in that a position coding vector is adaptively generated according to the real-time characteristics of an input sequence, so that the capturing capability of the model on element position relations in the sequence is enhanced, the DPE is particularly suitable for variable-length sequences and language tasks containing complex structures, and the specific optimization content comprises the following steps:

s331, environmental perception initialization phase

；

The vector is used for guiding the subsequent dynamic position code generation;

S332 dynamic coding generation network

；

S333, fusion strategy

S334, end-to-end training loss function

；

In the above-mentioned method, the step of, The number of dimensions of the position code is usually the embedding dimension of the transducer model divided by 2, so that the dimension of the position code is matched with the embedding dimension of the model; representing the current time step number;

In this embodiment, in general, pair AndUsing different scaling factors to distinguish the importance of different dimensions helps the model learn more complex spatial correlations;

step S34, dynamically adjusting OpenPose parameter policy optimization

In the process of optimizing OpenPose, which is a human body posture estimation framework, the dynamic adjustment of key parameters is the key for improving the performance of the model and ensuring the efficient learning process, so in the embodiment, the prediction precision of the model on the human body joint and limb structure is further enhanced by means of fine parameter fine adjustment strategies through the self-adaptive dynamic adjustment of the heat map loss function weight, the PAF (PART AFFINITY FIELDS, the part associated field) loss function weight and the convolution layer parameters, and the method is specific:

Heat map loss function weight adjustment policy optimization:

；

In the above-mentioned method, the step of, Representing the learning rate, controlling the updating step length, and ensuring that the weight can be adjusted towards the direction of reducing the heat map loss in each iteration; the number of current time steps;

The dynamic adjustment mechanism enables the model to be subjected to key optimization aiming at difficulties in training data, so that accuracy of positioning of key points is improved.

PAF loss function weight adjustment strategy optimization:

；

the adjusted model can be focused on learning the correct connection relation between joints, so that the incorrect configuration of the limb structure is reduced, and the continuity and stability of posture estimation are improved;

And (3) adjusting and optimizing parameters of a convolution layer:

；

The adjustment optimization of the parameters of the convolution layer aims at balancing the global optimization requirement and the fineness of the local feature extraction. The convolution kernel parameters are continuously finely adjusted through gradient feedback of global loss (L), so that the model can capture high-level semantic features and simultaneously give consideration to low-level detail information, and the gesture of an individual can be accurately identified under a complex background.

；

Specifically, the implementation process of the fuzzy algorithm in the steps is as follows:

1) A set (F) is defined which contains all possible output values and their corresponding membership functions.

2) Rules are constructed and used to evaluate and adjust the output of the transducer model by constructing a rule base. For example, if the output value is too high or too low, the rule will trigger a corresponding adjustment.

3) After obtaining the original output (K) of the transducer, the design controller evaluates and adjusts (K) according to rules and outputs adjusted key point data #）

4) Parameter selection and optimization: and selecting proper membership functions and rule parameters, and determining optimal controller parameters through experiments.

The algorithm pseudocode is as follows:

FOR each output K from the Transformer model DO

Evaluate K using the fuzzy rules R

IF K is too high THEN

Decrease K based on the corresponding fuzzy rule

ELSIF K is too low THEN

Increase K based on the corresponding fuzzy rule

ENDIF

Update K_smooth with the adjusted value of K

END FOR

The experimental results in the following table show that the accuracy of the model before the algorithm treatment is 84%, but the accuracy after the algorithm treatment is as high as 92%.

Table 1, accuracy comparison of models before and after Algorithm processing

In conclusion, the method of the invention remarkably improves the accuracy of key point detection in dance motion capture, reduces the error of gesture estimation and improves the real-time response capability under the premise of high frame rate by combining the strong prediction capability of a transducer model and the flexibility of OpenPose; and by introducing dynamic position coding and multidimensional position coding, the robustness of the model in a complex scene is improved.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. A dance motion key point capturing method based on a transducer fusion openpose is characterized by comprising the following steps:

step S1, acquiring dance video frames: adopting video acquisition equipment with high depth and frame rate, acquiring dance motion video in real time in an environment with good illumination, and decomposing frame by frame to obtain dance video frames;

2. The dance motion key point capturing method based on the transformer fusion openpose according to claim 1, wherein the specific content of the OpenPose extracting the human body key point and the hand key point in the step S2 is as follows:

；

3. The method for capturing dance motion keypoints based on the transducer fusion openpose according to claim 1, wherein in step S3, the key points extracted in OpenPose are input into a pre-trained transducer model for prediction and parameter adjustment, and the specific contents are as follows:

；

4. The method for capturing dance motion keypoints based on the Transformer fusion openpose according to claim 1, wherein the step S3 of dynamically adjusting the convolutional layer of OpenPose and the learning rate, the number of heads of the attention mechanism and the rotation angle of the video frame based on the Transformer output to optimize the extraction of the keypoints comprises:

Step S31, parameter adjustment optimization

；

The keypoint data comprises a heat map and a PAF;

For the heat map loss:

；

for PAF loss:

；

Step S32, optimization of Multi-head attention mechanism

；

step S33, optimization of position coding

Including dynamic position coding optimization and multidimensional position coding optimization;

step S34, dynamically adjusting OpenPose parameter policy optimization

The method comprises the steps of heat map loss function weight adjustment strategy optimization, PAF loss function weight adjustment strategy optimization and adjustment optimization of convolution layer parameters.

5. The dance motion key point capturing method based on the Transformer fusion openpose according to claim 4, wherein the dynamic position coding optimization in step S33 solves the problem of position sensitivity in complex sequence data processing by introducing dynamic position coding DPE into a Transformer model, and the specific contents include:

s331, environmental perception initialization phase

；

S332 dynamic coding generation network

；

S333, fusion strategy

S334, end-to-end training loss function

；

In the above-mentioned method, the step of, Representing a slave datasetAn input-output pair obtained by sampling; A transducer model with dynamic position coding is introduced; All the learnable parameters are contained; Representing an original input sequence The processed form, i.e. the enhanced input representation obtained by the fusion strategy in step S333.

6. The method for capturing dance motion keypoints based on transformation fusion openpose according to claim 4, wherein the multi-dimensional position coding optimization in step S33 enables the transformation model to consider the position change of the keypoints in space and time at the same time by introducing the multi-dimensional position coding in space dimension, specifically comprising the following steps:

；

7. The dance motion key point capturing method based on the transformer fusion openpose according to claim 4, wherein the heat map loss function weight adjustment policy optimization process in step S34 is as follows:

；

In the above-mentioned method, the step of, Representing the learning rate, controlling the updating step length, and ensuring that the weight can be adjusted towards the direction of reducing the heat map loss in each iteration; Indicating the current number of time steps.

8. The method for capturing dance motion keypoints based on transformation fusion openpose as claimed in claim 4, wherein the PAF loss function weight adjustment strategy in step S34 is optimized, PAF is a vector field connecting different keypoints of human body, and the PAF is used for weighting the loss functionThe dynamic adjustment strategy of (2) is as follows:

；

By monitoring Changes in (2) correspondingly adjustThe OpenPose model can be focused on learning the correct connection relation between joints, so that the incorrect configuration of the limb structure is reduced, and the continuity and stability of posture estimation are improved.

9. The dance motion key point capturing method based on the transformer fusion openpose according to claim 4, wherein the adjustment and optimization of the convolution layer parameters in step S34 are performed on the convolution layer parametersAnd (3) adjusting and optimizing by adopting a global loss function guiding method:

；

10. The dance motion key point capturing method based on the Transformer fusion openpose according to claim 1, wherein the processing of the output of the Transformer by using the fuzzy algorithm in step S4 is as follows:

；