CN117710533B

CN117710533B - Music conditional dance animation generation method based on diffusion model

Info

Publication number: CN117710533B
Application number: CN202410146031.1A
Authority: CN
Inventors: 刘长红; 蔡娟
Original assignee: Jiangxi Normal University
Current assignee: Jiangxi Normal University
Priority date: 2024-02-02
Filing date: 2024-02-02
Publication date: 2024-04-30
Anticipated expiration: 2044-02-02
Also published as: CN117710533A

Abstract

The invention discloses a music conditional dance animation generation method based on a diffusion model, which comprises the steps of obtaining a data set, constructing text prompts conforming to dance videos in the data set, segmenting the dance videos into music segments and video segments, respectively obtaining actor images and source dance animation segments from a first frame and a last frame in the video segments, encoding the obtained text prompts, the music segments and the actor images to obtain different potential characteristics, adding noise conforming to standard normal distribution to the source dance animation segments for a certain time step, predicting the added noise by the obtained different potential characteristics to obtain denoised potential spatial characteristics of target dance animation segments, and decoding the potential spatial characteristics of the target dance animation segments by a pre-training VAE model to obtain the target dance animation segments. According to the invention, the stylized dance image is directly generated according to the prior conditions such as music, text prompt, performer image and the like given by the user, so that the stylized dance image has better practicability and generalization.

Description

Music conditional dance animation generation method based on diffusion model

Technical Field

The invention relates to the technical fields of games, movies, social media and education, in particular to a music conditional dance animation generation method based on a diffusion model.

Background

The music conditional dance action generation method mainly comprises a similarity retrieval-based method and a depth generation model-based method at present; the method based on similarity retrieval directly retrieves corresponding dance movements according to music, and the method can generate dance movements conforming to the music rhythm, but the method can only generate the existing dance movements in a database, cannot generate new dance movements and lacks diversity and creativity; the method based on the depth generation model establishes the mapping relation between music and dance actions based on the models such as a cyclic neural network, a transducer or a generation type countermeasure network, generates a dance action joint point sequence according to the input music, and generates dance actions with better diversity and creativity; however, the methods focus on generating a dance action node sequence according to music, and then migrating the dance action node sequence to a designated character image through a rendering model to realize a dance image according to the generated dance action node sequence, so that the dance image cannot be directly generated; in recent years, the diffusion model obtains impressive results in the field of high-quality image synthesis, and the pre-training diffusion model also has great success in the cross-modal generation task; the method is different from the prior method, a two-step mode is not needed, and dance video images can be directly generated according to input conditions, however, text conditioned dance video image generation is mainly realized based on a pre-training diffusion model at present, and the method cannot be directly used for music conditioned dance video image generation.

(1) The existing music conditional dance motion generation basically generates a dance motion articulation point sequence according to music, and then renders a dance image according to a character model; according to the invention, the stylized dance image is directly generated according to the prior conditions such as music, text prompt, performer image and the like given by the user, so that the stylized dance image has better practicability and generalization;

(2) How to perform multi-mode data fusion on the input music and the given priori conditions, so that the generated dance image accords with the music and the priori conditions given by the user; the dance motion generation method mainly controls generation of dance motion or dance images according to single conditions such as music or text, and the invention considers personalized dance generation of a user, fuses the music and other prior conditions, and enables the generated dance images to be related to given condition constraints.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a music conditional dance animation generation method based on a diffusion model, which aims to generate dance images in a mode of guiding priori conditions such as music, text prompt, dance style, performer images and the like by means of a pretraining diffusion model and combining a multi-mode fusion mechanism, so as to generate dance animation consistent and continuous with the music.

In order to achieve the above purpose, the present invention provides the following technical solutions: a music conditional dance animation generation method based on a diffusion model comprises the following steps:

Step S1: the dance animation generation model is constructed and comprises a pre-training diffusion model, a pre-training model Wav2CLIP, a contrast language image pre-training model CLIP, a pre-training VAE model and a multi-modal control network;

the pre-training VAE model consists of an image encoder and an image decoder;

the multi-mode control network consists of a plurality of zero convolution layers, a special full connection layer, an encoder part and an intermediate layer part;

The encoder part consists of a multi-condition encoder module, a second diffusion model encoder module, a third diffusion model encoder module and a fourth diffusion model encoder module;

the contrast language image pre-training model CLIP consists of a CLIP text encoder and a CLIP image encoder;

the multi-condition encoder module is composed of a condition normalization module;

Step S2: acquiring dance videos of a data set, and constructing text prompts conforming to the dance videos;

step S2.1: dividing dance video of a data set into a music segment and a video segment with fixed time length, and taking a first frame and a last frame in the video segment as an image of a performer and a source dance animation segment respectively;

Step S3: adopting a contrast language image pretraining model CLIP, a pretraining model Wav2CLIP and a pretraining VAE model to encode the text prompt, the music piece and the performer image to obtain different potential characteristics; the different potential features include potential features of a text prompt, potential features of audio, and potential spatial features of a performer image;

Step S4: encoding the source dance animation segment into a potential space to obtain potential space characteristics of the source dance animation segment, randomly adding noise of time steps to the potential space characteristics of the source dance animation segment, predicting noise added to the potential space characteristics of the source dance animation segment through control of the potential characteristics of text prompts, the potential characteristics of audio and the potential space characteristics of an actor image, and further obtaining target dance animation segment potential space characteristics without prediction noise;

Step S5: and decoding the denoised potential space features of the target dance animation segment by using an image decoder of the pre-training VAE model to obtain the target dance animation segment.

In step S2, the process of obtaining a music clip, an actor image, a source dance animation clip and constructing a text prompt according to the dance video of the dataset is as follows:

Suppose that the dataset contains Group data;

The music piece is expressed as =；

The text prompts are expressed as=；

The source dance animation segment is expressed as=；

The image of the performer is expressed as=；

In the method, in the process of the invention,A feature vector representing the first piece of music,A feature vector representing a second piece of music,A feature vector representing the nth piece of music,A first one of the text prompts is indicated,A second one of the text prompts is indicated,Representing the i-th text prompt,Representing a first segment of the source dance animation,Representing a second source dance animation segment,Representing the nth source dance animation segment,A first image of the actor is represented,A second image of the performer is shown,Representing an nth performer image;

Constructing a text prompt:

=(name，style)；

in the method, in the process of the invention, Representing the i-th text prompt,；The method for constructing the text prompt is represented by a name, a name and a style of characters in each dance video.

In step S3, the process of obtaining different potential features by encoding is as follows:

Text prompt for input Coding by adopting a text coder of a contrast language image pretraining model CLIP to obtain potential characteristics of text promptThe expression is expressed by formula (1):

（1）；

in the method, in the process of the invention, Representing the potential characteristics of the text prompt,A text encoder representing the CLIP,Representing a text prompt vector;

for input images of performers Encoding by adopting an image encoder of a pre-training VAE model to obtain potential spatial characteristics of an image of a performerThe expression is expressed by formula (2):

（2）；

in the method, in the process of the invention, Representing potential spatial features of the image of the performer,An image encoder representing a pre-trained VAE model,Representing a performer image vector;

encoding the music piece through a pre-training model Wav2CLIP, wherein the pre-training model Wav2CLIP can learn the alignment semantic relation between the music piece and dance images corresponding to the music piece to obtain potential characteristics of audio related to the dance images The expression is expressed by formula (3):

（3）；

in the method, in the process of the invention, Representing the potential characteristics of the audio,Representing a pre-trained Wav2CLIP model,Representing a music piece vector;

also, add a time step Guiding the denoising process of the diffusion model, and coding the diffusion model through a 2-layer full-connection layer and a SiLU activation function to obtain time step characteristicsThe expression is expressed by formula (4):

（4）；

in the method, in the process of the invention, A time-step feature is represented and,Indicating that the full-link layer is to be formed,Representation ofThe function is activated and the function is activated,Representing a time step.

In step S4, the process of obtaining the target dance animation segment potential space feature for removing the prediction noise is as follows:

image encoder employing pre-trained VAE model Fragmenting source dance animationEncoding to the potential space features to obtain the source dance animation fragmentIs a potential spatial feature of (a)The expression is expressed by formula (5):

（5）；

in the method, in the process of the invention, Representing source dance animation segmentsIs a potential spatial feature of (1);

For a pair of Performing a time stepAdding noiseObtaining potential spatial characteristics of the noisy source dance animation fragmentThe expression is expressed by formula (6):

（6）；

；

in the method, in the process of the invention, Represents the potential spatial characteristics of the source dancing animation segment after noise addition,Representing corresponding time steps within 1 to time step TAs a result of the cumulative multiplication of the values,The standard normal distribution is represented by the formula,Representation ofThe time step adds the variance of the noise,Representing corresponding time steps within 1 to time step TA value;

The denoising process for guiding diffusion by fusing the potential features of the coded text prompt, the potential features of the audio and the potential spatial features of the performer image through the multi-condition encoder module is as follows:

Potential features of the actor image to be acquired Potential spatial features of source dance animation fragments after passing through learnable zero convolution network and after adding noiseSumming to obtain potential spatial features of noisy actor imagesThe expression is expressed by formula (7):

（7）；

in the method, in the process of the invention, Representing potential spatial features of noisy actor images,A zero convolution network representing potential spatial features for processing the actor images;

design condition normalization module, to obtain potential characteristics of audio After special full-connection layer processing, potential characteristics of the audio are obtained as conditions, and the noise-containing potential spatial characteristics after the condition normalization module are distributed and adjusted, so that the denoising process is better guided, and the expression is expressed by a formula (8):（8）；

wherein CBN represents a condition normalization module, The normalization layer is represented by a graph,A special fully connected layer representing potential features for processing audio;

after the normalization processing is carried out on the potential spatial noise processed by the condition normalization module, the potential spatial feature after the first condition normalization processing is obtained through an activation function and a convolution layer The expression is expressed by formula (9):

（9）；

in the method, in the process of the invention, The representation of the potential spatial features is provided,Representing a convolution layer;

will acquire a time step feature The potential spatial noise processed by the activation function and the full connection layer is used as the condition and the condition normalization moduleAfter addition, the audio is input to the normalization layer of the second conditional normalization module, and the audio is processed through potential feature pairs of the audioPerforming secondary condition normalization processing, and performing normalization layer, activation function, dropout regularization method and zero convolution layer processing to obtainThe expression is expressed by formula (10):

（10）；

in the method, in the process of the invention, Representing potential spatial features obtained after the second conditional normalization process,Representing the zero-convolution layer,Representation regularization;

Potential spatial features of source dancing animation segments after noise addition Carrying out noise prediction and constructing noise prediction loss;

Will acquire The outputs of the coder module and the middle layer module of each layer are input to the middle layer module and the diffusion model decoder module corresponding to the diffusion model after passing through the learnable zero convolution layer, and are used for time stepPrediction of added noise;

In this process, the parameters of the pre-trained diffusion model are locked, and then the weights copied to the control network part are further trained by using different conditional feature vectors; the different conditional feature vectors include music, performer images; constructing prediction noise using cross entropy loss function With initial addition of real noiseThe loss function between them, get the noise predictive lossThe expression is expressed by formula (11):

（11）；

in the method, in the process of the invention, Representing the predicted loss of noise and,An identifier representing a loss function,A normal distribution is indicated and the distribution is determined,Representing the noise of the prediction and,Representing the control of the text over the entire model,Representing a multi-modal control network;

By means of a contrast learning mechanism, the distance between the potential characteristics of the audio and the potential characteristics of the source dance animation segment coded by the CLIP image coder is minimized, and a contrast loss function between the potential characteristics of the audio and the potential characteristics of the source dance animation segment is constructed The method is used for optimizing Wav2CLIP parameters so as to better realize the guidance of the potential characteristics of the audio to the denoising process, and the expression is expressed by a formula (12);

（12）；

in the method, in the process of the invention, A contrast loss function representing a distance between a potential feature of the audio and a feature of the dance image,An image encoder representing CLIP;

Constructing joint loss functions Training of the model is achieved;

joint loss function Predicting loss from acquisition noiseAnd obtaining a contrast loss functionAdding to form;

joint loss function The expression is represented by formula (13):

（13）。

Compared with the prior art, the invention has the following beneficial effects:

(1) According to the invention, dance video images are generated according to given music and priori conditions, and a user can customize dance style and other conditions by compiling text prompts, so that performer images and music are designated, and dance animation which accords with user preferences is generated.

(2) According to the invention, by constructing the multi-mode network control module, the prior conditions such as music, text prompt, performer image and the like can be fused to guide the diffusion model to generate the dance image.

(3) The user preference is input in a text prompt mode, and the text prompt is used for guiding the pre-training diffusion model to generate the image which accords with the user preference.

(4) According to the invention, the pre-training model Wav2CLIP is introduced to encode the input music fragments, and the Wav2CLIP model is utilized to capture the correlation between music and dance images, so that the generated dance images are aligned with music semantics, and a consistent dance animation consistent with the music is generated; and introducing a contrast language image pre-training model CLIP to learn the semantic relation between the text prompt and the generated image, and controlling the model to generate a dance image preferred by the user.

Drawings

FIG. 1 is a schematic diagram of the overall process plan structure of the dance animation training stage of the present invention.

FIG. 2 is a schematic diagram of a training process of a dance animation generation model according to the present invention.

FIG. 3 is a schematic plan view of a multi-condition encoder module according to the present invention.

Detailed Description

As shown in fig. 1-3, the present invention provides the following technical solutions: a music conditional dance animation generation method based on a diffusion model comprises the following steps:

the pre-training VAE model consists of an image encoder and an image decoder;

step S2.1: dividing dance video of a data set into a music segment and a video segment with fixed time length, and simultaneously taking a first frame and a last frame in the video segment as an image of a performer and a source dance animation segment respectively to construct a text prompt conforming to the dance video;

step S5: and decoding the denoised potential space features of the target dance animation fragment by using a pre-training VAE model image decoder to obtain the target dance animation fragment.

In step S2, a music clip, an actor image, a source dance animation clip are obtained according to dance video of a dataset, and a text prompting process is constructed;

Suppose that the dataset contains Group data;

The music piece is expressed as =；

The text prompts are expressed as=；

The source dance animation segment is expressed as=；

The image of the performer is expressed as=；

Constructing a text prompt:

=(name，style)；

in the method, in the process of the invention, Representing the i-th text prompt,；The method for constructing the text prompts comprises the steps that name represents the name of a character in each dance video, style represents the dance style, and the text prompts corresponding to all source dance animation fragments obtained by segmenting the same dance video are identical.

Text prompt for input Coding by adopting a text coder of a contrast language image pretraining model CLIP to obtain potential characteristics of text promptSo that the semantic relation between the text prompt and the generated image can be learned, the control model generates a dance image preferred by the user, the potential characteristics of the text prompt are expressed by a formula (1):

（1）；

（2）；

in the method, in the process of the invention, Representing potential spatial features of the actor's image,An image encoder representing a pre-trained VAE model,Representing a performer image vector;

for input music pieces Firstly, a pre-training model Wav2CLIP is adopted to encode music fragments, the pre-training model Wav2CLIP can learn the alignment semantic relation between the music fragments and dance images corresponding to the music fragments, and potential characteristics of audio related to the dance images are obtainedThe expression is expressed by formula (3):

（3）；

（4）；

in the method, in the process of the invention, A time-step feature is represented and,Indicating that the full-link layer is to be formed,Representation ofThe function is activated and the function is activated,Representing a time step vector.

（5）；

in the method, in the process of the invention, Source dance animation segmentIs a potential spatial feature of (1);

（6）；

；

（7）；

Design condition normalization module (CBN) to capture potential features of audio After special full-connection layer processing, potential characteristics of the audio are obtained as conditions, and the noise-containing potential spatial characteristics after the condition normalization module are distributed and adjusted, so that the denoising process is better guided, and the expression is expressed by a formula (8):（8）；

（9）；

（10）；

Potential characteristics of source dancing animation fragments after noise addition Carrying out noise prediction and constructing noise prediction loss;

In this process, the parameters of the pre-trained diffusion model are locked, and then the weights copied to the control network part are further trained using different conditional feature vectors (music, performer images); constructing prediction noise using cross entropy loss function With initial addition of real noiseThe loss function between them, get the noise predictive lossThe expression is expressed by formula (11):

（11）；

in the method, in the process of the invention, Representing the predicted loss of noise and,An identifier representing a loss function,A normal distribution is indicated and the distribution is determined,Representing the predicted noise obtained by the model,Representing the control of the text over the entire model,Representing a multi-modal control network;

By means of a contrast learning mechanism, the distance between the potential characteristics of the audio and the potential characteristics of the source dance animation segment coded by the CLIP image coder is minimized, and a contrast loss function between the potential characteristics of the audio and the potential characteristics of the source dance animation segment is constructed ；

Constructing a loss function of the potential characteristics of music and the potential characteristics of the source dance animation fragments by adopting a contrast loss function, and optimizing Wav2CLIP parameters so as to better realize the guidance of the potential characteristics of the audio to the denoising process, wherein the expression is expressed by a formula (12);

（12）；

Constructing joint loss functions Training of the model is achieved;

joint loss function The expression is represented by formula (13):

（13）；

The dance animation generation model test process comprises the following steps:

Step S1: a piece of music is selected, and the music in the used data set can be cut into pieces with fixed duration according to a certain duration (for example, duration is 1 second) and the sliding window size (for example, the sliding window size is 4 frames) by using other pieces of music.

Step S2: constructing a text prompt description, for example, the description is in the form of 'Zhang Sanzheng performing street jazz dance', wherein 'Zhang three' is a specific character name, and 'street jazz dance' is a designated dance style; for this part of the text prompt description, more conditions can be expanded, for example, words such as "high quality, detail, high level" are added, and finally determined text prompt description is translated into English expression to obtain text prompt.

Step S3: an image of the performer corresponding to the name of the person is selected.

Step S4: the difference with the training flow structure diagram is that the potential characteristics of the dance animation segments after noise addition are replaced by noise randomly sampled from standard normal distribution, the noise removing time step is set, the first music segment obtained in the step S1, the text prompt in the step S2 and the performer image in the step S3 are input into a noise removing network for noise prediction, and the potential spatial characteristics of the final target dance animation segments are obtained.

Step S5: and (3) decoding the de-noised potential space features of the dance segment obtained in the step (S4) by using an image decoder of the pre-trained VAE model to obtain a final target dance animation segment.

Step S6: and (3) taking the final target dance animation segment obtained in the step (S5) as an image of a performer generated by the next dance segment, taking the second music segment obtained in the step (S1) and the text prompt of the step (S2) as input, and repeating the step (S4) and the step (S5) to obtain a second real target dance animation segment.

Step S7: and repeating the step S6 until the last music piece guides the generation of the last real target dance animation piece.

Step S8: setting the generation frame rate of the dance animation, and connecting the generated target dance animation fragments in a frame inserting mode to obtain the complete dance animation.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A music conditional dance animation generation method based on a diffusion model is characterized by comprising the following steps:

Step S1: constructing a pre-training diffusion model, a pre-training model Wav2CLIP, a contrast language image pre-training model CLIP, a pre-training VAE model and a multi-mode control network;

the pre-training VAE model consists of an image encoder and an image decoder;

Step S3: adopting a contrast language image pretraining model CLIP, a pretraining model Wav2CLIP and a pretraining VAE model to encode the text prompt, the music piece and the performer image to obtain different potential characteristics; the different potential features include a text-prompted potential feature, an audio potential feature, and a performer image potential feature;

Text prompt for input The text encoder adopting the contrast language image pretraining model CLIP is used for encoding to obtain the potential characteristics/>, of the text promptThe expression is expressed by formula (1):

（1）；

in the method, in the process of the invention, Representing potential features of text cues,/>Text encoder representing CLIP,/>Representing a text prompt vector;

for input images of performers Encoding by adopting an image encoder of a pre-training VAE model to obtain potential characteristics/>, of an image of a performerThe expression is expressed by formula (2):

（2）；

in the method, in the process of the invention, Representing potential features of the actor's image,/>Encoder representing a pre-trained VAE model,/>Representing a performer image vector;

encoding the music piece through a pre-training model Wav2CLIP, wherein the pre-training model Wav2CLIP can learn the alignment semantic relation between the music piece and dance images corresponding to the music piece to obtain audio potential characteristics related to the dance images The expression is expressed by formula (3):

（3）；

in the method, in the process of the invention, Representing audio potential features,/>Representing a pre-trained Wav2CLIP model,/>Representing a music piece vector;

also, add a time step Guiding the denoising process of the diffusion model, and coding the diffusion model through a 2-layer full-connection layer and a SiLU activation function to obtain time step characteristics/>The expression is expressed by formula (4):

（4）；

in the method, in the process of the invention, Representing time step feature,/>Representing a fully connected layer,/>Representation/>Activating a function,/>Representing a time step;

Step S4: encoding the dance animation segments into a potential space to obtain potential space characteristics of the source dance animation segments, randomly adding noise of time steps to the potential space characteristics of the source dance animation segments, predicting the noise added to the potential space characteristics of the source dance animation segments through control of the potential characteristics of text prompts, the potential characteristics of audio and the potential space characteristics of performer images, and further obtaining the potential space characteristics of the source dance animation segments, from which the predicted noise is removed;

in step S4, the process of obtaining the potential spatial features of the dance animation segment from which the prediction noise is removed is as follows:

encoder employing pre-trained VAE model Segment dance animation/>Coding to the potential space characteristics to obtain the dance animation fragment/>Potential features of/>The expression is expressed by formula (5):

（5）；

in the method, in the process of the invention, Representing dance animation segment/>Potential features of/>Representing a dance animation segment;

For a pair of Time step/>Adding noise/>Obtaining potential characteristics/>, of the noisy dancing animation segmentsThe expression is expressed by formula (6):

（6）；

；

in the method, in the process of the invention, Representing potential characteristics of noisy dance animation segments,/>Representing 1 to the corresponding time step in time step T >Cumulative result of value,/>Representing a standard normal distribution,/>Representation/>Variance of time step added noise,/>Representing 1 to the corresponding time step in time step T >A value;

the denoising process for guiding diffusion by fusing the potential features of the coded text prompt, the potential audio features and the potential spatial features of the performer image through the multi-condition encoder module is as follows:

Potential features of the actor image to be acquired Dance animation segment potential characteristics/>, after learning zero convolution network and after adding noiseAdding to obtain potential spatial features/>, of the actor imagesThe expression is expressed by formula (7):

（7）；

in the method, in the process of the invention, Representing potential spatial features,/>A zero convolution network representing potential spatial features for processing the actor's image;

Design condition normalization module, to obtain audio potential characteristics After special full-connection layer processing, audio potential features are obtained as conditions, and distribution adjustment is carried out on noise-containing potential spatial features after a condition normalization module, so that a denoising process is better guided, and an expression is represented by a formula (8):（8）；

wherein CBN represents a condition normalization module, Representing normalization layer,/>Representing a special fully connected layer for processing audio potential features;

（9）；

in the method, in the process of the invention, Representing potential spatial features,/>Representing a convolution layer;

will acquire a time step feature The potential spatial noise/>, which is treated by the activation function and the full connection layer and is treated by the condition and condition normalization moduleAfter addition, the audio potential feature pair/>, after being input to a normalization layer of a second condition normalization modulePerforming secondary condition normalization processing, and performing normalization layer, activation function, dropout regularization method and zero convolution layer processing to obtain/>The expression is expressed by formula (10):

（10）；

in the method, in the process of the invention, Representing potential spatial features obtained after a second conditional normalization process,/>Representing a zero convolution layer,/>Representation regularization;

Latent features of dancing animation segments after noise addition Carrying out noise prediction and constructing noise prediction loss;

Will acquire The output of each layer of encoder module and middle layer module is input to the middle layer module and diffusion model decoder module corresponding to the diffusion model after passing through the learnable zero convolution layer, and is used for the time step/>Prediction of added noise;

In this process, the parameters of the pre-trained diffusion model are locked, and then the weights copied to the control network part are further trained by using different conditional feature vectors; the different conditional feature vectors include music, performer images; constructing prediction noise using cross entropy loss function And initially joining in true noise/>The loss function between the two to obtain the noise prediction loss/>The expression is expressed by formula (11):

（11）；

in the method, in the process of the invention, Representing noise predictive loss,/>Identifier representing a loss function,/>Representing a normal distribution,/>Representing predicted noise,/>Representing the control of text over the model,/>Representing a multi-modal control network;

The distance between the audio potential feature and the potential feature of the source dance animation segment coded by the CLIP image coder is minimized through a contrast learning mechanism, and a contrast loss function between the audio potential feature and the potential feature of the source dance animation segment is constructed The method is used for optimizing Wav2CLIP parameters so as to better realize the guidance of the audio potential characteristics on the denoising process, and the expression is expressed by a formula (12);

（12）；

in the method, in the process of the invention, Contrast loss function representing distance between audio latent feature and dance image feature,/>, and method for generating contrast loss functionAn image encoder representing CLIP;

Constructing joint loss functions Training of the model is achieved;

joint loss function Predicting loss from acquisition noise/>And obtaining a contrast loss function/>Adding to form;

joint loss function The expression is represented by formula (13):

（13）；

Step S5: and decoding the denoised source dance animation fragment potential space characteristics by using a pre-training VAE model decoder to obtain the dance fragment.

2. The music conditioning dance animation generation method based on the diffusion model according to claim 1, wherein the method comprises the following steps: in step S2, the process of obtaining a music clip, an actor image, a source dance animation clip and constructing a text prompt according to the dance video of the dataset is as follows:

Suppose that the dataset contains Group data;

The music piece is expressed as =/>；

The text prompts are expressed as=/>；

Dancing animation segments are expressed as=/>；

The image of the performer is expressed as=/>；

In the method, in the process of the invention,Feature vector representing the first piece of music,/>Feature vector representing the second piece of music,/>Feature vector representing nth musical piece,/>Representing the first text prompt,/>Representing a second text prompt,/>Representing the ith text prompt,/>Representing a first dance animation segment,/>Representing a second dance animation segment,/>Representing the nth dance animation segment,/>Representing the first actor image,/>Representing a second actor image,/>Representing an nth performer image;

Constructing a text prompt:

=/>(name，style)

in the method, in the process of the invention, Representing the ith text prompt,/>；/>The method for constructing the text prompt is represented by a name, a name and a style of characters in each dance video.