CN117292704A - Voice-driven gesture action generation method and device based on diffusion model - Google Patents
Voice-driven gesture action generation method and device based on diffusion model Download PDFInfo
- Publication number
- CN117292704A CN117292704A CN202311007818.1A CN202311007818A CN117292704A CN 117292704 A CN117292704 A CN 117292704A CN 202311007818 A CN202311007818 A CN 202311007818A CN 117292704 A CN117292704 A CN 117292704A
- Authority
- CN
- China
- Prior art keywords
- voice
- noise
- diffusion model
- gesture
- diffusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000009792 diffusion process Methods 0.000 title claims abstract description 166
- 230000009471 action Effects 0.000 title claims abstract description 107
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000012549 training Methods 0.000 claims abstract description 46
- 230000008569 process Effects 0.000 claims abstract description 25
- 238000009826 distribution Methods 0.000 claims abstract description 24
- 238000007781 pre-processing Methods 0.000 claims abstract description 16
- 238000005070 sampling Methods 0.000 claims abstract description 16
- 239000012634 fragment Substances 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 22
- 238000000605 extraction Methods 0.000 claims description 20
- 238000009499 grossing Methods 0.000 claims description 15
- 230000008447 perception Effects 0.000 claims description 8
- 238000003860 storage Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 5
- 238000012952 Resampling Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Acoustics & Sound (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method and a device for generating voice-driven gesture actions based on a diffusion model, wherein the method comprises the following steps: acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels, and adding noise to gesture action sequences in the training data to obtain a noisy gesture action sequence sample for training a diffusion model; constructing and training a diffusion model for voice-driven gesture motion generation, wherein the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; and (3) using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence. According to the invention, the diffusion model is utilized to model the gesture action sequence distribution based on voice driving, so that the gesture action sequence with more authenticity and diversity can be generated.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for generating voice-driven gesture actions based on a diffusion model.
Background
With the development of artificial intelligence, virtual digital people are widely used in the fields of games, media, movies and the like, and gesture motion generation is one of key technologies of the virtual digital people manufacturing process. In the real world, when a person speaks, gesture actions such as making a stroke by hand are usually made very naturally, and these gesture actions can be used as a non-speech communication signal, which helps the speaker to express himself better. The gesture action generated in the speaking process plays an important role in communication and communication between people, plays a non-negligible role in the aspect of manufacturing the virtual digital person, properly adds natural gesture action to the virtual digital person, can enhance visual effect, improves user experience and brings more temperature emotion interaction. Therefore, the research on the generation of the gesture action driven by the voice has a certain meaning.
The voice-driven gesture motion generation is to generate a gesture motion sequence corresponding to a voice input given a piece of voice input. Existing speech driven gesture motion generation methods can be generally classified into rule-based methods and learning-based methods: (1) The rule-based method requires to explicitly formulate various rules and map voice content to gesture actions, and can establish the corresponding relation between the voice content and the gesture actions, however, the expression capability of the method is directly limited by the quantity of the formulated rules, gesture actions beyond the definition rules cannot be generated, and the method has great limitation. (2) Learning-based methods learn mappings from data from speech to gesture actions, which generally rely on generating an countermeasure network, which is difficult to produce true and varied results due to problems of difficulty in training and pattern collapse.
In order to solve the problems, the invention provides a voice-driven gesture motion generation method based on a diffusion model, which utilizes the strong generation capacity of the diffusion model to generate a gesture motion sequence with more authenticity and diversity.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provide a voice-driven gesture motion generation method and device based on a diffusion model.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for generating a speech driven gesture motion based on a diffusion model, including the steps of:
acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the gesture action sequence in the training data is subjected to noise adding, so that a noise added gesture action sequence sample is obtained, and the process is also called a diffusion process;
constructing a diffusion model for voice-driven gesture motion generation, wherein the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; taking the noisy gesture motion sequence sample as the input of a diffusion model, training the diffusion model by taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between the real noise in the diffusion process and the noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
and (3) using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence.
As a preferred technical solution, the preprocessing of the human gesture action data set includes the following steps:
resampling the human body gesture action data set at a set speed, extracting upper body gesture action data, sampling each segment in the training data at a set stride and length, and obtaining a plurality of gesture action sequence segments with voice information marks as the training data.
As a preferred technical solution, the step of adding noise to the gesture motion sequence in the training data to obtain a noise added gesture motion sequence sample specifically includes:
for a gesture motion sequence x= { f with N frames 1 ,f 2 ,…,f N Of f, where f i ={p 1,1 ,p 1,2 ,p 1,3 ,…,p J,1 ,p J,2 ,p J,3 The rotation angle of the three-dimensional skeletal joints of the gesture action of the ith frame is shown, and J is the total number of skeletal joints; given the original gesture action sequence x 0 ∈R N×3J Gradually to x 0 Adding Gaussian noise, adding T times to obtain pure Gaussian noise x T A noisy gesture motion sequence sample is obtained, a process also known as a diffusion process.
As a preferential technical solution, the voice feature extraction network specifically includes:
firstly, extracting a multidimensional feature for each voice segment by using a pre-trained deep model, and representing the current state of the voice mode of the frame by using the features of adjacent voice segments, namely, for the gesture action of a single frame, the corresponding voice original features;
then inputting the original voice characteristics into a time filtering module to calculate voice smooth characteristics, namely fusing adjacent voice characteristics to obtain the voice smooth characteristics;
when the noise prediction network is trained, a random mask is generated for the voice smooth characteristic, and the diffusion model generated under the training condition and the diffusion model generated under the unconditional condition are combined to realize the conditional diffusion model learning without classifier guidance.
As a preferential technical solution, the noise prediction network needs to learn the following probability distribution under the condition of specific voice characteristics:
wherein θ is a network parameter, t is a noise step number, a is a voice smoothing feature, x t To the original gesture action sequence x 0 And adding the noisy samples after the noise is added for t times.
As a preferred technical solution, the formula of the loss function is as follows:
wherein θ is a network parameter, t is a noise step number, ε is Gaussian noise added when the noise step number is t in the diffusion process, anda is the speech smoothing feature, x t To the original gesture action sequence x 0 And adding the noisy samples after the noise is added for t times.
As a preferred technical solution, the noise prediction network specifically includes:
the first one-dimensional convolution block converts the input gesture action sequence with noise into high-dimensional representation, and then flattening the high-dimensional representation and then carrying out one-dimensional convolution operation again;
the three downsampling modules realize three downsampling of the data, and each downsampling module has four layers of one-dimensional convolutions;
the three condition up-sampling modules respectively comprise a convolution kernel predictor and a time perception position variable convolution block, wherein the convolution kernel predictor outputs a predicted convolution kernel under the condition of noise step number and voice smoothing characteristics, and the predicted convolution kernel participates in calculation in the time perception position variable convolution block;
the second one-dimensional convolution block firstly carries out dimension reduction representation on the data, and carries out one-dimensional convolution operation after dimension reshaping of the data of the dimension reduction representation.
In a second aspect, the invention provides a voice driving gesture motion generating system based on a diffusion model, which is applied to the voice driving gesture motion generating method based on the diffusion model, and comprises a data preprocessing module, a diffusion model module and a reverse diffusion module;
the data preprocessing module is used for acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the diffusion module is used for adding noise to the gesture action sequence in the training data to obtain a noise added gesture action sequence sample for training a diffusion model, and the process is also called a diffusion process;
the diffusion model module is used for constructing a diffusion model for generating voice-driven gesture actions, the diffusion model regards a gesture action generation task as a denoising process of a noisy gesture action sequence, a denoised gesture action sequence sample is used as an input of the diffusion model, the diffusion model is trained by using a mean square error loss function until convergence, and the loss function is used for measuring the difference between real noise in the diffusion process and noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
the back diffusion module is used for performing iterative denoising from random sampling Gaussian noise according to given voice input with any length by using a trained diffusion model, and generating a gesture action sequence.
In a third aspect, the present invention provides an electronic device, including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the diffusion model-based speech driven gesture motion generation method.
In a fourth aspect, the present invention provides a computer readable storage medium storing a program, wherein the program, when executed by a processor, implements the method for generating a speech driven gesture based on a diffusion model.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention provides a voice-driven gesture motion generation method based on a diffusion model, which utilizes the advantages of strong generation capacity and easy training of the diffusion model to model gesture motion sequence distribution under specific voice conditions, solves the problems of difficult training and mode collapse in the method based on generating an countermeasure network, and generates a gesture motion sequence with more authenticity and diversity. Experiments performed on the challenging baseline dataset Body-Expression-Audio-Text (BEAT) show that the diffusion model-based speech-driven gesture motion generation method of the present invention is superior to the baseline method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of generating speech driven gesture actions based on a diffusion model in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of a noise prediction network in an embodiment of the invention;
FIG. 3 is a schematic diagram of a conditional upsampling module in a noise prediction network according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of extracting speech smoothing features in an embodiment of the invention;
FIG. 5 is a schematic diagram of a diffusion module and a back diffusion module in an embodiment of the invention;
FIG. 6 is an example of generating gesture motion results corresponding to speech input using a trained diffusion model to denoise randomly sampled Gaussian noise in an embodiment of the invention;
FIG. 7 is a schematic diagram of a speech driven gesture motion generation system based on a diffusion model according to an embodiment of the present invention;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.
Referring to fig. 1, the method for generating a voice-driven gesture motion based on a diffusion model according to the present embodiment includes the following steps:
s1, acquiring a human body gesture action data set with voice labels, and preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels.
Further, the pretreatment specifically comprises:
resampling the data set at a speed of 15 frames per second, extracting the gesture motion of the upper body, and sampling each segment in the training data at a stride of 40 frames and a length of 128 frames to obtain a plurality of gesture motion sequence segments with voice information labels as the training data.
S2, the gesture action sequence in the training data is subjected to noise adding, and a noise added gesture action sequence sample is obtained and is used for training a diffusion model.
Further, for a gesture motion sequence x= { f with N frames 1 ,f 2 ,…,f N Of f, where f i ={p 1,1 ,p 1,2 ,p 1,3 ,…,p J,1 ,p J,2 ,p J,3 The three-dimensional skeletal joint rotation angle of the i-th frame posture action is shown, and J is the total number of skeletal joints. Given the original gesture action sequence x 0 ∈R N×3J Gradually to x 0 Adding Gaussian noise, adding T times to obtain pure Gaussian noise x T This process, also called diffusion process, is defined as follows:
x 1 ,x 2 …,x T and x 0 With the same dimensions, T is set to 500, variance β t From beta 1 Linear increase to beta =0.0001 T =0.02, the noise adding process from the t-1 st step to the t-th step can be expressed as:
wherein,due to some characteristics of Gaussian distribution, it is also possible to directly follow x 0 Directly sampling to obtain x after adding noise t times t :
Wherein alpha is t =1-β t ,
S3, constructing and training a diffusion model for generating voice-driven gesture actions, wherein the diffusion model regards a gesture action generating task as a denoising process for a noisy gesture action sequence.
Further, the diffusion model is used for learning the reversal of the diffusion process to generate a new sample, namely, the gesture motion generation task is regarded as a denoising process for the noisy gesture motion sequence; the diffusion model comprises a voice characteristic extraction network and a noise prediction network, and is specifically as follows:
s31, extracting voice characteristics of a network extraction sample based on the voice characteristics, wherein the voice characteristics are as follows:
s311, the speech feature extraction network firstly uses a pre-trained deep model to extract a 29-dimensional feature for each 33-millisecond speech segment, uses the features of 16 adjacent speech segments to represent the current state of the frame speech mode, namely for the gesture action of a single frame, the corresponding speech original feature is A E R 16×29 Then inputting the original voice characteristics into a time filtering module to calculate voice smooth characteristics, namely fusing adjacent voice characteristics, wherein the obtained voice smooth characteristics are a epsilon R 64×1 。
S312, in order to more accurately realize condition generation, when training a noise prediction network, a random mask is required to be generated for the voice smoothing characteristics, and a diffusion model generated by the condition training and a diffusion model generated unconditionally are combined to realize the condition diffusion model learning without classifier guidance.
S32, predicting gesture actions based on the extracted features and a prediction network;
further, referring to fig. 2, the noise prediction network in this embodiment includes a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, where the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected, specifically as follows:
the first one-dimensional convolution block (upper one-dimensional convolution block) firstly converts the input gesture motion sequence with noise into a representation with higher dimension, the used convolution kernel is 7, the step length is 1, the number of input channels is three times of the total number of skeleton joints, namely 141, the number of output channels is 256, then one-dimensional convolution operation is carried out after flattening the input gesture motion sequence with noise, the used convolution kernel is 7, the step length is 1, the number of input channels is 1, and the number of output channels is 32.
Three downsampling modules realize three downsampling of data, and each downsampling module has four layers of one-dimensional convolutions: the first layer convolution kernel is 1 in size, the step length is 1, the number of input channels and the number of output channels are 32, the other three layers convolution kernels are 3 in size, the step length is 1, the number of input channels and the number of output channels are 32, and the expansion coefficients are 1, 2 and 4 respectively.
The three conditional up-sampling modules mainly comprise a convolution kernel predictor and a time perception position variable convolution block; the convolution kernel predictor outputs a predicted convolution kernel based on noise step number and voice smoothing characteristics, and the predicted convolution kernel participates in the calculation of the time perception position variable convolution block.
The second one-dimensional convolution block (lower one-dimensional convolution block) firstly performs dimension reduction representation on data, the used convolution kernel size is 7, the step length is 1, the number of input channels is 32, the number of output channels is 1, the dimension of the second one-dimensional convolution block is remodeled, then one-dimensional convolution operation is performed, the used convolution kernel size is 7, the step length is 1, the number of input channels is 256, and the number of output channels is 141.
Still further, referring to fig. 3, a conditional upsampling module in a noise prediction network according to an embodiment of the present invention is shown, specifically:
the noise step number t is encoded by using position encoding, and then added with the voice smoothing characteristic a after passing through a full connection layer, and the result is taken as the input of a convolution kernel predictor, and the predicted convolution kernel participates in the calculation of time perception position variable convolution.
Further, referring to fig. 4, the extraction of the speech smoothing feature in the embodiment of the present invention is shown, specifically:
feature encoding of speech input using deep model to obtain speech raw features A E R 16×29 Then, the speech is input into a time filtering module shown in FIG. 3 to calculate speech smoothing characteristics, i.e. fusion between adjacent speech characteristics is carried out, and the dimension of the input speech characteristics is gradually reduced to a E R 64×1 . The time filtering module comprises a convolution block and a linear layer block, wherein the convolution block is provided with four convolution layers, the used convolution kernel is 3 in size, the step length is 2, and the linear layer block comprises two full-connection layers.
S33, training the diffusion model;
s331, the noise prediction network needs to predict target data distribution from noise distribution under the condition of specific voice characteristics, and learns inversion of diffusion process to generate new samples, namely learns the following probability distribution:
where θ is a network parameter, t is a noise step number, and a is a speech smoothing feature. The noise prediction network mainly comprises three layers of downsampling modules and three layers of conditional upsampling modules, wherein the conditional upsampling modules mainly comprise a convolution kernel predictor and a time perception position variable convolution block. The convolution kernel predictor outputs a predicted convolution kernel based on the noise step number and the speech smoothing feature, the predicted convolution kernel being involved in the calculation in the time-aware position variable convolution block.
S332, training a diffusion model, wherein the target function is as follows:
wherein θ is a network parameter, t is a noise step number, ε is Gaussian noise added when the noise step number is t in the diffusion process, andx 0 for the original gesture action sequence, a is the speech smoothing feature, +.>
S4, using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence.
Further, referring to FIG. 5, the process of generating a gesture motion sequence is essentially a back-diffusion process, randomly sampling a Gaussian distribution-compliant noisePair x using trained diffusion model T Denoising T times to obtain x 0 The denoising process from the t-th step to the t-1 th step can be expressed as:
wherein, gamma is a weight coefficient,
according to the voice driving gesture motion generating method based on the diffusion model, a diffusion model for generating voice driving gesture motion is constructed, and a gesture motion generating task is regarded as a denoising process of a noisy gesture motion sequence; training a diffusion model by taking a noisy gesture action sequence sample as an input of the diffusion model and taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between real noise in the diffusion process and noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, the voice features are used as one of inputs of the noise prediction network to realize voice-driven gesture motion generation, the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice features, namely learning inversion of a diffusion process to generate new samples, and the noise prediction network comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three condition upsampling modules, and the first one-dimensional convolution block, the three downsampling modules, the three condition upsampling modules and the second one-dimensional convolution block are sequentially connected. According to the invention, the diffusion model is utilized to model the gesture action sequence distribution based on voice driving, so that the gesture action sequence with more authenticity and diversity can be generated.
Referring to fig. 6, an example of generating gesture results corresponding to a voice input by denoising random sampled gaussian noise using a trained diffusion model in an embodiment of the present invention, that is, six gesture actions illustrated in the figure are only six frames in a complete gesture action sequence, is shown.
Objective evaluation results of the speech driving gesture motion generation method based on the diffusion model of the present invention are shown in table 1, and the present invention evaluates Body-Expression-Audio-Text (bet) data sets using FGD (Frechet Gesture Distance), SRGR (security-Relevant Gesture Recall) and BeatAilgn (Beat Alignment Score) as evaluation indexes. FGD is used to evaluate the distribution distance between the generated gesture motion sequence and the real gesture motion sequence, the lower the numerical value is, the better; SRGR is used to evaluate the semantic relevance of gesture actions, the higher the value the better; beatAlign is used to evaluate the similarity between the generated gesture motion sequence and the speech beat, the higher the value the better.
TABLE 1
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present invention.
Based on the same ideas of the diffusion model-based speech-driven gesture motion generation method in the above embodiment, the present invention also provides a diffusion model-based speech-driven gesture motion generation system that can be used to perform the above diffusion model-based speech-driven gesture motion generation method. For ease of illustration, only those portions of the structure diagram of the embodiment of the diffusion model-based speech-driven gesture motion generation system that are relevant to embodiments of the present invention are shown, and those skilled in the art will appreciate that the illustrated structure is not limiting of the apparatus and may include more or fewer components than illustrated, or may combine certain components, or a different arrangement of components.
Referring to fig. 7, in another embodiment of the present application, a diffusion model-based speech driven gesture motion generation system 100 is provided, which includes a data preprocessing module 101, a diffusion module 102, a diffusion model module 103, and a back diffusion module 104;
the data preprocessing module 101 is configured to obtain a human body gesture motion data set with voice label, perform preprocessing on the human body gesture motion data set to obtain training data of gesture motion sequence segments with voice information label,
the diffusion module 102 is configured to denoise the gesture motion sequence in the training data to obtain a denoised gesture motion sequence sample, which is used for training a diffusion model;
the diffusion model module 103 is configured to construct a diffusion model for speech-driven gesture motion generation, where the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; taking the noisy gesture motion sequence sample as the input of a diffusion model, training the diffusion model by taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between the real noise in the diffusion process and the noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
the inverse diffusion module 104 is configured to use a trained diffusion model to perform iterative denoising from random sampling gaussian noise according to a given speech input with any length, so as to generate a gesture motion sequence.
It should be noted that, the diffusion model-based voice driving gesture motion generating system and the diffusion model-based voice driving gesture motion generating method of the present invention are in one-to-one correspondence, and technical features and beneficial effects described in the embodiment of the diffusion model-based voice driving gesture motion generating method are applicable to the embodiment of the diffusion model-based voice driving gesture motion generating, and specific content can be found in the description of the embodiment of the method of the present invention, which is not repeated herein, and thus is described herein.
In addition, in the implementation of the diffusion model-based speech driven gesture motion generating system of the above embodiment, the logic division of each program module is merely illustrative, and in practical application, the above-mentioned function allocation may be performed by different program modules according to needs, for example, in view of configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the diffusion model-based speech driven gesture motion generating system is divided into different program modules, so as to perform all or part of the functions described above.
Referring to fig. 8, in one embodiment, an electronic device implementing a diffusion model-based speech driven gesture motion generation method is provided, the electronic device 200 may include a first processor 201, a first memory 202, and a bus, and may further include a computer program, such as a multiparty privacy preserving machine learning program 203, stored in the first memory 202 and executable on the first processor 201.
The first memory 202 includes at least one type of readable storage medium, which includes flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a mobile hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a secure digital (SecureDigital, SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the electronic device 200. Further, the first memory 202 may also include both an internal memory unit and an external memory device of the electronic device 200. The first memory 202 may be used to store not only application software installed in the electronic device 200 and various types of data, such as codes of the multiparty privacy securing machine learning program 203, but also temporarily store data that has been output or is to be output.
The first processor 201 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), a microprocessor, a digital processing chip, a graphics processor, a combination of various control chips, and so on. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 200 and processes data by running or executing programs or modules stored in the first memory 202 and calling data stored in the first memory 202.
Fig. 8 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 8 is not limiting of the electronic device 200 and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
The diffusion model-based speech driven gesture motion generation program 203 stored in the first memory 202 in the electronic device 200 is a combination of instructions that, when executed in the first processor 201, may implement:
acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the gesture action sequence in the training data is subjected to noise adding, so that a noise added gesture action sequence sample is obtained and is used for training a diffusion model, and the process is also called a diffusion process;
constructing a diffusion model for voice-driven gesture motion generation, wherein the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; taking the noisy gesture motion sequence sample as the input of a diffusion model, training the diffusion model by taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between the real noise in the diffusion process and the noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
and (3) using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence.
Further, the modules/units integrated with the electronic device 200 may be stored in a non-volatile computer readable storage medium if implemented in the form of software functional units and sold or used as a stand-alone product. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.
Claims (10)
1. The voice-driven gesture motion generation method based on the diffusion model is characterized by comprising the following steps of:
acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the gesture action sequence in the training data is subjected to noise adding, so that a noise added gesture action sequence sample is obtained, and the process is also called a diffusion process;
constructing a diffusion model for voice-driven gesture motion generation, wherein the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; taking the noisy gesture motion sequence sample as the input of a diffusion model, training the diffusion model by taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between the real noise in the diffusion process and the noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
and (3) using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence.
2. The method for generating a voice-driven gesture motion based on a diffusion model according to claim 1, wherein the preprocessing of the human gesture motion data set comprises the steps of:
resampling the human body gesture action data set at a set speed, extracting upper body gesture action data, sampling each segment in the training data at a set stride and length, and obtaining a plurality of gesture action sequence segments with voice information marks as the training data.
3. The method for generating a voice-driven gesture motion based on a diffusion model according to claim 1, wherein the step of adding noise to the gesture motion sequence in the training data to obtain a noise-added gesture motion sequence sample is specifically:
for a gesture motion sequence x= { f with N frames 1 ,f 2 ,…,f N Of f, where f i ={p 1, ,p 1, ,p 1, ,…,p J,1 ,p J,2 ,p J,3 The rotation angle of the three-dimensional skeletal joints of the gesture action of the ith frame is shown, and J is the total number of skeletal joints; given the original gesture action sequence x 0 ∈R N×3J Gradually to x 0 Adding Gaussian noise, adding T times to obtain pure Gaussian noise x T A noisy gesture motion sequence sample is obtained, a process also known as a diffusion process.
4. The method for generating a speech driven gesture motion based on a diffusion model according to claim 1, wherein the speech feature extraction network specifically comprises:
firstly, extracting a multidimensional feature for each voice segment by using a pre-trained deep model, and representing the current state of the voice mode of the frame by using the features of adjacent voice segments, namely, for the gesture action of a single frame, the corresponding voice original features;
then inputting the original voice characteristics into a time filtering module to calculate voice smooth characteristics, namely fusing adjacent voice characteristics to obtain the voice smooth characteristics;
when the noise prediction network is trained, a random mask is generated for the voice smooth characteristic, and the diffusion model generated under the training condition and the diffusion model generated under the unconditional condition are combined to realize the conditional diffusion model learning without classifier guidance.
5. The method of claim 1, wherein the noise prediction network learns the following probability distribution under the condition of specific speech characteristics:
wherein θ is a network parameter, t is a noise step number, a is a voice smoothing feature, x t To the original gesture action sequence x 0 And adding the noisy samples after the noise is added for t times.
6. The method for generating a speech driven gesture motion based on a diffusion model according to claim 1, wherein the formula of the loss function is as follows:
wherein θ is a network parameter, t is a noise step number, ε is Gaussian noise added when the noise step number is t in the diffusion process, anda is the speech smoothing feature, x t To the original gesture action sequence x 0 And adding the noisy samples after the noise is added for t times.
7. The method for generating a speech driven gesture motion based on a diffusion model according to claim 1, wherein the noise prediction network is specifically as follows:
the first one-dimensional convolution block converts the input gesture action sequence with noise into high-dimensional representation, and then flattening the high-dimensional representation and then carrying out one-dimensional convolution operation again;
the three downsampling modules realize three downsampling of the data, and each downsampling module has four layers of one-dimensional convolutions;
the three condition up-sampling modules respectively comprise a convolution kernel predictor and a time perception position variable convolution block, wherein the convolution kernel predictor outputs a predicted convolution kernel under the condition of noise step number and voice smoothing characteristics, and the predicted convolution kernel participates in calculation in the time perception position variable convolution block;
the second one-dimensional convolution block firstly carries out dimension reduction representation on the data, and carries out one-dimensional convolution operation after dimension reshaping of the data of the dimension reduction representation.
8. A diffusion model-based speech driven gesture motion generation system, characterized in that it is applied to the diffusion model-based speech driven gesture motion generation method of any one of claims 1 to 7, and comprises a data preprocessing module, a diffusion model module, and a back diffusion module;
the data preprocessing module is used for acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the diffusion module is used for adding noise to the gesture action sequence in the training data to obtain a noise added gesture action sequence sample for training a diffusion model, and the process is also called a diffusion process;
the diffusion model module is used for constructing a diffusion model for generating voice-driven gesture actions, the diffusion model regards a gesture action generation task as a denoising process of a noisy gesture action sequence, a denoised gesture action sequence sample is used as an input of the diffusion model, the diffusion model is trained by using a mean square error loss function until convergence, and the loss function is used for measuring the difference between real noise in the diffusion process and noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
the back diffusion module is used for performing iterative denoising from random sampling Gaussian noise according to given voice input with any length by using a trained diffusion model, and generating a gesture action sequence.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the diffusion model-based speech driven gesture motion generation method of any one of claims 1-7.
10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the diffusion model-based speech-driven gesture motion generation method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311007818.1A CN117292704A (en) | 2023-08-11 | 2023-08-11 | Voice-driven gesture action generation method and device based on diffusion model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311007818.1A CN117292704A (en) | 2023-08-11 | 2023-08-11 | Voice-driven gesture action generation method and device based on diffusion model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117292704A true CN117292704A (en) | 2023-12-26 |
Family
ID=89252525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311007818.1A Pending CN117292704A (en) | 2023-08-11 | 2023-08-11 | Voice-driven gesture action generation method and device based on diffusion model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117292704A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117577121A (en) * | 2024-01-17 | 2024-02-20 | 清华大学 | Diffusion model-based audio encoding and decoding method and device, storage medium and equipment |
-
2023
- 2023-08-11 CN CN202311007818.1A patent/CN117292704A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117577121A (en) * | 2024-01-17 | 2024-02-20 | 清华大学 | Diffusion model-based audio encoding and decoding method and device, storage medium and equipment |
CN117577121B (en) * | 2024-01-17 | 2024-04-05 | 清华大学 | Diffusion model-based audio encoding and decoding method and device, storage medium and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113077471B (en) | Medical image segmentation method based on U-shaped network | |
CN111932444B (en) | Face attribute editing method based on generation countermeasure network and information processing terminal | |
US10931976B1 (en) | Face-speech bridging by cycle video/audio reconstruction | |
CN111984772B (en) | Medical image question-answering method and system based on deep learning | |
CN117292704A (en) | Voice-driven gesture action generation method and device based on diffusion model | |
CN112949707A (en) | Cross-mode face image generation method based on multi-scale semantic information supervision | |
CN116229531A (en) | Face front image synthesis method for collaborative progressive generation countermeasure network | |
Uddin et al. | A perceptually inspired new blind image denoising method using $ L_ {1} $ and perceptual loss | |
CN113436224B (en) | Intelligent image clipping method and device based on explicit composition rule modeling | |
CN115526223A (en) | Score-based generative modeling in a potential space | |
Luo et al. | Dualg-gan, a dual-channel generator based generative adversarial network for text-to-face synthesis | |
Li et al. | CorrDiff: Corrective Diffusion Model for Accurate MRI Brain Tumor Segmentation | |
Atkale et al. | Multi-scale feature fusion model followed by residual network for generation of face aging and de-aging | |
CN116912268A (en) | Skin lesion image segmentation method, device, equipment and storage medium | |
CN116978057A (en) | Human body posture migration method and device in image, computer equipment and storage medium | |
CN116167015A (en) | Dimension emotion analysis method based on joint cross attention mechanism | |
CN113609330B (en) | Video question-answering system, method, computer and storage medium based on text attention and fine-grained information | |
CN116977343A (en) | Image processing method, apparatus, device, storage medium, and program product | |
Yauri-Lozano et al. | Generative Adversarial Networks for text-to-face synthesis & generation: A quantitative–qualitative analysis of Natural Language Processing encoders for Spanish | |
Sun et al. | Silp-autoencoder for face de-occlusion | |
CN113609355A (en) | Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning | |
CN115984911A (en) | Attribute generation countermeasure network and face image continuous transformation method based on same | |
CN113222100A (en) | Training method and device of neural network model | |
CN116542292B (en) | Training method, device, equipment and storage medium of image generation model | |
CN117788629B (en) | Image generation method, device and storage medium with style personalization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |