CN117292704A - Voice-driven gesture action generation method and device based on diffusion model - Google Patents

Voice-driven gesture action generation method and device based on diffusion model Download PDF

Info

Publication number
CN117292704A
CN117292704A CN202311007818.1A CN202311007818A CN117292704A CN 117292704 A CN117292704 A CN 117292704A CN 202311007818 A CN202311007818 A CN 202311007818A CN 117292704 A CN117292704 A CN 117292704A
Authority
CN
China
Prior art keywords
voice
noise
diffusion model
gesture
diffusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311007818.1A
Other languages
Chinese (zh)
Inventor
梁云
叶翠
陈熠金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Agricultural University
Original Assignee
South China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Agricultural University filed Critical South China Agricultural University
Priority to CN202311007818.1A priority Critical patent/CN117292704A/en
Publication of CN117292704A publication Critical patent/CN117292704A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Acoustics & Sound (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a device for generating voice-driven gesture actions based on a diffusion model, wherein the method comprises the following steps: acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels, and adding noise to gesture action sequences in the training data to obtain a noisy gesture action sequence sample for training a diffusion model; constructing and training a diffusion model for voice-driven gesture motion generation, wherein the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; and (3) using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence. According to the invention, the diffusion model is utilized to model the gesture action sequence distribution based on voice driving, so that the gesture action sequence with more authenticity and diversity can be generated.

Description

Voice-driven gesture action generation method and device based on diffusion model
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for generating voice-driven gesture actions based on a diffusion model.
Background
With the development of artificial intelligence, virtual digital people are widely used in the fields of games, media, movies and the like, and gesture motion generation is one of key technologies of the virtual digital people manufacturing process. In the real world, when a person speaks, gesture actions such as making a stroke by hand are usually made very naturally, and these gesture actions can be used as a non-speech communication signal, which helps the speaker to express himself better. The gesture action generated in the speaking process plays an important role in communication and communication between people, plays a non-negligible role in the aspect of manufacturing the virtual digital person, properly adds natural gesture action to the virtual digital person, can enhance visual effect, improves user experience and brings more temperature emotion interaction. Therefore, the research on the generation of the gesture action driven by the voice has a certain meaning.
The voice-driven gesture motion generation is to generate a gesture motion sequence corresponding to a voice input given a piece of voice input. Existing speech driven gesture motion generation methods can be generally classified into rule-based methods and learning-based methods: (1) The rule-based method requires to explicitly formulate various rules and map voice content to gesture actions, and can establish the corresponding relation between the voice content and the gesture actions, however, the expression capability of the method is directly limited by the quantity of the formulated rules, gesture actions beyond the definition rules cannot be generated, and the method has great limitation. (2) Learning-based methods learn mappings from data from speech to gesture actions, which generally rely on generating an countermeasure network, which is difficult to produce true and varied results due to problems of difficulty in training and pattern collapse.
In order to solve the problems, the invention provides a voice-driven gesture motion generation method based on a diffusion model, which utilizes the strong generation capacity of the diffusion model to generate a gesture motion sequence with more authenticity and diversity.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provide a voice-driven gesture motion generation method and device based on a diffusion model.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for generating a speech driven gesture motion based on a diffusion model, including the steps of:
acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the gesture action sequence in the training data is subjected to noise adding, so that a noise added gesture action sequence sample is obtained, and the process is also called a diffusion process;
constructing a diffusion model for voice-driven gesture motion generation, wherein the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; taking the noisy gesture motion sequence sample as the input of a diffusion model, training the diffusion model by taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between the real noise in the diffusion process and the noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
and (3) using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence.
As a preferred technical solution, the preprocessing of the human gesture action data set includes the following steps:
resampling the human body gesture action data set at a set speed, extracting upper body gesture action data, sampling each segment in the training data at a set stride and length, and obtaining a plurality of gesture action sequence segments with voice information marks as the training data.
As a preferred technical solution, the step of adding noise to the gesture motion sequence in the training data to obtain a noise added gesture motion sequence sample specifically includes:
for a gesture motion sequence x= { f with N frames 1 ,f 2 ,…,f N Of f, where f i ={p 1,1 ,p 1,2 ,p 1,3 ,…,p J,1 ,p J,2 ,p J,3 The rotation angle of the three-dimensional skeletal joints of the gesture action of the ith frame is shown, and J is the total number of skeletal joints; given the original gesture action sequence x 0 ∈R N×3J Gradually to x 0 Adding Gaussian noise, adding T times to obtain pure Gaussian noise x T A noisy gesture motion sequence sample is obtained, a process also known as a diffusion process.
As a preferential technical solution, the voice feature extraction network specifically includes:
firstly, extracting a multidimensional feature for each voice segment by using a pre-trained deep model, and representing the current state of the voice mode of the frame by using the features of adjacent voice segments, namely, for the gesture action of a single frame, the corresponding voice original features;
then inputting the original voice characteristics into a time filtering module to calculate voice smooth characteristics, namely fusing adjacent voice characteristics to obtain the voice smooth characteristics;
when the noise prediction network is trained, a random mask is generated for the voice smooth characteristic, and the diffusion model generated under the training condition and the diffusion model generated under the unconditional condition are combined to realize the conditional diffusion model learning without classifier guidance.
As a preferential technical solution, the noise prediction network needs to learn the following probability distribution under the condition of specific voice characteristics:
wherein θ is a network parameter, t is a noise step number, a is a voice smoothing feature, x t To the original gesture action sequence x 0 And adding the noisy samples after the noise is added for t times.
As a preferred technical solution, the formula of the loss function is as follows:
wherein θ is a network parameter, t is a noise step number, ε is Gaussian noise added when the noise step number is t in the diffusion process, anda is the speech smoothing feature, x t To the original gesture action sequence x 0 And adding the noisy samples after the noise is added for t times.
As a preferred technical solution, the noise prediction network specifically includes:
the first one-dimensional convolution block converts the input gesture action sequence with noise into high-dimensional representation, and then flattening the high-dimensional representation and then carrying out one-dimensional convolution operation again;
the three downsampling modules realize three downsampling of the data, and each downsampling module has four layers of one-dimensional convolutions;
the three condition up-sampling modules respectively comprise a convolution kernel predictor and a time perception position variable convolution block, wherein the convolution kernel predictor outputs a predicted convolution kernel under the condition of noise step number and voice smoothing characteristics, and the predicted convolution kernel participates in calculation in the time perception position variable convolution block;
the second one-dimensional convolution block firstly carries out dimension reduction representation on the data, and carries out one-dimensional convolution operation after dimension reshaping of the data of the dimension reduction representation.
In a second aspect, the invention provides a voice driving gesture motion generating system based on a diffusion model, which is applied to the voice driving gesture motion generating method based on the diffusion model, and comprises a data preprocessing module, a diffusion model module and a reverse diffusion module;
the data preprocessing module is used for acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the diffusion module is used for adding noise to the gesture action sequence in the training data to obtain a noise added gesture action sequence sample for training a diffusion model, and the process is also called a diffusion process;
the diffusion model module is used for constructing a diffusion model for generating voice-driven gesture actions, the diffusion model regards a gesture action generation task as a denoising process of a noisy gesture action sequence, a denoised gesture action sequence sample is used as an input of the diffusion model, the diffusion model is trained by using a mean square error loss function until convergence, and the loss function is used for measuring the difference between real noise in the diffusion process and noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
the back diffusion module is used for performing iterative denoising from random sampling Gaussian noise according to given voice input with any length by using a trained diffusion model, and generating a gesture action sequence.
In a third aspect, the present invention provides an electronic device, including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the diffusion model-based speech driven gesture motion generation method.
In a fourth aspect, the present invention provides a computer readable storage medium storing a program, wherein the program, when executed by a processor, implements the method for generating a speech driven gesture based on a diffusion model.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention provides a voice-driven gesture motion generation method based on a diffusion model, which utilizes the advantages of strong generation capacity and easy training of the diffusion model to model gesture motion sequence distribution under specific voice conditions, solves the problems of difficult training and mode collapse in the method based on generating an countermeasure network, and generates a gesture motion sequence with more authenticity and diversity. Experiments performed on the challenging baseline dataset Body-Expression-Audio-Text (BEAT) show that the diffusion model-based speech-driven gesture motion generation method of the present invention is superior to the baseline method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of generating speech driven gesture actions based on a diffusion model in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of a noise prediction network in an embodiment of the invention;
FIG. 3 is a schematic diagram of a conditional upsampling module in a noise prediction network according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of extracting speech smoothing features in an embodiment of the invention;
FIG. 5 is a schematic diagram of a diffusion module and a back diffusion module in an embodiment of the invention;
FIG. 6 is an example of generating gesture motion results corresponding to speech input using a trained diffusion model to denoise randomly sampled Gaussian noise in an embodiment of the invention;
FIG. 7 is a schematic diagram of a speech driven gesture motion generation system based on a diffusion model according to an embodiment of the present invention;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.
Referring to fig. 1, the method for generating a voice-driven gesture motion based on a diffusion model according to the present embodiment includes the following steps:
s1, acquiring a human body gesture action data set with voice labels, and preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels.
Further, the pretreatment specifically comprises:
resampling the data set at a speed of 15 frames per second, extracting the gesture motion of the upper body, and sampling each segment in the training data at a stride of 40 frames and a length of 128 frames to obtain a plurality of gesture motion sequence segments with voice information labels as the training data.
S2, the gesture action sequence in the training data is subjected to noise adding, and a noise added gesture action sequence sample is obtained and is used for training a diffusion model.
Further, for a gesture motion sequence x= { f with N frames 1 ,f 2 ,…,f N Of f, where f i ={p 1,1 ,p 1,2 ,p 1,3 ,…,p J,1 ,p J,2 ,p J,3 The three-dimensional skeletal joint rotation angle of the i-th frame posture action is shown, and J is the total number of skeletal joints. Given the original gesture action sequence x 0 ∈R N×3J Gradually to x 0 Adding Gaussian noise, adding T times to obtain pure Gaussian noise x T This process, also called diffusion process, is defined as follows:
x 1 ,x 2 …,x T and x 0 With the same dimensions, T is set to 500, variance β t From beta 1 Linear increase to beta =0.0001 T =0.02, the noise adding process from the t-1 st step to the t-th step can be expressed as:
wherein,due to some characteristics of Gaussian distribution, it is also possible to directly follow x 0 Directly sampling to obtain x after adding noise t times t
Wherein alpha is t =1-β t
S3, constructing and training a diffusion model for generating voice-driven gesture actions, wherein the diffusion model regards a gesture action generating task as a denoising process for a noisy gesture action sequence.
Further, the diffusion model is used for learning the reversal of the diffusion process to generate a new sample, namely, the gesture motion generation task is regarded as a denoising process for the noisy gesture motion sequence; the diffusion model comprises a voice characteristic extraction network and a noise prediction network, and is specifically as follows:
s31, extracting voice characteristics of a network extraction sample based on the voice characteristics, wherein the voice characteristics are as follows:
s311, the speech feature extraction network firstly uses a pre-trained deep model to extract a 29-dimensional feature for each 33-millisecond speech segment, uses the features of 16 adjacent speech segments to represent the current state of the frame speech mode, namely for the gesture action of a single frame, the corresponding speech original feature is A E R 16×29 Then inputting the original voice characteristics into a time filtering module to calculate voice smooth characteristics, namely fusing adjacent voice characteristics, wherein the obtained voice smooth characteristics are a epsilon R 64×1
S312, in order to more accurately realize condition generation, when training a noise prediction network, a random mask is required to be generated for the voice smoothing characteristics, and a diffusion model generated by the condition training and a diffusion model generated unconditionally are combined to realize the condition diffusion model learning without classifier guidance.
S32, predicting gesture actions based on the extracted features and a prediction network;
further, referring to fig. 2, the noise prediction network in this embodiment includes a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, where the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected, specifically as follows:
the first one-dimensional convolution block (upper one-dimensional convolution block) firstly converts the input gesture motion sequence with noise into a representation with higher dimension, the used convolution kernel is 7, the step length is 1, the number of input channels is three times of the total number of skeleton joints, namely 141, the number of output channels is 256, then one-dimensional convolution operation is carried out after flattening the input gesture motion sequence with noise, the used convolution kernel is 7, the step length is 1, the number of input channels is 1, and the number of output channels is 32.
Three downsampling modules realize three downsampling of data, and each downsampling module has four layers of one-dimensional convolutions: the first layer convolution kernel is 1 in size, the step length is 1, the number of input channels and the number of output channels are 32, the other three layers convolution kernels are 3 in size, the step length is 1, the number of input channels and the number of output channels are 32, and the expansion coefficients are 1, 2 and 4 respectively.
The three conditional up-sampling modules mainly comprise a convolution kernel predictor and a time perception position variable convolution block; the convolution kernel predictor outputs a predicted convolution kernel based on noise step number and voice smoothing characteristics, and the predicted convolution kernel participates in the calculation of the time perception position variable convolution block.
The second one-dimensional convolution block (lower one-dimensional convolution block) firstly performs dimension reduction representation on data, the used convolution kernel size is 7, the step length is 1, the number of input channels is 32, the number of output channels is 1, the dimension of the second one-dimensional convolution block is remodeled, then one-dimensional convolution operation is performed, the used convolution kernel size is 7, the step length is 1, the number of input channels is 256, and the number of output channels is 141.
Still further, referring to fig. 3, a conditional upsampling module in a noise prediction network according to an embodiment of the present invention is shown, specifically:
the noise step number t is encoded by using position encoding, and then added with the voice smoothing characteristic a after passing through a full connection layer, and the result is taken as the input of a convolution kernel predictor, and the predicted convolution kernel participates in the calculation of time perception position variable convolution.
Further, referring to fig. 4, the extraction of the speech smoothing feature in the embodiment of the present invention is shown, specifically:
feature encoding of speech input using deep model to obtain speech raw features A E R 16×29 Then, the speech is input into a time filtering module shown in FIG. 3 to calculate speech smoothing characteristics, i.e. fusion between adjacent speech characteristics is carried out, and the dimension of the input speech characteristics is gradually reduced to a E R 64×1 . The time filtering module comprises a convolution block and a linear layer block, wherein the convolution block is provided with four convolution layers, the used convolution kernel is 3 in size, the step length is 2, and the linear layer block comprises two full-connection layers.
S33, training the diffusion model;
s331, the noise prediction network needs to predict target data distribution from noise distribution under the condition of specific voice characteristics, and learns inversion of diffusion process to generate new samples, namely learns the following probability distribution:
where θ is a network parameter, t is a noise step number, and a is a speech smoothing feature. The noise prediction network mainly comprises three layers of downsampling modules and three layers of conditional upsampling modules, wherein the conditional upsampling modules mainly comprise a convolution kernel predictor and a time perception position variable convolution block. The convolution kernel predictor outputs a predicted convolution kernel based on the noise step number and the speech smoothing feature, the predicted convolution kernel being involved in the calculation in the time-aware position variable convolution block.
S332, training a diffusion model, wherein the target function is as follows:
wherein θ is a network parameter, t is a noise step number, ε is Gaussian noise added when the noise step number is t in the diffusion process, andx 0 for the original gesture action sequence, a is the speech smoothing feature, +.>
S4, using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence.
Further, referring to FIG. 5, the process of generating a gesture motion sequence is essentially a back-diffusion process, randomly sampling a Gaussian distribution-compliant noisePair x using trained diffusion model T Denoising T times to obtain x 0 The denoising process from the t-th step to the t-1 th step can be expressed as:
wherein, gamma is a weight coefficient,
according to the voice driving gesture motion generating method based on the diffusion model, a diffusion model for generating voice driving gesture motion is constructed, and a gesture motion generating task is regarded as a denoising process of a noisy gesture motion sequence; training a diffusion model by taking a noisy gesture action sequence sample as an input of the diffusion model and taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between real noise in the diffusion process and noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, the voice features are used as one of inputs of the noise prediction network to realize voice-driven gesture motion generation, the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice features, namely learning inversion of a diffusion process to generate new samples, and the noise prediction network comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three condition upsampling modules, and the first one-dimensional convolution block, the three downsampling modules, the three condition upsampling modules and the second one-dimensional convolution block are sequentially connected. According to the invention, the diffusion model is utilized to model the gesture action sequence distribution based on voice driving, so that the gesture action sequence with more authenticity and diversity can be generated.
Referring to fig. 6, an example of generating gesture results corresponding to a voice input by denoising random sampled gaussian noise using a trained diffusion model in an embodiment of the present invention, that is, six gesture actions illustrated in the figure are only six frames in a complete gesture action sequence, is shown.
Objective evaluation results of the speech driving gesture motion generation method based on the diffusion model of the present invention are shown in table 1, and the present invention evaluates Body-Expression-Audio-Text (bet) data sets using FGD (Frechet Gesture Distance), SRGR (security-Relevant Gesture Recall) and BeatAilgn (Beat Alignment Score) as evaluation indexes. FGD is used to evaluate the distribution distance between the generated gesture motion sequence and the real gesture motion sequence, the lower the numerical value is, the better; SRGR is used to evaluate the semantic relevance of gesture actions, the higher the value the better; beatAlign is used to evaluate the similarity between the generated gesture motion sequence and the speech beat, the higher the value the better.
TABLE 1
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present invention.
Based on the same ideas of the diffusion model-based speech-driven gesture motion generation method in the above embodiment, the present invention also provides a diffusion model-based speech-driven gesture motion generation system that can be used to perform the above diffusion model-based speech-driven gesture motion generation method. For ease of illustration, only those portions of the structure diagram of the embodiment of the diffusion model-based speech-driven gesture motion generation system that are relevant to embodiments of the present invention are shown, and those skilled in the art will appreciate that the illustrated structure is not limiting of the apparatus and may include more or fewer components than illustrated, or may combine certain components, or a different arrangement of components.
Referring to fig. 7, in another embodiment of the present application, a diffusion model-based speech driven gesture motion generation system 100 is provided, which includes a data preprocessing module 101, a diffusion module 102, a diffusion model module 103, and a back diffusion module 104;
the data preprocessing module 101 is configured to obtain a human body gesture motion data set with voice label, perform preprocessing on the human body gesture motion data set to obtain training data of gesture motion sequence segments with voice information label,
the diffusion module 102 is configured to denoise the gesture motion sequence in the training data to obtain a denoised gesture motion sequence sample, which is used for training a diffusion model;
the diffusion model module 103 is configured to construct a diffusion model for speech-driven gesture motion generation, where the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; taking the noisy gesture motion sequence sample as the input of a diffusion model, training the diffusion model by taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between the real noise in the diffusion process and the noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
the inverse diffusion module 104 is configured to use a trained diffusion model to perform iterative denoising from random sampling gaussian noise according to a given speech input with any length, so as to generate a gesture motion sequence.
It should be noted that, the diffusion model-based voice driving gesture motion generating system and the diffusion model-based voice driving gesture motion generating method of the present invention are in one-to-one correspondence, and technical features and beneficial effects described in the embodiment of the diffusion model-based voice driving gesture motion generating method are applicable to the embodiment of the diffusion model-based voice driving gesture motion generating, and specific content can be found in the description of the embodiment of the method of the present invention, which is not repeated herein, and thus is described herein.
In addition, in the implementation of the diffusion model-based speech driven gesture motion generating system of the above embodiment, the logic division of each program module is merely illustrative, and in practical application, the above-mentioned function allocation may be performed by different program modules according to needs, for example, in view of configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the diffusion model-based speech driven gesture motion generating system is divided into different program modules, so as to perform all or part of the functions described above.
Referring to fig. 8, in one embodiment, an electronic device implementing a diffusion model-based speech driven gesture motion generation method is provided, the electronic device 200 may include a first processor 201, a first memory 202, and a bus, and may further include a computer program, such as a multiparty privacy preserving machine learning program 203, stored in the first memory 202 and executable on the first processor 201.
The first memory 202 includes at least one type of readable storage medium, which includes flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a mobile hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a secure digital (SecureDigital, SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the electronic device 200. Further, the first memory 202 may also include both an internal memory unit and an external memory device of the electronic device 200. The first memory 202 may be used to store not only application software installed in the electronic device 200 and various types of data, such as codes of the multiparty privacy securing machine learning program 203, but also temporarily store data that has been output or is to be output.
The first processor 201 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), a microprocessor, a digital processing chip, a graphics processor, a combination of various control chips, and so on. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 200 and processes data by running or executing programs or modules stored in the first memory 202 and calling data stored in the first memory 202.
Fig. 8 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 8 is not limiting of the electronic device 200 and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
The diffusion model-based speech driven gesture motion generation program 203 stored in the first memory 202 in the electronic device 200 is a combination of instructions that, when executed in the first processor 201, may implement:
acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the gesture action sequence in the training data is subjected to noise adding, so that a noise added gesture action sequence sample is obtained and is used for training a diffusion model, and the process is also called a diffusion process;
constructing a diffusion model for voice-driven gesture motion generation, wherein the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; taking the noisy gesture motion sequence sample as the input of a diffusion model, training the diffusion model by taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between the real noise in the diffusion process and the noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
and (3) using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence.
Further, the modules/units integrated with the electronic device 200 may be stored in a non-volatile computer readable storage medium if implemented in the form of software functional units and sold or used as a stand-alone product. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (10)

1. The voice-driven gesture motion generation method based on the diffusion model is characterized by comprising the following steps of:
acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the gesture action sequence in the training data is subjected to noise adding, so that a noise added gesture action sequence sample is obtained, and the process is also called a diffusion process;
constructing a diffusion model for voice-driven gesture motion generation, wherein the diffusion model regards a gesture motion generation task as a denoising process for a noisy gesture motion sequence; taking the noisy gesture motion sequence sample as the input of a diffusion model, training the diffusion model by taking a mean square error as a loss function until convergence, wherein the loss function is used for measuring the difference between the real noise in the diffusion process and the noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
and (3) using a trained diffusion model, and performing iterative denoising from random sampling Gaussian noise according to given voice input with any length to generate a gesture action sequence.
2. The method for generating a voice-driven gesture motion based on a diffusion model according to claim 1, wherein the preprocessing of the human gesture motion data set comprises the steps of:
resampling the human body gesture action data set at a set speed, extracting upper body gesture action data, sampling each segment in the training data at a set stride and length, and obtaining a plurality of gesture action sequence segments with voice information marks as the training data.
3. The method for generating a voice-driven gesture motion based on a diffusion model according to claim 1, wherein the step of adding noise to the gesture motion sequence in the training data to obtain a noise-added gesture motion sequence sample is specifically:
for a gesture motion sequence x= { f with N frames 1 ,f 2 ,…,f N Of f, where f i ={p 1, ,p 1, ,p 1, ,…,p J,1 ,p J,2 ,p J,3 The rotation angle of the three-dimensional skeletal joints of the gesture action of the ith frame is shown, and J is the total number of skeletal joints; given the original gesture action sequence x 0 ∈R N×3J Gradually to x 0 Adding Gaussian noise, adding T times to obtain pure Gaussian noise x T A noisy gesture motion sequence sample is obtained, a process also known as a diffusion process.
4. The method for generating a speech driven gesture motion based on a diffusion model according to claim 1, wherein the speech feature extraction network specifically comprises:
firstly, extracting a multidimensional feature for each voice segment by using a pre-trained deep model, and representing the current state of the voice mode of the frame by using the features of adjacent voice segments, namely, for the gesture action of a single frame, the corresponding voice original features;
then inputting the original voice characteristics into a time filtering module to calculate voice smooth characteristics, namely fusing adjacent voice characteristics to obtain the voice smooth characteristics;
when the noise prediction network is trained, a random mask is generated for the voice smooth characteristic, and the diffusion model generated under the training condition and the diffusion model generated under the unconditional condition are combined to realize the conditional diffusion model learning without classifier guidance.
5. The method of claim 1, wherein the noise prediction network learns the following probability distribution under the condition of specific speech characteristics:
wherein θ is a network parameter, t is a noise step number, a is a voice smoothing feature, x t To the original gesture action sequence x 0 And adding the noisy samples after the noise is added for t times.
6. The method for generating a speech driven gesture motion based on a diffusion model according to claim 1, wherein the formula of the loss function is as follows:
wherein θ is a network parameter, t is a noise step number, ε is Gaussian noise added when the noise step number is t in the diffusion process, anda is the speech smoothing feature, x t To the original gesture action sequence x 0 And adding the noisy samples after the noise is added for t times.
7. The method for generating a speech driven gesture motion based on a diffusion model according to claim 1, wherein the noise prediction network is specifically as follows:
the first one-dimensional convolution block converts the input gesture action sequence with noise into high-dimensional representation, and then flattening the high-dimensional representation and then carrying out one-dimensional convolution operation again;
the three downsampling modules realize three downsampling of the data, and each downsampling module has four layers of one-dimensional convolutions;
the three condition up-sampling modules respectively comprise a convolution kernel predictor and a time perception position variable convolution block, wherein the convolution kernel predictor outputs a predicted convolution kernel under the condition of noise step number and voice smoothing characteristics, and the predicted convolution kernel participates in calculation in the time perception position variable convolution block;
the second one-dimensional convolution block firstly carries out dimension reduction representation on the data, and carries out one-dimensional convolution operation after dimension reshaping of the data of the dimension reduction representation.
8. A diffusion model-based speech driven gesture motion generation system, characterized in that it is applied to the diffusion model-based speech driven gesture motion generation method of any one of claims 1 to 7, and comprises a data preprocessing module, a diffusion model module, and a back diffusion module;
the data preprocessing module is used for acquiring a human body gesture action data set with voice labels, preprocessing the human body gesture action data set to obtain training data of gesture action sequence fragments with voice information labels,
the diffusion module is used for adding noise to the gesture action sequence in the training data to obtain a noise added gesture action sequence sample for training a diffusion model, and the process is also called a diffusion process;
the diffusion model module is used for constructing a diffusion model for generating voice-driven gesture actions, the diffusion model regards a gesture action generation task as a denoising process of a noisy gesture action sequence, a denoised gesture action sequence sample is used as an input of the diffusion model, the diffusion model is trained by using a mean square error loss function until convergence, and the loss function is used for measuring the difference between real noise in the diffusion process and noise predicted by the diffusion model; the diffusion model comprises a voice feature extraction network and a noise prediction network, wherein the voice feature extraction network is used for extracting semantic feature information of input voice to obtain voice features, and the voice features are used as one of the inputs of the noise prediction network so as to realize voice-driven gesture action generation; the noise prediction network is used for predicting target data distribution from noise distribution under the condition of specific voice characteristics, namely learning inversion of a diffusion process to generate new samples, and comprises a first one-dimensional convolution block, a second one-dimensional convolution block, three downsampling modules and three conditional upsampling modules, wherein the first one-dimensional convolution block, the three downsampling modules, the three conditional upsampling modules and the second one-dimensional convolution block are sequentially connected;
the back diffusion module is used for performing iterative denoising from random sampling Gaussian noise according to given voice input with any length by using a trained diffusion model, and generating a gesture action sequence.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the diffusion model-based speech driven gesture motion generation method of any one of claims 1-7.
10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the diffusion model-based speech-driven gesture motion generation method of any one of claims 1 to 7.
CN202311007818.1A 2023-08-11 2023-08-11 Voice-driven gesture action generation method and device based on diffusion model Pending CN117292704A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311007818.1A CN117292704A (en) 2023-08-11 2023-08-11 Voice-driven gesture action generation method and device based on diffusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311007818.1A CN117292704A (en) 2023-08-11 2023-08-11 Voice-driven gesture action generation method and device based on diffusion model

Publications (1)

Publication Number Publication Date
CN117292704A true CN117292704A (en) 2023-12-26

Family

ID=89252525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311007818.1A Pending CN117292704A (en) 2023-08-11 2023-08-11 Voice-driven gesture action generation method and device based on diffusion model

Country Status (1)

Country Link
CN (1) CN117292704A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577121A (en) * 2024-01-17 2024-02-20 清华大学 Diffusion model-based audio encoding and decoding method and device, storage medium and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577121A (en) * 2024-01-17 2024-02-20 清华大学 Diffusion model-based audio encoding and decoding method and device, storage medium and equipment
CN117577121B (en) * 2024-01-17 2024-04-05 清华大学 Diffusion model-based audio encoding and decoding method and device, storage medium and equipment

Similar Documents

Publication Publication Date Title
CN113077471B (en) Medical image segmentation method based on U-shaped network
CN111932444B (en) Face attribute editing method based on generation countermeasure network and information processing terminal
US10931976B1 (en) Face-speech bridging by cycle video/audio reconstruction
CN111984772B (en) Medical image question-answering method and system based on deep learning
CN117292704A (en) Voice-driven gesture action generation method and device based on diffusion model
CN112949707A (en) Cross-mode face image generation method based on multi-scale semantic information supervision
CN116229531A (en) Face front image synthesis method for collaborative progressive generation countermeasure network
Uddin et al. A perceptually inspired new blind image denoising method using $ L_ {1} $ and perceptual loss
CN113436224B (en) Intelligent image clipping method and device based on explicit composition rule modeling
CN115526223A (en) Score-based generative modeling in a potential space
Luo et al. Dualg-gan, a dual-channel generator based generative adversarial network for text-to-face synthesis
Li et al. CorrDiff: Corrective Diffusion Model for Accurate MRI Brain Tumor Segmentation
Atkale et al. Multi-scale feature fusion model followed by residual network for generation of face aging and de-aging
CN116912268A (en) Skin lesion image segmentation method, device, equipment and storage medium
CN116978057A (en) Human body posture migration method and device in image, computer equipment and storage medium
CN116167015A (en) Dimension emotion analysis method based on joint cross attention mechanism
CN113609330B (en) Video question-answering system, method, computer and storage medium based on text attention and fine-grained information
CN116977343A (en) Image processing method, apparatus, device, storage medium, and program product
Yauri-Lozano et al. Generative Adversarial Networks for text-to-face synthesis & generation: A quantitative–qualitative analysis of Natural Language Processing encoders for Spanish
Sun et al. Silp-autoencoder for face de-occlusion
CN113609355A (en) Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning
CN115984911A (en) Attribute generation countermeasure network and face image continuous transformation method based on same
CN113222100A (en) Training method and device of neural network model
CN116542292B (en) Training method, device, equipment and storage medium of image generation model
CN117788629B (en) Image generation method, device and storage medium with style personalization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination