CN115984933A - Training method of human face animation model, and voice data processing method and device - Google Patents

Training method of human face animation model, and voice data processing method and device Download PDF

Info

Publication number
CN115984933A
CN115984933A CN202211703181.5A CN202211703181A CN115984933A CN 115984933 A CN115984933 A CN 115984933A CN 202211703181 A CN202211703181 A CN 202211703181A CN 115984933 A CN115984933 A CN 115984933A
Authority
CN
China
Prior art keywords
voice data
face
voice
animation
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211703181.5A
Other languages
Chinese (zh)
Inventor
胡俊佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Geely Holding Group Co Ltd
Zhejiang Zeekr Intelligent Technology Co Ltd
Original Assignee
Zhejiang Geely Holding Group Co Ltd
Zhejiang Zeekr Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Geely Holding Group Co Ltd, Zhejiang Zeekr Intelligent Technology Co Ltd filed Critical Zhejiang Geely Holding Group Co Ltd
Priority to CN202211703181.5A priority Critical patent/CN115984933A/en
Publication of CN115984933A publication Critical patent/CN115984933A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The application provides a training method of a human face animation model, a voice data processing method and a device. The method comprises the steps of performing framing processing on collected voice to obtain voice data of each frame, then obtaining an expression database and a training database, and finally performing time sequence generation countermeasure network training on a pre-constructed facial animation model based on the voice data of each frame, the expression database and the training database to obtain a trained facial animation model, wherein the facial animation model is used for generating a sequence of facial animation corresponding to input voice data. Therefore, the generated facial animation has expressions with abundant emotion representations, and the technical bottleneck that the model of the voice-driven facial animation at the present stage lacks expression animation is solved.

Description

Training method of human face animation model, and voice data processing method and device
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a training method of a human face animation model, a voice data processing method and a device.
Background
The voice-driven face animation technology is a technology for constructing a model based on algorithms such as deep learning, and generating a face expression coefficient by taking voice input by a user or a text generated by the voice as model input, namely a driving source, so as to finish driving face animation. With the continuous progress of science and technology, the voice-driven face animation technology is more and more widely applied in the movie and education industries.
In the prior art, driving the face animation can be divided into two methods, one is an end-to-end method, and a mesh sequence is directly predicted based on voice; the other method is to predict the expression coefficient or the blendshape coefficient based on the voice and then synthesize the mesh sequence. The mouth shape driver can also generate text through speech recognition first and then generate phoneme sequence, and then generate mouth shape coefficient by using phoneme sequence or directly generate mesh sequence.
However, most of the studies of these deep learning-based methods focus on the mouth-shape-driven part, and lack of emotional expression-driven.
Disclosure of Invention
The application provides a training method of a human face animation model, a voice data processing method and a device. The method is used for solving the problem that the face animation technology cannot realize expression driving rich in emotional representation.
In a first aspect, the present application provides a method for training a face animation model, including:
performing framing processing on the collected voice to obtain voice data of each frame;
obtaining an expression database and a training database, wherein the expression database comprises: the training database comprises a calibrated real face animation sequence corresponding to the voice;
and performing time sequence generation confrontation network training on a pre-constructed human face animation model based on the voice data of each frame and the expression database to obtain a trained human face animation model, wherein the human face animation model is used for generating a human face animation sequence corresponding to the input voice data.
With reference to the first aspect, in some embodiments, the face animation model includes a speech coding module, a speech emotion recognition module, and a face generation module;
correspondingly, the training database performs time sequence generation confrontation network training on the pre-constructed facial animation model based on the voice data of each frame and the expression database to obtain the trained facial animation model, and the method comprises the following steps:
a, aiming at voice data of each frame, coding the voice data through a voice coding module of a human face animation model to obtain coding information corresponding to the voice data;
b, aiming at the voice data of each frame, performing emotion recognition processing on the voice data through a voice emotion recognition model of a human face animation model to obtain expression characteristics corresponding to the voice data;
step c, aiming at the voice data of each frame, inputting the expression characteristics corresponding to the voice data, the pre-acquired coded information of the face motion with duration and the coded information corresponding to the voice data into a face generation module of the face animation model to be trained to obtain a sequence of the face animation corresponding to the voice data;
and d, performing time sequence generation confrontation network training on a constructed face animation loss function based on the real face animation sequence of the voice calibrated in the training database and a face animation sequence corresponding to each frame of voice data output by the face animation model, optimizing parameters of the face animation model, and repeating the steps a-d until the face animation loss function is converged to obtain a final face animation model.
With reference to the first aspect, in some embodiments, the encoding, by a speech encoding module of a face animation model, the speech data of each frame to obtain encoding information corresponding to the speech data includes:
aiming at the voice data of each frame, performing feature processing on the voice data by adopting a time sequence convolution neural network through a feature extraction layer of a voice coding module to obtain feature information of the voice data;
performing linear interpolation processing on the feature information of the voice data through a linear interpolation layer of the voice coding module to obtain interpolated feature information;
and coding the characteristic information after interpolation through a coding layer of the voice coding module, and performing linear mapping processing through a linear mapping layer to obtain coding information of the voice data.
With reference to the first aspect, in some embodiments, performing emotion recognition processing on the voice data through a voice emotion recognition model of a face animation model for the voice data of each frame to obtain an expression feature corresponding to the voice data includes:
aiming at the voice data of each frame, performing DeepSpeech extraction through an emotion feature extraction layer of the voice emotion recognition module to obtain voice features corresponding to the voice data;
performing emotion classification on the voice features through a voice emotion recognition layer of the emotion recognition module to obtain emotion probability distribution;
and inquiring in the expression database based on the emotion probability distribution through an emotion expression inquiry layer of the emotion recognition module to obtain expression characteristics corresponding to the voice data.
With reference to the first aspect, in some embodiments, for the voice data of each frame, the step of inputting, to a face generation module of the to-be-trained face animation model, the expressive feature corresponding to the voice data, the pre-obtained coded information about the facial movement with duration, and the coded information corresponding to the voice data to obtain a sequence of the face animation corresponding to the voice data includes:
aiming at the voice data of each frame, carrying out face generation on coding information of face motion with duration acquired in advance and coding information corresponding to the voice data through a decoding layer of the face generation module to obtain face information;
and migrating the expression characteristics to the face information through an expression migration layer of the face generation module to obtain a sequence of the face animation with the emotion characteristics.
With reference to the first aspect, in some embodiments, the performing time-series generation of confrontation network training on the real facial animation sequence based on the speech calibrated in the training database, the constructed facial animation loss function, and the sequence of the facial animation corresponding to the speech output by the facial animation model, and optimizing the parameters of the facial animation model includes:
constructing a face animation loss function in a discriminator:
Figure BDA0004025272610000031
wherein λ is the contribution of the control reconstruction loss function to the total loss function,
Figure BDA0004025272610000032
a loss function representing a classification of the discriminator>
Figure BDA0004025272610000033
Representing a face reconstruction loss function term; />
Aiming at the voice data of each frame, inputting the training of the face animation corresponding to the voice data output by the face animation model and the real face animation sequence of the voice calibrated in the training database into a discriminator for classification discrimination to obtain the value of the face animation loss function;
and optimizing the parameters of the human face animation model according to the value of the human face animation loss function.
In a second aspect, the present application provides a method for processing voice data, including:
performing frame processing on the voice to be processed to obtain voice data of each frame;
and inputting the voice data into a human face animation model for processing aiming at the voice data of each frame to obtain a human face animation sequence corresponding to the voice data.
The human face animation model is a sequence which is trained in advance and used for generating human face animation corresponding to input voice data.
In a third aspect, the present application provides a training apparatus for a human face animation model, comprising:
the acquisition processing module is used for performing framing processing on the acquired voice to obtain voice data of each frame;
the information acquisition module is used for acquiring an expression database and a training database, wherein the expression database comprises: the training database comprises a calibrated real face animation sequence corresponding to the voice;
and the model training module is used for carrying out time sequence generation confrontation network training on a pre-constructed facial animation model based on the voice data of each frame, the expression database and the training database to obtain a trained facial animation model, and the facial animation model is used for generating a model corresponding to the input voice data.
With reference to the third aspect, in some embodiments, the face animation model includes a speech coding module, a speech emotion recognition module, a face generation module;
correspondingly, the model training module performs time sequence generation confrontation network training on the pre-constructed facial animation model based on the voice data of each frame and the expression database to obtain the trained facial animation model, and the training database comprises:
the voice coding module is used for coding the voice data aiming at the voice data of each frame to obtain coding information corresponding to the voice data;
the voice emotion recognition module is used for carrying out emotion recognition processing on the voice data aiming at the voice data of each frame to obtain expression characteristics corresponding to the voice data;
the face generation module is used for acquiring coding information of face motion with duration and corresponding coding information of the voice data in advance according to the expression characteristics corresponding to the voice data aiming at the voice data of each frame to obtain a sequence of face animation corresponding to the voice data;
and the model training module is used for performing time sequence generation confrontation network training on a constructed face animation loss function based on a real face animation sequence of the voice calibrated in the training database and a face animation sequence corresponding to the voice output by the face animation model, optimizing parameters of the face animation model, and repeating model training until the face animation loss function is converged to obtain a final face animation model.
With reference to the third aspect, in some embodiments, the speech coding module includes:
the feature extraction unit is used for performing feature processing on the voice data by adopting a time sequence convolution neural network aiming at the voice data of each frame to obtain feature information of the voice data;
the linear interpolation unit is used for carrying out linear interpolation processing on the characteristic information of the voice data to obtain interpolated characteristic information;
and the coding unit is used for coding the characteristic information after interpolation and carrying out linear mapping processing through a linear mapping layer to obtain the coding information of the voice data.
With reference to the third aspect, in some embodiments, the speech emotion recognition module includes:
the emotion feature extraction unit is used for extracting the DeepSpeech of the voice data of each frame to obtain the voice features corresponding to the voice data;
the voice emotion recognition unit is used for carrying out emotion classification on the voice characteristics to obtain emotion probability distribution;
and the emotion expression query unit is used for querying in the expression database based on the emotion probability distribution to obtain the expression characteristics corresponding to the voice data.
With reference to the third aspect, in some embodiments, the face generation module includes:
the decoding unit is used for carrying out face generation on coding information which is obtained in advance and used for face motion in a duration and coding information corresponding to the voice data aiming at the voice data of each frame to obtain face information;
and the expression transfer unit is used for transferring the expression characteristics to the face information to obtain a sequence of the face animation with the emotion characteristics.
With reference to the third aspect, in some embodiments, the model training module includes:
the function construction unit is used for constructing a face animation loss function in the discriminator:
Figure BDA0004025272610000051
wherein λ is the contribution of the control reconstruction loss function to the total loss function,
Figure BDA0004025272610000052
a loss function representing a classification by a discriminator>
Figure BDA0004025272610000053
Representing a face reconstruction loss function term;
the model training unit is used for inputting a sequence of the face animation corresponding to the voice data output by the face animation model and a real face animation sequence of the voice data calibrated in the training database into a discriminator for classification discrimination aiming at the voice data of each frame to obtain a value of a face animation loss function;
and the model optimization unit is used for optimizing the parameters of the human face animation model according to the value of the human face animation loss function.
In a fourth aspect, the present application provides a processing apparatus for voice data, including:
the voice framing module is used for framing the voice to be processed to obtain voice data of each frame;
and the model reasoning module is used for inputting the voice data into the human face animation model for processing aiming at the voice data of each frame to obtain a sequence of the human face animation corresponding to the voice data.
The human face animation model is a sequence which is trained in advance and used for generating human face animation corresponding to input voice data.
In a fifth aspect, the present application further provides an electronic device, including: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored by the memory to implement the methods of the first and second aspects.
In a sixth aspect, the present application further provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the method of the first and second aspects when executed by a processor.
The application provides a training method of a human face animation model, and a voice data processing method and device. In the technology of voice-driven facial animation, an emotion recognition module and an expression migration module based on voice data are introduced in the scheme, so that the voice-driven facial expression animation is rich in emotion characterization, and a network training facial animation model is generated by utilizing time sequence confrontation. The voice-driven facial animation is not limited to the mouth shape driving part any more, and the expression and mouth shape driving are fused, so that the facial animation is rich in emotional representation.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is an application scenario diagram of a training method for a human face animation model according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a first embodiment of a training method for a human face animation model according to the embodiment of the present application;
fig. 3 is a schematic flow chart of a second embodiment of a training method for a human face animation model according to the present application;
fig. 4 is a schematic flow chart of a third embodiment of a training method for a human face animation model according to the present application;
fig. 5 is a schematic flowchart of a fourth embodiment of a training method for a human face animation model according to the present application;
fig. 6 is a schematic flow chart of a fifth embodiment of a training method for a human face animation model according to the embodiment of the present application;
fig. 7 is a schematic flowchart of a sixth embodiment of a training method for a human face animation model according to an embodiment of the present application;
fig. 8 is a schematic flowchart of an embodiment of a method for processing voice data according to an embodiment of the present application;
FIG. 9 is a schematic diagram of an inference architecture of a face animation model according to an embodiment of the present application;
FIG. 10 is a schematic diagram of an internal structure of an encoding layer according to an embodiment of the present application;
fig. 11 is a schematic diagram of an internal structure of a decoding layer according to an embodiment of the present application;
FIG. 12 is a schematic diagram of a network architecture for training a face animation model according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a network structure of a discriminator;
fig. 14 is a schematic structural diagram of a first training device for a human face animation model according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of a second training apparatus for a human face animation model according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of a third embodiment of a training apparatus for a human face animation model according to the present application;
fig. 17 is a schematic structural diagram of a fourth embodiment of a training apparatus for a human face animation model according to the embodiment of the present application;
fig. 18 is a schematic structural diagram of a fifth example of the training apparatus for a human face animation model according to the embodiment of the present application;
fig. 19 is a schematic structural diagram of a sixth embodiment of a training apparatus for a human face animation model according to an embodiment of the present application;
fig. 20 is a schematic structural diagram of an embodiment of a speech data processing apparatus according to an embodiment of the present application;
fig. 21 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.
With the development of three-dimensional digital virtual human, the voice-driven face animation technology has become one of the important research hotspots of virtual human interaction. The voice-driven face animation technique generally predicts the mesh sequence directly based on the voice through an end-to-end method. Or predicting expression coefficients or blendshape coefficients based on the voice, and then synthesizing the mesh sequence. The mouth shape driver can also generate text through speech recognition first and then generate phoneme sequence, and then generate mouth shape coefficient by using phoneme sequence or directly generate mesh sequence. But the existing research focuses on the mouth shape driving part, and the expression driving is rarely related to, especially the expression driving rich in emotional characteristics.
In order to solve the problems, the application provides a training method of a facial animation model, and the facial animation which is rich in emotion representation and driven by expressions is realized. Specifically, when the human face animation is driven by voice, people usually adopt mouth shape driving to realize the human face animation. The inventor finds in the research process that the mesh talk network structure not only takes voice as input, but also inputs expression characteristics. The similar faceFormer network structure encodes voice features through a time sequence convolution network and a transform coding layer, and simultaneously, a neutral face is used as a Style Embedding and duration face motion, and is fused with the voice features after being encoded, so that the face motion is predicted, and face animation is driven. However, the two network structures only focus on mouth shape and eye movement, the ability of driving the expression is still insufficient, and particularly, the expression animation with rich emotion is provided. In view of the problems, the inventor researches whether an emotion network can be added into a network structure, and realizes the human face animation rich in emotion representation through the fusion of expression drive and mouth shape drive.
Fig. 1 is an application scenario diagram of a training method for a human face animation model according to an embodiment of the present application. As shown in fig. 1, the training method of the human face animation model provided in the embodiment of the present application is applied to an actual scene for simulating human face animation, where the actual scene for simulating human face animation at least includes voice data, a human face animation model, and an output human face animation. The voice data can be data that any user wants to simulate, and the output human face animation is animation corresponding to the voice data input by the user. The voice data is not particularly limited in this scheme.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 2 is a schematic flow chart of a first embodiment of a training method for a human face animation model provided in an embodiment of the present application, and as shown in fig. 2, the method specifically includes the following steps:
s101: and performing frame processing on the collected voice to obtain voice data of each frame.
In this step, when reasoning is performed on a pre-constructed human face animation model, voice data of each frame of pre-collected voice needs to be input, so after the voice is collected in advance, the collected voice needs to be framed, specifically, segmented according to a specified length, such as a time period or a sampling number, and structured into a data structure programmed by a user. The collected voice comprises voice data in wav format and emotion tag data pairs corresponding to the voice.
S102: and acquiring an expression database and a training database.
In this step, in the training method of the facial animation model, training data is required to train the model, so that an expression database and a training database need to be obtained in advance. Specifically, different voices are collected and integrated to form an expression database, wherein the expression database comprises expression data and corresponding expression labels, and the training database comprises a calibrated real face animation sequence corresponding to the voices.
S103: and performing time sequence generation countermeasure network training on the pre-constructed facial animation model based on the voice data of each frame, the expression database and the training database to obtain the trained facial animation model.
In the step, a pre-constructed human face animation model is trained according to the voice data of each frame, the expression database and the training database obtained in the step, and a time sequence generation countermeasure network is adopted for training because the sequence of the human face animation output by the human face animation model is time sequence data.
In a specific embodiment, the time sequence generation countermeasure network includes a generator and a discriminator, where the generator is a pre-constructed human face animation model, the discriminator is used to judge whether a sequence of human face animation corresponding to the voice data output by the generator is true, and the purpose of the time sequence generation countermeasure network training is to make the discriminator unable to judge whether a sequence of human face animation corresponding to the voice data output by the generator is false.
Specifically, a face animation sequence corresponding to speech data output for a face animation model, a real face animation sequence of speech calibrated in a training database, and user information are input into the discriminator, and a loss function of the discriminator may be expressed as:
Figure BDA0004025272610000101
wherein D is Seq Based on the classification loss of the whole voice and the face animation, x is a sequence segment of the face animation output by the generator, s is a voice data sequence segment, and z is a sequence of the real face animation.
In order to improve the effect of mouth shape synthesis, a face reconstruction loss function term of one pixel level is added on the basis of the formula 1. Since the mouth shape is only related to the lower half of the face, only the sum of the losses over pixels of the reconstructed lower half of the face needs to be calculated. Suppose the reconstructed face at time t is W × H pixelsThe face reconstruction loss function at the time t
Figure BDA0004025272610000102
Expressed as:
Figure BDA0004025272610000103
wherein, F p As a real face pixel, G p To the corresponding face pixels generated by the generator.
The final face animation loss function is obtained as follows:
Figure BDA0004025272610000104
wherein, λ is the contribution degree of the control reconstruction loss function to the total loss function,
Figure BDA0004025272610000105
a loss function representing a classification of the discriminator>
Figure BDA0004025272610000106
Representing the face reconstruction loss function term.
And in the training process, the loss function of the face animation is converged, namely the loss function value is minimum, when the loss function of the face animation obtained after the training is finished is not converged, the loss function of the face animation is converged by randomly adjusting the parameters of the face animation model, the function value is minimum, and the training is finished to obtain the trained face animation model.
In the training method for the facial animation model provided by this embodiment, speech is collected and is subjected to framing processing, the obtained speech data of each frame is input into a facial animation model which is constructed in advance, and the facial animation model is trained through an expression database, a training database and a time sequence generation countermeasure network. The fusion of the mouth shape drive and the expression drive is realized, so that the facial animation output by the model is rich in emotional characteristics.
Fig. 3 is a schematic flow diagram of a second embodiment of a training method for a facial animation model provided in an embodiment of the present application, and as shown in fig. 3, in the first embodiment, the facial animation model includes a speech coding module, a speech emotion recognition module, a face generation module, and correspondingly, in step S103, based on speech data of each frame, an expression database, and a training database, a time-series generation confrontation network training is performed on a pre-constructed facial animation model to obtain a trained facial animation model, and the method specifically includes the following steps:
s1031: and acquiring coding information corresponding to the voice data.
In this step, for the voice data of each frame, a feature extraction layer of a voice coding module is used to perform feature processing on the voice data by using a time sequence convolutional neural network to obtain feature information of the voice data, then a linear interpolation layer is used to perform linear interpolation processing on the feature information of the voice data to obtain interpolated feature information, and finally the interpolated feature information is coded by a coding layer and is subjected to linear mapping processing by a linear mapping layer to obtain coded information of the voice data.
In a specific embodiment, the feature extraction layer is composed of a convolutional neural network, and the convolutional neural network includes a convolutional layer, a pooling layer, and a full-link layer, where a specific structure of the convolutional neural network is not specifically limited in this scheme. Specifically, after the voice data of each frame is input into the convolutional layer, the features of the local region are extracted through the convolutional layer, then the features are selected through the pooling layer, and finally the feature information of the voice data is obtained through the output of the full-connection layer. Then, interpolation processing is carried out through an interpolation function, the interpolation function is not specifically limited in the scheme, a user can select a linear interpolation method according to specific data, finally, the characteristic information after interpolation is input into a coding layer, the coding layer is formed by stacking 6 encoders, the internal structure of the coding layer is a multi-head attention layer and a feedforward neural network layer, and a residual error is connected before input and output of the feedforward neural network layer and then a normalized operation is carried out, so that the coding information of the voice data is obtained.
S1032: and obtaining the expression characteristics corresponding to the voice data.
In the step, for the voice data of each frame, deep speech extraction is carried out through an emotion feature extraction layer of a voice emotion recognition module to obtain voice features corresponding to the voice data, emotion classification is carried out on the voice features through a voice emotion recognition layer of the emotion recognition module to obtain emotion probability distribution, and finally, through an emotion expression query layer of the emotion recognition module, query is carried out in an expression database based on the emotion probability distribution to obtain expression features corresponding to the voice data.
In a specific embodiment, deep speech is an end-to-end speech recognition technology based on Long-Short Term Memory-connection timing Classification, long-Short Term Memory (LSTM) modeling and Connection Timing Classification (CTC) training in the machine learning field are introduced into a traditional speech recognition framework to realize feature extraction processing of speech data of each frame, and then a speech emotion recognition layer needs to obtain a speech emotion recognition model capable of correctly outputting emotion expression distribution through pre-training, and the model technology has three implementation modes: and obtaining emotion probability distribution based on the manually extracted speech features and a classification model of a traditional machine learning algorithm, a spectrogram or manual features and a classification model of a convolution cyclic neural network and based on pre-training and fine tuning. The above three modes can be selected at will, and are not limited in the present solution. And finally, inputting the emotion probability distribution into an expression database of the expression query layer, wherein expressions corresponding to various emotions are stored in the expression database and are called expression bases. The speech emotion recognition layer outputs emotion probability distribution which is expression base weight, the emotion probability distribution can be mapped to a blendshape coefficient of an expression, the corresponding expression base is matched in an expression database through the blendshape coefficient, then the emotion expression is obtained through weighted average of the corresponding expression base, and an expression calculation formula is represented as follows:
Figure BDA0004025272610000121
wherein, B = [ B = 0 ,......,b N ]Expressed as expression base model, E = [ E = [ ] 0 ,......,e n ]Is the blenshape coefficient, b 0 Is a base expression without any expression.
S1033: and acquiring a sequence of the face animation corresponding to the voice data.
In the step, for the voice data of each frame, the coding information of the pre-acquired face motion with duration and the coding information corresponding to the voice data are subjected to face generation through a decoding layer of a face generation module to obtain face information, and then the expression characteristics are migrated into the face information through an expression migration layer of the face generation module to obtain a sequence of the face animation with the emotion characteristics.
In one embodiment, the decoding layer of the face generation module is composed of 6 decoder stacks, and the internal structure comprises a multi-head attention layer and a feedforward neural network layer. The method comprises the steps that pre-acquired coded information of face motion after duration passes through a first multi-head attention layer of a decoding layer, then is subjected to input addition after being connected with a residual error, then is subjected to normalization once, after two modes of the coded information of a second multi-head attention layer and voice data are fused, the normalization operation is carried out again after the addition, then the normalization operation is carried out once after the addition after the calculation of a feedforward neural network layer, and finally face information is output. In order to enable the voice-driven facial animation to be rich in emotional representation, after the face information is obtained, the face information and the expression features corresponding to the voice data are input into an expression migration layer of a voice recognition module, and the expressions corresponding to the voice data are migrated to the face information through the processing of the expression migration layer, so that a sequence of the facial animation with the emotional features is obtained.
S1034: and acquiring a final human face animation model.
In the specific implementation of this step, a face animation loss function in the discriminator is constructed:
Figure BDA0004025272610000122
wherein λ is the contribution of the control reconstruction loss function to the total loss function,
Figure BDA0004025272610000123
a loss function representing a classification by a discriminator>
Figure BDA0004025272610000124
Representing the face reconstruction loss function term.
And aiming at the voice data of each frame, inputting the training of the face animation corresponding to the voice data output by the face animation model and the real face animation sequence of the voice data calibrated in the training database into a discriminator for classification discrimination to obtain the value of a face animation loss function, and then optimizing the parameters of the face animation model according to the value of the face animation loss function to obtain the final face animation model.
The training method of the facial animation model provided by this embodiment includes processing voice data of each frame through a voice coding module to obtain coding information corresponding to the voice data, then processing the voice data of each frame through a voice emotion recognition module to obtain expression features corresponding to the voice data, and finally inputting the coding information corresponding to the voice data and the expression features corresponding to the voice data into a face generation module, generating a sequence of facial animation corresponding to the voice data, and training the facial animation model through a time sequence generation confrontation network to obtain a final facial animation model. Therefore, the generated human face animation has expressions with rich emotion representations and is not limited to mouth shape driving any more.
Fig. 4 is a schematic flow diagram of a third embodiment of a training method for a face animation model provided in an embodiment of the present application, and as shown in fig. 4, step S1031 in the second embodiment of the method specifically includes:
s201: and acquiring characteristic information of the voice data.
In a specific implementation manner of this step, after the speech coding module inputs speech data of each frame, feature extraction is performed in a feature extraction layer in the speech coding module, where the feature extraction layer is composed of a convolutional neural network, the convolutional neural network includes a convolutional layer, a pooling layer, and a full-link layer, and a specific structure of the convolutional neural network is not specifically limited in this scheme.
Specifically, after the voice data of each frame is input into the convolutional layer, the features of the local region are extracted through the convolutional layer, then feature selection is performed through the pooling layer, and finally feature information of the voice data is obtained through output of the full connection layer.
S202: and acquiring the characteristic information after interpolation.
In the specific embodiment of this step, after the feature information of the voice data is acquired, the linear interpolation layer performs linear interpolation processing.
Specifically, the interpolation function is used for interpolation processing, the interpolation function is not specifically limited in the scheme, and a user can select a linear interpolation method according to specific data.
S203: encoding information of voice data is acquired.
In a specific embodiment of this step, the obtained feature information after interpolation is input to the coding layer, and the coding processing is performed on the feature information after interpolation.
Specifically, the coding layer is formed by stacking 6 encoders, the internal structure of the coding layer is a multi-head attention layer and a feedforward neural network layer, and a residual error is connected before the input and the output of the feedforward neural network layer, and then a normalization operation is performed, so that the coding information of the voice data is obtained.
In the training method for the face animation model provided in this embodiment, the speech data of each frame is processed through the speech feature extraction layer, the linear interpolation layer, and the coding layer in the speech coding module, so as to obtain the coding information of the speech data. The method and the device realize more specific processing of the voice data of each frame, and accurately obtain the human face animation sequence which is focused on the mouth shape and lacks of expressions.
Fig. 5 is a schematic flowchart of a fourth embodiment of a training method for a human face animation model provided in an embodiment of the present application, and as shown in fig. 5, step S1032 in the second embodiment of the method specifically includes:
s301: and acquiring voice characteristics corresponding to the voice data.
In a specific implementation manner of this step, for each frame of voice data, deep speech extraction is performed through an emotion feature extraction layer of a voice emotion recognition module to obtain a voice feature corresponding to the voice data. Specifically, deep speech is an end-to-end speech recognition technology based on Long Short Term Memory-connection time sequence Classification, and Long Short Term Memory (LSTM) modeling and connection time sequence Classification (CTC) training in the field of machine learning are introduced into a traditional speech recognition framework to realize feature extraction processing of speech data of each frame.
S302: and acquiring emotion probability distribution.
In the specific implementation manner of this step, the speech emotion recognition layer processes the speech features corresponding to the speech data obtained in the above step to obtain emotion probability distribution. Specifically, the speech emotion recognition layer needs to obtain a speech emotion recognition model capable of correctly outputting emotion expression distribution through pre-training, and the model has three implementation modes technically: the method comprises the steps of adding a classification model of a traditional machine learning algorithm, a spectrogram or a manual feature and convolution cyclic neural network based on a voice feature extracted manually and adding a classification model of a fine tuning based on pre-training. The above three modes can be selected at will, and are not limited in the present solution.
S303: and obtaining the expression characteristics corresponding to the voice data.
In the specific implementation manner of this step, the expression features corresponding to the voice data are obtained by processing the obtained emotion probability distribution. Specifically, the emotion probability distribution is input into an expression database of the expression query layer, and expressions corresponding to various emotions are stored in the expression database and are called expression bases. The speech emotion recognition layer outputs emotion probability distribution which is expression base weight, the emotion probability distribution can be mapped to a blendshape coefficient of an expression, the corresponding expression base is matched in an expression database through the blendshape coefficient, then the emotion expression is obtained through weighted average of the corresponding expression base, and an expression calculation formula is represented as follows:
Figure BDA0004025272610000141
wherein, B = [ B = 0 ,…,b N ]Expressed as expression base model, E = [ E = [ ] 0 ,……,e n ]Is the blenshape coefficient, b 0 Is a base expression without any expression.
In the training method of the face animation model provided by this embodiment, the speech data of each frame is processed through the speech emotion feature extraction layer, the speech emotion recognition layer and the emotion expression query layer in the speech emotion recognition module, so as to obtain the expression features corresponding to the speech data. The emotion driving of the voice-driven face animation is realized.
Fig. 6 is a schematic flow diagram of a fifth embodiment of a training method for a human face animation model provided in an embodiment of the present application, and as shown in fig. 6, step S1033 in the second embodiment of the method specifically includes:
s401: and acquiring the face information.
In a specific embodiment of this step, the coding information of the voice data and the coding information of the face motion with a duration acquired in advance are input into a decoding layer of the face generation module, and the face information is obtained through decoding processing.
Specifically, the decoding layer is composed of 6 decoder stacks, and the internal structure comprises a multi-head attention layer and a feedforward neural network layer. The method comprises the steps that pre-acquired coded information of face movement with duration passes through a first multi-head attention layer of a decoding layer, then is subjected to addition with input connected with residual errors, then is subjected to normalization once, after two modes of the coded information of a second multi-head attention layer and voice data are fused, the operation of addition with normalization is carried out again, then, the operation of addition with normalization is carried out once after calculation is carried out through a feedforward neural network layer, and finally face information is output.
S402: and acquiring a sequence of the human face animation with the emotional characteristics.
In a specific implementation manner of this step, for the voice-driven facial animation to be rich in emotional representations, after the facial information is obtained, the facial information and the expression features corresponding to the voice data are input into an expression migration layer of the voice recognition module, and the expressions corresponding to the voice data are migrated to the facial information through the processing of the expression migration layer, so that a sequence of the facial animation with the emotional features is obtained.
According to the training method of the human face animation model, the human face information is obtained through the decoding layer in the human face generation module, and then the expression features corresponding to the voice data output from the voice emotion recognition module are migrated to the human face information through the expression migration layer, so that the sequence of the human face animation with the emotion features is obtained. Expression driving rich in emotional characteristics is realized.
Fig. 7 is a schematic flow diagram of a sixth embodiment of a training method for a human face animation model provided in an embodiment of the present application, and as shown in fig. 7, step S1034 in the second embodiment of the method specifically includes:
s501: and constructing a face animation loss function in the discriminator.
In this step, in order to make the sequence of the face animation output by the face animation model closer to the sequence of the real face animation, the sequence for generating the face animation is time sequence data, so that the face animation model is trained by adopting a time sequence generation countermeasure network. In order to improve the effect of mouth shape synthesis, the loss function is trained in an improved way.
In a specific embodiment, the time sequence generation countermeasure network includes a generator and a discriminator, where the generator is a face animation model, and the loss function of the discriminator is:
Figure BDA0004025272610000161
wherein D is seq Based on the classification loss of the whole voice and the face animation, x is a sequence segment of the face animation output by the generator, s is a voice data sequence segment, and z is a sequence of the real face animation.
In order to improve the mouth shape synthesis effect, a face reconstruction loss function term of a pixel level is added on the basis of the formula. Because the mouth shape isThe lower half of the face is concerned so that only the sum of the losses over pixels for the lower half of the reconstructed face needs to be calculated. Assuming that the face reconstructed at the time t is an image of W multiplied by H pixels, the face reconstruction loss function at the time t
Figure BDA0004025272610000162
Expressed as:
Figure BDA0004025272610000163
/>
wherein, F p As a real face pixel, G p To the corresponding face pixels generated by the generator.
The final face animation loss function is obtained as follows:
Figure BDA0004025272610000164
wherein λ is the contribution of the control reconstruction loss function to the total loss function,
Figure BDA0004025272610000165
a loss function representing a classification of the discriminator>
Figure BDA0004025272610000166
Representing the face reconstruction loss function term.
S502: the value of the face animation loss function is calculated.
In the specific implementation manner of this step, the real face animation sequence and the voice data are input into the generator, and the face animation sequence is output. And inputting the human face animation sequence, the real human face animation sequence and the user information into a discriminator, and calculating to obtain the value of the human face animation loss function through the human face animation loss function.
S503: and optimizing parameters of the human face animation model.
In the specific implementation manner of this step, after the value of the face animation loss function is obtained, the performance of the face animation model is determined by determining whether the value of the face animation loss function is minimum. If the face animation loss function is not converged, the value of the face animation loss function is not the minimum value, the training process needs to be repeated until the value of the loss function is the minimum value, and the training is finished. If the loss function of the face animation is not converged, the aim of training is fulfilled by randomly adjusting the parameters of the face animation model.
In the training method for the face animation model provided by this embodiment, the face animation model is trained by using the time sequence generation confrontation network, where the training target based on the generation confrontation network is that the discriminator cannot discriminate whether the input face animation sequence is generated by the generator or a real sequence. The performance of the obtained human face animation model is better.
An embodiment of the present application further provides a method for processing voice data, and fig. 8 is a schematic flowchart of an embodiment of the method for processing voice data provided in the embodiment of the present application, and as shown in fig. 8, the method specifically includes the following steps:
s601: voice data for each frame is acquired.
In a specific implementation manner of this step, when reasoning is performed on a pre-constructed face animation model, voice data of each frame of pre-collected voice needs to be input, and therefore after the voice is pre-collected, the collected voice needs to be framed, specifically, segmented according to a specified length, such as a time period or a sampling number, and structured into a data structure programmed by a user. The collected voice comprises voice data in wav format and emotion tag data pairs corresponding to the voice.
S602: and acquiring a sequence of the face animation corresponding to the voice data.
In this step, the voice data of each frame is input into the face animation model, and the sequence of the face animation corresponding to the voice data is obtained after the voice data is processed by the voice coding module, the voice emotion recognition module and the face generation module of the face animation model.
In a specific implementation mode, the voice data of each frame is input into a voice coding module, the voice data is subjected to feature processing by adopting a time sequence convolutional neural network through a feature extraction layer of the voice coding module to obtain feature information of the voice data, then the feature information of the voice data is subjected to linear interpolation processing through a linear interpolation layer to obtain interpolated feature information, and finally the interpolated feature information is coded through a coding layer and subjected to linear mapping processing through a linear mapping layer to obtain coded information of the voice data. And then performing DeepSpeech extraction through an emotion feature extraction layer of the voice emotion recognition module to obtain voice features corresponding to the voice data, performing emotion classification on the voice features through a voice emotion recognition layer of the emotion recognition module to obtain emotion probability distribution, and finally performing query in an expression database based on the emotion probability distribution through an emotion expression query layer of the emotion recognition module to obtain expression features corresponding to the voice data. And finally, carrying out face generation on the coding information of the face movement with duration acquired in advance and the coding information corresponding to the voice data through a decoding layer of a face generation module to obtain face information, and then migrating the expression features into the face information through an expression migration layer of the face generation module to obtain a sequence of the face animation corresponding to the voice data.
In the processing method of the voice data provided by this embodiment, the voice data of each frame is input into the face animation model, and is processed by the voice coding module, the voice emotion recognition module and the face generation module of the face animation model to finally obtain a sequence of the face animation corresponding to the voice data. The expression driving of the voice-driven facial animation is realized through the voice coding module and the voice emotion recognition module.
The training method and the voice data processing method of the face animation model provided in the embodiment of the present application are described in detail below through a specific model inference architecture diagram and a model training architecture diagram, and fig. 9 is a schematic view of the face animation model inference architecture provided in the embodiment of the present application, and as shown in fig. 9, the face animation model includes a voice coding module, a face generation module, and a voice emotion recognition module. Specifically, the method comprises the following steps:
and the voice coding module is responsible for coding the voice data of each input frame. As shown in the figure 9 of the drawings,before inputting the speech coding module and the speech emotion recognition module, firstly, framing the speech to obtain speech data of each frame, and the speech data of each frame has a corresponding emotion label in the expression database. Then using time sequence convolution nerve network to make characteristic treatment for speech data of every frame, then making linear interpolation to obtain interpolated characteristic information, finally making coding layer coding, making once linear mapping to obtain coding information a of speech data n N =1,2. The coding information of the voice data can be input to the corresponding time position of the coding layer of the face generation module.
The coding layer in the speech coding module is formed by stacking 6 encoders, and as shown in fig. 10, its internal structure is a Multi-Head Attention (Multi-Head Attention) layer and a Feed-Forward neural network (Feed Forward) layer, and there is a residual connection before the input and output of the Feed-Forward neural network layer (and then Add & Norm is performed once).
And the speech emotion recognition module is used for enhancing the expression characteristic information by adopting the speech emotion recognition module because the speech coding module is difficult to capture the information of the generated expression so as to enrich the expression characteristic of the generated facial animation. Like the speech coding module, after speech framing, the speech data of each frame is operated, and the specific steps are as follows:
(1) Extracting voice features by using deep speech;
(2) Inputting the voice features extracted by DeepSpeech to classify the emotion and outputting emotion probability distribution;
(3) And synthesizing the expression characteristics corresponding to the voice data from the expression database according to the emotion probability distribution.
The speech emotion recognition layer needs to obtain a speech emotion recognition model capable of correctly outputting emotion probability distribution through pre-training, and the specific implementation method comprises the following steps: the method comprises the steps of adding a classification model of a traditional machine learning algorithm, a spectrogram or a manual feature and convolution cyclic neural network based on a voice feature extracted manually and adding a classification model of a fine tuning based on pre-training. The scheme does not specifically limit the implementation method, and can select a proper implementation method according to specific conditions.
Specifically, synthesizing the expression features corresponding to the voice data from the expression database according to the emotion probability distribution includes: the expression database stores expressions corresponding to various emotions, which are called expression bases. The speech emotion recognition layer outputs emotion probability distribution which can be mapped to a blendshape coefficient of an expression, a corresponding expression base is inquired in an expression database through the blendshape coefficient, and then an emotion expression is obtained through weighted average of the corresponding expression base, wherein the calculation formula is as follows:
Figure BDA0004025272610000191
wherein, B = [ B = 0 ,......,b N ]Is an expression base model, E = [ E = [) 0 ,......,e n ]Is the blenshape coefficient, b 0 Is a base expression without any expression. The synthesized expression in the expression database is input into an expression migration layer of the face generation module to migrate the synthesized expression to face information generated by the face generation module, so that the purpose of driving the expression is achieved.
The face generation module mainly has two processing layers: the device comprises a decoding layer and an expression migration layer. The method comprises the steps of inputting coding information of voice data and coding information of face motion for a preset duration at a decoding layer, decoding the coding information through the decoding layer to generate face information, inputting user information (Speaker Identity) after the face information is generated, migrating expression features corresponding to the voice data to the face information at an expression migration layer after pattern Embedding (Style Embedding), and generating a sequence of face animation with emotional features.
The decoding layer has an internal structure as shown in fig. 11, and is formed by stacking 6 decoders, wherein the pre-acquired coded information of face motion with duration passes through a first Multi-Head Attention layer (Multi-Head Attention) of the decoding layer, then is connected with a residual error, and then is subjected to Add & Norm operation, after the second Multi-Head Attention layer and an expression feature corresponding to speech data are fused in two modalities, the Add & Norm operation is performed again, and then a feedforward neural network layer is used for calculating, and then is subjected to Add & Norm operation once to output a sequence of face animation with emotional features.
Fig. 12 is a schematic diagram of a facial animation model training network architecture according to an embodiment of the present application, and as shown in fig. 12, the facial animation model training network generates a countermeasure network for time sequence, which includes a Generator (Generator) and a Discriminator (Discriminator). Wherein, the generator is the human face animation model.
The training goal of the confrontation network training framework generated based on the time sequence is that a discriminator cannot discriminate whether an input face animation sequence is a face animation sequence corresponding to the voice data generated by the generator or a real face animation sequence, and the method is realized by connecting a two-classifier output classification label at the last of the discriminator to determine whether the generated face animation sequence is the real face animation sequence. Fig. 13 is a schematic diagram of a network structure of the arbiter. As shown in fig. 13, there are two classification decisions in the discriminator: (1) Classifying and judging the voice data of each frame and the sequence of the face animation corresponding to the voice data; (2) And (4) classifying and judging the sequences of the real face animation and the face animation corresponding to the voice data. Only classification discrimination of the speech data of each frame and the sequence of the face animation corresponding to the speech data is given in fig. 13, and the loss function of the classifier classification is:
Figure BDA0004025272610000201
wherein D is Deq Based on the classification loss of the whole voice and the face animation, x is a sequence segment of the face animation output by the generator, s is a voice data sequence segment, and z is a sequence of the real face animation.
In order to improve the effect of mouth shape synthesis, a human face reconstruction loss function term of a pixel level is added on the basis of the formula. Since the mouth shape is only related to the lower half of the face, only the sum of the losses over pixels for the lower half of the reconstructed face needs to be calculated. Assuming that the face reconstructed at the time t is an image of W multiplied by H pixels, the face reconstruction loss function at the time t
Figure BDA0004025272610000202
Expressed as:
Figure BDA0004025272610000203
wherein, F p As a real face pixel, G p To the corresponding face pixels generated by the generator.
The final face animation loss function is obtained as follows:
Figure BDA0004025272610000204
wherein, λ is the contribution degree of the control reconstruction loss function to the total loss function,
Figure BDA0004025272610000205
a loss function representing a classification of the discriminator>
Figure BDA0004025272610000206
Representing the face reconstruction loss function term.
The application provides a training method and a voice data processing method for a human face animation model. A voice emotion recognition module is added on the basis of a generated face animation model of an original direct prediction mesh sequence, emotion information contained in voice data of each frame is embodied on expressions of the face animation by using an expression migration method, so that the generated face animation has expressions with abundant emotion representations, and the technical bottleneck that the existing voice-driven face animation model is concentrated on mouth shape synthesis and lacks the face animation due to the fact that enough expression characteristic information cannot be obtained is solved. Meanwhile, because an expression migration method is used and the generated facial animation is time sequence data, a time sequence confrontation is adopted to generate a network training generation model in a training stage.
Fig. 14 is a schematic structural diagram of a first embodiment of the training apparatus for a face animation model provided in the embodiment of the present application, and as shown in fig. 14, the training apparatus 700 for a face animation model includes:
the collecting and processing module 701 is configured to perform framing processing on the collected voice to obtain voice data of each frame.
An information obtaining module 702, configured to obtain an expression database and a training database, where the expression database includes: the training database comprises a real face animation sequence corresponding to the calibrated voice.
The model training module 703 is configured to perform time sequence generation confrontation network training on a pre-constructed facial animation model based on the speech data of each frame, the expression database, and the training database to obtain a trained facial animation model, where the facial animation model is used to generate a sequence of facial animation corresponding to the input speech data.
Fig. 15 is a schematic structural diagram of a second embodiment of a training apparatus for a face animation model according to an embodiment of the present application, and as shown in fig. 15, a face animation model 800 includes a speech coding module 801, a speech emotion recognition module 802, and a face generation module 803;
correspondingly, the model training module 703 performs time sequence generation confrontation network training on a pre-constructed facial animation model based on the voice data of each frame, the expression database and the training database to obtain a trained facial animation model, and includes:
the speech coding module 801 is configured to code the speech data for each frame of speech data to obtain coding information corresponding to the speech data.
And a speech emotion recognition module 802, configured to perform emotion recognition processing on the speech data of each frame to obtain an expression feature corresponding to the speech data.
The face generation module 803 is configured to, for each frame of voice data, obtain, according to an expression feature corresponding to the voice data, coding information of face motion for a duration obtained in advance, and coding information corresponding to the voice data, to obtain a sequence of a face animation corresponding to the voice data.
The model training module 703 is configured to perform time sequence generation confrontation network training based on a real face animation sequence of the speech calibrated in the training database, the constructed face animation loss function, and a face animation sequence corresponding to each frame of speech data output by the face animation model, optimize parameters of the face animation model, and repeat the model training until the face animation loss function converges to obtain a final face animation model.
Fig. 16 is a schematic structural diagram of a third embodiment of a training apparatus for a human face animation model according to an embodiment of the present application, and as shown in fig. 16, a speech coding module 801 includes:
the feature extraction unit 8011 is configured to perform feature processing on the voice data by using a time-series convolutional neural network according to the voice data of each frame, so as to obtain feature information of the voice data.
The linear interpolation unit 8012 is configured to perform linear interpolation processing on the feature information of the voice data to obtain feature information after interpolation.
And an encoding unit 8013 configured to encode the interpolated feature information and perform linear mapping processing on the linear mapping layer to obtain encoded information of the voice data.
Fig. 17 is a schematic structural diagram of a fourth embodiment of a training apparatus for a human face animation model provided in an embodiment of the present application, and as shown in fig. 17, a speech emotion recognition module 802 includes:
the emotion feature extraction unit 8021 is configured to perform, for the voice data of each frame, deepSpeech extraction to obtain a voice feature corresponding to the voice data.
And the voice emotion recognition unit 8022 is used for performing emotion classification on the voice features to obtain emotion probability distribution.
The emotional expression query unit 8023 is configured to query in the expression database based on the emotional probability distribution to obtain an expression feature corresponding to the voice data.
Fig. 18 is a schematic structural diagram of a fifth embodiment of a training apparatus for a face animation model according to the embodiment of the present application, and as shown in fig. 18, the face generation module 803 includes:
the decoding unit 8031 is configured to perform, for each frame of voice data, face generation on coding information of face motion for duration acquired in advance and coding information corresponding to the voice data, so as to obtain face information.
And the expression migration unit 8032 is configured to migrate the expression features into the face information to obtain a sequence of the face animation with the emotion features.
Fig. 19 is a schematic structural diagram of a sixth embodiment of a training apparatus for a human face animation model according to an embodiment of the present application, and as shown in fig. 19, a model training module 703 includes:
a function constructing unit 7031, configured to construct a face animation loss function in the discriminator:
Figure BDA0004025272610000221
wherein, λ is the contribution degree of the control reconstruction loss function to the total loss function,
Figure BDA0004025272610000222
a loss function representing a classification of the discriminator>
Figure BDA0004025272610000223
Representing a face reconstruction loss function term;
a model training unit 7032, configured to, for each frame of voice data, input a sequence of a facial animation corresponding to the voice data output by the facial animation model and a real facial animation sequence of the voice data calibrated in the expression database into a discriminator for classification discrimination, to obtain a value of a facial animation loss function
And the model optimizing unit 7033 is configured to optimize parameters of the face animation model according to the value of the face animation loss function.
Fig. 20 is a schematic structural diagram of an embodiment of the apparatus for processing voice data provided in the embodiment of the present application, and as shown in fig. 20, the apparatus 900 for processing voice data includes:
the speech framing module 901 is configured to perform framing processing on the speech to be processed to obtain speech data of each frame.
And the model reasoning module 902 is configured to input the voice data into the face animation model for processing, so as to obtain a sequence of the face animation corresponding to the voice data.
The face animation model is a sequence which is trained in advance and used for generating face animation corresponding to input voice data.
An embodiment of the present application further provides an electronic device, fig. 21 is a schematic structural diagram of the electronic device provided in the embodiment of the present application, and as shown in fig. 21, the electronic device 110 includes: a processor 111, and a memory 112 communicatively coupled to the processor;
the memory 111 stores computer-executable instructions.
The processor 112 executes computer-executable instructions stored by the memory to implement the method in any of the embodiments.
The embodiment of the present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-executable instructions are used to implement the method in any one of the embodiments. Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (16)

1. A training method of a human face animation model is characterized by comprising the following steps:
performing framing processing on the collected voice to obtain voice data of each frame;
obtaining an expression database and a training database, wherein the expression database comprises: the training database comprises a calibrated real face animation sequence corresponding to the voice;
and performing time sequence generation confrontation network training on a pre-constructed human face animation model based on the voice data of each frame and the expression database to obtain a trained human face animation model, wherein the human face animation model is used for generating a human face animation sequence corresponding to the input voice data.
2. The method of claim 1, wherein the face animation model comprises a speech coding module, a speech emotion recognition module, a face generation module;
correspondingly, the performing time sequence generation confrontation network training on the pre-constructed facial animation model based on the voice data of each frame, the expression database and the training database to obtain the trained facial animation model comprises:
step a, aiming at voice data of each frame, coding the voice data through a voice coding module of a human face animation model to obtain coding information corresponding to the voice data;
b, aiming at the voice data of each frame, performing emotion recognition processing on the voice data through a voice emotion recognition model of a human face animation model to obtain expression characteristics corresponding to the voice data;
step c, aiming at the voice data of each frame, inputting the expression characteristics corresponding to the voice data, the coding information of the face movement with duration obtained in advance and the coding information corresponding to the voice data into a face generation module of the face animation model to be trained to obtain a sequence of the face animation corresponding to the voice data;
and d, performing time sequence generation confrontation network training on a constructed face animation loss function based on the real face animation sequence of the voice calibrated in the training database and a face animation sequence corresponding to each frame of voice data output by the face animation model, optimizing parameters of the face animation model, and repeating the steps a-d until the face animation loss function is converged to obtain a final face animation model.
3. The method of claim 2, wherein the encoding the voice data of each frame by a voice encoding module of a human face animation model to obtain encoded information corresponding to the voice data comprises:
aiming at the voice data of each frame, performing feature processing on the voice data by adopting a time sequence convolution neural network through a feature extraction layer of a voice coding module to obtain feature information of the voice data;
performing linear interpolation processing on the feature information of the voice data through a linear interpolation layer of the voice coding module to obtain interpolated feature information;
and coding the characteristic information after interpolation through a coding layer of the voice coding module, and performing linear mapping processing through a linear mapping layer to obtain coding information of the voice data.
4. The method of claim 2, wherein performing emotion recognition processing on the voice data through a voice emotion recognition model of a facial animation model for the voice data of each frame to obtain an expression feature corresponding to the voice data comprises:
aiming at the voice data of each frame, performing DeepSpeech extraction through an emotion feature extraction layer of the voice emotion recognition module to obtain voice features corresponding to the voice data;
performing emotion classification on the voice features through a voice emotion recognition layer of the emotion recognition module to obtain emotion probability distribution;
and inquiring in the expression database based on the emotion probability distribution through an emotion expression inquiry layer of the emotion recognition module to obtain expression characteristics corresponding to the voice data.
5. The method according to claim 2, wherein the inputting, for the voice data of each frame, the expression feature corresponding to the voice data, the pre-obtained coded information of the face motion for duration, and the coded information corresponding to the voice data into the face generation module of the face animation model to be trained to obtain a sequence of the face animation corresponding to the voice data comprises:
aiming at the voice data of each frame, carrying out face generation on coding information of face motion with duration acquired in advance and coding information corresponding to the voice data through a decoding layer of the face generation module to obtain face information;
and migrating the expression characteristics to the face information through an expression migration layer of the face generation module to obtain a sequence of the face animation with the emotion characteristics.
6. The method of any one of claims 2 to 5, wherein the performing of the time-series generation countermeasure network training based on the real facial animation sequence of the speech calibrated in the training database, the constructed facial animation loss function, and the sequence of the facial animation corresponding to the speech output by the facial animation model, and the optimizing of the parameters of the facial animation model comprises:
constructing a face animation loss function in a discriminator:
Figure FDA0004025272600000021
wherein λ is the contribution of the control reconstruction loss function to the total loss function,
Figure FDA0004025272600000031
a loss function representing a classification by a discriminator>
Figure FDA0004025272600000032
Representing a face reconstruction loss function term;
inputting a sequence of a face animation corresponding to the voice data output by the face animation model and a real face animation sequence of the voice calibrated in the training database into a discriminator for classification discrimination aiming at the voice data of each frame to obtain a value of a face animation loss function;
and optimizing the parameters of the human face animation model according to the value of the human face animation loss function.
7. A method for processing voice data, comprising:
performing frame processing on the voice to be processed to obtain voice data of each frame;
inputting the voice data into a human face animation model for processing aiming at the voice data of each frame to obtain a human face animation sequence corresponding to the voice data;
the human face animation model is a sequence which is trained in advance and used for generating human face animation corresponding to input voice data.
8. An apparatus for training a human face animation model, comprising:
the acquisition processing module is used for performing framing processing on the acquired voice to obtain voice data of each frame;
the information acquisition module is used for acquiring an expression database and a training database, wherein the expression database comprises: the training database comprises a calibrated real face animation sequence corresponding to the voice;
and the model training module is used for carrying out time sequence generation confrontation network training on a pre-constructed facial animation model based on the voice data of each frame, the expression database and the training database to obtain a trained facial animation model, and the facial animation model is used for generating a sequence of facial animation corresponding to the input voice data.
9. The apparatus of claim 8, wherein the face animation model comprises a speech coding module, a speech emotion recognition module, a face generation module;
correspondingly, the model training module performs time sequence generation confrontation network training on the pre-constructed facial animation model based on the voice data of each frame and the expression database to obtain the trained facial animation model, and the training database comprises:
the voice coding module is used for coding the voice data aiming at the voice data of each frame to obtain coding information corresponding to the voice data;
the voice emotion recognition module is used for carrying out emotion recognition processing on the voice data aiming at the voice data of each frame to obtain expression characteristics corresponding to the voice data;
the face generation module is used for acquiring coding information of face motion for duration and coding information corresponding to the voice data in advance according to expression characteristics corresponding to the voice data aiming at the voice data of each frame to obtain a sequence of face animation corresponding to the voice data;
and the model training module is used for performing time sequence generation confrontation network training on a constructed face animation loss function based on a real face animation sequence of the calibrated voice in the training database and a face animation sequence corresponding to each frame of voice data output by the face animation model, optimizing parameters of the face animation model, and repeating model training until the face animation loss function is converged to obtain a final face animation model.
10. The apparatus of claim 9, wherein the speech coding module comprises:
the feature extraction unit is used for performing feature processing on the voice data by adopting a time sequence convolution neural network aiming at the voice data of each frame to obtain feature information of the voice data;
the linear interpolation unit is used for carrying out linear interpolation processing on the characteristic information of the voice data to obtain interpolated characteristic information;
and the coding unit is used for coding the characteristic information after interpolation and carrying out linear mapping processing through a linear mapping layer to obtain the coding information of the voice data.
11. The apparatus of claim 9, wherein the speech emotion recognition module comprises:
the emotion feature extraction unit is used for extracting the DeepSpeech of the voice data of each frame to obtain the voice features corresponding to the voice data;
the voice emotion recognition unit is used for carrying out emotion classification on the voice characteristics to obtain emotion probability distribution;
and the emotion expression query unit is used for querying in the expression database based on the emotion probability distribution to obtain the expression characteristics corresponding to the voice data.
12. The apparatus of claim 9, wherein the face generation module comprises:
the decoding unit is used for carrying out face generation on coding information which is obtained in advance and used for face motion in a duration and coding information corresponding to the voice data aiming at the voice data of each frame to obtain face information;
and the expression transfer unit is used for transferring the expression characteristics to the face information to obtain a sequence of the face animation with the emotion characteristics.
13. The apparatus of any of claims 9 to 12, wherein the model training module comprises:
the function construction unit is used for constructing a face animation loss function in the discriminator:
Figure FDA0004025272600000051
wherein λ is the contribution of the control reconstruction loss function to the total loss function,
Figure FDA0004025272600000052
a loss function representing a classification of the discriminator>
Figure FDA0004025272600000053
Representing a face reconstruction loss function term; />
The model training unit is used for inputting a sequence of the facial animation corresponding to the voice data output by the facial animation model and a real facial animation sequence of the voice data calibrated in the expression database into a discriminator for classification discrimination aiming at the voice data of each frame to obtain a value of a facial animation loss function;
and the model optimization unit is used for optimizing the parameters of the human face animation model according to the value of the human face animation loss function.
14. An apparatus for processing voice data, comprising:
the voice framing module is used for framing the voice to be processed to obtain voice data of each frame;
the model reasoning module is used for inputting the voice data into a human face animation model for processing aiming at the voice data of each frame to obtain a human face animation sequence corresponding to the voice data;
the human face animation model is a sequence which is trained in advance and used for generating human face animation corresponding to input voice data.
15. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored by the memory to implement the method of any of claims 1 to 7.
16. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, are configured to implement the method of any one of claims 1 to 7.
CN202211703181.5A 2022-12-29 2022-12-29 Training method of human face animation model, and voice data processing method and device Pending CN115984933A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211703181.5A CN115984933A (en) 2022-12-29 2022-12-29 Training method of human face animation model, and voice data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211703181.5A CN115984933A (en) 2022-12-29 2022-12-29 Training method of human face animation model, and voice data processing method and device

Publications (1)

Publication Number Publication Date
CN115984933A true CN115984933A (en) 2023-04-18

Family

ID=85973634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211703181.5A Pending CN115984933A (en) 2022-12-29 2022-12-29 Training method of human face animation model, and voice data processing method and device

Country Status (1)

Country Link
CN (1) CN115984933A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188649A (en) * 2023-04-27 2023-05-30 科大讯飞股份有限公司 Three-dimensional face model driving method based on voice and related device
CN116664731A (en) * 2023-06-21 2023-08-29 华院计算技术(上海)股份有限公司 Face animation generation method and device, computer readable storage medium and terminal
CN117540282A (en) * 2024-01-10 2024-02-09 青岛科技大学 High-precision prediction method for shelf life of aquatic product in variable temperature environment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188649A (en) * 2023-04-27 2023-05-30 科大讯飞股份有限公司 Three-dimensional face model driving method based on voice and related device
CN116188649B (en) * 2023-04-27 2023-10-13 科大讯飞股份有限公司 Three-dimensional face model driving method based on voice and related device
CN116664731A (en) * 2023-06-21 2023-08-29 华院计算技术(上海)股份有限公司 Face animation generation method and device, computer readable storage medium and terminal
CN116664731B (en) * 2023-06-21 2024-03-29 华院计算技术(上海)股份有限公司 Face animation generation method and device, computer readable storage medium and terminal
CN117540282A (en) * 2024-01-10 2024-02-09 青岛科技大学 High-precision prediction method for shelf life of aquatic product in variable temperature environment
CN117540282B (en) * 2024-01-10 2024-03-22 青岛科技大学 High-precision prediction method for shelf life of aquatic product in variable temperature environment

Similar Documents

Publication Publication Date Title
Klushyn et al. Learning hierarchical priors in vaes
CN115984933A (en) Training method of human face animation model, and voice data processing method and device
CN110164476B (en) BLSTM voice emotion recognition method based on multi-output feature fusion
CN111897933B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
WO2020251681A1 (en) Robustness against manipulations in machine learning
CN114610935B (en) Method and system for synthesizing semantic image of text control image style
CN113901894A (en) Video generation method, device, server and storage medium
CN112331183B (en) Non-parallel corpus voice conversion method and system based on autoregressive network
CN113792177B (en) Scene character visual question-answering method based on knowledge-guided deep attention network
CN111966800A (en) Emotional dialogue generation method and device and emotional dialogue model training method and device
CN115457169A (en) Voice-driven human face animation generation method and system
CN115330912A (en) Training method for generating face speaking video based on audio and image driving
CN113704419A (en) Conversation processing method and device
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN116311483A (en) Micro-expression recognition method based on local facial area reconstruction and memory contrast learning
CN115937369A (en) Expression animation generation method and system, electronic equipment and storage medium
CN115858726A (en) Multi-stage multi-modal emotion analysis method based on mutual information method representation
CN116484217A (en) Intelligent decision method and system based on multi-mode pre-training large model
CN116129013A (en) Method, device and storage medium for generating virtual person animation video
Shankar et al. Multi-speaker emotion conversion via latent variable regularization and a chained encoder-decoder-predictor network
Huang et al. Fine-grained talking face generation with video reinterpretation
CN113160032A (en) Unsupervised multi-mode image conversion method based on generation countermeasure network
CN117238019A (en) Video facial expression category identification method and system based on space-time relative transformation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination