CN115100329A

CN115100329A - Multi-mode driving-based emotion controllable facial animation generation method

Info

Publication number: CN115100329A
Application number: CN202210744504.9A
Authority: CN
Inventors: 李瑶; 赵子康; 李峰; 郭浩; 杨艳丽; 程忱; 曹锐
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-23
Anticipated expiration: 2042-06-27
Also published as: CN115100329B

Abstract

The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving. Step S1: preprocessing an image of a portrait video to obtain a face 3D feature coordinate sequence; step S2: preprocessing the audio of the portrait video, and decoupling the audio into an audio content vector and an audio style vector; step S3: training a face lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-term and short-term memory network based on the face 3D feature coordinate sequence and the audio content vector; the invention introduces the emotion portrait as an emotion source, realizes emotion remodeling of the target portrait by combining the common drive of the emotion source portrait and the audio, and provides diversified emotion facial animation. The method avoids the over-low robustness of the audio single driving source under the drive of multiple modes, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.

Description

Multi-mode driving-based emotion controllable facial animation generation method

Technical Field

The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving.

Background

Facial animation generation is a popular area of research in computer vision-generated models. Its purpose is to transform a still portrait into a realistic facial animation with an arbitrary audio. The method has wide application background in the fields of treatment systems of pseudoscopic and pseudoscopic audiometry, virtual anchor, role-defined games and the like. However, the existing facial animation generation method has the defects that due to the limitations of the principle and the characteristics of the existing facial animation generation method, the emotion aspect of the generated portrait animation is always lack of maturity, so that the application value of the portrait animation is seriously influenced.

In recent years, many studies in the field of facial animation generation have been made on realistic lip movement and head posture swing, and this is an important factor of portrait emotion. The existence of the portrait emotional information has an important influence on the expression of the synthesized facial animation expression emotion, different facial expressions often make a sentence with different emotional colors, and the perception of the emotional information in the visual mode is one of the important ways for human audiovisual speech communication. However, most of the facial animation generation driving sources are audio single-mode, which is superior in lip movement for generating syllables, but relatively poor in generating facial expressions. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.

Disclosure of Invention

The invention provides an emotion controllable facial animation generation method based on multi-mode driving, aiming at solving the problem that the existing facial animation generation method is lack of emotion regulation and control capability.

The invention is realized by adopting the following technical scheme:

the method for generating the emotion controllable facial animation based on multi-mode driving is realized by adopting the following steps:

step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used to obtain a face 3D feature coordinate sequence.

Step S2: the audio of the portrait video is preprocessed and then the preprocessed audio is decoupled into an audio content vector irrelevant to the audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method.

Step S3: based on the face coordinate sequence obtained in step S1 and the audio content vector obtained in step S2, a face lip sound coordinate animation generation network composed of a Multi-Layer Perceptron (MLP) and a Long Short-Term Memory (LSTM) network is trained.

Step S4: based on the face coordinate sequence obtained in step S1 and the audio content vector and the audio style vector obtained in step S2, a face emotion coordinate animation generation network composed of MLP, LSTM, Self-attention mechanism (Self-attention) and generation countermeasure network (GAN) is trained.

Step S5: based on the face coordinate sequence obtained in step S1, a coordinate-to-video network composed of GANs is trained.

Step S6: based on the facial lip tone coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, any two portrait pictures (one representing an identity source and one representing an emotion source) and any one section of audio are input, and a lip tone synchronous video of a target portrait with emotion corresponding to the emotion source is generated.

The method for generating the emotion-controllable facial animation based on multi-mode driving uses a computer vision-generation model and a deep neural network model as technical supports, and realizes the description of the emotion-controllable facial animation generation network.

The invention has the beneficial effects that: compared with the existing facial animation generation method, the method has the advantages that the problems of facial expression ghosting and distortion caused by single audio features and low emotion voice recognition accuracy are considered, the emotion portrait is introduced as an emotion source, the emotion of the target portrait is remolded by utilizing the multi-mode driving of the emotion source portrait features and the audio features, and the facial animation with controllable emotion is generated. The dual driving of the emotion image and the audio can avoid the dependency of emotion generation on single voice information, so that the generated video has controllable emotion while meeting the requirements of lip sound synchronization and spontaneous head swing, namely, the diversity and naturalness of facial animation are ensured, and more real emotional expression of the facial animation is realized.

The method effectively solves the problem that the existing facial animation generation method has low efficiency due to the limitation of speech emotion recognition precision on facial expressions, and can be used in the fields of pseudoscopic and pseudoscopic auxiliary treatment systems, virtual anchor games, role-defined games and the like.

Drawings

FIG. 1 is a schematic diagram of a multi-modal driven emotion controllable facial animation generation structure according to an embodiment of the invention.

Fig. 2 is a schematic diagram comparing the present invention with a conventional facial animation method.

Fig. 3 is a sample video schematic of an embodiment of the invention.

Detailed Description

In this embodiment, the portrait video data set used is derived from a public Multi-view Emotional Audio-visual data set (MEAD).

As shown in FIG. 1, the method for generating emotion controllable facial animation based on multi-modal driving is realized by adopting the following steps:

Step S4: based on the face coordinate sequence obtained in step S1 and the audio content vector and style vector obtained in step S2, a face emotion coordinate animation generation network composed of MLP, LSTM, Self-attention mechanism (Self-attention) and generation countermeasure network (GAN) is trained.

Step S5: based on the face coordinate sequence obtained in step S1, a coordinate-to-video network composed of GANs is trained. During this step of training, a loss function is used to calculate the minimum distance in pixels between the reconstructed face and the training target face.

In step S1, the image of the portrait video is preprocessed, and the specific preprocessing process includes frame rate conversion, image resampling, and face coordinate extraction.

First, the video is frame rate converted to 62.5 frames per second. It is then image resampled and cropped to 256 x 256 video containing faces. And finally, extracting face coordinates by using a face identification algorithm face alignment, and acquiring 3D coordinates (with the dimension of 68 x 3) of the face of each frame to form a face 3D feature coordinate sequence.

In addition, the face 3D feature coordinate sequence is saved as an emotion source portrait coordinate sequence (emotion source face coordinates) and an identity source portrait coordinate sequence (identity source face coordinates). Compared with pixel points of the portrait, the face coordinates can provide natural low-dimensional representation for the portrait and provide a high-quality bridge for downstream emotion replay tasks.

In step S2, the audio of the portrait video is preprocessed, where the preprocessing includes sampling rate conversion, audio vector extraction, and audio vector decoupling.

The audio is first sample rate converted to 16000hz using Fast Forward Moving Picture Experts Group. Then, audio vector extraction is carried out on the audio vector, and the audio vector is obtained by using a rememblyzer library of python. And finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring an audio content vector irrelevant to the audio speaker after decoupling and an audio style vector relevant to the audio speaker.

In step S3, training of the facial lip sound coordinate animation generation network is completed.

The network adopts a self-defined coder-decoder network structure, the coder comprises a facial coordinate coder consisting of two layers of MLPs and a voice content coder consisting of three layers of LSTMs, and the decoder is a facial lip sound coordinate decoder consisting of three layers of MLPs. In order to generate an optimal sequence of the offset of the facial lip voice coordinate, the facial lip voice coordinate animation generation network sets a loss function to continuously adjust the weight and the deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized.

The custom encoder-decoder network structure is as follows:

firstly, the identity feature of the 3D feature coordinate sequence of the face in the first frame of the video (i.e. the first time point of the 3D feature coordinate sequence of the face) obtained in step S1 is extracted by using two-layer MLP. And then, based on the identity characteristics and the audio content vector obtained in the step S2, performing linear fusion and extracting the coordinate dependence relationship between the audio continuous syllables and the lips by using the LSTM of the three-layer unit. Then, based on the output of the encoder in the step, a decoder consisting of three layers of MLPs is used for predicting a facial lip sound coordinate offset sequence, and the specific calculation formula is as follows:

ΔP _t ＝MLP _c (LSTM _c (Ec _t→t+λ ，MLP _L (L；W _mlp，l )；W _lstm )；W _mlp，c ) (1)

in the formula (1), Δ P _t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP _L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W _mlp，l Representing facial coordinate encoder learnable parameters; LSTM _c Representing a speech content encoder, Ec representing an audio content vector, t → t + λ representing that the audio content vector is input to the speech content encoder in a batch size of λ ═ 18 per frame t, W _lstm Representing speech content encoder learnable parameters; MLP _c Coordinate decoder for facial lip notes, W _mlp，c The coordinate decoder for lip sound on the face can learn the parameters.

Correcting coordinates of a first frame of the portrait video through the predicted coordinate offset sequence of the facial lip sound to obtain a lip sound synchronous coordinate sequence, wherein a specific calculation formula is as follows:

P _t ＝L+ΔP _t (2)

in the formula (2), P _t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P _t Indicating the predicted t-th frame face lip sound coordinate offset.

In order to generate an optimal sequence of the offset of the facial lip coordinates, the weight and the deviation of the loss function adjusting network are set based on the encoder-decoder structure of the facial lip coordinate animation generating network. The objective of the penalty function is to minimize the error between the predicted coordinates and the coordinates obtained in step S1, and the specific calculation formula is as follows:

in the formula (3), the first and second groups,

representing a loss function of a facial lip tone coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N is 68 which represents the total number of facial coordinates, and i represents the current facial coordinate number; p _i，t Representing the coordinates of the predicted face of the ith frame,

coordinates of the face indicating the ith frame obtained in step S1;

represents P _i，t And with

Squared euclidean norm of (d).

When the loss function tends to be smooth, i.e.

And when the minimum value is reached, the training of the facial lip voice coordinate animation synthesis network is completed.

And step S4, finishing the training of the face emotion coordinate animation generation network, and adding rich visual emotion expressions for the generated video.

Human beings rely on visual information in emotion interpretation, and abundant visual emotion expression can give people stronger sense of reality, and the practicality is bigger. Most of the existing face animation generation algorithms are dedicated to expressing the lip movement and the head pose swing of the face animation through audio single modality. Audio single-modality driving works well in lip movement that generates syllables, but relatively poorly in generating facial expressions. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.

The patent provides a facial emotion coordinate animation generation network based on multi-mode driving, emotion portraits are introduced to serve as emotion sources, multi-mode driving is performed in combination with audio characteristics to achieve emotion remodeling of target portraits more accurately, and the facial emotion coordinate animation generation network is generated.

The network is a custom encoder-decoder network structure, the encoder comprises an audio encoder and a facial coordinate encoder, and the decoder comprises a coordinate decoder. The encoder can obtain audio features, portrait identity features and portrait affective features. The decoder is responsible for processing the multi-mode characteristics, and is driven by the audio characteristics and the portrait emotion characteristics together to generate a coordinate offset sequence after the target portrait emotion is remolded, so that rich visual emotion expression is added to the video. Under the driving of the multiple modes, the method avoids the over-low robustness of the audio single driving source, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.

In order to generate an optimal facial emotion coordinate offset sequence, three different loss functions are set to adjust the weight and the deviation of the network based on the encoder-decoder structure of the facial emotion coordinate animation generation network. One of them calculates the distance between the predicted face 3D feature coordinate sequence and the face 3D feature coordinate sequence obtained in step S1. The second and third are the identifier loss function to distinguish the true and false of the generated face coordinate and the similarity of the face coordinate interval frame.

The network structure of the encoder-decoder for generating the customized network of the facial emotion coordinate animation is as follows:

the encoder consists of an audio encoder, an identity source face coordinate encoder and an emotion source face coordinate encoder. The audio encoder captures audio features through a three-layered LSTM, a three-layered MLP, and a self-attention mechanism.

Specifically, firstly, the LSTM is used to extract the features of the audio content vector obtained in step S2; then, using MLP to extract the features of the audio style vector obtained in step S2; then, linear fusion is carried out on the audio content vector characteristics and the audio style vector characteristics; and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain an audio feature with stronger time dependency, wherein a specific calculation formula is as follows:

S _t ＝Attn(LSTM _c′ (Ec _t→t+λ ；W′ _lstm )，MLP _s (Es；W _mlp，s )；W _attn ) (4)

in the formula (4), S _t Representing the processed audio characteristics of the t-th frame, wherein t represents the current frame of the portrait video; MLP _S Representing an audio style vector encoder, Es representing an audio style vector, W _mlp，s Representing audio style vector encoder learnable parameters; LSTM _c′ Represents an audio content vector encoder, Ec represents an audio content vector, t → t + λ represents an audio content vector input to the audio content vector encoder with a batch size of λ ═ 18 per frame t, W' _lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W _attn Indicating a self-attentiveness mechanism learnable parameter.

The two face coordinate encoders are both light neural networks composed of seven layers of MLPs. The two are similar in structure but different in function, one extracts geometric information of identity and one extracts geometric information of facial emotion.

Based on the two different face coordinates (one is regarded as an identity source face coordinate sequence and the other is regarded as an emotion source face coordinate sequence) obtained in step S1, firstly, using an identity source face coordinate encoder composed of seven layers of MLPs to extract portrait identity features of an identity source; secondly, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the audio characteristic obtained by the formula (4) to obtain a fusion characteristic, wherein the concrete calculation formula is as follows:

F _t ＝concat(MLP _LA (L _a ；W _mlp，la )，MLP _LB (L _b ；W _mlp，lb )，S _t ) (5)

in the formula (5), F _t Representing the fusion characteristics of the t frame after linear fusion, and concat representing linear fusion; MLP _LA Face coordinate encoder representing identity source, L _a Face coordinates, W, for the first frame of the identity Source Portrait video _mlp，la Representing identity source facial coordinate encoder learnable parameters; MLP _LB Face coordinate encoder for representing emotion source, L _b Face coordinates, W, for the first frame of the Source Anoticeage video _mlp，lb Representing emotion source face coordinate encoder learnable parameters; s. the _t Representing the t frame audio feature of step S4.

Based on the portrait identity characteristic, the portrait emotion characteristic and the fusion characteristic of the audio characteristic obtained by the formula (5), a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:

ΔQ _t ＝MLP _LD (F _t ；W _mlp，ld ) (6)

in the formula (6), Δ Q _t Representing the predicted t frame face emotion coordinate offset, wherein t represents the current frame of the portrait video; MLP _LD Decoder for animation generation network representing facial emotion coordinates, F _t For the t frame fusion feature after the linear fusion of step S5, W _mlp，ld Indicating that the decoder can learn the parameters.

The method comprises the following steps of correcting the first frame coordinate of the identity source portrait video through a predicted facial emotion coordinate offset sequence to obtain a facial emotion coordinate sequence, wherein the specific calculation formula is as follows:

Q _t ＝L _a +ΔQ _t (7)

in the formula (7), Q _t Representing facial emotion coordinates of the t-th frame, t represents the Chinese zodiac signA current frame like video; l is a radical of an alcohol _a Face coordinates, Δ Q, for the first frame of the identity Source Portrait video _t And the predicted t frame emotion coordinate offset is shown.

In order to generate an optimal facial emotion coordinate offset sequence, three different loss functions are set to adjust the weight and the deviation of the network based on the encoder-decoder structure of the facial emotion coordinate animation generation network, and the specific formula is as follows:

in the formula (8), the first and second groups,

representing the total loss function of the facial emotion coordinate animation generation network,

a penalty function representing the facial emotion coordinate animation generation network,

discriminator D for representing face coordinates _L Is used to determine the loss function of (c),

identifier D for representing similarity of face coordinate interval frames _T A loss function of (d); lambda [ alpha ] ₁ ，λ ₂ ，λ ₃ Respectively, weight parameters.

The face coordinate loss function calculates the distance between the predicted face emotion coordinate sequence and the face coordinate (identity source coordinate sequence same as emotion source emotion) obtained in step S1, and the specific calculation formula is as follows:

in the formula (9), the reaction mixture,

representing a loss function of the facial emotion coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N is 68 which represents the total number of facial coordinates, and i represents the current facial coordinate number; q _i，t Representing the coordinates of the predicted face of the ith frame,

coordinates of the face indicating the ith frame obtained in step S1;

represents Q _i，t And

squared euclidean norm of (d).

Discriminator loss function during facial emotion coordinate animation generation network training

Discriminator loss function for discriminating true or false of generated face coordinates

For estimating the similarity of the face coordinate interval frame, the formula is as follows:

in the equations (10) and (11), t represents the current frame of the portrait video, D _L A discriminator for representing whether the coordinates of the face are true or false,

discriminator D for representing face coordinates _L A loss function of (d); d _T A frame similarity discriminator indicating the interval of facial coordinates,

frame similarity discriminator D for representing face coordinate interval _T A loss function of (d); q _t Representing the predicted face emotion coordinates of the t frame,

the t-th frame face coordinates obtained in step S1 are indicated,

to represent

Face coordinates of the previous frame.

When the loss function tends to be smooth, the training of the animation synthesis network of the facial emotion coordinates is finished.

In step S5, the coordinate-to-video network is completed and the training of the target network is completed.

Based on the face coordinate sequence obtained in step S1, discrete coordinates are connected by number and rendered with color line segments to create a three-channel face sketch sequence of size 256 × 256. The sequence is channel-concatenated with the original pictures of the first frame of the corresponding video to create a six-channel picture sequence with a size of 256 × 256. And generating a reconstructed face video by using the sequence as input and using a coordinate-to-video network.

In order to generate an optimal face video, an L1 loss function (L1-norm loss function) is set to adjust the weight and deviation of the network based on the image conversion network. The loss function aims to minimize the pixel distance between the reconstructed face video and the training target face video.

Step S6 is to input any two portrait pictures (one representing the identity source and the other representing the emotion source) and any piece of audio to generate the target video based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in step S3, step S4 and step S5.

And respectively acquiring corresponding identity source portrait coordinates and emotion source portrait coordinates by using a face identification algorithm face alignment, and acquiring an audio content vector and an audio style vector of the audio by using a voice conversion method. And generating a lip sound synchronous face coordinate offset sequence by the audio content vector and the identity source coordinate through the face lip sound coordinate animation generation network obtained in the step S3. And (4) generating a facial emotion coordinate offset sequence by the audio content vector, the audio style vector, the identity source coordinate and the emotion source coordinate through the facial emotion coordinate animation generation network obtained in the step S4. And correcting the identity source coordinate through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence to the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with emotion source emotion.

The multi-mode driven emotion controllable facial animation generation method is realized through a voice conversion method, a multi-layer perceptron, a long-term and short-term memory network, a self-attention mechanism and a generation countermeasure network; as shown in FIGS. 2-3, the invention can generate different emotion videos by adjusting the emotion source portrait, thereby having higher application value and overcoming the characteristics of the prior facial animation generation method such as lack of emotion or poor robustness.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. The method for generating the emotion controllable facial animation based on multi-mode driving is characterized by comprising the following steps of:

step S1: preprocessing an image of a portrait video, and extracting a facial 3D feature coordinate sequence from the preprocessed image by using a facial recognition algorithm;

step S2: preprocessing the audio of the portrait video, and then decoupling the preprocessed audio into an audio content vector irrelevant to an audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method;

step S3: training a facial lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-time and short-time memory network based on a facial 3D characteristic coordinate sequence and an audio content vector;

step S4: training a facial emotion coordinate animation generation network consisting of a multilayer perceptron, a long-time memory network, a long-time attention mechanism and a generation countermeasure network based on a facial 3D feature coordinate sequence, an audio content vector and an audio style vector;

step S5: training a coordinate-to-video network consisting of a generated countermeasure network based on the facial 3D feature coordinate sequence;

step S6: inputting any two portrait pictures and a section of any audio based on a trained facial lip voice coordinate animation generation network, a facial emotion coordinate animation generation network and a coordinate-to-video network, wherein one of the two portrait pictures represents an identity source and the other represents an emotion source; and generating lip sound synchronous video with the target portrait of the emotion corresponding to the emotion source.

2. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S1, the method specifically comprises:

firstly, performing frame rate conversion on a video, and converting the video into 62.5 frames per second;

then, the image is resampled and is cut into 256 × 256 videos containing faces;

extracting facial coordinates by using a facial recognition algorithm, acquiring 3D coordinates of the face of each frame, wherein the dimensionality is 68 x 3, and forming a facial 3D feature coordinate sequence;

and storing the face 3D feature coordinate sequence into an emotion source portrait coordinate sequence and an identity source portrait coordinate sequence, namely, an emotion source face coordinate and an identity source face coordinate.

3. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein step S2 specifically includes:

carrying out sampling rate conversion on the audio, and converting the audio sampling rate into 16000hz by using Fast Forward Moving Picture Experts Group;

then, audio vector extraction is carried out on the audio vector, and a rememblyzer library of python is used for obtaining the audio vector;

and finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring the decoupled audio content vector irrelevant to the audio speaker and the audio style vector relevant to the audio speaker.

4. The method for generating controllable emotion facial animation based on multi-modal driving as claimed in claim 1, wherein in step S3, said facial lip voice coordinate animation generation network adopts a self-defined encoder-decoder network structure, the encoder comprises a facial coordinate encoder composed of two layers of MLPs and a speech content encoder composed of three layers of LSTM, and the decoder is a facial lip voice coordinate decoder composed of three layers of MLPs; the facial lip voice coordinate animation generation network is provided with a loss function used for continuously adjusting the weight and deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized.

5. The method for generating emotion controllable facial animation based on multi-modal driving according to claim 4, wherein in step S3, the network training process for generating facial lip sound coordinates animation is as follows:

firstly, extracting the identity feature of the face 3D feature coordinate sequence of the first frame of the video obtained in the step S1 by using a two-layer MLP (Multi-layer matching processing), namely the identity feature of the first time point of the face 3D feature coordinate sequence;

then, based on the identity characteristics and the audio content vector obtained in the step S2, after linear fusion, extracting the coordinate dependency relationship between audio continuous syllables and lips by using the LSTM of the three-layer unit;

then, based on the output of the encoder in the step, a decoder consisting of three layers of MLPs is used for predicting a facial lip sound coordinate offset sequence, and the specific calculation formula is as follows:

ΔP _t ＝MLP _c (LSTM _c (Ec _t→t+λ ，MLP _L (L；W _mlp，l )；W _lstm )；W _mlp，c )

in the formula,. DELTA.P _t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP _L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W _mlp，l Representing facial coordinate encoder learnable parameters; LSTM _c Representing a speech content encoder, Ec representing an audio content vector, t → t + λ representing that the audio content vector is input to the speech content encoder in a batch size of λ ═ 18 per frame, W _lstm Representing speech content encoder learnable parameters; MLP _c Coordinate decoder for lip voice of face, W _mlp，c Representing facial lip tone coordinate decoder learnable parameters;

correcting the first frame coordinate of the portrait video through the predicted facial lip tone coordinate offset sequence to obtain a lip tone synchronous coordinate sequence, wherein a specific calculation formula is as follows:

P _t ＝L+ΔP _t

in the formula, P _t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P _t Indicating the predicted coordinate offset of the lip sound of the t frame face;

in order to generate an optimal sequence of the offset of the facial lip coordinates, based on the encoder-decoder structure of the facial lip coordinate animation generation network, the weight and the deviation of a loss function adjustment network are set, and a specific calculation formula of the loss function is as follows:

in the formula (I), the compound is shown in the specification,

representing the loss function of the facial lip sound coordinate animation generation network, T representing the total frame rate of the video, T representing the current frame of the portrait video, N-68Represents the total number of facial coordinates, i represents the current facial coordinate number; p is _i，t Coordinates representing the predicted ith frame,

coordinates representing the ith frame obtained in step S1;

is represented by P _i，t And with

The square of the euclidean norm of (d);

when the loss function tends to be smooth, i.e.

And when the minimum value is reached, the training of the facial lip sound coordinate animation synthesis network is completed.

6. The method for generating the emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S4, the facial emotion coordinate animation generation network adopts a customized encoder-decoder network structure:

the encoder comprises an audio encoder and a face coordinate encoder, wherein the face coordinate encoder comprises an identity source face coordinate encoder and an emotion source face coordinate encoder, and the audio encoder captures audio features through a three-layer LSTM, a three-layer MLP and a self-attention mechanism;

the decoder comprises a coordinate decoder;

the encoder is used for acquiring audio features, portrait identity features and portrait emotion features, the decoder is used for processing multi-modal features, and the multi-modal features and the portrait emotion features are jointly driven to generate a coordinate offset sequence after target portrait emotion is remolded;

the facial emotion coordinate animation generation network is provided with three different weight and deviation of a loss function adjustment network, wherein one of the weight and deviation is used for calculating the distance between the predicted facial 3D characteristic coordinate sequence and the facial 3D characteristic coordinate sequence obtained in the step S1, and the second and third are discriminator loss functions which are used for distinguishing the truth of generated facial coordinates and the similarity of a facial coordinate interval frame.

7. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 6, wherein in step S4, the network training process for generating facial emotion coordinate animation is as follows:

firstly, LSTM is used for extracting the characteristics of the audio content vector obtained in the step S2;

then, using MLP to extract the features of the audio style vector obtained in step S2;

then, carrying out linear fusion on the audio content vector characteristics and the audio style vector characteristics;

and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain an audio feature with stronger time dependency, wherein a specific calculation formula is as follows:

S _t ＝Attn(LSTM _c′ (Ec _t→t+λ ；W′ _lstm )，MLP _s (Es；W _mlp，s )；W _attn )

in the formula, S _t Representing the processed audio characteristics of the t-th frame, wherein t represents the current frame of the portrait video; MLP _S Representing an audio style vector encoder, Es representing an audio style vector, W _mlp，s Representing audio style vector encoder learnable parameters; LSTM _c′ Represents an audio content vector encoder, Ec represents an audio content vector, t → t + λ represents an audio content vector input to the audio content vector encoder with a batch size of λ ═ 18 per frame t, W' _lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W _attn Representing a self-attention mechanism learnable parameter;

the two facial coordinate encoders are both light neural networks consisting of seven layers of MLPs, wherein one is used for extracting geometric information of identity, and the other is used for extracting geometric information of facial emotion;

based on the two different face coordinates obtained in step S1, one is regarded as an identity source face coordinate sequence and the other is regarded as an emotion source face coordinate sequence, firstly, portrait identity features of an identity source are extracted by using an identity source face coordinate encoder composed of seven layers of MLPs; then, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the obtained audio characteristic to obtain a fusion characteristic, wherein the specific calculation formula is as follows:

F _t ＝concat(MLP _LA (L _a ；W _mlp，la )，MLP _LB (L _b ；W _mlp，lb )，S _t )

in the formula, F _t Representing the t frame characteristic after linear fusion, concat representing linear fusion; MLP _LA Identity source face coordinate representation encoder, L _a Face coordinates, W, for the first frame of the identity Source Portrait video _mlp，la Representing identity source facial coordinate encoder learnable parameters; MLP _LB Face coordinate encoder for representing emotion source, L _b Face coordinates for the first frame of the Source of Emotion Portrait video, W _mlp，lb Representing emotion source face coordinate encoder learnable parameters; s _t The t-th frame audio feature representing step S4;

based on the fusion characteristics of the portrait identity characteristics, the portrait emotion characteristics and the audio characteristics, a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:

ΔQ _t ＝MLP _LD (F _t ；W _mlp，ld )

in the formula,. DELTA.Q _t Representing the predicted emotional coordinate offset of the tth frame, wherein t represents the current frame of the portrait video; MLP _LD Decoder for animation generation network representing facial emotion coordinates, F _t For the t frame fused feature after the linear fusion of step S5, W _mlp，ld Indicating the decoder learnable parameters;

Q _t ＝L _a +ΔQ _t

in the formula, Q _t Representing emotional face coordinates, and t representing a current frame of the portrait video; l is a radical of an alcohol _a Face coordinates, Δ Q, for the first frame of the identity Source Portrait video _t Representing the predicted t frame emotion coordinate offset;

in order to generate an optimal facial emotion coordinate offset sequence, a coder-decoder structure of a network is generated based on facial emotion coordinate animation, three different loss functions are set to adjust the weight and the deviation of the network, and the specific formula is as follows:

in the formula (I), the compound is shown in the specification,

frame similarity discriminator D for representing face coordinate interval _T A loss function of (d); lambda [ alpha ] ₁ ，λ ₂ ，λ ₃ Respectively are weight parameters;

wherein, the loss function of the facial emotion coordinate animation generation network calculates the distance between the predicted facial emotion coordinate sequence and the facial coordinates obtained in step S1, and the specific calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

coordinates of the face indicating the ith frame obtained in step S1;

represents Q _i，t And

the square of the euclidean norm of (d);

For estimating the similarity of the face interval frame coordinates, the formula is as follows:

wherein t represents a current frame of the portrait video, D _L A discriminator for representing whether the coordinates of the face are true or false,

the t-th frame face coordinates obtained in step S1 are indicated,

to represent

Face coordinates of a previous frame;

when the loss function tends to be smooth, the training of the facial emotion coordinate animation synthesis network is completed.

8. The method for generating controllable emotion face animation based on multi-modal driving as claimed in claim 1, wherein in step S5, the training process of the coordinate-to-video network is as follows:

based on the face coordinate sequence obtained in step S1, connecting the discrete coordinates by number, and rendering with color line segments to create a three-channel face sketch sequence with a size of 256 × 256;

performing channel cascade on the sequence and the original picture of the first frame of the corresponding video to create a six-channel picture sequence with the size of 256 × 256;

generating a reconstructed face video by using a coordinate-to-video network by taking the sequence as input;

in order to generate an optimal face video, the weight and the deviation of the network are adjusted by setting an L1 loss function based on the image conversion network.

9. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S6, lip sound synchronization video of a target portrait with emotion source emotion is generated by using three trained network models, and specifically comprises:

inputting any two portrait pictures and any section of audio, respectively obtaining an identity source portrait coordinate and an emotion source portrait coordinate by using a face recognition algorithm, and obtaining an audio content vector and an audio style vector of the audio by using a voice conversion method;

generating a lip sound synchronous face coordinate offset sequence by the audio content vector and the identity source coordinate through the face lip sound coordinate animation generation network obtained in the step S3;

generating a network by the audio content vector, the audio style vector, the identity source coordinate and the emotion source coordinate through the facial emotion coordinate animation obtained in the step S4 to generate a facial emotion coordinate offset sequence;

and correcting the identity source coordinate through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence to the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with emotion source emotion.