CN113450436A

CN113450436A - Face animation generation method and system based on multi-mode correlation

Info

Publication number: CN113450436A
Application number: CN202110718414.8A
Authority: CN
Inventors: 熊盛武; 马宜祯; 陈燚雷; 邓梦涵; 曾瑞
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-28
Anticipated expiration: 2041-06-28
Also published as: CN113450436B

Abstract

The invention discloses a face animation generation method and a face animation generation system based on multi-modal correlation, and firstly, a face animation generation neural network framework based on multi-modal correlation is provided; on the network structure, firstly, a multi-Transformer structure is provided for modeling the correlation of a voice mode, an image mode and a voice/image cross-mode in a voice-driven face animation generation task, and the face expression characteristics corresponding to the driving voice are calculated based on the correlation, so that the reality and the fluency of the generated face animation are improved. On the basis of a network structure, a source video is innovatively introduced into a training process, and a more accurate expression prediction result of a driving voice can be achieved. In addition, the method driven by the single voice mode can only generate a better animation result on the lips, and the reality of other face regions except the lip region can be improved by introducing the source video, so that a more vivid human face animation result is generated.

Description

Face animation generation method and system based on multi-mode correlation

Technical Field

The invention relates to the crossing field of digital image processing and artificial intelligence technology, in particular to a face animation generation method and system based on multi-modal correlation.

Background

The multi-mode driven face animation generation aims to generate a high-naturalness and natural and smooth face animation by utilizing information of different driving modes. The technology has strong practical significance and application value in the scenes of human-computer interaction, virtual reality and the like. The current human-computer interaction is usually carried out through voice or text, but compared with the human-human interaction process, the existing human-computer interaction mode brings completely different experience to people. In fact, facial expressions, body movements, tone and the like in human-human interaction have great influence on interaction experience, and the same content can give people completely different feelings through different expression modes. At present, human-computer interaction based on a human face animation generation technology is applied to many real scenes, such as artificial intelligence anchor, virtual customer service and the like.

As one of face animation generation technologies, face animation generation based on voice driving can improve user experience in industries such as movie and game. Compared with a video-driven human face animation generation technology, the voice-driven method can greatly improve the productivity of related industries because voice data is easier to obtain. The face animation related technology is generally required to be applied to a scene interacted with a human, so that extremely high requirements are imposed on the aspects of real-time performance, generation quality and the like of a result. For the voice-driven human face animation generation task, because of the lack of rich face information provided by a video-driven mode, how to generate a real natural and non-mechanized result is still a problem to be solved urgently. The voice-driven face animation generation integrates technologies in a plurality of research fields, such as voice recognition, sequence models, face feature representation, generation models and the like. In most of the previous researches, the accuracy of mapping from voice to facial expressions and the quality of generated model generation results are improved, the accuracy of intermediate representation from voice to human faces such as human face characteristic points and 3D human face coefficients is improved by using sequence models such as LSTM, and video results are generated from the intermediate representation by using the generated models. However, the existing method does not effectively utilize multi-modal information existing in the input video data in the stage of mapping the voice to the facial expression, so that the information of the speaking style of the speaker in the input video is lost. In addition, an accurate mapping relation from voice to facial expression exists in input video data, and the existing method does not consider the connection between driving voice and voice in input video to generate more accurate facial expression.

Disclosure of Invention

In order to solve the technical problem, the invention provides a face animation generation method and system based on multi-modal relevance.

The method adopts the technical scheme that: a face animation generation method based on multi-modal correlation comprises the following steps:

step 1: building a voice-to-expression mapping network based on multi-modal correlation, preprocessing a 2D speaker video data set, then performing voice-to-expression mapping network training by using a self-supervision method, and guiding the voice-to-expression mapping network based on multi-modal correlation to train by using a loss function to obtain a trained voice-to-expression mapping network;

the voice-to-expression mapping network with multi-modal correlation comprises a voice Transformer network T_audioAn expression Transformer network T_expAnd a trans-modal Transformer network T_cross；

The voice Transformer network T_audioEmotion transducer networkT_expAnd cross-modal Transformer network T_crossThe structure is the same, and the self-attention device comprises N self-attention layers, wherein each self-attention layer comprises a multi-head self-attention layer and a feedforward layer;

the voice Transformer network T_audioThe method comprises the steps of extracting an autocorrelation expression of a voice mode, and inputting source voice features and combined features obtained by splicing the driving voice features in sequence dimensions; expression Transformer network T_expThe facial expression self-correlation representation of the image modality is extracted and input as the expression parameters of the source video; cross-modal Transformer network T_crossAn autocorrelation representation for extracting the composite features of a speech modality and an image modality, input as T_audioAnd T_expSplicing the obtained autocorrelation characteristics in sequence dimensions to obtain composite characteristics;

step 2: acquiring a source video of a target character speaking and a section of driving voice input;

and step 3: preprocessing the speaking video and the driving voice, obtaining an expression characteristic sequence and a source voice characteristic sequence from the speaking video, and obtaining a driving voice characteristic sequence from the driving voice;

and 4, step 4: splicing the driving voice characteristic sequence and the source voice characteristic sequence in the sequence dimension to obtain a recombined voice characteristic sequence, and inputting the recombined voice characteristic sequence into a voice Transformer network T_audioObtaining an autocorrelation representation of the recombined speech;

and 5: inputting the expression characteristic sequence obtained in the step 3 into an expression Transformer network T_expCalculating an autocorrelation representation of the expression sequence;

step 6: inputting the 2 kinds of autocorrelation expressions obtained in the step 4 and the step 5 into a cross-modal Transformer network T after the sequence dimensions are spliced_crossPerforming multi-mode correlation calculation, and taking a subsequence with a specific sequence number to obtain a predicted expression characteristic sequence;

and 7: sequentially replacing the expression feature part of the facial 3D parameters extracted from the video in the step 3 by the predicted expression feature sequence to obtain a new recombined 3D facial feature representation, and calculating a 3D facial grid according to the 3D facial parameters;

and 8: rendering the obtained 3D face mesh to obtain a primary 2D rendering result, and intercepting the 2D rendering image according to the corresponding relation between the 3D face model and the face key points to obtain a 2D face image of the next half face;

and step 9: refining the 2D face image of the next half face in the step 8 by using a neural rendering network to obtain a 2D face image sequence of the next half face; replacing the lower half face part in the source video frame by the generated lower half face image according to the face key point information obtained by preprocessing in the step 1 to obtain a 2D face image sequence;

step 10: and (3) splicing the 2D face image sequence and the driving voice input in the step (2) by using ffmpeg to obtain a video output result.

The technical scheme adopted by the system of the invention is as follows: a face animation generation system based on multi-modal relevance comprises the following modules:

the system comprises a module 1, a voice-to-expression mapping network and a voice-to-expression mapping network, wherein the module 1 is used for building the voice-to-expression mapping network based on multi-modal correlation, preprocessing a 2D speaker video data set, then performing voice-to-expression mapping network training by using a self-supervision method, and guiding the voice-to-expression mapping network based on the multi-modal correlation to train by using a loss function to obtain the trained voice-to-expression mapping network;

The voice Transformer network T_audioExpression Transformer network T_expAnd cross-modal Transformer network T_crossThe structure is the same, and the self-attention device comprises N self-attention layers, wherein each self-attention layer comprises a multi-head self-attention layer and a feedforward layer;

the module 2 is used for acquiring a source video of a target person speaking and a section of driving voice input;

the module 3 is used for preprocessing the speaking video and the driving voice, obtaining an expression characteristic sequence and a source voice characteristic sequence from the speaking video and obtaining a driving voice characteristic sequence from the driving voice;

a module 4, configured to splice the driving speech feature sequence and the source speech feature sequence in sequence dimensions to obtain a recombined speech feature sequence, and input the recombined speech feature sequence into the speech Transformer network T_audioObtaining an autocorrelation representation of the recombined speech;

a module 5, configured to input the expression feature sequence obtained by the module 3 into an expression Transformer network T_expCalculating an autocorrelation representation of the expression sequence;

a module 6, configured to input the 2 kinds of autocorrelation representations obtained in the modules 4 and 5 into the cross-modal Transformer network T after sequence dimension splicing_crossPerforming multi-mode correlation calculation, and taking a subsequence with a specific sequence number to obtain a predicted expression characteristic sequence;

a module 7, configured to sequentially replace, in the module 3, the expression feature part of the face 3D parameter extracted from the video with the predicted expression feature sequence, to obtain a new recombined 3D face feature representation, and calculate a 3D face mesh according to the 3D face parameter;

a module 8, configured to render the obtained 3D face mesh to obtain a preliminary 2D rendering result; intercepting the 2D face image according to the corresponding relation between the 3D face model and the face key points to obtain the 2D face image of the next half face

A module 9, configured to refine the 2D face image of the module 8 by using a neural rendering network, so as to obtain a 2D face image sequence; replacing the lower half face part in the source video frame by the generated lower half face image according to the face key point information obtained by preprocessing in the module 1 to obtain a 2D face image sequence;

and the module 10 is used for splicing the 2D face image sequence and the driving voice input by the module 2 by using ffmpeg to obtain a video output result.

The invention provides a brand-new framework based on a Transformer for generating voice-driven face animation, and the framework can provide stability in time sequence in a feature mapping stage and ensure diversity and smoothness of a generated result. In addition, by virtue of the characteristics of the parallel computing of the Transformer architecture, compared with the prior method, the method can achieve the real-time speech-to-expression mapping.

The present invention proposes to explicitly consider the connection of driving audio with source video, i.e. multi-modal correlation. Different from the previous method of fine-tuning the network by using the source video, the method provided by the invention utilizes the inherent synchronism of the source audio and the source video, so that a more accurate result is obtained when the expression prediction is carried out on the driving voice. In addition, the method of single-voice modal driving can only generate better animation results on lips, and the reality of the whole face expression except the lips can be improved by introducing the source video, so that the animation results which are more in line with the speaking mode of the original speaker are generated.

Drawings

FIG. 1 is a schematic diagram of a method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a voice-to-expression mapping network structure of multi-modal correlations according to an embodiment of the present invention.

Fig. 3 is a first result diagram of face animation generation according to the embodiment of the present invention.

Fig. 4 is a diagram illustrating a result of generating a face animation according to the embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, the method for generating a face animation based on multi-modal correlation according to the present invention includes the following steps:

referring to fig. 2, the voice-to-emotion mapping network architecture with multi-modal correlation provided in this embodiment includes a voice Transformer network T_audioAn expressive Transformer network T_expAnd a trans-modal Transformer network T_cross；

Voice Transformer network T_audioThe method comprises the steps of extracting an autocorrelation expression of a voice mode, and inputting source voice features and combined features obtained by splicing the driving voice features in sequence dimensions; expression Transformer network T_expThe facial expression self-correlation representation of the image modality is extracted and input as the expression parameters of the source video; cross-modal Transformer network T_crossAn autocorrelation representation for extracting the composite features of a speech modality and an image modality, input as T_audioAnd T_expSplicing the obtained autocorrelation characteristics in sequence dimensions to obtain composite characteristics; the Transformer network structures used in this example are all the same, and differ only in input/output dimensions; specifically, each transform network is composed of N self-attention layers, where N is 10 in this example; each self-attention layer comprises a multi-head self-attention layer and a feedforward layer; calculating the correlation attn for the input query and key in a self-attribute layer, and performing matrix multiplication mat-mul with value to obtain a correlation representation; the formulation is expressed as, for the input sequence X ═ X₁，x₂，...，x_n)∈R^n×mWherein n is the sequence length and m is perThe characteristic dimension of each sample is related to express a general calculation formula as

X′∈R^n×mFor the autocorrelation calculation employed in this example, q, k, v are all input sequences X, where X represents a matrix multiplication.

Data preprocessing in this embodiment: an input source video is marked as V, firstly, an open source audio and video processing tool ffmpeg is used for intercepting a video frame at a frame rate of 25fps, and an obtained video frame sequence V is (V is) a frame sequence₁，v₂，...，v_n) Wherein n is the number of video frames, and source audio A in the video is extracted; for each frame of the video frame sequence, carrying out face recognition by using an open source face detection frame mtcnn, and deleting data from a data set for videos with zero or more faces in the video frames; for each video frame, only one face is detected, the center of the video frame is cut into the size of 256 multiplied by 256 pixels according to a face detection frame, and a cut video frame sequence is obtained

And storing the key points of the human face detected by mtcnn; calculating the clipped video frame by using a 3D Face Reconstruction method in the Accurate 3D Face Reconstruction with weather-super left-From Single Image to Image Set

Corresponding 3D face parameters

Wherein alpha is_i，β_i，δ_i.γ_i，p_iRespectively representing the geometric parameters, expression parameters, illumination parameters, texture parameters and transformation parameters of the 3D face parametric model; extracting expression components in the 3D face parameters from each frame

Extracting separatelyTo form an expression sequence β ═ (β)₁，β₂，...β_n) (ii) a For the speech part, the Mel frequency cepstral coefficients (mfcc) of audio a are extracted using the open source speech processing library python _ speech _ features, then the audio mfcc features are sliced using a sliding window of length 8 and step size 4, and the mfcc of each slice is spliced in the channel dimension to obtain the same speech slice a as the video frame number n (a ═ n)₁，a₂，...a_n)；

The network forward computing process of this embodiment: because of the use of the self-supervision training mode, a length n _ position (n _ position < n) needs to be selected for segmenting training and testing data, and a subsequence beta with the length n _ position in the expression sequence beta is used_train＝(β₁，β₂，...β_{n_position}) As expression Transformer T_expInput of (1), beta_pred＝(β_{n_position+1}，...β_n) As a true value of regression; sequence of speech slices A as Speech Transformer T_audioIs input of, wherein A_train＝(a₁，a₂，...a_{n_position}) For a sequence of speech slices matching an expression sequence, equivalent to the source speech input at the time of the test, A_pred＝(a_{n_position+1}，a_{n_position+2}，...a_n) The part of the test method is a voice slice sequence needing expression prediction, which is equal to the driving voice input in the test process; then two sequences are beta_trainA is spliced in sequence dimension to obtain the length n_position+ n composite sequence as trans-modal Transformamer T_crossIs obtained by autocorrelation calculation to obtain a length n_positioThe resulting sequence of n + n, R ═ R (R)₁，r₂，...r_{n_position+n}) According to the combination mode of the input sequence and the calculation mode of the trans-modal Transformer, the subset in the result sequence

The expression sequence is predicted;

the calculation of the loss function of this embodiment uses backward propagationThe broadcasting method trains a voice-to-expression mapping network based on multi-mode correlation; calculating predicted expression sequences

And beta_predMaking mean square error loss (MSE), i.e.

Training a voice Transformer, an expression Transformer and a trans-modal Transformer through a back propagation algorithm, observing a loss function of a training result, and stopping training when loss values are not reduced in 5 continuous rounds;

step 2: acquiring a speaking video containing a section of character and a section of driven voice input;

and step 3: using the same data preprocessing method as the step 1 to obtain an expression sequence with the length of n _ seed and a source speech mfcc characteristic sequence with the length of n _ seed from the speaking video and obtain a driving speech mfcc characteristic sequence with the length of n _ position from the driving speech;

step 4-6 is an expression prediction stage, and the specific process and the diagram refer to fig. 2;

and 4, step 4: splicing the driving voice mfcc characteristic sequence and the source voice mfcc characteristic sequence in sequence dimension to obtain recombined mfcc, and inputting the recombined mfcc into a voice Transformer to obtain autocorrelation representation of recombined voice;

step 4 of this embodiment includes the following operations:

step 4.1: splicing the driving voice mfcc characteristic sequence (the sequence length is 3 in the figure) obtained in the step 3 and the source voice mfcc characteristic sequence (the sequence length is 6 in the figure) in the sequence dimension to obtain a composite mfcc characteristic (the sequence length is 9 in the figure);

step 4.2: inputting the composite mfcc into a voice Transformer, and obtaining the voice characteristics with autocorrelation after passing through a multi-head attention layer and a feedforward layer in a Transformer encoder;

and 5: inputting the expression sequence obtained in the step 3 into an expression Transformer T_expCalculating an autocorrelation representation of the expression;

step 6: splicing the 2 kinds of autocorrelation expressions obtained in the step 4 and the step 5 in sequence dimensions, inputting the sequence dimensions into a cross-modal Transformer network, performing multi-modal correlation calculation, and taking a subsequence with a specific sequence number to obtain a predicted expression characteristic sequence;

step 6 of this embodiment comprises the following operations:

step 6.1: splicing the voice autocorrelation representation obtained in the step 4 and the expression autocorrelation representation obtained in the step 5 on a sequence dimension to obtain a composite characteristic sequence (the sequence length in the figure is 15);

step 6.2: inputting the spliced composite features into a trans-modal Transformer, and calculating the autocorrelation representation of the composite features;

step 6.3: in the obtained composite characteristic autocorrelation representation, a part of calculating the correlation between the driving voice and the source table emotion coefficient is taken as a result, and the sequence length of the result is the same as the number of the driving voice sliding windows calculated in the preprocessing stage (the sequence length in the figure is 3);

all the Transformer model frameworks in the step 4-6 are based on the attribute Is All You Need, and the specific details are shown in FIG. 2; the voice Transformer consists of N self-attention layers, wherein N is 10 in the example; each self-attention layer comprises a multi-head self-attention layer and a feedforward layer; performing correlation calculation on the input sequence in a self-attribute layer to obtain a correlation representation; specifically, for the input sequence X ═ X (X)₁，x₂，...，x_n)∈Rⁿ ^×mWhere n is the sequence length, m is the characteristic dimension of each sample, and the correlation expression is a general calculation formula of

X′∈R^n×mFor the autocorrelation calculation method employed in this example, q, k, v are all input sequences X, where X represents a matrix multiplication;

and 7: sequentially replacing the expression feature part of the face 3D parameters in the step 3 with the predicted expression feature sequence to obtain new 3D face parameters, and calculating a 3D face grid according to the 3D face parameters;

and 8: rendering the obtained 3D face mesh by using a rasterization method to obtain a primary 2D rendering result; intercepting the 2D rendering image according to the corresponding relation between the 3D face model and the face key points to obtain a 2D face image of the next half face;

step 8 of this embodiment comprises the following operations:

step 8.1: rendering the recombined face parameter sequence obtained in the step 8 in a rasterization mode to obtain a primary rendering result;

step 8.2: calculating the positions of face key points in a reconstructed face parameter rendering result according to the corresponding relation between the 3D face grids and the 2D face key points, and intercepting the 2D face image of the next half face according to the face key points;

and step 9: refining the 2D face image obtained in the step 8 by using a neural rendering network to obtain a 2D face image sequence of the next half face; replacing the lower half face part in the source video frame by the generated lower half face image according to the face key point information obtained by preprocessing in the step 1 to obtain a 2D face image sequence;

step 9 of this embodiment includes the following operations:

step 9.1: inputting a picture of a next half face by using a neural rendering network in the photo cosmetic Audio-drive Video portals to obtain a refined picture sequence of the next half face;

step 9.2: turning the position information of the key point obtained by the refined next half face picture sequence by using mtcnn according to the preprocessing part in the step 3 back to the video frame obtained by using ffmpeg interception in the preprocessing part in the step 3 to obtain a 2D face image sequence;

Please refer to fig. 3 and fig. 4, which are results of the method for generating human face animation according to the present invention. This example uses two test audios, both extracted from other speaking videos. For each segment of test audio, the present example uses three segments of video for testing. The same graph is used with the same driving voice and different lines represent results obtained with different source video tests. It can be seen that the lip movement produced by the same segment of driving audio when using different source video is similar and the resulting talking video is naturally fluent.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A face animation generation method based on multi-modal correlation is characterized by comprising the following steps:

the voice Transformer network T_audioThe method comprises the steps of extracting an autocorrelation expression of a voice mode, and inputting source voice features and combined features obtained by splicing the driving voice features in sequence dimensions; expression Transformer network T_expFor extracting the facial expression autocorrelation representation of the image modality, the input is the sourceExpression parameters of the video; cross-modal Transformer network T_crossAn autocorrelation representation for extracting the composite features of a speech modality and an image modality, input as T_audioAnd T_expSplicing the obtained autocorrelation characteristics in sequence dimensions to obtain composite characteristics;

2. The multi-modal relevance based face animation generation method of claim 1, wherein: preprocessing the 2D speaker video data set in the step 1, intercepting video frames by using an open source audio and video processing tool ffmpeg at a frame rate of 25fps aiming at an input source video V, and obtaining a video frame sequence V ═ V₁，v₂，...，v_n) Wherein n is the number of video frames, and source audio A in the video is extracted; for each frame of the video frames, carrying out face recognition by using an open source face detection frame mtcnn, and deleting data from the data set for videos with zero or more faces in the video frames; for each video frame, only one face is detected, the center of the video frame is cut into the size of 256 multiplied by 256 pixels according to a face detection frame, and a cut video frame sequence is obtained

And storing the key points of the human face detected by mtcnn; calculating clipped video frames using 3D face reconstruction method

Corresponding 3D face parameters

Is extracted separately to form expression sequence beta ═ beta (beta)₁，β₂，...β_n) (ii) a For the speech part, the Mel frequency cepstral coefficients (mfcc) of audio a are extracted using the open source speech processing library python _ speech _ features, then the audio mfcc features are sliced using a sliding window of length 8 and step size 4, and the mfcc of each slice is spliced in the channel dimension to obtain the same speech slice a as the video frame number n (a ═ n)₁，a₂，...a_n)。

3. The multi-modal relevance based face animation generation method of claim 1, wherein: step 1, performing voice-to-expression mapping network training by using an automatic supervision method, selecting a subsequence beta with the length of n _ position in an expression sequence beta, wherein the length of n _ position is less than n_train＝(β₁，β₂，...β_{n_position}) As an expression Transformer network T_expInput of (1), beta_pred＝(β_{n_position+1}，...β_n) As a true value of regression; sequence of speech slices A as a speech transducer network T_audioIs input of, wherein A_train＝(a₁，a₂，...a_{n_position}) For a sequence of speech slices matching an expression sequence, equivalent to the source speech input at the time of the test, A_pred＝(a_{n_position+1}，a_{n_position+2}，...a_n) The part of the test method is a voice slice sequence needing expression prediction, which is equal to the driving voice input in the test process; then two sequences are beta_trainA is spliced in sequence dimension to obtain the length n_position+ n composite sequence as trans-modal Transformer network T_crossIs obtained by autocorrelation calculation to obtain a length n_positionResult sequence of + n

According to the combination mode of the input sequence and the calculation mode of the trans-modal Transformer, the subset in the result sequence

I.e. the predicted expression sequence.

4. The multi-modal relevance based face animation generation method of claim 1, wherein: the step 1 of using the loss function to guide the training of the voice-to-expression mapping network based on the multi-modal correlation is to calculate a predicted expression sequence

With the real expression sequence beta_predMaking the mean square error loss MSE, i.e.

And training the voice Transformer network, the expression Transformer network and the cross-modal Transformer network through a back propagation algorithm, observing a loss function of a training result, and stopping training when the loss value is not reduced in 5 continuous rounds.

5. The method for generating a human face animation based on multi-modal correlation according to claim 1, wherein the step 4 comprises the following sub-steps:

step 4.1: splicing the driving voice mfcc characteristic sequence obtained in the step (3) and the source voice mfcc characteristic sequence in a sequence dimension to obtain a composite mfcc characteristic;

step 4.2: inputting the composite mfcc into a voice Transformer network T_audioIn the method, after passing through a multi-head attention layer and a feedforward layer in a transform encoder, a speech feature with autocorrelation is obtained.

6. The method for generating a human face animation based on multi-modal correlation according to claim 1, wherein the step 6 comprises the following sub-steps:

step 6.1: splicing the voice autocorrelation representation obtained in the step 4 and the expression autocorrelation representation obtained in the step 5 on a sequence dimension to obtain a composite characteristic sequence;

step 6.2: inputting the spliced composite characteristics into a cross-modal Transformer network T_crossCalculating an autocorrelation representation of the composite feature;

step 6.3: in the obtained composite feature autocorrelation expression, a part of calculating the correlation between the driving voice and the source expression coefficient is taken as a result, and the sequence length of the result is the same as the number of the driving voice sliding windows calculated by the preprocessing part in the step 1.

7. A face animation generation system based on multi-modal correlation is characterized by comprising the following modules:

a module 8, configured to render the obtained 3D face mesh to obtain a preliminary 2D rendering result; intercepting the 2D rendering image according to the corresponding relation between the 3D face model and the face key points to obtain the 2D face image of the next half face