CN113450436A - Face animation generation method and system based on multi-mode correlation - Google Patents

Face animation generation method and system based on multi-mode correlation Download PDF

Info

Publication number
CN113450436A
CN113450436A CN202110718414.8A CN202110718414A CN113450436A CN 113450436 A CN113450436 A CN 113450436A CN 202110718414 A CN202110718414 A CN 202110718414A CN 113450436 A CN113450436 A CN 113450436A
Authority
CN
China
Prior art keywords
sequence
voice
expression
face
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110718414.8A
Other languages
Chinese (zh)
Other versions
CN113450436B (en
Inventor
熊盛武
马宜祯
陈燚雷
邓梦涵
曾瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202110718414.8A priority Critical patent/CN113450436B/en
Publication of CN113450436A publication Critical patent/CN113450436A/en
Application granted granted Critical
Publication of CN113450436B publication Critical patent/CN113450436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/802D [Two Dimensional] animation, e.g. using sprites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computer Graphics (AREA)
  • Processing Or Creating Images (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a face animation generation method and a face animation generation system based on multi-modal correlation, and firstly, a face animation generation neural network framework based on multi-modal correlation is provided; on the network structure, firstly, a multi-Transformer structure is provided for modeling the correlation of a voice mode, an image mode and a voice/image cross-mode in a voice-driven face animation generation task, and the face expression characteristics corresponding to the driving voice are calculated based on the correlation, so that the reality and the fluency of the generated face animation are improved. On the basis of a network structure, a source video is innovatively introduced into a training process, and a more accurate expression prediction result of a driving voice can be achieved. In addition, the method driven by the single voice mode can only generate a better animation result on the lips, and the reality of other face regions except the lip region can be improved by introducing the source video, so that a more vivid human face animation result is generated.

Description

Face animation generation method and system based on multi-mode correlation
Technical Field
The invention relates to the crossing field of digital image processing and artificial intelligence technology, in particular to a face animation generation method and system based on multi-modal correlation.
Background
The multi-mode driven face animation generation aims to generate a high-naturalness and natural and smooth face animation by utilizing information of different driving modes. The technology has strong practical significance and application value in the scenes of human-computer interaction, virtual reality and the like. The current human-computer interaction is usually carried out through voice or text, but compared with the human-human interaction process, the existing human-computer interaction mode brings completely different experience to people. In fact, facial expressions, body movements, tone and the like in human-human interaction have great influence on interaction experience, and the same content can give people completely different feelings through different expression modes. At present, human-computer interaction based on a human face animation generation technology is applied to many real scenes, such as artificial intelligence anchor, virtual customer service and the like.
As one of face animation generation technologies, face animation generation based on voice driving can improve user experience in industries such as movie and game. Compared with a video-driven human face animation generation technology, the voice-driven method can greatly improve the productivity of related industries because voice data is easier to obtain. The face animation related technology is generally required to be applied to a scene interacted with a human, so that extremely high requirements are imposed on the aspects of real-time performance, generation quality and the like of a result. For the voice-driven human face animation generation task, because of the lack of rich face information provided by a video-driven mode, how to generate a real natural and non-mechanized result is still a problem to be solved urgently. The voice-driven face animation generation integrates technologies in a plurality of research fields, such as voice recognition, sequence models, face feature representation, generation models and the like. In most of the previous researches, the accuracy of mapping from voice to facial expressions and the quality of generated model generation results are improved, the accuracy of intermediate representation from voice to human faces such as human face characteristic points and 3D human face coefficients is improved by using sequence models such as LSTM, and video results are generated from the intermediate representation by using the generated models. However, the existing method does not effectively utilize multi-modal information existing in the input video data in the stage of mapping the voice to the facial expression, so that the information of the speaking style of the speaker in the input video is lost. In addition, an accurate mapping relation from voice to facial expression exists in input video data, and the existing method does not consider the connection between driving voice and voice in input video to generate more accurate facial expression.
Disclosure of Invention
In order to solve the technical problem, the invention provides a face animation generation method and system based on multi-modal relevance.
The method adopts the technical scheme that: a face animation generation method based on multi-modal correlation comprises the following steps:
step 1: building a voice-to-expression mapping network based on multi-modal correlation, preprocessing a 2D speaker video data set, then performing voice-to-expression mapping network training by using a self-supervision method, and guiding the voice-to-expression mapping network based on multi-modal correlation to train by using a loss function to obtain a trained voice-to-expression mapping network;
the voice-to-expression mapping network with multi-modal correlation comprises a voice Transformer network TaudioAn expression Transformer network TexpAnd a trans-modal Transformer network Tcross
The voice Transformer network TaudioEmotion transducer networkTexpAnd cross-modal Transformer network TcrossThe structure is the same, and the self-attention device comprises N self-attention layers, wherein each self-attention layer comprises a multi-head self-attention layer and a feedforward layer;
the voice Transformer network TaudioThe method comprises the steps of extracting an autocorrelation expression of a voice mode, and inputting source voice features and combined features obtained by splicing the driving voice features in sequence dimensions; expression Transformer network TexpThe facial expression self-correlation representation of the image modality is extracted and input as the expression parameters of the source video; cross-modal Transformer network TcrossAn autocorrelation representation for extracting the composite features of a speech modality and an image modality, input as TaudioAnd TexpSplicing the obtained autocorrelation characteristics in sequence dimensions to obtain composite characteristics;
step 2: acquiring a source video of a target character speaking and a section of driving voice input;
and step 3: preprocessing the speaking video and the driving voice, obtaining an expression characteristic sequence and a source voice characteristic sequence from the speaking video, and obtaining a driving voice characteristic sequence from the driving voice;
and 4, step 4: splicing the driving voice characteristic sequence and the source voice characteristic sequence in the sequence dimension to obtain a recombined voice characteristic sequence, and inputting the recombined voice characteristic sequence into a voice Transformer network TaudioObtaining an autocorrelation representation of the recombined speech;
and 5: inputting the expression characteristic sequence obtained in the step 3 into an expression Transformer network TexpCalculating an autocorrelation representation of the expression sequence;
step 6: inputting the 2 kinds of autocorrelation expressions obtained in the step 4 and the step 5 into a cross-modal Transformer network T after the sequence dimensions are splicedcrossPerforming multi-mode correlation calculation, and taking a subsequence with a specific sequence number to obtain a predicted expression characteristic sequence;
and 7: sequentially replacing the expression feature part of the facial 3D parameters extracted from the video in the step 3 by the predicted expression feature sequence to obtain a new recombined 3D facial feature representation, and calculating a 3D facial grid according to the 3D facial parameters;
and 8: rendering the obtained 3D face mesh to obtain a primary 2D rendering result, and intercepting the 2D rendering image according to the corresponding relation between the 3D face model and the face key points to obtain a 2D face image of the next half face;
and step 9: refining the 2D face image of the next half face in the step 8 by using a neural rendering network to obtain a 2D face image sequence of the next half face; replacing the lower half face part in the source video frame by the generated lower half face image according to the face key point information obtained by preprocessing in the step 1 to obtain a 2D face image sequence;
step 10: and (3) splicing the 2D face image sequence and the driving voice input in the step (2) by using ffmpeg to obtain a video output result.
The technical scheme adopted by the system of the invention is as follows: a face animation generation system based on multi-modal relevance comprises the following modules:
the system comprises a module 1, a voice-to-expression mapping network and a voice-to-expression mapping network, wherein the module 1 is used for building the voice-to-expression mapping network based on multi-modal correlation, preprocessing a 2D speaker video data set, then performing voice-to-expression mapping network training by using a self-supervision method, and guiding the voice-to-expression mapping network based on the multi-modal correlation to train by using a loss function to obtain the trained voice-to-expression mapping network;
the voice-to-expression mapping network with multi-modal correlation comprises a voice Transformer network TaudioAn expression Transformer network TexpAnd a trans-modal Transformer network Tcross
The voice Transformer network TaudioExpression Transformer network TexpAnd cross-modal Transformer network TcrossThe structure is the same, and the self-attention device comprises N self-attention layers, wherein each self-attention layer comprises a multi-head self-attention layer and a feedforward layer;
the voice Transformer network TaudioThe method comprises the steps of extracting an autocorrelation expression of a voice mode, and inputting source voice features and combined features obtained by splicing the driving voice features in sequence dimensions; expression Transformer network TexpThe facial expression self-correlation representation of the image modality is extracted and input as the expression parameters of the source video; cross-modal Transformer network TcrossAn autocorrelation representation for extracting the composite features of a speech modality and an image modality, input as TaudioAnd TexpSplicing the obtained autocorrelation characteristics in sequence dimensions to obtain composite characteristics;
the module 2 is used for acquiring a source video of a target person speaking and a section of driving voice input;
the module 3 is used for preprocessing the speaking video and the driving voice, obtaining an expression characteristic sequence and a source voice characteristic sequence from the speaking video and obtaining a driving voice characteristic sequence from the driving voice;
a module 4, configured to splice the driving speech feature sequence and the source speech feature sequence in sequence dimensions to obtain a recombined speech feature sequence, and input the recombined speech feature sequence into the speech Transformer network TaudioObtaining an autocorrelation representation of the recombined speech;
a module 5, configured to input the expression feature sequence obtained by the module 3 into an expression Transformer network TexpCalculating an autocorrelation representation of the expression sequence;
a module 6, configured to input the 2 kinds of autocorrelation representations obtained in the modules 4 and 5 into the cross-modal Transformer network T after sequence dimension splicingcrossPerforming multi-mode correlation calculation, and taking a subsequence with a specific sequence number to obtain a predicted expression characteristic sequence;
a module 7, configured to sequentially replace, in the module 3, the expression feature part of the face 3D parameter extracted from the video with the predicted expression feature sequence, to obtain a new recombined 3D face feature representation, and calculate a 3D face mesh according to the 3D face parameter;
a module 8, configured to render the obtained 3D face mesh to obtain a preliminary 2D rendering result; intercepting the 2D face image according to the corresponding relation between the 3D face model and the face key points to obtain the 2D face image of the next half face
A module 9, configured to refine the 2D face image of the module 8 by using a neural rendering network, so as to obtain a 2D face image sequence; replacing the lower half face part in the source video frame by the generated lower half face image according to the face key point information obtained by preprocessing in the module 1 to obtain a 2D face image sequence;
and the module 10 is used for splicing the 2D face image sequence and the driving voice input by the module 2 by using ffmpeg to obtain a video output result.
The invention provides a brand-new framework based on a Transformer for generating voice-driven face animation, and the framework can provide stability in time sequence in a feature mapping stage and ensure diversity and smoothness of a generated result. In addition, by virtue of the characteristics of the parallel computing of the Transformer architecture, compared with the prior method, the method can achieve the real-time speech-to-expression mapping.
The present invention proposes to explicitly consider the connection of driving audio with source video, i.e. multi-modal correlation. Different from the previous method of fine-tuning the network by using the source video, the method provided by the invention utilizes the inherent synchronism of the source audio and the source video, so that a more accurate result is obtained when the expression prediction is carried out on the driving voice. In addition, the method of single-voice modal driving can only generate better animation results on lips, and the reality of the whole face expression except the lips can be improved by introducing the source video, so that the animation results which are more in line with the speaking mode of the original speaker are generated.
Drawings
FIG. 1 is a schematic diagram of a method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a voice-to-expression mapping network structure of multi-modal correlations according to an embodiment of the present invention.
Fig. 3 is a first result diagram of face animation generation according to the embodiment of the present invention.
Fig. 4 is a diagram illustrating a result of generating a face animation according to the embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, the method for generating a face animation based on multi-modal correlation according to the present invention includes the following steps:
step 1: building a voice-to-expression mapping network based on multi-modal correlation, preprocessing a 2D speaker video data set, then performing voice-to-expression mapping network training by using a self-supervision method, and guiding the voice-to-expression mapping network based on multi-modal correlation to train by using a loss function to obtain a trained voice-to-expression mapping network;
referring to fig. 2, the voice-to-emotion mapping network architecture with multi-modal correlation provided in this embodiment includes a voice Transformer network TaudioAn expressive Transformer network TexpAnd a trans-modal Transformer network Tcross
Voice Transformer network TaudioThe method comprises the steps of extracting an autocorrelation expression of a voice mode, and inputting source voice features and combined features obtained by splicing the driving voice features in sequence dimensions; expression Transformer network TexpThe facial expression self-correlation representation of the image modality is extracted and input as the expression parameters of the source video; cross-modal Transformer network TcrossAn autocorrelation representation for extracting the composite features of a speech modality and an image modality, input as TaudioAnd TexpSplicing the obtained autocorrelation characteristics in sequence dimensions to obtain composite characteristics; the Transformer network structures used in this example are all the same, and differ only in input/output dimensions; specifically, each transform network is composed of N self-attention layers, where N is 10 in this example; each self-attention layer comprises a multi-head self-attention layer and a feedforward layer; calculating the correlation attn for the input query and key in a self-attribute layer, and performing matrix multiplication mat-mul with value to obtain a correlation representation; the formulation is expressed as, for the input sequence X ═ X1,x2,...,xn)∈Rn×mWherein n is the sequence length and m is perThe characteristic dimension of each sample is related to express a general calculation formula as
Figure BDA0003135948300000061
X′∈Rn×mFor the autocorrelation calculation employed in this example, q, k, v are all input sequences X, where X represents a matrix multiplication.
Data preprocessing in this embodiment: an input source video is marked as V, firstly, an open source audio and video processing tool ffmpeg is used for intercepting a video frame at a frame rate of 25fps, and an obtained video frame sequence V is (V is) a frame sequence1,v2,...,vn) Wherein n is the number of video frames, and source audio A in the video is extracted; for each frame of the video frame sequence, carrying out face recognition by using an open source face detection frame mtcnn, and deleting data from a data set for videos with zero or more faces in the video frames; for each video frame, only one face is detected, the center of the video frame is cut into the size of 256 multiplied by 256 pixels according to a face detection frame, and a cut video frame sequence is obtained
Figure BDA0003135948300000062
And storing the key points of the human face detected by mtcnn; calculating the clipped video frame by using a 3D Face Reconstruction method in the Accurate 3D Face Reconstruction with weather-super left-From Single Image to Image Set
Figure BDA0003135948300000063
Corresponding 3D face parameters
Figure BDA0003135948300000064
Wherein alpha isi,βi,δii,piRespectively representing the geometric parameters, expression parameters, illumination parameters, texture parameters and transformation parameters of the 3D face parametric model; extracting expression components in the 3D face parameters from each frame
Figure BDA0003135948300000065
Extracting separatelyTo form an expression sequence β ═ (β)1,β2,...βn) (ii) a For the speech part, the Mel frequency cepstral coefficients (mfcc) of audio a are extracted using the open source speech processing library python _ speech _ features, then the audio mfcc features are sliced using a sliding window of length 8 and step size 4, and the mfcc of each slice is spliced in the channel dimension to obtain the same speech slice a as the video frame number n (a ═ n)1,a2,...an);
The network forward computing process of this embodiment: because of the use of the self-supervision training mode, a length n _ position (n _ position < n) needs to be selected for segmenting training and testing data, and a subsequence beta with the length n _ position in the expression sequence beta is usedtrain=(β1,β2,...βn_position) As expression Transformer TexpInput of (1), betapred=(βn_position+1,...βn) As a true value of regression; sequence of speech slices A as Speech Transformer TaudioIs input of, wherein Atrain=(a1,a2,...an_position) For a sequence of speech slices matching an expression sequence, equivalent to the source speech input at the time of the test, Apred=(an_position+1,an_position+2,...an) The part of the test method is a voice slice sequence needing expression prediction, which is equal to the driving voice input in the test process; then two sequences are betatrainA is spliced in sequence dimension to obtain the length nposition+ n composite sequence as trans-modal Transformamer TcrossIs obtained by autocorrelation calculation to obtain a length npositioThe resulting sequence of n + n, R ═ R (R)1,r2,...rn_position+n) According to the combination mode of the input sequence and the calculation mode of the trans-modal Transformer, the subset in the result sequence
Figure BDA0003135948300000071
The expression sequence is predicted;
the calculation of the loss function of this embodiment uses backward propagationThe broadcasting method trains a voice-to-expression mapping network based on multi-mode correlation; calculating predicted expression sequences
Figure BDA0003135948300000072
And betapredMaking mean square error loss (MSE), i.e.
Figure BDA0003135948300000073
Training a voice Transformer, an expression Transformer and a trans-modal Transformer through a back propagation algorithm, observing a loss function of a training result, and stopping training when loss values are not reduced in 5 continuous rounds;
step 2: acquiring a speaking video containing a section of character and a section of driven voice input;
and step 3: using the same data preprocessing method as the step 1 to obtain an expression sequence with the length of n _ seed and a source speech mfcc characteristic sequence with the length of n _ seed from the speaking video and obtain a driving speech mfcc characteristic sequence with the length of n _ position from the driving speech;
step 4-6 is an expression prediction stage, and the specific process and the diagram refer to fig. 2;
and 4, step 4: splicing the driving voice mfcc characteristic sequence and the source voice mfcc characteristic sequence in sequence dimension to obtain recombined mfcc, and inputting the recombined mfcc into a voice Transformer to obtain autocorrelation representation of recombined voice;
step 4 of this embodiment includes the following operations:
step 4.1: splicing the driving voice mfcc characteristic sequence (the sequence length is 3 in the figure) obtained in the step 3 and the source voice mfcc characteristic sequence (the sequence length is 6 in the figure) in the sequence dimension to obtain a composite mfcc characteristic (the sequence length is 9 in the figure);
step 4.2: inputting the composite mfcc into a voice Transformer, and obtaining the voice characteristics with autocorrelation after passing through a multi-head attention layer and a feedforward layer in a Transformer encoder;
and 5: inputting the expression sequence obtained in the step 3 into an expression Transformer TexpCalculating an autocorrelation representation of the expression;
step 6: splicing the 2 kinds of autocorrelation expressions obtained in the step 4 and the step 5 in sequence dimensions, inputting the sequence dimensions into a cross-modal Transformer network, performing multi-modal correlation calculation, and taking a subsequence with a specific sequence number to obtain a predicted expression characteristic sequence;
step 6 of this embodiment comprises the following operations:
step 6.1: splicing the voice autocorrelation representation obtained in the step 4 and the expression autocorrelation representation obtained in the step 5 on a sequence dimension to obtain a composite characteristic sequence (the sequence length in the figure is 15);
step 6.2: inputting the spliced composite features into a trans-modal Transformer, and calculating the autocorrelation representation of the composite features;
step 6.3: in the obtained composite characteristic autocorrelation representation, a part of calculating the correlation between the driving voice and the source table emotion coefficient is taken as a result, and the sequence length of the result is the same as the number of the driving voice sliding windows calculated in the preprocessing stage (the sequence length in the figure is 3);
all the Transformer model frameworks in the step 4-6 are based on the attribute Is All You Need, and the specific details are shown in FIG. 2; the voice Transformer consists of N self-attention layers, wherein N is 10 in the example; each self-attention layer comprises a multi-head self-attention layer and a feedforward layer; performing correlation calculation on the input sequence in a self-attribute layer to obtain a correlation representation; specifically, for the input sequence X ═ X (X)1,x2,...,xn)∈Rn ×mWhere n is the sequence length, m is the characteristic dimension of each sample, and the correlation expression is a general calculation formula of
Figure BDA0003135948300000081
X′∈Rn×mFor the autocorrelation calculation method employed in this example, q, k, v are all input sequences X, where X represents a matrix multiplication;
and 7: sequentially replacing the expression feature part of the face 3D parameters in the step 3 with the predicted expression feature sequence to obtain new 3D face parameters, and calculating a 3D face grid according to the 3D face parameters;
and 8: rendering the obtained 3D face mesh by using a rasterization method to obtain a primary 2D rendering result; intercepting the 2D rendering image according to the corresponding relation between the 3D face model and the face key points to obtain a 2D face image of the next half face;
step 8 of this embodiment comprises the following operations:
step 8.1: rendering the recombined face parameter sequence obtained in the step 8 in a rasterization mode to obtain a primary rendering result;
step 8.2: calculating the positions of face key points in a reconstructed face parameter rendering result according to the corresponding relation between the 3D face grids and the 2D face key points, and intercepting the 2D face image of the next half face according to the face key points;
and step 9: refining the 2D face image obtained in the step 8 by using a neural rendering network to obtain a 2D face image sequence of the next half face; replacing the lower half face part in the source video frame by the generated lower half face image according to the face key point information obtained by preprocessing in the step 1 to obtain a 2D face image sequence;
step 9 of this embodiment includes the following operations:
step 9.1: inputting a picture of a next half face by using a neural rendering network in the photo cosmetic Audio-drive Video portals to obtain a refined picture sequence of the next half face;
step 9.2: turning the position information of the key point obtained by the refined next half face picture sequence by using mtcnn according to the preprocessing part in the step 3 back to the video frame obtained by using ffmpeg interception in the preprocessing part in the step 3 to obtain a 2D face image sequence;
step 10: and (3) splicing the 2D face image sequence and the driving voice input in the step (2) by using ffmpeg to obtain a video output result.
Please refer to fig. 3 and fig. 4, which are results of the method for generating human face animation according to the present invention. This example uses two test audios, both extracted from other speaking videos. For each segment of test audio, the present example uses three segments of video for testing. The same graph is used with the same driving voice and different lines represent results obtained with different source video tests. It can be seen that the lip movement produced by the same segment of driving audio when using different source video is similar and the resulting talking video is naturally fluent.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A face animation generation method based on multi-modal correlation is characterized by comprising the following steps:
step 1: building a voice-to-expression mapping network based on multi-modal correlation, preprocessing a 2D speaker video data set, then performing voice-to-expression mapping network training by using a self-supervision method, and guiding the voice-to-expression mapping network based on multi-modal correlation to train by using a loss function to obtain a trained voice-to-expression mapping network;
the voice-to-expression mapping network with multi-modal correlation comprises a voice Transformer network TaudioAn expression Transformer network TexpAnd a trans-modal Transformer network Tcross
The voice Transformer network TaudioExpression Transformer network TexpAnd cross-modal Transformer network TcrossThe structure is the same, and the self-attention device comprises N self-attention layers, wherein each self-attention layer comprises a multi-head self-attention layer and a feedforward layer;
the voice Transformer network TaudioThe method comprises the steps of extracting an autocorrelation expression of a voice mode, and inputting source voice features and combined features obtained by splicing the driving voice features in sequence dimensions; expression Transformer network TexpFor extracting the facial expression autocorrelation representation of the image modality, the input is the sourceExpression parameters of the video; cross-modal Transformer network TcrossAn autocorrelation representation for extracting the composite features of a speech modality and an image modality, input as TaudioAnd TexpSplicing the obtained autocorrelation characteristics in sequence dimensions to obtain composite characteristics;
step 2: acquiring a source video of a target character speaking and a section of driving voice input;
and step 3: preprocessing the speaking video and the driving voice, obtaining an expression characteristic sequence and a source voice characteristic sequence from the speaking video, and obtaining a driving voice characteristic sequence from the driving voice;
and 4, step 4: splicing the driving voice characteristic sequence and the source voice characteristic sequence in the sequence dimension to obtain a recombined voice characteristic sequence, and inputting the recombined voice characteristic sequence into a voice Transformer network TaudioObtaining an autocorrelation representation of the recombined speech;
and 5: inputting the expression characteristic sequence obtained in the step 3 into an expression Transformer network TexpCalculating an autocorrelation representation of the expression sequence;
step 6: inputting the 2 kinds of autocorrelation expressions obtained in the step 4 and the step 5 into a cross-modal Transformer network T after the sequence dimensions are splicedcrossPerforming multi-mode correlation calculation, and taking a subsequence with a specific sequence number to obtain a predicted expression characteristic sequence;
and 7: sequentially replacing the expression feature part of the facial 3D parameters extracted from the video in the step 3 by the predicted expression feature sequence to obtain a new recombined 3D facial feature representation, and calculating a 3D facial grid according to the 3D facial parameters;
and 8: rendering the obtained 3D face mesh to obtain a primary 2D rendering result, and intercepting the 2D rendering image according to the corresponding relation between the 3D face model and the face key points to obtain a 2D face image of the next half face;
and step 9: refining the 2D face image of the next half face in the step 8 by using a neural rendering network to obtain a 2D face image sequence of the next half face; replacing the lower half face part in the source video frame by the generated lower half face image according to the face key point information obtained by preprocessing in the step 1 to obtain a 2D face image sequence;
step 10: and (3) splicing the 2D face image sequence and the driving voice input in the step (2) by using ffmpeg to obtain a video output result.
2. The multi-modal relevance based face animation generation method of claim 1, wherein: preprocessing the 2D speaker video data set in the step 1, intercepting video frames by using an open source audio and video processing tool ffmpeg at a frame rate of 25fps aiming at an input source video V, and obtaining a video frame sequence V ═ V1,v2,...,vn) Wherein n is the number of video frames, and source audio A in the video is extracted; for each frame of the video frames, carrying out face recognition by using an open source face detection frame mtcnn, and deleting data from the data set for videos with zero or more faces in the video frames; for each video frame, only one face is detected, the center of the video frame is cut into the size of 256 multiplied by 256 pixels according to a face detection frame, and a cut video frame sequence is obtained
Figure FDA0003135948290000021
And storing the key points of the human face detected by mtcnn; calculating clipped video frames using 3D face reconstruction method
Figure FDA0003135948290000022
Corresponding 3D face parameters
Figure FDA0003135948290000023
Wherein alpha isi,βi,δii,piRespectively representing the geometric parameters, expression parameters, illumination parameters, texture parameters and transformation parameters of the 3D face parametric model; extracting expression components in the 3D face parameters from each frame
Figure FDA0003135948290000024
Is extracted separately to form expression sequence beta ═ beta (beta)1,β2,...βn) (ii) a For the speech part, the Mel frequency cepstral coefficients (mfcc) of audio a are extracted using the open source speech processing library python _ speech _ features, then the audio mfcc features are sliced using a sliding window of length 8 and step size 4, and the mfcc of each slice is spliced in the channel dimension to obtain the same speech slice a as the video frame number n (a ═ n)1,a2,...an)。
3. The multi-modal relevance based face animation generation method of claim 1, wherein: step 1, performing voice-to-expression mapping network training by using an automatic supervision method, selecting a subsequence beta with the length of n _ position in an expression sequence beta, wherein the length of n _ position is less than ntrain=(β1,β2,...βn_position) As an expression Transformer network TexpInput of (1), betapred=(βn_position+1,...βn) As a true value of regression; sequence of speech slices A as a speech transducer network TaudioIs input of, wherein Atrain=(a1,a2,...an_position) For a sequence of speech slices matching an expression sequence, equivalent to the source speech input at the time of the test, Apred=(an_position+1,an_position+2,...an) The part of the test method is a voice slice sequence needing expression prediction, which is equal to the driving voice input in the test process; then two sequences are betatrainA is spliced in sequence dimension to obtain the length nposition+ n composite sequence as trans-modal Transformer network TcrossIs obtained by autocorrelation calculation to obtain a length npositionResult sequence of + n
Figure FDA0003135948290000031
According to the combination mode of the input sequence and the calculation mode of the trans-modal Transformer, the subset in the result sequence
Figure FDA0003135948290000032
Figure FDA0003135948290000033
I.e. the predicted expression sequence.
4. The multi-modal relevance based face animation generation method of claim 1, wherein: the step 1 of using the loss function to guide the training of the voice-to-expression mapping network based on the multi-modal correlation is to calculate a predicted expression sequence
Figure FDA0003135948290000034
With the real expression sequence betapredMaking the mean square error loss MSE, i.e.
Figure FDA0003135948290000035
And training the voice Transformer network, the expression Transformer network and the cross-modal Transformer network through a back propagation algorithm, observing a loss function of a training result, and stopping training when the loss value is not reduced in 5 continuous rounds.
5. The method for generating a human face animation based on multi-modal correlation according to claim 1, wherein the step 4 comprises the following sub-steps:
step 4.1: splicing the driving voice mfcc characteristic sequence obtained in the step (3) and the source voice mfcc characteristic sequence in a sequence dimension to obtain a composite mfcc characteristic;
step 4.2: inputting the composite mfcc into a voice Transformer network TaudioIn the method, after passing through a multi-head attention layer and a feedforward layer in a transform encoder, a speech feature with autocorrelation is obtained.
6. The method for generating a human face animation based on multi-modal correlation according to claim 1, wherein the step 6 comprises the following sub-steps:
step 6.1: splicing the voice autocorrelation representation obtained in the step 4 and the expression autocorrelation representation obtained in the step 5 on a sequence dimension to obtain a composite characteristic sequence;
step 6.2: inputting the spliced composite characteristics into a cross-modal Transformer network TcrossCalculating an autocorrelation representation of the composite feature;
step 6.3: in the obtained composite feature autocorrelation expression, a part of calculating the correlation between the driving voice and the source expression coefficient is taken as a result, and the sequence length of the result is the same as the number of the driving voice sliding windows calculated by the preprocessing part in the step 1.
7. A face animation generation system based on multi-modal correlation is characterized by comprising the following modules:
the system comprises a module 1, a voice-to-expression mapping network and a voice-to-expression mapping network, wherein the module 1 is used for building the voice-to-expression mapping network based on multi-modal correlation, preprocessing a 2D speaker video data set, then performing voice-to-expression mapping network training by using a self-supervision method, and guiding the voice-to-expression mapping network based on the multi-modal correlation to train by using a loss function to obtain the trained voice-to-expression mapping network;
the voice-to-expression mapping network with multi-modal correlation comprises a voice Transformer network TaudioAn expression Transformer network TexpAnd a trans-modal Transformer network Tcross
The voice Transformer network TaudioExpression Transformer network TexpAnd cross-modal Transformer network TcrossThe structure is the same, and the self-attention device comprises N self-attention layers, wherein each self-attention layer comprises a multi-head self-attention layer and a feedforward layer;
the voice Transformer network TaudioThe method comprises the steps of extracting an autocorrelation expression of a voice mode, and inputting source voice features and combined features obtained by splicing the driving voice features in sequence dimensions; expression Transformer network TexpThe facial expression self-correlation representation of the image modality is extracted and input as the expression parameters of the source video; cross-modal Transformer network TcrossAn autocorrelation representation for extracting the composite features of a speech modality and an image modality, input as TaudioAnd TexpSplicing the obtained autocorrelation characteristics in sequence dimensions to obtain composite characteristics;
the module 2 is used for acquiring a source video of a target person speaking and a section of driving voice input;
the module 3 is used for preprocessing the speaking video and the driving voice, obtaining an expression characteristic sequence and a source voice characteristic sequence from the speaking video and obtaining a driving voice characteristic sequence from the driving voice;
a module 4, configured to splice the driving speech feature sequence and the source speech feature sequence in sequence dimensions to obtain a recombined speech feature sequence, and input the recombined speech feature sequence into the speech Transformer network TaudioObtaining an autocorrelation representation of the recombined speech;
a module 5, configured to input the expression feature sequence obtained by the module 3 into an expression Transformer network TexpCalculating an autocorrelation representation of the expression sequence;
a module 6, configured to input the 2 kinds of autocorrelation representations obtained in the modules 4 and 5 into the cross-modal Transformer network T after sequence dimension splicingcrossPerforming multi-mode correlation calculation, and taking a subsequence with a specific sequence number to obtain a predicted expression characteristic sequence;
a module 7, configured to sequentially replace, in the module 3, the expression feature part of the face 3D parameter extracted from the video with the predicted expression feature sequence, to obtain a new recombined 3D face feature representation, and calculate a 3D face mesh according to the 3D face parameter;
a module 8, configured to render the obtained 3D face mesh to obtain a preliminary 2D rendering result; intercepting the 2D rendering image according to the corresponding relation between the 3D face model and the face key points to obtain the 2D face image of the next half face
A module 9, configured to refine the 2D face image of the module 8 by using a neural rendering network, so as to obtain a 2D face image sequence; replacing the lower half face part in the source video frame by the generated lower half face image according to the face key point information obtained by preprocessing in the module 1 to obtain a 2D face image sequence;
and the module 10 is used for splicing the 2D face image sequence and the driving voice input by the module 2 by using ffmpeg to obtain a video output result.
CN202110718414.8A 2021-06-28 2021-06-28 Face animation generation method and system based on multi-mode correlation Active CN113450436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110718414.8A CN113450436B (en) 2021-06-28 2021-06-28 Face animation generation method and system based on multi-mode correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110718414.8A CN113450436B (en) 2021-06-28 2021-06-28 Face animation generation method and system based on multi-mode correlation

Publications (2)

Publication Number Publication Date
CN113450436A true CN113450436A (en) 2021-09-28
CN113450436B CN113450436B (en) 2022-04-15

Family

ID=77813260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110718414.8A Active CN113450436B (en) 2021-06-28 2021-06-28 Face animation generation method and system based on multi-mode correlation

Country Status (1)

Country Link
CN (1) CN113450436B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155321A (en) * 2021-11-26 2022-03-08 天津大学 Face animation generation method based on self-supervision and mixed density network
CN116664731A (en) * 2023-06-21 2023-08-29 华院计算技术(上海)股份有限公司 Face animation generation method and device, computer readable storage medium and terminal
CN116993948A (en) * 2023-09-26 2023-11-03 粤港澳大湾区数字经济研究院(福田) Face three-dimensional reconstruction method, system and intelligent terminal
CN117115312A (en) * 2023-10-17 2023-11-24 天度(厦门)科技股份有限公司 Voice-driven facial animation method, device, equipment and medium
CN117237495A (en) * 2023-11-06 2023-12-15 浙江同花顺智能科技有限公司 Three-dimensional face animation generation method and system
CN117315552A (en) * 2023-11-30 2023-12-29 山东森普信息技术有限公司 Large-scale crop inspection method, device and storage medium
CN117635784A (en) * 2023-12-19 2024-03-01 世优(北京)科技有限公司 Automatic three-dimensional digital human face animation generation system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120280974A1 (en) * 2011-05-03 2012-11-08 Microsoft Corporation Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
CN106462255A (en) * 2016-06-29 2017-02-22 深圳狗尾草智能科技有限公司 A method, system and robot for generating interactive content of robot
CN112184858A (en) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 Virtual object animation generation method and device based on text, storage medium and terminal
CN112465935A (en) * 2020-11-19 2021-03-09 科大讯飞股份有限公司 Virtual image synthesis method and device, electronic equipment and storage medium
CN112581569A (en) * 2020-12-11 2021-03-30 中国科学院软件研究所 Adaptive emotion expression speaker facial animation generation method and electronic device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120280974A1 (en) * 2011-05-03 2012-11-08 Microsoft Corporation Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
CN106462255A (en) * 2016-06-29 2017-02-22 深圳狗尾草智能科技有限公司 A method, system and robot for generating interactive content of robot
CN112184858A (en) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 Virtual object animation generation method and device based on text, storage medium and terminal
CN112465935A (en) * 2020-11-19 2021-03-09 科大讯飞股份有限公司 Virtual image synthesis method and device, electronic equipment and storage medium
CN112581569A (en) * 2020-12-11 2021-03-30 中国科学院软件研究所 Adaptive emotion expression speaker facial animation generation method and electronic device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
THIES J 等: "Neural voice puppetry: Audio-driven facial reenactment", 《EUROPEAN CONFERENCE ON COMPUTER VISION. SPRINGER》 *
肖磊: "语音驱动的高自然度人脸动画", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155321A (en) * 2021-11-26 2022-03-08 天津大学 Face animation generation method based on self-supervision and mixed density network
CN116664731A (en) * 2023-06-21 2023-08-29 华院计算技术(上海)股份有限公司 Face animation generation method and device, computer readable storage medium and terminal
CN116664731B (en) * 2023-06-21 2024-03-29 华院计算技术(上海)股份有限公司 Face animation generation method and device, computer readable storage medium and terminal
CN116993948B (en) * 2023-09-26 2024-03-26 粤港澳大湾区数字经济研究院(福田) Face three-dimensional reconstruction method, system and intelligent terminal
CN116993948A (en) * 2023-09-26 2023-11-03 粤港澳大湾区数字经济研究院(福田) Face three-dimensional reconstruction method, system and intelligent terminal
CN117115312A (en) * 2023-10-17 2023-11-24 天度(厦门)科技股份有限公司 Voice-driven facial animation method, device, equipment and medium
CN117115312B (en) * 2023-10-17 2023-12-19 天度(厦门)科技股份有限公司 Voice-driven facial animation method, device, equipment and medium
CN117237495A (en) * 2023-11-06 2023-12-15 浙江同花顺智能科技有限公司 Three-dimensional face animation generation method and system
CN117237495B (en) * 2023-11-06 2024-02-23 浙江同花顺智能科技有限公司 Three-dimensional face animation generation method and system
CN117315552A (en) * 2023-11-30 2023-12-29 山东森普信息技术有限公司 Large-scale crop inspection method, device and storage medium
CN117315552B (en) * 2023-11-30 2024-01-26 山东森普信息技术有限公司 Large-scale crop inspection method, device and storage medium
CN117635784A (en) * 2023-12-19 2024-03-01 世优(北京)科技有限公司 Automatic three-dimensional digital human face animation generation system
CN117635784B (en) * 2023-12-19 2024-04-19 世优(北京)科技有限公司 Automatic three-dimensional digital human face animation generation system

Also Published As

Publication number Publication date
CN113450436B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN113450436B (en) Face animation generation method and system based on multi-mode correlation
Xie et al. Realistic mouth-synching for speech-driven talking face using articulatory modelling
CN112562722A (en) Audio-driven digital human generation method and system based on semantics
JP6936298B2 (en) Methods and devices for controlling changes in the mouth shape of 3D virtual portraits
Xie et al. A coupled HMM approach to video-realistic speech animation
JP2003529861A (en) A method for animating a synthetic model of a human face driven by acoustic signals
US11847726B2 (en) Method for outputting blend shape value, storage medium, and electronic device
Yao et al. Iterative text-based editing of talking-heads using neural retargeting
Bigioi et al. Speech driven video editing via an audio-conditioned diffusion model
Sadoughi et al. Expressive speech-driven lip movements with multitask learning
Chen et al. Transformer-s2a: Robust and efficient speech-to-animation
CN115376482A (en) Face motion video generation method and device, readable medium and electronic equipment
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN116051692A (en) Three-dimensional digital human face animation generation method based on voice driving
Medina et al. Speech driven tongue animation
Hussen Abdelaziz et al. Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models
Li et al. Speech driven facial animation generation based on GAN
Jha et al. Cross-language speech dependent lip-synchronization
Filntisis et al. Photorealistic adaptation and interpolation of facial expressions using HMMS and AAMS for audio-visual speech synthesis
Agarwal et al. Realistic Lip Animation from Speech for Unseen Subjects using Few-shot Cross-modal Learning
Krejsa et al. A novel lip synchronization approach for games and virtual environments
Aggarwal et al. Comprehensive overview of various lip synchronization techniques
CN117528197B (en) High-frame-rate playback type quick virtual film making system
Chen et al. Text to avatar in multimodal human computer interface
Thikekar et al. Generative Adversarial Networks based Viable Solution on Dubbing Videos With Lips Synchronization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant