CN115100329A - Multi-mode driving-based emotion controllable facial animation generation method - Google Patents
Multi-mode driving-based emotion controllable facial animation generation method Download PDFInfo
- Publication number
- CN115100329A CN115100329A CN202210744504.9A CN202210744504A CN115100329A CN 115100329 A CN115100329 A CN 115100329A CN 202210744504 A CN202210744504 A CN 202210744504A CN 115100329 A CN115100329 A CN 115100329A
- Authority
- CN
- China
- Prior art keywords
- coordinate
- emotion
- facial
- face
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 156
- 230000001815 facial effect Effects 0.000 title claims abstract description 141
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 44
- 230000004927 fusion Effects 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 16
- 238000006243 chemical reaction Methods 0.000 claims description 14
- 230000007246 mechanism Effects 0.000 claims description 12
- 230000001360 synchronised effect Effects 0.000 claims description 11
- 230000002996 emotional effect Effects 0.000 claims description 9
- 230000015572 biosynthetic process Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000003786 synthesis reaction Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 150000001875 compounds Chemical class 0.000 claims 3
- 238000009877 rendering Methods 0.000 claims 1
- 230000014509 gene expression Effects 0.000 abstract description 8
- 230000006403 short-term memory Effects 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 abstract description 2
- 230000007787 long-term memory Effects 0.000 abstract description 2
- 238000007634 remodeling Methods 0.000 abstract description 2
- 230000008921 facial expression Effects 0.000 description 7
- 230000000007 visual effect Effects 0.000 description 5
- 208000009119 Giant Axonal Neuropathy Diseases 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 201000003382 giant axonal neuropathy 1 Diseases 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 238000012076 audiometry Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 230000010482 emotional regulation Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000011541 reaction mixture Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002269 spontaneous effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Databases & Information Systems (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving. Step S1: preprocessing an image of a portrait video to obtain a face 3D feature coordinate sequence; step S2: preprocessing the audio of the portrait video, and decoupling the audio into an audio content vector and an audio style vector; step S3: training a face lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-term and short-term memory network based on the face 3D feature coordinate sequence and the audio content vector; the invention introduces the emotion portrait as an emotion source, realizes emotion remodeling of the target portrait by combining the common drive of the emotion source portrait and the audio, and provides diversified emotion facial animation. The method avoids the over-low robustness of the audio single driving source under the drive of multiple modes, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.
Description
Technical Field
The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving.
Background
Facial animation generation is a popular area of research in computer vision-generated models. Its purpose is to transform a still portrait into a realistic facial animation with an arbitrary audio. The method has wide application background in the fields of treatment systems of pseudoscopic and pseudoscopic audiometry, virtual anchor, role-defined games and the like. However, the existing facial animation generation method has the defects that due to the limitations of the principle and the characteristics of the existing facial animation generation method, the emotion aspect of the generated portrait animation is always lack of maturity, so that the application value of the portrait animation is seriously influenced.
In recent years, many studies in the field of facial animation generation have been made on realistic lip movement and head posture swing, and this is an important factor of portrait emotion. The existence of the portrait emotional information has an important influence on the expression of the synthesized facial animation expression emotion, different facial expressions often make a sentence with different emotional colors, and the perception of the emotional information in the visual mode is one of the important ways for human audiovisual speech communication. However, most of the facial animation generation driving sources are audio single-mode, which is superior in lip movement for generating syllables, but relatively poor in generating facial expressions. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.
Disclosure of Invention
The invention provides an emotion controllable facial animation generation method based on multi-mode driving, aiming at solving the problem that the existing facial animation generation method is lack of emotion regulation and control capability.
The invention is realized by adopting the following technical scheme:
the method for generating the emotion controllable facial animation based on multi-mode driving is realized by adopting the following steps:
step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used to obtain a face 3D feature coordinate sequence.
Step S2: the audio of the portrait video is preprocessed and then the preprocessed audio is decoupled into an audio content vector irrelevant to the audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method.
Step S3: based on the face coordinate sequence obtained in step S1 and the audio content vector obtained in step S2, a face lip sound coordinate animation generation network composed of a Multi-Layer Perceptron (MLP) and a Long Short-Term Memory (LSTM) network is trained.
Step S4: based on the face coordinate sequence obtained in step S1 and the audio content vector and the audio style vector obtained in step S2, a face emotion coordinate animation generation network composed of MLP, LSTM, Self-attention mechanism (Self-attention) and generation countermeasure network (GAN) is trained.
Step S5: based on the face coordinate sequence obtained in step S1, a coordinate-to-video network composed of GANs is trained.
Step S6: based on the facial lip tone coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, any two portrait pictures (one representing an identity source and one representing an emotion source) and any one section of audio are input, and a lip tone synchronous video of a target portrait with emotion corresponding to the emotion source is generated.
The method for generating the emotion-controllable facial animation based on multi-mode driving uses a computer vision-generation model and a deep neural network model as technical supports, and realizes the description of the emotion-controllable facial animation generation network.
The invention has the beneficial effects that: compared with the existing facial animation generation method, the method has the advantages that the problems of facial expression ghosting and distortion caused by single audio features and low emotion voice recognition accuracy are considered, the emotion portrait is introduced as an emotion source, the emotion of the target portrait is remolded by utilizing the multi-mode driving of the emotion source portrait features and the audio features, and the facial animation with controllable emotion is generated. The dual driving of the emotion image and the audio can avoid the dependency of emotion generation on single voice information, so that the generated video has controllable emotion while meeting the requirements of lip sound synchronization and spontaneous head swing, namely, the diversity and naturalness of facial animation are ensured, and more real emotional expression of the facial animation is realized.
The method effectively solves the problem that the existing facial animation generation method has low efficiency due to the limitation of speech emotion recognition precision on facial expressions, and can be used in the fields of pseudoscopic and pseudoscopic auxiliary treatment systems, virtual anchor games, role-defined games and the like.
Drawings
FIG. 1 is a schematic diagram of a multi-modal driven emotion controllable facial animation generation structure according to an embodiment of the invention.
Fig. 2 is a schematic diagram comparing the present invention with a conventional facial animation method.
Fig. 3 is a sample video schematic of an embodiment of the invention.
Detailed Description
In this embodiment, the portrait video data set used is derived from a public Multi-view Emotional Audio-visual data set (MEAD).
As shown in FIG. 1, the method for generating emotion controllable facial animation based on multi-modal driving is realized by adopting the following steps:
step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used to obtain a face 3D feature coordinate sequence.
Step S2: the audio of the portrait video is preprocessed and then the preprocessed audio is decoupled into an audio content vector irrelevant to the audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method.
Step S3: based on the face coordinate sequence obtained in step S1 and the audio content vector obtained in step S2, a face lip sound coordinate animation generation network composed of a Multi-Layer Perceptron (MLP) and a Long Short-Term Memory (LSTM) network is trained.
Step S4: based on the face coordinate sequence obtained in step S1 and the audio content vector and style vector obtained in step S2, a face emotion coordinate animation generation network composed of MLP, LSTM, Self-attention mechanism (Self-attention) and generation countermeasure network (GAN) is trained.
Step S5: based on the face coordinate sequence obtained in step S1, a coordinate-to-video network composed of GANs is trained. During this step of training, a loss function is used to calculate the minimum distance in pixels between the reconstructed face and the training target face.
Step S6: based on the facial lip tone coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, any two portrait pictures (one representing an identity source and one representing an emotion source) and any one section of audio are input, and a lip tone synchronous video of a target portrait with emotion corresponding to the emotion source is generated.
In step S1, the image of the portrait video is preprocessed, and the specific preprocessing process includes frame rate conversion, image resampling, and face coordinate extraction.
First, the video is frame rate converted to 62.5 frames per second. It is then image resampled and cropped to 256 x 256 video containing faces. And finally, extracting face coordinates by using a face identification algorithm face alignment, and acquiring 3D coordinates (with the dimension of 68 x 3) of the face of each frame to form a face 3D feature coordinate sequence.
In addition, the face 3D feature coordinate sequence is saved as an emotion source portrait coordinate sequence (emotion source face coordinates) and an identity source portrait coordinate sequence (identity source face coordinates). Compared with pixel points of the portrait, the face coordinates can provide natural low-dimensional representation for the portrait and provide a high-quality bridge for downstream emotion replay tasks.
In step S2, the audio of the portrait video is preprocessed, where the preprocessing includes sampling rate conversion, audio vector extraction, and audio vector decoupling.
The audio is first sample rate converted to 16000hz using Fast Forward Moving Picture Experts Group. Then, audio vector extraction is carried out on the audio vector, and the audio vector is obtained by using a rememblyzer library of python. And finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring an audio content vector irrelevant to the audio speaker after decoupling and an audio style vector relevant to the audio speaker.
In step S3, training of the facial lip sound coordinate animation generation network is completed.
The network adopts a self-defined coder-decoder network structure, the coder comprises a facial coordinate coder consisting of two layers of MLPs and a voice content coder consisting of three layers of LSTMs, and the decoder is a facial lip sound coordinate decoder consisting of three layers of MLPs. In order to generate an optimal sequence of the offset of the facial lip voice coordinate, the facial lip voice coordinate animation generation network sets a loss function to continuously adjust the weight and the deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized.
The custom encoder-decoder network structure is as follows:
firstly, the identity feature of the 3D feature coordinate sequence of the face in the first frame of the video (i.e. the first time point of the 3D feature coordinate sequence of the face) obtained in step S1 is extracted by using two-layer MLP. And then, based on the identity characteristics and the audio content vector obtained in the step S2, performing linear fusion and extracting the coordinate dependence relationship between the audio continuous syllables and the lips by using the LSTM of the three-layer unit. Then, based on the output of the encoder in the step, a decoder consisting of three layers of MLPs is used for predicting a facial lip sound coordinate offset sequence, and the specific calculation formula is as follows:
ΔP t =MLP c (LSTM c (Ec t→t+λ ,MLP L (L;W mlp,l );W lstm );W mlp,c ) (1)
in the formula (1), Δ P t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W mlp,l Representing facial coordinate encoder learnable parameters; LSTM c Representing a speech content encoder, Ec representing an audio content vector, t → t + λ representing that the audio content vector is input to the speech content encoder in a batch size of λ ═ 18 per frame t, W lstm Representing speech content encoder learnable parameters; MLP c Coordinate decoder for facial lip notes, W mlp,c The coordinate decoder for lip sound on the face can learn the parameters.
Correcting coordinates of a first frame of the portrait video through the predicted coordinate offset sequence of the facial lip sound to obtain a lip sound synchronous coordinate sequence, wherein a specific calculation formula is as follows:
P t =L+ΔP t (2)
in the formula (2), P t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P t Indicating the predicted t-th frame face lip sound coordinate offset.
In order to generate an optimal sequence of the offset of the facial lip coordinates, the weight and the deviation of the loss function adjusting network are set based on the encoder-decoder structure of the facial lip coordinate animation generating network. The objective of the penalty function is to minimize the error between the predicted coordinates and the coordinates obtained in step S1, and the specific calculation formula is as follows:
in the formula (3), the first and second groups,representing a loss function of a facial lip tone coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N is 68 which represents the total number of facial coordinates, and i represents the current facial coordinate number; p i,t Representing the coordinates of the predicted face of the ith frame,coordinates of the face indicating the ith frame obtained in step S1;represents P i,t And withSquared euclidean norm of (d).
When the loss function tends to be smooth, i.e.And when the minimum value is reached, the training of the facial lip voice coordinate animation synthesis network is completed.
And step S4, finishing the training of the face emotion coordinate animation generation network, and adding rich visual emotion expressions for the generated video.
Human beings rely on visual information in emotion interpretation, and abundant visual emotion expression can give people stronger sense of reality, and the practicality is bigger. Most of the existing face animation generation algorithms are dedicated to expressing the lip movement and the head pose swing of the face animation through audio single modality. Audio single-modality driving works well in lip movement that generates syllables, but relatively poorly in generating facial expressions. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.
The patent provides a facial emotion coordinate animation generation network based on multi-mode driving, emotion portraits are introduced to serve as emotion sources, multi-mode driving is performed in combination with audio characteristics to achieve emotion remodeling of target portraits more accurately, and the facial emotion coordinate animation generation network is generated.
The network is a custom encoder-decoder network structure, the encoder comprises an audio encoder and a facial coordinate encoder, and the decoder comprises a coordinate decoder. The encoder can obtain audio features, portrait identity features and portrait affective features. The decoder is responsible for processing the multi-mode characteristics, and is driven by the audio characteristics and the portrait emotion characteristics together to generate a coordinate offset sequence after the target portrait emotion is remolded, so that rich visual emotion expression is added to the video. Under the driving of the multiple modes, the method avoids the over-low robustness of the audio single driving source, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.
In order to generate an optimal facial emotion coordinate offset sequence, three different loss functions are set to adjust the weight and the deviation of the network based on the encoder-decoder structure of the facial emotion coordinate animation generation network. One of them calculates the distance between the predicted face 3D feature coordinate sequence and the face 3D feature coordinate sequence obtained in step S1. The second and third are the identifier loss function to distinguish the true and false of the generated face coordinate and the similarity of the face coordinate interval frame.
The network structure of the encoder-decoder for generating the customized network of the facial emotion coordinate animation is as follows:
the encoder consists of an audio encoder, an identity source face coordinate encoder and an emotion source face coordinate encoder. The audio encoder captures audio features through a three-layered LSTM, a three-layered MLP, and a self-attention mechanism.
Specifically, firstly, the LSTM is used to extract the features of the audio content vector obtained in step S2; then, using MLP to extract the features of the audio style vector obtained in step S2; then, linear fusion is carried out on the audio content vector characteristics and the audio style vector characteristics; and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain an audio feature with stronger time dependency, wherein a specific calculation formula is as follows:
S t =Attn(LSTM c′ (Ec t→t+λ ;W′ lstm ),MLP s (Es;W mlp,s );W attn ) (4)
in the formula (4), S t Representing the processed audio characteristics of the t-th frame, wherein t represents the current frame of the portrait video; MLP S Representing an audio style vector encoder, Es representing an audio style vector, W mlp,s Representing audio style vector encoder learnable parameters; LSTM c′ Represents an audio content vector encoder, Ec represents an audio content vector, t → t + λ represents an audio content vector input to the audio content vector encoder with a batch size of λ ═ 18 per frame t, W' lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W attn Indicating a self-attentiveness mechanism learnable parameter.
The two face coordinate encoders are both light neural networks composed of seven layers of MLPs. The two are similar in structure but different in function, one extracts geometric information of identity and one extracts geometric information of facial emotion.
Based on the two different face coordinates (one is regarded as an identity source face coordinate sequence and the other is regarded as an emotion source face coordinate sequence) obtained in step S1, firstly, using an identity source face coordinate encoder composed of seven layers of MLPs to extract portrait identity features of an identity source; secondly, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the audio characteristic obtained by the formula (4) to obtain a fusion characteristic, wherein the concrete calculation formula is as follows:
F t =concat(MLP LA (L a ;W mlp,la ),MLP LB (L b ;W mlp,lb ),S t ) (5)
in the formula (5), F t Representing the fusion characteristics of the t frame after linear fusion, and concat representing linear fusion; MLP LA Face coordinate encoder representing identity source, L a Face coordinates, W, for the first frame of the identity Source Portrait video mlp,la Representing identity source facial coordinate encoder learnable parameters; MLP LB Face coordinate encoder for representing emotion source, L b Face coordinates, W, for the first frame of the Source Anoticeage video mlp,lb Representing emotion source face coordinate encoder learnable parameters; s. the t Representing the t frame audio feature of step S4.
Based on the portrait identity characteristic, the portrait emotion characteristic and the fusion characteristic of the audio characteristic obtained by the formula (5), a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:
ΔQ t =MLP LD (F t ;W mlp,ld ) (6)
in the formula (6), Δ Q t Representing the predicted t frame face emotion coordinate offset, wherein t represents the current frame of the portrait video; MLP LD Decoder for animation generation network representing facial emotion coordinates, F t For the t frame fusion feature after the linear fusion of step S5, W mlp,ld Indicating that the decoder can learn the parameters.
The method comprises the following steps of correcting the first frame coordinate of the identity source portrait video through a predicted facial emotion coordinate offset sequence to obtain a facial emotion coordinate sequence, wherein the specific calculation formula is as follows:
Q t =L a +ΔQ t (7)
in the formula (7), Q t Representing facial emotion coordinates of the t-th frame, t represents the Chinese zodiac signA current frame like video; l is a radical of an alcohol a Face coordinates, Δ Q, for the first frame of the identity Source Portrait video t And the predicted t frame emotion coordinate offset is shown.
In order to generate an optimal facial emotion coordinate offset sequence, three different loss functions are set to adjust the weight and the deviation of the network based on the encoder-decoder structure of the facial emotion coordinate animation generation network, and the specific formula is as follows:
in the formula (8), the first and second groups,representing the total loss function of the facial emotion coordinate animation generation network,a penalty function representing the facial emotion coordinate animation generation network,discriminator D for representing face coordinates L Is used to determine the loss function of (c),identifier D for representing similarity of face coordinate interval frames T A loss function of (d); lambda [ alpha ] 1 ,λ 2 ,λ 3 Respectively, weight parameters.
The face coordinate loss function calculates the distance between the predicted face emotion coordinate sequence and the face coordinate (identity source coordinate sequence same as emotion source emotion) obtained in step S1, and the specific calculation formula is as follows:
in the formula (9), the reaction mixture,representing a loss function of the facial emotion coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N is 68 which represents the total number of facial coordinates, and i represents the current facial coordinate number; q i,t Representing the coordinates of the predicted face of the ith frame,coordinates of the face indicating the ith frame obtained in step S1;represents Q i,t Andsquared euclidean norm of (d).
Discriminator loss function during facial emotion coordinate animation generation network trainingDiscriminator loss function for discriminating true or false of generated face coordinatesFor estimating the similarity of the face coordinate interval frame, the formula is as follows:
in the equations (10) and (11), t represents the current frame of the portrait video, D L A discriminator for representing whether the coordinates of the face are true or false,discriminator D for representing face coordinates L A loss function of (d); d T A frame similarity discriminator indicating the interval of facial coordinates,frame similarity discriminator D for representing face coordinate interval T A loss function of (d); q t Representing the predicted face emotion coordinates of the t frame,the t-th frame face coordinates obtained in step S1 are indicated,to representFace coordinates of the previous frame.
When the loss function tends to be smooth, the training of the animation synthesis network of the facial emotion coordinates is finished.
In step S5, the coordinate-to-video network is completed and the training of the target network is completed.
Based on the face coordinate sequence obtained in step S1, discrete coordinates are connected by number and rendered with color line segments to create a three-channel face sketch sequence of size 256 × 256. The sequence is channel-concatenated with the original pictures of the first frame of the corresponding video to create a six-channel picture sequence with a size of 256 × 256. And generating a reconstructed face video by using the sequence as input and using a coordinate-to-video network.
In order to generate an optimal face video, an L1 loss function (L1-norm loss function) is set to adjust the weight and deviation of the network based on the image conversion network. The loss function aims to minimize the pixel distance between the reconstructed face video and the training target face video.
Step S6 is to input any two portrait pictures (one representing the identity source and the other representing the emotion source) and any piece of audio to generate the target video based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in step S3, step S4 and step S5.
And respectively acquiring corresponding identity source portrait coordinates and emotion source portrait coordinates by using a face identification algorithm face alignment, and acquiring an audio content vector and an audio style vector of the audio by using a voice conversion method. And generating a lip sound synchronous face coordinate offset sequence by the audio content vector and the identity source coordinate through the face lip sound coordinate animation generation network obtained in the step S3. And (4) generating a facial emotion coordinate offset sequence by the audio content vector, the audio style vector, the identity source coordinate and the emotion source coordinate through the facial emotion coordinate animation generation network obtained in the step S4. And correcting the identity source coordinate through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence to the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with emotion source emotion.
The multi-mode driven emotion controllable facial animation generation method is realized through a voice conversion method, a multi-layer perceptron, a long-term and short-term memory network, a self-attention mechanism and a generation countermeasure network; as shown in FIGS. 2-3, the invention can generate different emotion videos by adjusting the emotion source portrait, thereby having higher application value and overcoming the characteristics of the prior facial animation generation method such as lack of emotion or poor robustness.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (9)
1. The method for generating the emotion controllable facial animation based on multi-mode driving is characterized by comprising the following steps of:
step S1: preprocessing an image of a portrait video, and extracting a facial 3D feature coordinate sequence from the preprocessed image by using a facial recognition algorithm;
step S2: preprocessing the audio of the portrait video, and then decoupling the preprocessed audio into an audio content vector irrelevant to an audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method;
step S3: training a facial lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-time and short-time memory network based on a facial 3D characteristic coordinate sequence and an audio content vector;
step S4: training a facial emotion coordinate animation generation network consisting of a multilayer perceptron, a long-time memory network, a long-time attention mechanism and a generation countermeasure network based on a facial 3D feature coordinate sequence, an audio content vector and an audio style vector;
step S5: training a coordinate-to-video network consisting of a generated countermeasure network based on the facial 3D feature coordinate sequence;
step S6: inputting any two portrait pictures and a section of any audio based on a trained facial lip voice coordinate animation generation network, a facial emotion coordinate animation generation network and a coordinate-to-video network, wherein one of the two portrait pictures represents an identity source and the other represents an emotion source; and generating lip sound synchronous video with the target portrait of the emotion corresponding to the emotion source.
2. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S1, the method specifically comprises:
firstly, performing frame rate conversion on a video, and converting the video into 62.5 frames per second;
then, the image is resampled and is cut into 256 × 256 videos containing faces;
extracting facial coordinates by using a facial recognition algorithm, acquiring 3D coordinates of the face of each frame, wherein the dimensionality is 68 x 3, and forming a facial 3D feature coordinate sequence;
and storing the face 3D feature coordinate sequence into an emotion source portrait coordinate sequence and an identity source portrait coordinate sequence, namely, an emotion source face coordinate and an identity source face coordinate.
3. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein step S2 specifically includes:
carrying out sampling rate conversion on the audio, and converting the audio sampling rate into 16000hz by using Fast Forward Moving Picture Experts Group;
then, audio vector extraction is carried out on the audio vector, and a rememblyzer library of python is used for obtaining the audio vector;
and finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring the decoupled audio content vector irrelevant to the audio speaker and the audio style vector relevant to the audio speaker.
4. The method for generating controllable emotion facial animation based on multi-modal driving as claimed in claim 1, wherein in step S3, said facial lip voice coordinate animation generation network adopts a self-defined encoder-decoder network structure, the encoder comprises a facial coordinate encoder composed of two layers of MLPs and a speech content encoder composed of three layers of LSTM, and the decoder is a facial lip voice coordinate decoder composed of three layers of MLPs; the facial lip voice coordinate animation generation network is provided with a loss function used for continuously adjusting the weight and deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized.
5. The method for generating emotion controllable facial animation based on multi-modal driving according to claim 4, wherein in step S3, the network training process for generating facial lip sound coordinates animation is as follows:
firstly, extracting the identity feature of the face 3D feature coordinate sequence of the first frame of the video obtained in the step S1 by using a two-layer MLP (Multi-layer matching processing), namely the identity feature of the first time point of the face 3D feature coordinate sequence;
then, based on the identity characteristics and the audio content vector obtained in the step S2, after linear fusion, extracting the coordinate dependency relationship between audio continuous syllables and lips by using the LSTM of the three-layer unit;
then, based on the output of the encoder in the step, a decoder consisting of three layers of MLPs is used for predicting a facial lip sound coordinate offset sequence, and the specific calculation formula is as follows:
ΔP t =MLP c (LSTM c (Ec t→t+λ ,MLP L (L;W mlp,l );W lstm );W mlp,c )
in the formula,. DELTA.P t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W mlp,l Representing facial coordinate encoder learnable parameters; LSTM c Representing a speech content encoder, Ec representing an audio content vector, t → t + λ representing that the audio content vector is input to the speech content encoder in a batch size of λ ═ 18 per frame, W lstm Representing speech content encoder learnable parameters; MLP c Coordinate decoder for lip voice of face, W mlp,c Representing facial lip tone coordinate decoder learnable parameters;
correcting the first frame coordinate of the portrait video through the predicted facial lip tone coordinate offset sequence to obtain a lip tone synchronous coordinate sequence, wherein a specific calculation formula is as follows:
P t =L+ΔP t
in the formula, P t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P t Indicating the predicted coordinate offset of the lip sound of the t frame face;
in order to generate an optimal sequence of the offset of the facial lip coordinates, based on the encoder-decoder structure of the facial lip coordinate animation generation network, the weight and the deviation of a loss function adjustment network are set, and a specific calculation formula of the loss function is as follows:
in the formula (I), the compound is shown in the specification,representing the loss function of the facial lip sound coordinate animation generation network, T representing the total frame rate of the video, T representing the current frame of the portrait video, N-68Represents the total number of facial coordinates, i represents the current facial coordinate number; p is i,t Coordinates representing the predicted ith frame,coordinates representing the ith frame obtained in step S1;is represented by P i,t And withThe square of the euclidean norm of (d);
6. The method for generating the emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S4, the facial emotion coordinate animation generation network adopts a customized encoder-decoder network structure:
the encoder comprises an audio encoder and a face coordinate encoder, wherein the face coordinate encoder comprises an identity source face coordinate encoder and an emotion source face coordinate encoder, and the audio encoder captures audio features through a three-layer LSTM, a three-layer MLP and a self-attention mechanism;
the decoder comprises a coordinate decoder;
the encoder is used for acquiring audio features, portrait identity features and portrait emotion features, the decoder is used for processing multi-modal features, and the multi-modal features and the portrait emotion features are jointly driven to generate a coordinate offset sequence after target portrait emotion is remolded;
the facial emotion coordinate animation generation network is provided with three different weight and deviation of a loss function adjustment network, wherein one of the weight and deviation is used for calculating the distance between the predicted facial 3D characteristic coordinate sequence and the facial 3D characteristic coordinate sequence obtained in the step S1, and the second and third are discriminator loss functions which are used for distinguishing the truth of generated facial coordinates and the similarity of a facial coordinate interval frame.
7. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 6, wherein in step S4, the network training process for generating facial emotion coordinate animation is as follows:
firstly, LSTM is used for extracting the characteristics of the audio content vector obtained in the step S2;
then, using MLP to extract the features of the audio style vector obtained in step S2;
then, carrying out linear fusion on the audio content vector characteristics and the audio style vector characteristics;
and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain an audio feature with stronger time dependency, wherein a specific calculation formula is as follows:
S t =Attn(LSTM c′ (Ec t→t+λ ;W′ lstm ),MLP s (Es;W mlp,s );W attn )
in the formula, S t Representing the processed audio characteristics of the t-th frame, wherein t represents the current frame of the portrait video; MLP S Representing an audio style vector encoder, Es representing an audio style vector, W mlp,s Representing audio style vector encoder learnable parameters; LSTM c′ Represents an audio content vector encoder, Ec represents an audio content vector, t → t + λ represents an audio content vector input to the audio content vector encoder with a batch size of λ ═ 18 per frame t, W' lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W attn Representing a self-attention mechanism learnable parameter;
the two facial coordinate encoders are both light neural networks consisting of seven layers of MLPs, wherein one is used for extracting geometric information of identity, and the other is used for extracting geometric information of facial emotion;
based on the two different face coordinates obtained in step S1, one is regarded as an identity source face coordinate sequence and the other is regarded as an emotion source face coordinate sequence, firstly, portrait identity features of an identity source are extracted by using an identity source face coordinate encoder composed of seven layers of MLPs; then, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the obtained audio characteristic to obtain a fusion characteristic, wherein the specific calculation formula is as follows:
F t =concat(MLP LA (L a ;W mlp,la ),MLP LB (L b ;W mlp,lb ),S t )
in the formula, F t Representing the t frame characteristic after linear fusion, concat representing linear fusion; MLP LA Identity source face coordinate representation encoder, L a Face coordinates, W, for the first frame of the identity Source Portrait video mlp,la Representing identity source facial coordinate encoder learnable parameters; MLP LB Face coordinate encoder for representing emotion source, L b Face coordinates for the first frame of the Source of Emotion Portrait video, W mlp,lb Representing emotion source face coordinate encoder learnable parameters; s t The t-th frame audio feature representing step S4;
based on the fusion characteristics of the portrait identity characteristics, the portrait emotion characteristics and the audio characteristics, a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:
ΔQ t =MLP LD (F t ;W mlp,ld )
in the formula,. DELTA.Q t Representing the predicted emotional coordinate offset of the tth frame, wherein t represents the current frame of the portrait video; MLP LD Decoder for animation generation network representing facial emotion coordinates, F t For the t frame fused feature after the linear fusion of step S5, W mlp,ld Indicating the decoder learnable parameters;
the method comprises the following steps of correcting the first frame coordinate of the identity source portrait video through a predicted facial emotion coordinate offset sequence to obtain a facial emotion coordinate sequence, wherein the specific calculation formula is as follows:
Q t =L a +ΔQ t
in the formula, Q t Representing emotional face coordinates, and t representing a current frame of the portrait video; l is a radical of an alcohol a Face coordinates, Δ Q, for the first frame of the identity Source Portrait video t Representing the predicted t frame emotion coordinate offset;
in order to generate an optimal facial emotion coordinate offset sequence, a coder-decoder structure of a network is generated based on facial emotion coordinate animation, three different loss functions are set to adjust the weight and the deviation of the network, and the specific formula is as follows:
in the formula (I), the compound is shown in the specification,representing the total loss function of the facial emotion coordinate animation generation network,a penalty function representing the facial emotion coordinate animation generation network,discriminator D for representing face coordinates L Is used to determine the loss function of (c),frame similarity discriminator D for representing face coordinate interval T A loss function of (d); lambda [ alpha ] 1 ,λ 2 ,λ 3 Respectively are weight parameters;
wherein, the loss function of the facial emotion coordinate animation generation network calculates the distance between the predicted facial emotion coordinate sequence and the facial coordinates obtained in step S1, and the specific calculation formula is as follows:
in the formula (I), the compound is shown in the specification,representing a loss function of the facial emotion coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N is 68 which represents the total number of facial coordinates, and i represents the current facial coordinate number; q i,t Representing the coordinates of the predicted face of the ith frame,coordinates of the face indicating the ith frame obtained in step S1;represents Q i,t Andthe square of the euclidean norm of (d);
discriminator loss function during facial emotion coordinate animation generation network trainingDiscriminator loss function for discriminating true or false of generated face coordinatesFor estimating the similarity of the face interval frame coordinates, the formula is as follows:
wherein t represents a current frame of the portrait video, D L A discriminator for representing whether the coordinates of the face are true or false,discriminator D for representing face coordinates L A loss function of (d); d T A frame similarity discriminator indicating the interval of facial coordinates,frame similarity discriminator D for representing face coordinate interval T A loss function of (d); q t Representing the predicted face emotion coordinates of the t frame,the t-th frame face coordinates obtained in step S1 are indicated,to representFace coordinates of a previous frame;
when the loss function tends to be smooth, the training of the facial emotion coordinate animation synthesis network is completed.
8. The method for generating controllable emotion face animation based on multi-modal driving as claimed in claim 1, wherein in step S5, the training process of the coordinate-to-video network is as follows:
based on the face coordinate sequence obtained in step S1, connecting the discrete coordinates by number, and rendering with color line segments to create a three-channel face sketch sequence with a size of 256 × 256;
performing channel cascade on the sequence and the original picture of the first frame of the corresponding video to create a six-channel picture sequence with the size of 256 × 256;
generating a reconstructed face video by using a coordinate-to-video network by taking the sequence as input;
in order to generate an optimal face video, the weight and the deviation of the network are adjusted by setting an L1 loss function based on the image conversion network.
9. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S6, lip sound synchronization video of a target portrait with emotion source emotion is generated by using three trained network models, and specifically comprises:
inputting any two portrait pictures and any section of audio, respectively obtaining an identity source portrait coordinate and an emotion source portrait coordinate by using a face recognition algorithm, and obtaining an audio content vector and an audio style vector of the audio by using a voice conversion method;
generating a lip sound synchronous face coordinate offset sequence by the audio content vector and the identity source coordinate through the face lip sound coordinate animation generation network obtained in the step S3;
generating a network by the audio content vector, the audio style vector, the identity source coordinate and the emotion source coordinate through the facial emotion coordinate animation obtained in the step S4 to generate a facial emotion coordinate offset sequence;
and correcting the identity source coordinate through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence to the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with emotion source emotion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210744504.9A CN115100329B (en) | 2022-06-27 | 2022-06-27 | Multi-mode driving-based emotion controllable facial animation generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210744504.9A CN115100329B (en) | 2022-06-27 | 2022-06-27 | Multi-mode driving-based emotion controllable facial animation generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115100329A true CN115100329A (en) | 2022-09-23 |
CN115100329B CN115100329B (en) | 2023-04-07 |
Family
ID=83295794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210744504.9A Active CN115100329B (en) | 2022-06-27 | 2022-06-27 | Multi-mode driving-based emotion controllable facial animation generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115100329B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631275A (en) * | 2022-11-18 | 2023-01-20 | 北京红棉小冰科技有限公司 | Multi-mode driven human body action sequence generation method and device |
CN116433807A (en) * | 2023-04-21 | 2023-07-14 | 北京百度网讯科技有限公司 | Animation synthesis method and device, and training method and device for animation synthesis model |
CN116843798A (en) * | 2023-07-03 | 2023-10-03 | 支付宝(杭州)信息技术有限公司 | Animation generation method, model training method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1466104A (en) * | 2002-07-03 | 2004-01-07 | 中国科学院计算技术研究所 | Statistics and rule combination based phonetic driving human face carton method |
US20120280974A1 (en) * | 2011-05-03 | 2012-11-08 | Microsoft Corporation | Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech |
CN111783658A (en) * | 2020-07-01 | 2020-10-16 | 河北工业大学 | Two-stage expression animation generation method based on double generation countermeasure network |
CN113158727A (en) * | 2020-12-31 | 2021-07-23 | 长春理工大学 | Bimodal fusion emotion recognition method based on video and voice information |
CN113408449A (en) * | 2021-06-25 | 2021-09-17 | 达闼科技(北京)有限公司 | Face action synthesis method based on voice drive, electronic equipment and storage medium |
CN114202604A (en) * | 2021-11-30 | 2022-03-18 | 长城信息股份有限公司 | Voice-driven target person video generation method and device and storage medium |
CN114663539A (en) * | 2022-03-09 | 2022-06-24 | 东南大学 | 2D face restoration technology under mask based on audio drive |
-
2022
- 2022-06-27 CN CN202210744504.9A patent/CN115100329B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1466104A (en) * | 2002-07-03 | 2004-01-07 | 中国科学院计算技术研究所 | Statistics and rule combination based phonetic driving human face carton method |
US20120280974A1 (en) * | 2011-05-03 | 2012-11-08 | Microsoft Corporation | Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech |
CN111783658A (en) * | 2020-07-01 | 2020-10-16 | 河北工业大学 | Two-stage expression animation generation method based on double generation countermeasure network |
CN113158727A (en) * | 2020-12-31 | 2021-07-23 | 长春理工大学 | Bimodal fusion emotion recognition method based on video and voice information |
CN113408449A (en) * | 2021-06-25 | 2021-09-17 | 达闼科技(北京)有限公司 | Face action synthesis method based on voice drive, electronic equipment and storage medium |
CN114202604A (en) * | 2021-11-30 | 2022-03-18 | 长城信息股份有限公司 | Voice-driven target person video generation method and device and storage medium |
CN114663539A (en) * | 2022-03-09 | 2022-06-24 | 东南大学 | 2D face restoration technology under mask based on audio drive |
Non-Patent Citations (1)
Title |
---|
范懿文等: "支持表情细节的语音驱动人脸动画" * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631275A (en) * | 2022-11-18 | 2023-01-20 | 北京红棉小冰科技有限公司 | Multi-mode driven human body action sequence generation method and device |
CN116433807A (en) * | 2023-04-21 | 2023-07-14 | 北京百度网讯科技有限公司 | Animation synthesis method and device, and training method and device for animation synthesis model |
CN116433807B (en) * | 2023-04-21 | 2024-08-23 | 北京百度网讯科技有限公司 | Animation synthesis method and device, and training method and device for animation synthesis model |
CN116843798A (en) * | 2023-07-03 | 2023-10-03 | 支付宝(杭州)信息技术有限公司 | Animation generation method, model training method and device |
Also Published As
Publication number | Publication date |
---|---|
CN115100329B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115100329B (en) | Multi-mode driving-based emotion controllable facial animation generation method | |
Cudeiro et al. | Capture, learning, and synthesis of 3D speaking styles | |
Wang et al. | Seeing what you said: Talking face generation guided by a lip reading expert | |
US11551393B2 (en) | Systems and methods for animation generation | |
US7027054B1 (en) | Do-it-yourself photo realistic talking head creation system and method | |
CN116250036A (en) | System and method for synthesizing photo-level realistic video of speech | |
CN115004236A (en) | Photo-level realistic talking face from audio | |
CN115588224B (en) | Virtual digital person generation method and device based on face key point prediction | |
CN114202604A (en) | Voice-driven target person video generation method and device and storage medium | |
CN115116109A (en) | Virtual character speaking video synthesis method, device, equipment and storage medium | |
CN113470170B (en) | Real-time video face region space-time consistent synthesis method utilizing voice information | |
CN117237521A (en) | Speech driving face generation model construction method and target person speaking video generation method | |
EP4010899A1 (en) | Audio-driven speech animation using recurrent neutral network | |
CN117171392A (en) | Virtual anchor generation method and system based on nerve radiation field and hidden attribute | |
Wang et al. | Ca-wav2lip: Coordinate attention-based speech to lip synthesis in the wild | |
EP0710929A2 (en) | Acoustic-assisted image processing | |
Zhua et al. | Audio-driven talking head video generation with diffusion model | |
Wen et al. | 3D Face Processing: Modeling, Analysis and Synthesis | |
CN117557695A (en) | Method and device for generating video by driving single photo through audio | |
Tang et al. | Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar | |
CN115937375A (en) | Digital body-separating synthesis method, device, computer equipment and storage medium | |
Ji et al. | RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network | |
Wang et al. | Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head | |
Han et al. | A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis | |
CN114494930A (en) | Training method and device for voice and image synchronism measurement model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |