CN115100329A - Multi-mode driving-based emotion controllable facial animation generation method - Google Patents

Multi-mode driving-based emotion controllable facial animation generation method Download PDF

Info

Publication number
CN115100329A
CN115100329A CN202210744504.9A CN202210744504A CN115100329A CN 115100329 A CN115100329 A CN 115100329A CN 202210744504 A CN202210744504 A CN 202210744504A CN 115100329 A CN115100329 A CN 115100329A
Authority
CN
China
Prior art keywords
coordinate
emotion
facial
face
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210744504.9A
Other languages
Chinese (zh)
Other versions
CN115100329B (en
Inventor
李瑶
赵子康
李峰
郭浩
杨艳丽
程忱
曹锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN202210744504.9A priority Critical patent/CN115100329B/en
Publication of CN115100329A publication Critical patent/CN115100329A/en
Application granted granted Critical
Publication of CN115100329B publication Critical patent/CN115100329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving. Step S1: preprocessing an image of a portrait video to obtain a face 3D feature coordinate sequence; step S2: preprocessing the audio of the portrait video, and decoupling the audio into an audio content vector and an audio style vector; step S3: training a face lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-term and short-term memory network based on the face 3D feature coordinate sequence and the audio content vector; the invention introduces the emotion portrait as an emotion source, realizes emotion remodeling of the target portrait by combining the common drive of the emotion source portrait and the audio, and provides diversified emotion facial animation. The method avoids the over-low robustness of the audio single driving source under the drive of multiple modes, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.

Description

Multi-mode driving-based emotion controllable facial animation generation method
Technical Field
The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving.
Background
Facial animation generation is a popular area of research in computer vision-generated models. Its purpose is to transform a still portrait into a realistic facial animation with an arbitrary audio. The method has wide application background in the fields of treatment systems of pseudoscopic and pseudoscopic audiometry, virtual anchor, role-defined games and the like. However, the existing facial animation generation method has the defects that due to the limitations of the principle and the characteristics of the existing facial animation generation method, the emotion aspect of the generated portrait animation is always lack of maturity, so that the application value of the portrait animation is seriously influenced.
In recent years, many studies in the field of facial animation generation have been made on realistic lip movement and head posture swing, and this is an important factor of portrait emotion. The existence of the portrait emotional information has an important influence on the expression of the synthesized facial animation expression emotion, different facial expressions often make a sentence with different emotional colors, and the perception of the emotional information in the visual mode is one of the important ways for human audiovisual speech communication. However, most of the facial animation generation driving sources are audio single-mode, which is superior in lip movement for generating syllables, but relatively poor in generating facial expressions. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.
Disclosure of Invention
The invention provides an emotion controllable facial animation generation method based on multi-mode driving, aiming at solving the problem that the existing facial animation generation method is lack of emotion regulation and control capability.
The invention is realized by adopting the following technical scheme:
the method for generating the emotion controllable facial animation based on multi-mode driving is realized by adopting the following steps:
step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used to obtain a face 3D feature coordinate sequence.
Step S2: the audio of the portrait video is preprocessed and then the preprocessed audio is decoupled into an audio content vector irrelevant to the audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method.
Step S3: based on the face coordinate sequence obtained in step S1 and the audio content vector obtained in step S2, a face lip sound coordinate animation generation network composed of a Multi-Layer Perceptron (MLP) and a Long Short-Term Memory (LSTM) network is trained.
Step S4: based on the face coordinate sequence obtained in step S1 and the audio content vector and the audio style vector obtained in step S2, a face emotion coordinate animation generation network composed of MLP, LSTM, Self-attention mechanism (Self-attention) and generation countermeasure network (GAN) is trained.
Step S5: based on the face coordinate sequence obtained in step S1, a coordinate-to-video network composed of GANs is trained.
Step S6: based on the facial lip tone coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, any two portrait pictures (one representing an identity source and one representing an emotion source) and any one section of audio are input, and a lip tone synchronous video of a target portrait with emotion corresponding to the emotion source is generated.
The method for generating the emotion-controllable facial animation based on multi-mode driving uses a computer vision-generation model and a deep neural network model as technical supports, and realizes the description of the emotion-controllable facial animation generation network.
The invention has the beneficial effects that: compared with the existing facial animation generation method, the method has the advantages that the problems of facial expression ghosting and distortion caused by single audio features and low emotion voice recognition accuracy are considered, the emotion portrait is introduced as an emotion source, the emotion of the target portrait is remolded by utilizing the multi-mode driving of the emotion source portrait features and the audio features, and the facial animation with controllable emotion is generated. The dual driving of the emotion image and the audio can avoid the dependency of emotion generation on single voice information, so that the generated video has controllable emotion while meeting the requirements of lip sound synchronization and spontaneous head swing, namely, the diversity and naturalness of facial animation are ensured, and more real emotional expression of the facial animation is realized.
The method effectively solves the problem that the existing facial animation generation method has low efficiency due to the limitation of speech emotion recognition precision on facial expressions, and can be used in the fields of pseudoscopic and pseudoscopic auxiliary treatment systems, virtual anchor games, role-defined games and the like.
Drawings
FIG. 1 is a schematic diagram of a multi-modal driven emotion controllable facial animation generation structure according to an embodiment of the invention.
Fig. 2 is a schematic diagram comparing the present invention with a conventional facial animation method.
Fig. 3 is a sample video schematic of an embodiment of the invention.
Detailed Description
In this embodiment, the portrait video data set used is derived from a public Multi-view Emotional Audio-visual data set (MEAD).
As shown in FIG. 1, the method for generating emotion controllable facial animation based on multi-modal driving is realized by adopting the following steps:
step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used to obtain a face 3D feature coordinate sequence.
Step S2: the audio of the portrait video is preprocessed and then the preprocessed audio is decoupled into an audio content vector irrelevant to the audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method.
Step S3: based on the face coordinate sequence obtained in step S1 and the audio content vector obtained in step S2, a face lip sound coordinate animation generation network composed of a Multi-Layer Perceptron (MLP) and a Long Short-Term Memory (LSTM) network is trained.
Step S4: based on the face coordinate sequence obtained in step S1 and the audio content vector and style vector obtained in step S2, a face emotion coordinate animation generation network composed of MLP, LSTM, Self-attention mechanism (Self-attention) and generation countermeasure network (GAN) is trained.
Step S5: based on the face coordinate sequence obtained in step S1, a coordinate-to-video network composed of GANs is trained. During this step of training, a loss function is used to calculate the minimum distance in pixels between the reconstructed face and the training target face.
Step S6: based on the facial lip tone coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, any two portrait pictures (one representing an identity source and one representing an emotion source) and any one section of audio are input, and a lip tone synchronous video of a target portrait with emotion corresponding to the emotion source is generated.
In step S1, the image of the portrait video is preprocessed, and the specific preprocessing process includes frame rate conversion, image resampling, and face coordinate extraction.
First, the video is frame rate converted to 62.5 frames per second. It is then image resampled and cropped to 256 x 256 video containing faces. And finally, extracting face coordinates by using a face identification algorithm face alignment, and acquiring 3D coordinates (with the dimension of 68 x 3) of the face of each frame to form a face 3D feature coordinate sequence.
In addition, the face 3D feature coordinate sequence is saved as an emotion source portrait coordinate sequence (emotion source face coordinates) and an identity source portrait coordinate sequence (identity source face coordinates). Compared with pixel points of the portrait, the face coordinates can provide natural low-dimensional representation for the portrait and provide a high-quality bridge for downstream emotion replay tasks.
In step S2, the audio of the portrait video is preprocessed, where the preprocessing includes sampling rate conversion, audio vector extraction, and audio vector decoupling.
The audio is first sample rate converted to 16000hz using Fast Forward Moving Picture Experts Group. Then, audio vector extraction is carried out on the audio vector, and the audio vector is obtained by using a rememblyzer library of python. And finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring an audio content vector irrelevant to the audio speaker after decoupling and an audio style vector relevant to the audio speaker.
In step S3, training of the facial lip sound coordinate animation generation network is completed.
The network adopts a self-defined coder-decoder network structure, the coder comprises a facial coordinate coder consisting of two layers of MLPs and a voice content coder consisting of three layers of LSTMs, and the decoder is a facial lip sound coordinate decoder consisting of three layers of MLPs. In order to generate an optimal sequence of the offset of the facial lip voice coordinate, the facial lip voice coordinate animation generation network sets a loss function to continuously adjust the weight and the deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized.
The custom encoder-decoder network structure is as follows:
firstly, the identity feature of the 3D feature coordinate sequence of the face in the first frame of the video (i.e. the first time point of the 3D feature coordinate sequence of the face) obtained in step S1 is extracted by using two-layer MLP. And then, based on the identity characteristics and the audio content vector obtained in the step S2, performing linear fusion and extracting the coordinate dependence relationship between the audio continuous syllables and the lips by using the LSTM of the three-layer unit. Then, based on the output of the encoder in the step, a decoder consisting of three layers of MLPs is used for predicting a facial lip sound coordinate offset sequence, and the specific calculation formula is as follows:
ΔP t =MLP c (LSTM c (Ec t→t+λ ,MLP L (L;W mlp,l );W lstm );W mlp,c ) (1)
in the formula (1), Δ P t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W mlp,l Representing facial coordinate encoder learnable parameters; LSTM c Representing a speech content encoder, Ec representing an audio content vector, t → t + λ representing that the audio content vector is input to the speech content encoder in a batch size of λ ═ 18 per frame t, W lstm Representing speech content encoder learnable parameters; MLP c Coordinate decoder for facial lip notes, W mlp,c The coordinate decoder for lip sound on the face can learn the parameters.
Correcting coordinates of a first frame of the portrait video through the predicted coordinate offset sequence of the facial lip sound to obtain a lip sound synchronous coordinate sequence, wherein a specific calculation formula is as follows:
P t =L+ΔP t (2)
in the formula (2), P t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P t Indicating the predicted t-th frame face lip sound coordinate offset.
In order to generate an optimal sequence of the offset of the facial lip coordinates, the weight and the deviation of the loss function adjusting network are set based on the encoder-decoder structure of the facial lip coordinate animation generating network. The objective of the penalty function is to minimize the error between the predicted coordinates and the coordinates obtained in step S1, and the specific calculation formula is as follows:
Figure BDA0003716536130000051
in the formula (3), the first and second groups,
Figure BDA0003716536130000052
representing a loss function of a facial lip tone coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N is 68 which represents the total number of facial coordinates, and i represents the current facial coordinate number; p i,t Representing the coordinates of the predicted face of the ith frame,
Figure BDA0003716536130000053
coordinates of the face indicating the ith frame obtained in step S1;
Figure BDA0003716536130000054
represents P i,t And with
Figure BDA0003716536130000055
Squared euclidean norm of (d).
When the loss function tends to be smooth, i.e.
Figure BDA0003716536130000056
And when the minimum value is reached, the training of the facial lip voice coordinate animation synthesis network is completed.
And step S4, finishing the training of the face emotion coordinate animation generation network, and adding rich visual emotion expressions for the generated video.
Human beings rely on visual information in emotion interpretation, and abundant visual emotion expression can give people stronger sense of reality, and the practicality is bigger. Most of the existing face animation generation algorithms are dedicated to expressing the lip movement and the head pose swing of the face animation through audio single modality. Audio single-modality driving works well in lip movement that generates syllables, but relatively poorly in generating facial expressions. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.
The patent provides a facial emotion coordinate animation generation network based on multi-mode driving, emotion portraits are introduced to serve as emotion sources, multi-mode driving is performed in combination with audio characteristics to achieve emotion remodeling of target portraits more accurately, and the facial emotion coordinate animation generation network is generated.
The network is a custom encoder-decoder network structure, the encoder comprises an audio encoder and a facial coordinate encoder, and the decoder comprises a coordinate decoder. The encoder can obtain audio features, portrait identity features and portrait affective features. The decoder is responsible for processing the multi-mode characteristics, and is driven by the audio characteristics and the portrait emotion characteristics together to generate a coordinate offset sequence after the target portrait emotion is remolded, so that rich visual emotion expression is added to the video. Under the driving of the multiple modes, the method avoids the over-low robustness of the audio single driving source, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.
In order to generate an optimal facial emotion coordinate offset sequence, three different loss functions are set to adjust the weight and the deviation of the network based on the encoder-decoder structure of the facial emotion coordinate animation generation network. One of them calculates the distance between the predicted face 3D feature coordinate sequence and the face 3D feature coordinate sequence obtained in step S1. The second and third are the identifier loss function to distinguish the true and false of the generated face coordinate and the similarity of the face coordinate interval frame.
The network structure of the encoder-decoder for generating the customized network of the facial emotion coordinate animation is as follows:
the encoder consists of an audio encoder, an identity source face coordinate encoder and an emotion source face coordinate encoder. The audio encoder captures audio features through a three-layered LSTM, a three-layered MLP, and a self-attention mechanism.
Specifically, firstly, the LSTM is used to extract the features of the audio content vector obtained in step S2; then, using MLP to extract the features of the audio style vector obtained in step S2; then, linear fusion is carried out on the audio content vector characteristics and the audio style vector characteristics; and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain an audio feature with stronger time dependency, wherein a specific calculation formula is as follows:
S t =Attn(LSTM c′ (Ec t→t+λ ;W′ lstm ),MLP s (Es;W mlp,s );W attn ) (4)
in the formula (4), S t Representing the processed audio characteristics of the t-th frame, wherein t represents the current frame of the portrait video; MLP S Representing an audio style vector encoder, Es representing an audio style vector, W mlp,s Representing audio style vector encoder learnable parameters; LSTM c′ Represents an audio content vector encoder, Ec represents an audio content vector, t → t + λ represents an audio content vector input to the audio content vector encoder with a batch size of λ ═ 18 per frame t, W' lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W attn Indicating a self-attentiveness mechanism learnable parameter.
The two face coordinate encoders are both light neural networks composed of seven layers of MLPs. The two are similar in structure but different in function, one extracts geometric information of identity and one extracts geometric information of facial emotion.
Based on the two different face coordinates (one is regarded as an identity source face coordinate sequence and the other is regarded as an emotion source face coordinate sequence) obtained in step S1, firstly, using an identity source face coordinate encoder composed of seven layers of MLPs to extract portrait identity features of an identity source; secondly, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the audio characteristic obtained by the formula (4) to obtain a fusion characteristic, wherein the concrete calculation formula is as follows:
F t =concat(MLP LA (L a ;W mlp,la ),MLP LB (L b ;W mlp,lb ),S t ) (5)
in the formula (5), F t Representing the fusion characteristics of the t frame after linear fusion, and concat representing linear fusion; MLP LA Face coordinate encoder representing identity source, L a Face coordinates, W, for the first frame of the identity Source Portrait video mlp,la Representing identity source facial coordinate encoder learnable parameters; MLP LB Face coordinate encoder for representing emotion source, L b Face coordinates, W, for the first frame of the Source Anoticeage video mlp,lb Representing emotion source face coordinate encoder learnable parameters; s. the t Representing the t frame audio feature of step S4.
Based on the portrait identity characteristic, the portrait emotion characteristic and the fusion characteristic of the audio characteristic obtained by the formula (5), a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:
ΔQ t =MLP LD (F t ;W mlp,ld ) (6)
in the formula (6), Δ Q t Representing the predicted t frame face emotion coordinate offset, wherein t represents the current frame of the portrait video; MLP LD Decoder for animation generation network representing facial emotion coordinates, F t For the t frame fusion feature after the linear fusion of step S5, W mlp,ld Indicating that the decoder can learn the parameters.
The method comprises the following steps of correcting the first frame coordinate of the identity source portrait video through a predicted facial emotion coordinate offset sequence to obtain a facial emotion coordinate sequence, wherein the specific calculation formula is as follows:
Q t =L a +ΔQ t (7)
in the formula (7), Q t Representing facial emotion coordinates of the t-th frame, t represents the Chinese zodiac signA current frame like video; l is a radical of an alcohol a Face coordinates, Δ Q, for the first frame of the identity Source Portrait video t And the predicted t frame emotion coordinate offset is shown.
In order to generate an optimal facial emotion coordinate offset sequence, three different loss functions are set to adjust the weight and the deviation of the network based on the encoder-decoder structure of the facial emotion coordinate animation generation network, and the specific formula is as follows:
Figure BDA0003716536130000081
in the formula (8), the first and second groups,
Figure BDA0003716536130000082
representing the total loss function of the facial emotion coordinate animation generation network,
Figure BDA0003716536130000083
a penalty function representing the facial emotion coordinate animation generation network,
Figure BDA0003716536130000084
discriminator D for representing face coordinates L Is used to determine the loss function of (c),
Figure BDA0003716536130000085
identifier D for representing similarity of face coordinate interval frames T A loss function of (d); lambda [ alpha ] 1 ,λ 2 ,λ 3 Respectively, weight parameters.
The face coordinate loss function calculates the distance between the predicted face emotion coordinate sequence and the face coordinate (identity source coordinate sequence same as emotion source emotion) obtained in step S1, and the specific calculation formula is as follows:
Figure BDA0003716536130000086
in the formula (9), the reaction mixture,
Figure BDA0003716536130000087
representing a loss function of the facial emotion coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N is 68 which represents the total number of facial coordinates, and i represents the current facial coordinate number; q i,t Representing the coordinates of the predicted face of the ith frame,
Figure BDA0003716536130000088
coordinates of the face indicating the ith frame obtained in step S1;
Figure BDA0003716536130000089
represents Q i,t And
Figure BDA00037165361300000810
squared euclidean norm of (d).
Discriminator loss function during facial emotion coordinate animation generation network training
Figure BDA00037165361300000811
Discriminator loss function for discriminating true or false of generated face coordinates
Figure BDA00037165361300000812
For estimating the similarity of the face coordinate interval frame, the formula is as follows:
Figure BDA0003716536130000091
Figure BDA0003716536130000092
in the equations (10) and (11), t represents the current frame of the portrait video, D L A discriminator for representing whether the coordinates of the face are true or false,
Figure BDA0003716536130000093
discriminator D for representing face coordinates L A loss function of (d); d T A frame similarity discriminator indicating the interval of facial coordinates,
Figure BDA0003716536130000094
frame similarity discriminator D for representing face coordinate interval T A loss function of (d); q t Representing the predicted face emotion coordinates of the t frame,
Figure BDA0003716536130000095
the t-th frame face coordinates obtained in step S1 are indicated,
Figure BDA0003716536130000096
to represent
Figure BDA0003716536130000097
Face coordinates of the previous frame.
When the loss function tends to be smooth, the training of the animation synthesis network of the facial emotion coordinates is finished.
In step S5, the coordinate-to-video network is completed and the training of the target network is completed.
Based on the face coordinate sequence obtained in step S1, discrete coordinates are connected by number and rendered with color line segments to create a three-channel face sketch sequence of size 256 × 256. The sequence is channel-concatenated with the original pictures of the first frame of the corresponding video to create a six-channel picture sequence with a size of 256 × 256. And generating a reconstructed face video by using the sequence as input and using a coordinate-to-video network.
In order to generate an optimal face video, an L1 loss function (L1-norm loss function) is set to adjust the weight and deviation of the network based on the image conversion network. The loss function aims to minimize the pixel distance between the reconstructed face video and the training target face video.
Step S6 is to input any two portrait pictures (one representing the identity source and the other representing the emotion source) and any piece of audio to generate the target video based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in step S3, step S4 and step S5.
And respectively acquiring corresponding identity source portrait coordinates and emotion source portrait coordinates by using a face identification algorithm face alignment, and acquiring an audio content vector and an audio style vector of the audio by using a voice conversion method. And generating a lip sound synchronous face coordinate offset sequence by the audio content vector and the identity source coordinate through the face lip sound coordinate animation generation network obtained in the step S3. And (4) generating a facial emotion coordinate offset sequence by the audio content vector, the audio style vector, the identity source coordinate and the emotion source coordinate through the facial emotion coordinate animation generation network obtained in the step S4. And correcting the identity source coordinate through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence to the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with emotion source emotion.
The multi-mode driven emotion controllable facial animation generation method is realized through a voice conversion method, a multi-layer perceptron, a long-term and short-term memory network, a self-attention mechanism and a generation countermeasure network; as shown in FIGS. 2-3, the invention can generate different emotion videos by adjusting the emotion source portrait, thereby having higher application value and overcoming the characteristics of the prior facial animation generation method such as lack of emotion or poor robustness.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (9)

1. The method for generating the emotion controllable facial animation based on multi-mode driving is characterized by comprising the following steps of:
step S1: preprocessing an image of a portrait video, and extracting a facial 3D feature coordinate sequence from the preprocessed image by using a facial recognition algorithm;
step S2: preprocessing the audio of the portrait video, and then decoupling the preprocessed audio into an audio content vector irrelevant to an audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method;
step S3: training a facial lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-time and short-time memory network based on a facial 3D characteristic coordinate sequence and an audio content vector;
step S4: training a facial emotion coordinate animation generation network consisting of a multilayer perceptron, a long-time memory network, a long-time attention mechanism and a generation countermeasure network based on a facial 3D feature coordinate sequence, an audio content vector and an audio style vector;
step S5: training a coordinate-to-video network consisting of a generated countermeasure network based on the facial 3D feature coordinate sequence;
step S6: inputting any two portrait pictures and a section of any audio based on a trained facial lip voice coordinate animation generation network, a facial emotion coordinate animation generation network and a coordinate-to-video network, wherein one of the two portrait pictures represents an identity source and the other represents an emotion source; and generating lip sound synchronous video with the target portrait of the emotion corresponding to the emotion source.
2. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S1, the method specifically comprises:
firstly, performing frame rate conversion on a video, and converting the video into 62.5 frames per second;
then, the image is resampled and is cut into 256 × 256 videos containing faces;
extracting facial coordinates by using a facial recognition algorithm, acquiring 3D coordinates of the face of each frame, wherein the dimensionality is 68 x 3, and forming a facial 3D feature coordinate sequence;
and storing the face 3D feature coordinate sequence into an emotion source portrait coordinate sequence and an identity source portrait coordinate sequence, namely, an emotion source face coordinate and an identity source face coordinate.
3. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein step S2 specifically includes:
carrying out sampling rate conversion on the audio, and converting the audio sampling rate into 16000hz by using Fast Forward Moving Picture Experts Group;
then, audio vector extraction is carried out on the audio vector, and a rememblyzer library of python is used for obtaining the audio vector;
and finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring the decoupled audio content vector irrelevant to the audio speaker and the audio style vector relevant to the audio speaker.
4. The method for generating controllable emotion facial animation based on multi-modal driving as claimed in claim 1, wherein in step S3, said facial lip voice coordinate animation generation network adopts a self-defined encoder-decoder network structure, the encoder comprises a facial coordinate encoder composed of two layers of MLPs and a speech content encoder composed of three layers of LSTM, and the decoder is a facial lip voice coordinate decoder composed of three layers of MLPs; the facial lip voice coordinate animation generation network is provided with a loss function used for continuously adjusting the weight and deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized.
5. The method for generating emotion controllable facial animation based on multi-modal driving according to claim 4, wherein in step S3, the network training process for generating facial lip sound coordinates animation is as follows:
firstly, extracting the identity feature of the face 3D feature coordinate sequence of the first frame of the video obtained in the step S1 by using a two-layer MLP (Multi-layer matching processing), namely the identity feature of the first time point of the face 3D feature coordinate sequence;
then, based on the identity characteristics and the audio content vector obtained in the step S2, after linear fusion, extracting the coordinate dependency relationship between audio continuous syllables and lips by using the LSTM of the three-layer unit;
then, based on the output of the encoder in the step, a decoder consisting of three layers of MLPs is used for predicting a facial lip sound coordinate offset sequence, and the specific calculation formula is as follows:
ΔP t =MLP c (LSTM c (Ec t→t+λ ,MLP L (L;W mlp,l );W lstm );W mlp,c )
in the formula,. DELTA.P t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W mlp,l Representing facial coordinate encoder learnable parameters; LSTM c Representing a speech content encoder, Ec representing an audio content vector, t → t + λ representing that the audio content vector is input to the speech content encoder in a batch size of λ ═ 18 per frame, W lstm Representing speech content encoder learnable parameters; MLP c Coordinate decoder for lip voice of face, W mlp,c Representing facial lip tone coordinate decoder learnable parameters;
correcting the first frame coordinate of the portrait video through the predicted facial lip tone coordinate offset sequence to obtain a lip tone synchronous coordinate sequence, wherein a specific calculation formula is as follows:
P t =L+ΔP t
in the formula, P t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P t Indicating the predicted coordinate offset of the lip sound of the t frame face;
in order to generate an optimal sequence of the offset of the facial lip coordinates, based on the encoder-decoder structure of the facial lip coordinate animation generation network, the weight and the deviation of a loss function adjustment network are set, and a specific calculation formula of the loss function is as follows:
Figure FDA0003716536120000031
in the formula (I), the compound is shown in the specification,
Figure FDA0003716536120000032
representing the loss function of the facial lip sound coordinate animation generation network, T representing the total frame rate of the video, T representing the current frame of the portrait video, N-68Represents the total number of facial coordinates, i represents the current facial coordinate number; p is i,t Coordinates representing the predicted ith frame,
Figure FDA0003716536120000033
coordinates representing the ith frame obtained in step S1;
Figure FDA0003716536120000034
is represented by P i,t And with
Figure FDA0003716536120000037
The square of the euclidean norm of (d);
when the loss function tends to be smooth, i.e.
Figure FDA0003716536120000036
And when the minimum value is reached, the training of the facial lip sound coordinate animation synthesis network is completed.
6. The method for generating the emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S4, the facial emotion coordinate animation generation network adopts a customized encoder-decoder network structure:
the encoder comprises an audio encoder and a face coordinate encoder, wherein the face coordinate encoder comprises an identity source face coordinate encoder and an emotion source face coordinate encoder, and the audio encoder captures audio features through a three-layer LSTM, a three-layer MLP and a self-attention mechanism;
the decoder comprises a coordinate decoder;
the encoder is used for acquiring audio features, portrait identity features and portrait emotion features, the decoder is used for processing multi-modal features, and the multi-modal features and the portrait emotion features are jointly driven to generate a coordinate offset sequence after target portrait emotion is remolded;
the facial emotion coordinate animation generation network is provided with three different weight and deviation of a loss function adjustment network, wherein one of the weight and deviation is used for calculating the distance between the predicted facial 3D characteristic coordinate sequence and the facial 3D characteristic coordinate sequence obtained in the step S1, and the second and third are discriminator loss functions which are used for distinguishing the truth of generated facial coordinates and the similarity of a facial coordinate interval frame.
7. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 6, wherein in step S4, the network training process for generating facial emotion coordinate animation is as follows:
firstly, LSTM is used for extracting the characteristics of the audio content vector obtained in the step S2;
then, using MLP to extract the features of the audio style vector obtained in step S2;
then, carrying out linear fusion on the audio content vector characteristics and the audio style vector characteristics;
and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain an audio feature with stronger time dependency, wherein a specific calculation formula is as follows:
S t =Attn(LSTM c′ (Ec t→t+λ ;W′ lstm ),MLP s (Es;W mlp,s );W attn )
in the formula, S t Representing the processed audio characteristics of the t-th frame, wherein t represents the current frame of the portrait video; MLP S Representing an audio style vector encoder, Es representing an audio style vector, W mlp,s Representing audio style vector encoder learnable parameters; LSTM c′ Represents an audio content vector encoder, Ec represents an audio content vector, t → t + λ represents an audio content vector input to the audio content vector encoder with a batch size of λ ═ 18 per frame t, W' lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W attn Representing a self-attention mechanism learnable parameter;
the two facial coordinate encoders are both light neural networks consisting of seven layers of MLPs, wherein one is used for extracting geometric information of identity, and the other is used for extracting geometric information of facial emotion;
based on the two different face coordinates obtained in step S1, one is regarded as an identity source face coordinate sequence and the other is regarded as an emotion source face coordinate sequence, firstly, portrait identity features of an identity source are extracted by using an identity source face coordinate encoder composed of seven layers of MLPs; then, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the obtained audio characteristic to obtain a fusion characteristic, wherein the specific calculation formula is as follows:
F t =concat(MLP LA (L a ;W mlp,la ),MLP LB (L b ;W mlp,lb ),S t )
in the formula, F t Representing the t frame characteristic after linear fusion, concat representing linear fusion; MLP LA Identity source face coordinate representation encoder, L a Face coordinates, W, for the first frame of the identity Source Portrait video mlp,la Representing identity source facial coordinate encoder learnable parameters; MLP LB Face coordinate encoder for representing emotion source, L b Face coordinates for the first frame of the Source of Emotion Portrait video, W mlp,lb Representing emotion source face coordinate encoder learnable parameters; s t The t-th frame audio feature representing step S4;
based on the fusion characteristics of the portrait identity characteristics, the portrait emotion characteristics and the audio characteristics, a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:
ΔQ t =MLP LD (F t ;W mlp,ld )
in the formula,. DELTA.Q t Representing the predicted emotional coordinate offset of the tth frame, wherein t represents the current frame of the portrait video; MLP LD Decoder for animation generation network representing facial emotion coordinates, F t For the t frame fused feature after the linear fusion of step S5, W mlp,ld Indicating the decoder learnable parameters;
the method comprises the following steps of correcting the first frame coordinate of the identity source portrait video through a predicted facial emotion coordinate offset sequence to obtain a facial emotion coordinate sequence, wherein the specific calculation formula is as follows:
Q t =L a +ΔQ t
in the formula, Q t Representing emotional face coordinates, and t representing a current frame of the portrait video; l is a radical of an alcohol a Face coordinates, Δ Q, for the first frame of the identity Source Portrait video t Representing the predicted t frame emotion coordinate offset;
in order to generate an optimal facial emotion coordinate offset sequence, a coder-decoder structure of a network is generated based on facial emotion coordinate animation, three different loss functions are set to adjust the weight and the deviation of the network, and the specific formula is as follows:
Figure FDA0003716536120000051
in the formula (I), the compound is shown in the specification,
Figure FDA0003716536120000052
representing the total loss function of the facial emotion coordinate animation generation network,
Figure FDA0003716536120000053
a penalty function representing the facial emotion coordinate animation generation network,
Figure FDA0003716536120000054
discriminator D for representing face coordinates L Is used to determine the loss function of (c),
Figure FDA0003716536120000055
frame similarity discriminator D for representing face coordinate interval T A loss function of (d); lambda [ alpha ] 1 ,λ 2 ,λ 3 Respectively are weight parameters;
wherein, the loss function of the facial emotion coordinate animation generation network calculates the distance between the predicted facial emotion coordinate sequence and the facial coordinates obtained in step S1, and the specific calculation formula is as follows:
Figure FDA0003716536120000056
in the formula (I), the compound is shown in the specification,
Figure FDA0003716536120000057
representing a loss function of the facial emotion coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N is 68 which represents the total number of facial coordinates, and i represents the current facial coordinate number; q i,t Representing the coordinates of the predicted face of the ith frame,
Figure FDA0003716536120000061
coordinates of the face indicating the ith frame obtained in step S1;
Figure FDA0003716536120000062
represents Q i,t And
Figure FDA0003716536120000063
the square of the euclidean norm of (d);
discriminator loss function during facial emotion coordinate animation generation network training
Figure FDA0003716536120000064
Discriminator loss function for discriminating true or false of generated face coordinates
Figure FDA0003716536120000065
For estimating the similarity of the face interval frame coordinates, the formula is as follows:
Figure FDA0003716536120000066
Figure FDA0003716536120000067
wherein t represents a current frame of the portrait video, D L A discriminator for representing whether the coordinates of the face are true or false,
Figure FDA0003716536120000068
discriminator D for representing face coordinates L A loss function of (d); d T A frame similarity discriminator indicating the interval of facial coordinates,
Figure FDA0003716536120000069
frame similarity discriminator D for representing face coordinate interval T A loss function of (d); q t Representing the predicted face emotion coordinates of the t frame,
Figure FDA00037165361200000610
the t-th frame face coordinates obtained in step S1 are indicated,
Figure FDA00037165361200000611
to represent
Figure FDA00037165361200000612
Face coordinates of a previous frame;
when the loss function tends to be smooth, the training of the facial emotion coordinate animation synthesis network is completed.
8. The method for generating controllable emotion face animation based on multi-modal driving as claimed in claim 1, wherein in step S5, the training process of the coordinate-to-video network is as follows:
based on the face coordinate sequence obtained in step S1, connecting the discrete coordinates by number, and rendering with color line segments to create a three-channel face sketch sequence with a size of 256 × 256;
performing channel cascade on the sequence and the original picture of the first frame of the corresponding video to create a six-channel picture sequence with the size of 256 × 256;
generating a reconstructed face video by using a coordinate-to-video network by taking the sequence as input;
in order to generate an optimal face video, the weight and the deviation of the network are adjusted by setting an L1 loss function based on the image conversion network.
9. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S6, lip sound synchronization video of a target portrait with emotion source emotion is generated by using three trained network models, and specifically comprises:
inputting any two portrait pictures and any section of audio, respectively obtaining an identity source portrait coordinate and an emotion source portrait coordinate by using a face recognition algorithm, and obtaining an audio content vector and an audio style vector of the audio by using a voice conversion method;
generating a lip sound synchronous face coordinate offset sequence by the audio content vector and the identity source coordinate through the face lip sound coordinate animation generation network obtained in the step S3;
generating a network by the audio content vector, the audio style vector, the identity source coordinate and the emotion source coordinate through the facial emotion coordinate animation obtained in the step S4 to generate a facial emotion coordinate offset sequence;
and correcting the identity source coordinate through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence to the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with emotion source emotion.
CN202210744504.9A 2022-06-27 2022-06-27 Multi-mode driving-based emotion controllable facial animation generation method Active CN115100329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210744504.9A CN115100329B (en) 2022-06-27 2022-06-27 Multi-mode driving-based emotion controllable facial animation generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210744504.9A CN115100329B (en) 2022-06-27 2022-06-27 Multi-mode driving-based emotion controllable facial animation generation method

Publications (2)

Publication Number Publication Date
CN115100329A true CN115100329A (en) 2022-09-23
CN115100329B CN115100329B (en) 2023-04-07

Family

ID=83295794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210744504.9A Active CN115100329B (en) 2022-06-27 2022-06-27 Multi-mode driving-based emotion controllable facial animation generation method

Country Status (1)

Country Link
CN (1) CN115100329B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631275A (en) * 2022-11-18 2023-01-20 北京红棉小冰科技有限公司 Multi-mode driven human body action sequence generation method and device
CN116433807A (en) * 2023-04-21 2023-07-14 北京百度网讯科技有限公司 Animation synthesis method and device, and training method and device for animation synthesis model
CN116843798A (en) * 2023-07-03 2023-10-03 支付宝(杭州)信息技术有限公司 Animation generation method, model training method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1466104A (en) * 2002-07-03 2004-01-07 中国科学院计算技术研究所 Statistics and rule combination based phonetic driving human face carton method
US20120280974A1 (en) * 2011-05-03 2012-11-08 Microsoft Corporation Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
CN111783658A (en) * 2020-07-01 2020-10-16 河北工业大学 Two-stage expression animation generation method based on double generation countermeasure network
CN113158727A (en) * 2020-12-31 2021-07-23 长春理工大学 Bimodal fusion emotion recognition method based on video and voice information
CN113408449A (en) * 2021-06-25 2021-09-17 达闼科技(北京)有限公司 Face action synthesis method based on voice drive, electronic equipment and storage medium
CN114202604A (en) * 2021-11-30 2022-03-18 长城信息股份有限公司 Voice-driven target person video generation method and device and storage medium
CN114663539A (en) * 2022-03-09 2022-06-24 东南大学 2D face restoration technology under mask based on audio drive

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1466104A (en) * 2002-07-03 2004-01-07 中国科学院计算技术研究所 Statistics and rule combination based phonetic driving human face carton method
US20120280974A1 (en) * 2011-05-03 2012-11-08 Microsoft Corporation Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
CN111783658A (en) * 2020-07-01 2020-10-16 河北工业大学 Two-stage expression animation generation method based on double generation countermeasure network
CN113158727A (en) * 2020-12-31 2021-07-23 长春理工大学 Bimodal fusion emotion recognition method based on video and voice information
CN113408449A (en) * 2021-06-25 2021-09-17 达闼科技(北京)有限公司 Face action synthesis method based on voice drive, electronic equipment and storage medium
CN114202604A (en) * 2021-11-30 2022-03-18 长城信息股份有限公司 Voice-driven target person video generation method and device and storage medium
CN114663539A (en) * 2022-03-09 2022-06-24 东南大学 2D face restoration technology under mask based on audio drive

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
范懿文等: "支持表情细节的语音驱动人脸动画" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631275A (en) * 2022-11-18 2023-01-20 北京红棉小冰科技有限公司 Multi-mode driven human body action sequence generation method and device
CN116433807A (en) * 2023-04-21 2023-07-14 北京百度网讯科技有限公司 Animation synthesis method and device, and training method and device for animation synthesis model
CN116843798A (en) * 2023-07-03 2023-10-03 支付宝(杭州)信息技术有限公司 Animation generation method, model training method and device

Also Published As

Publication number Publication date
CN115100329B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN115100329B (en) Multi-mode driving-based emotion controllable facial animation generation method
US11551393B2 (en) Systems and methods for animation generation
CN116250036A (en) System and method for synthesizing photo-level realistic video of speech
CN115116109B (en) Virtual character speaking video synthesizing method, device, equipment and storage medium
Wang et al. Seeing what you said: Talking face generation guided by a lip reading expert
EP2030171A1 (en) Do-it-yourself photo realistic talking head creation system and method
CN115588224B (en) Virtual digital person generation method and device based on face key point prediction
CN114202604A (en) Voice-driven target person video generation method and device and storage medium
CN115004236A (en) Photo-level realistic talking face from audio
CN112785671B (en) Virtual dummy face animation synthesis method
WO2021023869A1 (en) Audio-driven speech animation using recurrent neutral network
Si et al. Speech2video: Cross-modal distillation for speech to video generation
EP0710929A2 (en) Acoustic-assisted image processing
CN115393949A (en) Continuous sign language recognition method and device
Zhua et al. Audio-driven talking head video generation with diffusion model
Tang et al. Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar
Wang et al. CA-Wav2Lip: Coordinate Attention-based Speech To Lip Synthesis In The Wild
CN115937375A (en) Digital body-separating synthesis method, device, computer equipment and storage medium
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
CN114494930A (en) Training method and device for voice and image synchronism measurement model
CN114466178A (en) Method and device for measuring synchronism of voice and image
CN114466179A (en) Method and device for measuring synchronism of voice and image
Gowda et al. From pixels to portraits: A comprehensive survey of talking head generation techniques and applications
Wang et al. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head
Pan et al. Research on face video generation algorithm based on speech content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant