CN114155321B - Face animation generation method based on self-supervision and mixed density network - Google Patents

Face animation generation method based on self-supervision and mixed density network Download PDF

Info

Publication number
CN114155321B
CN114155321B CN202111424899.6A CN202111424899A CN114155321B CN 114155321 B CN114155321 B CN 114155321B CN 202111424899 A CN202111424899 A CN 202111424899A CN 114155321 B CN114155321 B CN 114155321B
Authority
CN
China
Prior art keywords
voice
image
face
network
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111424899.6A
Other languages
Chinese (zh)
Other versions
CN114155321A (en
Inventor
王建荣
范洪凯
喻梅
李雪威
刘李
李森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202111424899.6A priority Critical patent/CN114155321B/en
Publication of CN114155321A publication Critical patent/CN114155321A/en
Application granted granted Critical
Publication of CN114155321B publication Critical patent/CN114155321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to a facial animation generation method based on a self-supervision and mixed density network, which separates a voice content feature vector and an identity feature vector from voice fbank features, introduces a memory module for extracting high-quality voice features, stores a plurality of different hypotheses in the memory module, and distributes uncertainty generated when the voice is mapped to lip actions and head actions to the memory module so as to enable a feature extractor to concentrate on feature extraction. In order to solve the problem of uncertainty generated when the voice is mapped to the head action, a mixed density network is introduced into a face key point regression task, and a face key point regression network based on the mixed density network is provided. And finally, inputting the key points of the human face and the reference human face image into a picture-to-picture conversion network to obtain a final human face image.

Description

Face animation generation method based on self-supervision and mixed density network
Technical Field
The invention belongs to the technical field of image feature extraction, and relates to a face animation generation method based on a self-supervision and mixed density network.
Background
Generally, face animation generation aims to drive a reference face image through a source voice sequence, and further generates speaker face animation corresponding to the source voice sequence. The face animation generation has wide development prospect in the industries of film making, digital games, video conferences, virtual anchor and the like, and has indispensable significance for improving the understanding of people with hearing impairment to language.
Sound perception and vision are important media for communication of information. When people communicate with each other, the facial organ motions transfer important information, the lip motions transfer voice content information, the facial expressions reflect the happiness and the fun of the speaker, and even the head motions can improve the understanding of people on the language. The voice contains not only content information but also identity information, and different people speak different timbres, so that people can judge different people through sound sometimes. The face image also contains identity features, so that the sound features and the face image features contain overlapping information and complementary information is also present. Therefore, the combination of the acoustic sense and the visual sense provides an important mode for human-machine interaction.
In the generated face animation, the synchronization of lip actions and voice contents is crucial, and the asynchronization of the voice contents and the lip actions can cause discomfort to people and even doubt to hear the contents. Therefore, generating a face animation synchronized with voice content in a face animation generation task is a problem to be considered first. However, it is far from enough to generate only lip movements synchronized with speech, and only lip movements and other facial organs such as the head movements are static facial animation, the effect of which would appear very stiff to a person, and facial organ movements help to increase the person's perception of realism of the generated effect. Therefore, it is important to include natural head movements in the facial animation.
Face animation generation is generally classified into two types, speech driven and text driven. The voice-driven facial animation refers to giving an original voice input, extracting mel-frequency cepstrum coefficients (Mel Frequency Cepstral Coefficient, MFCC) or Filter Bank parameters (Filter Bank, fbank) from the original voice, and establishing a mapping from voice parameters to facial images from a large amount of training data by using a neural network or machine learning method. Since the voice and visual information is not perfectly aligned in time. Typically, the lips change earlier than the sound. For example, when we say "bed", the upper and lower lips meet before speaking the word. To solve this problem, a neural network is usually trained to learn this delay, or simply predict the video frame from the audio frame context, i.e., the first and last frames of the audio frame. The text-driven method is to convert the text into phoneme information, establish a mapping of the phoneme information to the mouth shape, and simultaneously generate smooth continuous mouth shapes by utilizing the co-pronunciation rules. The Text-driven and Speech-driven methods are essentially identical, with Speech recognition (Speech recognition) methods being used to convert Speech to Text, and Speech synthesis (TTS) methods being used to convert Text to Speech.
The facial animation generation has wide application prospect in a plurality of industries. For the network conference with limited bandwidth, only voice and one face image can be transmitted, and face animation synthesis is carried out at the receiving end. For people with limited hearing, face animation is synthesized through voice, and the understanding ability of the language is improved through lip actions. The method has great help to industries such as movie dubbing and game animation, and can effectively improve user experience. Face animation generation serves as a multidisciplinary crossing research field, and the development of the technology can provide great convenience for our life and promote the development of society.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a face animation generation method based on a self-supervision and mixed density network.
The invention solves the technical problems by the following technical proposal:
A face animation generation method based on a self-supervision and mixed density network is characterized by comprising the following steps of: the method comprises the following steps:
1) Inputting a group of face images and a section of voice fragments, performing self-supervision contrast learning, and training a voice feature extraction module;
2) The voice characteristic extraction module learned in self-supervision contrast learning is applied to a downstream task of face key point regression, and fine adjustment is carried out on the downstream task;
3) And inputting the target face key points and the reference face images output by the face key point regression network into an image-to-image conversion network to obtain final target face images.
Moreover, the specific operation of the step 1) is as follows:
1) Giving a group of input face images and a section of voice fragments, respectively inputting the input face images and the voice fragments into a picture feature extractor and a voice feature extractor for feature extraction;
(2) Firstly, extracting image features from an input face image by using 2D-CNN, then extracting time sequence information between adjacent image frames by using 3D-CNN, and finally extracting image content features by using an image content encoder, and extracting image identity features by using an image identity encoder;
(3) Firstly, performing preliminary extraction on voice characteristics by using a convolutional neural network to obtain high-level characteristic representation, and then further learning voice time sequence information by using a bidirectional GRU;
(4) The voice characteristics extracted by GRU are input into a memory module, a plurality of different hypotheses are saved by the memory module, and uncertainty of voice mapping to lip actions and head actions is shared to the memory module, so that the voice characteristic extraction module is focused on voice characteristic extraction.
(5) And comparing the extracted voice content characteristics with the facial image content characteristics, and comparing the extracted voice identity characteristics with the facial image identity characteristics.
Moreover, the specific operation of the step 2) is as follows:
1) Inputting the voice fragments to a voice feature extractor to obtain voice content feature vectors and voice content feature vectors;
2) And inputting the reference face key points into the multi-layer perceptron to obtain face key point feature vectors, and inputting the voice content feature vectors, the voice content feature vectors and the face key point feature vectors into the mixed density network to obtain target face key points.
The invention has the advantages and beneficial effects that:
1. According to the face animation generation method based on the self-supervision and mixed density network, the target face key points are generated by using the voice fragments and the reference face image key points and serve as intermediate representation of face animation generation, and then the final target face image is generated from the target face key points and the reference face image. The use of face keypoints as an intermediate representation of speech and face images has several advantages; firstly, generating key points of a human face can avoid appearance characteristics of low-level pixels so as to capture head actions more easily; meanwhile, compared with millions of pixel points, 68 face key points are used, so that the model is more compact, the number of parameters is smaller, and the text model can be trained by using a small data set; secondly, the key points are easy to drive different types of animation contents, including face images and cartoon animations. In contrast, the method for generating the facial animation based on the pixels is limited to the face and cannot be easily popularized to cartoon animation generation.
2. The invention discloses a facial animation generation method based on self-supervision and mixed density network, which utilizes the characteristic that self-supervision does not need data labels, fully trains a feature extraction network on a large amount of non-label data, separates voice features into content-related feature vectors and identity-related feature vectors, and enables the voice content feature vectors to focus on lip actions and the voice identity feature vectors to focus on head actions.
3. According to the facial animation generation method based on the self-supervision and mixed density network, a memory module is introduced in self-supervision contrast learning to store a plurality of different hypotheses, and uncertainties generated when voices are mapped to lip actions and head actions are distributed to the memory module, so that a feature extractor is focused on feature extraction.
4. The facial animation generation method based on the self-supervision and mixed density network uses the mixed density network to generate a plurality of different hypotheses for a speaker so as to further improve the naturalness of head motion generation.
Drawings
FIG. 1 is a schematic diagram of a self-supervised contrast learning network architecture of the present invention;
FIG. 2 is a schematic diagram of a mixed density network-based facial animation regression structure;
FIG. 3 is a schematic diagram of an image-to-image conversion network architecture of the present invention;
FIG. 4 is a schematic diagram of the experimental results of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention relates to a facial animation generating network structure based on a self-supervision and mixed density network, which is used for extracting a network from a voice characteristic based on memory-enhanced self-supervision contrast learning so as to obtain a high-quality voice characteristic. The feature extraction network is fully trained on a large amount of unlabeled data by utilizing the self-supervision characteristic without data labels, and the voice features are separated into content-related feature vectors and identity-related feature vectors. Introducing a memory module to store a plurality of different hypotheses, and distributing uncertainty generated when the voice is mapped to the lip action and the head action to the memory module so as to enable a feature extractor to concentrate on feature extraction; generating multiple different hypotheses for a speaker using a mixed density network further improves the naturalness of generating head motions. The paper estimates multiple 2D face keypoint hypotheses by minimizing mixed gaussian negative log-likelihood, as compared to generating single face keypoints by minimizing single gaussian negative log-likelihood for most existing works.
As shown in fig. 1, a learning feature extraction model based on memory enhancement self-supervision contrast is provided, which mainly comprises three modules:
(1) The voice characteristic extraction module based on AudioEncoder and the bidirectional GRU firstly uses a convolutional neural network to perform preliminary extraction on voice characteristics to obtain high-level characteristic representation, and then uses the bidirectional GRU to further learn voice time sequence information, which is expressed as h t;
(2) Image feature extraction module based on Resnet-18, firstly uses 2D-CNN to extract image features, then uses 3D-CNN to extract time sequence information between adjacent image frames, and finally uses image content encoder U c (.) to extract image content features, expressed as Image identity encoder U s (·) extracts image identity features, denoted/>
(3) And a memory module. The voice and lip actions are not in one-to-one mapping relation with the head actions, and the same voice segment corresponds to a plurality of different lip actions and head actions. And storing a plurality of different hypotheses by using the memory module, and sharing uncertainty among the mappings to the memory module so as to enable the feature extraction module to concentrate on feature extraction.
The speech content feature vector can be expressed as:
the voice identity feature vector may be expressed as:
Wherein p (i, t) is the contribution of the ith memory slot to the time step t, M is the memory module, weight distribution function The context representation h t is mapped to p (i, t), h t being the speech feature extracted by the GRU at time t. /(I)Is a learnable multi-layer perceptron and applies softmax manipulation in dimension k.
(4) The loss function is compared. For any section of audio fragment, the image sequence corresponding to the audio fragment is a positive sample, and the rest image sequences are negative samples. The contrast loss function uses cosine similarity to calculate the similarity between any two feature representations, and the combination of the cos distance and the softmax loss function leads to the output result having a value range of [ -1,1], which results in smaller cross entropy, so that the combination of the cos distance and the softmax cannot be efficiently learned. For this purpose, the learnable parameters w and b are used here together to participate in the training of the network.
The content contrast loss function may be expressed as:
the identity contrast loss function can be expressed as:
for the face key point regression network based on the mixed density network, as shown in fig. 3, the specific steps are as follows:
The first two frames and the last two frames corresponding to the i-th frame target face key point p i, the current frame voice segment A i (containing 5 frames of audio) and the reference face key point p r are input, and the aim is to learn a function F { p i,Ai } →Θ which maps the input { p i,Ai } into a mixed density network output parameter Θ = { mu, sigma, alpha }. μ, σ, α are the mean, variance, and mixing coefficient, respectively, of the mixed density network. M is the number of Gaussian kernels. The mean value of each Gaussian kernel represents an aligned 2D face key point, rotation and offset set, and the number M of Gaussian kernels determines the number of hypotheses generated by the model.
Inputting ith frame of speech content feature vectorAnd speech identity feature vector/>And a reference face key feature vector p r, the probability density of the target value can be expressed as a linear combination of gaussian kernel functions as follows:
Where M is the number of Gaussian kernels, i.e., the number of components that make up the hybrid model. Alpha m(ci) is a probability weight corresponding to an m-th component when the mixed coefficient represents the input voice content feature vector, the voice identity feature vector and the reference face key point feature vector to generate w i, w i is a generated set of aligned 2D face key points, rotation and offset, and ψ m is a probability density function of the component m and is used for calculating w i density obtained under the condition of input c i. The invention uses gaussian kernels as probability density functions.
The invention trains the picture to the picture conversion module, inputs the key point of the target face and the reference face image to generate the final target face image. The picture-to-picture conversion module is in an encoder/decoder structure, draws the key points of the target face into an RGB picture O trg with the size of 256 multiplied by 3, and splices the RGB picture O trg with the source face image H src in the channel dimension to obtain the input with the size of 256 multiplied by 6. The input is passed through the decoder to obtain an intermediate feature representation, which is input to the decoder for reconstructing the target face image H trg. The Decoder is a CNN architecture that uses deconvolution to derive the target face image from the intermediate feature representation. Encoder and Decoder use a U-Net structure with a jump connection to better preserve identity information of the target speaker, the model structure is shown in fig. 3.
Specifically, the Voxceleb data set is used for training and testing in this example. After dividing the picture into a data set and a test set, firstly extracting key points in a face image by using a face key point extractor, then inputting the face key points and the voice fragments into a face key point regression network to obtain target face key points, and then inputting the target face key points and a reference face image into an image-to-image conversion network to obtain a final face image.
In practical applications, these data samples may be replaced by their own data samples, as long as the frame structure is kept the same. In addition, the embodiment only needs to use pytorch (Python machine learning framework), so that the practical application is more convenient.
To verify the feasibility of the solution, experiments were performed on the dataset and the results of table 1 and the results of fig. 4 were finally obtained.
In the study, in order to evaluate the performance of the model proposed in this embodiment, two evaluation indexes, namely, the lip keypoint distance (LANDMARK DISTANCE, LMD) and the rotation distance (Rotation Distance, RD), are defined. From the results in table 1, the process proposed by the present invention achieves superior performance to the above-described process.
TABLE 1
Although the embodiments of the present invention and the accompanying drawings have been disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the invention and the appended claims, and therefore the scope of the invention is not limited to the embodiments and the disclosure of the drawings.

Claims (2)

1. A face animation generation method based on a self-supervision and mixed density network is characterized by comprising the following steps of: the method comprises the following steps:
1) Inputting a group of face images and a section of voice fragments, performing self-supervision contrast learning, and training a voice feature extraction module;
2) The voice characteristic extraction module learned in self-supervision contrast learning is applied to a downstream task of face key point regression, and fine adjustment is carried out on the downstream task;
3) Inputting the target face key points and the reference face images output by the face key point regression network into an image-to-image conversion network to obtain a final target face image;
the specific operation of the step 1) is as follows:
(1) Giving a group of input face images and a section of voice fragments, respectively inputting the input face images and the voice fragments into a picture feature extractor and a voice feature extractor for feature extraction;
(2) Firstly, extracting image features from an input face image by using 2D-CNN, then extracting time sequence information between adjacent image frames by using 3D-CNN, and finally extracting image content features by using an image content encoder, and extracting image identity features by using an image identity encoder;
(3) Firstly, performing preliminary extraction on voice characteristics by using a convolutional neural network to obtain high-level characteristic representation, and then further learning voice time sequence information by using a bidirectional GRU;
(4) Inputting the voice characteristics extracted by the GRU into a memory module, storing a plurality of different hypotheses by using the memory module, and sharing the uncertainty of voice mapping to lip actions and head actions to the memory module so as to enable the voice characteristic extraction module to concentrate on voice characteristic extraction;
(5) And comparing the extracted voice content characteristics with the facial image content characteristics, and comparing the extracted voice identity characteristics with the facial image identity characteristics.
2. The method for generating the facial animation based on the self-supervision and mixed density network according to claim 1, wherein the method comprises the following steps: the specific operation of the step 2) is as follows:
1) Inputting the voice fragments to a voice feature extractor to obtain voice content feature vectors and voice content feature vectors;
2) And inputting the reference face key points into the multi-layer perceptron to obtain face key point feature vectors, and inputting the voice content feature vectors, the voice content feature vectors and the face key point feature vectors into the mixed density network to obtain target face key points.
CN202111424899.6A 2021-11-26 Face animation generation method based on self-supervision and mixed density network Active CN114155321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111424899.6A CN114155321B (en) 2021-11-26 Face animation generation method based on self-supervision and mixed density network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111424899.6A CN114155321B (en) 2021-11-26 Face animation generation method based on self-supervision and mixed density network

Publications (2)

Publication Number Publication Date
CN114155321A CN114155321A (en) 2022-03-08
CN114155321B true CN114155321B (en) 2024-06-07

Family

ID=

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160098581A (en) * 2015-02-09 2016-08-19 홍익대학교 산학협력단 Method for certification using face recognition an speaker verification
CN112001992A (en) * 2020-07-02 2020-11-27 超维视界(北京)传媒科技有限公司 Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning
CN113450436A (en) * 2021-06-28 2021-09-28 武汉理工大学 Face animation generation method and system based on multi-mode correlation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160098581A (en) * 2015-02-09 2016-08-19 홍익대학교 산학협력단 Method for certification using face recognition an speaker verification
CN112001992A (en) * 2020-07-02 2020-11-27 超维视界(北京)传媒科技有限公司 Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning
CN113450436A (en) * 2021-06-28 2021-09-28 武汉理工大学 Face animation generation method and system based on multi-mode correlation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于NURBS变形和单视角图片的人脸表情生成;孙思;葛卫民;冯志勇;徐超;彭伟龙;;计算机工程;20171115(第11期);全文 *
基于关键点表示的语音驱动说话人脸视频生成;年福东;王文涛;王妍;张晶晶;胡贵恒;李腾;模式识别与人工智能;20210615;第34卷(第006期);全文 *

Similar Documents

Publication Publication Date Title
Wang et al. One-shot talking face generation from single-speaker audio-visual correlation learning
CN116250036A (en) System and method for synthesizing photo-level realistic video of speech
Vougioukas et al. Video-driven speech reconstruction using generative adversarial networks
CN113554737A (en) Target object motion driving method, device, equipment and storage medium
JP2014519082A (en) Video generation based on text
WO2022106654A2 (en) Methods and systems for video translation
Fu et al. Audio/visual mapping with cross-modal hidden Markov models
CN111666831A (en) Decoupling representation learning-based speaking face video generation method
CN113077537A (en) Video generation method, storage medium and equipment
Ma et al. Unpaired image-to-speech synthesis with multimodal information bottleneck
CN115761075A (en) Face image generation method, device, equipment, medium and product
Hassid et al. More than words: In-the-wild visually-driven prosody for text-to-speech
Liz-Lopez et al. Generation and detection of manipulated multimodal audiovisual content: Advances, trends and open challenges
CN116977903A (en) AIGC method for intelligently generating short video through text
CN116705038A (en) 3D virtual speaker driving method based on voice analysis and related device
CN114155321B (en) Face animation generation method based on self-supervision and mixed density network
CN114360491B (en) Speech synthesis method, device, electronic equipment and computer readable storage medium
Sun et al. Pre-avatar: An automatic presentation generation framework leveraging talking avatar
Chen et al. VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer
Barve et al. Multi-language audio-visual content generation based on generative adversarial networks
CN114155321A (en) Face animation generation method based on self-supervision and mixed density network
Sadiq et al. Emotion dependent domain adaptation for speech driven affective facial feature synthesis
Zainkó et al. Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
Kolivand et al. Realistic lip syncing for virtual character using common viseme set
Preethi Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant