CN114155321B

CN114155321B - Face animation generation method based on self-supervision and mixed density network

Info

Publication number: CN114155321B
Application number: CN202111424899.6A
Authority: CN
Inventors: 王建荣; 范洪凯; 喻梅; 李雪威; 刘李; 李森
Original assignee: Tianjin University
Current assignee: Tianjin University
Filing date: 2021-11-26
Publication date: 2024-06-07
Anticipated expiration: 2041-11-26

Abstract

The invention relates to a facial animation generation method based on a self-supervision and mixed density network, which separates a voice content feature vector and an identity feature vector from voice fbank features, introduces a memory module for extracting high-quality voice features, stores a plurality of different hypotheses in the memory module, and distributes uncertainty generated when the voice is mapped to lip actions and head actions to the memory module so as to enable a feature extractor to concentrate on feature extraction. In order to solve the problem of uncertainty generated when the voice is mapped to the head action, a mixed density network is introduced into a face key point regression task, and a face key point regression network based on the mixed density network is provided. And finally, inputting the key points of the human face and the reference human face image into a picture-to-picture conversion network to obtain a final human face image.

Description

Face animation generation method based on self-supervision and mixed density network

Technical Field

The invention belongs to the technical field of image feature extraction, and relates to a face animation generation method based on a self-supervision and mixed density network.

Background

Generally, face animation generation aims to drive a reference face image through a source voice sequence, and further generates speaker face animation corresponding to the source voice sequence. The face animation generation has wide development prospect in the industries of film making, digital games, video conferences, virtual anchor and the like, and has indispensable significance for improving the understanding of people with hearing impairment to language.

Sound perception and vision are important media for communication of information. When people communicate with each other, the facial organ motions transfer important information, the lip motions transfer voice content information, the facial expressions reflect the happiness and the fun of the speaker, and even the head motions can improve the understanding of people on the language. The voice contains not only content information but also identity information, and different people speak different timbres, so that people can judge different people through sound sometimes. The face image also contains identity features, so that the sound features and the face image features contain overlapping information and complementary information is also present. Therefore, the combination of the acoustic sense and the visual sense provides an important mode for human-machine interaction.

In the generated face animation, the synchronization of lip actions and voice contents is crucial, and the asynchronization of the voice contents and the lip actions can cause discomfort to people and even doubt to hear the contents. Therefore, generating a face animation synchronized with voice content in a face animation generation task is a problem to be considered first. However, it is far from enough to generate only lip movements synchronized with speech, and only lip movements and other facial organs such as the head movements are static facial animation, the effect of which would appear very stiff to a person, and facial organ movements help to increase the person's perception of realism of the generated effect. Therefore, it is important to include natural head movements in the facial animation.

Face animation generation is generally classified into two types, speech driven and text driven. The voice-driven facial animation refers to giving an original voice input, extracting mel-frequency cepstrum coefficients (Mel Frequency Cepstral Coefficient, MFCC) or Filter Bank parameters (Filter Bank, fbank) from the original voice, and establishing a mapping from voice parameters to facial images from a large amount of training data by using a neural network or machine learning method. Since the voice and visual information is not perfectly aligned in time. Typically, the lips change earlier than the sound. For example, when we say "bed", the upper and lower lips meet before speaking the word. To solve this problem, a neural network is usually trained to learn this delay, or simply predict the video frame from the audio frame context, i.e., the first and last frames of the audio frame. The text-driven method is to convert the text into phoneme information, establish a mapping of the phoneme information to the mouth shape, and simultaneously generate smooth continuous mouth shapes by utilizing the co-pronunciation rules. The Text-driven and Speech-driven methods are essentially identical, with Speech recognition (Speech recognition) methods being used to convert Speech to Text, and Speech synthesis (TTS) methods being used to convert Text to Speech.

The facial animation generation has wide application prospect in a plurality of industries. For the network conference with limited bandwidth, only voice and one face image can be transmitted, and face animation synthesis is carried out at the receiving end. For people with limited hearing, face animation is synthesized through voice, and the understanding ability of the language is improved through lip actions. The method has great help to industries such as movie dubbing and game animation, and can effectively improve user experience. Face animation generation serves as a multidisciplinary crossing research field, and the development of the technology can provide great convenience for our life and promote the development of society.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a face animation generation method based on a self-supervision and mixed density network.

The invention solves the technical problems by the following technical proposal:

A face animation generation method based on a self-supervision and mixed density network is characterized by comprising the following steps of: the method comprises the following steps:

1) Inputting a group of face images and a section of voice fragments, performing self-supervision contrast learning, and training a voice feature extraction module;

2) The voice characteristic extraction module learned in self-supervision contrast learning is applied to a downstream task of face key point regression, and fine adjustment is carried out on the downstream task;

3) And inputting the target face key points and the reference face images output by the face key point regression network into an image-to-image conversion network to obtain final target face images.

Moreover, the specific operation of the step 1) is as follows:

1) Giving a group of input face images and a section of voice fragments, respectively inputting the input face images and the voice fragments into a picture feature extractor and a voice feature extractor for feature extraction;

(2) Firstly, extracting image features from an input face image by using 2D-CNN, then extracting time sequence information between adjacent image frames by using 3D-CNN, and finally extracting image content features by using an image content encoder, and extracting image identity features by using an image identity encoder;

(3) Firstly, performing preliminary extraction on voice characteristics by using a convolutional neural network to obtain high-level characteristic representation, and then further learning voice time sequence information by using a bidirectional GRU;

(4) The voice characteristics extracted by GRU are input into a memory module, a plurality of different hypotheses are saved by the memory module, and uncertainty of voice mapping to lip actions and head actions is shared to the memory module, so that the voice characteristic extraction module is focused on voice characteristic extraction.

(5) And comparing the extracted voice content characteristics with the facial image content characteristics, and comparing the extracted voice identity characteristics with the facial image identity characteristics.

Moreover, the specific operation of the step 2) is as follows:

1) Inputting the voice fragments to a voice feature extractor to obtain voice content feature vectors and voice content feature vectors;

2) And inputting the reference face key points into the multi-layer perceptron to obtain face key point feature vectors, and inputting the voice content feature vectors, the voice content feature vectors and the face key point feature vectors into the mixed density network to obtain target face key points.

The invention has the advantages and beneficial effects that:

1. According to the face animation generation method based on the self-supervision and mixed density network, the target face key points are generated by using the voice fragments and the reference face image key points and serve as intermediate representation of face animation generation, and then the final target face image is generated from the target face key points and the reference face image. The use of face keypoints as an intermediate representation of speech and face images has several advantages; firstly, generating key points of a human face can avoid appearance characteristics of low-level pixels so as to capture head actions more easily; meanwhile, compared with millions of pixel points, 68 face key points are used, so that the model is more compact, the number of parameters is smaller, and the text model can be trained by using a small data set; secondly, the key points are easy to drive different types of animation contents, including face images and cartoon animations. In contrast, the method for generating the facial animation based on the pixels is limited to the face and cannot be easily popularized to cartoon animation generation.

2. The invention discloses a facial animation generation method based on self-supervision and mixed density network, which utilizes the characteristic that self-supervision does not need data labels, fully trains a feature extraction network on a large amount of non-label data, separates voice features into content-related feature vectors and identity-related feature vectors, and enables the voice content feature vectors to focus on lip actions and the voice identity feature vectors to focus on head actions.

3. According to the facial animation generation method based on the self-supervision and mixed density network, a memory module is introduced in self-supervision contrast learning to store a plurality of different hypotheses, and uncertainties generated when voices are mapped to lip actions and head actions are distributed to the memory module, so that a feature extractor is focused on feature extraction.

4. The facial animation generation method based on the self-supervision and mixed density network uses the mixed density network to generate a plurality of different hypotheses for a speaker so as to further improve the naturalness of head motion generation.

Drawings

FIG. 1 is a schematic diagram of a self-supervised contrast learning network architecture of the present invention;

FIG. 2 is a schematic diagram of a mixed density network-based facial animation regression structure;

FIG. 3 is a schematic diagram of an image-to-image conversion network architecture of the present invention;

FIG. 4 is a schematic diagram of the experimental results of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention relates to a facial animation generating network structure based on a self-supervision and mixed density network, which is used for extracting a network from a voice characteristic based on memory-enhanced self-supervision contrast learning so as to obtain a high-quality voice characteristic. The feature extraction network is fully trained on a large amount of unlabeled data by utilizing the self-supervision characteristic without data labels, and the voice features are separated into content-related feature vectors and identity-related feature vectors. Introducing a memory module to store a plurality of different hypotheses, and distributing uncertainty generated when the voice is mapped to the lip action and the head action to the memory module so as to enable a feature extractor to concentrate on feature extraction; generating multiple different hypotheses for a speaker using a mixed density network further improves the naturalness of generating head motions. The paper estimates multiple 2D face keypoint hypotheses by minimizing mixed gaussian negative log-likelihood, as compared to generating single face keypoints by minimizing single gaussian negative log-likelihood for most existing works.

As shown in fig. 1, a learning feature extraction model based on memory enhancement self-supervision contrast is provided, which mainly comprises three modules:

(1) The voice characteristic extraction module based on AudioEncoder and the bidirectional GRU firstly uses a convolutional neural network to perform preliminary extraction on voice characteristics to obtain high-level characteristic representation, and then uses the bidirectional GRU to further learn voice time sequence information, which is expressed as h _t;

(2) Image feature extraction module based on Resnet-18, firstly uses 2D-CNN to extract image features, then uses 3D-CNN to extract time sequence information between adjacent image frames, and finally uses image content encoder U _c (.) to extract image content features, expressed as Image identity encoder U _s (·) extracts image identity features, denoted/>

(3) And a memory module. The voice and lip actions are not in one-to-one mapping relation with the head actions, and the same voice segment corresponds to a plurality of different lip actions and head actions. And storing a plurality of different hypotheses by using the memory module, and sharing uncertainty among the mappings to the memory module so as to enable the feature extraction module to concentrate on feature extraction.

The speech content feature vector can be expressed as:

the voice identity feature vector may be expressed as:

Wherein p (i, t) is the contribution of the ith memory slot to the time step t, M is the memory module, weight distribution function The context representation h _t is mapped to p (i, t), h _t being the speech feature extracted by the GRU at time t. /(I)Is a learnable multi-layer perceptron and applies softmax manipulation in dimension k.

(4) The loss function is compared. For any section of audio fragment, the image sequence corresponding to the audio fragment is a positive sample, and the rest image sequences are negative samples. The contrast loss function uses cosine similarity to calculate the similarity between any two feature representations, and the combination of the cos distance and the softmax loss function leads to the output result having a value range of [ -1,1], which results in smaller cross entropy, so that the combination of the cos distance and the softmax cannot be efficiently learned. For this purpose, the learnable parameters w and b are used here together to participate in the training of the network.

The content contrast loss function may be expressed as:

the identity contrast loss function can be expressed as:

for the face key point regression network based on the mixed density network, as shown in fig. 3, the specific steps are as follows:

The first two frames and the last two frames corresponding to the i-th frame target face key point p _i, the current frame voice segment A _i (containing 5 frames of audio) and the reference face key point p _r are input, and the aim is to learn a function F { p _i,A_i } →Θ which maps the input { p _i,A_i } into a mixed density network output parameter Θ = { mu, sigma, alpha }. μ, σ, α are the mean, variance, and mixing coefficient, respectively, of the mixed density network. M is the number of Gaussian kernels. The mean value of each Gaussian kernel represents an aligned 2D face key point, rotation and offset set, and the number M of Gaussian kernels determines the number of hypotheses generated by the model.

Inputting ith frame of speech content feature vectorAnd speech identity feature vector/>And a reference face key feature vector p _r, the probability density of the target value can be expressed as a linear combination of gaussian kernel functions as follows:

Where M is the number of Gaussian kernels, i.e., the number of components that make up the hybrid model. Alpha _m(c_i) is a probability weight corresponding to an m-th component when the mixed coefficient represents the input voice content feature vector, the voice identity feature vector and the reference face key point feature vector to generate w _i, w _i is a generated set of aligned 2D face key points, rotation and offset, and ψ _m is a probability density function of the component m and is used for calculating w _i density obtained under the condition of input c _i. The invention uses gaussian kernels as probability density functions.

The invention trains the picture to the picture conversion module, inputs the key point of the target face and the reference face image to generate the final target face image. The picture-to-picture conversion module is in an encoder/decoder structure, draws the key points of the target face into an RGB picture O _trg with the size of 256 multiplied by 3, and splices the RGB picture O _trg with the source face image H _src in the channel dimension to obtain the input with the size of 256 multiplied by 6. The input is passed through the decoder to obtain an intermediate feature representation, which is input to the decoder for reconstructing the target face image H _trg. The Decoder is a CNN architecture that uses deconvolution to derive the target face image from the intermediate feature representation. Encoder and Decoder use a U-Net structure with a jump connection to better preserve identity information of the target speaker, the model structure is shown in fig. 3.

Specifically, the Voxceleb data set is used for training and testing in this example. After dividing the picture into a data set and a test set, firstly extracting key points in a face image by using a face key point extractor, then inputting the face key points and the voice fragments into a face key point regression network to obtain target face key points, and then inputting the target face key points and a reference face image into an image-to-image conversion network to obtain a final face image.

In practical applications, these data samples may be replaced by their own data samples, as long as the frame structure is kept the same. In addition, the embodiment only needs to use pytorch (Python machine learning framework), so that the practical application is more convenient.

To verify the feasibility of the solution, experiments were performed on the dataset and the results of table 1 and the results of fig. 4 were finally obtained.

In the study, in order to evaluate the performance of the model proposed in this embodiment, two evaluation indexes, namely, the lip keypoint distance (LANDMARK DISTANCE, LMD) and the rotation distance (Rotation Distance, RD), are defined. From the results in table 1, the process proposed by the present invention achieves superior performance to the above-described process.

TABLE 1

Although the embodiments of the present invention and the accompanying drawings have been disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the invention and the appended claims, and therefore the scope of the invention is not limited to the embodiments and the disclosure of the drawings.

Claims

1. A face animation generation method based on a self-supervision and mixed density network is characterized by comprising the following steps of: the method comprises the following steps:

3) Inputting the target face key points and the reference face images output by the face key point regression network into an image-to-image conversion network to obtain a final target face image;

the specific operation of the step 1) is as follows:

(1) Giving a group of input face images and a section of voice fragments, respectively inputting the input face images and the voice fragments into a picture feature extractor and a voice feature extractor for feature extraction;

(4) Inputting the voice characteristics extracted by the GRU into a memory module, storing a plurality of different hypotheses by using the memory module, and sharing the uncertainty of voice mapping to lip actions and head actions to the memory module so as to enable the voice characteristic extraction module to concentrate on voice characteristic extraction;

2. The method for generating the facial animation based on the self-supervision and mixed density network according to claim 1, wherein the method comprises the following steps: the specific operation of the step 2) is as follows: