CN115908659A

CN115908659A - Method and device for synthesizing speaking face based on generation countermeasure network

Info

Publication number: CN115908659A
Application number: CN202211493192.5A
Authority: CN
Inventors: 杨新宇; 宋怡馨; 胡冠宇; 张硕; 魏洁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-04-04

Abstract

The invention discloses a method and a device for synthesizing a speaking face based on a generated confrontation network. The method and the device pay attention to the implicit information of head movements, expressions, winks and the like except for the mouth shape, and the high-quality and natural target character speaking video can be synthesized by inputting a section of voice and the target face reference video, so that the problems that the anchor can not record the media and the synthesized video is distorted and rigid and the like can be solved. Firstly, acquiring voice and a reference video, and preprocessing and extracting characteristics of audio to obtain audio characteristics; simultaneously, inputting the video into a three-dimensional face reconstruction model to obtain face three-dimensional data; inputting the audio features into a face depth generation model to predict face parameters; obtaining a corresponding three-dimensional face image by using the face three-dimensional rendering model; then, eye masks are extracted according to the blinking parameters, and the eyes are coded; inputting the code and the facial image into a depth face rendering model to synthesize a frame-by-frame face image; finally, all the images are spliced and combined with the audio to synthesize the speaking face video.

Description

Method and device for synthesizing speaking face based on generation countermeasure network

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for synthesizing a speaking face based on a generated confrontation network.

Background

The information transmission in the information age is particularly important for the development of society, while the media industry is a main way for transmitting information, wherein both news broadcasting and network live broadcasting require a main broadcaster to record on site in a video studio, and the recording work is often difficult to be quickly carried out when temporary demands exist. The virtual main broadcast formed by directly generating the speaking face video through the audio information and the face video can well solve the problems, is not limited in time and space, can improve broadcasting accuracy on the basis of intelligent voice, avoids errors possibly caused by manual broadcasting, can greatly reduce labor cost, and reduces cost and improves efficiency for stock market.

As the role of virtual anchor in the media industry becomes more and more important, research on synthesis of speech face video is receiving wide attention from academia. In the related research aiming at the synthesis of the speaking face at present, 80% of documents focus on the synchronization of the mouth and the voice information, and the rest focus on the natural degree of the generated effect. The final measure of effectiveness in a video production study is not just accuracy, but rather the impression of the final audience of the video.

Disclosure of Invention

The invention aims to provide a method and a device for synthesizing a speaking face based on a generation countermeasure network. The invention provides a method for synthesizing a speaking face based on generation of a confrontation network, which generates a speaking face video of a target character aiming at audio contents by inputting a section of audio and a section of short video containing a target face. The invention has simple logic and natural effect, the mouth shape of the generated result can correspond to the audio frequency accurately, and the speaking has natural facial expression and head posture and keeps certain blinking action.

The invention is realized by adopting the following technical scheme:

a method for synthesizing a speaking face based on a generation confrontation network comprises the following steps:

acquiring a voice audio clip and a reference video which is recorded by a real person and contains a target face;

carrying out noise reduction, resampling and feature extraction on an input voice audio to obtain audio features;

inputting a reference video of a target face into a pre-trained three-dimensional face reconstruction model to obtain face three-dimensional data;

inputting the audio features into a face depth generation model to obtain predicted head posture, expression, mouth shape and blink parameters;

rendering a three-dimensional face image corresponding to each frame of voice by utilizing a pre-trained face three-dimensional rendering model according to the predicted parameters;

extracting eye masks according to rendered human faces and blink parameters, and independently coding eyes;

inputting the eye mask code and the three-dimensional face image into a deep face rendering model together to synthesize each frame of speaking face image;

finally, all the speaking face images are spliced together to output a final speaking face video.

The further improvement of the invention lies in that the obtaining of the reference video of the voice audio segment and the target face specifically comprises:

audio acquisition, namely acquiring and screening standard broadcast audio clips by a network or recording the audio clips in a recording room through acquisition equipment;

and video acquisition, wherein a single human face video clip is acquired through a network or the video clip comprising the front face and the side face is recorded through video acquisition equipment.

The invention is further improved in that the feature extraction is performed on the input voice audio, and comprises the following steps:

audio resampling, namely linearly resampling audio to align the audio with a video frame rate, encoding the resampled audio signal to form an array, and performing fast Fourier transform on the array;

the transformed code is convolved in a two-dimensional space consisting of a time domain and a frequency domain, and corresponding audio features are formed for each second of audio.

The further improvement of the invention is that the reference video of the target face is input into a three-dimensional face reconstruction model to obtain the three-dimensional data of the face, and the method comprises the following steps:

processing an input reference video into an image file according to frames;

performing face recognition on each reference image by using a convolutional neural network, and cutting the face image according to a recognition result;

performing face landmark extraction on the image, wherein the face landmark extraction comprises a left eye eyebrow, a right eye eyebrow, a left eye, a right eye, a nose, a mouth and a chin, and specific coordinates of 68 key points are finally obtained;

adopting a pre-trained face reconstruction method based on CNN to establish a face 3DMM model, wherein the process is as follows:

fitting the face image to a 3DMM model to obtain a target face three-dimensional parameter M = (F) _i ,F _t ,F _e ,F _l ) Wherein

Respectively representing the identity, the texture, the facial expression and the illumination characteristic vector of the three-dimensional coordinate of the face;

and in the model training process, the obtained three-dimensional face model is projected to two dimensions, 68 key points are extracted again, the key points extracted from the original video image are used as supervision, and the landmark loss is formed so as to further train the three-dimensional face reconstruction model.

The invention further improves that inputting the audio features into the pre-trained human face depth generation model to obtain the predicted head pose, mouth shape and blink parameters, and comprises the following steps:

inputting processed audio features and human face three-dimensional parameters into a model, wherein the audio features are sent into the model by a sliding window with the size of T frames each time, a mouth shape feature generator generates corresponding mouth shape feature parameters for each frame in the window, and an implicit feature generator learns an implicit feature array for each window by combining time sequence features;

respectively splicing the implicit characteristic arrays with the mouth shape characteristics of each frame, and sending the mouth shape characteristics into a full connection layer to obtain human face three-dimensional parameters predicted according to the audio characteristics;

a pre-trained face depth generation model comprising:

the face depth generation model consists of a mouth shape feature generator, an implicit feature generator and a discriminator, wherein the implicit feature comprises facial expression, head posture and blink;

and in the model training process, the three-dimensional parameters of the real human face and the predicted three-dimensional parameters are sent to a discriminator to be discriminated, and discrimination loss is determined.

The invention further improves that the three-dimensional face image corresponding to each frame of the voice is rendered by utilizing a human face three-dimensional rendering model according to the obtained prediction parameters, and the method comprises the following steps:

adjusting the parameters of the face three-dimensional model according to the predicted face three-dimensional parameters to obtain recombined face three-dimensional parameters;

and rendering the recombined human face three-dimensional parameters by adopting the open-source pytorech 3D to obtain a rendered human face image.

The invention further improves the method that an eye mask is extracted aiming at the rendered human face and the wink parameter, and the eye is coded independently, and the method comprises the following steps:

determining the position of eyes according to the extracted key points of the face;

combining the predicted human face and facial features to obtain an eye blink AU value;

AU value normalization is applied to the pixel values in the eye attention map.

The further improvement of the invention is that the eye mask coding and the three-dimensional face image are input into the depth face rendering model together to synthesize each frame of speaking face image, which comprises the following steps:

combining the eye coding with the rendered face image to form input information with four channels, wherein three channels are face images, and one channel is an eye mask;

placing the current frame in the center of a window with the size of 2N, inputting the current frame into a deep face rendering model, and outputting a target character image frame with time sequence by the model;

a deep face rendering model comprising:

the model consists of a generator G and an identifier D, the purpose of which is to ensure temporal consistency while preserving the character identity;

in the training process, the generated image and the real image are sent to a discriminator to be discriminated, so that the countermeasure loss is formed.

The invention has the further improvement that all the speaking face images are spliced together to output the final speaking face video, and all the images which are spliced and synthesized by using ffmpeg in sequence form the final speaking face video result.

A human face synthesis apparatus based on a generative confrontation network, comprising:

the data acquisition module is used for acquiring voice audio content of a speaking face and a reference video containing a face picture of a target person;

the audio processing module is used for processing the acquired audio data, and comprises the steps of noise reduction, resampling and feature extraction of the audio;

the human face three-dimensional reconstruction module is used for reconstructing the target character video by adopting a three-dimensional human face reconstruction model to obtain target human face three-dimensional parameters;

the three-dimensional face parameter prediction module is used for reading audio features and obtaining predicted face parameters by adopting a pre-trained face depth generation model;

the face three-dimensional rendering module is used for obtaining a rendered face image according to the predicted face parameters and the face three-dimensional parameters by adopting a face three-dimensional rendering model;

the eye coding module modifies the eye pixel value according to the predicted face parameters to obtain eye codes and realize blinking;

the human face image depth generation module is used for obtaining a human face image by adopting a depth human face rendering model and combining eye coding to synthesize each frame of speaking human face image;

and the video synthesis module is used for splicing all the images by adopting ffmpeg and synthesizing the final talking face video result together with the audio.

The invention has at least the following beneficial technical effects:

the method for synthesizing the speaking face based on the generation countermeasure network, provided by the invention, carries out three-dimensional modeling on the face through a three-dimensional face reconstruction model, so that the generated result is superior to the result of two-dimensional image modeling in visual effect; meanwhile, the method realizes the learning and generation of implicit characteristics such as head posture, expression and the like through a face depth generation model, so that the true degree of a generated result is greatly improved; the method extracts the eye mask from the rendered face and the blink parameters, and codes the eyes independently to realize the natural blink of the speaker, thereby further improving the continuity and the natural degree of the synthetic effect.

The invention provides a device for synthesizing a speaking face based on a generated confrontation network, which comprises eight modules, namely a data acquisition module, an audio processing module, a face three-dimensional reconstruction module, a three-dimensional face parameter prediction module, a face three-dimensional rendering module, an eye coding module, a face image depth generation module and a video synthesis module, and can completely complete all tasks from data acquisition to result synthesis to form a system closed loop.

Drawings

FIG. 1 is a flow chart of the synthesis of a face of a speaker based on generation of a confrontation network according to the present invention;

FIG. 2 is a schematic flow chart of a three-dimensional face reconstruction module according to the present invention;

FIG. 3 is a schematic diagram of two-dimensional and three-dimensional human face key points in the present invention;

FIG. 4 is a schematic flow chart of a three-dimensional face parameter prediction module according to the present invention;

FIG. 5 is a schematic flow chart of a three-dimensional rendering module for human faces;

FIG. 6 is a schematic flow diagram of an eye encoding module;

FIG. 7 is a schematic flow diagram of a face image depth generation module;

FIG. 8 is a functional block diagram of a face synthesis apparatus for generating an anti-confrontation network according to the present invention;

fig. 9 is a schematic structural diagram of an electronic device for implementing the method for synthesizing a speech face based on a generative confrontation network provided by the invention.

Detailed Description

The invention is described in further detail below with reference to the attached drawing, without in any way limiting the scope of the invention.

Referring to fig. 1, the invention provides a method for synthesizing a speaking face based on a generated confrontation network, which comprises the steps of firstly obtaining a reference video of an audio to be processed and a target face; then extracting the characteristics of the input audio; meanwhile, performing human face three-dimensional reconstruction based on the target human face video; secondly, constructing a three-dimensional face parameter prediction model to predict face parameters such as mouth shapes, facial expressions, head postures, winks and the like; rendering a facial image corresponding to each frame of voice according to the parameters through a human face three-dimensional rendering network; in order to ensure the natural blinking of the generated result, extracting an eye region aiming at the rendered human face, and independently coding the eye to change the pixel value of the eye; and synthesizing a speaking face video by using a face image deep networking and combining eye coding and a face rendering result. The method specifically comprises the following steps:

1. capturing audio and video

Audio acquisition, namely acquiring standard broadcast audio clips by a network and recording the audio clips in a recording studio by a voice acquisition device;

video acquisition, wherein a single human face video clip is acquired through a network, and the video clip comprising a front face and a side face is recorded through video acquisition equipment;

2. extracting audio features

Linearly resampling the audio to 30fps by using an open-source librosa library, encoding the resampled audio signal to form an array, performing Fast Fourier Transform (FFT) on the array, convolving the transformed audio encoding on a two-dimensional space consisting of a time domain and a frequency domain on the basis, and finally forming a characteristic array with the size of 30 multiplied by 29 as audio characteristics for each second of audio by adopting Relu as an activation function.

3. Three-dimensional reconstruction of target face

As shown in fig. 2, the process of establishing a three-dimensional face model for a reference video includes face detection, face key point extraction, and three-dimensional face deformation statistical model construction, and specifically includes the following steps:

s3-1, detecting the face position: detecting a face area in a video frame by using a Harr detector in OpenCV to obtain four coordinate points (x, y, w, h), and cutting an image;

s3-2, extracting key points of the human face: detecting the characteristic points of the face by using Dlib according to the coordinate position of the face, wherein the characteristic points mainly comprise left eye eyebrow, right eye eyebrow, left eye, right eye, nose, mouth and chin, and finally obtaining the specific coordinate (x) of each point _i ,y _i ) I = (0, 1..., 67)), the feature point extraction result is shown in fig. 3-a);

s3-3, three-dimensional face reconstruction model: establishing human face 3DMM model M = (F) by using deep network of multi-frame video _i ,F _t ,F _e ) Wherein

The method is different from a conventional model in that the expression characteristics are added to further improve the expression naturalness of the character;

s3-4, projecting the constructed three-dimensional model to a two-dimensional plane in the training process, and then extracting key points of the human face from the projection result by using the method in Step2, wherein the characteristic point extraction result is shown in figure 3-b). And in the training process of the human face three-dimensional model, taking the result of Step2 as a ground route to form landmark loss so as to further train the human face three-dimensional model.

4. Constructing a face generation model

Referring to fig. 4, the face generation model inputs the audio features into the pre-trained face depth generation model to obtain the predicted face parameters of head pose, mouth shape, blink, etc., which includes a mouth shape synchronous generator G _lip A timing generator G _tem And a discriminator D.

S4-1, using the basic audio features alpha extracted in the step 1) as a sliding window alpha with the size of T frames each time _t ,α _t+1 ,…,α _t+T Feeding into a model;

S4-2 G _lip explicit features in the audio, i.e., mouth-shaped feature vectors corresponding to the audio, are learned. G _lip In which the audio feature alpha of each frame is input _t Outputting the lip shape feature code l corresponding to the frame _t ；

S4-3G _tem The implicit characteristics of the audio are learned and output. G _tem The method aims to learn the context relationship of audio, find implicit emotional characteristics in the audio so as to guide the generation of head action, facial expression and the like in the subsequent steps, and inputs all audio characteristics in each sliding window and a set initial state s ₀ For each segment of window output, the corresponding implicit feature code s is output _t ；

S4-4 coupling of S _t Are respectively reacted with (l) _t ,l _t+1 ,...,l _t+T ) Connecting and sending the data into a full-connection layer to obtain the human face three-dimensional parameters predicted according to the audio features

S4-5 in the training process, using Y _t For ground channel supervision, it will

Sending the data to a discriminator for discrimination to determine discrimination loss.

5. Rendering face images

As shown in fig. 5, the rendering facial image module renders a three-dimensional facial image corresponding to each frame of speech by using a three-dimensional rendering model of a human face according to the obtained prediction parameters, and specifically includes the following steps:

s5-1, adjusting the parameters of the face three-dimensional model according to the face three-dimensional parameters predicted for each frame to obtain recombined face three-dimensional parameters;

and S5-2, rendering the recombined human face three-dimensional parameters by adopting an open-source pytorech 3D to obtain a rendered human face image.

6. Eye mask coding

As shown in fig. 6, the eye coding part extracts an eye mask through rendered human faces and blinking parameters to realize individual eye coding, and specifically includes the following steps:

s6-1, according to the face key point positioning eye region pixel points in the step 3), an eye mask is generated. Wherein the threshold z is set for the delineation of the eye region size, which has to be satisfied (p) _x -c _x ) ² /4+(p _y -c _y ) ² <z, wherein p, c are the eye vertex and eye center, respectively;

s6-2 according to the eye motion information predicted in the step 3)

Obtaining an eye blink AU (Action Unit) value;

s6-3 applies AU value normalization to pixel values in the eye attention map to achieve blinking behavior.

7. Face image depth generation

As shown in fig. 7, the face image depth generating section inputs the eye mask code and the three-dimensional face image together into the depth face rendering model to synthesize each frame of speaking face image, and specifically includes the following steps:

s7-1 combining the eye mask in the step 6) and the face image rendered in the step 5) to form input information with the size of WxHx4

Wherein three channels are face images, and one channel is an eye mask;

s7-2, constructing a depth face rendering model. The model is composed of a generator G _render And a discriminator D _render The purpose is to ensure time consistency while preserving character identity. With one size of 2N _w And (4) placing the current frame in the center of the window and sending the current frame into a video rendering network. G _render The size is W × H × 4 × 2N _w Tensor X of _t Outputting the image frame G of the target person as input _render (X _t )；

During the training process, G _render (X _t ) Feeding the real image I into D _render And (4) judging to form the resistance loss.

8. Video result synthesis

And splicing all the speaking face images together to output a final speaking face video, and is characterized in that all the images synthesized in the requirement 9 are spliced in sequence by using ffmpeg to form a final speaking face video result.

Based on the speaking face synthesis method based on the generation countermeasure network provided by the application, the application also provides a speaking face synthesis device based on the generation countermeasure network, which comprises a data acquisition module, an audio processing module, a face three-dimensional reconstruction module, a three-dimensional face parameter prediction module, a face three-dimensional rendering module, an eye coding module, a face image depth generation module and a video synthesis module:

the 8-1 data acquisition module is used for acquiring the voice audio content of the speaking face and a reference video containing a face picture of a target person;

the 8-2 audio processing module is used for processing the acquired audio data, and comprises the steps of noise reduction, resampling and feature extraction of the audio;

8-3, a human face three-dimensional reconstruction module, which reconstructs a target character video by adopting a three-dimensional human face reconstruction model to obtain a target human face three-dimensional parameter;

the 8-4 three-dimensional face parameter prediction module is used for reading audio features and obtaining predicted face parameters by adopting a pre-trained face depth generation model;

8-5, a face three-dimensional rendering module, which adopts a face three-dimensional rendering model and obtains a rendered face image according to the predicted face parameters and the face three-dimensional parameters;

8-6 eye coding module, modifying the eye pixel value according to the predicted face parameter to obtain eye coding, and realizing blinking;

8-7 a face image depth generation module, which obtains a face image by adopting a depth face rendering model and synthesizes a speaking face image of each station by combining eye coding;

and the 8-8 video synthesis module is used for splicing all the images by using ffmpeg and synthesizing the final talking face video result together with the audio.

Based on the method for synthesizing the speaking face based on the generation countermeasure network, the application also provides an electronic device, which comprises:

9-1 at least one processor;

9-2 and a memory communicatively coupled to the processor;

wherein the memory stores computer execution instructions;

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of any of the methods presented herein.

Based on the generated confrontation network-based speaking face synthesis method provided by the application, the application also provides a storage medium which is a computer readable storage medium, for example, a memory containing a computer program, and the deletion of the computer program can be executed by a processor to complete the steps in the aforementioned generated confrontation network-based speaking face synthesis method.

Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Therefore, the above description is only an exemplary embodiment of the present application, and any equivalent changes and modifications made by those skilled in the art without departing from the spirit and principle of the present application belong to the protection scope of the present invention.

Claims

1. A method for synthesizing a speaking face based on a generation confrontation network is characterized by comprising the following steps:

finally, all the speaking face images are spliced together to output the final speaking face video.

2. The method for synthesizing a human face with speaking based on a generative confrontation network as claimed in claim 1, wherein obtaining the reference video of the voice audio segment and the target human face specifically comprises:

audio acquisition, namely acquiring and screening standard broadcast audio clips by a network or recording the audio clips in a recording studio through acquisition equipment;

3. The method for synthesizing the speaking face based on the generation countermeasure network as claimed in claim 1, wherein the feature extraction of the input speech audio comprises:

audio resampling, namely, linearly resampling the audio to align the audio with the video frame rate, encoding the resampled audio signal to form an array, and performing fast Fourier transform on the array;

4. The method for synthesizing a speaking face based on a generative confrontation network as claimed in claim 1, wherein inputting a reference video of a target face into a three-dimensional face reconstruction model to obtain three-dimensional data of the face comprises:

processing an input reference video into an image file by frames;

fitting the face image to a 3DMM model to obtain a target face three-dimensional parameter M = (F) _i ,F _t ,F _e ,F _l ) In which

5. The method of claim 1, wherein inputting audio features into a pre-trained face depth generation model to obtain predicted head pose, mouth shape and blink parameters comprises:

inputting processed audio features and human face three-dimensional parameters into a model, wherein the audio features are sent into the model by a sliding window with the size of T frames each time, a mouth-shaped feature generator generates corresponding mouth-shaped feature parameters for each frame in the window, and an implicit feature generator learns an implicit feature array for each window by combining time sequence features;

splicing the implicit characteristic array with the mouth shape characteristic of each frame respectively, and sending the merged characteristic array into a full connection layer to obtain a human face three-dimensional parameter predicted according to the audio characteristic;

a pre-trained face depth generation model comprising:

6. The method as claimed in claim 1, wherein rendering a three-dimensional face image corresponding to each frame of speech by using a three-dimensional face rendering model according to the obtained prediction parameters comprises:

7. The method of claim 1, wherein the eye mask is extracted for the rendered face and the blink parameters, and the eye is encoded separately, comprising:

determining the eye position according to the extracted key points of the face;

combining the predicted face characteristics to obtain an eye blink AU value;

AU value normalization is applied to the pixel values in the eye attention map.

8. The method as claimed in claim 1, wherein the step of inputting the eye mask code and the three-dimensional face image into the deep face rendering model together to synthesize each frame of speaking face image comprises:

combining the eye coding with the rendered human face image to form input information with four channels, wherein three channels are human face images, and one channel is an eye mask;

a depth face rendering model comprising:

9. The method as claimed in claim 1, wherein all the face images are merged together to output a final face video, and all the images merged by ffmpeg are used to form a final face video result.

10. A face synthesis apparatus for speaking based on a generative confrontation network, comprising:

and the video synthesis module is used for splicing all the images by using ffmpeg and synthesizing the spliced images and the audio into a final talking face video result.