CN115908659A - Method and device for synthesizing speaking face based on generation countermeasure network - Google Patents

Method and device for synthesizing speaking face based on generation countermeasure network Download PDF

Info

Publication number
CN115908659A
CN115908659A CN202211493192.5A CN202211493192A CN115908659A CN 115908659 A CN115908659 A CN 115908659A CN 202211493192 A CN202211493192 A CN 202211493192A CN 115908659 A CN115908659 A CN 115908659A
Authority
CN
China
Prior art keywords
face
dimensional
audio
model
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211493192.5A
Other languages
Chinese (zh)
Inventor
杨新宇
宋怡馨
胡冠宇
张硕
魏洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202211493192.5A priority Critical patent/CN115908659A/en
Publication of CN115908659A publication Critical patent/CN115908659A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method and a device for synthesizing a speaking face based on a generated confrontation network. The method and the device pay attention to the implicit information of head movements, expressions, winks and the like except for the mouth shape, and the high-quality and natural target character speaking video can be synthesized by inputting a section of voice and the target face reference video, so that the problems that the anchor can not record the media and the synthesized video is distorted and rigid and the like can be solved. Firstly, acquiring voice and a reference video, and preprocessing and extracting characteristics of audio to obtain audio characteristics; simultaneously, inputting the video into a three-dimensional face reconstruction model to obtain face three-dimensional data; inputting the audio features into a face depth generation model to predict face parameters; obtaining a corresponding three-dimensional face image by using the face three-dimensional rendering model; then, eye masks are extracted according to the blinking parameters, and the eyes are coded; inputting the code and the facial image into a depth face rendering model to synthesize a frame-by-frame face image; finally, all the images are spliced and combined with the audio to synthesize the speaking face video.

Description

Method and device for synthesizing speaking face based on generation countermeasure network
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for synthesizing a speaking face based on a generated confrontation network.
Background
The information transmission in the information age is particularly important for the development of society, while the media industry is a main way for transmitting information, wherein both news broadcasting and network live broadcasting require a main broadcaster to record on site in a video studio, and the recording work is often difficult to be quickly carried out when temporary demands exist. The virtual main broadcast formed by directly generating the speaking face video through the audio information and the face video can well solve the problems, is not limited in time and space, can improve broadcasting accuracy on the basis of intelligent voice, avoids errors possibly caused by manual broadcasting, can greatly reduce labor cost, and reduces cost and improves efficiency for stock market.
As the role of virtual anchor in the media industry becomes more and more important, research on synthesis of speech face video is receiving wide attention from academia. In the related research aiming at the synthesis of the speaking face at present, 80% of documents focus on the synchronization of the mouth and the voice information, and the rest focus on the natural degree of the generated effect. The final measure of effectiveness in a video production study is not just accuracy, but rather the impression of the final audience of the video.
Disclosure of Invention
The invention aims to provide a method and a device for synthesizing a speaking face based on a generation countermeasure network. The invention provides a method for synthesizing a speaking face based on generation of a confrontation network, which generates a speaking face video of a target character aiming at audio contents by inputting a section of audio and a section of short video containing a target face. The invention has simple logic and natural effect, the mouth shape of the generated result can correspond to the audio frequency accurately, and the speaking has natural facial expression and head posture and keeps certain blinking action.
The invention is realized by adopting the following technical scheme:
a method for synthesizing a speaking face based on a generation confrontation network comprises the following steps:
acquiring a voice audio clip and a reference video which is recorded by a real person and contains a target face;
carrying out noise reduction, resampling and feature extraction on an input voice audio to obtain audio features;
inputting a reference video of a target face into a pre-trained three-dimensional face reconstruction model to obtain face three-dimensional data;
inputting the audio features into a face depth generation model to obtain predicted head posture, expression, mouth shape and blink parameters;
rendering a three-dimensional face image corresponding to each frame of voice by utilizing a pre-trained face three-dimensional rendering model according to the predicted parameters;
extracting eye masks according to rendered human faces and blink parameters, and independently coding eyes;
inputting the eye mask code and the three-dimensional face image into a deep face rendering model together to synthesize each frame of speaking face image;
finally, all the speaking face images are spliced together to output a final speaking face video.
The further improvement of the invention lies in that the obtaining of the reference video of the voice audio segment and the target face specifically comprises:
audio acquisition, namely acquiring and screening standard broadcast audio clips by a network or recording the audio clips in a recording room through acquisition equipment;
and video acquisition, wherein a single human face video clip is acquired through a network or the video clip comprising the front face and the side face is recorded through video acquisition equipment.
The invention is further improved in that the feature extraction is performed on the input voice audio, and comprises the following steps:
audio resampling, namely linearly resampling audio to align the audio with a video frame rate, encoding the resampled audio signal to form an array, and performing fast Fourier transform on the array;
the transformed code is convolved in a two-dimensional space consisting of a time domain and a frequency domain, and corresponding audio features are formed for each second of audio.
The further improvement of the invention is that the reference video of the target face is input into a three-dimensional face reconstruction model to obtain the three-dimensional data of the face, and the method comprises the following steps:
processing an input reference video into an image file according to frames;
performing face recognition on each reference image by using a convolutional neural network, and cutting the face image according to a recognition result;
performing face landmark extraction on the image, wherein the face landmark extraction comprises a left eye eyebrow, a right eye eyebrow, a left eye, a right eye, a nose, a mouth and a chin, and specific coordinates of 68 key points are finally obtained;
adopting a pre-trained face reconstruction method based on CNN to establish a face 3DMM model, wherein the process is as follows:
fitting the face image to a 3DMM model to obtain a target face three-dimensional parameter M = (F) i ,F t ,F e ,F l ) Wherein
Figure BDA0003964409920000031
Figure BDA0003964409920000032
Respectively representing the identity, the texture, the facial expression and the illumination characteristic vector of the three-dimensional coordinate of the face;
and in the model training process, the obtained three-dimensional face model is projected to two dimensions, 68 key points are extracted again, the key points extracted from the original video image are used as supervision, and the landmark loss is formed so as to further train the three-dimensional face reconstruction model.
The invention further improves that inputting the audio features into the pre-trained human face depth generation model to obtain the predicted head pose, mouth shape and blink parameters, and comprises the following steps:
inputting processed audio features and human face three-dimensional parameters into a model, wherein the audio features are sent into the model by a sliding window with the size of T frames each time, a mouth shape feature generator generates corresponding mouth shape feature parameters for each frame in the window, and an implicit feature generator learns an implicit feature array for each window by combining time sequence features;
respectively splicing the implicit characteristic arrays with the mouth shape characteristics of each frame, and sending the mouth shape characteristics into a full connection layer to obtain human face three-dimensional parameters predicted according to the audio characteristics;
a pre-trained face depth generation model comprising:
the face depth generation model consists of a mouth shape feature generator, an implicit feature generator and a discriminator, wherein the implicit feature comprises facial expression, head posture and blink;
and in the model training process, the three-dimensional parameters of the real human face and the predicted three-dimensional parameters are sent to a discriminator to be discriminated, and discrimination loss is determined.
The invention further improves that the three-dimensional face image corresponding to each frame of the voice is rendered by utilizing a human face three-dimensional rendering model according to the obtained prediction parameters, and the method comprises the following steps:
adjusting the parameters of the face three-dimensional model according to the predicted face three-dimensional parameters to obtain recombined face three-dimensional parameters;
and rendering the recombined human face three-dimensional parameters by adopting the open-source pytorech 3D to obtain a rendered human face image.
The invention further improves the method that an eye mask is extracted aiming at the rendered human face and the wink parameter, and the eye is coded independently, and the method comprises the following steps:
determining the position of eyes according to the extracted key points of the face;
combining the predicted human face and facial features to obtain an eye blink AU value;
AU value normalization is applied to the pixel values in the eye attention map.
The further improvement of the invention is that the eye mask coding and the three-dimensional face image are input into the depth face rendering model together to synthesize each frame of speaking face image, which comprises the following steps:
combining the eye coding with the rendered face image to form input information with four channels, wherein three channels are face images, and one channel is an eye mask;
placing the current frame in the center of a window with the size of 2N, inputting the current frame into a deep face rendering model, and outputting a target character image frame with time sequence by the model;
a deep face rendering model comprising:
the model consists of a generator G and an identifier D, the purpose of which is to ensure temporal consistency while preserving the character identity;
in the training process, the generated image and the real image are sent to a discriminator to be discriminated, so that the countermeasure loss is formed.
The invention has the further improvement that all the speaking face images are spliced together to output the final speaking face video, and all the images which are spliced and synthesized by using ffmpeg in sequence form the final speaking face video result.
A human face synthesis apparatus based on a generative confrontation network, comprising:
the data acquisition module is used for acquiring voice audio content of a speaking face and a reference video containing a face picture of a target person;
the audio processing module is used for processing the acquired audio data, and comprises the steps of noise reduction, resampling and feature extraction of the audio;
the human face three-dimensional reconstruction module is used for reconstructing the target character video by adopting a three-dimensional human face reconstruction model to obtain target human face three-dimensional parameters;
the three-dimensional face parameter prediction module is used for reading audio features and obtaining predicted face parameters by adopting a pre-trained face depth generation model;
the face three-dimensional rendering module is used for obtaining a rendered face image according to the predicted face parameters and the face three-dimensional parameters by adopting a face three-dimensional rendering model;
the eye coding module modifies the eye pixel value according to the predicted face parameters to obtain eye codes and realize blinking;
the human face image depth generation module is used for obtaining a human face image by adopting a depth human face rendering model and combining eye coding to synthesize each frame of speaking human face image;
and the video synthesis module is used for splicing all the images by adopting ffmpeg and synthesizing the final talking face video result together with the audio.
The invention has at least the following beneficial technical effects:
the method for synthesizing the speaking face based on the generation countermeasure network, provided by the invention, carries out three-dimensional modeling on the face through a three-dimensional face reconstruction model, so that the generated result is superior to the result of two-dimensional image modeling in visual effect; meanwhile, the method realizes the learning and generation of implicit characteristics such as head posture, expression and the like through a face depth generation model, so that the true degree of a generated result is greatly improved; the method extracts the eye mask from the rendered face and the blink parameters, and codes the eyes independently to realize the natural blink of the speaker, thereby further improving the continuity and the natural degree of the synthetic effect.
The invention provides a device for synthesizing a speaking face based on a generated confrontation network, which comprises eight modules, namely a data acquisition module, an audio processing module, a face three-dimensional reconstruction module, a three-dimensional face parameter prediction module, a face three-dimensional rendering module, an eye coding module, a face image depth generation module and a video synthesis module, and can completely complete all tasks from data acquisition to result synthesis to form a system closed loop.
Drawings
FIG. 1 is a flow chart of the synthesis of a face of a speaker based on generation of a confrontation network according to the present invention;
FIG. 2 is a schematic flow chart of a three-dimensional face reconstruction module according to the present invention;
FIG. 3 is a schematic diagram of two-dimensional and three-dimensional human face key points in the present invention;
FIG. 4 is a schematic flow chart of a three-dimensional face parameter prediction module according to the present invention;
FIG. 5 is a schematic flow chart of a three-dimensional rendering module for human faces;
FIG. 6 is a schematic flow diagram of an eye encoding module;
FIG. 7 is a schematic flow diagram of a face image depth generation module;
FIG. 8 is a functional block diagram of a face synthesis apparatus for generating an anti-confrontation network according to the present invention;
fig. 9 is a schematic structural diagram of an electronic device for implementing the method for synthesizing a speech face based on a generative confrontation network provided by the invention.
Detailed Description
The invention is described in further detail below with reference to the attached drawing, without in any way limiting the scope of the invention.
Referring to fig. 1, the invention provides a method for synthesizing a speaking face based on a generated confrontation network, which comprises the steps of firstly obtaining a reference video of an audio to be processed and a target face; then extracting the characteristics of the input audio; meanwhile, performing human face three-dimensional reconstruction based on the target human face video; secondly, constructing a three-dimensional face parameter prediction model to predict face parameters such as mouth shapes, facial expressions, head postures, winks and the like; rendering a facial image corresponding to each frame of voice according to the parameters through a human face three-dimensional rendering network; in order to ensure the natural blinking of the generated result, extracting an eye region aiming at the rendered human face, and independently coding the eye to change the pixel value of the eye; and synthesizing a speaking face video by using a face image deep networking and combining eye coding and a face rendering result. The method specifically comprises the following steps:
1. capturing audio and video
Audio acquisition, namely acquiring standard broadcast audio clips by a network and recording the audio clips in a recording studio by a voice acquisition device;
video acquisition, wherein a single human face video clip is acquired through a network, and the video clip comprising a front face and a side face is recorded through video acquisition equipment;
2. extracting audio features
Linearly resampling the audio to 30fps by using an open-source librosa library, encoding the resampled audio signal to form an array, performing Fast Fourier Transform (FFT) on the array, convolving the transformed audio encoding on a two-dimensional space consisting of a time domain and a frequency domain on the basis, and finally forming a characteristic array with the size of 30 multiplied by 29 as audio characteristics for each second of audio by adopting Relu as an activation function.
3. Three-dimensional reconstruction of target face
As shown in fig. 2, the process of establishing a three-dimensional face model for a reference video includes face detection, face key point extraction, and three-dimensional face deformation statistical model construction, and specifically includes the following steps:
s3-1, detecting the face position: detecting a face area in a video frame by using a Harr detector in OpenCV to obtain four coordinate points (x, y, w, h), and cutting an image;
s3-2, extracting key points of the human face: detecting the characteristic points of the face by using Dlib according to the coordinate position of the face, wherein the characteristic points mainly comprise left eye eyebrow, right eye eyebrow, left eye, right eye, nose, mouth and chin, and finally obtaining the specific coordinate (x) of each point i ,y i ) I = (0, 1..., 67)), the feature point extraction result is shown in fig. 3-a);
s3-3, three-dimensional face reconstruction model: establishing human face 3DMM model M = (F) by using deep network of multi-frame video i ,F t ,F e ) Wherein
Figure BDA0003964409920000071
The method is different from a conventional model in that the expression characteristics are added to further improve the expression naturalness of the character;
s3-4, projecting the constructed three-dimensional model to a two-dimensional plane in the training process, and then extracting key points of the human face from the projection result by using the method in Step2, wherein the characteristic point extraction result is shown in figure 3-b). And in the training process of the human face three-dimensional model, taking the result of Step2 as a ground route to form landmark loss so as to further train the human face three-dimensional model.
4. Constructing a face generation model
Referring to fig. 4, the face generation model inputs the audio features into the pre-trained face depth generation model to obtain the predicted face parameters of head pose, mouth shape, blink, etc., which includes a mouth shape synchronous generator G lip A timing generator G tem And a discriminator D.
S4-1, using the basic audio features alpha extracted in the step 1) as a sliding window alpha with the size of T frames each time tt+1 ,…,α t+T Feeding into a model;
S4-2 G lip explicit features in the audio, i.e., mouth-shaped feature vectors corresponding to the audio, are learned. G lip In which the audio feature alpha of each frame is input t Outputting the lip shape feature code l corresponding to the frame t
S4-3G tem The implicit characteristics of the audio are learned and output. G tem The method aims to learn the context relationship of audio, find implicit emotional characteristics in the audio so as to guide the generation of head action, facial expression and the like in the subsequent steps, and inputs all audio characteristics in each sliding window and a set initial state s 0 For each segment of window output, the corresponding implicit feature code s is output t
S4-4 coupling of S t Are respectively reacted with (l) t ,l t+1 ,...,l t+T ) Connecting and sending the data into a full-connection layer to obtain the human face three-dimensional parameters predicted according to the audio features
Figure BDA0003964409920000081
S4-5 in the training process, using Y t For ground channel supervision, it will
Figure BDA0003964409920000082
Sending the data to a discriminator for discrimination to determine discrimination loss.
5. Rendering face images
As shown in fig. 5, the rendering facial image module renders a three-dimensional facial image corresponding to each frame of speech by using a three-dimensional rendering model of a human face according to the obtained prediction parameters, and specifically includes the following steps:
s5-1, adjusting the parameters of the face three-dimensional model according to the face three-dimensional parameters predicted for each frame to obtain recombined face three-dimensional parameters;
and S5-2, rendering the recombined human face three-dimensional parameters by adopting an open-source pytorech 3D to obtain a rendered human face image.
6. Eye mask coding
As shown in fig. 6, the eye coding part extracts an eye mask through rendered human faces and blinking parameters to realize individual eye coding, and specifically includes the following steps:
s6-1, according to the face key point positioning eye region pixel points in the step 3), an eye mask is generated. Wherein the threshold z is set for the delineation of the eye region size, which has to be satisfied (p) x -c x ) 2 /4+(p y -c y ) 2 <z, wherein p, c are the eye vertex and eye center, respectively;
s6-2 according to the eye motion information predicted in the step 3)
Figure BDA0003964409920000083
Obtaining an eye blink AU (Action Unit) value;
s6-3 applies AU value normalization to pixel values in the eye attention map to achieve blinking behavior.
7. Face image depth generation
As shown in fig. 7, the face image depth generating section inputs the eye mask code and the three-dimensional face image together into the depth face rendering model to synthesize each frame of speaking face image, and specifically includes the following steps:
s7-1 combining the eye mask in the step 6) and the face image rendered in the step 5) to form input information with the size of WxHx4
Figure BDA0003964409920000091
Wherein three channels are face images, and one channel is an eye mask;
s7-2, constructing a depth face rendering model. The model is composed of a generator G render And a discriminator D render The purpose is to ensure time consistency while preserving character identity. With one size of 2N w And (4) placing the current frame in the center of the window and sending the current frame into a video rendering network. G render The size is W × H × 4 × 2N w Tensor X of t Outputting the image frame G of the target person as input render (X t );
During the training process, G render (X t ) Feeding the real image I into D render And (4) judging to form the resistance loss.
8. Video result synthesis
And splicing all the speaking face images together to output a final speaking face video, and is characterized in that all the images synthesized in the requirement 9 are spliced in sequence by using ffmpeg to form a final speaking face video result.
Based on the speaking face synthesis method based on the generation countermeasure network provided by the application, the application also provides a speaking face synthesis device based on the generation countermeasure network, which comprises a data acquisition module, an audio processing module, a face three-dimensional reconstruction module, a three-dimensional face parameter prediction module, a face three-dimensional rendering module, an eye coding module, a face image depth generation module and a video synthesis module:
the 8-1 data acquisition module is used for acquiring the voice audio content of the speaking face and a reference video containing a face picture of a target person;
the 8-2 audio processing module is used for processing the acquired audio data, and comprises the steps of noise reduction, resampling and feature extraction of the audio;
8-3, a human face three-dimensional reconstruction module, which reconstructs a target character video by adopting a three-dimensional human face reconstruction model to obtain a target human face three-dimensional parameter;
the 8-4 three-dimensional face parameter prediction module is used for reading audio features and obtaining predicted face parameters by adopting a pre-trained face depth generation model;
8-5, a face three-dimensional rendering module, which adopts a face three-dimensional rendering model and obtains a rendered face image according to the predicted face parameters and the face three-dimensional parameters;
8-6 eye coding module, modifying the eye pixel value according to the predicted face parameter to obtain eye coding, and realizing blinking;
8-7 a face image depth generation module, which obtains a face image by adopting a depth face rendering model and synthesizes a speaking face image of each station by combining eye coding;
and the 8-8 video synthesis module is used for splicing all the images by using ffmpeg and synthesizing the final talking face video result together with the audio.
Based on the method for synthesizing the speaking face based on the generation countermeasure network, the application also provides an electronic device, which comprises:
9-1 at least one processor;
9-2 and a memory communicatively coupled to the processor;
wherein the memory stores computer execution instructions;
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of any of the methods presented herein.
Based on the generated confrontation network-based speaking face synthesis method provided by the application, the application also provides a storage medium which is a computer readable storage medium, for example, a memory containing a computer program, and the deletion of the computer program can be executed by a processor to complete the steps in the aforementioned generated confrontation network-based speaking face synthesis method.
Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Therefore, the above description is only an exemplary embodiment of the present application, and any equivalent changes and modifications made by those skilled in the art without departing from the spirit and principle of the present application belong to the protection scope of the present invention.

Claims (10)

1. A method for synthesizing a speaking face based on a generation confrontation network is characterized by comprising the following steps:
acquiring a voice audio clip and a reference video which is recorded by a real person and contains a target face;
carrying out noise reduction, resampling and feature extraction on an input voice audio to obtain audio features;
inputting a reference video of a target face into a pre-trained three-dimensional face reconstruction model to obtain face three-dimensional data;
inputting the audio features into a face depth generation model to obtain predicted head posture, expression, mouth shape and blink parameters;
rendering a three-dimensional face image corresponding to each frame of voice by utilizing a pre-trained face three-dimensional rendering model according to the predicted parameters;
extracting eye masks according to rendered human faces and blink parameters, and independently coding eyes;
inputting the eye mask code and the three-dimensional face image into a deep face rendering model together to synthesize each frame of speaking face image;
finally, all the speaking face images are spliced together to output the final speaking face video.
2. The method for synthesizing a human face with speaking based on a generative confrontation network as claimed in claim 1, wherein obtaining the reference video of the voice audio segment and the target human face specifically comprises:
audio acquisition, namely acquiring and screening standard broadcast audio clips by a network or recording the audio clips in a recording studio through acquisition equipment;
and video acquisition, wherein a single human face video clip is acquired through a network or the video clip comprising the front face and the side face is recorded through video acquisition equipment.
3. The method for synthesizing the speaking face based on the generation countermeasure network as claimed in claim 1, wherein the feature extraction of the input speech audio comprises:
audio resampling, namely, linearly resampling the audio to align the audio with the video frame rate, encoding the resampled audio signal to form an array, and performing fast Fourier transform on the array;
the transformed code is convolved in a two-dimensional space consisting of a time domain and a frequency domain, and corresponding audio features are formed for each second of audio.
4. The method for synthesizing a speaking face based on a generative confrontation network as claimed in claim 1, wherein inputting a reference video of a target face into a three-dimensional face reconstruction model to obtain three-dimensional data of the face comprises:
processing an input reference video into an image file by frames;
performing face recognition on each reference image by using a convolutional neural network, and cutting the face image according to a recognition result;
performing face landmark extraction on the image, wherein the face landmark extraction comprises a left eye eyebrow, a right eye eyebrow, a left eye, a right eye, a nose, a mouth and a chin, and specific coordinates of 68 key points are finally obtained;
adopting a pre-trained face reconstruction method based on CNN to establish a face 3DMM model, wherein the process is as follows:
fitting the face image to a 3DMM model to obtain a target face three-dimensional parameter M = (F) i ,F t ,F e ,F l ) In which
Figure FDA0003964409910000021
Figure FDA0003964409910000022
Respectively representing the identity, the texture, the facial expression and the illumination characteristic vector of the three-dimensional coordinate of the face;
and in the model training process, the obtained three-dimensional face model is projected to two dimensions, 68 key points are extracted again, the key points extracted from the original video image are used as supervision, and the landmark loss is formed so as to further train the three-dimensional face reconstruction model.
5. The method of claim 1, wherein inputting audio features into a pre-trained face depth generation model to obtain predicted head pose, mouth shape and blink parameters comprises:
inputting processed audio features and human face three-dimensional parameters into a model, wherein the audio features are sent into the model by a sliding window with the size of T frames each time, a mouth-shaped feature generator generates corresponding mouth-shaped feature parameters for each frame in the window, and an implicit feature generator learns an implicit feature array for each window by combining time sequence features;
splicing the implicit characteristic array with the mouth shape characteristic of each frame respectively, and sending the merged characteristic array into a full connection layer to obtain a human face three-dimensional parameter predicted according to the audio characteristic;
a pre-trained face depth generation model comprising:
the face depth generation model consists of a mouth shape feature generator, an implicit feature generator and a discriminator, wherein the implicit feature comprises facial expression, head posture and blink;
and in the model training process, the three-dimensional parameters of the real human face and the predicted three-dimensional parameters are sent to a discriminator to be discriminated, and discrimination loss is determined.
6. The method as claimed in claim 1, wherein rendering a three-dimensional face image corresponding to each frame of speech by using a three-dimensional face rendering model according to the obtained prediction parameters comprises:
adjusting the parameters of the face three-dimensional model according to the predicted face three-dimensional parameters to obtain recombined face three-dimensional parameters;
and rendering the recombined human face three-dimensional parameters by adopting the open-source pytorech 3D to obtain a rendered human face image.
7. The method of claim 1, wherein the eye mask is extracted for the rendered face and the blink parameters, and the eye is encoded separately, comprising:
determining the eye position according to the extracted key points of the face;
combining the predicted face characteristics to obtain an eye blink AU value;
AU value normalization is applied to the pixel values in the eye attention map.
8. The method as claimed in claim 1, wherein the step of inputting the eye mask code and the three-dimensional face image into the deep face rendering model together to synthesize each frame of speaking face image comprises:
combining the eye coding with the rendered human face image to form input information with four channels, wherein three channels are human face images, and one channel is an eye mask;
placing the current frame in the center of a window with the size of 2N, inputting the current frame into a deep face rendering model, and outputting a target character image frame with time sequence by the model;
a depth face rendering model comprising:
the model consists of a generator G and an identifier D, the purpose of which is to ensure temporal consistency while preserving the character identity;
in the training process, the generated image and the real image are sent to a discriminator to be discriminated, so that the countermeasure loss is formed.
9. The method as claimed in claim 1, wherein all the face images are merged together to output a final face video, and all the images merged by ffmpeg are used to form a final face video result.
10. A face synthesis apparatus for speaking based on a generative confrontation network, comprising:
the data acquisition module is used for acquiring voice audio content of a speaking face and a reference video containing a face picture of a target person;
the audio processing module is used for processing the acquired audio data, and comprises the steps of noise reduction, resampling and feature extraction of the audio;
the human face three-dimensional reconstruction module is used for reconstructing the target character video by adopting a three-dimensional human face reconstruction model to obtain target human face three-dimensional parameters;
the three-dimensional face parameter prediction module is used for reading audio features and obtaining predicted face parameters by adopting a pre-trained face depth generation model;
the face three-dimensional rendering module is used for obtaining a rendered face image according to the predicted face parameters and the face three-dimensional parameters by adopting a face three-dimensional rendering model;
the eye coding module modifies the eye pixel value according to the predicted face parameters to obtain eye codes and realize blinking;
the human face image depth generation module is used for obtaining a human face image by adopting a depth human face rendering model and combining eye coding to synthesize each frame of speaking human face image;
and the video synthesis module is used for splicing all the images by using ffmpeg and synthesizing the spliced images and the audio into a final talking face video result.
CN202211493192.5A 2022-11-25 2022-11-25 Method and device for synthesizing speaking face based on generation countermeasure network Pending CN115908659A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211493192.5A CN115908659A (en) 2022-11-25 2022-11-25 Method and device for synthesizing speaking face based on generation countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211493192.5A CN115908659A (en) 2022-11-25 2022-11-25 Method and device for synthesizing speaking face based on generation countermeasure network

Publications (1)

Publication Number Publication Date
CN115908659A true CN115908659A (en) 2023-04-04

Family

ID=86474412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211493192.5A Pending CN115908659A (en) 2022-11-25 2022-11-25 Method and device for synthesizing speaking face based on generation countermeasure network

Country Status (1)

Country Link
CN (1) CN115908659A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152447A (en) * 2023-04-21 2023-05-23 科大讯飞股份有限公司 Face modeling method and device, electronic equipment and storage medium
CN116233567A (en) * 2023-05-05 2023-06-06 山东建筑大学 Speaker face video generation method and system based on audio emotion perception
CN116402928A (en) * 2023-05-26 2023-07-07 南昌航空大学 Virtual talking digital person generating method
CN117036555A (en) * 2023-05-18 2023-11-10 无锡捷通数智科技有限公司 Digital person generation method and device and digital person generation system
CN117292030A (en) * 2023-10-27 2023-12-26 海看网络科技(山东)股份有限公司 Method and system for generating three-dimensional digital human animation
CN117593442A (en) * 2023-11-28 2024-02-23 拓元(广州)智慧科技有限公司 Portrait generation method based on multi-stage fine grain rendering
CN117729298A (en) * 2023-12-15 2024-03-19 北京中科金财科技股份有限公司 Photo driving method based on action driving and mouth shape driving

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152447A (en) * 2023-04-21 2023-05-23 科大讯飞股份有限公司 Face modeling method and device, electronic equipment and storage medium
CN116152447B (en) * 2023-04-21 2023-09-26 科大讯飞股份有限公司 Face modeling method and device, electronic equipment and storage medium
CN116233567A (en) * 2023-05-05 2023-06-06 山东建筑大学 Speaker face video generation method and system based on audio emotion perception
CN117036555A (en) * 2023-05-18 2023-11-10 无锡捷通数智科技有限公司 Digital person generation method and device and digital person generation system
CN116402928A (en) * 2023-05-26 2023-07-07 南昌航空大学 Virtual talking digital person generating method
CN116402928B (en) * 2023-05-26 2023-08-25 南昌航空大学 Virtual talking digital person generating method
CN117292030A (en) * 2023-10-27 2023-12-26 海看网络科技(山东)股份有限公司 Method and system for generating three-dimensional digital human animation
CN117593442A (en) * 2023-11-28 2024-02-23 拓元(广州)智慧科技有限公司 Portrait generation method based on multi-stage fine grain rendering
CN117593442B (en) * 2023-11-28 2024-05-03 拓元(广州)智慧科技有限公司 Portrait generation method based on multi-stage fine grain rendering
CN117729298A (en) * 2023-12-15 2024-03-19 北京中科金财科技股份有限公司 Photo driving method based on action driving and mouth shape driving
CN117729298B (en) * 2023-12-15 2024-06-21 北京中科金财科技股份有限公司 Photo driving method based on action driving and mouth shape driving

Similar Documents

Publication Publication Date Title
CN115908659A (en) Method and device for synthesizing speaking face based on generation countermeasure network
Guo et al. Ad-nerf: Audio driven neural radiance fields for talking head synthesis
CN113378697A (en) Method and device for generating speaking face video based on convolutional neural network
US11393150B2 (en) Generating an animation rig for use in animating a computer-generated character based on facial scans of an actor and a muscle model
CN114419702B (en) Digital person generation model, training method of model, and digital person generation method
US20240212252A1 (en) Method and apparatus for training video generation model, storage medium, and computer device
CN115914505A (en) Video generation method and system based on voice-driven digital human model
Shen et al. Sd-nerf: Towards lifelike talking head animation via spatially-adaptive dual-driven nerfs
CN116993948B (en) Face three-dimensional reconstruction method, system and intelligent terminal
Peng et al. Synctalk: The devil is in the synchronization for talking head synthesis
CN117115331A (en) Virtual image synthesizing method, synthesizing device, equipment and medium
CN117409121A (en) Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
US20220076409A1 (en) Systems and Methods for Building a Skin-to-Muscle Transformation in Computer Animation
CN115984452A (en) Head three-dimensional reconstruction method and equipment
Zeng et al. Ultra-low bit rate facial coding hybrid model based on saliency detection
CN116402928B (en) Virtual talking digital person generating method
US11158103B1 (en) Systems and methods for data bundles in computer animation
US11875504B2 (en) Systems and methods for building a muscle-to-skin transformation in computer animation
CN117153195B (en) Method and system for generating speaker face video based on adaptive region shielding
Shen et al. Talking Head Generation Based on 3D Morphable Facial Model
CN117372585A (en) Face video generation method and device and electronic equipment
Yang et al. Deep learning-based 3D face reconstruction method for video stream
CN117557695A (en) Method and device for generating video by driving single photo through audio
CN114972874A (en) Three-dimensional human body classification and generation method and system for complex action sequence
WO2022055366A1 (en) Systems and methods for building a skin-to-muscle transformation in computer animation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination