CN115908659A - Method and device for synthesizing speaking face based on generation countermeasure network - Google Patents
Method and device for synthesizing speaking face based on generation countermeasure network Download PDFInfo
- Publication number
- CN115908659A CN115908659A CN202211493192.5A CN202211493192A CN115908659A CN 115908659 A CN115908659 A CN 115908659A CN 202211493192 A CN202211493192 A CN 202211493192A CN 115908659 A CN115908659 A CN 115908659A
- Authority
- CN
- China
- Prior art keywords
- face
- dimensional
- audio
- model
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a method and a device for synthesizing a speaking face based on a generated confrontation network. The method and the device pay attention to the implicit information of head movements, expressions, winks and the like except for the mouth shape, and the high-quality and natural target character speaking video can be synthesized by inputting a section of voice and the target face reference video, so that the problems that the anchor can not record the media and the synthesized video is distorted and rigid and the like can be solved. Firstly, acquiring voice and a reference video, and preprocessing and extracting characteristics of audio to obtain audio characteristics; simultaneously, inputting the video into a three-dimensional face reconstruction model to obtain face three-dimensional data; inputting the audio features into a face depth generation model to predict face parameters; obtaining a corresponding three-dimensional face image by using the face three-dimensional rendering model; then, eye masks are extracted according to the blinking parameters, and the eyes are coded; inputting the code and the facial image into a depth face rendering model to synthesize a frame-by-frame face image; finally, all the images are spliced and combined with the audio to synthesize the speaking face video.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for synthesizing a speaking face based on a generated confrontation network.
Background
The information transmission in the information age is particularly important for the development of society, while the media industry is a main way for transmitting information, wherein both news broadcasting and network live broadcasting require a main broadcaster to record on site in a video studio, and the recording work is often difficult to be quickly carried out when temporary demands exist. The virtual main broadcast formed by directly generating the speaking face video through the audio information and the face video can well solve the problems, is not limited in time and space, can improve broadcasting accuracy on the basis of intelligent voice, avoids errors possibly caused by manual broadcasting, can greatly reduce labor cost, and reduces cost and improves efficiency for stock market.
As the role of virtual anchor in the media industry becomes more and more important, research on synthesis of speech face video is receiving wide attention from academia. In the related research aiming at the synthesis of the speaking face at present, 80% of documents focus on the synchronization of the mouth and the voice information, and the rest focus on the natural degree of the generated effect. The final measure of effectiveness in a video production study is not just accuracy, but rather the impression of the final audience of the video.
Disclosure of Invention
The invention aims to provide a method and a device for synthesizing a speaking face based on a generation countermeasure network. The invention provides a method for synthesizing a speaking face based on generation of a confrontation network, which generates a speaking face video of a target character aiming at audio contents by inputting a section of audio and a section of short video containing a target face. The invention has simple logic and natural effect, the mouth shape of the generated result can correspond to the audio frequency accurately, and the speaking has natural facial expression and head posture and keeps certain blinking action.
The invention is realized by adopting the following technical scheme:
a method for synthesizing a speaking face based on a generation confrontation network comprises the following steps:
acquiring a voice audio clip and a reference video which is recorded by a real person and contains a target face;
carrying out noise reduction, resampling and feature extraction on an input voice audio to obtain audio features;
inputting a reference video of a target face into a pre-trained three-dimensional face reconstruction model to obtain face three-dimensional data;
inputting the audio features into a face depth generation model to obtain predicted head posture, expression, mouth shape and blink parameters;
rendering a three-dimensional face image corresponding to each frame of voice by utilizing a pre-trained face three-dimensional rendering model according to the predicted parameters;
extracting eye masks according to rendered human faces and blink parameters, and independently coding eyes;
inputting the eye mask code and the three-dimensional face image into a deep face rendering model together to synthesize each frame of speaking face image;
finally, all the speaking face images are spliced together to output a final speaking face video.
The further improvement of the invention lies in that the obtaining of the reference video of the voice audio segment and the target face specifically comprises:
audio acquisition, namely acquiring and screening standard broadcast audio clips by a network or recording the audio clips in a recording room through acquisition equipment;
and video acquisition, wherein a single human face video clip is acquired through a network or the video clip comprising the front face and the side face is recorded through video acquisition equipment.
The invention is further improved in that the feature extraction is performed on the input voice audio, and comprises the following steps:
audio resampling, namely linearly resampling audio to align the audio with a video frame rate, encoding the resampled audio signal to form an array, and performing fast Fourier transform on the array;
the transformed code is convolved in a two-dimensional space consisting of a time domain and a frequency domain, and corresponding audio features are formed for each second of audio.
The further improvement of the invention is that the reference video of the target face is input into a three-dimensional face reconstruction model to obtain the three-dimensional data of the face, and the method comprises the following steps:
processing an input reference video into an image file according to frames;
performing face recognition on each reference image by using a convolutional neural network, and cutting the face image according to a recognition result;
performing face landmark extraction on the image, wherein the face landmark extraction comprises a left eye eyebrow, a right eye eyebrow, a left eye, a right eye, a nose, a mouth and a chin, and specific coordinates of 68 key points are finally obtained;
adopting a pre-trained face reconstruction method based on CNN to establish a face 3DMM model, wherein the process is as follows:
fitting the face image to a 3DMM model to obtain a target face three-dimensional parameter M = (F) i ,F t ,F e ,F l ) Wherein Respectively representing the identity, the texture, the facial expression and the illumination characteristic vector of the three-dimensional coordinate of the face;
and in the model training process, the obtained three-dimensional face model is projected to two dimensions, 68 key points are extracted again, the key points extracted from the original video image are used as supervision, and the landmark loss is formed so as to further train the three-dimensional face reconstruction model.
The invention further improves that inputting the audio features into the pre-trained human face depth generation model to obtain the predicted head pose, mouth shape and blink parameters, and comprises the following steps:
inputting processed audio features and human face three-dimensional parameters into a model, wherein the audio features are sent into the model by a sliding window with the size of T frames each time, a mouth shape feature generator generates corresponding mouth shape feature parameters for each frame in the window, and an implicit feature generator learns an implicit feature array for each window by combining time sequence features;
respectively splicing the implicit characteristic arrays with the mouth shape characteristics of each frame, and sending the mouth shape characteristics into a full connection layer to obtain human face three-dimensional parameters predicted according to the audio characteristics;
a pre-trained face depth generation model comprising:
the face depth generation model consists of a mouth shape feature generator, an implicit feature generator and a discriminator, wherein the implicit feature comprises facial expression, head posture and blink;
and in the model training process, the three-dimensional parameters of the real human face and the predicted three-dimensional parameters are sent to a discriminator to be discriminated, and discrimination loss is determined.
The invention further improves that the three-dimensional face image corresponding to each frame of the voice is rendered by utilizing a human face three-dimensional rendering model according to the obtained prediction parameters, and the method comprises the following steps:
adjusting the parameters of the face three-dimensional model according to the predicted face three-dimensional parameters to obtain recombined face three-dimensional parameters;
and rendering the recombined human face three-dimensional parameters by adopting the open-source pytorech 3D to obtain a rendered human face image.
The invention further improves the method that an eye mask is extracted aiming at the rendered human face and the wink parameter, and the eye is coded independently, and the method comprises the following steps:
determining the position of eyes according to the extracted key points of the face;
combining the predicted human face and facial features to obtain an eye blink AU value;
AU value normalization is applied to the pixel values in the eye attention map.
The further improvement of the invention is that the eye mask coding and the three-dimensional face image are input into the depth face rendering model together to synthesize each frame of speaking face image, which comprises the following steps:
combining the eye coding with the rendered face image to form input information with four channels, wherein three channels are face images, and one channel is an eye mask;
placing the current frame in the center of a window with the size of 2N, inputting the current frame into a deep face rendering model, and outputting a target character image frame with time sequence by the model;
a deep face rendering model comprising:
the model consists of a generator G and an identifier D, the purpose of which is to ensure temporal consistency while preserving the character identity;
in the training process, the generated image and the real image are sent to a discriminator to be discriminated, so that the countermeasure loss is formed.
The invention has the further improvement that all the speaking face images are spliced together to output the final speaking face video, and all the images which are spliced and synthesized by using ffmpeg in sequence form the final speaking face video result.
A human face synthesis apparatus based on a generative confrontation network, comprising:
the data acquisition module is used for acquiring voice audio content of a speaking face and a reference video containing a face picture of a target person;
the audio processing module is used for processing the acquired audio data, and comprises the steps of noise reduction, resampling and feature extraction of the audio;
the human face three-dimensional reconstruction module is used for reconstructing the target character video by adopting a three-dimensional human face reconstruction model to obtain target human face three-dimensional parameters;
the three-dimensional face parameter prediction module is used for reading audio features and obtaining predicted face parameters by adopting a pre-trained face depth generation model;
the face three-dimensional rendering module is used for obtaining a rendered face image according to the predicted face parameters and the face three-dimensional parameters by adopting a face three-dimensional rendering model;
the eye coding module modifies the eye pixel value according to the predicted face parameters to obtain eye codes and realize blinking;
the human face image depth generation module is used for obtaining a human face image by adopting a depth human face rendering model and combining eye coding to synthesize each frame of speaking human face image;
and the video synthesis module is used for splicing all the images by adopting ffmpeg and synthesizing the final talking face video result together with the audio.
The invention has at least the following beneficial technical effects:
the method for synthesizing the speaking face based on the generation countermeasure network, provided by the invention, carries out three-dimensional modeling on the face through a three-dimensional face reconstruction model, so that the generated result is superior to the result of two-dimensional image modeling in visual effect; meanwhile, the method realizes the learning and generation of implicit characteristics such as head posture, expression and the like through a face depth generation model, so that the true degree of a generated result is greatly improved; the method extracts the eye mask from the rendered face and the blink parameters, and codes the eyes independently to realize the natural blink of the speaker, thereby further improving the continuity and the natural degree of the synthetic effect.
The invention provides a device for synthesizing a speaking face based on a generated confrontation network, which comprises eight modules, namely a data acquisition module, an audio processing module, a face three-dimensional reconstruction module, a three-dimensional face parameter prediction module, a face three-dimensional rendering module, an eye coding module, a face image depth generation module and a video synthesis module, and can completely complete all tasks from data acquisition to result synthesis to form a system closed loop.
Drawings
FIG. 1 is a flow chart of the synthesis of a face of a speaker based on generation of a confrontation network according to the present invention;
FIG. 2 is a schematic flow chart of a three-dimensional face reconstruction module according to the present invention;
FIG. 3 is a schematic diagram of two-dimensional and three-dimensional human face key points in the present invention;
FIG. 4 is a schematic flow chart of a three-dimensional face parameter prediction module according to the present invention;
FIG. 5 is a schematic flow chart of a three-dimensional rendering module for human faces;
FIG. 6 is a schematic flow diagram of an eye encoding module;
FIG. 7 is a schematic flow diagram of a face image depth generation module;
FIG. 8 is a functional block diagram of a face synthesis apparatus for generating an anti-confrontation network according to the present invention;
fig. 9 is a schematic structural diagram of an electronic device for implementing the method for synthesizing a speech face based on a generative confrontation network provided by the invention.
Detailed Description
The invention is described in further detail below with reference to the attached drawing, without in any way limiting the scope of the invention.
Referring to fig. 1, the invention provides a method for synthesizing a speaking face based on a generated confrontation network, which comprises the steps of firstly obtaining a reference video of an audio to be processed and a target face; then extracting the characteristics of the input audio; meanwhile, performing human face three-dimensional reconstruction based on the target human face video; secondly, constructing a three-dimensional face parameter prediction model to predict face parameters such as mouth shapes, facial expressions, head postures, winks and the like; rendering a facial image corresponding to each frame of voice according to the parameters through a human face three-dimensional rendering network; in order to ensure the natural blinking of the generated result, extracting an eye region aiming at the rendered human face, and independently coding the eye to change the pixel value of the eye; and synthesizing a speaking face video by using a face image deep networking and combining eye coding and a face rendering result. The method specifically comprises the following steps:
1. capturing audio and video
Audio acquisition, namely acquiring standard broadcast audio clips by a network and recording the audio clips in a recording studio by a voice acquisition device;
video acquisition, wherein a single human face video clip is acquired through a network, and the video clip comprising a front face and a side face is recorded through video acquisition equipment;
2. extracting audio features
Linearly resampling the audio to 30fps by using an open-source librosa library, encoding the resampled audio signal to form an array, performing Fast Fourier Transform (FFT) on the array, convolving the transformed audio encoding on a two-dimensional space consisting of a time domain and a frequency domain on the basis, and finally forming a characteristic array with the size of 30 multiplied by 29 as audio characteristics for each second of audio by adopting Relu as an activation function.
3. Three-dimensional reconstruction of target face
As shown in fig. 2, the process of establishing a three-dimensional face model for a reference video includes face detection, face key point extraction, and three-dimensional face deformation statistical model construction, and specifically includes the following steps:
s3-1, detecting the face position: detecting a face area in a video frame by using a Harr detector in OpenCV to obtain four coordinate points (x, y, w, h), and cutting an image;
s3-2, extracting key points of the human face: detecting the characteristic points of the face by using Dlib according to the coordinate position of the face, wherein the characteristic points mainly comprise left eye eyebrow, right eye eyebrow, left eye, right eye, nose, mouth and chin, and finally obtaining the specific coordinate (x) of each point i ,y i ) I = (0, 1..., 67)), the feature point extraction result is shown in fig. 3-a);
s3-3, three-dimensional face reconstruction model: establishing human face 3DMM model M = (F) by using deep network of multi-frame video i ,F t ,F e ) WhereinThe method is different from a conventional model in that the expression characteristics are added to further improve the expression naturalness of the character;
s3-4, projecting the constructed three-dimensional model to a two-dimensional plane in the training process, and then extracting key points of the human face from the projection result by using the method in Step2, wherein the characteristic point extraction result is shown in figure 3-b). And in the training process of the human face three-dimensional model, taking the result of Step2 as a ground route to form landmark loss so as to further train the human face three-dimensional model.
4. Constructing a face generation model
Referring to fig. 4, the face generation model inputs the audio features into the pre-trained face depth generation model to obtain the predicted face parameters of head pose, mouth shape, blink, etc., which includes a mouth shape synchronous generator G lip A timing generator G tem And a discriminator D.
S4-1, using the basic audio features alpha extracted in the step 1) as a sliding window alpha with the size of T frames each time t ,α t+1 ,…,α t+T Feeding into a model;
S4-2 G lip explicit features in the audio, i.e., mouth-shaped feature vectors corresponding to the audio, are learned. G lip In which the audio feature alpha of each frame is input t Outputting the lip shape feature code l corresponding to the frame t ;
S4-3G tem The implicit characteristics of the audio are learned and output. G tem The method aims to learn the context relationship of audio, find implicit emotional characteristics in the audio so as to guide the generation of head action, facial expression and the like in the subsequent steps, and inputs all audio characteristics in each sliding window and a set initial state s 0 For each segment of window output, the corresponding implicit feature code s is output t ;
S4-4 coupling of S t Are respectively reacted with (l) t ,l t+1 ,...,l t+T ) Connecting and sending the data into a full-connection layer to obtain the human face three-dimensional parameters predicted according to the audio features
S4-5 in the training process, using Y t For ground channel supervision, it willSending the data to a discriminator for discrimination to determine discrimination loss.
5. Rendering face images
As shown in fig. 5, the rendering facial image module renders a three-dimensional facial image corresponding to each frame of speech by using a three-dimensional rendering model of a human face according to the obtained prediction parameters, and specifically includes the following steps:
s5-1, adjusting the parameters of the face three-dimensional model according to the face three-dimensional parameters predicted for each frame to obtain recombined face three-dimensional parameters;
and S5-2, rendering the recombined human face three-dimensional parameters by adopting an open-source pytorech 3D to obtain a rendered human face image.
6. Eye mask coding
As shown in fig. 6, the eye coding part extracts an eye mask through rendered human faces and blinking parameters to realize individual eye coding, and specifically includes the following steps:
s6-1, according to the face key point positioning eye region pixel points in the step 3), an eye mask is generated. Wherein the threshold z is set for the delineation of the eye region size, which has to be satisfied (p) x -c x ) 2 /4+(p y -c y ) 2 <z, wherein p, c are the eye vertex and eye center, respectively;
s6-2 according to the eye motion information predicted in the step 3)Obtaining an eye blink AU (Action Unit) value;
s6-3 applies AU value normalization to pixel values in the eye attention map to achieve blinking behavior.
7. Face image depth generation
As shown in fig. 7, the face image depth generating section inputs the eye mask code and the three-dimensional face image together into the depth face rendering model to synthesize each frame of speaking face image, and specifically includes the following steps:
s7-1 combining the eye mask in the step 6) and the face image rendered in the step 5) to form input information with the size of WxHx4Wherein three channels are face images, and one channel is an eye mask;
s7-2, constructing a depth face rendering model. The model is composed of a generator G render And a discriminator D render The purpose is to ensure time consistency while preserving character identity. With one size of 2N w And (4) placing the current frame in the center of the window and sending the current frame into a video rendering network. G render The size is W × H × 4 × 2N w Tensor X of t Outputting the image frame G of the target person as input render (X t );
During the training process, G render (X t ) Feeding the real image I into D render And (4) judging to form the resistance loss.
8. Video result synthesis
And splicing all the speaking face images together to output a final speaking face video, and is characterized in that all the images synthesized in the requirement 9 are spliced in sequence by using ffmpeg to form a final speaking face video result.
Based on the speaking face synthesis method based on the generation countermeasure network provided by the application, the application also provides a speaking face synthesis device based on the generation countermeasure network, which comprises a data acquisition module, an audio processing module, a face three-dimensional reconstruction module, a three-dimensional face parameter prediction module, a face three-dimensional rendering module, an eye coding module, a face image depth generation module and a video synthesis module:
the 8-1 data acquisition module is used for acquiring the voice audio content of the speaking face and a reference video containing a face picture of a target person;
the 8-2 audio processing module is used for processing the acquired audio data, and comprises the steps of noise reduction, resampling and feature extraction of the audio;
8-3, a human face three-dimensional reconstruction module, which reconstructs a target character video by adopting a three-dimensional human face reconstruction model to obtain a target human face three-dimensional parameter;
the 8-4 three-dimensional face parameter prediction module is used for reading audio features and obtaining predicted face parameters by adopting a pre-trained face depth generation model;
8-5, a face three-dimensional rendering module, which adopts a face three-dimensional rendering model and obtains a rendered face image according to the predicted face parameters and the face three-dimensional parameters;
8-6 eye coding module, modifying the eye pixel value according to the predicted face parameter to obtain eye coding, and realizing blinking;
8-7 a face image depth generation module, which obtains a face image by adopting a depth face rendering model and synthesizes a speaking face image of each station by combining eye coding;
and the 8-8 video synthesis module is used for splicing all the images by using ffmpeg and synthesizing the final talking face video result together with the audio.
Based on the method for synthesizing the speaking face based on the generation countermeasure network, the application also provides an electronic device, which comprises:
9-1 at least one processor;
9-2 and a memory communicatively coupled to the processor;
wherein the memory stores computer execution instructions;
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of any of the methods presented herein.
Based on the generated confrontation network-based speaking face synthesis method provided by the application, the application also provides a storage medium which is a computer readable storage medium, for example, a memory containing a computer program, and the deletion of the computer program can be executed by a processor to complete the steps in the aforementioned generated confrontation network-based speaking face synthesis method.
Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Therefore, the above description is only an exemplary embodiment of the present application, and any equivalent changes and modifications made by those skilled in the art without departing from the spirit and principle of the present application belong to the protection scope of the present invention.
Claims (10)
1. A method for synthesizing a speaking face based on a generation confrontation network is characterized by comprising the following steps:
acquiring a voice audio clip and a reference video which is recorded by a real person and contains a target face;
carrying out noise reduction, resampling and feature extraction on an input voice audio to obtain audio features;
inputting a reference video of a target face into a pre-trained three-dimensional face reconstruction model to obtain face three-dimensional data;
inputting the audio features into a face depth generation model to obtain predicted head posture, expression, mouth shape and blink parameters;
rendering a three-dimensional face image corresponding to each frame of voice by utilizing a pre-trained face three-dimensional rendering model according to the predicted parameters;
extracting eye masks according to rendered human faces and blink parameters, and independently coding eyes;
inputting the eye mask code and the three-dimensional face image into a deep face rendering model together to synthesize each frame of speaking face image;
finally, all the speaking face images are spliced together to output the final speaking face video.
2. The method for synthesizing a human face with speaking based on a generative confrontation network as claimed in claim 1, wherein obtaining the reference video of the voice audio segment and the target human face specifically comprises:
audio acquisition, namely acquiring and screening standard broadcast audio clips by a network or recording the audio clips in a recording studio through acquisition equipment;
and video acquisition, wherein a single human face video clip is acquired through a network or the video clip comprising the front face and the side face is recorded through video acquisition equipment.
3. The method for synthesizing the speaking face based on the generation countermeasure network as claimed in claim 1, wherein the feature extraction of the input speech audio comprises:
audio resampling, namely, linearly resampling the audio to align the audio with the video frame rate, encoding the resampled audio signal to form an array, and performing fast Fourier transform on the array;
the transformed code is convolved in a two-dimensional space consisting of a time domain and a frequency domain, and corresponding audio features are formed for each second of audio.
4. The method for synthesizing a speaking face based on a generative confrontation network as claimed in claim 1, wherein inputting a reference video of a target face into a three-dimensional face reconstruction model to obtain three-dimensional data of the face comprises:
processing an input reference video into an image file by frames;
performing face recognition on each reference image by using a convolutional neural network, and cutting the face image according to a recognition result;
performing face landmark extraction on the image, wherein the face landmark extraction comprises a left eye eyebrow, a right eye eyebrow, a left eye, a right eye, a nose, a mouth and a chin, and specific coordinates of 68 key points are finally obtained;
adopting a pre-trained face reconstruction method based on CNN to establish a face 3DMM model, wherein the process is as follows:
fitting the face image to a 3DMM model to obtain a target face three-dimensional parameter M = (F) i ,F t ,F e ,F l ) In which Respectively representing the identity, the texture, the facial expression and the illumination characteristic vector of the three-dimensional coordinate of the face;
and in the model training process, the obtained three-dimensional face model is projected to two dimensions, 68 key points are extracted again, the key points extracted from the original video image are used as supervision, and the landmark loss is formed so as to further train the three-dimensional face reconstruction model.
5. The method of claim 1, wherein inputting audio features into a pre-trained face depth generation model to obtain predicted head pose, mouth shape and blink parameters comprises:
inputting processed audio features and human face three-dimensional parameters into a model, wherein the audio features are sent into the model by a sliding window with the size of T frames each time, a mouth-shaped feature generator generates corresponding mouth-shaped feature parameters for each frame in the window, and an implicit feature generator learns an implicit feature array for each window by combining time sequence features;
splicing the implicit characteristic array with the mouth shape characteristic of each frame respectively, and sending the merged characteristic array into a full connection layer to obtain a human face three-dimensional parameter predicted according to the audio characteristic;
a pre-trained face depth generation model comprising:
the face depth generation model consists of a mouth shape feature generator, an implicit feature generator and a discriminator, wherein the implicit feature comprises facial expression, head posture and blink;
and in the model training process, the three-dimensional parameters of the real human face and the predicted three-dimensional parameters are sent to a discriminator to be discriminated, and discrimination loss is determined.
6. The method as claimed in claim 1, wherein rendering a three-dimensional face image corresponding to each frame of speech by using a three-dimensional face rendering model according to the obtained prediction parameters comprises:
adjusting the parameters of the face three-dimensional model according to the predicted face three-dimensional parameters to obtain recombined face three-dimensional parameters;
and rendering the recombined human face three-dimensional parameters by adopting the open-source pytorech 3D to obtain a rendered human face image.
7. The method of claim 1, wherein the eye mask is extracted for the rendered face and the blink parameters, and the eye is encoded separately, comprising:
determining the eye position according to the extracted key points of the face;
combining the predicted face characteristics to obtain an eye blink AU value;
AU value normalization is applied to the pixel values in the eye attention map.
8. The method as claimed in claim 1, wherein the step of inputting the eye mask code and the three-dimensional face image into the deep face rendering model together to synthesize each frame of speaking face image comprises:
combining the eye coding with the rendered human face image to form input information with four channels, wherein three channels are human face images, and one channel is an eye mask;
placing the current frame in the center of a window with the size of 2N, inputting the current frame into a deep face rendering model, and outputting a target character image frame with time sequence by the model;
a depth face rendering model comprising:
the model consists of a generator G and an identifier D, the purpose of which is to ensure temporal consistency while preserving the character identity;
in the training process, the generated image and the real image are sent to a discriminator to be discriminated, so that the countermeasure loss is formed.
9. The method as claimed in claim 1, wherein all the face images are merged together to output a final face video, and all the images merged by ffmpeg are used to form a final face video result.
10. A face synthesis apparatus for speaking based on a generative confrontation network, comprising:
the data acquisition module is used for acquiring voice audio content of a speaking face and a reference video containing a face picture of a target person;
the audio processing module is used for processing the acquired audio data, and comprises the steps of noise reduction, resampling and feature extraction of the audio;
the human face three-dimensional reconstruction module is used for reconstructing the target character video by adopting a three-dimensional human face reconstruction model to obtain target human face three-dimensional parameters;
the three-dimensional face parameter prediction module is used for reading audio features and obtaining predicted face parameters by adopting a pre-trained face depth generation model;
the face three-dimensional rendering module is used for obtaining a rendered face image according to the predicted face parameters and the face three-dimensional parameters by adopting a face three-dimensional rendering model;
the eye coding module modifies the eye pixel value according to the predicted face parameters to obtain eye codes and realize blinking;
the human face image depth generation module is used for obtaining a human face image by adopting a depth human face rendering model and combining eye coding to synthesize each frame of speaking human face image;
and the video synthesis module is used for splicing all the images by using ffmpeg and synthesizing the spliced images and the audio into a final talking face video result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211493192.5A CN115908659A (en) | 2022-11-25 | 2022-11-25 | Method and device for synthesizing speaking face based on generation countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211493192.5A CN115908659A (en) | 2022-11-25 | 2022-11-25 | Method and device for synthesizing speaking face based on generation countermeasure network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115908659A true CN115908659A (en) | 2023-04-04 |
Family
ID=86474412
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211493192.5A Pending CN115908659A (en) | 2022-11-25 | 2022-11-25 | Method and device for synthesizing speaking face based on generation countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115908659A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116152447A (en) * | 2023-04-21 | 2023-05-23 | 科大讯飞股份有限公司 | Face modeling method and device, electronic equipment and storage medium |
CN116233567A (en) * | 2023-05-05 | 2023-06-06 | 山东建筑大学 | Speaker face video generation method and system based on audio emotion perception |
CN116402928A (en) * | 2023-05-26 | 2023-07-07 | 南昌航空大学 | Virtual talking digital person generating method |
CN117036555A (en) * | 2023-05-18 | 2023-11-10 | 无锡捷通数智科技有限公司 | Digital person generation method and device and digital person generation system |
CN117292030A (en) * | 2023-10-27 | 2023-12-26 | 海看网络科技(山东)股份有限公司 | Method and system for generating three-dimensional digital human animation |
CN117593442A (en) * | 2023-11-28 | 2024-02-23 | 拓元(广州)智慧科技有限公司 | Portrait generation method based on multi-stage fine grain rendering |
CN117729298A (en) * | 2023-12-15 | 2024-03-19 | 北京中科金财科技股份有限公司 | Photo driving method based on action driving and mouth shape driving |
-
2022
- 2022-11-25 CN CN202211493192.5A patent/CN115908659A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116152447A (en) * | 2023-04-21 | 2023-05-23 | 科大讯飞股份有限公司 | Face modeling method and device, electronic equipment and storage medium |
CN116152447B (en) * | 2023-04-21 | 2023-09-26 | 科大讯飞股份有限公司 | Face modeling method and device, electronic equipment and storage medium |
CN116233567A (en) * | 2023-05-05 | 2023-06-06 | 山东建筑大学 | Speaker face video generation method and system based on audio emotion perception |
CN117036555A (en) * | 2023-05-18 | 2023-11-10 | 无锡捷通数智科技有限公司 | Digital person generation method and device and digital person generation system |
CN116402928A (en) * | 2023-05-26 | 2023-07-07 | 南昌航空大学 | Virtual talking digital person generating method |
CN116402928B (en) * | 2023-05-26 | 2023-08-25 | 南昌航空大学 | Virtual talking digital person generating method |
CN117292030A (en) * | 2023-10-27 | 2023-12-26 | 海看网络科技(山东)股份有限公司 | Method and system for generating three-dimensional digital human animation |
CN117593442A (en) * | 2023-11-28 | 2024-02-23 | 拓元(广州)智慧科技有限公司 | Portrait generation method based on multi-stage fine grain rendering |
CN117593442B (en) * | 2023-11-28 | 2024-05-03 | 拓元(广州)智慧科技有限公司 | Portrait generation method based on multi-stage fine grain rendering |
CN117729298A (en) * | 2023-12-15 | 2024-03-19 | 北京中科金财科技股份有限公司 | Photo driving method based on action driving and mouth shape driving |
CN117729298B (en) * | 2023-12-15 | 2024-06-21 | 北京中科金财科技股份有限公司 | Photo driving method based on action driving and mouth shape driving |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115908659A (en) | Method and device for synthesizing speaking face based on generation countermeasure network | |
Guo et al. | Ad-nerf: Audio driven neural radiance fields for talking head synthesis | |
CN113378697A (en) | Method and device for generating speaking face video based on convolutional neural network | |
US11393150B2 (en) | Generating an animation rig for use in animating a computer-generated character based on facial scans of an actor and a muscle model | |
CN114419702B (en) | Digital person generation model, training method of model, and digital person generation method | |
US20240212252A1 (en) | Method and apparatus for training video generation model, storage medium, and computer device | |
CN115914505A (en) | Video generation method and system based on voice-driven digital human model | |
Shen et al. | Sd-nerf: Towards lifelike talking head animation via spatially-adaptive dual-driven nerfs | |
CN116993948B (en) | Face three-dimensional reconstruction method, system and intelligent terminal | |
Peng et al. | Synctalk: The devil is in the synchronization for talking head synthesis | |
CN117115331A (en) | Virtual image synthesizing method, synthesizing device, equipment and medium | |
CN117409121A (en) | Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving | |
US20220076409A1 (en) | Systems and Methods for Building a Skin-to-Muscle Transformation in Computer Animation | |
CN115984452A (en) | Head three-dimensional reconstruction method and equipment | |
Zeng et al. | Ultra-low bit rate facial coding hybrid model based on saliency detection | |
CN116402928B (en) | Virtual talking digital person generating method | |
US11158103B1 (en) | Systems and methods for data bundles in computer animation | |
US11875504B2 (en) | Systems and methods for building a muscle-to-skin transformation in computer animation | |
CN117153195B (en) | Method and system for generating speaker face video based on adaptive region shielding | |
Shen et al. | Talking Head Generation Based on 3D Morphable Facial Model | |
CN117372585A (en) | Face video generation method and device and electronic equipment | |
Yang et al. | Deep learning-based 3D face reconstruction method for video stream | |
CN117557695A (en) | Method and device for generating video by driving single photo through audio | |
CN114972874A (en) | Three-dimensional human body classification and generation method and system for complex action sequence | |
WO2022055366A1 (en) | Systems and methods for building a skin-to-muscle transformation in computer animation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |