CN113470170A

CN113470170A - Real-time video face region space-time consistent synthesis method using voice information

Info

Publication number: CN113470170A
Application number: CN202110750794.3A
Authority: CN
Inventors: 曾鸣; 刘鹏飞; 邓文晋
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-10-01

Abstract

A real-time video face region space-time consistent synthesis method utilizing voice information relates to deep learning and three-dimensional face reconstruction. The three-dimensional face reconstruction algorithm is used for extracting face identity information, face shape information, face posture information and face texture information from visual features, the deep learning technology is used for extracting face expression information from audio features, the visual information of the visual features and the auditory information of the auditory features are fused, the richness of face expression synthesized by a neural network is enhanced, and a face speaking video consistent with the current speaking content is quickly and accurately synthesized. And the reference face identity parameters are introduced, so that the identity images of the video frames before and after output can be restrained to be consistent. Context information and smoothness constraint on a time sequence are introduced, and texture jitter is inhibited, so that the face generation algorithm can be suitable for videos. By adopting a more simplified neural network structure, the human face speaking video can be generated in real time or the human face shelter can be removed, and the method can be applied to the fields of security monitoring, video conferences, virtual images, animation driving and the like.

Description

Real-time video face region space-time consistent synthesis method using voice information

Technical Field

The invention relates to the technical fields of deep learning, three-dimensional face reconstruction, face synthesis and the like, in particular to a real-time video face region space-time consistent synthesis method by utilizing voice information.

Background

The traditional face region generation algorithm is limited to a single face image, and has the following problems: (1) the input information is single, the expression of the face region in the image cannot be determined, and a face video consistent with the voice cannot be synthesized. (2) The method is lack of identity information constraint, cannot ensure that the same person still looks like the same person after being combined into the face area under different posture expressions, and has the problem of inconsistent identity images before and after being applied to the video. (3) The time sequence jitter phenomenon can occur when the method is applied to the video due to the lack of correlation and constraint among the time sequence information. (4) The required network structure is complex, the consumption of computing resources is huge, the reasoning time is too long, and the real-time requirement cannot be met. The traditional algorithm lacks context constraint information on time sequence, is used for a video, is easy to have texture jitter phenomenon, and has poor effect. Finally, the neural network architecture required by the traditional algorithm is complex, the inference time cost is high, and the real-time requirement cannot be met.

Disclosure of Invention

The invention aims to provide a real-time video face region space-time consistent synthesis method by utilizing voice information aiming at the problems in the prior art. The method comprises the steps of extracting face identity information, face shape information, face posture information and face texture information from visual features by using a three-dimensional face reconstruction algorithm, extracting face expression information from audio features by using a deep learning technology, and fusing the visual information of the visual information and the auditory information of the visual information, so that the richness of synthesizing face expressions by a neural network can be enhanced, and a face speaking video consistent with the current speaking content can be quickly and accurately synthesized.

The invention comprises the following steps:

s1: manually selecting a face identity reference image, and extracting face identity parameters and face texture parameters of the face identity reference image;

s2: for each frame of image of the real-time video stream, extracting a face posture parameter and a face shape parameter corresponding to each frame by using a face three-dimensional reconstruction technology;

s3: extracting audio features corresponding to each frame of video from the real-time audio stream, inputting the audio features to a first network, and outputting facial expression parameters corresponding to each frame of the video stream;

s4: inputting face identity parameters, face posture parameters, face shape parameters, face texture parameters and face expression parameters, and rendering a corresponding three-dimensional face model rendering image by using a three-dimensional face model rendering technology.

S5: and inputting a video stream original image frame, a three-dimensional face rendering image, a face identity reference image and a frame synthesis image on the second network to the second network, and outputting an image synthesized by a face area.

In step S1, the face three-dimensional reconstruction technique is a face 3D deformation statistical model (3DMM) method.

In step S2, the first network is a face expression estimation network; the facial expression estimation network is divided into an audio feature extraction module and a facial expression parameter regression module.

In step S3, the three-dimensional face model rendering technique is voice driven.

In step S5, the second network is a face texture synthesis network and is a generation countermeasure network; the face texture synthesis network comprises a face identity coding module, a face identity coding projection module, a face texture synthesis module and a discriminator module;

the constraints used by the face texture synthesis network comprise time sequence consistency constraint, discriminator constraint and face identity consistency constraint.

The face identity consistency constraint is used for ensuring consistency of face identity images generated before and after, specifically, face identity codes are introduced into a face texture synthesis module by using self-adaptive instance standardization, and a synthesis result is constrained.

The time sequence consistency constraint is used for ensuring that the change of the front and back generated face textures is natural, specifically, a synthesis result of the previous frame is introduced into the face texture synthesis module, and the shake of the front and back frame textures is constrained.

The invention integrates the information of human face shape and posture belonging to visual information and the multi-modal information of audio characteristic information belonging to auditory information, thereby realizing information complementation, and leading an algorithm to synthesize the human face video with rich expression and expression appearance which are consistent with the speaking content. In the aspect of identity image consistency, the invention introduces the reference face identity parameter and can restrict the identity images of the video frames before and after output to be consistent. Meanwhile, context information and smoothness constraint on a time sequence are introduced, so that the occurrence of texture jitter is effectively inhibited, and a face generation algorithm can be suitable for videos. Finally, the invention adopts a more simplified neural network structure, so that the algorithm can generate a face speaking video or remove a face shelter in real time, can be used for removing the face shelter in a video conference, synthesizing scenes such as a virtual human anchor video and the like, and has extremely high practical value and good economic benefit in the fields of security monitoring, video conferences, virtual images, animation driving and the like.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention.

Fig. 2 is a diagram of a three-dimensional face model rendering process according to an embodiment of the present invention.

Fig. 3 is a diagram of a process of removing a face mask according to an embodiment of the present invention.

Detailed Description

The following examples will further illustrate the present invention with reference to the accompanying drawings.

Examples

Referring to fig. 1 to 3, the present embodiment provides a method for synthesizing a real-time video face region in a time-space consistent manner by using voice information, including the following steps:

s1: manually selecting a face identity reference image, and extracting face identity parameters and face texture parameters of the face identity reference image by using a face three-dimensional reconstruction technology;

s2: inputting a section of human face speaking voice video data stream, carrying out audio-video separation, and extracting corresponding human face related parameters of each frame by using a human face three-dimensional reconstruction technology for each frame of image of the real-time video stream, wherein the human face related parameters comprise human face posture parameters and human face shape parameters;

s4: inputting face identity parameters, face texture parameters, face posture parameters, face shape parameters and face expression parameters, and rendering a corresponding three-dimensional face model rendering image by using a three-dimensional face model rendering technology;

s5: inputting a video stream original image frame, a three-dimensional face rendering image, a face identity reference image and a frame synthesis image on a second network into a second network, and outputting an image synthesized by a face area;

s6: and checking whether the video stream is read completely, if not, returning to the step S1, and if so, finishing the time-space consistent synthesis of the real-time video face region.

The input of the invention is a section of face speaking video stream, the output is a real-time face area synthesis video stream, and the invention can be used in the scenes of face area occlusion removal and the like.

The present embodiment includes 2 key components: (1) the voice-driven three-dimensional face model rendering algorithm comprises the steps of S1-S4; (2) the face region synthesis algorithm includes step S5. The three-dimensional face model rendering algorithm driven by the voice uses the voice to carry out constraint to generate a rendered image, and is used for providing prior information such as three-dimensional model shapes and textures for face region synthesis. And the face region synthesis uses an anti-network technology to generate the face texture after the occlusion is removed, and the texture changes smoothly on the time sequence.

1) Voice-driven three-dimensional face model rendering algorithm

In order to generate a face speaking video driven by voice and enable the mouth shape, the expression and the like of a face in the synthesized video to be consistent with the voice content, a three-dimensional face model rendering image driven by voice is adopted as prior information to guide the generation of a face area image. The voice signal is used for restricting the change of the expression parameters of the face model, and other texture parameters, deformation parameters and the like are consistent with the original image frame of the face region; the process of the algorithm comprises the steps of S1-S4, wherein the process diagram of the step S2 refers to FIG. 2, the key part of the step S2 is a human face expression estimation network, the human face expression estimation network is divided into an audio feature extraction module and a human face expression parameter regression module, and the specific steps are as follows:

(1) for extracting audio features from the current real-time audio stream, the embodiment uses mel-frequency cepstrum coefficients.

(2) And performing frame windowing on the current audio feature sequence.

(3) And inputting the audio features of each window to the audio feature extraction module, and outputting the audio features of deeper layers.

(4) And (4) inputting the deeper audio features obtained in the step (3) into the facial expression parameter regression module, and outputting the facial expression parameters.

(5) And (5) according to the time sequence circulation steps (1) to (4), outputting the facial expression parameters corresponding to each video frame until the last frame of the output sequence is reached.

2) Face region synthesis algorithm:

the process of synthesizing the face region into the face speaking video that actually synthesizes the face speaking video in accordance with the speech content includes step S5 (the face mask removal process diagram of the present embodiment refers to fig. 3). In this embodiment, the confrontation network technology is adopted to realize a face texture synthesis network, and the neural network module includes a face texture synthesis module and a discriminator module, and has 2 key constraints: face identity consistency constraints and timing consistency constraints.

The face identity consistency constraint is used for ensuring that the identity characteristics of the front face and the back face of the synthesized video are consistent and constraining the identity characteristics of the front face and the back face of the synthesized video to be unchanged. In the embodiment, the identity information is encoded into the neural network generator by adopting self-adaptive example standardization, so that the neural network can synthesize a video frame sequence with consistent identity characteristics before and after the encoding based on the identity information; the system mainly comprises a face identity code extraction module, a face identity code projection module and a face identity verification module; the implementation process of the face identity consistency constraint (refer to fig. 3) is as follows:

(1) and the human face identity code extraction module extracts the human face identity code of the target speaker by using a three-dimensional human face reconstruction technology. In this embodiment, a face 3D deformation statistical model is used as the identity code extraction module.

(2) And inputting the face identity code obtained in the step 1 into a face identity code projection module, and outputting the projected face identity characteristics.

(3) And (3) coding the face identity characteristics obtained in the step (2) into a generator by using self-adaptive instance standardization, inputting an original image and a 3D face model rendering image, and outputting the face image after shielding is removed.

(4) And calculating the identity characteristic difference between the original image and the synthesized image by using the human face identity verification module, and using the identity characteristic difference as a guide for optimizing the generator. The present embodiment employs a face recognition model.

(5) And (5) circulating the steps (1) to (4) until the generator converges.

The time sequence consistency constraint is used for ensuring that the texture change before and after the face of the synthetic video is relatively smooth and natural, and preventing the occurrence of a jitter phenomenon; in the embodiment, the image generated by the previous frame is introduced into the network as prior information, so that the network synthesizes a human face speaking video with natural texture change; the specific implementation process is as follows: (refer to FIG. 3)

(1) In the input process of the generator, adding a face area video frame synthesized by the previous frame on the basis, and outputting a face video frame corresponding to the current frame;

(2) calculating the variation jitter degree of the face region texture synthesized by the front and rear frame generators, and using the variation jitter degree as a guide to optimize the generators; the embodiment adopts the color difference change rate among pixels for measurement;

(3) and (5) circulating the steps (1) and (2) until the generator converges.

In this embodiment, the face identity consistency constraint and the timing consistency constraint should be used in combination with the discriminator discrimination loss constraint to optimize the generator. The facial expression estimation network and the facial texture synthesis network respectively adopt a simple convolution network architecture and a generation countermeasure network with few layers, the required calculated amount is small, the requirement of real-time facial region synthesis can be met, and the method has high practical value and good economic benefit in the fields of security monitoring, video conferences, avatars, animation driving and the like.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A real-time video face region space-time consistent synthesis method by using voice information is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein in step S1, the face three-dimensional reconstruction technique uses 3DMM method.

3. The method of spatio-temporally consistent synthesizing a real-time video face region using voice information according to claim 1, wherein in step S2, the first network is a face expression estimation network.

4. The method as claimed in claim 3, wherein the facial expression estimation network is divided into an audio feature extraction module and a facial expression parameter regression module.

5. The method as claimed in claim 1, wherein in step S3, the three-dimensional face model rendering technique is voice-driven.

6. The method as claimed in claim 1, wherein in step S5, the second network is a face texture synthesis network and is a generation countermeasure network.

7. The method as claimed in claim 6, wherein the face texture synthesis network comprises a face identity coding module, a face identity coding projection module, a face texture synthesis module, and a discriminator module.

8. The method as claimed in claim 1, wherein the constraints used by the face texture synthesis network include timing consistency constraint, discriminator constraint, and face identity consistency constraint.

9. The method as claimed in claim 8, wherein the face identity consistency constraint is used to ensure that the face identities generated before and after the face identity consistency synthesis is consistent, specifically, the face identity coding is introduced into the face texture synthesis module using adaptive instance standardization, and the synthesis result is constrained.

10. The method as claimed in claim 8, wherein the time-sequence consistency constraint is used to ensure that the texture of the face generated before and after the face is changed naturally, and specifically, the result of the previous frame is introduced into the face texture synthesis module, and the texture jitter of the previous and following frames is constrained.