CN113470170A - Real-time video face region space-time consistent synthesis method using voice information - Google Patents

Real-time video face region space-time consistent synthesis method using voice information Download PDF

Info

Publication number
CN113470170A
CN113470170A CN202110750794.3A CN202110750794A CN113470170A CN 113470170 A CN113470170 A CN 113470170A CN 202110750794 A CN202110750794 A CN 202110750794A CN 113470170 A CN113470170 A CN 113470170A
Authority
CN
China
Prior art keywords
face
identity
network
information
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110750794.3A
Other languages
Chinese (zh)
Inventor
曾鸣
刘鹏飞
邓文晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202110750794.3A priority Critical patent/CN113470170A/en
Publication of CN113470170A publication Critical patent/CN113470170A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Abstract

A real-time video face region space-time consistent synthesis method utilizing voice information relates to deep learning and three-dimensional face reconstruction. The three-dimensional face reconstruction algorithm is used for extracting face identity information, face shape information, face posture information and face texture information from visual features, the deep learning technology is used for extracting face expression information from audio features, the visual information of the visual features and the auditory information of the auditory features are fused, the richness of face expression synthesized by a neural network is enhanced, and a face speaking video consistent with the current speaking content is quickly and accurately synthesized. And the reference face identity parameters are introduced, so that the identity images of the video frames before and after output can be restrained to be consistent. Context information and smoothness constraint on a time sequence are introduced, and texture jitter is inhibited, so that the face generation algorithm can be suitable for videos. By adopting a more simplified neural network structure, the human face speaking video can be generated in real time or the human face shelter can be removed, and the method can be applied to the fields of security monitoring, video conferences, virtual images, animation driving and the like.

Description

Real-time video face region space-time consistent synthesis method using voice information
Technical Field
The invention relates to the technical fields of deep learning, three-dimensional face reconstruction, face synthesis and the like, in particular to a real-time video face region space-time consistent synthesis method by utilizing voice information.
Background
The traditional face region generation algorithm is limited to a single face image, and has the following problems: (1) the input information is single, the expression of the face region in the image cannot be determined, and a face video consistent with the voice cannot be synthesized. (2) The method is lack of identity information constraint, cannot ensure that the same person still looks like the same person after being combined into the face area under different posture expressions, and has the problem of inconsistent identity images before and after being applied to the video. (3) The time sequence jitter phenomenon can occur when the method is applied to the video due to the lack of correlation and constraint among the time sequence information. (4) The required network structure is complex, the consumption of computing resources is huge, the reasoning time is too long, and the real-time requirement cannot be met. The traditional algorithm lacks context constraint information on time sequence, is used for a video, is easy to have texture jitter phenomenon, and has poor effect. Finally, the neural network architecture required by the traditional algorithm is complex, the inference time cost is high, and the real-time requirement cannot be met.
Disclosure of Invention
The invention aims to provide a real-time video face region space-time consistent synthesis method by utilizing voice information aiming at the problems in the prior art. The method comprises the steps of extracting face identity information, face shape information, face posture information and face texture information from visual features by using a three-dimensional face reconstruction algorithm, extracting face expression information from audio features by using a deep learning technology, and fusing the visual information of the visual information and the auditory information of the visual information, so that the richness of synthesizing face expressions by a neural network can be enhanced, and a face speaking video consistent with the current speaking content can be quickly and accurately synthesized.
The invention comprises the following steps:
s1: manually selecting a face identity reference image, and extracting face identity parameters and face texture parameters of the face identity reference image;
s2: for each frame of image of the real-time video stream, extracting a face posture parameter and a face shape parameter corresponding to each frame by using a face three-dimensional reconstruction technology;
s3: extracting audio features corresponding to each frame of video from the real-time audio stream, inputting the audio features to a first network, and outputting facial expression parameters corresponding to each frame of the video stream;
s4: inputting face identity parameters, face posture parameters, face shape parameters, face texture parameters and face expression parameters, and rendering a corresponding three-dimensional face model rendering image by using a three-dimensional face model rendering technology.
S5: and inputting a video stream original image frame, a three-dimensional face rendering image, a face identity reference image and a frame synthesis image on the second network to the second network, and outputting an image synthesized by a face area.
In step S1, the face three-dimensional reconstruction technique is a face 3D deformation statistical model (3DMM) method.
In step S2, the first network is a face expression estimation network; the facial expression estimation network is divided into an audio feature extraction module and a facial expression parameter regression module.
In step S3, the three-dimensional face model rendering technique is voice driven.
In step S5, the second network is a face texture synthesis network and is a generation countermeasure network; the face texture synthesis network comprises a face identity coding module, a face identity coding projection module, a face texture synthesis module and a discriminator module;
the constraints used by the face texture synthesis network comprise time sequence consistency constraint, discriminator constraint and face identity consistency constraint.
The face identity consistency constraint is used for ensuring consistency of face identity images generated before and after, specifically, face identity codes are introduced into a face texture synthesis module by using self-adaptive instance standardization, and a synthesis result is constrained.
The time sequence consistency constraint is used for ensuring that the change of the front and back generated face textures is natural, specifically, a synthesis result of the previous frame is introduced into the face texture synthesis module, and the shake of the front and back frame textures is constrained.
The invention integrates the information of human face shape and posture belonging to visual information and the multi-modal information of audio characteristic information belonging to auditory information, thereby realizing information complementation, and leading an algorithm to synthesize the human face video with rich expression and expression appearance which are consistent with the speaking content. In the aspect of identity image consistency, the invention introduces the reference face identity parameter and can restrict the identity images of the video frames before and after output to be consistent. Meanwhile, context information and smoothness constraint on a time sequence are introduced, so that the occurrence of texture jitter is effectively inhibited, and a face generation algorithm can be suitable for videos. Finally, the invention adopts a more simplified neural network structure, so that the algorithm can generate a face speaking video or remove a face shelter in real time, can be used for removing the face shelter in a video conference, synthesizing scenes such as a virtual human anchor video and the like, and has extremely high practical value and good economic benefit in the fields of security monitoring, video conferences, virtual images, animation driving and the like.
Drawings
FIG. 1 is an overall flow chart of an embodiment of the present invention.
Fig. 2 is a diagram of a three-dimensional face model rendering process according to an embodiment of the present invention.
Fig. 3 is a diagram of a process of removing a face mask according to an embodiment of the present invention.
Detailed Description
The following examples will further illustrate the present invention with reference to the accompanying drawings.
Examples
Referring to fig. 1 to 3, the present embodiment provides a method for synthesizing a real-time video face region in a time-space consistent manner by using voice information, including the following steps:
s1: manually selecting a face identity reference image, and extracting face identity parameters and face texture parameters of the face identity reference image by using a face three-dimensional reconstruction technology;
s2: inputting a section of human face speaking voice video data stream, carrying out audio-video separation, and extracting corresponding human face related parameters of each frame by using a human face three-dimensional reconstruction technology for each frame of image of the real-time video stream, wherein the human face related parameters comprise human face posture parameters and human face shape parameters;
s3: extracting audio features corresponding to each frame of video from the real-time audio stream, inputting the audio features to a first network, and outputting facial expression parameters corresponding to each frame of the video stream;
s4: inputting face identity parameters, face texture parameters, face posture parameters, face shape parameters and face expression parameters, and rendering a corresponding three-dimensional face model rendering image by using a three-dimensional face model rendering technology;
s5: inputting a video stream original image frame, a three-dimensional face rendering image, a face identity reference image and a frame synthesis image on a second network into a second network, and outputting an image synthesized by a face area;
s6: and checking whether the video stream is read completely, if not, returning to the step S1, and if so, finishing the time-space consistent synthesis of the real-time video face region.
The input of the invention is a section of face speaking video stream, the output is a real-time face area synthesis video stream, and the invention can be used in the scenes of face area occlusion removal and the like.
The present embodiment includes 2 key components: (1) the voice-driven three-dimensional face model rendering algorithm comprises the steps of S1-S4; (2) the face region synthesis algorithm includes step S5. The three-dimensional face model rendering algorithm driven by the voice uses the voice to carry out constraint to generate a rendered image, and is used for providing prior information such as three-dimensional model shapes and textures for face region synthesis. And the face region synthesis uses an anti-network technology to generate the face texture after the occlusion is removed, and the texture changes smoothly on the time sequence.
1) Voice-driven three-dimensional face model rendering algorithm
In order to generate a face speaking video driven by voice and enable the mouth shape, the expression and the like of a face in the synthesized video to be consistent with the voice content, a three-dimensional face model rendering image driven by voice is adopted as prior information to guide the generation of a face area image. The voice signal is used for restricting the change of the expression parameters of the face model, and other texture parameters, deformation parameters and the like are consistent with the original image frame of the face region; the process of the algorithm comprises the steps of S1-S4, wherein the process diagram of the step S2 refers to FIG. 2, the key part of the step S2 is a human face expression estimation network, the human face expression estimation network is divided into an audio feature extraction module and a human face expression parameter regression module, and the specific steps are as follows:
(1) for extracting audio features from the current real-time audio stream, the embodiment uses mel-frequency cepstrum coefficients.
(2) And performing frame windowing on the current audio feature sequence.
(3) And inputting the audio features of each window to the audio feature extraction module, and outputting the audio features of deeper layers.
(4) And (4) inputting the deeper audio features obtained in the step (3) into the facial expression parameter regression module, and outputting the facial expression parameters.
(5) And (5) according to the time sequence circulation steps (1) to (4), outputting the facial expression parameters corresponding to each video frame until the last frame of the output sequence is reached.
2) Face region synthesis algorithm:
the process of synthesizing the face region into the face speaking video that actually synthesizes the face speaking video in accordance with the speech content includes step S5 (the face mask removal process diagram of the present embodiment refers to fig. 3). In this embodiment, the confrontation network technology is adopted to realize a face texture synthesis network, and the neural network module includes a face texture synthesis module and a discriminator module, and has 2 key constraints: face identity consistency constraints and timing consistency constraints.
The face identity consistency constraint is used for ensuring that the identity characteristics of the front face and the back face of the synthesized video are consistent and constraining the identity characteristics of the front face and the back face of the synthesized video to be unchanged. In the embodiment, the identity information is encoded into the neural network generator by adopting self-adaptive example standardization, so that the neural network can synthesize a video frame sequence with consistent identity characteristics before and after the encoding based on the identity information; the system mainly comprises a face identity code extraction module, a face identity code projection module and a face identity verification module; the implementation process of the face identity consistency constraint (refer to fig. 3) is as follows:
(1) and the human face identity code extraction module extracts the human face identity code of the target speaker by using a three-dimensional human face reconstruction technology. In this embodiment, a face 3D deformation statistical model is used as the identity code extraction module.
(2) And inputting the face identity code obtained in the step 1 into a face identity code projection module, and outputting the projected face identity characteristics.
(3) And (3) coding the face identity characteristics obtained in the step (2) into a generator by using self-adaptive instance standardization, inputting an original image and a 3D face model rendering image, and outputting the face image after shielding is removed.
(4) And calculating the identity characteristic difference between the original image and the synthesized image by using the human face identity verification module, and using the identity characteristic difference as a guide for optimizing the generator. The present embodiment employs a face recognition model.
(5) And (5) circulating the steps (1) to (4) until the generator converges.
The time sequence consistency constraint is used for ensuring that the texture change before and after the face of the synthetic video is relatively smooth and natural, and preventing the occurrence of a jitter phenomenon; in the embodiment, the image generated by the previous frame is introduced into the network as prior information, so that the network synthesizes a human face speaking video with natural texture change; the specific implementation process is as follows: (refer to FIG. 3)
(1) In the input process of the generator, adding a face area video frame synthesized by the previous frame on the basis, and outputting a face video frame corresponding to the current frame;
(2) calculating the variation jitter degree of the face region texture synthesized by the front and rear frame generators, and using the variation jitter degree as a guide to optimize the generators; the embodiment adopts the color difference change rate among pixels for measurement;
(3) and (5) circulating the steps (1) and (2) until the generator converges.
In this embodiment, the face identity consistency constraint and the timing consistency constraint should be used in combination with the discriminator discrimination loss constraint to optimize the generator. The facial expression estimation network and the facial texture synthesis network respectively adopt a simple convolution network architecture and a generation countermeasure network with few layers, the required calculated amount is small, the requirement of real-time facial region synthesis can be met, and the method has high practical value and good economic benefit in the fields of security monitoring, video conferences, avatars, animation driving and the like.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (10)

1. A real-time video face region space-time consistent synthesis method by using voice information is characterized by comprising the following steps:
s1: manually selecting a face identity reference image, and extracting face identity parameters and face texture parameters of the face identity reference image;
s2: for each frame of image of the real-time video stream, extracting a face posture parameter and a face shape parameter corresponding to each frame by using a face three-dimensional reconstruction technology;
s3: extracting audio features corresponding to each frame of video from the real-time audio stream, inputting the audio features to a first network, and outputting facial expression parameters corresponding to each frame of the video stream;
s4: inputting face identity parameters, face posture parameters, face shape parameters, face texture parameters and face expression parameters, and rendering a corresponding three-dimensional face model rendering image by using a three-dimensional face model rendering technology.
S5: and inputting a video stream original image frame, a three-dimensional face rendering image, a face identity reference image and a frame synthesis image on the second network to the second network, and outputting an image synthesized by a face area.
2. The method as claimed in claim 1, wherein in step S1, the face three-dimensional reconstruction technique uses 3DMM method.
3. The method of spatio-temporally consistent synthesizing a real-time video face region using voice information according to claim 1, wherein in step S2, the first network is a face expression estimation network.
4. The method as claimed in claim 3, wherein the facial expression estimation network is divided into an audio feature extraction module and a facial expression parameter regression module.
5. The method as claimed in claim 1, wherein in step S3, the three-dimensional face model rendering technique is voice-driven.
6. The method as claimed in claim 1, wherein in step S5, the second network is a face texture synthesis network and is a generation countermeasure network.
7. The method as claimed in claim 6, wherein the face texture synthesis network comprises a face identity coding module, a face identity coding projection module, a face texture synthesis module, and a discriminator module.
8. The method as claimed in claim 1, wherein the constraints used by the face texture synthesis network include timing consistency constraint, discriminator constraint, and face identity consistency constraint.
9. The method as claimed in claim 8, wherein the face identity consistency constraint is used to ensure that the face identities generated before and after the face identity consistency synthesis is consistent, specifically, the face identity coding is introduced into the face texture synthesis module using adaptive instance standardization, and the synthesis result is constrained.
10. The method as claimed in claim 8, wherein the time-sequence consistency constraint is used to ensure that the texture of the face generated before and after the face is changed naturally, and specifically, the result of the previous frame is introduced into the face texture synthesis module, and the texture jitter of the previous and following frames is constrained.
CN202110750794.3A 2021-07-02 2021-07-02 Real-time video face region space-time consistent synthesis method using voice information Pending CN113470170A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110750794.3A CN113470170A (en) 2021-07-02 2021-07-02 Real-time video face region space-time consistent synthesis method using voice information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110750794.3A CN113470170A (en) 2021-07-02 2021-07-02 Real-time video face region space-time consistent synthesis method using voice information

Publications (1)

Publication Number Publication Date
CN113470170A true CN113470170A (en) 2021-10-01

Family

ID=77877558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110750794.3A Pending CN113470170A (en) 2021-07-02 2021-07-02 Real-time video face region space-time consistent synthesis method using voice information

Country Status (1)

Country Link
CN (1) CN113470170A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049678A (en) * 2022-01-11 2022-02-15 之江实验室 Facial motion capturing method and system based on deep learning
WO2023195426A1 (en) * 2022-04-05 2023-10-12 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Decoding device, encoding device, decoding method, and encoding method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN110868554A (en) * 2019-11-18 2020-03-06 广州华多网络科技有限公司 Method, device and equipment for changing faces in real time in live broadcast and storage medium
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN110868554A (en) * 2019-11-18 2020-03-06 广州华多网络科技有限公司 Method, device and equipment for changing faces in real time in live broadcast and storage medium
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049678A (en) * 2022-01-11 2022-02-15 之江实验室 Facial motion capturing method and system based on deep learning
WO2023195426A1 (en) * 2022-04-05 2023-10-12 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Decoding device, encoding device, decoding method, and encoding method

Similar Documents

Publication Publication Date Title
Chung et al. You said that?
Chuang et al. Mood swings: expressive speech animation
CN111145322B (en) Method, apparatus, and computer-readable storage medium for driving avatar
US11551393B2 (en) Systems and methods for animation generation
CN112562722A (en) Audio-driven digital human generation method and system based on semantics
CN113378697A (en) Method and device for generating speaking face video based on convolutional neural network
CN113470170A (en) Real-time video face region space-time consistent synthesis method using voice information
CN113838173B (en) Virtual human head motion synthesis method driven by combination of voice and background sound
CN115457169A (en) Voice-driven human face animation generation method and system
CN115100329B (en) Multi-mode driving-based emotion controllable facial animation generation method
CN114639374A (en) Real-time voice-driven photo-level realistic human face portrait video generation method
US20020164068A1 (en) Model switching in a communication system
CN114663539B (en) 2D face restoration technology under mask based on audio drive
CN117036583A (en) Video generation method, device, storage medium and computer equipment
Huang et al. Perceptual conversational head generation with regularized driver and enhanced renderer
CN115914505B (en) Video generation method and system based on voice-driven digital human model
CN116233567B (en) Speaker face video generation method and system based on audio emotion perception
Liu et al. Talking face generation via facial anatomy
Yi et al. Predicting personalized head movement from short video and speech signal
Wang et al. Speech Driven Talking Head Generation via Attentional Landmarks Based Representation.
CN109657589B (en) Human interaction action-based experiencer action generation method
Gowda et al. From pixels to portraits: A comprehensive survey of talking head generation techniques and applications
Wang et al. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head
CN116402928B (en) Virtual talking digital person generating method
Christoff et al. Audio-Driven 3D Talking Face for Realistic Holographic Mixed-Reality Telepresence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination