CN116668611A

CN116668611A - Virtual digital human lip synchronization method and system

Info

Publication number: CN116668611A
Application number: CN202310928303.9A
Authority: CN
Inventors: 袁海杰; 王鑫恒; 解仑
Original assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Current assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-08-29

Abstract

The application discloses a virtual digital human lip synchronization method and a virtual digital human lip synchronization system. According to the method, firstly, video shooting is carried out according to a target corpus obtained in advance to obtain a sample video, face images and audio information are extracted from the sample video, then the face images are processed through an Openface tool to obtain face key point images, and the audio information is preprocessed and converted into a Mel spectrogram; and finally training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video. The method can effectively improve the accuracy of lip synchronization, simultaneously reduce the calculation complexity and solve the problem of information redundancy.

Description

Virtual digital human lip synchronization method and system

Technical Field

The application relates to the technical field of deep learning, in particular to a virtual digital human lip synchronization method and a virtual digital human lip synchronization system.

Background

With the development of deep learning technology, the face image generation technology has made remarkable progress. However, many challenges remain in generating face images of speakers in real scenes. For example, insufficient synchronicity between the generated face image and audio results in uncoordinated face motion and speech.

The prior art generally uses wav2 lip-sync models, however wav2lip cannot meet high quality speaker video generation. When any speaker video is generated by reasoning, the defect of poor lip synchronization effect still exists. And wav2lip only uses a simple join operation when feature fusion is performed between the audio feature and the video feature. This may result in redundancy of information, simple connection operations may result in redundant information between feature vectors, which may result in reduced model training efficiency and may overfit the model. And because the model needs to handle a large number of parameters. This may result in increased training time for the model and may require more computing resources.

Disclosure of Invention

Based on the above, the embodiment of the application provides a virtual digital human lip synchronization method and a virtual digital human lip synchronization system, which are used for solving the problems of low lip synchronization efficiency, information redundancy caused by a simple connection mode in a feature fusion stage, difficulty in training and the like in the prior art.

In a first aspect, there is provided a virtual digital human lip synchronization method, the method comprising:

shooting a video according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;

processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;

training a Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing a target video by the trained Wav2Lip model to obtain a corresponding Lip video.

Optionally, the training process of the Wav2Lip model specifically includes:

inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;

extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;

feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;

inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.

Optionally, the generator of the Wav2Lip model includes a speech encoder, a face encoder, and a face decoder; the face encoder and the face decoder are composed of a 2D-CNN based on a residual error structure; the face encoder can encode 5 randomly selected reference frames and corresponding reference frames covering the lower half face frame by frame; the voice encoder encodes the voice fragments processed by the MFCC, and inputs the face-image generated frame by frame into the face decoder after simple connection operation of the face-image and the audio-image obtained by encoding.

Optionally, the Wav2Lip model generates a frame L using minimization _g And reference frame L _G L1 reconstruction loss between to implement generator L _recon ：

Wherein N represents the number of images, i represents the image number, L _g Representing a minimisation generating frame, L _G Representing a reference frame.

Alternatively, the Wav2lip model uses a pre-trained lip sync discriminator to correct the lip region of the person, enabling more computing resources to be allocated to fine-grained lip region correction while ignoring those background regions that are not important for image reconstruction or contain more useless details, specifically by the formula:

determining lip synchronous discriminator L _sync Wherein P is _sync Representing the probability of synchronization of an input audio-video pair, E _v And E is _s Representing the embedded vectors of video and audio, respectively.

Optionally, the Wav2lip model includes a visual discriminator, where the visual discriminator is constructed based on a cascade 2D-CNN of residual structures, and receives as input the lower half of the face image generated by the generator, and the visual discriminator passes through the formula:

training maximization L _disc Wherein L is _g Representing the generated image generated by the generator, L _G Representing a true reference image.

Optionally, extracting the obtained audio and video features, performing information fusion through feature fusion, and outputting the fused features, wherein the feature fusion module comprises a fast Fourier convolution layer and an adaptive instance normalization algorithm.

Optionally, replacing the second fast fourier convolutional layer in the double-layer fast fourier convolutional layer residual stack structure with an AdaIN layer; the AdaIN adopts an adaptive affine transformation algorithm and is used for fusing and converting the characteristic information into any given characteristic; wherein AdaIN can be expressed by the formula

A determination is made, wherein x, y each represent a different feature,and->Representing mean and variance calculations, respectively.

In a second aspect, there is provided a virtual digital human lip synchronization system, the system comprising:

the acquisition module is used for carrying out video shooting according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;

the preprocessing module is used for processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;

the generating module is used for training the Wav2Lip model based on the face key point image and the Mel spectrogram, and processing the target video by the Wav2Lip model after training to obtain a corresponding Lip video.

Optionally, the system further comprises:

the training module is used for inputting the Mel spectrogram and the facial key point image into the Wav2Lip model, and extracting further Audio and Video features through the Audio encoder and the Video encoder;

According to the technical scheme provided by the embodiment of the application, firstly, video shooting is carried out according to a target corpus obtained in advance to obtain a sample video, a face image and audio information are extracted from the sample video, then the face image is processed through an Openface tool to obtain a face key point image, and the audio information is preprocessed and converted into a Mel spectrogram; and finally training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video. It can be seen that the beneficial effects of the application are:

(1) Compared with the traditional method, the method can only accurately synchronize the lip shape of the static image or video of the specific person, or in the dynamic and unconstrained face video of the speaker, the method can not accurately change the lip shape movement of any identity, realizes higher-quality lip shape synchronization, and can effectively improve the accuracy of lip shape synchronization.

(2) The application adopts a feature fusion module to help the fusion of video and audio features, reduces the problems of calculation complexity, information redundancy and the like, and improves the lip synchronization accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.

FIG. 1 is a flow chart of a virtual digital human lip synchronization method provided by an embodiment of the application;

fig. 2 is a processing flow chart of a virtual digital lip synchronization method according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In the description of the present application, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed but inherent to such process, method, article, or apparatus or steps or elements added based on further optimization of the inventive concept.

The application aims to provide a lip synchronization method for improving lip synchronization accuracy, which aims to solve the problems of low lip synchronization efficiency, information redundancy caused by a simple connection mode in a feature fusion stage, difficult training and the like in the prior art. Specifically, please refer to fig. 1, which illustrates a flowchart of a virtual digital human lip synchronization method according to an embodiment of the present application, the method may include the following steps:

and step 101, performing video shooting according to a pre-acquired target corpus to obtain a sample video.

In the embodiment, data acquisition is needed first, and video shooting is carried out on a given related corpus; then processing the audio information and extracting the video face information.

In this step, a training sample is specifically determined, where the training sample includes an original image frame, a real image frame, and an audio file, the original image frame including a face of a speaker, the real image frame including a real lip of the speaker representing the audio file.

Step 102, processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram.

In this step, specifically, audio preprocessing is included: the input audio is converted into a Mel spectrogram for subsequent speech feature extraction.

Video preprocessing: and carrying out face key point detection on the input video by using an Openface tool to obtain corresponding face image frames.

And step 103, training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video.

Based on the training samples, a training process of the Wav2Lip model is performed, the training process comprising: the Mel spectrogram and the human face image frame after the second step and the third step are input into the method proposed by us, and further Audio and Video feature extraction is carried out in the Audio encoder and the Video encoder. And then, extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features. And finally, performing feature mapping and association through a generated countermeasure network to realize lip-shaped video generation. The generated image frame and the real image frame are input into a multi-scale image quality discriminator to discriminate by the multi-scale image quality discriminator whether the generated image frame and the real image frame are real images or not on a plurality of scales.

In the embodiment of the application, the training process of the Wav2Lip model specifically comprises the following steps:

Wherein, for the Wav2Lip model in the application,

the generator consists of three parts, namely a speech coder, a face coder and a face decoder. The face encoder and the face decoder are composed of 2D-CNN based residual error structures. The face encoder may encode t=5 randomly selected reference frames and the corresponding reference frame covering the lower half of the face frame by frame. The speech coder codes the speech segment processed by the MFCC. The face image is generated frame by inputting the simple con-cate of the face image and the audio image obtained by encoding into the face decoder

Wav2Lip model generation frame L using minimization _g And reference frame L _G L1 reconstruction loss between to implement generator L _recon ：

（1）

In this embodiment, wav2lip uses a pre-trained lip sync discriminator to correct the lip area of the person. This approach allocates more computing resources to fine-grained lip region correction while ignoring background regions that are not important for image reconstruction or contain more useless details.

（2）

（3）

To solve the slight artifact or mouth blurring phenomenon that occurs in correcting the lip synchronization by the lip synchronization discriminator, the present embodiment employs a method of adding a visual discriminator. The visual discriminator is constructed based on a cascade 2D-CNN of residual structure, and receives the lower half of the face image generated by the generator as input. The training purpose of the visual discriminator is to maximize L _disc ：

(4)

(5)

Wherein L is _g Representing the generated image generated by the generator, L _G Representing a true reference image. For the generator, the minimization of equation (6) refers to reconstruction loss L _recon Loss of synchronization L _sync And resistance loss L _gen Is a weighted sum of (c).

（6）

Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Indicating the setting parameters.

In the feature fusion module, the feature fusion module comprises a fast Fourier convolution layer (FFC) and an adaptive instance normalization algorithm (AdaIN).

The application improves the robustness of the model and the understanding of the global structure of the image through the proposed double-layer FFC residual error stacking structure. An AdaIN algorithm was introduced to better fuse the audio and video features. Specifically, the second layer FFC in the dual layer FFC residual stack is replaced with an AdaIN layer. AdaIN uses an adaptive affine transformation method to facilitate better fusion and conversion of feature information into any given feature. AdaIN can be expressed by formula (7):

（7）

wherein x and y represent different characteristics respectively,and->Representing mean and variance calculations, respectively.

The AdaIN algorithm can align the average value and variance of the last processed Video Feature map layer and the coded voice spectrum Feature Audio Feature map in the channel dimension, so that the Feature map processed by the AdaIN has high average activity similar to the audio Feature when output, and semantic information of the audio can be better reserved.

In summary, as shown in fig. 2, a process flow structure diagram of the virtual digital human lip synchronization method is provided. Optionally, the process may further include:

step one: the data acquisition is needed, and video shooting is carried out on a given related corpus;

step two: processing audio information and extracting video face information;

step three: further extracting Audio and Video features from the Audio and Video encodings;

step four: extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;

step five: and performing feature mapping and association by generating an countermeasure network to realize the generation of lip-shaped video.

The embodiment of the application also provides a virtual digital human lip synchronization system. The system comprises:

the acquisition module is used for carrying out video shooting according to a pre-acquired target corpus to obtain a sample video; the sample video comprises a face image and audio;

the generating module is used for training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the Wav2Lip model after training to obtain a corresponding Lip video.

In an alternative embodiment of the application, the system further comprises:

training module for: inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;

The virtual digital lip synchronization system provided by the embodiment of the present application is used to implement the above-mentioned virtual digital lip synchronization method, and specific limitations regarding the virtual digital lip synchronization system can be referred to above for the limitations of the virtual digital lip synchronization method, which are not repeated herein. The various portions of the virtual digital human lip sync system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in the device, or may be stored in software in a memory in the device, so that the processor may call and execute operations corresponding to the above modules.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A virtual digital human lip synchronization method, the method comprising:

2. The virtual digital human Lip synchronization method according to claim 1, wherein the training process of the Wav2Lip model specifically comprises:

3. The virtual digital Lip synchronization method according to claim 2, wherein the generator of the Wav2Lip model comprises a speech encoder, a face encoder, and a face decoder; the face encoder and the face decoder are composed of a 2D-CNN based on a residual error structure; the face encoder can encode 5 randomly selected reference frames and corresponding reference frames covering the lower half face frame by frame; the voice encoder encodes the voice fragments processed by the MFCC, and inputs the face-image generated frame by frame into the face decoder after simple connection operation of the face-image and the audio-image obtained by encoding.

4. The virtual digital human Lip synchronization method according to claim 2, wherein the Wav2Lip model uses a minimization generation frame L _g And reference frame L _G L1 reconstruction loss between to implement generator L _recon ：

5. The virtual digital human lip synchronization method according to claim 2, wherein the Wav2lip model uses a pre-trained lip synchronization discriminator to correct the lip region of the person, enabling more computing resources to be allocated to fine-grained lip region correction while ignoring those background regions that are not important for image reconstruction or contain more unwanted details, specifically by the formula:

6. The virtual digital lip synchronization method according to claim 2, wherein a visual discriminator is included in the Wav2lip model, the visual discriminator is constructed based on a cascade 2D-CNN of residual structures, and receives as input the lower half of the face image generated by the generator, the visual discriminator passing through the formula:

7. The virtual digital lip synchronization method according to claim 2, wherein the obtained audio and video features are extracted to perform information fusion through feature fusion, and the feature fusion module comprises a fast fourier convolution layer and an adaptive instance normalization algorithm in the output fusion features.

8. The virtual digital human lip synchronization method of claim 7, wherein the second layer of the fast fourier convolutional layer in the dual layer fast fourier convolutional layer residual stack is replaced with an AdaIN layer; the AdaIN adopts an adaptive affine transformation algorithm and is used for fusing and converting the characteristic information into any given characteristic; wherein AdaIN can be expressed by the formula

9. A virtual digital human lip sync system, the system comprising:

10. The virtual digital human lip synchronization system of claim 9, wherein the system further comprises: