CN116668611A - Virtual digital human lip synchronization method and system - Google Patents

Virtual digital human lip synchronization method and system Download PDF

Info

Publication number
CN116668611A
CN116668611A CN202310928303.9A CN202310928303A CN116668611A CN 116668611 A CN116668611 A CN 116668611A CN 202310928303 A CN202310928303 A CN 202310928303A CN 116668611 A CN116668611 A CN 116668611A
Authority
CN
China
Prior art keywords
video
image
face
lip
wav2lip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310928303.9A
Other languages
Chinese (zh)
Inventor
袁海杰
王鑫恒
解仑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaoduo Intelligent Technology Beijing Co ltd
Original Assignee
Xiaoduo Intelligent Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoduo Intelligent Technology Beijing Co ltd filed Critical Xiaoduo Intelligent Technology Beijing Co ltd
Priority to CN202310928303.9A priority Critical patent/CN116668611A/en
Publication of CN116668611A publication Critical patent/CN116668611A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Abstract

The application discloses a virtual digital human lip synchronization method and a virtual digital human lip synchronization system. According to the method, firstly, video shooting is carried out according to a target corpus obtained in advance to obtain a sample video, face images and audio information are extracted from the sample video, then the face images are processed through an Openface tool to obtain face key point images, and the audio information is preprocessed and converted into a Mel spectrogram; and finally training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video. The method can effectively improve the accuracy of lip synchronization, simultaneously reduce the calculation complexity and solve the problem of information redundancy.

Description

Virtual digital human lip synchronization method and system
Technical Field
The application relates to the technical field of deep learning, in particular to a virtual digital human lip synchronization method and a virtual digital human lip synchronization system.
Background
With the development of deep learning technology, the face image generation technology has made remarkable progress. However, many challenges remain in generating face images of speakers in real scenes. For example, insufficient synchronicity between the generated face image and audio results in uncoordinated face motion and speech.
The prior art generally uses wav2 lip-sync models, however wav2lip cannot meet high quality speaker video generation. When any speaker video is generated by reasoning, the defect of poor lip synchronization effect still exists. And wav2lip only uses a simple join operation when feature fusion is performed between the audio feature and the video feature. This may result in redundancy of information, simple connection operations may result in redundant information between feature vectors, which may result in reduced model training efficiency and may overfit the model. And because the model needs to handle a large number of parameters. This may result in increased training time for the model and may require more computing resources.
Disclosure of Invention
Based on the above, the embodiment of the application provides a virtual digital human lip synchronization method and a virtual digital human lip synchronization system, which are used for solving the problems of low lip synchronization efficiency, information redundancy caused by a simple connection mode in a feature fusion stage, difficulty in training and the like in the prior art.
In a first aspect, there is provided a virtual digital human lip synchronization method, the method comprising:
shooting a video according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;
processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
training a Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing a target video by the trained Wav2Lip model to obtain a corresponding Lip video.
Optionally, the training process of the Wav2Lip model specifically includes:
inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
Optionally, the generator of the Wav2Lip model includes a speech encoder, a face encoder, and a face decoder; the face encoder and the face decoder are composed of a 2D-CNN based on a residual error structure; the face encoder can encode 5 randomly selected reference frames and corresponding reference frames covering the lower half face frame by frame; the voice encoder encodes the voice fragments processed by the MFCC, and inputs the face-image generated frame by frame into the face decoder after simple connection operation of the face-image and the audio-image obtained by encoding.
Optionally, the Wav2Lip model generates a frame L using minimization g And reference frame L G L1 reconstruction loss between to implement generator L recon
Wherein N represents the number of images, i represents the image number, L g Representing a minimisation generating frame, L G Representing a reference frame.
Alternatively, the Wav2lip model uses a pre-trained lip sync discriminator to correct the lip region of the person, enabling more computing resources to be allocated to fine-grained lip region correction while ignoring those background regions that are not important for image reconstruction or contain more useless details, specifically by the formula:
determining lip synchronous discriminator L sync Wherein P is sync Representing the probability of synchronization of an input audio-video pair, E v And E is s Representing the embedded vectors of video and audio, respectively.
Optionally, the Wav2lip model includes a visual discriminator, where the visual discriminator is constructed based on a cascade 2D-CNN of residual structures, and receives as input the lower half of the face image generated by the generator, and the visual discriminator passes through the formula:
training maximization L disc Wherein L is g Representing the generated image generated by the generator, L G Representing a true reference image.
Optionally, extracting the obtained audio and video features, performing information fusion through feature fusion, and outputting the fused features, wherein the feature fusion module comprises a fast Fourier convolution layer and an adaptive instance normalization algorithm.
Optionally, replacing the second fast fourier convolutional layer in the double-layer fast fourier convolutional layer residual stack structure with an AdaIN layer; the AdaIN adopts an adaptive affine transformation algorithm and is used for fusing and converting the characteristic information into any given characteristic; wherein AdaIN can be expressed by the formula
A determination is made, wherein x, y each represent a different feature,and->Representing mean and variance calculations, respectively.
In a second aspect, there is provided a virtual digital human lip synchronization system, the system comprising:
the acquisition module is used for carrying out video shooting according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;
the preprocessing module is used for processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
the generating module is used for training the Wav2Lip model based on the face key point image and the Mel spectrogram, and processing the target video by the Wav2Lip model after training to obtain a corresponding Lip video.
Optionally, the system further comprises:
the training module is used for inputting the Mel spectrogram and the facial key point image into the Wav2Lip model, and extracting further Audio and Video features through the Audio encoder and the Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
According to the technical scheme provided by the embodiment of the application, firstly, video shooting is carried out according to a target corpus obtained in advance to obtain a sample video, a face image and audio information are extracted from the sample video, then the face image is processed through an Openface tool to obtain a face key point image, and the audio information is preprocessed and converted into a Mel spectrogram; and finally training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video. It can be seen that the beneficial effects of the application are:
(1) Compared with the traditional method, the method can only accurately synchronize the lip shape of the static image or video of the specific person, or in the dynamic and unconstrained face video of the speaker, the method can not accurately change the lip shape movement of any identity, realizes higher-quality lip shape synchronization, and can effectively improve the accuracy of lip shape synchronization.
(2) The application adopts a feature fusion module to help the fusion of video and audio features, reduces the problems of calculation complexity, information redundancy and the like, and improves the lip synchronization accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.
FIG. 1 is a flow chart of a virtual digital human lip synchronization method provided by an embodiment of the application;
fig. 2 is a processing flow chart of a virtual digital lip synchronization method according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In the description of the present application, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed but inherent to such process, method, article, or apparatus or steps or elements added based on further optimization of the inventive concept.
The application aims to provide a lip synchronization method for improving lip synchronization accuracy, which aims to solve the problems of low lip synchronization efficiency, information redundancy caused by a simple connection mode in a feature fusion stage, difficult training and the like in the prior art. Specifically, please refer to fig. 1, which illustrates a flowchart of a virtual digital human lip synchronization method according to an embodiment of the present application, the method may include the following steps:
and step 101, performing video shooting according to a pre-acquired target corpus to obtain a sample video.
In the embodiment, data acquisition is needed first, and video shooting is carried out on a given related corpus; then processing the audio information and extracting the video face information.
In this step, a training sample is specifically determined, where the training sample includes an original image frame, a real image frame, and an audio file, the original image frame including a face of a speaker, the real image frame including a real lip of the speaker representing the audio file.
Step 102, processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram.
In this step, specifically, audio preprocessing is included: the input audio is converted into a Mel spectrogram for subsequent speech feature extraction.
Video preprocessing: and carrying out face key point detection on the input video by using an Openface tool to obtain corresponding face image frames.
And step 103, training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video.
Based on the training samples, a training process of the Wav2Lip model is performed, the training process comprising: the Mel spectrogram and the human face image frame after the second step and the third step are input into the method proposed by us, and further Audio and Video feature extraction is carried out in the Audio encoder and the Video encoder. And then, extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features. And finally, performing feature mapping and association through a generated countermeasure network to realize lip-shaped video generation. The generated image frame and the real image frame are input into a multi-scale image quality discriminator to discriminate by the multi-scale image quality discriminator whether the generated image frame and the real image frame are real images or not on a plurality of scales.
In the embodiment of the application, the training process of the Wav2Lip model specifically comprises the following steps:
inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
Wherein, for the Wav2Lip model in the application,
the generator consists of three parts, namely a speech coder, a face coder and a face decoder. The face encoder and the face decoder are composed of 2D-CNN based residual error structures. The face encoder may encode t=5 randomly selected reference frames and the corresponding reference frame covering the lower half of the face frame by frame. The speech coder codes the speech segment processed by the MFCC. The face image is generated frame by inputting the simple con-cate of the face image and the audio image obtained by encoding into the face decoder
Wav2Lip model generation frame L using minimization g And reference frame L G L1 reconstruction loss between to implement generator L recon
(1)
Wherein N represents the number of images, i represents the image number, L g Representing a minimisation generating frame, L G Representing a reference frame.
In this embodiment, wav2lip uses a pre-trained lip sync discriminator to correct the lip area of the person. This approach allocates more computing resources to fine-grained lip region correction while ignoring background regions that are not important for image reconstruction or contain more useless details.
(2)
(3)
Determining lip synchronous discriminator L sync Wherein P is sync Representing the probability of synchronization of an input audio-video pair, E v And E is s Representing the embedded vectors of video and audio, respectively.
To solve the slight artifact or mouth blurring phenomenon that occurs in correcting the lip synchronization by the lip synchronization discriminator, the present embodiment employs a method of adding a visual discriminator. The visual discriminator is constructed based on a cascade 2D-CNN of residual structure, and receives the lower half of the face image generated by the generator as input. The training purpose of the visual discriminator is to maximize L disc
(4)
(5)
Wherein L is g Representing the generated image generated by the generator, L G Representing a true reference image. For the generator, the minimization of equation (6) refers to reconstruction loss L recon Loss of synchronization L sync And resistance loss L gen Is a weighted sum of (c).
(6)
Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Indicating the setting parameters.
In the feature fusion module, the feature fusion module comprises a fast Fourier convolution layer (FFC) and an adaptive instance normalization algorithm (AdaIN).
The application improves the robustness of the model and the understanding of the global structure of the image through the proposed double-layer FFC residual error stacking structure. An AdaIN algorithm was introduced to better fuse the audio and video features. Specifically, the second layer FFC in the dual layer FFC residual stack is replaced with an AdaIN layer. AdaIN uses an adaptive affine transformation method to facilitate better fusion and conversion of feature information into any given feature. AdaIN can be expressed by formula (7):
(7)
wherein x and y represent different characteristics respectively,and->Representing mean and variance calculations, respectively.
The AdaIN algorithm can align the average value and variance of the last processed Video Feature map layer and the coded voice spectrum Feature Audio Feature map in the channel dimension, so that the Feature map processed by the AdaIN has high average activity similar to the audio Feature when output, and semantic information of the audio can be better reserved.
In summary, as shown in fig. 2, a process flow structure diagram of the virtual digital human lip synchronization method is provided. Optionally, the process may further include:
step one: the data acquisition is needed, and video shooting is carried out on a given related corpus;
step two: processing audio information and extracting video face information;
step three: further extracting Audio and Video features from the Audio and Video encodings;
step four: extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
step five: and performing feature mapping and association by generating an countermeasure network to realize the generation of lip-shaped video.
The embodiment of the application also provides a virtual digital human lip synchronization system. The system comprises:
the acquisition module is used for carrying out video shooting according to a pre-acquired target corpus to obtain a sample video; the sample video comprises a face image and audio;
the preprocessing module is used for processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
the generating module is used for training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the Wav2Lip model after training to obtain a corresponding Lip video.
In an alternative embodiment of the application, the system further comprises:
training module for: inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
The virtual digital lip synchronization system provided by the embodiment of the present application is used to implement the above-mentioned virtual digital lip synchronization method, and specific limitations regarding the virtual digital lip synchronization system can be referred to above for the limitations of the virtual digital lip synchronization method, which are not repeated herein. The various portions of the virtual digital human lip sync system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in the device, or may be stored in software in a memory in the device, so that the processor may call and execute operations corresponding to the above modules.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. A virtual digital human lip synchronization method, the method comprising:
shooting a video according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;
processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
training a Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing a target video by the trained Wav2Lip model to obtain a corresponding Lip video.
2. The virtual digital human Lip synchronization method according to claim 1, wherein the training process of the Wav2Lip model specifically comprises:
inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
3. The virtual digital Lip synchronization method according to claim 2, wherein the generator of the Wav2Lip model comprises a speech encoder, a face encoder, and a face decoder; the face encoder and the face decoder are composed of a 2D-CNN based on a residual error structure; the face encoder can encode 5 randomly selected reference frames and corresponding reference frames covering the lower half face frame by frame; the voice encoder encodes the voice fragments processed by the MFCC, and inputs the face-image generated frame by frame into the face decoder after simple connection operation of the face-image and the audio-image obtained by encoding.
4. The virtual digital human Lip synchronization method according to claim 2, wherein the Wav2Lip model uses a minimization generation frame L g And reference frame L G L1 reconstruction loss between to implement generator L recon
Wherein N represents the number of images, i represents the image number, L g Representing a minimisation generating frame, L G Representing a reference frame.
5. The virtual digital human lip synchronization method according to claim 2, wherein the Wav2lip model uses a pre-trained lip synchronization discriminator to correct the lip region of the person, enabling more computing resources to be allocated to fine-grained lip region correction while ignoring those background regions that are not important for image reconstruction or contain more unwanted details, specifically by the formula:
determining lip synchronous discriminator L sync Wherein P is sync Representing the probability of synchronization of an input audio-video pair, E v And E is s Representing the embedded vectors of video and audio, respectively.
6. The virtual digital lip synchronization method according to claim 2, wherein a visual discriminator is included in the Wav2lip model, the visual discriminator is constructed based on a cascade 2D-CNN of residual structures, and receives as input the lower half of the face image generated by the generator, the visual discriminator passing through the formula:
training maximization L disc Wherein L is g Representing the generated image generated by the generator, L G Representing a true reference image.
7. The virtual digital lip synchronization method according to claim 2, wherein the obtained audio and video features are extracted to perform information fusion through feature fusion, and the feature fusion module comprises a fast fourier convolution layer and an adaptive instance normalization algorithm in the output fusion features.
8. The virtual digital human lip synchronization method of claim 7, wherein the second layer of the fast fourier convolutional layer in the dual layer fast fourier convolutional layer residual stack is replaced with an AdaIN layer; the AdaIN adopts an adaptive affine transformation algorithm and is used for fusing and converting the characteristic information into any given characteristic; wherein AdaIN can be expressed by the formula
A determination is made, wherein x, y each represent a different feature,and->Representing mean and variance calculations, respectively.
9. A virtual digital human lip sync system, the system comprising:
the acquisition module is used for carrying out video shooting according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;
the preprocessing module is used for processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
the generating module is used for training the Wav2Lip model based on the face key point image and the Mel spectrogram, and processing the target video by the Wav2Lip model after training to obtain a corresponding Lip video.
10. The virtual digital human lip synchronization system of claim 9, wherein the system further comprises:
the training module is used for inputting the Mel spectrogram and the facial key point image into the Wav2Lip model, and extracting further Audio and Video features through the Audio encoder and the Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
CN202310928303.9A 2023-07-27 2023-07-27 Virtual digital human lip synchronization method and system Pending CN116668611A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310928303.9A CN116668611A (en) 2023-07-27 2023-07-27 Virtual digital human lip synchronization method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310928303.9A CN116668611A (en) 2023-07-27 2023-07-27 Virtual digital human lip synchronization method and system

Publications (1)

Publication Number Publication Date
CN116668611A true CN116668611A (en) 2023-08-29

Family

ID=87715631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310928303.9A Pending CN116668611A (en) 2023-07-27 2023-07-27 Virtual digital human lip synchronization method and system

Country Status (1)

Country Link
CN (1) CN116668611A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160343389A1 (en) * 2015-05-19 2016-11-24 Bxb Electronics Co., Ltd. Voice Control System, Voice Control Method, Computer Program Product, and Computer Readable Medium
CN113723385A (en) * 2021-11-04 2021-11-30 新东方教育科技集团有限公司 Video processing method and device and neural network training method and device
CN113723457A (en) * 2021-07-28 2021-11-30 浙江大华技术股份有限公司 Image recognition method and device, storage medium and electronic device
CN113901894A (en) * 2021-09-22 2022-01-07 腾讯音乐娱乐科技(深圳)有限公司 Video generation method, device, server and storage medium
CN114793300A (en) * 2021-01-25 2022-07-26 天津大学 Virtual video customer service robot synthesis method and system based on generation countermeasure network
CN115223214A (en) * 2021-04-15 2022-10-21 腾讯科技(深圳)有限公司 Identification method of synthetic mouth-shaped face, model acquisition method, device and equipment
CN115713579A (en) * 2022-10-25 2023-02-24 贝壳找房(北京)科技有限公司 Wav2Lip model training method, image frame generation method, electronic device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160343389A1 (en) * 2015-05-19 2016-11-24 Bxb Electronics Co., Ltd. Voice Control System, Voice Control Method, Computer Program Product, and Computer Readable Medium
CN114793300A (en) * 2021-01-25 2022-07-26 天津大学 Virtual video customer service robot synthesis method and system based on generation countermeasure network
CN115223214A (en) * 2021-04-15 2022-10-21 腾讯科技(深圳)有限公司 Identification method of synthetic mouth-shaped face, model acquisition method, device and equipment
CN113723457A (en) * 2021-07-28 2021-11-30 浙江大华技术股份有限公司 Image recognition method and device, storage medium and electronic device
CN113901894A (en) * 2021-09-22 2022-01-07 腾讯音乐娱乐科技(深圳)有限公司 Video generation method, device, server and storage medium
CN113723385A (en) * 2021-11-04 2021-11-30 新东方教育科技集团有限公司 Video processing method and device and neural network training method and device
CN115713579A (en) * 2022-10-25 2023-02-24 贝壳找房(北京)科技有限公司 Wav2Lip model training method, image frame generation method, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN111783566B (en) Video synthesis method based on lip synchronization and enhancement of mental adaptation effect
WO2022267641A1 (en) Image defogging method and system based on cyclic generative adversarial network
CN113487618B (en) Portrait segmentation method, portrait segmentation device, electronic equipment and storage medium
CN114187547A (en) Target video output method and device, storage medium and electronic device
CN115442543A (en) Method, device, equipment and storage medium for synthesizing virtual image speaking video
Borsos et al. Speechpainter: Text-conditioned speech inpainting
CN113077537A (en) Video generation method, storage medium and equipment
CN110599411A (en) Image restoration method and system based on condition generation countermeasure network
CN114723760B (en) Portrait segmentation model training method and device and portrait segmentation method and device
CN116012255A (en) Low-light image enhancement method for generating countermeasure network based on cyclic consistency
CN113570689B (en) Portrait cartoon method, device, medium and computing equipment
US20230335148A1 (en) Speech Separation Method, Electronic Device, Chip, and Computer-Readable Storage Medium
US11526972B2 (en) Simultaneously correcting image degradations of multiple types in an image of a face
CN117440114A (en) Virtual image video generation method, device, equipment and medium
CN116668611A (en) Virtual digital human lip synchronization method and system
CN116362995A (en) Tooth image restoration method and system based on standard prior
CN113450824B (en) Voice lip reading method and system based on multi-scale video feature fusion
JP2024508568A (en) Image processing method, device, equipment, and computer program
CN113469292A (en) Training method, synthesizing method, device, medium and equipment for video synthesizing model
US11887403B1 (en) Mouth shape correction model, and model training and application method
Sheng et al. Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment
CN113722513B (en) Multimedia data processing method and equipment
CN115240106B (en) Task self-adaptive small sample behavior recognition method and system
KR102525594B1 (en) Apparatus and method for creating face image data for age recognition
CN116797877A (en) Training method and device for image generation model, and image generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination