CN116668611A - Virtual digital human lip synchronization method and system - Google Patents
Virtual digital human lip synchronization method and system Download PDFInfo
- Publication number
- CN116668611A CN116668611A CN202310928303.9A CN202310928303A CN116668611A CN 116668611 A CN116668611 A CN 116668611A CN 202310928303 A CN202310928303 A CN 202310928303A CN 116668611 A CN116668611 A CN 116668611A
- Authority
- CN
- China
- Prior art keywords
- video
- image
- face
- lip
- wav2lip
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 47
- 238000012545 processing Methods 0.000 claims abstract description 17
- 230000001815 facial effect Effects 0.000 claims abstract description 12
- 238000004364 calculation method Methods 0.000 claims abstract description 5
- 230000004927 fusion Effects 0.000 claims description 39
- 239000010410 layer Substances 0.000 claims description 15
- 238000007781 pre-processing Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 9
- 230000000007 visual effect Effects 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 4
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 239000002355 dual-layer Substances 0.000 claims description 2
- 239000012634 fragment Substances 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- HEMJJKBWTPKOJG-UHFFFAOYSA-N Gemfibrozil Chemical compound CC1=CC=C(C)C(OCCCC(C)(C)C(O)=O)=C1 HEMJJKBWTPKOJG-UHFFFAOYSA-N 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/265—Mixing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
Abstract
The application discloses a virtual digital human lip synchronization method and a virtual digital human lip synchronization system. According to the method, firstly, video shooting is carried out according to a target corpus obtained in advance to obtain a sample video, face images and audio information are extracted from the sample video, then the face images are processed through an Openface tool to obtain face key point images, and the audio information is preprocessed and converted into a Mel spectrogram; and finally training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video. The method can effectively improve the accuracy of lip synchronization, simultaneously reduce the calculation complexity and solve the problem of information redundancy.
Description
Technical Field
The application relates to the technical field of deep learning, in particular to a virtual digital human lip synchronization method and a virtual digital human lip synchronization system.
Background
With the development of deep learning technology, the face image generation technology has made remarkable progress. However, many challenges remain in generating face images of speakers in real scenes. For example, insufficient synchronicity between the generated face image and audio results in uncoordinated face motion and speech.
The prior art generally uses wav2 lip-sync models, however wav2lip cannot meet high quality speaker video generation. When any speaker video is generated by reasoning, the defect of poor lip synchronization effect still exists. And wav2lip only uses a simple join operation when feature fusion is performed between the audio feature and the video feature. This may result in redundancy of information, simple connection operations may result in redundant information between feature vectors, which may result in reduced model training efficiency and may overfit the model. And because the model needs to handle a large number of parameters. This may result in increased training time for the model and may require more computing resources.
Disclosure of Invention
Based on the above, the embodiment of the application provides a virtual digital human lip synchronization method and a virtual digital human lip synchronization system, which are used for solving the problems of low lip synchronization efficiency, information redundancy caused by a simple connection mode in a feature fusion stage, difficulty in training and the like in the prior art.
In a first aspect, there is provided a virtual digital human lip synchronization method, the method comprising:
shooting a video according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;
processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
training a Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing a target video by the trained Wav2Lip model to obtain a corresponding Lip video.
Optionally, the training process of the Wav2Lip model specifically includes:
inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
Optionally, the generator of the Wav2Lip model includes a speech encoder, a face encoder, and a face decoder; the face encoder and the face decoder are composed of a 2D-CNN based on a residual error structure; the face encoder can encode 5 randomly selected reference frames and corresponding reference frames covering the lower half face frame by frame; the voice encoder encodes the voice fragments processed by the MFCC, and inputs the face-image generated frame by frame into the face decoder after simple connection operation of the face-image and the audio-image obtained by encoding.
Optionally, the Wav2Lip model generates a frame L using minimization g And reference frame L G L1 reconstruction loss between to implement generator L recon :
Wherein N represents the number of images, i represents the image number, L g Representing a minimisation generating frame, L G Representing a reference frame.
Alternatively, the Wav2lip model uses a pre-trained lip sync discriminator to correct the lip region of the person, enabling more computing resources to be allocated to fine-grained lip region correction while ignoring those background regions that are not important for image reconstruction or contain more useless details, specifically by the formula:
determining lip synchronous discriminator L sync Wherein P is sync Representing the probability of synchronization of an input audio-video pair, E v And E is s Representing the embedded vectors of video and audio, respectively.
Optionally, the Wav2lip model includes a visual discriminator, where the visual discriminator is constructed based on a cascade 2D-CNN of residual structures, and receives as input the lower half of the face image generated by the generator, and the visual discriminator passes through the formula:
training maximization L disc Wherein L is g Representing the generated image generated by the generator, L G Representing a true reference image.
Optionally, extracting the obtained audio and video features, performing information fusion through feature fusion, and outputting the fused features, wherein the feature fusion module comprises a fast Fourier convolution layer and an adaptive instance normalization algorithm.
Optionally, replacing the second fast fourier convolutional layer in the double-layer fast fourier convolutional layer residual stack structure with an AdaIN layer; the AdaIN adopts an adaptive affine transformation algorithm and is used for fusing and converting the characteristic information into any given characteristic; wherein AdaIN can be expressed by the formula
A determination is made, wherein x, y each represent a different feature,and->Representing mean and variance calculations, respectively.
In a second aspect, there is provided a virtual digital human lip synchronization system, the system comprising:
the acquisition module is used for carrying out video shooting according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;
the preprocessing module is used for processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
the generating module is used for training the Wav2Lip model based on the face key point image and the Mel spectrogram, and processing the target video by the Wav2Lip model after training to obtain a corresponding Lip video.
Optionally, the system further comprises:
the training module is used for inputting the Mel spectrogram and the facial key point image into the Wav2Lip model, and extracting further Audio and Video features through the Audio encoder and the Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
According to the technical scheme provided by the embodiment of the application, firstly, video shooting is carried out according to a target corpus obtained in advance to obtain a sample video, a face image and audio information are extracted from the sample video, then the face image is processed through an Openface tool to obtain a face key point image, and the audio information is preprocessed and converted into a Mel spectrogram; and finally training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video. It can be seen that the beneficial effects of the application are:
(1) Compared with the traditional method, the method can only accurately synchronize the lip shape of the static image or video of the specific person, or in the dynamic and unconstrained face video of the speaker, the method can not accurately change the lip shape movement of any identity, realizes higher-quality lip shape synchronization, and can effectively improve the accuracy of lip shape synchronization.
(2) The application adopts a feature fusion module to help the fusion of video and audio features, reduces the problems of calculation complexity, information redundancy and the like, and improves the lip synchronization accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.
FIG. 1 is a flow chart of a virtual digital human lip synchronization method provided by an embodiment of the application;
fig. 2 is a processing flow chart of a virtual digital lip synchronization method according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In the description of the present application, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed but inherent to such process, method, article, or apparatus or steps or elements added based on further optimization of the inventive concept.
The application aims to provide a lip synchronization method for improving lip synchronization accuracy, which aims to solve the problems of low lip synchronization efficiency, information redundancy caused by a simple connection mode in a feature fusion stage, difficult training and the like in the prior art. Specifically, please refer to fig. 1, which illustrates a flowchart of a virtual digital human lip synchronization method according to an embodiment of the present application, the method may include the following steps:
and step 101, performing video shooting according to a pre-acquired target corpus to obtain a sample video.
In the embodiment, data acquisition is needed first, and video shooting is carried out on a given related corpus; then processing the audio information and extracting the video face information.
In this step, a training sample is specifically determined, where the training sample includes an original image frame, a real image frame, and an audio file, the original image frame including a face of a speaker, the real image frame including a real lip of the speaker representing the audio file.
Step 102, processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram.
In this step, specifically, audio preprocessing is included: the input audio is converted into a Mel spectrogram for subsequent speech feature extraction.
Video preprocessing: and carrying out face key point detection on the input video by using an Openface tool to obtain corresponding face image frames.
And step 103, training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the trained Wav2Lip model to obtain a corresponding Lip video.
Based on the training samples, a training process of the Wav2Lip model is performed, the training process comprising: the Mel spectrogram and the human face image frame after the second step and the third step are input into the method proposed by us, and further Audio and Video feature extraction is carried out in the Audio encoder and the Video encoder. And then, extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features. And finally, performing feature mapping and association through a generated countermeasure network to realize lip-shaped video generation. The generated image frame and the real image frame are input into a multi-scale image quality discriminator to discriminate by the multi-scale image quality discriminator whether the generated image frame and the real image frame are real images or not on a plurality of scales.
In the embodiment of the application, the training process of the Wav2Lip model specifically comprises the following steps:
inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
Wherein, for the Wav2Lip model in the application,
the generator consists of three parts, namely a speech coder, a face coder and a face decoder. The face encoder and the face decoder are composed of 2D-CNN based residual error structures. The face encoder may encode t=5 randomly selected reference frames and the corresponding reference frame covering the lower half of the face frame by frame. The speech coder codes the speech segment processed by the MFCC. The face image is generated frame by inputting the simple con-cate of the face image and the audio image obtained by encoding into the face decoder
Wav2Lip model generation frame L using minimization g And reference frame L G L1 reconstruction loss between to implement generator L recon :
(1)
Wherein N represents the number of images, i represents the image number, L g Representing a minimisation generating frame, L G Representing a reference frame.
In this embodiment, wav2lip uses a pre-trained lip sync discriminator to correct the lip area of the person. This approach allocates more computing resources to fine-grained lip region correction while ignoring background regions that are not important for image reconstruction or contain more useless details.
(2)
(3)
Determining lip synchronous discriminator L sync Wherein P is sync Representing the probability of synchronization of an input audio-video pair, E v And E is s Representing the embedded vectors of video and audio, respectively.
To solve the slight artifact or mouth blurring phenomenon that occurs in correcting the lip synchronization by the lip synchronization discriminator, the present embodiment employs a method of adding a visual discriminator. The visual discriminator is constructed based on a cascade 2D-CNN of residual structure, and receives the lower half of the face image generated by the generator as input. The training purpose of the visual discriminator is to maximize L disc :
(4)
(5)
Wherein L is g Representing the generated image generated by the generator, L G Representing a true reference image. For the generator, the minimization of equation (6) refers to reconstruction loss L recon Loss of synchronization L sync And resistance loss L gen Is a weighted sum of (c).
(6)
Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Indicating the setting parameters.
In the feature fusion module, the feature fusion module comprises a fast Fourier convolution layer (FFC) and an adaptive instance normalization algorithm (AdaIN).
The application improves the robustness of the model and the understanding of the global structure of the image through the proposed double-layer FFC residual error stacking structure. An AdaIN algorithm was introduced to better fuse the audio and video features. Specifically, the second layer FFC in the dual layer FFC residual stack is replaced with an AdaIN layer. AdaIN uses an adaptive affine transformation method to facilitate better fusion and conversion of feature information into any given feature. AdaIN can be expressed by formula (7):
(7)
wherein x and y represent different characteristics respectively,and->Representing mean and variance calculations, respectively.
The AdaIN algorithm can align the average value and variance of the last processed Video Feature map layer and the coded voice spectrum Feature Audio Feature map in the channel dimension, so that the Feature map processed by the AdaIN has high average activity similar to the audio Feature when output, and semantic information of the audio can be better reserved.
In summary, as shown in fig. 2, a process flow structure diagram of the virtual digital human lip synchronization method is provided. Optionally, the process may further include:
step one: the data acquisition is needed, and video shooting is carried out on a given related corpus;
step two: processing audio information and extracting video face information;
step three: further extracting Audio and Video features from the Audio and Video encodings;
step four: extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
step five: and performing feature mapping and association by generating an countermeasure network to realize the generation of lip-shaped video.
The embodiment of the application also provides a virtual digital human lip synchronization system. The system comprises:
the acquisition module is used for carrying out video shooting according to a pre-acquired target corpus to obtain a sample video; the sample video comprises a face image and audio;
the preprocessing module is used for processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
the generating module is used for training the Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing the target video by the Wav2Lip model after training to obtain a corresponding Lip video.
In an alternative embodiment of the application, the system further comprises:
training module for: inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
The virtual digital lip synchronization system provided by the embodiment of the present application is used to implement the above-mentioned virtual digital lip synchronization method, and specific limitations regarding the virtual digital lip synchronization system can be referred to above for the limitations of the virtual digital lip synchronization method, which are not repeated herein. The various portions of the virtual digital human lip sync system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in the device, or may be stored in software in a memory in the device, so that the processor may call and execute operations corresponding to the above modules.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (10)
1. A virtual digital human lip synchronization method, the method comprising:
shooting a video according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;
processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
training a Wav2Lip model based on the facial key point image and the Mel spectrogram, and processing a target video by the trained Wav2Lip model to obtain a corresponding Lip video.
2. The virtual digital human Lip synchronization method according to claim 1, wherein the training process of the Wav2Lip model specifically comprises:
inputting the Mel spectrogram and the facial key point image into a Wav2Lip model, and extracting further Audio and Video features through an Audio encoder and a Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
3. The virtual digital Lip synchronization method according to claim 2, wherein the generator of the Wav2Lip model comprises a speech encoder, a face encoder, and a face decoder; the face encoder and the face decoder are composed of a 2D-CNN based on a residual error structure; the face encoder can encode 5 randomly selected reference frames and corresponding reference frames covering the lower half face frame by frame; the voice encoder encodes the voice fragments processed by the MFCC, and inputs the face-image generated frame by frame into the face decoder after simple connection operation of the face-image and the audio-image obtained by encoding.
4. The virtual digital human Lip synchronization method according to claim 2, wherein the Wav2Lip model uses a minimization generation frame L g And reference frame L G L1 reconstruction loss between to implement generator L recon :
Wherein N represents the number of images, i represents the image number, L g Representing a minimisation generating frame, L G Representing a reference frame.
5. The virtual digital human lip synchronization method according to claim 2, wherein the Wav2lip model uses a pre-trained lip synchronization discriminator to correct the lip region of the person, enabling more computing resources to be allocated to fine-grained lip region correction while ignoring those background regions that are not important for image reconstruction or contain more unwanted details, specifically by the formula:
determining lip synchronous discriminator L sync Wherein P is sync Representing the probability of synchronization of an input audio-video pair, E v And E is s Representing the embedded vectors of video and audio, respectively.
6. The virtual digital lip synchronization method according to claim 2, wherein a visual discriminator is included in the Wav2lip model, the visual discriminator is constructed based on a cascade 2D-CNN of residual structures, and receives as input the lower half of the face image generated by the generator, the visual discriminator passing through the formula:
training maximization L disc Wherein L is g Representing the generated image generated by the generator, L G Representing a true reference image.
7. The virtual digital lip synchronization method according to claim 2, wherein the obtained audio and video features are extracted to perform information fusion through feature fusion, and the feature fusion module comprises a fast fourier convolution layer and an adaptive instance normalization algorithm in the output fusion features.
8. The virtual digital human lip synchronization method of claim 7, wherein the second layer of the fast fourier convolutional layer in the dual layer fast fourier convolutional layer residual stack is replaced with an AdaIN layer; the AdaIN adopts an adaptive affine transformation algorithm and is used for fusing and converting the characteristic information into any given characteristic; wherein AdaIN can be expressed by the formula
A determination is made, wherein x, y each represent a different feature,and->Representing mean and variance calculations, respectively.
9. A virtual digital human lip sync system, the system comprising:
the acquisition module is used for carrying out video shooting according to a pre-acquired target corpus to obtain a sample video, and extracting face images and audio information from the sample video;
the preprocessing module is used for processing the face image through an Openface tool to obtain a face key point image, and preprocessing the audio information to convert the audio information into a Mel spectrogram;
the generating module is used for training the Wav2Lip model based on the face key point image and the Mel spectrogram, and processing the target video by the Wav2Lip model after training to obtain a corresponding Lip video.
10. The virtual digital human lip synchronization system of claim 9, wherein the system further comprises:
the training module is used for inputting the Mel spectrogram and the facial key point image into the Wav2Lip model, and extracting further Audio and Video features through the Audio encoder and the Video encoder;
extracting the obtained audio and video features, carrying out information fusion through feature fusion, and outputting fusion features;
feature mapping and association are carried out through a generated countermeasure network, so that training lip-shaped video generation is realized;
inputting the image frames in the generated training lip video and the corresponding real image frames into a multi-scale image quality discriminator so as to discriminate whether the generated image frames and the real image frames are real images or not on a plurality of scales by the multi-scale image quality discriminator; and adjusting parameters of the Wav2Lip model according to the judging result until training is completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310928303.9A CN116668611A (en) | 2023-07-27 | 2023-07-27 | Virtual digital human lip synchronization method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310928303.9A CN116668611A (en) | 2023-07-27 | 2023-07-27 | Virtual digital human lip synchronization method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116668611A true CN116668611A (en) | 2023-08-29 |
Family
ID=87715631
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310928303.9A Pending CN116668611A (en) | 2023-07-27 | 2023-07-27 | Virtual digital human lip synchronization method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116668611A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160343389A1 (en) * | 2015-05-19 | 2016-11-24 | Bxb Electronics Co., Ltd. | Voice Control System, Voice Control Method, Computer Program Product, and Computer Readable Medium |
CN113723385A (en) * | 2021-11-04 | 2021-11-30 | 新东方教育科技集团有限公司 | Video processing method and device and neural network training method and device |
CN113723457A (en) * | 2021-07-28 | 2021-11-30 | 浙江大华技术股份有限公司 | Image recognition method and device, storage medium and electronic device |
CN113901894A (en) * | 2021-09-22 | 2022-01-07 | 腾讯音乐娱乐科技(深圳)有限公司 | Video generation method, device, server and storage medium |
CN114793300A (en) * | 2021-01-25 | 2022-07-26 | 天津大学 | Virtual video customer service robot synthesis method and system based on generation countermeasure network |
CN115223214A (en) * | 2021-04-15 | 2022-10-21 | 腾讯科技(深圳)有限公司 | Identification method of synthetic mouth-shaped face, model acquisition method, device and equipment |
CN115713579A (en) * | 2022-10-25 | 2023-02-24 | 贝壳找房(北京)科技有限公司 | Wav2Lip model training method, image frame generation method, electronic device and storage medium |
-
2023
- 2023-07-27 CN CN202310928303.9A patent/CN116668611A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160343389A1 (en) * | 2015-05-19 | 2016-11-24 | Bxb Electronics Co., Ltd. | Voice Control System, Voice Control Method, Computer Program Product, and Computer Readable Medium |
CN114793300A (en) * | 2021-01-25 | 2022-07-26 | 天津大学 | Virtual video customer service robot synthesis method and system based on generation countermeasure network |
CN115223214A (en) * | 2021-04-15 | 2022-10-21 | 腾讯科技(深圳)有限公司 | Identification method of synthetic mouth-shaped face, model acquisition method, device and equipment |
CN113723457A (en) * | 2021-07-28 | 2021-11-30 | 浙江大华技术股份有限公司 | Image recognition method and device, storage medium and electronic device |
CN113901894A (en) * | 2021-09-22 | 2022-01-07 | 腾讯音乐娱乐科技(深圳)有限公司 | Video generation method, device, server and storage medium |
CN113723385A (en) * | 2021-11-04 | 2021-11-30 | 新东方教育科技集团有限公司 | Video processing method and device and neural network training method and device |
CN115713579A (en) * | 2022-10-25 | 2023-02-24 | 贝壳找房(北京)科技有限公司 | Wav2Lip model training method, image frame generation method, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111783566B (en) | Video synthesis method based on lip synchronization and enhancement of mental adaptation effect | |
WO2022267641A1 (en) | Image defogging method and system based on cyclic generative adversarial network | |
CN113487618B (en) | Portrait segmentation method, portrait segmentation device, electronic equipment and storage medium | |
CN114187547A (en) | Target video output method and device, storage medium and electronic device | |
CN115442543A (en) | Method, device, equipment and storage medium for synthesizing virtual image speaking video | |
Borsos et al. | Speechpainter: Text-conditioned speech inpainting | |
CN113077537A (en) | Video generation method, storage medium and equipment | |
CN110599411A (en) | Image restoration method and system based on condition generation countermeasure network | |
CN114723760B (en) | Portrait segmentation model training method and device and portrait segmentation method and device | |
CN116012255A (en) | Low-light image enhancement method for generating countermeasure network based on cyclic consistency | |
CN113570689B (en) | Portrait cartoon method, device, medium and computing equipment | |
US20230335148A1 (en) | Speech Separation Method, Electronic Device, Chip, and Computer-Readable Storage Medium | |
US11526972B2 (en) | Simultaneously correcting image degradations of multiple types in an image of a face | |
CN117440114A (en) | Virtual image video generation method, device, equipment and medium | |
CN116668611A (en) | Virtual digital human lip synchronization method and system | |
CN116362995A (en) | Tooth image restoration method and system based on standard prior | |
CN113450824B (en) | Voice lip reading method and system based on multi-scale video feature fusion | |
JP2024508568A (en) | Image processing method, device, equipment, and computer program | |
CN113469292A (en) | Training method, synthesizing method, device, medium and equipment for video synthesizing model | |
US11887403B1 (en) | Mouth shape correction model, and model training and application method | |
Sheng et al. | Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment | |
CN113722513B (en) | Multimedia data processing method and equipment | |
CN115240106B (en) | Task self-adaptive small sample behavior recognition method and system | |
KR102525594B1 (en) | Apparatus and method for creating face image data for age recognition | |
CN116797877A (en) | Training method and device for image generation model, and image generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |