CN115578512A - Method, device and equipment for training and using generation model of voice broadcast video - Google Patents

Method, device and equipment for training and using generation model of voice broadcast video Download PDF

Info

Publication number
CN115578512A
CN115578512A CN202211204447.1A CN202211204447A CN115578512A CN 115578512 A CN115578512 A CN 115578512A CN 202211204447 A CN202211204447 A CN 202211204447A CN 115578512 A CN115578512 A CN 115578512A
Authority
CN
China
Prior art keywords
face image
audio
training
network
coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211204447.1A
Other languages
Chinese (zh)
Inventor
严妍
李俊
杨春宇
刘嘉亮
齐慧杰
何强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cape Cloud Information Technology Co ltd
New Media Center Of Xinhua News Agency
Xinhua Fusion Media Technology Development Beijing Co ltd
Beijing Kaipuyun Information Technology Co ltd
Original Assignee
Cape Cloud Information Technology Co ltd
New Media Center Of Xinhua News Agency
Xinhua Fusion Media Technology Development Beijing Co ltd
Beijing Kaipuyun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cape Cloud Information Technology Co ltd, New Media Center Of Xinhua News Agency, Xinhua Fusion Media Technology Development Beijing Co ltd, Beijing Kaipuyun Information Technology Co ltd filed Critical Cape Cloud Information Technology Co ltd
Priority to CN202211204447.1A priority Critical patent/CN115578512A/en
Publication of CN115578512A publication Critical patent/CN115578512A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip

Abstract

The application discloses a method, a device and equipment for training and using a generation model of a voice broadcast video, and belongs to the technical field of image processing. The method comprises the following steps: extracting face images and audio in a plurality of training videos, wherein the training videos are single broadcasting audio; extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using a pre-trained R-Net network in the generated model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient; performing feature extraction on the face image and the audio by using a convolution network in the generated model to obtain a feature vector; synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame; generating loss on the synthesized video frame and audio by using a pre-trained discrimination network; and training and generating a model according to the loss. According to the method and the device, the high-definition two-dimensional face image is generated by utilizing the R-Net network, and the picture quality of the face image in the synthesized voice broadcast video is improved.

Description

Method, device and equipment for training and using generation model of voice broadcast video
Technical Field
The application relates to the technical field of image processing, in particular to a method, a device and equipment for training and using a generation model of a voice broadcast video.
Background
When a voice broadcast video is generated, a given face image needs to be processed according to audio, so that the mouth shape in the face image is matched with the audio content at each time point. For example, when the content of the audio is "me", it is necessary to adjust the mouth shape on the face image to the mouth shape when the sound of "me" is emitted.
Currently, neural networks are commonly employed to generate voice broadcast video. Specifically, the generation network is used for receiving the characteristics of the audio and the image covering the lower half part of the face, and the image and the audio are processed respectively and then fused to generate the lower half part of the face of the original image. In the discriminating section, the loss is calculated by scoring the generated image and audio using a lip-tone discriminator. And introducing an image quality discriminator to judge and check the quality of the generated image.
If the training is not carried out by the image quality discriminator, the generated image has poor picture quality. If the image quality discriminator training is added, the image quality discriminator is extremely difficult to train, and the situation of 'falling over' usually occurs, for example, the low image quality loss is + ∞; the original loss is 0 instead of falling synchronously.
Disclosure of Invention
The application provides a method, a device and equipment for training and using a generation model of a voice broadcast video, which are used for solving the problems that the picture quality of a generated image is poor when training is carried out without a picture quality discriminator, and the training difficulty is high when training is carried out with the picture quality discriminator. The technical scheme is as follows:
in one aspect, a method for generating a voice broadcast video is provided, where the method includes:
extracting face images and audio in a plurality of training videos, wherein the training videos are single broadcasting audio;
extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network trained in advance in a generated model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient;
performing feature extraction on the face image and the audio by using a convolution network in the generation model to obtain a feature vector;
synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame;
generating loss on the synthesized video frame and the audio by using a pre-trained discrimination network;
training the generative model according to the loss.
In a possible implementation manner, the extracting features of the face image and the audio by using a convolutional network in the generative model to obtain a feature vector includes:
extracting a first feature vector from the face image by using a first convolution network in the convolution network;
extracting Mel cepstrum coefficient features from the audio, and extracting second feature vectors from the Mel cepstrum coefficient features by using a second convolution network in the convolution networks;
splicing the first feature vector and the second feature vector;
and utilizing a third convolution network in the convolution networks to carry out up-sampling on the splicing vector to obtain the characteristic vector.
In one possible implementation, the method further includes:
acquiring first training data and second training data, wherein the mouth shape of each face image in the first training data respectively corresponds to the audio content at each time point, and the mouth shape of each face image in the second training data respectively staggers with the audio content at each time point for preset time;
creating a pseudo-twin network;
and training the pseudo twin network according to the training data to obtain the discrimination network.
In a possible implementation manner, the three-dimensional face reconstruction coefficients include 80 face deformation combination coefficients, 64 expression combination coefficients and 80 texture combination coefficients;
the illumination rendering coefficients include 9 illumination combination coefficients and 6 individual face pose parameters.
In one aspect, a method for generating a voice broadcast video is provided, where the method includes:
acquiring a face image and an audio to be synthesized;
extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network in a generated model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, wherein the generated model is obtained by training by adopting the training method;
performing feature extraction on the face image and the audio by using a convolution network in the generation model to obtain a feature vector;
synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame;
and synthesizing the synthesized video frame and the audio to obtain the voice broadcast video.
In a possible implementation manner, the extracting features of the face image and the audio by using a convolutional network in the generative model to obtain a feature vector includes:
extracting a first feature vector from the face image by using a first convolution network in the convolution network;
extracting Mel cepstrum coefficient features from the audio, and extracting second feature vectors from the Mel cepstrum coefficient features by using a second convolution network in the convolution networks;
splicing the first feature vector and the second feature vector;
and utilizing a third convolution network in the convolution networks to carry out up-sampling on the splicing vector to obtain the characteristic vector.
In one aspect, a device for training a generative model of a voice broadcast video is provided, the device comprising:
the extraction module is used for extracting face images and audio in a plurality of training videos, and the training videos are single broadcast audio;
the extraction module is also used for extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by utilizing an R-Net network trained in advance in a generation model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient;
the extraction module is further configured to perform feature extraction on the face image and the audio by using a convolution network in the generation model to obtain a feature vector;
the synthesis module is used for synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame;
a generation module for generating loss to the synthesized video frame and the audio by using a pre-trained discrimination network;
and the training module is used for training the generating model according to the loss.
In one aspect, a voice broadcast video generating device is provided, the device comprising:
the acquisition module is used for acquiring a face image and an audio to be synthesized;
the extraction module is used for extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network in a generated model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, wherein the generated model is obtained by training by adopting the training method;
the extraction module is further configured to perform feature extraction on the face image and the audio by using a convolution network in the generation model to obtain a feature vector;
the synthesis module is used for synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame;
and the synthesis module is also used for synthesizing the synthesized video frame and the audio to obtain a voice broadcast video.
In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the generation model training method for a voice broadcast video as described above, or the at least one instruction is loaded and executed by a processor to implement the voice broadcast video generation method as described above.
In one aspect, a computer device is provided and includes a processor and a memory, where the memory stores at least one instruction that is loaded and executed by the processor to implement the method for generating model training of a voice broadcast video as described above, or that is loaded and executed by the processor to implement the method for generating a voice broadcast video as described above.
The technical scheme provided by the application has the beneficial effects that:
a three-dimensional face reconstruction coefficient and an illumination rendering coefficient are extracted from a face image through an R-Net network, and a two-dimensional face image is generated according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, so that a three-dimensional face model can be generated according to the three-dimensional face reconstruction coefficient, then the two-dimensional face image is generated according to the three-dimensional face model and the illumination rendering coefficient, and then the two-dimensional face image and the extracted feature vector can be synthesized to obtain a synthesized video frame which is used for synthesizing a voice broadcast video, so that the picture quality of the face image in the voice broadcast video is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a method for training a generative model of a voice broadcast video according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a structure of a generative model provided in one embodiment of the present application;
fig. 3 is a flowchart of a method for generating a voice broadcast video according to an embodiment of the present application;
fig. 4 is a flowchart of a method for generating a voice broadcast video according to an embodiment of the present application;
fig. 5 is a block diagram illustrating a structure of a training apparatus for generating a model of a voice broadcast video according to an embodiment of the present application;
fig. 6 is a block diagram of a structure of a video-on-demand audio generation apparatus according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.
The terms referred to in the present application will be explained first.
(1) 3DMM (3D portable FaceModel, modified face model) is a method for storing and calculating a three-dimensional face model. Firstly, a plurality of bases are obtained through data set calculation, and then, the human face can be recovered according to the given representation base by assuming that all the human faces are linear representations of the bases, namely, the coefficients of the linear representations are saved. Similarly, the skin texture coefficient and the substrate, and the expression coefficient and the substrate.
(2) The illumination spherical harmonics have orthogonality and rotation invariance, and a group of illumination spherical harmonics can be taken to perform linear combination to represent illumination information, namely, the illumination information can be recorded by storing coefficients.
(3) The MFCC (Mel Frequency Cepstrum Coefficients) is characterized in that a series of Mel Cepstrum key Coefficients are obtained by continuously performing operations such as short-time fourier transform, logarithmic transform and the like on audio, wherein the window time length of the short-time fourier transform is 0.0125 seconds (200 points). The MFCC feature retains some content that is semantically related, and filters out extraneous information such as background noise, making its cepstrum closer to the human nonlinear auditory system. MFCC features are in the form of a matrix, which can be analyzed by image analysis methods.
Referring to fig. 1, a flowchart of a method for training a generative model of a voice broadcast video according to an embodiment of the present application is shown, where the method for training the generative model of the voice broadcast video can be applied to a computer device. The method for training the generation model of the voice broadcast video can comprise the following steps:
step 101, extracting face images and audio in a plurality of training videos, wherein the training videos are single broadcasting audio.
The training video in this embodiment may be a series of single person pure lecture videos, where the person may be a real person or a virtual digital person.
The computer device may segment the training video by a fixed duration. The fixed time duration can be set according to requirements, for example, the fixed time duration is set to a value within 1-3 seconds.
For each video segment, the computer device may extract a video frame and audio in the video segment, capture a face image from the video frame, align the face image and the audio according to time points to obtain a set of training data, and train a generation model using each set of training video. In order to ensure the training effect of the generative model, more than 4 ten thousand sets of training data may be prepared.
And 102, extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using a pre-trained R-Net network in the generated model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient.
The R-Net Network (Refine Network) is used for extracting 224 three-dimensional face reconstruction coefficients and 15 illumination rendering coefficients from the face image in the training data. The three-dimensional face reconstruction coefficient comprises 80 face deformation combination coefficients alpha, 64 expression combination coefficients beta and 80 texture combination coefficients delta; the illumination rendering coefficients include 9 illumination combination coefficients γ and 6 face pose parameters p, as shown in fig. 2.
When an R-Net network is trained, firstly, extracted two-dimensional face image pictures are input into the R-Net network to generate 239 parameters, a 3DMM linear superposition reconstruction formula is used for reconstructing a face through parameters such as shape, expression and material, and then another two-dimensional face image is output through a set rendering and 3D-2D projection method. The goal of training the R-Net network is that the extracted face images and the output face images are as close as possible, so the loss function is defined as: 1) outputting one-dimensional features through the same feature extraction network to calculate cosine loss, 2) extracting feature points of the two images by using the same algorithm to calculate feature point distance loss, and 3) calculating pixel difference loss of non-shielded areas of the faces. Finally, the loss weights of the three parts are added to minimize them by training the R-Net network. In use, if no special requirement exists, the 2D-3D-2D image conversion can be directly carried out by using the pre-trained R-Net network and the corresponding rendering projection mode.
When the trained R-Net network is used, 224 three-dimensional face reconstruction coefficients are input into the 3DMM, and the 3DMM is used for linearly combining the 224 three-dimensional face reconstruction coefficients with a face substrate, an expression substrate and a texture substrate which are stored in advance, so that a three-dimensional face model corresponding to a single-frame face image can be generated. Then, the three-dimensional face model is subjected to illumination rendering by using 15 illumination rendering coefficients. Finally, the three-dimensional face is converted into a two-dimensional face image using an empirical camera model. Because the three-dimensional face model is generated by using a parameterization method, a meaningful face can be generated by giving any combined parameters, and therefore, the three-dimensional face model can be converted into a high-definition two-dimensional face image according to a set illumination rendering and projection formula.
It should be noted that the R-Net network does not involve the problem of voice matching, and therefore, it is not necessary to retrain the R-Net network unless there is a special case.
And 103, extracting the features of the face image and the audio by using a convolution network in the generated model to obtain a feature vector.
The Convolutional network (CNN) in this embodiment includes a first Convolutional network (CNN 1), a second Convolutional network (CNN 2), and a third Convolutional network (CNN 3), which need to be trained.
Specifically, the feature extraction is performed on the face image and the audio by using a convolution network in the generative model to obtain a feature vector, which may include: extracting a first feature vector from the face image by using a first convolution network in the convolution network; extracting Mel cepstrum coefficient characteristics from the audio, and extracting a second feature vector from the Mel cepstrum coefficient characteristics by using a second convolution network in the convolution network; splicing the first feature vector and the second feature vector; and utilizing a third convolution network in the convolution network to carry out up-sampling on the splicing vector to obtain a characteristic vector.
The input of the first convolution network is a predetermined number of extracted face images, such as 5 frames, and the output first feature vector is a 512-dimensional vector. The output of the second convolutional network is the mel-frequency cepstrum coefficient characteristic of the audio frequency with a preset time length, and the output second characteristic vector is a 512-dimensional vector. The input to the third convolutional network is a concatenation vector of the first eigenvector and the second eigenvector.
And 104, synthesizing the two-dimensional face image and the characteristic vector to obtain a synthesized video frame.
And 105, generating loss on the synthesized video frame and the audio by using the pre-trained discrimination network.
The judgment network is a sync network which is used for judging whether the audio frequency is in accordance with the opening and closing degree of the lips. The structure of the sync network is similar to that of the pseudo-twin network, and the training process of the discriminant network may include: acquiring first training data and second training data, wherein the mouth shape of each face image in the first training data corresponds to the audio content at each time point respectively, and the mouth shape of each face image in the second training data is staggered with the audio content at each time point for a preset time respectively; creating a pseudo-twin network; and training the pseudo twin network according to the training data to obtain a discrimination network. Wherein, the mouth shape-sound of the first training data strictly corresponds according to time, and the label is set as 1 to represent a positive sample; the mouth shape and the sound of the second training data are staggered for 0.2-1 second, and the label of the second training data is set to be 0, so that a negative sample is represented.
The sync network is provided with two branches, wherein one branch is used for extracting one-dimensional vector characteristics of a two-dimensional face image, the other branch is used for extracting one-dimensional vector characteristics of audio, cosine loss is calculated for the two one-dimensional vector characteristics, the cosine loss represents a fitness score of opening and closing of a sound and a lip, and the score is used as one of loss (loss) generated by a voice broadcast video. The sync network is trained until the final loss is less than 0.25, which can be used to supervise training the generative model.
It should be noted that the Sync network is sensitive to data, and needs to train customized discrimination data specially for a certain batch of data, and when data is added or replaced, the discrimination network needs to be retrained until the loss during training is reduced to 0.25, and the model parameters of the discrimination network can be fixed and used.
And 106, generating a model according to the loss training.
In this embodiment, the R-Net network and the discrimination network are trained in advance, so that the R-Net network and the discrimination network can be fixed, and the convolutional network can be trained by using given training data and a label.
In one example, the computer device may be a server and its hardware and software configuration is as follows: a CPU: the double E5-2620v4 total 16-core 32 threads, GPU: tesla P100 GB, memory: 32GB, operating System: ubuntu 18.04. When the generative model is used, libraries such as a pitorch, librosa, numpy and the like are configured, interface service is provided through flash starting, a face image and audio to be synthesized are uploaded through the interface service, and a synthesized voice broadcast video is downloaded through the interface service.
In this embodiment, the trained generative model can adjust the mouth shape of any user, and a customized model does not need to be trained for the user, thereby improving the universality of the generative model. In addition, the speed of generating the voice broadcast video by the generating model is 15-30 frames/second, and the standard of generating the video in real time is basically achieved.
To sum up, according to the method for training the generation model of the voice broadcast video provided by the embodiment of the application, the three-dimensional face reconstruction coefficient and the illumination rendering coefficient are extracted from the face image through the R-Net network, and the two-dimensional face image is generated according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient.
Referring to fig. 3, a method flowchart of a method for generating a voice broadcast video according to an embodiment of the present application is shown, where the method for generating a voice broadcast video is applicable to a computer device. The voice broadcast video generation method may include:
step 301, obtaining a face image and an audio to be synthesized.
The user can upload a preset number of face images, and the preset number is in positive correlation with the duration of the audio. The face image may be a human face of a real person or a human face of a virtual digital person.
After the face images are obtained, every 5 face images can be used as a group of images; after the audio is obtained, the audio can be cut according to the duration of 0.2 second to obtain a plurality of audio segments; then, each set of images and each audio clip are input into the generative model as a corresponding set for synthesis.
And 302, extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network in the generated model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient.
The R-Net Network (Refine Network) is used for extracting 224 three-dimensional face reconstruction coefficients and 15 illumination rendering coefficients from the face image in the training data. The three-dimensional face reconstruction coefficient comprises 80 face deformation combination coefficients alpha, 64 expression combination coefficients beta and 80 texture combination coefficients delta; the illumination rendering coefficient comprises 9 illumination combination coefficients gamma and 6 face pose parameters p.
When the trained R-Net network is used, 224 three-dimensional face reconstruction coefficients are input into the 3DMM, and the 3DMM is used for linearly combining the 224 three-dimensional face reconstruction coefficients with a face substrate, an expression substrate and a texture substrate which are stored in advance, so that a three-dimensional face model corresponding to a single-frame face image can be generated. Then, the three-dimensional face model is subjected to illumination rendering by using 15 illumination rendering coefficients. Finally, the three-dimensional face is converted into a two-dimensional face image using an empirical camera model. Because the three-dimensional face model is generated by using a parameterization method, a meaningful face can be generated by giving any combined parameters, and therefore, the three-dimensional face model can be converted into a high-definition two-dimensional face image according to a set illumination rendering and projection formula.
And 303, extracting the features of the face image and the audio by using a convolution network in the generated model to obtain a feature vector.
The convolutional network in the present embodiment includes a first convolutional network (CNN 1), a second convolutional network (CNN 2), and a third convolutional network (CNN 3), and needs to be trained.
Specifically, the extracting the features of the face image and the audio by using the convolution network in the generative model to obtain the feature vector may include: extracting a first feature vector from the face image by using a first convolution network in the convolution network; extracting Mel cepstrum coefficient characteristics from the audio, and extracting a second feature vector from the Mel cepstrum coefficient characteristics by using a second convolution network in the convolution network; splicing the first feature vector and the second feature vector; and utilizing a third convolution network in the convolution network to perform up-sampling on the splicing vector to obtain a characteristic vector.
The input of the first convolution network is extracted face images of a predetermined number of frames, such as 5 frames, and the output first feature vector is a 512-dimensional vector. The output of the second convolutional network is the mel-frequency cepstrum coefficient characteristic of the audio frequency with a preset time length, and the output second characteristic vector is a 512-dimensional vector. The input to the third convolutional network is a concatenated vector of the first eigenvector and the second eigenvector.
And step 304, synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame.
And 305, synthesizing the synthesized video frame and the audio to obtain a voice broadcast video.
In this embodiment, the generated synthesized video frames may be superimposed in time sequence, and then synthesized with the audio uploaded by the user to output a voiced voice broadcast video.
Referring to fig. 4, the data flow direction 1 refers to uploading a user-specified image, namely, a face image; data flow 2 refers to uploading audio; the data flow direction 3 means that the face image and the audio are synthesized by generating a model to obtain a voice broadcast video.
To sum up, according to the voice broadcast video generation method provided by the embodiment of the application, the three-dimensional face reconstruction coefficient and the illumination rendering coefficient are extracted from the face image through the R-Net network, and the two-dimensional face image is generated according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, so that the three-dimensional face model can be generated according to the three-dimensional face reconstruction coefficient, the two-dimensional face image can be generated according to the three-dimensional face model and the illumination rendering coefficient, then the two-dimensional face image and the extracted feature vector can be synthesized to obtain a synthesized video frame, and the synthesized video frame is used for synthesizing the voice broadcast video, so that the picture quality of the face image in the voice broadcast video is improved.
Referring to fig. 5, a block diagram of a training apparatus for generating a voice broadcast video according to an embodiment of the present application is shown, where the training apparatus for generating a voice broadcast video may be applied to a computer device. This voice broadcast video's generative model trainer can include:
the extracting module 510 is configured to extract face images and audio in a plurality of training videos, where the training videos are single broadcast audio;
the extracting module 510 is further configured to extract a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network trained in advance in the generation model, and generate a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient;
the extracting module 510 is further configured to perform feature extraction on the face image and the audio by using a convolution network in the generated model to obtain a feature vector;
a synthesizing module 520, configured to synthesize the two-dimensional face image and the feature vector to obtain a synthesized video frame;
a generating module 530, configured to generate a loss for the synthesized video frame and the audio by using a pre-trained discrimination network;
a training module 540 for generating a model according to the loss training.
In an optional embodiment, the extracting module 510 is further configured to:
extracting a first feature vector from the face image by using a first convolution network in the convolution network;
extracting Mel cepstrum coefficient characteristics from the audio, and extracting a second feature vector from the Mel cepstrum coefficient characteristics by using a second convolution network in the convolution network;
splicing the first feature vector and the second feature vector;
and utilizing a third convolution network in the convolution network to perform up-sampling on the splicing vector to obtain a characteristic vector.
In an alternative embodiment, the training module 540 is further configured to:
acquiring first training data and second training data, wherein the mouth shape of each face image in the first training data corresponds to the audio content at each time point respectively, and the mouth shape of each face image in the second training data is staggered with the audio content at each time point for a preset time respectively;
creating a pseudo-twin network;
and training the pseudo twin network according to the training data to obtain a discrimination network.
In an alternative embodiment, the three-dimensional face reconstruction coefficients include 80 face deformation combination coefficients, 64 expression combination coefficients and 80 texture combination coefficients; the illumination rendering coefficients include 9 illumination combination coefficients and 6 individual face pose parameters.
To sum up, the device for training a generation model of a voice broadcast video extracts a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from a face image through an R-Net network, and generates a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, so that a three-dimensional face model can be generated according to the three-dimensional face reconstruction coefficient, and then a two-dimensional face image can be generated according to the three-dimensional face model and the illumination rendering coefficient, and then the two-dimensional face image and the extracted feature vector can be synthesized to obtain a synthesized video frame, wherein the synthesized video frame is used for synthesizing the voice broadcast video, thereby improving the picture quality of the face image in the voice broadcast video.
Referring to fig. 6, a block diagram of a device for generating a voice broadcast video according to an embodiment of the present application is shown, where the device for generating a voice broadcast video can be applied to a computer device. This voice broadcast video generation device can include:
an obtaining module 610, configured to obtain a face image and an audio to be synthesized;
an extracting module 620, configured to extract a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network in the generated model, generate a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, and train the generated model by using the training method according to any one of claims 1 to 4;
the extracting module 620 is further configured to perform feature extraction on the face image and the audio by using a convolution network in the generated model to obtain a feature vector;
and a synthesizing module 630, configured to synthesize the two-dimensional face image and the feature vector to obtain a synthesized video frame.
And the synthesizing module 630 is further configured to synthesize the synthesized video frame and the audio to obtain the voice broadcast video.
In an alternative embodiment, the extracting module 620 is further configured to:
extracting a first feature vector from the face image by using a first convolution network in the convolution network;
extracting Mel cepstrum coefficient characteristics from the audio, and extracting a second feature vector from the Mel cepstrum coefficient characteristics by using a second convolution network in the convolution network;
splicing the first feature vector and the second feature vector;
and utilizing a third convolution network in the convolution network to perform up-sampling on the splicing vector to obtain a characteristic vector.
To sum up, the voice broadcast video generation device provided by the embodiment of the application extracts a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from a face image through an R-Net network, and generates a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, so that a three-dimensional face model can be generated according to the three-dimensional face reconstruction coefficient, then a two-dimensional face image can be generated according to the three-dimensional face model and the illumination rendering coefficient, and then the two-dimensional face image and the extracted feature vector can be synthesized to obtain a synthesized video frame, wherein the synthesized video frame is used for synthesizing a voice broadcast video, thereby improving the picture quality of the face image in the voice broadcast video.
One embodiment of the present application provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for training generation model of a voice broadcast video or the method for generating a voice broadcast video as described above.
One embodiment of the present application provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method for training generation model of voice broadcast video or the method for generating voice broadcast video as described above.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.
The above description is not intended to limit the embodiments of the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims (10)

1. A method for training a generative model of a voice broadcast video, the method comprising:
extracting face images and audio in a plurality of training videos, wherein the training videos are single broadcast audio;
extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network trained in advance in a generated model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient;
performing feature extraction on the face image and the audio by using a convolution network in the generation model to obtain a feature vector;
synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame;
generating loss on the synthesized video frame and the audio by using a pre-trained discrimination network;
training the generative model according to the loss.
2. The method of claim 1, wherein the extracting features of the face image and the audio using a convolutional network in the generative model to obtain a feature vector comprises:
extracting a first feature vector from the face image by using a first convolution network in the convolution network;
extracting Mel cepstrum coefficient characteristics from the audio, and extracting a second feature vector from the Mel cepstrum coefficient characteristics by using a second convolution network in the convolution network;
splicing the first feature vector and the second feature vector;
and utilizing a third convolution network in the convolution networks to carry out up-sampling on the spliced vector to obtain the characteristic vector.
3. The method of claim 1, further comprising:
acquiring first training data and second training data, wherein the mouth shape of each face image in the first training data respectively corresponds to the audio content at each time point, and the mouth shape of each face image in the second training data respectively staggers with the audio content at each time point for preset time;
creating a pseudo-twin network;
and training the pseudo twin network according to the training data to obtain the discrimination network.
4. The generative model training method of voice broadcast video according to any one of claims 1 to 3,
the three-dimensional face reconstruction coefficients comprise 80 face deformation combination coefficients, 64 expression combination coefficients and 80 texture combination coefficients;
the illumination rendering coefficients include 9 illumination combination coefficients and 6 individual face pose parameters.
5. A method for generating a voice broadcast video, the method comprising:
acquiring a face image and an audio to be synthesized;
extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network in a generated model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, wherein the generated model is obtained by training by adopting the training method of any one of claims 1 to 4;
performing feature extraction on the face image and the audio by using a convolution network in the generation model to obtain a feature vector;
synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame;
and synthesizing the synthesized video frame and the audio to obtain the voice broadcast video.
6. The method of claim 5, wherein the extracting the features of the face image and the audio using a convolutional network in the generative model to obtain a feature vector comprises:
extracting a first feature vector from the face image by using a first convolution network in the convolution network;
extracting Mel cepstrum coefficient features from the audio, and extracting second feature vectors from the Mel cepstrum coefficient features by using a second convolution network in the convolution networks;
splicing the first feature vector and the second feature vector;
and utilizing a third convolution network in the convolution networks to carry out up-sampling on the spliced vector to obtain the characteristic vector.
7. A training device for generating models of voice broadcast videos is characterized by comprising:
the extraction module is used for extracting face images and audio in a plurality of training videos, wherein the training videos are single broadcast audio;
the extraction module is also used for extracting a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by utilizing an R-Net network trained in advance in a generation model, and generating a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient;
the extraction module is further configured to perform feature extraction on the face image and the audio by using a convolution network in the generation model to obtain a feature vector;
the synthesis module is used for synthesizing the two-dimensional face image and the characteristic vector to obtain a synthesized video frame;
a generation module for generating loss to the synthesized video frame and the audio by using a pre-trained discrimination network;
and the training module is used for training the generating model according to the loss.
8. A voice broadcast video generation device, the device comprising:
the acquisition module is used for acquiring a face image and an audio to be synthesized;
an extraction module, configured to extract a three-dimensional face reconstruction coefficient and an illumination rendering coefficient from the face image by using an R-Net network in a generated model, and generate a two-dimensional face image according to the three-dimensional face reconstruction coefficient and the illumination rendering coefficient, where the generated model is obtained by training according to the training method of any one of claims 1 to 4;
the extraction module is further configured to perform feature extraction on the face image and the audio by using a convolution network in the generation model to obtain a feature vector;
the synthesis module is used for synthesizing the two-dimensional face image and the feature vector to obtain a synthesized video frame;
and the synthesis module is also used for synthesizing the synthesized video frame and the audio to obtain the voice broadcast video.
9. A computer-readable storage medium, wherein at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the generation model training method for the voice broadcast video according to any one of claims 1 to 4, or the at least one instruction is loaded and executed by a processor to implement the voice broadcast video generation method according to claim 5 or 6.
10. A computer device comprising a processor and a memory, wherein the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the generation model training method for the voice broadcast video according to any one of claims 1 to 4, or the at least one instruction is loaded and executed by the processor to implement the voice broadcast video generation method according to claim 5 or 6.
CN202211204447.1A 2022-09-29 2022-09-29 Method, device and equipment for training and using generation model of voice broadcast video Pending CN115578512A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211204447.1A CN115578512A (en) 2022-09-29 2022-09-29 Method, device and equipment for training and using generation model of voice broadcast video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211204447.1A CN115578512A (en) 2022-09-29 2022-09-29 Method, device and equipment for training and using generation model of voice broadcast video

Publications (1)

Publication Number Publication Date
CN115578512A true CN115578512A (en) 2023-01-06

Family

ID=84582846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211204447.1A Pending CN115578512A (en) 2022-09-29 2022-09-29 Method, device and equipment for training and using generation model of voice broadcast video

Country Status (1)

Country Link
CN (1) CN115578512A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778040A (en) * 2023-08-17 2023-09-19 北京百度网讯科技有限公司 Face image generation method based on mouth shape, training method and device of model
CN117593462A (en) * 2023-11-30 2024-02-23 约翰休斯(宁波)视觉科技有限公司 Fusion method and system of three-dimensional space scene

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778040A (en) * 2023-08-17 2023-09-19 北京百度网讯科技有限公司 Face image generation method based on mouth shape, training method and device of model
CN116778040B (en) * 2023-08-17 2024-04-09 北京百度网讯科技有限公司 Face image generation method based on mouth shape, training method and device of model
CN117593462A (en) * 2023-11-30 2024-02-23 约翰休斯(宁波)视觉科技有限公司 Fusion method and system of three-dimensional space scene

Similar Documents

Publication Publication Date Title
CN109377539B (en) Method and apparatus for generating animation
CN113378697B (en) Method and device for generating speaking face video based on convolutional neural network
CN112562722A (en) Audio-driven digital human generation method and system based on semantics
CN115578512A (en) Method, device and equipment for training and using generation model of voice broadcast video
CN116250036A (en) System and method for synthesizing photo-level realistic video of speech
CA2375350C (en) Method of animating a synthesised model of a human face driven by an acoustic signal
CN112001992A (en) Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning
US11847726B2 (en) Method for outputting blend shape value, storage medium, and electronic device
CN114419702B (en) Digital person generation model, training method of model, and digital person generation method
CN114187547A (en) Target video output method and device, storage medium and electronic device
CN114900733B (en) Video generation method, related device and storage medium
CN113299312A (en) Image generation method, device, equipment and storage medium
CN115376482A (en) Face motion video generation method and device, readable medium and electronic equipment
CN116051692A (en) Three-dimensional digital human face animation generation method based on voice driving
EP4207195A1 (en) Speech separation method, electronic device, chip and computer-readable storage medium
CN117409121A (en) Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
CN113223555A (en) Video generation method and device, storage medium and electronic equipment
CN113395569A (en) Video generation method and device
CN115937375B (en) Digital split synthesis method, device, computer equipment and storage medium
CN112330579A (en) Video background replacing method and device, computer equipment and computer readable medium
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN115278293A (en) Virtual anchor generation method and device, storage medium and computer equipment
US20220020196A1 (en) System and method for voice driven lip syncing and head reenactment
CN113079328B (en) Video generation method and device, storage medium and electronic equipment
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination