CN113469292A

CN113469292A - Training method, synthesizing method, device, medium and equipment for video synthesizing model

Info

Publication number: CN113469292A
Application number: CN202111023647.2A
Authority: CN
Inventors: 郎彦; 高原; 刘霄
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-10-01

Abstract

The present disclosure provides a training method of a video composition model, a video composition method, an apparatus, a storage medium, a program product, and an electronic device, wherein the method includes: acquiring a sample text and a sample video, wherein the sample video is a video of the sample text read by a real person; inputting the sample text into a voice synthesis sub-model to obtain a feature vector; inputting the feature vector into a voice reconstruction face sub-model to obtain a face feature parameter; inputting the face characteristic parameters and the sample video into a differentiable rendering sub-model to obtain a face characteristic graph; inputting the face feature map into a generating type confrontation network submodel to obtain a virtual real person video, and iteratively training a voice reconstruction face submodel, a differentiable rendering submodel and a generating type confrontation network submodel on the basis of the virtual real person video and the sample video until the loss function value of the generating type confrontation network submodel meets a preset condition.

Description

Training method, synthesizing method, device, medium and equipment for video synthesizing model

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for training a video composition model, a video composition method, a video composition apparatus, and a non-transitory computer-readable storage medium storing a computer program and an electronic device for implementing the method and apparatus.

Background

At present, with the rapid development of the deep learning technology, the video generation technology based on text driving gradually becomes a research hotspot, and the video generation technology can be applied to the fields of weather broadcasting, news broadcasting, online education and the like.

In the related art, the video generation technology based on text driving generally adopts multiple models for individual training, for example, text-to-speech synthesis uses a speech synthesis submodel for individual training, speech-to-face reconstruction also uses an individual speech reconstruction face submodel for training, and reconstructed face-to-virtual real person video synthesis also uses an individual synthesis model for training.

The existing multi-model independent training mode needs to prepare a plurality of training sample data for each model respectively, the training efficiency is low, and the quality of the video generated by the trained video synthesis model based on the text drive still needs to be improved.

Disclosure of Invention

According to a first aspect of the embodiments of the present disclosure, there is provided a training method for a video synthesis model, where the video synthesis model sequentially includes a speech synthesis sub-model, a speech reconstruction face sub-model, a differentiable rendering sub-model, and a generative confrontation network sub-model, the method includes:

acquiring a sample text and a sample video, wherein the sample video is a video of the sample text read by a real person;

inputting the sample text into a voice synthesis sub-model to obtain a feature vector;

inputting the feature vector into a voice reconstruction face sub-model to obtain a face feature parameter;

inputting the face characteristic parameters and the sample video into a differentiable rendering sub-model to obtain a face characteristic graph;

inputting the face feature map into a generating type confrontation network submodel to obtain a virtual real person video, and iteratively training a voice reconstruction face submodel, a differentiable rendering submodel and a generating type confrontation network submodel on the basis of the virtual real person video and the sample video until the loss function value of the generating type confrontation network submodel meets a preset condition.

According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for a video synthesis model, the video synthesis model sequentially includes a speech synthesis sub-model, a speech reconstruction face sub-model, a differentiable rendering sub-model, and a generative confrontation network sub-model, the apparatus includes:

the data acquisition module is used for acquiring a sample text and a sample video, wherein the sample video is a video of the sample text read by a real person;

the voice synthesis module is used for inputting the sample text into the voice synthesis sub-model to obtain a feature vector;

the face reconstruction module is used for inputting the feature vectors into a voice reconstruction face sub-model to obtain face feature parameters;

the face rendering module is used for inputting the face feature parameters and the sample video into a differentiable rendering sub-model to obtain a face feature graph;

and the generation training module is used for inputting the face characteristic diagram into the generation type confrontation network submodel to obtain a virtual real person video, and iteratively training the voice reconstruction face submodel, the differentiable rendering submodel and the generation type confrontation network submodel on the basis of the virtual real person video and the sample video until the loss function value of the generation type confrontation network submodel meets the preset condition.

According to a third aspect of the embodiments of the present disclosure, there is provided a video synthesis method, including:

receiving a text to be synthesized;

inputting a text to be synthesized into a video synthesis model to obtain a synthesized video;

the video composition model is obtained by training according to the method described in any of the first aspect of the embodiments of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a video compositing apparatus comprising:

the receiving module is used for receiving the text to be synthesized;

the generating module is used for inputting the text to be synthesized into the video synthesis model to obtain a synthesized video;

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method of any one of the first aspect of the embodiments of the present disclosure; alternatively, a method as described in embodiments in accordance with the third aspect of embodiments of the present disclosure is performed.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium storing a computer program for causing a computer to perform the method of any one of the first aspect of the embodiments of the present disclosure; alternatively, the method provided in the third aspect according to the embodiments of the present disclosure is performed.

According to one or more technical schemes provided in the embodiment of the application, a sample text and a sample video are obtained, wherein the sample video is a video of the sample text read by a real person; inputting the sample text into a voice synthesis sub-model to obtain a feature vector; inputting the feature vector into a voice reconstruction face sub-model to obtain a face feature parameter; inputting the face characteristic parameters and the sample video into a differentiable rendering sub-model to obtain a face characteristic graph; inputting the face characteristic diagram into a generating type confrontation network submodel to obtain a virtual real person video, iteratively training a voice reconstruction face submodel, a differentiable rendering submodel and a generating type confrontation network submodel on the basis of the virtual real person video and a sample video, and ending training until a loss function value of the generating type confrontation network submodel meets a preset condition. Therefore, the voice reconstruction face submodel, the generation type confrontation network submodel and the voice synthesis submodel are combined together based on the differentiable rendering submodel, integrated integral training is realized, the operation of respectively preparing a plurality of training sample data when a plurality of models are independently trained is not needed, the training efficiency is improved, the relevance among the models is enhanced, the phenomenon that partial characteristic information is lost due to the fact that each model is independently trained is reduced, and the quality of videos generated by the trained video synthesis model, such as definition and/or fidelity, is improved.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 shows a flow diagram of a method of training a video composition model according to an example embodiment of the present disclosure;

FIG. 2 shows a block diagram of a structure of a video composition model according to an example embodiment of the present disclosure;

FIG. 3 shows a block diagram of a structure of a video composition model according to another exemplary embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a method of training a video composition model according to another example embodiment of the present disclosure;

FIG. 5 shows a schematic block diagram of a training apparatus for a video composition model according to an exemplary embodiment of the present disclosure;

fig. 6 illustrates a flow chart of a video composition method according to an exemplary embodiment of the present disclosure;

fig. 7 shows a schematic block diagram of a video compositing apparatus according to an exemplary embodiment of the present disclosure;

FIG. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

In the following, aspects of the present disclosure are described with reference to the drawings, fig. 1 schematically shows a flowchart of a training method of a video synthesis model according to an embodiment of the present disclosure, which may be applied to the video synthesis model shown in fig. 2, and the video synthesis model may sequentially include a speech synthesis sub-model 201, a speech reconstruction face sub-model 202, a differentiable rendering sub-model 203, and a generative confrontation network sub-model 204, and the training method of the video synthesis model may include the following steps:

step S101: and acquiring a sample text and a sample video, wherein the sample video is a video of the sample text read by a real person.

The sample text is, for example, but not limited to, a piece of chinese and/or english text. In this embodiment, a video of a sample text read aloud by a real person may be captured in advance by a smart phone or a camera such as a high definition camera, and the captured video may be used as the sample video.

It will be appreciated that sample video typically comprises multi-frame images, and that sample text may be processed into a plurality of text sequences, for example a long length of text divided into a plurality of sub-texts, one text sequence for each sub-text. Each text sequence in this embodiment corresponds to one image of the plurality of images. For example, the sample text includes a text sequence 1, a text sequence 2, and a text sequence 3, and the sample video includes, for example, a 1 st frame image, a 2 nd frame image, and a 3 rd frame image. It is assumed that text sequence 1 corresponds to frame 1 image, text sequence 2 corresponds to frame 2 image, and text sequence 3 corresponds to frame 3 image, which is merely an example and is not intended to limit the present embodiment.

Step S102: and inputting the sample text into a speech synthesis sub-model to obtain a feature vector.

Illustratively, the speech synthesis submodel is obtained by pre-training, and in the training process of the video synthesis model of the embodiment, the model parameters of the speech synthesis submodel are kept unchanged. After a sample text, such as a segment of chinese and/or english text, is obtained, the sample text may be input into a speech synthesis submodel, which may synthesize audio data corresponding to the sample text. The feature vector corresponding to the sample text can be obtained through the speech synthesis sub-model, and the feature vector is used for face reconstruction.

In one example, when a sample text containing a plurality of text sequences is input into the speech synthesis sub-model, a plurality of corresponding feature vectors can be obtained, wherein each feature vector corresponds to one frame of image. For example, text sequence 1 corresponds to feature vector 1, text sequence 2 corresponds to feature vector 2, and text sequence 3 corresponds to feature vector 3, so that feature vector 1 corresponds to frame 1 image, feature vector 2 corresponds to frame 2 image, and feature vector 3 corresponds to frame 3 image.

Step S103: and inputting the feature vectors into the voice to reconstruct the face sub-model to obtain the face feature parameters.

For example, the voice reconstruction face sub-model is configured to generate face feature parameters for face reconstruction based on feature vectors corresponding to sample texts output by the voice synthesis sub-model, where the face feature parameters may be related to facial expressions, facial poses, and the like, but are not limited thereto.

In one example, after a plurality of feature vectors corresponding to a plurality of text sequences in a sample text are input into a voice reconstruction face sub-model, a plurality of corresponding face feature parameters are obtained, and each face feature parameter corresponds to the frame of image. For example, the feature vector 1 corresponds to the face feature parameter 1, that is, the face feature parameter 1 corresponds to the 1 st frame image. The feature vector 2 corresponds to the face feature parameter 2, that is, the face feature parameter 2 corresponds to the 2 nd frame image. The feature vector 3 corresponds to the face feature parameter 3, that is, the face feature parameter 3 corresponds to the 3 rd frame image.

Step S104: and inputting the human face characteristic parameters and the sample video into a differentiable rendering sub-model to obtain a human face characteristic image.

Illustratively, the Differentiable Rendering submodel 203 (DIB-R) may generate a three-dimensional object model through a differential renderer. In this embodiment, the differentiable rendering sub-model obtains a face feature map, such as a three-dimensional face feature map, based on the face feature parameters and the sample video rendering. In one example, the differentiable rendering sub-model may extract a face texture feature in a multi-frame image to which the sample video belongs, and perform rendering processing based on the face texture feature and the face feature parameter to obtain a face feature map, such as a three-dimensional face feature map. The differentiable rendering sub-model 203 will output a rendered face feature map that will be used in a subsequent video composition model such as the generative confrontation network sub-model 204.

In one example, the sample video includes, for example, the 1 st frame image, the 2 nd frame image, and the 3 rd frame image described above, the differential rendering sub-model may obtain the face feature map 1 based on the face feature parameter 1 and the corresponding 1 st frame image in the sample video, obtain the face feature map 2 based on the face feature parameter 2 and the corresponding 2 nd frame image in the sample video, and obtain the face feature map 3 based on the face feature parameter 3 and the corresponding 3 rd frame image in the sample video.

Step S105: and inputting the face feature map into the generating type confrontation network sub-model to obtain the virtual real person video.

Illustratively, referring to fig. 3, the Generative Adaptive Networks (GAN) submodel 204 may generally include a generator 2041 and a discriminator 2042, and the generator 2041 and the discriminator 2042 are each a deep neural network. As to the specific structures, working principles, and the like of the generator 2041 and the discriminator 2042, these can be understood with reference to the prior art, and are not described herein. In this embodiment, the generation-based confrontation network sub-model 204 is used only for video synthesis training.

In this embodiment, for example, the facial feature map 1, the facial feature map 2, and the facial feature map 3 are input into the generator 2041 of the generating-type countermeasure network sub-model 204, and the generator 2041 generates a new 1 st frame image, a new 2 nd frame image, and a new 3 rd frame image, which correspond to each other, and the new 1 st frame image, the new 2 nd frame image, and the new 3 rd frame image form a virtual live-person video.

Step S106: based on the virtual real person video and the sample video, performing iterative training on a voice reconstruction face submodel, a differentiable rendering submodel and a generating type confrontation network submodel until the loss function value of the generating type confrontation network submodel meets a preset condition.

Illustratively, the discriminator 2042 performs a true-false recognition judgment on the new 1 st frame image, the new 2 nd frame image and the new 3 rd frame image (i.e. the generated false image) and the corresponding 1 st frame image, 2 nd frame image and 3 rd frame image (i.e. the original true image) in the sample video, for example, the true and false of the new 1 st frame image and the 1 st frame image are recognized and judged, the true and false of the new 2 nd frame image and the 2 nd frame image are recognized and judged, and the true and false of the new 3 rd frame image and the 3 rd frame image are recognized and judged, when the true and false can be recognized, the iterative training continues to be performed to reconstruct the face sub-model 202, the differentiable rendering sub-model 203 and the generating type confrontation network sub-model 204, and when the true and false cannot be recognized, the training can be ended.

In one example, the difficulty in distinguishing true or false may be that the loss function value of the generative countermeasure network sub-model 204 satisfies a preset condition, for example, the loss function value is smaller than a preset threshold, which may be set as needed, but is not limited thereto.

In this embodiment, the differentiable rendering sub-model is adopted to combine the voice reconstruction face sub-model with the generative confrontation network sub-model and the voice synthesis sub-model, and the rendering part is differentiable, so that the voice reconstruction face sub-model and the generative confrontation network sub-model can perform gradient propagation, and further perform end-to-end integrated overall training. The differentiable rendering submodel enables all models to be communicated with each other, the relevance among all models is enhanced, the uniformity of information flow is realized, and the phenomenon that partial characteristic information is lost due to the fact that all models are trained independently is reduced.

The training method of the video synthesis model of the embodiment of the disclosure combines the voice reconstruction face submodel with the generation type confrontation network submodel based on the differentiable rendering submodel and the voice synthesis submodel together to realize integrated integral training, does not need the operation of respectively preparing a plurality of training sample data when the multiple models are trained independently, improves the training efficiency, enhances the relevance among the models, reduces the phenomenon that each model is trained independently to cause the loss of part of characteristic information, and improves the video quality generated by the trained video synthesis model, such as definition and/or fidelity, and the like.

Optionally, in an embodiment of the present disclosure, referring to fig. 3 in combination, the voice reconstruction face sub-model 202 may include a convolutional neural network 2021, and a first Fully connected layer (FC) 2022 and a second Fully connected layer FC2023 connected to the convolutional neural network 2021. For example, the convolutional neural network 2021 may be a Visual Geometry Group (VGG) network, such as a VGG16 network, but is not limited thereto. Correspondingly, the step S103 of inputting the feature vector into the speech to reconstruct the face sub-model to obtain the face feature parameters may include the following steps:

step a): and inputting the feature vector into a convolutional neural network to obtain a first feature map.

Illustratively, the feature vectors output by the speech synthesis sub-model 201 are processed, for example, through a VGG16 network, to obtain a first feature map.

Step b): and inputting the first feature map into the first full-connection layer to obtain a first face feature parameter.

For example, the first facial Feature parameter may be a human facial Expression Feature (Expression Feature) parameter. In the embodiment, in the voice reconstruction face sub-model 202, facial expression feature parameters can be extracted through, for example, a VGG16 network and a first full-connected layer regression.

Step c): and inputting the first feature map into a second full-connection layer to obtain a second face feature parameter, wherein the second face feature parameter is different from the first face feature parameter.

Illustratively, the second Face feature parameter may be a Face Pose (Face pos) feature parameter. In the embodiment, in the voice reconstruction face sub-model 202, the face pose feature parameters can be extracted through, for example, a VGG16 network and a second full-connected layer regression.

That is, in this embodiment, the features extracted by the VGG network may be input into two full connection layers, and the regression tasks of different parameters are respectively completed through two different branches, and the parameters obtained by the partial regression, such as the face pose feature parameter and the face expression feature parameter, will be used for the three-dimensional rendering in the differentiable rendering sub-model 203, so that a realistic face feature map can be obtained by rendering, and the quality of the video generated by the finally trained video synthesis model, such as the definition and/or the fidelity, is improved.

Optionally, on the basis of any one of the above embodiments, in an embodiment of the present disclosure, the method may further include the following steps: sample audio corresponding to the sample text is obtained, and wherein the speech synthesis submodel 201 is trained based on the sample text and the sample audio.

For example, the audio data is extracted from the video that captures the sample text read by the real person as the sample audio, or the audio is recorded as the sample audio when the real person reads the sample text, but the invention is not limited thereto.

Illustratively, the original speech synthesis submodel 201 may be understood by referring to the prior art, for example, including a convolutional neural network-based codec module and a vocoder, etc., which are not limited in this embodiment. In this embodiment, the original speech synthesis submodel may be trained based on the sample text and the sample audio to obtain the speech synthesis submodel 201. For example, a sample text is input into an original speech synthesis submodel to obtain a synthesized audio, the original speech synthesis submodel is iteratively trained based on a comparison result of the synthesized audio and the sample audio, and the training is finished until a loss function value of the original speech synthesis submodel is less than or equal to a preset threshold, wherein the preset threshold can be set as required and is not limited to the above.

In this embodiment, after the speech synthesis sub-model 201 is obtained based on the sample text and the sample audio training, the integrated overall training is performed based on the speech synthesis sub-model 201, the speech reconstruction face sub-model 202, the differentiable rendering sub-model 203 and the generation type confrontation network sub-model 204, so that the video quality, such as definition and/or fidelity, generated by the trained video synthesis model can be further improved.

Optionally, in an embodiment of the present disclosure, the generator 2041 in the generative confrontation network sub-model 204 may include a first feature extraction network and/or a second feature extraction network, and the network depths of the first feature extraction network and the second feature extraction network are different.

For example, the first feature extraction network and the second feature extraction network are feature extraction networks based on a convolutional neural network, the network depth is, for example, the number of network layers of the feature extraction networks, and the feature extraction networks with the number of network layers being greater than or equal to a preset value are suitable for scenes where tasks to be synthesized are difficult, such as synthesis tasks with high requirements on fidelity and definition. The feature extraction network with the network layer number smaller than the preset value is suitable for scenes with easy tasks to be synthesized, such as the synthesis tasks with low requirements on fidelity and definition

In this embodiment, the generator 2041 may select different feature extraction networks according to the difficulty level of the video synthesis task. In some embodiments, in order to ensure the clarity and reality of the generated video, the generator 2041 may include a first feature extraction network and a second feature extraction network, that is, features of different scales may be generated, and then the features of different scales are fused to generate a video frame image, so that the stability of the generated confrontation network sub-model 204 and the generalization facing different scales are increased to some extent, thereby improving the quality of the video generated by the finally trained video synthesis model, such as improving the clarity and fidelity.

Optionally, on the basis of any of the above embodiments, in an embodiment of the present disclosure, with reference to fig. 4, the acquiring a sample video in step S101 may specifically include the following steps:

step S401: and acquiring a pre-recorded original video of the sample text read by the real person, wherein the original video comprises a plurality of frames of images.

Step S402: and respectively carrying out image segmentation processing on the multi-frame images in the original video to obtain a sample video, wherein the sample video comprises the multi-frame images subjected to the image segmentation processing, and the multi-frame images subjected to the image segmentation processing all comprise human face areas of real persons.

In this embodiment, a face region in a multi-frame picture in an original video is segmented, a segmented multi-frame image is obtained as a sample video, and then, the step S102 may be skipped to perform training. Therefore, the data volume of the sample data is reduced, and effective sample data is obtained through segmentation processing, so that the training efficiency is further improved, and the quality of the video generated by the finally trained video synthesis model is further improved.

Optionally, in an embodiment of the present disclosure, in order to further improve the quality of the video generated by the finally trained video composition model, after performing image segmentation processing on multiple frames of images in the original video in step S402, the method may further include the following steps:

step i): and preprocessing the multi-frame image after the image segmentation processing, wherein the preprocessing comprises image texture processing and/or image rotation processing.

Illustratively, the image texture processing is performed on the multi-frame image after the segmentation processing, for example, so that the texture of at least part of or all of the image is different. And/or performing rotation processing on part or all of the images in the multi-frame images after the division processing so as to enable the rotation angles of the multi-frame images to be respectively completely different or at least the rotation angles of partial images to be respectively different.

Step ii): and determining a sample video based on the preprocessed multi-frame image.

Illustratively, after the image texture processing and/or the image rotation processing are/is performed on the multi-frame image after the segmentation processing, the sample video is determined based on the multi-frame image after the processing, and then the training is performed in the step S102.

In this embodiment, by performing preprocessing, such as image texture processing and/or image rotation processing, on the segmented multi-frame image, the diversity of sample data is increased, and the quality of the video generated by the finally trained video synthesis model is further improved based on the polarity training of the sample data with diversity.

The embodiment of the present disclosure further provides a training apparatus for a video synthesis model, where the video synthesis model may sequentially include a speech synthesis sub-model, a speech reconstruction face sub-model, a differentiable rendering sub-model, and a generative confrontation network sub-model, and the training apparatus for a video synthesis model shown in fig. 5 may include a data acquisition module 501, a speech synthesis module 502, a face reconstruction module 503, a face rendering module 504, and a generation training module 505:

the data obtaining module 501 is configured to obtain a sample text and a sample video, where the sample video is a video of a sample text read by a real person.

And the speech synthesis module 502 is configured to input the sample text into the speech synthesis sub-model to obtain a feature vector.

And a face reconstruction module 503, configured to input the feature vector into a speech reconstruction face sub-model to obtain a face feature parameter.

And the face rendering module 504 is configured to input the face feature parameters and the sample video into the differentiable rendering sub-model to obtain a face feature map.

And the generation training module 505 is used for inputting the face feature map into the generating type confrontation network submodel to obtain a virtual real person video, and iteratively training the voice reconstruction face submodel, the differentiable rendering submodel and the generating type confrontation network submodel on the basis of the virtual real person video and the sample video until the loss function value of the generating type confrontation network submodel meets the preset condition.

The training device for the video synthesis model of the embodiment of the disclosure combines the voice reconstruction face submodel and the generation type confrontation network submodel based on the differentiable rendering submodel, and combines the voice synthesis submodel together to realize integrated integral training, and does not need the operation of respectively preparing a plurality of training sample data when the multiple models are independently trained, so that the training efficiency is improved, the relevance among the models is enhanced, the phenomenon that each model is independently trained to cause the loss of part of characteristic information is reduced, and the video quality generated by the trained video synthesis model is improved, such as definition and/or fidelity, and the like.

Optionally, in an embodiment of the present disclosure, the voice reconstruction face sub-model may include a convolutional neural network, and a first fully-connected layer and a second fully-connected layer connected to the convolutional neural network. The face reconstruction module 503 is configured to: inputting the feature vector into the convolutional neural network to obtain a first feature map; inputting the first feature map into the first full-connection layer to obtain a first face feature parameter; and inputting the first feature map into the second full-connection layer to obtain a second face feature parameter, wherein the second face feature parameter is different from the first face feature parameter.

Optionally, in an embodiment of the present disclosure, the first facial feature parameters include facial expression feature parameters, and the second facial feature parameters include facial pose feature parameters.

Optionally, in an embodiment of the present disclosure, the apparatus may further include a speech synthesis submodel training module, configured to: obtaining sample audio corresponding to the sample text, obtaining sample audio, and training the voice synthesis sub-model based on the sample text and the sample audio.

Optionally, in an embodiment of the present disclosure, the generator in the generative confrontation network submodel includes a first feature extraction network and/or a second feature extraction network, and the network depths of the first feature extraction network and the second feature extraction network are different.

Optionally, in an embodiment of the present disclosure, the data obtaining module 501 is configured to: acquiring a prerecorded original video of a sample text read by a real person, wherein the original video comprises a plurality of frames of images; and respectively carrying out image segmentation processing on the multi-frame images in the original video to obtain a sample video, wherein the sample video comprises the multi-frame images subjected to the image segmentation processing, and the multi-frame images subjected to the image segmentation processing all comprise human face areas of real persons.

Optionally, in an embodiment of the present disclosure, the data obtaining module 501 is further configured to perform image segmentation processing on multiple frames of images in an original video, and then perform preprocessing on the multiple frames of images after the image segmentation processing, where the preprocessing includes image texture processing and/or image rotation processing; and determining a sample video based on the preprocessed multi-frame image.

The specific manner in which the above-mentioned embodiments of the apparatus, the modules, and the corresponding technical effects thereof perform the operations have been described in detail in the above-mentioned embodiments of the method, and will not be described in detail herein.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units. The components shown as modules or units may or may not be physical units, i.e. may be located in one place or may also be distributed over a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the wood-disclosed scheme. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present disclosure further provides a video synthesis method, as shown in fig. 6, the video synthesis method includes the following steps:

step S601: text to be synthesized is received.

Illustratively, the text to be synthesized is, for example, a piece of text "teacher is in class in classroom".

Step S602: and inputting the text to be synthesized into a video synthesis model to obtain a synthesized video. The video synthesis model is obtained by training the video synthesis model training method according to any one of the embodiments of the present disclosure, and specific reference may be made to the description in the foregoing embodiments, which is not repeated herein.

Illustratively, the composite video corresponding to the text to be synthesized is, for example, a video of a teacher in a classroom, and the user can play and view the video.

According to the video synthesis method disclosed by the embodiment of the disclosure, the video synthesis model obtained by training through the training method of the video synthesis model is adopted to synthesize the video, so that the quality of the generated video, such as definition and/or fidelity, is improved.

The embodiment of the present disclosure further provides a video synthesizing apparatus, and as shown in fig. 7, the video synthesizing apparatus may include a receiving module 701 and a generating module 702:

the receiving module 701 is configured to receive a text to be synthesized.

And the generating module 702 is configured to input the text to be synthesized into the video synthesis model to obtain a synthesized video. The video synthesis model is obtained by training the video synthesis model according to any one of the above embodiments of the present disclosure.

According to the video synthesis device disclosed by the embodiment of the disclosure, the video synthesis model obtained by training through the training method of the video synthesis model is adopted to synthesize the video, so that the quality of the generated video, such as definition and/or fidelity, is improved.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 8, a block diagram of a structure of an electronic device 800, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 804 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above. For example, in some embodiments, the video composition model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured to perform the video composition model training method described above in any other suitable manner (e.g., by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A training method of a video synthesis model, the video synthesis model comprises a voice synthesis sub-model, a voice reconstruction face sub-model, a differentiable rendering sub-model and a generating confrontation network sub-model in sequence, the method comprises the following steps:

inputting the sample text into a speech synthesis sub-model to obtain a feature vector;

inputting the feature vector into a voice reconstruction face sub-model to obtain face feature parameters;

inputting the face feature parameters and the sample video into a differentiable rendering sub-model to obtain a face feature map;

inputting the face feature map into a generating type confrontation network sub-model to obtain a virtual real person video;

and iteratively training the voice reconstruction face sub-model, the differentiable rendering sub-model and the generating type confrontation network sub-model based on the virtual real person video and the sample video until the loss function value of the generating type confrontation network sub-model meets the preset condition.

2. The method of training a video synthesis model according to claim 1, wherein the speech reconstruction face sub-model comprises a convolutional neural network, and a first fully-connected layer and a second fully-connected layer connected to the convolutional neural network;

and wherein, the inputting the feature vector into the voice reconstruction face sub-model to obtain the face feature parameters comprises:

inputting the feature vector into the convolutional neural network to obtain a first feature map;

inputting the first feature map into the first full-connection layer to obtain a first face feature parameter;

and inputting the first feature map into the second full-connection layer to obtain a second face feature parameter, wherein the second face feature parameter is different from the first face feature parameter.

3. The method for training a video synthesis model according to claim 2, wherein the first facial feature parameters comprise facial expression feature parameters, and the second facial feature parameters comprise facial pose feature parameters.

4. A method of training a video composition model according to any of claims 1 to 3, wherein the method further comprises:

acquiring sample audio corresponding to the sample text;

and wherein the speech synthesis sub-model is trained based on the sample text and the sample audio.

5. The method for training a video synthesis model according to any one of claims 1 to 3, wherein the generator in the generative confrontation network submodel comprises a first feature extraction network and/or a second feature extraction network, and the network depths of the first feature extraction network and the second feature extraction network are different.

6. The method for training a video synthesis model according to any one of claims 1 to 3, wherein the obtaining a sample video comprises:

acquiring a pre-recorded original video of the sample text read by a real person, wherein the original video comprises a plurality of frames of images;

and respectively carrying out image segmentation processing on the multi-frame images in the original video to obtain the sample video, wherein the sample video comprises the multi-frame images subjected to image segmentation processing, and the multi-frame images subjected to image segmentation processing all comprise human face areas of real persons.

7. The method for training a video synthesis model according to claim 6, wherein after the image segmentation processing is performed on each of the plurality of frames of images in the original video, the method further comprises:

preprocessing a multi-frame image subjected to image segmentation processing, wherein the preprocessing comprises image texture processing and/or image rotation processing;

and determining the sample video based on the preprocessed multi-frame images.

8. A training device of a video synthesis model, the video synthesis model sequentially comprises a voice synthesis sub-model, a voice reconstruction face sub-model, a differentiable rendering sub-model and a generating countermeasure network sub-model, the device comprises:

the voice synthesis module is used for inputting the sample text into a voice synthesis sub-model to obtain a feature vector;

the face reconstruction module is used for inputting the feature vector into a voice reconstruction face sub-model to obtain face feature parameters;

and the generation training module is used for inputting the face feature map into a generating type confrontation network submodel to obtain a virtual real person video, and iteratively training the voice reconstruction face submodel, the differentiable rendering submodel and the generating type confrontation network submodel on the basis of the virtual real person video and the sample video until the loss function value of the generating type confrontation network submodel meets a preset condition.

9. A video compositing method, comprising:

receiving a text to be synthesized;

inputting the text to be synthesized into a video synthesis model to obtain a synthesized video, wherein the video synthesis model is obtained by training based on the method of any one of claims 1-7.

10. A video compositing apparatus, comprising:

the receiving module is used for receiving the text to be synthesized;

the generating module is used for inputting the text to be synthesized into a video synthesis model to obtain a synthesized video;

wherein the video synthesis model is trained based on the method of any one of claims 1 to 7.

11. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method of any one of claims 1-7 or claim 9.

12. A non-transitory computer readable storage medium storing a computer program for causing a computer to perform the method according to any one of claims 1-7 or claim 9.