CN115063713A

CN115063713A - Training method of video generation model, video generation method and device, electronic equipment and readable storage medium

Info

Publication number: CN115063713A
Application number: CN202210581262.6A
Authority: CN
Inventors: 丁苗高
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-09-16

Abstract

The embodiment of the invention provides a training method of a video generation model, a video generation method, a device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a plurality of sample videos; constructing a generation confrontation network, wherein the generation confrontation network comprises a generation model and a discrimination model; inputting the sample video into a generation model to obtain a prediction video frame; inputting the predicted video frame and the sample video into a discrimination model to obtain a discrimination result; the judging model is used for judging whether the predicted video frame is matched with the sample video; and training and generating a confrontation network based on the discrimination result of each sample video until the training stopping condition is met, and obtaining a video generation model. According to the embodiment of the invention, the accuracy of the trained video generation model can be improved.

Description

Training method of video generation model, video generation method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a training method for a video generation model, a video generation method, an apparatus, an electronic device, and a readable storage medium.

Background

Currently, motion migration technology migrates motion in a source video onto a target image to generate the target video, and has an effect of making an object in the target image show the motion in the source video. The method can be applied to various scenes such as social entertainment, special effect synthesis and the like.

Because the postures of the objects in the source video and the target image may be different greatly, the target video generated by adopting the current motion migration technology may have the problems of unreal single video frame, blurred picture and unsmooth video frames, i.e. the generation effect of the target video is not good.

Disclosure of Invention

Embodiments of the present invention provide a training method for a video generation model, a video generation method, an apparatus, an electronic device, and a readable storage medium, which can improve an effect of generating a target video through a trained video generation model. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a method for training a video generative model, including:

acquiring a plurality of sample videos;

constructing a generation confrontation network, wherein the generation confrontation network comprises a generation model and a discrimination model;

inputting the sample video into a generation model to obtain a prediction video frame;

inputting the predicted video frame and the sample video into a discrimination model to obtain a discrimination result; the judging model is used for judging whether the predicted video frame is matched with the sample video;

and training and generating a confrontation network based on the discrimination result of each sample video until the training stopping condition is met, and obtaining a video generation model.

In a second aspect of the present invention, there is provided a video generating method, including:

obtaining a sequence of video frames, the sequence of video frames comprising: a video frame and a target image of a source video;

inputting the sequence of video frames into a video generation model, the video generation model comprising: an image generation model and an optical flow network model; extracting foreground characteristics of a video frame sequence through an image generation model, and extracting optical flow characteristics of a source video through an optical flow network model;

performing feature fusion on the foreground features and the optical flow features to generate a target video frame;

and generating a target video based on the target video frame.

In a third aspect of the present invention, there is also provided a training apparatus for a video generative model, including:

the first acquisition module is used for acquiring a plurality of sample videos;

the construction module is used for constructing a generated confrontation network, and the generated confrontation network comprises a generation model and a discrimination model;

the first input module is used for inputting the sample video into the generation model to obtain a prediction video frame;

the first input module is also used for inputting the predicted video frame and the sample video into the discrimination model to obtain a discrimination result; the judging model is used for judging whether the predicted video frame is matched with the sample video;

and the training module is used for training and generating the confrontation network based on the discrimination result of each sample video until the training stopping condition is met, and obtaining a video generation model.

In a fourth aspect of the present invention, there is also provided a video generating apparatus, including:

a second obtaining module, configured to obtain a sequence of video frames, where the sequence of video frames includes: a video frame and a target image of a source video;

a second input module for inputting the sequence of video frames to a video generation model, the video generation model comprising: an image generation model and an optical flow network model; extracting foreground characteristics of a video frame sequence through an image generation model, and extracting optical flow characteristics of a source video through an optical flow network model;

the fusion module is used for carrying out feature fusion on the foreground features and the optical flow features to generate a target video frame;

and the generating module is used for generating a target video based on the target video frame.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

According to the embodiment of the invention, a plurality of sample videos are obtained and a generation countermeasure network comprising a generation model and a judgment model is constructed, the sample videos are firstly input into the generation model to obtain a predicted video frame, then the predicted video frame and the sample videos are input into the judgment model for judging whether the predicted video frame is matched with the sample videos, wherein the reality of the predicted video frame and the smooth continuity of the connection between the predicted video frame and the sample videos can be continuously improved, and finally the generation countermeasure network is trained based on the judgment result of each sample video until the training stopping condition is met to obtain the video generation model. Therefore, the reality of details in the target video frame can be ensured and the continuity of the target video generated based on the target video frame can be ensured based on the target video frame generated by the trained video generation model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below.

Fig. 1 is a schematic structural diagram of a video generation model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a video generation model according to an embodiment of the present invention;

fig. 3 is a flowchart of a video generation method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-scale network structure provided by an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a training apparatus for video generative models according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

The video generation method provided by the embodiment of the invention can be applied to the following application scenarios, which are explained below.

Currently, the action migration technology migrates the action in the source video onto the target image to generate the target video, and the effect of the action migration technology is to make the object in the target image show the action in the source video. The method can be applied to various scenes such as social entertainment, special effect synthesis and the like. For example, the motion information of a professional dancer in the source video is migrated to the body of the amateur and rendered to generate the target video. In the target video generated by action migration, amateurs can learn to jump different styles of dance like professional dancers. Dance video generation is a combination of motion migration and video generation. How to ensure stable movement of actions, so that the stability of generated target video frames and the continuity of generated target videos are a problem to be solved.

Based on the application scenario, the following describes in detail a video generation method provided by an embodiment of the present invention.

First, a video generation model structure provided by the embodiment of the present invention is explained in an overall manner.

Fig. 1 is a schematic structural diagram of a video generative model according to an embodiment of the present invention, and as shown in fig. 1, the video generative model includes: generating a model and a discriminant model, the generating the model comprising: an image generation model and an optical flow network model; the discrimination model includes: an image discrimination model and a video discrimination model.

In the training process, a plurality of sample videos are obtained, a generation countermeasure network comprising a generation model and a discrimination model is constructed, the sample videos are input into the generation model to obtain a prediction video frame, wherein the sample videos comprise a first video frame and a plurality of second video frames adjacent to the first video frame; specifically, a plurality of second video frames and sample posture information extracted from a sample video are input into a generating model, and foreground training characteristics of the plurality of second video frames are extracted through the image generating model; and extracting optical flow training features of the plurality of second video frames through the optical flow network model. And finally, fusing the foreground training features output by the image generation model and the optical flow training features output by the optical flow network model to obtain a fused prediction video frame. The predicted video frame and the sample video are then input to a discrimination model for discriminating whether the predicted video frame matches the sample video.

The reality of the predicted video frame and the smooth consistency of the connection between the predicted video frame and the sample video can be continuously improved, and finally, the confrontation network is trained and generated based on the judgment result of each sample video until the training stopping condition is met, so that the video generation model is obtained. Therefore, the reality of details in the target video frame can be ensured and the continuity of the target video generated based on the target video frame can be ensured based on the target video frame generated by the trained video generation model.

In the application process, the video frame sequence comprising the video frames of the source video and the target images is input into the video generation model, and the foreground characteristics of the video frame sequence are extracted through the image generation model. Specifically, a target image in the video frame sequence is used as an input A (in the case of generating the target video frame, the generated target video frame is also included in the input A), and the foreground characteristics of the video frame sequence are extracted through an image generation model when the target image is input into the image generation model; on the other hand, the pose information extracted from the video frame of the source video is input as input B to the optical flow network model, and the optical flow feature of the source video is extracted by the optical flow network model.

Here, foreground features with authenticity and stability can be extracted through the image generation model, optical flow features of the source video can be extracted through the optical flow network model, and stable continuous optical flow features can be extracted through the optical flow network model; performing feature fusion on the foreground features and the optical flow features to generate a real and stable target video frame; therefore, a real and stable target video can be generated based on the target video frame, and the generation effect of the target video is improved.

Next, a training structure of the video generation model provided in the embodiment of the present invention is described.

as shown in fig. 2, the training method of the video generative model may include steps 210 to 250, which are specifically as follows:

step 210, a plurality of sample videos are obtained.

Step 220, constructing a generation countermeasure network, wherein the generation countermeasure network comprises a generation model and a discrimination model.

Step 230, inputting the sample video to the generation model to obtain a predicted video frame.

Step 240, inputting the predicted video frame and the sample video into a discrimination model to obtain a discrimination result; the discrimination model is used for discriminating whether the predicted video frame is matched with the sample video.

And 250, training and generating a confrontation network based on the discrimination result of each sample video until the training stopping condition is met, and obtaining a video generation model.

In summary, according to the embodiments of the present invention, a plurality of sample videos are obtained and a generation countermeasure network including a generation model and a discrimination model is constructed, the sample videos are first input to the generation model to obtain a predicted video frame, then the predicted video frame and the sample videos are input to the discrimination model for discriminating whether the predicted video frame matches with the sample videos, here, the reality of the predicted video frame and the smooth continuity of the connection between the predicted video frame and the sample videos can be continuously improved, and finally, the generation countermeasure network is trained based on the discrimination result of each sample video until the training stop condition is satisfied to obtain the video generation model. Therefore, the authenticity of details in the target video frame can be ensured based on the target video frame generated by the trained video generation model, and the continuity of the target video generated based on the target video frame can also be ensured.

Specific implementations of the above steps are described below.

Step 210 is involved.

A plurality of sample videos are acquired.

Is exemplified by3 (any number of frames) consecutive video frames x can be extracted from the sample video frame by frame _T-2 ,x _T-1 ,x _T 。

Step 220 is involved.

And constructing a generation countermeasure network, wherein the generation countermeasure network comprises a generation model and a discrimination model.

Generating a countermeasure network (GAN) is a deep learning model that passes through at least two modules in a framework: the mutual game learning of the Generative Model (Generative Model) and the Discriminative Model (Discriminative Model) yields a reasonably good output.

In the training process, the aim of generating the model is to generate a real picture as much as possible to deceive the discrimination model. The aim of the discriminant model is to separate the picture generated by the generative model from the real picture as much as possible. Thus, the generative model and the discriminative model constitute a dynamic "game process". As a result of the last game, the generative model can generate enough pictures to be "fake". It is difficult for the discriminant model to determine whether the picture generated by the generative model is real or not.

Step 230 is involved.

And inputting the sample video into the generation model to obtain a predicted video frame.

In order to guarantee the authenticity of details in the generated prediction video frame, the generation model related to the invention is composed of an image generation model and an optical flow network model. And finally, fusing output results of the image generation model and the optical flow network model to form a finally generated prediction video frame. Additionally, consider that the video is a composite of the spatial and temporal domains. Therefore, the discrimination model related to the invention consists of an image discrimination model and a video discrimination model. The image discrimination model is used to identify the authenticity of a predicted video frame of a single frame. The video discrimination model is used to identify the continuity of successive predicted video frames.

Specifically, wherein generating the model comprises: an image generation model and an optical flow network model; discriminating moduleThe model comprises: an image discrimination model and a video discrimination model; the sample video comprises a first video frame and a plurality of second video frames adjacent to the first video frame; specifically, the sample video includes a first video frame (x) _T ) And a plurality of second video frames (x) adjacent to the first video frame _T-2 ,x _T-1 )。

Step 230 may specifically include the following steps:

inputting a plurality of second video frames to an image generation model, and outputting foreground training characteristics;

inputting sample attitude information extracted from a sample video into an optical flow network model, and outputting optical flow training characteristics;

and fusing the foreground training characteristic and the optical flow training characteristic to obtain a predicted video frame.

The method includes the steps of inputting sample posture information extracted from a sample video into an optical flow network model, and outputting an optical flow training feature, and specifically includes the following steps:

respectively extracting sample attitude information from sample video frames of a sample video; and inputting the sample attitude information into the optical flow network model, and extracting optical flow training characteristics.

The step of obtaining the predicted video frame by fusing the foreground training feature and the optical flow training feature may specifically include the following steps:

generating a first pixel point according to the foreground training characteristics; generating a second pixel point according to the optical flow training characteristics; and synthesizing the first pixel points and the second pixel points to obtain a predicted video frame.

The sample video includes a target object, which may be a dynamic person, an animal, or a mobile device. Specifically, the first pixel point and the second pixel point can be synthesized according to the position of the target object in the video frame of the sample video, so as to obtain the predicted video frame.

First, a plurality of second video frames (x) _T-2 ,x _T-1 ) Inputting the foreground training data into an image generation model and outputting foreground training characteristics; then, the slave sample video (x) _T-2 ,x _T-1 ,x _T ) The extracted sample attitude information(s) _T-2 ,s _T-1 ,，s _T ) Inputting the optical flow network model and outputting an optical flow training characteristic; finally, fusing the foreground training characteristics output by the image generation model and the optical flow training characteristics output by the optical flow network model to obtain a fused predicted video frame (x) _T *)。

Step 240 is involved.

Inputting the predicted video frame and the sample video into a discrimination model to obtain a discrimination result; and the discrimination model is used for discriminating whether the predicted video frame is matched with the sample video.

Step 240 may specifically include the following steps:

inputting the predicted video frame and the first video frame into an image identification model to obtain a first loss value;

inputting the predicted video frame and the second video frame into a video identification model to obtain a second loss value;

correspondingly, step 250 may specifically include the following steps:

and training to generate a confrontation network according to the first loss value and the second loss value until a training stopping condition is met, and obtaining a video generation model.

Will predict the video frame (x) _T X) and a first video frame (x) _T ) Inputting the predicted image into image identification model, calculating first loss value of the image identification model, and predicting video frame (X) _T-2 *,X _T-1 *,X _T X) and a second video frame (X) _T-2 ,X _T-1 ,X _T ) And inputting the first loss value into the video identification model, calculating a second loss value of the video identification model, and determining a judgment result according to the first loss value and the second loss value.

In addition, in the case where the first frame prediction video frame is determined, X _T-2 *,X _T-1 Is absent, the corresponding predicted video frame (X) is input into the video discrimination model _T-2 *,X _T-1 May be represented by X) _T-2 And X _T-1 Instead.

The training is performed to generate the confrontation network according to the discrimination result determined by the first loss value and the second loss value until a training stop condition is met, so as to obtain the video generation model, which specifically includes:

determining a loss function value of a generated countermeasure network according to a judgment result determined by the first loss value and the second loss value of each sample video, training the generated countermeasure network, and obtaining a video generation model under the condition that the loss function value of the generated countermeasure network obtained according to the judgment result meets a loss threshold value.

Before step 240, the following steps may also be included:

calculating a third loss value of the image generation model according to the predicted video frame and the first video frame;

calculating a fourth loss value of the optical flow network model according to the optical flow training characteristics and the optical flow true value, wherein the optical flow true value is extracted from the sample video through a preset optical flow extraction algorithm;

determining a loss value of the generated model according to the third loss value and the fourth loss value;

here, the loss value of the generator generation model is calculated, and on the one hand, a third loss value of the image generation model is calculated from the predicted video frame and the first video frame, the third loss value having the same value as the first loss value mentioned above. And on the other hand, extracting an optical flow true value from the sample video based on a preset optical flow extraction algorithm, and calculating a fourth loss value of the optical flow network model according to the optical flow training characteristics and the optical flow true value. And finally, determining the loss value of the generated model according to the third loss value of the image generated model and the fourth loss value of the flow network model.

Correspondingly, step 250 may specifically include the following steps:

determining a loss value of the discriminant model according to the first loss value and the second loss value;

and performing back propagation training on the generated countermeasure network according to the loss value of the generated model and the loss value of the identification model until the generated countermeasure network meets a preset convergence condition to obtain a trained video generation model.

Specifically, the sum of the third loss value and the fourth loss value is determined as a loss value for generating the model; determining the sum of the first loss value and the second loss value as a loss value of the discriminant model; and carrying out back propagation training on the generation countermeasure network according to the loss value of the generation model and the loss value of the identification model.

Wherein, the back propagation is a process of carrying out iterative optimization on the loss value by using a gradient descent method to obtain a minimum value. The model algorithm optimization process is to adopt a proper loss function to measure the output loss of the training sample, optimize the loss function to obtain the minimum extreme value, namely, the process that a series of linear coefficient matrixes are used for iteratively optimizing the loss function of the deep neural network by a gradient descent method to obtain the minimum value is a back propagation algorithm, and various loss functions and activation functions can be used.

Step 250 is involved.

And training and generating a confrontation network based on the discrimination result of each training sample until the training stopping condition is met, and obtaining a video generation model.

Here, the trained generative confrontation network, the generative model, can generate enough target video frames to be "spurious". For the discriminant model, on one hand, it is difficult to determine whether the target video frame generated by the generative model is real or not; on the other hand, it is also difficult to determine whether the target video frame generated by the generative model matches the source video, i.e., it is also difficult to determine whether the target video frame originally existed in the source video. Therefore, the target video frame generated by the countermeasure network can be generated in a trained mode, the authenticity of details in the target video frame can be guaranteed, and the continuity of the target video generated based on the target video frame can be guaranteed.

In a possible embodiment, step 250 may specifically include the following steps:

inputting the predicted video frame and the first video frame into a feature extraction network, wherein the feature extraction network comprises a plurality of scale layers, and each scale layer respectively outputs a sub-loss value of the predicted video frame and a sub-loss value of the first video frame;

determining a multi-scale loss value according to the plurality of sub-loss values;

and training the generated countermeasure network according to the multi-scale loss value, the first loss value and the second loss value until the training stopping condition is met, and obtaining a video generation model.

Respectively inputting the predicted video frame and the first video frame into different scale layers of a feature extraction network (such as VGG), wherein each scale layer respectively outputs the sub-loss values of the predicted video frame and the first video frame at the scale; determining a multi-scale loss value according to the plurality of sub-loss values; correspondingly, the generated countermeasure network is trained according to the multi-scale loss value, the first loss value and the second loss value until the training stopping condition is met, and the video generation model is obtained.

Wherein, the multi-scale loss value can be specifically calculated by the following formula:

wherein L is _featrure Is a multi-scale loss value, where T represents the total scale layers, X is the predicted video frame, Y is the first video frame, f ⁱ Representation feature extraction, f ⁱ (X)-f ⁱ And (Y) is a sub-loss value.

As shown in fig. 4, in the multi-scale feature extraction network, the encoding network 1 and the decoding network 1 are low-scale networks, the network input1 is 256 × 256, and the output1 is also 256 × 256, and the network corresponding to the scale layer is used to generate the global low-fraction video. The encode 2 and decode 2 networks high-scale networks, the input2 and output2 networks are 512 x 512, the high-scale networks are used to generate locally refined high-resolution video.

Therefore, a multi-scale network is designed in the generation model, so that a high-resolution video is generated, and the generated video is more refined; in the discriminant model, by designing multi-scale feature loss values, the details of the generated prediction video frame can be truthful.

To sum up, the video generation model of the embodiment of the present invention obtains a plurality of sample videos and constructs a generation countermeasure network including a generation model and a discrimination model, first inputs the sample videos to the generation model to obtain a predicted video frame, then inputs the predicted video frame and the sample videos to the discrimination model for discriminating whether the predicted video frame matches with the sample videos, here, the reality of the predicted video frame and the smooth continuity of the connection between the predicted video frame and the sample videos can be continuously improved, and finally, the generation countermeasure network is trained based on the discrimination result of each sample video until the training stop condition is satisfied to obtain the video generation model. Therefore, the authenticity of details in the target video frame can be ensured based on the target video frame generated by the trained video generation model, and the continuity of the target video generated based on the target video frame can also be ensured.

The following describes a video generation method provided by an embodiment of the present invention.

Fig. 3 is a flowchart of a video generation method according to an embodiment of the present invention.

As shown in fig. 3, the video generation method may include steps 310 to 340, and the method is applied to a video generation apparatus, and is specifically as follows:

step 310, obtaining a video frame sequence, the video frame sequence including: video frames of a source video and a target image.

Step 320, inputting the sequence of video frames to a video generation model, the video generation model comprising: an image generation model and an optical flow network model; foreground features of the video frame sequence are extracted through an image generation model, and optical flow features of the source video are extracted through an optical flow network model.

And step 330, performing feature fusion on the foreground features and the optical flow features to generate a target video frame.

Step 340, generating a target video based on the target video frame.

In the disclosed embodiment, by inputting a video frame sequence including video frames of a source video and target images to a video generation model, foreground features of the video frame sequence are extracted by the image generation model, where the foreground features with reality and stability can be extracted by the image generation model, and optical flow features of the source video are extracted by an optical flow network model, where stable continuous optical flow features can be extracted by the optical flow network model; performing feature fusion on the foreground features and the optical flow features to generate a real and stable target video frame; therefore, a real and stable target video can be generated based on the target video frame, and the generation effect of the target video is improved.

Specific implementations of the above steps are described below.

Step 310 is involved.

Obtaining a sequence of video frames, the sequence of video frames comprising: video frames of a source video and a target image.

In addition, the video frames of the source video may include: video frame y ₀ -y _T The target image may include: image z of the target person ₀ ,z ₁ . The sequence of video frames is { z } ₀ ,z ₁ ，y ₀ -y _T }。

Step 320 is involved.

Inputting the sequence of video frames into a video generation model, the video generation model comprising: an image generation model and an optical flow network model; foreground features of the sequence of video frames are extracted by an image generation model, and optical flow features of the source video are extracted by an optical flow network model.

In one aspect, (z) of a sequence of video frames ₀ ,z ₁ ) Inputting the foreground characteristics of the video frame sequence into an image generation model and extracting the foreground characteristics of the video frame sequence through the image generation model; on the other hand, from video frames (y) of the source video in the sequence of video frames ₀ -y _T ) In (1), extracting attitude information(s) ₀ ,s ₁ ,s ₂ ) And inputting the attitude information into the optical flow network model, and extracting the optical flow characteristics of the source video through the optical flow network model.

Wherein, the step of extracting foreground features of the video frame sequence through the image generation model may specifically include the following steps:

extracting the action characteristics of the source video and the appearance characteristics of the target image through an image generation model;

and generating foreground characteristics based on the action characteristics of the source video and the appearance characteristics of the target image.

Therefore, the movement of the object in the source video can be transferred to the object in the target image, and the foreground feature can be generated.

The structures of the image generation model and the optical flow network model are designed in the video generation model, so that the optical flow network model is used for modeling optical flow characteristics of a background, the image generation model is used for modeling foreground characteristics, and then the two model branches are fused, so that inter-frame smoothness of the generated predicted video frame is ensured, the frame skipping phenomenon is avoided, and meanwhile, the accuracy of transferring the action in the source video to the object in the target image is also ensured.

Step 330 is involved.

Performing feature fusion on the foreground features and the optical flow features to generate a target video frame (y) ₀ *)。

Each video frame in the sequence of video frames includes a timing identification; the video frame sequence comprises a first sequence and a second sequence, wherein the first sequence comprises a target image and a generated target video frame; the second sequence comprises video frames of the source video, foreground features of the video frame sequence are extracted through an image generation model, and optical flow features of the source video are extracted through an optical flow network model, and the method comprises the following steps:

acquiring a third video frame corresponding to a time sequence identifier adjacent to the target time sequence identifier from the first sequence according to the target time sequence identifier of the target video frame to be generated;

acquiring a fourth video frame corresponding to the time sequence identifier adjacent to the target time sequence identifier from the second sequence according to the target time sequence identifier;

and extracting foreground features of the third video frame through an image generation model, and extracting optical flow features of the fourth video frame through an optical flow network model.

In order to make the transition between the video frames of the generated target video smoother, when a preset number of target video frames (y) are generated ₀ *，…，y _T One), the generated target video frame may be added to the sequence of video frames.

For example, a preset number of first target video frames have been generated, the first target video frames having been generated by extracting features from video frames and target images of the source video via a trained image generation model that may ensure smoothness of the sequence of first target video frames and video frames.

Then, the video frame sequence including the first target video frame is input into the video generation model to obtain a second target video frame obtained through prediction, and the trained image generation model can also ensure the smoothness of the second target video frame and the video frame sequence, wherein the video frame sequence includes the first target video frame, so that the smoothness of the first target video frame and the second target video frame is ensured.

Since the final target video is composed of the target video frames (the first target video frame, the second target video frame, … …), the inter-frame smoothness of the first target video frame and the second target video frame is ensured, and the inter-frame smoothness and the continuity of the target video can be improved.

Specifically, after performing feature fusion on the foreground features and the optical flow features to generate a target video frame, the method may further include the following steps: and updating the video frame sequence based on the target video frame to obtain the updated video frame sequence.

Thus, the next time a predicted video frame is generated, the target video frame (y) is generated based on the predicted video frame _t ) From the first sequence (z) to the target timing identification (t) ₀ ,z ₁ ，y ₀ *，…，y _t-2 *,y _t-1 A) of the video sequence, a third video frame (y) corresponding to a temporal marker adjacent to the target temporal marker is acquired _t-2 *,y _t-1 *)；

From the second sequence (y) according to the target timing identification ₀ -y _T ) A fourth video frame (y) corresponding to the timing mark adjacent to the target timing mark is obtained _t-2 ,y _t-1 ,y _t ) And extracting pose information(s) from the fourth video frame _t-2 ,s _t-1 ,s _t )；

Specifically, the fourth video frame may be input to a preset gesture recognition network to obtain gesture information.

Extracting foreground characteristics of the third video frame through an image generation model, and extracting attitude information(s) from the fourth video frame _t-2 ,s _t-1 ,s _t ) Then passing the light streamThe network model extracts optical flow features of the pose information.

In a possible embodiment, the step of performing feature fusion on the foreground feature and the optical flow feature to generate the target video frame may specifically include the following steps:

acquiring a weight image, wherein the weight image is used for representing a weight value corresponding to a pixel point in a target video frame;

based on the weighted image, performing weighted fusion processing on each pixel point in the foreground feature and each pixel point in the optical flow feature to generate a target video frame, wherein the foreground feature and the optical flow feature can be images, and the image sizes of the foreground feature and the optical flow feature are matched.

Step 340 is involved.

And generating a target video based on the target video frame.

After the target video frames are respectively output, the video frames are based on the target video frames (y) ₀ *,...,y _T X) generating a target video.

In summary, in the embodiment of the present invention, by inputting a video frame sequence including video frames of a source video and target images to a video generation model, foreground features of the video frame sequence are extracted through the image generation model, where the foreground features with reality and stability can be extracted through the image generation model, and optical flow features of the source video are extracted through an optical flow network model, where stable and continuous optical flow features can be extracted through the optical flow network model; performing feature fusion on the foreground features and the optical flow features to generate a real and stable target video frame; therefore, a real and stable target video can be generated based on the target video frame, and the generation effect of the target video is improved.

Based on the above training method for the video generative model shown in fig. 2, an embodiment of the present invention further provides a training apparatus for a video generative model, as shown in fig. 5, the training apparatus 500 for a video generative model may include:

a first obtaining module 510, configured to obtain a plurality of sample videos.

A building module 520, configured to build a generation countermeasure network, where the generation countermeasure network includes a generation model and a discrimination model.

A first input module 530, configured to input the sample video to the generative model, so as to obtain a predicted video frame.

The first input module 530 is further configured to input the predicted video frame and the sample video to the discriminant model to obtain a discriminant result; and the discrimination model is used for discriminating whether the predicted video frame is matched with the sample video.

And the training module 540 is configured to generate a confrontation network based on the result of the discrimination of each sample video, until a training stop condition is met, and obtain a video generation model.

In one possible embodiment, generating the model includes: an image generation model and an optical flow network model; the discrimination model includes: an image discrimination model and a video discrimination model; the sample video comprises a first video frame and a plurality of second video frames adjacent to the first video frame; the first input module 530 is specifically configured to:

inputting the plurality of second video frames into a generating model, and extracting foreground training characteristics of the plurality of second video frames through the image generating model; extracting optical flow training characteristics of a plurality of second video frames through the optical flow network model;

In a possible embodiment, the first input module 530 is specifically configured to:

the training module 540 is specifically configured to: and training to generate a confrontation network according to the first loss value and the second loss value until a training stopping condition is met, and obtaining a video generation model.

In a possible embodiment, the training apparatus 500 for video generative model may further include:

and the calculating module is used for calculating a third loss value of the image generation model according to the predicted video frame and the first video frame.

The calculation module is further configured to calculate a fourth loss value of the optical flow network model according to the optical flow training feature and the optical flow true value, where the optical flow true value is extracted from the sample video through a preset optical flow extraction algorithm.

And the determining module is used for determining the loss value of the generated model according to the third loss value and the fourth loss value.

The training module 540 is specifically configured to:

In a possible embodiment, the training module 540 is specifically configured to:

In summary, according to the embodiments of the present invention, a plurality of sample videos are obtained and a generation countermeasure network including a generation model and a discrimination model is constructed, the sample videos are first input to the generation model to obtain a predicted video frame, then the predicted video frame and the sample videos are input to the discrimination model for discriminating whether the predicted video frame matches with the sample videos, here, the reality of the predicted video frame and the smooth continuity of the connection between the predicted video frame and the sample videos can be continuously improved, and finally, the generation countermeasure network is trained based on the discrimination result of each sample video until the training stop condition is satisfied to obtain the video generation model. Therefore, the reality of details in the target video frame can be ensured and the continuity of the target video generated based on the target video frame can be ensured based on the target video frame generated by the trained video generation model.

Based on the video generating method shown in fig. 3, an embodiment of the present invention further provides a video generating apparatus, as shown in fig. 6, where the video generating apparatus 600 may include:

a second obtaining module 610, configured to obtain a sequence of video frames, where the sequence of video frames includes: video frames of a source video and a target image.

A second input module 620 for inputting the sequence of video frames to a video generation model, the video generation model comprising: an image generation model and an optical flow network model; foreground features of the sequence of video frames are extracted by an image generation model, and optical flow features of the source video are extracted by an optical flow network model.

And a fusion module 630, configured to perform feature fusion on the foreground features and the optical flow features, so as to generate a target video frame.

A generating module 640, configured to generate a target video based on the target video frame.

In one possible embodiment, each video frame of the sequence of video frames includes a timing identification; the video frame sequence comprises a first sequence and a second sequence, wherein the first sequence comprises a target image and a generated target video frame; the second sequence comprises video frames of the source video, and the second input module 620 is specifically configured to:

An embodiment of the present invention further provides an electronic device, as shown in fig. 7, including a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 complete mutual communication through the communication bus 704,

a memory 703 for storing a computer program;

the processor 701 is configured to implement the following steps when executing the program stored in the memory 703:

acquiring a plurality of sample videos; constructing a generated countermeasure network, wherein the generated countermeasure network comprises a generation model and a discrimination model; inputting the sample video into the generation model to obtain a prediction video frame; inputting the predicted video frame and the sample video into a discrimination model to obtain a discrimination result; the judging model is used for judging whether the predicted video frame is matched with the sample video or not; and training the generation countermeasure network based on the discrimination result of each sample video until a training stop condition is met, and obtaining the video generation model. Alternatively, the first and second electrodes may be,

and generating a target video based on the target video frame.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment, a computer-readable storage medium is provided, having stored thereon instructions, which, when executed on a computer, cause the computer to perform the method of any of the above embodiments.

In a further embodiment provided by the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for training a video generative model, the method comprising:

acquiring a plurality of sample videos;

constructing a generating countermeasure network, wherein the generating countermeasure network comprises a generating model and a judging model;

inputting the sample video into the generation model to obtain a prediction video frame;

inputting the predicted video frame and the sample video into a discrimination model to obtain a discrimination result; the judging model is used for judging whether the predicted video frame is matched with the sample video or not;

and training the generation countermeasure network based on the discrimination result of each sample video until a training stop condition is met to obtain a video generation model.

2. The method of claim 1, the generating a model comprising: an image generation model and an optical flow network model; the discriminant model includes: an image discrimination model and a video discrimination model; the sample video comprises a first video frame and a plurality of second video frames adjacent to the first video frame;

inputting the sample video into the generative model to obtain a predicted video frame, wherein the method comprises the following steps:

inputting the plurality of second video frames into the generation model, and extracting foreground training features of the plurality of second video frames through the image generation model; and extracting optical flow training features of the plurality of second video frames through the optical flow network model;

and fusing the foreground training features and the optical flow training features to obtain the predicted video frame.

3. The method of claim 2, wherein inputting the predicted video frame and the sample video to a discriminant model to obtain a discriminant result, training the generative confrontation network based on the discriminant result of each training sample until a training stop condition is satisfied to obtain the video generation model, comprises:

inputting the predicted video frame and the first video frame into the image identification model to obtain a first loss value;

inputting the predicted video frame and the second video frame into the video identification model to obtain a second loss value;

and training the generation countermeasure network according to the first loss value and the second loss value until a training stop condition is met, and obtaining the video generation model.

4. The method of claim 3, before said inputting said predicted video frame and said sample video to a discriminant model resulting in a discriminant result, said method further comprising:

determining a loss value of the generative model according to the third loss value and the fourth loss value;

training the generated confrontation network based on the discrimination result of each training sample until a training stop condition is met to obtain the video generation model, including:

and carrying out back propagation training on the generated countermeasure network according to the loss value of the generated model and the loss value of the identification model until the generated countermeasure network meets a preset convergence condition, so as to obtain the trained video generation model.

5. The method of claim 3 or 4, training the generative confrontation network according to the first loss value and the second loss value until a training stop condition is satisfied, resulting in the video generative model, comprising:

inputting the predicted video frame and the first video frame into a feature extraction network, wherein the feature extraction network comprises a plurality of scale layers, and each scale layer respectively outputs a sub-loss value of the predicted video frame and the sub-loss value of the first video frame;

determining a multi-scale loss value according to a plurality of the sub-loss values;

and training the generated countermeasure network according to the multi-scale loss value, the first loss value and the second loss value until a training stopping condition is met, and obtaining the video generation model.

6. A method of video generation, the method comprising:

inputting the sequence of video frames to a video generation model, the video generation model comprising: an image generation model and an optical flow network model; extracting foreground features of the sequence of video frames through the image generation model, and extracting optical flow features of the source video through the optical flow network model;

and generating a target video based on the target video frame.

7. The method of claim 6, each video frame of the sequence of video frames comprising a timing identification; the sequence of video frames comprises a first sequence and a second sequence, the first sequence comprising the target image and the generated target video frame; the second sequence comprises video frames of the source video, the extracting foreground features of the sequence of video frames through the image generation model and the extracting optical flow features of the source video through the optical flow network model comprises:

acquiring a third video frame corresponding to a time sequence identifier adjacent to a target time sequence identifier from the first sequence according to the target time sequence identifier of a target video frame to be generated;

according to the target time sequence identification, acquiring a fourth video frame corresponding to the time sequence identification adjacent to the target time sequence identification from the second sequence;

foreground features of the third video frame are extracted through the image generation model, and optical flow features of the fourth video frame are extracted through the optical flow network model.

8. An apparatus for training a video generative model, the apparatus comprising:

the system comprises a construction module, a judgment module and a control module, wherein the construction module is used for constructing a generated countermeasure network, and the generated countermeasure network comprises a generation model and a judgment model;

the first input module is further configured to input the predicted video frame and the sample video to a discrimination model to obtain a discrimination result; the judging model is used for judging whether the predicted video frame is matched with the sample video or not;

and the training module is used for training the generation countermeasure network based on the judgment result of each sample video until the training stopping condition is met, so as to obtain a video generation model.

9. A video generation apparatus, characterized in that the apparatus comprises:

a second input module for inputting the sequence of video frames to a video generation model, the video generation model comprising: an image generation model and an optical flow network model; extracting foreground features of the sequence of video frames through the image generation model, and extracting optical flow features of the source video through the optical flow network model;

the fusion module is used for carrying out feature fusion on the foreground feature and the optical flow feature to generate a target video frame;

10. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.