CN112308950A

CN112308950A - Video generation method and device

Info

Publication number: CN112308950A
Application number: CN202010864480.1A
Authority: CN
Inventors: 张炜; 沙铜; 梅涛; 周伯文
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2021-02-02

Abstract

The application discloses a video generation method and device. One embodiment of the method comprises: acquiring a target image and a target continuous attitude sequence; inputting the target image and the target continuous attitude sequence into a pre-trained video generation model to generate a video to be processed, wherein the video to be processed represents the attitude information represented by the target continuous attitude sequence of an object included in the target image; the method comprises the steps of inputting a video to be processed into a pre-trained consistency model, improving the consistency of the video to be processed to obtain a coherent video, and improving the consistency of the video to be processed through the consistency model on the basis of each video frame of the high-quality video to be processed generated by a video generation model, so that the high-quality video to be processed with better consistency is obtained, and the quality and the consistency of the generated video are improved.

Description

Video generation method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a video generation method and device.

Background

Human image video synthesis is an important subject in the field of computer vision at present, can be used as a data enhancement method for some video analysis tasks, and is applied in many scenes, such as movie production and interactive application.

The current portrait video synthesis technology is mainly divided into two types: the first method is to synthesize a video for maintaining the appearance of a portrait based on a single portrait picture and additional conditions, wherein the video needs to obtain corresponding actions according to the additional conditions, and the additional conditions can be action tags of people, continuous posture information of people, and the like; the second method is to compose a video of a person with the same motion based on a person image video and an additional condition, wherein the video needs to replace the attribute of the person according to the additional condition, and the additional condition can be another person image picture, a jacket picture, and the like.

Disclosure of Invention

The embodiment of the application provides a video generation method and device.

In a first aspect, an embodiment of the present application provides a video generation method, including: acquiring a target image and a target continuous attitude sequence; inputting a target image and a target continuous attitude sequence into a pre-trained video generation model to generate a to-be-processed video, wherein the to-be-processed video represents attitude information represented by the target continuous attitude sequence of an object included in the target image, and the video generation model is used for representing the corresponding relation among the target image, the target continuous attitude sequence and the to-be-processed video; and inputting the video to be processed into a pre-trained consistency model, and improving the consistency of the video to be processed to obtain a coherent video, wherein the consistency model is used for representing the corresponding relation between the video to be processed and the coherent video.

In some embodiments, the inputting the target image and the target continuous pose sequence into a pre-trained video generation model to generate a video to be processed includes: splitting the target continuous attitude sequence into a plurality of single-frame attitude information; generating a single-frame image matched with the single-frame attitude information based on the target image for each of the plurality of single-frame attitude information; and synthesizing the video to be processed comprising each single-frame image according to the sequence of the single-frame attitude information corresponding to each single-frame image in the target continuous attitude sequence.

In some embodiments, the inputting the to-be-processed video into the pre-trained coherence model to improve the coherence of the to-be-processed video to obtain the coherent video includes: inputting a video to be processed into a consistency model, and obtaining each video frame in the consistency video in the following mode: determining optical flow information between a previous video frame and a video frame in a coherent video and a predicted video frame of the video frame based on a preset number of video frames before the target video frame in the video to be processed, the target video frame and a preset number of video frames before the video frame in the coherent video, wherein the target video frame in the video to be processed corresponds to the video frame in the coherent video; and obtaining the video frame according to the predicted image and the optical flow information.

In some embodiments, the video generation model and the coherence model are trained by: obtaining a training sample set, wherein training samples in the training sample set comprise: sample images, sample continuous pose sequences and sample videos; acquiring an initial video model, wherein the initial video model comprises a generating network and a judging network, the generating network comprises an initial video generating model and an initial continuity model and is used for generating a video by utilizing a sample image and a sample continuous posture sequence, and the judging network is used for distinguishing the video generated by the generating network and the sample video; by utilizing a machine learning method, a sample image and a sample continuous posture sequence in a training sample are used as the input of a generation network, a video generated by the generation network and a sample video in the training sample are used as the input of a discrimination network, an initial video model is trained, the trained initial video generation model is determined as a video generation model, and the trained initial coherence model is determined as a coherence model.

In some embodiments, the discrimination network includes a video frame discrimination network for distinguishing video frames of the network-generated video from video frames of the sample video, and a video discrimination network for distinguishing video generated by the generation network from the sample video.

In a second aspect, an embodiment of the present application provides a video generating apparatus, including: an acquisition unit configured to acquire a target image and a target continuous pose sequence; the generating unit is configured to input the target image and the target continuous posture sequence into a pre-trained video generating model and generate a video to be processed, wherein the video to be processed represents the posture information represented by the target continuous posture sequence of the object included in the target image, and the video generating model is used for representing the corresponding relation among the target image, the target continuous posture sequence and the video to be processed; and the obtaining unit is configured to input the video to be processed into a pre-trained consistency model, so as to improve the consistency of the video to be processed and obtain a coherent video, wherein the consistency model is used for representing the corresponding relation between the video to be processed and the coherent video.

In some embodiments, the generating unit is further configured to: splitting the target continuous attitude sequence into a plurality of single-frame attitude information; generating a single-frame image matched with the single-frame attitude information based on the target image for each of the plurality of single-frame attitude information; and synthesizing the video to be processed comprising each single-frame image according to the sequence of the single-frame attitude information corresponding to each single-frame image in the target continuous attitude sequence.

In some embodiments, the deriving unit is further configured to: inputting a video to be processed into a consistency model, and obtaining each video frame in the consistency video in the following mode: determining optical flow information between a previous video frame and a video frame in a coherent video and a predicted video frame of the video frame based on a preset number of video frames before the target video frame in the video to be processed, the target video frame and a preset number of video frames before the video frame in the coherent video, wherein the target video frame in the video to be processed corresponds to the video frame in the coherent video; and obtaining the video frame according to the predicted image and the optical flow information.

In a third aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement a method as described in any implementation of the first aspect.

According to the video generation method and device provided by the embodiment of the application, the target image and the target continuous attitude sequence are obtained; inputting the target image and the target continuous attitude sequence into a pre-trained video generation model to generate a video to be processed, wherein the video to be processed represents the attitude information represented by the target continuous attitude sequence of an object included in the target image; the method comprises the steps of inputting a video to be processed into a pre-trained consistency model, improving the consistency of the video to be processed to obtain a coherent video, and improving the consistency of the video to be processed through the consistency model on the basis of each video frame of the high-quality video to be processed generated by a video generation model, so that the high-quality video to be processed with better consistency is obtained, and the quality and the consistency of the generated video are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram according to one embodiment of a video generation method of the present application;

fig. 3 is a schematic diagram of an application scene of the video generation method according to the present embodiment;

FIG. 4 is a flow diagram of yet another embodiment of a video generation method according to the present application;

FIG. 5 is a block diagram of one embodiment of a video generation device according to the present application;

FIG. 6 is a block diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary architecture 100 to which the video generation method and apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 may be hardware devices or software that support network connections for data interaction and data processing. When the

terminal devices

101, 102, and 103 are hardware, they may be various electronic devices supporting network connection, information interaction, display, processing, and the like, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background processing server for consecutive videos based on the target image and the target continuous pose sequence sent by the

terminal devices

101, 102, 103. The background processing server can input the target image and the target continuous attitude sequence into a pre-trained video generation model to generate a video to be processed, wherein the video to be processed represents the attitude information represented by the target continuous attitude sequence of an object included in the target image; and inputting the video to be processed into a pre-trained consistency model, so as to improve the consistency of the video to be processed and obtain a coherent video. Optionally, the background processing server may further feed back the coherent video to the terminal device for display by the terminal device. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the video generation method provided by the embodiment of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, and sub-module) included in the video generation apparatus may be entirely provided in the server, may be entirely provided in the terminal device, and may be provided in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the video generation method operates does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or a terminal device) on which the video generation method operates.

With continued reference to FIG. 2, a flow 200 of one embodiment of a video generation method is shown, comprising the steps of:

step 201, acquiring a target image and a target continuous attitude sequence.

In this embodiment, an execution subject (for example, a terminal device or a server in fig. 1) of the video generation method may acquire the target image and the target continuous pose sequence from a remote location or from a local location by a wired connection or a wireless connection.

The object included in the target image may be various objects including, but not limited to, a human object, an animal object, a cartoon object, and the like. The target continuous pose sequence is used to characterize the motion pose information to be presented by the object included in the target image.

As an example, the object included in the target image is a human object, and the sequence of target continuous poses represents the motion pose information of dancing. When a video is generated based on the target image and the target continuous posture sequence, the character object presents the dancing movement posture information represented by the target continuous posture sequence in the generated video.

In this embodiment, the target continuous pose sequence may be a pose sequence extracted from the target video. As an example, for each video frame in the target video, the execution subject described above extracts the pose information in the video frame. For all the pose information extracted from the target video, the execution main body generates a target continuous pose sequence including all the pose information according to the playing sequence of the video frames of the target video.

Step 202, inputting the target image and the target continuous posture sequence into a pre-trained video generation model to generate a video to be processed.

In this embodiment, the executing entity may input the target image and the target continuous pose sequence obtained in step 201 into a pre-trained video generation model to generate a to-be-processed video. The object included in the video representation target image to be processed presents the gesture information represented by the target continuous gesture sequence, and the video generation model is used for representing the corresponding relation among the target image, the target continuous gesture sequence and the video to be processed.

The video generation model can be trained in the following way:

first, a training sample set is obtained. Wherein, training samples in the training sample set include: a sample image, a sample continuous pose sequence, and a sample video.

And then, inputting the sample images and the sample continuous posture sequence into an initial video generation model by adopting a machine learning algorithm, taking the sample images in the input training samples as expected output, and training to obtain a video output model.

In some optional implementations of this embodiment, the executing main body may execute the step 202 in the following manner:

first, a target continuous pose sequence is split into a plurality of single frame pose information.

As an example, each single frame of pose information after splitting may correspond to one video frame.

Then, for each of the plurality of single-frame pose information, a single-frame image matching the single-frame pose information is generated based on the target image.

In a single frame image, the included objects present the motion characterized by the single frame pose information.

And finally, synthesizing the video to be processed comprising each single-frame image according to the sequence of the single-frame attitude information corresponding to each single-frame image in the target continuous attitude sequence.

And for each single-frame image, the object in the single-frame image presents an action in single-frame attitude information, and the video to be processed including each single-frame image can be synthesized by all the single-frame images according to the sequence of the single-frame attitude information corresponding to each single-frame image in the target continuous attitude sequence.

It can be understood that the video to be processed is synthesized by a single frame image, and there may be a coherence problem, and the coherence needs to be improved according to the subsequent steps.

And 203, inputting the video to be processed into a pre-trained consistency model, and improving the consistency of the video to be processed to obtain a coherent video.

In this embodiment, the executing body may input the to-be-processed video obtained in step 202 into a pre-trained coherence model, so as to improve the coherence of the to-be-processed video and obtain a coherent video. The consistency model is used for representing the corresponding relation between the video to be processed and the consistent video.

The consistency model may be any network model that has the capability of improving Video consistency, such as a Video-to-Video Synthesis (Video-to-Video Synthesis) model.

In some optional implementations of this embodiment, the executing main body may execute the step 203 in the following manner:

inputting a video to be processed into a consistency model, and obtaining each video frame in the consistency video in the following mode:

firstly, based on a preset number of video frames before a target video frame in a video to be processed, the target video frame and the preset number of video frames before the video frame in a coherent video, determining optical flow information between a previous video frame and the video frame in the coherent video and a predicted video frame of the video frame, wherein the target video frame in the video to be processed corresponds to the video frame in the coherent video.

The preset number can be specifically set according to actual conditions. For example, the preset number may be set to 2.

Then, the video frame is obtained from the predicted image and the optical flow information.

Specifically, the executing entity may obtain, through the coherence model, optical flow information between a previous video frame and the video frame, a predicted video frame and weight mask information of the video frame, and obtain the video frame through the following formula:

x_t＝(1-m_t)⊙w_t-1(x_t-1)+m_t⊙h_t

wherein x is_tCharacterizing the frame of video, m_tCharacterizing weight mask information, w_t-1Characterizing the optical flow information, x, between the previous video frame and the video frame_t-1Representing the last video frame, h, of the video frame in the consecutive video_tAnd characterizing the predicted image.

In some optional implementations of this embodiment, the video generation model and the coherence model are trained together. Specifically, the video generation model and the coherence model are obtained by training in the following way:

firstly, a training sample set is obtained, wherein training samples in the training sample set comprise: a sample image, a sample continuous pose sequence, and a sample video. A

Then, an initial video model is obtained, wherein the initial video model comprises a generating network and a judging network, the generating network comprises an initial video generating model and an initial continuity model and is used for generating a video by using the sample images and the sample continuous posture sequence, and the judging network is used for distinguishing the video generated by the generating network from the sample video.

And finally, using a machine learning method, taking the sample images and the sample continuous posture sequence in the training samples as the input of a generation network, taking the videos generated by the generation network and the sample videos in the training samples as the input of a discrimination network, training an initial video model, determining the trained initial video generation model as a video generation model, and determining the trained initial coherence model as a coherence model.

In some optional implementations, the discrimination network includes a video frame discrimination network and a video discrimination network, the video frame discrimination network is configured to distinguish a video frame of a video generated by the generation network from a video frame of a sample video, and the video discrimination network is configured to distinguish a video generated by the generation network from a sample video. Based on the dual discrimination of the video frame and the video, the continuity of the coherent video obtained by the coherent model can be further improved.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the video generation method according to the present embodiment. In the application scenario of fig. 3, during the video processing performed by the user 301, the target image 303 and the target continuous pose sequence 304 are selected in the terminal device 302, and the selected target image 303 and the target continuous pose sequence 304 are fed back to the server 305 through the terminal device 302. The server 305 inputs the target image 303 and the target continuous pose sequence 304 into a pre-trained video generation model 306 to generate a to-be-processed video 307. The video to be processed 307 represents the pose information represented by the object presentation target continuous pose sequence 304 included in the target image 303. Then, the to-be-processed video 307 is input into the pre-trained coherence model 308, the coherence of the to-be-processed video is promoted, a coherent video 309 is obtained, and the coherent video 309 is fed back to the terminal device 302.

The method provided by the above embodiment of the present disclosure includes acquiring a target image and a target continuous pose sequence; inputting the target image and the target continuous attitude sequence into a pre-trained video generation model to generate a video to be processed, wherein the video to be processed represents the attitude information represented by the target continuous attitude sequence of an object included in the target image; the method comprises the steps of inputting a video to be processed into a pre-trained consistency model, improving the consistency of the video to be processed to obtain a coherent video, and improving the consistency of the video to be processed through the consistency model on the basis of each video frame of the high-quality video to be processed generated by a video generation model, so that the high-quality video to be processed with better consistency is obtained, and the quality and the consistency of the generated video are improved.

With continuing reference to FIG. 4, an illustrative flow 400 of another embodiment of a video generation method in accordance with the present application is shown comprising the steps of:

step 401, acquiring a target image and a target continuous posture sequence.

Step 402, splitting the target continuous pose sequence into a plurality of single frame pose information.

Step 403, for each single frame attitude information in the plurality of single frame attitude information, generating a single frame image matched with the single frame attitude information based on the target image.

And 404, synthesizing a video to be processed comprising each single-frame image according to the sequence of the single-frame attitude information corresponding to each single-frame image in the target continuous attitude sequence.

Step 405, inputting the video to be processed into the consistency model, and obtaining each video frame in the consistency video in the following way:

step 4051, based on a preset number of video frames before a target video frame in the video to be processed, the target video frame, and a preset number of video frames before the video frame in the consecutive video, determining optical flow information between a previous video frame and the video frame in the consecutive video and a predicted video frame of the video frame, wherein the target video frame in the video to be processed corresponds to the video frame in the consecutive video.

Step 4052 obtains the video frame from the predicted image and the optical flow information.

As can be seen from this embodiment, compared with the embodiment corresponding to fig. 2, the flow 400 of the video generation method in this embodiment specifically illustrates a generation process of a to-be-processed video and a process for improving the continuity of the to-be-processed video. Thus, the present embodiment further improves the quality and consistency of the generated video.

With continuing reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a video generating apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the video generating apparatus includes: the method comprises the following steps: the acquisition unit 501 is configured to acquire a target image and a target continuous pose sequence; the generating unit 502 is configured to input the target image and the target continuous pose sequence into a pre-trained video generating model, and generate a video to be processed, where the video to be processed represents pose information represented by the target continuous pose sequence of an object included in the target image, and the video generating model is used for representing a corresponding relationship among the target image, the target continuous pose sequence and the video to be processed; the obtaining unit 503 is configured to input the to-be-processed video into a pre-trained coherence model, so as to promote coherence of the to-be-processed video, and obtain a coherent video, where the coherence model is used to represent a corresponding relationship between the to-be-processed video and the coherent video.

In some embodiments, the generation unit 502 is further configured to: splitting the target continuous attitude sequence into a plurality of single-frame attitude information; generating a single-frame image matched with the single-frame attitude information based on the target image for each of the plurality of single-frame attitude information; and synthesizing the video to be processed comprising each single-frame image according to the sequence of the single-frame attitude information corresponding to each single-frame image in the target continuous attitude sequence.

In some embodiments, the deriving unit 503 is further configured to: inputting a video to be processed into a consistency model, and obtaining each video frame in the consistency video in the following mode: determining optical flow information between a previous video frame and a video frame in a coherent video and a predicted video frame of the video frame based on a preset number of video frames before the target video frame in the video to be processed, the target video frame and a preset number of video frames before the video frame in the coherent video, wherein the target video frame in the video to be processed corresponds to the video frame in the coherent video; and obtaining the video frame according to the predicted image and the optical flow information.

In this embodiment, an obtaining unit in a video generating device obtains a target image and a target continuous posture sequence; the generation unit inputs the target image and the target continuous attitude sequence into a pre-trained video generation model to generate a video to be processed, wherein the video to be processed represents the attitude information represented by the target continuous attitude sequence of the object included in the target image; the obtaining unit inputs the video to be processed into the pre-trained consistency model, the consistency of the video to be processed is improved, and the consistency video is obtained, so that the consistency of the video to be processed is improved through the consistency model on the basis of each video frame of the high-quality video to be processed generated by the video generation model, the video to be processed with high quality and good consistency is obtained, and the quality and the consistency of the generated video are improved.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing devices of embodiments of the present application (e.g.,

devices

101, 102, 103, 105 shown in FIG. 1). The apparatus shown in fig. 6 is only an example, and should not bring any limitation to the function and use range of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a processor (e.g., CPU, central processing unit) 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The processor 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program, when executed by the processor 601, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the client computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a generation unit, and an obtaining unit. The names of the units do not form a limitation on the units themselves under certain conditions, for example, the obtaining unit can also be described as a unit for inputting the video to be processed into a pre-trained coherence model to promote the coherence of the video to be processed and obtain coherent video.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the computer device to: acquiring a target image and a target continuous attitude sequence; inputting a target image and a target continuous attitude sequence into a pre-trained video generation model to generate a to-be-processed video, wherein the to-be-processed video represents attitude information represented by the target continuous attitude sequence of an object included in the target image, and the video generation model is used for representing the corresponding relation among the target image, the target continuous attitude sequence and the to-be-processed video; and inputting the video to be processed into a pre-trained consistency model, and improving the consistency of the video to be processed to obtain a coherent video, wherein the consistency model is used for representing the corresponding relation between the video to be processed and the coherent video.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A video generation method, comprising:

acquiring a target image and a target continuous attitude sequence;

inputting the target image and the target continuous attitude sequence into a pre-trained video generation model to generate a to-be-processed video, wherein the to-be-processed video represents attitude information represented by the target continuous attitude sequence of an object included in the target image, and the video generation model is used for representing a corresponding relation among the target image, the target continuous attitude sequence and the to-be-processed video;

and inputting the video to be processed into a pre-trained consistency model, and improving the consistency of the video to be processed to obtain a coherent video, wherein the consistency model is used for representing the corresponding relation between the video to be processed and the coherent video.

2. The method of claim 1, wherein the inputting the target image and the target continuous pose sequence into a pre-trained video generation model to generate a video to be processed comprises:

splitting the target continuous attitude sequence into a plurality of single-frame attitude information;

generating a single-frame image matched with each piece of single-frame attitude information based on the target image for each piece of single-frame attitude information in the plurality of pieces of single-frame attitude information;

and synthesizing the video to be processed comprising each single-frame image according to the sequence of the single-frame attitude information corresponding to each single-frame image in the target continuous attitude sequence.

3. The method of claim 1, wherein the inputting the to-be-processed video into a pre-trained coherence model to promote coherence of the to-be-processed video to obtain a coherent video comprises:

inputting the video to be processed into the consistency model, and obtaining each video frame in the consistency video in the following mode:

determining optical flow information between a previous video frame and a target video frame in the consecutive video and a predicted video frame of the video frame based on a preset number of video frames before the target video frame in the video to be processed, the target video frame and a preset number of video frames before the target video frame in the consecutive video, wherein the target video frame in the video to be processed corresponds to the video frame in the consecutive video;

and obtaining the video frame according to the predicted image and the optical flow information.

4. The method of claim 1, wherein the video generation model and the coherence model are trained by:

obtaining a training sample set, wherein training samples in the training sample set comprise: sample images, sample continuous pose sequences and sample videos;

acquiring an initial video model, wherein the initial video model comprises a generating network and a judging network, the generating network comprises an initial video generating model and an initial continuity model and is used for generating a video by utilizing a sample image and a sample continuous posture sequence, and the judging network is used for distinguishing the video generated by the generating network from the sample video;

and training the initial video model by using a machine learning method and taking the sample images and the sample continuous posture sequence in the training samples as the input of a generation network, taking the video generated by the generation network and the sample video in the training samples as the input of a discrimination network, determining the trained initial video generation model as the video generation model, and determining the trained initial coherence model as the coherence model.

5. The method of claim 4, wherein the discrimination network includes a video frame discrimination network for distinguishing video frames of the video generated by the generation network from video frames of the sample video and a video discrimination network for distinguishing video generated by the generation network from sample video.

6. A video generation apparatus comprising:

an acquisition unit configured to acquire a target image and a target continuous pose sequence;

a generating unit, configured to input the target image and the target continuous pose sequence into a pre-trained video generating model, and generate a to-be-processed video, where the to-be-processed video represents pose information represented by the target continuous pose sequence of an object included in the target image, and the video generating model is used for representing a corresponding relationship among the target image, the target continuous pose sequence and the to-be-processed video;

the obtaining unit is configured to input the to-be-processed video into a pre-trained coherence model, and promote coherence of the to-be-processed video to obtain a coherent video, wherein the coherence model is used for representing a corresponding relation between the to-be-processed video and the coherent video.

7. The apparatus of claim 6, wherein the generating unit is further configured to:

splitting the target continuous attitude sequence into a plurality of single-frame attitude information; generating a single-frame image matched with each piece of single-frame attitude information based on the target image for each piece of single-frame attitude information in the plurality of pieces of single-frame attitude information; and synthesizing the video to be processed comprising each single-frame image according to the sequence of the single-frame attitude information corresponding to each single-frame image in the target continuous attitude sequence.

8. The apparatus of claim 6, wherein the deriving unit is further configured to:

inputting the video to be processed into the consistency model, and obtaining each video frame in the consistency video in the following mode: determining optical flow information between a previous video frame and a target video frame in the consecutive video and a predicted video frame of the video frame based on a preset number of video frames before the target video frame in the video to be processed, the target video frame and a preset number of video frames before the target video frame in the consecutive video, wherein the target video frame in the video to be processed corresponds to the video frame in the consecutive video; and obtaining the video frame according to the predicted image and the optical flow information.

9. The apparatus of claim 6, wherein the video generation model and the coherence model are trained by:

obtaining a training sample set, wherein training samples in the training sample set comprise: sample images, sample continuous pose sequences and sample videos; acquiring an initial video model, wherein the initial video model comprises a generating network and a judging network, the generating network comprises an initial video generating model and an initial continuity model and is used for generating a video by utilizing a sample image and a sample continuous posture sequence, and the judging network is used for distinguishing the video generated by the generating network from the sample video; and training the initial video model by using a machine learning method and taking the sample images and the sample continuous posture sequence in the training samples as the input of a generation network, taking the video generated by the generation network and the sample video in the training samples as the input of a discrimination network, determining the trained initial video generation model as the video generation model, and determining the trained initial coherence model as the coherence model.

10. The apparatus of claim 9, wherein the discrimination network comprises a video frame discrimination network to distinguish video frames of the video generated by the generation network from video frames of the sample video and a video discrimination network to distinguish video generated by the generation network from sample video.

11. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-5.

12. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.