CN117750155A

CN117750155A - Method and device for generating video based on image and electronic equipment

Info

Publication number: CN117750155A
Application number: CN202311766533.6A
Authority: CN
Inventors: 熊玲欣
Original assignee: Zhejiang Geely Holding Group Co Ltd; Geely Automobile Research Institute Ningbo Co Ltd
Current assignee: Zhejiang Geely Holding Group Co Ltd; Geely Automobile Research Institute Ningbo Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-03-22

Abstract

The specification provides a method, a device and an electronic device for generating video based on images, wherein the method comprises the following steps: preprocessing an input image into an initial image tensor to match the structure of an initial random tensor of a video generation model, wherein the video generation model generates a video based on the initial random tensor; loading the initial image tensor into the video generation model as the initial random tensor such that the video generation model generates a set of continuously varying potential image tensors based on the initial image tensor; all the generated potential image tensors are converted into the format of video frames, and the video frames are synthesized into a video file.

Description

Method and device for generating video based on image and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for generating video based on an image, and an electronic device.

Background

With the development of artificial intelligence technology, the AIGC (Artificial Intelligence Generated Content, generative artificial intelligence) technology is utilized to generate content related to an input text through training a model and learning of a large amount of data. For example, by inputting a piece of information describing the movement of an object, a moving image matching with it can be generated.

In the existing AIGC technology, video content of a corresponding scene is generated according to text description information input by a user. However, for complex scenes, the text description cannot provide complete description or accurate details, or the text description itself has ambiguity or diversity, which can lead to a certain randomness of the video result generated according to the text description, resulting in inconsistent or distorted generated video content and the text description.

Disclosure of Invention

In order to overcome the problems in the related art, the present specification provides a method, an apparatus and an electronic device for generating a video based on an image.

According to a first aspect of embodiments of the present specification, there is provided a method of generating video based on an image, the method comprising:

preprocessing an input image into an initial image tensor to match the structure of an initial random tensor of a video generation model, wherein the video generation model generates a video based on the initial random tensor;

loading the initial image tensor into the video generation model as the initial random tensor such that the video generation model generates a set of continuously varying potential image tensors based on the initial image tensor;

all the generated potential image tensors are converted into the format of video frames, and the video frames are synthesized into a video file.

According to a second aspect of embodiments of the present specification, there is provided an apparatus for generating video based on an image, comprising:

a preprocessing unit for preprocessing an input image into an initial image tensor to match a structure of an initial random tensor of a video generation model, wherein the video generation model generates a video based on the initial random tensor;

a generation unit configured to load the initial image tensor as the initial random tensor into the video generation model, so that the video generation model generates a set of continuously changing potential image tensors based on the initial image tensor;

and the synthesis unit is used for converting all the generated potential image tensors into the format of video frames and synthesizing the video frames into a video file.

According to a third aspect of embodiments of the present specification, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect when the program is executed.

According to a fourth aspect of embodiments of the present description, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:

in the embodiment of the present specification, the input image is preprocessed into an initial image tensor to match the structure of an initial random tensor of the video generation model, and the initial image tensor is loaded into the video generation model as the initial random tensor of the video generation model to generate the corresponding video content. Because the input image which is selected by the user independently is used as the basis of the video generation content, the input image provides more specific visual information, so that the process of generating the video by the model can be better guided to generate the video content matched with the input image, and the generated video result is ensured to accord with the expectations of the user.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the specification.

Fig. 1 is a flowchart illustrating a method of generating video based on an image according to an exemplary embodiment of the present description.

FIG. 2 is a schematic diagram of the noise addition and denoising of an initial random tensor according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic diagram of denoising processing performed by the denoising module according to an exemplary embodiment of the present specification.

Fig. 4 is a schematic diagram of a text tensor control denoising process illustrated in this specification according to an exemplary embodiment.

Fig. 5 is a schematic diagram of a motion sub-module control denoising process according to an exemplary embodiment of the present disclosure.

Fig. 6 is a schematic structural view of an electronic device according to an exemplary embodiment of the present disclosure.

Fig. 7 is a block diagram of an apparatus for generating video based on an image according to an exemplary embodiment of the present specification.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description as detailed in the accompanying claims.

The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In the existing AIGC technology, video content of a corresponding scene is generated according to text description information input by a user. However, for complex scenes, the text description may not provide complete or accurate details, e.g., for a text description describing "sunset on the sea," the generated video content may not accurately capture details of the sunset color, light and shadow effects, wave motion, etc. Or the text description itself may be ambiguous or diverse, e.g., for a text description described as "city streets", different people may imagine completely different city street scenes, resulting in differences in the generated video content. The above problems all result in a certain randomness of the video results generated according to the text description, and cause inconsistent or distorted situations between the generated video and the text description.

In view of the above, the present disclosure provides a method for generating video based on images, so as to avoid randomness of the generated video result and ensure that the generated video content meets the expectations of users.

Next, embodiments of the present specification will be described in detail.

As shown in fig. 1, fig. 1 is a flowchart illustrating a method of generating video based on an image according to an exemplary embodiment of the present specification, including the steps of:

step 101, preprocessing an input image into an initial image tensor to match the structure of the initial random tensor of the video generation model.

In one embodiment, the user may optionally search for an image with clear, sharp features as an input image, where the image may contain different types of feature information such as desired scene or object elements, providing sufficient information guidance for generating a video model to ensure that the generated video content is associated with the input image. For example, if a video of beach sunset is to be generated, it is more appropriate to select an image of beach sunset as the input image, as it can provide a visual reference of the desired scene. Alternatively, the user may not search for a desired image manually, but generate an image conforming to the user's text description through the text-generated graphic model, and then use the image as an input image of the video generation model. The present description does not set any limitation on the source of the input image.

In one embodiment, a video generation model may be used as a base model for which the present method is applied, but is not limited to the SD (Stable Diffusion) model.

In one embodiment, since the video generation model generates video based on an initial random tensor, where the initial random tensor is a randomly generated tensor, also referred to as a noise vector. In other words, the diffusion starting point of the video generation model is randomly generated noise, resulting in a problem of randomness of the result of each generation of the video generation model. The method generates video by replacing the input image with an initial random tensor of the video generation model so that the video generation model can take the input image as a diffusion start point.

Specifically, the structure of the input image may be preprocessed to the same structure as the initial random tensor, and then the input image is loaded into the video generation model as the initial random tensor. Illustratively, first, the input image may be resized to ensure that the image size is a multiple of 8 for processing by the image processing algorithm; then, scaling can be performed on the pixel values of the input image, and the pixel values of the input image are normalized to be within a range required by the model; finally, the input image is dimensionally transformed, and the structure of the input image is processed into the same structure as the initial random tensor of the video generation model. For example, the input image has a structure of (Batch Size, H, W, C), where Batch Size represents a Batch Size, H represents a height of the image, W represents a width of the image, and C represents the number of channels of the image; while the structure of the initial random tensor of the video generation model may be (Batch Size, C, H, W); thus, the structure of the input image can be converted into (Batch Size, C, H, W) to maintain the same structure as the initial random tensor.

Step 102, loading the initial image tensor into the video generation model as the initial random tensor, so that the video generation model generates a set of continuously changing potential image tensors based on the initial image tensor.

In one embodiment, as shown in FIG. 2, the process by which the video generation model generates a set of continuously varying latent image tensors based on the initial image may be divided into two associated processing steps, a denoising process 20 and a denoising process 21.

At the stage of the noise adding process 20, the specific steps are as follows:

first, a set of random noise 202 may be generated in a preset number of frames, which a user may specify the duration of the generated video by customizing the preset number of frames. For example, if the frame rate of the video is 30 frames/second and the preset number of frames is 300 frames, then the duration of the generated video will be 10 seconds. Through the configuration of the self-defined preset frame number, a user can flexibly control the time length for generating the video according to the requirement.

Then, the information of the initial image tensor 201 may be added to each random noise 202 to obtain a set of noise image tensors 211 of a preset number of frames. Illustratively, assume that the dimension of a set of random noise generated as a preset number of frames is (Batch Size, C, L, H, W), where L is the preset number of frames; assuming that the dimension of the initial image tensor is (Batch Size, C, H, W); then, the random noise 202 corresponding to each frame and the initial image tensor 201 may be added in the dimension of the preset frame number to obtain a set of noise image tensors 211 with the preset frame number.

In the stage of the denoising process 21, the specific steps are as follows:

in one embodiment, the denoising module may be in a video generation model, for example, the denoising module may be a U-Net module in an SD1.5 model. Based on the noise residuals predicted by the denoising module, the input set of noise image tensors 211 is reconstructed in the time dimension, and gradually converted from the set of noise image tensors 211 to a set of potential image tensors 212 with continuous changes.

Specifically, as shown in fig. 3, by inputting a set of noise image tensors 211 into the denoising module 30 of the video generation model, n rounds of iterative processing are performed on the set of noise image tensors 211 to generate a set of continuously changing potential image tensors 212, where n is the actual sampling number. The specific iterative processing steps are as follows:

in the first sampling process, inputting a set of noise image tensors 211 into the denoising module 30, wherein the denoising module 30 simulates a process of continuously increasing noise disturbance with different intensities along with time in the dimension of a preset frame number to predict a set of first noise data distributions 31 which are in one-to-one correspondence with the input set of noise image tensors 211, and updating the corresponding input image tensors in the set of input image tensors by utilizing each first noise data distribution 31 to obtain a set of intermediate image tensors;

in the second sampling process, a set of intermediate image tensors obtained in the first sampling process are input to the denoising module 30 again, the denoising module 30 simulates a process of continuously increasing noise disturbance with different intensities along time in the dimension of a preset frame number to predict first noise data distributions 31 corresponding to the input set of intermediate image tensors one by one, and updates the corresponding input image tensors in the set of image tensors by using each first noise data distribution 31 respectively to obtain a new set of intermediate image tensors again. And the group of input image tensors in the first round of iterative processing is a group of noise image tensors, and the group of input image tensors in the non-first round of iterative processing is a group of intermediate image tensors output in the previous round of iterative processing.

The set of intermediate image tensors output by the final round of iterative processing is taken as a set of continuously varying potential image tensors 212 generated by the denoising module 31.

In the above embodiment, by adding the feature information of the input image to the random noise in the noise adding process to obtain the noise image tensor containing the feature information of the input image, it is possible to ensure that the noise removing process algorithm can be provided with reference at the subsequent processing step stage according to the feature information of the input image retained in the noise image in the subsequent noise removing process, thereby ensuring that the video content finally automatically generated is associated with the input image, not random.

In an illustrated embodiment, the user may set the actual number of samples to a fixed value, or may control the change in the actual number of samples based on the denoising strength parameter. For example, the relation between the denoising intensity parameter and the actual number of samples is expressed as follows:

actual number of samples = preset number of samples-preset number of samples × denoising strength parameter

The value range of the denoising strength parameter is [0,1 ].

Since the higher the actual number of samples, the lower the degree of difference between the video content and the features of the input image, whereas the lower the actual number of samples, the higher the degree of difference between the video content and the features of the input image. It is clear from the above formula that the larger the user-defined denoising intensity parameter, the lower the actual sampling number. Therefore, through the formula, a user can indirectly control the style change of the video content generated by the denoising module by customizing the denoising intensity parameter and adjusting the actual sampling times by using the denoising intensity parameter.

In the above illustrated embodiment, the video generation model generates corresponding video content based on the feature information of the input image, and the generated video content and style are generally determined by the features and style of the input image. In addition, the present specification provides a manner of automatically generating video content based on a combination of feature information of an input image and a text description of the input image:

in one embodiment, a textual description associated with an input image may be obtained. For example, the content of the input image is "a beach with several people playing on it", and the text description may be "a beach with people chasing each other, sea waves beating the shore". The text tensor converted by the text description is then loaded into a video generation model, and the video generation model is controlled by the content of the text description to generate a set of potential image tensors conforming to the text description according to the initial image tensor. Since the text description includes the actions of sea wave beating the shore and people chasing each other, in the generated set of successive latent image tensors, characteristic information of the actions conforming to the above text description will appear. In conclusion, the personalized generation requirement of the user can be met on the basis of ensuring that controllable video content is automatically generated based on the input image.

In an embodiment, as shown in fig. 4, if there is a text description associated with the input image, and the text tensor 40 obtained by converting the text description is loaded into the denoising module 30 of the video generation model, the set of first noise data distributions 31 may be updated according to the text tensor 40 to obtain the set of second noise data distributions 41 containing text description characteristic information, and the set of input image tensors may be updated with the set of second noise data distributions to obtain the set of intermediate image tensors.

Based on the embodiment of fig. 3, fig. 4 includes semantic information of the text description by updating the characteristic information of the text description to a set of first noise data distributions 31, such that the second noise data distributions 41 generated by the denoising module 30. Thus, a set of continuously varying latent image tensors generated by the denoising module 30 may meet the requirements of the text description.

The existing video generation model generally learns motion prior knowledge of different motion styles from a large-scale video dataset by adding a time perception structure to generate a video with a motion effect. However, in order to enable the video generation model to learn different sports styles, a large number of personalized sports videos need to be collected, and all parameters of the whole model need to be subjected to parameter tuning, so that the engineering amount is huge. In view of the above problems, the present disclosure provides a motion submodule, by training a generalized motion submodule separately and loading a pre-trained motion submodule into a video generation model, it is not necessary to tune the video generation model, and personalized videos with different motion styles can be generated.

As shown in fig. 5, a motion prior knowledge may be obtained by training a motion submodule 50 on a video data set 51. The video data set 51 may be a public data set, such as WebVid-10M, or may be video data clipped from a video sharing platform. The present description does not set any limitation on the source of the video data set.

The pre-trained motion sub-module 50 is loaded to the denoising module 30 and can be used to assist the denoising module in generating motion information for controlling the motion style because the motion sub-module 50 can capture the time dependence between features for the same position of different noise image tensors 211 on the time axis. Also, by injecting motion information into each frame of image data, the motion information may be utilized to assist the denoising module 30 in generating a set of potential image tensors for motion changes.

In one embodiment, since the input data type of denoising module 30 is a set of image tensors, the input data type of the motion sub-module is a set of video tensors. Thus, a set of input images can be converted into a dominant dimension of video frames, by separating out the temporal and spatial dimensions, into a set of video tensors. And in the time dimension, performing diffusion generation on a group of input images to obtain a group of coherent motion information, and then injecting the motion information into each frame of image data to generate a group of potential image tensors of motion change by utilizing the motion information auxiliary denoising module.

In the above embodiment, the present disclosure may extend the motion submodule to any video generating model by training a general motion submodule capable of controlling motion change of an image alone, and may enable video content generated by the video generating model to have motion consistency on the basis of not adjusting parameters of an original video generating model.

Step 103, converting all generated potential image tensors into a format of video frames, and synthesizing the video frames into a video file.

In one embodiment, assume that the set of potential image tensors obtained from step 102 has dimensions of (Batch Size, L, C, H, W). First, dimensions other than (C, H, W) of the set of latent image tensors are aggregated to obtain a set of latent image tensors (Batch Size L, C, H, W) after dimensional transformation. The dimensionally transformed set of potential image tensors is then decoded by a decoder to remap the set of potential image tensors into an image representation. Finally, the dimensions of the obtained set of image representations are restored to the same dimensions (Batch Size, L, C, H, W) as the set of potential image tensors to obtain a set of video frames. Finally, the set of video frames is synthesized into a video file.

Thus, a related description of a method of generating video based on images is completed.

Corresponding to the embodiments of the aforementioned method, the present specification also provides embodiments of the apparatus and the terminal to which it is applied.

As shown in fig. 6, fig. 6 is a schematic structural view of an electronic device according to an exemplary embodiment shown in the present specification. At the hardware level, the device includes a processor 602, an internal bus 604, a network interface 606, a memory 608, and a non-volatile storage 610, although other hardware required for the service is possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 602 reading a corresponding computer program from the non-volatile memory 610 into the memory 608 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic module, but may also be a hardware or logic device.

As shown in fig. 7, fig. 7 is a block diagram of an apparatus for generating video based on an image according to an exemplary embodiment of the present specification. The device can be applied to the electronic equipment shown in fig. 6 to realize the technical scheme of the specification. The device comprises:

a preprocessing unit 702 for preprocessing an input image into an initial image tensor to match the structure of an initial random tensor of a video generation model, wherein the video generation model generates a video based on the initial random tensor;

a generating unit 704 for loading the initial image tensor as the initial random tensor into the video generation model, so that the video generation model generates a set of continuously varying potential image tensors based on the initial image tensor;

a synthesizing unit 706, configured to convert all generated potential image tensors into a format of a video frame, and synthesize the video frame into a video file.

The generating unit 704 is specifically configured to obtain a text description associated with the input image, and load the text description converted into a text tensor into the video generating model, so that the video generating model is controlled to generate a set of potential image tensors according to the text description according to the initial image tensor based on the text description.

The generating unit 704 is specifically configured to:

generating a group of random noise with the number of preset frames;

adding information of the initial image tensor to each random noise to obtain a group of noise image tensors with the number of preset frames;

the set of noisy image tensors is de-noised into a set of continuously varying potential image tensors.

The generating unit 704 is specifically configured to:

determining the actual sampling times n required by the denoising process;

performing n rounds of iterative processing on the set of noisy image tensors to generate a set of continuously varying potential image tensors, wherein the steps of each round of iterative processing include: determining a group of first noise data distributions corresponding to a group of input image tensors one by one through a denoising module of the video generation model, and updating the corresponding input image tensors in the group of input image tensors by utilizing each first noise data distribution respectively to obtain a group of intermediate image tensors; and the group of input image tensors in the first round of iterative processing is the group of noise image tensors, and the group of input image tensors in the non-first round of iterative processing is the group of intermediate image tensors output in the previous round of iterative processing.

The actual number of samples is determined by a denoising strength parameter, and the denoising strength parameter is user-defined.

The generating unit 704 is specifically configured to:

if there is a text description associated with the input image and a text tensor converted from the text description is loaded to a denoising module of the video generation model, updating the set of first noise data distributions according to the text tensor to obtain a set of second noise data distributions containing the text description characteristic information, and updating the set of input image tensors with the set of second noise data distributions to obtain the set of intermediate image tensors.

The denoising module comprises a pre-trained motion submodule which is independent of the denoising module, and the device further comprises a motion control unit (not shown in the figure) for:

loading the pre-trained motion sub-module to the denoising module to generate motion information for controlling a motion style;

utilizing the motion information to assist the denoising module in generating a set of motion-variant potential image tensors;

the pre-trained motion submodule is trained by video data to obtain motion priori knowledge.

The motion control unit is specifically configured to:

converting the set of noise image tensors into a set of video tensors by separating a time dimension and a space dimension;

in the time dimension, the motion information is added to the set of video tensors to assist the denoising module in generating a set of motion-variant potential image tensors.

The implementation process of the functions and roles of each module in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A method of generating video based on an image, the method comprising:

2. The method of claim 1, wherein the loading the initial image tensor into the video generation model as the initial random tensor such that the video generation model generates a set of continuously varying potential image tensors based on the initial image tensor comprises:

a text description associated with the input image is obtained and the text description converted into a text tensor is loaded into the video generation model, so that the video generation model is controlled to generate a set of potential image tensors conforming to the text description according to the initial image tensor based on the text description.

3. The method of claim 1, wherein the video generation model generates a set of continuously varying potential image tensors based on the initial image tensor, comprising:

generating a group of random noise with the number of preset frames;

4. A method according to claim 3, wherein said denoising said set of noisy image tensors into a set of continuously varying potential image tensors comprises:

determining the actual sampling times n required by the denoising process;

5. The method of claim 4, wherein the actual number of samples is determined by a denoising intensity parameter, and wherein the denoising intensity parameter is user-defined.

6. The method of claim 4, wherein determining, by the denoising module of the video generation model, a set of first noise data distributions corresponding to a set of input image tensors, and updating the set of input image tensors with the set of first noise data distributions to obtain a set of intermediate image tensors comprises:

7. The method of claim 4, wherein the denoising module comprises a pre-trained motion submodule that is independent of the denoising module, the method further comprising:

8. The method of claim 7, wherein the utilizing the motion information to assist the denoising module in generating a set of motion-variant potential image tensors comprises:

9. An apparatus for generating video based on an image, the apparatus comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 8 when the program is executed.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 8.