CN117788656A

CN117788656A - Video generation method, device and equipment

Info

Publication number: CN117788656A
Application number: CN202410217756.5A
Authority: CN
Inventors: 张顺四; 卢增; 徐列; 冯智毅
Original assignee: Guangzhou Quwan Network Technology Co Ltd
Current assignee: Guangzhou Quwan Network Technology Co Ltd
Priority date: 2024-02-28
Filing date: 2024-02-28
Publication date: 2024-03-29
Anticipated expiration: 2044-02-28
Also published as: CN117788656B

Abstract

The application discloses a video generation method, a device and equipment, wherein the method can acquire LoRA image weight parameters for generating a target image, and fuse the LoRA image weight parameters into an animation diffusion model to obtain the target animation model; based on the method, a digital person with an appointed image is generated in each frame of the video through a target animation model fused with LoRA image weight parameters, so that the image stability of the video is ensured; a context window corresponding to each frame can be determined, wherein each context window comprises a corresponding current frame and a plurality of frames surrounding the current frame; fusing all control information of the same context window to obtain fused control information; the fusion control information of each frame can be input into a target animation model to obtain continuous video for driving the target image; therefore, the method and the device can ensure the continuity of the generated continuous video and improve the quality of the generated continuous video from two aspects of digital human images and driving control information.

Description

Video generation method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to a method, an apparatus, and a device for generating video.

Background

In the field of artificial intelligence, AI video generation belongs to the current hot spot technology. In the prior art, a new picture frame by frame is generated based on the prompt word and the picture characteristic of each picture frame of the original video through a diffusion model, and finally the pictures are combined into an AI video.

However, due to the fact that the picture generated by the diffusion model is high in randomness, the consistency of the AI video is poor, and the video quality is poor.

Disclosure of Invention

In view of the foregoing, the present application provides a method, an apparatus and a device for generating video, which are used for solving the disadvantage of poor consistency of the video generated in the prior art.

In order to achieve the above object, the following solutions have been proposed:

a video generation method, comprising:

acquiring control information of each frame in the target continuous video, and generating LoRA image weight parameters of the target image;

fusing the LoRA image weight parameters to a preset animation diffusion model to obtain a target animation model;

determining a context window corresponding to each frame, wherein each context window comprises a corresponding current frame and a plurality of frames surrounding the current frame;

fusing each control information of the corresponding context window for each frame to obtain corresponding fused control information;

And inputting the fusion control information of each frame into the target animation model to obtain continuous video for driving the target image.

Optionally, acquiring the lore image weight parameter for generating the target image includes:

acquiring face LoRA weight parameters of a specified subject, wherein the face LoRA weight parameters are used for drawing the face of the specified subject;

acquiring a modeling LoRA weight parameter and a face LoRA weight parameter of a specified modeling, wherein the modeling LoRA weight parameter is used for drawing a main body matched with the specified modeling, and the face LoRA weight parameter is used for drawing a face matched with the main body of the specified modeling;

calculating a weight parameter difference between the modeling LoRA weight parameter and the face LoRA weight parameter;

and integrating the weight parameter difference value with the face LoRA weight parameter to obtain the LoRA image weight parameter after integration, wherein the target image is a specified subject collocated with a specified model.

Optionally, the acquiring the LoRA weight parameter of the face of the specified subject includes:

acquiring a plurality of subject images including a specified subject face;

extracting a face description text of each subject image and a subject face mask;

and training the LoRA level of the potential diffusion model by using the face description text of each subject image and the subject face mask, wherein the face LoRA weight parameter is the weight parameter of the trained LoRA level.

Optionally, the training the LoRA hierarchy of the potential diffusion model by using the face description text of each subject image and the subject face mask includes:

for each subject image, in a forward diffusion stage, inputting an input image added with real noise to the potential diffusion model, the input image being derived based on the subject image; predicting random noise added in the current stage based on the face description text of the subject image and the prediction result of the previous stage by using the potential diffusion model, and denoising the prediction result of the previous stage by using the predicted random noise to form the prediction result of the current stage; in the back propagation stage, calculating a loss value based on random noise of the internal region of the subject face mask and real noise of the internal region of the subject face mask; and updating the weight parameters of the LoRA level of the potential diffusion model based on the loss value.

Optionally, calculating the loss value based on random noise of the subject face mask interior region and true noise of the subject face mask interior region includes:

calculating to obtain the loss value by using a preset loss value calculation function;

The loss value calculation function is as follows:

；

wherein h is the high corresponding to the prediction result; w is the width corresponding to the prediction result; pred (pred) _ij Random noise values for row i and column j; gt _ij The true noise value for the ith row and jth column; mask _ij And masking the feature value corresponding to the ith row and the jth column of the ith row for the face of the main body.

Optionally, the fusing the control information corresponding to the context window to obtain corresponding fused control information includes:

based on normal distribution, controlling the weight corresponding to each frame in the context window, wherein the weight of the current frame of the context window is the maximum;

and calculating fusion control information of the current frame based on each piece of control information of the context window and the corresponding weight of the control information.

Optionally, the controlling the weight corresponding to each frame in the context window based on the normal distribution includes:

acquiring a preset weight calculation function constructed based on normal distribution, and calculating the weight of each frame in the context window based on the weight calculation function;

the weight calculation function is as follows:

；

wherein,a sequence number of the current frame in the context window; />Is a preset standard deviation; x represents a frame with a sequence number x in the context window; / >Is the weight of the x-th frame.

Optionally, the fusion control information of each frame includes torso fusion control information, hand fusion control information, and face fusion control information;

the step of inputting the fusion control information of each frame into the target animation model to obtain continuous video of the driving target image comprises the following steps:

and layering and injecting the trunk fusion control information, the hand fusion control information and the face fusion control information into the target animation model to obtain continuous video of the driving target image.

Optionally, the target animation model includes 8×8 Middle Block, 8×8 Decoder Block, 32×32 Decoder Block, 64×64 Decoder Block, and 16×16 Decoder Block;

layering and injecting the trunk fusion control information, the hand fusion control information and the face fusion control information into the target animation model, wherein the layering and injecting comprises the following steps:

injecting the trunk fusion control information into an 8×8 Middle Block and an 8×8 Decoder Block;

injecting the hand fusion control information and the face fusion control information into 32×32 Decoder blocks and 64×64 Decoder blocks;

the fusion control information is injected into a 16×16 Decoder Block.

Optionally, fusing the LoRA image weight parameter to a preset animation diffusion model to obtain a target animation model, including:

And fusing the LoRA image weight parameters to the LoRA level of the animation diffusion model, and obtaining the target animation model after fusing.

A video generating apparatus comprising:

the control information acquisition module is used for acquiring control information of each frame in the target continuous video and generating LoRA image weight parameters of the target image;

the parameter fusion module is used for fusing the LoRA image weight parameter to a preset animation diffusion model to obtain a target animation model;

the window determining module is used for determining a context window corresponding to each frame, wherein each context window comprises a corresponding current frame and a plurality of frames surrounding the current frame;

the control information fusion module is used for fusing the control information corresponding to the context window for each frame to obtain corresponding fused control information;

and the continuous video generation module is used for inputting the fusion control information of each frame into the target animation model to obtain continuous video for driving the target image.

A video generating apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement each step of the video generating method.

According to the technical scheme, the video generation method provided by the application can firstly obtain the LoRA image weight parameters for generating the target image, and then, the LoRA image weight parameters are fused with a preset animation diffusion model to obtain the target animation model; based on the method, the digital person with the appointed image can be generated in each frame of the video through the target animation model fused with the LoRA image weight parameter, so that the image stability of the video is ensured; on the basis, the control information of each frame in the continuous video can be targeted; determining a context window corresponding to each frame, wherein each context window comprises a corresponding current frame and a plurality of frames surrounding the current frame; fusing each control information of the corresponding context window for each frame to obtain corresponding fused control information; based on the above, when determining the fusion control information of each frame, referring to a plurality of frame control information surrounding the current frame, further ensuring continuity between each fusion control information; finally, the fusion control information of each frame can be input into the target animation model to obtain a continuous video for driving the target image; therefore, the method and the device can ensure the continuity of the generated continuous video and improve the quality of the generated continuous video from two aspects of digital human images and driving control information.

In addition, the method and the device can ensure that the image in the continuous video is the appointed target image through the LoRA image weight parameter, and based on the target image, the continuous video for driving the appointed image can be generated through the method and the device.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

Fig. 1 is a flowchart of a video generating method disclosed in an embodiment of the present application;

fig. 2 is a block diagram of a video generating apparatus according to an embodiment of the present application;

fig. 3 is a block diagram of a hardware structure of a video generating apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The subject application is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.

The target image and the target continuous video are information authorized by the user, and privacy information is not involved.

The following describes the video generation method in detail with reference to fig. 1, including the following steps:

and S1, acquiring control information of each frame in the target continuous video, and generating LoRA image weight parameters of the target image.

Specifically, any one continuous video can be selected as the target continuous video based on the driving requirement of the target image.

The control information of each frame in the target continuous video can be extracted through the control network control net.

LoRA image weight parameters for generating a specified subject collocated with a specified build may be obtained.

The designated subject may be an authorized user or an authorized avatar.

Specifying a build may include any one or more of specifying a garment, specifying a pose, specifying a make-up, specifying a hairstyle, and specifying a accessories.

And S2, fusing the LoRA image weight parameters to a preset animation diffusion model to obtain a target animation model.

Specifically, the LoRA image weight parameter and the animation diffusion model can be fused, and each frame in the continuous video generated by the target animation model obtained after fusion is the target image.

The target animation model may be used to generate a video that drives the target avatar.

And step S3, determining a context window corresponding to each frame.

Specifically, each frame may be sequentially used as a current frame, and the current frame and a plurality of surrounding frames surrounding the current frame may be combined to form a context window of the current frame.

Based on this, each context window may include a corresponding current frame and a plurality of frames surrounding the current frame.

The number of frames per context window may not be uniform.

And S4, fusing the control information corresponding to the context window for each frame to obtain corresponding fused control information.

Specifically, the control information of each frame in each context window can be fused by combining normal distribution, so as to obtain fused control information of each frame.

The fusion control information for each frame may be used to generate a picture for the corresponding frame.

The fused control information may include control information of a plurality of locations.

And S5, inputting the fusion control information of each frame into the target animation model to obtain a continuous video for driving the target image.

Specifically, based on the target animation model, the motion or expression of the target image can be driven by utilizing the fusion control information of each frame to obtain pictures, and the pictures are combined to form a continuous video for driving the target image.

In some embodiments of the present application, a procedure for acquiring the LoRA character weight parameter for generating the target character in step S1 will be described in detail, and the procedure is as follows:

s10, acquiring face LoRA weight parameters of the appointed subject.

Specifically, the face LoRA weight parameter is used to draw the face of a specified subject.

The face LoRA weight parameter may be a LoRA weight parameter of a first potential diffusion model that may be trained from a plurality of specified subject face images.

The LoRA weight parameter of the first potential diffusion model can transform the specified subject face image into a low-dimensional potential space.

S11, obtaining modeling LoRA weight parameters of the appointed modeling and face LoRA weight parameters.

Specifically, the pose LoRA weight parameter may be used to draw a subject collocated with a specified pose, and the face LoRA weight parameter may be used to draw a face collocated with a specified pose subject.

The subject collocated with the designated model can be any authorized subject except the designated subject, for example, can be an authorized user outside the designated subject, and can also be an avatar.

The model LoRA weight parameter can be a LoRA weight parameter of a second potential diffusion model, wherein the second potential diffusion model can be trained through a plurality of specified model images.

The facial LoRA weight parameters may be LoRA weight parameters of a third potential diffusion model, wherein the third potential diffusion model may be trained from a plurality of specified modeled facial images.

Each specified modeling face image is a face area corresponding to the specified modeling image.

Each specified subject face image, each specified pose image, and each specified pose face image are authorized.

S12, calculating a weight parameter difference value between the modeling LoRA weight parameter and the face LoRA weight parameter.

Specifically, the difference between the pose LoRA weight parameter and the face LoRA weight parameter is calculated as the weight parameter difference.

And S13, integrating the weight parameter difference value with the LoRA weight parameter of the face to obtain the LoRA image weight parameter after integration, wherein the target image is a specified subject collocated with a specified model.

Specifically, because the modeling LoRA weight parameter can generate a main body matched with the specified modeling, and the face LoRA weight parameter can generate a face matched with the specified modeling main body, the modeling LoRA weight parameter comprises the characteristics corresponding to the face of the specified modeling main body;

Thus, the weight parameter differences do not include facial region features of the person collocated with the specified pose.

And after the weight parameter difference value is overlapped with the face LoRA weight parameter, the LoRA image weight parameter can be obtained.

LoRA image weight parameters can be used to draw a specified subject collocated with a specified build.

The LoRA avatar weight parameter may be calculated by the following weight calculation function:

；

the LoRA image weight parameter is used; />The face LoRA weight parameter; />The face LoRA weight parameters; />Is modeling LoRA weight parameter.

From the above technical solution, it can be seen that this embodiment provides an optional manner of obtaining the LoRA image weight parameter, by which the present application may implement decoupling of multiple Lora weight parameters by constructing the face LoRA weight parameter, and the model LoRA weight parameter, and processing the face LoRA weight parameter, and the model LoRA weight parameter, to obtain the LoRA image weight parameter, thereby avoiding the conditions of mutual influence and mutual pollution between the LoRA weight parameters.

In some embodiments of the present application, a process of acquiring the LoRA weight parameter of the face of the specified subject in step S10 is described in detail, and the steps are as follows:

S100, acquiring a plurality of subject images containing the face of the specified subject.

Specifically, a plurality of subject images each including a face region of a specified subject may be acquired.

S101, extracting face description text of each subject image and a subject face mask.

Specifically, a face region may be truncated from each subject image using a face detection algorithm to obtain a specified subject face image.

A primitive Wen Suanfa model may be employed to extract facial description text for each specified subject facial image; a face parsing algorithm may be employed to extract a subject face mask for each of the specified subject face images.

S102, training the LoRA level of the potential diffusion model by using the face description text of each subject image and the subject face mask, wherein the face LoRA weight parameter is the weight parameter of the trained LoRA level.

Specifically, the potential diffusion model may be trained by using each specified subject face image, and its corresponding face description text and subject face mask, to obtain a first potential diffusion model.

And taking the weight parameter of the LoRA level in the first potential diffusion model as the face LoRA weight parameter.

As can be seen from the above technical solution, this embodiment provides an optional way to obtain the facial lore weight parameters, by which multiple subject images may be used to train the latent diffusion model, and the facial lore weight parameters may be extracted from the trained latent diffusion model. Therefore, the embodiment can pointedly acquire the face LoRA weight parameter for constructing the face of the appointed subject, avoid the influence of other areas of the appointed subject on the face area, further improve the training efficiency of the face LoRA weight parameter and improve the reliability of the face LoRA weight parameter.

Similarly, the model LoRA weight parameters and the face LoRA weight parameters can be obtained in the manner described above. Specifically, in step S11, the procedure for obtaining the model lore weight parameter of the specified model and the face lore weight parameter may be as follows:

a plurality of designated styling images may be acquired, each designated styling image including a subject collocated with the designated styling.

The face area of each modeling image can be intercepted, and a specified modeling face image is obtained; extracting a face description text and a face mask of each specified modeling face image;

extracting descriptive text of each modeling image;

Training the LoRA level of the potential diffusion model by using each appointed modeling image and the corresponding description text thereof to obtain a second potential diffusion model, wherein modeling LoRA weight parameters are weight parameters of the LoRA level in the second potential diffusion model;

training the LoRA level of the potential diffusion model by using each appointed modeling face image and the corresponding face description text and face mask to obtain a third potential diffusion model, wherein the face LoRA weight parameter is a weight parameter of the LoRA level in the third potential diffusion model.

In some embodiments of the present application, step S102, training the lore level of the potential diffusion model by using the face description text of each subject image and the subject face mask, where the face lore weight parameter is a process of the trained lore level weight parameter is described in detail as follows:

s1020, inputting an input image added with real noise to the potential diffusion model in a forward diffusion stage for each subject image, wherein the input image is obtained based on the subject image; predicting random noise added in the current stage based on the face description text of the subject image and the prediction result of the previous stage by using the potential diffusion model, and denoising the prediction result of the previous stage by using the predicted random noise to form the prediction result of the current stage; in the back propagation stage, calculating a loss value based on random noise of the internal region of the subject face mask and real noise of the internal region of the subject face mask; and updating the weight parameters of the LoRA level of the potential diffusion model based on the loss value.

Specifically, the potential diffusion model involves two processes: a forward diffusion phase and a reverse propagation phase.

The input image is a specified subject face image to which real noise is added.

The specified subject face image may comprise 336-448 pixels.

In the forward diffusion stage, a token corresponding to each vocabulary tag in each face description text can be obtained through a token; coding each token by adopting an encoder in a FrozenCLIPEmbedder, and converting each vocabulary tag into a semantic vector to obtain a semantic matrix related to each face description text; the semantic vector, the enabling of the current stage time and the prediction result last of the previous stage are input into UNET with Cross-Attention, and random noise of the current stage is obtained through prediction.

The random noise predicted by each stage is an influencing factor of the prediction result of the corresponding stage, namely, the prediction result of each stage is formed by the random noise predicted by the corresponding stage.

In the back propagation stage, the loss value of the external area of the main body face mask is set to 0, so that the gradient value of the external area of the main body face mask is prevented from being back propagated, and only the gradient value of the internal area of the main body face mask is back propagated, so that the guiding of the characteristics of the internal area of the main body face mask is ensured to be considered when the weight is updated by back propagation.

According to the technical scheme, the embodiment provides an optional mode for training the LoRA level of the potential diffusion model, by the mode, only the regional characteristics inside the mask are considered in the training process, the influence of the regional characteristics outside the mask on the weight parameters is avoided, and the reliability of the application is further improved.

In some embodiments of the present application, the process of calculating the loss value based on the random noise of the internal region of the mask of the face of the subject and the real noise of the internal region of the mask of the face of the subject in step S1020 is described in detail, and the steps are as follows:

s10200, calculating the loss value by using a preset loss value calculation function.

Specifically, the real noise and the random noise may be substituted into the loss value calculation function, and the first loss value may be calculated.

The loss value calculation function may be as follows:

；

According to the technical scheme, the embodiment provides an optional mode for calculating the loss value, and the characteristic value of the outer area of the mask can be better removed through the mode, so that the reliability of the method is further improved.

Similarly, when the second potential diffusion model is trained, a specified modeling image added with second real noise can be input to the potential diffusion model in a forward diffusion stage; predicting a second random noise added in the current stage based on the description text of the specified modeling image and the prediction noise of the previous stage by using a potential diffusion model; substituting the second random noise and the second real noise into a modeling loss value calculation function in the back propagation stage to calculate a second loss value; and updating the weight parameters of the LoRA level of the potential diffusion model based on the second loss value, wherein the potential diffusion model trained by each appointed modeling image is the second potential diffusion model.

The modeling loss value calculation function may be as follows:

；

for a given build image height; />A second loss value; />A length of the image for the specified shape; />A second random noise value for the ith row and jth column; />A second true noise value for the ith row and jth column.

When the third potential diffusion model is trained, a specified modeling image added with real noise can be input to the potential diffusion model in a forward diffusion stage; predicting random noise added in the current stage based on the face description text of the appointed modeling image and the prediction noise of the previous stage by utilizing the potential diffusion model; in the back propagation stage, calculating a third loss value based on random noise of the internal region of the face mask and real noise of the internal region of the face mask; and updating the weight parameters of the LoRA level of the potential diffusion model based on the third loss value to obtain a third potential diffusion model.

In calculating the third loss value, the loss value calculation function may be used for calculation.

In some embodiments of the present application, the process of fusing the LoRA image weight parameter to a preset animation diffusion model to obtain the target animation model in step S2 is described in detail, and the steps are as follows:

s20, fusing the LoRA image weight parameters to the LoRA level of the animation diffusion model, and obtaining the target animation model after fusion.

Specifically, the animation diffusion model may include a LoRA level, and the LoRA image weight parameter and the LoRA level of the animation diffusion model are fused to obtain the target animation model after fusion.

According to the technical scheme, the LoRA image weight parameters of the target image and the animation diffusion model can be fused, and the target image is drawn by using the fused target animation model.

In some embodiments of the present application, the process of fusing each control information corresponding to the context window for each frame in step S4 to obtain corresponding fused control information is described in detail, and the steps are as follows:

and S40, controlling the weight corresponding to each frame in the context window based on normal distribution, wherein the weight of the current frame of the context window is the maximum.

Specifically, the weight of the current frame may be set to be maximum based on the current frame in combination with the normal distribution, and the weight of each frame may be set based on the distance between each frame and the current frame, where the weight corresponding to the frame far away is smaller than the weight corresponding to the frame near.

S41, calculating fusion control information of the current frame based on each piece of control information of the context window and the corresponding weight of the context window.

Specifically, the fusion control information may be calculated using the following fusion control information calculation function:

；

wherein,is->Fusion control information of the frame; />Weights for the x-th frame; />Is->Control information of the frame; l is the total frame size of the context window.

Each piece of control information may include torso control information, hand control information, and face control information.

Torso fusion control information may be calculated based on torso control information for each frame and its corresponding weights;

the hand fusion control information can be calculated based on the hand control information of each frame and the weight corresponding to the hand fusion control information;

the face fusion control information may be calculated based on the face control information of each frame and its corresponding weight.

The torso fusion control information, the hand fusion control information, and the face fusion control information may be calculated using the fusion control information calculation function, respectively.

The hand region mask and the face region mask of each frame may be extracted, and torso control information, hand control information, and face control information of each frame may be obtained based on each frame and the hand region mask and the face region mask thereof.

As can be seen from the above technical solution, the present embodiment provides an optional manner of fusing each control information, by which each control information can be fused in combination with normal distribution, and under the condition that the control information of the current frame is ensured to be dominant, the control information of the current frame is updated by referring to the control information of a plurality of frames surrounding the current frame, so that the continuity of each fused control information is improved, and meanwhile, the fusion disorder of long-distance control information is effectively avoided.

In some embodiments of the present application, the step S40 of controlling the weight corresponding to each frame in the context window based on normal distribution, and the process of maximizing the weight of the current frame in the context window is described in detail, and the steps are as follows:

s400, acquiring a preset weight calculation function constructed based on normal distribution, and calculating the weight of each frame in the context window based on the weight calculation function.

Specifically, a weight calculation function constructed in advance may be acquired, and the weight of each frame may be calculated using the weight calculation function.

Wherein the weight calculation function may be constructed based on a normal distribution.

The weight calculation function is as follows:

；

wherein,a sequence number of the current frame in the context window; />Is a preset standard deviation; x represents a frame with a sequence number x in the context window; />Is the weight of the x-th frame.

As can be seen from the above technical solutions, this embodiment provides an optional way to calculate weights of frames, by which weights of each frame can be calculated step by step, so as to better perform control information fusion.

In some embodiments of the present application, a process of inputting the fusion control information of each frame to the target animation model in step S5 to obtain a continuous video of the driving target image is described in detail, and the steps are as follows:

s50, layering and injecting the trunk fusion control information, the hand fusion control information and the face fusion control information into the target animation model to obtain continuous video of the driving target image.

Specifically, the torso fusion control information, the hand fusion control information and the face fusion control information of each frame can be hierarchically injected into the target animation model to obtain continuous videos formed by the expression and/or the action of the driving target image.

The torso fusion control information and the face fusion control information can be injected into a large-scale block of the target animation model, and the torso fusion control information and the face fusion control information can be injected into a small-scale block of the target animation model.

From the above technical solution, it can be seen that this embodiment provides an optional way of inputting a fusion control model, by using the above way, the spatial features of the trunk can be extracted by using the large convolution receptive field, and the spatial features of the hands and the faces can be extracted by using the small convolution receptive field, which is favorable for fully extracting the spatial features of the continuous video and improving the continuity of the continuous video generated in this application.

In some embodiments of the present application, the target animation model may include 8×8 Middle Block, 8×8 Decoder Block, 32×32 Decoder Block, 64×64 Decoder Block, and 16×16 Decoder Block, and the following steps are described in detail in the process of hierarchically injecting the torso fusion control information, the hand fusion control information, and the face fusion control information into the target animation model to obtain the continuous video of the driving target image, where the steps are as follows:

s500, injecting the trunk fusion control information into an 8×8 Middle Block and an 8×8 Decoder Block.

Specifically, the torso fusion control information for each frame may be input to an 8×8 Middle Block and an 8×8 Decoder Block.

S501, injecting the hand fusion control information and the face fusion control information into 32×32 Decoder blocks and 64×64 Decoder blocks.

Specifically, the hand fusion control information and the face fusion control information for each frame may be input to a 32×32 Decoder Block and a 64×64 Decoder Block.

S502, injecting the fusion control information into a 16×16 Decoder Block.

Specifically, torso fusion control information, hand fusion control information, and face fusion control information for each frame may be input to a 16×16 Decoder Block.

From the above technical solution, it can be seen that this embodiment provides an optional manner of injecting fusion control information in layers, and by using the above manner, spatial features of each information can be extracted by using different convolution layers, so as to ensure reliability of the present application.

Next, a detailed description will be given of the video generating apparatus provided in the present application with reference to fig. 2, and the video generating apparatus provided hereinafter may be cross-referenced with the video generating method provided above.

As can be seen with reference to fig. 2, the video generating apparatus may include:

A control information acquisition module 10, configured to acquire control information of each frame in the target continuous video, and generate a lore image weight parameter of the target image;

the parameter fusion module 20 is configured to fuse the LoRA image weight parameter to a preset animation diffusion model, so as to obtain a target animation model;

a window determining module 30, configured to determine a context window corresponding to each frame, where each context window includes a corresponding current frame and a plurality of frames surrounding the current frame;

the control information fusion module 40 is configured to fuse, for each frame, each control information corresponding to the context window to obtain corresponding fused control information;

and the continuous video generation module 50 is used for inputting the fusion control information of each frame into the target animation model to obtain continuous video for driving the target image.

Further, the control information acquisition module may include:

a face weight parameter acquisition unit configured to acquire a face LoRA weight parameter of a specified subject, the face LoRA weight parameter being used to draw the face of the specified subject;

the modeling parameter acquisition unit is used for acquiring modeling LoRA weight parameters and face LoRA weight parameters of the appointed modeling, wherein the modeling LoRA weight parameters are used for drawing a main body matched with the appointed modeling, and the face LoRA weight parameters are used for drawing a face matched with the main body of the appointed modeling;

A parameter difference calculating unit, configured to calculate a weight parameter difference between the modeling lore weight parameter and the face lore weight parameter;

and the parameter integration unit is used for integrating the weight parameter difference value with the face LoRA weight parameter to obtain the LoRA image weight parameter after integration, and the target image is a specified subject collocated with a specified model.

Further, the face weight parameter acquisition unit may include:

a subject image acquisition subunit operable to acquire a plurality of subject images including a specified subject face;

a description text extraction subunit for extracting a face description text of each subject image and a subject face mask;

and the model training subunit is used for training the LoRA level of the potential diffusion model by using the face description text of each subject image and the subject face mask, and the face LoRA weight parameter is the weight parameter of the trained LoRA level.

Further, the model training subunit may include:

a parameter updating component for inputting an input image added with real noise to the potential diffusion model in a forward diffusion stage for each subject image, the input image being derived based on the subject image; predicting random noise added in the current stage based on the face description text of the subject image and the prediction result of the previous stage by using the potential diffusion model, and denoising the prediction result of the previous stage by using the predicted random noise to form the prediction result of the current stage; in the back propagation stage, calculating a loss value based on random noise of the internal region of the subject face mask and real noise of the internal region of the subject face mask; and updating the weight parameters of the LoRA level of the potential diffusion model based on the loss value.

Further, the parameter updating component may include:

the loss value calculation sub-component is used for calculating the loss value by utilizing a preset loss value calculation function;

the loss value calculation function is as follows:

；

Further, the control information fusion module may include:

the weight calculation unit is used for controlling the weight corresponding to each frame in the context window based on normal distribution, and the weight of the current frame of the context window is the maximum;

and the weight utilization unit is used for calculating fusion control information of the current frame based on each piece of control information of the context window and the corresponding weight.

Further, the weight calculation unit may include:

the first weight calculation subunit is used for acquiring a preset weight calculation function constructed based on normal distribution and calculating the weight of each frame in the context window based on the weight calculation function;

the weight calculation function is as follows:

；

Wherein,for the currentSequence number of frame in the context window; />Is a preset standard deviation; x represents a frame with a sequence number x in the context window; />Is the weight of the x-th frame.

Further, the continuous video generation module may include:

and the information injection unit is used for injecting the trunk fusion control information, the hand fusion control information and the face fusion control information into the target animation model in a layered manner to obtain continuous videos for driving the target image.

Further, the information injection unit may include:

a first information injection subunit, configured to inject the torso fusion control information into an 8×8 Middle Block and an 8×8 Decoder Block;

a second information injection subunit, configured to inject the hand fusion control information and the face fusion control information into 32×32 Decoder blocks and 64×64 Decoder blocks;

and a third information injection subunit, configured to inject the fusion control information into a 16×16 Decoder Block.

Further, the parameter fusion module may include:

and the target animation model acquisition unit is used for fusing the LoRA image weight parameters to the LoRA level of the animation diffusion model, and obtaining the target animation model after fusion.

The video generating device provided by the embodiment of the application can be applied to video generating equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 3 shows a block diagram of a hardware structure of the video generating apparatus, and referring to fig. 3, the hardware structure of the video generating apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;

processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the application also provides a readable storage medium, which can store a program suitable for being executed by a processor, the program being configured to:

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Various embodiments of the present application may be combined with one another. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A video generation method, comprising:

2. The video generation method of claim 1, wherein acquiring the lore character weight parameters for generating the target character comprises:

3. The video generation method according to claim 2, wherein the acquiring the face lore weight parameter of the specified subject includes:

Acquiring a plurality of subject images including a specified subject face;

4. A video generation method according to claim 3, wherein training the lore hierarchy of the potential diffusion model using the face description text of each subject image and the subject face mask comprises:

5. The video generation method according to claim 4, wherein calculating the loss value based on random noise of the subject face mask interior region and true noise of the subject face mask interior region comprises:

the loss value calculation function is as follows:

；

wherein loss is a loss value;hthe prediction result is high;wthe width corresponding to the prediction result is obtained; pred (pred) _ij Random noise values for row i and column j; gt _ij The true noise value for the ith row and jth column; mask _ij And masking the feature value corresponding to the ith row and the jth column of the ith row for the face of the main body.

6. The method as set forth in claim 1, wherein the fusing the control information corresponding to the contextual window to obtain the corresponding fused control information includes:

7. The method of generating video according to claim 6, wherein controlling weights corresponding to frames in the contextual window based on normal distribution comprises:

the weight calculation function is as follows:

；

wherein,a sequence number of the current frame in the context window; />Is preset toStandard deviation of (2); x represents a frame with a sequence number x in the context window; />Is the weight of the x-th frame.

8. The video generation method according to claim 1, wherein the fusion control information of each frame includes torso fusion control information, hand fusion control information, and face fusion control information;

9. The video generation method according to claim 8, wherein the target motion picture model includes 8 x 8 Middle Block, 8 x 8 Decoder Block, 32 x 32 Decoder Block, 64 x 64 Decoder Block, and 16 x 16 Decoder Block;

the fusion control information is injected into a 16×16 Decoder Block.

10. The video generating method according to claim 1, wherein fusing the LoRA image weight parameter to a preset animation diffusion model to obtain a target animation model comprises:

11. A video generating apparatus, comprising:

12. A video generating apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor being configured to execute the program to implement the steps of the video generation method according to any one of claims 1 to 10.