CN117788656B

CN117788656B - Video generation method, device and equipment

Info

Publication number: CN117788656B
Application number: CN202410217756.5A
Authority: CN
Inventors: 张顺四; 卢增; 徐列; 冯智毅
Original assignee: Guangzhou Quwan Network Technology Co Ltd
Current assignee: Guangzhou Quwan Network Technology Co Ltd
Priority date: 2024-02-28
Filing date: 2024-02-28
Publication date: 2024-04-26
Anticipated expiration: 2044-02-28
Also published as: CN117788656A

Abstract

The application discloses a video generation method, a device and equipment, wherein the method can acquire LoRA image weight parameters for generating a target image, and fuse LoRA image weight parameters into an animation diffusion model to obtain a target animation model; based on the method, a digital person with an appointed image is generated in each frame of the video through a target animation model fused with LoRA image weight parameters, so that the image stability of the video is ensured; a context window corresponding to each frame can be determined, wherein each context window comprises a corresponding current frame and a plurality of frames surrounding the current frame; fusing all control information of the same context window to obtain fused control information; the fusion control information of each frame can be input into a target animation model to obtain continuous video for driving the target image; therefore, the application can ensure the continuity of the generated continuous video and improve the quality of the generated continuous video from two aspects of digital human images and driving control information.

Description

Video generation method, device and equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, and a device for generating video.

Background

In the field of artificial intelligence, AI video generation belongs to the current hot spot technology. In the prior art, a new picture frame by frame is generated based on the prompt word and the picture characteristic of each picture frame of the original video through a diffusion model, and finally the pictures are combined into an AI video.

However, due to the fact that the picture generated by the diffusion model is high in randomness, the consistency of the AI video is poor, and the video quality is poor.

Disclosure of Invention

In view of the above, the present application provides a method, apparatus and device for generating video, which are used for solving the disadvantage of poor consistency of the video generated in the prior art.

In order to achieve the above object, the following solutions have been proposed:

A video generation method, comprising:

Acquiring control information of each frame in the target continuous video and LoRA image weight parameters for generating a target image;

Fusing the LoRA image weight parameters to a preset animation diffusion model to obtain a target animation model;

determining a context window corresponding to each frame, wherein each context window comprises a corresponding current frame and a plurality of frames surrounding the current frame;

Fusing each control information of the corresponding context window for each frame to obtain corresponding fused control information;

And inputting the fusion control information of each frame into the target animation model to obtain continuous video for driving the target image.

Optionally, obtaining LoRA avatar weight parameters for generating the target avatar includes:

Acquiring a face LoRA weight parameter of a specified subject, wherein the face LoRA weight parameter is used for drawing the face of the specified subject;

acquiring a model LoRA weight parameter and a face LoRA weight parameter of a specified model, wherein the model LoRA weight parameter is used for drawing a main body matched with the specified model, and the face LoRA weight parameter is used for drawing a face matched with the main body of the specified model;

calculating a weight parameter difference between the pose LoRA weight parameter and the face LoRA weight parameter;

and integrating the weight parameter difference value with the face LoRA weight parameter to obtain LoRA image weight parameter, wherein the target image is a specified subject matched with a specified model.

Optionally, the acquiring the face LoRA weight parameter of the specified subject includes:

acquiring a plurality of subject images including a specified subject face;

Extracting a face description text of each subject image and a subject face mask;

The LoRA levels of the potential diffusion model are trained by using the face description text of each subject image and the subject face mask, and the weight parameters of the face LoRA are those of the LoRA levels after training.

Optionally, the training the LoRA level of the potential diffusion model by using the face description text of each subject image and the subject face mask includes:

For each subject image, in a forward diffusion stage, inputting an input image added with real noise to the potential diffusion model, the input image being derived based on the subject image; predicting random noise added in the current stage based on the face description text of the subject image and the prediction result of the previous stage by using the potential diffusion model, and denoising the prediction result of the previous stage by using the predicted random noise to form the prediction result of the current stage; in the back propagation stage, calculating a loss value based on random noise of the internal region of the subject face mask and real noise of the internal region of the subject face mask; and updating the weight parameters of the level of the potential diffusion model LoRA based on the loss value.

Optionally, calculating the loss value based on random noise of the subject face mask interior region and true noise of the subject face mask interior region includes:

calculating to obtain the loss value by using a preset loss value calculation function;

The loss value calculation function is as follows:

；

Wherein h is the high corresponding to the prediction result; w is the width corresponding to the prediction result; pred _ij is the random noise value of row i and column j; gt _ij is the real noise value of the ith row and the jth column; mask _ij is a feature value corresponding to the ith row and jth column of the mask of the face of the main body.

Optionally, the fusing the control information corresponding to the context window to obtain corresponding fused control information includes:

based on normal distribution, controlling the weight corresponding to each frame in the context window, wherein the weight of the current frame of the context window is the maximum;

and calculating fusion control information of the current frame based on each piece of control information of the context window and the corresponding weight of the control information.

Optionally, the controlling the weight corresponding to each frame in the context window based on the normal distribution includes:

Acquiring a preset weight calculation function constructed based on normal distribution, and calculating the weight of each frame in the context window based on the weight calculation function;

the weight calculation function is as follows:

；

wherein, A sequence number of the current frame in the context window; /(I)Is a preset standard deviation; x represents a frame with a sequence number x in the context window; /(I)Is the weight of the x-th frame.

Optionally, the fusion control information of each frame includes torso fusion control information, hand fusion control information, and face fusion control information;

the step of inputting the fusion control information of each frame into the target animation model to obtain continuous video of the driving target image comprises the following steps:

and layering and injecting the trunk fusion control information, the hand fusion control information and the face fusion control information into the target animation model to obtain continuous video of the driving target image.

Optionally, the target animation model includes 8×8 Middle Block, 8×8 Decoder Block, 32×32 Decoder Block, 64×64 Decoder Block, and 16×16 Decoder Block;

Layering and injecting the trunk fusion control information, the hand fusion control information and the face fusion control information into the target animation model, wherein the layering and injecting comprises the following steps:

injecting the trunk fusion control information into an 8×8 Middle Block and an 8×8 Decoder Block;

Injecting the hand fusion control information and the face fusion control information into 32×32 Decoder blocks and 64×64 Decoder blocks;

the fusion control information is injected into a 16×16 Decoder Block.

Optionally, fusing the LoRA image weight parameters to a preset animation diffusion model to obtain a target animation model, including:

and fusing the LoRA image weight parameters to a LoRA level of the animation diffusion model, and obtaining a target animation model after fusing.

A video generating apparatus comprising:

The control information acquisition module is used for acquiring control information of each frame in the target continuous video and generating LoRA image weight parameters of the target image;

The parameter fusion module is used for fusing the LoRA image weight parameters to a preset animation diffusion model to obtain a target animation model;

The window determining module is used for determining a context window corresponding to each frame, wherein each context window comprises a corresponding current frame and a plurality of frames surrounding the current frame;

The control information fusion module is used for fusing the control information corresponding to the context window for each frame to obtain corresponding fused control information;

And the continuous video generation module is used for inputting the fusion control information of each frame into the target animation model to obtain continuous video for driving the target image.

A video generating apparatus comprising a memory and a processor;

The memory is used for storing programs;

the processor is configured to execute the program to implement each step of the video generating method.

According to the technical scheme, the video generation method provided by the application can be used for firstly acquiring LoRA image weight parameters for generating the target image, and then fusing the LoRA image weight parameters into a preset animation diffusion model to obtain the target animation model; based on the method, the digital person with the appointed image can be generated in each frame of the video through the target animation model fused with the LoRA image weight parameters, so that the image stability of the video is ensured; on the basis, the control information of each frame in the continuous video can be targeted; determining a context window corresponding to each frame, wherein each context window comprises a corresponding current frame and a plurality of frames surrounding the current frame; fusing each control information of the corresponding context window for each frame to obtain corresponding fused control information; based on the above, when determining the fusion control information of each frame, referring to a plurality of frame control information surrounding the current frame, further ensuring continuity between each fusion control information; finally, the fusion control information of each frame can be input into the target animation model to obtain a continuous video for driving the target image; therefore, the application can ensure the continuity of the generated continuous video and improve the quality of the generated continuous video from two aspects of digital human images and driving control information.

In addition, the application can ensure that the image in the continuous video is the appointed target image through LoRA image weight parameters, and based on the appointed target image, the continuous video for driving the appointed image can be generated through the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a video generating method according to an embodiment of the present application;

fig. 2 is a block diagram of a video generating apparatus according to an embodiment of the present application;

fig. 3 is a block diagram of a hardware structure of a video generating apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.

The target image and the target continuous video are information authorized by the user, and privacy information is not involved.

The following describes the video generation method of the present application in detail with reference to fig. 1, including the following steps:

and S1, acquiring control information of each frame in the target continuous video, and generating LoRA image weight parameters of the target image.

Specifically, any one continuous video can be selected as the target continuous video based on the driving requirement of the target image.

The control information of each frame in the target continuous video can be extracted through the control network control net.

LoRA avatar weight parameters for generating a specified subject collocated with a specified build may be obtained.

The designated subject may be an authorized user or an authorized avatar.

Specifying a build may include any one or more of specifying a garment, specifying a pose, specifying a make-up, specifying a hairstyle, and specifying a accessories.

And S2, fusing the LoRA image weight parameters to a preset animation diffusion model to obtain a target animation model.

Specifically, loRA image weight parameters can be fused with the animation diffusion model, and each frame in the continuous video generated by the target animation model obtained after fusion is the target image.

The target animation model may be used to generate a video that drives the target avatar.

And step S3, determining a context window corresponding to each frame.

Specifically, each frame may be sequentially used as a current frame, and the current frame and a plurality of surrounding frames surrounding the current frame may be combined to form a context window of the current frame.

Based on this, each context window may include a corresponding current frame and a plurality of frames surrounding the current frame.

The number of frames per context window may not be uniform.

And S4, fusing the control information corresponding to the context window for each frame to obtain corresponding fused control information.

Specifically, the control information of each frame in each context window can be fused by combining normal distribution, so as to obtain fused control information of each frame.

The fusion control information for each frame may be used to generate a picture for the corresponding frame.

The fused control information may include control information of a plurality of locations.

And S5, inputting the fusion control information of each frame into the target animation model to obtain a continuous video for driving the target image.

Specifically, based on the target animation model, the motion or expression of the target image can be driven by utilizing the fusion control information of each frame to obtain pictures, and the pictures are combined to form a continuous video for driving the target image.

In some embodiments of the present application, a process of acquiring LoRA character weight parameters for generating a target character in step S1 is described in detail, and the steps are as follows:

S10, acquiring face LoRA weight parameters of the appointed subject.

Specifically, the face LoRA weight parameters are used to map the face of the specified subject.

The face LoRA weight parameters may be LoRA weight parameters of a first potential diffusion model, which may be trained from a plurality of specified subject face images.

The LoRA weight parameters of the first potential diffusion model may convert the specified subject face image into a low-dimensional potential space.

S11, acquiring modeling LoRA weight parameters of the designated modeling and face LoRA weight parameters.

Specifically, pose LoRA weight parameters may be used to draw a subject collocated with a specified pose, and face LoRA weight parameters may be used to draw a face collocated with a specified pose subject.

The subject collocated with the designated model can be any authorized subject except the designated subject, for example, can be an authorized user outside the designated subject, and can also be an avatar.

The pose LoRA weight parameters may be LoRA weight parameters of a second potential diffusion model, where the second potential diffusion model may be trained over a plurality of specified pose images.

The face LoRA weight parameters may be LoRA weight parameters of a third potential diffusion model, where the third potential diffusion model may be trained from a plurality of specified modeled face images.

Each specified modeling face image is a face area corresponding to the specified modeling image.

Each specified subject face image, each specified pose image, and each specified pose face image are authorized.

S12, calculating a weight parameter difference value between the modeling LoRA weight parameter and the face LoRA weight parameter.

Specifically, the difference between the pose LoRA weight parameters and the face LoRA weight parameters is calculated as the weight parameter difference.

And S13, integrating the weight parameter difference value with the weight parameter of the face LoRA to obtain LoRA image weight parameters, wherein the target image is a specified subject matched with a specified model.

Specifically, since the pose LoRA weight parameters may generate a subject collocated with the specified pose, and the face LoRA weight parameters may generate a face collocated with the specified pose subject, the pose LoRA weight parameters include features corresponding to the face of the specified pose subject;

Thus, the weight parameter differences do not include facial region features of the person collocated with the specified pose.

After the weight parameter difference is superimposed with the face LoRA weight parameters, loRA avatar weight parameters may be obtained.

LoRA the image weight parameters can be used to draw a specified subject collocated with a specified build.

The figure weight parameters may be calculated LoRA by the following weight calculation function:

；

the image weight parameter is LoRA; /(I) The face LoRA is a weight parameter; /(I)Weighting parameters for the face LoRA; /(I)Weight parameters for the build LoRA.

As can be seen from the above technical solution, the present embodiment provides an optional manner for obtaining LoRA image weight parameters, by which the present application can implement decoupling of multiple Lora weight parameters by constructing the face LoRA weight parameters, the face LoRA weight parameters, and the model LoRA weight parameters, and processing the face LoRA weight parameters, the face LoRA weight parameters, and the model LoRA weight parameters to obtain LoRA image weight parameters, thereby avoiding the conditions of interaction and mutual pollution between LoRA weight parameters.

In some embodiments of the present application, the process of acquiring the weight parameter of the face LoRA of the specified subject in step S10 is described in detail as follows:

S100, acquiring a plurality of subject images containing the face of the specified subject.

Specifically, a plurality of subject images each including a face region of a specified subject may be acquired.

S101, extracting face description text of each subject image and a subject face mask.

Specifically, a face region may be truncated from each subject image using a face detection algorithm to obtain a specified subject face image.

A primitive Wen Suanfa model may be employed to extract the facial description text for each specified subject facial image; a face parsing algorithm may be employed to extract a subject face mask for each of the specified subject face images.

S102, training LoRA levels of the potential diffusion model by using the face description text of each subject image and the subject face mask, wherein the weight parameters of the face LoRA are weight parameters of LoRA levels after training.

Specifically, the potential diffusion model may be trained by using each specified subject face image, and its corresponding face description text and subject face mask, to obtain a first potential diffusion model.

The weight parameters of LoRA levels in the first potential diffusion model are taken as the weight parameters of the face LoRA.

As can be seen from the above technical solution, this embodiment provides an optional way to obtain the weight parameters of the face LoRA, by which multiple subject images can be used to train the potential diffusion model, and the weight parameters of the face LoRA can be extracted from the trained potential diffusion model. Therefore, the embodiment can pointedly acquire the face LoRA weight parameters constructing the face of the specified subject, avoid the influence of other areas of the specified subject on the face area, further improve the training efficiency of the face LoRA weight parameters and improve the reliability of the face LoRA weight parameters.

Similarly, the pose LoRA weight parameters and the face LoRA weight parameters may be obtained as described above. Specifically, the procedure of step S11, obtaining the pose LoRA weight parameter of the specified pose and the face LoRA weight parameter may be as follows:

a plurality of designated styling images may be acquired, each designated styling image including a subject collocated with the designated styling.

The face area of each modeling image can be intercepted, and a specified modeling face image is obtained; extracting a face description text and a face mask of each specified modeling face image;

Extracting descriptive text of each modeling image;

Training LoRA levels of the potential diffusion model by using each designated modeling image and the corresponding description text thereof to obtain a second potential diffusion model, wherein modeling LoRA weight parameters are weight parameters of LoRA levels in the second potential diffusion model;

and training LoRA levels of the potential diffusion model by using each specified modeling face image and the corresponding face description text and face mask to obtain a third potential diffusion model, wherein the weight parameters of the face LoRA are the weight parameters of LoRA levels in the third potential diffusion model.

In some embodiments of the present application, step S102, training the LoRA level of the potential diffusion model by using the face description text of each subject image and the subject face mask, where the face LoRA weight parameter is a trained LoRA level weight parameter, is described in detail as follows:

S1020, inputting an input image added with real noise to the potential diffusion model in a forward diffusion stage for each subject image, wherein the input image is obtained based on the subject image; predicting random noise added in the current stage based on the face description text of the subject image and the prediction result of the previous stage by using the potential diffusion model, and denoising the prediction result of the previous stage by using the predicted random noise to form the prediction result of the current stage; in the back propagation stage, calculating a loss value based on random noise of the internal region of the subject face mask and real noise of the internal region of the subject face mask; and updating the weight parameters of the level of the potential diffusion model LoRA based on the loss value.

Specifically, the potential diffusion model involves two processes: a forward diffusion phase and a reverse propagation phase.

The input image is a specified subject face image to which real noise is added.

The specified subject face image may comprise 336-448 pixels.

In the forward diffusion stage, a token corresponding to each vocabulary tag in each face description text can be obtained through tokenizer; coding each token by adopting an encoder in FrozenCLIPEmbedder, and converting each vocabulary tag into a semantic vector to obtain a semantic matrix related to each face description text; the semantic vector, embedding of the current stage time and the predicted result latent of the previous stage are input into UNET with Cross Attention Cross-Attention, and the random noise of the current stage is predicted.

The random noise predicted by each stage is an influencing factor of the prediction result of the corresponding stage, namely, the prediction result of each stage is formed by the random noise predicted by the corresponding stage.

In the back propagation stage, the loss value of the external area of the main body face mask is set to 0, so that the gradient value of the external area of the main body face mask is prevented from being back propagated, and only the gradient value of the internal area of the main body face mask is back propagated, so that the guiding of the characteristics of the internal area of the main body face mask is ensured to be considered when the weight is updated by back propagation.

According to the technical scheme, the method for training the LoRA-level potential diffusion model is an optional mode, only the regional characteristics inside the mask are considered in the training process, the influence of the regional characteristics outside the mask on the weight parameters is avoided, and the reliability of the method is further improved.

In some embodiments of the present application, the process of calculating the loss value based on the random noise of the internal region of the mask of the face of the subject and the real noise of the internal region of the mask of the face of the subject in step S1020 is described in detail, and the steps are as follows:

s10200, calculating the loss value by using a preset loss value calculation function.

Specifically, the real noise and the random noise may be substituted into the loss value calculation function, and the first loss value may be calculated.

The loss value calculation function may be as follows:

；

As can be seen from the above technical solution, the present embodiment provides an alternative way of calculating the loss value, by which the feature value of the outer area of the mask can be better removed, thereby further improving the reliability of the present application.

Similarly, when the second potential diffusion model is trained, a specified modeling image added with second real noise can be input to the potential diffusion model in a forward diffusion stage; predicting a second random noise added in the current stage based on the description text of the specified modeling image and the prediction noise of the previous stage by using a potential diffusion model; substituting the second random noise and the second real noise into a modeling loss value calculation function in the back propagation stage to calculate a second loss value; and updating the weight parameters of the level of the potential diffusion model LoRA based on the second loss value, wherein the potential diffusion model trained by each designated modeling image is the second potential diffusion model.

The modeling loss value calculation function may be as follows:

；

for a given build image height; /(I) A second loss value; /(I)A length of the image for the specified shape; /(I)A second random noise value for the ith row and jth column; /(I)A second true noise value for the ith row and jth column.

When the third potential diffusion model is trained, a specified modeling image added with real noise can be input to the potential diffusion model in a forward diffusion stage; predicting random noise added in the current stage based on the face description text of the appointed modeling image and the prediction noise of the previous stage by utilizing the potential diffusion model; in the back propagation stage, calculating a third loss value based on random noise of the internal region of the face mask and real noise of the internal region of the face mask; and updating the weight parameters of the LoRA-level potential diffusion model based on the third loss value to obtain a third potential diffusion model.

In calculating the third loss value, the loss value calculation function may be used for calculation.

In some embodiments of the present application, the process of fusing the LoRA image weight parameters to a preset animation diffusion model to obtain the target animation model in step S2 is described in detail, and the steps are as follows:

and S20, fusing the LoRA image weight parameters to a LoRA level of the animation diffusion model, and obtaining a target animation model after fusion.

Specifically, the animation diffusion model may include LoRA levels, and the LoRA image weight parameter is fused with LoRA levels of the animation diffusion model, so as to obtain the target animation model after fusion.

According to the technical scheme, the LoRA image weight parameters of the target image and the animation diffusion model can be fused, and the target image is drawn by using the fused target animation model.

In some embodiments of the present application, the process of fusing each control information corresponding to the context window for each frame in step S4 to obtain corresponding fused control information is described in detail, and the steps are as follows:

And S40, controlling the weight corresponding to each frame in the context window based on normal distribution, wherein the weight of the current frame of the context window is the maximum.

Specifically, the weight of the current frame may be set to be maximum based on the current frame in combination with the normal distribution, and the weight of each frame may be set based on the distance between each frame and the current frame, where the weight corresponding to the frame far away is smaller than the weight corresponding to the frame near.

S41, calculating fusion control information of the current frame based on each piece of control information of the context window and the corresponding weight of the context window.

Specifically, the fusion control information may be calculated using the following fusion control information calculation function:

；

wherein, For/>Fusion control information of the frame; /(I)Weights for the x-th frame; /(I)For/>Control information of the frame; l is the total frame size of the context window.

Each piece of control information may include torso control information, hand control information, and face control information.

Torso fusion control information may be calculated based on torso control information for each frame and its corresponding weights;

the hand fusion control information can be calculated based on the hand control information of each frame and the weight corresponding to the hand fusion control information;

the face fusion control information may be calculated based on the face control information of each frame and its corresponding weight.

The torso fusion control information, the hand fusion control information, and the face fusion control information may be calculated using the fusion control information calculation function, respectively.

The hand region mask and the face region mask of each frame may be extracted, and torso control information, hand control information, and face control information of each frame may be obtained based on each frame and the hand region mask and the face region mask thereof.

As can be seen from the above technical solution, the present embodiment provides an optional manner of fusing each control information, by which each control information can be fused in combination with normal distribution, and under the condition that the control information of the current frame is ensured to be dominant, the control information of the current frame is updated by referring to the control information of a plurality of frames surrounding the current frame, so that the continuity of each fused control information is improved, and meanwhile, the fusion disorder of long-distance control information is effectively avoided.

In some embodiments of the present application, the step S40 of controlling the weight corresponding to each frame in the context window based on normal distribution, and the process of maximizing the weight of the current frame in the context window is described in detail, and the steps are as follows:

S400, acquiring a preset weight calculation function constructed based on normal distribution, and calculating the weight of each frame in the context window based on the weight calculation function.

Specifically, a weight calculation function constructed in advance may be acquired, and the weight of each frame may be calculated using the weight calculation function.

Wherein the weight calculation function may be constructed based on a normal distribution.

The weight calculation function is as follows:

；

As can be seen from the above technical solutions, this embodiment provides an optional way to calculate weights of frames, by which weights of each frame can be calculated step by step, so as to better perform control information fusion.

In some embodiments of the present application, a process of inputting the fusion control information of each frame to the target animation model in step S5 to obtain a continuous video of the driving target image is described in detail, and the steps are as follows:

S50, layering and injecting the trunk fusion control information, the hand fusion control information and the face fusion control information into the target animation model to obtain continuous video of the driving target image.

Specifically, the torso fusion control information, the hand fusion control information and the face fusion control information of each frame can be hierarchically injected into the target animation model to obtain continuous videos formed by the expression and/or the action of the driving target image.

The torso fusion control information and the face fusion control information can be injected into a large-scale block of the target animation model, and the torso fusion control information and the face fusion control information can be injected into a small-scale block of the target animation model.

From the above technical solution, it can be seen that this embodiment provides an optional way of inputting the fusion control model, by using the above way, the spatial features of the trunk can be extracted by using the large convolution receptive field, and the spatial features of the hands and the faces can be extracted by using the small convolution receptive field, which is favorable for fully extracting the spatial features of the continuous video, and improving the continuity of the continuous video generated by the application.

In some embodiments of the present application, the target animation model may include 8×8 Middle Block, 8×8 Decoder Block, 32×32 Decoder Block, 64×64 Decoder Block, and 16×16 Decoder Block, and the steps of step S50, layering and injecting the torso fusion control information, the hand fusion control information, and the face fusion control information into the target animation model, and the following steps are described in detail to obtain a continuous video of the driving target image:

S500, injecting the trunk fusion control information into an 8×8 Middle Block and an 8×8 Decoder Block.

Specifically, the torso fusion control information for each frame may be input to an 8×8 Middle Block and an 8×8 Decoder Block.

S501, injecting the hand fusion control information and the face fusion control information into 32×32 Decoder blocks and 64×64 Decoder blocks.

Specifically, the hand fusion control information and the face fusion control information for each frame may be input to a 32×32 Decoder Block and a 64×64 Decoder Block.

S502, injecting the fusion control information into a 16×16 Decoder Block.

Specifically, torso fusion control information, hand fusion control information, and face fusion control information for each frame may be input to a 16×16 Decoder Block.

From the above technical solution, it can be seen that this embodiment provides an alternative way of injecting the fusion control information in layers, and by using the above way, spatial features of each information can be extracted by using different convolution layers, so as to ensure reliability of the present application.

Next, a detailed description will be given of a video generating apparatus provided in the present application with reference to fig. 2, which can be contrasted with the video generating method provided above.

As can be seen with reference to fig. 2, the video generating apparatus may include:

The control information acquisition module 10 is used for acquiring control information of each frame in the target continuous video and generating LoRA image weight parameters of the target image;

The parameter fusion module 20 is configured to fuse the LoRA image weight parameters to a preset animation diffusion model to obtain a target animation model;

A window determining module 30, configured to determine a context window corresponding to each frame, where each context window includes a corresponding current frame and a plurality of frames surrounding the current frame;

The control information fusion module 40 is configured to fuse, for each frame, each control information corresponding to the context window to obtain corresponding fused control information;

And the continuous video generation module 50 is used for inputting the fusion control information of each frame into the target animation model to obtain continuous video for driving the target image.

Further, the control information acquisition module may include:

a face weight parameter obtaining unit configured to obtain a face LoRA weight parameter of the specified subject, the face LoRA weight parameter being used to draw the face of the specified subject;

a modeling parameter obtaining unit, configured to obtain a modeling LoRA weight parameter of a specified modeling and a face LoRA weight parameter, where the modeling LoRA weight parameter is used for drawing a subject collocated with the specified modeling, and the face LoRA weight parameter is used for drawing a face collocated with the subject collocated with the specified modeling;

A parameter difference calculating unit, configured to calculate a weight parameter difference between the model LoRA weight parameter and the face LoRA weight parameter;

And the parameter integration unit is used for integrating the weight parameter difference value with the face LoRA weight parameter to obtain LoRA image weight parameter after integration, wherein the target image is a specified main body collocated with a specified model.

Further, the face weight parameter acquisition unit may include:

a subject image acquisition subunit operable to acquire a plurality of subject images including a specified subject face;

a description text extraction subunit for extracting a face description text of each subject image and a subject face mask;

And the model training subunit is used for training LoRA levels of the potential diffusion model by using the face description text of each subject image and the subject face mask, and the weight parameters of the face LoRA are weight parameters of LoRA levels after training.

Further, the model training subunit may include:

A parameter updating component for inputting an input image added with real noise to the potential diffusion model in a forward diffusion stage for each subject image, the input image being derived based on the subject image; predicting random noise added in the current stage based on the face description text of the subject image and the prediction result of the previous stage by using the potential diffusion model, and denoising the prediction result of the previous stage by using the predicted random noise to form the prediction result of the current stage; in the back propagation stage, calculating a loss value based on random noise of the internal region of the subject face mask and real noise of the internal region of the subject face mask; and updating the weight parameters of the level of the potential diffusion model LoRA based on the loss value.

Further, the parameter updating component may include:

the loss value calculation sub-component is used for calculating the loss value by utilizing a preset loss value calculation function;

The loss value calculation function is as follows:

；

Further, the control information fusion module may include:

The weight calculation unit is used for controlling the weight corresponding to each frame in the context window based on normal distribution, and the weight of the current frame of the context window is the maximum;

and the weight utilization unit is used for calculating fusion control information of the current frame based on each piece of control information of the context window and the corresponding weight.

Further, the weight calculation unit may include:

The first weight calculation subunit is used for acquiring a preset weight calculation function constructed based on normal distribution and calculating the weight of each frame in the context window based on the weight calculation function;

the weight calculation function is as follows:

；

Further, the continuous video generation module may include:

and the information injection unit is used for injecting the trunk fusion control information, the hand fusion control information and the face fusion control information into the target animation model in a layered manner to obtain continuous videos for driving the target image.

Further, the information injection unit may include:

a first information injection subunit, configured to inject the torso fusion control information into an 8×8 Middle Block and an 8×8 Decoder Block;

a second information injection subunit, configured to inject the hand fusion control information and the face fusion control information into 32×32 Decoder blocks and 64×64 Decoder blocks;

And a third information injection subunit, configured to inject the fusion control information into a 16×16 Decoder Block.

Further, the parameter fusion module may include:

And the target animation model acquisition unit is used for fusing the LoRA image weight parameters to the LoRA level of the animation diffusion model, and obtaining a target animation model after fusion.

The video generating device provided by the embodiment of the application can be applied to video generating equipment such as PC terminals, cloud platforms, servers, server clusters and the like. Alternatively, fig. 3 shows a block diagram of a hardware structure of the video generating apparatus, and referring to fig. 3, the hardware structure of the video generating apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the present application also provides a readable storage medium storing a program adapted to be executed by a processor, the program being configured to:

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Various embodiments of the present application may be combined with each other. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A video generation method, comprising:

Acquiring control information of each frame in a target continuous video;

Integrating the weight parameter difference value with the face LoRA weight parameter to obtain LoRA image weight parameter after integration, wherein the LoRA image weight parameter is used for generating a target image, and the target image is a specified subject matched with a specified modeling;

2. The method of claim 1, wherein the obtaining the face LoRA weight parameters of the specified subject comprises:

acquiring a plurality of subject images including a specified subject face;

3. The method of claim 2, wherein training the LoRA level of the potential diffusion model using the face description text of each subject image and the subject face mask comprises:

For each subject image, in a forward diffusion stage, inputting an input image added with real noise to the potential diffusion model, the input image being derived based on the subject image; predicting random noise added in the current stage based on the face description text of the subject image and the prediction result of the previous stage by using the potential diffusion model, and denoising the prediction result of the previous stage by using the predicted random noise to form the prediction result of the current stage; in the back propagation stage, calculating a loss value based on random noise of the internal region of the subject face mask and real noise of the internal region of the subject face mask; and updating the weight parameters of the potential diffusion model LoRA level based on the loss value.

4. The video generation method according to claim 3, wherein calculating the loss value based on random noise of the subject face mask interior region and true noise of the subject face mask interior region comprises:

The loss value calculation function is as follows:

；

Wherein loss is a loss value; h is the high corresponding to the prediction result; w is the width corresponding to the prediction result; pred _ij is the random noise value of row i and column j; gt _ij is the real noise value of the ith row and the jth column; mask _ij is a feature value corresponding to the ith row and jth column of the mask of the face of the main body.

5. The method as set forth in claim 1, wherein the fusing the control information corresponding to the contextual window to obtain the corresponding fused control information includes:

6. The method of generating video according to claim 5, wherein controlling weights corresponding to frames in the contextual window based on normal distribution comprises:

the weight calculation function is as follows:

；

7. The video generation method according to claim 1, wherein the fusion control information of each frame includes torso fusion control information, hand fusion control information, and face fusion control information;

8. The video generation method according to claim 7, wherein the target motion picture model includes 8×8 Middle Block, 8×8 Decoder Block, 32×32 Decoder Block, 64×64 Decoder Block, and 16×16 Decoder Block;

the fusion control information is injected into a 16×16 Decoder Block.

9. The method of claim 1, wherein fusing the LoRA avatar weight parameters to a preset animation diffusion model to obtain a target animation model comprises:

10. A video generating apparatus, comprising:

The control information acquisition module is used for acquiring control information of each frame in the target continuous video; acquiring a face LoRA weight parameter of a specified subject, wherein the face LoRA weight parameter is used for drawing the face of the specified subject; acquiring a model LoRA weight parameter and a face LoRA weight parameter of a specified model, wherein the model LoRA weight parameter is used for drawing a main body matched with the specified model, and the face LoRA weight parameter is used for drawing a face matched with the main body of the specified model; calculating a weight parameter difference between the pose LoRA weight parameter and the face LoRA weight parameter; integrating the weight parameter difference value with the face LoRA weight parameter to obtain LoRA image weight parameter after integration, wherein the LoRA image weight parameter is used for generating a target image, and the target image is a specified subject matched with a specified modeling;

11. A video generating apparatus comprising a memory and a processor;

The memory is used for storing programs;

the processor being configured to execute the program to implement the steps of the video generation method according to any one of claims 1 to 9.