CN117994708A

CN117994708A - Human body video generation method based on time sequence consistent hidden space guiding diffusion model

Info

Publication number: CN117994708A
Application number: CN202410397545.4A
Authority: CN
Inventors: 张盛平; 王晨阳; 吕晓倩; 孟权令; 柳青林
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2024-04-03
Filing date: 2024-04-03
Publication date: 2024-05-07
Anticipated expiration: 2044-04-03
Also published as: CN117994708B

Abstract

The invention discloses a human body video generation method based on a time sequence consistent hidden space guide diffusion model, which comprises the following steps: extracting a skeleton sequence from a character action video and forming a video of the skeleton sequence; extracting features from the input video, the character images and the skeleton sequences; inputting the characteristics into a diffusion model for forward noise adding and noise prediction; taking the added noise as a supervision definition constraint and training a diffusion model; inputting the target gesture sequence and the character picture into a trained diffusion model; mapping the space-time coordinates into pixel values through an implicit network; taking the target video and the target gesture sequence as supervision constraints, and learning model parameters; and extracting the characteristics of the output video of the implicit network, and inputting the trained diffusion model again to obtain the character action video. The invention designs an iterative optimization strategy, and improves the continuity of the generation result of the diffusion model through hidden space guidance with consistent time sequence so as to improve the generation quality of the human body video guided by the gesture.

Description

Human body video generation method based on time sequence consistent hidden space guiding diffusion model

Technical Field

The invention relates to the technical field of image processing and pattern recognition, in particular to a human body video generation method based on a time sequence consistent hidden space guide diffusion model.

Background

The gesture-guided human body video generation aims at generating videos of specific actions of specific character figures, and has wide application in the fields of man-machine interaction, motion analysis, virtual reality and the like. Most of the existing methods solve the problem by adopting a generation countermeasure network, but the type of network has high training difficulty and unstable generation result. In recent years, new vitality is injected into the field by the diffusion model, and the diffusion model obtains high-precision human body images through the processes of noise adding and noise removing. However, the existing method can only obtain an approximate image by means of texts, and can not generate a specific character video according to the requirements of users. In addition, the existing method only considers the generation quality of each frame, ignores the time sequence relation among frames, and causes the artifacts and flickering to be easy to generate.

Disclosure of Invention

The invention aims to provide a human body video generation method based on a time sequence consistent hidden space guiding diffusion model, which utilizes the high-fidelity human body picture generation capability of the diffusion model to generate a human body picture guided by a gesture, and utilizes the time sequence consistency of a reconstruction result of a video implicit network to provide time sequence consistent hidden space guiding for the diffusion model through an iterative optimization strategy so as to promote the two to each other to improve the human body video generation quality guided by the gesture.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A human body video generation method based on a time sequence consistent hidden space guiding diffusion model comprises the following steps:

Extracting a skeleton sequence from a character action video, and forming a video of the skeleton sequence;

extracting characteristics from the input character action video, character images and skeleton sequences;

inputting the extracted features into a diffusion model for forward noise adding and noise prediction;

taking the added noise as a supervision definition constraint, training a diffusion model by minimizing the constraint, and learning model parameters;

Inputting the target gesture sequence and the character picture into a trained diffusion model to obtain a character action video under the target gesture sequence;

Mapping the space-time coordinates into pixel values through an implicit network;

Taking the target video and the target gesture sequence obtained in the steps as supervision constraint, training the implicit network by minimizing the constraint, and learning model parameters;

And extracting the characteristics of the output video of the implicit network, and inputting the trained diffusion model again as a guide to obtain the final gesture-guided character action video.

Further, the extracting the skeleton sequence from the character action video and forming the video of the skeleton sequence includes:

And acquiring two-dimensional skeleton key points of the 21 key points and connected skeletons, taking the key points as nodes, and connecting the skeletons as nodes to construct a video of a skeleton sequence corresponding to the video.

Further, the extracting features from the input character action video, character image and skeleton sequence includes:

extracting characteristics of the figure image picture and the skeleton image sequence through a multilayer convolutional neural network and an attention network, and splicing; the input character motion video is encoded by a variation encoder.

Further, the inputting the features into the diffusion model for forward noise adding and noise prediction includes:

And adding noise to the coding result of the input video on the hidden space level according to a noise adding formula of the pre-training diffusion model, and inputting the coding result to the diffusion model built by the U-shaped neural network for noise prediction.

Further, the training of the diffusion model by minimizing the constraint with the added noise as a supervision definition constraint, and the learning of model parameters include:

The difference between the predicted noise and the actual added noise is small, the model is trained by minimizing the constraint, and the diffusion model parameters are continuously updated.

Further, the inputting the target gesture sequence and the character picture into the trained diffusion model to obtain the character action video under the target gesture sequence includes:

And loading diffusion model parameters, taking a target gesture sequence, the figure image and random noise as inputs, and obtaining the video of the specific figure image under the target gesture through the denoising process of the diffusion model.

Further, the mapping the space-time coordinates to pixel values through the implicit network includes:

In terms of coordinate values As input and stored in the form of a hash table, mapped to pixel values/>, by an implicit network built from a multi-layer perceptron networkAnd obtaining a predicted video.

Further, the training of the implicit network by minimizing the constraint and the learning of the model parameters by taking the target video and the target gesture sequence obtained in the steps as the supervision constraint comprise:

Taking the character video under the target gesture sequence obtained in the steps as pixel supervision, and minimizing the constraint; meanwhile, calculating the optical flow between frames for the input gesture sequence, taking the optical flow as time sequence supervision, and reducing the difference between the optical flow of the predicted video and the optical flow of the gesture sequence; by minimizing this constraint, the model parameters of the video implicit network are updated.

Further, the feature extraction is performed on the output video of the implicit network, and the feature extraction is used as guiding to input the trained diffusion model again to obtain the final gesture-guided character action video, which comprises the following steps:

Loading an implicit network, reconstructing videos with consistent time sequence, loading diffusion model parameters again, extracting hidden space features of the videos with consistent time sequence output by the implicit network, taking the hidden space features as continuous hidden space features, and denoising feature spaces by combining target gestures and character images to obtain character action videos with consistent time sequence and gesture guidance.

The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:

The human body video generation method based on the time sequence consistent hidden space guiding diffusion model provided by the invention overcomes the problem that the existing method cannot generate the specific figure image, and overcomes the problem that the time sequence of a single image generation method is inconsistent. The iterative optimization strategy is designed, the high-fidelity human body picture generation capacity of the diffusion model and the time sequence consistency reconstruction capacity of the video implicit network are fully utilized, and the continuity of the generation result of the diffusion model is improved through hidden space guidance with consistent time sequence, so that the human body video generation quality of gesture guidance is improved.

Drawings

FIG. 1 is a flow chart of a human body video generation method based on a time sequence consistent hidden space guiding diffusion model.

Detailed Description

As shown in fig. 1, the human body video generation method based on the time sequence consistent hidden space guiding diffusion model comprises the following steps:

S1, extracting a skeleton sequence from a character action video, and forming a video of the skeleton sequence;

s2, extracting characteristics of the input character action video, character images and skeleton sequences;

s3, inputting the extracted features into a diffusion model for forward noise adding and noise prediction;

S4, taking the added noise as a supervision definition constraint, training a diffusion model by minimizing the constraint, and learning model parameters;

s5, inputting the target gesture sequence and the character picture into a trained diffusion model to obtain a character action video under the target gesture sequence;

s6, mapping the space-time coordinates into pixel values through an implicit network;

S7, training the implicit network by taking the target video and the target gesture sequence obtained in the steps as supervision constraint and learning model parameters by minimizing the constraint;

and S8, extracting characteristics of the output video of the implicit network, and inputting the extracted characteristics as a guiding diffusion model after training again to obtain a final gesture-guided character action video.

In step S1, two-dimensional skeleton key points of 21 key points and connected skeletons are obtained, the key points are used as nodes, the skeletons are used as connections among the nodes, and a video of a skeleton sequence corresponding to a character action video is constructed: for a given figure action video, 21 two-dimensional key points are obtained by manual calibration or by using the existing human body key point detection method, all key points are connected according to the connection relation of human joints, and according to the nodes and edges, the key points are mapped into a skeleton sequence with the same size as a video frame image.

In the step S2, the characteristics of the figure image and the skeleton image sequence are extracted through a multi-layer convolutional neural network and an attention network and spliced; encoding the input video by a variable division encoder: converting the character image and skeleton image sequence into 512×512 image size, and coding respectively through two layers of convolutional neural network and time sequence attention network; encoding the input video by a variation self-encoder trained in the large-scale data; and splicing the characteristics of the three through splicing operation to serve as the input characteristics of the video.

In step S3, according to a noise adding formula of the pre-training diffusion model, the coding result of the input video is subjected to noise adding on the hidden space level, and the noise is input to the diffusion model built by the U-shaped neural network for noise prediction: according to a noise adding formula of the diffusion model, the input characteristics are subjected to T-step noise adding, and the noise adding process can be defined as follows:

，

Wherein, To add the number of noise steps,/>For/>Hidden spatial features of step,/>Is Gaussian noise,/>For/>Noise figure of step. Through the formula, potential space characteristics with noise can be obtained; and inputting the denoised characteristics into a U-shaped neural network structure with a mutual attention mechanism, and predicting the denoised noise of each step through the network.

In step S4, the difference between the predicted noise and the actual added noise is small, and the model is trained by minimizing the constraint, so as to continuously update the diffusion model parameters: defining an objective function to be optimized of a diffusion model as follows: Wherein/> For denoising step number,/>For the noise added in step S3,/>Noise predicted for network,/>For/>Hidden spatial features of step,/>The objective function constrains the network prediction noise to be as close as possible to the applied noise, so that the network can obtain a high-realism image through denoising. And continuously updating model parameters by minimizing the optimization function to obtain a diffusion model belonging to the specific character image.

In step S5, diffusion model parameters are loaded, and a target gesture sequence, a character image and random noise are used as inputs, so that a video of a specific character image in a target gesture is obtained through a denoising process of the diffusion model: loading a trained diffusion model of a specific character image, taking the character image, a skeleton sequence and noise as inputs, predicting the noise through the trained diffusion model, and further obtaining a video of the specific character image under a target gesture sequence through a DDIM denoising formula, wherein the specific denoising formula is as follows:

Wherein, Representing the number of sampling steps, the total number of steps being 20 steps,/>Represents the/>Step sample image,/>As a parameter varying with the number of sampling steps,/>Representing the pre-trained diffusion model at/>Step (c) estimating the noise value.

Expressed in given/>Clean sample image and/>, after step updateStep sample image/>In the case of (1) >Step sample image/>Probability distribution of (2); /(I)Mean value of Gaussian distribution,/>Representing the variance of the gaussian distribution.

In step S6, coordinate values are usedAs input and stored in the form of a hash table, mapped as/>, through an implicit network built by a multi-layer perceptron networkValue, obtain the predictive video: pixel coordinate value/>, of each frame of videoAs input, the high-frequency characteristic is obtained through the encoding mode of the hash grid, the image characteristic is further input into the video implicit network, the offset of each frame relative to the standard space frame is predicted through the multi-layer full-connection network, the offset is decoded back to the image layer through the other full-connection network, the image layer is combined with the standard space image, and the image of each frame is obtained through reconstruction.

In step S7, taking the character video under the target gesture sequence obtained in the step as pixel supervision, and minimizing the constraint; meanwhile, calculating the optical flow between frames for the input gesture sequence, taking the optical flow as time sequence supervision, and reducing the difference between the optical flow of the predicted video and the optical flow of the gesture sequence; by minimizing this constraint, model parameters of the video implicit network are updated: defining an optimization function:

Wherein/> Is a weight parameter. A first item:

measuring the difference between the reconstructed frame and the target video obtained by the diffusion model, wherein/> For hash coding mode,/>And/>Is a multi-layer full connection layer,/>Is the first/>, of the character videoA frame; second item

Measuring the difference between the optical flow of the predicted video and the optical flow of the gesture sequence, whereinFor frame number,/>For the total number of frames,Representing the coordinates/>, in the target pose sequenceFrom the pixel value of/>Frame to/>Light flow value of frame. Model parameters of the video implicit network are continuously updated by minimizing the function to be optimized.

In step S8, loading an implicit network, reconstructing videos with consistent time sequence, loading diffusion model parameters again, extracting hidden space features of the videos with consistent time sequence output by the implicit network, taking the hidden space features as continuous hidden space features, combining target gestures and character images, denoising feature spaces, and obtaining character action videos with consistent time sequence and gesture guidance: the hidden video network is loaded in a fixed mode, coordinates of each frame are used as input, video with consistent time sequence is obtained, the video is encoded to a hidden space through a variation self-encoder, and the hidden space is input to a diffusion model which is trained in the step 4 for noise addition, and the video is used as a diffusion model hidden space guiding feature with continuous time sequence; and (3) further encoding the target skeleton sequence and the character image, and repeating the characteristic processing in the step (2) and the denoising process in the step (5) to obtain character action videos with consistent time sequence and posture guidance.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. The human body video generation method based on the time sequence consistent hidden space guiding diffusion model is characterized by comprising the following steps of: the method comprises the following steps:

step one, extracting a skeleton sequence from a character action video, and forming a video of the skeleton sequence;

step two, extracting characteristics of the input character action video, character images and skeleton sequences;

step three, inputting the features extracted in the step two into a diffusion model to perform forward noise adding and noise prediction;

step four, taking the added noise as a supervision definition constraint, training a diffusion model by minimizing the constraint, and learning model parameters;

inputting the target gesture sequence and the character image picture into a trained diffusion model to obtain a character action video under the target gesture sequence;

step six, mapping the space-time coordinates into pixel values through an implicit network;

Step seven, taking the character action video under the target gesture sequence obtained in the step five as a supervision constraint, training an implicit network by minimizing the constraint, and learning model parameters;

and step eight, extracting characteristics of the output video of the implicit network, and inputting the extracted characteristics as a guiding diffusion model after training again to obtain a final gesture-guided character action video.

2. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the first step comprises:

And acquiring two-dimensional skeleton key points of 21 key points and connected skeletons from the character action video, taking the key points as nodes, and connecting the skeletons as nodes to construct a video of a skeleton sequence corresponding to the character action video.

3. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step two comprises:

4. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step three comprises:

and adding noise to the coding result of the input character action video on the hidden space level according to a noise adding formula of the pre-training diffusion model, and inputting the coding result to the diffusion model built by the U-shaped neural network for noise prediction.

5. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the fourth step comprises:

The difference between the predicted noise and the actual added noise is small, and the diffusion model is trained by minimizing the constraint, so that the diffusion model parameters are continuously updated.

6. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the fifth step comprises:

and loading diffusion model parameters, taking a target gesture sequence, a figure image and random noise as inputs, and obtaining a video of the specific figure under the target gesture through a denoising process of the diffusion model.

7. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step six comprises:

And taking the coordinate value as input, storing the coordinate value in a hash table mode, and mapping the coordinate value into a pixel value through an implicit network constructed by a multi-layer perceptron network to obtain a predicted video.

8. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step seven comprises:

Taking the character action video under the target gesture sequence obtained in the step five as pixel supervision, and minimizing the constraint; meanwhile, calculating the optical flow between frames for the input gesture sequence, taking the optical flow as time sequence supervision, and reducing the difference between the optical flow of the predicted video and the optical flow of the gesture sequence; by minimizing this constraint, the model parameters of the video implicit network are updated.

9. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step eight includes: