CN117994708A - Human body video generation method based on time sequence consistent hidden space guiding diffusion model - Google Patents

Human body video generation method based on time sequence consistent hidden space guiding diffusion model Download PDF

Info

Publication number
CN117994708A
CN117994708A CN202410397545.4A CN202410397545A CN117994708A CN 117994708 A CN117994708 A CN 117994708A CN 202410397545 A CN202410397545 A CN 202410397545A CN 117994708 A CN117994708 A CN 117994708A
Authority
CN
China
Prior art keywords
video
diffusion model
sequence
time sequence
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410397545.4A
Other languages
Chinese (zh)
Other versions
CN117994708B (en
Inventor
张盛平
王晨阳
吕晓倩
孟权令
柳青林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Weihai
Original Assignee
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Weihai filed Critical Harbin Institute of Technology Weihai
Priority to CN202410397545.4A priority Critical patent/CN117994708B/en
Publication of CN117994708A publication Critical patent/CN117994708A/en
Application granted granted Critical
Publication of CN117994708B publication Critical patent/CN117994708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a human body video generation method based on a time sequence consistent hidden space guide diffusion model, which comprises the following steps: extracting a skeleton sequence from a character action video and forming a video of the skeleton sequence; extracting features from the input video, the character images and the skeleton sequences; inputting the characteristics into a diffusion model for forward noise adding and noise prediction; taking the added noise as a supervision definition constraint and training a diffusion model; inputting the target gesture sequence and the character picture into a trained diffusion model; mapping the space-time coordinates into pixel values through an implicit network; taking the target video and the target gesture sequence as supervision constraints, and learning model parameters; and extracting the characteristics of the output video of the implicit network, and inputting the trained diffusion model again to obtain the character action video. The invention designs an iterative optimization strategy, and improves the continuity of the generation result of the diffusion model through hidden space guidance with consistent time sequence so as to improve the generation quality of the human body video guided by the gesture.

Description

Human body video generation method based on time sequence consistent hidden space guiding diffusion model
Technical Field
The invention relates to the technical field of image processing and pattern recognition, in particular to a human body video generation method based on a time sequence consistent hidden space guide diffusion model.
Background
The gesture-guided human body video generation aims at generating videos of specific actions of specific character figures, and has wide application in the fields of man-machine interaction, motion analysis, virtual reality and the like. Most of the existing methods solve the problem by adopting a generation countermeasure network, but the type of network has high training difficulty and unstable generation result. In recent years, new vitality is injected into the field by the diffusion model, and the diffusion model obtains high-precision human body images through the processes of noise adding and noise removing. However, the existing method can only obtain an approximate image by means of texts, and can not generate a specific character video according to the requirements of users. In addition, the existing method only considers the generation quality of each frame, ignores the time sequence relation among frames, and causes the artifacts and flickering to be easy to generate.
Disclosure of Invention
The invention aims to provide a human body video generation method based on a time sequence consistent hidden space guiding diffusion model, which utilizes the high-fidelity human body picture generation capability of the diffusion model to generate a human body picture guided by a gesture, and utilizes the time sequence consistency of a reconstruction result of a video implicit network to provide time sequence consistent hidden space guiding for the diffusion model through an iterative optimization strategy so as to promote the two to each other to improve the human body video generation quality guided by the gesture.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
A human body video generation method based on a time sequence consistent hidden space guiding diffusion model comprises the following steps:
Extracting a skeleton sequence from a character action video, and forming a video of the skeleton sequence;
extracting characteristics from the input character action video, character images and skeleton sequences;
inputting the extracted features into a diffusion model for forward noise adding and noise prediction;
taking the added noise as a supervision definition constraint, training a diffusion model by minimizing the constraint, and learning model parameters;
Inputting the target gesture sequence and the character picture into a trained diffusion model to obtain a character action video under the target gesture sequence;
Mapping the space-time coordinates into pixel values through an implicit network;
Taking the target video and the target gesture sequence obtained in the steps as supervision constraint, training the implicit network by minimizing the constraint, and learning model parameters;
And extracting the characteristics of the output video of the implicit network, and inputting the trained diffusion model again as a guide to obtain the final gesture-guided character action video.
Further, the extracting the skeleton sequence from the character action video and forming the video of the skeleton sequence includes:
And acquiring two-dimensional skeleton key points of the 21 key points and connected skeletons, taking the key points as nodes, and connecting the skeletons as nodes to construct a video of a skeleton sequence corresponding to the video.
Further, the extracting features from the input character action video, character image and skeleton sequence includes:
extracting characteristics of the figure image picture and the skeleton image sequence through a multilayer convolutional neural network and an attention network, and splicing; the input character motion video is encoded by a variation encoder.
Further, the inputting the features into the diffusion model for forward noise adding and noise prediction includes:
And adding noise to the coding result of the input video on the hidden space level according to a noise adding formula of the pre-training diffusion model, and inputting the coding result to the diffusion model built by the U-shaped neural network for noise prediction.
Further, the training of the diffusion model by minimizing the constraint with the added noise as a supervision definition constraint, and the learning of model parameters include:
The difference between the predicted noise and the actual added noise is small, the model is trained by minimizing the constraint, and the diffusion model parameters are continuously updated.
Further, the inputting the target gesture sequence and the character picture into the trained diffusion model to obtain the character action video under the target gesture sequence includes:
And loading diffusion model parameters, taking a target gesture sequence, the figure image and random noise as inputs, and obtaining the video of the specific figure image under the target gesture through the denoising process of the diffusion model.
Further, the mapping the space-time coordinates to pixel values through the implicit network includes:
In terms of coordinate values As input and stored in the form of a hash table, mapped to pixel values/>, by an implicit network built from a multi-layer perceptron networkAnd obtaining a predicted video.
Further, the training of the implicit network by minimizing the constraint and the learning of the model parameters by taking the target video and the target gesture sequence obtained in the steps as the supervision constraint comprise:
Taking the character video under the target gesture sequence obtained in the steps as pixel supervision, and minimizing the constraint; meanwhile, calculating the optical flow between frames for the input gesture sequence, taking the optical flow as time sequence supervision, and reducing the difference between the optical flow of the predicted video and the optical flow of the gesture sequence; by minimizing this constraint, the model parameters of the video implicit network are updated.
Further, the feature extraction is performed on the output video of the implicit network, and the feature extraction is used as guiding to input the trained diffusion model again to obtain the final gesture-guided character action video, which comprises the following steps:
Loading an implicit network, reconstructing videos with consistent time sequence, loading diffusion model parameters again, extracting hidden space features of the videos with consistent time sequence output by the implicit network, taking the hidden space features as continuous hidden space features, and denoising feature spaces by combining target gestures and character images to obtain character action videos with consistent time sequence and gesture guidance.
The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:
The human body video generation method based on the time sequence consistent hidden space guiding diffusion model provided by the invention overcomes the problem that the existing method cannot generate the specific figure image, and overcomes the problem that the time sequence of a single image generation method is inconsistent. The iterative optimization strategy is designed, the high-fidelity human body picture generation capacity of the diffusion model and the time sequence consistency reconstruction capacity of the video implicit network are fully utilized, and the continuity of the generation result of the diffusion model is improved through hidden space guidance with consistent time sequence, so that the human body video generation quality of gesture guidance is improved.
Drawings
FIG. 1 is a flow chart of a human body video generation method based on a time sequence consistent hidden space guiding diffusion model.
Detailed Description
As shown in fig. 1, the human body video generation method based on the time sequence consistent hidden space guiding diffusion model comprises the following steps:
S1, extracting a skeleton sequence from a character action video, and forming a video of the skeleton sequence;
s2, extracting characteristics of the input character action video, character images and skeleton sequences;
s3, inputting the extracted features into a diffusion model for forward noise adding and noise prediction;
S4, taking the added noise as a supervision definition constraint, training a diffusion model by minimizing the constraint, and learning model parameters;
s5, inputting the target gesture sequence and the character picture into a trained diffusion model to obtain a character action video under the target gesture sequence;
s6, mapping the space-time coordinates into pixel values through an implicit network;
S7, training the implicit network by taking the target video and the target gesture sequence obtained in the steps as supervision constraint and learning model parameters by minimizing the constraint;
and S8, extracting characteristics of the output video of the implicit network, and inputting the extracted characteristics as a guiding diffusion model after training again to obtain a final gesture-guided character action video.
In step S1, two-dimensional skeleton key points of 21 key points and connected skeletons are obtained, the key points are used as nodes, the skeletons are used as connections among the nodes, and a video of a skeleton sequence corresponding to a character action video is constructed: for a given figure action video, 21 two-dimensional key points are obtained by manual calibration or by using the existing human body key point detection method, all key points are connected according to the connection relation of human joints, and according to the nodes and edges, the key points are mapped into a skeleton sequence with the same size as a video frame image.
In the step S2, the characteristics of the figure image and the skeleton image sequence are extracted through a multi-layer convolutional neural network and an attention network and spliced; encoding the input video by a variable division encoder: converting the character image and skeleton image sequence into 512×512 image size, and coding respectively through two layers of convolutional neural network and time sequence attention network; encoding the input video by a variation self-encoder trained in the large-scale data; and splicing the characteristics of the three through splicing operation to serve as the input characteristics of the video.
In step S3, according to a noise adding formula of the pre-training diffusion model, the coding result of the input video is subjected to noise adding on the hidden space level, and the noise is input to the diffusion model built by the U-shaped neural network for noise prediction: according to a noise adding formula of the diffusion model, the input characteristics are subjected to T-step noise adding, and the noise adding process can be defined as follows:
Wherein, To add the number of noise steps,/>For/>Hidden spatial features of step,/>Is Gaussian noise,/>For/>Noise figure of step. Through the formula, potential space characteristics with noise can be obtained; and inputting the denoised characteristics into a U-shaped neural network structure with a mutual attention mechanism, and predicting the denoised noise of each step through the network.
In step S4, the difference between the predicted noise and the actual added noise is small, and the model is trained by minimizing the constraint, so as to continuously update the diffusion model parameters: defining an objective function to be optimized of a diffusion model as follows: Wherein/> For denoising step number,/>For the noise added in step S3,/>Noise predicted for network,/>For/>Hidden spatial features of step,/>The objective function constrains the network prediction noise to be as close as possible to the applied noise, so that the network can obtain a high-realism image through denoising. And continuously updating model parameters by minimizing the optimization function to obtain a diffusion model belonging to the specific character image.
In step S5, diffusion model parameters are loaded, and a target gesture sequence, a character image and random noise are used as inputs, so that a video of a specific character image in a target gesture is obtained through a denoising process of the diffusion model: loading a trained diffusion model of a specific character image, taking the character image, a skeleton sequence and noise as inputs, predicting the noise through the trained diffusion model, and further obtaining a video of the specific character image under a target gesture sequence through a DDIM denoising formula, wherein the specific denoising formula is as follows:
Wherein, Representing the number of sampling steps, the total number of steps being 20 steps,/>Represents the/>Step sample image,/>As a parameter varying with the number of sampling steps,/>Representing the pre-trained diffusion model at/>Step (c) estimating the noise value.
Expressed in given/>Clean sample image and/>, after step updateStep sample image/>In the case of (1) >Step sample image/>Probability distribution of (2); /(I)Mean value of Gaussian distribution,/>Representing the variance of the gaussian distribution.
In step S6, coordinate values are usedAs input and stored in the form of a hash table, mapped as/>, through an implicit network built by a multi-layer perceptron networkValue, obtain the predictive video: pixel coordinate value/>, of each frame of videoAs input, the high-frequency characteristic is obtained through the encoding mode of the hash grid, the image characteristic is further input into the video implicit network, the offset of each frame relative to the standard space frame is predicted through the multi-layer full-connection network, the offset is decoded back to the image layer through the other full-connection network, the image layer is combined with the standard space image, and the image of each frame is obtained through reconstruction.
In step S7, taking the character video under the target gesture sequence obtained in the step as pixel supervision, and minimizing the constraint; meanwhile, calculating the optical flow between frames for the input gesture sequence, taking the optical flow as time sequence supervision, and reducing the difference between the optical flow of the predicted video and the optical flow of the gesture sequence; by minimizing this constraint, model parameters of the video implicit network are updated: defining an optimization function:
Wherein/> Is a weight parameter. A first item:
measuring the difference between the reconstructed frame and the target video obtained by the diffusion model, wherein/> For hash coding mode,/>And/>Is a multi-layer full connection layer,/>Is the first/>, of the character videoA frame; second item
Measuring the difference between the optical flow of the predicted video and the optical flow of the gesture sequence, whereinFor frame number,/>For the total number of frames,Representing the coordinates/>, in the target pose sequenceFrom the pixel value of/>Frame to/>Light flow value of frame. Model parameters of the video implicit network are continuously updated by minimizing the function to be optimized.
In step S8, loading an implicit network, reconstructing videos with consistent time sequence, loading diffusion model parameters again, extracting hidden space features of the videos with consistent time sequence output by the implicit network, taking the hidden space features as continuous hidden space features, combining target gestures and character images, denoising feature spaces, and obtaining character action videos with consistent time sequence and gesture guidance: the hidden video network is loaded in a fixed mode, coordinates of each frame are used as input, video with consistent time sequence is obtained, the video is encoded to a hidden space through a variation self-encoder, and the hidden space is input to a diffusion model which is trained in the step 4 for noise addition, and the video is used as a diffusion model hidden space guiding feature with continuous time sequence; and (3) further encoding the target skeleton sequence and the character image, and repeating the characteristic processing in the step (2) and the denoising process in the step (5) to obtain character action videos with consistent time sequence and posture guidance.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (9)

1. The human body video generation method based on the time sequence consistent hidden space guiding diffusion model is characterized by comprising the following steps of: the method comprises the following steps:
step one, extracting a skeleton sequence from a character action video, and forming a video of the skeleton sequence;
step two, extracting characteristics of the input character action video, character images and skeleton sequences;
step three, inputting the features extracted in the step two into a diffusion model to perform forward noise adding and noise prediction;
step four, taking the added noise as a supervision definition constraint, training a diffusion model by minimizing the constraint, and learning model parameters;
inputting the target gesture sequence and the character image picture into a trained diffusion model to obtain a character action video under the target gesture sequence;
step six, mapping the space-time coordinates into pixel values through an implicit network;
Step seven, taking the character action video under the target gesture sequence obtained in the step five as a supervision constraint, training an implicit network by minimizing the constraint, and learning model parameters;
and step eight, extracting characteristics of the output video of the implicit network, and inputting the extracted characteristics as a guiding diffusion model after training again to obtain a final gesture-guided character action video.
2. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the first step comprises:
And acquiring two-dimensional skeleton key points of 21 key points and connected skeletons from the character action video, taking the key points as nodes, and connecting the skeletons as nodes to construct a video of a skeleton sequence corresponding to the character action video.
3. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step two comprises:
extracting characteristics of the figure image picture and the skeleton image sequence through a multilayer convolutional neural network and an attention network, and splicing; the input character motion video is encoded by a variation encoder.
4. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step three comprises:
and adding noise to the coding result of the input character action video on the hidden space level according to a noise adding formula of the pre-training diffusion model, and inputting the coding result to the diffusion model built by the U-shaped neural network for noise prediction.
5. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the fourth step comprises:
The difference between the predicted noise and the actual added noise is small, and the diffusion model is trained by minimizing the constraint, so that the diffusion model parameters are continuously updated.
6. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the fifth step comprises:
and loading diffusion model parameters, taking a target gesture sequence, a figure image and random noise as inputs, and obtaining a video of the specific figure under the target gesture through a denoising process of the diffusion model.
7. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step six comprises:
And taking the coordinate value as input, storing the coordinate value in a hash table mode, and mapping the coordinate value into a pixel value through an implicit network constructed by a multi-layer perceptron network to obtain a predicted video.
8. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step seven comprises:
Taking the character action video under the target gesture sequence obtained in the step five as pixel supervision, and minimizing the constraint; meanwhile, calculating the optical flow between frames for the input gesture sequence, taking the optical flow as time sequence supervision, and reducing the difference between the optical flow of the predicted video and the optical flow of the gesture sequence; by minimizing this constraint, the model parameters of the video implicit network are updated.
9. The human body video generating method based on the time sequence coincidence hidden space guiding diffusion model as claimed in claim 1, wherein the step eight includes:
Loading an implicit network, reconstructing videos with consistent time sequence, loading diffusion model parameters again, extracting hidden space features of the videos with consistent time sequence output by the implicit network, taking the hidden space features as continuous hidden space features, and denoising feature spaces by combining target gestures and character images to obtain character action videos with consistent time sequence and gesture guidance.
CN202410397545.4A 2024-04-03 2024-04-03 Human body video generation method based on time sequence consistent hidden space guiding diffusion model Active CN117994708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410397545.4A CN117994708B (en) 2024-04-03 2024-04-03 Human body video generation method based on time sequence consistent hidden space guiding diffusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410397545.4A CN117994708B (en) 2024-04-03 2024-04-03 Human body video generation method based on time sequence consistent hidden space guiding diffusion model

Publications (2)

Publication Number Publication Date
CN117994708A true CN117994708A (en) 2024-05-07
CN117994708B CN117994708B (en) 2024-05-31

Family

ID=90893626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410397545.4A Active CN117994708B (en) 2024-04-03 2024-04-03 Human body video generation method based on time sequence consistent hidden space guiding diffusion model

Country Status (1)

Country Link
CN (1) CN117994708B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113099146A (en) * 2019-12-19 2021-07-09 华为技术有限公司 Video generation method and device and related equipment
CN114694261A (en) * 2022-04-14 2022-07-01 重庆邮电大学 Video three-dimensional human body posture estimation method and system based on multi-level supervision graph convolution
CN115244583A (en) * 2020-12-24 2022-10-25 辉达公司 Generating a three-dimensional model of motion using motion migration
CN115965836A (en) * 2023-01-12 2023-04-14 厦门大学 Human behavior posture video data amplification system and method with controllable semantics
WO2023098664A1 (en) * 2021-11-30 2023-06-08 北京字节跳动网络技术有限公司 Method, device and apparatus for generating special effect video, and storage medium
CN116681838A (en) * 2023-07-07 2023-09-01 中南大学 Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization
CN116883524A (en) * 2022-03-25 2023-10-13 腾讯科技(深圳)有限公司 Image generation model training, image generation method and device and computer equipment
CN117710181A (en) * 2022-09-15 2024-03-15 辉达公司 Video generation techniques

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113099146A (en) * 2019-12-19 2021-07-09 华为技术有限公司 Video generation method and device and related equipment
CN115244583A (en) * 2020-12-24 2022-10-25 辉达公司 Generating a three-dimensional model of motion using motion migration
WO2023098664A1 (en) * 2021-11-30 2023-06-08 北京字节跳动网络技术有限公司 Method, device and apparatus for generating special effect video, and storage medium
CN116883524A (en) * 2022-03-25 2023-10-13 腾讯科技(深圳)有限公司 Image generation model training, image generation method and device and computer equipment
CN114694261A (en) * 2022-04-14 2022-07-01 重庆邮电大学 Video three-dimensional human body posture estimation method and system based on multi-level supervision graph convolution
CN117710181A (en) * 2022-09-15 2024-03-15 辉达公司 Video generation techniques
US20240095989A1 (en) * 2022-09-15 2024-03-21 Nvidia Corporation Video generation techniques
CN115965836A (en) * 2023-01-12 2023-04-14 厦门大学 Human behavior posture video data amplification system and method with controllable semantics
CN116681838A (en) * 2023-07-07 2023-09-01 中南大学 Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
于海涛;杨小汕;徐常胜;: "基于多模态输入的对抗式视频生成方法", 计算机研究与发展, no. 07, 7 July 2020 (2020-07-07) *

Also Published As

Publication number Publication date
CN117994708B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
CN109064507B (en) Multi-motion-stream deep convolution network model method for video prediction
Zhao et al. Learning to forecast and refine residual motion for image-to-video generation
CN111242844B (en) Image processing method, device, server and storage medium
CN112164067A (en) Medical image segmentation method and device based on multi-mode subspace clustering
CN117496072B (en) Three-dimensional digital person generation and interaction method and system
Cheng et al. DDU-Net: A dual dense U-structure network for medical image segmentation
CN111462274A (en) Human body image synthesis method and system based on SMP L model
Xu et al. AutoSegNet: An automated neural network for image segmentation
CN115293986A (en) Multi-temporal remote sensing image cloud region reconstruction method
CN111738092B (en) Method for recovering occluded human body posture sequence based on deep learning
CN117593275A (en) Medical image segmentation system
CN117994708B (en) Human body video generation method based on time sequence consistent hidden space guiding diffusion model
CN117094365A (en) Training method and device for image-text generation model, electronic equipment and medium
CN116189306A (en) Human behavior recognition method based on joint attention mechanism
CN115374854A (en) Multi-modal emotion recognition method and device and computer readable storage medium
CN114333069A (en) Object posture processing method, device, equipment and storage medium
CN113362281A (en) Infrared and visible light image fusion method based on WSN-LatLRR
Zhang et al. Scale-progressive multi-patch network for image dehazing
CN116958423B (en) Text-based three-dimensional modeling method, image rendering method and device
CN111401141B (en) 3D gesture estimation method based on skeleton
CN118037898B (en) Text generation video method based on image guided video editing
CN117576248B (en) Image generation method and device based on gesture guidance
CN116542292B (en) Training method, device, equipment and storage medium of image generation model
CN111724467B (en) Voxel model generation method and system for 3D printing
Qiu Image Reconstruction of Tang Sancai Figurines Based on Artificial Intelligence Image Extraction Technology Based on Ration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant