CN117880444B - Human body rehabilitation exercise video data generation method guided by long-short time features - Google Patents

Human body rehabilitation exercise video data generation method guided by long-short time features Download PDF

Info

Publication number
CN117880444B
CN117880444B CN202410281162.0A CN202410281162A CN117880444B CN 117880444 B CN117880444 B CN 117880444B CN 202410281162 A CN202410281162 A CN 202410281162A CN 117880444 B CN117880444 B CN 117880444B
Authority
CN
China
Prior art keywords
video
attention
segmented
module
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410281162.0A
Other languages
Chinese (zh)
Other versions
CN117880444A (en
Inventor
王宏升
林峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202410281162.0A priority Critical patent/CN117880444B/en
Publication of CN117880444A publication Critical patent/CN117880444A/en
Application granted granted Critical
Publication of CN117880444B publication Critical patent/CN117880444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The specification discloses a human rehabilitation exercise video data generation method guided by long-short time features, which can extract image reference features corresponding to reference images through an image reference network in a video generation model, and input a k-1 segmented video sample into the video reference network in the video generation model to obtain video reference features. The method comprises the steps of adding noise to a kth segmented video sample through generated noise to obtain a segmented video sample after noise addition, inputting the kth segmented gesture sequence, the segmented video sample after noise addition, video reference characteristics and image reference characteristics into a stable diffusion network in a video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise; and training the video generation model by taking the difference between the minimized prediction noise and the generated noise as an optimization target, thereby improving the video generation quality.

Description

Human body rehabilitation exercise video data generation method guided by long-short time features
Technical Field
The specification relates to the technical field of neural networks and video generation, in particular to a human rehabilitation exercise video data generation method guided by long-short time features.
Background
Currently, in the field of rehabilitation video generation, a traditional method generally adopts a frame-by-frame rendering mode, so that the video lacks of temporal consistency, for example, adverse effects such as discontinuous rehabilitation actions occur. To address this problem, many research efforts have proposed adding a temporal attention mechanism to directly generate the entire rehabilitation exercise video.
However, this method still has the problem that each video frame is generated frame by frame when the rehabilitation exercise video is generated, so that the video still has a lack of temporal consistency.
Therefore, how to improve the accuracy of video generation is a urgent problem to be solved.
Disclosure of Invention
The present disclosure provides a method for generating long-short time feature-guided human rehabilitation exercise video data, so as to partially solve the above-mentioned problems in the prior art.
The technical scheme adopted in the specification is as follows:
the specification provides a human rehabilitation exercise video data generation method guided by long-short time features, which comprises the following steps:
Acquiring a reference image, a gesture sequence and a video sample;
Segmenting the gesture sequence and the video samples respectively to obtain segmented gesture sequences and segmented video samples, wherein one segmented gesture sequence corresponds to one segmented video sample one by one, and overlapping exists between adjacent segmented video samples;
Inputting the reference image, the segmented gesture sequence and the segmented video sample into a video generation model to be trained so as to extract image reference characteristics corresponding to the reference image through an image reference network in the video generation model to be trained, and inputting the kth-1 segmented video sample into a video reference network in the video generation model so as to obtain video reference characteristics corresponding to the kth-1 segmented video sample;
The method comprises the steps of adding noise to a kth segmented video sample through generated noise to obtain a noisy segmented video sample, inputting a kth segmented gesture sequence, the noisy segmented video sample, the video reference feature and the image reference feature into a stable diffusion network in a video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise;
And training the video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, wherein the trained video generation model is used for generating a human body rehabilitation motion video through a reference image and a gesture sequence given by a user.
Optionally, the video generation model further comprises an image semantic feature extraction model and a video semantic feature extraction model;
the method further comprises the steps of:
Inputting the reference image into the image semantic feature extraction model to obtain semantic features corresponding to the reference image, and inputting the kth-1 segmented video sample into the video semantic feature extraction model to obtain semantic features corresponding to the kth-1 segmented video sample.
Optionally, the image reference network comprises a plurality of sub-modules, and each sub-module comprises a spatial attention module and a composite cross attention module;
Extracting, through an image reference network in the video generation model to be trained, an image reference feature corresponding to the reference image, specifically including:
Coding the reference image through a variation self-coder to obtain an image code;
Inputting the image code into the image reference network, obtaining an attention result through a spatial attention module of a first sub-module, inputting semantic features corresponding to the attention result and the reference image into the composite cross attention module to obtain a cross attention result, inputting the cross attention result into a next sub-module, and continuously obtaining the attention result through the spatial attention module and obtaining the cross attention result through the composite cross attention module;
And taking the attention result corresponding to each sub-module as the image reference characteristic.
Optionally, the video reference network includes a plurality of sub-modules, and each sub-module includes: a spatial attention module, a composite cross attention module, and a temporal attention module;
Extracting video reference characteristics corresponding to the kth-1 segmented video sample through a video reference network in the video generation model, wherein the video reference characteristics comprise:
coding the (k-1) th segmented video sample through a variation self-coder to obtain video coding;
inputting the video code into the video reference network, obtaining an attention result through a space attention module of a first sub-module, inputting semantic features corresponding to the attention result and the k-1 th segmented video sample into the composite cross attention module, obtaining a cross attention result, inputting the cross attention result into a time attention module, obtaining a time attention result, inputting the time attention result into a next sub-module, continuously obtaining an attention result through the space attention module, obtaining a cross attention result through the composite cross attention module and obtaining a time attention result through the time attention module;
And taking the time attention result corresponding to each sub-module as the video reference characteristic.
Optionally, the stable diffusion network includes a plurality of sub-modules, and each sub-module includes: a spatial attention module, a composite cross attention module, a gated cross attention module, and a temporal attention module;
inputting the kth segment gesture sequence, the noisy segment video sample, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predicting the noise added to the kth segment video sample through the stable diffusion network to obtain prediction noise, wherein the method specifically comprises the following steps of:
coding the kth segment gesture sequence to obtain gesture codes;
Inputting the gesture codes, the noisy segmented video samples and the image reference features to a space attention module in a first sub-module to obtain a space attention result, removing parts belonging to the image reference features in the space attention result to obtain a removed result, inputting the removed result, semantic features corresponding to the reference images and semantic features corresponding to the k-1 segmented video samples to a composite cross attention module to obtain a cross attention result, inputting the cross attention result and the video reference features to a gating cross attention module to obtain a gating cross attention result, inputting the gating cross attention result to a time attention network to obtain a time attention result, and inputting the time attention result to a next sub-module;
The next submodule continues to obtain attention results through the space attention module, cross attention results are obtained through the compound cross attention module, gate cross attention results are obtained through the gate cross attention module, and time attention results are obtained through the time attention module;
And determining the prediction noise according to the time attention result determined by the last submodule.
Optionally, the compound cross attention module comprises a cross attention module and a gate control cross attention module;
Inputting the removed result, the semantic features corresponding to the reference image and the semantic features corresponding to the k-1 th segmented video sample into a composite cross attention module to obtain a cross attention result, wherein the method specifically comprises the following steps of:
determining fusion semantic features for representing fusion features between 1 st to k-2 th segmented video samples;
inputting the removed result, the fused semantic features and the semantic features corresponding to the kth-1 segmented video sample to a cross attention module in a composite cross attention module to obtain a first attention result, wherein the first attention result comprises the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples, and the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples are used for a training process of the kth+1segmented video sample;
Inputting semantic features corresponding to the first attention result and the reference image into a gating cross attention module to obtain a second attention result;
And removing part of features in the second attention result except the removed result to obtain a cross attention result.
Optionally, inputting the gated cross attention result into a time attention network to obtain a time attention result, which specifically includes:
Sampling the part of the kth-1 segmented video sample before an overlapped frame to obtain a sampling result, wherein the overlapped frame is the overlapped part between the kth-1 segmented video sample and the kth segmented video sample;
Splicing codes corresponding to the sampling results in front of the characteristics corresponding to the kth segmented video sample in the gating cross attention results according to time sequence to obtain spliced results corresponding to the gating cross attention results;
And inputting the spliced result into a time attention network, determining an output result, and removing a part belonging to the sampling result in the output result to obtain a time attention result.
The specification provides a human rehabilitation exercise video data generating device guided by long-short time features, comprising:
the acquisition module is used for acquiring a reference image, a gesture sequence and a video sample;
The segmentation module is used for respectively segmenting the gesture sequence and the video samples to obtain segmented gesture sequences and segmented video samples, wherein one segmented gesture sequence corresponds to one segmented video sample one by one, and adjacent segmented video samples are overlapped;
The input module is used for inputting the reference image, the segmented gesture sequence and the segmented video sample into a video generation model to be trained so as to extract image reference characteristics corresponding to the reference image through an image reference network in the video generation model to be trained, and inputting the kth-1 segmented video sample into a video reference network in the video generation model so as to obtain video reference characteristics corresponding to the kth-1 segmented video sample;
the prediction module is used for adding noise to the kth segmented video sample through the generated noise to obtain a segmented video sample after adding noise, inputting the kth segmented gesture sequence, the segmented video sample after adding noise, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise;
the training module is used for training the video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, and the trained video generation model is used for generating a human body rehabilitation movement video through a reference image and a gesture sequence given by a user.
The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the human rehabilitation exercise video data generation method of long and short time feature guidance described above.
The present specification provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the human rehabilitation exercise video data generation method guided by the long-short time features when executing the program.
The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:
According to the human rehabilitation exercise video data generation method guided by the long-short time features, the reference image, the gesture sequence and the video samples can be obtained, the gesture sequence and the video samples are segmented respectively to obtain segmented gesture sequences and segmented video samples, one segmented gesture sequence corresponds to one segmented video sample one by one, overlapping exists between adjacent segmented video samples, then the reference image, the segmented gesture sequence and the segmented video samples can be input into a video generation model to be trained, the image reference features corresponding to the reference image can be extracted through an image reference network in the video generation model to be trained, and the k-1 segmented video samples are input into a video reference network in the video generation model to obtain the video reference features corresponding to the k-1 segmented video samples. The method comprises the steps of adding noise to a kth segmented video sample through generated noise to obtain a segmented video sample after noise addition, inputting the kth segmented gesture sequence, the segmented video sample after noise addition, video reference characteristics and image reference characteristics into a stable diffusion network in a video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise; and training a video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, wherein the trained video generation model is used for generating a human body rehabilitation motion video through a reference image and a gesture sequence given by a user.
From the foregoing, it can be seen that the present invention provides the following advantages:
1. Compared with other methods which only pay attention to generating a whole video frame by frame or directly generating the whole video, the method focuses on the consistency of high-frequency textures while keeping the time consistency, and improves the quality of the generated video.
2. Compared with a simple autoregressive method, the method realizes better time continuity and high-frequency texture consistency by generating the video in a segmented way.
3. The method adds the gating cross attention to be fused into the module aligned with the stable diffusion network, can effectively avoid information loss caused by characteristic space difference between different information, enables the model to better utilize space characteristics and semantic information, better process time continuity and improve the stability and accuracy of the generation process.
4. The invention adopts REFERENCE NET (reference network) structure to extract the high-frequency information in each frame, and can effectively improve the quality and consistency of video generation by using the high-frequency information, and can also improve the generalization capability of the model, so that the model can be better adapted to different data sets and scenes.
5. The invention designs the overlapped frame part, transmits time information and samples the frame number of the previous video, adopts methods such as a time attention mechanism and the like in each module of the stable diffusion network, and can utilize the information of the previous video as a priori by sampling the frame number of the previous video and encoding the frame number into the latent space, thereby guiding the generation process of the current video segment, enhancing the time consistency in the video generation process and improving the video quality and consistency.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:
Fig. 1 is a schematic flow chart of a method for generating long-short-time feature guided human rehabilitation exercise video data provided in the present specification;
FIG. 2 is a schematic diagram of a gesture sequence in the present specification;
FIG. 3 is a schematic diagram of a video generation model provided in the present specification;
FIG. 4 is a schematic diagram of a composite cross-attention module provided herein;
fig. 5 is a schematic diagram of a human rehabilitation exercise video data generating device guided by long-short time features provided in the present specification;
fig. 6 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a method for generating long-short time feature guided human rehabilitation exercise video data provided in the present specification, specifically including the following steps:
S100: a reference image, a pose sequence, and a video sample are acquired.
S102: and respectively segmenting the gesture sequence and the video samples to obtain each segmented gesture sequence and each segmented video sample, wherein one gesture sequence corresponds to one segmented video sample one by one, and overlapping exists between adjacent segmented video samples.
In this specification, a video generation model for generating a human rehabilitation exercise video is required to be trained, and a real human rehabilitation exercise video (i.e., a video of a real person doing rehabilitation exercise) can be generated by using the video generation model, and various uses of the video generated by the video generation model exist, for example, a training sample source of a neural network model (such as a three-dimensional model corresponding to a human body in the generated video) for other uses related to the human rehabilitation video can be used, and for example, the video generation model can be used for automatically generating rehabilitation exercise video provided for medical staff or patients.
Based on this, the server may obtain reference images, gesture sequences, and video samples. The gesture sequence mentioned herein may be used to represent the gesture of the human body in the video sample at each moment, for example, the gesture sequence may refer to a video sequence that represents the gesture of the human body through the human body skeleton, as shown in fig. 2.
Fig. 2 is a schematic diagram of a gesture sequence in this specification.
The video sample is a video containing a real human body for rehabilitation, and the reference image may be an image containing a real human body in the video sample, for example, the reference image may be a frame image in the video sample.
Then, the gesture sequence and the video samples can be segmented respectively to obtain each segmented gesture sequence and each segmented video sample, one segmented gesture sequence corresponds to one segmented video sample one by one, and overlapping exists between adjacent segmented video samples.
Wherein segmentation is performed separately for the gesture sequence and the video samples. Wherein one segment pose sequence and segment video sample may be set to have K frames, and there is an s-frame overlap between every two adjacent segment pose sequences and segment video samples (e.g., s may take 1/4K) to obtain each segment pose sequence and each segment video sample, each segment containing a number of frames of [1:K ], [ K-s+1:2k-s ], [2K-2s:3K-2s ], [ N-K: N ].
S104: inputting the reference image, the segmented gesture sequence and the segmented video sample into a video generation model to be trained, extracting and obtaining image reference characteristics corresponding to the reference image through an image reference network in the video generation model to be trained, and extracting and obtaining video reference characteristics corresponding to the k-1 segmented video sample through a video reference network in the video generation model.
S106: and adding noise to the kth segmented video sample through the generated noise to obtain a noise-added segmented video sample, inputting the kth segmented gesture sequence, the noise-added segmented video sample, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise.
S108: and training the video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, wherein the trained video generation model is used for generating a human body rehabilitation motion video through a reference image and a gesture sequence given by a user.
After the reference image, the segmented gesture sequences and the segmented video samples are determined, the reference image, the segmented gesture sequences and the segmented video samples can be input into a video generation model to be trained, and the video generation model to be trained is trained.
It should be noted that, when training the video generating model, it is assumed that a complete video sample has n segmented video samples, the video generating model needs to perform an inference process n times, and from the second time, the kth-1 segmented video sample needs to be input to assist in training the process of generating the kth segmented video sample.
Therefore, the image reference characteristics corresponding to the reference image can be extracted through an image reference network in a video generation model to be trained, the video reference characteristics corresponding to the k-1 th segmented video sample can be extracted through the video reference network in the video generation model, the k-th segmented video sample is subjected to noise adding through generated noise to obtain a noise-added segmented video sample, the k-th segmented gesture sequence, the noise-added segmented video sample, the video reference characteristics and the image reference characteristics are input into a stable diffusion network in the video generation model, the noise added to the k-th segmented video sample is predicted through the stable diffusion network to obtain predicted noise, the difference between the minimized predicted noise and the generated noise is taken as an optimization target, the video generation model is trained, and the trained video generation model is used for generating a human body rehabilitation motion video through the reference image and gesture sequence given by a user.
It can be seen that the foregoing briefly describes the training process corresponding to one segmented video sample (kth segmented video sample) in one complete video sample, and is a process of adding noise to the video and predicting the added noise through the video generation model when training the video generation model, so that there is a certain difference from the training phase when generating a real video through the video generation model after training.
It should be noted that, the above-mentioned process of adding noise to the kth segmented video sample (or the feature corresponding to the kth segmented video sample) may be a process of adding multiple noises to the segmented video sample according to multiple time steps, and the noise sequence may be predicted by the stable diffusion network.
Specifically, the number of time steps can be set as T, the noise adding operation of the T steps is carried out, and the Gaussian noise added in each step obeysUntil the segmented video samples (or the features corresponding to the segmented video samples) change from an original state to a pure gaussian distribution of noise. The segmented video sample after T-step noise addition can be directly obtained by the following formula:
in the above formula May refer to the original segmented video sample,/>May refer to a segmented video sample that has been denoised by a T step. The above formula is a conventional formula in a stable diffusion model for adding noise to raw data for multiple time steps at once.
When the video is generated, the noise is required to be randomly generated, then the randomly generated noise, the preset reference image and the preset gesture sequence are input into a video generation model, and the noise output by the video generation model is used for denoising the randomly generated noise, so that the generated video is obtained.
In the video generation model, noise is output from the stable diffusion network, but the video generation model is required to output the generated video finally, so a decoder (VAE decoder) may be connected after the stable diffusion network, and a denoising result obtained by denoising the randomly generated noise is input into the decoder to obtain the generated video.
When generating video, the method also needs to generate segments, wherein noise can be randomly generated, the above mentioned gesture sequences are segmented to obtain segmented gesture sequences, the ith segmented gesture sequence, the randomly generated noise, the reference image and the ith-1 th generated segmented video are input into a video generation model, the ith segmented video is output, all the segmented videos are spliced, and then the complete video can be obtained (the method needs to be interpreted, whether in a training stage or a stage of generating video through a trained model, and in a training process/generating process corresponding to the 1 st segmented video, the process of inputting the previous segmented video is not involved).
The structure of the video generation model is described in detail below, as shown in fig. 3.
Fig. 3 is a schematic structural diagram of a video generation model provided in the present specification.
The video generation model comprises three main modules: image reference network, stable diffusion network and video reference network, before the image reference network, stable diffusion network and video reference network, there are networks for encoding reference images, pose sequences and video.
The method comprises the steps of carrying out feature extraction on a reference image by a variation self-encoder (VAE encoder) and an image semantic feature extraction model (CLIP network) respectively, carrying out feature extraction on a gesture sequence by a gesture director (convolutional neural network), and carrying out feature extraction on video by a variation self-encoder and a video semantic feature extraction model (CLIP network) respectively.
The difference between feature extraction by the variational self-encoder and the semantic feature extraction model is that the variational self-encoder is a relatively simple way of compressing an image or video to obtain image/video features, and the semantic feature extraction model can extract high-frequency texture features of the video and texture features of the image.
The reference image can be input into an image semantic feature extraction model to obtain semantic features corresponding to the reference image, and the kth-1 segmented video sample is input into a video semantic feature extraction model to obtain semantic features corresponding to the kth-1 segmented video sample. Where k may be a positive integer greater than 1.
For an image reference network, several sub-modules may be included in the image reference network, each sub-module including a spatial attention module and a compound cross attention module (as shown in fig. 2).
The reference image can be encoded through a variable self-encoder to obtain image encoding, the image encoding is input into an image reference network, the attention result is obtained through a space attention module of a first sub-module, the attention result and the semantic feature corresponding to the reference image are input into a compound cross attention module to obtain a cross attention result, the cross attention result is input into a next sub-module, the next sub-module continues to obtain the attention result through the space attention module, the cross attention result is obtained through the compound cross attention module, and the attention result corresponding to each sub-module is used as the image reference feature.
It should be noted that, the spatial attention module is used for spatially weighting the image code of the reference image itself, and the compound cross attention module is used for cross attention between the image code of the reference image and the semantic features. The composite cross-attention module may consist of only cross-attention networks or may consist of cross-attention networks and gated cross-attention networks, wherein the gated cross-attention networks consist of cross-attention networks and gated networks (with the cross-attention network preceding and the gated network following).
For a video reference network, the video reference network comprises a plurality of sub-modules, and each sub-module comprises: a spatial attention module, a composite cross attention module, and a temporal attention module.
The k-1 th segmented video sample can be encoded through a variable self-encoder to obtain video encoding, then the video encoding can be input into a video reference network, the attention result is obtained through a space attention module of a first sub-module, semantic features corresponding to the attention result and the k-1 th segmented video sample are input into a composite cross attention module to obtain a cross attention result, the cross attention result is input into a time attention module to obtain a time attention result, the time attention result is input into a next sub-module, the next sub-module continues to obtain the attention result through the space attention module, the cross attention result is obtained through the composite cross attention module, the time attention result is obtained through the time attention module, and then the time attention result corresponding to each sub-module is used as video reference features.
The image reference network and the video reference network mainly provide feature references for the stable diffusion network reference image and the segmented video before the segmented video to be generated, so that the network mainly used for generating the video is the stable diffusion network (of course, the stable diffusion network cannot directly generate the video and is used for determining noise added into the video, and thus, when the video is actually required to be generated, the generated noise is denoised through the stable diffusion network, and the generated video is obtained).
Wherein the composite cross-attention module in the video reference network is identical to the composite cross-attention module in the image reference network and is not repeated here.
It should be noted that, since the stable diffusion network includes a large number of attention network layers, the semantic features corresponding to the segmented video sample are integrated into the features of the k-th segmented gesture sequence and the noisy segmented video sample in the model reasoning process, and after the features are integrated, the image reference features, the video reference features, or the semantic features corresponding to the reference images, the part of the features corresponding to the semantic features corresponding to the segmented video sample can be removed.
For the stable diffusion network, the stable diffusion network comprises a plurality of sub-modules, and each sub-module comprises: a spatial attention module, a composite cross attention module, a gated cross attention module, and a temporal attention module.
It should be noted that the number of sub-modules in the stable diffusion network, the video reference network and the image reference network is consistent, that is, the three sub-modules are in one-to-one correspondence, the output inside each sub-module in the video reference network and the image reference network needs to be provided to the sub-module corresponding to the stable diffusion network, that is, the output of the spatial attention module of the image reference network needs to be input to the spatial attention module of the sub-module corresponding to the stable diffusion network, and the output of the temporal attention module of the video reference network needs to be input to the gating cross attention module of the sub-module corresponding to the stable diffusion network.
Specifically, the kth segment gesture sequence may be encoded (the segment gesture sequence is encoded by the gesture director mentioned above) to obtain a gesture code, then, a segment video sample after gesture coding and noise adding (the segment video sample after noise adding may also be obtained by encoding the segment video sample through a network for feature extraction such as a variation self-encoder and then noise adding), and an image reference feature are input to a spatial attention module in the first submodule, so as to obtain a spatial attention result, and a part belonging to the image reference feature in the spatial attention result is removed, so as to obtain a removed result.
And then, the removed result, the semantic features corresponding to the reference image and the semantic features corresponding to the k-1 segmented video sample can be input into a composite cross attention module to obtain a cross attention result, the cross attention result and the video reference feature are input into a gating cross attention to obtain a gating cross attention result, the gating cross attention result is input into a time attention network to obtain a time attention result, and the time attention result is input into the next sub-module.
The next submodule continues to obtain attention results through the space attention module, obtains cross attention results through the composite cross attention module, obtains gating cross attention results through the gating cross attention module and obtains time attention results through the time attention module, and can determine prediction noise according to the time attention results determined by the last submodule.
The following detailed description is directed to a composite cross-attention module, consistent with an image reference network, which may be composed of a cross-attention network and a gated cross-attention network, wherein the gated cross-attention network is composed of a cross-attention network and a gating network (the cross-attention network is in front and the gating network is in back), as shown in fig. 4.
Fig. 4 is a schematic structural diagram of a composite cross-attention module provided in the present specification.
The method comprises the steps that fusion semantic features for representing fusion features between 1 st to k-2 th segmented video samples can be determined, a removed result, the fusion semantic features and semantic features corresponding to the k-1 th segmented video samples are input to a cross attention module in a composite cross attention module, a first attention result is obtained, the first attention result comprises the fusion semantic features for representing the fusion features between the 1 st to k-1 th segmented video samples, and the fusion semantic features for representing the fusion features between the 1 st to k-1 th segmented video samples are used for training the k+1 th segmented video samples.
And then, inputting semantic features corresponding to the first attention result and the reference image into a gating cross attention module to obtain a second attention result, and removing part of features in the second attention result except the removed result to obtain a cross attention result.
It can be seen from fig. 4 (for convenience of explanation, only the input-output relationship between the fusion semantic features and the semantic features is drawn in fig. 4), the cross attention module input into the composite cross attention module has the semantic features corresponding to the kth-1 segmented video sample and the fusion semantic features for representing the fusion features between the 1 st to the kth-2 segmented video samples, and the output result contains the fusion semantic features for representing the fusion features between the 1 st to the kth-1 segmented video samples.
It should be noted that the fused semantic features used to characterize the fused features between the 1 st to the k-2 th segmented video samples are obtained during the training process of the video generation model for the k-1 st segmented video sample.
Specifically, in the training process for the 1 st segmented video sample, this process is not involved, and only the semantic features corresponding to the 1 st segmented video sample need to be input in the training process for the 2 nd segmented video sample.
In the training process aiming at the 3 rd segmented video sample, the input of the cross attention module in the compound cross attention module comprises the semantic features corresponding to the 1st segmented video sample and the semantic features corresponding to the 2 nd segmented video sample, and the two semantic features are fused through the cross attention module, so that the fusion semantic features for representing the fusion features between the 1st to 2 nd segmented video samples can be obtained.
In the training process for the 4 th segmented video sample, the input of the cross attention module in the compound cross attention module comprises the fusion semantic features for representing the fusion features between the 1 st to 2 nd segmented video samples and the semantic features corresponding to the 3 rd segmented video sample, and the fusion semantic features for representing the fusion features between the 1 st to 3 rd segmented video samples can be obtained by fusing the two semantic features through the cross attention module. So that the fusion semantic features obtained in the training process of the last segmented video sample can be used in the training process of the next segmented video sample by analogy.
It should be noted that, as can be seen from fig. 3, there is a branch for sampling the kth-1 segmented video sequence, that is, a portion of the kth-1 segmented video sample before the overlapped frame may be sampled to obtain a sampling result, the overlapped frame is a portion of the kth-1 segmented video sample overlapped with the kth segmented video sample, then, codes corresponding to the sampling result may be spliced in time sequence before features corresponding to the kth segmented video sample in the gating cross attention result to obtain a spliced result corresponding to the gating cross attention result, and then, the spliced result is input into the time attention network to determine an output result, and a portion belonging to the sampling result in the output result is removed to obtain the time attention result.
For convenience of description, the execution subject for executing the method will be described as a server, and the execution subject of the method may be a computer, a controller, a server, or the like, which is not limited herein. The features of the following examples and embodiments may be combined with each other without any conflict.
The above method for generating human rehabilitation exercise video data guided by one or more long and short time features in the present specification is based on the same concept, and the present specification also provides a device for generating human rehabilitation exercise video data guided by long and short time features, as shown in fig. 5.
Fig. 5 is a schematic diagram of a human rehabilitation exercise video data generating device guided by long-short time features provided in the present specification, including:
An acquisition module 501, configured to acquire a reference image, a gesture sequence, and a video sample;
The segmentation module 502 is configured to segment the gesture sequence and the video samples respectively to obtain each segment gesture sequence and each segment video sample, where one segment gesture sequence corresponds to one segment video sample one by one, and overlapping exists between adjacent segment video samples;
An input module 503, configured to input the reference image, the segmented gesture sequence, and the segmented video sample into a video generation model to be trained, so as to extract, through an image reference network in the video generation model to be trained, an image reference feature corresponding to the reference image, and input a kth-1 segmented video sample into a video reference network in the video generation model, so as to obtain a video reference feature corresponding to the kth-1 segmented video sample;
The prediction module 504 is configured to perform noise addition on a kth segmented video sample through the generated noise to obtain a noisy segmented video sample, and input a kth segmented gesture sequence, the noisy segmented video sample, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predict noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise;
The training module 505 is configured to train the video generating model with a difference between the minimized predicted noise and the generated noise as an optimization target, where the trained video generating model is used to generate a human rehabilitation motion video through a reference image and a gesture sequence given by a user.
Optionally, the video generation model further comprises an image semantic feature extraction model and a video semantic feature extraction model;
the input module 503 is further configured to input the reference image into the image semantic feature extraction model to obtain semantic features corresponding to the reference image, and input the kth-1 segmented video sample into the video semantic feature extraction model to obtain semantic features corresponding to the kth-1 segmented video sample.
Optionally, the image reference network comprises a plurality of sub-modules, and each sub-module comprises a spatial attention module and a composite cross attention module;
The input module 503 is specifically configured to encode the reference image by using a variation self-encoder to obtain an image code; inputting the image code into the image reference network, obtaining an attention result through a spatial attention module of a first sub-module, inputting semantic features corresponding to the attention result and the reference image into the composite cross attention module to obtain a cross attention result, inputting the cross attention result into a next sub-module, and continuously obtaining the attention result through the spatial attention module and obtaining the cross attention result through the composite cross attention module; and taking the attention result corresponding to each sub-module as the image reference characteristic.
Optionally, the video reference network includes a plurality of sub-modules, and each sub-module includes: a spatial attention module, a composite cross attention module, and a temporal attention module;
The input module 503 is specifically configured to encode the kth-1 segment video sample by using a variation self-encoder to obtain video encoding; inputting the video code into the video reference network, obtaining an attention result through a space attention module of a first sub-module, inputting semantic features corresponding to the attention result and the k-1 th segmented video sample into the composite cross attention module, obtaining a cross attention result, inputting the cross attention result into a time attention module, obtaining a time attention result, inputting the time attention result into a next sub-module, continuously obtaining an attention result through the space attention module, obtaining a cross attention result through the composite cross attention module and obtaining a time attention result through the time attention module; and taking the time attention result corresponding to each sub-module as the video reference characteristic.
Optionally, the stable diffusion network includes a plurality of sub-modules, and each sub-module includes: a spatial attention module, a composite cross attention module, a gated cross attention module, and a temporal attention module;
The prediction module 504 is specifically configured to encode the kth segment gesture sequence to obtain a gesture code; inputting the gesture codes, the noisy segmented video samples and the image reference features to a space attention module in a first sub-module to obtain a space attention result, removing parts belonging to the image reference features in the space attention result to obtain a removed result, inputting the removed result, semantic features corresponding to the reference images and semantic features corresponding to the k-1 segmented video samples to a composite cross attention module to obtain a cross attention result, inputting the cross attention result and the video reference features to a gating cross attention module to obtain a gating cross attention result, inputting the gating cross attention result to a time attention network to obtain a time attention result, and inputting the time attention result to a next sub-module; the next submodule continues to obtain attention results through the space attention module, cross attention results are obtained through the compound cross attention module, gate cross attention results are obtained through the gate cross attention module, and time attention results are obtained through the time attention module; and determining the prediction noise according to the time attention result determined by the last submodule.
Optionally, the compound cross attention module comprises a cross attention module and a gate control cross attention module;
The prediction module 504 is specifically configured to determine fusion semantic features for characterizing fusion features between the 1 st to k-2 th segmented video samples; inputting the removed result, the fused semantic features and the semantic features corresponding to the kth-1 segmented video sample to a cross attention module in a composite cross attention module to obtain a first attention result, wherein the first attention result comprises the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples, and the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples are used for a training process of the kth+1segmented video sample; inputting semantic features corresponding to the first attention result and the reference image into a gating cross attention module to obtain a second attention result; and removing part of features in the second attention result except the removed result to obtain a cross attention result.
Optionally, the prediction module 504 is specifically configured to sample a portion of the kth-1 segmented video sample before an overlapping frame, where the overlapping frame is a portion of the kth-1 segmented video sample overlapping with the kth segmented video sample, to obtain a sampling result; splicing codes corresponding to the sampling results in front of the characteristics corresponding to the kth segmented video sample in the gating cross attention results according to time sequence to obtain spliced results corresponding to the gating cross attention results; and inputting the spliced result into a time attention network, determining an output result, and removing a part belonging to the sampling result in the output result to obtain a time attention result.
The present specification also provides a computer-readable storage medium storing a computer program operable to perform the above long and short time feature-guided human rehabilitation exercise video data generation method.
The present specification also provides a schematic structural diagram of the electronic device shown in fig. 6. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as illustrated in fig. 6, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory to the memory and then operates the computer program to realize the long-short time feature guided human rehabilitation exercise video data generation method.
Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language), and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims (10)

1. The method for generating the long-short-time feature-guided human rehabilitation exercise video data is characterized by comprising the following steps of:
Acquiring a reference image, a gesture sequence and a video sample, wherein the reference image comprises a real human body image in the video sample;
Segmenting the gesture sequence and the video samples respectively to obtain segmented gesture sequences and segmented video samples, wherein one segmented gesture sequence corresponds to one segmented video sample one by one, and overlapping exists between adjacent segmented video samples;
Inputting the reference image, the segmented gesture sequence and the segmented video sample into a video generation model to be trained so as to extract image reference characteristics corresponding to the reference image through an image reference network in the video generation model to be trained, and inputting the kth-1 segmented video sample into a video reference network in the video generation model so as to obtain video reference characteristics corresponding to the kth-1 segmented video sample;
The method comprises the steps of adding noise to a kth segmented video sample through generated noise to obtain a noisy segmented video sample, inputting a kth segmented gesture sequence, the noisy segmented video sample, the video reference feature and the image reference feature into a stable diffusion network in a video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise;
And training the video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, wherein the trained video generation model is used for generating a human body rehabilitation motion video through a reference image and a gesture sequence given by a user.
2. The method of claim 1, wherein the video generation model further comprises an image semantic feature extraction model and a video semantic feature extraction model;
the method further comprises the steps of:
Inputting the reference image into the image semantic feature extraction model to obtain semantic features corresponding to the reference image, and inputting the kth-1 segmented video sample into the video semantic feature extraction model to obtain semantic features corresponding to the kth-1 segmented video sample.
3. The method of claim 2, wherein the image reference network includes a plurality of sub-modules, each sub-module including a spatial attention module and a composite cross attention module;
Extracting, through an image reference network in the video generation model to be trained, an image reference feature corresponding to the reference image, specifically including:
Coding the reference image through a variation self-coder to obtain an image code;
Inputting the image code into the image reference network, obtaining an attention result through a spatial attention module of a first sub-module, inputting semantic features corresponding to the attention result and the reference image into the composite cross attention module to obtain a cross attention result, inputting the cross attention result into a next sub-module, and continuously obtaining the attention result through the spatial attention module and obtaining the cross attention result through the composite cross attention module;
And taking the attention result corresponding to each sub-module as the image reference characteristic.
4. The method of claim 2, wherein the video reference network includes a plurality of sub-modules, each sub-module including: a spatial attention module, a composite cross attention module, and a temporal attention module;
Extracting video reference characteristics corresponding to the kth-1 segmented video sample through a video reference network in the video generation model, wherein the video reference characteristics comprise:
coding the (k-1) th segmented video sample through a variation self-coder to obtain video coding;
inputting the video code into the video reference network, obtaining an attention result through a space attention module of a first sub-module, inputting semantic features corresponding to the attention result and the k-1 th segmented video sample into the composite cross attention module, obtaining a cross attention result, inputting the cross attention result into a time attention module, obtaining a time attention result, inputting the time attention result into a next sub-module, continuously obtaining an attention result through the space attention module, obtaining a cross attention result through the composite cross attention module and obtaining a time attention result through the time attention module;
And taking the time attention result corresponding to each sub-module as the video reference characteristic.
5. The method of claim 2, wherein the stable diffusion network includes a plurality of sub-modules therein, each sub-module including: a spatial attention module, a composite cross attention module, a gated cross attention module, and a temporal attention module;
inputting the kth segment gesture sequence, the noisy segment video sample, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predicting the noise added to the kth segment video sample through the stable diffusion network to obtain prediction noise, wherein the method specifically comprises the following steps of:
coding the kth segment gesture sequence to obtain gesture codes;
Inputting the gesture codes, the noisy segmented video samples and the image reference features to a space attention module in a first sub-module to obtain a space attention result, removing parts belonging to the image reference features in the space attention result to obtain a removed result, inputting the removed result, semantic features corresponding to the reference images and semantic features corresponding to the k-1 segmented video samples to a composite cross attention module to obtain a cross attention result, inputting the cross attention result and the video reference features to a gating cross attention module to obtain a gating cross attention result, inputting the gating cross attention result to a time attention network to obtain a time attention result, and inputting the time attention result to a next sub-module;
The next submodule continues to obtain attention results through the space attention module, cross attention results are obtained through the compound cross attention module, gate cross attention results are obtained through the gate cross attention module, and time attention results are obtained through the time attention module;
And determining the prediction noise according to the time attention result determined by the last submodule.
6. The method of claim 5, wherein the composite cross-attention module comprises a cross-attention module and a gated cross-attention module;
Inputting the removed result, the semantic features corresponding to the reference image and the semantic features corresponding to the k-1 th segmented video sample into a composite cross attention module to obtain a cross attention result, wherein the method specifically comprises the following steps of:
determining fusion semantic features for representing fusion features between 1 st to k-2 th segmented video samples;
inputting the removed result, the fused semantic features and the semantic features corresponding to the kth-1 segmented video sample to a cross attention module in a composite cross attention module to obtain a first attention result, wherein the first attention result comprises the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples, and the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples are used for a training process of the kth+1segmented video sample;
Inputting semantic features corresponding to the first attention result and the reference image into a gating cross attention module to obtain a second attention result;
And removing part of features in the second attention result except the removed result to obtain a cross attention result.
7. The method according to claim 5, wherein the step of inputting the gated cross attention result into a time attention network to obtain a time attention result comprises:
Sampling the part of the kth-1 segmented video sample before an overlapped frame to obtain a sampling result, wherein the overlapped frame is the overlapped part between the kth-1 segmented video sample and the kth segmented video sample;
Splicing codes corresponding to the sampling results in front of the characteristics corresponding to the kth segmented video sample in the gating cross attention results according to time sequence to obtain spliced results corresponding to the gating cross attention results;
And inputting the spliced result into a time attention network, determining an output result, and removing a part belonging to the sampling result in the output result to obtain a time attention result.
8. A long and short time feature guided human rehabilitation exercise video data generating device, comprising:
The acquisition module is used for acquiring a reference image, a gesture sequence and a video sample, wherein the reference image comprises a real human body image in the video sample;
The segmentation module is used for respectively segmenting the gesture sequence and the video samples to obtain segmented gesture sequences and segmented video samples, wherein one segmented gesture sequence corresponds to one segmented video sample one by one, and adjacent segmented video samples are overlapped;
The input module is used for inputting the reference image, the segmented gesture sequence and the segmented video sample into a video generation model to be trained so as to extract image reference characteristics corresponding to the reference image through an image reference network in the video generation model to be trained, and inputting the kth-1 segmented video sample into a video reference network in the video generation model so as to obtain video reference characteristics corresponding to the kth-1 segmented video sample;
the prediction module is used for adding noise to the kth segmented video sample through the generated noise to obtain a segmented video sample after adding noise, inputting the kth segmented gesture sequence, the segmented video sample after adding noise, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise;
the training module is used for training the video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, and the trained video generation model is used for generating a human body rehabilitation movement video through a reference image and a gesture sequence given by a user.
9. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-7 when executing the program.
CN202410281162.0A 2024-03-12 2024-03-12 Human body rehabilitation exercise video data generation method guided by long-short time features Active CN117880444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410281162.0A CN117880444B (en) 2024-03-12 2024-03-12 Human body rehabilitation exercise video data generation method guided by long-short time features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410281162.0A CN117880444B (en) 2024-03-12 2024-03-12 Human body rehabilitation exercise video data generation method guided by long-short time features

Publications (2)

Publication Number Publication Date
CN117880444A CN117880444A (en) 2024-04-12
CN117880444B true CN117880444B (en) 2024-05-24

Family

ID=90579570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410281162.0A Active CN117880444B (en) 2024-03-12 2024-03-12 Human body rehabilitation exercise video data generation method guided by long-short time features

Country Status (1)

Country Link
CN (1) CN117880444B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118365510B (en) * 2024-06-19 2024-09-13 阿里巴巴达摩院(杭州)科技有限公司 Image processing method, training method of image processing model and image generating method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007133609A (en) * 2005-11-09 2007-05-31 Oki Electric Ind Co Ltd Video generating device and video generating method
CN107506800A (en) * 2017-09-21 2017-12-22 深圳市唯特视科技有限公司 It is a kind of based on unsupervised domain adapt to without label video face identification method
JP2018032316A (en) * 2016-08-26 2018-03-01 日本電信電話株式会社 Video generation device, video generation model learning device, method for the same, and program
WO2021190078A1 (en) * 2020-03-26 2021-09-30 华为技术有限公司 Method and apparatus for generating short video, and related device and medium
CN116233491A (en) * 2023-05-04 2023-06-06 阿里巴巴达摩院(杭州)科技有限公司 Video generation method and server
CN116392812A (en) * 2022-12-02 2023-07-07 阿里巴巴(中国)有限公司 Action generating method and virtual character animation generating method
US11727618B1 (en) * 2022-08-25 2023-08-15 xNeurals Inc. Artificial intelligence-based system and method for generating animated videos from an audio segment
CN117409121A (en) * 2023-10-17 2024-01-16 西安电子科技大学 Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
CN117499711A (en) * 2023-11-08 2024-02-02 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of video generation model
CN117593473A (en) * 2024-01-17 2024-02-23 淘宝(中国)软件有限公司 Method, apparatus and storage medium for generating motion image and video

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007133609A (en) * 2005-11-09 2007-05-31 Oki Electric Ind Co Ltd Video generating device and video generating method
JP2018032316A (en) * 2016-08-26 2018-03-01 日本電信電話株式会社 Video generation device, video generation model learning device, method for the same, and program
CN107506800A (en) * 2017-09-21 2017-12-22 深圳市唯特视科技有限公司 It is a kind of based on unsupervised domain adapt to without label video face identification method
WO2021190078A1 (en) * 2020-03-26 2021-09-30 华为技术有限公司 Method and apparatus for generating short video, and related device and medium
US11727618B1 (en) * 2022-08-25 2023-08-15 xNeurals Inc. Artificial intelligence-based system and method for generating animated videos from an audio segment
CN116392812A (en) * 2022-12-02 2023-07-07 阿里巴巴(中国)有限公司 Action generating method and virtual character animation generating method
CN116233491A (en) * 2023-05-04 2023-06-06 阿里巴巴达摩院(杭州)科技有限公司 Video generation method and server
CN117409121A (en) * 2023-10-17 2024-01-16 西安电子科技大学 Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
CN117499711A (en) * 2023-11-08 2024-02-02 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of video generation model
CN117593473A (en) * 2024-01-17 2024-02-23 淘宝(中国)软件有限公司 Method, apparatus and storage medium for generating motion image and video

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Enabling the Encoder-Empowered GAN-based Video Generators for Long Video Generation;J. Yang and A. G. Bors;《2023 IEEE International Conference on Image Processing (ICIP)》;20230911;全文 *
基于多模态输入的对抗式视频生成方法;于海涛;杨小汕;徐常胜;;计算机研究与发展;20200707(07);全文 *

Also Published As

Publication number Publication date
CN117880444A (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN117880444B (en) Human body rehabilitation exercise video data generation method guided by long-short time features
CN117372631B (en) Training method and application method of multi-view image generation model
CN108765334A (en) A kind of image de-noising method, device and electronic equipment
CN112784857B (en) Model training and image processing method and device
CN116977525B (en) Image rendering method and device, storage medium and electronic equipment
CN116343314B (en) Expression recognition method and device, storage medium and electronic equipment
CN117635822A (en) Model training method and device, storage medium and electronic equipment
CN117409466B (en) Three-dimensional dynamic expression generation method and device based on multi-label control
CN116030247B (en) Medical image sample generation method and device, storage medium and electronic equipment
CN117079777A (en) Medical image complement method and device, storage medium and electronic equipment
CN116524295A (en) Image processing method, device, equipment and readable storage medium
CN117726760B (en) Training method and device for three-dimensional human body reconstruction model of video
CN117808976B (en) Three-dimensional model construction method and device, storage medium and electronic equipment
CN117726907B (en) Training method of modeling model, three-dimensional human modeling method and device
CN117911630B (en) Three-dimensional human modeling method and device, storage medium and electronic equipment
CN117830564B (en) Three-dimensional virtual human model reconstruction method based on gesture distribution guidance
CN114528923B (en) Video target detection method, device, equipment and medium based on time domain context
CN113887326B (en) Face image processing method and device
CN116309924B (en) Model training method, image display method and device
CN116991388B (en) Graph optimization sequence generation method and device of deep learning compiler
CN117635912A (en) Method, device, medium and equipment for predicting growth of interested part
CN112950732B (en) Image generation method and device, storage medium and electronic equipment
CN116229218B (en) Model training and image registration method and device
CN116188469A (en) Focus detection method, focus detection device, readable storage medium and electronic equipment
CN116245773A (en) Face synthesis model training method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant