CN117880444B

CN117880444B - Human body rehabilitation exercise video data generation method guided by long-short time features

Info

Publication number: CN117880444B
Application number: CN202410281162.0A
Authority: CN
Inventors: 王宏升; 林峰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-05-24
Anticipated expiration: 2044-03-12
Also published as: CN117880444A

Abstract

The specification discloses a human rehabilitation exercise video data generation method guided by long-short time features, which can extract image reference features corresponding to reference images through an image reference network in a video generation model, and input a k-1 segmented video sample into the video reference network in the video generation model to obtain video reference features. The method comprises the steps of adding noise to a kth segmented video sample through generated noise to obtain a segmented video sample after noise addition, inputting the kth segmented gesture sequence, the segmented video sample after noise addition, video reference characteristics and image reference characteristics into a stable diffusion network in a video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise; and training the video generation model by taking the difference between the minimized prediction noise and the generated noise as an optimization target, thereby improving the video generation quality.

Description

Human body rehabilitation exercise video data generation method guided by long-short time features

Technical Field

The specification relates to the technical field of neural networks and video generation, in particular to a human rehabilitation exercise video data generation method guided by long-short time features.

Background

Currently, in the field of rehabilitation video generation, a traditional method generally adopts a frame-by-frame rendering mode, so that the video lacks of temporal consistency, for example, adverse effects such as discontinuous rehabilitation actions occur. To address this problem, many research efforts have proposed adding a temporal attention mechanism to directly generate the entire rehabilitation exercise video.

However, this method still has the problem that each video frame is generated frame by frame when the rehabilitation exercise video is generated, so that the video still has a lack of temporal consistency.

Therefore, how to improve the accuracy of video generation is a urgent problem to be solved.

Disclosure of Invention

The present disclosure provides a method for generating long-short time feature-guided human rehabilitation exercise video data, so as to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a human rehabilitation exercise video data generation method guided by long-short time features, which comprises the following steps:

Acquiring a reference image, a gesture sequence and a video sample;

Segmenting the gesture sequence and the video samples respectively to obtain segmented gesture sequences and segmented video samples, wherein one segmented gesture sequence corresponds to one segmented video sample one by one, and overlapping exists between adjacent segmented video samples;

Inputting the reference image, the segmented gesture sequence and the segmented video sample into a video generation model to be trained so as to extract image reference characteristics corresponding to the reference image through an image reference network in the video generation model to be trained, and inputting the kth-1 segmented video sample into a video reference network in the video generation model so as to obtain video reference characteristics corresponding to the kth-1 segmented video sample;

The method comprises the steps of adding noise to a kth segmented video sample through generated noise to obtain a noisy segmented video sample, inputting a kth segmented gesture sequence, the noisy segmented video sample, the video reference feature and the image reference feature into a stable diffusion network in a video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise;

And training the video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, wherein the trained video generation model is used for generating a human body rehabilitation motion video through a reference image and a gesture sequence given by a user.

Optionally, the video generation model further comprises an image semantic feature extraction model and a video semantic feature extraction model;

the method further comprises the steps of:

Inputting the reference image into the image semantic feature extraction model to obtain semantic features corresponding to the reference image, and inputting the kth-1 segmented video sample into the video semantic feature extraction model to obtain semantic features corresponding to the kth-1 segmented video sample.

Optionally, the image reference network comprises a plurality of sub-modules, and each sub-module comprises a spatial attention module and a composite cross attention module;

Extracting, through an image reference network in the video generation model to be trained, an image reference feature corresponding to the reference image, specifically including:

Coding the reference image through a variation self-coder to obtain an image code;

Inputting the image code into the image reference network, obtaining an attention result through a spatial attention module of a first sub-module, inputting semantic features corresponding to the attention result and the reference image into the composite cross attention module to obtain a cross attention result, inputting the cross attention result into a next sub-module, and continuously obtaining the attention result through the spatial attention module and obtaining the cross attention result through the composite cross attention module;

And taking the attention result corresponding to each sub-module as the image reference characteristic.

Optionally, the video reference network includes a plurality of sub-modules, and each sub-module includes: a spatial attention module, a composite cross attention module, and a temporal attention module;

Extracting video reference characteristics corresponding to the kth-1 segmented video sample through a video reference network in the video generation model, wherein the video reference characteristics comprise:

coding the (k-1) th segmented video sample through a variation self-coder to obtain video coding;

inputting the video code into the video reference network, obtaining an attention result through a space attention module of a first sub-module, inputting semantic features corresponding to the attention result and the k-1 th segmented video sample into the composite cross attention module, obtaining a cross attention result, inputting the cross attention result into a time attention module, obtaining a time attention result, inputting the time attention result into a next sub-module, continuously obtaining an attention result through the space attention module, obtaining a cross attention result through the composite cross attention module and obtaining a time attention result through the time attention module;

And taking the time attention result corresponding to each sub-module as the video reference characteristic.

Optionally, the stable diffusion network includes a plurality of sub-modules, and each sub-module includes: a spatial attention module, a composite cross attention module, a gated cross attention module, and a temporal attention module;

inputting the kth segment gesture sequence, the noisy segment video sample, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predicting the noise added to the kth segment video sample through the stable diffusion network to obtain prediction noise, wherein the method specifically comprises the following steps of:

coding the kth segment gesture sequence to obtain gesture codes;

Inputting the gesture codes, the noisy segmented video samples and the image reference features to a space attention module in a first sub-module to obtain a space attention result, removing parts belonging to the image reference features in the space attention result to obtain a removed result, inputting the removed result, semantic features corresponding to the reference images and semantic features corresponding to the k-1 segmented video samples to a composite cross attention module to obtain a cross attention result, inputting the cross attention result and the video reference features to a gating cross attention module to obtain a gating cross attention result, inputting the gating cross attention result to a time attention network to obtain a time attention result, and inputting the time attention result to a next sub-module;

The next submodule continues to obtain attention results through the space attention module, cross attention results are obtained through the compound cross attention module, gate cross attention results are obtained through the gate cross attention module, and time attention results are obtained through the time attention module;

And determining the prediction noise according to the time attention result determined by the last submodule.

Optionally, the compound cross attention module comprises a cross attention module and a gate control cross attention module;

Inputting the removed result, the semantic features corresponding to the reference image and the semantic features corresponding to the k-1 th segmented video sample into a composite cross attention module to obtain a cross attention result, wherein the method specifically comprises the following steps of:

determining fusion semantic features for representing fusion features between 1 st to k-2 th segmented video samples;

inputting the removed result, the fused semantic features and the semantic features corresponding to the kth-1 segmented video sample to a cross attention module in a composite cross attention module to obtain a first attention result, wherein the first attention result comprises the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples, and the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples are used for a training process of the kth+1segmented video sample;

Inputting semantic features corresponding to the first attention result and the reference image into a gating cross attention module to obtain a second attention result;

And removing part of features in the second attention result except the removed result to obtain a cross attention result.

Optionally, inputting the gated cross attention result into a time attention network to obtain a time attention result, which specifically includes:

Sampling the part of the kth-1 segmented video sample before an overlapped frame to obtain a sampling result, wherein the overlapped frame is the overlapped part between the kth-1 segmented video sample and the kth segmented video sample;

Splicing codes corresponding to the sampling results in front of the characteristics corresponding to the kth segmented video sample in the gating cross attention results according to time sequence to obtain spliced results corresponding to the gating cross attention results;

And inputting the spliced result into a time attention network, determining an output result, and removing a part belonging to the sampling result in the output result to obtain a time attention result.

The specification provides a human rehabilitation exercise video data generating device guided by long-short time features, comprising:

the acquisition module is used for acquiring a reference image, a gesture sequence and a video sample;

The segmentation module is used for respectively segmenting the gesture sequence and the video samples to obtain segmented gesture sequences and segmented video samples, wherein one segmented gesture sequence corresponds to one segmented video sample one by one, and adjacent segmented video samples are overlapped;

The input module is used for inputting the reference image, the segmented gesture sequence and the segmented video sample into a video generation model to be trained so as to extract image reference characteristics corresponding to the reference image through an image reference network in the video generation model to be trained, and inputting the kth-1 segmented video sample into a video reference network in the video generation model so as to obtain video reference characteristics corresponding to the kth-1 segmented video sample;

the prediction module is used for adding noise to the kth segmented video sample through the generated noise to obtain a segmented video sample after adding noise, inputting the kth segmented gesture sequence, the segmented video sample after adding noise, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise;

the training module is used for training the video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, and the trained video generation model is used for generating a human body rehabilitation movement video through a reference image and a gesture sequence given by a user.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the human rehabilitation exercise video data generation method of long and short time feature guidance described above.

The present specification provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the human rehabilitation exercise video data generation method guided by the long-short time features when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

According to the human rehabilitation exercise video data generation method guided by the long-short time features, the reference image, the gesture sequence and the video samples can be obtained, the gesture sequence and the video samples are segmented respectively to obtain segmented gesture sequences and segmented video samples, one segmented gesture sequence corresponds to one segmented video sample one by one, overlapping exists between adjacent segmented video samples, then the reference image, the segmented gesture sequence and the segmented video samples can be input into a video generation model to be trained, the image reference features corresponding to the reference image can be extracted through an image reference network in the video generation model to be trained, and the k-1 segmented video samples are input into a video reference network in the video generation model to obtain the video reference features corresponding to the k-1 segmented video samples. The method comprises the steps of adding noise to a kth segmented video sample through generated noise to obtain a segmented video sample after noise addition, inputting the kth segmented gesture sequence, the segmented video sample after noise addition, video reference characteristics and image reference characteristics into a stable diffusion network in a video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise; and training a video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, wherein the trained video generation model is used for generating a human body rehabilitation motion video through a reference image and a gesture sequence given by a user.

From the foregoing, it can be seen that the present invention provides the following advantages:

1. Compared with other methods which only pay attention to generating a whole video frame by frame or directly generating the whole video, the method focuses on the consistency of high-frequency textures while keeping the time consistency, and improves the quality of the generated video.

2. Compared with a simple autoregressive method, the method realizes better time continuity and high-frequency texture consistency by generating the video in a segmented way.

3. The method adds the gating cross attention to be fused into the module aligned with the stable diffusion network, can effectively avoid information loss caused by characteristic space difference between different information, enables the model to better utilize space characteristics and semantic information, better process time continuity and improve the stability and accuracy of the generation process.

4. The invention adopts REFERENCE NET (reference network) structure to extract the high-frequency information in each frame, and can effectively improve the quality and consistency of video generation by using the high-frequency information, and can also improve the generalization capability of the model, so that the model can be better adapted to different data sets and scenes.

5. The invention designs the overlapped frame part, transmits time information and samples the frame number of the previous video, adopts methods such as a time attention mechanism and the like in each module of the stable diffusion network, and can utilize the information of the previous video as a priori by sampling the frame number of the previous video and encoding the frame number into the latent space, thereby guiding the generation process of the current video segment, enhancing the time consistency in the video generation process and improving the video quality and consistency.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

Fig. 1 is a schematic flow chart of a method for generating long-short-time feature guided human rehabilitation exercise video data provided in the present specification;

FIG. 2 is a schematic diagram of a gesture sequence in the present specification;

FIG. 3 is a schematic diagram of a video generation model provided in the present specification;

FIG. 4 is a schematic diagram of a composite cross-attention module provided herein;

fig. 5 is a schematic diagram of a human rehabilitation exercise video data generating device guided by long-short time features provided in the present specification;

fig. 6 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a method for generating long-short time feature guided human rehabilitation exercise video data provided in the present specification, specifically including the following steps:

S100: a reference image, a pose sequence, and a video sample are acquired.

S102: and respectively segmenting the gesture sequence and the video samples to obtain each segmented gesture sequence and each segmented video sample, wherein one gesture sequence corresponds to one segmented video sample one by one, and overlapping exists between adjacent segmented video samples.

In this specification, a video generation model for generating a human rehabilitation exercise video is required to be trained, and a real human rehabilitation exercise video (i.e., a video of a real person doing rehabilitation exercise) can be generated by using the video generation model, and various uses of the video generated by the video generation model exist, for example, a training sample source of a neural network model (such as a three-dimensional model corresponding to a human body in the generated video) for other uses related to the human rehabilitation video can be used, and for example, the video generation model can be used for automatically generating rehabilitation exercise video provided for medical staff or patients.

Based on this, the server may obtain reference images, gesture sequences, and video samples. The gesture sequence mentioned herein may be used to represent the gesture of the human body in the video sample at each moment, for example, the gesture sequence may refer to a video sequence that represents the gesture of the human body through the human body skeleton, as shown in fig. 2.

Fig. 2 is a schematic diagram of a gesture sequence in this specification.

The video sample is a video containing a real human body for rehabilitation, and the reference image may be an image containing a real human body in the video sample, for example, the reference image may be a frame image in the video sample.

Then, the gesture sequence and the video samples can be segmented respectively to obtain each segmented gesture sequence and each segmented video sample, one segmented gesture sequence corresponds to one segmented video sample one by one, and overlapping exists between adjacent segmented video samples.

Wherein segmentation is performed separately for the gesture sequence and the video samples. Wherein one segment pose sequence and segment video sample may be set to have K frames, and there is an s-frame overlap between every two adjacent segment pose sequences and segment video samples (e.g., s may take 1/4K) to obtain each segment pose sequence and each segment video sample, each segment containing a number of frames of [1:K ], [ K-s+1:2k-s ], [2K-2s:3K-2s ], [ N-K: N ].

S104: inputting the reference image, the segmented gesture sequence and the segmented video sample into a video generation model to be trained, extracting and obtaining image reference characteristics corresponding to the reference image through an image reference network in the video generation model to be trained, and extracting and obtaining video reference characteristics corresponding to the k-1 segmented video sample through a video reference network in the video generation model.

S106: and adding noise to the kth segmented video sample through the generated noise to obtain a noise-added segmented video sample, inputting the kth segmented gesture sequence, the noise-added segmented video sample, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predicting the noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise.

S108: and training the video generation model by taking the difference between the minimized predicted noise and the generated noise as an optimization target, wherein the trained video generation model is used for generating a human body rehabilitation motion video through a reference image and a gesture sequence given by a user.

After the reference image, the segmented gesture sequences and the segmented video samples are determined, the reference image, the segmented gesture sequences and the segmented video samples can be input into a video generation model to be trained, and the video generation model to be trained is trained.

It should be noted that, when training the video generating model, it is assumed that a complete video sample has n segmented video samples, the video generating model needs to perform an inference process n times, and from the second time, the kth-1 segmented video sample needs to be input to assist in training the process of generating the kth segmented video sample.

Therefore, the image reference characteristics corresponding to the reference image can be extracted through an image reference network in a video generation model to be trained, the video reference characteristics corresponding to the k-1 th segmented video sample can be extracted through the video reference network in the video generation model, the k-th segmented video sample is subjected to noise adding through generated noise to obtain a noise-added segmented video sample, the k-th segmented gesture sequence, the noise-added segmented video sample, the video reference characteristics and the image reference characteristics are input into a stable diffusion network in the video generation model, the noise added to the k-th segmented video sample is predicted through the stable diffusion network to obtain predicted noise, the difference between the minimized predicted noise and the generated noise is taken as an optimization target, the video generation model is trained, and the trained video generation model is used for generating a human body rehabilitation motion video through the reference image and gesture sequence given by a user.

It can be seen that the foregoing briefly describes the training process corresponding to one segmented video sample (kth segmented video sample) in one complete video sample, and is a process of adding noise to the video and predicting the added noise through the video generation model when training the video generation model, so that there is a certain difference from the training phase when generating a real video through the video generation model after training.

It should be noted that, the above-mentioned process of adding noise to the kth segmented video sample (or the feature corresponding to the kth segmented video sample) may be a process of adding multiple noises to the segmented video sample according to multiple time steps, and the noise sequence may be predicted by the stable diffusion network.

Specifically, the number of time steps can be set as T, the noise adding operation of the T steps is carried out, and the Gaussian noise added in each step obeysUntil the segmented video samples (or the features corresponding to the segmented video samples) change from an original state to a pure gaussian distribution of noise. The segmented video sample after T-step noise addition can be directly obtained by the following formula:

in the above formula May refer to the original segmented video sample,/>May refer to a segmented video sample that has been denoised by a T step. The above formula is a conventional formula in a stable diffusion model for adding noise to raw data for multiple time steps at once.

When the video is generated, the noise is required to be randomly generated, then the randomly generated noise, the preset reference image and the preset gesture sequence are input into a video generation model, and the noise output by the video generation model is used for denoising the randomly generated noise, so that the generated video is obtained.

In the video generation model, noise is output from the stable diffusion network, but the video generation model is required to output the generated video finally, so a decoder (VAE decoder) may be connected after the stable diffusion network, and a denoising result obtained by denoising the randomly generated noise is input into the decoder to obtain the generated video.

When generating video, the method also needs to generate segments, wherein noise can be randomly generated, the above mentioned gesture sequences are segmented to obtain segmented gesture sequences, the ith segmented gesture sequence, the randomly generated noise, the reference image and the ith-1 th generated segmented video are input into a video generation model, the ith segmented video is output, all the segmented videos are spliced, and then the complete video can be obtained (the method needs to be interpreted, whether in a training stage or a stage of generating video through a trained model, and in a training process/generating process corresponding to the 1 st segmented video, the process of inputting the previous segmented video is not involved).

The structure of the video generation model is described in detail below, as shown in fig. 3.

Fig. 3 is a schematic structural diagram of a video generation model provided in the present specification.

The video generation model comprises three main modules: image reference network, stable diffusion network and video reference network, before the image reference network, stable diffusion network and video reference network, there are networks for encoding reference images, pose sequences and video.

The method comprises the steps of carrying out feature extraction on a reference image by a variation self-encoder (VAE encoder) and an image semantic feature extraction model (CLIP network) respectively, carrying out feature extraction on a gesture sequence by a gesture director (convolutional neural network), and carrying out feature extraction on video by a variation self-encoder and a video semantic feature extraction model (CLIP network) respectively.

The difference between feature extraction by the variational self-encoder and the semantic feature extraction model is that the variational self-encoder is a relatively simple way of compressing an image or video to obtain image/video features, and the semantic feature extraction model can extract high-frequency texture features of the video and texture features of the image.

The reference image can be input into an image semantic feature extraction model to obtain semantic features corresponding to the reference image, and the kth-1 segmented video sample is input into a video semantic feature extraction model to obtain semantic features corresponding to the kth-1 segmented video sample. Where k may be a positive integer greater than 1.

For an image reference network, several sub-modules may be included in the image reference network, each sub-module including a spatial attention module and a compound cross attention module (as shown in fig. 2).

The reference image can be encoded through a variable self-encoder to obtain image encoding, the image encoding is input into an image reference network, the attention result is obtained through a space attention module of a first sub-module, the attention result and the semantic feature corresponding to the reference image are input into a compound cross attention module to obtain a cross attention result, the cross attention result is input into a next sub-module, the next sub-module continues to obtain the attention result through the space attention module, the cross attention result is obtained through the compound cross attention module, and the attention result corresponding to each sub-module is used as the image reference feature.

It should be noted that, the spatial attention module is used for spatially weighting the image code of the reference image itself, and the compound cross attention module is used for cross attention between the image code of the reference image and the semantic features. The composite cross-attention module may consist of only cross-attention networks or may consist of cross-attention networks and gated cross-attention networks, wherein the gated cross-attention networks consist of cross-attention networks and gated networks (with the cross-attention network preceding and the gated network following).

For a video reference network, the video reference network comprises a plurality of sub-modules, and each sub-module comprises: a spatial attention module, a composite cross attention module, and a temporal attention module.

The k-1 th segmented video sample can be encoded through a variable self-encoder to obtain video encoding, then the video encoding can be input into a video reference network, the attention result is obtained through a space attention module of a first sub-module, semantic features corresponding to the attention result and the k-1 th segmented video sample are input into a composite cross attention module to obtain a cross attention result, the cross attention result is input into a time attention module to obtain a time attention result, the time attention result is input into a next sub-module, the next sub-module continues to obtain the attention result through the space attention module, the cross attention result is obtained through the composite cross attention module, the time attention result is obtained through the time attention module, and then the time attention result corresponding to each sub-module is used as video reference features.

The image reference network and the video reference network mainly provide feature references for the stable diffusion network reference image and the segmented video before the segmented video to be generated, so that the network mainly used for generating the video is the stable diffusion network (of course, the stable diffusion network cannot directly generate the video and is used for determining noise added into the video, and thus, when the video is actually required to be generated, the generated noise is denoised through the stable diffusion network, and the generated video is obtained).

Wherein the composite cross-attention module in the video reference network is identical to the composite cross-attention module in the image reference network and is not repeated here.

It should be noted that, since the stable diffusion network includes a large number of attention network layers, the semantic features corresponding to the segmented video sample are integrated into the features of the k-th segmented gesture sequence and the noisy segmented video sample in the model reasoning process, and after the features are integrated, the image reference features, the video reference features, or the semantic features corresponding to the reference images, the part of the features corresponding to the semantic features corresponding to the segmented video sample can be removed.

For the stable diffusion network, the stable diffusion network comprises a plurality of sub-modules, and each sub-module comprises: a spatial attention module, a composite cross attention module, a gated cross attention module, and a temporal attention module.

It should be noted that the number of sub-modules in the stable diffusion network, the video reference network and the image reference network is consistent, that is, the three sub-modules are in one-to-one correspondence, the output inside each sub-module in the video reference network and the image reference network needs to be provided to the sub-module corresponding to the stable diffusion network, that is, the output of the spatial attention module of the image reference network needs to be input to the spatial attention module of the sub-module corresponding to the stable diffusion network, and the output of the temporal attention module of the video reference network needs to be input to the gating cross attention module of the sub-module corresponding to the stable diffusion network.

Specifically, the kth segment gesture sequence may be encoded (the segment gesture sequence is encoded by the gesture director mentioned above) to obtain a gesture code, then, a segment video sample after gesture coding and noise adding (the segment video sample after noise adding may also be obtained by encoding the segment video sample through a network for feature extraction such as a variation self-encoder and then noise adding), and an image reference feature are input to a spatial attention module in the first submodule, so as to obtain a spatial attention result, and a part belonging to the image reference feature in the spatial attention result is removed, so as to obtain a removed result.

And then, the removed result, the semantic features corresponding to the reference image and the semantic features corresponding to the k-1 segmented video sample can be input into a composite cross attention module to obtain a cross attention result, the cross attention result and the video reference feature are input into a gating cross attention to obtain a gating cross attention result, the gating cross attention result is input into a time attention network to obtain a time attention result, and the time attention result is input into the next sub-module.

The next submodule continues to obtain attention results through the space attention module, obtains cross attention results through the composite cross attention module, obtains gating cross attention results through the gating cross attention module and obtains time attention results through the time attention module, and can determine prediction noise according to the time attention results determined by the last submodule.

The following detailed description is directed to a composite cross-attention module, consistent with an image reference network, which may be composed of a cross-attention network and a gated cross-attention network, wherein the gated cross-attention network is composed of a cross-attention network and a gating network (the cross-attention network is in front and the gating network is in back), as shown in fig. 4.

Fig. 4 is a schematic structural diagram of a composite cross-attention module provided in the present specification.

The method comprises the steps that fusion semantic features for representing fusion features between 1 st to k-2 th segmented video samples can be determined, a removed result, the fusion semantic features and semantic features corresponding to the k-1 th segmented video samples are input to a cross attention module in a composite cross attention module, a first attention result is obtained, the first attention result comprises the fusion semantic features for representing the fusion features between the 1 st to k-1 th segmented video samples, and the fusion semantic features for representing the fusion features between the 1 st to k-1 th segmented video samples are used for training the k+1 th segmented video samples.

And then, inputting semantic features corresponding to the first attention result and the reference image into a gating cross attention module to obtain a second attention result, and removing part of features in the second attention result except the removed result to obtain a cross attention result.

It can be seen from fig. 4 (for convenience of explanation, only the input-output relationship between the fusion semantic features and the semantic features is drawn in fig. 4), the cross attention module input into the composite cross attention module has the semantic features corresponding to the kth-1 segmented video sample and the fusion semantic features for representing the fusion features between the 1 st to the kth-2 segmented video samples, and the output result contains the fusion semantic features for representing the fusion features between the 1 st to the kth-1 segmented video samples.

It should be noted that the fused semantic features used to characterize the fused features between the 1 st to the k-2 th segmented video samples are obtained during the training process of the video generation model for the k-1 st segmented video sample.

Specifically, in the training process for the 1 st segmented video sample, this process is not involved, and only the semantic features corresponding to the 1 st segmented video sample need to be input in the training process for the 2 nd segmented video sample.

In the training process aiming at the 3 rd segmented video sample, the input of the cross attention module in the compound cross attention module comprises the semantic features corresponding to the 1st segmented video sample and the semantic features corresponding to the 2 nd segmented video sample, and the two semantic features are fused through the cross attention module, so that the fusion semantic features for representing the fusion features between the 1st to 2 nd segmented video samples can be obtained.

In the training process for the 4 th segmented video sample, the input of the cross attention module in the compound cross attention module comprises the fusion semantic features for representing the fusion features between the 1 st to 2 nd segmented video samples and the semantic features corresponding to the 3 rd segmented video sample, and the fusion semantic features for representing the fusion features between the 1 st to 3 rd segmented video samples can be obtained by fusing the two semantic features through the cross attention module. So that the fusion semantic features obtained in the training process of the last segmented video sample can be used in the training process of the next segmented video sample by analogy.

It should be noted that, as can be seen from fig. 3, there is a branch for sampling the kth-1 segmented video sequence, that is, a portion of the kth-1 segmented video sample before the overlapped frame may be sampled to obtain a sampling result, the overlapped frame is a portion of the kth-1 segmented video sample overlapped with the kth segmented video sample, then, codes corresponding to the sampling result may be spliced in time sequence before features corresponding to the kth segmented video sample in the gating cross attention result to obtain a spliced result corresponding to the gating cross attention result, and then, the spliced result is input into the time attention network to determine an output result, and a portion belonging to the sampling result in the output result is removed to obtain the time attention result.

For convenience of description, the execution subject for executing the method will be described as a server, and the execution subject of the method may be a computer, a controller, a server, or the like, which is not limited herein. The features of the following examples and embodiments may be combined with each other without any conflict.

The above method for generating human rehabilitation exercise video data guided by one or more long and short time features in the present specification is based on the same concept, and the present specification also provides a device for generating human rehabilitation exercise video data guided by long and short time features, as shown in fig. 5.

Fig. 5 is a schematic diagram of a human rehabilitation exercise video data generating device guided by long-short time features provided in the present specification, including:

An acquisition module 501, configured to acquire a reference image, a gesture sequence, and a video sample;

The segmentation module 502 is configured to segment the gesture sequence and the video samples respectively to obtain each segment gesture sequence and each segment video sample, where one segment gesture sequence corresponds to one segment video sample one by one, and overlapping exists between adjacent segment video samples;

An input module 503, configured to input the reference image, the segmented gesture sequence, and the segmented video sample into a video generation model to be trained, so as to extract, through an image reference network in the video generation model to be trained, an image reference feature corresponding to the reference image, and input a kth-1 segmented video sample into a video reference network in the video generation model, so as to obtain a video reference feature corresponding to the kth-1 segmented video sample;

The prediction module 504 is configured to perform noise addition on a kth segmented video sample through the generated noise to obtain a noisy segmented video sample, and input a kth segmented gesture sequence, the noisy segmented video sample, the video reference feature and the image reference feature into a stable diffusion network in the video generation model, and predict noise added to the kth segmented video sample through the stable diffusion network to obtain predicted noise;

The training module 505 is configured to train the video generating model with a difference between the minimized predicted noise and the generated noise as an optimization target, where the trained video generating model is used to generate a human rehabilitation motion video through a reference image and a gesture sequence given by a user.

the input module 503 is further configured to input the reference image into the image semantic feature extraction model to obtain semantic features corresponding to the reference image, and input the kth-1 segmented video sample into the video semantic feature extraction model to obtain semantic features corresponding to the kth-1 segmented video sample.

The input module 503 is specifically configured to encode the reference image by using a variation self-encoder to obtain an image code; inputting the image code into the image reference network, obtaining an attention result through a spatial attention module of a first sub-module, inputting semantic features corresponding to the attention result and the reference image into the composite cross attention module to obtain a cross attention result, inputting the cross attention result into a next sub-module, and continuously obtaining the attention result through the spatial attention module and obtaining the cross attention result through the composite cross attention module; and taking the attention result corresponding to each sub-module as the image reference characteristic.

The input module 503 is specifically configured to encode the kth-1 segment video sample by using a variation self-encoder to obtain video encoding; inputting the video code into the video reference network, obtaining an attention result through a space attention module of a first sub-module, inputting semantic features corresponding to the attention result and the k-1 th segmented video sample into the composite cross attention module, obtaining a cross attention result, inputting the cross attention result into a time attention module, obtaining a time attention result, inputting the time attention result into a next sub-module, continuously obtaining an attention result through the space attention module, obtaining a cross attention result through the composite cross attention module and obtaining a time attention result through the time attention module; and taking the time attention result corresponding to each sub-module as the video reference characteristic.

The prediction module 504 is specifically configured to encode the kth segment gesture sequence to obtain a gesture code; inputting the gesture codes, the noisy segmented video samples and the image reference features to a space attention module in a first sub-module to obtain a space attention result, removing parts belonging to the image reference features in the space attention result to obtain a removed result, inputting the removed result, semantic features corresponding to the reference images and semantic features corresponding to the k-1 segmented video samples to a composite cross attention module to obtain a cross attention result, inputting the cross attention result and the video reference features to a gating cross attention module to obtain a gating cross attention result, inputting the gating cross attention result to a time attention network to obtain a time attention result, and inputting the time attention result to a next sub-module; the next submodule continues to obtain attention results through the space attention module, cross attention results are obtained through the compound cross attention module, gate cross attention results are obtained through the gate cross attention module, and time attention results are obtained through the time attention module; and determining the prediction noise according to the time attention result determined by the last submodule.

The prediction module 504 is specifically configured to determine fusion semantic features for characterizing fusion features between the 1 st to k-2 th segmented video samples; inputting the removed result, the fused semantic features and the semantic features corresponding to the kth-1 segmented video sample to a cross attention module in a composite cross attention module to obtain a first attention result, wherein the first attention result comprises the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples, and the fused semantic features used for representing the fused features between the 1 st to the kth-1 segmented video samples are used for a training process of the kth+1segmented video sample; inputting semantic features corresponding to the first attention result and the reference image into a gating cross attention module to obtain a second attention result; and removing part of features in the second attention result except the removed result to obtain a cross attention result.

Optionally, the prediction module 504 is specifically configured to sample a portion of the kth-1 segmented video sample before an overlapping frame, where the overlapping frame is a portion of the kth-1 segmented video sample overlapping with the kth segmented video sample, to obtain a sampling result; splicing codes corresponding to the sampling results in front of the characteristics corresponding to the kth segmented video sample in the gating cross attention results according to time sequence to obtain spliced results corresponding to the gating cross attention results; and inputting the spliced result into a time attention network, determining an output result, and removing a part belonging to the sampling result in the output result to obtain a time attention result.

The present specification also provides a computer-readable storage medium storing a computer program operable to perform the above long and short time feature-guided human rehabilitation exercise video data generation method.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 6. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as illustrated in fig. 6, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory to the memory and then operates the computer program to realize the long-short time feature guided human rehabilitation exercise video data generation method.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL（Advanced Boolean Expression Language）、AHDL（Altera Hardware Description Language）、Confluence、CUPL（Cornell University Programming Language）、HDCal、JHDL（Java Hardware Description Language）、Lava、Lola、MyHDL、PALASM、RHDL（Ruby Hardware Description Language）, and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. The method for generating the long-short-time feature-guided human rehabilitation exercise video data is characterized by comprising the following steps of:

Acquiring a reference image, a gesture sequence and a video sample, wherein the reference image comprises a real human body image in the video sample;

2. The method of claim 1, wherein the video generation model further comprises an image semantic feature extraction model and a video semantic feature extraction model;

the method further comprises the steps of:

3. The method of claim 2, wherein the image reference network includes a plurality of sub-modules, each sub-module including a spatial attention module and a composite cross attention module;

4. The method of claim 2, wherein the video reference network includes a plurality of sub-modules, each sub-module including: a spatial attention module, a composite cross attention module, and a temporal attention module;

5. The method of claim 2, wherein the stable diffusion network includes a plurality of sub-modules therein, each sub-module including: a spatial attention module, a composite cross attention module, a gated cross attention module, and a temporal attention module;

coding the kth segment gesture sequence to obtain gesture codes;

6. The method of claim 5, wherein the composite cross-attention module comprises a cross-attention module and a gated cross-attention module;

7. The method according to claim 5, wherein the step of inputting the gated cross attention result into a time attention network to obtain a time attention result comprises:

8. A long and short time feature guided human rehabilitation exercise video data generating device, comprising:

The acquisition module is used for acquiring a reference image, a gesture sequence and a video sample, wherein the reference image comprises a real human body image in the video sample;

9. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-7 when executing the program.