CN116524419B

CN116524419B - Video prediction method and system based on space-time decoupling and self-attention difference LSTM

Info

Publication number: CN116524419B
Application number: CN202310802044.5A
Authority: CN
Inventors: 陈苏婷; 薄业雯; 胡斌武; 韩光勋; 裴加明; 杨宁; 孙俊; 高云勇; 夏芸
Original assignee: NANJING CHINA-SPACENET SATELLITE TELECOM CO LTD; Nanjing University of Information Science and Technology
Current assignee: NANJING CHINA-SPACENET SATELLITE TELECOM CO LTD; Nanjing University of Information Science and Technology
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-11-07
Anticipated expiration: 2043-07-03
Also published as: CN116524419A

Abstract

The invention discloses a video prediction method and a system based on space-time decoupling and self-attention difference LSTM, wherein the method comprises the following steps: introducing an antagonism loss constraint and a similarity constraint to construct a space-time decoupling network, and obtaining decoupling characteristics of the video; a dynamic differential model is designed by utilizing differential operation to replace a forgetting gate of a traditional LSTM unit; designing a gating mechanism on the basis of attention, fusing long-time memory with the depth of the noted features, and constructing a new global self-attention model; combining a dynamic differential model and a global self-attention model, constructing a DISA-LSTM unit, stacking the unit by using a diagonal loop system structure, and constructing a DISA-LSTM prediction network; and building a network overall architecture based on the convolution self-encoder and combining a loss function training model. The invention effectively improves the capability of capturing high-dimensional dynamic complex features and the accuracy of video prediction, and reduces the complexity of the prediction work caused by high-dimensional video data.

Description

Video prediction method and system based on space-time decoupling and self-attention difference LSTM

Technical Field

The invention belongs to the field of electronic communication and information engineering, and particularly relates to a video prediction method and a video prediction system based on space-time decoupling and self-attention difference LSTM.

Background

The high-dimensional characteristic of video data and the complexity of various time sequence motion evolution bring great challenges to video prediction work, the existing prediction method has the problems of space detail loss and inconsistent motion prediction in time sequence, and the generated result is too smooth and cannot fully retain high-frequency detail information, so that the prediction accuracy is low, the appearance of a predicted image is fuzzy, the predicted image is unreal and the like.

The current mainstream methods of video prediction are convolutional neural network-based, cyclic neural network-based, and generation-antagonistic neural network-based. The convolutional neural network has strong characteristic learning capability, but has limited receptive field, cannot construct a remote space dependence relationship, has limited capability of describing time dimension information, is specially designed for processing data with time dimension, and is insufficient in description of internal space characteristics of the data, so that in recent years, a large number of students combine the convolutional neural network and the cyclic neural network in a space-time sequence prediction task in consideration of learning of space structure information and time dimension information. As a recent research hotspot, the use of the reactive neural network has been demonstrated to give more accurate video predictions than the use of the L2 loss function, but the pattern collapse problem is a major weakness of the reactive neural network, which converges to a single pattern state during training. In addition, this feature increases the difficulty of predictive learning to some extent because video sequences tend to have complex spatio-temporal coupling as high-dimensional spatio-temporal sequence data.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the video prediction method and the system based on space-time decoupling and self-attention difference LSTM are provided, a space-time decoupling network, a dynamic difference model and a global self-attention model are introduced based on a convolution self-encoder structure, a network model for video frame prediction is constructed, the complexity of high-dimensional video data to prediction work is reduced, and the characteristic expression of short-term dependence and long-term association of the network is improved.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a video prediction method based on space-time decoupling and self-attention difference LSTM, which comprises the following steps:

s1, constructing a space-time decoupling network, and decoupling time dynamic characteristics and space static characteristics of a video so as to reduce the complexity of prediction work brought by high-dimensional video data.

S2, designing a dynamic differential model comprising a forgetting gate, an input gate, an updating gate and an output gate by utilizing differential operation to replace the forgetting gate of the LSTM unit, thereby improving the capability of capturing high-dimensional dynamic complex features of the network.

S3, in order to improve the capacity of network fitting global time-space correlation, a gating mechanism is designed on the basis of attention, long-time memory and the depth of the noted features are fused, and a new global self-attention model is built.

S4, embedding the dynamic differential model and the global self-attention model into the LSTM unit to form a new DISA-LSTM unit, and stacking the units by using a diagonal loop architecture to construct a DISA-LSTM prediction network.

S5, constructing a network overall architecture based on a convolution self-encoder, and combining an contrast loss function, a similarity loss function, a reconstruction loss function and a prediction loss training video prediction model.

Further, in step S1, the space-time decoupling network is composed of a dynamic encoder and a static encoder, and the specific contents of the time dynamic feature and the space static feature of the decoupled video are:

(1) Decoupling concrete content of video temporal dynamics

Building a dynamic encoder: a dynamic encoder is constructed by using 6 convolution kernels with step size of 2 and size of 4×4, and a batch normalization operation and a leak Relu activation function are used after the first 5 layers of convolution; the output temporal dynamic feature vector is normalized to between-1 and 1 using the Tanh activation function at the final output layer.

Extracting time dynamic characteristics: introducing a contrast loss function in the dynamic encoder, and completely decoupling the time dynamic characteristics from the space static by utilizing the contrast training of the dynamic encoder and the characteristic discriminator, wherein the specific formula is as follows:

wherein L is _adversarial Representing the antagonism loss function, E _d Representing a dynamic encoder, T representing a feature discriminator,a t-th frame video sequence representing an mth video segment,/->T+k frame video sequence representing an mth video segment,/video sequence>A t-th frame video sequence representing an n-th video segment,>a t+k frame video sequence representing an nth video segment.

The static features containing the appearance information cannot change with time in the same video segment, but the difference exists in different video segments, so that when the feature identifier cannot judge whether the dynamic features come from the same video segment, the task of completely decoupling the time dynamic features can be completed.

The feature discriminator uses a three-layer 1 x 1 convolution and a Relu activation function, and uses a Sigmoid function at the last layer to map the probability vector of the discriminator output to the 0-1 interval.

(2) Decoupling concrete content of video spatial static features

The spatial static characteristics are expressed as background, color and appearance details of the video, the static encoder is constructed by using the same architecture as the dynamic encoder, a similarity loss function is introduced, and the similarity of the spatial static characteristics of adjacent time steps is maximized by using a square error, wherein the specific formula is as follows:

wherein L is _similarity Representing a similarity loss function, E _s Representing a static encoder, X _t Representing a video sequence of the t-th frame, X _t+k Representing the t + k frame video sequence.

Further, in step S1, the temporal dynamic feature and the spatial static feature of the video obtained by decoupling are fused, and the specific contents are as follows:

wherein,representing temporal dynamics after decoupling of the video sequence of the t-th to t + k frames, representing the spatial static characteristics of the video sequence after decoupling of the t-th to t + k frames,X _t:t+k a video sequence representing the t-th to t+k frames; h _t:t+k And (5) representing decoupling characteristics after the fusion of the time dynamic characteristics and the space static characteristics of the video sequences of the t to t+k frames.

Further, in step S2, the specific design contents of the dynamic differential model are:

the forgetting gate of the traditional LSTM unit is replaced by a dynamic differential model, and the model receives differential information of hidden states of adjacent time steps and fuses with long-time memory cells of the previous time step after passing through the forgetting gate, the input gate and the update gate to form differential characteristics. And then generating a long-time memory cell of the current time step by using the output gate, and participating in information updating of the later information. The specific formula is as follows:

wherein sigma and tanh represent Sigmoid and tanh activation functions, respectively; * Indicating the convolution operation, ++indicates the Hadamard product, W' _hf 、W′ _hi 、W′ _hg Is a two-dimensional convolution kernel, b' _f 、b′ _i 、b' _g For bias, t represents time, l represents layer, f _t '、i _t '、g _t ' respectively indicate a forget gate, an input gate and an update gate for screening differential information,differential information indicating hidden status of adjacent time steps, < >>Long-term memory cell representing the last time step, < >>Representing the differential features.

Further, in step S3, the specific steps of constructing the new global self-attention model are as follows:

s301, assigning different 1×1 weights { W } to the hidden states of the current time step _q ,W _k ,W _v Map it to three different spaces of query vector, key vector and value vector, usingCalculating similarity scores of the jth key vector and the e query vector, and normalizing the scores by using a softmax () activation function to obtain similarity distribution of each point, and multiplying the similarity distribution by a corresponding value vector to obtain attention characteristics, wherein the specific formula is as follows:

wherein Q represents a query vector, K represents a key vector, V represents a value vector, W _q 、W _k And W is _v Representing three different weights assigned to hidden states,representing hidden state, Z representing attention feature, C×H×W representing number×height×width of channel, N=H×W, e, j representing position index of query vector and key vector, Q _e Represents the e-th query vector,>represents the transpose of the jth key vector, d _k Representing the dimension of the key vector, T representing the transpose.

S302, fusing long-time memory with attention characteristics and hiding state depth, wherein the specific formula is as follows:

wherein i, g and o represent input gate, update gate and output gate, respectively, W in the global self-attention module _i；h 、W _i；z 、W _g；h 、W _g；z 、W _o；h 、W _o；z Representing a two-dimensional convolution kernel, b _i；、b _g；、b _o； The offset is indicated as being a function of the offset,indicating the updated long-term memory and hidden status,/-for example>Representing the hidden state of the current time step passing through the global attention module.

Further, in step S4, the specific steps for constructing the DISA-LSTM predictive network are as follows:

s401, replacing a forgetting gate in the LSTM unit with the dynamic differential model, and inputting the hidden state of the current time step and the long-time memory of the last moment into the global self-attention model to form a new DISA-LSTM unit.

S402, a DISA-LSTM prediction network is formed by stacking three layers of memory units, wherein the first layer has no previous characteristics, so that ST-LSTM units are used, and the other two layers are DISA-LSTM units; and constructing a DISA-LSTM prediction network by adopting a diagonal cyclic structure, wherein differential information is transmitted in a diagonal level in the network, and the hidden state of the global self-attention model is transmitted in a time dimension.

Further, in step S5, the specific content of the video prediction model is:

the network overall architecture constructed based on the convolution self-encoder mainly comprises three parts of an encoder, a DISA-LSTM prediction network and a decoder, wherein:

the encoder consists of a space-time decoupling network comprising a dynamic encoder and a static encoder for decoupling the temporal dynamic and spatial static features of video and encoding video data into a potential vector representation of smaller dimensions.

And after the feature vector representation of the encoder is further fused, entering a DISA-LSTM prediction network, and generating a future frame by learning an internal potential relation.

The decoder consists of deconvolution, and to enhance the appearance detail representation of the predicted image, the future frame sequence is fused with the decoupled spatial static features and reconstructed back into the original pixels by the decoder.

Training a video prediction model by decoupling and prediction losses, wherein the decoupling losses include a counterloss, a similarity loss, and a reconstruction loss; the decoupling loss is used to encourage complete decoupling of the video data and ensure that future frames can be reconstructed from spatially static features, and the prediction loss is used to maximize the similarity between the predicted and actual results. And optimizing the model by using a back propagation algorithm to improve the quality of the predicted image. The specific formula is as follows:

L＝L _{reconstruction} (E _s ,E _d ,D)+λ _sim L _similarity (E _s )+λ _adv (L _adversarial (E _d )+L _adversarial (T))+λ _mse L _MSE ；

wherein L is _{reconstruction} Is a reconstruction loss function; l (L) _MSE Is a predictive loss, using the mean square error as a model loss return; d represents a decoder; lambda (lambda) _sim 、λ _adv And lambda (lambda) _mse Is a hyper-parameter for balancing the convergence speed of different loss functions.

Furthermore, the invention also provides a video prediction system based on space-time decoupling and self-attention difference LSTM, which comprises:

and the video characteristic decoupling module is used for constructing a space-time decoupling network and decoupling the time dynamic characteristic and the space static characteristic of the video.

The dynamic differential model design module is used for designing a dynamic differential model comprising a forgetting gate, an input gate and an update gate by utilizing differential operation to replace the forgetting gate of the LSTM unit.

And the depth fusion feature module is used for designing a gating mechanism on the basis of attention, and fusing long-time memory with the noted feature depth to form a new global self-attention model.

And the video prediction model training module is used for constructing a network overall architecture based on the convolution self-encoder and training a video prediction model by combining the contrast loss function, the similarity loss function, the reconstruction loss function and the prediction loss.

Furthermore, the invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the video prediction method based on space-time decoupling and self-attention difference LSTM when executing the computer program.

Further, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is executed by a processor to execute the video prediction method based on space-time decoupling and self-attention difference LSTM.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. according to the invention, by introducing a space-time decoupling network, the time dynamic characteristics and the appearance information-containing space static characteristics are decoupled, so that the complexity of the prediction work caused by the high-dimensional video data is reduced.

2. The DISA-LSTM prediction network combining dynamic difference and global self-attention improvement has the function of processing non-stationary sequences, and effectively improves the capability of capturing high-dimensional dynamic complex features. The added global self-attention model can store history information with long-term dependence and interact with current state information to generate a new hidden state, so that the short-term dependence and long-term associated feature expression of the network are improved.

3. The invention is integrally constructed based on a convolution self-encoder, the future frames generated by a prediction network are fused with the decoupled static features, and the decoder network is utilized to generate the future frames with clear appearance, so that the problems of high complexity of video prediction tasks, fuzzy appearance of the predicted image and low accuracy are finally solved.

Drawings

FIG. 1 is a flow chart of the overall implementation of the present invention.

Fig. 2 is a diagram of a space-time decoupling network according to the present invention.

FIG. 3 is a detailed view of LSTM cell embedded in dynamic differential model of the present invention.

FIG. 4 is a diagram of the global self-attention model structure of the present invention.

Fig. 5 is a detailed view of the disk-LSTM cell of the present invention.

Fig. 6 is a diagram of a structure of a disk-LSTM predictive network according to the present invention.

Fig. 7 is an overall construction diagram of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

The invention is described in further detail below with reference to the accompanying drawings.

The video prediction method based on space-time decoupling and self-attention difference LSTM provided by the invention, as shown in figure 1, comprises the following steps:

s1, constructing a space-time decoupling network, and decoupling time dynamic characteristics and space static characteristics of a video.

In this embodiment, the public data set UCF101 is selected, and includes 101 types of 13320 segments of video, which are human motion data sets in a real scene, where all video frequencies are 25fps and resolutions are 320×240 pixels. UCF101 is first divided into a training set and a test set, then video is extracted into an image format, and all image pixels are adjusted to 128X 128 to accommodate model input.

Extracting consecutive 10 frames in the training set as input to the network, denoted asWhere m represents the video index. The Batchsize used in this example is 10, and X is one [ B, W, H, C]B is the fetch size used, W, H, C represents the width and height of the video image and the number of feature channels.

As shown in fig. 2, the space-time decoupling network is composed of a dynamic encoder and a static encoder, and the specific contents of the time dynamic characteristic and the space static characteristic of the decoupled video are as follows:

(1) Decoupling concrete content of video temporal dynamics

The feature discriminator uses a three-layer 1 x 1 convolution and a Relu activation function, and uses a Sigmoid function at the last layer to map the probability vector of the discriminator output to a specific interval.

(2) Decoupling concrete content of video spatial static features

The static encoder is constructed by using the same architecture as the dynamic encoder, a similarity loss function is introduced, and the similarity of the space static characteristics of adjacent time steps is maximized by using a square error, wherein the specific formula is as follows:

wherein L is _similarity Representing a similarity loss function, E _s Representing a static encoder, X _t Representing a video sequence of the t-th frame, X _t+k Represent the firstt+k frames of video sequence.

Finally, fusing the time dynamic characteristics and the space static characteristics of the video obtained by decoupling, wherein the specific contents are as follows:

S2, designing a dynamic differential model comprising a forgetting gate, an input gate, an updating gate and an output gate by utilizing differential operation, wherein the forgetting gate replaces the forgetting gate of an LSTM unit, and the specific contents are as follows:

the dynamic differential model is designed by adopting differential operation and replaces the forgetting gate of the traditional LSTM unit, so that the network has the capability of processing a non-stable sequence, and the capability of capturing high-dimensional dynamic complex characteristics of the network is improved.

As shown in FIG. 3, the hidden state of the dynamic differential model is designed in the dotted line box, the model is used for replacing the forgetting gate of the traditional LSTM unit, and the differential characteristic is formed by receiving the differential information of the hidden state of the adjacent time steps and fusing the hidden state with the long-time memory cells of the previous time step after passing through the forgetting gate, the input gate and the update gate. And then generating a long-time memory cell of the current time step by using the output gate, and participating in information updating of the later information. The specific formula is as follows:

wherein sigma and tanh represent Sigmoid and tanh activation functions, respectively; * Indicating the convolution operation, ++indicates the Hadamard product, W' _co 、W′ _ho 、W _hi 、W _xi 、W _hg 、W _hg 、W″ _hf 、W″ _mf 、W″ _hi 、W″ _mi 、W″ _hg 、W″ _mg 、W _xo 、W _ho 、W _co 、W _mo Is a two-dimensional convolution kernel, b' _o 、b _i 、b _g 、b _o 、b″ _f 、b″ _i 、b″ _g For bias, t represents time, l represents layer, H is hidden,differential information indicating hidden status of adjacent time steps, < >>And->Long-term memory cells representing the last time step and the current time step, respectively, < >>Representing differential characteristics, M is a space-time memory cell, < ->Is the hidden state of the current time step output, i _t And g _t Respectively representing an input gate and an update gate for screening the hidden state H, f' _t 、i” _t 、g” _t Respectively representing a forgetting gate, an input gate and an update gate for filtering the spatiotemporal memory M, o' _t And o _t Representing an output gate.

S3, in order to improve the capacity of network fitting global space-time correlation, a gating mechanism is designed on the basis of attention, long-term memory and the depth of the noted features are fused, a new global self-attention model is built, the network is enabled to recall past information like a human brain, and the specific steps are as follows:

S4, embedding a dynamic differential model and a global self-attention model into an LSTM unit to form a new DISA-LSTM unit, and stacking the units by using a diagonal circulation system structure to construct a DISA-LSTM prediction network, wherein the specific steps are as follows:

s401, replacing a forgetting gate in the LSTM unit with the dynamic differential model, and inputting the hidden state of the current time step and the long-time memory of the last moment into the global self-attention model to form a new DISA-LSTM unit, as shown in FIG. 5.

S402, a DISA-LSTM prediction network is formed by stacking three layers of memory units, wherein the first layer has no previous characteristics, so that ST-LSTM units are used, and the other two layers are DISA-LSTM units; a DISA-LSTM prediction network is constructed by adopting a diagonal cyclic structure, wherein differential information is transmitted in a diagonal level in the network, and hidden states passing through a global self-attention model are transmitted in a time dimension, as shown in figure 6.

S5, constructing a network overall architecture based on a convolution self-encoder, and combining an antagonism loss function, a similarity loss function, a reconstruction loss function and a prediction loss training video prediction model, wherein the specific contents are as follows:

as shown in fig. 7, the overall network architecture constructed based on the convolutional self-encoder mainly comprises three parts of an encoder, a disk-LSTM prediction network and a decoder, wherein:

the encoder consists of a space-time decoupling network comprising a dynamic encoder and a static encoder for decoupling the temporal dynamic and spatial static features of video and encoding video data into a potential vector representation of smaller dimensions;

after the feature vector representation of the encoder is further fused, the video is transmitted to a DISA-LSTM prediction network, and future frames are generated by learning internal potential relations;

the decoder consists of deconvolution, and in order to enhance the appearance detail expression of the predicted image, the future frame sequence is fused with the decoupled space static feature and is reconstructed into original pixels through the decoder;

training a video prediction model by decoupling losses and predictive losses, wherein the decoupling losses include a contrast loss, a similarity loss, and a reconstruction loss, with specific formulas:

The embodiment of the invention also provides a video prediction system based on space-time decoupling and self-attention difference LSTM, which comprises a video feature decoupling module, a dynamic difference model design module, a depth fusion feature module, a video prediction model training module and a computer program capable of running on a processor. It should be noted that each module in the above system corresponds to a specific step of the method provided by the embodiment of the present invention, and has a corresponding functional module and beneficial effect of executing the method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.

The embodiment of the invention also provides an electronic device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor. It should be noted that each module in the above system corresponds to a specific step of the method provided by the embodiment of the present invention, and has a corresponding functional module and beneficial effect of executing the method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.

The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores a computer program. It should be noted that each module in the above system corresponds to a specific step of the method provided by the embodiment of the present invention, and has a corresponding functional module and beneficial effect of executing the method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.

While embodiments of the present invention have been shown and described, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention. Any other corresponding changes and modifications made in accordance with the technical idea of the present invention shall be included in the scope of the claims of the present invention.

Claims

1. The video prediction method based on space-time decoupling and self-attention difference LSTM is characterized by comprising the following steps of:

s1, constructing a space-time decoupling network, and decoupling time dynamic characteristics and space static characteristics of a video;

s2, designing a dynamic differential model comprising a forgetting gate, an input gate and an update gate by utilizing differential operation;

s3, designing a gating mechanism on the basis of attention, fusing long-time memory with the noted feature depth, and constructing a new global self-attention model;

s4, embedding the dynamic differential model and the global self-attention model into an LSTM unit to form a new DISA-LSTM unit, stacking the units by using a diagonal circulation system structure, and constructing a DISA-LSTM prediction network;

s5, constructing a network overall architecture based on a convolution self-encoder, and combining an antagonism loss function, a similarity loss function, a reconstruction loss function and a predictive loss training video prediction model;

in step S1, the space-time decoupling network is composed of a dynamic encoder and a static encoder, and the specific contents of the time dynamic feature and the space static feature of the decoupled video are:

(1) Decoupling concrete content of video temporal dynamics

Building a dynamic encoder: a dynamic encoder is constructed by using 6 convolution kernels with the step length of 2 and the size of 4 multiplied by 4, and a batch normalization processing operation and a leak Relu activation function are used after the first 5 layers of convolution; normalizing the output time dynamic feature vector to be between-1 and 1 by using a Tanh activation function at the final output layer;

wherein L is _adversarial Representing the antagonism loss function, E _d Representing a dynamic encoder, T representing a feature discriminator,a t-th frame video sequence representing an mth video segment,/->T+k frame video sequence representing an mth video segment,/video sequence>A t-th frame video sequence representing an n-th video segment,>a (t+k) th frame video sequence representing an nth segment video;

the feature discriminator uses three layers of 1 x 1 convolution and a Relu activation function, and uses a Sigmoid function to map the probability vector output by the discriminator to a specific interval in the last layer;

(2) Decoupling concrete content of video spatial static features

wherein L is _similarity Representing a similarity loss function, E _s Representing a static encoder, X _t Representing a video sequence of the t-th frame, X _t+k Representing a t+k frame video sequence;

the video time dynamic characteristics and the space static characteristics obtained by decoupling are fused, and the specific contents are as follows:

wherein,representing temporal dynamics after decoupling of video sequences of t-th to t+k frames,/> Representing the spatial static characteristics of the video sequence after decoupling of the t-th to t+k frames,/and/or>X _t:t+k A video sequence representing the t-th to t+k frames; h _t:t+k Representing decoupling characteristics after fusion of time dynamic characteristics and space static characteristics of the t-th to t+k-th frame video sequences;

in step S5, the specific content of the video prediction model is:

the method comprises the following steps of forming a network overall architecture constructed based on a convolution self-encoder by an encoder, a DISA-LSTM prediction network and a decoder, wherein:

the encoder consists of a space-time decoupling network comprising a dynamic encoder and a static encoder, decouples the temporal dynamic and spatial static features of the video, and encodes the video data into a potential vector representation of smaller dimensions;

the decoder is composed of deconvolution, and the future frame sequence and the decoupled space static feature are fused and reconstructed into original pixels through the decoder;

2. The method for predicting video based on space-time decoupling and self-attention differential LSTM according to claim 1, wherein in step S2, the specific design contents of the dynamic differential model are:

the differential information of the hidden state of the adjacent time steps is received, and is fused with the long-time memory cells of the previous time step after passing through the forgetting gate, the input gate and the updating gate to form differential characteristics, and the specific formula is as follows:

wherein σ represents a Sigmoid activation function, tanh represents a tanh activation function, x represents a convolution operation, and radix Hadamard product, W' _hf 、W′ _hi 、W′ _hg Is a two-dimensional convolution kernel, b' _f 、b′ _i 、b' _g For bias, t represents time, l represents layer, f _t '、i′ _t And g' _t Respectively a forget gate, an input gate and an update gate for screening differential information,differential information indicating hidden status of adjacent time steps, < >>Long-term memory cell representing the last time step, < >>Representing the differential features.

3. The method for video prediction based on space-time decoupling and self-attention differential LSTM according to claim 1, wherein in step S3, the specific steps of constructing a new global self-attention model are as follows:

s301, assigning different 1×1 weights { W } to hidden states _q ,W _k ,W _v Map it to three different spaces of query vector, key vector and value vector, usingCalculating similarity scores of the jth key vector and the e query vector, and normalizing the scores by using a softmax activation function to obtain similarity distribution of each point, and multiplying the similarity distribution by a corresponding value vector to obtain attention characteristics, wherein the specific formula is as follows:

wherein Q represents a query vector, K represents a key vector, V represents a value vector, W _q 、W _k And W is _v Representing three different weights assigned to hidden states,representing hidden state, Z representing attention feature, C×H×W representing number×height×width of channel, N=H×W, e, j representing position index of query vector and key vector, Q _e Represents the e-th query vector,>represents the transpose of the jth key vector, d _k Representing the dimension of the key vector, T representing the transpose;

wherein i, g and o represent input gate, update gate and output gate, respectively, W in the global self-attention module _i；h 、W _i；z 、W _g；h 、W _g；z 、W _o；h 、W _o；z Representing a two-dimensional convolution kernel, b _i；、b _g；、b _o； The offset is indicated as being a function of the offset,indicating updated long-term memory and hidden status,representing a hidden state through the global attention module.

4. The video prediction method based on space-time decoupling and self-attention differential LSTM as set forth in claim 1, wherein in step S4, the specific steps of constructing a dia-LSTM prediction network are as follows:

s401, replacing a forgetting gate in the LSTM unit by using a dynamic differential model, and inputting the hidden state of the current time step and the long-time memory of the last moment into a global self-attention model to form a new DISA-LSTM unit;

s402, stacking three layers of memory units, wherein a first layer uses an ST-LSTM unit, and the other two layers use a DISA-LSTM unit; and constructing a DISA-LSTM prediction network by adopting a diagonal cyclic structure, wherein differential information is transmitted in a diagonal level in the network, and the hidden state of the global self-attention model is transmitted in a time dimension.

5. A video prediction system based on spatiotemporal decoupling and self-attention differential LSTM, comprising:

the video feature decoupling module is used for constructing a space-time decoupling network and decoupling the time dynamic features and the space static features of the video;

the dynamic differential model design module is used for designing a dynamic differential model comprising a forgetting gate, an input gate and an update gate by utilizing differential operation to replace the forgetting gate of the LSTM unit;

the depth fusion feature module is used for designing a gating mechanism on the basis of attention, fusing long-time memory with the noted feature depth and constructing a new global self-attention model;

the video prediction model training module is used for constructing a network overall architecture based on a convolution self-encoder and training a video prediction model by combining an antagonism loss function, a similarity loss function, a reconstruction loss function and a prediction loss;

in the video feature decoupling module, the space-time decoupling network consists of a dynamic encoder and a static encoder, and the specific contents of the time dynamic feature and the space static feature of the decoupled video are as follows:

(1) Decoupling concrete content of video temporal dynamics

(2) Decoupling concrete content of video spatial static features

in the video prediction model training module, the specific contents of the video prediction model are as follows:

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 4 when the computer program is executed by the processor.

7. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when executed by a processor, performs the method of any of claims 1 to 4.