CN116524419B - Video prediction method and system based on space-time decoupling and self-attention difference LSTM - Google Patents

Video prediction method and system based on space-time decoupling and self-attention difference LSTM Download PDF

Info

Publication number
CN116524419B
CN116524419B CN202310802044.5A CN202310802044A CN116524419B CN 116524419 B CN116524419 B CN 116524419B CN 202310802044 A CN202310802044 A CN 202310802044A CN 116524419 B CN116524419 B CN 116524419B
Authority
CN
China
Prior art keywords
video
representing
decoupling
dynamic
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310802044.5A
Other languages
Chinese (zh)
Other versions
CN116524419A (en
Inventor
陈苏婷
薄业雯
胡斌武
韩光勋
裴加明
杨宁
孙俊
高云勇
夏芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING CHINA-SPACENET SATELLITE TELECOM CO LTD
Nanjing University of Information Science and Technology
Original Assignee
NANJING CHINA-SPACENET SATELLITE TELECOM CO LTD
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING CHINA-SPACENET SATELLITE TELECOM CO LTD, Nanjing University of Information Science and Technology filed Critical NANJING CHINA-SPACENET SATELLITE TELECOM CO LTD
Priority to CN202310802044.5A priority Critical patent/CN116524419B/en
Publication of CN116524419A publication Critical patent/CN116524419A/en
Application granted granted Critical
Publication of CN116524419B publication Critical patent/CN116524419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a video prediction method and a system based on space-time decoupling and self-attention difference LSTM, wherein the method comprises the following steps: introducing an antagonism loss constraint and a similarity constraint to construct a space-time decoupling network, and obtaining decoupling characteristics of the video; a dynamic differential model is designed by utilizing differential operation to replace a forgetting gate of a traditional LSTM unit; designing a gating mechanism on the basis of attention, fusing long-time memory with the depth of the noted features, and constructing a new global self-attention model; combining a dynamic differential model and a global self-attention model, constructing a DISA-LSTM unit, stacking the unit by using a diagonal loop system structure, and constructing a DISA-LSTM prediction network; and building a network overall architecture based on the convolution self-encoder and combining a loss function training model. The invention effectively improves the capability of capturing high-dimensional dynamic complex features and the accuracy of video prediction, and reduces the complexity of the prediction work caused by high-dimensional video data.

Description

Video prediction method and system based on space-time decoupling and self-attention difference LSTM
Technical Field
The invention belongs to the field of electronic communication and information engineering, and particularly relates to a video prediction method and a video prediction system based on space-time decoupling and self-attention difference LSTM.
Background
The high-dimensional characteristic of video data and the complexity of various time sequence motion evolution bring great challenges to video prediction work, the existing prediction method has the problems of space detail loss and inconsistent motion prediction in time sequence, and the generated result is too smooth and cannot fully retain high-frequency detail information, so that the prediction accuracy is low, the appearance of a predicted image is fuzzy, the predicted image is unreal and the like.
The current mainstream methods of video prediction are convolutional neural network-based, cyclic neural network-based, and generation-antagonistic neural network-based. The convolutional neural network has strong characteristic learning capability, but has limited receptive field, cannot construct a remote space dependence relationship, has limited capability of describing time dimension information, is specially designed for processing data with time dimension, and is insufficient in description of internal space characteristics of the data, so that in recent years, a large number of students combine the convolutional neural network and the cyclic neural network in a space-time sequence prediction task in consideration of learning of space structure information and time dimension information. As a recent research hotspot, the use of the reactive neural network has been demonstrated to give more accurate video predictions than the use of the L2 loss function, but the pattern collapse problem is a major weakness of the reactive neural network, which converges to a single pattern state during training. In addition, this feature increases the difficulty of predictive learning to some extent because video sequences tend to have complex spatio-temporal coupling as high-dimensional spatio-temporal sequence data.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: the video prediction method and the system based on space-time decoupling and self-attention difference LSTM are provided, a space-time decoupling network, a dynamic difference model and a global self-attention model are introduced based on a convolution self-encoder structure, a network model for video frame prediction is constructed, the complexity of high-dimensional video data to prediction work is reduced, and the characteristic expression of short-term dependence and long-term association of the network is improved.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides a video prediction method based on space-time decoupling and self-attention difference LSTM, which comprises the following steps:
s1, constructing a space-time decoupling network, and decoupling time dynamic characteristics and space static characteristics of a video so as to reduce the complexity of prediction work brought by high-dimensional video data.
S2, designing a dynamic differential model comprising a forgetting gate, an input gate, an updating gate and an output gate by utilizing differential operation to replace the forgetting gate of the LSTM unit, thereby improving the capability of capturing high-dimensional dynamic complex features of the network.
S3, in order to improve the capacity of network fitting global time-space correlation, a gating mechanism is designed on the basis of attention, long-time memory and the depth of the noted features are fused, and a new global self-attention model is built.
S4, embedding the dynamic differential model and the global self-attention model into the LSTM unit to form a new DISA-LSTM unit, and stacking the units by using a diagonal loop architecture to construct a DISA-LSTM prediction network.
S5, constructing a network overall architecture based on a convolution self-encoder, and combining an contrast loss function, a similarity loss function, a reconstruction loss function and a prediction loss training video prediction model.
Further, in step S1, the space-time decoupling network is composed of a dynamic encoder and a static encoder, and the specific contents of the time dynamic feature and the space static feature of the decoupled video are:
(1) Decoupling concrete content of video temporal dynamics
Building a dynamic encoder: a dynamic encoder is constructed by using 6 convolution kernels with step size of 2 and size of 4×4, and a batch normalization operation and a leak Relu activation function are used after the first 5 layers of convolution; the output temporal dynamic feature vector is normalized to between-1 and 1 using the Tanh activation function at the final output layer.
Extracting time dynamic characteristics: introducing a contrast loss function in the dynamic encoder, and completely decoupling the time dynamic characteristics from the space static by utilizing the contrast training of the dynamic encoder and the characteristic discriminator, wherein the specific formula is as follows:
wherein L is adversarial Representing the antagonism loss function, E d Representing a dynamic encoder, T representing a feature discriminator,a t-th frame video sequence representing an mth video segment,/->T+k frame video sequence representing an mth video segment,/video sequence>A t-th frame video sequence representing an n-th video segment,>a t+k frame video sequence representing an nth video segment.
The static features containing the appearance information cannot change with time in the same video segment, but the difference exists in different video segments, so that when the feature identifier cannot judge whether the dynamic features come from the same video segment, the task of completely decoupling the time dynamic features can be completed.
The feature discriminator uses a three-layer 1 x 1 convolution and a Relu activation function, and uses a Sigmoid function at the last layer to map the probability vector of the discriminator output to the 0-1 interval.
(2) Decoupling concrete content of video spatial static features
The spatial static characteristics are expressed as background, color and appearance details of the video, the static encoder is constructed by using the same architecture as the dynamic encoder, a similarity loss function is introduced, and the similarity of the spatial static characteristics of adjacent time steps is maximized by using a square error, wherein the specific formula is as follows:
wherein L is similarity Representing a similarity loss function, E s Representing a static encoder, X t Representing a video sequence of the t-th frame, X t+k Representing the t + k frame video sequence.
Further, in step S1, the temporal dynamic feature and the spatial static feature of the video obtained by decoupling are fused, and the specific contents are as follows:
wherein,representing temporal dynamics after decoupling of the video sequence of the t-th to t + k frames, representing the spatial static characteristics of the video sequence after decoupling of the t-th to t + k frames,X t:t+k a video sequence representing the t-th to t+k frames; h t:t+k And (5) representing decoupling characteristics after the fusion of the time dynamic characteristics and the space static characteristics of the video sequences of the t to t+k frames.
Further, in step S2, the specific design contents of the dynamic differential model are:
the forgetting gate of the traditional LSTM unit is replaced by a dynamic differential model, and the model receives differential information of hidden states of adjacent time steps and fuses with long-time memory cells of the previous time step after passing through the forgetting gate, the input gate and the update gate to form differential characteristics. And then generating a long-time memory cell of the current time step by using the output gate, and participating in information updating of the later information. The specific formula is as follows:
wherein sigma and tanh represent Sigmoid and tanh activation functions, respectively; * Indicating the convolution operation, ++indicates the Hadamard product, W' hf 、W′ hi 、W′ hg Is a two-dimensional convolution kernel, b' f 、b′ i 、b' g For bias, t represents time, l represents layer, f t '、i t '、g t ' respectively indicate a forget gate, an input gate and an update gate for screening differential information,differential information indicating hidden status of adjacent time steps, < >>Long-term memory cell representing the last time step, < >>Representing the differential features.
Further, in step S3, the specific steps of constructing the new global self-attention model are as follows:
s301, assigning different 1×1 weights { W } to the hidden states of the current time step q ,W k ,W v Map it to three different spaces of query vector, key vector and value vector, usingCalculating similarity scores of the jth key vector and the e query vector, and normalizing the scores by using a softmax () activation function to obtain similarity distribution of each point, and multiplying the similarity distribution by a corresponding value vector to obtain attention characteristics, wherein the specific formula is as follows:
wherein Q represents a query vector, K represents a key vector, V represents a value vector, W q 、W k And W is v Representing three different weights assigned to hidden states,representing hidden state, Z representing attention feature, C×H×W representing number×height×width of channel, N=H×W, e, j representing position index of query vector and key vector, Q e Represents the e-th query vector,>represents the transpose of the jth key vector, d k Representing the dimension of the key vector, T representing the transpose.
S302, fusing long-time memory with attention characteristics and hiding state depth, wherein the specific formula is as follows:
wherein i, g and o represent input gate, update gate and output gate, respectively, W in the global self-attention module i;h 、W i;z 、W g;h 、W g;z 、W o;h 、W o;z Representing a two-dimensional convolution kernel, b i; 、b g; 、b o; The offset is indicated as being a function of the offset,indicating the updated long-term memory and hidden status,/-for example>Representing the hidden state of the current time step passing through the global attention module.
Further, in step S4, the specific steps for constructing the DISA-LSTM predictive network are as follows:
s401, replacing a forgetting gate in the LSTM unit with the dynamic differential model, and inputting the hidden state of the current time step and the long-time memory of the last moment into the global self-attention model to form a new DISA-LSTM unit.
S402, a DISA-LSTM prediction network is formed by stacking three layers of memory units, wherein the first layer has no previous characteristics, so that ST-LSTM units are used, and the other two layers are DISA-LSTM units; and constructing a DISA-LSTM prediction network by adopting a diagonal cyclic structure, wherein differential information is transmitted in a diagonal level in the network, and the hidden state of the global self-attention model is transmitted in a time dimension.
Further, in step S5, the specific content of the video prediction model is:
the network overall architecture constructed based on the convolution self-encoder mainly comprises three parts of an encoder, a DISA-LSTM prediction network and a decoder, wherein:
the encoder consists of a space-time decoupling network comprising a dynamic encoder and a static encoder for decoupling the temporal dynamic and spatial static features of video and encoding video data into a potential vector representation of smaller dimensions.
And after the feature vector representation of the encoder is further fused, entering a DISA-LSTM prediction network, and generating a future frame by learning an internal potential relation.
The decoder consists of deconvolution, and to enhance the appearance detail representation of the predicted image, the future frame sequence is fused with the decoupled spatial static features and reconstructed back into the original pixels by the decoder.
Training a video prediction model by decoupling and prediction losses, wherein the decoupling losses include a counterloss, a similarity loss, and a reconstruction loss; the decoupling loss is used to encourage complete decoupling of the video data and ensure that future frames can be reconstructed from spatially static features, and the prediction loss is used to maximize the similarity between the predicted and actual results. And optimizing the model by using a back propagation algorithm to improve the quality of the predicted image. The specific formula is as follows:
L=L reconstruction (E s ,E d ,D)+λ sim L similarity (E s )+λ adv (L adversarial (E d )+L adversarial (T))+λ mse L MSE
wherein L is reconstruction Is a reconstruction loss function; l (L) MSE Is a predictive loss, using the mean square error as a model loss return; d represents a decoder; lambda (lambda) sim 、λ adv And lambda (lambda) mse Is a hyper-parameter for balancing the convergence speed of different loss functions.
Furthermore, the invention also provides a video prediction system based on space-time decoupling and self-attention difference LSTM, which comprises:
and the video characteristic decoupling module is used for constructing a space-time decoupling network and decoupling the time dynamic characteristic and the space static characteristic of the video.
The dynamic differential model design module is used for designing a dynamic differential model comprising a forgetting gate, an input gate and an update gate by utilizing differential operation to replace the forgetting gate of the LSTM unit.
And the depth fusion feature module is used for designing a gating mechanism on the basis of attention, and fusing long-time memory with the noted feature depth to form a new global self-attention model.
And the video prediction model training module is used for constructing a network overall architecture based on the convolution self-encoder and training a video prediction model by combining the contrast loss function, the similarity loss function, the reconstruction loss function and the prediction loss.
Furthermore, the invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the video prediction method based on space-time decoupling and self-attention difference LSTM when executing the computer program.
Further, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is executed by a processor to execute the video prediction method based on space-time decoupling and self-attention difference LSTM.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
1. according to the invention, by introducing a space-time decoupling network, the time dynamic characteristics and the appearance information-containing space static characteristics are decoupled, so that the complexity of the prediction work caused by the high-dimensional video data is reduced.
2. The DISA-LSTM prediction network combining dynamic difference and global self-attention improvement has the function of processing non-stationary sequences, and effectively improves the capability of capturing high-dimensional dynamic complex features. The added global self-attention model can store history information with long-term dependence and interact with current state information to generate a new hidden state, so that the short-term dependence and long-term associated feature expression of the network are improved.
3. The invention is integrally constructed based on a convolution self-encoder, the future frames generated by a prediction network are fused with the decoupled static features, and the decoder network is utilized to generate the future frames with clear appearance, so that the problems of high complexity of video prediction tasks, fuzzy appearance of the predicted image and low accuracy are finally solved.
Drawings
FIG. 1 is a flow chart of the overall implementation of the present invention.
Fig. 2 is a diagram of a space-time decoupling network according to the present invention.
FIG. 3 is a detailed view of LSTM cell embedded in dynamic differential model of the present invention.
FIG. 4 is a diagram of the global self-attention model structure of the present invention.
Fig. 5 is a detailed view of the disk-LSTM cell of the present invention.
Fig. 6 is a diagram of a structure of a disk-LSTM predictive network according to the present invention.
Fig. 7 is an overall construction diagram of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
The invention is described in further detail below with reference to the accompanying drawings.
The video prediction method based on space-time decoupling and self-attention difference LSTM provided by the invention, as shown in figure 1, comprises the following steps:
s1, constructing a space-time decoupling network, and decoupling time dynamic characteristics and space static characteristics of a video.
In this embodiment, the public data set UCF101 is selected, and includes 101 types of 13320 segments of video, which are human motion data sets in a real scene, where all video frequencies are 25fps and resolutions are 320×240 pixels. UCF101 is first divided into a training set and a test set, then video is extracted into an image format, and all image pixels are adjusted to 128X 128 to accommodate model input.
Extracting consecutive 10 frames in the training set as input to the network, denoted asWhere m represents the video index. The Batchsize used in this example is 10, and X is one [ B, W, H, C]B is the fetch size used, W, H, C represents the width and height of the video image and the number of feature channels.
As shown in fig. 2, the space-time decoupling network is composed of a dynamic encoder and a static encoder, and the specific contents of the time dynamic characteristic and the space static characteristic of the decoupled video are as follows:
(1) Decoupling concrete content of video temporal dynamics
Building a dynamic encoder: a dynamic encoder is constructed by using 6 convolution kernels with step size of 2 and size of 4×4, and a batch normalization operation and a leak Relu activation function are used after the first 5 layers of convolution; the output temporal dynamic feature vector is normalized to between-1 and 1 using the Tanh activation function at the final output layer.
Extracting time dynamic characteristics: introducing a contrast loss function in the dynamic encoder, and completely decoupling the time dynamic characteristics from the space static by utilizing the contrast training of the dynamic encoder and the characteristic discriminator, wherein the specific formula is as follows:
wherein L is adversarial Representing the antagonism loss function, E d Representing a dynamic encoder, T representing a feature discriminator,a t-th frame video sequence representing an mth video segment,/->T+k frame video sequence representing an mth video segment,/video sequence>A t-th frame video sequence representing an n-th video segment,>a t+k frame video sequence representing an nth video segment.
The feature discriminator uses a three-layer 1 x 1 convolution and a Relu activation function, and uses a Sigmoid function at the last layer to map the probability vector of the discriminator output to a specific interval.
(2) Decoupling concrete content of video spatial static features
The static encoder is constructed by using the same architecture as the dynamic encoder, a similarity loss function is introduced, and the similarity of the space static characteristics of adjacent time steps is maximized by using a square error, wherein the specific formula is as follows:
wherein L is similarity Representing a similarity loss function, E s Representing a static encoder, X t Representing a video sequence of the t-th frame, X t+k Represent the firstt+k frames of video sequence.
Finally, fusing the time dynamic characteristics and the space static characteristics of the video obtained by decoupling, wherein the specific contents are as follows:
wherein,representing temporal dynamics after decoupling of the video sequence of the t-th to t + k frames, representing the spatial static characteristics of the video sequence after decoupling of the t-th to t + k frames,X t:t+k a video sequence representing the t-th to t+k frames; h t:t+k And (5) representing decoupling characteristics after the fusion of the time dynamic characteristics and the space static characteristics of the video sequences of the t to t+k frames.
S2, designing a dynamic differential model comprising a forgetting gate, an input gate, an updating gate and an output gate by utilizing differential operation, wherein the forgetting gate replaces the forgetting gate of an LSTM unit, and the specific contents are as follows:
the dynamic differential model is designed by adopting differential operation and replaces the forgetting gate of the traditional LSTM unit, so that the network has the capability of processing a non-stable sequence, and the capability of capturing high-dimensional dynamic complex characteristics of the network is improved.
As shown in FIG. 3, the hidden state of the dynamic differential model is designed in the dotted line box, the model is used for replacing the forgetting gate of the traditional LSTM unit, and the differential characteristic is formed by receiving the differential information of the hidden state of the adjacent time steps and fusing the hidden state with the long-time memory cells of the previous time step after passing through the forgetting gate, the input gate and the update gate. And then generating a long-time memory cell of the current time step by using the output gate, and participating in information updating of the later information. The specific formula is as follows:
wherein sigma and tanh represent Sigmoid and tanh activation functions, respectively; * Indicating the convolution operation, ++indicates the Hadamard product, W' co 、W′ ho 、W hi 、W xi 、W hg 、W hg 、W″ hf 、W″ mf 、W″ hi 、W″ mi 、W″ hg 、W″ mg 、W xo 、W ho 、W co 、W mo Is a two-dimensional convolution kernel, b' o 、b i 、b g 、b o 、b″ f 、b″ i 、b″ g For bias, t represents time, l represents layer, H is hidden,differential information indicating hidden status of adjacent time steps, < >>And->Long-term memory cells representing the last time step and the current time step, respectively, < >>Representing differential characteristics, M is a space-time memory cell, < ->Is the hidden state of the current time step output, i t And g t Respectively representing an input gate and an update gate for screening the hidden state H, f' t 、i” t 、g” t Respectively representing a forgetting gate, an input gate and an update gate for filtering the spatiotemporal memory M, o' t And o t Representing an output gate.
S3, in order to improve the capacity of network fitting global space-time correlation, a gating mechanism is designed on the basis of attention, long-term memory and the depth of the noted features are fused, a new global self-attention model is built, the network is enabled to recall past information like a human brain, and the specific steps are as follows:
s301, assigning different 1×1 weights { W } to the hidden states of the current time step q ,W k ,W v Map it to three different spaces of query vector, key vector and value vector, usingCalculating similarity scores of the jth key vector and the e query vector, and normalizing the scores by using a softmax () activation function to obtain similarity distribution of each point, and multiplying the similarity distribution by a corresponding value vector to obtain attention characteristics, wherein the specific formula is as follows:
wherein Q represents a query vector, K represents a key vector, V represents a value vector, W q 、W k And W is v Representing three different weights assigned to hidden states,representing hidden state, Z representing attention feature, C×H×W representing number×height×width of channel, N=H×W, e, j representing position index of query vector and key vector, Q e Represents the e-th query vector,>represents the transpose of the jth key vector, d k Representing the dimension of the key vector, T representing the transpose.
S302, fusing long-time memory with attention characteristics and hiding state depth, wherein the specific formula is as follows:
wherein i, g and o represent input gate, update gate and output gate, respectively, W in the global self-attention module i;h 、W i;z 、W g;h 、W g;z 、W o;h 、W o;z Representing a two-dimensional convolution kernel, b i; 、b g; 、b o; The offset is indicated as being a function of the offset,indicating the updated long-term memory and hidden status,/-for example>Representing the hidden state of the current time step passing through the global attention module.
S4, embedding a dynamic differential model and a global self-attention model into an LSTM unit to form a new DISA-LSTM unit, and stacking the units by using a diagonal circulation system structure to construct a DISA-LSTM prediction network, wherein the specific steps are as follows:
s401, replacing a forgetting gate in the LSTM unit with the dynamic differential model, and inputting the hidden state of the current time step and the long-time memory of the last moment into the global self-attention model to form a new DISA-LSTM unit, as shown in FIG. 5.
S402, a DISA-LSTM prediction network is formed by stacking three layers of memory units, wherein the first layer has no previous characteristics, so that ST-LSTM units are used, and the other two layers are DISA-LSTM units; a DISA-LSTM prediction network is constructed by adopting a diagonal cyclic structure, wherein differential information is transmitted in a diagonal level in the network, and hidden states passing through a global self-attention model are transmitted in a time dimension, as shown in figure 6.
S5, constructing a network overall architecture based on a convolution self-encoder, and combining an antagonism loss function, a similarity loss function, a reconstruction loss function and a prediction loss training video prediction model, wherein the specific contents are as follows:
as shown in fig. 7, the overall network architecture constructed based on the convolutional self-encoder mainly comprises three parts of an encoder, a disk-LSTM prediction network and a decoder, wherein:
the encoder consists of a space-time decoupling network comprising a dynamic encoder and a static encoder for decoupling the temporal dynamic and spatial static features of video and encoding video data into a potential vector representation of smaller dimensions;
after the feature vector representation of the encoder is further fused, the video is transmitted to a DISA-LSTM prediction network, and future frames are generated by learning internal potential relations;
the decoder consists of deconvolution, and in order to enhance the appearance detail expression of the predicted image, the future frame sequence is fused with the decoupled space static feature and is reconstructed into original pixels through the decoder;
training a video prediction model by decoupling losses and predictive losses, wherein the decoupling losses include a contrast loss, a similarity loss, and a reconstruction loss, with specific formulas:
L=L reconstruction (E s ,E d ,D)+λ sim L similarity (E s )+λ adv (L adversarial (E d )+L adversarial (T))+λ mse L MSE
wherein L is reconstruction Is a reconstruction loss function; l (L) MSE Is a predictive loss, using the mean square error as a model loss return; d represents a decoder; lambda (lambda) sim 、λ adv And lambda (lambda) mse Is a hyper-parameter for balancing the convergence speed of different loss functions.
The embodiment of the invention also provides a video prediction system based on space-time decoupling and self-attention difference LSTM, which comprises a video feature decoupling module, a dynamic difference model design module, a depth fusion feature module, a video prediction model training module and a computer program capable of running on a processor. It should be noted that each module in the above system corresponds to a specific step of the method provided by the embodiment of the present invention, and has a corresponding functional module and beneficial effect of executing the method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.
The embodiment of the invention also provides an electronic device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor. It should be noted that each module in the above system corresponds to a specific step of the method provided by the embodiment of the present invention, and has a corresponding functional module and beneficial effect of executing the method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.
The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores a computer program. It should be noted that each module in the above system corresponds to a specific step of the method provided by the embodiment of the present invention, and has a corresponding functional module and beneficial effect of executing the method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.
While embodiments of the present invention have been shown and described, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention. Any other corresponding changes and modifications made in accordance with the technical idea of the present invention shall be included in the scope of the claims of the present invention.

Claims (7)

1. The video prediction method based on space-time decoupling and self-attention difference LSTM is characterized by comprising the following steps of:
s1, constructing a space-time decoupling network, and decoupling time dynamic characteristics and space static characteristics of a video;
s2, designing a dynamic differential model comprising a forgetting gate, an input gate and an update gate by utilizing differential operation;
s3, designing a gating mechanism on the basis of attention, fusing long-time memory with the noted feature depth, and constructing a new global self-attention model;
s4, embedding the dynamic differential model and the global self-attention model into an LSTM unit to form a new DISA-LSTM unit, stacking the units by using a diagonal circulation system structure, and constructing a DISA-LSTM prediction network;
s5, constructing a network overall architecture based on a convolution self-encoder, and combining an antagonism loss function, a similarity loss function, a reconstruction loss function and a predictive loss training video prediction model;
in step S1, the space-time decoupling network is composed of a dynamic encoder and a static encoder, and the specific contents of the time dynamic feature and the space static feature of the decoupled video are:
(1) Decoupling concrete content of video temporal dynamics
Building a dynamic encoder: a dynamic encoder is constructed by using 6 convolution kernels with the step length of 2 and the size of 4 multiplied by 4, and a batch normalization processing operation and a leak Relu activation function are used after the first 5 layers of convolution; normalizing the output time dynamic feature vector to be between-1 and 1 by using a Tanh activation function at the final output layer;
extracting time dynamic characteristics: introducing a contrast loss function in the dynamic encoder, and completely decoupling the time dynamic characteristics from the space static by utilizing the contrast training of the dynamic encoder and the characteristic discriminator, wherein the specific formula is as follows:
wherein L is adversarial Representing the antagonism loss function, E d Representing a dynamic encoder, T representing a feature discriminator,a t-th frame video sequence representing an mth video segment,/->T+k frame video sequence representing an mth video segment,/video sequence>A t-th frame video sequence representing an n-th video segment,>a (t+k) th frame video sequence representing an nth segment video;
the feature discriminator uses three layers of 1 x 1 convolution and a Relu activation function, and uses a Sigmoid function to map the probability vector output by the discriminator to a specific interval in the last layer;
(2) Decoupling concrete content of video spatial static features
The static encoder is constructed by using the same architecture as the dynamic encoder, a similarity loss function is introduced, and the similarity of the space static characteristics of adjacent time steps is maximized by using a square error, wherein the specific formula is as follows:
wherein L is similarity Representing a similarity loss function, E s Representing a static encoder, X t Representing a video sequence of the t-th frame, X t+k Representing a t+k frame video sequence;
the video time dynamic characteristics and the space static characteristics obtained by decoupling are fused, and the specific contents are as follows:
wherein,representing temporal dynamics after decoupling of video sequences of t-th to t+k frames,/> Representing the spatial static characteristics of the video sequence after decoupling of the t-th to t+k frames,/and/or>X t:t+k A video sequence representing the t-th to t+k frames; h t:t+k Representing decoupling characteristics after fusion of time dynamic characteristics and space static characteristics of the t-th to t+k-th frame video sequences;
in step S5, the specific content of the video prediction model is:
the method comprises the following steps of forming a network overall architecture constructed based on a convolution self-encoder by an encoder, a DISA-LSTM prediction network and a decoder, wherein:
the encoder consists of a space-time decoupling network comprising a dynamic encoder and a static encoder, decouples the temporal dynamic and spatial static features of the video, and encodes the video data into a potential vector representation of smaller dimensions;
after the feature vector representation of the encoder is further fused, the video is transmitted to a DISA-LSTM prediction network, and future frames are generated by learning internal potential relations;
the decoder is composed of deconvolution, and the future frame sequence and the decoupled space static feature are fused and reconstructed into original pixels through the decoder;
training a video prediction model by decoupling losses and predictive losses, wherein the decoupling losses include a contrast loss, a similarity loss, and a reconstruction loss, with specific formulas:
L=L reconstruction (E s ,E d ,D)+λ sim L similarity (E s )+λ adv (L adversarial (E d )+L adversarial (T))+λ mse L MSE
wherein L is reconstruction Is a reconstruction loss function; l (L) MSE Is a predictive loss, using the mean square error as a model loss return; d represents a decoder; lambda (lambda) sim 、λ adv And lambda (lambda) mse Is a hyper-parameter for balancing the convergence speed of different loss functions.
2. The method for predicting video based on space-time decoupling and self-attention differential LSTM according to claim 1, wherein in step S2, the specific design contents of the dynamic differential model are:
the differential information of the hidden state of the adjacent time steps is received, and is fused with the long-time memory cells of the previous time step after passing through the forgetting gate, the input gate and the updating gate to form differential characteristics, and the specific formula is as follows:
wherein σ represents a Sigmoid activation function, tanh represents a tanh activation function, x represents a convolution operation, and radix Hadamard product, W' hf 、W′ hi 、W′ hg Is a two-dimensional convolution kernel, b' f 、b′ i 、b' g For bias, t represents time, l represents layer, f t '、i′ t And g' t Respectively a forget gate, an input gate and an update gate for screening differential information,differential information indicating hidden status of adjacent time steps, < >>Long-term memory cell representing the last time step, < >>Representing the differential features.
3. The method for video prediction based on space-time decoupling and self-attention differential LSTM according to claim 1, wherein in step S3, the specific steps of constructing a new global self-attention model are as follows:
s301, assigning different 1×1 weights { W } to hidden states q ,W k ,W v Map it to three different spaces of query vector, key vector and value vector, usingCalculating similarity scores of the jth key vector and the e query vector, and normalizing the scores by using a softmax activation function to obtain similarity distribution of each point, and multiplying the similarity distribution by a corresponding value vector to obtain attention characteristics, wherein the specific formula is as follows:
wherein Q represents a query vector, K represents a key vector, V represents a value vector, W q 、W k And W is v Representing three different weights assigned to hidden states,representing hidden state, Z representing attention feature, C×H×W representing number×height×width of channel, N=H×W, e, j representing position index of query vector and key vector, Q e Represents the e-th query vector,>represents the transpose of the jth key vector, d k Representing the dimension of the key vector, T representing the transpose;
s302, fusing long-time memory with attention characteristics and hiding state depth, wherein the specific formula is as follows:
wherein i, g and o represent input gate, update gate and output gate, respectively, W in the global self-attention module i;h 、W i;z 、W g;h 、W g;z 、W o;h 、W o;z Representing a two-dimensional convolution kernel, b i; 、b g; 、b o; The offset is indicated as being a function of the offset,indicating updated long-term memory and hidden status,representing a hidden state through the global attention module.
4. The video prediction method based on space-time decoupling and self-attention differential LSTM as set forth in claim 1, wherein in step S4, the specific steps of constructing a dia-LSTM prediction network are as follows:
s401, replacing a forgetting gate in the LSTM unit by using a dynamic differential model, and inputting the hidden state of the current time step and the long-time memory of the last moment into a global self-attention model to form a new DISA-LSTM unit;
s402, stacking three layers of memory units, wherein a first layer uses an ST-LSTM unit, and the other two layers use a DISA-LSTM unit; and constructing a DISA-LSTM prediction network by adopting a diagonal cyclic structure, wherein differential information is transmitted in a diagonal level in the network, and the hidden state of the global self-attention model is transmitted in a time dimension.
5. A video prediction system based on spatiotemporal decoupling and self-attention differential LSTM, comprising:
the video feature decoupling module is used for constructing a space-time decoupling network and decoupling the time dynamic features and the space static features of the video;
the dynamic differential model design module is used for designing a dynamic differential model comprising a forgetting gate, an input gate and an update gate by utilizing differential operation to replace the forgetting gate of the LSTM unit;
the depth fusion feature module is used for designing a gating mechanism on the basis of attention, fusing long-time memory with the noted feature depth and constructing a new global self-attention model;
the video prediction model training module is used for constructing a network overall architecture based on a convolution self-encoder and training a video prediction model by combining an antagonism loss function, a similarity loss function, a reconstruction loss function and a prediction loss;
in the video feature decoupling module, the space-time decoupling network consists of a dynamic encoder and a static encoder, and the specific contents of the time dynamic feature and the space static feature of the decoupled video are as follows:
(1) Decoupling concrete content of video temporal dynamics
Building a dynamic encoder: a dynamic encoder is constructed by using 6 convolution kernels with the step length of 2 and the size of 4 multiplied by 4, and a batch normalization processing operation and a leak Relu activation function are used after the first 5 layers of convolution; normalizing the output time dynamic feature vector to be between-1 and 1 by using a Tanh activation function at the final output layer;
extracting time dynamic characteristics: introducing a contrast loss function in the dynamic encoder, and completely decoupling the time dynamic characteristics from the space static by utilizing the contrast training of the dynamic encoder and the characteristic discriminator, wherein the specific formula is as follows:
wherein L is adversarial Representing the antagonism loss function, E d Representing a dynamic encoder, T representing a feature discriminator,a t-th frame video sequence representing an mth video segment,/->T+k frame video sequence representing an mth video segment,/video sequence>A t-th frame video sequence representing an n-th video segment,>a (t+k) th frame video sequence representing an nth segment video;
the feature discriminator uses three layers of 1 x 1 convolution and a Relu activation function, and uses a Sigmoid function to map the probability vector output by the discriminator to a specific interval in the last layer;
(2) Decoupling concrete content of video spatial static features
The static encoder is constructed by using the same architecture as the dynamic encoder, a similarity loss function is introduced, and the similarity of the space static characteristics of adjacent time steps is maximized by using a square error, wherein the specific formula is as follows:
wherein L is similarity Representing a similarity loss function, E s Representing a static encoder, X t Representing a video sequence of the t-th frame, X t+k Representing a t+k frame video sequence;
the video time dynamic characteristics and the space static characteristics obtained by decoupling are fused, and the specific contents are as follows:
wherein,representing temporal dynamics after decoupling of video sequences of t-th to t+k frames,/> Representing the spatial static characteristics of the video sequence after decoupling of the t-th to t+k frames,/and/or>X t:t+k A video sequence representing the t-th to t+k frames; h t:t+k Representing decoupling characteristics after fusion of time dynamic characteristics and space static characteristics of the t-th to t+k-th frame video sequences;
in the video prediction model training module, the specific contents of the video prediction model are as follows:
the method comprises the following steps of forming a network overall architecture constructed based on a convolution self-encoder by an encoder, a DISA-LSTM prediction network and a decoder, wherein:
the encoder consists of a space-time decoupling network comprising a dynamic encoder and a static encoder, decouples the temporal dynamic and spatial static features of the video, and encodes the video data into a potential vector representation of smaller dimensions;
after the feature vector representation of the encoder is further fused, the video is transmitted to a DISA-LSTM prediction network, and future frames are generated by learning internal potential relations;
the decoder is composed of deconvolution, and the future frame sequence and the decoupled space static feature are fused and reconstructed into original pixels through the decoder;
training a video prediction model by decoupling losses and predictive losses, wherein the decoupling losses include a contrast loss, a similarity loss, and a reconstruction loss, with specific formulas:
L=L reconstruction (E s ,E d ,D)+λ sim L similarity (E s )+λ adv (L adversarial (E d )+L adversarial (T))+λ mse L MSE
wherein L is reconstruction Is a reconstruction loss function; l (L) MSE Is a predictive loss, using the mean square error as a model loss return; d represents a decoder; lambda (lambda) sim 、λ adv And lambda (lambda) mse Is a hyper-parameter for balancing the convergence speed of different loss functions.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 4 when the computer program is executed by the processor.
7. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when executed by a processor, performs the method of any of claims 1 to 4.
CN202310802044.5A 2023-07-03 2023-07-03 Video prediction method and system based on space-time decoupling and self-attention difference LSTM Active CN116524419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310802044.5A CN116524419B (en) 2023-07-03 2023-07-03 Video prediction method and system based on space-time decoupling and self-attention difference LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310802044.5A CN116524419B (en) 2023-07-03 2023-07-03 Video prediction method and system based on space-time decoupling and self-attention difference LSTM

Publications (2)

Publication Number Publication Date
CN116524419A CN116524419A (en) 2023-08-01
CN116524419B true CN116524419B (en) 2023-11-07

Family

ID=87390650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310802044.5A Active CN116524419B (en) 2023-07-03 2023-07-03 Video prediction method and system based on space-time decoupling and self-attention difference LSTM

Country Status (1)

Country Link
CN (1) CN116524419B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881996B (en) * 2023-09-07 2023-12-01 华南理工大学 Modeling intention prediction method based on mouse operation
CN117421733A (en) * 2023-12-19 2024-01-19 浪潮电子信息产业股份有限公司 Leesvirus detection method, apparatus, electronic device and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN113449660A (en) * 2021-07-05 2021-09-28 西安交通大学 Abnormal event detection method of space-time variation self-coding network based on self-attention enhancement
CN115147275A (en) * 2022-06-24 2022-10-04 浙江大学 Video implicit representation method based on decoupled space and time sequence information
CN115311169A (en) * 2022-08-29 2022-11-08 上海大学 Damaged image repairing method and device, electronic equipment and storage medium
CN115346101A (en) * 2022-08-22 2022-11-15 南京信息工程大学 Method for realizing radar echo extrapolation model based on depth space-time fusion network
CN115830666A (en) * 2022-09-23 2023-03-21 合肥工业大学 Video expression recognition method based on spatio-temporal characteristic decoupling and application
WO2023061102A1 (en) * 2021-10-15 2023-04-20 腾讯科技(深圳)有限公司 Video behavior recognition method and apparatus, and computer device and storage medium
CN116091978A (en) * 2023-02-24 2023-05-09 北京工业大学 Video description method based on advanced semantic information feature coding
CN116307224A (en) * 2023-03-28 2023-06-23 南京信息工程大学 ENSO space-time prediction method based on recursive gating convolution and attention mechanism improvement

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN113449660A (en) * 2021-07-05 2021-09-28 西安交通大学 Abnormal event detection method of space-time variation self-coding network based on self-attention enhancement
WO2023061102A1 (en) * 2021-10-15 2023-04-20 腾讯科技(深圳)有限公司 Video behavior recognition method and apparatus, and computer device and storage medium
CN115147275A (en) * 2022-06-24 2022-10-04 浙江大学 Video implicit representation method based on decoupled space and time sequence information
CN115346101A (en) * 2022-08-22 2022-11-15 南京信息工程大学 Method for realizing radar echo extrapolation model based on depth space-time fusion network
CN115311169A (en) * 2022-08-29 2022-11-08 上海大学 Damaged image repairing method and device, electronic equipment and storage medium
CN115830666A (en) * 2022-09-23 2023-03-21 合肥工业大学 Video expression recognition method based on spatio-temporal characteristic decoupling and application
CN116091978A (en) * 2023-02-24 2023-05-09 北京工业大学 Video description method based on advanced semantic information feature coding
CN116307224A (en) * 2023-03-28 2023-06-23 南京信息工程大学 ENSO space-time prediction method based on recursive gating convolution and attention mechanism improvement

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics;Yunbo Wang等;《CVPR 2019》;9154-9162 *
PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning;Yunbo Wang等;《arXiv:2103.09504v4》;1-17 *
self-attention eidetic 3D-LSTM: Video prediction models for traffic flow forecasting;Xiao Yan等;《Neurocomputing》;第509卷(第2022期);167-176 *
Unsupervised Learning of Disentangled Representations from Video;Emily Denton等;《31st Conference on Neural Information Processing Systems (NIPS 2017)》;4417-4426 *
基于无标签视频数据的深度预测学习方法综述;潘敏婷等;《电子学报》;第50卷(第04期);869-886 *
基于注意力时空解耦3D卷积LSTM的视频预测;黄金贵等;《微电子学与计算机 》;第39卷(第09期);63-72 *
基于自监督学习的视频预测研究;韩鑫等;《中国优秀硕士学位论文全文数据库 (信息科技辑)》(第2023(01)期);I138-1405, 正文第3.1.2-3.1.3节, 第3.2-3.3节, 图3-2, 图3-4, 图3-5 *

Also Published As

Publication number Publication date
CN116524419A (en) 2023-08-01

Similar Documents

Publication Publication Date Title
CN116524419B (en) Video prediction method and system based on space-time decoupling and self-attention difference LSTM
WO2021258967A1 (en) Neural network training method and device, and data acquisition method and device
CN111291212B (en) Zero sample sketch image retrieval method and system based on graph convolution neural network
CN112418409B (en) Improved convolution long-short-term memory network space-time sequence prediction method by using attention mechanism
CN111462324B (en) Online spatiotemporal semantic fusion method and system
CN110570035B (en) People flow prediction system for simultaneously modeling space-time dependency and daily flow dependency
CN116310667B (en) Self-supervision visual characterization learning method combining contrast loss and reconstruction loss
CN116958534A (en) Image processing method, training method of image processing model and related device
CN114969298A (en) Video question-answering method based on cross-modal heterogeneous graph neural network
CN115719036A (en) Space-time process simulation method and system based on stacked space-time memory unit
CN115792913A (en) Radar echo extrapolation method and system based on time-space network
CN115115828A (en) Data processing method, apparatus, program product, computer device and medium
CN112016701B (en) Abnormal change detection method and system integrating time sequence and attribute behaviors
CN113689382A (en) Tumor postoperative life prediction method and system based on medical images and pathological images
CN113554653A (en) Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration
CN116977661A (en) Data processing method, device, equipment, storage medium and program product
CN116307224A (en) ENSO space-time prediction method based on recursive gating convolution and attention mechanism improvement
CN115797557A (en) Self-supervision 3D scene flow estimation method based on graph attention network
CN114782538A (en) Visual positioning method compatible with different barrel shapes and applied to filling field
Sun et al. Cycle representation-disentangling network: learning to completely disentangle spatial-temporal features in video
CN113569867A (en) Image processing method and device, computer equipment and storage medium
Yang et al. A Lightweight Semantic Segmentation Algorithm Based on Deep Convolutional Neural Networks
CN111915701A (en) Button image generation method and device based on artificial intelligence
CN115965995B (en) Skeleton self-supervision method and model based on partial space-time data
Wang et al. Keymemoryrnn: A flexible prediction framework for spatiotemporal prediction networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant