CN111914731A - Multi-mode LSTM video motion prediction method based on self-attention mechanism - Google Patents

Multi-mode LSTM video motion prediction method based on self-attention mechanism Download PDF

Info

Publication number
CN111914731A
CN111914731A CN202010738071.7A CN202010738071A CN111914731A CN 111914731 A CN111914731 A CN 111914731A CN 202010738071 A CN202010738071 A CN 202010738071A CN 111914731 A CN111914731 A CN 111914731A
Authority
CN
China
Prior art keywords
rgb
optical flow
features
data set
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010738071.7A
Other languages
Chinese (zh)
Other versions
CN111914731B (en
Inventor
邵洁
莫晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University of Electric Power
Original Assignee
Shanghai University of Electric Power
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai University of Electric Power filed Critical Shanghai University of Electric Power
Priority to CN202010738071.7A priority Critical patent/CN111914731B/en
Publication of CN111914731A publication Critical patent/CN111914731A/en
Application granted granted Critical
Publication of CN111914731B publication Critical patent/CN111914731B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video motion prediction method based on a multi-mode LSTM (local Strand TM) of a self-attention mechanism, which comprises the following steps of: step 1: preparing a training data set and preprocessing an original video to obtain an RGB picture and an optical flow picture; step 2: extracting RGB features and optical flow features through a TSN (time delay network) based on the RGB pictures and the optical flow pictures, and obtaining features related to target detection through a fast-RCNN target detector based on a training data set; and step 3: establishing a multi-mode LSTM network model based on a self-attention mechanism, inputting the RGB characteristics and the optical flow characteristics obtained in the step (2) and the characteristics related to target detection into the network model for training, and outputting the corresponding action type distribution tensors; and 4, step 4: and establishing a fusion network to distribute weight for the action type distribution tensor and combine the action type distribution tensor with the action type distribution tensor to obtain a final video action prediction result. Compared with the prior art, the method has high accuracy and overcomes the defect of poor effect of longer action prediction time.

Description

Multi-mode LSTM video motion prediction method based on self-attention mechanism
Technical Field
The invention relates to the technical field of video motion prediction, in particular to a multi-mode LSTM video motion prediction method based on a self-attention mechanism.
Background
Action recognition based on vision is one of the difficulties and hot spots of computer vision field research, and relates to a plurality of subject fields such as image processing, deep learning, artificial intelligence and the like. The method has high academic research value, and has wide application background for analysis and understanding of videos under the trend of vigorous development of the Internet industry in the 5G era. The current field of motion recognition focuses on how to correctly recognize the complete motion contained in the video. In practice, however, it is desirable that the monitoring system provide early warning of potential risks in the monitored location so that dangerous activities are prevented before they have serious consequences, rather than identifying actions that have already been completed or detecting the consequences of such actions. To achieve this, it is necessary to provide a monitoring system with a visual function to enable the monitoring system to predict the operation.
Motion prediction refers to predicting the motion category of a video as early as possible before the motion in the video is completed by extracting and processing features of a continuously input video stream. The main difference between motion prediction and motion recognition is the integrity of the recognition object. The former recognition objects are video clips before the action occurs, and these clips do not contain the action to be occurred. And the recognition object of the latter is a complete video containing the motion. Motion prediction is a more challenging task. First, some actions have similar characteristic performance in the early stage of the exercise, and two different actions, such as "hand shaking" and "hand waving", have a process of lifting the hand upwards in the early stage of the exercise, and similar actions make the characteristics of the obtained video stream indistinguishable from the two different actions. Second, due to the setting of the motion prediction task, the time required to complete the entire motion cannot be known, and different motions cannot be distinguished by the motion duration. Therefore, from the observed video part, the characteristics with key semantics can not be obtained to distinguish the actions similar to the initial action, and the complete action time sequence structure can not be obtained. Thirdly, since the selected video segment is taken before the motion segment requiring prediction, such input data is not strongly related to the motion segment requiring prediction.
The motion prediction method generally extracts features from a video, and models a mapping relationship between the features and motion categories to predict motions occurring in the future. The quality of the prediction effect depends greatly on the description capability of the characteristics for incomplete actions and whether the specific time sequence motion pattern can be learned from the target action. Before the advent of deep learning methods, traditional machine learning methods such as bag of words models and support vector machines were used to solve the motion prediction task. In recent years, deep learning methods are the mainstream in the computer vision field, and the convolutional network can extract high-level features with rich semantics, and the high-level features can be used for identification and detection. These features are then further fused or encoded to improve the effectiveness of the model.
Disclosure of Invention
The present invention is directed to overcome the above-mentioned drawbacks of the prior art and to provide a method for predicting motion of a multi-modal LSTM video based on a self-attention mechanism.
The purpose of the invention can be realized by the following technical scheme:
a method for video motion prediction based on multi-modal LSTM of the self-attention mechanism, the method comprising the steps of:
step 1: preparing a training data set and preprocessing an original video to obtain an RGB picture and an optical flow picture;
step 2: extracting RGB features and optical flow features through a TSN (time delay network) based on the RGB pictures and the optical flow pictures, and obtaining features related to target detection through a fast-RCNN target detector based on a training data set;
and step 3: establishing a multi-mode LSTM network model based on a self-attention mechanism, inputting the RGB characteristics and the optical flow characteristics obtained in the step (2) and the characteristics related to target detection into the network model for training, and outputting the corresponding action type distribution tensors;
and 4, step 4: and establishing a fusion network to distribute weight for the action type distribution tensor and combine the action type distribution tensor with the action type distribution tensor to obtain a final video action prediction result.
Further, the step 1 comprises the following sub-steps:
step 101: selecting a data set for training to obtain characteristics related to target detection;
step 102: decomposing an original video according to a set frame rate and extracting to obtain an RGB picture;
step 103: and extracting the optical flow picture aiming at the original video by adopting a TVL1 algorithm.
Further, the data set in step 101 adopts an EPIC-KITCHENS data set and an EGTEA Gaze data set+A data set.
Further, the frame rate set in step 102 is 30 fps.
Further, the step 2 comprises the following sub-steps:
step 201: the original TSN is trained in advance to obtain a pre-trained TSN model;
step 202; removing a classification layer in the original TSN, and loading a pre-trained TSN model to obtain a TSN based on a double-flow method principle;
step 203: inputting the RGB pictures and the optical flow pictures into a TSN (traffic stream network) based on a double-flow method principle, and outputting and extracting corresponding RGB features and optical flow features from a global posing layer in the network;
step 204: and training the fast-RCNN target detector by using the target label of the data set to obtain the characteristics related to target detection.
Further, the initial learning rate of the training process corresponding to the TSN network based on the dual-flow principle in step 202 is set to 0.001, 160 epochs are trained by using the standard cross entropy loss function with decreasing random gradient, and after the 80 th epoch, the learning rate is reduced by 10 times.
Further, the data set in step 204 adopts EGTEA Gaze+A data set.
Further, the step 3 comprises the following sub-steps:
step 301: establishing a multi-mode LSTM network model based on a self-attention mechanism, wherein the multi-mode LSTM network model comprises an encoder and a multi-layer independent LSTM network, the encoder is composed of a position encoding module and a self-attention mechanism module, and the multi-layer independent LSTM network comprises:
the position coding module is used for coding the absolute position and the relative position of a frame in a video to obtain a feature sequence of a corresponding position;
the self-attention mechanism module is used for further mining semantics in the feature sequence of the position to obtain a global description of the video;
step 302: inputting the RGB features and optical flow features obtained in step 2 and features related to target detection into a multi-modal LSTM network model based on a self-attention mechanism for training, and outputting corresponding action type distribution tensors.
Further, in step 302, the learning rate of the training process corresponding to the input of the RGB features and optical flow features obtained in step 2 and the features related to the target detection to the multimodal LSTM network model based on the attention-driven system is set to 0.005, 100 epochs are trained using the standard cross entropy loss function with random gradient descent, and the momentum is set to 0.9.
Further, the number of layers of the LSTM network is 2.
Compared with the prior art, the invention has the following advantages:
(1) the method comprehensively considers three video characteristics, wherein RGB characteristics are used for modeling spatial information, optical flow characteristics are used for modeling time sequence motion information, and characteristics related to target detection are used for modeling which target a person in the video interacts with; because the characteristic sequence is very sensitive to position information, an independent position coding module based on a trigonometric function is adopted to code the absolute position and the relative position of a frame in the video; sending the feature sequence with the coded position into a self-attention module for processing, and further mining semantics in the feature sequence to obtain global description of the video; the output of the self-attention module is used as the input of an LSTM network, the LSTM can effectively load historical information and can complete the prediction of different prediction time, and the output of the LSTM network is the distribution of action types; in order to avoid overfitting, the feature extraction network and the prediction network are trained separately; the three extracted characteristics are used as the input of a prediction network, and the prediction network is trained by adopting a cross entropy loss function; detecting the trained model on a test set of a data set to evaluate the effect of the model; compared with the methods for predicting actions in recent years, the method has more accuracy indexes than those of the methods, and solves the defect of poor effect of long action prediction time.
(2) The self-attention mechanism in the method is a research proposal in the field of natural language processing, and is proved to have good effect on data such as text, voice and the like. The data type in the computer vision field is mainly picture video, and the algorithm of the motion prediction task applies a self-attention mechanism to help shorten the distance between two communities.
(3) The method proves that the text sequence and the video sequence have time sequence, and the similar characteristic is also the basis for using position coding and self-attention mechanism coding.
Drawings
FIG. 1 is a diagram of the overall network model architecture of the present invention;
FIG. 2 is a diagram of a multi-modal LSTM network model architecture in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
1. Video pre-processing and training data preparation
The method of the invention is carried out on EPIC-KITCHENS and EGTEA Gaze+Two data sets were experimented with, decomposing the original video at a frame rate of 30fps to extract RGB pictures, and then using the TVL1 algorithm to extract the corresponding optical flow pictures.
2. Feature extraction
The method adopts a TSN (time delay network) based on a double-flow method principle to extract RGB (red, green and blue) characteristics and optical flow characteristics. Firstly, training a TSN (time series network) on an action recognition task to obtain a pre-trained model. And then removing a classification layer of the original TSN network, loading a pre-trained model, and extracting corresponding RGB (red, green, blue) features and optical flow features from the output of the global posing layer. Features related to target detection train the fast-RCNN target detector with target labeling of the dataset, the output of the target detector omits coordinate information of the bounding box, and retains type information of the target. Because the algorithm only concerns the person in the video with whom the person is interacting, i.e. the algorithm only models information that is useful for predicting the kind of action.
3. Motion prediction
The network model structure of the action prediction algorithm designed by the method is shown in figure 1, and takes the encoders of the position encoding module and the self-attention mechanism module and two layers of independent LSTM networks as a basic framework, wherein the encoders are responsible for further encoding the extracted feature sequences and extracting the context information of the feature sequences to obtain richer semantics. The LSTM network embodies motion prediction, loads video frames observed in the past, and produces a motion class distribution of different prediction times. The input-output relationship of LSTM is shown in fig. 2, where 14 frames of pictures are sampled forward with a time interval of 0.25s before the motion starts for a video segment. The three characteristics enter three sub-networks formed by the encoder and the LSTM network respectively and are trained respectively. And finally, distributing weights for the three sub-networks by using a fusion network of an attention mechanism consisting of three full-connection layers, and correspondingly multiplying the weights by the corresponding action type distribution tensors to obtain the final output of the whole model.
4. Training strategy and associated parameters
The dual-flow network training of the TSN is set to be that the number of segments is 3, and 160 epochs are trained by adopting a standard cross entropy loss function with descending random gradient. The initial learning rate was set to 0.001, and after the 80 th epoch, the learning rate decreased by 10 times. The experimental environment was single card GEFORCE 1060. The fast-RCNN target detector was trained on the EPIC-KITCHENS dataset due to EGTEA Gaze+The data set lacks bounding box labels, so no phase of target detection is added to the data setRegarding features, the model on the dataset only considers RGB features and optical flow features. The prediction network has 3 sub-networks, each sub-network is also trained by adopting a standard cross entropy loss function with random gradient descent, the learning rate is fixed and is not changed to be 0.005, the momentum is set to be 0.9, and 100 epochs are trained.
5. Results and analysis of the experiments
Tables 1 and 2 show the method of the invention and other predictive algorithms in EPIC-KITCHENS and EGTEA Gaze+Results on the data set. The evaluation index is Top-5 accuracy. In EGTEA size+The method of the present invention outperforms the comparative method at all times of prediction on the data set. On the EPIC-KITCHENS dataset, the prediction times exceeded the other comparison algorithms except for prediction times of 0.5s and 0.25s, which were slightly below the RU algorithm. To further verify the effectiveness of the self-attention mechanism, the prediction results of the model (a) with the full model (B) and the encoder removed in the model compared in the three partial features are shown in table 3. From the result, the encoder based on the attention mechanism provided by the method effectively improves the performance of the model, not only solves the defect that other algorithms have poor performance in long prediction time, increases the robustness of the model, but also improves the accuracy.
Table 1: the method and other prediction algorithms of the invention predict the result on the EPIC-KITCHENS data set
TABLE I
Action anticipation results on the EPIC-KITCHENS dataset
Figure BDA0002605824840000061
Table 2: the method of the present invention and other prediction algorithms in EGTEA Gaze+Predicted results on a dataset
TABLE II
Action anticipation results on the EGTEA Gaze+dataset
Figure BDA0002605824840000062
Table 3: the method compares the prediction results of the full model (B) and the model (A) without the coder on three sub-features in the model
TABLE III
Comparison of experimental results with and without Encoder on a single modality
Figure BDA0002605824840000071
In fig. 1 of this embodiment, Linear layer represents a Linear layer, Flow feature represents an optical Flow feature, RGB feature represents an RGB feature, Obj feature represents a feature related to object detection, Multiplication represents Multiplication, interpretation output distribution represents a prediction output result, BN-acceptance represents a BN-acceptance network structure, fast-RCNN represents a fast-RCNN object detector, Position encoding represents a Position encoding module, Sum represents summation, presence of hidden and cell states represents a connection of hidden and cell states, Self-ATTention represents a Self-ATTention module, Rolling LSTM unit represents an operating LSTM network unit, Rolling LSTM unit represents an non-operating LSTM network unit, Multi-model LSTM represents a Multi-modal LSTM network model, and ATTention termination represents a fusion network model;
in fig. 2 of this embodiment, the observer Time represents Observation Time, the interpretation Time represents prediction Time, Time interval represents Time interval, interpretation output represents prediction output, observer Segment represents Observation portion, Action encapsulation represents Action occurrence, and Action starting Time represents Action start Time.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for video motion prediction based on multi-modal LSTM of the self-attention mechanism, the method comprising the steps of:
step 1: preparing a training data set and preprocessing an original video to obtain an RGB picture and an optical flow picture;
step 2: extracting RGB features and optical flow features through a TSN (time delay network) based on the RGB pictures and the optical flow pictures, and obtaining features related to target detection through a fast-RCNN target detector based on a training data set;
and step 3: establishing a multi-mode LSTM network model based on a self-attention mechanism, inputting the RGB characteristics and the optical flow characteristics obtained in the step (2) and the characteristics related to target detection into the network model for training, and outputting the corresponding action type distribution tensors;
and 4, step 4: and establishing a fusion network to distribute weight for the action type distribution tensor and combine the action type distribution tensor with the action type distribution tensor to obtain a final video action prediction result.
2. The method of claim 1, wherein step 1 comprises the following sub-steps:
step 101: selecting a data set for training to obtain characteristics related to target detection;
step 102: decomposing an original video according to a set frame rate and extracting to obtain an RGB picture;
step 103: and extracting the optical flow picture aiming at the original video by adopting a TVL1 algorithm.
3. The method of claim 2, wherein the data set in step 101 is EPIC-KITCHENS data set and EGTEA size+A data set.
4. The method of claim 2, wherein the frame rate is set to 30fps in step 102.
5. The method of claim 1, wherein the step 2 comprises the following sub-steps:
step 201: the original TSN is trained in advance to obtain a pre-trained TSN model;
step 202; removing a classification layer in the original TSN, and loading a pre-trained TSN model to obtain a TSN based on a double-flow method principle;
step 203: inputting the RGB pictures and the optical flow pictures into a TSN (traffic stream network) based on a double-flow method principle, and outputting and extracting corresponding RGB features and optical flow features from a global posing layer in the network;
step 204: and training the fast-RCNN target detector by using the target label of the data set to obtain the characteristics related to target detection.
6. The method as claimed in claim 5, wherein the initial learning rate of the training process corresponding to the TSN network based on the dual-flow principle in step 202 is set to 0.001, the 160 epochs are trained by using the standard cross entropy loss function with decreasing random gradient, and after the 80 th epoch, the learning rate is decreased by 10 times.
7. The method of claim 5, wherein the data set in step 204 is EGTEA GAze for multi-modal LSTM video motion prediction based on the attention mechanism+A data set.
8. The method of claim 1, wherein the step 3 comprises the following sub-steps:
step 301: establishing a multi-mode LSTM network model based on a self-attention mechanism, wherein the multi-mode LSTM network model comprises an encoder and a multi-layer independent LSTM network, the encoder is composed of a position encoding module and a self-attention mechanism module, and the multi-layer independent LSTM network comprises:
the position coding module is used for coding the absolute position and the relative position of a frame in a video to obtain a feature sequence of a corresponding position;
the self-attention mechanism module is used for further mining semantics in the feature sequence of the position to obtain a global description of the video;
step 302: inputting the RGB features and optical flow features obtained in step 2 and features related to target detection into a multi-modal LSTM network model based on a self-attention mechanism for training, and outputting corresponding action type distribution tensors.
9. The method of claim 8, wherein the learning rate of the training process of inputting the RGB features and optical flow features obtained in step 2 and the features related to object detection into the multi-modal LSTM network model based on the SOS in step 302 is set to 0.005, 100 epochs are trained using the standard cross-entropy loss function with decreasing random gradients, and the momentum is set to 0.9.
10. The method of claim 8, wherein the LSTM network has 2 layers.
CN202010738071.7A 2020-07-28 2020-07-28 Multi-mode LSTM video motion prediction method based on self-attention mechanism Active CN111914731B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010738071.7A CN111914731B (en) 2020-07-28 2020-07-28 Multi-mode LSTM video motion prediction method based on self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010738071.7A CN111914731B (en) 2020-07-28 2020-07-28 Multi-mode LSTM video motion prediction method based on self-attention mechanism

Publications (2)

Publication Number Publication Date
CN111914731A true CN111914731A (en) 2020-11-10
CN111914731B CN111914731B (en) 2024-01-23

Family

ID=73286387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010738071.7A Active CN111914731B (en) 2020-07-28 2020-07-28 Multi-mode LSTM video motion prediction method based on self-attention mechanism

Country Status (1)

Country Link
CN (1) CN111914731B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434608A (en) * 2020-11-24 2021-03-02 山东大学 Human behavior identification method and system based on double-current combined network
CN113343564A (en) * 2021-05-28 2021-09-03 国网江苏省电力有限公司南通供电分公司 Transformer top layer oil temperature prediction method based on multi-element empirical mode decomposition
CN113963200A (en) * 2021-10-18 2022-01-21 郑州大学 Modal data fusion processing method, device, equipment and storage medium
CN114758285A (en) * 2022-06-14 2022-07-15 山东省人工智能研究院 Video interaction action detection method based on anchor freedom and long-term attention perception

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170262996A1 (en) * 2016-03-11 2017-09-14 Qualcomm Incorporated Action localization in sequential data with attention proposals from a recurrent network
CN107609460A (en) * 2017-05-24 2018-01-19 南京邮电大学 A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
CN109063568A (en) * 2018-07-04 2018-12-21 复旦大学 A method of the figure skating video auto-scoring based on deep learning
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN109740419A (en) * 2018-11-22 2019-05-10 东南大学 A kind of video behavior recognition methods based on Attention-LSTM network
CN109815903A (en) * 2019-01-24 2019-05-28 同济大学 A kind of video feeling classification method based on adaptive converged network
CN110852273A (en) * 2019-11-12 2020-02-28 重庆大学 Behavior identification method based on reinforcement learning attention mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170262996A1 (en) * 2016-03-11 2017-09-14 Qualcomm Incorporated Action localization in sequential data with attention proposals from a recurrent network
CN107609460A (en) * 2017-05-24 2018-01-19 南京邮电大学 A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
CN109063568A (en) * 2018-07-04 2018-12-21 复旦大学 A method of the figure skating video auto-scoring based on deep learning
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN109740419A (en) * 2018-11-22 2019-05-10 东南大学 A kind of video behavior recognition methods based on Attention-LSTM network
CN109815903A (en) * 2019-01-24 2019-05-28 同济大学 A kind of video feeling classification method based on adaptive converged network
CN110852273A (en) * 2019-11-12 2020-02-28 重庆大学 Behavior identification method based on reinforcement learning attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
潘陈听;谭晓阳;: "复杂背景下基于深度学习的视频动作识别", 计算机与现代化, no. 07 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434608A (en) * 2020-11-24 2021-03-02 山东大学 Human behavior identification method and system based on double-current combined network
CN112434608B (en) * 2020-11-24 2023-02-28 山东大学 Human behavior identification method and system based on double-current combined network
CN113343564A (en) * 2021-05-28 2021-09-03 国网江苏省电力有限公司南通供电分公司 Transformer top layer oil temperature prediction method based on multi-element empirical mode decomposition
CN113963200A (en) * 2021-10-18 2022-01-21 郑州大学 Modal data fusion processing method, device, equipment and storage medium
CN114758285A (en) * 2022-06-14 2022-07-15 山东省人工智能研究院 Video interaction action detection method based on anchor freedom and long-term attention perception

Also Published As

Publication number Publication date
CN111914731B (en) 2024-01-23

Similar Documents

Publication Publication Date Title
CN111914731A (en) Multi-mode LSTM video motion prediction method based on self-attention mechanism
US20180114071A1 (en) Method for analysing media content
Zhang et al. A multistage refinement network for salient object detection
Zhou et al. PGDENet: Progressive guided fusion and depth enhancement network for RGB-D indoor scene parsing
Lin et al. Multi-grained deep feature learning for robust pedestrian detection
Wang et al. Spatial–temporal pooling for action recognition in videos
CN112801068B (en) Video multi-target tracking and segmenting system and method
Shi et al. Shuffle-invariant network for action recognition in videos
CN111523378A (en) Human behavior prediction method based on deep learning
CN112990122A (en) Complex behavior identification method based on video basic unit analysis
Suratkar et al. Employing transfer-learning based CNN architectures to enhance the generalizability of deepfake detection
de Oliveira Silva et al. Human action recognition based on a two-stream convolutional network classifier
Lin et al. Joint learning of local and global context for temporal action proposal generation
Shi et al. A divided spatial and temporal context network for remote sensing change detection
Farrajota et al. Human action recognition in videos with articulated pose information by deep networks
Dastbaravardeh et al. Channel Attention‐Based Approach with Autoencoder Network for Human Action Recognition in Low‐Resolution Frames
Hussain et al. AI-driven behavior biometrics framework for robust human activity recognition in surveillance systems
Shaikh et al. Real-Time Multi-Object Detection Using Enhanced Yolov5-7S on Multi-GPU for High-Resolution Video
Qin et al. Application of video scene semantic recognition technology in smart video
Wang et al. Deep neural networks in video human action recognition: A review
Huang et al. Video frame prediction with dual-stream deep network emphasizing motions and content details
CN117893957A (en) System and method for flow counting
CN112131429A (en) Video classification method and system based on depth prediction coding network
Zebhi et al. Converting video classification problem to image classification with global descriptors and pre‐trained network
CN114120076B (en) Cross-view video gait recognition method based on gait motion estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant