CN111914731A

CN111914731A - Multi-mode LSTM video motion prediction method based on self-attention mechanism

Info

Publication number: CN111914731A
Application number: CN202010738071.7A
Authority: CN
Inventors: 邵洁; 莫晨
Original assignee: Shanghai University of Electric Power
Current assignee: Shanghai University of Electric Power
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-11-10
Anticipated expiration: 2040-07-28
Also published as: CN111914731B

Abstract

The invention relates to a video motion prediction method based on a multi-mode LSTM (local Strand TM) of a self-attention mechanism, which comprises the following steps of: step 1: preparing a training data set and preprocessing an original video to obtain an RGB picture and an optical flow picture; step 2: extracting RGB features and optical flow features through a TSN (time delay network) based on the RGB pictures and the optical flow pictures, and obtaining features related to target detection through a fast-RCNN target detector based on a training data set; and step 3: establishing a multi-mode LSTM network model based on a self-attention mechanism, inputting the RGB characteristics and the optical flow characteristics obtained in the step (2) and the characteristics related to target detection into the network model for training, and outputting the corresponding action type distribution tensors; and 4, step 4: and establishing a fusion network to distribute weight for the action type distribution tensor and combine the action type distribution tensor with the action type distribution tensor to obtain a final video action prediction result. Compared with the prior art, the method has high accuracy and overcomes the defect of poor effect of longer action prediction time.

Description

Multi-mode LSTM video motion prediction method based on self-attention mechanism

Technical Field

The invention relates to the technical field of video motion prediction, in particular to a multi-mode LSTM video motion prediction method based on a self-attention mechanism.

Background

Action recognition based on vision is one of the difficulties and hot spots of computer vision field research, and relates to a plurality of subject fields such as image processing, deep learning, artificial intelligence and the like. The method has high academic research value, and has wide application background for analysis and understanding of videos under the trend of vigorous development of the Internet industry in the 5G era. The current field of motion recognition focuses on how to correctly recognize the complete motion contained in the video. In practice, however, it is desirable that the monitoring system provide early warning of potential risks in the monitored location so that dangerous activities are prevented before they have serious consequences, rather than identifying actions that have already been completed or detecting the consequences of such actions. To achieve this, it is necessary to provide a monitoring system with a visual function to enable the monitoring system to predict the operation.

Motion prediction refers to predicting the motion category of a video as early as possible before the motion in the video is completed by extracting and processing features of a continuously input video stream. The main difference between motion prediction and motion recognition is the integrity of the recognition object. The former recognition objects are video clips before the action occurs, and these clips do not contain the action to be occurred. And the recognition object of the latter is a complete video containing the motion. Motion prediction is a more challenging task. First, some actions have similar characteristic performance in the early stage of the exercise, and two different actions, such as "hand shaking" and "hand waving", have a process of lifting the hand upwards in the early stage of the exercise, and similar actions make the characteristics of the obtained video stream indistinguishable from the two different actions. Second, due to the setting of the motion prediction task, the time required to complete the entire motion cannot be known, and different motions cannot be distinguished by the motion duration. Therefore, from the observed video part, the characteristics with key semantics can not be obtained to distinguish the actions similar to the initial action, and the complete action time sequence structure can not be obtained. Thirdly, since the selected video segment is taken before the motion segment requiring prediction, such input data is not strongly related to the motion segment requiring prediction.

The motion prediction method generally extracts features from a video, and models a mapping relationship between the features and motion categories to predict motions occurring in the future. The quality of the prediction effect depends greatly on the description capability of the characteristics for incomplete actions and whether the specific time sequence motion pattern can be learned from the target action. Before the advent of deep learning methods, traditional machine learning methods such as bag of words models and support vector machines were used to solve the motion prediction task. In recent years, deep learning methods are the mainstream in the computer vision field, and the convolutional network can extract high-level features with rich semantics, and the high-level features can be used for identification and detection. These features are then further fused or encoded to improve the effectiveness of the model.

Disclosure of Invention

The present invention is directed to overcome the above-mentioned drawbacks of the prior art and to provide a method for predicting motion of a multi-modal LSTM video based on a self-attention mechanism.

The purpose of the invention can be realized by the following technical scheme:

a method for video motion prediction based on multi-modal LSTM of the self-attention mechanism, the method comprising the steps of:

step 1: preparing a training data set and preprocessing an original video to obtain an RGB picture and an optical flow picture;

step 2: extracting RGB features and optical flow features through a TSN (time delay network) based on the RGB pictures and the optical flow pictures, and obtaining features related to target detection through a fast-RCNN target detector based on a training data set;

and step 3: establishing a multi-mode LSTM network model based on a self-attention mechanism, inputting the RGB characteristics and the optical flow characteristics obtained in the step (2) and the characteristics related to target detection into the network model for training, and outputting the corresponding action type distribution tensors;

and 4, step 4: and establishing a fusion network to distribute weight for the action type distribution tensor and combine the action type distribution tensor with the action type distribution tensor to obtain a final video action prediction result.

Further, the step 1 comprises the following sub-steps:

step 101: selecting a data set for training to obtain characteristics related to target detection;

step 102: decomposing an original video according to a set frame rate and extracting to obtain an RGB picture;

step 103: and extracting the optical flow picture aiming at the original video by adopting a TVL1 algorithm.

Further, the data set in step 101 adopts an EPIC-KITCHENS data set and an EGTEA Gaze data set⁺A data set.

Further, the frame rate set in step 102 is 30 fps.

Further, the step 2 comprises the following sub-steps:

step 201: the original TSN is trained in advance to obtain a pre-trained TSN model;

step 202; removing a classification layer in the original TSN, and loading a pre-trained TSN model to obtain a TSN based on a double-flow method principle;

step 203: inputting the RGB pictures and the optical flow pictures into a TSN (traffic stream network) based on a double-flow method principle, and outputting and extracting corresponding RGB features and optical flow features from a global posing layer in the network;

step 204: and training the fast-RCNN target detector by using the target label of the data set to obtain the characteristics related to target detection.

Further, the initial learning rate of the training process corresponding to the TSN network based on the dual-flow principle in step 202 is set to 0.001, 160 epochs are trained by using the standard cross entropy loss function with decreasing random gradient, and after the 80 th epoch, the learning rate is reduced by 10 times.

Further, the data set in step 204 adopts EGTEA Gaze⁺A data set.

Further, the step 3 comprises the following sub-steps:

step 301: establishing a multi-mode LSTM network model based on a self-attention mechanism, wherein the multi-mode LSTM network model comprises an encoder and a multi-layer independent LSTM network, the encoder is composed of a position encoding module and a self-attention mechanism module, and the multi-layer independent LSTM network comprises:

the position coding module is used for coding the absolute position and the relative position of a frame in a video to obtain a feature sequence of a corresponding position;

the self-attention mechanism module is used for further mining semantics in the feature sequence of the position to obtain a global description of the video;

step 302: inputting the RGB features and optical flow features obtained in step 2 and features related to target detection into a multi-modal LSTM network model based on a self-attention mechanism for training, and outputting corresponding action type distribution tensors.

Further, in step 302, the learning rate of the training process corresponding to the input of the RGB features and optical flow features obtained in step 2 and the features related to the target detection to the multimodal LSTM network model based on the attention-driven system is set to 0.005, 100 epochs are trained using the standard cross entropy loss function with random gradient descent, and the momentum is set to 0.9.

Further, the number of layers of the LSTM network is 2.

Compared with the prior art, the invention has the following advantages:

(1) the method comprehensively considers three video characteristics, wherein RGB characteristics are used for modeling spatial information, optical flow characteristics are used for modeling time sequence motion information, and characteristics related to target detection are used for modeling which target a person in the video interacts with; because the characteristic sequence is very sensitive to position information, an independent position coding module based on a trigonometric function is adopted to code the absolute position and the relative position of a frame in the video; sending the feature sequence with the coded position into a self-attention module for processing, and further mining semantics in the feature sequence to obtain global description of the video; the output of the self-attention module is used as the input of an LSTM network, the LSTM can effectively load historical information and can complete the prediction of different prediction time, and the output of the LSTM network is the distribution of action types; in order to avoid overfitting, the feature extraction network and the prediction network are trained separately; the three extracted characteristics are used as the input of a prediction network, and the prediction network is trained by adopting a cross entropy loss function; detecting the trained model on a test set of a data set to evaluate the effect of the model; compared with the methods for predicting actions in recent years, the method has more accuracy indexes than those of the methods, and solves the defect of poor effect of long action prediction time.

(2) The self-attention mechanism in the method is a research proposal in the field of natural language processing, and is proved to have good effect on data such as text, voice and the like. The data type in the computer vision field is mainly picture video, and the algorithm of the motion prediction task applies a self-attention mechanism to help shorten the distance between two communities.

(3) The method proves that the text sequence and the video sequence have time sequence, and the similar characteristic is also the basis for using position coding and self-attention mechanism coding.

Drawings

FIG. 1 is a diagram of the overall network model architecture of the present invention;

FIG. 2 is a diagram of a multi-modal LSTM network model architecture in accordance with the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

1. Video pre-processing and training data preparation

The method of the invention is carried out on EPIC-KITCHENS and EGTEA Gaze⁺Two data sets were experimented with, decomposing the original video at a frame rate of 30fps to extract RGB pictures, and then using the TVL1 algorithm to extract the corresponding optical flow pictures.

2. Feature extraction

The method adopts a TSN (time delay network) based on a double-flow method principle to extract RGB (red, green and blue) characteristics and optical flow characteristics. Firstly, training a TSN (time series network) on an action recognition task to obtain a pre-trained model. And then removing a classification layer of the original TSN network, loading a pre-trained model, and extracting corresponding RGB (red, green, blue) features and optical flow features from the output of the global posing layer. Features related to target detection train the fast-RCNN target detector with target labeling of the dataset, the output of the target detector omits coordinate information of the bounding box, and retains type information of the target. Because the algorithm only concerns the person in the video with whom the person is interacting, i.e. the algorithm only models information that is useful for predicting the kind of action.

3. Motion prediction

The network model structure of the action prediction algorithm designed by the method is shown in figure 1, and takes the encoders of the position encoding module and the self-attention mechanism module and two layers of independent LSTM networks as a basic framework, wherein the encoders are responsible for further encoding the extracted feature sequences and extracting the context information of the feature sequences to obtain richer semantics. The LSTM network embodies motion prediction, loads video frames observed in the past, and produces a motion class distribution of different prediction times. The input-output relationship of LSTM is shown in fig. 2, where 14 frames of pictures are sampled forward with a time interval of 0.25s before the motion starts for a video segment. The three characteristics enter three sub-networks formed by the encoder and the LSTM network respectively and are trained respectively. And finally, distributing weights for the three sub-networks by using a fusion network of an attention mechanism consisting of three full-connection layers, and correspondingly multiplying the weights by the corresponding action type distribution tensors to obtain the final output of the whole model.

4. Training strategy and associated parameters

The dual-flow network training of the TSN is set to be that the number of segments is 3, and 160 epochs are trained by adopting a standard cross entropy loss function with descending random gradient. The initial learning rate was set to 0.001, and after the 80 th epoch, the learning rate decreased by 10 times. The experimental environment was single card GEFORCE 1060. The fast-RCNN target detector was trained on the EPIC-KITCHENS dataset due to EGTEA Gaze⁺The data set lacks bounding box labels, so no phase of target detection is added to the data setRegarding features, the model on the dataset only considers RGB features and optical flow features. The prediction network has 3 sub-networks, each sub-network is also trained by adopting a standard cross entropy loss function with random gradient descent, the learning rate is fixed and is not changed to be 0.005, the momentum is set to be 0.9, and 100 epochs are trained.

5. Results and analysis of the experiments

Tables 1 and 2 show the method of the invention and other predictive algorithms in EPIC-KITCHENS and EGTEA Gaze⁺Results on the data set. The evaluation index is Top-5 accuracy. In EGTEA size⁺The method of the present invention outperforms the comparative method at all times of prediction on the data set. On the EPIC-KITCHENS dataset, the prediction times exceeded the other comparison algorithms except for prediction times of 0.5s and 0.25s, which were slightly below the RU algorithm. To further verify the effectiveness of the self-attention mechanism, the prediction results of the model (a) with the full model (B) and the encoder removed in the model compared in the three partial features are shown in table 3. From the result, the encoder based on the attention mechanism provided by the method effectively improves the performance of the model, not only solves the defect that other algorithms have poor performance in long prediction time, increases the robustness of the model, but also improves the accuracy.

Table 1: the method and other prediction algorithms of the invention predict the result on the EPIC-KITCHENS data set

TABLE I

Action anticipation results on the EPIC-KITCHENS dataset

Table 2: the method of the present invention and other prediction algorithms in EGTEA Gaze⁺Predicted results on a dataset

TABLE II

Action anticipation results on the EGTEA Gaze+dataset

Table 3: the method compares the prediction results of the full model (B) and the model (A) without the coder on three sub-features in the model

TABLE III

Comparison of experimental results with and without Encoder on a single modality

In fig. 1 of this embodiment, Linear layer represents a Linear layer, Flow feature represents an optical Flow feature, RGB feature represents an RGB feature, Obj feature represents a feature related to object detection, Multiplication represents Multiplication, interpretation output distribution represents a prediction output result, BN-acceptance represents a BN-acceptance network structure, fast-RCNN represents a fast-RCNN object detector, Position encoding represents a Position encoding module, Sum represents summation, presence of hidden and cell states represents a connection of hidden and cell states, Self-ATTention represents a Self-ATTention module, Rolling LSTM unit represents an operating LSTM network unit, Rolling LSTM unit represents an non-operating LSTM network unit, Multi-model LSTM represents a Multi-modal LSTM network model, and ATTention termination represents a fusion network model;

in fig. 2 of this embodiment, the observer Time represents Observation Time, the interpretation Time represents prediction Time, Time interval represents Time interval, interpretation output represents prediction output, observer Segment represents Observation portion, Action encapsulation represents Action occurrence, and Action starting Time represents Action start Time.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for video motion prediction based on multi-modal LSTM of the self-attention mechanism, the method comprising the steps of:

2. The method of claim 1, wherein step 1 comprises the following sub-steps:

3. The method of claim 2, wherein the data set in step 101 is EPIC-KITCHENS data set and EGTEA size⁺A data set.

4. The method of claim 2, wherein the frame rate is set to 30fps in step 102.

5. The method of claim 1, wherein the step 2 comprises the following sub-steps:

6. The method as claimed in claim 5, wherein the initial learning rate of the training process corresponding to the TSN network based on the dual-flow principle in step 202 is set to 0.001, the 160 epochs are trained by using the standard cross entropy loss function with decreasing random gradient, and after the 80 th epoch, the learning rate is decreased by 10 times.

7. The method of claim 5, wherein the data set in step 204 is EGTEA GAze for multi-modal LSTM video motion prediction based on the attention mechanism⁺A data set.

8. The method of claim 1, wherein the step 3 comprises the following sub-steps:

9. The method of claim 8, wherein the learning rate of the training process of inputting the RGB features and optical flow features obtained in step 2 and the features related to object detection into the multi-modal LSTM network model based on the SOS in step 302 is set to 0.005, 100 epochs are trained using the standard cross-entropy loss function with decreasing random gradients, and the momentum is set to 0.9.

10. The method of claim 8, wherein the LSTM network has 2 layers.