CN110188239B

CN110188239B - Double-current video classification method and device based on cross-mode attention mechanism

Info

Publication number: CN110188239B
Application number: CN201910294018.XA
Authority: CN
Inventors: 迟禄; 严慧; 田贵宇; 穆亚东; 陈刚; 王成成; 黄波; 韩峻; 糜俊青
Original assignee: Zhongxing Technology Co ltd; Peking University; Nanjing University of Science and Technology
Current assignee: Zhongxing Technology Co ltd; Peking University; Nanjing University of Science and Technology
Priority date: 2018-12-26
Filing date: 2019-04-12
Publication date: 2021-06-22
Anticipated expiration: 2039-04-12
Also published as: CN110188239A

Abstract

The invention relates to a double-current video classification method and device based on a cross-mode attention mechanism. Different from the traditional double-flow method, the information of two modes (even more modes) is fused before the result is predicted, so that the method is more efficient and sufficient, meanwhile, because the information interaction is carried out at an early stage, a single branch already has important information of another branch at a later stage, the precision of the single branch is equal to or even more than that of the traditional double-flow method, and the parameter quantity of the single branch is much less than that of the traditional double-flow method; compared with a non-local neural network, the attention module designed by the invention can cross the modes, and not only uses an attention mechanism in a single mode, but also has the effect equivalent to that of the non-local neural network under the condition that the two modes are the same.

Description

Double-current video classification method and device based on cross-mode attention mechanism

Technical Field

The invention relates to a video classification method, in particular to a double-current video classification method and device using an attention mechanism, and belongs to the field of computer vision.

Technical Field

With the rapid development of deep learning in the image field, a deep learning method is gradually introduced in the video field and certain achievement is achieved. However, the current technical level still has not achieved ideal effect, and the problems are mainly the following two aspects:

first, current techniques have not fully utilized dynamic information. Video differs from images in that the dynamic information from frame to frame is unique and important to video. For example, even for human beings, it is difficult to judge various fine-classified dances (such as tango and salsa dance) by only viewing one frame of image, and if motion trajectory information is added, the task becomes much easier. Likewise, the classification of some sports is also dependent on the motion trajectory.

Second, current techniques also have difficulty locating critical objects quickly and accurately. Attention mechanism has been widely used in natural language processing, but research in video classification is relatively lacking. With the attention mechanism, the neural network can filter out extraneous objects and focus more on critical objects. Such as "swords," classification is simplified if the key object "sword" is detected. In general, the moving object can attract the eyes of human beings, and the area often contains key information of video classification, such as "cake making" and "pizza making", and the key object "cake" or "pizza" is located near the moving hands.

There are many prior art attempts to solve both of the above problems. With respect to how to utilize dynamic information, there are two main types of current technologies: one is to design a neural network structure related to a time dimension, such as a Recurrent Neural Network (RNN), a three-dimensional convolutional neural network (3D-Conv), etc., and train a network structure capable of capturing information between frames in a data-driven manner; the other method is to explicitly use dynamic information, namely extracting optical flows, then using the optical flows to train a neural network branch separately, and performing weighted summation with the result of the RGB branch, which means that a relatively extensive dual-flow video classification technology is currently used. However, research on how to capture key clues, namely, introducing attention mechanism into video classification is relatively few, and a representative method is Non-local Neural Networks (Non-local Neural Networks), but the network only focuses on important information in a single modality, and there is no special modeling way for a "moving object".

Disclosure of Invention

The invention mainly provides a novel double-current video classification method based on a cross-modal attention mechanism, which can efficiently utilize multi-modal information to classify videos and pay attention to moving objects, so that the video classification is simpler and more efficient. The technology provided by the invention has universality and can be widely applied to the existing video classification problem and other multi-modal models.

The technical problems to be solved by the invention specifically comprise: 1. fully utilizing multi-mode information to classify videos; 2. key objects are concerned more, so that the video classification is more accurate; 3. higher accuracy is achieved using fewer parameters.

Different from the traditional double-flow method, the information of two modals (even more modals, such as extracted sound, an intermediate characteristic diagram extracted by using an object detection model and the like) is fused before the result is predicted, so that the method is more efficient and sufficient, meanwhile, because the information interaction is carried out at an early stage, a single branch already has important information of another branch at a later stage, the precision of the single branch is equal to or even more than that of the traditional double-flow method, and the parameter quantity of the single branch is much less than that of the traditional double-flow method; compared with a non-local neural network, the attention module designed by the invention can cross the modes, and not only uses an attention mechanism in a single mode, but also has the effect equivalent to that of the non-local neural network under the condition that the two modes are the same.

The invention discloses a double-current video classification method based on a cross-mode attention mechanism, which comprises the following steps of:

1) establishing a neural network structure of RGB branches and optical flow branches, wherein the neural network structure comprises a cross-modal attention module;

2) obtaining RGB and optical flow according to the video to be classified, and respectively inputting the RGB and optical flow into the neural network structure of the RGB branch and the optical flow branch;

3) for input RGB and optical flow, the neural network structures of the RGB branch and the optical flow branch carry out information interaction through a cross-modal attention module, and cross-modal information fusion is realized;

4) and carrying out video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch.

The cross-mode attention mechanism (cross-mode attention module) designed by the method mainly comprises three parts: key Value (Key), Query (Query), and Value (Value). The key value refers to indexes of all information, the query refers to index of query information, and the numerical value refers to all information. The cross-modal attention mechanism can be described as: generating a query from a current modality, generating a key value-numerical matching pair from another modality, and acquiring important information from the other modality according to the similarity degree of the query and the key value. Thus, the cross-modality attention mechanism is actually a process of selectively acquiring information from another modality according to the current modality, the acquired information is often weak or even missing in the current modality but is very important to the final result.

FIG. 1 is an example implementation of a cross-modal attention mechanism. X and Y represent inputs from the RGB branch and the optical flow branch, respectively. Q (query), K (dictionary) and V (value) are generated by convolution of X or Y by 1X1, the shape and size of the X or Y are already marked in the figure, and matrix transposition, deformation and other operations are carried out before matrix multiplication so that matrix multiplication can be carried out. Multiplying Q and K to obtain M which represents the attention weight distribution of each pixel on the whole feature map, multiplying M and V after obtaining the distribution M, namely selectively obtaining information Z from V, carrying out nonlinear transformation (for example, adopting an activation function such as ReLU) after obtaining Z, and carrying out residual error connection on the transformed result and the original input to obtain a final Output result (Output).

By this way of operation, at the current network stage, the RGB branches can traverse all the positions of the optical flow branches. Therefore, the RGB branch can integrate all the information of the optical flow branch and then selectively choose important information to fuse with the current information, rather than just performing weighted summation at the final stage (common dual-flow method). Meanwhile, as the input and output shapes of the operation are completely consistent and can process the input with any shape, the operation has good compatibility and can be inserted into almost any stage of all networks, thereby fully utilizing multi-scale information. In order to further improve the compatibility, residual error connection is added to the model, namely the residual error connection is directly added to the input before the operation on the result of the operation, so that the model added with the cross-modal attention module can be guaranteed not to have lower precision than the original model theoretically.

FIG. 2 is a network architecture designed by the present invention based on a dual-flow model, incorporating a cross-modal attention module. The two branches of the model are an RGB branch and a Flow branch respectively and are responsible for processing the surface characteristics and the dynamic information of the image respectively. The specific process is as follows:

step 1: initial network parameters. The parameters of the network are initialized with the pre-trained model on the ImageNet dataset and then trained on the kinetics dataset until convergence.

Step 2: and (6) data processing. The input to the network requires both RGB and optical flow inputs, for RGB, frames are directly cut from the original video and then scaled to the specified resolution (224x 224); for optical Flow, RGB images of two adjacent frames are extracted through a TVL1 optical Flow algorithm of a GPU version in OpenCV, a plurality of continuous frames (such as five continuous frames) of optical Flow are stacked together to be used as input of a Flow branch, and the resolution of the Flow branch is consistent with that of RGB;

step 3: after the data is obtained, the RGB and optical flows are input into two branches, respectively, which pass through a cross-modality attention module (i.e., the configuration shown in FIG. 1, with CMA in FIG. 2) during operation₁～CMA_nTo represent) to interact information, thereby achieving the effect of fully utilizing multi-modal information at multiple levels.

Step 4: the two branches eventually yield two results, which can also be weighted and summed as in the normal dual stream approach. Since the model can perform information fusion at an earlier stage, the accuracy of simultaneous prediction of two branches in a common dual-stream model can be achieved or even exceeded by performing video classification only by using the result of RGB branch, and at the moment, the optical flow branch does not need to perform subsequent operation (indicated by a dotted line in FIG. 2), a large number of parameters are saved, so that the model is very efficient. In addition, if the prediction precision is further improved by simultaneously adopting the two branch results of the model.

Based on the same inventive concept, the invention also provides a double-current video classification device based on a cross-mode attention mechanism corresponding to the method, which comprises the following steps:

the network construction module is responsible for establishing a neural network structure of the RGB branch and the optical flow branch, and comprises a cross-modal attention module;

the data processing module is responsible for obtaining RGB and optical flow according to the video to be classified and inputting the RGB and optical flow into the neural network structures of the RGB branch and the optical flow branch respectively;

the information fusion module is responsible for carrying out information interaction on input RGB and optical flow, and the neural network structures of the RGB branches and the optical flow branches through the cross-modal attention module to realize cross-modal information fusion;

and the video classification module is in charge of performing video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch.

Compared with the prior art, the invention has the advantages that:

(1) multi-mode information interaction is added, and multi-mode information can be fully utilized in multiple levels;

(2) the information of the two modes can be selectively selected by a cross-mode attention mechanism, so that the complementary information can be efficiently utilized, and the key object can be captured more accurately;

(3) the precision of the traditional double-flow method can be achieved or even exceeded by using less parameter quantity, and meanwhile, the results of two branches can be synthesized for prediction, so that the classification precision is further improved;

(4) the cross-modal attention module designed by the invention has good compatibility, is not in conflict with most of the prior art, can be almost inserted into any existing network architecture, and can stably improve the video classification precision.

Drawings

FIG. 1 is an exemplary diagram of a cross-modal attention module;

fig. 2 is a diagram of a video classification network according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

1. Configuration of cross-modal attention module

The cross-modal attention module can process input of any dimension, and can ensure that the input and the output are consistent in shape, so that the cross-modal attention module has excellent compatibility. For example, the Q, K, V is obtained by 2-dimensional convolution operation of 1x1 (for 3-dimensional model, the convolution is 3-dimensional convolution operation of 1x1x 1), which performs dimensionality reduction operation in channel dimension while obtaining Q, K, V to reduce computational complexity and save GPU space. To further simplify the operation, the convolution operation may be preceded by a max-Pooling (max-Pooling) operation, which reduces the spatial dimension to 1/4. After obtaining Z, it needs to increase the dimension to be consistent with the input dimension through a convolution operation, and then it needs to initialize the parameters of BN to be zero through Batch Normalization, so that the initial state of the module has no influence on the operation result of the previous network.

2. Configuration of a network

Both branches are based on the ResNet-50 network. In order to utilize multi-scale and more accurate spatial information as much as possible while saving GPU video memory, 5 (or other numbers of) cross-mode attention modules are uniformly inserted in two stages of res3 and res4, and the configuration is also consistent with an nonlocal neural network. The RGB branch inputs only one frame of image, while the optical flow branch inputs a succession of five frames of optical flow images. The weights of the RGB branches are directly initialized by using the parameters trained on ImageNet, while the optical flow branches need to be changed properly because the input shapes of the optical flow branches are different from the input shapes of models trained on ImageNet, the convolution kernel parameters of the first layer of convolution network are averaged in the channel dimension, then the average value is copied five times to obtain convolution kernels of five channels, and the parameters of other layers can be directly copied, so that the parameters trained on ImageNet can be well migrated.

The network is based on a Temporal Segment Network (TSN) framework because the framework is capable of modeling long sequence relationships simply and efficiently. The whole video is divided into m segments on average, each segment randomly selects a frame as the input of the network, so that m results are obtained, and the final video prediction result is based on the average value of the m results.

3. Data processing

The original video data resolution is not completely uniform. Scaling them uniformly to 256x256 resolution size. The optical flow is extracted using the GPU version TVL1 algorithm in OpenCV, truncates its result at-20, and then scales between-1, 1. Data enhancement is also performed, such as random cropping, scaling, mirroring, etc., and it should be noted that the two branches are consistent for the same input data enhancement, e.g., if the upper left corner of the RGB image is cropped, the upper left corner of the optical flow is also cropped at the same position. In the temporal dimension, the RGB image corresponds to the first frame of the five consecutive frames of optical flow.

4. Network training

Since the convergence speed of the optical flow branches is generally slower than that of the RGB branches, the optical flow branches are first trained on a kinetics data set, which also helps the optical flow branches to provide more accurate information for the RGB branches. After that, iterative training is started, i.e. the RGB branch is optimized alternately with the optical flow branch. In the process of training the RGB branches, all parameters of the optical flow branches are frozen, including a cross-modal attention module in the optical flow branches, only the parameters of the RGB branches are updated, and vice versa when the optical flow branches are trained. The number of training sessions per iteration does not exceed 30 epochs. Practice has found that the precision of the just trained branch is often higher than that of the other branch, so that for the weighting of the two branch results, the higher precision branch is given a higher weight (5:1), which results in higher precision. Typically, one iteration can achieve very high accuracy.

In the training process, a standard cross entropy loss function and a random gradient descent optimization method are adopted. The batch size is 128, BN parameters are updated simultaneously in the training process, and synchronous BN is adopted in order to obtain more accurate BN statistic. The learning rate is initialized to 0.01, and when the training accuracy reaches a stable stage, the learning rate is reduced to one tenth of the current learning rate. To prevent overfitting, dropout of 0.7 and weight decay of 0.0005 are added and K is set to 3 during training.

In the testing process, the four corners and the center position of the image are cut out and inverted, so that 10 samples are obtained, 10 results are obtained when the 10 samples are input into the network, and the 10 results are averaged to obtain the final video classification result. K in TSN is set to 25.

5. Transfer learning

The network structure provided by the invention is trained on the basis of kinetics, wherein the kinetics has a class 400, other data sets, such as ucf101, have only a class 101, and the class 101 and the kinetics 400 have overlapping parts and non-overlapping parts. In order to migrate the model to a new video classification, only the last full link layer needs to be finely adjusted on a new data set, so that a good effect can be achieved. Similar methods can be adopted for other data sets, and the model has better migration capability.

ResNet50 is a model without a cross-modal attention module added, and CMA-ResNet50 is a model with the module added on the basis of the cross-modal attention module added. -R denotes the RGB branch, -S denotes the fusion of the RGB branch with the optical flow branch. Table 1 below is an experiment with ResNet50 as the backbone network:

TABLE 1 Experimental results with ResNet50 as backbone network

Model (model)	Accuracy (%)
		ResNet50-R	67.73
ResNet50-S	71.21
		CMA-ResNet50-R	72.17
CMA-ResNet50-S	72.62

P3D is a 3-dimensional convolution neural network model, and CMA-P3D is a model added with a 3-dimensional cross-modal attention module on the basis of the model. Table 2 below is an experiment with P3D as the backbone network:

TABLE 2 Experimental results with P3D as the backbone network

Model (model)	Accuracy (%)
		P3D-R	71.50
P3D-S	74.62
		CMA-P3D-R	74.86
CMA-P3D-S	75.98

From the above two tables, it can be seen that the accuracy of the two-dimensional or three-dimensional cross-modal attention module can be stably improved, and the accuracy of the RGB branch ratio added with the module is higher than that of the dual-flow model (ResNet50-S/P3D-S) in the comparison experiment, so that the accuracy of the dual-flow model added with the module is further improved.

Another embodiment of the present invention provides a dual-stream video classification apparatus based on a cross-mode attention mechanism, which includes:

The invention is not limited to ResNet-50 neural networks, and can be applied to various neural networks (such as VGG, DenseNet, SENET and the like) and also can be applied to 3D neural networks (such as I3D, P3D and the like). Meanwhile, the operation across the interior of the modal attention module is not limited to the implementation described above, for example, in the process of generating the key value/query/value, a more complex operation may be used instead of the 1 × 1 convolution operation (for example, a multilayer convolution operation is superimposed), and after obtaining Z, a more complex operation may also be performed (a more multilayer convolution operation may also be used). For the merging mode with the main branch, the residual error connection is adopted, and other modes, such as splicing with the characteristics of the main branch, and the like, can also be adopted.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A double-current video classification method based on a cross-modal attention mechanism is characterized by comprising the following steps:

3) for input RGB and optical flow, the neural network structures of the RGB branch and the optical flow branch carry out information interaction through a cross-modal attention module, and cross-modal information fusion is realized; the cross-modal attention module comprises a key value, a query and a numerical value, generates the query from the current modality, generates a key value-numerical value matching pair from the other modality, and acquires important information from the other modality according to the similarity degree of the query and the key value;

4) performing video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch;

the neural network structure of the RGB branch and the optical flow branch takes ResNet-50 as a basic network, and a plurality of cross-mode attention modules are uniformly inserted in two stages of res3 and res 4; a time sequence segmentation network framework is adopted to averagely divide the whole video into m segments, each segment randomly selects a frame as the input of the network, so that m results are obtained, and the final video prediction result is based on the average value of the m results.

2. The method of claim 1, wherein the cross-modality attention module is: x and Y represent inputs from the RGB and optical flow branches, respectively, the query Q, the dictionary K, and the value V being X or Y generated by a 1X1 convolution; multiplying Q and K to obtain M, and representing the attention weight distribution of each pixel on the whole feature map; multiplying M and V, namely selectively acquiring information Z from V, performing nonlinear transformation after Z is acquired, and performing residual error connection on the transformed result and the original input to acquire a final result.

3. The method of claim 2, wherein the cross-modality attention module performs dimensionality reduction operations in channel dimensions while obtaining Q, K, V through convolution operations to reduce computational complexity and save GPU space; the operation is simplified by a maximization pooling operation before the convolution operation, the dimensionality is increased to be consistent with the input dimensionality through the convolution operation after Z is obtained, and parameters of the BN are initialized to be zero through the BN, wherein the BN is batch standardization.

4. The method of claim 1, wherein the deriving RGB and optical flow from the video to be classified comprises:

a) for RGB, frames are directly intercepted from an original video to be classified, and then the frames are scaled to a specified resolution and used as input of a neural network structure of RGB branches;

b) for optical flow, RGB images of two adjacent frames are extracted through an optical flow algorithm, a plurality of continuous frames of optical flow are stacked together to be used as input of a neural network structure of optical flow branches, and the resolution of the optical flow is consistent with that of RGB.

5. The method of claim 1, wherein step 4) performs video classification by one of the following methods:

a) only the result of RGB branching is adopted for video classification;

b) the video classification is performed by weighted summation of the two results from the two branches.

6. The method of claim 1, wherein the training process of the neural network structure of the RGB and optical flow branches comprises: firstly, training an optical flow branch, and then starting iterative training, namely alternately optimizing the RGB branch and the optical flow branch; freezing all parameters of the optical flow branches in the process of training the RGB branches, wherein the parameters comprise a cross-modal attention module in the optical flow branches, only updating the parameters of the RGB branches, and vice versa when the optical flow branches are trained; for the weighting weights of the two branch results, giving higher weight to the branch with higher precision; and a standard cross entropy loss function and a random gradient descent optimization method are adopted in the training process.

7. The method of claim 1, wherein the neural network structure of the RGB and optical flow branches is migrated to a new data set by fine-tuning the last fully connected layer, thereby implementing a migratory learning.

8. A dual-stream video classification device based on a cross-mode attention mechanism and adopting the method of any one of claims 1 to 7, comprising: