CN114596520A

CN114596520A - First visual angle video action identification method and device

Info

Publication number: CN114596520A
Application number: CN202210120923.5A
Authority: CN
Inventors: 聂梦真; 姜金印
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-02-09
Filing date: 2022-02-09
Publication date: 2022-06-07

Abstract

The invention provides a method and a device for identifying a first visual angle video action, which are used for constructing a multi-stream network model for identifying the first visual angle action, wherein the model comprises a Convolutional Neural Network (CNN), a Transformer network and the like. The model adopts an RGB mode and a depth mode, is divided into three stages for action classification, extracts the double-scale characteristics of a video frame through a convolutional neural network pre-trained on ImageNet, respectively adopts different intra-frame segmentation modes according to the respective characteristics of different modes and different scale characteristic graphs, enhances the spatial representation by combining a correlation calculation mechanism, improves the spatial semantic information, and generates cross mode representation and enhances the cross correlation among the modes through the interaction of a multi-scale cross-mode fusion module; extracting time sequence information between video frames based on an attention mechanism; bimodal data enhanced through spatial interaction are fused, bimodal spatiotemporal information is effectively utilized and fused, and a good action recognition effect can be achieved.

Description

First visual angle video action identification method and device

Technical Field

The invention belongs to the fields of deep learning, computer vision and the like, relates to a feature extraction and action recognition technology, and particularly relates to a first visual angle video action recognition method and device.

Background

Action Recognition (Action Recognition) based on video data is an important research direction in the field of computer vision, the task goal is to finish a video classification task according to human actions appearing in videos for unprocessed or given segmented video segments, and the task has wide application value in the fields of video security monitoring, daily behavior Recognition, behavior interaction and the like.

In recent years, many studies in the field of computer vision have been made by means of Deep Learning (Deep Learning). The Convolutional Neural Network (CNN) has strong feature extraction capability and can better extract feature information of a feature map. The 2D convolutional neural network is mature on a picture classification task, is transferred to a video action classification task, directly acts on each frame of data, extracts depth features through a multilayer convolutional network, fuses the features of each frame in a certain mode, and inputs the features to a full connection layer FC (full connection) through a feed-forward network to complete the classification task. The method can better extract the representation information of the video frames, and the final classification result is also determined by the representation information of each frame. However, because the method splits the time sequence expression between frames, a certain effect can be finally obtained only on partial tasks (the motion information in the video is not easy to distinguish, and the representation information is easy to distinguish), and the method lacks excellent generalization performance for data needing to better distinguish the motion information between frames.

The image classification task in the computer vision field is based on two-dimensional data and emphasizes processing of spatial information in frames, on the basis, the video motion recognition task not only needs to process the spatial semantics of each frame, but also introduces a time dimension T to represent the time sequence between video frames, and the task is more complex and is the research emphasis in the computer vision field at present. The video classification task adds time sequence information compared with the image classification task, a common 3D convolution network (for example, the combination of two-dimensional convolution of a space domain and one-dimensional convolution of a time domain) simultaneously processes intra-frame space information in a video segment and time sequence information between adjacent frames, and the motion information in a video is considered while processing representation information of a video frame, so that the accuracy rate is higher than that of a 2D convolution neural network method. The method of unfolded 3D ConvNet (I3D) is used as optimization of a 3D convolution network, a trained 2D convolution kernel is expanded to a 3D category, and the method can benefit from a pre-training model of a large ImageNet data set, can further improve the accuracy of the model, and can generate larger calculation loss.

The method of spatiotemporal integration is one of the mainstream directions of research in the field of motion recognition. The method mainly focuses on processing of spatial information and time sequence information, the representation data of each frame represents spatial information such as an object, a scene and the like, and the motion expression of the frames represents time sequence information such as the motion of people and objects in a video, the motion of a camera and the like. The spatio-temporal network method usually uses RGB frames as one input for processing spatial stream information, uses a plurality of stacked optical frames as one input for processing time sequence information, adopts two-way joint training, and finally performs complementary fusion on extracted deep features to complete classification tasks. Although spatiotemporal integration can well fuse spatiotemporal information, when processing high-complexity and long-time video data, the classification effect is not good, and the modeling cannot be well performed.

The long and short term memory network (LSTM) method is commonly used for sequence modeling problems such as natural language processing and is suitable for processing sequence data. In consideration of the complex time sequence relation among video frames, the video frames are transferred to a video action classification task, and the problems of gradient loss and gradient explosion in the long sequence training process can be solved. Meanwhile, the LSTM method can control the transmission state in a gating mode, model long-time sequence information and obtain higher accuracy rate than a Recurrent Neural Network (RNN) in longer sequence data. However, in the task of processing a long video sequence, the LSTM method has high computational complexity, high training difficulty, and low efficiency of time sequence modeling.

In recent years, attention has been drawn to a mechanism as an important method in deep learning, and intensive research has been conducted in various application fields. The attention mechanism in deep learning is similar to that of human vision, namely attention is concentrated on important points in a plurality of information, key information is selected, and other unimportant information is ignored. The attention mechanism breaks through the limitation that the RNN model can not perform parallel computation, allows the modeling of the dependent items of the input and output sequence without considering the distance of the dependent items in the sequence, can well solve the long-term dependence problem, and is widely applied to the third visual angle action recognition task.

The algorithm based on deep learning is widely applied to human activity recognition, and meanwhile, the visual task at the first visual angle is also concerned due to more practical applications. However, due to the problems of the movement of the shooting lens, the shooting sampling distance and the like, the inter-frame of the first visual angle visual data has the problems of relatively large changes of a sporter, an object and a background, and the like, and the video data of the first visual angle and the video of the third visual angle have large differences. Meanwhile, the traditional video motion recognition algorithm has low concern on the correlation of spatial information in frames, and has less consideration on the spatial feature distribution on feature maps, so that the traditional video motion recognition algorithm has certain difficulty in the task of video motion recognition at a first visual angle and low generalization performance.

Disclosure of Invention

The invention aims to overcome the defects of the existing deep learning method and provide a method and a device for identifying the action of a first visual angle video, which can acquire the related weight of video data and enhance the space-time semantic representation by combining the characteristics of a first visual angle video data set and a multi-scale bimodal space-time representation attention mechanism, and simultaneously obtain higher classification accuracy by combining a cross fusion module.

The technical scheme for realizing the purpose of the invention is as follows:

the invention provides a first visual angle video motion recognition method, which comprises the steps of inputting a first visual angle video motion data set into a multi-scale network based on an RGB (red, green and blue) mode and a depth mode to extract spatial semantics, wherein the network adopts a convolutional neural network CNN, two different convolutional blocks in the convolutional neural network CNN are selected to respectively output feature maps of two scales, a first class of feature maps contain certain spatial information, a second class of feature maps contain rich high-level semantic information, the first class of feature maps are processed through an MCIAM I module, the second class of feature maps are processed through an MCIAM II module to further obtain feature embedded vectors with rich multi-scale bimodal spatial semantics, the feature embedded vectors are used as the input of an Inter-frame Encoder module to be processed, the extraction of an Inter-frame time sequence relation is completed through the processing of a plurality of the Inter-frame Encoder modules, and three feature embedded vectors are obtained, the method comprises the steps that data of an RGB branch and data of a depth branch are fused through a CFAM module, feature embedding vectors of the multi-scale fusion branch are fused, a combined feature embedding vector is generated, the combined feature embedding vector is processed through a linear layer to obtain an action classification result of each frame, then video frames of an action segment are subjected to average processing along a time sequence direction, and a final identification result is output.

Furthermore, the MCIAM I module firstly adopts an average segmentation or edge cross segmentation mode to segment the first class of feature maps, then maps the feature maps into embedded vectors through feature embedding and linear mapping, adds position information, calculates a weight matrix of the embedded vectors generated by the same segmentation mode of the RGB mode and the depth mode, and enhances the spatial correlation among the modes.

Furthermore, the MCIAM II module adopts a horizontal and vertical segmentation characteristic segmentation mode to calculate the spatial correlation between the RGB mode and the depth mode, and respectively fuses four embedded vectors generated by the MCIAM I module according to different modes to complete multi-scale bimodal spatial semantic enhancement.

Furthermore, the Inter-frame Encoder module uses trainable linear projection to complete linear mapping to generate a corresponding feature embedded vector, adds a position code to the feature embedded vector for encoding position information of a sequence frame or each Patch, then performs multi-head attention operation to obtain an intermediate vector, and finally inputs the intermediate vector into a feedforward network and completes residual connection and layer regularization operations.

The method evaluates the difference degree between the real action label of the video and the prediction result of the current model through a cross entropy loss function, wherein the loss function is as follows:

where n represents the number of behavior classes, i represents the actual class number to which the video data currently being processed by the network belongs, and y represents the actual class number to which the video data currently being processed by the network belongs_iTrue tags, p, representing corresponding classes_iRepresenting the probability values for the corresponding categories predicted by the model.

And, carry on the data preconditioning and data enhancement before inputting the first visual angle video motion data set into the network.

The convolutional neural network CNN uses a ResNet-34 model pre-trained on an ImageNet dataset with a multi-layered convolutional residual block of ResNet-34 as a basic component element.

In another aspect, the present invention further provides a first-view video motion recognition apparatus, including:

a convolutional neural network CNN module for extracting feature information, performing feature extraction of RGB mode and depth mode at 2D feature map level, and outputting the output value of (N)₁，N₁) And (N)₂，N₂) Feature maps of two scales;

multimode cross-frame attention module MCIAM I for processing (N)₁,N₁) The characteristic graph of the scale adopts two different characteristic segmentation modes of average segmentation and edge cross segmentation, then adopts characteristic embedding and linear mapping to map the characteristic graph into embedded vectors, adds position information, calculates a weight matrix of the embedded vectors generated by the same cutting mode of the two modes, and enhances the spatial correlation between the modes;

multimode cross-frame attention module MCIAM II for processing (N)₂,N₂) The feature map of the scale adopts a feature segmentation mode of horizontal and vertical segmentation to calculate the spatial correlation among the modes, the calculation process is the same as that of the MCIAM I module, and four embedded vectors generated by the MCIAM I module are respectively fused according to the difference of the modes to complete the multi-scale bimodal spatial semantic enhancement;

the Inter-frame Encoder is used for modeling a time process, finishing the processing of action time sequence information by utilizing a self-attention mechanism, modeling the long-term relationship of motion, reasonably distributing the weight of Inter-frame feature embedding, inhibiting the interference of irrelevant objects and objects in a video and distributing more attention resources for a focus area;

and the cross fusion attention module CFAM is used for finishing joint representation of a time sequence network RGB mode and a depth mode, learning a shared structure among different modes through a mutual attention mechanism, and leading the cross fusion module to be responsible for final multi-scale space-time information fusion and generating a joint feature embedded vector.

Furthermore, the device also comprises a preprocessing module which is used for preprocessing and enhancing the data of the video motion data of the first visual angle in a random cutting mode.

A third aspect of the present invention provides a storage medium having a computer program stored therein, wherein the program is executable by a terminal device or a computer to perform the method of the first aspect of the present invention.

The network based on the interframe time sequence relation modeling can well complete the action recognition under the third visual angle, but because the shooting relation, the video of the first visual angle has irregular jitter and background movement, and the action recognition task of the first visual angle is mostly interactive action (such as the interaction between people and objects), more fine-grained action recognition needs to be carried out, and not only the action of a motion subject needs to be recognized, but also an interactive object needs to be recognized. In the invention, different segmentation interaction modes are adopted for feature maps with different scales before a time sequence information learning module, so that the representation capability of a moving subject operation object is enhanced, and the learning capability of a network on spatial information is enhanced.

By analyzing the complementary relation among the modes, the invention provides a characteristic embedded vector grouping and fusing scheme, and the analysis is carried out from three layers: on a spatial level, potential spatial semantic relations of different scales among the modes can be well mined; on the fusion period level, two fusion architectures of different periods are adopted, the first architecture firstly acquires semantic information of a multi-scale space level subjected to inter-modal fusion, inter-frame time sequence relation is processed later, multi-scale fusion of space-time information is completed, the second architecture firstly processes respective time sequence information of an RGB (red, green and blue) mode and a depth mode and then completes fusion, the two architectures respectively concern different attention information, and fusion semantics are enriched; on the multi-stream architecture level, the complementation of different information is realized in a cross fusion mode, and the accuracy of action recognition is well improved.

A large number of experiments carried out on the FPHA reference show the effectiveness of each module, and more importantly, the invention provides a novel method for identifying the first visual angle action by combining multi-scale multi-mode space-time information fusion.

The invention has the advantages and beneficial effects that:

1. according to the invention, the multi-mode cross-frame attention modules MCIAM I and MCIAM II are utilized to realize a multi-scale multi-mode inter-space interactive attention mechanism based on Patch, information of different modes can be fused and enhanced, the optimal representation effect is learned, and abundant semantic information is provided for subsequent time sequence information processing work. Meanwhile, by combining the characteristics of the first visual angle video action data set and adopting various Patch segmentation strategies, the edge information of the central area of the cutting position is fully utilized, the generalization capability of the model can be enhanced, and good effects can be obtained in experiments of different scenes and motions.

2. The invention designs a dual-mode-based feature embedding vector grouping fusion strategy, retains the unique attribute of each mode, better performs multi-scale fusion on low-level features containing abundant spatial information and high-level features containing abundant deep semantics through certain embedding mapping conversion and other modes, and simultaneously explores sharable information in a unified depth architecture by combining the respective characteristics of RGB and depth data.

Drawings

FIG. 1a is an example of bimodal data (clapping action, RGB modality) employed by the present invention;

FIG. 1b is an example of bimodal data (clapping action, depth modality) employed by the present invention;

FIG. 2 is a network overall structure diagram of a first-view video motion recognition method according to an embodiment of the present invention;

FIG. 3 is a diagram of a model of the MCIAM I module;

FIG. 4 is a model diagram of the MCIAM II module;

FIG. 5 is a model diagram of an Interframe Encoder module;

FIG. 6 is a model diagram of a CFAM module;

Detailed Description

The present invention is further described in the following examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.

The invention provides a first visual angle video motion recognition method which uses an end-to-end network for training.

As shown in fig. 1a and fig. 1b, the invention adopts bimodal first person video motion data to carry out experiments.

As shown in fig. 2, the present invention adopts a network structure based on RGB mode and depth mode, and the multi-stream network trains together, and simultaneously, the data of the two modes in time sequence are in one-to-one correspondence with the content of the video frame. The invention selects an FPHA data set to carry out experiments, and the data set takes part in 1175 videos belonging to 45 action categories respectively by 24 different objects.

As shown in fig. 1a and 1b, most of video data clips of the first perspective motion recognition task are short, the motion range of the motion is small, the occupied visual space is large, and certain context redundancy can be caused.

The invention constructs a feature extraction network, which takes a multilayer convolution residual block of ResNet-34 as a basic component element and uses a ResNet-34 model pre-trained on an ImageNet data set. Considering that extracted feature space semantics are gradually reduced and high-level semantics are gradually increased along with the continuous increase of the depth of a convolutional neural network, and meanwhile, two semantic information have important functions on a first visual angle action recognition task, in order to give consideration to the function of a pre-training feature extraction model, ensure that the high-level semantics extract the space information of a feature map to a certain extent and improve intra-frame correlation, a fourth layer convolution block and a fifth layer convolution block in a ResNet-34 network are selected to respectively output the feature maps with two scales, a first class of feature maps contains certain space information, a second class of feature maps contains rich high-level semantic information, a multi-stream multi-scale network structure is adopted to extract rich space semantics, and MCIAM I and MCIAM II modules are designed.

The network obtains a feature embedded vector with rich multi-scale bimodal space semantics through the processing of an MCIAM I module and an MCIAM II module, and the vector is used as the input of an Inter-frame Encoder module for processing.

And completing the extraction of the Inter-frame time sequence relation by processing a plurality of Inter-frame Encoder modules to obtain three characteristic embedded vectors which are respectively generated by an RGB branch, a depth branch and a multi-scale fusion branch, fusing the data of the RGB branch and the depth branch by a CFAM module, completing the fusion of the characteristic embedded vectors of the multi-scale fusion branch and generating the final characteristic embedded vector.

As shown in fig. 3 and 4, the present invention introduces a specific implementation of multi-mode cross-frame intra attention modules (MCIAM i, MCIAM ii). Both modules employ a similar underlying attention mechanism. The present invention will be described by taking the RGB modality leg in the framework as an example.

For the input feature map F ∈ BTxxHxW (BT: batch size numbers of frames, C: channel, H: height, W: width), the input feature map of the MCIAM I module is F _ R₁∈4x256xN₁xN₁The input characteristic diagram of the MCIAM II module is f _ R₂∈128x512xN₂xN₂. For feature map f _ R₁Two segmentation methods are adopted, the first method is to segment the characteristic diagram into 4 intra-frame Patch by the central axis, and the correlation among the Patch is extracted. Specifically, the feature map f _ R₁Deformation of f _ R₁∈BTxNxCP²Wherein P represents the width and height of the newly generated feature map,

representing the number of generated Patch blocks, passing the CP through the linear layer²Mapping into D to obtain a feature embedding vector, inputting the feature embedding vector into a cross fusion attention module (the module has the same structure as the CFAM module) to generate a feature embedding vector F _ R_1'E 128x4x 512. The second segmentation mode is that the edge information of the segmentation position is fully utilized through flattening operation to enhance the richness of the spatial information of the edge position of the learned feature, and then the spatial information is input into a cross fusion attention module to generate a feature embedded vector F _ R_1'‘∈128x4x512。

The input characteristic diagram of the MCIAM II module is f _ R₂∈4x512xN₂xN₂Inputting the cross fusion attention module to generate the feature embedding vector F _ R_2'‘The F _ R belongs to 128x4x512 and is generated by an MCIAM I module in two segmentation interaction modes_1'And F _ R_1'‘And F _ R_2'‘Adding, and then summing the F _ D generated by the contemporaneous depth mode legs_2'‘And adding to generate a feature embedding vector F _ Cross fusing the spatial correlation of the two modes of the two scales.

FIG. 5 shows an Inter-frame Encoder module for Inter-frame timing relationship extraction according to the present invention. To input the characteristic diagram f _ R of the module₂For example, a trainable linear projection is used to complete a linear mapping to generate a corresponding feature embedding vector f _ R_2'And adding position codes for the position information of the coded sequence frame or each Patch. Then, performing multi-head attention operation, specifically, firstly mapping the embedded vector to a series of query, key and value vectors of a dimension D by using linear projection, secondly calculating a dot product of the query and the key vectors by softmax to obtain corresponding attention weights, and then performing weighted summation with the value vectors, wherein for the operation in each head, the following formula can be used for representing:

wherein Q represents a query vector, K represents a key vector, V represents a value vector, the input of each Head is connected, and then operations such as dropout, residual connection, layer regularization and the like are performed, and a specific formula can be represented as follows:

f_R_2”＝LN(Dropout(Concat(MSA))+f_R_2'+x_pos)

then, the intermediate vector f _ R is added_2”Inputting a feedforward network, and completing operations such as residual connection, layer regularization and the like, wherein a specific formula can be expressed as follows:

F_R₂＝LN(Dropout(FFN(f_R_2”))+f_R_2”)

wherein FFM is a feedforward network consisting of two layers of convolutions with convolution kernel size 1, F _ R₂Is the final output of the Inter-frame encoder module.

As shown in fig. 6, the CFAM module of interest is responsible for cross-fusion for the present invention. Different modes have characteristic diversity, and fusion operation is carried out through the CFAM module to complete joint representation of double branches. Embedding features of RGB modal tributaries into a vector F _ R₂Feature embedding vector F _ D with depth tributary₂After the new position information processing, the subsequent operation processing is basically consistent with the Inter-frame encoder module in fig. 5, except that the MSA calculation mode is different, and can be expressed by the following formula:

F_D_{2_M},F_R_{2_M}＝CFAM(F_R₂,F_D₂)

f _ D generated by CFAM module_{2_M}And F _ R_{2_M}And multiplying to generate a joint representation F _ M, finishing time sequence correlation extraction by the F _ Cross to generate a new characteristic embedded vector F _ Cross ', adding the new characteristic embedded vector F _ Cross' and the F _ M, inputting a linear layer to finish class mapping, and then carrying out average calculation on results of all frames to obtain a final identification result.

The above network model employs a Cross Entropy (Cross Entropy) loss function. And evaluating the difference degree between the real action label of the video and the prediction result of the current model through a cross entropy loss function, wherein the loss function has a smaller value and represents that the trained model has better effect. The loss function is as follows:

in the formula, n represents the number of behavior categories, i represents the real category number of the video data processed by the current network, and y_iTrue tags, p, representing corresponding classes_iRepresenting the probability values for the corresponding categories predicted by the model. Network losses can optimize network parameters and improve model performance by constantly propagating back-to-back iterations.

The method adopts a Pythrch deep learning framework to train and test on the FPHA data set, and the test standard is the identification accuracy. The results of the experiments are shown in the following table:

Methods	Modality	Accuracy(％)
			HON4D	Depth	70.61
Two stream-color	RGB	61.56
			H+O	RGB	82.43
1-layer LSTM	Pose	78.73
			2-layer LSTM	Pose	80.14
Gram Matrix	Pose	85.39
			Two stream	Flow+RGB	75.3
HOG2-depth+pose	Pose+Depth	66.78
			JOULE-all	Pose+RGB+Depth	78.78
Ours	RGB+Depth	96.70

from the above table, the effectiveness of the invention can be proved through the comparative evaluation with other mainstream algorithms, the relevant weight embedded in the features can be reasonably distributed through the multi-scale bimodal space-time representation attention mechanism, the space-time semantic representation can be enhanced, and meanwhile, the cross fusion module is combined to obtain higher identification accuracy rate, and meanwhile, the invention has certain practical value.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept, and these changes and modifications are all within the scope of the present invention.

Claims

1. A first visual angle video motion recognition method is characterized in that a first visual angle video motion data set is input into a multi-scale network based on an RGB mode and a depth mode to extract space semantics, the network adopts a convolutional neural network CNN, two different convolutional blocks in the convolutional neural network CNN are selected to respectively output feature maps of two scales, a first class of feature maps contain certain space information, a second class of feature maps contain rich high-level semantic information, the first class of feature maps are processed through an MCIAM I module, the second class of feature maps are processed through an MCIAM II module to further obtain feature embedded vectors with rich multi-dual-scale mode space semantics, the feature embedded vectors are used as the input of an Inter-frame Encoder module to be processed, the extraction of Inter-frame time sequence relation is completed through the processing of a plurality of Inter-frame Encoder modules, and three feature embedded vectors are obtained, the method comprises the steps that data of an RGB branch and data of a depth branch are fused through a CFAM module, feature embedding vectors of the multi-scale fusion branch are fused, a combined feature embedding vector is generated, the combined feature embedding vector is processed through a linear layer to obtain an action classification result of each frame, then video frames of an action segment are subjected to average processing along a time sequence direction, and a final identification result is output.

2. The method of claim 1, wherein the MCIAM i module partitions the first class of feature maps by mean partition or edge cross partition, maps the feature maps into embedded vectors by feature embedding and linear mapping, adds position information, calculates a weight matrix of the embedded vectors generated by the same partition method as the RGB mode and the depth mode, and performs spatial correlation enhancement between the modes.

3. The method for identifying the motion of the video at the first view angle according to claim 1, wherein the MCIAM ii module calculates the spatial correlation between the RGB mode and the depth mode by using a characteristic segmentation method of horizontal-vertical segmentation, and respectively fuses four embedded vectors generated by the MCIAM i module according to the difference of the modes to complete the multi-scale dual-mode spatial semantic enhancement.

4. The method according to claim 1, wherein the Inter-frame Encoder module uses a trainable linear projection to complete linear mapping to generate a corresponding feature embedded vector, adds a position code to the feature embedded vector for encoding position information of a sequence frame or each Patch, then performs multi-head attention operation to obtain an intermediate vector, and finally inputs the intermediate vector into a feed-forward network and completes residual connection and layer regularization operations.

5. The method for identifying the motion of the video from the first perspective according to claim 1, wherein the degree of the difference between the motion label of the video reality and the prediction result of the current model is evaluated by a cross entropy loss function, wherein the loss function is as follows:

6. The method of claim 1, wherein the data pre-processing and data enhancement processing is performed before the first-view video motion data set is inputted into the network.

7. The first perspective video motion recognition method of claim 1, wherein the convolutional neural network CNN uses a ResNet-34 model pre-trained on an ImageNet data set with a multi-layer convolutional residual block of ResNet-34 as a basic component.

8. A first perspective video motion recognition apparatus, comprising:

and the cross fusion attention module CFAM is used for finishing joint representation of a time sequence network RGB mode and a depth mode, learning a shared structure among different modes through a mutual attention mechanism, and taking charge of final multi-scale space-time information fusion to generate a joint feature embedded vector.

9. The apparatus of claim 8, further comprising a pre-processing module for pre-processing and enhancing the video motion data of the first view by random cropping.

10. A storage medium, in which a computer program is stored, wherein the program is executable by a terminal device or a computer to perform the method of any one of claims 1 to 7.