CN116311525A

CN116311525A - Video behavior recognition method based on cross-modal fusion

Info

Publication number: CN116311525A
Application number: CN202310292682.7A
Authority: CN
Inventors: 周毓轩; 李宏亮; 谢晶晶; 梁悦; 刘黛瑶; 万金鹏; 孟凡满; 吴庆波; 许林峰; 潘力立
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-06-23

Abstract

The invention provides a video behavior recognition method based on cross-modal fusion, which comprises the following steps: the video stream is subjected to downsampling treatment, each frame of downsampled image is divided into pixel blocks, an image feature vector is calculated by adopting a linear projection layer, and the image feature vector is input into a transducer space encoder to obtain an image feature sequence of each frame of video; the inertial motion sensor data are processed in a segmented mode, linear mapping dimension rising is adopted on the data segment by segment, and then a sensor characteristic sequence of a transducer time sequence encoder is input; the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector to be input into a mask transducer time encoder to obtain a multi-mode feature after time sequence fusion, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video recognition result. The invention extracts the space-time semantic features and the human motion features from the video stream data and the inertial motion sensor data through the space-coded transducer and the time-coded transducer in a combined way, and completes the behavior recognition based on the trans-modal coding transducer method.

Description

Video behavior recognition method based on cross-modal fusion

Technical Field

The invention belongs to the field of deep learning, and relates to a cross-mode fusion video behavior recognition technology.

Background

In recent years, with popularization of wearable intelligent devices, intelligent home furnishings and the like and generation of a large number of user videos on a social media platform, a video identification method is one of the most common and popular fields in computer vision. The video behavior recognition has wide application prospects, including man-machine interaction, health monitoring, security monitoring, game entertainment, video content retrieval and the like. Video recognition is the basis of the field of video understanding, in which applications it is necessary to recognize and distinguish actions performed in a scene, and even further decisions or processes can be performed based on the inference. Therefore, the research of the video behavior recognition method has very great practical significance.

Currently, algorithms based on deep learning behavior recognition have become a popular approach. The 3D convolutional neural network is utilized to extract scene characteristics and motion characteristics from a video stream, which are one important branch of a behavior recognition method, and the cyclic convolutional neural network is utilized to extract limb motion characteristics from time-series inertial motion sensor data, which are the other branch of behavior recognition.

Heretofore, video behavior recognition methods based on 3D convolutional neural networks have been widely used and have achieved significant effects. In the current mainstream algorithm, only a 3D convolution network is adopted to extract video stream characteristics, so that end-to-end space-time modeling is realized. Because the huge data volume of the video naturally brings high computational overhead, the 3D convolutional neural network generally has to downsample a certain number of pictures from the image stream of the video as input, cannot contain complete all moments, and is difficult to describe fine-grained time sequence motion related information. In the recent mainstream algorithm, a transform architecture in the natural language processing field is used for reference, and is migrated to the visual field, each frame of image is uniformly divided into a plurality of mutually non-overlapping pixel blocks and embedded into feature vectors, so that complex high-dimensional video data is converted into sequence problems for learning. In the self-attention mechanism of a kernel algorithm module of a transducer, the global self-attention coding calculation of space-time combination is generally carried out on all pixel block characteristics obtained by video, so that partial redundancy and low efficiency calculation are caused. In addition, whether a 3D convolution network or a transform method is adopted, the problem that modeling of the multi-mode features is completely separated and the fusion at the end of the network can not be fully interacted is easily faced.

Disclosure of Invention

Aiming at the problem that the feature extraction of behavior recognition under the single-mode input of video is insufficient and the problem that the current main stream model is difficult to flexibly realize multi-mode fusion, the invention provides a method for jointly extracting video data and inertial motion sensor data features, and fully and interactively fusing the two data features by utilizing a multi-mode network model so as to improve the behavior recognition performance of the video.

The technical scheme adopted by the invention for solving the problems is that the video identification method based on cross-modal fusion comprises the following steps:

video data processing step: the video stream is subjected to downsampling, each frame of downsampled image is divided into non-overlapping pixel blocks, an image feature vector is calculated by adopting a linear projection layer, and then the feature vector is input into a transform space encoder to obtain an image feature sequence of each frame of video;

inertial motion sensor data processing step: performing segmentation processing on the sensor data, obtaining motion feature vectors by linearly mapping up-scaling the data segment by segment, and inputting the sensor feature vectors into a transducer time sequence encoder to obtain a sensor feature sequence of a time segment aligned with the feature representation dimension;

and (3) video identification: the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector and is input into a mask converter time encoder, the mask converter time encoder fuses the image feature sequence and the sensor feature sequence segment by segment in a cross attention mode, a weight matrix after Softmax calculation and a mask matrix are subjected to element-by-element product to obtain a time sequence fused multi-mode feature, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video recognition result.

The method has the advantages that the space-time semantic features and the human motion features are extracted from the video stream data and the inertial motion sensor data in a combined mode through the space-coded transducer and the time-coded transducer, behavior recognition is completed based on a trans-modal coding transducer method, feature fusion of fine time sequence granularity is achieved, and accuracy of model behavior recognition is improved.

Drawings

FIG. 1 is a flow chart of a model

FIG. 2 is a schematic diagram of a single frame image transducer space encoder calculation

FIG. 3 is a schematic diagram of sensor motion profile transducer timing encoder computation

FIG. 4 is a cross-modal timing cross-attention schematic diagram

Detailed Description

The embodiment is mainly implemented on a high-performance Linux server with a plurality of TITAN X display cards. Firstly, an independently developed embedded device is adopted to shoot and collect a large amount of paired video and inertial motion sensor data, and a daily life behavior data set required by experimental training is constructed.

The specific steps of the behavior recognition method based on cross-modal fusion are shown in fig. 1:

1. the video of the data set sample is fixedly downsampled to 32 frames, and data enhancement operations such as random clipping are adopted; uniformly dividing inertial motion sensor data into 32 sections corresponding to the number of frames;

2. dividing each frame of image into non-overlapping pixel blocks, calculating a feature vector by adopting a linear projection layer, and obtaining the feature representation of each frame of image by a transducer space encoder consisting of layer normalization, space self-attention, feedforward network FFN and other modules; meanwhile, linear mapping dimension-increasing Linear is adopted for inertial motion sensor data segment by segment, and a sensor feature sequence of a time segment aligned with an image feature dimension is obtained through a transducer time sequence encoder consisting of modules such as layer normalization, time self-attention, a feedforward network FFN and the like;

3. and fusing the image characteristic representation and the sensor characteristic sequence segment by segment in a cross attention mode by using a mask transducer time encoder, and finally outputting a video identification result through a multi-layer perceptron MLP.

The spatial feature extraction of the single frame image is as shown in fig. 2, and mainly comprises the following steps:

the first step: each segment of video data is fixedly downsampled to 32 frames, randomly cut to 224×224 pixels, and then each frame of image is uniformly divided into a series of 16×16 non-overlapping pixel blocks;

and a second step of: the characteristic vector of each pixel block is obtained through a linear projection layer, the first and second steps of operation are completed by adopting a convolution layer with the step length and the convolution kernel size of 16, the image of 224 multiplied by 224 pixels is divided into 14 multiplied by 14 pixel blocks, the number of convolution layer channels is 768, namely, the characteristic of each pixel block is a 768-dimensional vector;

and a third step of: each frame is fused with an additional category embedded vector in a splicing mode, and the position relation of a pixel block position cosine function coding two-dimensional space is introduced in an adding mode;

fourth step: and (3) sending the image block characteristics obtained in the previous step into a transducer space encoder, including LayerNorm normalization, inputting the image block characteristics into a multi-head space self-attention module to weight the importance of the image sub-areas, introducing residual connection, and completing the encoding of the single-frame image space semantic characteristics through a feedforward neural network FFN.

The time sequence feature extraction of the inertial motion sensor segment is shown in fig. 3, and mainly comprises the following steps:

the first step: the inertial sensor data are uniformly divided into 32 sections in time sequence, and each section is used for ascending the 6-axis data to 64-dimensional hidden layer characteristics through a linear mapping layer so as to better represent complex limb movement characteristics;

and a second step of: introducing a time sequence relation among the segment time sequence cosine function coding segments in an addition mode;

and a third step of: and (3) sending the segment motion characteristics obtained in the last step to a transform time encoder, wherein LayerNorm layer normalization is firstly carried out, then a multi-head time self-attention module is input, residual connection is introduced, and finally the segment motion characteristics are converted into dimensions consistent with images through an FFN network for cross attention calculation.

Based on deep features of each frame image and corresponding segment motion features, a cross-mode cross-attention time encoder is used, as shown in fig. 4, causality is realized by adopting a masked self-attention mechanism, and in practical implementation, a weight matrix after the attention mechanism performs Softmax calculation is multiplied by a mask matrix (main diagonal and following elements are 1, and the rest elements are 0) element by element so as to shield attention correlation at all subsequent moments at the current moment. And finally, sending the multi-mode characteristics after time sequence fusion into a classification head of the MLP network to output a behavior recognition result.

Claims

1. The video behavior recognition method based on cross-modal fusion is characterized by comprising the following steps of:

and (3) video identification: the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector and is input into a mask converter time encoder, the mask converter time encoder fuses the image feature sequence and the sensor feature sequence segment by segment in a cross attention mode, a weight matrix after Softmax calculation and a mask matrix are subjected to element-by-element product to obtain a time sequence fused multi-mode feature, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video behavior recognition result.

2. The method of claim 1, wherein the transducer spatial encoder comprises a layer normalization, spatial self-attention, and feed forward network; the transducer timing encoder includes layer normalization, temporal self-attention, and feed forward networks.

3. The method of claim 1, wherein the mask matrix is a matrix with main diagonal and following elements of 1 and the remaining elements of 0.