CN116311525A - Video behavior recognition method based on cross-modal fusion - Google Patents

Video behavior recognition method based on cross-modal fusion Download PDF

Info

Publication number
CN116311525A
CN116311525A CN202310292682.7A CN202310292682A CN116311525A CN 116311525 A CN116311525 A CN 116311525A CN 202310292682 A CN202310292682 A CN 202310292682A CN 116311525 A CN116311525 A CN 116311525A
Authority
CN
China
Prior art keywords
video
feature
sequence
transducer
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310292682.7A
Other languages
Chinese (zh)
Inventor
周毓轩
李宏亮
谢晶晶
梁悦
刘黛瑶
万金鹏
孟凡满
吴庆波
许林峰
潘力立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202310292682.7A priority Critical patent/CN116311525A/en
Publication of CN116311525A publication Critical patent/CN116311525A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Social Psychology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video behavior recognition method based on cross-modal fusion, which comprises the following steps: the video stream is subjected to downsampling treatment, each frame of downsampled image is divided into pixel blocks, an image feature vector is calculated by adopting a linear projection layer, and the image feature vector is input into a transducer space encoder to obtain an image feature sequence of each frame of video; the inertial motion sensor data are processed in a segmented mode, linear mapping dimension rising is adopted on the data segment by segment, and then a sensor characteristic sequence of a transducer time sequence encoder is input; the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector to be input into a mask transducer time encoder to obtain a multi-mode feature after time sequence fusion, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video recognition result. The invention extracts the space-time semantic features and the human motion features from the video stream data and the inertial motion sensor data through the space-coded transducer and the time-coded transducer in a combined way, and completes the behavior recognition based on the trans-modal coding transducer method.

Description

Video behavior recognition method based on cross-modal fusion
Technical Field
The invention belongs to the field of deep learning, and relates to a cross-mode fusion video behavior recognition technology.
Background
In recent years, with popularization of wearable intelligent devices, intelligent home furnishings and the like and generation of a large number of user videos on a social media platform, a video identification method is one of the most common and popular fields in computer vision. The video behavior recognition has wide application prospects, including man-machine interaction, health monitoring, security monitoring, game entertainment, video content retrieval and the like. Video recognition is the basis of the field of video understanding, in which applications it is necessary to recognize and distinguish actions performed in a scene, and even further decisions or processes can be performed based on the inference. Therefore, the research of the video behavior recognition method has very great practical significance.
Currently, algorithms based on deep learning behavior recognition have become a popular approach. The 3D convolutional neural network is utilized to extract scene characteristics and motion characteristics from a video stream, which are one important branch of a behavior recognition method, and the cyclic convolutional neural network is utilized to extract limb motion characteristics from time-series inertial motion sensor data, which are the other branch of behavior recognition.
Heretofore, video behavior recognition methods based on 3D convolutional neural networks have been widely used and have achieved significant effects. In the current mainstream algorithm, only a 3D convolution network is adopted to extract video stream characteristics, so that end-to-end space-time modeling is realized. Because the huge data volume of the video naturally brings high computational overhead, the 3D convolutional neural network generally has to downsample a certain number of pictures from the image stream of the video as input, cannot contain complete all moments, and is difficult to describe fine-grained time sequence motion related information. In the recent mainstream algorithm, a transform architecture in the natural language processing field is used for reference, and is migrated to the visual field, each frame of image is uniformly divided into a plurality of mutually non-overlapping pixel blocks and embedded into feature vectors, so that complex high-dimensional video data is converted into sequence problems for learning. In the self-attention mechanism of a kernel algorithm module of a transducer, the global self-attention coding calculation of space-time combination is generally carried out on all pixel block characteristics obtained by video, so that partial redundancy and low efficiency calculation are caused. In addition, whether a 3D convolution network or a transform method is adopted, the problem that modeling of the multi-mode features is completely separated and the fusion at the end of the network can not be fully interacted is easily faced.
Disclosure of Invention
Aiming at the problem that the feature extraction of behavior recognition under the single-mode input of video is insufficient and the problem that the current main stream model is difficult to flexibly realize multi-mode fusion, the invention provides a method for jointly extracting video data and inertial motion sensor data features, and fully and interactively fusing the two data features by utilizing a multi-mode network model so as to improve the behavior recognition performance of the video.
The technical scheme adopted by the invention for solving the problems is that the video identification method based on cross-modal fusion comprises the following steps:
video data processing step: the video stream is subjected to downsampling, each frame of downsampled image is divided into non-overlapping pixel blocks, an image feature vector is calculated by adopting a linear projection layer, and then the feature vector is input into a transform space encoder to obtain an image feature sequence of each frame of video;
inertial motion sensor data processing step: performing segmentation processing on the sensor data, obtaining motion feature vectors by linearly mapping up-scaling the data segment by segment, and inputting the sensor feature vectors into a transducer time sequence encoder to obtain a sensor feature sequence of a time segment aligned with the feature representation dimension;
and (3) video identification: the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector and is input into a mask converter time encoder, the mask converter time encoder fuses the image feature sequence and the sensor feature sequence segment by segment in a cross attention mode, a weight matrix after Softmax calculation and a mask matrix are subjected to element-by-element product to obtain a time sequence fused multi-mode feature, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video recognition result.
The method has the advantages that the space-time semantic features and the human motion features are extracted from the video stream data and the inertial motion sensor data in a combined mode through the space-coded transducer and the time-coded transducer, behavior recognition is completed based on a trans-modal coding transducer method, feature fusion of fine time sequence granularity is achieved, and accuracy of model behavior recognition is improved.
Drawings
FIG. 1 is a flow chart of a model
FIG. 2 is a schematic diagram of a single frame image transducer space encoder calculation
FIG. 3 is a schematic diagram of sensor motion profile transducer timing encoder computation
FIG. 4 is a cross-modal timing cross-attention schematic diagram
Detailed Description
The embodiment is mainly implemented on a high-performance Linux server with a plurality of TITAN X display cards. Firstly, an independently developed embedded device is adopted to shoot and collect a large amount of paired video and inertial motion sensor data, and a daily life behavior data set required by experimental training is constructed.
The specific steps of the behavior recognition method based on cross-modal fusion are shown in fig. 1:
1. the video of the data set sample is fixedly downsampled to 32 frames, and data enhancement operations such as random clipping are adopted; uniformly dividing inertial motion sensor data into 32 sections corresponding to the number of frames;
2. dividing each frame of image into non-overlapping pixel blocks, calculating a feature vector by adopting a linear projection layer, and obtaining the feature representation of each frame of image by a transducer space encoder consisting of layer normalization, space self-attention, feedforward network FFN and other modules; meanwhile, linear mapping dimension-increasing Linear is adopted for inertial motion sensor data segment by segment, and a sensor feature sequence of a time segment aligned with an image feature dimension is obtained through a transducer time sequence encoder consisting of modules such as layer normalization, time self-attention, a feedforward network FFN and the like;
3. and fusing the image characteristic representation and the sensor characteristic sequence segment by segment in a cross attention mode by using a mask transducer time encoder, and finally outputting a video identification result through a multi-layer perceptron MLP.
The spatial feature extraction of the single frame image is as shown in fig. 2, and mainly comprises the following steps:
the first step: each segment of video data is fixedly downsampled to 32 frames, randomly cut to 224×224 pixels, and then each frame of image is uniformly divided into a series of 16×16 non-overlapping pixel blocks;
and a second step of: the characteristic vector of each pixel block is obtained through a linear projection layer, the first and second steps of operation are completed by adopting a convolution layer with the step length and the convolution kernel size of 16, the image of 224 multiplied by 224 pixels is divided into 14 multiplied by 14 pixel blocks, the number of convolution layer channels is 768, namely, the characteristic of each pixel block is a 768-dimensional vector;
and a third step of: each frame is fused with an additional category embedded vector in a splicing mode, and the position relation of a pixel block position cosine function coding two-dimensional space is introduced in an adding mode;
fourth step: and (3) sending the image block characteristics obtained in the previous step into a transducer space encoder, including LayerNorm normalization, inputting the image block characteristics into a multi-head space self-attention module to weight the importance of the image sub-areas, introducing residual connection, and completing the encoding of the single-frame image space semantic characteristics through a feedforward neural network FFN.
The time sequence feature extraction of the inertial motion sensor segment is shown in fig. 3, and mainly comprises the following steps:
the first step: the inertial sensor data are uniformly divided into 32 sections in time sequence, and each section is used for ascending the 6-axis data to 64-dimensional hidden layer characteristics through a linear mapping layer so as to better represent complex limb movement characteristics;
and a second step of: introducing a time sequence relation among the segment time sequence cosine function coding segments in an addition mode;
and a third step of: and (3) sending the segment motion characteristics obtained in the last step to a transform time encoder, wherein LayerNorm layer normalization is firstly carried out, then a multi-head time self-attention module is input, residual connection is introduced, and finally the segment motion characteristics are converted into dimensions consistent with images through an FFN network for cross attention calculation.
Based on deep features of each frame image and corresponding segment motion features, a cross-mode cross-attention time encoder is used, as shown in fig. 4, causality is realized by adopting a masked self-attention mechanism, and in practical implementation, a weight matrix after the attention mechanism performs Softmax calculation is multiplied by a mask matrix (main diagonal and following elements are 1, and the rest elements are 0) element by element so as to shield attention correlation at all subsequent moments at the current moment. And finally, sending the multi-mode characteristics after time sequence fusion into a classification head of the MLP network to output a behavior recognition result.

Claims (3)

1. The video behavior recognition method based on cross-modal fusion is characterized by comprising the following steps of:
video data processing step: the video stream is subjected to downsampling, each frame of downsampled image is divided into non-overlapping pixel blocks, an image feature vector is calculated by adopting a linear projection layer, and then the feature vector is input into a transform space encoder to obtain an image feature sequence of each frame of video;
inertial motion sensor data processing step: performing segmentation processing on the sensor data, obtaining motion feature vectors by linearly mapping up-scaling the data segment by segment, and inputting the sensor feature vectors into a transducer time sequence encoder to obtain a sensor feature sequence of a time segment aligned with the feature representation dimension;
and (3) video identification: the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector and is input into a mask converter time encoder, the mask converter time encoder fuses the image feature sequence and the sensor feature sequence segment by segment in a cross attention mode, a weight matrix after Softmax calculation and a mask matrix are subjected to element-by-element product to obtain a time sequence fused multi-mode feature, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video behavior recognition result.
2. The method of claim 1, wherein the transducer spatial encoder comprises a layer normalization, spatial self-attention, and feed forward network; the transducer timing encoder includes layer normalization, temporal self-attention, and feed forward networks.
3. The method of claim 1, wherein the mask matrix is a matrix with main diagonal and following elements of 1 and the remaining elements of 0.
CN202310292682.7A 2023-03-23 2023-03-23 Video behavior recognition method based on cross-modal fusion Pending CN116311525A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310292682.7A CN116311525A (en) 2023-03-23 2023-03-23 Video behavior recognition method based on cross-modal fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310292682.7A CN116311525A (en) 2023-03-23 2023-03-23 Video behavior recognition method based on cross-modal fusion

Publications (1)

Publication Number Publication Date
CN116311525A true CN116311525A (en) 2023-06-23

Family

ID=86779518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310292682.7A Pending CN116311525A (en) 2023-03-23 2023-03-23 Video behavior recognition method based on cross-modal fusion

Country Status (1)

Country Link
CN (1) CN116311525A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056540A (en) * 2023-10-10 2023-11-14 苏州元脑智能科技有限公司 Method and device for generating multimedia object based on text
CN117671777A (en) * 2023-10-17 2024-03-08 广州易而达科技股份有限公司 Gesture recognition method, device, equipment and storage medium based on radar

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056540A (en) * 2023-10-10 2023-11-14 苏州元脑智能科技有限公司 Method and device for generating multimedia object based on text
CN117056540B (en) * 2023-10-10 2024-02-02 苏州元脑智能科技有限公司 Method and device for generating multimedia object based on text
CN117671777A (en) * 2023-10-17 2024-03-08 广州易而达科技股份有限公司 Gesture recognition method, device, equipment and storage medium based on radar
CN117671777B (en) * 2023-10-17 2024-05-14 广州易而达科技股份有限公司 Gesture recognition method, device, equipment and storage medium based on radar

Similar Documents

Publication Publication Date Title
Cho et al. Self-attention network for skeleton-based human action recognition
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN111079532B (en) Video content description method based on text self-encoder
CN110796111B (en) Image processing method, device, equipment and storage medium
CN116311525A (en) Video behavior recognition method based on cross-modal fusion
CN113158723B (en) End-to-end video motion detection positioning system
CN109460707A (en) A kind of multi-modal action identification method based on deep neural network
CN114596520A (en) First visual angle video action identification method and device
CN112801068B (en) Video multi-target tracking and segmenting system and method
CN114973049B (en) Lightweight video classification method with unified convolution and self-attention
CN113780249B (en) Expression recognition model processing method, device, equipment, medium and program product
WO2022116616A1 (en) Behavior recognition method based on conversion module
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
Naeem et al. T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition
CN112906520A (en) Gesture coding-based action recognition method and device
Guetari et al. Real time emotion recognition in video stream, using B-CNN and F-CNN
Huang et al. Dynamic sign language recognition based on CBAM with autoencoder time series neural network
Tong et al. D3-LND: A two-stream framework with discriminant deep descriptor, linear CMDT and nonlinear KCMDT descriptors for action recognition
WO2023068953A1 (en) Attention-based method for deep point cloud compression
Zhou Video expression recognition method based on spatiotemporal recurrent neural network and feature fusion
Wani et al. Deep learning-based video action recognition: a review
CN113822117B (en) Data processing method, device and computer readable storage medium
CN114782995A (en) Human interaction behavior detection method based on self-attention mechanism
CN114120076A (en) Cross-view video gait recognition method based on gait motion estimation
Zhou et al. Facial expressions and body postures emotion recognition based on convolutional attention network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination