CN116311525A - Video behavior recognition method based on cross-modal fusion - Google Patents
Video behavior recognition method based on cross-modal fusion Download PDFInfo
- Publication number
- CN116311525A CN116311525A CN202310292682.7A CN202310292682A CN116311525A CN 116311525 A CN116311525 A CN 116311525A CN 202310292682 A CN202310292682 A CN 202310292682A CN 116311525 A CN116311525 A CN 116311525A
- Authority
- CN
- China
- Prior art keywords
- video
- feature
- sequence
- transducer
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 18
- 230000004927 fusion Effects 0.000 title claims abstract description 12
- 239000013598 vector Substances 0.000 claims abstract description 21
- 238000013507 mapping Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 2
- 230000002123 temporal effect Effects 0.000 claims 1
- 239000000284 extract Substances 0.000 abstract 1
- 230000000630 rising effect Effects 0.000 abstract 1
- 230000006399 behavior Effects 0.000 description 15
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- RTAQQCXQSZGOHL-UHFFFAOYSA-N Titanium Chemical compound [Ti] RTAQQCXQSZGOHL-UHFFFAOYSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Human Computer Interaction (AREA)
- Biophysics (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Psychiatry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Social Psychology (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a video behavior recognition method based on cross-modal fusion, which comprises the following steps: the video stream is subjected to downsampling treatment, each frame of downsampled image is divided into pixel blocks, an image feature vector is calculated by adopting a linear projection layer, and the image feature vector is input into a transducer space encoder to obtain an image feature sequence of each frame of video; the inertial motion sensor data are processed in a segmented mode, linear mapping dimension rising is adopted on the data segment by segment, and then a sensor characteristic sequence of a transducer time sequence encoder is input; the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector to be input into a mask transducer time encoder to obtain a multi-mode feature after time sequence fusion, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video recognition result. The invention extracts the space-time semantic features and the human motion features from the video stream data and the inertial motion sensor data through the space-coded transducer and the time-coded transducer in a combined way, and completes the behavior recognition based on the trans-modal coding transducer method.
Description
Technical Field
The invention belongs to the field of deep learning, and relates to a cross-mode fusion video behavior recognition technology.
Background
In recent years, with popularization of wearable intelligent devices, intelligent home furnishings and the like and generation of a large number of user videos on a social media platform, a video identification method is one of the most common and popular fields in computer vision. The video behavior recognition has wide application prospects, including man-machine interaction, health monitoring, security monitoring, game entertainment, video content retrieval and the like. Video recognition is the basis of the field of video understanding, in which applications it is necessary to recognize and distinguish actions performed in a scene, and even further decisions or processes can be performed based on the inference. Therefore, the research of the video behavior recognition method has very great practical significance.
Currently, algorithms based on deep learning behavior recognition have become a popular approach. The 3D convolutional neural network is utilized to extract scene characteristics and motion characteristics from a video stream, which are one important branch of a behavior recognition method, and the cyclic convolutional neural network is utilized to extract limb motion characteristics from time-series inertial motion sensor data, which are the other branch of behavior recognition.
Heretofore, video behavior recognition methods based on 3D convolutional neural networks have been widely used and have achieved significant effects. In the current mainstream algorithm, only a 3D convolution network is adopted to extract video stream characteristics, so that end-to-end space-time modeling is realized. Because the huge data volume of the video naturally brings high computational overhead, the 3D convolutional neural network generally has to downsample a certain number of pictures from the image stream of the video as input, cannot contain complete all moments, and is difficult to describe fine-grained time sequence motion related information. In the recent mainstream algorithm, a transform architecture in the natural language processing field is used for reference, and is migrated to the visual field, each frame of image is uniformly divided into a plurality of mutually non-overlapping pixel blocks and embedded into feature vectors, so that complex high-dimensional video data is converted into sequence problems for learning. In the self-attention mechanism of a kernel algorithm module of a transducer, the global self-attention coding calculation of space-time combination is generally carried out on all pixel block characteristics obtained by video, so that partial redundancy and low efficiency calculation are caused. In addition, whether a 3D convolution network or a transform method is adopted, the problem that modeling of the multi-mode features is completely separated and the fusion at the end of the network can not be fully interacted is easily faced.
Disclosure of Invention
Aiming at the problem that the feature extraction of behavior recognition under the single-mode input of video is insufficient and the problem that the current main stream model is difficult to flexibly realize multi-mode fusion, the invention provides a method for jointly extracting video data and inertial motion sensor data features, and fully and interactively fusing the two data features by utilizing a multi-mode network model so as to improve the behavior recognition performance of the video.
The technical scheme adopted by the invention for solving the problems is that the video identification method based on cross-modal fusion comprises the following steps:
video data processing step: the video stream is subjected to downsampling, each frame of downsampled image is divided into non-overlapping pixel blocks, an image feature vector is calculated by adopting a linear projection layer, and then the feature vector is input into a transform space encoder to obtain an image feature sequence of each frame of video;
inertial motion sensor data processing step: performing segmentation processing on the sensor data, obtaining motion feature vectors by linearly mapping up-scaling the data segment by segment, and inputting the sensor feature vectors into a transducer time sequence encoder to obtain a sensor feature sequence of a time segment aligned with the feature representation dimension;
and (3) video identification: the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector and is input into a mask converter time encoder, the mask converter time encoder fuses the image feature sequence and the sensor feature sequence segment by segment in a cross attention mode, a weight matrix after Softmax calculation and a mask matrix are subjected to element-by-element product to obtain a time sequence fused multi-mode feature, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video recognition result.
The method has the advantages that the space-time semantic features and the human motion features are extracted from the video stream data and the inertial motion sensor data in a combined mode through the space-coded transducer and the time-coded transducer, behavior recognition is completed based on a trans-modal coding transducer method, feature fusion of fine time sequence granularity is achieved, and accuracy of model behavior recognition is improved.
Drawings
FIG. 1 is a flow chart of a model
FIG. 2 is a schematic diagram of a single frame image transducer space encoder calculation
FIG. 3 is a schematic diagram of sensor motion profile transducer timing encoder computation
FIG. 4 is a cross-modal timing cross-attention schematic diagram
Detailed Description
The embodiment is mainly implemented on a high-performance Linux server with a plurality of TITAN X display cards. Firstly, an independently developed embedded device is adopted to shoot and collect a large amount of paired video and inertial motion sensor data, and a daily life behavior data set required by experimental training is constructed.
The specific steps of the behavior recognition method based on cross-modal fusion are shown in fig. 1:
1. the video of the data set sample is fixedly downsampled to 32 frames, and data enhancement operations such as random clipping are adopted; uniformly dividing inertial motion sensor data into 32 sections corresponding to the number of frames;
2. dividing each frame of image into non-overlapping pixel blocks, calculating a feature vector by adopting a linear projection layer, and obtaining the feature representation of each frame of image by a transducer space encoder consisting of layer normalization, space self-attention, feedforward network FFN and other modules; meanwhile, linear mapping dimension-increasing Linear is adopted for inertial motion sensor data segment by segment, and a sensor feature sequence of a time segment aligned with an image feature dimension is obtained through a transducer time sequence encoder consisting of modules such as layer normalization, time self-attention, a feedforward network FFN and the like;
3. and fusing the image characteristic representation and the sensor characteristic sequence segment by segment in a cross attention mode by using a mask transducer time encoder, and finally outputting a video identification result through a multi-layer perceptron MLP.
The spatial feature extraction of the single frame image is as shown in fig. 2, and mainly comprises the following steps:
the first step: each segment of video data is fixedly downsampled to 32 frames, randomly cut to 224×224 pixels, and then each frame of image is uniformly divided into a series of 16×16 non-overlapping pixel blocks;
and a second step of: the characteristic vector of each pixel block is obtained through a linear projection layer, the first and second steps of operation are completed by adopting a convolution layer with the step length and the convolution kernel size of 16, the image of 224 multiplied by 224 pixels is divided into 14 multiplied by 14 pixel blocks, the number of convolution layer channels is 768, namely, the characteristic of each pixel block is a 768-dimensional vector;
and a third step of: each frame is fused with an additional category embedded vector in a splicing mode, and the position relation of a pixel block position cosine function coding two-dimensional space is introduced in an adding mode;
fourth step: and (3) sending the image block characteristics obtained in the previous step into a transducer space encoder, including LayerNorm normalization, inputting the image block characteristics into a multi-head space self-attention module to weight the importance of the image sub-areas, introducing residual connection, and completing the encoding of the single-frame image space semantic characteristics through a feedforward neural network FFN.
The time sequence feature extraction of the inertial motion sensor segment is shown in fig. 3, and mainly comprises the following steps:
the first step: the inertial sensor data are uniformly divided into 32 sections in time sequence, and each section is used for ascending the 6-axis data to 64-dimensional hidden layer characteristics through a linear mapping layer so as to better represent complex limb movement characteristics;
and a second step of: introducing a time sequence relation among the segment time sequence cosine function coding segments in an addition mode;
and a third step of: and (3) sending the segment motion characteristics obtained in the last step to a transform time encoder, wherein LayerNorm layer normalization is firstly carried out, then a multi-head time self-attention module is input, residual connection is introduced, and finally the segment motion characteristics are converted into dimensions consistent with images through an FFN network for cross attention calculation.
Based on deep features of each frame image and corresponding segment motion features, a cross-mode cross-attention time encoder is used, as shown in fig. 4, causality is realized by adopting a masked self-attention mechanism, and in practical implementation, a weight matrix after the attention mechanism performs Softmax calculation is multiplied by a mask matrix (main diagonal and following elements are 1, and the rest elements are 0) element by element so as to shield attention correlation at all subsequent moments at the current moment. And finally, sending the multi-mode characteristics after time sequence fusion into a classification head of the MLP network to output a behavior recognition result.
Claims (3)
1. The video behavior recognition method based on cross-modal fusion is characterized by comprising the following steps of:
video data processing step: the video stream is subjected to downsampling, each frame of downsampled image is divided into non-overlapping pixel blocks, an image feature vector is calculated by adopting a linear projection layer, and then the feature vector is input into a transform space encoder to obtain an image feature sequence of each frame of video;
inertial motion sensor data processing step: performing segmentation processing on the sensor data, obtaining motion feature vectors by linearly mapping up-scaling the data segment by segment, and inputting the sensor feature vectors into a transducer time sequence encoder to obtain a sensor feature sequence of a time segment aligned with the feature representation dimension;
and (3) video identification: the image feature sequence is used as a key and a value vector, the sensor feature sequence is used as a query vector and is input into a mask converter time encoder, the mask converter time encoder fuses the image feature sequence and the sensor feature sequence segment by segment in a cross attention mode, a weight matrix after Softmax calculation and a mask matrix are subjected to element-by-element product to obtain a time sequence fused multi-mode feature, the multi-mode feature is input into a multi-layer perceptron MLP, and the MLP outputs a video behavior recognition result.
2. The method of claim 1, wherein the transducer spatial encoder comprises a layer normalization, spatial self-attention, and feed forward network; the transducer timing encoder includes layer normalization, temporal self-attention, and feed forward networks.
3. The method of claim 1, wherein the mask matrix is a matrix with main diagonal and following elements of 1 and the remaining elements of 0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310292682.7A CN116311525A (en) | 2023-03-23 | 2023-03-23 | Video behavior recognition method based on cross-modal fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310292682.7A CN116311525A (en) | 2023-03-23 | 2023-03-23 | Video behavior recognition method based on cross-modal fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116311525A true CN116311525A (en) | 2023-06-23 |
Family
ID=86779518
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310292682.7A Pending CN116311525A (en) | 2023-03-23 | 2023-03-23 | Video behavior recognition method based on cross-modal fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116311525A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117056540A (en) * | 2023-10-10 | 2023-11-14 | 苏州元脑智能科技有限公司 | Method and device for generating multimedia object based on text |
CN117671777A (en) * | 2023-10-17 | 2024-03-08 | 广州易而达科技股份有限公司 | Gesture recognition method, device, equipment and storage medium based on radar |
-
2023
- 2023-03-23 CN CN202310292682.7A patent/CN116311525A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117056540A (en) * | 2023-10-10 | 2023-11-14 | 苏州元脑智能科技有限公司 | Method and device for generating multimedia object based on text |
CN117056540B (en) * | 2023-10-10 | 2024-02-02 | 苏州元脑智能科技有限公司 | Method and device for generating multimedia object based on text |
CN117671777A (en) * | 2023-10-17 | 2024-03-08 | 广州易而达科技股份有限公司 | Gesture recognition method, device, equipment and storage medium based on radar |
CN117671777B (en) * | 2023-10-17 | 2024-05-14 | 广州易而达科技股份有限公司 | Gesture recognition method, device, equipment and storage medium based on radar |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cho et al. | Self-attention network for skeleton-based human action recognition | |
CN113936339B (en) | Fighting identification method and device based on double-channel cross attention mechanism | |
CN111079532B (en) | Video content description method based on text self-encoder | |
CN110796111B (en) | Image processing method, device, equipment and storage medium | |
CN116311525A (en) | Video behavior recognition method based on cross-modal fusion | |
CN113158723B (en) | End-to-end video motion detection positioning system | |
CN109460707A (en) | A kind of multi-modal action identification method based on deep neural network | |
CN114596520A (en) | First visual angle video action identification method and device | |
CN112801068B (en) | Video multi-target tracking and segmenting system and method | |
CN114973049B (en) | Lightweight video classification method with unified convolution and self-attention | |
CN113780249B (en) | Expression recognition model processing method, device, equipment, medium and program product | |
WO2022116616A1 (en) | Behavior recognition method based on conversion module | |
CN113920581A (en) | Method for recognizing motion in video by using space-time convolution attention network | |
Naeem et al. | T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition | |
CN112906520A (en) | Gesture coding-based action recognition method and device | |
Guetari et al. | Real time emotion recognition in video stream, using B-CNN and F-CNN | |
Huang et al. | Dynamic sign language recognition based on CBAM with autoencoder time series neural network | |
Tong et al. | D3-LND: A two-stream framework with discriminant deep descriptor, linear CMDT and nonlinear KCMDT descriptors for action recognition | |
WO2023068953A1 (en) | Attention-based method for deep point cloud compression | |
Zhou | Video expression recognition method based on spatiotemporal recurrent neural network and feature fusion | |
Wani et al. | Deep learning-based video action recognition: a review | |
CN113822117B (en) | Data processing method, device and computer readable storage medium | |
CN114782995A (en) | Human interaction behavior detection method based on self-attention mechanism | |
CN114120076A (en) | Cross-view video gait recognition method based on gait motion estimation | |
Zhou et al. | Facial expressions and body postures emotion recognition based on convolutional attention network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |