CN115457657A - Method for identifying channel characteristic interaction time modeling behaviors based on BERT model - Google Patents

Method for identifying channel characteristic interaction time modeling behaviors based on BERT model Download PDF

Info

Publication number
CN115457657A
CN115457657A CN202211083801.XA CN202211083801A CN115457657A CN 115457657 A CN115457657 A CN 115457657A CN 202211083801 A CN202211083801 A CN 202211083801A CN 115457657 A CN115457657 A CN 115457657A
Authority
CN
China
Prior art keywords
channel
bert
module
sub
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211083801.XA
Other languages
Chinese (zh)
Inventor
李晓潮
杨曼
甘利鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202211083801.XA priority Critical patent/CN115457657A/en
Publication of CN115457657A publication Critical patent/CN115457657A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

A method for recognizing channel characteristic interaction time modeling behaviors based on a BERT model belongs to the technical field of computer vision, deep learning and behavior recognition. The method comprises the steps of decomposing an action video into corresponding RGB image sequences, inputting the RGB image sequences into a two-dimensional convolutional neural network, carrying out self-attention calculation on a subchannel feature sequence through a channel recombination module and a channel BERT model based on features extracted by the two-dimensional convolutional neural network, extracting key subchannel features with large time variation difference and interactive correlation among the key subchannel features, and obtaining key semantic features for distinguishing action categories and correlation of the key semantic features, so that behavior classification accuracy is improved. By combining the channel BERT and the time BERT, the semantic features of the key channel in the key frame are further concerned, so that higher behavior identification accuracy is obtained.

Description

Method for identifying channel characteristic interaction time modeling behaviors based on BERT model
Technical Field
The invention belongs to the technical field of computer vision, deep learning and behavior recognition, and particularly relates to a method for recognizing channel feature interaction time modeling behaviors based on a BERT (Bidirectional Encoder Representation) model.
Background
Behavior recognition is one of basic tasks of computer vision, and is widely applied to occasions such as man-machine interaction, video retrieval, intelligent safety monitoring and the like. Behavior recognition technology is mainly used for enabling a computer to understand human actions and behaviors by processing and analyzing video data. Therefore, a key technique for behavior recognition is the modeling of video semantic features along the time dimension during action change. Firstly, in the feature modeling, the time-space change relation of human body behaviors in a video needs to be extracted, and the difference of appearance changes of different behaviors of a human body is described by modeling the time-space features. Secondly, the channel relation among the video frames at different moments needs to be considered, and the video semantic information can be more completely represented through effective channel feature interaction. Therefore, time modeling is carried out on channel feature interaction in the video, and the method is an effective method for improving the precision of the behavior recognition task.
Through analysis and experiments, a BERT self-attention mechanism module is embedded in the two-dimensional convolution network, so that the time correlation between image frames can be learned and extracted, and the accuracy of behavior recognition is improved. On the basis, a channel recombination module is provided, and the module separates the channel characteristics of continuous frames into N sub-channels, and splices the corresponding sub-channel characteristics of each frame along the time dimension to form a sub-channel characteristic time sequence. Then, a channel feature interaction model is established for the reconstructed sub-channel groups by using a channel BERT self-attention mechanism, and key sub-channel feature sequences with large time variation difference and interaction correlation among the key sub-channel feature sequences are extracted according to similarity calculation among the sub-channel groups, and the interaction relation between the key sub-channel feature sequences and the interaction correlation among the key sub-channel feature sequences is enhanced, so that key semantic features for distinguishing action categories are obtained. In order to simultaneously extract the interactive correlation of the channel and the time dimension in the frame image characteristics, a joint-BERT model is provided, and the two branches of extracting the channel correlation and the time correlation are fused, so that the semantic characteristics of the key channel in the key frame can be further concerned, and the behavior identification precision is improved. Meanwhile, the channel BERT module (70) and the time BERT module (71) adopt a weight sharing strategy, thereby reducing the weight number of the whole model.
In recent years, similar research and patents to the method for identifying the behavior of the BERT model-based channel feature interaction time modeling are as follows:
in the ECCV (European Conference on Computer Vision) Conference in 2020, the article "Late Temporal Modeling in 3D CNN architecture with BERT for Action Recognition" published by Kalfaoglu et al uses BERT to replace TGAP layer at the end of 3D CNN to learn important Temporal features in video frames, and enhances the Late Temporal Modeling capability of CNN backbone network. In contrast, the invention adopts 2D CNN to extract the spatial features of the video, obtains the relevance between video time frames by using BERT, and increases the BERT of the channel for extracting the relevance between the image channel features of the frames. Firstly, a channel recombination module is designed to extract channel information of all adjacent frames in a sub-channel characteristic sequence, and then a self-attention mechanism of a channel BERT is utilized to carry out similarity calculation on different sub-channel characteristic sequences, so as to learn the semantic characteristics of key channels in video frames. In addition, the method for identifying the channel feature interaction time modeling behavior based on the BERT model adopts two-dimensional convolution to replace three-dimensional convolution to extract the semantic features of each frame of image, so that the calculation cost is saved to a great extent, the parameter quantity is reduced, and the calculation efficiency of the model is improved.
The "Channel interaction networks for fine-grained image capture" article published in the 34 th AAAI Conference on intellectual association in 2020 by Gao, y, et al proposes a self-Channel interaction (SCI) module to establish a correlation model between different channels in an image frame, and a Contrast Channel Interaction (CCI) module to model the cross-sample Channel interaction relationship between two image frames. In contrast, the channel reorganization module provided by the invention firstly performs channel separation and reorganization on the feature sequences of all image frames extracted by the 2D CNN to obtain a sub-channel group containing the feature sequences of all adjacent frame channels, and then performs self-attention calculation on the sub-channel feature sequences by using a BERT self-attention mechanism, thereby not only improving the channel interaction capacity in one image frame or between two image frames, but also enhancing the channel interaction relation between all the image frames and between the sub-channel groups to obtain more complete channel semantic representation, so that all the adjacent time frames of the video, not only the feature channels on the two image frames, can learn the semantic features with identification in the video, and improving the classification accuracy.
Chinese patent CN111597929a discloses a group behavior identification method based on channel information fusion and group relationship spatial structured modeling. Aiming at the accuracy problem of collective behavior identification, the CSTM and CMM modules of a 2D network STM are adopted to respectively extract each frame of time-space information and interframe motion information, and a channel selection module is provided to fuse the time-space and motion information of each frame; and extracting a collective relationship evolution model through the graph convolution-LSTM network. The recognition video is sparsely sampled in order to capture long video sequences. The invention inputs each frame of spatial information extracted by the 2D CNN into a BERT self-attention mechanism to obtain the time variation relation between the inter-frame space and the channel within a period of time. The method for learning the sequence features through the multi-head self-attention mechanism not only obtains key image frames and key channel semantic features, but also improves the long-term time modeling capability of a behavior recognition model. In addition, the invention does not need to extract the collective relationship evolution model.
Chinese patent CN113591774a discloses a behavior recognition algorithm based on a transform. In the invention, a convolutional layer is adopted in a posture estimation part to simultaneously extract time information and space information of skeleton points from an original video, and the fused time information and space information are directly input into a Transformer self-attention network to obtain the change relation of the skeleton nodes of the human body. The method firstly adopts the 2D CNN to extract the spatial information of each frame, utilizes the BERT self-attention mechanism to extract the change relation between space and time to identify the behavior, fully utilizes the advantages of the 2D convolution to extract the spatial characteristics and the BERT self-attention mechanism to process the time sequence, and improves the behavior modeling capability of the neural network.
Chinese patent CN113673489A discloses a video group behavior recognition method based on cascade transformers, wherein a first Transformer module is used for detecting a human body target to extract individual characteristics of a key frame, and a second Transformer module is used for modeling the hierarchical relationship between an individual and a group to complete a group behavior recognition task. The BERT module adopted by the invention is based on a self-attention model on a bidirectional Transformer module, and the similarity between the sub-channel groups and between time frames is respectively learned by the BERT module in the branch 1 and the branch 2 to complete the behavior recognition task of the individual without extracting the hierarchical relationship between the individual and the group. In addition, different from the method only paying attention to the individual features of the key frames, the method adds a weight sharing strategy to the parameter matrix of the BERT in the joint-BERT model training, and jointly learns key channel and time features of the human body, so that different dimension information is fused to improve the integrity of the description of the action features of the human body, and the semantic features of the key channel in the key frames are further paid attention to.
Disclosure of Invention
The invention aims to provide a channel feature interaction time modeling behavior identification method based on a BERT model, which can improve the behavior classification precision and aims to overcome the defects in the prior art. The 2D convolution network and the BERT self-attention mechanism are combined, especially the extraction of the correlation between time and channel dimension in the frame image features is researched, the key channel semantic features in the key frame are learned by combining the fusion mechanism of the BERT self-attention model, and the fusion of the image frames in the time and channel dimension is optimized, so that the requirement of improving the video classification accuracy is met.
The invention provides a combined-BERT self-attention model for simultaneously extracting key channels and time characteristics based on a two-dimensional convolutional neural network, wherein the model consists of a 1 st branch for extracting correlation among channels and a 2 nd branch for extracting time correlation among image frames; the two-dimensional convolutional neural network consists of a plurality of 2-dimensional convolutional layers, and the spatial characteristics of the video image frames are obtained by performing convolution in the spatial dimension; the 1 st branch is used for establishing interaction on the sub-channels by utilizing a multi-head self-attention mechanism from the semantic features of the recombination sub-channel group to obtain the key semantic features for distinguishing the action categories and the correlation thereof; the 2 nd branch is used for extracting the key image frames according to the similarity calculation between the frames when the image frames are fused at different moments by using a self-attention mechanism.
The method comprises the following specific steps:
1) Decomposing the motion video into a corresponding RGB image sequence (10), inputting the sequence into a two-dimensional convolution neural network module (20) for feature extraction, and obtaining a feature map (30) corresponding to B, T, C, H and W dimensions, wherein B represents the batch number of input video frames during batch training, C represents the number of channels, T represents a T frame continuous image, and H, W represents the height and width of the input image;
2) Inputting the extracted feature map into a pooling module (40) to perform spatial average pooling operation to obtain a B, T and C dimension feature sequence F (50);
3) Respectively inputting the characteristic sequences F into two branches of the combined-BERT self-attention model, and respectively extracting the channel and time characteristics: in the 1 st branch, the extracted characteristic sequence F (50) is input into a channel recombination module (60), and the recombined subchannel characteristic sequence X is output C After weighted processing by a channel BERT module (70), predicting through full-connection layer output to obtain a first prediction matrix (80) for behavior recognition; in the 2 nd branch, the characteristic sequence F (50) is input into a time BERT module (71), and a second prediction matrix (81) for behavior recognition is obtained through full-connection layer output; the channel BERT module (70) and the time BERT module (71) in branch 1 and branch 2 share parameters;
4) And weighting and fusing the first prediction matrix (80) and the second prediction matrix (81) and inputting the weighted and fused prediction matrixes into a classification module (90) to obtain a classification result of behavior recognition.
Further, in step 3) of the above technical solution, the joint-BERT self-attention model includes two branches, a 1 st branch for extracting inter-channel correlation and a 2 nd branch for extracting inter-image temporal correlation; the 1 st branch is used for establishing interaction on the sub-channels by utilizing a multi-head self-attention mechanism from the semantic features of the recombination sub-channel group to obtain the key semantic features for distinguishing the action categories and the correlation thereof; the 2 nd branch is used for extracting the key image frames according to the similarity calculation between the frames when the image frames are fused at different moments by using a self-attention mechanism.
In step 3), the specific steps of extracting the channel and time characteristics from branch 1 and branch 2 respectively include:
(1) In the 1 st branch, based on the semantic features and the correlation of key channels for distinguishing action categories obtained by a channel recombination module (60) and a channel BERT module (70), a feature sequence F (50) extracted by a two-dimensional convolutional neural network is input into the channel recombination module (60), and corresponding sub-channel features of adjacent frames are recombined and spliced along the time dimension in the channel recombination module to form a sub-channel feature time sequence X containing a time variation relation C (ii) a Exported Recombinaton subchannel signature sequence X C Self-attention calculation is carried out through a channel BERT module (70), and key sub-channel features with large variation difference along with time and the interaction correlation among the key sub-channel features are extracted; and performing weighting processing according to the correlation among the subchannel feature sequences, and outputting through a full connection layer to obtain a first prediction matrix (80) for behavior identification, thereby realizing modeling of channel feature interaction in time dimension.
(2) In the 2 nd branch, the pooled feature sequence F (50) is directly input into a time BERT module (71), the similarity between video frames is calculated, and a second prediction matrix (81) for behavior recognition is obtained through full-link layer output.
The channel reconfiguration module (60) in step 3) of the above technical solution is shown in fig. 2, and is characterized in that: comprises a channel separation module (601) and a subchannel feature sequence (602). The feature sequence F is input to a channel separation module (601) which equally divides it along the channel dimension into N sub-channels, each sub-channel containing C/N channel features, i.e. F = [ F ' (1), F ' (2),. ·, F ' (N)]Wherein
Figure BDA0003834600510000041
Splicing the sub-channel characteristics F corresponding to the adjacent frames along the time dimension to obtain a sub-channel characteristic sequence (602),
Figure BDA0003834600510000042
wherein N is C = T × C/N; specifically, for the nth (1 ≦ N ≦ N) subchannel group
Figure BDA0003834600510000051
Is provided with
Figure BDA0003834600510000052
Indicating that T frame image characteristic sequence information is contained in each subchannel group.
The channel BERT module (70) in step 3) of the above technical solution is shown in fig. 3, and is characterized in that: comprises a position coding layer (701), a multi-head self-attention mechanism module (702), a channel connection module (703) and a full connection layer (704). Characteristic sequence X output by channel recombination module C The position information is encoded by a position encoding layer (701) of an input channel BERT module (70) to obtain position embedded features
Figure BDA0003834600510000053
Embedding locations in features
Figure BDA0003834600510000054
Inputting a multi-head attention mechanism and a position feedforward network PFFN (·) layer shown in a multi-head self-attention mechanism module (702), and obtaining a matrix which can highlight channel differences and interact with sub-channel groups through self-attention calculation and nonlinear mapping of the PFFN (·) layer
Figure BDA0003834600510000055
Output Y of all sub-channel groups C The input channel connection module (703) splices along the channel dimension to obtain a matrix y having the same dimension as the characteristic F channel C (ii) a Will matrix y C A fully-connected layer (704) is input to obtain a first prediction matrix (80) for behavior recognition.
The invention provides a BERT model-based channel feature interaction time modeling behavior identification method, which is used for jointly learning the time change relationship of video frames and the interaction correlation between channels. The 1 st branch design channel recombination module and the channel BERT module extract the key sub-channel characteristics with large time variation difference and the interactive correlation between the key sub-channel characteristics, so that the channel interactive relation between image frames and sub-channel groups is enhanced, and the key semantic characteristics for distinguishing action categories and the correlation thereof are obtained, thereby improving the behavior classification precision. The combination-BERT combines the 1 st branch for extracting the correlation between channels and the 2 nd branch for extracting the correlation of time, and carries out weight sharing on learnable parameters of a multi-head self-attention layer and a feedforward neural network layer of the channel BERT and the time BERT, so that not only are the semantic features and the correlation of key channels extracted in the channel BERT, but also key action image frames are obtained through the time BERT, the semantic features of the key channels in the key frames can be further paid attention to, and therefore higher behavior recognition accuracy is obtained, and the leading action recognition accuracy at home and abroad is achieved.
Compared with the prior art, the invention innovatively provides a channel characteristic interaction time modeling behavior identification method based on the BERT model, and the outstanding technical effects of the invention are as follows:
1. based on the channel recombination module and the channel BERT module, key sub-channel features with large variation difference along with time and the interactive correlation between the key sub-channel features are extracted, channel aggregation is carried out according to the correlation between sub-channel feature sequences, and key semantic features and correlation thereof for distinguishing action categories are obtained, so that the behavior classification precision is improved.
2. By combining the channel BERT and the time BERT, not only the key channel semantic features and the correlation thereof are extracted from the channel BERT, but also the key action image frames are obtained through the time BERT, the key channel semantic features in the key frames can be further concerned through the fusion of the two, the precision of behavior classification is further improved, and the leading action identification precision at home and abroad is achieved.
3. The method adopts 2D CNN to extract semantic features of each frame of image, and then extracts a change relation between space and time by using a BERT self-attention mechanism to perform behavior recognition. Compared with the traditional 3D CNN-based action recognition method, the 2D CNN has fewer parameters and lower calculation cost. Meanwhile, the weight sharing through the time BERT and the channel BERT enables the joint-BERT network model not to add extra parameters.
4. The behavior recognition accuracy of the invention on the Something-Something (Sth-Sth) V1, V2 and HMDB-51 general public data set is respectively improved to 57.1%, 68.2% and 83.8%, thus achieving leading behavior recognition accuracy at home and abroad.
Drawings
FIG. 1 is a schematic diagram of a process framework of the present invention.
FIG. 2 is a schematic structural diagram of a channel reconfiguration module according to the present invention.
Fig. 3 is a schematic structural design diagram of the channel BERT module of branch 1 of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention provides a method for identifying channel feature interaction time modeling behaviors based on a BERT model, which consists of a 1 st branch for extracting correlation among channels and a 2 nd branch for extracting time correlation among image frames; as shown in fig. 1, this embodiment specifically includes the following steps:
1) Decomposing the action video into a corresponding RGB image sequence (10), inputting the sequence into a two-dimensional convolution neural network module (20) for feature extraction, and obtaining a feature map (30) corresponding to B, T, C, H and W dimensions; wherein, B represents the batch number of input video frames during batch training, C represents the number of channels, T represents a T frame continuous image, and H, W represents the height and width of the input image. In this embodiment, the TDN network in the existing method 1 is selected as the 2D resnet50 residual network for extracting RGB sequence spatial features.
2) And inputting the extracted feature maps into a pooling module (40) for spatial average pooling operation to obtain a B, T and C dimension feature sequence F (50).
3) The extracted feature sequences F (50) are input into the two branches of the union-BERT respectively: wherein, the 1 st branch uses a channel BERT model to extract the correlation between channels, and the 2 nd branch uses a time BERT model to extract the time correlation between image frames; parameters are shared between the channel BERT module (70) and the time BERT module (71) during joint training of the 1 st branch and the 2 nd branch.
4) The 1 st branch firstly inputs the characteristic sequence F (50) into the channel recombination module (60), and specifically develops the characteristic sequence F into a recombination sub-channel characteristic sequence X which is obtained by the channel separation operation module (601) as shown in FIG. 2 C (602) (ii) a Recombination sub-channel characteristic sequence X C Inputting the channel BERT module (70) in FIG. 1, specifically developing the multi-head attention mechanism and the position feedforward network PFFN (·) of the multi-head self-attention mechanism module (702) via the position encoding layer (701) in the channel BERT module (70) as shown in FIG. 3, to obtain a matrix Y with interaction of sub-channel groups and capable of highlighting channel differences C (ii) a Output matrix Y of all sub-channel groups C The input channel connection module (703) splices along the channel dimension to obtain an output vector y with the same dimension as the characteristic diagram channel C And inputting the fully-connected layer (704) to obtain a first prediction matrix (80) for behavior recognition.
5) The 2 nd branch directly inputs the spatial average pooled feature sequence F (50) into a temporal BERT module (71) to obtain a second prediction matrix (81) for behavior recognition by calculating the similarity between video frames.
6) And inputting the first prediction matrix (80) and the second prediction matrix (81) into a classification module (90) to obtain a classification result of behavior recognition.
In step 4), the channel reorganizing module (60) takes the feature sequence F extracted by the 2D ResNet-50 network as an input of the channel reorganizing module as shown in fig. 2, wherein F [ F (1), F (2) ], F (T)],
Figure BDA0003834600510000071
First, the signature sequence F is equally split along the channel dimension into N sub-channels, denoted as F '= [ F' (1), F '(2),.., F' (N), via the channel split operation module (601)],
Figure BDA0003834600510000072
Then the obtained sub-channel characteristicsSplicing the sub-channels of different frames along the time dimension by the sequence F to obtain a recombined sub-channel characteristic sequence (602) containing channel information of adjacent frames,
Figure BDA0003834600510000073
wherein N is C = T × C/N; specifically, for the nth (1 ≦ N ≦ N)) subchannel group
Figure BDA0003834600510000074
Figure BDA0003834600510000075
Wherein, F' T (N) denotes the Tth frame in the nth subchannel divided (1. Ltoreq. N. Ltoreq.N). Thus, N recombinator channel signature sequences are generated, i.e.
Figure BDA0003834600510000076
Each channel group contains medium and long term variation relationships between T frame image features.
In step 4), the channel BERT module (70) outputs the characteristic sequence X of the channel recombination module as shown in FIG. 3 C As input to the channel BERT module, the interaction between the enhancement channel set and the neighboring video frames is enhanced by BERT's self-attention mechanism. Obtaining learnable position-embedded features from a position-encoding layer (701) in an attention module via channel BERT
Figure BDA0003834600510000077
Figure BDA0003834600510000078
Wherein, P C Is represented by the formula X C Learnable location parameters of the same dimension.
Embedding locations in features
Figure BDA0003834600510000079
Inputting the multi-head self-Attention mechanism module (702) to do Attention (-) calculation, usually giving query
Figure BDA00038346005100000710
Key with a key body
Figure BDA00038346005100000711
Sum value
Figure BDA00038346005100000712
Is to
Figure BDA00038346005100000713
Corresponding query matrix, key matrix and value matrix obtained by linear transformation mapping are carried out, d is set q =d k =d v =N C H; thus, the Attention () calculation for the ith self-Attention head can be expressed as:
Figure BDA0003834600510000081
wherein the content of the first and second substances,
Figure BDA0003834600510000082
is a scale factor, and is a function of,
Figure BDA0003834600510000083
description of the invention
Figure BDA0003834600510000084
And
Figure BDA0003834600510000085
the similarity between the two groups is similar to each other,
Figure BDA0003834600510000086
attention (·) calculation
Figure BDA0003834600510000087
And
Figure BDA0003834600510000088
channel interaction between all subchannel sets in (1).
Splicing the outputs of h heads of the self-attention mechanism, and then carrying out linear transformation to obtain the output of the multi-head self-attention mechanism:
Figure BDA0003834600510000089
wherein the content of the first and second substances,
Figure BDA00038346005100000810
inputting the feature learned by the multi-head attention mechanism into a position feed-forward network PFFN (·) to obtain a matrix of interaction of the subchannel groups
Figure BDA00038346005100000811
Figure BDA00038346005100000812
Wherein PFFN (x) = W 2 GELU(W 1 x+b 1 )+b 2 GELU (. Cndot.) is an activation function.
Combining the output matrix Y of all the recombined subchannel sequences C Input channel connection block (703), Y C And after being mapped to the C/N dimension through a trainable linear projection, the C/N dimension is spliced along the channel dimension:
Figure BDA00038346005100000813
wherein the content of the first and second substances,
Figure BDA00038346005100000814
finally, mixing y c Inputting the full connection layer to perform the final action prediction. Using a full link layer to connect y C Mapping to the same prediction matrix as the behavior recognition video classification number: y is cb =y c W 1 . Meanwhile, in order to better aggregate the information of the video frames, the time dimension of the characteristic sequence F is subjected to average pooling: y is avg =AvgPool(FW 2 ). Wherein the content of the first and second substances,
Figure BDA00038346005100000815
N FC is the behavior recognition video classification number. Thus, the loss function for channel-BERT end-to-end training is:
Figure BDA00038346005100000816
wherein alpha is a hyper-parameter, and alpha is more than 0 and less than 1; l (-) is a cross entropy loss function;
Figure BDA00038346005100000817
to output a true tag.
The branch 2 in the step 5) directly inputs the pooled feature sequence F into a time BERT module (71), and the feature sequence F passes through a position coding layer in the time BERT module to obtain a learnable position embedded feature
Figure BDA00038346005100000818
Figure BDA0003834600510000091
Wherein the content of the first and second substances,
Figure BDA0003834600510000092
P T is shown and
Figure BDA0003834600510000093
learnable location parameters of the same dimension. During training, the channel BERT module (70) and the time BERT module (71) share a parameter matrix W Q 、W K And W V
To perform classification using the temporal BERT module, additional classification embedding vectors are added
Figure BDA0003834600510000094
And corresponding classification vectors
Figure BDA0003834600510000095
Figure BDA0003834600510000096
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003834600510000097
description of the invention
Figure BDA0003834600510000098
And
Figure BDA0003834600510000099
the similarity between the two groups is similar to each other,
Figure BDA00038346005100000910
attention (·) calculation
Figure BDA00038346005100000911
And
Figure BDA00038346005100000912
inter-frame importance of all images in the image.
Thus, the total loss function when performing end-to-end training by the joint-BERT is:
Figure BDA00038346005100000913
wherein the content of the first and second substances,
Figure BDA00038346005100000914
is a prediction class vector of the temporal BERT module, consisting of
Figure BDA00038346005100000915
Through a full connection layer W 3 E is mapped to action category generation. y is jb =βy cb +γy tb The weighted fusion of the output of branch 1 and branch 2, beta and gamma are hyper-parameters, beta is more than 0 and less than 1,0 and less than gamma is 0 and less than 1.
To validate the invention, validation was performed using the public Something-Something (Sth-Sth) V1, V2 and HMDB-51 datasets common to behavior recognition. The comparison result of the behavior recognition accuracy rate with other known advanced methods is shown in table 1, and the behavior recognition accuracy rates of the method of the invention on the data sets of Something-Something (Sth-Sth) V1, V2 and HMDB-51 are respectively improved to 57.1%, 68.2% and 83.8%. The comparison result shows that the combined-BERT model provided by the invention realizes more effective modeling of channel characteristic interaction in time dimension, thereby obtaining better behavior identification accuracy and reaching the leading level at home and abroad at present.
TABLE 1 comparison of behavioral recognition accuracy
Method Sth-Sth V1 Sth-Sth V2 HMDB51
Existing method 1 47.2% 63.4% 73.5%
Conventional method II 51.0% 62.9% 75.7%
Existing method III 53.9% 65.3% 76.3%
Examples of the invention 57.1% 68.2% 83.8%
The prior method comprises the following steps: lin, J.et al, in 2019 ICCV (IEEE International Conference on Computer Vision) published as "TSM: temporal shift module for efficacy video understanding".
The prior method II comprises the following steps: wu, W. et al, 2021 AAAI (Association for the advancement of Artificial Intelligence) conference article "MVFNet: multi-View Fusion Network for Efficient Video Recognition".
The existing method is three: wang, L. et al published on CVPR (IEEE Conference on Computer Vision and Pattern Recognition) in 2021 as "TDN: temporal Difference Networks for Efficient Action Recognition".
The invention provides a channel feature interaction time modeling behavior recognition method by combining a 2D convolution network and a BERT self-attention mechanism, and the method can extract key sub-channel features with larger time variation difference and interaction correlation among the key sub-channel features to obtain key semantic features for distinguishing action categories and correlation of the key semantic features, so that behavior classification accuracy is improved. By combining the channel BERT and the time BERT, the semantic features of the key channel in the key frame can be further concerned, so that higher behavior recognition accuracy is obtained.

Claims (5)

1. A method for identifying channel feature interaction time modeling behaviors based on a BERT model is characterized by comprising the following specific steps:
1) Decomposing the motion video into corresponding RGB image sequences, inputting the RGB image sequences into a two-dimensional convolution neural network, and extracting B, T, C, H and W dimension feature maps; b represents the batch number of input video frames during batch training, C represents the channel number, T represents a T frame continuous image, and H, W represents the height and width of the input image;
2) Inputting the extracted feature map into a pooling module to perform space average pooling operation to obtain a B, T and C dimension feature sequence F;
3) Respectively inputting the characteristic sequence F into two branches of the combined-BERT self-attention model, and respectively extracting the channel and time characteristics: in the 1 st branch, the extracted characteristic sequence F is input into a channel recombination module, and the recombination subchannel characteristic sequence X is output C Predicting through full-connection layer output after weighted processing of a channel BERT module to obtain a first prediction matrix of behavior recognition; in the 2 nd branch, the characteristic sequence F is input into a time BERT module, and a second prediction matrix for behavior recognition is obtained through full-connection layer output; the channel BERT module and the time BERT module in the 1 st branch and the 2 nd branch share parameters;
4) And weighting and fusing the first prediction matrix and the second prediction matrix, and inputting the weighted and fused first prediction matrix and second prediction matrix into a classification module to obtain a classification result of behavior recognition.
2. The method for identifying channel feature interactive temporal modeling behaviors based on BERT model as claimed in claim 1, wherein in step 3), the joint-BERT self-attention model includes two branches, the 1 st branch for inter-channel correlation and the 2 nd branch for inter-image temporal correlation are extracted; the 1 st branch is used for establishing interaction on the sub-channels by utilizing a multi-head self-attention mechanism from the semantic features of the recombination sub-channel group to obtain the key semantic features for distinguishing the action categories and the correlation thereof; the 2 nd branch is used for extracting the key image frames according to the similarity calculation between the frames when the image frames are fused at different moments by using a self-attention mechanism.
3. The method for recognizing the interaction time modeling behavior of the channel feature based on the BERT model as claimed in claim 1, wherein in the step 3), the specific steps of inputting the feature sequences F into the two branches of the joint-BERT self-attention model respectively and performing the channel and time feature extraction respectively comprise:
(1) In the 1 st branch, obtaining key channel semantic features and correlation thereof for distinguishing action categories based on a channel recombination module and a channel BERT module, inputting a feature sequence F extracted by a two-dimensional convolutional neural network into the channel recombination module, and recombining and splicing corresponding sub-channel features of adjacent frames in the channel recombination module along a time dimension to form a sub-channel feature time sequence containing a time variation relationship; exported Recombinaton sub-channel signature sequence X C Performing self-attention calculation through a channel BERT module, and extracting key sub-channel characteristics with large variation difference along with time and interactive correlation between the key sub-channel characteristics; carrying out weighting processing according to the correlation among the subchannel feature sequences, and outputting through a full connection layer to obtain a first prediction matrix for behavior identification so as to realize modeling of channel feature interaction in a time dimension;
2) In the 2 nd branch, the pooled feature sequences F are directly input into a time BERT module, the similarity between video frames is calculated, and a second prediction matrix for behavior recognition is obtained through full-connection layer output.
4. The method for identifying interaction time modeling behavior of channel features based on BERT model as claimed in claim 1, wherein in step 3), the channel restructuring module comprises a channel separation module and a sub-channel feature sequence; inputting the feature sequence F into a channel separation module, equally dividing the feature sequence into N sub-channels along a channel dimension, wherein each sub-channel comprises C/N channel features, namely F '= [ F' (1), F '(2),...., F' (N)]In which
Figure FDA0003834600500000021
Splicing the sub-channel characteristics F' corresponding to the adjacent frames along the time dimension to obtain a sub-channel characteristic sequence,
Figure FDA0003834600500000022
wherein N is C = T × C/N; for the nth (1. Ltoreq. N. Ltoreq.N)) subchannel group
Figure FDA0003834600500000023
Is provided with
Figure FDA0003834600500000024
Indicating that each subchannel set contains T-frame image characteristic sequence information.
5. The method for identifying the interaction time modeling behavior of the channel feature based on the BERT model in claim 1, wherein in step 3), the channel BERT module comprises a position coding layer, a multi-head self-attention mechanism module, a channel connection module and a full connection layer; characteristic sequence X output by channel recombination module C The position coding layer of the input channel BERT module codes the position information to obtain the position embedding characteristics
Figure FDA0003834600500000025
Embedding locations in features
Figure FDA0003834600500000026
Inputting a multi-head attention mechanism and a position feedforward network PFFN (·) layer of a multi-head self-attention mechanism module, and obtaining a matrix which can highlight channel differences and interact with sub-channel groups through self-attention calculation and nonlinear mapping of the PFFN (·) layer
Figure FDA0003834600500000027
Output Y of all sub-channel groups C The input channel connection module is spliced along the channel dimension to obtain a matrix y with the same dimension as the characteristic F channel C (ii) a Will matrix y C And inputting the fully-connected layer to obtain a first prediction matrix for behavior recognition.
CN202211083801.XA 2022-09-06 2022-09-06 Method for identifying channel characteristic interaction time modeling behaviors based on BERT model Pending CN115457657A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211083801.XA CN115457657A (en) 2022-09-06 2022-09-06 Method for identifying channel characteristic interaction time modeling behaviors based on BERT model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211083801.XA CN115457657A (en) 2022-09-06 2022-09-06 Method for identifying channel characteristic interaction time modeling behaviors based on BERT model

Publications (1)

Publication Number Publication Date
CN115457657A true CN115457657A (en) 2022-12-09

Family

ID=84303250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211083801.XA Pending CN115457657A (en) 2022-09-06 2022-09-06 Method for identifying channel characteristic interaction time modeling behaviors based on BERT model

Country Status (1)

Country Link
CN (1) CN115457657A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189800A (en) * 2023-02-23 2023-05-30 深圳大学 Pattern recognition method, device, equipment and storage medium based on gas detection

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189800A (en) * 2023-02-23 2023-05-30 深圳大学 Pattern recognition method, device, equipment and storage medium based on gas detection
CN116189800B (en) * 2023-02-23 2023-08-18 深圳大学 Pattern recognition method, device, equipment and storage medium based on gas detection

Similar Documents

Publication Publication Date Title
Zhou et al. AGLNet: Towards real-time semantic segmentation of self-driving images via attention-guided lightweight network
CN111709304B (en) Behavior recognition method based on space-time attention-enhancing feature fusion network
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN114596520A (en) First visual angle video action identification method and device
CN112329780B (en) Depth image semantic segmentation method based on deep learning
CN113052254B (en) Multi-attention ghost residual fusion classification model and classification method thereof
CN112651360B (en) Skeleton action recognition method under small sample
CN114332573A (en) Multi-mode information fusion recognition method and system based on attention mechanism
CN110599443A (en) Visual saliency detection method using bidirectional long-term and short-term memory network
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
CN115457657A (en) Method for identifying channel characteristic interaction time modeling behaviors based on BERT model
CN115631513A (en) Multi-scale pedestrian re-identification method based on Transformer
CN115346269A (en) Gesture motion recognition method
CN113705384B (en) Facial expression recognition method considering local space-time characteristics and global timing clues
CN112348033B (en) Collaborative saliency target detection method
CN114972794A (en) Three-dimensional object recognition method based on multi-view Pooll transducer
CN115908793A (en) Coding and decoding structure semantic segmentation model based on position attention mechanism
Jiang et al. Cross-level reinforced attention network for person re-identification
CN116543338A (en) Student classroom behavior detection method based on gaze target estimation
CN116844004A (en) Point cloud automatic semantic modeling method for digital twin scene
Li et al. Two-stream spatial graphormer networks for skeleton-based action recognition
CN116189292A (en) Video action recognition method based on double-flow network
CN113283393B (en) Deepfake video detection method based on image group and two-stream network
Li et al. Lighter Transformer for Online Action Detection
CN113158901A (en) Domain-adaptive pedestrian re-identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination