CN115457657A

CN115457657A - Method for identifying channel characteristic interaction time modeling behaviors based on BERT model

Info

Publication number: CN115457657A
Application number: CN202211083801.XA
Authority: CN
Inventors: 李晓潮; 杨曼; 甘利鹏
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2022-12-09

Abstract

A method for recognizing channel characteristic interaction time modeling behaviors based on a BERT model belongs to the technical field of computer vision, deep learning and behavior recognition. The method comprises the steps of decomposing an action video into corresponding RGB image sequences, inputting the RGB image sequences into a two-dimensional convolutional neural network, carrying out self-attention calculation on a subchannel feature sequence through a channel recombination module and a channel BERT model based on features extracted by the two-dimensional convolutional neural network, extracting key subchannel features with large time variation difference and interactive correlation among the key subchannel features, and obtaining key semantic features for distinguishing action categories and correlation of the key semantic features, so that behavior classification accuracy is improved. By combining the channel BERT and the time BERT, the semantic features of the key channel in the key frame are further concerned, so that higher behavior identification accuracy is obtained.

Description

Method for identifying channel characteristic interaction time modeling behaviors based on BERT model

Technical Field

The invention belongs to the technical field of computer vision, deep learning and behavior recognition, and particularly relates to a method for recognizing channel feature interaction time modeling behaviors based on a BERT (Bidirectional Encoder Representation) model.

Background

Behavior recognition is one of basic tasks of computer vision, and is widely applied to occasions such as man-machine interaction, video retrieval, intelligent safety monitoring and the like. Behavior recognition technology is mainly used for enabling a computer to understand human actions and behaviors by processing and analyzing video data. Therefore, a key technique for behavior recognition is the modeling of video semantic features along the time dimension during action change. Firstly, in the feature modeling, the time-space change relation of human body behaviors in a video needs to be extracted, and the difference of appearance changes of different behaviors of a human body is described by modeling the time-space features. Secondly, the channel relation among the video frames at different moments needs to be considered, and the video semantic information can be more completely represented through effective channel feature interaction. Therefore, time modeling is carried out on channel feature interaction in the video, and the method is an effective method for improving the precision of the behavior recognition task.

Through analysis and experiments, a BERT self-attention mechanism module is embedded in the two-dimensional convolution network, so that the time correlation between image frames can be learned and extracted, and the accuracy of behavior recognition is improved. On the basis, a channel recombination module is provided, and the module separates the channel characteristics of continuous frames into N sub-channels, and splices the corresponding sub-channel characteristics of each frame along the time dimension to form a sub-channel characteristic time sequence. Then, a channel feature interaction model is established for the reconstructed sub-channel groups by using a channel BERT self-attention mechanism, and key sub-channel feature sequences with large time variation difference and interaction correlation among the key sub-channel feature sequences are extracted according to similarity calculation among the sub-channel groups, and the interaction relation between the key sub-channel feature sequences and the interaction correlation among the key sub-channel feature sequences is enhanced, so that key semantic features for distinguishing action categories are obtained. In order to simultaneously extract the interactive correlation of the channel and the time dimension in the frame image characteristics, a joint-BERT model is provided, and the two branches of extracting the channel correlation and the time correlation are fused, so that the semantic characteristics of the key channel in the key frame can be further concerned, and the behavior identification precision is improved. Meanwhile, the channel BERT module (70) and the time BERT module (71) adopt a weight sharing strategy, thereby reducing the weight number of the whole model.

In recent years, similar research and patents to the method for identifying the behavior of the BERT model-based channel feature interaction time modeling are as follows:

in the ECCV (European Conference on Computer Vision) Conference in 2020, the article "Late Temporal Modeling in 3D CNN architecture with BERT for Action Recognition" published by Kalfaoglu et al uses BERT to replace TGAP layer at the end of 3D CNN to learn important Temporal features in video frames, and enhances the Late Temporal Modeling capability of CNN backbone network. In contrast, the invention adopts 2D CNN to extract the spatial features of the video, obtains the relevance between video time frames by using BERT, and increases the BERT of the channel for extracting the relevance between the image channel features of the frames. Firstly, a channel recombination module is designed to extract channel information of all adjacent frames in a sub-channel characteristic sequence, and then a self-attention mechanism of a channel BERT is utilized to carry out similarity calculation on different sub-channel characteristic sequences, so as to learn the semantic characteristics of key channels in video frames. In addition, the method for identifying the channel feature interaction time modeling behavior based on the BERT model adopts two-dimensional convolution to replace three-dimensional convolution to extract the semantic features of each frame of image, so that the calculation cost is saved to a great extent, the parameter quantity is reduced, and the calculation efficiency of the model is improved.

The "Channel interaction networks for fine-grained image capture" article published in the 34 th AAAI Conference on intellectual association in 2020 by Gao, y, et al proposes a self-Channel interaction (SCI) module to establish a correlation model between different channels in an image frame, and a Contrast Channel Interaction (CCI) module to model the cross-sample Channel interaction relationship between two image frames. In contrast, the channel reorganization module provided by the invention firstly performs channel separation and reorganization on the feature sequences of all image frames extracted by the 2D CNN to obtain a sub-channel group containing the feature sequences of all adjacent frame channels, and then performs self-attention calculation on the sub-channel feature sequences by using a BERT self-attention mechanism, thereby not only improving the channel interaction capacity in one image frame or between two image frames, but also enhancing the channel interaction relation between all the image frames and between the sub-channel groups to obtain more complete channel semantic representation, so that all the adjacent time frames of the video, not only the feature channels on the two image frames, can learn the semantic features with identification in the video, and improving the classification accuracy.

Chinese patent CN111597929a discloses a group behavior identification method based on channel information fusion and group relationship spatial structured modeling. Aiming at the accuracy problem of collective behavior identification, the CSTM and CMM modules of a 2D network STM are adopted to respectively extract each frame of time-space information and interframe motion information, and a channel selection module is provided to fuse the time-space and motion information of each frame; and extracting a collective relationship evolution model through the graph convolution-LSTM network. The recognition video is sparsely sampled in order to capture long video sequences. The invention inputs each frame of spatial information extracted by the 2D CNN into a BERT self-attention mechanism to obtain the time variation relation between the inter-frame space and the channel within a period of time. The method for learning the sequence features through the multi-head self-attention mechanism not only obtains key image frames and key channel semantic features, but also improves the long-term time modeling capability of a behavior recognition model. In addition, the invention does not need to extract the collective relationship evolution model.

Chinese patent CN113591774a discloses a behavior recognition algorithm based on a transform. In the invention, a convolutional layer is adopted in a posture estimation part to simultaneously extract time information and space information of skeleton points from an original video, and the fused time information and space information are directly input into a Transformer self-attention network to obtain the change relation of the skeleton nodes of the human body. The method firstly adopts the 2D CNN to extract the spatial information of each frame, utilizes the BERT self-attention mechanism to extract the change relation between space and time to identify the behavior, fully utilizes the advantages of the 2D convolution to extract the spatial characteristics and the BERT self-attention mechanism to process the time sequence, and improves the behavior modeling capability of the neural network.

Chinese patent CN113673489A discloses a video group behavior recognition method based on cascade transformers, wherein a first Transformer module is used for detecting a human body target to extract individual characteristics of a key frame, and a second Transformer module is used for modeling the hierarchical relationship between an individual and a group to complete a group behavior recognition task. The BERT module adopted by the invention is based on a self-attention model on a bidirectional Transformer module, and the similarity between the sub-channel groups and between time frames is respectively learned by the BERT module in the branch 1 and the branch 2 to complete the behavior recognition task of the individual without extracting the hierarchical relationship between the individual and the group. In addition, different from the method only paying attention to the individual features of the key frames, the method adds a weight sharing strategy to the parameter matrix of the BERT in the joint-BERT model training, and jointly learns key channel and time features of the human body, so that different dimension information is fused to improve the integrity of the description of the action features of the human body, and the semantic features of the key channel in the key frames are further paid attention to.

Disclosure of Invention

The invention aims to provide a channel feature interaction time modeling behavior identification method based on a BERT model, which can improve the behavior classification precision and aims to overcome the defects in the prior art. The 2D convolution network and the BERT self-attention mechanism are combined, especially the extraction of the correlation between time and channel dimension in the frame image features is researched, the key channel semantic features in the key frame are learned by combining the fusion mechanism of the BERT self-attention model, and the fusion of the image frames in the time and channel dimension is optimized, so that the requirement of improving the video classification accuracy is met.

The invention provides a combined-BERT self-attention model for simultaneously extracting key channels and time characteristics based on a two-dimensional convolutional neural network, wherein the model consists of a 1 st branch for extracting correlation among channels and a 2 nd branch for extracting time correlation among image frames; the two-dimensional convolutional neural network consists of a plurality of 2-dimensional convolutional layers, and the spatial characteristics of the video image frames are obtained by performing convolution in the spatial dimension; the 1 st branch is used for establishing interaction on the sub-channels by utilizing a multi-head self-attention mechanism from the semantic features of the recombination sub-channel group to obtain the key semantic features for distinguishing the action categories and the correlation thereof; the 2 nd branch is used for extracting the key image frames according to the similarity calculation between the frames when the image frames are fused at different moments by using a self-attention mechanism.

The method comprises the following specific steps:

1) Decomposing the motion video into a corresponding RGB image sequence (10), inputting the sequence into a two-dimensional convolution neural network module (20) for feature extraction, and obtaining a feature map (30) corresponding to B, T, C, H and W dimensions, wherein B represents the batch number of input video frames during batch training, C represents the number of channels, T represents a T frame continuous image, and H, W represents the height and width of the input image;

2) Inputting the extracted feature map into a pooling module (40) to perform spatial average pooling operation to obtain a B, T and C dimension feature sequence F (50);

3) Respectively inputting the characteristic sequences F into two branches of the combined-BERT self-attention model, and respectively extracting the channel and time characteristics: in the 1 st branch, the extracted characteristic sequence F (50) is input into a channel recombination module (60), and the recombined subchannel characteristic sequence X is output ^C After weighted processing by a channel BERT module (70), predicting through full-connection layer output to obtain a first prediction matrix (80) for behavior recognition; in the 2 nd branch, the characteristic sequence F (50) is input into a time BERT module (71), and a second prediction matrix (81) for behavior recognition is obtained through full-connection layer output; the channel BERT module (70) and the time BERT module (71) in branch 1 and branch 2 share parameters;

4) And weighting and fusing the first prediction matrix (80) and the second prediction matrix (81) and inputting the weighted and fused prediction matrixes into a classification module (90) to obtain a classification result of behavior recognition.

Further, in step 3) of the above technical solution, the joint-BERT self-attention model includes two branches, a 1 st branch for extracting inter-channel correlation and a 2 nd branch for extracting inter-image temporal correlation; the 1 st branch is used for establishing interaction on the sub-channels by utilizing a multi-head self-attention mechanism from the semantic features of the recombination sub-channel group to obtain the key semantic features for distinguishing the action categories and the correlation thereof; the 2 nd branch is used for extracting the key image frames according to the similarity calculation between the frames when the image frames are fused at different moments by using a self-attention mechanism.

In step 3), the specific steps of extracting the channel and time characteristics from branch 1 and branch 2 respectively include:

(1) In the 1 st branch, based on the semantic features and the correlation of key channels for distinguishing action categories obtained by a channel recombination module (60) and a channel BERT module (70), a feature sequence F (50) extracted by a two-dimensional convolutional neural network is input into the channel recombination module (60), and corresponding sub-channel features of adjacent frames are recombined and spliced along the time dimension in the channel recombination module to form a sub-channel feature time sequence X containing a time variation relation ^C (ii) a Exported Recombinaton subchannel signature sequence X ^C Self-attention calculation is carried out through a channel BERT module (70), and key sub-channel features with large variation difference along with time and the interaction correlation among the key sub-channel features are extracted; and performing weighting processing according to the correlation among the subchannel feature sequences, and outputting through a full connection layer to obtain a first prediction matrix (80) for behavior identification, thereby realizing modeling of channel feature interaction in time dimension.

(2) In the 2 nd branch, the pooled feature sequence F (50) is directly input into a time BERT module (71), the similarity between video frames is calculated, and a second prediction matrix (81) for behavior recognition is obtained through full-link layer output.

The channel reconfiguration module (60) in step 3) of the above technical solution is shown in fig. 2, and is characterized in that: comprises a channel separation module (601) and a subchannel feature sequence (602). The feature sequence F is input to a channel separation module (601) which equally divides it along the channel dimension into N sub-channels, each sub-channel containing C/N channel features, i.e. F = [ F ' (1), F ' (2),. ·, F ' (N)]Wherein

Splicing the sub-channel characteristics F corresponding to the adjacent frames along the time dimension to obtain a sub-channel characteristic sequence (602),

wherein N is _C = T × C/N; specifically, for the nth (1 ≦ N ≦ N) subchannel group

Is provided with

Indicating that T frame image characteristic sequence information is contained in each subchannel group.

The channel BERT module (70) in step 3) of the above technical solution is shown in fig. 3, and is characterized in that: comprises a position coding layer (701), a multi-head self-attention mechanism module (702), a channel connection module (703) and a full connection layer (704). Characteristic sequence X output by channel recombination module ^C The position information is encoded by a position encoding layer (701) of an input channel BERT module (70) to obtain position embedded features

Embedding locations in features

Inputting a multi-head attention mechanism and a position feedforward network PFFN (·) layer shown in a multi-head self-attention mechanism module (702), and obtaining a matrix which can highlight channel differences and interact with sub-channel groups through self-attention calculation and nonlinear mapping of the PFFN (·) layer

Output Y of all sub-channel groups ^C The input channel connection module (703) splices along the channel dimension to obtain a matrix y having the same dimension as the characteristic F channel _C (ii) a Will matrix y _C A fully-connected layer (704) is input to obtain a first prediction matrix (80) for behavior recognition.

The invention provides a BERT model-based channel feature interaction time modeling behavior identification method, which is used for jointly learning the time change relationship of video frames and the interaction correlation between channels. The 1 st branch design channel recombination module and the channel BERT module extract the key sub-channel characteristics with large time variation difference and the interactive correlation between the key sub-channel characteristics, so that the channel interactive relation between image frames and sub-channel groups is enhanced, and the key semantic characteristics for distinguishing action categories and the correlation thereof are obtained, thereby improving the behavior classification precision. The combination-BERT combines the 1 st branch for extracting the correlation between channels and the 2 nd branch for extracting the correlation of time, and carries out weight sharing on learnable parameters of a multi-head self-attention layer and a feedforward neural network layer of the channel BERT and the time BERT, so that not only are the semantic features and the correlation of key channels extracted in the channel BERT, but also key action image frames are obtained through the time BERT, the semantic features of the key channels in the key frames can be further paid attention to, and therefore higher behavior recognition accuracy is obtained, and the leading action recognition accuracy at home and abroad is achieved.

Compared with the prior art, the invention innovatively provides a channel characteristic interaction time modeling behavior identification method based on the BERT model, and the outstanding technical effects of the invention are as follows:

1. based on the channel recombination module and the channel BERT module, key sub-channel features with large variation difference along with time and the interactive correlation between the key sub-channel features are extracted, channel aggregation is carried out according to the correlation between sub-channel feature sequences, and key semantic features and correlation thereof for distinguishing action categories are obtained, so that the behavior classification precision is improved.

2. By combining the channel BERT and the time BERT, not only the key channel semantic features and the correlation thereof are extracted from the channel BERT, but also the key action image frames are obtained through the time BERT, the key channel semantic features in the key frames can be further concerned through the fusion of the two, the precision of behavior classification is further improved, and the leading action identification precision at home and abroad is achieved.

3. The method adopts 2D CNN to extract semantic features of each frame of image, and then extracts a change relation between space and time by using a BERT self-attention mechanism to perform behavior recognition. Compared with the traditional 3D CNN-based action recognition method, the 2D CNN has fewer parameters and lower calculation cost. Meanwhile, the weight sharing through the time BERT and the channel BERT enables the joint-BERT network model not to add extra parameters.

4. The behavior recognition accuracy of the invention on the Something-Something (Sth-Sth) V1, V2 and HMDB-51 general public data set is respectively improved to 57.1%, 68.2% and 83.8%, thus achieving leading behavior recognition accuracy at home and abroad.

Drawings

FIG. 1 is a schematic diagram of a process framework of the present invention.

FIG. 2 is a schematic structural diagram of a channel reconfiguration module according to the present invention.

Fig. 3 is a schematic structural design diagram of the channel BERT module of branch 1 of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention provides a method for identifying channel feature interaction time modeling behaviors based on a BERT model, which consists of a 1 st branch for extracting correlation among channels and a 2 nd branch for extracting time correlation among image frames; as shown in fig. 1, this embodiment specifically includes the following steps:

1) Decomposing the action video into a corresponding RGB image sequence (10), inputting the sequence into a two-dimensional convolution neural network module (20) for feature extraction, and obtaining a feature map (30) corresponding to B, T, C, H and W dimensions; wherein, B represents the batch number of input video frames during batch training, C represents the number of channels, T represents a T frame continuous image, and H, W represents the height and width of the input image. In this embodiment, the TDN network in the existing method 1 is selected as the 2D resnet50 residual network for extracting RGB sequence spatial features.

2) And inputting the extracted feature maps into a pooling module (40) for spatial average pooling operation to obtain a B, T and C dimension feature sequence F (50).

3) The extracted feature sequences F (50) are input into the two branches of the union-BERT respectively: wherein, the 1 st branch uses a channel BERT model to extract the correlation between channels, and the 2 nd branch uses a time BERT model to extract the time correlation between image frames; parameters are shared between the channel BERT module (70) and the time BERT module (71) during joint training of the 1 st branch and the 2 nd branch.

4) The 1 st branch firstly inputs the characteristic sequence F (50) into the channel recombination module (60), and specifically develops the characteristic sequence F into a recombination sub-channel characteristic sequence X which is obtained by the channel separation operation module (601) as shown in FIG. 2 ^C (602) (ii) a Recombination sub-channel characteristic sequence X ^C Inputting the channel BERT module (70) in FIG. 1, specifically developing the multi-head attention mechanism and the position feedforward network PFFN (·) of the multi-head self-attention mechanism module (702) via the position encoding layer (701) in the channel BERT module (70) as shown in FIG. 3, to obtain a matrix Y with interaction of sub-channel groups and capable of highlighting channel differences ^C (ii) a Output matrix Y of all sub-channel groups ^C The input channel connection module (703) splices along the channel dimension to obtain an output vector y with the same dimension as the characteristic diagram channel _C And inputting the fully-connected layer (704) to obtain a first prediction matrix (80) for behavior recognition.

5) The 2 nd branch directly inputs the spatial average pooled feature sequence F (50) into a temporal BERT module (71) to obtain a second prediction matrix (81) for behavior recognition by calculating the similarity between video frames.

6) And inputting the first prediction matrix (80) and the second prediction matrix (81) into a classification module (90) to obtain a classification result of behavior recognition.

In step 4), the channel reorganizing module (60) takes the feature sequence F extracted by the 2D ResNet-50 network as an input of the channel reorganizing module as shown in fig. 2, wherein F [ F (1), F (2) ], F (T)]，

First, the signature sequence F is equally split along the channel dimension into N sub-channels, denoted as F '= [ F' (1), F '(2),.., F' (N), via the channel split operation module (601)]，

Then the obtained sub-channel characteristicsSplicing the sub-channels of different frames along the time dimension by the sequence F to obtain a recombined sub-channel characteristic sequence (602) containing channel information of adjacent frames,

wherein N is _C = T × C/N; specifically, for the nth (1 ≦ N ≦ N)) subchannel group

Wherein, F' _T (N) denotes the Tth frame in the nth subchannel divided (1. Ltoreq. N. Ltoreq.N). Thus, N recombinator channel signature sequences are generated, i.e.

Each channel group contains medium and long term variation relationships between T frame image features.

In step 4), the channel BERT module (70) outputs the characteristic sequence X of the channel recombination module as shown in FIG. 3 ^C As input to the channel BERT module, the interaction between the enhancement channel set and the neighboring video frames is enhanced by BERT's self-attention mechanism. Obtaining learnable position-embedded features from a position-encoding layer (701) in an attention module via channel BERT

Wherein, P ^C Is represented by the formula X ^C Learnable location parameters of the same dimension.

Embedding locations in features

Inputting the multi-head self-Attention mechanism module (702) to do Attention (-) calculation, usually giving query

Key with a key body

Sum value

Is to

Corresponding query matrix, key matrix and value matrix obtained by linear transformation mapping are carried out, d is set _q ＝d _k ＝d _v ＝N _C H; thus, the Attention () calculation for the ith self-Attention head can be expressed as:

wherein the content of the first and second substances,

is a scale factor, and is a function of,

description of the invention

And

the similarity between the two groups is similar to each other,

attention (·) calculation

And

channel interaction between all subchannel sets in (1).

Splicing the outputs of h heads of the self-attention mechanism, and then carrying out linear transformation to obtain the output of the multi-head self-attention mechanism:

wherein the content of the first and second substances,

inputting the feature learned by the multi-head attention mechanism into a position feed-forward network PFFN (·) to obtain a matrix of interaction of the subchannel groups

Wherein PFFN (x) = W ₂ GELU(W ₁ x+b ₁ )+b ₂ GELU (. Cndot.) is an activation function.

Combining the output matrix Y of all the recombined subchannel sequences ^C Input channel connection block (703), Y ^C And after being mapped to the C/N dimension through a trainable linear projection, the C/N dimension is spliced along the channel dimension:

wherein the content of the first and second substances,

finally, mixing y _c Inputting the full connection layer to perform the final action prediction. Using a full link layer to connect y _C Mapping to the same prediction matrix as the behavior recognition video classification number: y is _cb ＝y _c W ₁ . Meanwhile, in order to better aggregate the information of the video frames, the time dimension of the characteristic sequence F is subjected to average pooling: y is _avg ＝AvgPool(FW ₂ ). Wherein the content of the first and second substances,

N _FC is the behavior recognition video classification number. Thus, the loss function for channel-BERT end-to-end training is:

wherein alpha is a hyper-parameter, and alpha is more than 0 and less than 1; l (-) is a cross entropy loss function;

to output a true tag.

The branch 2 in the step 5) directly inputs the pooled feature sequence F into a time BERT module (71), and the feature sequence F passes through a position coding layer in the time BERT module to obtain a learnable position embedded feature

Wherein the content of the first and second substances,

P ^T is shown and

learnable location parameters of the same dimension. During training, the channel BERT module (70) and the time BERT module (71) share a parameter matrix W ^Q 、W ^K And W ^V 。

To perform classification using the temporal BERT module, additional classification embedding vectors are added

And corresponding classification vectors

Wherein, the first and the second end of the pipe are connected with each other,

description of the invention

And

the similarity between the two groups is similar to each other,

attention (·) calculation

And

inter-frame importance of all images in the image.

Thus, the total loss function when performing end-to-end training by the joint-BERT is:

wherein the content of the first and second substances,

is a prediction class vector of the temporal BERT module, consisting of

Through a full connection layer W ₃ E is mapped to action category generation. y is _jb ＝βy _cb +γy _tb The weighted fusion of the output of branch 1 and branch 2, beta and gamma are hyper-parameters, beta is more than 0 and less than 1,0 and less than gamma is 0 and less than 1.

To validate the invention, validation was performed using the public Something-Something (Sth-Sth) V1, V2 and HMDB-51 datasets common to behavior recognition. The comparison result of the behavior recognition accuracy rate with other known advanced methods is shown in table 1, and the behavior recognition accuracy rates of the method of the invention on the data sets of Something-Something (Sth-Sth) V1, V2 and HMDB-51 are respectively improved to 57.1%, 68.2% and 83.8%. The comparison result shows that the combined-BERT model provided by the invention realizes more effective modeling of channel characteristic interaction in time dimension, thereby obtaining better behavior identification accuracy and reaching the leading level at home and abroad at present.

TABLE 1 comparison of behavioral recognition accuracy

Method	Sth-Sth V1	Sth-Sth V2	HMDB51
				Existing method 1	47.2％	63.4％	73.5％
Conventional method II	51.0％	62.9％	75.7％
				Existing method III	53.9％	65.3％	76.3％
Examples of the invention	57.1％	68.2％	83.8％

The prior method comprises the following steps: lin, J.et al, in 2019 ICCV (IEEE International Conference on Computer Vision) published as "TSM: temporal shift module for efficacy video understanding".

The prior method II comprises the following steps: wu, W. et al, 2021 AAAI (Association for the advancement of Artificial Intelligence) conference article "MVFNet: multi-View Fusion Network for Efficient Video Recognition".

The existing method is three: wang, L. et al published on CVPR (IEEE Conference on Computer Vision and Pattern Recognition) in 2021 as "TDN: temporal Difference Networks for Efficient Action Recognition".

The invention provides a channel feature interaction time modeling behavior recognition method by combining a 2D convolution network and a BERT self-attention mechanism, and the method can extract key sub-channel features with larger time variation difference and interaction correlation among the key sub-channel features to obtain key semantic features for distinguishing action categories and correlation of the key semantic features, so that behavior classification accuracy is improved. By combining the channel BERT and the time BERT, the semantic features of the key channel in the key frame can be further concerned, so that higher behavior recognition accuracy is obtained.

Claims

1. A method for identifying channel feature interaction time modeling behaviors based on a BERT model is characterized by comprising the following specific steps:

1) Decomposing the motion video into corresponding RGB image sequences, inputting the RGB image sequences into a two-dimensional convolution neural network, and extracting B, T, C, H and W dimension feature maps; b represents the batch number of input video frames during batch training, C represents the channel number, T represents a T frame continuous image, and H, W represents the height and width of the input image;

2) Inputting the extracted feature map into a pooling module to perform space average pooling operation to obtain a B, T and C dimension feature sequence F;

3) Respectively inputting the characteristic sequence F into two branches of the combined-BERT self-attention model, and respectively extracting the channel and time characteristics: in the 1 st branch, the extracted characteristic sequence F is input into a channel recombination module, and the recombination subchannel characteristic sequence X is output ^C Predicting through full-connection layer output after weighted processing of a channel BERT module to obtain a first prediction matrix of behavior recognition; in the 2 nd branch, the characteristic sequence F is input into a time BERT module, and a second prediction matrix for behavior recognition is obtained through full-connection layer output; the channel BERT module and the time BERT module in the 1 st branch and the 2 nd branch share parameters;

4) And weighting and fusing the first prediction matrix and the second prediction matrix, and inputting the weighted and fused first prediction matrix and second prediction matrix into a classification module to obtain a classification result of behavior recognition.

2. The method for identifying channel feature interactive temporal modeling behaviors based on BERT model as claimed in claim 1, wherein in step 3), the joint-BERT self-attention model includes two branches, the 1 st branch for inter-channel correlation and the 2 nd branch for inter-image temporal correlation are extracted; the 1 st branch is used for establishing interaction on the sub-channels by utilizing a multi-head self-attention mechanism from the semantic features of the recombination sub-channel group to obtain the key semantic features for distinguishing the action categories and the correlation thereof; the 2 nd branch is used for extracting the key image frames according to the similarity calculation between the frames when the image frames are fused at different moments by using a self-attention mechanism.

3. The method for recognizing the interaction time modeling behavior of the channel feature based on the BERT model as claimed in claim 1, wherein in the step 3), the specific steps of inputting the feature sequences F into the two branches of the joint-BERT self-attention model respectively and performing the channel and time feature extraction respectively comprise:

(1) In the 1 st branch, obtaining key channel semantic features and correlation thereof for distinguishing action categories based on a channel recombination module and a channel BERT module, inputting a feature sequence F extracted by a two-dimensional convolutional neural network into the channel recombination module, and recombining and splicing corresponding sub-channel features of adjacent frames in the channel recombination module along a time dimension to form a sub-channel feature time sequence containing a time variation relationship; exported Recombinaton sub-channel signature sequence X ^C Performing self-attention calculation through a channel BERT module, and extracting key sub-channel characteristics with large variation difference along with time and interactive correlation between the key sub-channel characteristics; carrying out weighting processing according to the correlation among the subchannel feature sequences, and outputting through a full connection layer to obtain a first prediction matrix for behavior identification so as to realize modeling of channel feature interaction in a time dimension;

2) In the 2 nd branch, the pooled feature sequences F are directly input into a time BERT module, the similarity between video frames is calculated, and a second prediction matrix for behavior recognition is obtained through full-connection layer output.

4. The method for identifying interaction time modeling behavior of channel features based on BERT model as claimed in claim 1, wherein in step 3), the channel restructuring module comprises a channel separation module and a sub-channel feature sequence; inputting the feature sequence F into a channel separation module, equally dividing the feature sequence into N sub-channels along a channel dimension, wherein each sub-channel comprises C/N channel features, namely F '= [ F' (1), F '(2),...., F' (N)]In which

Splicing the sub-channel characteristics F' corresponding to the adjacent frames along the time dimension to obtain a sub-channel characteristic sequence,

wherein N is _C = T × C/N; for the nth (1. Ltoreq. N. Ltoreq.N)) subchannel group

Is provided with

Indicating that each subchannel set contains T-frame image characteristic sequence information.

5. The method for identifying the interaction time modeling behavior of the channel feature based on the BERT model in claim 1, wherein in step 3), the channel BERT module comprises a position coding layer, a multi-head self-attention mechanism module, a channel connection module and a full connection layer; characteristic sequence X output by channel recombination module ^C The position coding layer of the input channel BERT module codes the position information to obtain the position embedding characteristics

Embedding locations in features

Inputting a multi-head attention mechanism and a position feedforward network PFFN (·) layer of a multi-head self-attention mechanism module, and obtaining a matrix which can highlight channel differences and interact with sub-channel groups through self-attention calculation and nonlinear mapping of the PFFN (·) layer

Output Y of all sub-channel groups ^C The input channel connection module is spliced along the channel dimension to obtain a matrix y with the same dimension as the characteristic F channel _C (ii) a Will matrix y _C And inputting the fully-connected layer to obtain a first prediction matrix for behavior recognition.