CN111325145B - Behavior recognition method based on combined time domain channel correlation block - Google Patents
Behavior recognition method based on combined time domain channel correlation block Download PDFInfo
- Publication number
- CN111325145B CN111325145B CN202010102863.5A CN202010102863A CN111325145B CN 111325145 B CN111325145 B CN 111325145B CN 202010102863 A CN202010102863 A CN 202010102863A CN 111325145 B CN111325145 B CN 111325145B
- Authority
- CN
- China
- Prior art keywords
- channel
- time domain
- attention module
- domain channel
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000011176 pooling Methods 0.000 claims abstract description 6
- 230000008569 process Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 229910052739 hydrogen Inorganic materials 0.000 claims description 3
- 238000010586 diagram Methods 0.000 claims 3
- 239000012141 concentrate Substances 0.000 abstract 1
- 230000006399 behavior Effects 0.000 description 18
- 230000002123 temporal effect Effects 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000001994 activation Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- NVNSXBXKNMWKEJ-UHFFFAOYSA-N 5-[[5-(2-nitrophenyl)furan-2-yl]methylidene]-1,3-diphenyl-2-sulfanylidene-1,3-diazinane-4,6-dione Chemical compound [O-][N+](=O)C1=CC=CC=C1C(O1)=CC=C1C=C1C(=O)N(C=2C=CC=CC=2)C(=S)N(C=2C=CC=CC=2)C1=O NVNSXBXKNMWKEJ-UHFFFAOYSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000007725 thermal activation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the field of computer vision and discloses a behavior recognition method based on a time domain channel correlation block, which compresses an input initial feature map through space global average pooling operation to obtain a time domain channel description operator; inputting a time domain channel description operator into an attention module to obtain a time domain channel global nonlinear dependence; and assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the input initial feature map with the tensor output by the attention module channel by channel through residual connection to obtain a feature map after channel weighting. The invention effectively captures the related information between time domains and channels through a network layer to obtain a channel-by-channel description operator, weights the channel-by-channel description operator to the previous characteristics through multiplication, completes the re-weighting of the original characteristics in the channel dimension, and concentrates more computing resources of the network into the characteristic channels important to the output result.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a behavior recognition method based on a correlation block combined with a time domain channel.
Background
Video occupies a 70% share of the internet traffic and is also continuously rising. Most cell phone cameras now capture not only images, but also high resolution video. Many real world data sources are video-based, ranging from warehouse inventory systems to autopilot cars or drones. Video can be said to be the next-to-field front of computer vision because it captures a large amount of information that still images cannot convey. Therefore, video behavior recognition has been a hot problem in research in the fields of computer vision and the like.
Human motion in a video sequence is a three-dimensional (3D) spatiotemporal signal containing spatial features and temporal features. The spatial features mainly describe the appearance of the objects related to the motion and the configuration of the scene and the scene within each frame of the video. Spatial feature learning is similar to still image recognition and therefore readily benefits from the recent advances in deep Convolutional Neural Networks (CNNs). Video temporal features capture motion cues that are embedded in evolving frames over time, containing valuable motion information that needs to be incorporated into video recognition tasks. Two main problems that video behavior recognition needs to solve: one how to learn the temporal features and the other how to properly fuse the spatial and temporal features.
Researchers initially explicitly model temporal motion information and spatial information in parallel. The optical flow between the original frame and the adjacent frame is used as two input streams to the deep neural network. On the other hand, as generalization of two-dimensional convolution (2D Conv) for still image recognition, three-dimensional convolution (3D Conv) has been proposed to process 3D volume video data. In a three-dimensional convolutional network, spatial and temporal features are tightly entangled together and co-learned. That is, rather than learning spatial and temporal features separately and fusing them at the top of the network, joint spatiotemporal features are learned by three-dimensional convolution distributed across the network. Given the excellent feature representation learning capability of CNN, ideal three-dimensional convolution should have great success in video understanding, just as two-dimensional convolution is on image recognition. However, the large number of model parameters and low computational efficiency limit the effectiveness and practicality of three-dimensional convolution.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a behavior recognition method based on a correlation block combined with a time domain channel.
A behavior recognition method based on a combined time domain channel correlation block, comprising the steps of:
s1, compressing an input initial three-dimensional space-time signal feature map through space global average pooling operation to obtain a time domain channel description operator;
s2, inputting a time domain channel description operator into an attention module to obtain a time domain channel global nonlinear dependence;
and S3, assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the initial three-dimensional space-time signal feature map input in the step S1 with the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a feature map after channel weighting.
Preferably, in the above-mentioned behavior recognition method based on the combined time-domain channel correlation block, the three-dimensional spatio-temporal signal of the initial feature map input in the step S1 is expressed as:where T, H, W, C represents the input signal time domain length, the spatial domain height and width, and the number of channels, respectively.
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the expression of the time domain channel description operator obtained in the step S1 is: wherein ,/>
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the attention module is composed of two fully connected layers, wherein the feature dimension of the first fully connected layer is reduced toWhile the second feature dimension increases the feature to C.
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, in the step S3, the correlation block is passed throughThe process of the attention module fusing time domain-channel information and extracting channel-by-channel information is expressed as: z=σ (MLP (F))=σ (W) 1 (δ(W 0 z)); wherein, and->Delta and sigma are denoted ReLU and Sigmoid activation functions, respectively, r is a hyper-parameter used to reduce the number of parameters of the attention module.
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, in the step S3, the tensor value output by the attention module is assigned
Preferably, in the above-mentioned behavior recognition method based on the combined time domain channel correlation block, in the step S3, the feature map obtained by multiplying the initial three-dimensional spatio-temporal signal feature map input in the step S1 and the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain the channel weighting is denoted as X c ,X c =F scale (X, Z) =x·z; wherein x= [ X ] 1 ,x 2 ,…,x C ],F scale (X, Z) represents a feature map and />Is multiplied channel by channel.
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the value of the super parameter r is [2,4,8,16,32, … ].
Preferably, in the above behavior recognition method based on the combined time domain channel correlation block, the value of the super parameter r is 16.
The invention has the beneficial effects that: according to the invention, the network layer effectively captures the related information between the time domain and the channel, the time domain-channel related characteristic learning can be effectively executed on any network, a channel-by-channel description operator is obtained, the channel-by-channel weighting is carried out on the previous characteristic through multiplication, and the re-weighting of the original characteristic in the channel dimension is completed. By concentrating more computing resources of the network into characteristic channels important for output results, the computing resources of the network are optimized, and the behavior recognition accuracy is improved.
Detailed Description
The following description of the embodiments of the present invention will clearly and fully describe the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a behavior recognition method based on a combined time domain channel correlation block, which comprises the following steps:
s1, compressing an input initial three-dimensional space-time signal feature map through space global average pooling operation to obtain a time domain channel description operator. The three-dimensional spatiotemporal signal of the input initial feature map is expressed as:where T, H, W, C represents the input signal time domain length, the spatial domain height and width, and the number of channels, respectively. The expression of the obtained time domain channel description operator is as follows: /> wherein ,/>
S2, inputting a time domain channel description operator into an attention module to obtain the time domain channel global nonlinear dependence, wherein in order to achieve the goal, the attention module must meet two conditions: first, the attention module must have flexibility, in particular, the attention module must be able to learn the nonlinear interactions between time-domain channels; second, the attention module must learn a non-exclusive relationship. As we will aim to ensure that multiple channels are allowed to be reinforced, rather than single thermal activation.
In particular, the attention module consists of two fully connected layers, wherein the characteristic dimension of the first fully connected layer is reduced toWhile the second feature dimension adds features to C, a spatial dimension global receptive field can be obtained using spatial global averaging pooling.
And S3, assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the initial three-dimensional space-time signal feature map input in the step S1 with the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a feature map after channel weighting.
Specifically, in a preferred embodiment of the present invention, the process of fusing time domain-channel information and extracting channel-by-channel information by the attention module in step S3 is expressed as: z=σ (MLP (F))=σ (W) 1 (δ(W 0 z)); wherein,and->Delta and sigma are denoted ReLU and Sigmoid activation functions, respectively, r is a hyper-parameter used to reduce the number of parameters of the attention module.
Tensor assignment output by the attention moduleSaid step by residual connectionThe feature map obtained by multiplying the initial three-dimensional space-time signal feature map input in the step S1 with tensors output by the attention module in the step S2 channel by channel to obtain channel weighted feature map is expressed as X c ,X c =F scale (X, Z) =x·z; wherein x= [ X ] 1 ,x 2 ,…,x C ],F scale (X, Z) represents a characteristic map-> and />Is multiplied channel by channel. The value of the super parameter r is [2,4,8,16,32, … ]]I.e. 2 3n N is a natural number equal to or greater than 0. In practice, experiments show that the value of the super parameter r is 16, and the effect is best.
Specifically, the overall network architecture is shown in the following table:
in the above table, 3D-ResNet101 represents the basic 101-layer residual network, while 3D CTC-ResNet101 represents the network architecture after adding "combined time-domain channel correlation blocks (CTCs)", we add a CTC module to each block of the residual network to construct a 3D CTC-ResNet10 network. Both network architectures of the table above employ three-dimensional convolution kernels and three-dimensional pooling, with each convolution layer shown in the table corresponding to a composite sequence BN-ReLU-Conv operation. Tests on the behavior recognition data set UCF-101 and the HMDB-51 show that the recognition rate can be improved to a certain extent by the reference network and the CTC module.
In summary, the invention effectively captures the correlation information between the time domain and the channel through the network layer, can effectively perform the time domain-channel correlation feature learning on any network, obtains a channel-by-channel description operator, weights the channel-by-channel to the previous feature through multiplication, and completes the re-weighting of the original feature in the channel dimension. By concentrating more computing resources of the network into characteristic channels important for output results, the computing resources of the network are optimized, and the behavior recognition accuracy is improved.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the foregoing embodiments, but rather, the foregoing embodiments and description illustrate the principles of the invention, and that various changes and modifications may be effected therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.
Claims (6)
1. The behavior recognition method based on the combined time domain channel correlation block is characterized by comprising the following steps of:
s1, compressing an input initial three-dimensional space-time signal feature map through space global average pooling operation to obtain a time domain channel description operator;
s2, inputting a time domain channel description operator into an attention module to obtain a time domain channel global nonlinear dependence;
s3, assigning the tensor output by the attention module as the weight of the importance of each channel after feature selection, and multiplying the initial three-dimensional space-time signal feature map input in the step S1 with the tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a feature map after channel weighting;
in the step S3, the process of fusing time-channel information and extracting channel-by-channel information by the attention module is expressed as: z=σ (MLP (F))=σ (W) 1 (δ(W 0 z)); wherein,and->Delta and sigma are respectively expressed as ReLU and Sigmoid activation functions, and r is a super parameter for reducing the parameter quantity of the attention module;
The initial three-dimensional space-time signal characteristic diagram input in the step S1 is multiplied with tensor output by the attention module in the step S2 channel by channel through residual connection to obtain a characteristic diagram after channel weighting, and the characteristic diagram is expressed as X c ,X c =F scale (X, Z) =x·z; wherein x= [ X ] 1 ,x 2 ,…,x C ],F scale (X, Z) represents a feature map and />Is multiplied channel by channel.
4. The behavior recognition method based on the combined time domain channel correlation block according to claim 1, which is specificCharacterized in that the attention module consists of two fully connected layers, wherein the characteristic dimension of the first fully connected layer is reduced toWhile the second feature dimension increases the feature to C.
5. The behavior recognition method based on the combined time domain channel correlation block according to claim 1, wherein the value of the super parameter r is [2,4,8,16,32, … ].
6. The behavior recognition method based on the combined time domain channel correlation block according to claim 5, wherein the super parameter r takes a value of 16.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010102863.5A CN111325145B (en) | 2020-02-19 | 2020-02-19 | Behavior recognition method based on combined time domain channel correlation block |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010102863.5A CN111325145B (en) | 2020-02-19 | 2020-02-19 | Behavior recognition method based on combined time domain channel correlation block |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111325145A CN111325145A (en) | 2020-06-23 |
CN111325145B true CN111325145B (en) | 2023-04-25 |
Family
ID=71172703
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010102863.5A Active CN111325145B (en) | 2020-02-19 | 2020-02-19 | Behavior recognition method based on combined time domain channel correlation block |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111325145B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113989940B (en) * | 2021-11-17 | 2024-03-29 | 中国科学技术大学 | Method, system, device and storage medium for identifying actions in video data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109726659A (en) * | 2018-12-21 | 2019-05-07 | 北京达佳互联信息技术有限公司 | Detection method, device, electronic equipment and the readable medium of skeleton key point |
CN109871777A (en) * | 2019-01-23 | 2019-06-11 | 广州智慧城市发展研究院 | A kind of Activity recognition system based on attention mechanism |
CN110070073A (en) * | 2019-05-07 | 2019-07-30 | 国家广播电视总局广播电视科学研究院 | Pedestrian's recognition methods again of global characteristics and local feature based on attention mechanism |
CN110084180A (en) * | 2019-04-24 | 2019-08-02 | 北京达佳互联信息技术有限公司 | Critical point detection method, apparatus, electronic equipment and readable storage medium storing program for executing |
CN110610129A (en) * | 2019-08-05 | 2019-12-24 | 华中科技大学 | Deep learning face recognition system and method based on self-attention mechanism |
-
2020
- 2020-02-19 CN CN202010102863.5A patent/CN111325145B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109726659A (en) * | 2018-12-21 | 2019-05-07 | 北京达佳互联信息技术有限公司 | Detection method, device, electronic equipment and the readable medium of skeleton key point |
CN109871777A (en) * | 2019-01-23 | 2019-06-11 | 广州智慧城市发展研究院 | A kind of Activity recognition system based on attention mechanism |
CN110084180A (en) * | 2019-04-24 | 2019-08-02 | 北京达佳互联信息技术有限公司 | Critical point detection method, apparatus, electronic equipment and readable storage medium storing program for executing |
CN110070073A (en) * | 2019-05-07 | 2019-07-30 | 国家广播电视总局广播电视科学研究院 | Pedestrian's recognition methods again of global characteristics and local feature based on attention mechanism |
CN110610129A (en) * | 2019-08-05 | 2019-12-24 | 华中科技大学 | Deep learning face recognition system and method based on self-attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN111325145A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Hierarchical feature fusion with mixed convolution attention for single image dehazing | |
CN112597883B (en) | Human skeleton action recognition method based on generalized graph convolution and reinforcement learning | |
CN111985343B (en) | Construction method of behavior recognition depth network model and behavior recognition method | |
CN113378600B (en) | Behavior recognition method and system | |
CN114596520A (en) | First visual angle video action identification method and device | |
Zou et al. | Crowd counting via hierarchical scale recalibration network | |
CN113255464A (en) | Airplane action recognition method and system | |
WO2021057091A1 (en) | Viewpoint image processing method and related device | |
CN114708665A (en) | Skeleton map human behavior identification method and system based on multi-stream fusion | |
CN116563355A (en) | Target tracking method based on space-time interaction attention mechanism | |
CN111325145B (en) | Behavior recognition method based on combined time domain channel correlation block | |
CN114495271A (en) | Human behavior identification method based on deep ConvLSTM and double-current fusion network | |
CN117726517A (en) | Classroom image super-resolution method based on Transformer | |
Jiang et al. | Cross-level reinforced attention network for person re-identification | |
Yadav et al. | Video object detection from compressed formats for modern lightweight consumer electronics | |
CN117975565A (en) | Action recognition system and method based on space-time diffusion and parallel convertors | |
CN113393435A (en) | Video significance detection method based on dynamic context-aware filter network | |
Yuan et al. | Multi-filter dynamic graph convolutional networks for skeleton-based action recognition | |
TWI826160B (en) | Image encoding and decoding method and apparatus | |
CN116597144A (en) | Image semantic segmentation method based on event camera | |
CN116453025A (en) | Volleyball match group behavior identification method integrating space-time information in frame-missing environment | |
CN111325149A (en) | Video action identification method based on voting time sequence correlation model | |
CN114648722B (en) | Motion recognition method based on video multipath space-time characteristic network | |
CN114022371B (en) | Defogging device and defogging method based on space and channel attention residual error network | |
CN115511858A (en) | Video quality evaluation method based on novel time sequence characteristic relation mapping |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |