CN112766177B - Behavior identification method based on feature mapping and multi-layer time interaction attention - Google Patents

Behavior identification method based on feature mapping and multi-layer time interaction attention Download PDF

Info

Publication number
CN112766177B
CN112766177B CN202110086627.3A CN202110086627A CN112766177B CN 112766177 B CN112766177 B CN 112766177B CN 202110086627 A CN202110086627 A CN 202110086627A CN 112766177 B CN112766177 B CN 112766177B
Authority
CN
China
Prior art keywords
video
matrix
feature
generating
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110086627.3A
Other languages
Chinese (zh)
Other versions
CN112766177A (en
Inventor
同鸣
金磊
董秋宇
边放
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110086627.3A priority Critical patent/CN112766177B/en
Publication of CN112766177A publication Critical patent/CN112766177A/en
Application granted granted Critical
Publication of CN112766177B publication Critical patent/CN112766177B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a behavior recognition method based on feature mapping and multi-layer time interaction attention, which solves the problem that the prior art is insufficient in time dynamic information modeling and ignores the interdependency relation between different frames, so that the recognition capability of behaviors is insufficient. The method comprises the following implementation steps: generating a training set; (2) acquiring a depth feature map; (3) constructing a feature mapping matrix; (4) generating a time interaction attention matrix; (5) generating a time interaction attention weighted feature matrix; (6) Generating a multi-layer time interactive attention weighted feature matrix; (7) acquiring a feature vector of the video; and (8) performing behavior recognition on the video. Because the invention constructs the feature mapping matrix and provides multilayer time interaction attention, the invention can improve the accuracy of behavior identification in the video.

Description

Behavior identification method based on feature mapping and multi-layer time interaction attention
Technical Field
The invention belongs to the technical field of video processing, and further relates to a behavior identification method based on feature mapping and multilayer time interaction attention in the technical field of computer vision. The method can be used for human behavior recognition in videos.
Background
The human behavior recognition task based on the video plays an important role in the field of computer vision, has a wide application prospect, and is applied to the fields of unmanned driving, man-machine interaction, video monitoring and the like at present. The aim of human behavior recognition is to judge the category of human behavior in a video, and the essence is a classification problem. In recent years, with the development of deep learning, behavior recognition methods based on deep learning have been widely studied.
The south China university of marble discloses a human behavior recognition method in the patent document 'human behavior recognition method based on time attention mechanism and LSTM' (patent application No.: CN201910271178.2, application publication No. CN 110135249A) applied by the university of south China. The method mainly comprises the following implementation steps: 1. acquiring video data of an RGB monocular vision sensor; 2. extracting 2D skeleton joint point data; 3. extracting joint point combined structure characteristics; 4. constructing an LSTM long-term and short-term memory network; 5. adding a time attention mechanism in the LSTM network; 6. and (5) carrying out human behavior recognition by utilizing a softmax classifier. The time attention mechanism proposed by the method separately explores the importance degree of each frame in the video and gives large weight to the characteristics of the important frames, but the method still has the defects that the interdependence relation between different frames in the video is ignored, so that partial global information is lost, and the error of behavior identification is caused.
A method of behavior recognition is disclosed in the published article "Temporal segment networks for action recognition in videos" by Limin Wang et al (IEEE transactions on pattern analysis and machine interaction, 2018, 2740-2755). The method mainly comprises the following implementation steps: 1. uniformly dividing the video into 7 video segments; 2. randomly sampling a frame of RGB image in each video segment to obtain 7 frames of RGB images; 3. inputting each frame of the obtained RGB image into a convolutional neural network to obtain a classification score of each frame of the RGB image; 4. and combining the segment consensus function and the prediction function with the classification score of the 7 frames of RGB images to obtain the behavior recognition result of the video. The method has the defects that for a longer video, only 7 frames of RGB images are sampled, so that information in the video is lost, more complete time dynamic information cannot be modeled, and further, the behavior recognition accuracy rate is lower.
Disclosure of Invention
The invention aims to provide a behavior recognition method based on feature mapping and multi-layer time interaction attention aiming at the defects of the prior art, and the method is used for solving the problems that in the prior art, time dynamic information is not sufficiently modeled, and the mutual dependency relationship among different frames is ignored, so that the behavior recognition capability is poor.
In order to achieve the purpose, the idea of the invention is to construct a feature mapping matrix and embed the feature mapping matrix into the time and space information in the video; the time interaction attention is obtained by exploring the mutual influence among different frames in the video; and (3) mining complex time dynamic information in the video by using multi-layer time interaction attention.
In order to achieve the purpose, the method comprises the following specific steps:
(1) Generating a training set:
(1a) Selecting RGB videos containing N behavior categories in a video data set to form a sample set, wherein each category contains at least 100 videos, each video has a determined behavior category, and N is greater than 50;
(1b) Preprocessing each video in the sample set to obtain RGB images corresponding to the video, and forming the RGB images of all preprocessed videos into a training set;
(2) Generating a depth feature map:
sequentially inputting each frame of RGB image in each video in a training set into an inclusion-v 2 network, and sequentially outputting a depth feature map X with the size of 7X 1024 in each frame of image in each video k Wherein k represents a sequence number of a sample image in the video, and k =1, 2.., 60;
(3) Constructing a feature mapping matrix:
(3a) Encoding each depth feature map as a low-dimensional vector f with 1024 dimensions using a spatial vectorization function k ,k=1,2,...,60;
(3b) Arranging the low-dimensional vectors corresponding to 60 frame sampling images of each video in a row according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix
Figure BDA0002911074420000021
Wherein T represents a transpose operation;
(4) Generating a temporal interaction attention matrix:
(4a) Using the formula B = M T M, generating a correlation matrix B of M, wherein the value of the ith row and the jth column in the matrix represents the correlation degree between two low-dimensional vectors corresponding to the ith and jth sampling images in the video;
(4b) Normalizing the correlation matrix B to obtain a time interaction attention matrix A with the size of 60 multiplied by 60;
(5) Generating a time interaction attention weighted feature matrix:
using formulas
Figure BDA0002911074420000031
Generating a temporal interaction attention weighted feature matrix
Figure BDA0002911074420000032
Wherein γ represents a proportionality parameter initialized to 0 for balancing both MA and M;
(6) Generating a multi-layer time interaction attention weighted feature matrix:
(6a) Using formulas
Figure BDA0002911074420000033
Generating
Figure BDA0002911074420000034
Correlation matrix of
Figure BDA0002911074420000035
To pair
Figure BDA0002911074420000036
Normalization processing is carried out to obtain a multilayer time interactive attention moment array with the size of 60 multiplied by 60
Figure BDA0002911074420000037
(6b) Using formulas
Figure BDA0002911074420000038
Generating a multi-tier temporal interaction attention weighted feature matrix
Figure BDA0002911074420000039
Wherein the content of the first and second substances,
Figure BDA00029110744200000310
indicating an initialization to 0 for balancing
Figure BDA00029110744200000311
And
Figure BDA00029110744200000312
the proportion parameters of the two terms;
(7) Acquiring a feature vector of a video:
inputting the multi-layer time interactive attention weighted feature matrix of each video into a full-connection layer, and outputting the feature vector of the video;
(8) Performing behavior recognition on the video:
(8a) Inputting the feature vector of each video into a softmax classifier, and iteratively updating parameters gamma and gamma by using a back propagation gradient descent method
Figure BDA00029110744200000313
Parameters of a full connection layer and parameters of a softmax classifier are obtained until a cross entropy loss function is converged, and each trained parameter is obtained;
(8b) Sampling 60 frames of RGB images of each video to be identified at equal intervals, scaling the size of each frame of image to 256 multiplied by 340, then performing center cutting to obtain 60 frames of RGB images with the size of 224 multiplied by 224, inputting each frame of RGB images into an inclusion-v 2 network, and outputting a depth feature map of the video to be identified;
(8c) And (4) processing the depth feature map of each video to be recognized by adopting the same processing method as the steps (3) to (7) to obtain feature vectors of the video, inputting each feature vector into a trained softmax classifier, and outputting a behavior recognition result of each video.
Compared with the prior art, the invention has the following advantages:
firstly, because the invention constructs the characteristic mapping matrix which comprises the time information of 60 sampling images in the video and the space information of each sampling image, the invention overcomes the problems that the information in the video is lost and more complete time dynamic information can not be modeled because only 7 frames of RGB images are sampled in the prior art, so that the invention can more fully reserve the time sequence information and obtain the characteristic with more expressive ability.
Secondly, the time interaction attention matrix is provided by the invention, and is obtained by calculating the correlation degree between the low-dimensional features of different sampling images in the feature mapping matrix, so that the problem that partial global information is lost due to the fact that the method in the prior art ignores the interdependence relation between different frames in the video is solved, the technology provided by the invention can fully explore the global information, and the accuracy of behavior identification is improved.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The specific steps of the present invention will be further described with reference to fig. 1.
Step 1, generating a training set.
Selecting RGB videos containing N behavior categories in a video data set to form a sample set, wherein each category contains at least 100 videos, each video has a determined behavior category, and N is greater than 50. Preprocessing each video in the sample set to obtain an RGB image corresponding to the video, and forming the RGB images of all preprocessed videos into a training set. The preprocessing is to sample 60 frames of RGB images at equal intervals for each video in the sample set, scale the size of each frame of RGB image to 256 × 340, and then crop the RGB images to obtain 60 frames of RGB images with the size of 224 × 224 for the video.
And 2, acquiring a depth characteristic map.
Will train each of the setSequentially inputting each frame of RGB image in each video into an increment-v 2 network, and sequentially outputting a depth feature map X with the size of 7X 1024 in each frame of image in each video k Where k denotes the sequence number of the sampled image in the video, and k =1, 2.
And 3, constructing a characteristic mapping matrix.
Due to the high dimensionality of the feature map, the joint analysis of the information of the densely sampled images in the video is challenging, and the mapping of the feature map into a low-dimensional vector can reduce the amount of calculation and is beneficial to the joint analysis of the densely sampled images. Taking the kth sampling image in the r video as an example, how to encode the depth feature map of the sampling image of the video into a low-dimensional vector with 1024 dimensions is described:
Figure BDA0002911074420000041
wherein f is r,k Represents the corresponding low-dimensional vector of the kth sampling image in the r video, V (-) represents the space vectorization function, X r,k Representing a depth feature map, X, corresponding to the kth sample image in the r video r,k,ij Represents X r,k Represents a summation operation, and H and W represent X, respectively r,k The total number of rows and the total number of columns.
Arranging the low-dimensional vectors corresponding to 60 frame sampling images of each video in a row according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix
Figure BDA0002911074420000051
Wherein, f k A low-dimensional vector representing the k-th sampled image, k =1, 2.
The number of columns of the matrix M is equal to the total number of sampled images corresponding to each video, and the number of rows is equal to the dimension of the low-dimensional vector.
The feature mapping matrix contains the time information of the video and the spatial information of each sampling image, so that the method can perform joint analysis on the densely sampled images in the video.
And 4, generating a time interaction attention matrix.
Generating M correlation matrix B = M T And M, expressing the correlation degree between two low-dimensional vectors corresponding to the ith and jth sampling images in the video by the value of the ith row and the jth column in the B, and normalizing the B to obtain a time interaction attention matrix A with the size of 60 multiplied by 60.
The following description will use the ith frame sample image and the jth frame sample image as an example to explain how to calculate the ith row and jth column elements A of the time interaction attention matrix A by the correlation degree between the two frame images ij The specific calculation formula is as follows:
Figure BDA0002911074420000052
wherein A is ij The correlation degree of the ith frame sample image and the jth frame sample image is measured. M is a group of i And M j And the physical meanings of the column vectors respectively consisting of the ith column elements and the jth column elements in the feature mapping matrix M are the transposes of the low-dimensional vectors of the ith sampling image and the jth sampling image in the video. If the lower dimensional vectors of the two frames are more similar, A ij The larger the correlation between the two frames.
All elements in the time interaction attention matrix A are calculated in the same way, and the ith row in A represents the correlation degree of the ith frame sampling image of the video and all sampling images in the video. Therefore, the time interaction attention moment matrix models the correlation between video frames, and is helpful for more fully exploring the global information in the video.
And 5, generating a time interaction attention weighted feature matrix.
Using a formula
Figure BDA0002911074420000061
Generating a temporal interaction attention weighted feature matrix
Figure BDA0002911074420000062
Wherein, γIndicating a scaling parameter initialized to 0 to balance both MA and M terms.
And 6, generating a multilayer time interaction attention weighted feature matrix.
Using formulas
Figure BDA0002911074420000063
Generating
Figure BDA0002911074420000064
Correlation matrix of
Figure BDA0002911074420000065
To pair
Figure BDA0002911074420000066
Normalization processing is carried out to obtain a multilayer time interactive attention moment array with the size of 60 multiplied by 60
Figure BDA0002911074420000067
Reuse formula
Figure BDA0002911074420000068
Generating a multi-tier temporal interaction attention weighted feature matrix
Figure BDA0002911074420000069
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00029110744200000610
indicating that an initialization is 0 for balancing
Figure BDA00029110744200000611
And
Figure BDA00029110744200000612
the ratio parameter of the two terms.
The multi-layer time interaction attention to time interaction attention weighted feature matrix reapplies time interaction attention and explores richer time dynamics.
And 7, acquiring a feature vector of the video.
And inputting the multi-layer time interactive attention weighted feature matrix of each video into a full-connection layer with 1024 output neurons to obtain the feature vector of the video.
And 8, performing behavior recognition on the video.
Inputting the feature vector of each video into a softmax classifier, and respectively updating gamma, gamma and gamma by using a back propagation gradient descent method,
Figure BDA00029110744200000613
Parameters of a full connection layer and parameters of a softmax classifier until a cross entropy loss function is converged.
Sampling 60 frames of RGB images at equal intervals for each video to be identified, scaling the size of each frame of image to 256 × 340, then performing center cropping to obtain 60 frames of RGB images with the size of 224 × 224, inputting each frame of RGB images into an inclusion-v 2 network, and outputting a depth feature map of the video to be identified.
And (4) processing the depth characteristic map of each video to be recognized by adopting the same processing method as the steps 3 to 7 to obtain the characteristic vectors of the video to be recognized, inputting each characteristic vector into a trained softmax classifier, and outputting the behavior recognition result of each video.

Claims (4)

1. A behavior identification method based on feature mapping and multi-layer time interactive attention is characterized in that a feature mapping matrix containing time information of a video and space information of each sampling image is constructed; the time interaction attention is provided, a time interaction attention matrix is obtained by calculating the correlation degree between the low-dimensional vectors of different sampling images in the feature mapping matrix, and the method specifically comprises the following steps:
(1) Generating a training set:
(1a) Selecting RGB videos containing N behavior categories in a video data set to form a sample set, wherein each category contains at least 100 videos, each video has a determined behavior category, and N is greater than 50;
(1b) Preprocessing each video in the sample set to obtain RGB images corresponding to the video, and forming the RGB images of all preprocessed videos into a training set;
(2) Generating a depth feature map:
sequentially inputting each frame of RGB image in each video in a training set into an inclusion-v 2 network, and sequentially outputting a depth feature map X with the size of 7X 1024 in each frame of image in each video k Wherein k represents a sequence number of a sample image in the video, and k =1, 2.., 60;
(3) Constructing a feature mapping matrix:
(3a) Encoding each depth feature map as a low-dimensional vector f with 1024 dimensions using a spatial vectorization function k ,k=1,2,...,60;
(3b) Arranging the low-dimensional vectors corresponding to 60 frames of sampling images of each video in a line according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix
Figure FDA0002911074410000011
Wherein T represents a transpose operation;
(4) Generating a temporal interaction attention matrix:
(4a) Using the formula B = M T M, generating a correlation matrix B of M, wherein the value of the ith row and the jth column in the matrix represents the correlation degree between two low-dimensional vectors corresponding to the ith and jth sampling images in the video;
(4b) Normalizing the correlation matrix B to obtain a time interaction attention matrix A with the size of 60 multiplied by 60;
(5) Generating a time interaction attention weighted feature matrix:
using formulas
Figure FDA0002911074410000021
Generating a temporal interaction attention weighted feature matrix
Figure FDA0002911074410000022
Wherein γ represents a scaling parameter initialized to 0 to balance both MA and M;
(6) Generating a multi-layer time interactive attention weighted feature matrix:
(6a) Using a formula
Figure FDA0002911074410000023
Generating
Figure FDA0002911074410000024
Correlation matrix of (2)
Figure FDA0002911074410000025
For is to
Figure FDA0002911074410000026
Normalization processing is carried out to obtain a multilayer time interactive attention moment array with the size of 60 multiplied by 60
Figure FDA0002911074410000027
(6b) Using a formula
Figure FDA0002911074410000028
Generating a multi-tier temporal interaction attention weighted feature matrix
Figure FDA0002911074410000029
Wherein the content of the first and second substances,
Figure FDA00029110744100000210
indicating an initialization to 0 for balancing
Figure FDA00029110744100000211
And
Figure FDA00029110744100000212
the proportion parameters of the two terms;
(7) Acquiring a feature vector of a video:
inputting the multilayer time interactive attention weighted feature matrix of each video into a full connection layer, and outputting the feature vector of the video;
(8) Performing behavior recognition on the video:
(8a) Inputting the feature vector of each video into a softmax classifier, and iteratively updating the parameters gamma and gamma by using a back propagation gradient descent method
Figure FDA00029110744100000213
Parameters of a full connection layer and parameters of a softmax classifier are obtained until a cross entropy loss function is converged, and each trained parameter is obtained;
(8b) Sampling 60 frames of RGB images of each video to be identified at equal intervals, scaling the size of each frame of image to 256 multiplied by 340, then performing center cutting to obtain 60 frames of RGB images with the size of 224 multiplied by 224, inputting each frame of RGB images into an inclusion-v 2 network, and outputting a depth feature map of the video to be identified;
(8c) And (4) processing the depth feature map of each video to be recognized by adopting the same processing method as the steps (3) to (7) to obtain feature vectors of the video, inputting each feature vector into a trained softmax classifier, and outputting a behavior recognition result of each video.
2. The method according to claim 1, wherein the preprocessing of each video in the sample set in step (1 b) comprises sampling 60 frames of RGB images at equal intervals for each video in the sample set, scaling the RGB images to 256 × 340, and cropping to obtain 60 frames of RGB images with a size of 224 × 224 for the video.
3. The method for recognizing behavior based on feature mapping and multi-layer temporal interaction attention according to claim 1, wherein the space vectorization function in step (3 a) is as follows:
Figure FDA0002911074410000031
wherein f is r,k Represents the corresponding low-dimensional vector of the kth sampling frame in the r video, V (-) represents the space vectorization function, X r,k Representing a depth feature map, X, corresponding to the kth sample frame in the r video r,k,ij Represents X r,k Represents a summation operation, and H and W represent X, respectively r,k The total number of rows and the total number of columns.
4. The method for behavior recognition based on feature mapping and multi-layer temporal interaction attention of claim 1, wherein the number of output neurons of the fully-connected layer in step (7) is set to 1024.
CN202110086627.3A 2021-01-22 2021-01-22 Behavior identification method based on feature mapping and multi-layer time interaction attention Active CN112766177B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110086627.3A CN112766177B (en) 2021-01-22 2021-01-22 Behavior identification method based on feature mapping and multi-layer time interaction attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110086627.3A CN112766177B (en) 2021-01-22 2021-01-22 Behavior identification method based on feature mapping and multi-layer time interaction attention

Publications (2)

Publication Number Publication Date
CN112766177A CN112766177A (en) 2021-05-07
CN112766177B true CN112766177B (en) 2022-12-02

Family

ID=75702700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110086627.3A Active CN112766177B (en) 2021-01-22 2021-01-22 Behavior identification method based on feature mapping and multi-layer time interaction attention

Country Status (1)

Country Link
CN (1) CN112766177B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system
EP3625727A1 (en) * 2017-11-14 2020-03-25 Google LLC Weakly-supervised action localization by sparse temporal pooling network
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111627052A (en) * 2020-04-30 2020-09-04 沈阳工程学院 Action identification method based on double-flow space-time attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200175281A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Relation attention module for temporal action localization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3625727A1 (en) * 2017-11-14 2020-03-25 Google LLC Weakly-supervised action localization by sparse temporal pooling network
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111627052A (en) * 2020-04-30 2020-09-04 沈阳工程学院 Action identification method based on double-flow space-time attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A new framework of action recognition with discriminative parts,spatio-temporal and causal interaction descriptors";Ming Tong 等;《ELSEVIER》;20180904;116–130 *
基于通道注意力机制的视频人体行为识别;解怀奇等;《电子技术与软件工程》;20200215(第04期);146-148 *
融合空间-时间双网络流和视觉注意的人体行为识别;刘天亮等;《电子与信息学报》;20180815(第10期);114-120 *

Also Published As

Publication number Publication date
CN112766177A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN110119703B (en) Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene
CN108537742B (en) Remote sensing image panchromatic sharpening method based on generation countermeasure network
CN113469094A (en) Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN112070078B (en) Deep learning-based land utilization classification method and system
CN111639719B (en) Footprint image retrieval method based on space-time motion and feature fusion
CN107688856B (en) Indoor robot scene active identification method based on deep reinforcement learning
CN113936339A (en) Fighting identification method and device based on double-channel cross attention mechanism
CN112926396A (en) Action identification method based on double-current convolution attention
CN107451565B (en) Semi-supervised small sample deep learning image mode classification and identification method
CN108648197A (en) A kind of object candidate area extracting method based on image background mask
CN109817276A (en) A kind of secondary protein structure prediction method based on deep neural network
CN114782694B (en) Unsupervised anomaly detection method, system, device and storage medium
CN109376589A (en) ROV deformation target and Small object recognition methods based on convolution kernel screening SSD network
CN113887517B (en) Crop remote sensing image semantic segmentation method based on parallel attention mechanism
CN113344045B (en) Method for improving SAR ship classification precision by combining HOG characteristics
CN113935249B (en) Upper-layer ocean thermal structure inversion method based on compression and excitation network
CN114565594A (en) Image anomaly detection method based on soft mask contrast loss
CN115757919A (en) Symmetric deep network and dynamic multi-interaction human resource post recommendation method
CN115032602A (en) Radar target identification method based on multi-scale convolution capsule network
CN117593666B (en) Geomagnetic station data prediction method and system for aurora image
CN115019132A (en) Multi-target identification method for complex background ship
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
CN113516232A (en) Training method of neural network model based on self-attention mechanism
CN112766177B (en) Behavior identification method based on feature mapping and multi-layer time interaction attention
CN110782503B (en) Face image synthesis method and device based on two-branch depth correlation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant