CN112766177A - Behavior identification method based on feature mapping and multi-layer time interaction attention - Google Patents
Behavior identification method based on feature mapping and multi-layer time interaction attention Download PDFInfo
- Publication number
- CN112766177A CN112766177A CN202110086627.3A CN202110086627A CN112766177A CN 112766177 A CN112766177 A CN 112766177A CN 202110086627 A CN202110086627 A CN 202110086627A CN 112766177 A CN112766177 A CN 112766177A
- Authority
- CN
- China
- Prior art keywords
- video
- matrix
- temporal
- feature
- interactive attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种基于特征映射和多层时间交互注意力的行为识别方法,解决现有技术对时间动态信息建模不充分,忽略了不同帧之间的相互依赖关系,从而导致的对行为的识别能力不足的问题。本发明的实现步骤为:(1)生成训练集;(2)获取深度特征图;(3)构建特征映射矩阵;(4)生成时间交互注意力矩阵;(5)生成时间交互注意力加权特征矩阵;(6)生成多层时间交互注意力加权特征矩阵;(7)获取视频的特征向量;(8)对视频进行行为识别。由于本发明构建了特征映射矩阵,提出了多层时间交互注意力,使得本发明可以提高对视频中的行为识别的准确率。
The invention discloses a behavior recognition method based on feature mapping and multi-layer temporal interactive attention, which solves the problem of insufficient modeling of temporal dynamic information in the prior art, ignoring the interdependence between different frames, and thus leading to inaccurate behaviors. The problem of insufficient recognition ability. The implementation steps of the present invention are: (1) generating a training set; (2) acquiring a depth feature map; (3) constructing a feature mapping matrix; (4) generating a temporal interactive attention matrix; (5) generating a temporal interactive attention weighted feature (6) Generate a multi-layer temporal interactive attention weighted feature matrix; (7) Obtain the feature vector of the video; (8) Perform behavior recognition on the video. Since the present invention constructs a feature map matrix and proposes multi-layer temporal interactive attention, the present invention can improve the accuracy of behavior recognition in video.
Description
技术领域technical field
本发明属于视频处理技术领域,更进一步涉及计算机视觉技术领域中的一种基于特征映射和多层时间交互注意力的行为识别方法。本发明可用于视频中的人体行为识别。The invention belongs to the technical field of video processing, and further relates to a behavior recognition method based on feature mapping and multi-layer temporal interactive attention in the technical field of computer vision. The present invention can be used for human action recognition in video.
背景技术Background technique
基于视频的人体行为识别任务在计算机视觉领域中占有重要的地位,有着广阔的应用前景,目前已被应用于无人驾驶、人机交互、视频监控等领域。人体行为识别的目标是判断一个视频中人体行为的类别,本质是一个分类问题。近年来,随着深度学习的发展,基于深度学习的行为识别方法被广泛研究。Video-based human behavior recognition tasks occupy an important position in the field of computer vision and have broad application prospects. The goal of human behavior recognition is to determine the category of human behavior in a video, which is essentially a classification problem. In recent years, with the development of deep learning, behavior recognition methods based on deep learning have been widely studied.
华南理工大学在其申请的专利文献“基于时间注意力机制和LSTM的人体行为识别方法”(专利申请号:CN201910271178.2,申请公开号CN110135249A)中公开了一种人体行为识别方法。该方法的主要实现步骤是:1.获取RGB单目视觉传感器的视频数据;2.提取2D骨架关节点数据;3.提取关节点联合结构特征;4.构建LSTM长短期记忆网络;5.在LSTM网络中加入时间注意力机制;6.利用softmax分类器进行人体行为识别。该方法提出的时间注意力机制单独地探索视频中每一帧的重要程度,对重要的帧的特征赋予大的权重,但是,该方法仍然存在的不足之处是,忽略了视频中不同帧之间的相互依赖关系,从而损失了部分全局信息,导致行为识别的错误。South China University of Technology disclosed a human action recognition method in its patent document "Human Action Recognition Method Based on Time Attention Mechanism and LSTM" (Patent Application No.: CN201910271178.2, Application Publication No. CN110135249A). The main implementation steps of the method are: 1. Obtain the video data of the RGB monocular vision sensor; 2. Extract the 2D skeleton joint point data; 3. Extract the joint structural features of the joint points; 4. Construct the LSTM long short-term memory network; A time attention mechanism is added to the LSTM network; 6. Human action recognition is performed using the softmax classifier. The temporal attention mechanism proposed by this method independently explores the importance of each frame in the video, and assigns a large weight to the features of important frames. However, the method still has the disadvantage that it ignores the difference between different frames in the video. The interdependence between them, thus losing part of the global information, leading to errors in behavior recognition.
Limin Wang等人在其发表的论文“Temporal segment networks for actionrecognition in videos”(IEEE transactions on pattern analysis and machineintelligence,2018,2740-2755)中公开了一种行为识别方法。该方法的主要实现步骤是:1.将视频均匀分成7个视频段;2.在每个视频段中随机采样一帧RGB图像,得到7帧RGB图像;3.将获取的每一帧RGB图像输入到卷积神经网络中,得到每一帧RGB图像的分类得分;4.利用段共识函数和预测函数结合7帧RGB图像的分类得分,得到该视频的行为识别的结果。该方法存在的不足之处是,对于较长的视频,仅采样7帧RGB图像会导致视频中信息的丢失,无法建模更完备的时间动态信息,进而导致行为识别准确率较低。Limin Wang et al. disclosed an action recognition method in their paper "Temporal segment networks for actionrecognition in videos" (IEEE transactions on pattern analysis and machineintelligence, 2018, 2740-2755). The main implementation steps of this method are: 1. Divide the video into 7 video segments evenly; 2. Randomly sample a frame of RGB image in each video segment to obtain 7 frames of RGB image; 3. Divide each frame of RGB image obtained Input into the convolutional neural network to get the classification score of each frame of RGB image; 4. Use the segment consensus function and the prediction function to combine the classification scores of the 7 frames of RGB images to obtain the behavior recognition result of the video. The disadvantage of this method is that for a long video, only sampling 7 frames of RGB images will lead to the loss of information in the video, and it is impossible to model more complete temporal dynamic information, resulting in a low accuracy of behavior recognition.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于针对上述已有技术的不足,提出了一种基于特征映射和多层时间交互注意力的行为识别方法,用于解决现有技术对时间动态信息建模不充分,忽略了不同帧之间的相互依赖关系,导致的行为识别能力差的问题。The purpose of the present invention is to address the above-mentioned shortcomings of the prior art, and propose a behavior recognition method based on feature mapping and multi-layer temporal interactive attention, which is used to solve the problem that the prior art does not adequately model temporal dynamic information, ignores different The interdependence between frames leads to the problem of poor behavior recognition ability.
为实现上述目的,本发明的思路是,构建特征映射矩阵,嵌入视频中的时间和空间信息;通过探索视频中不同帧之间的相互影响,得到时间交互注意力;使用多层的时间交互注意力挖掘视频中复杂的时间动态信息。In order to achieve the above purpose, the idea of the present invention is to construct a feature mapping matrix and embed the temporal and spatial information in the video; obtain temporal interactive attention by exploring the mutual influence between different frames in the video; use multi-layer temporal interactive attention Powerful mining of complex temporal dynamic information in videos.
为实现上述目的,本发明的实现的具体步骤如下:For achieving the above object, the concrete steps of the realization of the present invention are as follows:
(1)生成训练集:(1) Generate a training set:
(1a)选取视频数据集中包含N个行为类别的RGB视频组成样本集,每个类别包含至少100个视频,每个视频有一个确定的行为类别,其中N>50;(1a) Select the RGB videos containing N behavior categories in the video data set to form a sample set, each category contains at least 100 videos, and each video has a definite behavior category, where N>50;
(1b)对样本集中的每个视频进行预处理,以获取该视频对应的RGB图像,将所有预处理后视频的RGB图像组成训练集;(1b) Preprocess each video in the sample set to obtain the RGB image corresponding to the video, and compose the RGB image of all the preprocessed videos into a training set;
(2)生成深度特征图:(2) Generate a deep feature map:
将训练集中的每一个视频中每帧RGB图像依次输入到Inception-v2网络中,依次输出每一个视频中每帧图像的尺寸为7×7×1024的深度特征图Xk,其中,k表示视频中采样图像的序号,k=1,2,...,60;Input each frame of RGB image in each video in the training set into the Inception-v2 network in turn, and sequentially output the depth feature map X k with a size of 7 × 7 × 1024 for each frame in each video, where k represents the video The serial number of the sampled image in the middle, k=1,2,...,60;
(3)构建特征映射矩阵:(3) Construct the feature mapping matrix:
(3a)使用一个空间向量化函数,将每个深度特征图编码为一个维数为1024的低维向量fk,k=1,2,...,60;(3a) Using a spatial vectorization function, encode each depth feature map as a low-dimensional vector f k with dimension 1024, k=1,2,...,60;
(3b)将每个视频的60帧采样图像对应的低维向量按照帧的时间顺序排列成行,得到一个二维特征映射矩阵其中,T表示转置操作;(3b) Arrange the low-dimensional vectors corresponding to the 60-frame sampled images of each video into rows according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix Among them, T represents the transpose operation;
(4)生成时间交互注意力矩阵:(4) Generate temporal interactive attention matrix:
(4a)利用公式B=MTM,生成M的相关性矩阵B,该矩阵中第i行第j列的值表示视频中第i个和第j个采样图像对应的两个低维向量之间的相关程度;(4a) Using the formula B=M T M, generate a correlation matrix B of M, where the value of the i-th row and the j-th column in the matrix represents the sum of the two low-dimensional vectors corresponding to the i-th and j-th sampled images in the video the degree of correlation between;
(4b)对相关性矩阵B进行归一化处理,得到尺寸为60×60的时间交互注意力矩阵A;(4b) Normalize the correlation matrix B to obtain a temporal interactive attention matrix A with a size of 60×60;
(5)生成时间交互注意力加权特征矩阵:(5) Generate a temporal interactive attention weighted feature matrix:
利用公式生成时间交互注意力加权特征矩阵其中,γ表示一个初始化为0的用于平衡MA和M两项的比例参数;Use the formula Generating Temporal Interactive Attention Weighted Feature Matrix Among them, γ represents a scale parameter initialized to 0 for balancing MA and M;
(6)生成多层时间交互注意力加权特征矩阵:(6) Generate a multi-layer temporal interactive attention weighted feature matrix:
(6a)利用公式生成的相关性矩阵对进行归一化处理,得到尺寸为60×60的多层时间交互注意力矩阵 (6a) Using the formula generate The correlation matrix of right Perform normalization to obtain a multi-layer temporal interactive attention matrix of size 60×60
(6b)利用公式生成多层时间交互注意力加权特征矩阵其中,表示一个初始化为0的用于平衡和两项的比例参数;(6b) Using the formula Generating Multilayer Temporal Interactive Attention Weighted Feature Matrix in, Represents a value initialized to 0 for balancing and two scale parameters;
(7)获取视频的特征向量:(7) Obtain the feature vector of the video:
将每个视频的多层时间交互注意力加权特征矩阵输入到全连接层,输出该视频的特征向量;Input the multi-layer temporal interactive attention weighted feature matrix of each video to the fully connected layer, and output the feature vector of the video;
(8)对视频进行行为识别:(8) Behavior recognition on video:
(8a)将每个视频的特征向量输入到softmax分类器中,利用反向传播梯度下降法,迭代更新参数γ和全连接层的参数、softmax分类器的参数,直至交叉熵损失函数收敛为止,得到训练好的各个参数;(8a) Input the feature vector of each video into the softmax classifier, and use the back-propagation gradient descent method to iteratively update the parameters γ and The parameters of the fully connected layer and the parameters of the softmax classifier are obtained until the cross entropy loss function converges, and the trained parameters are obtained;
(8b)对每个待识别的视频等间隔采样60帧RGB图像,将每帧图像的尺寸均缩放为256×340后进行中心裁剪,得到尺寸为224×224的60帧RGB图像,将每帧RGB图像输入到Inception-v2网络中,输出待识别视频的深度特征图;(8b) Sample 60 frames of RGB images at equal intervals for each video to be identified, scale the size of each frame to 256×340, and then perform center cropping to obtain 60 frames of RGB images with a size of 224×224. The RGB image is input into the Inception-v2 network, and the depth feature map of the video to be recognized is output;
(8c)对每个待识别视频的深度特征图采用与步骤(3)至步骤(7)相同的处理方法进行处理,得到该视频的特征向量,将每个特征向量输入到训练好的softmax分类器中,输出每个视频的行为识别结果。(8c) adopt the same processing method as step (3) to step (7) to process the depth feature map of each to-be-recognized video, obtain the feature vector of the video, and input each feature vector into the trained softmax classification In the device, the behavior recognition results of each video are output.
本发明与现有技术相比较,具有以下优点:Compared with the prior art, the present invention has the following advantages:
第一,由于本发明构建了特征映射矩阵,特征映射矩阵包含了视频中60个采样图像的时间信息和每个采样图像的空间信息,克服了现有技术仅采样7帧RGB图像会导致视频中信息的丢失,无法建模更完备的时间动态信息的问题,使得本发明能够更充分地保留时序信息,获得更具有表达能力的特征。First, because the present invention constructs a feature mapping matrix, the feature mapping matrix contains the time information of 60 sampled images in the video and the spatial information of each sampled image, which overcomes the problem that only 7 frames of RGB images are sampled in the prior art. The loss of information makes it impossible to model more complete time dynamic information, so that the present invention can more fully retain time series information and obtain features with more expressive ability.
第二,由于本发明提出了时间交互注意力矩阵,通过计算特征映射矩阵中不同采样图像的低维特征之间的相关程度得到时间交互注意力矩阵,克服了现有技术中的方法忽略视频中不同帧之间的相互依赖关系,从而损失部分全局信息的问题,使得本发明所提出的技术能够充分地探索全局信息,从而提高了行为识别的准确率。Second, since the present invention proposes a temporal interactive attention matrix, the temporal interactive attention matrix is obtained by calculating the correlation degree between the low-dimensional features of different sampled images in the feature mapping matrix, which overcomes the problem of ignoring the video content in the prior art method. The interdependence between different frames, thereby losing part of the global information, enables the technology proposed in the present invention to fully explore the global information, thereby improving the accuracy of behavior recognition.
附图说明Description of drawings
图1是本发明的流程图。Figure 1 is a flow chart of the present invention.
具体实施方式Detailed ways
下面结合附图1对本发明的具体步骤做进一步的描述。The specific steps of the present invention will be further described below in conjunction with FIG. 1 .
步骤1.生成训练集。Step 1. Generate a training set.
选取视频数据集中包含N个行为类别的RGB视频组成样本集,每个类别包含至少100个视频,每个视频有一个确定的行为类别,其中N>50。对样本集中的每个视频进行预处理,以获取该视频对应的RGB图像,将所有预处理后视频的RGB图像组成训练集。其中预处理指的是,对样本集中的每个视频等间隔采样60帧RGB图像,将每一帧RGB图像的尺寸缩放为256×340后进行裁剪,得到该视频的尺寸为224×224大小的60帧的RGB图像。The RGB videos containing N behavior categories in the video dataset are selected to form a sample set, each category contains at least 100 videos, and each video has a certain behavior category, where N>50. Each video in the sample set is preprocessed to obtain the corresponding RGB image of the video, and the RGB images of all preprocessed videos are formed into a training set. The preprocessing refers to sampling 60 frames of RGB images at equal intervals for each video in the sample set, scaling the size of each frame of RGB image to 256×340 and then cropping to obtain a video with a size of 224×224. 60 frames of RGB images.
步骤2.获取深度特征图。Step 2. Obtain the depth feature map.
将训练集中的每一个视频中每帧RGB图像依次输入到Inception-v2网络中,依次输出每一个视频中每帧图像的尺寸为7×7×1024深度特征图Xk,其中,k表示视频中采样图像的序号,k=1,2,...,60。Input each frame of RGB image in each video in the training set into the Inception-v2 network in turn, and sequentially output the size of each frame in each video as a 7×7×1024 depth feature map X k , where k represents the The serial number of the sampled image, k=1,2,...,60.
步骤3.构建特征映射矩阵。Step 3. Build the feature map matrix.
由于特征图的高维性,联合分析视频中的密集采样图像的信息存在挑战性,而将特征图映射为一个低维的向量可以减少计算量,有助于对密集采样图像的联合分析。以第r个视频中的第k个采样图像为例,说明如何将视频的采样图像的深度特征图编码为一个维数为1024的低维向量:Due to the high dimensionality of feature maps, it is challenging to jointly analyze the information of densely sampled images in videos, and mapping the feature map to a low-dimensional vector can reduce the computational complexity and facilitate the joint analysis of densely sampled images. Take the kth sampled image in the rth video as an example to illustrate how to encode the depth feature map of the sampled image of the video into a low-dimensional vector with a dimension of 1024:
其中,fr,k表示第r个视频中第k个采样图像对应的低维向量,V(·)表示空间向量化函数,Xr,k表示第r个视频中第k个采样图像对应的深度特征图,Xr,k,ij表示Xr,k的第i行第j列的值,∑表示求和操作,H和W分别表示Xr,k的行的总数和列的总数。Among them, f r,k represents the low-dimensional vector corresponding to the k-th sampled image in the r-th video, V( ) represents the space vectorization function, and X r,k represents the k-th sampled image in the r-th video. Corresponding to the image Depth feature map, X r,k,ij represents the value of the i-th row and j-th column of X r,k , ∑ represents the summation operation, H and W represent the total number of rows and columns of X r,k , respectively.
将每个视频的60帧采样图像对应的低维向量按照帧的时间顺序排列成行,得到一个二维特征映射矩阵其中,fk表示第k个采样图像的低维向量,k=1,2,...,60,T表示转置操作。Arrange the low-dimensional vectors corresponding to the 60-frame sampled images of each video into rows according to the time order of the frames to obtain a two-dimensional feature map matrix Among them, f k represents the low-dimensional vector of the k-th sampled image, k=1, 2, . . . , 60, and T represents the transpose operation.
矩阵M的列数等于每个视频对应的采样图像的总数,行数等于低维向量的维数。The number of columns of the matrix M is equal to the total number of sampled images corresponding to each video, and the number of rows is equal to the dimension of the low-dimensional vector.
特征映射矩阵包含了视频的时间信息和每个采样图像的空间信息,这使得本方法能够对视频中密集采样的图像进行联合分析。The feature map matrix contains the temporal information of the video and the spatial information of each sampled image, which enables the method to jointly analyze the densely sampled images in the video.
步骤4.生成时间交互注意力矩阵。Step 4. Generate the temporal interaction attention matrix.
生成M的相关性矩阵B=MTM,B中第i行第j列的值表示视频中第i个和第j个采样图像对应的两个低维向量之间的相关程度,对B进行归一化处理,得到尺寸为60×60的时间交互注意力矩阵A。Generate the correlation matrix of M B=M T M, the value of the i-th row and the j-th column in B represents the degree of correlation between the two low-dimensional vectors corresponding to the i-th and j-th sampled images in the video. After normalization, a temporal interaction attention matrix A of size 60×60 is obtained.
下面以第i帧采样图像和第j帧采样图像为例,阐明如何通过两帧图像之间的相关程度计算时间交互注意力矩阵A的第i行第j列个元素Aij,具体计算公式如下:The following takes the sampled image of the ith frame and the sampled image of the jth frame as an example to illustrate how to calculate the element A ij of the ith row and the jth column of the temporal interactive attention matrix A through the degree of correlation between the two frame images. The specific calculation formula is as follows :
其中,Aij度量了第i帧采样图像和第j帧采样图像的相关程度。Mi和Mj分别表示特征映射矩阵M中的第i列元素组成的列向量和第j列元素组成的列向量,其物理意义分别为视频中第i个采样图像和第j个采样图像的低维向量的转置。如果两帧的低维向量越相似,则Aij越大,说明这两帧之间的相关性越强。Among them, A ij measures the degree of correlation between the sampled image of the ith frame and the sampled image of the jth frame. M i and M j represent the column vector composed of elements in the i-th column and the column vector composed of the elements in the j-th column in the feature mapping matrix M respectively, and their physical meanings are the i-th sampled image and the j-th sampled image in the video respectively. Transpose of low-dimensional vectors. If the low-dimensional vectors of the two frames are more similar, the larger A ij is, indicating that the correlation between the two frames is stronger.
以同样方法计算时间交互注意力矩阵A中的所有元素,A中的第i行表示视频的第i帧采样图像和该视频中所有采样图像的相关程度。因此,时间交互注意力矩阵建模了视频帧之间的相关性,有助于更充分地探索视频中的全局信息。Calculate all elements in the temporal interactive attention matrix A in the same way. The ith row in A represents the correlation degree between the sampled image of the ith frame of the video and all the sampled images in the video. Therefore, the temporal interaction attention matrix models the correlation between video frames, which helps to more fully explore the global information in the video.
步骤5.生成时间交互注意力加权特征矩阵。Step 5. Generate a temporal interactive attention weighted feature matrix.
利用公式生成时间交互注意力加权特征矩阵其中,γ表示一个初始化为0的用于平衡MA和M两项的比例参数。Use the formula Generating Temporal Interactive Attention Weighted Feature Matrix where γ represents a scaling parameter initialized to 0 for balancing MA and M.
步骤6.生成多层时间交互注意力加权特征矩阵。Step 6. Generate a multi-layer temporal interactive attention-weighted feature matrix.
利用公式生成的相关性矩阵对进行归一化处理,得到尺寸为60×60的多层时间交互注意力矩阵再利用公式生成多层时间交互注意力加权特征矩阵其中,表示一个初始化为0的用于平衡和两项的比例参数。Use the formula generate The correlation matrix of right Perform normalization to obtain a multi-layer temporal interactive attention matrix of size 60×60 Reuse formula Generating Multilayer Temporal Interactive Attention Weighted Feature Matrix in, Represents a value initialized to 0 for balancing and A scale parameter for both terms.
多层时间交互注意力对时间交互注意力加权特征矩阵再次应用了时间交互注意力,探索了更丰富的时间动态。Multi-layer temporal interactive attention applies temporal interactive attention again to the temporal interactive attention weighted feature matrix to explore richer temporal dynamics.
步骤7.获取视频的特征向量。Step 7. Obtain the feature vector of the video.
将每个视频的多层时间交互注意力加权特征矩阵输入到一个输出神经元个数为1024的全连接层,得到该视频的特征向量。The multi-layer temporal interactive attention weighted feature matrix of each video is input into a fully connected layer with 1024 output neurons, and the feature vector of the video is obtained.
步骤8.对视频进行行为识别。Step 8. Perform behavior recognition on the video.
将每个视频的特征向量输入到softmax分类器中,利用反向传播梯度下降法,分别更新γ、全连接层的参数、softmax分类器的参数,直至交叉熵损失函数收敛。Input the feature vector of each video into the softmax classifier, and use the back-propagation gradient descent method to update γ, The parameters of the fully connected layer, the parameters of the softmax classifier, until the cross-entropy loss function converges.
对每个待识别的视频等间隔采样60帧RGB图像,将每帧图像的尺寸均缩放为256×340后进行中心裁剪,得到尺寸为224×224的60帧RGB图像,将每帧RGB图像输入到Inception-v2网络中,输出待识别视频的深度特征图。Sample 60 frames of RGB images at equal intervals for each video to be recognized, scale the size of each frame to 256 × 340, and then perform center cropping to obtain 60 frames of RGB images with a size of 224 × 224, and input each frame of RGB image. To the Inception-v2 network, output the depth feature map of the video to be recognized.
对每个待识别视频的深度特征图采用与步骤3至步骤7相同的处理方法进行处理,得到该待识别视频的特征向量,将每个特征向量输入到训练好的softmax分类器中,输出每个视频的行为识别结果。The depth feature map of each to-be-recognized video is processed using the same processing method as step 3 to step 7 to obtain the feature vector of the to-be-recognized video, input each feature vector into the trained softmax classifier, and output each feature vector. The behavior recognition results of each video.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110086627.3A CN112766177B (en) | 2021-01-22 | 2021-01-22 | Action Recognition Method Based on Feature Map and Multilayer Temporal Interaction Attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110086627.3A CN112766177B (en) | 2021-01-22 | 2021-01-22 | Action Recognition Method Based on Feature Map and Multilayer Temporal Interaction Attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112766177A true CN112766177A (en) | 2021-05-07 |
CN112766177B CN112766177B (en) | 2022-12-02 |
Family
ID=75702700
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110086627.3A Active CN112766177B (en) | 2021-01-22 | 2021-01-22 | Action Recognition Method Based on Feature Map and Multilayer Temporal Interaction Attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112766177B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109389055A (en) * | 2018-09-21 | 2019-02-26 | 西安电子科技大学 | Video classification methods based on mixing convolution sum attention mechanism |
CN110059662A (en) * | 2019-04-26 | 2019-07-26 | 山东大学 | A deep video behavior recognition method and system |
EP3625727A1 (en) * | 2017-11-14 | 2020-03-25 | Google LLC | Weakly-supervised action localization by sparse temporal pooling network |
US20200175281A1 (en) * | 2018-11-30 | 2020-06-04 | International Business Machines Corporation | Relation attention module for temporal action localization |
CN111325099A (en) * | 2020-01-21 | 2020-06-23 | 南京邮电大学 | Sign language identification method and system based on double-current space-time diagram convolutional neural network |
CN111627052A (en) * | 2020-04-30 | 2020-09-04 | 沈阳工程学院 | Action identification method based on double-flow space-time attention mechanism |
-
2021
- 2021-01-22 CN CN202110086627.3A patent/CN112766177B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3625727A1 (en) * | 2017-11-14 | 2020-03-25 | Google LLC | Weakly-supervised action localization by sparse temporal pooling network |
CN109389055A (en) * | 2018-09-21 | 2019-02-26 | 西安电子科技大学 | Video classification methods based on mixing convolution sum attention mechanism |
US20200175281A1 (en) * | 2018-11-30 | 2020-06-04 | International Business Machines Corporation | Relation attention module for temporal action localization |
CN110059662A (en) * | 2019-04-26 | 2019-07-26 | 山东大学 | A deep video behavior recognition method and system |
CN111325099A (en) * | 2020-01-21 | 2020-06-23 | 南京邮电大学 | Sign language identification method and system based on double-current space-time diagram convolutional neural network |
CN111627052A (en) * | 2020-04-30 | 2020-09-04 | 沈阳工程学院 | Action identification method based on double-flow space-time attention mechanism |
Non-Patent Citations (3)
Title |
---|
MING TONG 等: ""A new framework of action recognition with discriminative parts,spatio-temporal and causal interaction descriptors"", 《ELSEVIER》 * |
刘天亮等: "融合空间-时间双网络流和视觉注意的人体行为识别", 《电子与信息学报》 * |
解怀奇等: "基于通道注意力机制的视频人体行为识别", 《电子技术与软件工程》 * |
Also Published As
Publication number | Publication date |
---|---|
CN112766177B (en) | 2022-12-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112926396B (en) | Action identification method based on double-current convolution attention | |
CN109948425B (en) | A pedestrian search method and device based on structure-aware self-attention and online instance aggregation and matching | |
CN111191660B (en) | A multi-channel collaborative capsule network-based method for classifying pathological images of colon cancer | |
CN113496217B (en) | Face micro-expression recognition method in video image sequence | |
CN107368831B (en) | English words and digit recognition method in a kind of natural scene image | |
CN112801040B (en) | Lightweight unconstrained facial expression recognition method and system embedded with high-order information | |
CN112307995B (en) | Semi-supervised pedestrian re-identification method based on feature decoupling learning | |
CN113343937B (en) | Lip language identification method based on deep convolution and attention mechanism | |
CN112784763A (en) | Expression recognition method and system based on local and overall feature adaptive fusion | |
CN112949740B (en) | A Small Sample Image Classification Method Based on Multi-Level Metric | |
CN108427921A (en) | A kind of face identification method based on convolutional neural networks | |
CN108520213B (en) | A face beauty prediction method based on multi-scale depth | |
CN106909938B (en) | Perspective-independent behavior recognition method based on deep learning network | |
CN114842343B (en) | ViT-based aerial image recognition method | |
CN113780249B (en) | Expression recognition model processing method, device, equipment, medium and program product | |
CN111950455A (en) | A Feature Recognition Method of Motor Imagery EEG Signals Based on LFFCNN-GRU Algorithm Model | |
CN107784288A (en) | A kind of iteration positioning formula method for detecting human face based on deep neural network | |
CN111259735B (en) | Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network | |
CN108182475A (en) | It is a kind of based on automatic coding machine-the multi-dimensional data characteristic recognition method of the learning machine that transfinites | |
CN112446253A (en) | Skeleton behavior identification method and device | |
CN109325513A (en) | An image classification network training method based on massive single-class single image | |
CN118378128A (en) | Multi-mode emotion recognition method based on staged attention mechanism | |
CN112149616A (en) | A method of character interaction behavior recognition based on dynamic information | |
CN117994550B (en) | Incomplete multi-view large-scale animal image clustering method based on depth online anchor subspace learning | |
CN114359675A (en) | A saliency map generation method for hyperspectral images based on semi-supervised neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |