CN112766177A - Behavior identification method based on feature mapping and multi-layer time interaction attention - Google Patents

Behavior identification method based on feature mapping and multi-layer time interaction attention Download PDF

Info

Publication number
CN112766177A
CN112766177A CN202110086627.3A CN202110086627A CN112766177A CN 112766177 A CN112766177 A CN 112766177A CN 202110086627 A CN202110086627 A CN 202110086627A CN 112766177 A CN112766177 A CN 112766177A
Authority
CN
China
Prior art keywords
video
matrix
temporal
feature
interactive attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110086627.3A
Other languages
Chinese (zh)
Other versions
CN112766177B (en
Inventor
同鸣
金磊
董秋宇
边放
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110086627.3A priority Critical patent/CN112766177B/en
Publication of CN112766177A publication Critical patent/CN112766177A/en
Application granted granted Critical
Publication of CN112766177B publication Critical patent/CN112766177B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于特征映射和多层时间交互注意力的行为识别方法,解决现有技术对时间动态信息建模不充分,忽略了不同帧之间的相互依赖关系,从而导致的对行为的识别能力不足的问题。本发明的实现步骤为:(1)生成训练集;(2)获取深度特征图;(3)构建特征映射矩阵;(4)生成时间交互注意力矩阵;(5)生成时间交互注意力加权特征矩阵;(6)生成多层时间交互注意力加权特征矩阵;(7)获取视频的特征向量;(8)对视频进行行为识别。由于本发明构建了特征映射矩阵,提出了多层时间交互注意力,使得本发明可以提高对视频中的行为识别的准确率。

Figure 202110086627

The invention discloses a behavior recognition method based on feature mapping and multi-layer temporal interactive attention, which solves the problem of insufficient modeling of temporal dynamic information in the prior art, ignoring the interdependence between different frames, and thus leading to inaccurate behaviors. The problem of insufficient recognition ability. The implementation steps of the present invention are: (1) generating a training set; (2) acquiring a depth feature map; (3) constructing a feature mapping matrix; (4) generating a temporal interactive attention matrix; (5) generating a temporal interactive attention weighted feature (6) Generate a multi-layer temporal interactive attention weighted feature matrix; (7) Obtain the feature vector of the video; (8) Perform behavior recognition on the video. Since the present invention constructs a feature map matrix and proposes multi-layer temporal interactive attention, the present invention can improve the accuracy of behavior recognition in video.

Figure 202110086627

Description

基于特征映射和多层时间交互注意力的行为识别方法Action Recognition Method Based on Feature Mapping and Multilayer Temporal Interactive Attention

技术领域technical field

本发明属于视频处理技术领域,更进一步涉及计算机视觉技术领域中的一种基于特征映射和多层时间交互注意力的行为识别方法。本发明可用于视频中的人体行为识别。The invention belongs to the technical field of video processing, and further relates to a behavior recognition method based on feature mapping and multi-layer temporal interactive attention in the technical field of computer vision. The present invention can be used for human action recognition in video.

背景技术Background technique

基于视频的人体行为识别任务在计算机视觉领域中占有重要的地位,有着广阔的应用前景,目前已被应用于无人驾驶、人机交互、视频监控等领域。人体行为识别的目标是判断一个视频中人体行为的类别,本质是一个分类问题。近年来,随着深度学习的发展,基于深度学习的行为识别方法被广泛研究。Video-based human behavior recognition tasks occupy an important position in the field of computer vision and have broad application prospects. The goal of human behavior recognition is to determine the category of human behavior in a video, which is essentially a classification problem. In recent years, with the development of deep learning, behavior recognition methods based on deep learning have been widely studied.

华南理工大学在其申请的专利文献“基于时间注意力机制和LSTM的人体行为识别方法”(专利申请号:CN201910271178.2,申请公开号CN110135249A)中公开了一种人体行为识别方法。该方法的主要实现步骤是:1.获取RGB单目视觉传感器的视频数据;2.提取2D骨架关节点数据;3.提取关节点联合结构特征;4.构建LSTM长短期记忆网络;5.在LSTM网络中加入时间注意力机制;6.利用softmax分类器进行人体行为识别。该方法提出的时间注意力机制单独地探索视频中每一帧的重要程度,对重要的帧的特征赋予大的权重,但是,该方法仍然存在的不足之处是,忽略了视频中不同帧之间的相互依赖关系,从而损失了部分全局信息,导致行为识别的错误。South China University of Technology disclosed a human action recognition method in its patent document "Human Action Recognition Method Based on Time Attention Mechanism and LSTM" (Patent Application No.: CN201910271178.2, Application Publication No. CN110135249A). The main implementation steps of the method are: 1. Obtain the video data of the RGB monocular vision sensor; 2. Extract the 2D skeleton joint point data; 3. Extract the joint structural features of the joint points; 4. Construct the LSTM long short-term memory network; A time attention mechanism is added to the LSTM network; 6. Human action recognition is performed using the softmax classifier. The temporal attention mechanism proposed by this method independently explores the importance of each frame in the video, and assigns a large weight to the features of important frames. However, the method still has the disadvantage that it ignores the difference between different frames in the video. The interdependence between them, thus losing part of the global information, leading to errors in behavior recognition.

Limin Wang等人在其发表的论文“Temporal segment networks for actionrecognition in videos”(IEEE transactions on pattern analysis and machineintelligence,2018,2740-2755)中公开了一种行为识别方法。该方法的主要实现步骤是:1.将视频均匀分成7个视频段;2.在每个视频段中随机采样一帧RGB图像,得到7帧RGB图像;3.将获取的每一帧RGB图像输入到卷积神经网络中,得到每一帧RGB图像的分类得分;4.利用段共识函数和预测函数结合7帧RGB图像的分类得分,得到该视频的行为识别的结果。该方法存在的不足之处是,对于较长的视频,仅采样7帧RGB图像会导致视频中信息的丢失,无法建模更完备的时间动态信息,进而导致行为识别准确率较低。Limin Wang et al. disclosed an action recognition method in their paper "Temporal segment networks for actionrecognition in videos" (IEEE transactions on pattern analysis and machineintelligence, 2018, 2740-2755). The main implementation steps of this method are: 1. Divide the video into 7 video segments evenly; 2. Randomly sample a frame of RGB image in each video segment to obtain 7 frames of RGB image; 3. Divide each frame of RGB image obtained Input into the convolutional neural network to get the classification score of each frame of RGB image; 4. Use the segment consensus function and the prediction function to combine the classification scores of the 7 frames of RGB images to obtain the behavior recognition result of the video. The disadvantage of this method is that for a long video, only sampling 7 frames of RGB images will lead to the loss of information in the video, and it is impossible to model more complete temporal dynamic information, resulting in a low accuracy of behavior recognition.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于针对上述已有技术的不足,提出了一种基于特征映射和多层时间交互注意力的行为识别方法,用于解决现有技术对时间动态信息建模不充分,忽略了不同帧之间的相互依赖关系,导致的行为识别能力差的问题。The purpose of the present invention is to address the above-mentioned shortcomings of the prior art, and propose a behavior recognition method based on feature mapping and multi-layer temporal interactive attention, which is used to solve the problem that the prior art does not adequately model temporal dynamic information, ignores different The interdependence between frames leads to the problem of poor behavior recognition ability.

为实现上述目的,本发明的思路是,构建特征映射矩阵,嵌入视频中的时间和空间信息;通过探索视频中不同帧之间的相互影响,得到时间交互注意力;使用多层的时间交互注意力挖掘视频中复杂的时间动态信息。In order to achieve the above purpose, the idea of the present invention is to construct a feature mapping matrix and embed the temporal and spatial information in the video; obtain temporal interactive attention by exploring the mutual influence between different frames in the video; use multi-layer temporal interactive attention Powerful mining of complex temporal dynamic information in videos.

为实现上述目的,本发明的实现的具体步骤如下:For achieving the above object, the concrete steps of the realization of the present invention are as follows:

(1)生成训练集:(1) Generate a training set:

(1a)选取视频数据集中包含N个行为类别的RGB视频组成样本集,每个类别包含至少100个视频,每个视频有一个确定的行为类别,其中N>50;(1a) Select the RGB videos containing N behavior categories in the video data set to form a sample set, each category contains at least 100 videos, and each video has a definite behavior category, where N>50;

(1b)对样本集中的每个视频进行预处理,以获取该视频对应的RGB图像,将所有预处理后视频的RGB图像组成训练集;(1b) Preprocess each video in the sample set to obtain the RGB image corresponding to the video, and compose the RGB image of all the preprocessed videos into a training set;

(2)生成深度特征图:(2) Generate a deep feature map:

将训练集中的每一个视频中每帧RGB图像依次输入到Inception-v2网络中,依次输出每一个视频中每帧图像的尺寸为7×7×1024的深度特征图Xk,其中,k表示视频中采样图像的序号,k=1,2,...,60;Input each frame of RGB image in each video in the training set into the Inception-v2 network in turn, and sequentially output the depth feature map X k with a size of 7 × 7 × 1024 for each frame in each video, where k represents the video The serial number of the sampled image in the middle, k=1,2,...,60;

(3)构建特征映射矩阵:(3) Construct the feature mapping matrix:

(3a)使用一个空间向量化函数,将每个深度特征图编码为一个维数为1024的低维向量fk,k=1,2,...,60;(3a) Using a spatial vectorization function, encode each depth feature map as a low-dimensional vector f k with dimension 1024, k=1,2,...,60;

(3b)将每个视频的60帧采样图像对应的低维向量按照帧的时间顺序排列成行,得到一个二维特征映射矩阵

Figure BDA0002911074420000021
其中,T表示转置操作;(3b) Arrange the low-dimensional vectors corresponding to the 60-frame sampled images of each video into rows according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix
Figure BDA0002911074420000021
Among them, T represents the transpose operation;

(4)生成时间交互注意力矩阵:(4) Generate temporal interactive attention matrix:

(4a)利用公式B=MTM,生成M的相关性矩阵B,该矩阵中第i行第j列的值表示视频中第i个和第j个采样图像对应的两个低维向量之间的相关程度;(4a) Using the formula B=M T M, generate a correlation matrix B of M, where the value of the i-th row and the j-th column in the matrix represents the sum of the two low-dimensional vectors corresponding to the i-th and j-th sampled images in the video the degree of correlation between;

(4b)对相关性矩阵B进行归一化处理,得到尺寸为60×60的时间交互注意力矩阵A;(4b) Normalize the correlation matrix B to obtain a temporal interactive attention matrix A with a size of 60×60;

(5)生成时间交互注意力加权特征矩阵:(5) Generate a temporal interactive attention weighted feature matrix:

利用公式

Figure BDA0002911074420000031
生成时间交互注意力加权特征矩阵
Figure BDA0002911074420000032
其中,γ表示一个初始化为0的用于平衡MA和M两项的比例参数;Use the formula
Figure BDA0002911074420000031
Generating Temporal Interactive Attention Weighted Feature Matrix
Figure BDA0002911074420000032
Among them, γ represents a scale parameter initialized to 0 for balancing MA and M;

(6)生成多层时间交互注意力加权特征矩阵:(6) Generate a multi-layer temporal interactive attention weighted feature matrix:

(6a)利用公式

Figure BDA0002911074420000033
生成
Figure BDA0002911074420000034
的相关性矩阵
Figure BDA0002911074420000035
Figure BDA0002911074420000036
进行归一化处理,得到尺寸为60×60的多层时间交互注意力矩阵
Figure BDA0002911074420000037
(6a) Using the formula
Figure BDA0002911074420000033
generate
Figure BDA0002911074420000034
The correlation matrix of
Figure BDA0002911074420000035
right
Figure BDA0002911074420000036
Perform normalization to obtain a multi-layer temporal interactive attention matrix of size 60×60
Figure BDA0002911074420000037

(6b)利用公式

Figure BDA0002911074420000038
生成多层时间交互注意力加权特征矩阵
Figure BDA0002911074420000039
其中,
Figure BDA00029110744200000310
表示一个初始化为0的用于平衡
Figure BDA00029110744200000311
Figure BDA00029110744200000312
两项的比例参数;(6b) Using the formula
Figure BDA0002911074420000038
Generating Multilayer Temporal Interactive Attention Weighted Feature Matrix
Figure BDA0002911074420000039
in,
Figure BDA00029110744200000310
Represents a value initialized to 0 for balancing
Figure BDA00029110744200000311
and
Figure BDA00029110744200000312
two scale parameters;

(7)获取视频的特征向量:(7) Obtain the feature vector of the video:

将每个视频的多层时间交互注意力加权特征矩阵输入到全连接层,输出该视频的特征向量;Input the multi-layer temporal interactive attention weighted feature matrix of each video to the fully connected layer, and output the feature vector of the video;

(8)对视频进行行为识别:(8) Behavior recognition on video:

(8a)将每个视频的特征向量输入到softmax分类器中,利用反向传播梯度下降法,迭代更新参数γ和

Figure BDA00029110744200000313
全连接层的参数、softmax分类器的参数,直至交叉熵损失函数收敛为止,得到训练好的各个参数;(8a) Input the feature vector of each video into the softmax classifier, and use the back-propagation gradient descent method to iteratively update the parameters γ and
Figure BDA00029110744200000313
The parameters of the fully connected layer and the parameters of the softmax classifier are obtained until the cross entropy loss function converges, and the trained parameters are obtained;

(8b)对每个待识别的视频等间隔采样60帧RGB图像,将每帧图像的尺寸均缩放为256×340后进行中心裁剪,得到尺寸为224×224的60帧RGB图像,将每帧RGB图像输入到Inception-v2网络中,输出待识别视频的深度特征图;(8b) Sample 60 frames of RGB images at equal intervals for each video to be identified, scale the size of each frame to 256×340, and then perform center cropping to obtain 60 frames of RGB images with a size of 224×224. The RGB image is input into the Inception-v2 network, and the depth feature map of the video to be recognized is output;

(8c)对每个待识别视频的深度特征图采用与步骤(3)至步骤(7)相同的处理方法进行处理,得到该视频的特征向量,将每个特征向量输入到训练好的softmax分类器中,输出每个视频的行为识别结果。(8c) adopt the same processing method as step (3) to step (7) to process the depth feature map of each to-be-recognized video, obtain the feature vector of the video, and input each feature vector into the trained softmax classification In the device, the behavior recognition results of each video are output.

本发明与现有技术相比较,具有以下优点:Compared with the prior art, the present invention has the following advantages:

第一,由于本发明构建了特征映射矩阵,特征映射矩阵包含了视频中60个采样图像的时间信息和每个采样图像的空间信息,克服了现有技术仅采样7帧RGB图像会导致视频中信息的丢失,无法建模更完备的时间动态信息的问题,使得本发明能够更充分地保留时序信息,获得更具有表达能力的特征。First, because the present invention constructs a feature mapping matrix, the feature mapping matrix contains the time information of 60 sampled images in the video and the spatial information of each sampled image, which overcomes the problem that only 7 frames of RGB images are sampled in the prior art. The loss of information makes it impossible to model more complete time dynamic information, so that the present invention can more fully retain time series information and obtain features with more expressive ability.

第二,由于本发明提出了时间交互注意力矩阵,通过计算特征映射矩阵中不同采样图像的低维特征之间的相关程度得到时间交互注意力矩阵,克服了现有技术中的方法忽略视频中不同帧之间的相互依赖关系,从而损失部分全局信息的问题,使得本发明所提出的技术能够充分地探索全局信息,从而提高了行为识别的准确率。Second, since the present invention proposes a temporal interactive attention matrix, the temporal interactive attention matrix is obtained by calculating the correlation degree between the low-dimensional features of different sampled images in the feature mapping matrix, which overcomes the problem of ignoring the video content in the prior art method. The interdependence between different frames, thereby losing part of the global information, enables the technology proposed in the present invention to fully explore the global information, thereby improving the accuracy of behavior recognition.

附图说明Description of drawings

图1是本发明的流程图。Figure 1 is a flow chart of the present invention.

具体实施方式Detailed ways

下面结合附图1对本发明的具体步骤做进一步的描述。The specific steps of the present invention will be further described below in conjunction with FIG. 1 .

步骤1.生成训练集。Step 1. Generate a training set.

选取视频数据集中包含N个行为类别的RGB视频组成样本集,每个类别包含至少100个视频,每个视频有一个确定的行为类别,其中N>50。对样本集中的每个视频进行预处理,以获取该视频对应的RGB图像,将所有预处理后视频的RGB图像组成训练集。其中预处理指的是,对样本集中的每个视频等间隔采样60帧RGB图像,将每一帧RGB图像的尺寸缩放为256×340后进行裁剪,得到该视频的尺寸为224×224大小的60帧的RGB图像。The RGB videos containing N behavior categories in the video dataset are selected to form a sample set, each category contains at least 100 videos, and each video has a certain behavior category, where N>50. Each video in the sample set is preprocessed to obtain the corresponding RGB image of the video, and the RGB images of all preprocessed videos are formed into a training set. The preprocessing refers to sampling 60 frames of RGB images at equal intervals for each video in the sample set, scaling the size of each frame of RGB image to 256×340 and then cropping to obtain a video with a size of 224×224. 60 frames of RGB images.

步骤2.获取深度特征图。Step 2. Obtain the depth feature map.

将训练集中的每一个视频中每帧RGB图像依次输入到Inception-v2网络中,依次输出每一个视频中每帧图像的尺寸为7×7×1024深度特征图Xk,其中,k表示视频中采样图像的序号,k=1,2,...,60。Input each frame of RGB image in each video in the training set into the Inception-v2 network in turn, and sequentially output the size of each frame in each video as a 7×7×1024 depth feature map X k , where k represents the The serial number of the sampled image, k=1,2,...,60.

步骤3.构建特征映射矩阵。Step 3. Build the feature map matrix.

由于特征图的高维性,联合分析视频中的密集采样图像的信息存在挑战性,而将特征图映射为一个低维的向量可以减少计算量,有助于对密集采样图像的联合分析。以第r个视频中的第k个采样图像为例,说明如何将视频的采样图像的深度特征图编码为一个维数为1024的低维向量:Due to the high dimensionality of feature maps, it is challenging to jointly analyze the information of densely sampled images in videos, and mapping the feature map to a low-dimensional vector can reduce the computational complexity and facilitate the joint analysis of densely sampled images. Take the kth sampled image in the rth video as an example to illustrate how to encode the depth feature map of the sampled image of the video into a low-dimensional vector with a dimension of 1024:

Figure BDA0002911074420000041
Figure BDA0002911074420000041

其中,fr,k表示第r个视频中第k个采样图像对应的低维向量,V(·)表示空间向量化函数,Xr,k表示第r个视频中第k个采样图像对应的深度特征图,Xr,k,ij表示Xr,k的第i行第j列的值,∑表示求和操作,H和W分别表示Xr,k的行的总数和列的总数。Among them, f r,k represents the low-dimensional vector corresponding to the k-th sampled image in the r-th video, V( ) represents the space vectorization function, and X r,k represents the k-th sampled image in the r-th video. Corresponding to the image Depth feature map, X r,k,ij represents the value of the i-th row and j-th column of X r,k , ∑ represents the summation operation, H and W represent the total number of rows and columns of X r,k , respectively.

将每个视频的60帧采样图像对应的低维向量按照帧的时间顺序排列成行,得到一个二维特征映射矩阵

Figure BDA0002911074420000051
其中,fk表示第k个采样图像的低维向量,k=1,2,...,60,T表示转置操作。Arrange the low-dimensional vectors corresponding to the 60-frame sampled images of each video into rows according to the time order of the frames to obtain a two-dimensional feature map matrix
Figure BDA0002911074420000051
Among them, f k represents the low-dimensional vector of the k-th sampled image, k=1, 2, . . . , 60, and T represents the transpose operation.

矩阵M的列数等于每个视频对应的采样图像的总数,行数等于低维向量的维数。The number of columns of the matrix M is equal to the total number of sampled images corresponding to each video, and the number of rows is equal to the dimension of the low-dimensional vector.

特征映射矩阵包含了视频的时间信息和每个采样图像的空间信息,这使得本方法能够对视频中密集采样的图像进行联合分析。The feature map matrix contains the temporal information of the video and the spatial information of each sampled image, which enables the method to jointly analyze the densely sampled images in the video.

步骤4.生成时间交互注意力矩阵。Step 4. Generate the temporal interaction attention matrix.

生成M的相关性矩阵B=MTM,B中第i行第j列的值表示视频中第i个和第j个采样图像对应的两个低维向量之间的相关程度,对B进行归一化处理,得到尺寸为60×60的时间交互注意力矩阵A。Generate the correlation matrix of M B=M T M, the value of the i-th row and the j-th column in B represents the degree of correlation between the two low-dimensional vectors corresponding to the i-th and j-th sampled images in the video. After normalization, a temporal interaction attention matrix A of size 60×60 is obtained.

下面以第i帧采样图像和第j帧采样图像为例,阐明如何通过两帧图像之间的相关程度计算时间交互注意力矩阵A的第i行第j列个元素Aij,具体计算公式如下:The following takes the sampled image of the ith frame and the sampled image of the jth frame as an example to illustrate how to calculate the element A ij of the ith row and the jth column of the temporal interactive attention matrix A through the degree of correlation between the two frame images. The specific calculation formula is as follows :

Figure BDA0002911074420000052
Figure BDA0002911074420000052

其中,Aij度量了第i帧采样图像和第j帧采样图像的相关程度。Mi和Mj分别表示特征映射矩阵M中的第i列元素组成的列向量和第j列元素组成的列向量,其物理意义分别为视频中第i个采样图像和第j个采样图像的低维向量的转置。如果两帧的低维向量越相似,则Aij越大,说明这两帧之间的相关性越强。Among them, A ij measures the degree of correlation between the sampled image of the ith frame and the sampled image of the jth frame. M i and M j represent the column vector composed of elements in the i-th column and the column vector composed of the elements in the j-th column in the feature mapping matrix M respectively, and their physical meanings are the i-th sampled image and the j-th sampled image in the video respectively. Transpose of low-dimensional vectors. If the low-dimensional vectors of the two frames are more similar, the larger A ij is, indicating that the correlation between the two frames is stronger.

以同样方法计算时间交互注意力矩阵A中的所有元素,A中的第i行表示视频的第i帧采样图像和该视频中所有采样图像的相关程度。因此,时间交互注意力矩阵建模了视频帧之间的相关性,有助于更充分地探索视频中的全局信息。Calculate all elements in the temporal interactive attention matrix A in the same way. The ith row in A represents the correlation degree between the sampled image of the ith frame of the video and all the sampled images in the video. Therefore, the temporal interaction attention matrix models the correlation between video frames, which helps to more fully explore the global information in the video.

步骤5.生成时间交互注意力加权特征矩阵。Step 5. Generate a temporal interactive attention weighted feature matrix.

利用公式

Figure BDA0002911074420000061
生成时间交互注意力加权特征矩阵
Figure BDA0002911074420000062
其中,γ表示一个初始化为0的用于平衡MA和M两项的比例参数。Use the formula
Figure BDA0002911074420000061
Generating Temporal Interactive Attention Weighted Feature Matrix
Figure BDA0002911074420000062
where γ represents a scaling parameter initialized to 0 for balancing MA and M.

步骤6.生成多层时间交互注意力加权特征矩阵。Step 6. Generate a multi-layer temporal interactive attention-weighted feature matrix.

利用公式

Figure BDA0002911074420000063
生成
Figure BDA0002911074420000064
的相关性矩阵
Figure BDA0002911074420000065
Figure BDA0002911074420000066
进行归一化处理,得到尺寸为60×60的多层时间交互注意力矩阵
Figure BDA0002911074420000067
再利用公式
Figure BDA0002911074420000068
生成多层时间交互注意力加权特征矩阵
Figure BDA0002911074420000069
其中,
Figure BDA00029110744200000610
表示一个初始化为0的用于平衡
Figure BDA00029110744200000611
Figure BDA00029110744200000612
两项的比例参数。Use the formula
Figure BDA0002911074420000063
generate
Figure BDA0002911074420000064
The correlation matrix of
Figure BDA0002911074420000065
right
Figure BDA0002911074420000066
Perform normalization to obtain a multi-layer temporal interactive attention matrix of size 60×60
Figure BDA0002911074420000067
Reuse formula
Figure BDA0002911074420000068
Generating Multilayer Temporal Interactive Attention Weighted Feature Matrix
Figure BDA0002911074420000069
in,
Figure BDA00029110744200000610
Represents a value initialized to 0 for balancing
Figure BDA00029110744200000611
and
Figure BDA00029110744200000612
A scale parameter for both terms.

多层时间交互注意力对时间交互注意力加权特征矩阵再次应用了时间交互注意力,探索了更丰富的时间动态。Multi-layer temporal interactive attention applies temporal interactive attention again to the temporal interactive attention weighted feature matrix to explore richer temporal dynamics.

步骤7.获取视频的特征向量。Step 7. Obtain the feature vector of the video.

将每个视频的多层时间交互注意力加权特征矩阵输入到一个输出神经元个数为1024的全连接层,得到该视频的特征向量。The multi-layer temporal interactive attention weighted feature matrix of each video is input into a fully connected layer with 1024 output neurons, and the feature vector of the video is obtained.

步骤8.对视频进行行为识别。Step 8. Perform behavior recognition on the video.

将每个视频的特征向量输入到softmax分类器中,利用反向传播梯度下降法,分别更新γ、

Figure BDA00029110744200000613
全连接层的参数、softmax分类器的参数,直至交叉熵损失函数收敛。Input the feature vector of each video into the softmax classifier, and use the back-propagation gradient descent method to update γ,
Figure BDA00029110744200000613
The parameters of the fully connected layer, the parameters of the softmax classifier, until the cross-entropy loss function converges.

对每个待识别的视频等间隔采样60帧RGB图像,将每帧图像的尺寸均缩放为256×340后进行中心裁剪,得到尺寸为224×224的60帧RGB图像,将每帧RGB图像输入到Inception-v2网络中,输出待识别视频的深度特征图。Sample 60 frames of RGB images at equal intervals for each video to be recognized, scale the size of each frame to 256 × 340, and then perform center cropping to obtain 60 frames of RGB images with a size of 224 × 224, and input each frame of RGB image. To the Inception-v2 network, output the depth feature map of the video to be recognized.

对每个待识别视频的深度特征图采用与步骤3至步骤7相同的处理方法进行处理,得到该待识别视频的特征向量,将每个特征向量输入到训练好的softmax分类器中,输出每个视频的行为识别结果。The depth feature map of each to-be-recognized video is processed using the same processing method as step 3 to step 7 to obtain the feature vector of the to-be-recognized video, input each feature vector into the trained softmax classifier, and output each feature vector. The behavior recognition results of each video.

Claims (4)

1.一种基于特征映射和多层时间交互注意力的行为识别方法,其特征在于,构建了包含视频的时间信息和每个采样图像的空间信息的特征映射矩阵;提出了时间交互注意力,通过计算特征映射矩阵中不同采样图像的低维向量之间的相关程度得到时间交互注意力矩阵,该方法的具体步骤包括如下:1. A behavior recognition method based on feature mapping and multi-layer temporal interactive attention, characterized in that, a feature mapping matrix containing temporal information of video and spatial information of each sampled image is constructed; temporal interactive attention is proposed, The temporal interactive attention matrix is obtained by calculating the correlation degree between the low-dimensional vectors of different sampled images in the feature mapping matrix. The specific steps of the method include the following: (1)生成训练集:(1) Generate a training set: (1a)选取视频数据集中包含N个行为类别的RGB视频组成样本集,每个类别包含至少100个视频,每个视频有一个确定的行为类别,其中N>50;(1a) Select the RGB videos containing N behavior categories in the video data set to form a sample set, each category contains at least 100 videos, and each video has a definite behavior category, where N>50; (1b)对样本集中的每个视频进行预处理,以获取该视频对应的RGB图像,将所有预处理后视频的RGB图像组成训练集;(1b) Preprocess each video in the sample set to obtain the RGB image corresponding to the video, and compose the RGB image of all the preprocessed videos into a training set; (2)生成深度特征图:(2) Generate a deep feature map: 将训练集中的每一个视频中每帧RGB图像依次输入到Inception-v2网络中,依次输出每一个视频中每帧图像的尺寸为7×7×1024的深度特征图Xk,其中,k表示视频中采样图像的序号,k=1,2,...,60;Input each frame of RGB image in each video in the training set into the Inception-v2 network in turn, and sequentially output the depth feature map X k with a size of 7 × 7 × 1024 for each frame in each video, where k represents the video The serial number of the sampled image in the middle, k=1,2,...,60; (3)构建特征映射矩阵:(3) Construct the feature mapping matrix: (3a)使用一个空间向量化函数,将每个深度特征图编码为一个维数为1024的低维向量fk,k=1,2,...,60;(3a) Using a spatial vectorization function, encode each depth feature map as a low-dimensional vector f k with dimension 1024, k=1,2,...,60; (3b)将每个视频的60帧采样图像对应的低维向量按照帧的时间顺序排列成行,得到一个二维特征映射矩阵
Figure FDA0002911074410000011
其中,T表示转置操作;
(3b) Arrange the low-dimensional vectors corresponding to the 60-frame sampled images of each video into rows according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix
Figure FDA0002911074410000011
Among them, T represents the transpose operation;
(4)生成时间交互注意力矩阵:(4) Generate temporal interactive attention matrix: (4a)利用公式B=MTM,生成M的相关性矩阵B,该矩阵中第i行第j列的值表示视频中第i个和第j个采样图像对应的两个低维向量之间的相关程度;(4a) Using the formula B=M T M, generate a correlation matrix B of M, where the value of the i-th row and the j-th column in the matrix represents the sum of the two low-dimensional vectors corresponding to the i-th and j-th sampled images in the video the degree of correlation between; (4b)对相关性矩阵B进行归一化处理,得到尺寸为60×60的时间交互注意力矩阵A;(4b) Normalize the correlation matrix B to obtain a temporal interactive attention matrix A with a size of 60×60; (5)生成时间交互注意力加权特征矩阵:(5) Generate a temporal interactive attention weighted feature matrix: 利用公式
Figure FDA0002911074410000021
生成时间交互注意力加权特征矩阵
Figure FDA0002911074410000022
其中,γ表示一个初始化为0的用于平衡MA和M两项的比例参数;
Use the formula
Figure FDA0002911074410000021
Generating Temporal Interactive Attention Weighted Feature Matrix
Figure FDA0002911074410000022
Among them, γ represents a scale parameter initialized to 0 for balancing MA and M;
(6)生成多层时间交互注意力加权特征矩阵:(6) Generate a multi-layer temporal interactive attention weighted feature matrix: (6a)利用公式
Figure FDA0002911074410000023
生成
Figure FDA0002911074410000024
的相关性矩阵
Figure FDA0002911074410000025
Figure FDA0002911074410000026
进行归一化处理,得到尺寸为60×60的多层时间交互注意力矩阵
Figure FDA0002911074410000027
(6a) Using the formula
Figure FDA0002911074410000023
generate
Figure FDA0002911074410000024
The correlation matrix of
Figure FDA0002911074410000025
right
Figure FDA0002911074410000026
Perform normalization to obtain a multi-layer temporal interactive attention matrix of size 60×60
Figure FDA0002911074410000027
(6b)利用公式
Figure FDA0002911074410000028
生成多层时间交互注意力加权特征矩阵
Figure FDA0002911074410000029
其中,
Figure FDA00029110744100000210
表示一个初始化为0的用于平衡
Figure FDA00029110744100000211
Figure FDA00029110744100000212
两项的比例参数;
(6b) Using the formula
Figure FDA0002911074410000028
Generating Multilayer Temporal Interactive Attention Weighted Feature Matrix
Figure FDA0002911074410000029
in,
Figure FDA00029110744100000210
Represents a value initialized to 0 for balancing
Figure FDA00029110744100000211
and
Figure FDA00029110744100000212
two scale parameters;
(7)获取视频的特征向量:(7) Obtain the feature vector of the video: 将每个视频的多层时间交互注意力加权特征矩阵输入到全连接层,输出该视频的特征向量;Input the multi-layer temporal interactive attention weighted feature matrix of each video to the fully connected layer, and output the feature vector of the video; (8)对视频进行行为识别:(8) Behavior recognition on video: (8a)将每个视频的特征向量输入到softmax分类器中,利用反向传播梯度下降法,迭代更新参数γ和
Figure FDA00029110744100000213
全连接层的参数、softmax分类器的参数,直至交叉熵损失函数收敛为止,得到训练好的各个参数;
(8a) Input the feature vector of each video into the softmax classifier, and use the back-propagation gradient descent method to iteratively update the parameters γ and
Figure FDA00029110744100000213
The parameters of the fully connected layer and the parameters of the softmax classifier are obtained until the cross entropy loss function converges, and the trained parameters are obtained;
(8b)对每个待识别的视频等间隔采样60帧RGB图像,将每帧图像的尺寸均缩放为256×340后进行中心裁剪,得到尺寸为224×224的60帧RGB图像,将每帧RGB图像输入到Inception-v2网络中,输出待识别视频的深度特征图;(8b) Sample 60 frames of RGB images at equal intervals for each video to be identified, scale the size of each frame to 256×340, and then perform center cropping to obtain 60 frames of RGB images with a size of 224×224. The RGB image is input into the Inception-v2 network, and the depth feature map of the video to be recognized is output; (8c)对每个待识别视频的深度特征图采用与步骤(3)至步骤(7)相同的处理方法进行处理,得到该视频的特征向量,将每个特征向量输入到训练好的softmax分类器中,输出每个视频的行为识别结果。(8c) adopt the same processing method as step (3) to step (7) to process the depth feature map of each to-be-recognized video, obtain the feature vector of the video, and input each feature vector into the trained softmax classification In the device, the behavior recognition results of each video are output.
2.根据权利要求1所述的基于特征映射和多层时间交互注意力的行为识别方法,其特征在于,步骤(1a)中所述的对样本集中的每个视频进行预处理指的是,对样本集中的每个视频等间隔采样60帧RGB图像,将每一帧RGB图像的尺寸缩放为256×340后再进行裁剪,得到该视频的尺寸为224×224大小的60帧的RGB图像。2. the behavior recognition method based on feature mapping and multi-layer time interactive attention according to claim 1, is characterized in that, described in step (1a) is carried out preprocessing to each video in the sample set refers to, Each video in the sample set is sampled with 60 frames of RGB images at equal intervals, and the size of each frame of RGB image is scaled to 256×340 and then cropped to obtain a 60-frame RGB image with a size of 224×224. 3.根据权利要求1所述的基于特征映射和多层时间交互注意力的行为识别方法,其特征在于,步骤(3a)中所述的空间向量化函数如下:3. the behavior recognition method based on feature map and multi-layer time interactive attention according to claim 1, is characterized in that, the space vectorization function described in step (3a) is as follows:
Figure FDA0002911074410000031
Figure FDA0002911074410000031
其中,fr,k表示第r个视频中第k个采样帧对应的低维向量,V(·)表示空间向量化函数,Xr,k表示第r个视频中第k个采样帧对应的深度特征图,Xr,k,ij表示Xr,k的第i行第j列的值,∑表示求和操作,H和W分别表示Xr,k的行的总数和列的总数。Among them, f r,k represents the low-dimensional vector corresponding to the kth sampled frame in the rth video, V( ) represents the space vectorization function, and X r,k represents the kth sampled frame in the rth video. Corresponding to the frame Depth feature map, X r,k,ij represents the value of the i-th row and j-th column of X r,k , ∑ represents the summation operation, H and W represent the total number of rows and columns of X r,k , respectively.
4.根据权利要求1所述的基于特征映射和多层时间交互注意力的行为识别方法,其特征在于,步骤(7)中所述全连接层的输出神经元个数设置为1024。4 . The behavior recognition method based on feature mapping and multi-layer temporal interactive attention according to claim 1 , wherein the number of output neurons of the fully connected layer in step (7) is set to 1024. 5 .
CN202110086627.3A 2021-01-22 2021-01-22 Action Recognition Method Based on Feature Map and Multilayer Temporal Interaction Attention Active CN112766177B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110086627.3A CN112766177B (en) 2021-01-22 2021-01-22 Action Recognition Method Based on Feature Map and Multilayer Temporal Interaction Attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110086627.3A CN112766177B (en) 2021-01-22 2021-01-22 Action Recognition Method Based on Feature Map and Multilayer Temporal Interaction Attention

Publications (2)

Publication Number Publication Date
CN112766177A true CN112766177A (en) 2021-05-07
CN112766177B CN112766177B (en) 2022-12-02

Family

ID=75702700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110086627.3A Active CN112766177B (en) 2021-01-22 2021-01-22 Action Recognition Method Based on Feature Map and Multilayer Temporal Interaction Attention

Country Status (1)

Country Link
CN (1) CN112766177B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A deep video behavior recognition method and system
EP3625727A1 (en) * 2017-11-14 2020-03-25 Google LLC Weakly-supervised action localization by sparse temporal pooling network
US20200175281A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Relation attention module for temporal action localization
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111627052A (en) * 2020-04-30 2020-09-04 沈阳工程学院 Action identification method based on double-flow space-time attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3625727A1 (en) * 2017-11-14 2020-03-25 Google LLC Weakly-supervised action localization by sparse temporal pooling network
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
US20200175281A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Relation attention module for temporal action localization
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A deep video behavior recognition method and system
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111627052A (en) * 2020-04-30 2020-09-04 沈阳工程学院 Action identification method based on double-flow space-time attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MING TONG 等: ""A new framework of action recognition with discriminative parts,spatio-temporal and causal interaction descriptors"", 《ELSEVIER》 *
刘天亮等: "融合空间-时间双网络流和视觉注意的人体行为识别", 《电子与信息学报》 *
解怀奇等: "基于通道注意力机制的视频人体行为识别", 《电子技术与软件工程》 *

Also Published As

Publication number Publication date
CN112766177B (en) 2022-12-02

Similar Documents

Publication Publication Date Title
CN112926396B (en) Action identification method based on double-current convolution attention
CN109948425B (en) A pedestrian search method and device based on structure-aware self-attention and online instance aggregation and matching
CN111191660B (en) A multi-channel collaborative capsule network-based method for classifying pathological images of colon cancer
CN113496217B (en) Face micro-expression recognition method in video image sequence
CN107368831B (en) English words and digit recognition method in a kind of natural scene image
CN112801040B (en) Lightweight unconstrained facial expression recognition method and system embedded with high-order information
CN112307995B (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
CN113343937B (en) Lip language identification method based on deep convolution and attention mechanism
CN112784763A (en) Expression recognition method and system based on local and overall feature adaptive fusion
CN112949740B (en) A Small Sample Image Classification Method Based on Multi-Level Metric
CN108427921A (en) A kind of face identification method based on convolutional neural networks
CN108520213B (en) A face beauty prediction method based on multi-scale depth
CN106909938B (en) Perspective-independent behavior recognition method based on deep learning network
CN114842343B (en) ViT-based aerial image recognition method
CN113780249B (en) Expression recognition model processing method, device, equipment, medium and program product
CN111950455A (en) A Feature Recognition Method of Motor Imagery EEG Signals Based on LFFCNN-GRU Algorithm Model
CN107784288A (en) A kind of iteration positioning formula method for detecting human face based on deep neural network
CN111259735B (en) Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network
CN108182475A (en) It is a kind of based on automatic coding machine-the multi-dimensional data characteristic recognition method of the learning machine that transfinites
CN112446253A (en) Skeleton behavior identification method and device
CN109325513A (en) An image classification network training method based on massive single-class single image
CN118378128A (en) Multi-mode emotion recognition method based on staged attention mechanism
CN112149616A (en) A method of character interaction behavior recognition based on dynamic information
CN117994550B (en) Incomplete multi-view large-scale animal image clustering method based on depth online anchor subspace learning
CN114359675A (en) A saliency map generation method for hyperspectral images based on semi-supervised neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant