CN112766177A

CN112766177A - Behavior identification method based on feature mapping and multi-layer time interaction attention

Info

Publication number: CN112766177A
Application number: CN202110086627.3A
Authority: CN
Inventors: 同鸣; 金磊; 董秋宇; 边放
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-05-07
Anticipated expiration: 2041-01-22
Also published as: CN112766177B

Abstract

The invention discloses a behavior recognition method based on feature mapping and multi-layer temporal interactive attention, which solves the problem of insufficient modeling of temporal dynamic information in the prior art, ignoring the interdependence between different frames, and thus leading to inaccurate behaviors. The problem of insufficient recognition ability. The implementation steps of the present invention are: (1) generating a training set; (2) acquiring a depth feature map; (3) constructing a feature mapping matrix; (4) generating a temporal interactive attention matrix; (5) generating a temporal interactive attention weighted feature (6) Generate a multi-layer temporal interactive attention weighted feature matrix; (7) Obtain the feature vector of the video; (8) Perform behavior recognition on the video. Since the present invention constructs a feature map matrix and proposes multi-layer temporal interactive attention, the present invention can improve the accuracy of behavior recognition in video.

Description

Action Recognition Method Based on Feature Mapping and Multilayer Temporal Interactive Attention

技术领域technical field

本发明属于视频处理技术领域，更进一步涉及计算机视觉技术领域中的一种基于特征映射和多层时间交互注意力的行为识别方法。本发明可用于视频中的人体行为识别。The invention belongs to the technical field of video processing, and further relates to a behavior recognition method based on feature mapping and multi-layer temporal interactive attention in the technical field of computer vision. The present invention can be used for human action recognition in video.

背景技术Background technique

基于视频的人体行为识别任务在计算机视觉领域中占有重要的地位，有着广阔的应用前景，目前已被应用于无人驾驶、人机交互、视频监控等领域。人体行为识别的目标是判断一个视频中人体行为的类别，本质是一个分类问题。近年来，随着深度学习的发展，基于深度学习的行为识别方法被广泛研究。Video-based human behavior recognition tasks occupy an important position in the field of computer vision and have broad application prospects. The goal of human behavior recognition is to determine the category of human behavior in a video, which is essentially a classification problem. In recent years, with the development of deep learning, behavior recognition methods based on deep learning have been widely studied.

华南理工大学在其申请的专利文献“基于时间注意力机制和LSTM的人体行为识别方法”(专利申请号：CN201910271178.2，申请公开号CN110135249A)中公开了一种人体行为识别方法。该方法的主要实现步骤是：1.获取RGB单目视觉传感器的视频数据；2.提取2D骨架关节点数据；3.提取关节点联合结构特征；4.构建LSTM长短期记忆网络；5.在LSTM网络中加入时间注意力机制；6.利用softmax分类器进行人体行为识别。该方法提出的时间注意力机制单独地探索视频中每一帧的重要程度，对重要的帧的特征赋予大的权重，但是，该方法仍然存在的不足之处是，忽略了视频中不同帧之间的相互依赖关系，从而损失了部分全局信息，导致行为识别的错误。South China University of Technology disclosed a human action recognition method in its patent document "Human Action Recognition Method Based on Time Attention Mechanism and LSTM" (Patent Application No.: CN201910271178.2, Application Publication No. CN110135249A). The main implementation steps of the method are: 1. Obtain the video data of the RGB monocular vision sensor; 2. Extract the 2D skeleton joint point data; 3. Extract the joint structural features of the joint points; 4. Construct the LSTM long short-term memory network; A time attention mechanism is added to the LSTM network; 6. Human action recognition is performed using the softmax classifier. The temporal attention mechanism proposed by this method independently explores the importance of each frame in the video, and assigns a large weight to the features of important frames. However, the method still has the disadvantage that it ignores the difference between different frames in the video. The interdependence between them, thus losing part of the global information, leading to errors in behavior recognition.

Limin Wang等人在其发表的论文“Temporal segment networks for actionrecognition in videos”(IEEE transactions on pattern analysis and machineintelligence,2018,2740-2755)中公开了一种行为识别方法。该方法的主要实现步骤是：1.将视频均匀分成7个视频段；2.在每个视频段中随机采样一帧RGB图像，得到7帧RGB图像；3.将获取的每一帧RGB图像输入到卷积神经网络中，得到每一帧RGB图像的分类得分；4.利用段共识函数和预测函数结合7帧RGB图像的分类得分，得到该视频的行为识别的结果。该方法存在的不足之处是，对于较长的视频，仅采样7帧RGB图像会导致视频中信息的丢失，无法建模更完备的时间动态信息，进而导致行为识别准确率较低。Limin Wang et al. disclosed an action recognition method in their paper "Temporal segment networks for actionrecognition in videos" (IEEE transactions on pattern analysis and machineintelligence, 2018, 2740-2755). The main implementation steps of this method are: 1. Divide the video into 7 video segments evenly; 2. Randomly sample a frame of RGB image in each video segment to obtain 7 frames of RGB image; 3. Divide each frame of RGB image obtained Input into the convolutional neural network to get the classification score of each frame of RGB image; 4. Use the segment consensus function and the prediction function to combine the classification scores of the 7 frames of RGB images to obtain the behavior recognition result of the video. The disadvantage of this method is that for a long video, only sampling 7 frames of RGB images will lead to the loss of information in the video, and it is impossible to model more complete temporal dynamic information, resulting in a low accuracy of behavior recognition.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于针对上述已有技术的不足，提出了一种基于特征映射和多层时间交互注意力的行为识别方法，用于解决现有技术对时间动态信息建模不充分，忽略了不同帧之间的相互依赖关系，导致的行为识别能力差的问题。The purpose of the present invention is to address the above-mentioned shortcomings of the prior art, and propose a behavior recognition method based on feature mapping and multi-layer temporal interactive attention, which is used to solve the problem that the prior art does not adequately model temporal dynamic information, ignores different The interdependence between frames leads to the problem of poor behavior recognition ability.

为实现上述目的，本发明的思路是，构建特征映射矩阵，嵌入视频中的时间和空间信息；通过探索视频中不同帧之间的相互影响，得到时间交互注意力；使用多层的时间交互注意力挖掘视频中复杂的时间动态信息。In order to achieve the above purpose, the idea of the present invention is to construct a feature mapping matrix and embed the temporal and spatial information in the video; obtain temporal interactive attention by exploring the mutual influence between different frames in the video; use multi-layer temporal interactive attention Powerful mining of complex temporal dynamic information in videos.

为实现上述目的，本发明的实现的具体步骤如下：For achieving the above object, the concrete steps of the realization of the present invention are as follows:

(1)生成训练集：(1) Generate a training set:

(1a)选取视频数据集中包含N个行为类别的RGB视频组成样本集，每个类别包含至少100个视频，每个视频有一个确定的行为类别，其中N>50；(1a) Select the RGB videos containing N behavior categories in the video data set to form a sample set, each category contains at least 100 videos, and each video has a definite behavior category, where N>50;

(1b)对样本集中的每个视频进行预处理，以获取该视频对应的RGB图像，将所有预处理后视频的RGB图像组成训练集；(1b) Preprocess each video in the sample set to obtain the RGB image corresponding to the video, and compose the RGB image of all the preprocessed videos into a training set;

(2)生成深度特征图：(2) Generate a deep feature map:

将训练集中的每一个视频中每帧RGB图像依次输入到Inception-v2网络中，依次输出每一个视频中每帧图像的尺寸为7×7×1024的深度特征图X_k，其中，k表示视频中采样图像的序号，k＝1,2,...,60；Input each frame of RGB image in each video in the training set into the Inception-v2 network in turn, and sequentially output the depth feature map X _k with a size of 7 × 7 × 1024 for each frame in each video, where k represents the video The serial number of the sampled image in the middle, k=1,2,...,60;

(3)构建特征映射矩阵：(3) Construct the feature mapping matrix:

(3a)使用一个空间向量化函数，将每个深度特征图编码为一个维数为1024的低维向量f_k，k＝1,2,...,60；(3a) Using a spatial vectorization function, encode each depth feature map as a low-dimensional vector f _k with dimension 1024, k=1,2,...,60;

(3b)将每个视频的60帧采样图像对应的低维向量按照帧的时间顺序排列成行，得到一个二维特征映射矩阵

其中，T表示转置操作；(3b) Arrange the low-dimensional vectors corresponding to the 60-frame sampled images of each video into rows according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix

Among them, T represents the transpose operation;

(4)生成时间交互注意力矩阵：(4) Generate temporal interactive attention matrix:

(4a)利用公式B＝M^TM，生成M的相关性矩阵B，该矩阵中第i行第j列的值表示视频中第i个和第j个采样图像对应的两个低维向量之间的相关程度；(4a) Using the formula B=M ^T M, generate a correlation matrix B of M, where the value of the i-th row and the j-th column in the matrix represents the sum of the two low-dimensional vectors corresponding to the i-th and j-th sampled images in the video the degree of correlation between;

(4b)对相关性矩阵B进行归一化处理，得到尺寸为60×60的时间交互注意力矩阵A；(4b) Normalize the correlation matrix B to obtain a temporal interactive attention matrix A with a size of 60×60;

(5)生成时间交互注意力加权特征矩阵：(5) Generate a temporal interactive attention weighted feature matrix:

利用公式

生成时间交互注意力加权特征矩阵

其中，γ表示一个初始化为0的用于平衡MA和M两项的比例参数；Use the formula

Generating Temporal Interactive Attention Weighted Feature Matrix

Among them, γ represents a scale parameter initialized to 0 for balancing MA and M;

(6)生成多层时间交互注意力加权特征矩阵：(6) Generate a multi-layer temporal interactive attention weighted feature matrix:

(6a)利用公式

生成

的相关性矩阵

对

进行归一化处理，得到尺寸为60×60的多层时间交互注意力矩阵

(6a) Using the formula

generate

The correlation matrix of

right

Perform normalization to obtain a multi-layer temporal interactive attention matrix of size 60×60

(6b)利用公式

生成多层时间交互注意力加权特征矩阵

其中，

表示一个初始化为0的用于平衡

和

两项的比例参数；(6b) Using the formula

Generating Multilayer Temporal Interactive Attention Weighted Feature Matrix

in,

Represents a value initialized to 0 for balancing

and

two scale parameters;

(7)获取视频的特征向量：(7) Obtain the feature vector of the video:

将每个视频的多层时间交互注意力加权特征矩阵输入到全连接层，输出该视频的特征向量；Input the multi-layer temporal interactive attention weighted feature matrix of each video to the fully connected layer, and output the feature vector of the video;

(8)对视频进行行为识别：(8) Behavior recognition on video:

(8a)将每个视频的特征向量输入到softmax分类器中，利用反向传播梯度下降法，迭代更新参数γ和

全连接层的参数、softmax分类器的参数，直至交叉熵损失函数收敛为止，得到训练好的各个参数；(8a) Input the feature vector of each video into the softmax classifier, and use the back-propagation gradient descent method to iteratively update the parameters γ and

The parameters of the fully connected layer and the parameters of the softmax classifier are obtained until the cross entropy loss function converges, and the trained parameters are obtained;

(8b)对每个待识别的视频等间隔采样60帧RGB图像，将每帧图像的尺寸均缩放为256×340后进行中心裁剪，得到尺寸为224×224的60帧RGB图像，将每帧RGB图像输入到Inception-v2网络中，输出待识别视频的深度特征图；(8b) Sample 60 frames of RGB images at equal intervals for each video to be identified, scale the size of each frame to 256×340, and then perform center cropping to obtain 60 frames of RGB images with a size of 224×224. The RGB image is input into the Inception-v2 network, and the depth feature map of the video to be recognized is output;

(8c)对每个待识别视频的深度特征图采用与步骤(3)至步骤(7)相同的处理方法进行处理，得到该视频的特征向量，将每个特征向量输入到训练好的softmax分类器中，输出每个视频的行为识别结果。(8c) adopt the same processing method as step (3) to step (7) to process the depth feature map of each to-be-recognized video, obtain the feature vector of the video, and input each feature vector into the trained softmax classification In the device, the behavior recognition results of each video are output.

本发明与现有技术相比较，具有以下优点：Compared with the prior art, the present invention has the following advantages:

第一，由于本发明构建了特征映射矩阵，特征映射矩阵包含了视频中60个采样图像的时间信息和每个采样图像的空间信息，克服了现有技术仅采样7帧RGB图像会导致视频中信息的丢失，无法建模更完备的时间动态信息的问题，使得本发明能够更充分地保留时序信息，获得更具有表达能力的特征。First, because the present invention constructs a feature mapping matrix, the feature mapping matrix contains the time information of 60 sampled images in the video and the spatial information of each sampled image, which overcomes the problem that only 7 frames of RGB images are sampled in the prior art. The loss of information makes it impossible to model more complete time dynamic information, so that the present invention can more fully retain time series information and obtain features with more expressive ability.

第二，由于本发明提出了时间交互注意力矩阵，通过计算特征映射矩阵中不同采样图像的低维特征之间的相关程度得到时间交互注意力矩阵，克服了现有技术中的方法忽略视频中不同帧之间的相互依赖关系，从而损失部分全局信息的问题，使得本发明所提出的技术能够充分地探索全局信息，从而提高了行为识别的准确率。Second, since the present invention proposes a temporal interactive attention matrix, the temporal interactive attention matrix is obtained by calculating the correlation degree between the low-dimensional features of different sampled images in the feature mapping matrix, which overcomes the problem of ignoring the video content in the prior art method. The interdependence between different frames, thereby losing part of the global information, enables the technology proposed in the present invention to fully explore the global information, thereby improving the accuracy of behavior recognition.

附图说明Description of drawings

图1是本发明的流程图。Figure 1 is a flow chart of the present invention.

具体实施方式Detailed ways

下面结合附图1对本发明的具体步骤做进一步的描述。The specific steps of the present invention will be further described below in conjunction with FIG. 1 .

步骤1.生成训练集。Step 1. Generate a training set.

选取视频数据集中包含N个行为类别的RGB视频组成样本集，每个类别包含至少100个视频，每个视频有一个确定的行为类别，其中N>50。对样本集中的每个视频进行预处理，以获取该视频对应的RGB图像，将所有预处理后视频的RGB图像组成训练集。其中预处理指的是，对样本集中的每个视频等间隔采样60帧RGB图像，将每一帧RGB图像的尺寸缩放为256×340后进行裁剪，得到该视频的尺寸为224×224大小的60帧的RGB图像。The RGB videos containing N behavior categories in the video dataset are selected to form a sample set, each category contains at least 100 videos, and each video has a certain behavior category, where N>50. Each video in the sample set is preprocessed to obtain the corresponding RGB image of the video, and the RGB images of all preprocessed videos are formed into a training set. The preprocessing refers to sampling 60 frames of RGB images at equal intervals for each video in the sample set, scaling the size of each frame of RGB image to 256×340 and then cropping to obtain a video with a size of 224×224. 60 frames of RGB images.

步骤2.获取深度特征图。Step 2. Obtain the depth feature map.

将训练集中的每一个视频中每帧RGB图像依次输入到Inception-v2网络中，依次输出每一个视频中每帧图像的尺寸为7×7×1024深度特征图X_k，其中，k表示视频中采样图像的序号，k＝1,2,...,60。Input each frame of RGB image in each video in the training set into the Inception-v2 network in turn, and sequentially output the size of each frame in each video as a 7×7×1024 depth feature map X _k , where k represents the The serial number of the sampled image, k=1,2,...,60.

步骤3.构建特征映射矩阵。Step 3. Build the feature map matrix.

由于特征图的高维性，联合分析视频中的密集采样图像的信息存在挑战性，而将特征图映射为一个低维的向量可以减少计算量，有助于对密集采样图像的联合分析。以第r个视频中的第k个采样图像为例，说明如何将视频的采样图像的深度特征图编码为一个维数为1024的低维向量：Due to the high dimensionality of feature maps, it is challenging to jointly analyze the information of densely sampled images in videos, and mapping the feature map to a low-dimensional vector can reduce the computational complexity and facilitate the joint analysis of densely sampled images. Take the kth sampled image in the rth video as an example to illustrate how to encode the depth feature map of the sampled image of the video into a low-dimensional vector with a dimension of 1024:

其中，f_r,k表示第r个视频中第k个采样图像对应的低维向量，V(·)表示空间向量化函数，X_r,k表示第r个视频中第k个采样图像对应的深度特征图，X_r,k,ij表示X_r,k的第i行第j列的值，∑表示求和操作，H和W分别表示X_r,k的行的总数和列的总数。Among them, f _r,k represents the low-dimensional vector corresponding to the k-th sampled image in the r-th video, V( ) represents the space vectorization function, and X _r,k represents the k-th sampled image in the r-th video. Corresponding to the image Depth feature map, X _r,k,ij represents the value of the i-th row and j-th column of X _r,k , ∑ represents the summation operation, H and W represent the total number of rows and columns of X _r,k , respectively.

将每个视频的60帧采样图像对应的低维向量按照帧的时间顺序排列成行，得到一个二维特征映射矩阵

其中，f_k表示第k个采样图像的低维向量，k＝1,2,...,60，T表示转置操作。Arrange the low-dimensional vectors corresponding to the 60-frame sampled images of each video into rows according to the time order of the frames to obtain a two-dimensional feature map matrix

Among them, f _k represents the low-dimensional vector of the k-th sampled image, k=1, 2, . . . , 60, and T represents the transpose operation.

矩阵M的列数等于每个视频对应的采样图像的总数，行数等于低维向量的维数。The number of columns of the matrix M is equal to the total number of sampled images corresponding to each video, and the number of rows is equal to the dimension of the low-dimensional vector.

特征映射矩阵包含了视频的时间信息和每个采样图像的空间信息，这使得本方法能够对视频中密集采样的图像进行联合分析。The feature map matrix contains the temporal information of the video and the spatial information of each sampled image, which enables the method to jointly analyze the densely sampled images in the video.

步骤4.生成时间交互注意力矩阵。Step 4. Generate the temporal interaction attention matrix.

生成M的相关性矩阵B＝M^TM，B中第i行第j列的值表示视频中第i个和第j个采样图像对应的两个低维向量之间的相关程度，对B进行归一化处理，得到尺寸为60×60的时间交互注意力矩阵A。Generate the correlation matrix of M B=M ^T M, the value of the i-th row and the j-th column in B represents the degree of correlation between the two low-dimensional vectors corresponding to the i-th and j-th sampled images in the video. After normalization, a temporal interaction attention matrix A of size 60×60 is obtained.

下面以第i帧采样图像和第j帧采样图像为例，阐明如何通过两帧图像之间的相关程度计算时间交互注意力矩阵A的第i行第j列个元素A_ij，具体计算公式如下：The following takes the sampled image of the ith frame and the sampled image of the jth frame as an example to illustrate how to calculate the element A _ij of the ith row and the jth column of the temporal interactive attention matrix A through the degree of correlation between the two frame images. The specific calculation formula is as follows :

其中，A_ij度量了第i帧采样图像和第j帧采样图像的相关程度。M_i和M_j分别表示特征映射矩阵M中的第i列元素组成的列向量和第j列元素组成的列向量，其物理意义分别为视频中第i个采样图像和第j个采样图像的低维向量的转置。如果两帧的低维向量越相似，则A_ij越大，说明这两帧之间的相关性越强。Among them, A _ij measures the degree of correlation between the sampled image of the ith frame and the sampled image of the jth frame. M _i and M _j represent the column vector composed of elements in the i-th column and the column vector composed of the elements in the j-th column in the feature mapping matrix M respectively, and their physical meanings are the i-th sampled image and the j-th sampled image in the video respectively. Transpose of low-dimensional vectors. If the low-dimensional vectors of the two frames are more similar, the larger A _ij is, indicating that the correlation between the two frames is stronger.

以同样方法计算时间交互注意力矩阵A中的所有元素，A中的第i行表示视频的第i帧采样图像和该视频中所有采样图像的相关程度。因此，时间交互注意力矩阵建模了视频帧之间的相关性，有助于更充分地探索视频中的全局信息。Calculate all elements in the temporal interactive attention matrix A in the same way. The ith row in A represents the correlation degree between the sampled image of the ith frame of the video and all the sampled images in the video. Therefore, the temporal interaction attention matrix models the correlation between video frames, which helps to more fully explore the global information in the video.

步骤5.生成时间交互注意力加权特征矩阵。Step 5. Generate a temporal interactive attention weighted feature matrix.

利用公式

生成时间交互注意力加权特征矩阵

其中，γ表示一个初始化为0的用于平衡MA和M两项的比例参数。Use the formula

Generating Temporal Interactive Attention Weighted Feature Matrix

where γ represents a scaling parameter initialized to 0 for balancing MA and M.

步骤6.生成多层时间交互注意力加权特征矩阵。Step 6. Generate a multi-layer temporal interactive attention-weighted feature matrix.

利用公式

生成

的相关性矩阵

对

再利用公式

生成多层时间交互注意力加权特征矩阵

其中，

表示一个初始化为0的用于平衡

和

两项的比例参数。Use the formula

generate

The correlation matrix of

right

Reuse formula

Generating Multilayer Temporal Interactive Attention Weighted Feature Matrix

in,

Represents a value initialized to 0 for balancing

and

A scale parameter for both terms.

多层时间交互注意力对时间交互注意力加权特征矩阵再次应用了时间交互注意力，探索了更丰富的时间动态。Multi-layer temporal interactive attention applies temporal interactive attention again to the temporal interactive attention weighted feature matrix to explore richer temporal dynamics.

步骤7.获取视频的特征向量。Step 7. Obtain the feature vector of the video.

将每个视频的多层时间交互注意力加权特征矩阵输入到一个输出神经元个数为1024的全连接层，得到该视频的特征向量。The multi-layer temporal interactive attention weighted feature matrix of each video is input into a fully connected layer with 1024 output neurons, and the feature vector of the video is obtained.

步骤8.对视频进行行为识别。Step 8. Perform behavior recognition on the video.

将每个视频的特征向量输入到softmax分类器中，利用反向传播梯度下降法，分别更新γ、

全连接层的参数、softmax分类器的参数，直至交叉熵损失函数收敛。Input the feature vector of each video into the softmax classifier, and use the back-propagation gradient descent method to update γ,

The parameters of the fully connected layer, the parameters of the softmax classifier, until the cross-entropy loss function converges.

对每个待识别的视频等间隔采样60帧RGB图像，将每帧图像的尺寸均缩放为256×340后进行中心裁剪，得到尺寸为224×224的60帧RGB图像，将每帧RGB图像输入到Inception-v2网络中，输出待识别视频的深度特征图。Sample 60 frames of RGB images at equal intervals for each video to be recognized, scale the size of each frame to 256 × 340, and then perform center cropping to obtain 60 frames of RGB images with a size of 224 × 224, and input each frame of RGB image. To the Inception-v2 network, output the depth feature map of the video to be recognized.

对每个待识别视频的深度特征图采用与步骤3至步骤7相同的处理方法进行处理，得到该待识别视频的特征向量，将每个特征向量输入到训练好的softmax分类器中，输出每个视频的行为识别结果。The depth feature map of each to-be-recognized video is processed using the same processing method as step 3 to step 7 to obtain the feature vector of the to-be-recognized video, input each feature vector into the trained softmax classifier, and output each feature vector. The behavior recognition results of each video.

Claims

1. A behavior recognition method based on feature mapping and multi-layer temporal interactive attention, characterized in that, a feature mapping matrix containing temporal information of video and spatial information of each sampled image is constructed; temporal interactive attention is proposed, The temporal interactive attention matrix is obtained by calculating the correlation degree between the low-dimensional vectors of different sampled images in the feature mapping matrix. The specific steps of the method include the following:

(1) Generate a training set:

(1a) Select the RGB videos containing N behavior categories in the video data set to form a sample set, each category contains at least 100 videos, and each video has a definite behavior category, where N>50;

(1b) Preprocess each video in the sample set to obtain the RGB image corresponding to the video, and compose the RGB image of all the preprocessed videos into a training set;

(2) Generate a deep feature map:

Input each frame of RGB image in each video in the training set into the Inception-v2 network in turn, and sequentially output the depth feature map X _k with a size of 7 × 7 × 1024 for each frame in each video, where k represents the video The serial number of the sampled image in the middle, k=1,2,...,60;

(3) Construct the feature mapping matrix:

(3a) Using a spatial vectorization function, encode each depth feature map as a low-dimensional vector f _k with dimension 1024, k=1,2,...,60;

(3b) Arrange the low-dimensional vectors corresponding to the 60-frame sampled images of each video into rows according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix

Among them, T represents the transpose operation;

(4) Generate temporal interactive attention matrix:

(4a) Using the formula B=M ^T M, generate a correlation matrix B of M, where the value of the i-th row and the j-th column in the matrix represents the sum of the two low-dimensional vectors corresponding to the i-th and j-th sampled images in the video the degree of correlation between;

(4b) Normalize the correlation matrix B to obtain a temporal interactive attention matrix A with a size of 60×60;

(5) Generate a temporal interactive attention weighted feature matrix:

Use the formula

Generating Temporal Interactive Attention Weighted Feature Matrix

(6) Generate a multi-layer temporal interactive attention weighted feature matrix:

(6a) Using the formula

generate

The correlation matrix of

right

(6b) Using the formula

Generating Multilayer Temporal Interactive Attention Weighted Feature Matrix

in,

Represents a value initialized to 0 for balancing

and

two scale parameters;

(7) Obtain the feature vector of the video:

Input the multi-layer temporal interactive attention weighted feature matrix of each video to the fully connected layer, and output the feature vector of the video;

(8) Behavior recognition on video:

(8a) Input the feature vector of each video into the softmax classifier, and use the back-propagation gradient descent method to iteratively update the parameters γ and

(8b) Sample 60 frames of RGB images at equal intervals for each video to be identified, scale the size of each frame to 256×340, and then perform center cropping to obtain 60 frames of RGB images with a size of 224×224. The RGB image is input into the Inception-v2 network, and the depth feature map of the video to be recognized is output;

(8c) adopt the same processing method as step (3) to step (7) to process the depth feature map of each to-be-recognized video, obtain the feature vector of the video, and input each feature vector into the trained softmax classification In the device, the behavior recognition results of each video are output.

2. the behavior recognition method based on feature mapping and multi-layer time interactive attention according to claim 1, is characterized in that, described in step (1a) is carried out preprocessing to each video in the sample set refers to, Each video in the sample set is sampled with 60 frames of RGB images at equal intervals, and the size of each frame of RGB image is scaled to 256×340 and then cropped to obtain a 60-frame RGB image with a size of 224×224.

3. the behavior recognition method based on feature map and multi-layer time interactive attention according to claim 1, is characterized in that, the space vectorization function described in step (3a) is as follows:

Among them, f _r,k represents the low-dimensional vector corresponding to the kth sampled frame in the rth video, V( ) represents the space vectorization function, and X _r,k represents the kth sampled frame in the rth video. Corresponding to the frame Depth feature map, X _r,k,ij represents the value of the i-th row and j-th column of X _r,k , ∑ represents the summation operation, H and W represent the total number of rows and columns of X _r,k , respectively.

4 . The behavior recognition method based on feature mapping and multi-layer temporal interactive attention according to claim 1 , wherein the number of output neurons of the fully connected layer in step (7) is set to 1024. 5 .