CN111325155A

CN111325155A - Video action recognition method based on residual 3D CNN and multimodal feature fusion strategy

Info

Publication number: CN111325155A
Application number: CN202010107288.8A
Authority: CN
Inventors: 张祖凡; 吕宗明; 甘臣权; 张家波
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-06-23
Anticipated expiration: 2040-02-21
Also published as: CN111325155B

Abstract

The invention relates to a video action classification method based on residual 3D CNN and multimodal feature fusion, and belongs to the field of computer vision and deep learning. First, the traditional C3D network connection method is changed to residual connection; the kernel decomposition technology is used to disassemble the 3D convolution kernel to obtain a spatial convolution kernel, and multiple time kernels of different time scales in parallel, and then in the spatial convolution kernel After inserting the attention model, the A3D residual module is obtained and stacked into a residual network. Build a dual-stream action recognition model, input RGB image features and optical flow features into the spatial flow network and temporal flow network, and extract the multi-level convolution feature layer features, and then use the multi-level feature fusion strategy to fuse the two networks to achieve The spatial and temporal features are complementary; finally, the global video action descriptor after fractional fusion is reduced by PCA, and then the SVM classifier is used to complete the action classification.

Description

Video Action Recognition Based on Residual 3D CNN and Multimodal Feature Fusion Strategy method

技术领域technical field

本发明属于计算机视觉与深度学习领域，涉及一种基于残差式3D CNN和多模态特征融合策略的视频动作识别方法。The invention belongs to the fields of computer vision and deep learning, and relates to a video action recognition method based on a residual 3D CNN and a multimodal feature fusion strategy.

背景技术Background technique

今天的数字内容本质上是包含了文本、音频、图像、视频等等的多媒体信息。特别是图像和视频，随着传感器的盛行与移动设备的激增，通过视频动作传达信息作为交流的方式也逐渐流行起来，开始成为互联网用户之间的一种新的通信方式。为了更为深层次与智能化地去发掘与理解多媒体信息，科研领域越来越鼓励开发先进的视频理解技术。表征学习则是这些技术进步取得成功的基础。近年来，卷积神经网络(Convolutional neuralnetwork,CNN)的兴起，特别是在图像领域，深度卷积神经网络中通过多个不同的卷积核结合局部感受野的信息抓取机制，遍历上一层的特征平面捕获不同粒度的局部特征，随着层数加深，这些提取的显著特征被组合和压缩，不同特征层涵盖不同层次的视觉感知特征表达，因此，凭借其对视觉表观特征优越的学习能力，在表征学习领域得到了广泛的认可。卷积神经网络(CNN)取得的成功证明了卷积神经网络具有很高的学习视觉表象的能力。例如，残差网络在ImageNet测试集上top-5错误率达到了3.57％，刷新了人类之前已知的最好识别性能。然而，视频帧是一个时序图像，其间较大的动态变化以及处理的复杂性，使得速度模型学习到一个强大和通用的时空表征成为了难题。Today's digital content is essentially multimedia information that includes text, audio, images, video, and more. Especially images and videos, with the prevalence of sensors and the proliferation of mobile devices, conveying information through video actions as a way of communication has also gradually become popular, starting to become a new way of communication between Internet users. In order to explore and understand multimedia information more deeply and intelligently, the scientific research field is increasingly encouraging the development of advanced video understanding technologies. Representation learning is fundamental to the success of these technological advances. In recent years, with the rise of Convolutional Neural Network (CNN), especially in the image field, the Deep Convolutional Neural Network uses multiple different convolution kernels combined with the information capture mechanism of the local receptive field to traverse the previous layer. The feature planes capture local features of different granularities. As the number of layers deepens, these extracted salient features are combined and compressed. Different feature layers cover different levels of visual perception feature representation. Therefore, with its superior learning of visual apparent features ability, has been widely recognized in the field of representation learning. The success of Convolutional Neural Networks (CNN) has demonstrated the high ability of Convolutional Neural Networks to learn visual representations. For example, the residual network achieved a top-5 error rate of 3.57% on the ImageNet test set, breaking the best recognition performance known to humans before. However, a video frame is a time-series image, and the large dynamic changes and processing complexity make it difficult for the velocity model to learn a powerful and general spatiotemporal representation.

目前，主要方法是将CNN的卷积核从2D扩展到3D，并训练出一种全新的3D CNN，通过在2D CNN的基础上扩增一个时间维度，这样网络不仅可以提取出每个视频图像中存在的视觉外观特征，而且可以捕获到连续帧之间的动态信息。但是，3D卷积核给模型性能带来提升的同时，网络训练中昂贵的计算成本也成为了一个待解决的问题。以一个广泛采用的11层3D CNN，即C3D网络为例，模型大小就达到了321MB，随着模型参数二次方式的增加，研究3D 卷积核的有效替代势在必行。再者，当前的双流动作识别模型中，空间流网络与时间流网络在最后的决策融合之前缺少交互，积聚在多个网络层的表征能力未被充分开发，关于对如何融合双流网络多级特征有效实现空间特征与时间特征的互补的研究还相对较少。因此，怎样针对C3D模型参数多训练困难以及局限于浅层网络表征能力的缺陷来展开研究，有效提升3D 卷积神经网络模型处理视频动作的能力与效率，以及怎样充分且有效实现双流网络融合互补，提升识别的性能，是一项非常重要的工作。At present, the main method is to expand the convolution kernel of CNN from 2D to 3D, and train a new 3D CNN. By expanding a temporal dimension on the basis of 2D CNN, the network can not only extract each video image The visual appearance features that exist in the frame, and the dynamic information between consecutive frames can be captured. However, while the 3D convolution kernel improves the performance of the model, the expensive computational cost in network training has also become a problem to be solved. Taking a widely used 11-layer 3D CNN, the C3D network as an example, the model size reaches 321MB. With the increase of the quadratic method of model parameters, it is imperative to study the effective replacement of 3D convolution kernels. Furthermore, in the current two-stream action recognition model, the spatial flow network and the temporal flow network lack interaction before the final decision fusion, and the representation capabilities accumulated in multiple network layers are not fully developed. There are relatively few studies on effectively realizing the complementarity of spatial and temporal features. Therefore, how to conduct research on the difficulty of training C3D model parameters and the limitation of shallow network representation ability, effectively improve the ability and efficiency of 3D convolutional neural network model to process video actions, and how to fully and effectively realize dual-stream network fusion and complementarity , improving the performance of recognition is a very important task.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供一种基于残差式3D CNN和多模态特征融合的视频动作分类方法。In view of this, the purpose of the present invention is to provide a video action classification method based on residual 3D CNN and multimodal feature fusion.

为达到上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于残差式3D CNN和多模态特征融合的视频动作分类方法，包括以下步骤：A video action classification method based on residual 3D CNN and multimodal feature fusion, including the following steps:

S1：基于传统的卷积3D神经网络(Convolutional 3D Neural Networks,C3D)，将各个卷积模块的连接方式改为残差式连接，引入恒等映射(Indentity mapping)；S1: Based on the traditional Convolutional 3D Neural Networks (C3D), the connection mode of each convolution module is changed to residual connection, and identity mapping is introduced;

S2：在残差模块中，利用3D核分解技术，将原始的3D卷积核分解为空间核和多个并行的多尺度时间核(Multiscale temporal transform layers,MTTL)，以减少模型参数，接着，嵌入注意力模型(Convolutional block attention module,CBAM)，得到全新的残差模块(A3D block)；S2: In the residual module, the original 3D convolution kernel is decomposed into spatial kernels and multiple parallel multiscale temporal transform layers (MTTL) using 3D kernel decomposition technology to reduce model parameters. Then, Embed the attention model (Convolutional block attention module, CBAM), get a new residual module (A3D block);

S3：通过堆叠A3D block以及池化层，调整各个模块的输入输出设置，完成最终的A3D残差网络的搭建；S3: By stacking A3D blocks and pooling layers, adjust the input and output settings of each module to complete the construction of the final A3D residual network;

S4：利用设计好的A3D卷积残差神经网络模型，搭建时空双流识别模型，分别将RGB视频图像和光流图像两种模态作为网络输入；S4: Use the designed A3D convolutional residual neural network model to build a spatiotemporal dual-stream recognition model, and use the two modalities of RGB video image and optical flow image as network input respectively;

S5：联合利用多级特征融合与决策融合方法(multi-stage fusionmethods)，首先在特征层面融合时间网络和空间网络中不同层特征，再通过决策级权值融合策略权衡多个softmax分类器的类分数向量，实现分数级决策融合；S5: Combine multi-stage feature fusion and decision fusion methods (multi-stage fusion methods), first fuse the features of different layers in the temporal network and the spatial network at the feature level, and then use the decision-level weight fusion strategy to weigh the classes of multiple softmax classifiers Fractional vector to achieve fractional-level decision fusion;

S6：再利用主成分分析(principal component analysis,PCA)降维算法，对融合后的特征描述子降维去相关，最后通过多分类的SVM分类器完成对视频动作的分类识别。S6: Principal component analysis (PCA) dimensionality reduction algorithm is then used to reduce the dimensionality and decorrelation of the fused feature descriptors, and finally complete the classification and recognition of video actions through a multi-class SVM classifier.

进一步，步骤S1中，将原始C3D中各个特征模块之间顺序直连的方式改为残差式连接，具体包括：Further, in step S1, the sequential direct connection between each feature module in the original C3D is changed to a residual connection, which specifically includes:

将特征模块的原始输入x_n-1，即恒等映射Indentitymapping，与其输出的和作为新的输出y_n，表示为y_n＝R*(x_n-1,W)+x_n-1，其中W表示残差模块中的可训练参数，通过残差映射R* 结合原始输入x_n-1，拟合网络训练中的可变残差值，R*+x_n-1表示shortcut连接，保证前层信息在向网络更深层传播时不易丢失，避免梯度弥散与梯度爆炸。The original input x _n-1 of the feature module, that is, the identity mapping Indentitymapping, and the sum of its output are taken as the new output _yn , which is expressed as _yn =R*(x _n-1 ,W)+x _n-1 , where W represents the trainable parameter in the residual module, and the variable residual value in the network training is fitted through the residual mapping R* combined with the original input x _n-1 , R*+x _n-1 represents the shortcut connection, guaranteeing the Layer information is not easily lost when it propagates deeper into the network, avoiding gradient dispersion and gradient explosion.

进一步，步骤S2中所述的3D核分解包括：Further, the 3D kernel decomposition described in step S2 includes:

利用3D核分解技术，将3×3×3卷积核沿着空间维度和时间维度分解，得到一个1×3×3的空间卷积核，以及一个3×1×1的时间卷积核，减少模型参数；同时，为了解决模型处理时序帧图像特征信息时，时间抓取尺度单一的缺点，本发明丰富了时间核尺度，并入1×1×1以及 2×1×1不同尺度时间核，设计出多尺度的时间转变层(Multiscale temporaltransform layers, MTTL)来提升模型对时间域中多粒度时间信息的提取能力。Using the 3D kernel decomposition technology, the 3×3×3 convolution kernel is decomposed along the spatial and temporal dimensions to obtain a 1×3×3 spatial convolution kernel and a 3×1×1 time convolution kernel, Reduce model parameters; at the same time, in order to solve the shortcoming of a single time capture scale when the model processes time series frame image feature information, the present invention enriches the time kernel scale, and incorporates 1×1×1 and 2×1×1 different scale time kernels , a multi-scale temporal transform layer (MTTL) is designed to improve the model's ability to extract multi-granularity temporal information in the temporal domain.

进一步，步骤S2中所述在残差模块中引入注意力模块CBAM，CBAM分为通道注意力(Channel attention module,CAM)和空间注意力(Spatial attention module,SAM)，其中Further, as described in step S2, an attention module CBAM is introduced into the residual module. CBAM is divided into channel attention module (CAM) and spatial attention module (SAM), wherein

在通道注意力模型中，首先将输入特征F∈R^C×W×H(其中C,W,H分别代表特征平面通道数、宽度与高度值)分别通过最大池化(maxpool)和平均池化(avgpool)，压缩空间维度，再利用多层感知层(MLP)制取通道权重，最后相加，通过relu激活层，再映射到输入特征各个特征通道，实现对输入特征通道注意力分数的合理分配，过程计算表示为： M_c＝relu{MLP(max pool(F))+MLP(avgpool(F))}，M_c为CAM的输出，即通道加权后的显著性特征；In the channel attention model, the input feature F∈R ^C×W×H (where C, W, H respectively represent the number of feature plane channels, width and height) are passed through max pooling and average pooling respectively. (avgpool), compress the spatial dimension, and then use the multi-layer perceptual layer (MLP) to obtain the channel weights, and finally add them together. Through the relu activation layer, they are mapped to each feature channel of the input feature to achieve a reasonable attention score for the input feature channel. Assignment, the process calculation is expressed as: M _c =relu{MLP(max pool(F))+MLP(avgpool(F))}, M _c is the output of CAM, that is, the salient feature after channel weighting;

在空间注意力模型中，同样通过最大池化(maxpool)和平均池化(avgpool)，压缩掉M_c的通道维度，通过串联两个特征描述子得到携带通道显著性的两通道特征，再利用一个卷积操作 Conv计算Conv[max pool(F),avgpool(F)}得到空间权重，归一化后与M_c相加，得到空间显著性特征；由于CAM与SAM在空间关注上互补，使得CBAM能实现对特征空间信息的全方位筛选；在残差模块中，CBAM模型直接接收空间核的输出作为输入，赋予模型有效的特征筛选机制。In the spatial attention model, the channel dimension of Mc is also compressed through _maxpool and average pooling (avgpool), and two-channel features with channel saliency are obtained by concatenating two feature descriptors, and then use A convolution operation Conv calculates Conv[max pool(F), _avgpool (F)} to obtain the spatial weight, which is normalized and added to Mc to obtain the spatial saliency feature; since CAM and SAM are complementary in spatial attention, making CBAM can realize all-round screening of feature space information; in the residual module, the CBAM model directly receives the output of the spatial kernel as input, giving the model an effective feature screening mechanism.

进一步，步骤S4中所述双流识别模型的搭建过程如下：Further, the construction process of the dual-stream recognition model described in step S4 is as follows:

使用A3D卷积残差神经网络作为双流网络的基础模型，利用RGB图像特征以及对应的光流特征分别作为空间流和时间流网络的输入；其中光流特征的获取是通过利用空间金字塔模型(Spatial pyramid networks,SpyNet)导出，该模型直接接入到双流流网络中，通过梯度的反向传播与时间流网络以及空间网络一同参加训练，微调自身参数。不同于基于手工制作的方法提取光流信息，来自学习网络计算的光流更具有灵活性来表征现实场景中的动作分类。The A3D convolutional residual neural network is used as the basic model of the dual-stream network, and the RGB image features and the corresponding optical flow features are used as the input of the spatial and temporal flow networks respectively; the optical flow features are obtained by using the spatial pyramid model (Spatial). pyramid networks, SpyNet), the model is directly connected to the dual-stream network, and participates in training with the time-stream network and the spatial network through the back-propagation of the gradient to fine-tune its own parameters. Different from hand-crafted methods to extract optical flow information, optical flow computed from learning networks is more flexible to characterize action classification in real-world scenes.

进一步，步骤S5中所述多级特征融合与决策融合方法，具体包括：Further, the multi-level feature fusion and decision fusion method described in step S5 specifically includes:

分别从A3D卷积残差神经网络的不同特征层，包括A3D_2a、A3D_3a、A3D_5a以及softmax层，导出多级互补特征f_i ^*,f_i，其中f_i ^*,f_i分别表示来自时间流网络以及空间流网络的多级特征，接着对导出的特征采用加权求和的方式融合对应的时间流和空间流特征，用于权衡双流网络的贡献，即计算F_i＝W_i[f_i,f_i ^*]，其中F_i,W_i分别是第i层特征融合的输出和对应的权值融合参数矩阵，表示为α_i,β_i；然后加权融合后的特征通过一个1×1×1的卷积层以及最大池化层，经过sofmax后得到由各层融合特征产生的决策分数，对各层的决策分数再进行一次分数级的权值融合，以制取具有强表征力的特征描述子。From the different feature layers of the A3D convolutional residual neural network, including A3D_2a, A3D_3a, A3D_5a and softmax layers, derive multi-level complementary features f _i ^* , f _i , where f _i ^* , f _i represent from the temporal flow network and The multi-level features of the spatial flow network, and then the derived features are combined with the corresponding temporal flow and spatial flow features by weighted summation to weigh the contribution of the dual-flow network, that is, to calculate F _i =W _i [f _i , f _i ^* ], where F _i , W _i are the output of the i-th layer feature fusion and the corresponding weight fusion parameter matrix, denoted as α _i , β _i ; then the weighted fusion features pass through a 1×1×1 volume The accumulation layer and the maximum pooling layer, after sofmax, get the decision score generated by the fusion features of each layer, and then perform a fractional weight fusion on the decision score of each layer to obtain a feature descriptor with strong representation.

本发明的有益效果在于：本发明提出的时空双流A3D卷积残差神经网络，相较于原始的C3D模型，本发明能在较少的在模型参数达到更高的识别效率，同时更深的网络模型在特征表征上获得进一步提升，能够进一步提高动作分类精度。The beneficial effects of the present invention are: compared with the original C3D model, the spatiotemporal dual-stream A3D convolutional residual neural network proposed by the present invention can achieve higher recognition efficiency with fewer model parameters, and at the same time a deeper network The model is further improved in feature representation, which can further improve the accuracy of action classification.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述，并且在某种程度上，基于对下文的考察研究对本领域技术人员而言将是显而易见的，或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects, and features of the present invention will be set forth in the description that follows, and will be apparent to those skilled in the art based on a study of the following, to the extent that is taught in the practice of the present invention. The objectives and other advantages of the present invention may be realized and attained by the following description.

附图说明Description of drawings

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作优选的详细描述，其中：In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be preferably described in detail below with reference to the accompanying drawings, wherein:

图1为本发明所述基于残差式3D CNN和多模态特征融合的视频动作分类方法流程图；1 is a flowchart of a video action classification method based on residual 3D CNN and multimodal feature fusion according to the present invention;

图2为C3D模型图；Figure 2 is a C3D model diagram;

图3为2D卷积和3D卷积操作示意图；Figure 3 is a schematic diagram of 2D convolution and 3D convolution operations;

图4为CBAM结构图；Figure 4 is a structural diagram of CBAM;

图5为A3D残差模块示意图；Figure 5 is a schematic diagram of the A3D residual module;

图6为A3D卷积残差神经网络结构图；Figure 6 is a structural diagram of the A3D convolutional residual neural network;

图7为整体的双流动作识别模型图。FIG. 7 is a diagram of the overall dual-stream action recognition model.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic idea of the present invention in a schematic manner, and the following embodiments and features in the embodiments can be combined with each other without conflict.

其中，附图仅用于示例性说明，表示的仅是示意图，而非实物图，不能理解为对本发明的限制；为了更好地说明本发明的实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；对本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。Among them, the accompanying drawings are only used for exemplary description, and represent only schematic diagrams, not physical drawings, and should not be construed as limitations of the present invention; in order to better illustrate the embodiments of the present invention, some parts of the accompanying drawings will be omitted, The enlargement or reduction does not represent the size of the actual product; it is understandable to those skilled in the art that some well-known structures and their descriptions in the accompanying drawings may be omitted.

本发明实施例的附图中相同或相似的标号对应相同或相似的部件；在本发明的描述中，需要理解的是，若有术语“上”、“下”、“左”、“右”、“前”、“后”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此附图中描述位置关系的用语仅用于示例性说明，不能理解为对本发明的限制，对于本领域的普通技术人员而言，可以根据具体情况理解上述术语的具体含义。The same or similar numbers in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms “upper”, “lower”, “left” and “right” , "front", "rear" and other indicated orientations or positional relationships are based on the orientations or positional relationships shown in the accompanying drawings, and are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the indicated device or element must be It has a specific orientation, is constructed and operated in a specific orientation, so the terms describing the positional relationship in the accompanying drawings are only used for exemplary illustration, and should not be construed as a limitation of the present invention. situation to understand the specific meaning of the above terms.

如图1所示，本发明提供一种基于残差式3D CNN和多模态特征融合的视频动作分类方法，首先，本发明抽取视频中前20帧图像，并将全部输入帧裁剪为112×112尺寸大小，作为网络的输入，使用的Batchsize大小为20个视频。C3D卷积神经网络作为早期经典的3DCNN 模型，包括5个卷积模快2层全连接层，一共11层的浅层模型，具体C3D模型结构见图2。训练时模型通过前层输出顺序接入后层的单一连接方式进行梯度的传播以及参数的更新，模型参数之大以及模型表征能力不足是本发明要改进的地方。接着，开始详细介绍本发明的具体步骤。As shown in FIG. 1, the present invention provides a video action classification method based on residual 3D CNN and multi-modal feature fusion. First, the present invention extracts the first 20 frames of images in the video, and crops all input frames into 112× 112 size, as the input of the network, the batch size used is 20 videos. As an early classic 3DCNN model, the C3D convolutional neural network includes 5 convolutional modules and 2 fully connected layers, and a total of 11 shallow models. The specific C3D model structure is shown in Figure 2. During training, the model propagates gradients and updates parameters through a single connection in which the output of the front layer is sequentially connected to the back layer. The large model parameters and insufficient model representation capabilities are the areas to be improved by the present invention. Next, the specific steps of the present invention are introduced in detail.

A3D卷积残差神经网络的搭建过程：The construction process of A3D convolutional residual neural network:

(1)建立残差连接：本发明将原始C3D中各个特征模块之间顺序直连的方式改为残差式连接，具体操作主要是将特征模块的原始输入x_n-1(即是Indentity mapping)与其输出的和作为新的输出y_n，具体流程表示为y_n＝R*(x_n-1,W)+x_n-1，其中W表示残差模块中的可训练参数，通过残差映射R*结合原始输入x_n-1，拟合网络训练中的可变残差值，R*+x_n-1，表示shortcut 连接，保证前层信息在向网络更深层传播时不易丢失，避免梯度弥散与梯度爆炸。(1) Establishing residual connection: The present invention changes the sequential direct connection between each feature module in the original C3D _to a residual connection. ) and its output as a new output y _n , the specific process is expressed as y _n =R*(x _n-1 ,W)+x _n-1 , where W represents the trainable parameter in the residual module, through the residual The mapping R* combines the original input x _n-1 to fit the variable residual value in the network training, R*+x _n-1 , which represents the shortcut connection, which ensures that the information of the previous layer is not easily lost when it propagates to the deeper layers of the network, avoiding Gradient dispersion and gradient explosion.

(2)3D核分解：2D卷积输出缺乏时间域信息，3D卷积则能同时捕获时间域与空间域信息，具体操作见图3。但是繁重的训练参数，降低了网络训练效率。本发明利用核分解技术，将3×3×3卷积核沿着空间维度和时间维度分解，得到一个1×3×3的空间卷积核，以及一个 3×1×1的时间卷积核，减少模型参数。同时，为了解决模型处理时序帧图像特征信息时，时间抓取尺度单一的缺点，本发明丰富了时间核尺度，并入1×1×1以及2×1×1不同尺度时间核，设计出了多尺度的时间转变层(Multiscale temporal transform layers,MTTL)来提升对时间域中多粒度时间信息的提取能力。(2) 3D kernel decomposition: 2D convolution output lacks time domain information, while 3D convolution can capture both time domain and spatial domain information. The specific operation is shown in Figure 3. However, the heavy training parameters reduce the efficiency of network training. The invention uses the kernel decomposition technology to decompose the 3×3×3 convolution kernel along the space dimension and the time dimension to obtain a 1×3×3 spatial convolution kernel and a 3×1×1 time convolution kernel. , reduce the model parameters. At the same time, in order to solve the shortcoming of a single time capture scale when the model processes the feature information of time series frame images, the present invention enriches the time kernel scale and incorporates different scale time kernels of 1×1×1 and 2×1×1. Multiscale temporal transform layers (MTTL) to improve the ability to extract multi-granularity temporal information in the temporal domain.

(3)注意力模块的引入：接着上述流程，本发明再在残差模块中引入了注意力模型CBAM， CBAM主要分为通道注意力(Channel attention module,CAM)和空间注意力(Spatial attention module,SAM)。模型结构可见图4。①在通道注意力模型中，首先将输入特征F∈R^C×W×H(其中C,W,H分别代表特征平面通道数、宽度与高度值)分别通过最大池化(maxpool)和平均池化 (avgpool)，压缩空间维度，再利用多层感知层(MLP)制取通道权重，最后相加，通过relu激活层,再映射到输入特征各个特征通道，实现对输入特征通道注意力分数的合理分配，过程计算表示为：M_c＝relu{MLP(max pool(F))+MLP(avgpool(F))}，M_c为CAM的输出，即通道加权后的显著性特征。②在空间注意力中，同样是使用两种池化方式，压缩掉M_c的通道维度，通过串联两个特征描述子得到携带通道显著性的两通道特征，再利用一个卷积操作Conv计算Conv[max pool(F),avgpool(F)}得到空间权重，归一化后与M_c相加，得到空间显著性特征。由于CAM与SAM在空间关注上互补，使得CBAM能实现对特征空间信息的全方位筛选。在残差模块中，CBAM模型直接接收空间核的输出作为输入，赋予模型有效的特征筛选机制。(3) Introduction of attention module: Following the above process, the present invention introduces an attention model CBAM into the residual module. CBAM is mainly divided into channel attention module (CAM) and spatial attention module (Spatial attention module). , SAM). The model structure can be seen in Figure 4. ①In the channel attention model, the input feature F∈R ^C×W×H (where C, W, H represent the number of feature plane channels, width and height, respectively) are passed through max pooling and average pooling respectively. (avgpool), compress the spatial dimension, and then use the multi-layer perceptual layer (MLP) to obtain the channel weight, and finally add it, through the relu activation layer, and then map it to each feature channel of the input feature to realize the attention score of the input feature channel. Reasonable allocation, the process calculation is expressed as: M _c =relu{MLP(max pool(F))+MLP(avgpool(F))}, M _c is the output of CAM, that is, the saliency feature after channel weighting. ② In spatial attention, two pooling methods are also used to compress the channel dimension of _Mc , and two-channel features with channel saliency are obtained by concatenating two feature descriptors, and then a convolution operation Conv is used to calculate Conv [max pool(F), avgpool(F)} gets the spatial weight, which is normalized and added to M _c to get the spatial saliency feature. Since CAM and SAM complement each other in spatial attention, CBAM can realize all-round screening of feature space information. In the residual module, the CBAM model directly receives the output of the spatial kernel as input, giving the model an effective feature screening mechanism.

(4)A3D残差模块：随着上述的铺垫，在发明的残差模块中我们使用了核分解技术减少模型参数，设计了MTTL丰富模型抓取的时间特征粒度，以及引入了有效的注意力模型提升模型鲁棒性，融合这些优点，得到了A3D残差模块，详细结构见图5。(4) A3D residual module: With the above foreshadowing, in the residual module of the invention, we use kernel decomposition technology to reduce model parameters, design the temporal feature granularity of MTTL rich model capture, and introduce effective attention The model improves the robustness of the model, and combines these advantages to obtain the A3D residual module. The detailed structure is shown in Figure 5.

(5)搭建A3D卷积残差神经网络：本发明将A3D模块替代原始C3D相应位置的卷积模块，并调整了相应的维度输出，旨在为保持与C3D各个卷积模块的输入输出维度一致。再通过堆叠A3D模块，最终我们得到层数更多的卷积神经网络结构，即A3D卷积残差神经网络见图6。(5) Building an A3D convolutional residual neural network: the present invention replaces the A3D module with the convolution module at the corresponding position of the original C3D, and adjusts the corresponding dimension output, in order to keep the input and output dimensions of each convolution module consistent with the C3D . By stacking the A3D modules, we finally get a convolutional neural network structure with more layers, that is, the A3D convolutional residual neural network shown in Figure 6.

双流识别模型搭建过程：The construction process of the dual-stream recognition model:

(1)导出多模态特征：本发明使用A3D卷积残差神经网络作为双流网络的基础模型，同时利用RGB图像特征以及对应的光流特征分别作为空间流和时间流网络的输入。其中光流特征的获取是通过利用空间金字塔模型(Spatial pyramid networks,SpyNet)导出，该模型直接接入到双流流网络中，通过梯度的反向传播与时间流网络以及空间网络一同参加训练，微调自身参数。不同于基于手工制作的方法提取光流信息，来自学习网络计算的光流更具有灵活性来表征现实场景中的动作分类。(1) Deriving multi-modal features: The present invention uses A3D convolutional residual neural network as the basic model of the dual-stream network, and uses RGB image features and corresponding optical flow features as the input of spatial flow and temporal flow networks respectively. The acquisition of optical flow features is derived by using the Spatial pyramid network (SpyNet), which is directly connected to the dual-stream network, and participates in training together with the temporal flow network and the spatial network through the back-propagation of the gradient, and fine-tuning own parameters. Different from hand-crafted methods to extract optical flow information, optical flow computed from learning networks is more flexible to characterize action classification in real-world scenes.

(2)多级特征融合与决策方法(multi-stage fusion methods)：接着，在搭建好的双流识别网络中，本发明分别从A3D卷积残差神经网络的不同特征层(A3D_2a、A3D_3a、A3D_5a以及 softmax层)导出多级互补特征f_i ^*,f_i，其中f_i ^*,f_i分别表示来自时间流网络以及空间流网络的多级特征，接着，对导出的特征采用加权求和的方式融合对应的时间流和空间流特征，旨在权衡双流网络的贡献，即计算F_i＝W_i[f_i,f_i ^*]，其中F_i,W_i分别是是第i层特征融合的输出和对应的权值融合参数矩阵(详细表示为α_i,β_i)。然后，加权融合后的特征通过一个1×1×1的卷积层以及最大池化层，经过sofmax后得到由各层融合特征产生的决策分数，类似地，我们对各层的决策分数再进行一次分数级的权值融合，以制取具有强表征力的特征描述子。最后，通过PCA进行特征向量的去相关以及去冗余，得到的有效特征再进入一个多分类的SVM分类器完成最终的识别任务，整体的双流动作识别模型见图7。(2) Multi-stage fusion methods (multi-stage fusion methods): Next, in the built dual-stream recognition network, the present invention uses different feature layers (A3D_2a, A3D_3a, A3D_5a) of the A3D convolutional residual neural network respectively. and softmax layer) to derive multi-level complementary features f _i ^* , f _i , where f _i ^* , f _i represent multi-level features from the temporal flow network and the spatial flow network, respectively, and then, the derived features are weighted and summed Fusion of the corresponding temporal flow and spatial flow features aims to balance the contribution of the dual-stream network, that is, to calculate F _i =W _i [f _i , f _i ^* ], where F _i , W _i are the outputs of the i-th layer feature fusion, respectively and the corresponding weight fusion parameter matrix (denoted in detail as α _i , β _i ). Then, the weighted and fused features pass through a 1×1×1 convolutional layer and a maximum pooling layer, and after sofmax, the decision scores generated by the fusion features of each layer are obtained. A fractional fusion of weights is used to produce feature descriptors with strong representational power. Finally, the feature vectors are de-correlated and de-redundant through PCA, and the obtained effective features are then entered into a multi-class SVM classifier to complete the final recognition task. The overall dual-stream action recognition model is shown in Figure 7.

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements, without departing from the spirit and scope of the technical solution, should all be included in the scope of the claims of the present invention.

Claims

1. a video action classification method based on residual type 3D CNN and multimodal feature fusion, is characterized in that: comprise the following steps:

S1: Based on the traditional convolutional 3D neural network C3D, the connection mode of each convolution module is changed to residual connection, and the identity mapping is introduced;

S2: In the residual module, the original 3D convolution kernel is decomposed into spatial kernels and multiple parallel multi-scale time kernels MTTL by using 3D kernel decomposition technology to reduce the model parameters. Then, the attention model CBAM is embedded to obtain A new residual module A3Dblock;

S3: By stacking A3D blocks and pooling layers, adjust the input and output settings of each module to complete the construction of the final A3D residual network;

S4: Use the designed A3D convolutional residual neural network model to build a spatiotemporal dual-stream recognition model, and use the two modalities of RGB video image and optical flow image as network input respectively;

S5: Combined use of multi-level feature fusion and decision fusion methods. First, the features of different layers in the temporal network and the spatial network are fused at the feature level, and then the class score vectors of multiple softmax classifiers are weighed through the decision-level weight fusion strategy to achieve the score level. decision fusion;

S6: PCA dimensionality reduction algorithm is then used to reduce the dimensionality and decorrelation of the fused feature descriptors, and finally complete the classification and recognition of video actions through the multi-class SVM classifier.

2. the video action classification method based on residual type 3D CNN and multimodal feature fusion according to claim 1, it is characterized in that: in step S1, the mode of order direct connection between each feature module in the original C3D is changed to It is a residual connection, including:

Take the original input x _n-1 of the feature module, that is, the identity map, and the sum of its output as a new output y _n , expressed as _yn =R ^* (x _n-1 ,W)+x _n-1 , where W Represents the trainable parameters in the residual module, through the residual mapping R ^* combined with the original input x _n-1 , fitting the variable residual value in the network training, R ^* +x _n-1 represents the shortcut connection, ensuring the front layer Information is not easily lost when propagating deeper into the network, avoiding gradient dispersion and gradient explosion.

3. the video action classification method based on residual type 3D CNN and multimodal feature fusion according to claim 1, is characterized in that: the 3D kernel decomposition described in step S2 comprises:

Using the 3D kernel decomposition technology, the 3×3×3 convolution kernel is decomposed along the spatial and temporal dimensions to obtain a 1×3×3 spatial convolution kernel and a 3×1×1 time convolution kernel, Reduce model parameters; simultaneously incorporate 1×1×1 and 2×1×1 time kernels of different scales, and design a multi-scale time transition layer MTTL to improve the ability to extract multi-granularity time information in the time domain.

4. the video action classification method based on residual type 3D CNN and multimodal feature fusion according to claim 1, is characterized in that: described in step S2, introduce attention module CBAM in residual error module, CBAM is divided into. channel attention CAM and spatial attention SAM, where

In the channel attention model, the input feature F∈R ^C×W×H , where C, W, H represent the number of channels, width and height of the feature plane, respectively, through max pooling and average pooling, respectively, to compress the space Dimension, and then use the multi-layer perceptual layer (MLP) to obtain the channel weight, and finally add it. Through the relu activation layer, it is mapped to each feature channel of the input feature to achieve a reasonable allocation of the attention score of the input feature channel. The process calculation is expressed as : M _c =relu{MLP(maxpool(F))+MLP(avgpool(F))}, M _c is the output of CAM, that is, the salient feature after channel weighting;

In the spatial attention model, the channel dimension of _Mc is also compressed through maximum pooling and average pooling, and two-channel features with channel saliency are obtained by concatenating two feature descriptors, and then a convolution operation Conv is used to calculate Conv[maxpool(F), _avgpool (F)} gets the spatial weight, which is normalized and added to Mc to get the spatially significant features; since CAM and SAM complement each other in spatial attention, CBAM can realize the feature spatial information analysis. In the residual module, the CBAM model directly receives the output of the spatial kernel as input, giving the model an effective feature screening mechanism.

5. the video action classification method based on residual type 3D CNN and multimodal feature fusion according to claim 1, is characterized in that: the building process of the dual-stream recognition model described in step S4 is as follows:

The A3D convolutional residual neural network is used as the basic model of the dual-stream network, and the RGB image features and the corresponding optical flow features are used as the input of the spatial flow and temporal flow networks respectively; the acquisition of the optical flow features is derived by using the spatial pyramid model SpyNet , the model is directly connected to the dual-stream network, and participates in training with the time-stream network and the spatial network through the back-propagation of the gradient to fine-tune its own parameters.

6. the video action classification method based on residual type 3D CNN and multi-modal feature fusion according to claim 1, is characterized in that: the multi-level feature fusion and decision fusion method described in step S5, specifically comprises:

From the different feature layers of the A3D convolutional residual neural network, including A3D_2a, A3D_3a, A3D_5a and softmax layers, derive multi-level complementary features f _i ^* , f _i , where f _i ^* , f _i represent from the temporal flow network and The multi-level features of the spatial flow network, and then the derived features are combined with the corresponding temporal flow and spatial flow features by weighted summation to weigh the contribution of the dual-flow network, that is, to calculate F _i =W _i [f _i , f _i ^* ], where F _i , W _i are the output of the i-th layer feature fusion and the corresponding weight fusion parameter matrix, denoted as α _i , β _i ; then the weighted fusion features pass through a 1×1×1 volume The accumulation layer and the maximum pooling layer, after sofmax, get the decision score generated by the fusion features of each layer, and then perform a fractional weight fusion on the decision score of each layer to obtain a feature descriptor with strong representation.