CN115836330A

CN115836330A - Action recognition method and related products based on deep residual network

Info

Publication number: CN115836330A
Application number: CN202180048575.9A
Authority: CN
Inventors: 陈佳伟; 萧人豪; 何朝文
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-07-10
Filing date: 2021-07-09
Publication date: 2023-03-21
Also published as: WO2022007954A1

Abstract

An action recognition method based on a deep residual network and related products are provided. A first convolution module in the at least one convolution module receives a video segment as input. Traverse at least one convolutional module by: processing the above input with a 1D filter for motion-related features and a 2D filter for appearance-related features; shifting the above motion-related features along the temporal dimension A step size, subtracting the above-mentioned shifted motion-related features from the above-mentioned motion-related features to obtain residual features; obtain the output of the above-mentioned convolution module as the input of the next convolution module, until at least one of the above-mentioned convolutions is traversed The last convolutional module in the module. At least one action included in the video segment is identified based on an output of a last convolutional module of the at least one convolutional module. Thereby reducing the model size and reducing the computational cost.

Description

Action recognition method and related products based on deep residual network

技术领域technical field

本公开涉及神经网络技术领域，尤其涉及一种基于深度残差网络的动作识别方法以及相关产品。The present disclosure relates to the technical field of neural networks, in particular to an action recognition method based on a deep residual network and related products.

背景技术Background technique

卷积神经网络(CNN)和大规模标记数据集的再次兴起，使得使用端到端可训练网络的图像分类取得了前所未有的进展。然而，基于视频的人体动作识别还不能仅仅基于CNN特征来实现。如何有效地对时间信息进行建模，即识别时间的相关性和因果关系，是一个基本的挑战。The resurgence of convolutional neural networks (CNNs) and large-scale labeled datasets has enabled unprecedented advances in image classification using end-to-end trainable networks. However, video-based human action recognition cannot yet be achieved based solely on CNN features. How to efficiently model temporal information, that is, identify temporal correlations and causal relationships, is a fundamental challenge.

目前，出现了专注于通过人为手动设计的光流对运动进行建模的经典研究分支，并且在单独的流中分别使用光流模态和红绿蓝(RGB)的双流方法是最成功的架构之一。然而，光流计算成本高。Currently, there is a classical branch of research focused on modeling motion via human-designed optical flow, and a two-stream approach using optical flow modalities and red-green-blue (RGB) in separate streams is the most successful architecture. one. However, optical flow is computationally expensive.

发明内容Contents of the invention

实施例提供了一种基于深度残差网络的动作识别方法以及相关产品，以减小用于基于视频的人体动作识别的深度残差网络的模型的尺寸，以及减少计算成本。The embodiment provides an action recognition method based on a deep residual network and related products, so as to reduce the size of the model of the deep residual network for video-based human action recognition and reduce the calculation cost.

在第一方面，提供了一种基于深度残差网络的动作识别方法。该基于深度残差网络的动作识别方法，应用于包括至少一个卷积模块的深度残差网络系统，上述至少一个卷积模块中的每个卷积模块包括至少一个第一卷积层，上述至少一个第一卷积层中的每个第一卷积层具有至少一个一维滤波器和至少一个二维滤波器，上述方法包括以下内容。在上述至少一个卷积模块中的第一个卷积模块接收视频片段作为输入。通过执行以下操作遍历上述至少一个卷积模块：使用一维滤波器处理上述输入以获取运动相关特征，以及使用二维滤波器处理上述输入以获取外观相关特征；将上述运动相关特征沿时间维度移位一个步长，从上述运动相关特征中减去上述移位后的运动相关特征以获取残差特征；基于上述外观相关特征、上述运动相关特征以及上述残差特征，获取上述卷积模块的输出；将上述卷积的输出作为下一个卷积模块的输入，直至遍历完上述至少一个卷积模块中的最后一个卷积模块。基于上述至少一个卷积模块中的最后一个卷积模块的输出，识别上述视频片段中包括的至少一个动作。In the first aspect, an action recognition method based on deep residual network is provided. The action recognition method based on the deep residual network is applied to a deep residual network system including at least one convolution module, each convolution module in the at least one convolution module includes at least one first convolution layer, and the above at least Each first convolutional layer in a first convolutional layer has at least one one-dimensional filter and at least one two-dimensional filter, and the above method includes the following content. A first convolution module in the at least one convolution module receives a video segment as input. Traversing the at least one convolution module above by performing the following operations: processing the above-mentioned input with a one-dimensional filter to obtain motion-related features, and processing the above-mentioned input with a two-dimensional filter to obtain appearance-related features; shifting the above-mentioned motion-related features along the time dimension One step, subtract the above-mentioned shifted motion-related features from the above-mentioned motion-related features to obtain the residual feature; based on the above-mentioned appearance-related features, the above-mentioned motion-related features and the above-mentioned residual features, obtain the output of the above-mentioned convolution module ; Using the output of the above convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. At least one action included in the video segment is identified based on an output of a last convolutional module of the at least one convolutional module.

在第二方面，提供了一种基于深度残差网络的动作识别装置。该基于深度残差网络的动作识别装置，应用于包括至少一个卷积模块的深度残差网络系统，上述至少一个卷积模块中的每个卷积模块包括至少一个第一卷积层，上述至少一个第一卷积层中的每个第一卷积层具有至少一个一维滤波器和至少一个二维滤波器，上述装置包括接收单元、处理单元，以及识别单元。上述接收单元用于在上述至少一个卷积模块中的第一个卷积模块接收视频片段作为输入。上述处理用于通过执行以下操作遍历上述至少一个卷积模块：使用一维滤波器处理上述输入以获取运动相关特征，以及使用二维滤波器处理上述输入以获取外观相关特征；将上述运动相关特征沿时间维度移位一个步长，从上述运动相关特征中减去上述移位后的运动相关特征以获取残差特征；基于上述外观相关特征、上述运动相关特征以及上述残差特征，获取上述卷积模块的输出；将上述卷积的输出作为下一个卷积模块的输入，直至遍历完上述至少一个卷积模块中的最后一个卷积模块。上述识别单元用于基于上述至少一个卷积模块中的最后一个卷积模块的输出，识别上述视频片段中包括的至少一个动作。In a second aspect, an action recognition device based on a deep residual network is provided. The action recognition device based on the deep residual network is applied to a deep residual network system including at least one convolution module, each convolution module in the at least one convolution module includes at least one first convolution layer, and the above at least Each first convolutional layer in a first convolutional layer has at least one one-dimensional filter and at least one two-dimensional filter, and the above device includes a receiving unit, a processing unit, and an identifying unit. The above-mentioned receiving unit is used for receiving a video segment as an input in the first convolution module of the at least one convolution module. The above-mentioned processing is used to traverse the above-mentioned at least one convolution module by performing the following operations: processing the above-mentioned input with a one-dimensional filter to obtain motion-related features, and processing the above-mentioned input with a two-dimensional filter to obtain appearance-related features; Shift a step along the time dimension, subtract the above-mentioned shifted motion-related features from the above-mentioned motion-related features to obtain residual features; The output of the convolution module; the output of the above convolution is used as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. The identification unit is configured to identify at least one action included in the video clip based on the output of the last convolution module in the at least one convolution module.

在第三方面，提供了一种终端设备，该终端设备包括处理器，用于存储一个或多个程序的存储器。上述一个或多个程序用于被上述处理器执行，上述一个或多个程序包括用于执行在第一方面描述的方法中的部分或全部操作的指令。In a third aspect, a terminal device is provided, and the terminal device includes a processor and a memory for storing one or more programs. The above-mentioned one or more programs are used to be executed by the above-mentioned processor, and the above-mentioned one or more programs include instructions for performing part or all of the operations in the method described in the first aspect.

在第四方面，提供了一种非瞬时性计算机可读存储介质，上述非瞬时性计算机可读存储介质用于存储用于电子数据交换的计算机程序。上述计算机程序包括用于执行在第一方面描述的方法中的部分或全部操作的指令。In a fourth aspect, there is provided a non-transitory computer-readable storage medium for storing a computer program for electronic data exchange. The above computer program includes instructions for performing part or all of the operations in the method described in the first aspect.

在第五方面，提供了一种计算机程序产品，上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质。上述计算机程序可使计算机执行在第一方面描述的方法中的部分或全部操作。In a fifth aspect, a computer program product is provided, and the computer program product includes a non-transitory computer-readable storage medium storing a computer program. The above computer program can cause the computer to execute part or all of the operations in the method described in the first aspect.

在本申请实施例中，提供了一种新的深度残差网络系统。上述深度残差网络系统包括至少一个卷积模块，上述至少一个卷积模块中的每个卷积模块包括至少一个第一卷积层，上述至少一个第一卷积层中的每个第一卷积层具有至少一个一维滤波器和至少一个二维滤波器，上述基于深度残差网络的动作识别方法应用于上述深度残差网络系统。在上述至少一个卷积模块中的第一个卷积模块接收视频片段作为输入。通过执行以下操作遍历上述至少一个卷积模块：使用一维滤波器处理上述输入以获取运动相关特征，以及使用二维滤波器处理上述输入以获取外观相关特征；将上述运动相关特征沿时间维度移位一个步长，从上述运动相关特征中减去上述移位后的运动相关特征以获取残差特征；基于上述外观相关特征、上述运动相关特征以及上述残差特征，获取上述卷积模块的输出；将上述卷积的输出作为下一个卷积模块的输入，直至遍历完上述至少一个卷积模块中的最后一个卷积模块。基于上述至少一个卷积模块中的最后一个卷积模块的输出，识别上述视频片段中包括的至少一个动作。因此，在本申请中提出了可被视为伪三维卷积模块的新卷积模块，在该新卷积模块中，相关技术中的标准3D滤波器被解耦形成并行的二维空间滤波器和一维时域滤波器。通过使用可分离的二维卷积和一维卷积代替三维卷积，大大降低了模型尺寸和计算成本。此外，2D卷积和1D卷积被放置在不同的路径中，从而可针对外观相关特征和运动相关特征进行不同的建模。In the embodiment of this application, a new deep residual network system is provided. The above-mentioned deep residual network system includes at least one convolution module, each convolution module in the above-mentioned at least one convolution module includes at least one first convolution layer, and each first convolution module in the above-mentioned at least one first convolution layer The product layer has at least one one-dimensional filter and at least one two-dimensional filter, and the above-mentioned action recognition method based on the deep residual network is applied to the above-mentioned deep residual network system. A first convolution module in the at least one convolution module receives a video segment as input. Traversing the at least one convolution module above by performing the following operations: processing the above-mentioned input with a one-dimensional filter to obtain motion-related features, and processing the above-mentioned input with a two-dimensional filter to obtain appearance-related features; shifting the above-mentioned motion-related features along the time dimension One step, subtract the above-mentioned shifted motion-related features from the above-mentioned motion-related features to obtain the residual feature; based on the above-mentioned appearance-related features, the above-mentioned motion-related features and the above-mentioned residual features, obtain the output of the above-mentioned convolution module ; Using the output of the above convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. At least one action included in the video segment is identified based on an output of a last convolutional module of the at least one convolutional module. Therefore, in this application, a new convolution module that can be regarded as a pseudo-3D convolution module is proposed, in which the standard 3D filters in the related art are decoupled to form parallel 2D spatial filters and a one-dimensional time-domain filter. By using separable 2D and 1D convolutions instead of 3D convolutions, the model size and computational cost are greatly reduced. Furthermore, 2D convolutions and 1D convolutions are placed in different paths, allowing for different modeling of appearance-related features and motion-related features.

附图说明Description of drawings

为了更清楚地说明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort.

图1是RGB帧示例(顶部)和残差帧示例(底部)的示意图。Figure 1 is a schematic diagram of an RGB frame example (top) and a residual frame example (bottom).

图2是深度残差网络的示例性详细设计的示意图。Figure 2 is a schematic diagram of an exemplary detailed design of a deep residual network.

图3是根据实施例的基于深度残差网络的动作识别方法的示意性流程图。Fig. 3 is a schematic flowchart of an action recognition method based on a deep residual network according to an embodiment.

图4是根据实施例的基于深度残差网络的动作识别方法的示意性流程图。Fig. 4 is a schematic flowchart of an action recognition method based on a deep residual network according to an embodiment.

图5是提出的卷积模块的示例性详细设计的示意图。Fig. 5 is a schematic diagram of an exemplary detailed design of the proposed convolution module.

图6是根据实施例的基于深度残差网络的动作识别装置的示意性结构图。Fig. 6 is a schematic structural diagram of an action recognition device based on a deep residual network according to an embodiment.

图7是根据实施例的终端设备的示意性结构图。Fig. 7 is a schematic structural diagram of a terminal device according to an embodiment.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解实施例的方案，下面将结合本实施例中的附图，对本实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分的实施例，而不是全部的实施例。基于本实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the embodiment, the technical solution in the embodiment will be clearly and completely described below in conjunction with the drawings in the embodiment. Obviously, the described embodiment is only Embodiments of a part of the application, but not all of the embodiments. Based on this embodiment, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”和“第三”等是用于区别不同对象，而不是用于描述特定顺序。此外，术语“包括”和“具有”以及它们任何变形，意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备，不限于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备而言固有的其它步骤或单元。The terms "first", "second" and "third" in the description and claims of the present application and the drawings are used to distinguish different objects, rather than to describe a specific order. Furthermore, the terms "include" and "have", as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, product, or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes Other steps or elements inherent to the process, method, product or apparatus.

在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是，本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.

本发明实施例涉及的终端设备可以包括各种具有无线通信功能的手持设备、车载设备、可穿戴设备、计算设备或连接到无线调制解调器的其他处理设备，以及各种形式的用户设备(UE)，移动台(MS)，移动终端等等。为方便描述，上面提到的设备统称为终端设备。The terminal devices involved in the embodiments of the present invention may include various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem with wireless communication functions, and various forms of user equipment (UE), Mobile Station (MS), Mobile Terminal, etc. For convenience of description, the devices mentioned above are collectively referred to as terminal devices.

为了便于更好地理解本申请实施例，下面将简要介绍本申请中涉及的相关技术。In order to better understand the embodiments of the present application, the related technologies involved in the present application will be briefly introduced below.

卷积神经网络(CNN)和大规模标记数据集的再次兴起，使得使用端到端可训练网络的图像分类取得了前所未有的进展。然而，基于视频的人体动作识别还不能仅仅基于CNN特征来实现。如何有效地对时间信息进行建模，即识别时间的相关性和因果关系，是一个基本的挑战。有一个经典的研究分支专注于通过人为手动设计的光流对运动进行建模。在深度学习的背景下，在单独的流中分别使用光流模态和RGB模态的双流方法是最成功的架构之一。然而，考虑到光流计算成本高，并且在单独的流中分别使用光流模态和RGB模态的双流方法通常不能与光流一起进行端到端的学习，因此这种架构在方法上不能令人满意。The resurgence of convolutional neural networks (CNNs) and large-scale labeled datasets has enabled unprecedented advances in image classification using end-to-end trainable networks. However, video-based human action recognition cannot yet be achieved based solely on CNN features. How to efficiently model temporal information, that is, identify temporal correlations and causal relationships, is a fundamental challenge. There is a classic branch of research that focuses on modeling motion via human-crafted optical flow. In the context of deep learning, a two-stream approach using optical flow and RGB modalities in separate streams is one of the most successful architectures. However, given the high computational cost of optical flow, and the fact that two-stream approaches using optical flow and RGB modalities in separate streams are usually not learnable end-to-end together with optical flow, this architecture cannot methodologically make People are satisfied.

在本申请中，提出了残差帧，即相邻RGB帧之间的差异，作为可选的“轻量级”运动表示，残差帧与RGB模态组合用于基于视频的人体动作识别。残差帧可以与RGB模态一起用于基于视频的人体动作识别的原因可以表述如下。一方面，相邻RGB帧很大程度上共享静止对象的信息和背景信息，因此残差帧通常主要保留了运动特定的特征，如图1所示。图1示出了RGB帧示例(顶部)和残差帧示例(底部)。从图1中可以看出，RGB帧包含丰富的外观信息，而残差帧主要保留显著的运动信息。另一方面，与诸如光流的其他运动表示相比，残差帧的计算成本可以忽略不计。In this application, residual frames, i.e., the differences between adjacent RGB frames, are proposed as an alternative "lightweight" motion representation, which are combined with RGB modalities for video-based human action recognition. The reason why residual frames can be used together with RGB modality for video-based human action recognition can be formulated as follows. On the one hand, adjacent RGB frames largely share the information of stationary objects and background information, so residual frames usually mainly preserve motion-specific features, as shown in Fig. 1. Figure 1 shows an RGB frame example (top) and a residual frame example (bottom). As can be seen from Figure 1, RGB frames contain rich appearance information, while residual frames mainly retain salient motion information. On the other hand, the computational cost of residual frames is negligible compared to other motion representations such as optical flow.

随着开发用于视频分类的三维(3D)卷积模型的最近趋势，在本申请中提供了一种新型有效的卷积模块，其可被视为伪3D卷积模块，在该新型的卷积模块中，原始3D卷积被解耦为2D和1D卷积。此外，为了进一步增强运动特征并降低计算成本，可以利用特征空间中的残差信息，即表示时序相邻的CNN特征之间的差异的残差特征。此外，还可以使用自注意力机制基于外观相关特征和运动相关特征对最终任务的重要性，重新校准外观相关特征和运动相关特征，以进一步降低模型尺寸和计算成本，并且避免对最终任务不重要的特征对深度残差网络系统的计算精度的影响。Following the recent trend of developing three-dimensional (3D) convolutional models for video classification, in this application a novel efficient convolutional module is presented, which can be viewed as a pseudo-3D convolutional module, in which In the convolution module, the original 3D convolution is decoupled into 2D and 1D convolution. Furthermore, to further enhance the motion features and reduce the computational cost, residual information in the feature space, i.e. residual features representing the differences between temporally adjacent CNN features, can be exploited. In addition, the self-attention mechanism can also be used to recalibrate the appearance-related features and motion-related features based on their importance to the final task to further reduce the model size and computational cost, and avoid being unimportant to the final task. The influence of features on the calculation accuracy of the deep residual network system.

本申请中提供的基于深度残差网络的动作识别方法和相关产品计算效率高，并且具有改进的性能。具体地，所提出的残差帧(或残差特征)与其他运动表示(例如光流)相比是轻量级替代，并且新的卷积模块也有助于显著降低计算成本。另外，实验已经证明，使用残差帧可以显著提高动作识别的准确率，这将在下面详细描述。The deep residual network-based action recognition method and related products provided in this application are computationally efficient and have improved performance. Specifically, the proposed residual frame (or residual feature) is a lightweight alternative compared to other motion representations (such as optical flow), and the new convolution module also helps to significantly reduce the computational cost. In addition, experiments have demonstrated that using residual frames can significantly improve the accuracy of action recognition, which will be described in detail below.

为了让本领域技术人员更好地理解残差特征的概念，首先介绍残差帧的概念。In order for those skilled in the art to better understand the concept of residual features, the concept of residual frames is firstly introduced.

假设一个视频片段x∈R^T×H×W×C，其中，T,H,W分别表示每帧的长度、高度和宽度，C表示通道数。残差帧可以通过从期望帧

中减去参考帧

来形成，其中时间戳t₁和t₂之间的步长表示为s。更正式地，残差帧可以定义为：Assume a video clip x∈R ^T×H×W×C , where T, H, W represent the length, height and width of each frame, respectively, and C represents the number of channels. The residual frame can be obtained from the desired frame by

Subtract the reference frame from

to form, where the step size between timestamps _t1 and _t2 is denoted as s. More formally, a residual frame can be defined as:

由于相邻的视频帧在静态信息上具有显著的相似性，残差帧通常不包含背景信息和对象的外观信息，但是保留显著的运动相关信息。因此，残差帧可以被视为提取运动特征的良好来源。此外，与诸如光流的其他运动表示相比，残差帧的计算成本可以显著降低。Since adjacent video frames have significant similarity in static information, residual frames usually do not contain background information and appearance information of objects, but retain significant motion-related information. Therefore, residual frames can be regarded as a good source for extracting motion features. Furthermore, the computational cost of residual frames can be significantly reduced compared to other motion representations such as optical flow.

在现实中，视频中包含的动作和活动可能是复杂的，并且可能涉及不同的运动速度或运动持续时间。为了应对这种不确定性，可以堆叠连续的残差帧以形成残差片段，该残差片段可以定义为：In reality, actions and activities contained in videos can be complex and may involve different motion speeds or motion durations. To deal with this uncertainty, consecutive residual frames can be stacked to form a residual segment, which can be defined as:

上述残差片段可以捕捉空间轴上的快速运动和时间轴上的慢速/持续时间长的运动。因此，残差片段适用于3D卷积，在残差片段中，可以同时提取持续时间短和持续时间长的运动信息。The above residual segments can capture fast motion on the spatial axis and slow/long duration motion on the time axis. Therefore, the residual segment is suitable for 3D convolution, in which both short-duration and long-duration motion information can be extracted simultaneously.

然而，由于对象外观和背景场景也可以提供识别动作的重要信息，因此仅有残差帧可能不足以解决人体动作识别问题。例如，在化眼妆和涂口红的场景中，化眼妆和涂口红的运动相似，但运动的位置不同，其中一个发生在眼睛周围，另一个发生在嘴唇周围。因此，需要同时利用RGB帧和残差帧进行动作识别。为此，提供了一种新的卷积模块，它是伪3D卷积模块，具有同时处理RGB帧和残差帧的能力。However, residual frames alone may not be sufficient for human action recognition since object appearance and background scenes can also provide important information for recognizing actions. For example, in the scene of applying eye makeup and applying lipstick, the movements of applying eye makeup and applying lipstick are similar, but the positions of the movements are different, one of which occurs around the eyes and the other around the lips. Therefore, both RGB frames and residual frames need to be utilized for action recognition. To this end, a new convolutional module is provided, which is a pseudo-3D convolutional module with the ability to simultaneously process RGB frames and residual frames.

下面将对本申请实施例进行详细描述。The embodiments of the present application will be described in detail below.

图2是深度残差网络的示例性详细设计的示意图。如图2所示，深度残差网络可至少包括：输入层、至少一个卷积层、池化层、至少一个全连接层和输出层。深度残差网络用于基于视频片段作为输入的动作识别。本申请中的深度残差网络系统是基于图2所示的深度残差网络的系统。Figure 2 is a schematic diagram of an exemplary detailed design of a deep residual network. As shown in Figure 2, the deep residual network may at least include: an input layer, at least one convolutional layer, a pooling layer, at least one fully connected layer and an output layer. Deep residual networks for action recognition based on video clips as input. The deep residual network system in this application is based on the deep residual network system shown in FIG. 2 .

图3是根据实施例的基于深度残差网络的动作识别方法的示意性流程图。基于深度残差网络的动作识别方法应用于包括至少一个卷积模块的深度残差网络系统，上述至少一个卷积模块中的每个卷积模块包括至少一个第一卷积层，上述至少一个第一卷积层中的每个第一卷积层具有至少一个一维滤波器和至少一个二维滤波器。如图3所示，该方法包括以下内容。Fig. 3 is a schematic flowchart of an action recognition method based on a deep residual network according to an embodiment. The action recognition method based on the deep residual network is applied to a deep residual network system including at least one convolution module, each convolution module in the at least one convolution module includes at least one first convolution layer, and the at least one first convolution layer Each first convolutional layer in a convolutional layer has at least one one-dimensional filter and at least one two-dimensional filter. As shown in Figure 3, the method includes the following contents.

302，在上述至少一个卷积模块中的第一个卷积模块接收视频片段作为输入。302. A first convolution module in the at least one convolution module receives a video segment as an input.

具体地，视频片段x∈R^T×H×W×C可以被接收作为深度残差网络系统的输入，其中T、H、W分别表示每帧的长度、高度和宽度，C表示通道数。对于RGB帧，通道数为3，这三个通道分别代表红色(R)、绿色(G)和蓝色(B)。当在上述至少一个卷积模块中的第一个卷积模块之前没有其他卷积模块或层时，上述视频片段被传递到上述至少一个卷积模块中的第一个卷积模块，并且在上述至少一个卷积模块中的第一个卷积模块处作为输入被接收。当在上述至少一个卷积模块中的第一个卷积模块之前存在其他层时，上述视频片段可以首先被其他层处理，然后被传递到上述至少一个卷积模块中的第一个卷积模块，并且在上述至少一个卷积模块中的第一个卷积模块处进一步作为输入被接收。Specifically, a video segment x ∈ R ^T×H×W×C can be received as an input to a deep residual network system, where T, H, and W denote the length, height, and width of each frame, respectively, and C denotes the number of channels. For RGB frames, the number of channels is 3, and these three channels represent red (R), green (G) and blue (B) respectively. When there are no other convolution modules or layers before the first of the at least one convolution modules, the video segment is passed to the first of the at least one convolution modules, and the A first of the at least one convolutional modules is received as input. When there are other layers before the first convolution module of the at least one convolution module, the video segment may first be processed by other layers and then passed to the first convolution module of the at least one convolution module , and is further received as an input at the first convolution module of the at least one convolution module.

滤波器的尺寸可以表示为T×H×W，其中T表示时间维度，H表示空间维度中的高度，W表示空间维度中的宽度。1D滤波器可以表示为T×1×1，其中，T大于1，而2D滤波器可以表示为1×H×W，其中H和W中的至少一个大于1。1D滤波器用于时间维度的卷积，2D滤波器用于空间维度的卷积。The size of the filter can be expressed as T×H×W, where T represents the time dimension, H represents the height in the space dimension, and W represents the width in the space dimension. A 1D filter can be expressed as T × 1 × 1, where T is greater than 1, and a 2D filter can be expressed as 1 × H × W, where at least one of H and W is greater than 1. 1D filters are used for convolutions in the time dimension Product, 2D filter for convolution of spatial dimensions.

304，通过执行以下操作遍历上述至少一个卷积模块。304. Traverse the at least one convolution module by performing the following operations.

3042，使用一维滤波器处理上述输入以获取运动相关特征，以及使用二维滤波器处理上述输入以获取外观相关特征。3042. Use a one-dimensional filter to process the above input to obtain motion-related features, and use a two-dimensional filter to process the above input to obtain appearance-related features.

在本申请中提供了可以被视为伪3D卷积模块的新卷积模块。对于卷积模块来说，相关技术中的3D滤波器被解耦形成并行的2D空间滤波器和1D时域滤波器。因此，通过使用可分离的2D卷积和1D卷积代替3D卷积，大大降低了模型尺寸和计算成本，这符合高效3D网络发展的最新趋势。此外，2D卷积和1D卷积被放置在不同的路径中，从而可针对外观特征和运动特征进行不同的建模。In this application a new convolutional module that can be regarded as a pseudo 3D convolutional module is presented. For the convolution module, the 3D filter in the related art is decoupled to form a parallel 2D spatial filter and a 1D time domain filter. Therefore, by replacing 3D convolutions with separable 2D convolutions and 1D convolutions, the model size and computational cost are greatly reduced, which is in line with the latest trend in the development of efficient 3D networks. Furthermore, 2D convolutions and 1D convolutions are placed in different paths, allowing for different modeling of appearance and motion features.

3044，将上述运动相关特征沿时间维度移位一个步长，从上述运动相关特征中减去上述移位后的运动相关特征以获取残差特征。3044. Shift the motion-related feature by one step along the time dimension, and subtract the shifted motion-related feature from the motion-related feature to obtain a residual feature.

具体地，当对运动进行建模时，残差帧的思想从像素级扩展到特征级。假设来自1D时域卷积的输出特征是f^m∈R^{T’×H’×W’×C,}，1D时域卷积的输出特征沿着时间维度移位一个步长，例如步长1，然后通过从原始运动相关特征f^m(t)中减去移位后的运动相关特征f^m(t+1)，可以生成残差特征

其可以定义为：Specifically, the idea of residual frames is extended from pixel level to feature level when modeling motion. Assuming that the output features from 1D temporal convolution are f ^m ∈ ^{RT'×H'×W'×C,} , the output features of 1D temporal convolution are shifted by a step along the time dimension, e.g. step 1, The residual feature can then be generated by subtracting the shifted motion-related feature f ^m (t+1) from the original motion-related feature f ^m (t)

It can be defined as:

3046，基于上述外观相关特征、上述运动相关特征以及上述残差特征，获取上述卷积模块的输出。3046. Acquire the output of the convolution module based on the appearance-related feature, the motion-related feature, and the residual feature.

在伪3D卷积之后创建了三个特征

其中f^s是保留外观信息的2D卷积的输出，f^m是1D卷积的输出，f^m和

保留了区别运动结构。Three features are created after the pseudo 3D convolution

where f ^s is the output of a 2D convolution preserving appearance information, f ^m is the output of a 1D convolution, f ^m and

The distinguishing motion structure is retained.

3048，将上述卷积的输出作为下一个(即，随后)卷积模块的输入，直至遍历完上述至少一个卷积模块中的最后一个卷积模块。3048. Use the output of the above convolution as the input of the next (that is, subsequent) convolution module until the last convolution module in the at least one convolution module is traversed.

306，基于上述至少一个卷积模块中的最后一个卷积模块的输出，识别上述视频片段中包括的至少一个动作。306. Based on the output of the last convolution module in the at least one convolution module, identify at least one action included in the video segment.

在一个实施例中，上述卷积模块的输出通过以下方式获得：通过在通道维度中级联上述运动相关特征、上述残差特征以及上述外观相关特征，获取级联特征；将上述级联特征确定为上述卷积模块的输出。In one embodiment, the output of the above-mentioned convolution module is obtained in the following manner: by cascading the above-mentioned motion-related features, the above-mentioned residual features and the above-mentioned appearance-related features in the channel dimension, the cascaded features are obtained; the above-mentioned cascaded features are determined is the output of the above convolution module.

为了促进外观特征和运动特征的有效融合，伪3D卷积的输出特征可以在通道维度中被级联以获得级联特征，该级联特征可以被定义为：To facilitate the effective fusion of appearance features and motion features, the output features of pseudo-3D convolutions can be concatenated in the channel dimension to obtain cascaded features, which can be defined as:

其中，符号

代表级联。Among them, the symbol

Represents cascading.

在一个实施例中，上述卷积模块的输出通过以下方式获得：通过在通道维度级联上述运动相关特征、上述残差特征以及上述外观相关特征，获取级联特征；基于上述级联特征，获取通道注意力掩膜；以及基于上述通道注意力掩膜和上述级联特征，获取注意力特征作为上述卷积模块的上述输出。In one embodiment, the output of the above-mentioned convolution module is obtained in the following manner: by cascading the above-mentioned motion-related features, the above-mentioned residual features and the above-mentioned appearance-related features in the channel dimension, the cascaded features are obtained; based on the above-mentioned cascaded features, the obtained a channel attention mask; and based on the above channel attention mask and the above cascaded features, obtaining attention features as the above output of the above convolution module.

为了促进外观相关特征和运动相关特征的有效融合，可以进一步应用通道自注意力机制来重新校准输出特征。具体地，输出特征可以在通道维度被级联以获得级联特征，该级联特征可以被定义为：To facilitate the effective fusion of appearance-related features and motion-related features, a channel self-attention mechanism can be further applied to recalibrate the output features. Specifically, the output features can be concatenated in the channel dimension to obtain cascaded features, which can be defined as:

其中，符号

代表级联。由于f^m∈R^{T’×H’×W’×C’}，f^s∈R^{T’×H’×W’×C’}，

则级联后，f∈R^{T’×H’×W’×3C’}。Among them, the symbol

Represents cascading. Since f ^m ∈ ^{RT'×H'×W'×C'} , f ^s ∈ ^{RT'×H'×W'×C'} ,

After cascading, f∈RT ^{'×H'×W'×3C'} .

然后，基于上述级联特征可获得通道注意力掩模M_att。在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括全连接层，可以按照如下方式基于所述级联特征，获取上述所述通道注意力掩膜。针对上述级联特征执行全局池化，获取池化后的级联特征，该池化后的级联特征可表示为pool(f)。将上述池化后的级联特征乘以上述全连接层的权重矩阵，获取加权级联特征，该加权级联特征可表示为Wpool(f)。将上述加权级联特征与偏置相加，获取偏置级联特征，该偏置级联特征可表示为Wpool(f)+b。使用Sigmoid函数处理上述偏置级联特征，获取上述通道注意力掩膜，上述通道注意力掩膜可表示为σ(Wpool(f)+b)。因此，上述通道注意力掩膜M_att可以表示为：Then, the channel attention mask M _att can be obtained based on the above cascaded features. In one embodiment, each convolution module in the at least one convolution module further includes a fully connected layer, and the channel attention mask described above can be obtained based on the cascaded features in the following manner. Global pooling is performed on the above cascaded features to obtain the pooled cascaded features, which can be expressed as pool(f). Multiply the pooled cascaded features by the weight matrix of the fully connected layer to obtain weighted cascaded features, which can be expressed as Wpool(f). Add the above weighted cascaded feature and bias to obtain the biased cascaded feature, which can be expressed as Wpool(f)+b. Use the Sigmoid function to process the above-mentioned bias cascade feature to obtain the above-mentioned channel attention mask, which can be expressed as σ(Wpool(f)+b). Therefore, the above channel attention mask M _att can be expressed as:

M_att＝σ(Wpool(f)+b)，M _att =σ(Wpool(f)+b),

其中，

代表由单层神经网络(即上述卷积模块的全连接层)参数化的权重矩阵，

代表偏差项，pool是跨空间和时间对级联特征f进行平均的全局池化操作，σ代表sigmoid函数。通过通道注意力掩膜，实现了动态特性，该动态特征以输入特征和基于输入特征对最终任务的重要性的重新加权通道为条件。in,

Represents the weight matrix parameterized by a single-layer neural network (i.e., the fully connected layer of the above convolutional module),

Represents the bias term, pool is a global pooling operation that averages the cascaded feature f across space and time, and σ represents the sigmoid function. Through channel attention masks, dynamics are achieved that are conditioned on input features and reweighted channels based on the importance of the input features to the final task.

在获得上述通道注意力掩膜之后，可以基于上述通道注意力掩膜和上述级联特征进一步获得注意力特征。在一个实施例中，通过执行上述通道注意力掩膜M_att与上述级联特征f之间的逐通道乘法，获取注意力特征。在另一个实施例中，为了进一步提高鲁棒性，通过以下方式获得上述注意力特征：通过执行上述通道注意力掩膜M_att与上述级联特征f之间的逐通道乘法，获取中间特征；将上述中间特征与上述级联特征相加，获取上述注意力特征，上述注意力特征可定义为：After obtaining the above channel attention mask, attention features can be further obtained based on the above channel attention mask and the above cascaded features. In one embodiment, the attention feature is obtained by performing channel-by-channel multiplication between the above-mentioned channel attention mask M _att and the above-mentioned cascaded feature f. In another embodiment, in order to further improve the robustness, the above-mentioned attention feature is obtained in the following manner: by performing channel-by-channel multiplication between the above-mentioned channel attention mask M _att and the above-mentioned cascade feature f to obtain intermediate features; Add the above-mentioned intermediate features and the above-mentioned cascade features to obtain the above-mentioned attention features, and the above-mentioned attention features can be defined as:

f_att＝f⊙M_att+ff _att ＝f⊙M _att +f

其中，符号“⊙”表示逐通道乘法。因此，在所提出的卷积模块中可以实现残差连接。Among them, the symbol "⊙" means channel-by-channel multiplication. Therefore, residual connections can be implemented in the proposed convolutional module.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括位于上述至少一个第一卷积层之前的第二卷积层，上述第二卷积层包括尺寸为1×1×1的三维滤波器，遍历上述至少一个卷积模块还包括：使用上述一维滤波器处理上述输入以获取运动相关特征，以及使用上述二维滤波器处理上述输入以获取外观相关特征之前，使用上述三维滤波器处理上述输入以降低上述输入的维度。使用一维滤波器处理上述输入以获取运动相关特征，以及使用上述二维滤波器处理上述输入以获取外观相关特征包括：使用上述一维滤波器处理降维后的上述输入以获取上述运动相关特征，使用上述二维滤波器处理降维后的上述输入以获取上述外观相关特征。In one embodiment, each convolution module of the at least one convolution module further includes a second convolution layer before the at least one first convolution layer, and the second convolution layer includes ×1 three-dimensional filter, traversing the at least one convolution module further includes: using the above-mentioned one-dimensional filter to process the above-mentioned input to obtain motion-related features, and before using the above-mentioned two-dimensional filter to process the above-mentioned input to obtain appearance-related features, use The above-mentioned three-dimensional filter processes the above-mentioned input to reduce the dimensionality of the above-mentioned input. Processing the above-mentioned input with a one-dimensional filter to obtain motion-related features, and processing the above-mentioned input with the above-mentioned two-dimensional filter to obtain appearance-related features include: processing the above-mentioned input after dimension reduction with the above-mentioned one-dimensional filter to obtain the above-mentioned motion-related features , using the above-mentioned two-dimensional filter to process the above-mentioned input after dimensionality reduction to obtain the above-mentioned appearance-related features.

具体地，在被下一个卷积模块的1D滤波器和2D滤波器处理之前，前一个卷积模块的输出可以先被尺寸为1×1×1的3D滤波器处理，从而可以减少前一个卷积模块的输出的通道的数量，从而降低前一个卷积模块输出(即下一个卷积模块的输入)的维度。3D滤波器可包括至少一个3D滤波器。3D滤光器的数量可以基于期望的维度来确定。例如，为了降低维度，3D滤波器的数量少于前一个卷积模块的输出的通道的数量；为了恢复维度，3D滤波器的数目等于前一个卷积模块的输出的通道的数量。Specifically, before being processed by the 1D filter and 2D filter of the next convolution module, the output of the previous convolution module can be processed by a 3D filter with a size of 1 × 1 × 1, so that the previous convolution can be reduced. The number of channels of the output of the convolution module, thereby reducing the dimensionality of the output of the previous convolution module (ie, the input of the next convolution module). The 3D filter may include at least one 3D filter. The number of 3D filters can be determined based on the desired dimensions. For example, in order to reduce the dimension, the number of 3D filters is less than the number of channels of the output of the previous convolution module; in order to restore the dimension, the number of 3D filters is equal to the number of channels of the output of the previous convolution module.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括位于上述至少一个第一卷积层和上述第二卷积层之后的第三卷积层，上述第三卷积层包括尺寸为1×1×1的三维滤波器，获取上述卷积模块的输出还包括：使用上述三维滤波器处理上述级联特征以增加上述级联特征的维度；以及将升维后的上述级联特征作为上述卷积模块的输出。In one embodiment, each convolution module in the at least one convolution module further includes a third convolution layer located after the at least one first convolution layer and the second convolution layer, and the third convolution The layer includes a three-dimensional filter with a size of 1×1×1, and obtaining the output of the above-mentioned convolution module also includes: using the above-mentioned three-dimensional filter to process the above-mentioned cascaded features to increase the dimension of the above-mentioned cascaded features; The concatenated features serve as the output of the above convolutional modules.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括位于上述至少一个第一卷积层和上述第二卷积层之后的第三卷积层，上述第三卷积层包括尺寸为1×1×1的三维滤波器，获取上述卷积模块的上述输出还包括：用上述三维滤波器处理上述注意力特征以增加上述注意力特征的维度；以及使用升维后的上述注意力特征作为上述卷积模块的输出。In one embodiment, each convolution module in the at least one convolution module further includes a third convolution layer located after the at least one first convolution layer and the second convolution layer, and the third convolution The layer includes a three-dimensional filter with a size of 1×1×1, and obtaining the above-mentioned output of the above-mentioned convolution module also includes: processing the above-mentioned attention feature with the above-mentioned three-dimensional filter to increase the dimension of the above-mentioned attention feature; The aforementioned attention features serve as the output of the aforementioned convolutional modules.

具体地，顶部1×1×1卷积和底部1×1×1卷积用于减少和恢复维度。Specifically, top 1×1×1 convolution and bottom 1×1×1 convolution are used to reduce and restore dimensionality.

进一步进行了多个实验，在实验中通过用所提出的卷积模块替换所有瓶颈块来开发ResNet-60的变体。Multiple experiments are further conducted, where a variant of ResNet-60 is developed by replacing all bottleneck blocks with the proposed convolutional modules.

例如，使用UCF101数据集评估本申请中提出的技术解决方案的性能，其中，UCF101是由包含101个动作类别的13,330个视频组成的数据集。针对所有实验，报告了split-1验证集的第一名(top-1)准确率和前五名(top-5)准确率。For example, the performance of the technical solutions proposed in this application is evaluated using the UCF101 dataset, where UCF101 is a dataset consisting of 13,330 videos containing 101 action categories. For all experiments, the top-1 (top-1) accuracy and top-5 (top-5) accuracy on the split-1 validation set are reported.

具体地，通过分别使用单独RGB帧、单独残差帧以及使用RGB帧和残差帧的组合输入来训练动作分类器，来进行对不同数据模态的有效性的评估。其次，研究了残差帧步长对动作识别的影响。最后，还进行了消融研究，以调查所提出的卷积模块中各个组件的有效性。Specifically, the evaluation of the effectiveness of different data modalities is performed by training an action classifier using RGB frames alone, residual frames alone, and a combined input of RGB frames and residual frames, respectively. Second, the effect of residual frame stride on action recognition is studied. Finally, an ablation study is also performed to investigate the effectiveness of individual components in the proposed convolutional module.

不同数据模态的性能比较。Performance comparison of different data modalities.

表1示出了输入模态和网络架构的各种组合的动作识别性能。对于仅使用RGB帧或残差帧的实验，仅保留数据层(即第一卷积层)中的一个流，但是为了公平比较，通道的数增加了一倍。从表1中可以看出，仅使用残差帧在top-1准确率和top-5准确率方面比仅使用RGB帧高出约3％，这表明残差帧确实包含对动作识别重要的显著的运动信息。当在不同的流中使用RGB帧和残差帧时，top-1准确率进一步提高了2.6％(从83.0％提高到85.6％)，这表明两种数据模态保留了互补的信息。值得注意的是，与在相关技术中使用标准3D卷积的情况相比，使用本申请中提供的卷积模块不仅显著地减少了每秒浮点计算(flops)(从163G增加到40G)，而且还提供了更好的性能(在top-1准确率方面，85.6％v.s.85.0％)。Table 1 shows the action recognition performance for various combinations of input modalities and network architectures. For experiments using only RGB frames or residual frames, only one stream in the data layer (i.e., the first convolutional layer) is kept, but the number of channels is doubled for fair comparison. From Table 1, it can be seen that using only residual frames is about 3% higher than using only RGB frames in terms of top-1 accuracy and top-5 accuracy, which shows that residual frames do contain significant features that are important for action recognition. sports information. When RGB frames and residual frames are used in different streams, the top-1 accuracy is further improved by 2.6% (from 83.0% to 85.6%), indicating that the two data modalities preserve complementary information. It is worth noting that, compared with the case of using standard 3D convolution in the related art, using the convolution module provided in this application not only significantly reduces floating-point calculations (flops) per second (increased from 163G to 40G), And it also provides better performance (85.6% vs. 85.0% in top-1 accuracy).

表1：不同的输入模态和网络架构对应的性能比较。Table 1: Performance comparison for different input modalities and network architectures.

步长s的影响。The effect of the step size s.

当产生残差帧时，可以改变步长s以捕获不同时间标度的运动特征。然而，尚不清楚动作识别任务的最佳步长是多少。因此，对步长的影响进行了研究，研究结果见表2。分别对三种设置进行了实验，其中，输入数据分别是步长s＝1、2、4的残差帧。从表2中可以看出，分类准确度随着步长的增加而降低。我们猜想，运动会导致两帧之间相同对象的空间位移，并且使用大的步长可能会导致运动表示之间的不匹配。When generating residual frames, the step size s can be varied to capture motion features at different time scales. However, it is unclear what the optimal step size is for action recognition tasks. Therefore, the effect of the step size was studied, and the results of the study are shown in Table 2. Experiments are carried out for three settings respectively, where the input data are residual frames with stride s=1, 2, 4, respectively. From Table 2, it can be seen that the classification accuracy decreases as the step size increases. We conjecture that motion causes the spatial displacement of the same object between two frames, and that using a large stride may cause a mismatch between motion representations.

表2：不同残差帧步长s对应的性能比较。Table 2: Performance comparison corresponding to different residual frame steps s.

步长step size Val top-1Val top-1 Val top-5Val top-5 s＝1s=1 83.0％83.0% 98.3％98.3% s＝2s=2 82.7％82.7% 97.1％97.1% s＝4s=4 80.2％80.2% 96.0％96.0%

消融研究。Ablation studies.

为了验证提出的卷积模块中不同组件的有效性，进行了消融研究。在不损失一般性的情况下，用RGB帧和残差帧(s＝1)的组合输入来训练模型。表3示出了各种卷积模块设置对应的性能比较。如表3所示，去除自注意力机制导致top-1准确率下降1.8％(从85.6％下降至83.8％)。同时，当忽略特征空间中的残差信息时，性能也从85.6％下降到83.5％。如果同时取消自注意力机制和残差特征，top-1准确率将进一步降低到82.1％。这些结果证实了通道自注意力机制和残差特征对提高动作识别性能是有效的。To verify the effectiveness of different components in the proposed convolutional module, an ablation study is performed. Without loss of generality, the model is trained with a combined input of RGB frames and residual frames (s = 1). Table 3 shows the performance comparison corresponding to various convolution module settings. As shown in Table 3, removing the self-attention mechanism leads to a 1.8% drop in top-1 accuracy (from 85.6% to 83.8%). Meanwhile, the performance also drops from 85.6% to 83.5% when ignoring the residual information in the feature space. If both the self-attention mechanism and the residual feature are removed, the top-1 accuracy will be further reduced to 82.1%. These results confirm that the channel self-attention mechanism and residual features are effective for improving action recognition performance.

表3：不同卷积模块设置对应的性能比较Table 3: Performance comparison for different convolution module settings

方法method Val top-1Val top-1 Val top-5Val top-5 不具有自注意力和残差特征的卷积模块Convolutional modules without self-attention and residual features 82.1％82.1% 97.9％97.9% 不具有自注意力的卷积模块Convolutional modules without self-attention 83.8％83.8% 98.4％98.4% 不具有残差特征的卷积模块Convolutional modules without residual features 83.5％83.5% 98.1％98.1% 卷积模块Convolution module 85.6％85.6% 99.2％99.2%

图4是根据实施例的基于深度残差网络的动作识别方法的示意性流程图。基于深度残差网络的动作识别方法应用于深度残差网络系统，该深度残差网络系统包括至少一个卷积模块，上述至少一个卷积模块中的每个卷积模块包括至少一个第一卷积层，上述至少一个第一卷积层中的每个第一卷积层具有至少一个一维滤波器和至少一个二维滤波器。如图4所示，该方法包括以下内容。Fig. 4 is a schematic flowchart of an action recognition method based on a deep residual network according to an embodiment. The action recognition method based on the deep residual network is applied to the deep residual network system, the deep residual network system includes at least one convolution module, and each convolution module in the at least one convolution module includes at least one first convolution layers, each of the at least one first convolutional layer has at least one one-dimensional filter and at least one two-dimensional filter. As shown in Figure 4, the method includes the following contents.

402，在上述至少一个卷积模块中的第一个卷积模块接收视频片段作为输入。402. A first convolution module in the at least one convolution module receives a video segment as an input.

404，通过执行以下操作遍历上述至少一个卷积模块。404. Traverse the at least one convolution module by performing the following operations.

40502，使用三维滤波器处理上述输入以降低上述输入的维度。40502, Use a three-dimensional filter to process the above input to reduce the dimension of the above input.

40504，使用上述一维滤波器处理降维后的上述输入以获取运动相关特征，使用上述二维滤波器处理降维后的上述输入以获取外观相关特征。40504. Use the above-mentioned one-dimensional filter to process the above-mentioned input after dimensionality reduction to obtain motion-related features, and use the above-mentioned two-dimensional filter to process the above-mentioned input after dimensionality reduction to obtain appearance-related features.

40506，将上述运动相关特征沿时间维度移位一个步长，并从上述运动相关特征中减去上述移位后的运动相关特征以获取残差特征。40506. Shift the motion-related features by one step along the time dimension, and subtract the shifted motion-related features from the motion-related features to obtain residual features.

40508，通过在通道维度级联上述运动相关特征、上述残差特征以及上述外观相关特征，获取级联特征。40508. Obtain cascaded features by cascading the above-mentioned motion-related features, the above-mentioned residual features, and the above-mentioned appearance-related features in the channel dimension.

40410，针对上述级联特征执行全局池化，获取池化后的级联特征。40410. Perform global pooling for the above cascaded features, and obtain pooled cascaded features.

40412，将上述池化后的级联特征乘以全连接层的权重矩阵，获取加权级联特征。40412. Multiply the pooled cascaded features by the weight matrix of the fully connected layer to obtain weighted cascaded features.

40414，将上述加权级联特征与偏置相加，获取偏置级联特征。40414. Add the above weighted cascaded features to the bias to obtain the biased cascaded features.

40416，使用Sigmoid函数处理上述偏置级联特征，获取通道注意力掩膜。40416, use the Sigmoid function to process the above bias cascade features to obtain the channel attention mask.

40418，通过执行上述通道注意力掩膜与上述级联特征之间的逐通道乘法，获取中间特征。40418, Obtain intermediate features by performing channel-by-channel multiplication between the above channel attention mask and the above cascaded features.

40430，将上述中间特征与上述级联特征相加，获取注意力特征。40430. Add the above-mentioned intermediate features and the above-mentioned cascade features to obtain attention features.

40422，使用上述三维滤波器处理上述注意力特征以增加上述注意力特征的维度。40422. Use the above-mentioned three-dimensional filter to process the above-mentioned attention feature to increase the dimension of the above-mentioned attention feature.

40424，将升维后的上述注意力特征作为上述卷积模块的输出。40424, use the above-mentioned attention features after dimensionality enhancement as the output of the above-mentioned convolution module.

40426，将上述卷积的输出作为下一个卷积模块的输入，直至遍历完上述至少一个卷积模块中的最后一个卷积模块。40426. Use the output of the above convolution as the input of the next convolution module, until the last convolution module in the above at least one convolution module is traversed.

406，基于上述至少一个卷积模块中的最后一个卷积模块的输出，识别上述视频片段中包括的至少一个动作。406. Based on the output of the last convolution module in the at least one convolution module, identify at least one action included in the video segment.

图5示出了所提出的卷积模块的示例性详细设计。所提出的卷积模块可以集成到任何标准的CNN架构中，例如ResNet。为了同时处理RGB帧和残差帧，原始数据层(即第一卷积层)被修改成具有并行构建块(building block)的两个流，这两个构建块分别输出外观相关特征和运动相关特征，每个构建块针对每个模态。来自两个流的结果特征被级联并传递到下一层(即，后续的卷积模块)。在所提出的卷积模块的示例性设计中，以尺寸为3×3的2D滤波器和尺寸为1×1的1D滤波器为例进行说明。具体地，滤波器的尺寸可以表示为T×H×W，其中，T表示时间维度，H表示空间维度中的高度，W表示空间维度中的宽度。Fig. 5 shows an exemplary detailed design of the proposed convolution module. The proposed convolutional module can be integrated into any standard CNN architecture, such as ResNet. To process RGB frames and residual frames simultaneously, the raw data layer (i.e., the first convolutional layer) is modified into two streams with parallel building blocks that output appearance-related features and motion-related features, respectively. traits, each building block for each modal. The resulting features from the two streams are concatenated and passed to the next layer (i.e., subsequent convolutional modules). In the exemplary design of the proposed convolutional module, a 2D filter of size 3×3 and a 1D filter of size 1×1 are taken as examples. Specifically, the size of the filter can be expressed as T×H×W, where T represents the time dimension, H represents the height in the space dimension, and W represents the width in the space dimension.

表4示出了深度残差网络系统的示例性详细设计。如表4所示，深度残差网络系统包括四个卷积模块，分别表示为res 2、res 3、res 4和res 5。在深度残差网络系统的示例性设计中，滤波器的尺寸可以由{T×S²,C}表示，以表示时间、空间和通道尺寸，具体地，在表4中，以尺寸为3×3的2D滤波器和尺寸为1×1的1D滤波器为例进行说明。Table 4 shows an exemplary detailed design of the deep residual network system. As shown in Table 4, the deep residual network system includes four convolution modules, denoted as res 2, res 3, res 4 and res 5. In an exemplary design of a deep residual network system, the size of the filter can be represented by {T×S ² ,C} to represent the time, space and channel dimensions, specifically, in Table 4, the size is 3× The 2D filter of 3 and the 1D filter of size 1×1 are taken as examples for illustration.

表4：深度残差网络系统的示例性详细设计。Table 4: Exemplary detailed design of the deep residual network system.

以上操作可以参考基于深度残差网络的动作识别方法中的网络训练操作的详细描述，这里将不描述。The above operations can refer to the detailed description of the network training operation in the action recognition method based on the deep residual network, which will not be described here.

上述主要从方法侧执行过程的角度对本发明实施例的方案进行了介绍。可以理解的是，为了实现上述功能，电子设备包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，本发明能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。The foregoing mainly introduces the solutions of the embodiments of the present invention from the perspective of executing the process on the method side. It can be understood that, in order to realize the above functions, the electronic device includes hardware structures and/or software modules corresponding to each function. Those skilled in the art should easily realize that the present invention can be realized in the form of hardware or a combination of hardware and computer software in combination with the units and algorithm steps of each example described in the embodiments disclosed herein. Whether a certain function is executed by hardware or computer software drives hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention.

本发明实施例可以根据上述方法示例对第一无线耳机进行功能单元的划分，例如，可以对应各个功能划分各个功能单元，也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。需要说明的是，本发明实施例中对单元的划分是示意性的，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式。In the embodiment of the present invention, the functional units of the first wireless earphone can be divided according to the above method example. For example, each functional unit can be divided corresponding to each function, or two or more functions can be integrated into one processing unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units. It should be noted that the division of the units in the embodiment of the present invention is schematic, and is only a logical function division, and there may be another division manner in actual implementation.

图6是根据实施例的基于深度残差网络的动作识别装置的示意性结构图。该装置应用于深度残差网络系统，该深度残差网络系统包括至少一个卷积模块，上述至少一个卷积模块中的每个卷积模块包括至少一个第一卷积层，上述至少一个第一卷积层中的每个第一卷积层具有至少一个一维滤波器和至少一个二维滤波器。如图6所示，上述基于深度残差网络的动作识别装置包括接收单元602、处理单元604以及识别单元606。Fig. 6 is a schematic structural diagram of an action recognition device based on a deep residual network according to an embodiment. The device is applied to a deep residual network system, the deep residual network system includes at least one convolution module, each convolution module in the at least one convolution module includes at least one first convolution layer, and the at least one first Each first of the convolutional layers has at least one one-dimensional filter and at least one two-dimensional filter. As shown in FIG. 6 , the above-mentioned action recognition device based on deep residual network includes a receiving unit 602 , a processing unit 604 and a recognition unit 606 .

该接收单元602，用于在上述至少一个卷积模块中的第一个卷积模块接收视频片段作为输入。The receiving unit 602 is configured to receive a video segment as an input in the first convolution module of the at least one convolution module.

该处理单元604，用于通过执行以下操作遍历上述至少一个卷积模块：使用一维滤波器处理上述输入以获取运动相关特征，以及使用二维滤波器处理上述输入以获取外观相关特征；将上述运动相关特征沿时间维度移位一个步长，从上述运动相关特征中减去上述移位后的运动相关特征以获取残差特征；基于上述外观相关特征、上述运动相关特征以及上述残差特征，获取上述卷积模块的输出；将上述卷积的输出作为下一个卷积模块的输入，直至遍历完上述至少一个卷积模块中的最后一个卷积模块。The processing unit 604 is configured to traverse the at least one convolution module by performing the following operations: process the input with a one-dimensional filter to obtain motion-related features, and process the input with a two-dimensional filter to obtain appearance-related features; The motion-related features are shifted by a step along the time dimension, and the above-mentioned shifted motion-related features are subtracted from the above-mentioned motion-related features to obtain residual features; based on the above-mentioned appearance-related features, the above-mentioned motion-related features, and the above-mentioned residual features, Obtain the output of the above convolution module; use the output of the above convolution as the input of the next convolution module, until the last convolution module of the at least one convolution module is traversed.

该识别单元606，基于上述至少一个卷积模块中的最后一个卷积模块的输出，识别上述视频片段中包括的至少一个动作。The recognition unit 606 recognizes at least one action included in the video segment based on the output of the last convolution module in the at least one convolution module.

在一个实施例中，在获取上述卷积模块的上述输出方面，上述处理单元604具体用于：通过在通道维度级联上述运动相关特征、上述残差特征以及上述外观相关特征，获取级联特征；以及将上述级联特征确定为上述卷积模块的上述输出。In one embodiment, in terms of obtaining the above-mentioned output of the above-mentioned convolution module, the above-mentioned processing unit 604 is specifically configured to: obtain the concatenated features by concatenating the above-mentioned motion-related features, the above-mentioned residual features, and the above-mentioned appearance-related features in the channel dimension ; and determining the above-mentioned concatenated features as the above-mentioned output of the above-mentioned convolution module.

在一个实施例中，在获取上述卷积模块的上述输出方面，上述处理单元604具体用于：通过在通道维度级联上述运动相关特征、上述残差特征以及上述外观相关特征，获取级联特征；基于上述级联特征，获取通道注意力掩膜；以及基于上述通道注意力掩膜和上述级联特征，获取注意力特征作为上述卷积模块的上述输出。In one embodiment, in terms of obtaining the above-mentioned output of the above-mentioned convolution module, the above-mentioned processing unit 604 is specifically configured to: obtain the concatenated features by concatenating the above-mentioned motion-related features, the above-mentioned residual features, and the above-mentioned appearance-related features in the channel dimension ; Obtaining a channel attention mask based on the above-mentioned cascade features; and obtaining an attention feature as the above-mentioned output of the above-mentioned convolution module based on the above-mentioned channel attention mask and the above-mentioned cascade features.

在一个实施例中，在基于上述通道注意力掩膜和上述级联特征，获取上述注意力特征方面，上述处理单元604具体用于：通过执行上述通道注意力掩膜与上述级联特征之间的逐通道乘法，获取上述注意力特征。In one embodiment, in terms of obtaining the above-mentioned attention feature based on the above-mentioned channel attention mask and the above-mentioned cascade feature, the above-mentioned processing unit 604 is specifically configured to: perform The channel-wise multiplication of , to obtain the above attention features.

在一个实施例中，在基于上述通道注意力掩膜和上述级联特征，获取注意力特征方面，上述处理单元604具体用于：通过执行上述通道注意力掩膜与上述级联特征之间的逐通道乘法，获取中间特征；以及将上述中间特征与上述级联特征相加，获取上述注意力特征。In one embodiment, in terms of obtaining attention features based on the above-mentioned channel attention mask and the above-mentioned cascaded features, the above-mentioned processing unit 604 is specifically configured to: channel-by-channel multiplication to obtain intermediate features; and adding the above-mentioned intermediate features to the above-mentioned cascaded features to obtain the above-mentioned attention features.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括全连接层，在基于上述级联特征，获取上述通道注意力掩膜方面，上述处理单元604用于：针对上述级联特征执行全局池化，获取池化后的级联特征；将上述池化后的级联特征乘以上述全连接层的权重矩阵，获取加权级联特征；将上述加权级联特征与偏置相加，获取偏置级联特征；以及使用Sigmoid函数处理上述偏置级联特征，获取上述通道注意力掩膜。In one embodiment, each convolution module in the above at least one convolution module further includes a fully connected layer, and in terms of obtaining the above channel attention mask based on the above cascaded features, the above processing unit 604 is configured to: for the above The cascaded features perform global pooling to obtain the pooled cascaded features; multiply the above pooled cascaded features by the weight matrix of the above fully connected layer to obtain weighted cascaded features; combine the above weighted cascaded features with partial The bias cascade feature is obtained by adding the offsets; and the above bias cascade feature is processed using the Sigmoid function to obtain the above-mentioned channel attention mask.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括位于上述至少一个第一卷积层之前的第二卷积层，上述第二卷积层包括尺寸为1×1×1的三维滤波器，在遍历上述至少一个卷积模块方面，上述处理单元604还用于：在使用上述一维滤波器处理上述输入以获取运动相关特征，以及使用上述二维滤波器处理上述输入以获取外观相关特征之前，使用上述三维滤波器处理上述输入以降低上述输入的维度；其中，在使用一维滤波器处理上述输入以获取运动相关特征，以及使用上述二维滤波器处理上述输入以获取外观相关特征方面，上述处理单元604具体用于：使用上述一维滤波器处理降维后的上述输入以获取上述运动相关特征，使用上述二维滤波器处理降维后的上述输入以获取上述外观相关特征。In one embodiment, each convolution module of the at least one convolution module further includes a second convolution layer before the at least one first convolution layer, and the second convolution layer includes ×1 three-dimensional filter, in traversing the at least one convolution module, the processing unit 604 is further configured to: use the one-dimensional filter to process the input to obtain motion-related features, and use the two-dimensional filter to process the above-mentioned Before inputting to obtain appearance-related features, the above-mentioned input is processed by using the above-mentioned three-dimensional filter to reduce the dimension of the above-mentioned input; wherein, the above-mentioned input is processed by using a one-dimensional filter to obtain motion-related features, and the above-mentioned input is processed by using the above-mentioned two-dimensional filter In terms of obtaining appearance-related features, the processing unit 604 is specifically configured to: use the above-mentioned one-dimensional filter to process the above-mentioned input after dimensionality reduction to obtain the above-mentioned motion-related features, and use the above-mentioned two-dimensional filter to process the above-mentioned input after dimensionality reduction to obtain The aforementioned appearance-related features.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括位于上述至少一个第一卷积层和上述第二卷积层之后的第三卷积层，上述第三卷积层包括尺寸为1×1×1的三维滤波器，在获取上述卷积模块的上述输出方面，上述处理单元604还用于：使用上述三维滤波器处理上述级联特征以增加上述级联特征的维度；以及将升维后的上述级联特征作为上述卷积模块的输出。In one embodiment, each convolution module in the at least one convolution module further includes a third convolution layer located after the at least one first convolution layer and the second convolution layer, and the third convolution The layer includes a three-dimensional filter with a size of 1×1×1. In terms of obtaining the above-mentioned output of the above-mentioned convolution module, the above-mentioned processing unit 604 is also used to: use the above-mentioned three-dimensional filter to process the above-mentioned concatenated features to increase the performance of the above-mentioned concatenated features. Dimensions; and the above-mentioned cascaded features after dimension-up are used as the output of the above-mentioned convolution module.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括位于上述至少一个第一卷积层和上述第二卷积层之后的第三卷积层，上述第三卷积层包括尺寸为1×1×1的三维滤波器，在获取上述卷积模块的上述输出方面，上述处理单元604还用于：用上述三维滤波器处理上述注意力特征以增加上述注意力特征的维度；以及使用升维后的上述注意力特征作为上述卷积模块的输出。In one embodiment, each convolution module in the at least one convolution module further includes a third convolution layer located after the at least one first convolution layer and the second convolution layer, and the third convolution The layer includes a three-dimensional filter with a size of 1×1×1. In terms of obtaining the above-mentioned output of the above-mentioned convolution module, the above-mentioned processing unit 604 is also used to: process the above-mentioned attention feature with the above-mentioned three-dimensional filter to increase the performance of the above-mentioned attention feature Dimensions; and using the above-mentioned attention features after dimensionality enhancement as the output of the above-mentioned convolution module.

图7是根据实施例的终端设备的示意性结构图。如图7所示，终端设备700包括处理器701、存储器702、通信接口703以及存储在上述存储器702中并由上述处理器701执行的一个或多个程序704。上述一个或多个程序704包括用于执行深度残差网络(ResNet)系统的指令。上述深度残差网络系统包括至少一个卷积模块，上述至少一个卷积模块中的每一个卷积模块包括至少一个第一卷积层，上述至少一个第一卷积层中的每一个第一卷积层具有至少一个一维(1D)滤波器和至少一个二维(2D)滤波器。上述一个或多个程序704包括用于执行以下操作的指令。Fig. 7 is a schematic structural diagram of a terminal device according to an embodiment. As shown in FIG. 7 , a terminal device 700 includes a processor 701 , a memory 702 , a communication interface 703 , and one or more programs 704 stored in the memory 702 and executed by the processor 701 . The one or more programs 704 described above include instructions for implementing a deep residual network (ResNet) system. The above-mentioned deep residual network system includes at least one convolution module, each convolution module in the at least one convolution module includes at least one first convolution layer, and each first volume in the at least one first convolution layer The stack has at least one one-dimensional (1D) filter and at least one two-dimensional (2D) filter. The one or more programs 704 described above include instructions for performing the following operations.

在上述至少一个卷积模块中的第一个卷积模块接收视频片段作为输入。通过执行以下操作遍历上述至少一个卷积模块：使用一维滤波器处理上述输入以获取运动相关特征，以及使用二维滤波器处理上述输入以获取外观相关特征；将上述运动相关特征沿时间维度移位一个步长，从上述运动相关特征中减去上述移位后的运动相关特征以获取残差特征；基于上述外观相关特征、上述运动相关特征以及上述残差特征，获取上述卷积模块的输出；将上述卷积的输出作为下一个卷积模块的输入，直至遍历完上述至少一个卷积模块中的最后一个卷积模块。基于上述至少一个卷积模块中的最后一个卷积模块的输出，识别上述视频片段中包括的至少一个动作。A first convolution module in the at least one convolution module receives a video segment as input. Traversing the at least one convolution module above by performing the following operations: processing the above-mentioned input with a one-dimensional filter to obtain motion-related features, and processing the above-mentioned input with a two-dimensional filter to obtain appearance-related features; shifting the above-mentioned motion-related features along the time dimension One step, subtract the above-mentioned shifted motion-related features from the above-mentioned motion-related features to obtain the residual feature; based on the above-mentioned appearance-related features, the above-mentioned motion-related features and the above-mentioned residual features, obtain the output of the above-mentioned convolution module ; Using the output of the above convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. At least one action included in the video segment is identified based on an output of a last convolutional module of the at least one convolutional module.

在一个实施例中，在获取上述卷积模块的上述输出方面，上述一个或多个程序704包括用于执行以下操作的指令。通过在通道维度级联上述运动相关特征、上述残差特征以及上述外观相关特征，获取级联特征；以及将上述级联特征确定为上述卷积模块的上述输出。In one embodiment, in terms of obtaining the above-mentioned output of the above-mentioned convolution module, the above-mentioned one or more programs 704 include instructions for performing the following operations. Obtaining the concatenated features by concatenating the motion-related features, the residual features and the appearance-related features in the channel dimension; and determining the concatenated features as the output of the convolution module.

在一个实施例中，在获取上述卷积模块的上述输出方面，上述一个或多个程序704包括用于执行以下操作的指令。通过在通道维度级联上述运动相关特征、上述残差特征以及上述外观相关特征，获取级联特征；基于上述级联特征，获取通道注意力掩膜；以及基于上述通道注意力掩膜和上述级联特征，获取注意力特征作为上述卷积模块的上述输出。In one embodiment, in terms of obtaining the above-mentioned output of the above-mentioned convolution module, the above-mentioned one or more programs 704 include instructions for performing the following operations. By cascading the above-mentioned motion-related features, the above-mentioned residual features and the above-mentioned appearance-related features in the channel dimension, the cascaded features are obtained; based on the above-mentioned cascaded features, the channel attention mask is obtained; and based on the above-mentioned channel attention mask and the above-mentioned levels Concatenated features, the attention features are obtained as the above output of the above convolution module.

在一个实施例中，在基于上述通道注意力掩膜和上述级联特征，获取上述注意力特征方面，上述一个或多个程序704包括用于执行以下操作的指令。通过执行上述通道注意力掩膜与上述级联特征之间的逐通道乘法，获取上述注意力特征。In one embodiment, in terms of obtaining the above attention feature based on the above channel attention mask and the above cascade feature, the one or more programs 704 include instructions for performing the following operations. The aforementioned attention features are obtained by performing a channel-by-channel multiplication between the aforementioned channel attention mask and the aforementioned cascaded features.

在一个实施例中，在基于上述通道注意力掩膜和上述级联特征，获取注意力特征方面，上述一个或多个程序704包括用于执行以下操作的指令。通过执行上述通道注意力掩膜与上述级联特征之间的逐通道乘法，获取中间特征；以及将上述中间特征与上述级联特征相加，获取上述注意力特征。In one embodiment, in terms of obtaining attention features based on the above-mentioned channel attention mask and the above-mentioned cascaded features, the above-mentioned one or more programs 704 include instructions for performing the following operations. The intermediate features are obtained by performing channel-by-channel multiplication between the above-mentioned channel attention mask and the above-mentioned concatenated features; and the above-mentioned attention features are obtained by adding the above-mentioned intermediate features and the above-mentioned concatenated features.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括全连接层，在基于上述级联特征，获取上述通道注意力掩膜方面，上述一个或多个程序704包括用于执行以下操作的指令。针对上述级联特征执行全局池化，获取池化后的级联特征；将上述池化后的级联特征乘以上述全连接层的权重矩阵，获取加权级联特征；将上述加权级联特征与偏置相加，获取偏置级联特征；以及使用Sigmoid函数处理上述偏置级联特征，获取上述通道注意力掩膜。In one embodiment, each convolution module in the above at least one convolution module further includes a fully connected layer, and in terms of obtaining the above channel attention mask based on the above cascaded features, the above one or more programs 704 include using Instructions for performing the following operations. Perform global pooling for the above cascaded features to obtain pooled cascaded features; multiply the above pooled cascaded features by the weight matrix of the above fully connected layer to obtain weighted cascaded features; combine the above weighted cascaded features Adding the bias to obtain the bias cascade feature; and using the Sigmoid function to process the above bias cascade feature to obtain the above channel attention mask.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块位于上述至少一个第一卷积层之前的第二卷积层，上述第二卷积层包括尺寸为1×1×1的三维滤波器，在遍历上述至少一个卷积模块方面，上述一个或多个程序704还包括用于执行以下操作的指令。使用上述一维滤波器处理上述输入以获取运动相关特征，以及使用上述二维滤波器处理上述输入以获取外观相关特征之前：使用上述三维滤波器处理上述输入以降低上述输入的维度；其中，在使用一维滤波器处理上述输入以获取运动相关特征，以及使用上述二维滤波器处理上述输入以获取外观相关特征方面，上述一个或多个程序704包括用于执行以下操作的指令。使用上述一维滤波器处理降维后的上述输入以获取上述运动相关特征，使用上述二维滤波器处理降维后的上述输入以获取上述外观相关特征。In one embodiment, each convolution module in the at least one convolution module is located in the second convolution layer before the at least one first convolution layer, and the second convolution layer includes a size of 1×1×1 In terms of traversing the at least one convolution module, the one or more programs 704 further include instructions for performing the following operations. Before using the above-mentioned one-dimensional filter to process the above-mentioned input to obtain motion-related features, and using the above-mentioned two-dimensional filter to process the above-mentioned input to obtain appearance-related features: use the above-mentioned three-dimensional filter to process the above-mentioned input to reduce the dimension of the above-mentioned input; wherein, in In terms of processing the input using a one-dimensional filter to obtain motion-related features, and using the two-dimensional filter to obtain appearance-related features, the one or more programs 704 include instructions for performing the following operations. Using the above-mentioned one-dimensional filter to process the above-mentioned input after dimension reduction to obtain the above-mentioned motion-related features, and using the above-mentioned two-dimensional filter to process the above-mentioned input after dimension-reduction to obtain the above-mentioned appearance-related features.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括位于上述至少一个第一卷积层和上述第二卷积层之后的第三卷积层，上述第三卷积层包括尺寸为1×1×1的三维滤波器，在获取上述卷积模块的上述输出方面，上述一个或多个程序704还包括用于执行以下操作的指令。使用上述三维滤波器处理上述级联特征以增加上述级联特征的维度；以及将升维后的上述级联特征作为上述卷积模块的输出。In one embodiment, each convolution module in the at least one convolution module further includes a third convolution layer located after the at least one first convolution layer and the second convolution layer, and the third convolution A layer includes a three-dimensional filter with a size of 1×1×1, and in terms of obtaining the above-mentioned output of the above-mentioned convolution module, the above-mentioned one or more programs 704 further include instructions for performing the following operations. Processing the above-mentioned concatenated features by using the above-mentioned three-dimensional filter to increase the dimension of the above-mentioned concatenated features; and using the above-mentioned concatenated features after dimensionality enhancement as the output of the above-mentioned convolution module.

在一个实施例中，上述至少一个卷积模块中的每个卷积模块还包括位于上述至少一个第一卷积层和上述第二卷积层之后的第三卷积层，上述第三卷积层包括尺寸为1×1×1的三维滤波器，在获取上述卷积模块的上述输出方面，上述一个或多个程序704包括用于执行以下操作的指令。用上述三维滤波器处理上述注意力特征以增加上述注意力特征的维度；以及使用升维后的上述注意力特征作为上述卷积模块的输出。In one embodiment, each convolution module in the at least one convolution module further includes a third convolution layer located after the at least one first convolution layer and the second convolution layer, and the third convolution The layer comprises a three-dimensional filter of size 1×1×1, and in obtaining the above output of the above convolution module, the one or more programs 704 include instructions for performing the following operations. processing the attention feature with the above-mentioned three-dimensional filter to increase the dimension of the attention feature; and using the dimension-uplifted attention feature as an output of the convolution module.

还提供了一种非瞬时性计算机存储介质。该非瞬时性计算机存储介质被配置为存储程序，当执行该程序时，该程序可用于执行上述方法实施例中描述的基于深度残差网络的动作识别方法的一些或全部操作。A non-transitory computer storage medium is also provided. The non-transitory computer storage medium is configured to store a program. When the program is executed, the program can be used to perform some or all operations of the deep residual network-based action recognition method described in the above method embodiments.

还提供了一种计算机程序产品。该计算机程序产品包括存储计算机程序的非瞬时性计算机可读存储介质。上述计算机程序可使计算机执行上述方法实施例中描述的基于深度残差网络的动作识别方法的一些或全部操作。A computer program product is also provided. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. The above computer program can cause the computer to execute some or all of the operations of the deep residual network-based action recognition method described in the above method embodiments.

需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明并不受所描述的动作顺序的限制，因为依据本发明，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定是本发明所必须的It should be noted that for the foregoing method embodiments, for the sake of simple description, they are expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. Because of the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the foregoing embodiments, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置，可通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如上述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the above units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical or other forms.

上述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储器中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储器中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例上述方法的全部或部分步骤。而前述的存储器包括：U盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the above-mentioned integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable memory. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory. Several instructions are included to make a computer device (which may be a personal computer, server or network device, etc.) execute all or part of the steps of the above-mentioned methods in various embodiments of the present invention. The above-mentioned memory includes: various media capable of storing program codes such as U disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, magnetic disk or optical disk.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储器中，存储器可以包括：闪存盘、只读存储器(Read-Only Memory，ROM)、随机存取器(Random AccessMemory，RAM)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable memory, and the memory can include: a flash disk , a read-only memory (Read-Only Memory, ROM), a random access device (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.

上面详细描述了本申请实施例，这里使用具体示例来描述本申请的原理和实现方式。上述实施例的描述仅用于帮助理解本申请的方法和核心思想。同时，本领域技术人员可以根据本申请的思想对具体实现方式和应用范围进行修改。总之，本说明书的内容不应被解释为对本申请的限制。The embodiments of the present application have been described in detail above, and specific examples are used here to describe the principle and implementation of the present application. The descriptions of the above embodiments are only used to help understand the method and core idea of the present application. At the same time, those skilled in the art can modify the specific implementation and application scope according to the idea of the present application. In any case, the content of this specification should not be construed as limiting the application.

Claims

1. A method for motion recognition based on a depth residual error network, applied to a depth residual error network system including at least one convolution module, each convolution module of the at least one convolution module including at least one first convolution layer, each first convolution layer of the at least one first convolution layer having at least one-dimensional filter and at least one two-dimensional filter, the method comprising: receiving as input a video segment at a first of the at least one convolution modules;

traversing the at least one convolution module by performing the following operations:

processing the input using a one-dimensional filter to obtain motion-related features and processing the input using a two-dimensional filter to obtain appearance-related features;

shifting the motion-related feature by one step along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature;

obtaining an output of the convolution module based on the appearance-related feature, the motion-related feature, and the residual feature;

taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed;

identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.

2. The method of claim 1, wherein obtaining the output of the convolution module comprises:

obtaining a cascade feature by cascading the motion-related feature, the residual feature, and the appearance-related feature in a channel dimension; and

determining the concatenated features as the output of the convolution module.

3. The method of claim 1, wherein obtaining the output of the convolution module comprises:

obtaining a cascade feature by cascading the motion-related feature, the residual feature, and the appearance-related feature in a channel dimension;

acquiring a channel attention mask based on the cascade feature; and

based on the channel attention mask and the cascade feature, acquiring an attention feature as the output of the convolution module.

4. The method of claim 3, wherein obtaining the attention feature based on the channel attention mask and the cascade feature comprises:

the attention feature is obtained by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature.

5. The method of claim 3, wherein obtaining the attention feature based on the channel attention mask and the cascade feature comprises:

obtaining an intermediate feature by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature; and

and adding the intermediate feature and the cascade feature to obtain the attention feature.

6. The method of any of claims 3 to 5, wherein each of the at least one convolution module further comprises a fully-connected layer, and wherein obtaining the channel attention mask based on the concatenated features comprises:

performing global pooling on the cascade features to obtain pooled cascade features;

multiplying the pooled cascade features by the weight matrix of the full connection layer to obtain weighted cascade features;

adding the weighted cascade feature and the bias to obtain a biased cascade feature; and

and processing the bias cascade feature by using a Sigmoid function to obtain the channel attention mask.

7. The method of any of claims 1 to 6, wherein each of the at least one convolution module further comprises a second convolution layer located before the at least one first convolution layer, the second convolution layer comprising a three-dimensional filter having a size of 1 x 1, traversing the at least one convolution module further comprising:

prior to processing the input using the one-dimensional filter to obtain motion-related features and processing the input using the two-dimensional filter to obtain appearance-related features:

processing the input using the three-dimensional filter to reduce a dimension of the input;

wherein processing the input using a one-dimensional filter to obtain motion-related features and processing the input using a two-dimensional filter to obtain appearance-related features comprises:

processing the reduced dimension input using the one-dimensional filter to obtain the motion-related feature, and processing the reduced dimension input using the two-dimensional filter to obtain the appearance-related feature.

8. The method of claim 7, wherein each of the at least one convolution module further includes a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer including a three-dimensional filter having a size of 1 x 1, the obtaining the output of the convolution module further comprising:

processing the concatenated features using the three-dimensional filter to increase a dimension of the concatenated features; and

and taking the cascade feature after the dimension is raised as the output of the convolution module.

9. The method of claim 7, wherein each of the at least one convolution module further includes a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer including a three-dimensional filter having a size of 1 x 1, the obtaining the output of the convolution module further comprising:

processing the attention feature with the three-dimensional filter to increase a dimension of the attention feature; and

using the attention feature after upscaling as the output of the convolution module.

10. An action recognition device based on a depth residual error network, applied to a depth residual error network system comprising at least one convolution module, each convolution module of the at least one convolution module comprising at least one first convolution layer, each first convolution layer of the at least one first convolution layer having at least one-dimensional filter and at least one two-dimensional filter, the device comprising:

a receiving unit for receiving as input a video segment at a first of said at least one convolution module;

a processing unit to traverse the at least one convolution module by performing the following operations:

an identification unit to identify at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.

11. The apparatus according to claim 10, wherein, in obtaining the output of the convolution module, the processing unit is specifically configured to:

determining the concatenated features as the output of the convolution module.

12. The apparatus according to claim 10, wherein, in obtaining the output of the convolution module, the processing unit is specifically configured to:

acquiring a channel attention mask based on the cascade feature; and

based on the channel attention mask and the concatenated features, acquiring attention features as the output of the convolution module.

13. The apparatus of claim 12, wherein, in obtaining the attention feature based on the channel attention mask and the cascade feature, the processing unit is specifically configured to:

14. The apparatus of claim 12, wherein, in obtaining an attention feature based on the channel attention mask and the cascade feature, the processing unit is specifically configured to:

15. The apparatus according to any one of claims 12 to 14, wherein each of the at least one convolution module further comprises a fully-connected layer, the processing unit being specifically configured to, in obtaining the channel attention mask based on the concatenated features:

performing global pooling on the cascade feature to obtain the pooled cascade feature;

16. The apparatus of any of claims 10 to 15, wherein each of the at least one convolution module further comprises a second convolution layer preceding the at least one first convolution layer, the second convolution layer comprising a three-dimensional filter having a size of 1 x 1, the processing unit further to, in traversing the at least one convolution module: processing the input using the three-dimensional filter to reduce the dimensionality of the input prior to processing the input using the one-dimensional filter to obtain motion-related features and processing the input using the two-dimensional filter to obtain appearance-related features;

wherein, in processing the input using a one-dimensional filter to obtain motion-related features and processing the input using a two-dimensional filter to obtain appearance-related features, the processing unit is to: processing the reduced dimension input using the one-dimensional filter to obtain the motion-related feature, and processing the reduced dimension input using the two-dimensional filter to obtain the appearance-related feature.

17. The apparatus of claim 16, wherein each of the at least one convolution module further comprises a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer comprising a three-dimensional filter having a size of 1 x 1, the processing unit further to, in obtaining the output of the convolution module:

processing the cascaded feature using the three-dimensional filter to increase a dimension of the cascaded feature; and

18. The apparatus of claim 16, wherein each of the at least one convolution module further comprises a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer comprising a three-dimensional filter having a size of 1 x 1, the processing unit further to, in obtaining the output of the convolution module:

19. A terminal device comprising a processor, a memory for storing one or more programs, wherein the one or more programs are for execution by the processor and comprise instructions for performing the method of any of claims 1 to 9.

20. A non-transitory computer readable storage medium storing a computer program for electronic data exchange, which when executed, causes a computer to perform the method according to any one of claims 1 to 9.