CN113312966B

CN113312966B - An action recognition method and device based on a first-person perspective

Info

Publication number: CN113312966B
Application number: CN202110430314.5A
Authority: CN
Inventors: 刘文印; 田文浩; 陈俊洪
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2023-08-08
Anticipated expiration: 2041-04-21
Also published as: CN113312966A

Abstract

The invention discloses an action recognition method and device based on a first-person perspective. The method includes: acquiring RGB video frames to be processed; the RGB video frames to be processed include hand movement image information based on the first-person perspective; The video frame is input to the pre-trained HOPE-Net deep neural network to obtain the corresponding hand joint position information; select a predetermined number of target RGB video frames from all RGB video frames to be processed, and input them into the I3D model for identification, and obtain the corresponding The video frame features; the hand joint point position information is input into the AGCN model to obtain the corresponding position information features; the video frame features and position information features are fused one by one to obtain the probability of recognizing action instructions. The video frames are sequentially extracted from hand bones and joints, RGB and bone action features are extracted, and finally feature fusion is performed to obtain the action command probability, thereby getting rid of the dependence on external hardware devices, and has strong robustness to lighting and scene changes.

Description

An action recognition method and device based on a first-person perspective

技术领域technical field

本发明涉及动作识别技术领域，尤其涉及一种基于第一人称视角的动作识别方法及装置。The present invention relates to the technical field of motion recognition, in particular to a method and device for motion recognition based on a first-person perspective.

背景技术Background technique

虽然机器人可以通过从人类的演示视频中学习动作，从而很好地了解人类的行为意图并自主学习人类的行为，但在实际应用中，机器人对人类行为的学习需要一个细致的理解过程，尤其是对来源于日常活动的行为学习，对于机器人来说是特别具有挑战性的，例如：基于穿戴式相机拍摄的第一人称视频中，机器人只能从单一的角度获取人类手部的操作动作，在这种情况下，充满着诸如手部移动较快，以及手部操作时候出现遮挡现象等，从而产生很大程度的不可预测性。因此，机器人对于人类动作的细微差异的识别，并对人类动作加以学习和执行的过程，仍然是目前机器人技术领域的一大难题，尤其是在研究热点之一的第一人称视角的动作识别方向上。Although robots can understand human behavior intentions well and learn human behavior autonomously by learning actions from human demonstration videos, in practical applications, the learning of human behavior by robots requires a detailed understanding process, especially It is particularly challenging for robots to learn behaviors derived from daily activities. For example, in the first-person video taken by a wearable camera, the robot can only obtain the human hand's operation from a single angle. In this case, it is full of things such as fast hand movement and occlusion when the hand is operated, resulting in a large degree of unpredictability. Therefore, the recognition of subtle differences in human actions by robots, and the process of learning and executing human actions is still a major problem in the field of robotics, especially in the direction of action recognition from the first-person perspective, which is one of the research hotspots. .

基于第一人称视角动作识别的方法，现今的方法主要包括三种：(1)利用传感器如LeapMtion，采集演示视频中手关节的信息，进而辅助动作识别，这种方法需要硬件的支持，并且要求操作人员需要在特定的环境下演示动作； (2)对于演示视频，利用稠密轨迹表示运动特征，以及利用HOG采集手势特征，这种方法往往会受到背景和相机移动带来的干扰，并且计算量较大；(3) 分割出演示视频中操作者的手部再输入到深度神经网络中进行识别，这种方法虽然可以有效的减少背景的干扰，但是缺失了大部分的原始信息。显而易见，现有的基于第一人称视角的动作识别方法都存在一定的缺陷。Based on the method of first-person perspective action recognition, the current methods mainly include three types: (1) use sensors such as LeapMtion to collect information about hand joints in demonstration videos, and then assist action recognition. This method requires hardware support and requires operation. Personnel need to demonstrate actions in a specific environment; (2) For demonstration videos, use dense trajectories to represent motion features, and use HOG to collect gesture features. (3) Segment the hands of the operator in the demonstration video and then input them into the deep neural network for recognition. Although this method can effectively reduce background interference, most of the original information is missing. Obviously, the existing action recognition methods based on the first-person perspective have certain defects.

综上所述，提出一个能摆脱对外部硬件设备的依赖，并且对于光照和场景变化有强鲁棒性的第一人称视角的动作识别方案，具有重要的意义。In summary, it is of great significance to propose a first-person perspective action recognition scheme that can get rid of the dependence on external hardware devices and is robust to illumination and scene changes.

发明内容Contents of the invention

本发明提供了一种基于第一人称视角的动作识别方法及装置，能够摆脱对外部硬件设备的依赖，并且对于光照和场景变化有强鲁棒性。The present invention provides an action recognition method and device based on a first-person perspective, which can get rid of the dependence on external hardware devices, and has strong robustness to illumination and scene changes.

第一方面，本发明提供的一种基于第一人称视角的动作识别方法，包括：In the first aspect, the present invention provides an action recognition method based on a first-person perspective, including:

获取待处理RGB视频帧；所述待处理RGB视频帧包含基于第一人称视角的手部动作图像信息；Obtain the RGB video frame to be processed; the RGB video frame to be processed contains hand movement image information based on the first-person perspective;

将所有所述待处理RGB视频帧输入到预先训练的HOPE-Net深度神经网络，得到对应的手部关节点位置信息；All the RGB video frames to be processed are input to the pre-trained HOPE-Net deep neural network to obtain the corresponding hand joint point position information;

从所有所述待处理RGB视频帧中挑选预定数量的目标RGB视频帧，并输入I3D模型中识别，得到对应的视频帧特征；Select a predetermined number of target RGB video frames from all the RGB video frames to be processed, and input them into the I3D model for identification to obtain corresponding video frame features;

将所述手部关节点位置信息输入AGCN模型，得到对应的位置信息特征；Input the position information of the joint points of the hand into the AGCN model to obtain the corresponding position information features;

将所述视频帧特征和所述位置信息特征一一对应融合，得到识别动作指令的概率。The video frame features and the position information features are fused in one-to-one correspondence to obtain the probability of recognizing an action instruction.

可选地，所述HOPE-Net深度神经网络包括：ResNet10网络和自适应图 U-Net网络；将所有所述RGB视频帧输入到预先训练的HOPE-Net深度神经网络，得到对应的手部关节点位置信息，包括：Optionally, the HOPE-Net deep neural network includes: a ResNet10 network and an adaptive graph U-Net network; all the RGB video frames are input to the pre-trained HOPE-Net deep neural network to obtain corresponding hand joints point location information, including:

通过ResNet网络对所有所述RGB视频帧进行编码和预测，得到多个对应的目标平面直角坐标点；Encoding and predicting all the RGB video frames through the ResNet network to obtain a plurality of corresponding target plane Cartesian coordinate points;

将多个所述目标平面直角坐标点输入自适应U-Net网络，得到所述RGB 视频帧对应的手部关节点位置信息。Inputting a plurality of Cartesian coordinate points of the target plane into the adaptive U-Net network to obtain the position information of the hand joint points corresponding to the RGB video frame.

可选地，通过ResNet网络对所有所述RGB视频帧进行编码和预测，得到多个对应的目标平面直角坐标点，包括：Optionally, all the RGB video frames are encoded and predicted through the ResNet network to obtain a plurality of corresponding target plane Cartesian coordinate points, including:

对所有所述RGB视频帧进行编码，得到编码后的视频帧；Encoding all the RGB video frames to obtain encoded video frames;

对所有所述编码后的视频帧进行预测，得到对应的初始平面直角坐标点；Predicting all the encoded video frames to obtain corresponding initial plane Cartesian coordinate points;

将所有所述初始平面直角坐标点和对应的RGB视频帧进行卷积，得到目标平面直角坐标点。Convolving all the initial plane Cartesian coordinate points with the corresponding RGB video frames to obtain the target plane Cartesian coordinate points.

可选地，获取待处理RGB视频帧，包括：Optionally, obtain RGB video frames to be processed, including:

获取待处理视频；所述待处理视频中包含有基于第一人称视角的手部动作影像信息；Obtain the video to be processed; the video to be processed contains hand movement image information based on the first-person perspective;

通过OpenCV将所述手部动作影像信息转换所述待处理RGB视频帧。The hand movement image information is converted into the RGB video frame to be processed by OpenCV.

可选地，将所述视频帧特征和所述位置信息特征一一对应融合，得到识别动作指令的概率，包括：Optionally, the one-to-one correspondence fusion of the video frame features and the position information features to obtain the probability of recognizing action instructions includes:

通过预先建立的关系图卷积网络，分析所述视频帧特征和所述位置信息特征的距离关系，并基于所述距离关系创建每一个所述视频帧特征与每一个位置信息特征的连接；Analyzing the distance relationship between the video frame feature and the position information feature through the pre-established relational graph convolution network, and creating a connection between each of the video frame features and each position information feature based on the distance relationship;

分别将所述视频帧特征和所述位置信息特征输入卷积网络，得到卷积后的视频证特征和卷积后的位置信息特征；The video frame feature and the positional information feature are respectively input into a convolutional network to obtain the convolutional video card feature and the convolutional positional information feature;

将处于同一连接的卷积后的视频证特征和卷积后的位置信息特征融合，得到融合后的信息特征输入全连接层网络，得到所述识别动作指令的概率。The convolved video evidence features and the convolved position information features in the same connection are fused, and the fused information features are input into the fully connected layer network to obtain the probability of the recognition action instruction.

第二方面，本发明提供的一种基于第一人称视角的动作识别装置，包括：In the second aspect, the present invention provides an action recognition device based on a first-person perspective, including:

获取模块，用于获取待处理RGB视频帧；所述待处理RGB视频帧包含基于第一人称视角的手部动作图像信息；An acquisition module, configured to acquire an RGB video frame to be processed; the RGB video frame to be processed includes hand movement image information based on a first-person perspective;

第一输入模块，用于将所有所述待处理RGB视频帧输入到预先训练的 HOPE-Net深度神经网络，得到对应的手部关节点位置信息；The first input module is used to input all the RGB video frames to be processed to the pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information;

挑选模块，用于从所有所述待处理RGB视频帧中挑选预定数量的目标 RGB视频帧，并输入I3D模型中识别，得到对应的视频帧特征；Selection module is used to select a predetermined number of target RGB video frames from all the RGB video frames to be processed, and input them to identify in the I3D model to obtain corresponding video frame features;

第二输入模块，用于将所述手部关节点位置信息输入AGCN模型，得到对应的位置信息特征；The second input module is used to input the position information of the joint points of the hand into the AGCN model to obtain corresponding position information features;

融合模块，用于将所述视频帧特征和所述位置信息特征一一对应融合，得到识别动作指令的概率。The fusion module is used to fuse the video frame features and the position information features one by one to obtain the probability of recognizing the action instruction.

可选地，所述HOPE-Net深度神经网络包括：ResNet10网络和自适应图 U-Net网络；所述第一输入模块包括：Optionally, the HOPE-Net deep neural network includes: ResNet10 network and adaptive graph U-Net network; the first input module includes:

编码子模块，用于通过ResNet网络对所有所述RGB视频帧进行编码和预测，得到多个对应的目标平面直角坐标点；The encoding submodule is used to encode and predict all the RGB video frames through the ResNet network to obtain a plurality of corresponding target plane Cartesian coordinate points;

第一输入子模块，用于将多个所述目标平面直角坐标点输入自适应U-Net 网络，得到所述RGB视频帧对应的手部关节点位置信息。The first input sub-module is configured to input a plurality of Cartesian coordinate points of the target plane into the adaptive U-Net network to obtain the position information of the hand joint points corresponding to the RGB video frame.

可选地，所述编码子模块包括：Optionally, the coding submodule includes:

编码单元，用于对所有所述RGB视频帧进行编码，得到编码后的视频帧；An encoding unit, configured to encode all the RGB video frames to obtain encoded video frames;

预测单元，用于对所有所述编码后的视频帧进行预测，得到对应的初始平面直角坐标点；A prediction unit, configured to predict all the encoded video frames to obtain corresponding initial plane Cartesian coordinate points;

卷积单元，用于将所有所述初始平面直角坐标点和对应的RGB视频帧进行卷积，得到目标平面直角坐标点。The convolution unit is configured to convolve all the initial plane Cartesian coordinate points with the corresponding RGB video frames to obtain the target plane Cartesian coordinate points.

可选地，所述获取模块包括：Optionally, the acquisition module includes:

获取子模块，用于获取待处理视频；所述待处理视频中包含有基于第一人称视角的手部动作影像信息；The obtaining sub-module is used to obtain the video to be processed; the video to be processed contains hand movement image information based on the first-person perspective;

转换子模块，用于通过OpenCV将所述手部动作影像信息转换所述待处理RGB视频帧。The conversion sub-module is used to convert the hand motion image information into the RGB video frame to be processed through OpenCV.

可选地，所述融合模块包括：Optionally, the fusion module includes:

连接子模块，用于通过预先建立的关系图卷积网络，分析所述视频帧特征和所述位置信息特征的距离关系，并基于所述距离关系创建每一个所述视频帧特征与每一个位置信息特征的连接；The connection sub-module is used to analyze the distance relationship between the video frame feature and the position information feature through the pre-established relational graph convolution network, and create each of the video frame features and each position based on the distance relationship. connection of information features;

第二输入子模块，用于分别将所述视频帧特征和所述位置信息特征输入卷积网络，得到卷积后的视频证特征和卷积后的位置信息特征；The second input submodule is used to input the video frame feature and the position information feature into the convolutional network respectively to obtain the convolved video card feature and the convolved position information feature;

融合子模块，用于将处于同一连接的卷积后的视频证特征和卷积后的位置信息特征融合，得到融合后的信息特征输入全连接层网络，得到所述识别动作指令的概率。The fusion sub-module is used to fuse the convolved video card features and the convolved position information features in the same connection, and input the fused information features into the fully connected layer network to obtain the probability of the recognition action instruction.

从以上技术方案可以看出，本发明具有以下优点：As can be seen from the above technical solutions, the present invention has the following advantages:

本发明通过获取待处理RGB视频帧；所述待处理RGB视频帧包含基于第一人称视角的手部动作图像信息；将所有所述待处理RGB视频帧输入到预先训练的HOPE-Net深度神经网络，得到对应的手部关节点位置信息；从所有所述待处理RGB视频帧中挑选预定数量的目标RGB视频帧，并输入I3D模型中识别，得到对应的视频帧特征；将所述手部关节点位置信息输入AGCN 模型，得到对应的位置信息特征；将所述视频帧特征和所述位置信息特征一一对应融合，得到识别动作指令的概率。将视频帧依次进行手骨骼关节提取， RGB和骨骼动作特征提取，最后进行特征融合得到动作指令概率，从而摆脱对外部硬件设备的依赖，并且对于光照和场景变化有强鲁棒性。The present invention obtains the RGB video frame to be processed; the RGB video frame to be processed contains hand motion image information based on the first-person perspective; all the RGB video frames to be processed are input to the pre-trained HOPE-Net deep neural network, Obtain the corresponding hand joint point position information; Select a predetermined number of target RGB video frames from all the RGB video frames to be processed, and input them into the I3D model for identification to obtain corresponding video frame features; The position information is input into the AGCN model to obtain the corresponding position information features; the video frame features and the position information features are fused one by one to obtain the probability of recognizing action instructions. The video frames are sequentially extracted from hand bones and joints, RGB and bone action features are extracted, and finally the feature fusion is performed to obtain the action command probability, thereby getting rid of the dependence on external hardware devices, and has strong robustness to lighting and scene changes.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其它的附图；In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings on the premise of not paying creative labor;

图1为本发明的一种基于第一人称视角的动作识别方法实施例一的步骤流程图；FIG. 1 is a flow chart of the steps of Embodiment 1 of an action recognition method based on a first-person perspective in the present invention;

图2为本发明的一种基于第一人称视角的动作识别方法实施例二的步骤流程图；FIG. 2 is a flow chart of steps in Embodiment 2 of an action recognition method based on a first-person perspective in the present invention;

图3为本发明的从待处理视频到手部关节位置信息的处理原理图；Fig. 3 is the processing schematic diagram from the video to be processed to the hand joint position information of the present invention;

图4为本发明的一种自适应图卷积模块的结构示意图；Fig. 4 is a schematic structural diagram of an adaptive graph convolution module of the present invention;

图5为本发明的关系图卷积网络的使用原理图；Fig. 5 is the schematic diagram of the use of the relational graph convolutional network of the present invention;

图6为本发明的一种基于第一人称视角的动作识别装置实施例的结构框图。FIG. 6 is a structural block diagram of an embodiment of an action recognition device based on a first-person perspective in the present invention.

具体实施方式Detailed ways

本发明实施例提供了一种基于第一人称视角的动作识别方法及装置，能够摆脱对外部硬件设备的依赖，并且对于光照和场景变化有强鲁棒性。Embodiments of the present invention provide an action recognition method and device based on a first-person perspective, which can get rid of the dependence on external hardware devices and have strong robustness to illumination and scene changes.

为使得本发明的发明目的、特征、优点能够更加的明显和易懂，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，下面所描述的实施例仅仅是本发明一部分实施例，而非全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to make the purpose, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the following The described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

实施例一，请参阅图1，图1为本发明的一种基于第一人称视角的动作识别方法实施例一的步骤流程图，包括：Embodiment 1, please refer to FIG. 1. FIG. 1 is a flow chart of steps in Embodiment 1 of an action recognition method based on a first-person perspective according to the present invention, including:

S101，获取待处理RGB视频帧；所述待处理RGB视频帧包含基于第一人称视角的手部动作图像信息；S101. Obtain an RGB video frame to be processed; the RGB video frame to be processed includes hand movement image information based on a first-person perspective;

S102，将所有所述待处理RGB视频帧输入到预先训练的HOPE-Net深度神经网络，得到对应的手部关节点位置信息；S102, inputting all the RGB video frames to be processed into the pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information;

S103，从所有所述待处理RGB视频帧中挑选预定数量的目标RGB视频帧，并输入I3D模型中识别，得到对应的视频帧特征；S103, selecting a predetermined number of target RGB video frames from all the RGB video frames to be processed, and inputting them into the I3D model for identification to obtain corresponding video frame features;

S104，将所述手部关节点位置信息输入AGCN模型，得到对应的位置信息特征；S104. Input the position information of the hand joint points into the AGCN model to obtain corresponding position information features;

S105，将所述视频帧特征和所述位置信息特征一一对应融合，得到识别动作指令的概率。S105, merging the video frame features and the position information features one by one to obtain a probability of recognizing an action instruction.

在本发明实施例中，通过获取待处理RGB视频帧；所述待处理RGB视频帧包含基于第一人称视角的手部动作图像信息；将所有所述待处理RGB视频帧输入到预先训练的HOPE-Net深度神经网络，得到对应的手部关节点位置信息；从所有所述待处理RGB视频帧中挑选预定数量的目标RGB视频帧，并输入I3D模型中识别，得到对应的视频帧特征；将所述手部关节点位置信息输入AGCN模型，得到对应的位置信息特征；将所述视频帧特征和所述位置信息特征一一对应融合，得到识别动作指令的概率。将视频帧依次进行手骨骼关节提取，RGB和骨骼动作特征提取，最后进行特征融合得到动作指令概率，从而摆脱对外部硬件设备的依赖，并且对于光照和场景变化有强鲁棒性。In the embodiment of the present invention, by obtaining the RGB video frames to be processed; the RGB video frames to be processed include hand motion image information based on the first-person perspective; all the RGB video frames to be processed are input to the pre-trained HOPE- Net deep neural network to obtain the corresponding hand joint point position information; select a predetermined number of target RGB video frames from all the RGB video frames to be processed, and input them into the I3D model for identification to obtain the corresponding video frame features; The position information of the hand joint points is input into the AGCN model to obtain the corresponding position information features; the video frame features and the position information features are fused one by one to obtain the probability of recognizing action instructions. The video frames are sequentially extracted from hand bones and joints, RGB and bone action features are extracted, and finally feature fusion is performed to obtain the action command probability, thereby getting rid of the dependence on external hardware devices, and has strong robustness to lighting and scene changes.

实施例二，请参阅图2，图2为本发明的一种基于第一人称视角的动作识别方法实施例二的步骤流程图，具体包括：Embodiment 2, please refer to FIG. 2. FIG. 2 is a flow chart of steps in Embodiment 2 of an action recognition method based on a first-person perspective according to the present invention, which specifically includes:

步骤S201，获取待处理视频；所述待处理视频中包含有基于第一人称视角的手部动作影像信息；Step S201, acquiring the video to be processed; the video to be processed contains hand movement image information based on the first-person perspective;

步骤S202，通过OpenCV将所述手部动作影像信息转换所述待处理RGB 视频帧；Step S202, converting the hand movement image information into the RGB video frame to be processed through OpenCV;

在本发明实施例中，首先将待处理视频利用OpenCV转换为多个待处理 RGB视频帧。In the embodiment of the present invention, first the video to be processed is converted into a plurality of RGB video frames to be processed using OpenCV.

需要说明的是，OpenCV是一个跨平台计算机视觉和机器学习软件库，可以运行在Linux、Windows、Android和Mac OS操作系统上。OpenCV具有轻量和高效的特定——由一系列C函数和少量C++类构成，同时提供了 Python、Ruby、MATLAB等语言的接口，进而实现图像处理和计算机视觉方面的很多通用算法。It should be noted that OpenCV is a cross-platform computer vision and machine learning software library that can run on Linux, Windows, Android and Mac OS operating systems. OpenCV has light-weight and high-efficiency features—consists of a series of C functions and a small number of C++ classes, and provides interfaces for languages such as Python, Ruby, and MATLAB, thereby implementing many general-purpose algorithms in image processing and computer vision.

步骤S203，通过ResNet网络对所有所述RGB视频帧进行编码和预测，得到多个对应的目标平面直角坐标点；Step S203, encoding and predicting all the RGB video frames through the ResNet network to obtain a plurality of corresponding target plane Cartesian coordinate points;

在一个可选实施例中，通过ResNet网络对所有所述RGB视频帧进行编码和预测，得到多个对应的目标平面直角坐标点，包括：In an optional embodiment, all the RGB video frames are encoded and predicted by the ResNet network to obtain a plurality of corresponding target plane Cartesian coordinate points, including:

在本发明实施例中，利用ResNet10网络对所有RGB视频帧进行特征编码，得到编码后的视频帧，并基于编码后的视频帧进行预测得到初始平面直角坐标点，然后将初始平面直角坐标点和RGB视频帧进行卷积，得到更为精确的目标平面直角坐标点。In the embodiment of the present invention, the ResNet10 network is used to perform feature encoding on all RGB video frames to obtain encoded video frames, and based on the encoded video frames, the initial plane Cartesian coordinate points are obtained by prediction, and then the initial plane Cartesian coordinate points and The RGB video frame is convolved to obtain a more accurate Cartesian coordinate point of the target plane.

步骤S204，将多个所述目标平面直角坐标点输入自适应U-Net网络，得到所述RGB视频帧对应的手部关节点位置信息；Step S204, inputting a plurality of Cartesian coordinate points of the target plane into the adaptive U-Net network to obtain the position information of the hand joint points corresponding to the RGB video frame;

在本发明实施例中，在得到步骤S203提及的目标平面直角坐标点之后，将其输入自适应U-Net网络，计算手部关节点的深度值，即对应的手部关节点位置信息，而手部关节点位置信息为最终的21个手关节的三维直角坐标信息，从而实现手部关节点从目标平面直角坐标到三维直角坐标的转变。In the embodiment of the present invention, after obtaining the Cartesian coordinate point of the target plane mentioned in step S203, it is input into the adaptive U-Net network to calculate the depth value of the hand joint point, that is, the corresponding hand joint point position information, The position information of the hand joint points is the final three-dimensional rectangular coordinate information of the 21 hand joints, so as to realize the transformation of the hand joint points from the target plane rectangular coordinates to the three-dimensional rectangular coordinates.

请查阅图3，图3为从待处理视频到手部关节位置信息的处理原理图，其中1为HOPE-Net网络，HOPE-Net网络中包含有ResNet10网络2和U-Net 网络3，在ResNet10网络2的协助下得到初始平面直角坐标点，然后再次在 ResNet10网络2的协助得到目标平面直角坐标点，接着在U-Net网络3的协助下得到所述RGB视频帧对应的手部关节点位置信息。Please refer to Figure 3, Figure 3 is a schematic diagram of the processing from the video to be processed to the position information of the hand joints, in which 1 is the HOPE-Net network, and the HOPE-Net network includes ResNet10 network 2 and U-Net network 3, in the ResNet10 network With the assistance of 2, the initial plane rectangular coordinate points are obtained, and then the target plane rectangular coordinate points are obtained again with the assistance of ResNet10 network 2, and then the hand joint point position information corresponding to the RGB video frame is obtained with the assistance of U-Net network 3 .

步骤S205，从所有所述待处理RGB视频帧中挑选预定数量的目标RGB 视频帧，并输入I3D模型中识别，得到对应的视频帧特征；Step S205, selecting a predetermined number of target RGB video frames from all the RGB video frames to be processed, and inputting them into the I3D model for identification to obtain corresponding video frame features;

在本发明实施例中，为了能够从待处理RGB视频帧中提取更多的特征细节，使用I3D模型对从待处理RGB视频帧中挑选的目标RGB视频帧进行识别，该模型从二维卷积扩展到三维卷积，即在卷积核和池化层增加时间维，利用三维卷积来提取目标RGB视频帧对应的视频帧特征。In the embodiment of the present invention, in order to extract more feature details from the RGB video frames to be processed, the I3D model is used to identify the target RGB video frames selected from the RGB video frames to be processed. Extend to three-dimensional convolution, that is, increase the time dimension in the convolution kernel and pooling layer, and use three-dimensional convolution to extract the video frame features corresponding to the target RGB video frame.

需要说明的是，三维卷积的滤波器是N*N*N的，即沿着时间维度重复 N*N的滤波器权重N次，并且通过除以N进行归一化，且除了最后一层卷积层外，在每一层卷积之后都加上BN函数和Relu激活函数。It should be noted that the filter of the three-dimensional convolution is N*N*N, that is, the filter weight of N*N is repeated N times along the time dimension, and normalized by dividing by N, and except for the last layer In addition to the convolutional layer, the BN function and the Relu activation function are added after each layer of convolution.

在具体实现中，从待处理RGB视频帧中挑选32帧作为一组输入到I3D 模型，通过I3D模型生成对应的视频帧特征。In a specific implementation, 32 frames are selected from the RGB video frames to be processed as a set of input to the I3D model, and the corresponding video frame features are generated through the I3D model.

步骤S206，将所述手部关节点位置信息输入AGCN模型，得到对应的位置信息特征；Step S206, input the position information of the joint points of the hand into the AGCN model to obtain the corresponding position information features;

需要说明的是，AGCN模型包括9层自适应的图卷积组合模块，针对不同的GCN单元和不同的样本，会自动生成不同的拓扑结构图。请查阅图4，图4为本发明的一种自适应图卷积模块的结构示意图，包括空间图卷积4、时间图卷积5和附加的dropout层6，每个图卷积层后面都有BN层7和Relu层 8，通过5种不同类型图层的组合，生成对应的拓扑结构图。为了得到更为稳定的效果，AGCN模型中每一层自适应的图卷积组合弄快均为残差连接。此外，通过AGCN网络模型得到的是N*256的特征，其中N为样本数。It should be noted that the AGCN model includes a 9-layer adaptive graph convolution combination module, which automatically generates different topology graphs for different GCN units and different samples. Please refer to Figure 4, Figure 4 is a schematic structural diagram of an adaptive graph convolution module of the present invention, including spatial graph convolution 4, temporal graph convolution 5 and an additional dropout layer 6, each graph convolution layer is followed by There are BN layer 7 and Relu layer 8. Through the combination of 5 different types of layers, the corresponding topological structure diagram is generated. In order to obtain a more stable effect, the adaptive graph convolution combination of each layer in the AGCN model is fast as a residual connection. In addition, the features obtained through the AGCN network model are N*256 features, where N is the number of samples.

在本发明实施例中，利用AGCN模型，将以手部关节点位置信息为主的手的自然骨架结构，通过拓补图表示。该模型建立在一系列手部骨架图，即手部关节点位置信息的基础上，手部骨架图的每个节点代表一个时刻手的一个关节，以三维坐标表示。图的边有两种类型，一种是在某一个时刻手的关节自然连接间的空间边，一种是跨越连续时间步长连接相同关节的时间边。在此基础上构造了多层时空图卷积，从而实现信息在空间和时间维度上的聚合。In the embodiment of the present invention, the AGCN model is used to represent the natural skeleton structure of the hand mainly based on the position information of the joint points of the hand through a topology map. The model is based on a series of hand skeleton diagrams, that is, the position information of hand joint points. Each node of the hand skeleton diagram represents a joint of the hand at a time, expressed in three-dimensional coordinates. There are two types of edges in the graph, one is the spatial edge between the natural connections of the joints of the hand at a certain moment, and the other is the temporal edge connecting the same joints across consecutive time steps. On this basis, a multi-layer spatio-temporal graph convolution is constructed to realize the aggregation of information in both spatial and temporal dimensions.

步骤S207，通过预先建立的关系图卷积网络，分析所述视频帧特征和所述位置信息特征的距离关系，并基于所述距离关系创建每一个所述视频帧特征与每一个位置信息特征的连接；Step S207, analyze the distance relationship between the video frame feature and the position information feature through the pre-established relational graph convolution network, and create a relationship between each video frame feature and each position information feature based on the distance relationship connect;

步骤S208，分别将所述视频帧特征和所述位置信息特征输入卷积网络，得到卷积后的视频证特征和卷积后的位置信息特征；Step S208, respectively input the video frame feature and the position information feature into the convolutional network to obtain the convolved video card feature and the convolved position information feature;

步骤S209，将处于同一连接的卷积后的视频证特征和卷积后的位置信息特征融合，得到融合后的信息特征输入全连接层网络，得到所述识别动作指令的概率。In step S209, the convolved video evidence features and the convolved position information features in the same connection are fused, and the fused information features are input into the fully connected layer network to obtain the probability of the recognition action instruction.

请查阅图5，图5为关系图卷积网络的使用原理图，其中9为视频帧特征， 10为位置信息特征，11为GCN单元，12为融合后的信息特征，13为识别动作指令的概率。分别将视频帧特征9和位置信息特征10输入多个GCN单元进行卷积，得到卷积后的视频证特征和卷积后的位置信息特征，然后将其融合得到融合后的信息特征12，进而得到识别动作指令的概率13。进而实现在不需要对演示视频和演示环境约束，不依赖于额外的辅助传感器的情况下，对操作视频进行动作识别。Please refer to Figure 5, Figure 5 is a schematic diagram of the use of the relational graph convolution network, in which 9 is the video frame feature, 10 is the position information feature, 11 is the GCN unit, 12 is the information feature after fusion, and 13 is the recognition action instruction probability. The video frame feature 9 and the location information feature 10 are input to multiple GCN units for convolution respectively, and the video card feature after convolution and the location information feature after convolution are obtained, and then they are fused to obtain the fused information feature 12, and then Get the probability of recognition action command 13. Furthermore, it is possible to perform action recognition on the operation video without restricting the demonstration video and the demonstration environment, and without relying on additional auxiliary sensors.

在本发明实施例所提供的一种基于第一人称视角的动作识别方法，通过通过获取待处理RGB视频帧；所述待处理RGB视频帧包含基于第一人称视角的手部动作图像信息；将所有所述待处理RGB视频帧输入到预先训练的 HOPE-Net深度神经网络，得到对应的手部关节点位置信息；从所有所述待处理RGB视频帧中挑选预定数量的目标RGB视频帧，并输入I3D模型中识别，得到对应的视频帧特征；将所述手部关节点位置信息输入AGCN模型，得到对应的位置信息特征；将所述视频帧特征和所述位置信息特征一一对应融合，得到识别动作指令的概率。将视频帧依次进行手骨骼关节提取，RGB和骨骼动作特征提取，最后进行特征融合得到动作指令概率，从而摆脱对外部硬件设备的依赖，并且对于光照和场景变化有强鲁棒性。A kind of action recognition method based on the first-person perspective provided in the embodiment of the present invention, by obtaining the RGB video frame to be processed; The RGB video frame to be processed contains the hand movement image information based on the first-person perspective; The RGB video frames to be processed are input to the pre-trained HOPE-Net deep neural network to obtain the corresponding hand joint point position information; a predetermined number of target RGB video frames are selected from all the RGB video frames to be processed, and input to I3D Identify in the model to obtain the corresponding video frame features; input the position information of the hand joint points into the AGCN model to obtain the corresponding position information features; fuse the video frame features and the position information features one by one to obtain recognition The probability of an action command. The video frames are sequentially extracted from hand bones and joints, RGB and bone action features are extracted, and finally feature fusion is performed to obtain the action command probability, thereby getting rid of the dependence on external hardware devices, and has strong robustness to lighting and scene changes.

请参阅图6，示出了一种基于第一人称视角的动作识别装置实施例的结构框图，装置包括：Please refer to FIG. 6, which shows a structural block diagram of an embodiment of an action recognition device based on a first-person perspective. The device includes:

获取模块101，用于获取待处理RGB视频帧；所述待处理RGB视频帧包含基于第一人称视角的手部动作图像信息；The obtaining module 101 is used to obtain the RGB video frame to be processed; the RGB video frame to be processed includes hand movement image information based on the first-person perspective;

第一输入模块102，用于将所有所述待处理RGB视频帧输入到预先训练的HOPE-Net深度神经网络，得到对应的手部关节点位置信息；The first input module 102 is used to input all the RGB video frames to be processed to the pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information;

挑选模块103，用于从所有所述待处理RGB视频帧中挑选预定数量的目标RGB视频帧，并输入I3D模型中识别，得到对应的视频帧特征；The selection module 103 is used to select a predetermined number of target RGB video frames from all the RGB video frames to be processed, and input them into the I3D model for identification to obtain corresponding video frame features;

第二输入模块104，用于将所述手部关节点位置信息输入AGCN模型，得到对应的位置信息特征；The second input module 104 is used to input the position information of the joint points of the hand into the AGCN model to obtain the corresponding position information features;

融合模块105，用于将所述视频帧特征和所述位置信息特征一一对应融合，得到识别动作指令的概率。The fusion module 105 is configured to fuse the features of the video frame and the features of the location information in a one-to-one correspondence to obtain the probability of recognizing an action instruction.

在一个可选实施例中，所述HOPE-Net深度神经网络包括：ResNet10网络和自适应图U-Net网络；第一输入模块102包括：In an optional embodiment, the HOPE-Net deep neural network includes: a ResNet10 network and an adaptive graph U-Net network; the first input module 102 includes:

在一个可选实施例中，所述编码子模块包括：In an optional embodiment, the encoding submodule includes:

在一个可选实施例中，所述获取模块101包括：In an optional embodiment, the acquisition module 101 includes:

在一个可选实施例中，所述融合模块105包括：In an optional embodiment, the fusion module 105 includes:

对于装置实施例而言，由于其与方法实施例基本相似，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的装置和模块的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described devices and modules can refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on such an understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, and other media that can store program codes.

以上所述，以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still understand the foregoing The technical solutions recorded in each embodiment are modified, or some of the technical features are replaced equivalently; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims

1. An action recognition method based on a first-person perspective, characterized in that, comprising:

Obtain the RGB video frame to be processed; the RGB video frame to be processed contains hand movement image information based on the first-person perspective;

All the RGB video frames to be processed are input to the pre-trained HOPE-Net deep neural network to obtain the corresponding hand joint point position information;

Select a predetermined number of target RGB video frames from all the RGB video frames to be processed, and input them into the I3D model for identification to obtain corresponding video frame features;

Input the position information of the joint points of the hand into the AGCN model to obtain the corresponding position information features;

The video frame feature and the position information feature are fused one by one to obtain the probability of recognizing an action instruction;

Described HOPE-Net depth neural network comprises: ResNet10 network and self-adaptive figure U-Net network; All described RGB video frames are input to pre-trained HOPE-Net depth neural network, obtain corresponding hand joint point position information, include:

Encoding and predicting all the RGB video frames through the ResNet network to obtain a plurality of corresponding target plane Cartesian coordinate points;

A plurality of Cartesian coordinate points of the target plane are input into the adaptive U-Net network to obtain the corresponding hand joint point position information of the RGB video frame;

All the RGB video frames are encoded and predicted through the ResNet network to obtain a plurality of corresponding target plane Cartesian coordinate points, including:

Encoding all the RGB video frames to obtain encoded video frames;

Predicting all the encoded video frames to obtain corresponding initial plane Cartesian coordinate points;

Carry out convolution with all described initial plane Cartesian coordinate points and corresponding RGB video frame, obtain target plane Cartesian coordinate point;

The video frame feature and the position information feature are fused one-to-one to obtain the probability of recognizing an action instruction, including:

Analyzing the distance relationship between the video frame feature and the position information feature through the pre-established relational graph convolution network, and creating a connection between each of the video frame features and each position information feature based on the distance relationship;

The video frame feature and the positional information feature are respectively input into a convolutional network to obtain the convolutional video card feature and the convolutional positional information feature;

The convolved video evidence features and the convolved position information features in the same connection are fused, and the fused information features are input into the fully connected layer network to obtain the probability of the recognition action instruction.

2. The action recognition method based on the first-person perspective according to claim 1, wherein obtaining RGB video frames to be processed comprises:

Obtain the video to be processed; the video to be processed contains hand movement image information based on the first-person perspective;

The hand movement image information is converted into the RGB video frame to be processed by OpenCV.

3. An action recognition device based on a first-person perspective, characterized in that it comprises:

An acquisition module, configured to acquire an RGB video frame to be processed; the RGB video frame to be processed includes hand movement image information based on a first-person perspective;

The first input module is used to input all the RGB video frames to be processed to the pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information;

The selection module is used to select a predetermined number of target RGB video frames from all the RGB video frames to be processed, and input them into the I3D model for identification to obtain corresponding video frame features;

The second input module is used to input the position information of the joint points of the hand into the AGCN model to obtain corresponding position information features;

The fusion module is used to fuse the video frame features and the position information features one by one to obtain the probability of recognizing action instructions;

Described HOPE-Net deep neural network comprises: ResNet10 network and adaptive figure U-Net network; Described first input module comprises:

The encoding submodule is used to encode and predict all the RGB video frames through the ResNet network to obtain a plurality of corresponding target plane Cartesian coordinate points;

The first input submodule is used to input a plurality of Cartesian coordinate points of the target plane into the adaptive U-Net network to obtain the corresponding hand joint point position information of the RGB video frame;

The encoding submodule includes:

An encoding unit, configured to encode all the RGB video frames to obtain encoded video frames;

A prediction unit, configured to predict all the encoded video frames to obtain corresponding initial plane Cartesian coordinate points;

The convolution unit is used to convolve all the initial plane Cartesian coordinate points and corresponding RGB video frames to obtain the target plane Cartesian coordinate points;

The fusion module includes:

The connection sub-module is used to analyze the distance relationship between the video frame feature and the position information feature through the pre-established relational graph convolution network, and create each of the video frame features and each position based on the distance relationship. connection of information features;

The second input submodule is used to input the video frame feature and the position information feature into the convolutional network respectively to obtain the convolved video card feature and the convolved position information feature;

The fusion sub-module is used to fuse the convolved video card features and the convolved position information features in the same connection, and input the fused information features into the fully connected layer network to obtain the probability of the recognition action instruction.

4. The action recognition device based on the first-person perspective according to claim 3, wherein the acquisition module comprises:

The obtaining sub-module is used to obtain the video to be processed; the video to be processed contains hand movement image information based on the first-person perspective;

The conversion sub-module is used to convert the hand motion image information into the RGB video frame to be processed through OpenCV.