CN111222459B

CN111222459B - Visual angle independent video three-dimensional human body gesture recognition method

Info

Publication number: CN111222459B
Application number: CN202010010324.9A
Authority: CN
Inventors: 邱丰; 马利庄
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2023-05-12
Anticipated expiration: 2040-01-06
Also published as: CN111222459A

Abstract

The invention relates to a visual angle independent video three-dimensional human body gesture recognition method, which comprises the following steps: step 1: virtual data generation phase: synthesizing virtual camera parameters based on a human body posture data set containing three-dimensional labels at will, and then generating a two-dimensional/three-dimensional data tuple; step 2: model training stage: training a first module of a modularized neural network for obtaining a model with camera view generalization capability and a second module of the modularized neural network for obtaining a model capable of protecting inter-frame motion continuity by using the generated two-dimensional/three-dimensional data elements respectively; step 3: unconstrained video reasoning phase: and (3) predicting the video acquired by any unconstrained acquisition by utilizing the multi-module depth neural network trained in the step (2) to acquire a three-dimensional human body posture recognition result. Compared with the prior art, the method is based on the modularized neural network combined training method, and the generalization capability of three-dimensional human body gesture recognition is effectively improved.

Description

A view-independent method for 3D human pose recognition in video

技术领域technical field

本发明涉及计算机视觉技术领域中的三维人体姿态识别技术，尤其是涉及一种针对视频任务的未知视角数据合成、模块化神经网络训练及预处理方法，即视角无关的视频三维人体姿态识别方法。The present invention relates to three-dimensional human body posture recognition technology in the field of computer vision technology, in particular to a method for unknown viewing angle data synthesis, modularized neural network training and preprocessing for video tasks, that is, a viewing angle-independent video three-dimensional human body posture recognition method.

背景技术Background technique

近几十年来，随着人工智能和深度学习有关技术的发展，人体姿态识别这一课题也取得了长足的进步。视频人体姿态识别，特别是针对视频的三维人体姿态识别长期以来一直是计算机视觉和智能人机交互领域的重要内容；它融合了数字图像处理、人机交互、计算机图形学、计算机视觉等多个学科，并伴随着安防监控网络、智能机器人、智能手机、平板电脑等便携式移动电子设备的普及，进一步融入人们的生活。In recent decades, with the development of artificial intelligence and deep learning related technologies, the subject of human gesture recognition has also made great progress. Video human pose recognition, especially 3D human pose recognition for video has long been an important content in the field of computer vision and intelligent human-computer interaction; it integrates digital image processing, human-computer interaction, computer graphics, computer vision, etc. With the popularity of portable mobile electronic devices such as security monitoring networks, intelligent robots, smart phones, and tablet computers, it has further integrated into people's lives.

现有的三维人体姿态识别算法按照预测目标往往可以分为单阶段人体姿态识别和多阶段人体姿态识别：前者一般是指使用RGB或RGB-D图片直接回归得到三维人体关键点位置或参数化模型的参数信息，此类方法优点是利用了图片中更多的隐藏信息，实验室环境下精度较高，但囿于带三维标注的RGB图片数据缺失，无法脱离实验室采集环境，因而泛化能力较差，难以转化为易用性强的产品产生商业价值；后者一般是指分阶段地先估计出二维人体姿态的关键点位置，再估计出三维人体姿态，这类方法的优点在于可以利用人工标注方式通过采集大量互联网无约束图片进行二维人体姿态估计部分的训练，而二维到三维的预测问题也经由Martinez等人的论文证明了是一个相对容易完成的任务。本发明为了利于转化总体上沿用了多阶段人体姿态识别的架构，然而在已有较强二维人体关键点检测模型的基础上，现有方法仍会因为受限于三维采集数据的视角缺失，比较容易过拟合到数据集相机参数上。The existing 3D human body pose recognition algorithm can be divided into single-stage human body pose recognition and multi-stage human body pose recognition according to the prediction target: the former generally refers to the direct regression of RGB or RGB-D images to obtain the key point position or parameterized model of the 3D human body The parameter information of this method has the advantage of using more hidden information in the picture, and the accuracy is higher in the laboratory environment. However, due to the lack of RGB image data with three-dimensional annotations, it cannot be separated from the laboratory collection environment, so the generalization ability Poor, it is difficult to transform into a product with strong usability to generate commercial value; the latter generally refers to estimating the key point positions of the two-dimensional human body posture in stages, and then estimating the three-dimensional human body posture. The advantage of this method is that it can Using manual labeling to collect a large number of Internet unconstrained pictures for 2D human body pose estimation training, and the 2D to 3D prediction problem has also been proved to be a relatively easy task by Martinez et al. In order to facilitate the transformation, the present invention generally uses the framework of multi-stage human body posture recognition. However, on the basis of the existing strong two-dimensional human body key point detection model, the existing method is still limited by the lack of perspective of the three-dimensional collected data. It is easier to overfit to the data set camera parameters.

发明内容Contents of the invention

本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种视角无关的视频三维人体姿态识别方法，提出虚拟视角合成方法，利用相机视角增强模块生成随机视角，配合相机投影关系获得二维/三维数据元组，用于神经网络多模块训练和泛化能力验证；此外提出利用二维人体姿态检测框的方式归一化三维预测的输入，使得无约束环境下的三维人体姿态估计方法脱离相机内外参限制，因此具有更强的泛化能力。The purpose of the present invention is to overcome the above-mentioned defects in the prior art and to provide an angle-independent video three-dimensional body pose recognition method, and to propose a virtual angle synthesis method, which uses a camera angle enhancement module to generate a random angle of view, and cooperates with the camera projection relationship to obtain a two-dimensional /3D data tuple, used for neural network multi-module training and generalization ability verification; in addition, it is proposed to use the 2D human body posture detection frame to normalize the input of 3D prediction, so that the 3D human body pose estimation method in an unconstrained environment is separated from The internal and external parameters of the camera are limited, so it has stronger generalization ability.

本发明的目的可以通过以下技术方案来实现：The purpose of the present invention can be achieved through the following technical solutions:

一种视角无关的视频三维人体姿态识别方法，该识别方法包括：A viewing angle-independent video 3D human gesture recognition method, the recognition method comprising:

步骤1：虚拟数据生成阶段：基于任意包含三维标注的人体姿态数据集，合成虚拟相机参数后生成二维/三维数据元组；Step 1: Virtual data generation stage: based on any human body pose dataset containing 3D annotations, generate 2D/3D data tuples after synthesizing virtual camera parameters;

步骤2：模型训练阶段：利用生成的二维/三维数据元组分别训练用于获得具有相机视角泛化能力的模型的模块化神经网络第一模块和用于获得能够保护帧间动作连续性的模型的模块化神经网络第二模块；Step 2: Model training phase: Use the generated 2D/3D data tuples to train the first module of the modular neural network for obtaining a model with camera view generalization capability and the first module for obtaining a model that can protect the continuity of motion between frames. Model the second module of the modular neural network;

步骤3：无约束视频推理阶段：对于任意无约束采集得到的视频，通过利用步骤2训练得到的多模块深度神经网络预测得到三维人体姿态识别结果。Step 3: Unconstrained video inference stage: For any unconstrained collected video, the 3D human pose recognition result is obtained by using the multi-module deep neural network trained in step 2 to predict.

进一步地，所述的步骤1具体包括：对于任意包含三维标注的人体姿态数据集，采用相机视角增强模块合成虚拟相机参数，并利用投影关系生成二维/三维数据元组。Further, the step 1 specifically includes: for any human body pose dataset containing 3D annotations, using a camera view enhancement module to synthesize virtual camera parameters, and using projection relationships to generate 2D/3D data tuples.

进一步地，所述的相机参数包括决定相机位置和朝向的外参和决定相机投影焦距画幅的内参。Further, the camera parameters include external parameters for determining the position and orientation of the camera and internal parameters for determining the focal length and frame size of the camera projection.

进一步地，所述的步骤2中的第一模块使用单帧数据元组进行视角增强的训练。Further, the first module in step 2 uses single frame data tuples to perform viewing angle enhancement training.

进一步地，所述的步骤2中的第二模块使用连续序列的数据元组进行时序模型训练。Further, the second module in step 2 uses continuous sequence of data tuples to perform time series model training.

所述第一模块与所述第二模块只要满足第一模块是一个单帧的二维到三维的预测模块，且第二模块是一个时序的三维到三维的修正模块即可，第一第二模块串联起来完成二维到三维的预测。The first module and the second module only need to satisfy that the first module is a single-frame 2D to 3D prediction module, and the second module is a sequential 3D to 3D correction module. The modules are connected in series to complete the prediction from 2D to 3D.

进一步地，所述的步骤2与步骤3中，在输入神经网络前，还包括对二维检测结果进行相机无关的二维检测归一化预处理过程，其对应描述公式为：Further, in step 2 and step 3, before inputting the neural network, it also includes a camera-independent two-dimensional detection normalization preprocessing process for the two-dimensional detection results, and its corresponding description formula is:

式中，K_x,y表示二维检测归一化预处理后的二维点坐标，

表示原二维点坐标，

表示二维检测框的中心坐标，w^d,h^d分别为二维检测框的宽度和高度。In the formula, K _{x, y} represent the two-dimensional point coordinates after two-dimensional detection normalization preprocessing,

Indicates the original two-dimensional point coordinates,

Represents the center coordinates of the two-dimensional detection frame, w ^d , h ^d are the width and height of the two-dimensional detection frame, respectively.

进一步地，所述的步骤3中的无约束采集得到的视频具体包括自然条件采集，或经过缩放、裁剪、变速、和其他颜色调整变换的视频序列。Further, the video obtained by the unconstrained acquisition in step 3 specifically includes acquisition under natural conditions, or a video sequence that has been scaled, cropped, changed in speed, and transformed by other color adjustments.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

(1)本发明提出的视角无关的视频三维人体姿态识别方法，其虚拟数据生成阶段，以假设合理的随机视角取代原有固定视角训练中对数据集采集时使用的相机视角，克服了数据集相机内外参依赖；其模型训练阶段，模块化设计既可以分别训练两个单独模块，也可以串行地对视频流数据元组进行完整训练，两个单独模块的任务目的明确，可单独验证，泛化能力强。(1) In the perspective-independent video three-dimensional human body posture recognition method proposed by the present invention, in the virtual data generation stage, a reasonable random perspective is used to replace the camera perspective used in the original fixed perspective training to overcome the data set The internal and external parameters of the camera are dependent; in the model training phase, the modular design can train two separate modules separately, or complete training on video stream data tuples serially. The tasks of the two separate modules are clear and can be verified separately. Strong generalization ability.

(2)且因为本发明利用了时序模型训练，可以在控制感受野的基础上获得较长时间的预测提示；其无约束视频推理阶段，因有效的归一化框设计和选取，解耦合了投影关系依赖，对于互联网采集的大量缺少相机参数、人物比例极端(往往表现为人物在原始视频内尺度比例过小)、经过裁切等处理的视频都能取得较好的预测效果。(2) And because the present invention utilizes time series model training, it can obtain a longer period of prediction prompts on the basis of controlling the receptive field; in its unconstrained video reasoning stage, due to the effective normalization frame design and selection, decoupling Depending on the projection relationship, good prediction results can be obtained for a large number of videos collected from the Internet that lack camera parameters, have extreme proportions of people (often manifested as the proportion of people in the original video is too small), and have been processed by cropping.

(3)本发明提出了一种视角无关的视频三维人体姿态识别方法，使用大量经过相机视角增强的二维/三维数据元组对多模块神经网络进行训练，同时利用一种相机无关的二维检测归一化方法进行二维输入的预处理；本发明中的第一模块可以适应无约束的三维人体姿态估计任务，获得较强的相机泛化能力，第二模块可以有效利用时间序列上的连续特征，使得预测关键点获得较好的空间稳定性，并使得整个预测达到较为理想的精度。(3) The present invention proposes a viewing angle-independent video 3D human pose recognition method, using a large number of 2D/3D data tuples enhanced by camera viewing angles to train the multi-module neural network, and using a camera-independent 2D The detection and normalization method performs preprocessing of two-dimensional input; the first module in the present invention can adapt to the unconstrained three-dimensional human body pose estimation task, and obtains a strong camera generalization ability; the second module can effectively use the time series The continuous features make the prediction key points obtain better spatial stability, and make the whole prediction achieve a more ideal accuracy.

附图说明Description of drawings

图1为本发明的方法结构流程图；Fig. 1 is a method structure flowchart of the present invention;

图2为本发明的方法中的相机参数生成时的旋转(姿态角)控制示意图；Fig. 2 is a schematic diagram of rotation (attitude angle) control when the camera parameters in the method of the present invention are generated;

图3为本发明的方法中的第一模块神经网络和第二模块神经网络的结构实例图；Fig. 3 is the structural example diagram of the first module neural network and the second module neural network in the method of the present invention;

图4为本发明方法中的投影关系示意图；Fig. 4 is a schematic diagram of the projection relationship in the method of the present invention;

图5为本发明方法中的二维检测框归一化方法示意图。Fig. 5 is a schematic diagram of a method for normalizing two-dimensional detection frames in the method of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都应属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

如图1所示为本发明一种视角无关的视频三维人体姿态识别方法的整体结构流程图，主要包括以下三个阶段：虚拟数据生成阶段，模型训练阶段，无约束视频推理阶段，此外还包含一个在训练和推理两阶段均会使用到的相机无关的二维检测归一化方法；As shown in Figure 1, it is a flow chart of the overall structure of a viewing angle-independent video three-dimensional human body gesture recognition method according to the present invention, which mainly includes the following three stages: virtual data generation stage, model training stage, unconstrained video reasoning stage, and also includes A camera-independent 2D detection normalization method used in both training and inference phases;

虚拟数据生成阶段：对于任意公开的三维人体姿态学术数据集，或动作捕捉系统采集得到的三维人体姿态数据集，通过本发明提出的相机视角合成原理和投影变换生成对应的二维投影和三维立体姿态数据元组；Virtual data generation stage: For any public 3D human body pose academic data set, or the 3D human body pose data set collected by the motion capture system, the corresponding 2D projection and 3D stereoscopic data are generated through the camera perspective synthesis principle and projection transformation proposed by the present invention attitude data tuple;

模型训练阶段：训练神经网络模型，对于本发明提出的第一模块，使用大规模视角增强单帧数据元组进行训练，获得较好的视角抗性并与相机参数解耦合；对于本发明提出的第二模块，使用包含三维标注的视频流数据进行时序学习与预测，获得时序上的空间连续性，提升姿态识别准确度；Model training stage: train the neural network model, for the first module proposed by the present invention, use large-scale viewing angle enhancement single-frame data tuples for training, obtain better viewing angle resistance and decouple from camera parameters; The second module uses video stream data containing 3D annotations for time-series learning and prediction to obtain spatial continuity in time series and improve the accuracy of gesture recognition;

无约束视频推理阶段：对于一般环境下(in-the-wild)获取得到的视频数据流，和基于任意二维人体关键点检测的模块，通过本发明提出的一种特殊的二维人体检测关键点结果的归一化方法进行预处理，处理后的二维数据依次通过第一、第二模块前向预测，得到基于三维关键点表示的人体姿态。Unconstrained video reasoning stage: For the video data stream obtained in the general environment (in-the-wild), and the module based on any two-dimensional human key point detection, a special two-dimensional human body detection key point proposed by the present invention The normalization method of the point results is used for preprocessing, and the processed two-dimensional data is sequentially predicted through the first and second modules to obtain the human body posture based on the three-dimensional key point representation.

本发明所提及人体姿态数据表示方法主要为关键点-骨骼表示法；本发明提出的第一模块主要用于提升视角泛化能力，第二模块主要用于获得时序预测具有更大感受野和良好稳定性的特性。本发明所描述的场景包括但不限于涉及视频人体姿态识别的研究和应用。本发明基于模块化神经网络组合训练方法，有效提高三维人体姿态识别的泛化能力。The human body posture data representation method mentioned in the present invention is mainly the key point-skeleton representation method; the first module proposed by the present invention is mainly used to improve the generalization ability of the viewing angle, and the second module is mainly used to obtain time series prediction with a larger receptive field and Characterized by good stability. The scenarios described in the present invention include, but are not limited to, research and applications involving video human gesture recognition. The invention is based on a modularized neural network combination training method, which effectively improves the generalization ability of three-dimensional human body posture recognition.

其中，虚拟数据生成阶段的方法进程，适用范围包括但不限于公开的学术数据集、动作捕捉系统采集数据集等，只要有三维标注和相机参数(即只要存在二维/三维投影关系)均适用。Among them, the method process in the stage of virtual data generation includes but is not limited to public academic datasets, datasets collected by motion capture systems, etc., as long as there are 3D annotations and camera parameters (that is, as long as there is a 2D/3D projection relationship). .

模型训练阶段与无约束视频推理阶段各自的方法进程中定义的第一、第二模块，其实现不限于本发明说明书中举例实现，凡适用于基于单帧二维人体检测预测三维人体姿态的、和基于时序模型预测连续序列三维人体姿态的神经网络等，均可替代本发明所指代的第一、第二模块。The implementation of the first and second modules defined in the respective method processes of the model training stage and the unconstrained video inference stage is not limited to the example implementation in the description of the present invention, and any one that is applicable to predicting a three-dimensional human body posture based on single-frame two-dimensional human body detection, And the neural network for predicting the continuous sequence of three-dimensional human posture based on the time series model, etc., can replace the first and second modules referred to in the present invention.

无约束视频推理阶段的方法进程，适用于无约束视频，即自然条件采集，或经过包括但不限于缩放、裁剪、变速、和其他颜色调整变换的视频序列。The process of the method in the unconstrained video inference stage is suitable for unconstrained videos, that is, natural conditions acquisition, or video sequences that have undergone transformations including but not limited to scaling, cropping, variable speed, and other color adjustments.

模型训练阶段和无约束视频推理阶段中作为特殊预处理阶段的归一化方法，均适用于无约束视频：即便人物和相机的相对投影位置关系被破坏，只要有可检出的二维人体关键点结果或检测模块，本发明所述的方法均适用。The normalization method used as a special preprocessing stage in the model training stage and the unconstrained video inference stage is applicable to unconstrained videos: even if the relative projection position relationship between the characters and the camera is destroyed, as long as there are detectable two-dimensional human key Point results or detection modules, the method of the present invention is applicable.

进一步地，本发明方法中各阶段的具体流程细节如下：Further, the specific process details of each stage in the method of the present invention are as follows:

本发明提出的视角无关的视频三维人体姿态识别方法中，所述虚拟数据生成阶段进一步包括：对于已有三维人体姿态数据的数据集，通过设计合理的随机方案生成若干不同的相机参数，包括决定相机位置和朝向的外参和决定相机投影焦距画幅的内参。在随机参数的基础上不断利用投影关系，为三维人体数据生成不同相机参数下的对应二维人体数据，进而获得二维/三维人体姿态数据元组；对于视频数据集，还应考虑视频序列内被观测人体的运动范围，依据运动轨迹获得合理的内参，使得投影视锥尽可能包含三维人体关键点的运动点集。In the perspective-independent video three-dimensional human posture recognition method proposed by the present invention, the virtual data generation stage further includes: for the existing data set of three-dimensional human posture data, a number of different camera parameters are generated by designing a reasonable random scheme, including determining The external parameters of the camera's position and orientation and the internal parameters that determine the focal length of the camera's projection frame. On the basis of random parameters, the projection relationship is continuously used to generate corresponding 2D human body data under different camera parameters for 3D human body data, and then obtain 2D/3D human body pose data tuples; for video data sets, the video sequence should also be considered The range of motion of the observed human body is based on the motion trajectory to obtain a reasonable internal reference, so that the projection frustum contains the moving point set of the key points of the three-dimensional human body as much as possible.

本发明提出的视角无关的视频三维人体姿态识别方法中，所述模型训练阶段进一步包括：分模块的神经网络训练。即对于第一模块，既可以使用单帧、也可以使用连续、但主要是使用单帧数据元组进行视角增强的训练，以获得具有相机视角泛化能力的模型；对于第二模块，主要使用连续序列的数据元组进行时序模型训练，以获得能够保护帧间动作连续性的模型。需要注意：一般地，对于RGB视频输入，第一模块可以理解为一个二维到三维的回归问题，第二模块是一个时序连续三维到时序连续三维的回归问题；对于RGB-D视频输入，额外的深度纬度可以同时增加至第一、第二模块。In the perspective-independent video 3D human gesture recognition method proposed by the present invention, the model training stage further includes: sub-module neural network training. That is, for the first module, either single frame or continuous can be used, but mainly single-frame data tuples are used for viewing angle enhancement training to obtain a model with camera viewing angle generalization ability; for the second module, mainly use The continuous sequence of data tuples is used for temporal model training to obtain a model that can preserve the continuity of motion between frames. Note: Generally, for RGB video input, the first module can be understood as a two-dimensional to three-dimensional regression problem, and the second module is a time-series continuous three-dimensional to time-series continuous three-dimensional regression problem; for RGB-D video input, additional The depth latitude can be added to the first and second modules at the same time.

本发明提出的视角无关的视频三维人体姿态识别方法中，所述无约束视频推理阶段进一步包括：先经由任意已知二维人体关键点检测方法得到二维检出结果，接着使用本发明所述相机无关的二维检测归一化方法进行数据预处理，再顺次经由本发明所述第一、第二模块前向推理，得到视频序列对应的三维人体姿态估计结果。In the perspective-independent video three-dimensional human body gesture recognition method proposed by the present invention, the unconstrained video reasoning stage further includes: first obtain the two-dimensional detection result through any known two-dimensional human body key point detection method, and then use the method described in the present invention The camera-independent two-dimensional detection normalization method performs data preprocessing, and then sequentially passes through the first and second modules of the present invention for forward reasoning to obtain the corresponding three-dimensional human body pose estimation result of the video sequence.

本发明提出的视角无关的视频三维人体姿态识别方法中，所述相机无关的二维检测归一化方法进一步包括：使用二维人体关键点检出框(或乘以适当系数)作为归一化标准，对二维人体关键点输入进行归一化，此种归一化方法具有较好的相机抗性，对因图片局部缩放、裁切等带来的投影关系丢失、破坏具有较好的抗性。In the angle-of-view-independent video three-dimensional human body posture recognition method proposed by the present invention, the camera-independent two-dimensional detection normalization method further includes: using the two-dimensional human body key point detection frame (or multiplying by an appropriate coefficient) as a normalization Standard, to normalize the input of key points of the two-dimensional human body. This normalization method has better camera resistance, and has better resistance to the loss and destruction of the projection relationship caused by local zooming and cropping of the picture. sex.

以下结合具体实施例，对本发明所提出的此种视角无关的视频三维人体姿态识别方法作具体说明。The view-independent video three-dimensional human gesture recognition method proposed by the present invention will be specifically described below in conjunction with specific embodiments.

在本发明方法的第一阶段，虚拟数据生成阶段：首先依据现有三维人体姿态数据集，如Human3.6M这一公开学术数据集，针对每一段视频的三维人体世界坐标生成一个相机参数随机配置，该配置会依据被观测人体身高、活动范围设定相机位置和旋转角度，如：以人物活动范围在地平面投影的均值为中心点、以人物身高的0.75为观测球心、以人物身高的0.5倍为高斯半径，随机确定一点作为相机光轴方向的经过点O；以4.0米到6.5米内的均匀分布随机选取相机距离O点欧式距离，相机姿态角如图2所示，固定roll(Rotate oncamera’s direction vector相机摄像方向上的旋转)角不动，随机在-15度到+15度之间生成pitch(Rotate on the cross product of the other camera’s up and directionvectors相机横向方向上的旋转)角，随机在0度到+360度之间生成yaw(Rotate on camera’s up vector相机竖直方向上的旋转)角；因Human3.6M数据集自带相机内参，此处可暂不做内参生成。每次采样时都重新随机生成上述取值，并利用投影关系得到二维/三维人体姿态数据元组。In the first stage of the method of the present invention, the stage of virtual data generation: first, according to the existing three-dimensional human body posture data set, such as the public academic data set Human3.6M, a random configuration of camera parameters is generated for the three-dimensional human body world coordinates of each section of video , this configuration will set the camera position and rotation angle according to the height and activity range of the observed human body. 0.5 times the Gaussian radius, randomly determine a point as the passing point O in the direction of the camera's optical axis; randomly select the Euclidean distance from the camera to O point with a uniform distribution within 4.0 meters to 6.5 meters, the camera attitude angle is shown in Figure 2, fixed roll(Rotate oncamera's direction vector The angle of rotation in the camera direction of the camera's direction vector is fixed, and the pitch (Rotate on the cross product of the other camera's up and directionvectors) angle is randomly generated between -15 degrees and +15 degrees, randomly Generate yaw (Rotate on camera's up vector camera vertical rotation) angle between 0 degrees and +360 degrees; because the Human3.6M dataset comes with camera internal parameters, internal reference generation can not be done here temporarily. The above-mentioned values are randomly generated every time sampling, and the two-dimensional/three-dimensional human body posture data tuple is obtained by using the projection relationship.

在本发明的第二阶段，模型训练阶段：对于第一模块(camera-agnosticregressor)采用随机单帧数据元组进行训练，本实现采用基于深度神经网络的带有残差连接的两次迭代回归模型，依据二维人体姿态关键点回归得到三维人体姿态关键点，具有相机无关、强视角泛化的特性；对于第二模块(temporal regressor)基于时序上的空洞卷积模型进行改进，设计一个三维姿态修正网络，用于利用空洞卷积方法可以扩大感受野的特性增加时序上三维预测结果的空间连续性，起到补偿第一模块预测结果的作用。训练过程的两部分监督所用损失函数如下所示。In the second stage of the present invention, the model training stage: for the first module (camera-agnosticregressor), a random single-frame data tuple is used for training, and this implementation uses a two-time iterative regression model with a residual connection based on a deep neural network , according to the key point regression of the 2D human body posture, the key points of the 3D human body posture are obtained, which has the characteristics of camera independence and strong viewing angle generalization; the second module (temporal regressor) is improved based on the hollow convolution model in time series, and a 3D posture is designed The modified network is used to use the hole convolution method to expand the characteristics of the receptive field to increase the spatial continuity of the three-dimensional prediction results in time series, and to compensate for the prediction results of the first module. The loss function used for the two-part supervision of the training process is shown below.

式中，L为总损失，

和

分别为第一模块和第二模块各自的权重，

和

分别为第一模块和第二模块各自的损失。In the formula, L is the total loss,

and

are the respective weights of the first module and the second module,

and

are the respective losses of the first module and the second module, respectively.

在本发明的第三阶段，无约束视频推理阶段：对于任意无约束条件采集得到的视频，在已有二维检测结果的基础上，利用本发明第二阶段训练收敛得到的多模块深度神经网络，顺次预测得到三维人体姿态结果。本举例所述第一、第二模块的实现，以及推理过程如图3，图3中的single-frame和multi-frame分别表示单张和多张二维框架。In the third stage of the present invention, the unconstrained video reasoning stage: for any video collected without constraints, on the basis of the existing two-dimensional detection results, use the multi-module deep neural network obtained by the second stage of the present invention to train and converge , sequentially predicted to obtain the 3D human pose result. The implementation of the first and second modules described in this example, as well as the reasoning process are shown in Figure 3. The single-frame and multi-frame in Figure 3 represent a single and multiple two-dimensional frames, respectively.

在本发明的第二、第三阶段所使用的前述相机无关的二维检测归一化方法：如图4所示，图4中的principal point和optical point分别表示投影前后的二维检测框中心点，一般情况下的投影关系和传统方法使用原图画幅像素尺寸或原图外接正方形尺寸作为归一化标准的方法，其计算函数如下所示：The aforementioned camera-independent two-dimensional detection normalization method used in the second and third stages of the present invention: as shown in Figure 4, the principal point and optical point in Figure 4 represent the center of the two-dimensional detection frame before and after projection respectively In general, the projection relationship and the traditional method use the pixel size of the original image frame or the size of the circumscribed square of the original image as the normalization standard. The calculation function is as follows:

可见其依赖相机参数(焦距)并且其等价形式也依赖原图尺寸，缺少裁剪等变换抗性。本发明提出的归一化方法如图5所示，具体计算函数如下所示，具有可以保持二维检出大小稳定、且与相机参数无关的特点：It can be seen that it depends on the camera parameters (focal length) and its equivalent form also depends on the size of the original image, lacking transformation resistance such as cropping. The normalization method proposed by the present invention is shown in Figure 5, and the specific calculation function is as follows, which has the characteristics that the two-dimensional detection size can be kept stable and has nothing to do with camera parameters:

式中，K_x,y表示二维检测归一化预处理后的二维点坐标，

表示原二维点坐标，

表示二维检测框的中心坐标，m大于1，本实施例取1.2，w^d,h^d分别为二维检测框的宽度和高度。In the formula, K _{x, y} represent the two-dimensional point coordinates after two-dimensional detection normalization preprocessing,

Indicates the original two-dimensional point coordinates,

Indicates the center coordinates of the two-dimensional detection frame, m is greater than 1, and this embodiment takes 1.2, w ^d , h ^d are the width and height of the two-dimensional detection frame, respectively.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the technical scope disclosed in the present invention. Modifications or replacements shall all fall within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A visual angle-independent video three-dimensional human gesture recognition method, characterized in that, the recognition method comprises:

Step 1: Virtual data generation stage: based on any human body pose dataset containing 3D annotations, generate 2D/3D data tuples after synthesizing virtual camera parameters;

Step 2: Model training phase: Use the generated 2D/3D data tuples to train the first module of the modular neural network for obtaining a model with camera view generalization capability and the first module for obtaining a model that can protect the continuity of motion between frames. Model the second module of the modular neural network;

Step 3: Unconstrained video inference stage: For any unconstrained collected video, the 3D human pose recognition result is obtained by using the multi-module deep neural network trained in step 2 to predict.

2. A view-independent video three-dimensional human pose recognition method according to claim 1, wherein said step 1 specifically comprises: for any human pose data set containing three-dimensional annotations, using a camera viewing angle enhancement module to synthesize Virtual camera parameters, and generate 2D/3D data tuples using projection relations.

3. A view-independent video three-dimensional human gesture recognition method according to claim 2, wherein said camera parameters include external parameters for determining the position and orientation of the camera and internal parameters for determining the focal length and frame size of the camera projection.

4. A kind of viewpoint-independent video three-dimensional human gesture recognition method according to claim 1, characterized in that, the first module in the step 2 uses single-frame data tuples to carry out the training of viewpoint enhancement.

5. A view-independent video three-dimensional human gesture recognition method according to claim 1, characterized in that, the second module in the step 2 uses continuous sequence data tuples to perform time-series model training.

6. A view-independent video three-dimensional human gesture recognition method according to claim 1, characterized in that, in the step 2 and step 3, before inputting the neural network, it also includes performing a camera test on the two-dimensional detection results. The irrelevant two-dimensional detection normalization preprocessing process, its corresponding description formula is:

In the formula, K _{x, y} represent the two-dimensional point coordinates after two-dimensional detection normalization preprocessing,

Indicates the original two-dimensional point coordinates,

7. A kind of angle-of-view-independent video three-dimensional human body gesture recognition method according to claim 1, characterized in that, the video obtained by the unconstrained collection in the step 3 specifically includes natural condition collection, or after zooming, cutting, Variable speed, and other color adjustments to transform video sequences.