CN115810219A

CN115810219A - Three-dimensional gesture tracking method based on RGB camera

Info

Publication number: CN115810219A
Application number: CN202211650785.8A
Authority: CN
Inventors: 陈建新; 欧超前; 汪峰平
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-03-17

Abstract

The present invention proposes a three-dimensional gesture tracking method based on an RGB camera. The method includes the following steps: taking the RGB camera as input, standardizing the picture, and sending it to the three-dimensional joint point detection module, extracting image features, and predicting the preliminary three-dimensional joint points and calculate the length of the hand bones; use the particle swarm optimization algorithm to iteratively fit the length of the hand bones and the MAMO hand model to find the optimal hand shape and obtain a new 3D joint point; send the new 3D joint point to Enter the reverse analytical kinematics module to solve the posture parameters required by the MAMO model; use the MANO model to combine the optimal hand shape and posture parameters θ to obtain the final joint points and grid vertices; reconstruct and render the obtained grid vertices and joint points Output to get the final 3D real-time tracking pose of the hand. The method of the invention not only has high prediction accuracy and high precision of three-dimensional gesture tracking effect, but also the tracking process and results are easy to realize.

Description

A 3D Gesture Tracking Method Based on RGB Camera

技术领域technical field

本发明属于手势识别技术领域，尤其涉及一种基于RGB摄像头的三维手势追踪方法。The invention belongs to the technical field of gesture recognition, in particular to a three-dimensional gesture tracking method based on an RGB camera.

背景技术Background technique

手部是现实生活中人体使用最频繁的部位之一，更是最富于运动变化的部位之一。手部能表现出多样化的姿势，以此传达丰富的信息。捕捉手的运动状态对于虚拟现实，增强现实，人机交互等领域的各种应用都非常重要。因此近年来很多研究人员着手开展这方面研究，并取得了一定的进展。The hand is one of the most frequently used parts of the human body in real life, and it is also one of the most dynamic parts. Hands can express a variety of postures, thereby conveying a wealth of information. Capturing the motion state of the hand is very important for various applications in the fields of virtual reality, augmented reality, human-computer interaction and so on. Therefore, in recent years, many researchers have started to carry out research in this area, and have made some progress.

由于深度相机的广泛应用，早期很多研究者基于深度图像，通过将生成的模型拟合到深度图像上来估计手的姿势。Tompson等人将CNN与随机决策森林和逆向运动学相结合，从单个深度图像实时估计手的姿势。Wan等人利用未标记的深度图像进行自我监督微调，而Mueller等人构建了照片级真实感数据集，以获得更好的鲁棒性。还有一些科研工作者从深度图像中分离出点云和3D体素来做研究。Due to the wide application of depth cameras, many early researchers estimated the pose of the hand by fitting the generated model to the depth image based on the depth image. Tompson et al. combined CNN with random decision forests and inverse kinematics to estimate hand pose in real time from a single depth image. Wan et al. utilized unlabeled depth images for self-supervised fine-tuning, while Mueller et al. constructed a photorealistic dataset for better robustness. There are also some researchers who separate point clouds and 3D voxels from depth images for research.

由于深度传感器价格昂贵，功耗高，对实验环境的要求也较为严格，所以越来越多的人开始研究基于单目RGB图像的3D手部姿态估计。Zimmermann和Brox训练了一个基于CNN的模型，该模型直接从RGB图像估计3D关节坐标。Iqbal等人使用了2.5D热图公式，该公式将2D关节位置与深度信息一起编码，从而大大的提高了精度。许多研究者利用深度图像数据集来来扩大训练过程中看到的多样性。Mueller等人提出了一个由CycleGAN后处理的大规模渲染数据集，以弥合领域差距。然而，他们只关注关节位置估计，而没有进行关节旋转恢复。Ge等人使用GraphCNN直接回归手部网格，但需要具有真实手部网格的特殊数据集。这种无模型的方法对具有挑战性的场景效果一般。对于充分利用来自不同模态的现有数据集，包括图像数据和非图像数据，这种方法具有较好的效果。Zhou等人使用有2D或3D注释的图像数据，以及没有相应图像数据的3D动画图像，提出了一个3D手部关节检测模块和一个反向运动学模块，该模块不仅回归了3D关节位置，还关注关节的旋转问题，该方法在计算机视觉和图形应用领域有一定的前景。然而，Zhou的方法没有考虑到三维关节点和MANO的手模型的最佳匹配问题，并且反向运动学模块算法复杂度略高，不易实现。Due to the high price of depth sensors, high power consumption, and strict requirements on the experimental environment, more and more people have begun to study 3D hand pose estimation based on monocular RGB images. Zimmermann and Brox train a CNN-based model that estimates 3D joint coordinates directly from RGB images. Iqbal et al. used a 2.5D heatmap formulation that encodes 2D joint locations together with depth information, which greatly improves accuracy. Many researchers exploit deep image datasets to amplify the diversity seen during training. Mueller et al. propose a large-scale rendering dataset post-processed by CycleGAN to bridge the domain gap. However, they only focus on joint position estimation without performing joint rotation recovery. Ge et al. used GraphCNN to directly regress hand meshes, but required a special dataset with real hand meshes. This model-free approach works moderately for challenging scenarios. This approach works well for fully utilizing existing datasets from different modalities, including image data and non-image data. Using image data with 2D or 3D annotations, and 3D animation images without corresponding image data, Zhou et al. proposed a 3D hand joint detection module and an inverse kinematics module, which not only regressed the 3D joint positions, but also Focusing on the joint rotation problem, this method has certain prospects in the field of computer vision and graphics applications. However, Zhou's method does not take into account the optimal matching of 3D joint points and MANO's hand model, and the algorithm complexity of the inverse kinematics module is slightly high, which is not easy to implement.

上述方法中，有的利用深度图像，有的利用RGB彩色图像，但是一方面没有充分利用2D图像信息和3D图像信息的图像特征，另一方面因为网络模型不够先进或者方法过于复杂，导致手部3D坐标的预测准确率不是很高或者三维手势追踪效果一般。同时，还有手部存在遮挡问题以及手势追踪实时性不佳等问题。Among the above methods, some use depth images and some use RGB color images, but on the one hand, they do not make full use of the image features of 2D image information and 3D image information, and on the other hand, because the network model is not advanced enough or the method is too complicated, resulting in hand The prediction accuracy of 3D coordinates is not very high or the effect of 3D gesture tracking is average. At the same time, there are problems such as hand occlusion and poor real-time gesture tracking.

发明内容Contents of the invention

本发明的主要目的是，设计一种三维手势追踪方法，以提高预测准确率以及获得高精度的三维手势追踪效果，同时追踪过程和结果容易实现。The main purpose of the present invention is to design a three-dimensional gesture tracking method to improve prediction accuracy and obtain high-precision three-dimensional gesture tracking effect, and at the same time, the tracking process and results are easy to realize.

为实现以上目的，本发明提供了一种基于RGB摄像头的三维手势追踪方法，包括以下步骤：To achieve the above object, the present invention provides a three-dimensional gesture tracking method based on an RGB camera, comprising the following steps:

步骤一、以RGB摄像头为输入，对图片进行标准化处理，把处理后的图片送入三维关节点检测模块，提取图像特征，利用卷积神经网络预测初步的三维关节点并计算手部骨骼长度；Step 1. Use the RGB camera as input, standardize the image, send the processed image to the 3D joint point detection module, extract image features, use the convolutional neural network to predict the initial 3D joint point and calculate the length of the hand bones;

步骤二、利用粒子群优化算法迭代拟合手部骨骼长度和MAMO手模型，找出最优的手部形状并得到一个新的三维关节点；Step 2. Use the particle swarm optimization algorithm to iteratively fit the hand bone length and the MAMO hand model to find the optimal hand shape and obtain a new 3D joint point;

步骤三、通过新的三维关节点送入逆向解析运动学模块求解MAMO模型所需的姿势参数θ；Step 3. Send the new 3D joint points to the inverse analytical kinematics module to solve the posture parameter θ required by the MAMO model;

步骤四、利用MANO模型结合最优手部形状和姿势参数θ，得到最终关节点和网格顶点；以及Step 4, use the MANO model to combine the optimal hand shape and posture parameters θ to obtain the final joint points and mesh vertices; and

步骤五、对得到的网格顶点和关节点进行重建渲染输出，得到最终的三维手部实时追踪姿态。Step 5: Reconstruct and render the obtained mesh vertices and joint points to obtain the final 3D real-time tracking pose of the hand.

本发明的进一步改进在于，所述三维关节点检测模块使用基于ResNet50的神经网络模型，包括特征提取器、2D检测器和3D检测器，使用加入注意力机制的ResNet50作为特征提取器，输入是分辨率为128×128的图像，输出尺寸为32×32×256的特征体F。A further improvement of the present invention is that the three-dimensional joint point detection module uses a neural network model based on ResNet50, including a feature extractor, a 2D detector and a 3D detector, and uses ResNet50 adding an attention mechanism as a feature extractor, and the input is a resolution The image rate is 128×128, and the output feature volume F is 32×32×256.

本发明的进一步改进在于，所述2D检测器是一个两层CNN，获取特征体F并输出21个关节的热图H，热图H用于2D姿态估计；所述3D检测器首先使用2层CNN从热图H和特征体F估计增量图D，热图H、特征体F和增量图D被连接并馈送到另一个2层CNN中，以获得最终的位置图L，并以位置图L的形式估计3D手关节位置。A further improvement of the present invention is that the 2D detector is a two-layer CNN, which acquires the feature body F and outputs the heat map H of 21 joints, and the heat map H is used for 2D pose estimation; the 3D detector first uses 2 layers CNN estimates delta map D from heatmap H and feature volume F, heatmap H, feature volume F, and delta map D are concatenated and fed into another 2-layer CNN to obtain the final location map L, with location The form of Figure L estimates 3D hand joint positions.

本发明的进一步改进在于，步骤四中MANO模型是一个3D参数化模型，该模型根据16个关节点和5个从顶点中获取到的指尖点，构成完整的手部链条。A further improvement of the present invention is that the MANO model in step 4 is a 3D parametric model, which forms a complete hand chain based on 16 joint points and 5 fingertip points obtained from vertices.

本发明的进一步改进在于，步骤五中使用open3D进行手部网格顶点的重建。The further improvement of the present invention is that in the fifth step, open3D is used to reconstruct the vertices of the hand mesh.

本发明的进一步改进在于，MANO模型中的

是一个手掌摊平的姿势，通过模板

形状函数B_s(β)和姿势函数B_p(θ)，可以得到手部形变模板T，然后结合姿势参数θ，蒙皮权重ω，关节位置J(θ)，对其进行蒙皮操作，数学表达式如下：The further improvement of the present invention is that, in the MANO model

is a palm-flat pose, through the template

The shape function B _s (β) and the posture function B _p (θ) can obtain the hand deformation template T, and then combine the posture parameters θ, skin weight ω, and joint position J(θ) to perform skinning operations on it. Mathematics The expression is as follows:

M(θ，β)＝W(T(θ，β)，θ，ω，J(θ))。M(θ,β)=W(T(θ,β),θ,ω,J(θ)).

本发明的进一步改进在于，所述粒子群优化算法先初始化一群随机粒子，然后通过多次迭代找到最优解，在每一次迭代中，粒子通过跟踪两个极值来更新自己，在找到这两个最优值后，粒子通过下面的公式来更新自己的速度和位置：The further improvement of the present invention is that the particle swarm optimization algorithm first initializes a group of random particles, and then finds the optimal solution through multiple iterations. In each iteration, the particles update themselves by tracking two extreme values. After the optimal value, the particle updates its speed and position by the following formula:

其中，i＝1，2，...，N，N是粒子群规模，d是粒子维度序号，k是迭代次数，ω是惯性权重，c₁是个体学习因子，c₂是群体学习因子，r₁，r₂是区间[0，1]内的随机数，增加搜索的随机性，

是粒子i在第k次迭代中第d维的速度向量，

是粒子i在第k次迭代中第d维的位置向量，

是粒子i在第k次迭代中第d维的历史最优位置，既在第k次迭代之后，第i个粒子搜索得到的最优解，

是群体在第k次迭代中第d维的历史最优位置，即在第k次迭代后，整个粒子群体中的最优解。Among them, i=1, 2,..., N, N is the particle swarm scale, d is the particle dimension sequence number, k is the number of iterations, ω is the inertia weight, c ₁ is the individual learning factor, c ₂ is the group learning factor, r ₁ and r ₂ are random numbers in the interval [0, 1], increasing the randomness of the search,

is the d-dimensional velocity vector of particle i in the k-th iteration,

is the d-th dimension position vector of particle i in the k-th iteration,

is the historical optimal position of particle i in the d-th dimension in the k-th iteration, that is, the optimal solution obtained by the i-th particle search after the k-th iteration,

is the historical optimal position of the swarm in the d-th dimension in the k-th iteration, that is, the optimal solution in the entire particle swarm after the k-th iteration.

本发明的有益效果：采用粒子群优化算法和MANO模型结合的方式还原手部形状，提高了手势追踪的准确性，对于手部运动的自遮挡以及物体遮挡，均有较好的追踪效果，采用解析逆向运动学求解姿势参数，优化了算法结构，整个方法的复杂度得到了控制，拥有较好的实时性。最终实现了高精度的实时三维手势追踪。Beneficial effects of the present invention: use the combination of particle swarm optimization algorithm and MANO model to restore the shape of the hand, improve the accuracy of gesture tracking, and have better tracking effects for self-occlusion of hand movement and object occlusion. Analytical inverse kinematics is used to solve the posture parameters, and the algorithm structure is optimized. The complexity of the whole method is controlled, and it has better real-time performance. Finally, high-precision real-time 3D gesture tracking is achieved.

附图说明Description of drawings

图1为本发明基于RGB摄像头的三维手势追踪方法整体框架图。FIG. 1 is an overall framework diagram of the 3D gesture tracking method based on an RGB camera in the present invention.

图2为本发明中三维关节点检测模块的网络模型图。Fig. 2 is a network model diagram of the three-dimensional joint point detection module in the present invention.

图3为本发明的实验效果图。Fig. 3 is an experimental effect diagram of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚，下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

需要强调的是，在描述本发明过程中，各种公式和约束条件分别使用前后一致的标号进行区分，但也不排除使用不同的标号标志相同的公式和/或约束条件，这样设置的目的是为了更清楚的说明本发明特征所在。It should be emphasized that in the process of describing the present invention, various formulas and constraints are distinguished by using consistent labels, but it does not exclude the use of different labels to mark the same formulas and/or constraints. The purpose of such setting is In order to illustrate the features of the present invention more clearly.

本发明提出了一种以卷积神经网络为基础，加入粒子群优化算法，并结合逆向解析运动学的手势追踪方法。本发明的方法使用了2D图像数据和3D图像数据，可以更好的利用图像特征对模型进行训练，对原始数据进行预处理后送入加入注意力机制的卷积神经网络，得到的手部3D关节点准确率较高，再结合粒子群优化算法对MANO模型进行迭代拟合找出手部最优形状，然后使用解析逆向运动学进行手部姿势参数的推导，得到MANO模型所需的手部参数。该方法结合逆向解析运动学可以更好的利用手部运动特性，同时把骨骼长度和MANO手模型参数文件进行PSO迭代拟合，可以找出最优的手部形状。在控制算法复杂度的前提下，最终实现了高精度的三维手势追踪效果。The present invention proposes a gesture tracking method based on a convolutional neural network, adding a particle swarm optimization algorithm, and combining reverse analytical kinematics. The method of the present invention uses 2D image data and 3D image data, which can better utilize image features to train the model, preprocess the original data and send it to the convolutional neural network that adds the attention mechanism, and the obtained hand 3D The accuracy of the joint points is high, combined with the particle swarm optimization algorithm to iteratively fit the MANO model to find the optimal shape of the hand, and then use analytic inverse kinematics to deduce the hand posture parameters to obtain the hand required by the MANO model parameter. This method combined with inverse analytical kinematics can make better use of hand motion characteristics, and at the same time perform PSO iterative fitting on bone length and MANO hand model parameter files to find the optimal hand shape. On the premise of controlling the complexity of the algorithm, the high-precision three-dimensional gesture tracking effect is finally realized.

本发明的基于RGB摄像头的三维手势追踪方法，主要包括以下步骤：The three-dimensional gesture tracking method based on the RGB camera of the present invention mainly comprises the following steps:

步骤一、获取视频流，对图片进行处理，把处理后的图片送入三维关节点检测模块，提取图像特征，利用卷积神经网络预测初步的三维关节点并计算手部骨骼长度；Step 1. Obtain the video stream, process the picture, send the processed picture to the 3D joint point detection module, extract image features, use the convolutional neural network to predict the preliminary 3D joint point and calculate the length of the hand bones;

以下结合附图对本发明进行详细描述。The present invention will be described in detail below in conjunction with the accompanying drawings.

步骤一、获取视频流，对图片进行处理，把处理后的图片送入三维关节点检测模块，提取图像特征，利用卷积神经网络预测初步的三维关节点并计算手部骨骼长度。Step 1. Get the video stream, process the picture, send the processed picture to the 3D joint point detection module, extract the image features, use the convolutional neural network to predict the preliminary 3D joint point and calculate the length of the hand bones.

在步骤一中，以RGB三通道摄像头为输入，对图片进行裁剪和归一化处理，把处理后的图片送入三维关节点检测模块，提取图像特征，利用卷积神经网络预测初步的手部三维关节点并计算骨骼长度。其中，硬件设备使用普通的RGB三通道摄像头，以视频流的形式作为输入，对视频流进行逐帧处理，先统一裁剪成128*128的尺寸，使用每层通道的均值和标准差对三个通道进行标准化。In step 1, the RGB three-channel camera is used as input, the image is cropped and normalized, the processed image is sent to the 3D joint point detection module, image features are extracted, and the convolutional neural network is used to predict the preliminary hand 3D joint points and calculate bone length. Among them, the hardware device uses an ordinary RGB three-channel camera, takes the form of video stream as input, processes the video stream frame by frame, first uniformly cuts it into a size of 128*128, and uses the mean value and standard deviation of each channel to compare the three Channels are standardized.

如图2所示，三维关节点检测模块由特征提取器、2D检测器和3D检测器组成。使用加入注意力机制的ResNet50作为特征提取器，输入是分辨率为128×128的图像，输出尺寸为32×32×256的特征体F；2D检测器是一个两层CNN，获取特征体F并输出21个关节的热图H，热图H用于2D姿态估计，手部21个关节点的2D坐标：As shown in Figure 2, the 3D joint point detection module consists of a feature extractor, a 2D detector and a 3D detector. Using ResNet50 with attention mechanism as the feature extractor, the input is an image with a resolution of 128×128, and the output size is a feature body F of 32×32×256; the 2D detector is a two-layer CNN, which obtains the feature body F and Output the heat map H of 21 joints, the heat map H is used for 2D pose estimation, and the 2D coordinates of the 21 joint points of the hand:

P_2d＝[[x₁，y₁]，[x₂，y₂]，...，[x₁，y₁]]^T。P _2d =[[x ₁ , y ₁ ], [x ₂ , y ₂ ], . . . , [x ₁ , y ₁ ]] ^T .

使用二维高斯分布对每个关节的2D图像坐标生成热图，热图产生公式如下：Use a two-dimensional Gaussian distribution to generate a heat map for the 2D image coordinates of each joint. The heat map generation formula is as follows:

其中，σ决定了热土半径的大小。f(x，y)表示了图像坐标为[4x，4y]的关节点i的概率值，第i张热图中响应最大的位置与第i个关节的2D图像坐标对应。Among them, σ determines the size of the hot soil radius. f(x, y) represents the probability value of the joint point i whose image coordinates are [4x, 4y]. The position with the largest response in the i-th heat map corresponds to the 2D image coordinates of the i-th joint.

3D检测器首先使用2层CNN从热图H和特征体F估计增量图D，热图H、特征体F和增量图D被连接并馈送到另一个2层CNN中，以获得最终的位置图L，并以位置图L的形式估计3D手关节位置。3D手关节位置表示相对于腕节点的空间坐标系下，手部21个关节点的3D坐标P3_d＝[[x₁，y1，z₁]，[x₂,y₂,z₂]，...，[x₁,y₁,z₁]]^T。其中，手部骨骼的存在让手部的大小，运动范围以及关节点之间的距离受到限制。骨架中任意一个骨骼可以表示为手部第i和第j个关节点之间的向量b_ij，骨向量b_ij的长度|b_ij|则对应骨骼的长度，骨向量b_ij的方向

则表示骨骼的方向。对于整只手，共有20个骨向量，因此可将整只手的骨骼长度表示为矩阵B_L，B_L∈R^(J-1)×1。The 3D detector first uses a 2-layer CNN to estimate the delta map D from the heatmap H and the feature volume F, the heatmap H, the feature volume F and the delta map D are concatenated and fed into another 2-layer CNN to obtain the final position map L, and estimate the 3D hand joint positions in the form of the position map L. The 3D _hand joint _position represents _the 3D coordinates P3 _d _of the 21 joint points of the hand relative to the _spatial coordinate system of the wrist node. ..，[x ₁ ,y ₁ ,z ₁ ]] ^T . Among them, the existence of hand bones limits the size, range of motion and distance between joint points of the hand. Any bone in the skeleton can be expressed as a vector b _ij between the i-th and j-th joint points of the hand, the length of the bone vector b _ij |b _ij | corresponds to the length of the bone, and the direction of the bone vector b _ij

indicates the direction of the bone. For the whole hand, there are 20 bone vectors in total, so the bone length of the whole hand can be expressed as a matrix B _L , B _L ∈ R ^(J-1)×1 .

步骤二、利用粒子群优化算法迭代拟合手部骨骼长度和MAMO手模型，找出最优的手部形状并得到一个新的三维关节点。Step 2: Use the particle swarm optimization algorithm to iteratively fit the hand bone length and the MAMO hand model to find the optimal hand shape and obtain a new 3D joint point.

粒子群优化算法是一种基于种群的随机优化技术，该算法先初始化一群随机粒子(随机解)，然后通过多次迭代找到最优解。在每一次迭代中，粒子通过跟踪两个极值来更新自己。在找到这两个最优值后，粒子通过下面的公式来更新自己的速度和位置：The particle swarm optimization algorithm is a population-based stochastic optimization technique. The algorithm first initializes a group of random particles (random solutions), and then finds the optimal solution through multiple iterations. In each iteration, the particle updates itself by tracking the two extrema. After finding these two optimal values, the particle updates its speed and position by the following formula:

是粒子i在第k次迭代中第d维的速度向量，

是粒子i在第k次迭代中第d维的位置向量，

is the d-dimensional velocity vector of particle i in the k-th iteration,

is the d-th dimension position vector of particle i in the k-th iteration,

粒子群优化算法的迭代终止条件根据具体问题不同，一般选为最大迭代次数或满足精度要求，在本发明中，经过多次对比实验，最终设定最大迭代次数为150次。在每一次迭代中，粒子通过跟踪两个极值来更新自己。The iteration termination condition of the particle swarm optimization algorithm is different according to the specific problem. Generally, it is selected as the maximum number of iterations or meets the accuracy requirements. In the present invention, after many comparison experiments, the maximum number of iterations is finally set to 150 times. In each iteration, the particle updates itself by tracking the two extrema.

MANO模型是一种主流的手部姿态估计参数化模型，该模型根据16个关节点和5个从顶点中获取到的指尖点，构成完整的手部链条，结合姿势参数即可从MANO模型中恢复手部形状。

为初始MANO网络模板，用来表示静止状态下标准的MANO模型表面的初始网格顶点位置。通过MANO初始模板

形状函数B_s(β)和姿势函数B_p(θ)，我们可以得到手部形变模板T，然后结合姿势参数θ，蒙皮权重ω，关节位置J(θ)，对其进行蒙皮操作，数学表达式如下：The MANO model is a mainstream hand pose estimation parametric model. The model forms a complete hand chain based on 16 joint points and 5 fingertip points obtained from the vertices. Combined with the pose parameters, the MANO model can be Restoring the shape of the hand.

It is the initial MANO network template, which is used to represent the initial grid vertex position of the standard MANO model surface in the static state. Via MANO initial template

Shape function B _s (β) and pose function B _p (θ), we can get the hand deformation template T, and then combine the pose parameters θ, skin weight ω, and joint position J(θ) to perform skinning operations on it, The mathematical expression is as follows:

M(θ，β)＝W(T(θ，β)，θ，ω，J(θ))M(θ,β)=W(T(θ,β),θ,ω,J(θ))

步骤三、通过新的三维关节点送入逆向解析运动学模块求解MAMO模型所需的姿势参数θ。Step 3. Send the new 3D joint points to the inverse analytical kinematics module to solve the posture parameter θ required by the MAMO model.

3D关节点坐标可以在一定程度上解释手的姿势，但不足以表示三维手模型，所以我们需要从关节点坐标推导关节旋转。所述的解析逆向运动学是通过将旋转分解为扭转和摆动的方式求解姿势参数θ，该姿势参数最后用于MANO模型还原手部形状和姿态。The 3D joint point coordinates can explain the hand pose to a certain extent, but it is not enough to represent the 3D hand model, so we need to derive the joint rotation from the joint point coordinates. The analytical inverse kinematics is to solve the posture parameter θ by decomposing the rotation into twisting and swinging, and the posture parameter is finally used in the MANO model to restore the shape and posture of the hand.

步骤四、利用MANO模型结合最优手部形状和姿势参数θ，得到最终关节点和网格顶点。Step 4: Use the MANO model to combine the optimal hand shape and posture parameters θ to obtain the final joint points and mesh vertices.

根据上述模型生成的整体流程，当给定姿势参数θ和最优手部形状之后，通过MANO模型即可生成相应的网格形状。According to the overall process of the above model generation, when the posture parameter θ and the optimal hand shape are given, the corresponding mesh shape can be generated through the MANO model.

在步骤五中使用open3D进行手部网格顶点的重建。In step five, use open3D to reconstruct the vertices of the hand mesh.

本发明设计一共完成了两个实验。The design of the present invention has completed two experiments altogether.

第一个实验旨在测试该方法是否能提高三维关节点检测准确率，我们的模型在Nvidia DGX的GPU Tesla P100-SXM2上进行了训练，该模型的训练集采用CMU HandDB，Rendered Handpose Dataset，GANerated Hands Dataset三个数据集。测试集采用Rendered Handpose Dataset，EgoDexter Dataset，STB Dataset,DexterObject Dataset四个数据集。The first experiment is to test whether the method can improve the accuracy of 3D joint point detection. Our model is trained on Nvidia DGX GPU Tesla P100-SXM2. The training set of the model uses CMU HandDB, Rendered Handpose Dataset, GANerated Hands Dataset Three datasets. The test set uses four datasets: Rendered Handpose Dataset, EgoDexter Dataset, STB Dataset, and DexterObject Dataset.

评价指标为正确关键点比例(PCK)和曲线下面积(AUC)，计算PCK需要人为设定一个3D关节点误差阈值c，当3D关节点误差小于c时，则认为该关节点检测正确，估计正确的关节点占所有关节点的比例则为PCK的值。相同阈值下，PCK值越高，则代表方法的性能越好。通过设定不同的阈值c可以得到不同的PCK，以阈值c为横轴，PCK值为纵轴，则可以得到PCK随阈值变化的曲线，计算曲线于横轴之间的面积则可得到AUC的值，AUC值越高代表姿态估计的越准确。The evaluation indicators are the proportion of correct key points (PCK) and the area under the curve (AUC). Calculating PCK requires artificially setting a 3D joint point error threshold c. When the 3D joint point error is less than c, the joint point is considered to be detected correctly. The proportion of correct joint points to all related nodes is the value of PCK. Under the same threshold, the higher the PCK value, the better the performance of the method. Different PCK can be obtained by setting different threshold c. With the threshold c as the horizontal axis and the PCK value as the vertical axis, the curve of PCK changing with the threshold can be obtained, and the area between the curve and the horizontal axis can be calculated to obtain the AUC. The higher the AUC value, the more accurate the pose estimation is.

实验结果如表1所示，可见本发明所述方法能较准确的估计出手部关节点坐标。The experimental results are shown in Table 1. It can be seen that the method of the present invention can estimate the coordinates of the joint points of the hand more accurately.

表1实验结果Table 1 Experimental results

第二个实验旨在测试该方法的实时手部追踪效果以及手部遮挡恢复问题，实验环境为Intel(R)Xeon(R)CPU E5-2620，我们分别进行张开手掌，握笔，拿杯子等动作，实验结果如图3所示，可以发现，此方法拥有较好的三维手势追踪能力，手部不存在明显变形和自遮挡现象，对于物体的遮挡，此方法也能实时恢复出手型。The second experiment aims to test the real-time hand tracking effect of the method and the hand occlusion recovery problem. The experimental environment is Intel(R) Xeon(R) CPU E5-2620. We open the palm, hold the pen, and hold the cup respectively. And other actions, the experimental results are shown in Figure 3. It can be found that this method has a good three-dimensional gesture tracking ability, and there is no obvious deformation and self-occlusion of the hand. For the occlusion of objects, this method can also restore the hand shape in real time.

综上所述，经过此方法所进行三维手势追踪，具有较好的实时性，对于手部的自遮挡及物遮挡都具有较好的实时效果。To sum up, the 3D gesture tracking performed by this method has better real-time performance, and has better real-time effects on self-occlusion of hands and object occlusion.

本发明还提供了基于RGB摄像头的三维手势追踪装置，所述装置包括图片获取与预处理模块、三维关节点检测模块、姿势参数计算模块、关节点获取模块以及渲染输出模块。The present invention also provides a three-dimensional gesture tracking device based on an RGB camera. The device includes a picture acquisition and preprocessing module, a three-dimensional joint point detection module, a pose parameter calculation module, a joint point acquisition module and a rendering output module.

图片获取与预处理模块和三维关节点检测模块，用于以RGB摄像头为输入，对图片进行标准化处理，把处理后的图片送入三维关节点检测模块，提取图像特征，利用卷积神经网络预测初步的手部三维关节点并计算骨骼长度；The image acquisition and preprocessing module and the 3D joint point detection module are used to standardize the picture with the RGB camera as input, send the processed picture to the 3D joint point detection module, extract image features, and use the convolutional neural network to predict Preliminary 3D joint points of the hand and calculation of bone length;

姿势参数计算模块，用于利用粒子群优化算法迭代拟合手部骨骼长度和MANO模型，找出最优的手部形状并得到一组新的三维关节点；并把新的三维关节点送入逆向解析运动学模块求解MANO模型所需的姿势参数θ；The posture parameter calculation module is used to use the particle swarm optimization algorithm to iteratively fit the hand bone length and the MANO model, find out the optimal hand shape and obtain a set of new 3D joint points; and send the new 3D joint points into The reverse analysis kinematics module solves the posture parameter θ required by the MANO model;

关节点获取模块，用于利用MANO模型结合最优手部形状和姿势参数θ，得到最终关节点和网格顶点；以及The joint point acquisition module is used to combine the optimal hand shape and posture parameters θ with the MANO model to obtain the final joint points and mesh vertices; and

渲染输出模块，用于对得到的网格顶点和关节点进行重建渲染输出，得到最终的三维手部实时追踪姿态。The rendering output module is used to reconstruct and render the obtained mesh vertices and joint points to obtain the final 3D real-time tracking posture of the hand.

本发明的方法，通过卷积神经网络模型预测的手部3D坐标推导出手部骨骼长度，使用粒子群优化算法将手部骨骼长度和MANO手模型参数文件进行150次迭代拟合，匹配出最优手部形状，然后使用逆向解析运动学，从关节位置推断姿势参数，最后再将姿势参数和MANO模型结合，得到最终的手部追踪姿态。该算法在真实使用场景中，拥有较好的实时性，对于手部的自遮挡及物遮挡都具有较好的鲁棒性。In the method of the present invention, the hand bone length is deduced from the 3D coordinates of the hand predicted by the convolutional neural network model, and the hand bone length and the MANO hand model parameter file are iteratively fitted for 150 times using the particle swarm optimization algorithm to match the optimal The shape of the hand, and then use inverse analytical kinematics to infer the pose parameters from the joint positions, and finally combine the pose parameters with the MANO model to get the final hand tracking pose. This algorithm has better real-time performance in real use scenarios, and has better robustness to self-occlusion and object occlusion of hands.

以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention without limitation. Although the present invention has been described in detail with reference to preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified or equivalently replaced. Without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A three-dimensional gesture tracking method based on RGB camera, is characterized in that, comprises the following steps:

Step 1. Use the RGB camera as input, standardize the image, send the processed image to the 3D joint point detection module, extract image features, use the convolutional neural network to predict the initial 3D joint point and calculate the length of the hand bones;

Step 2. Use the particle swarm optimization algorithm to iteratively fit the hand bone length and the MAMO hand model to find the optimal hand shape and obtain a new 3D joint point;

Step 3. Send the new 3D joint points to the inverse analytical kinematics module to solve the posture parameter θ required by the MAMO model;

Step 4, use the MANO model to combine the optimal hand shape and posture parameters θ to obtain the final joint points and mesh vertices; and

Step 5: Reconstruct and render the obtained mesh vertices and joint points to obtain the final 3D real-time tracking pose of the hand.

2. the three-dimensional gesture tracking method based on RGB camera according to claim 1, is characterized in that: described three-dimensional joint point detection module uses the neural network model based on ResNet50, comprises feature extractor, 2D detector and 3D detector, Using ResNet50 with attention mechanism as the feature extractor, the input is an image with a resolution of 128×128, and the output size is a feature volume F of 32×32×256.

3. the three-dimensional gesture tracking method based on RGB camera according to claim 2, is characterized in that: described 2D detector is a two-layer CNN, obtains feature body F and outputs heat map H of 21 articulation points, heat map H is used for 2D pose estimation; the 3D detector first uses a 2-layer CNN to estimate a delta map D from heatmap H and feature volume F, heatmap H, feature volume F and delta map D are concatenated and fed to another 2-layer CNN to obtain the final position map L, and estimate the 3D hand joint position in the form of position map L.

4. the three-dimensional gesture tracking method based on RGB cameras according to claim 3, characterized in that: in step 4, the MANO model is a 3D parametric model, and the model is obtained according to 16 joint points and 5 points from the vertices. Fingertips to form a complete hand chain.

5. The three-dimensional gesture tracking method based on an RGB camera according to claim 3, wherein in step five, open3D is used to reconstruct the vertices of the hand mesh.

6. the three-dimensional gesture tracking method based on RGB camera according to claim 3, characterized in that: in the MANO model

is a palm-flat pose, through the template

M(θ,β)=W(T(θ,β),θ,ω,J(θ)).

7. The three-dimensional gesture tracking method based on RGB camera according to claim 6, characterized in that: the particle swarm optimization algorithm first initializes a group of random particles, and then finds the optimal solution through multiple iterations, and in each iteration, Particles update themselves by tracking two extreme values. After finding these two optimal values, particles update their speed and position through the following formula:

Among them, i=1,2,...,N,N is the size of particle swarm, d is the number of particle dimension, k is the number of iterations, ω is the inertia weight, c ₁ is the individual learning factor, c ₂ is the group learning factor, r ₁ , r ₂ is a random number in the interval [0,1], increasing the randomness of the search,

is the d-dimensional velocity vector of particle i in the k-th iteration,

is the d-th dimension position vector of particle i in the k-th iteration,