CN115546888A

CN115546888A - A Convolutional Pose Estimation Method Based on Body Part Grouping with Symmetrical Semantic Graphs

Info

Publication number: CN115546888A
Application number: CN202211084071.5A
Authority: CN
Inventors: 毛爱华; 张翘楚; 王梓旭
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2022-12-30

Abstract

The invention discloses a symmetric semantic graph convolution pose estimation method based on body part grouping, comprising the following steps: S1, inputting two-dimensional human joint points and their connection relations, constructing a symmetric semantic graph convolution layer of the joint point graph structure and Non-local layer; S2, according to the body trunk, group the body parts, obtain the local and non-local features of each torso and the local and non-local features of the whole body, and perform fusion calculation on the obtained features; S3, based on the symmetrical semantic map volume Laminate layer, non-local layer and body part grouping, build the symmetric semantic graph convolution pose estimation network model of body part grouping; S4, use Human3.6M data set to train described symmetric semantic graph convolution pose estimation network model, will The 2D human joint points to be estimated are input into the trained symmetric semantic graph convolution pose estimation network model, and the estimated 3D human joint points are output. The invention can be applied to the fields of movie animation, virtual reality, sports action analysis and the like, and the method has better effect and improved generalization ability.

Description

A Convolutional Pose Estimation Method Based on Body Part Grouping with Symmetrical Semantic Graphs

技术领域technical field

本发明涉及计算机视觉技术领域，具体涉及一种基于身体部位分组的对称语义图卷积姿态估计方法。The invention relates to the technical field of computer vision, in particular to a body part grouping-based symmetric semantic graph convolution pose estimation method.

背景技术Background technique

人体姿态估计在许多计算机视觉任务上已经得到广泛应用，如虚拟现实、人机交互和行为识别等领域。得益于深度学习的迅速发展，从图像估计三维人体姿态在性能上获得了明显提升，成为当前研究热点。Human pose estimation has been widely used in many computer vision tasks, such as virtual reality, human-computer interaction and action recognition. Thanks to the rapid development of deep learning, the performance of estimating 3D human pose from images has been significantly improved, and has become a current research hotspot.

已有的3D姿态估计估计的方法有两类，一是从图像中直接预测出3D姿态，二是先预测出2D姿态，再回归出3D姿态。第一类方法可以从图像中得到大量信息，但模型受图像背景以及人体着装等因素影响很大，且模型所需学习特征具有复杂性。第二类方法降低了总体工作复杂程度，网络模型更加容易学习到2D到3D空间的映射，同时得益于2D姿态估计研究的成熟，这类模型更加主流。There are two types of existing 3D attitude estimation methods, one is to directly predict the 3D attitude from the image, and the other is to first predict the 2D attitude and then regress to the 3D attitude. The first type of method can get a lot of information from images, but the model is greatly affected by factors such as image background and human clothing, and the learning features required by the model are complex. The second type of method reduces the overall work complexity, and the network model is easier to learn the mapping from 2D to 3D space. At the same time, thanks to the maturity of 2D pose estimation research, this type of model is more mainstream.

《一种基于图卷积网络的三维人体姿态估计方法》(CN112712019A)提供的基于图卷积网络的三维人体姿态估计方法具有能提高三维人体姿态回归性能、减少网络参数使用的优点，但是模型泛化能力有待提升。现有研究存在深度学习背景下人体姿态估计算法容易受自遮挡、环境遮挡等的影响，且人体姿态具有多样性，当前模型泛化能力欠佳。因此目前亟待探索更加合理，更具普适性的网络模型来提升姿态估计效果。"A 3D Human Pose Estimation Method Based on Graph Convolutional Network" (CN112712019A) provides a 3D human pose estimation method based on graph convolutional network, which has the advantages of improving the performance of 3D human pose regression and reducing the use of network parameters. Ability to be improved. In the existing research, the human body pose estimation algorithm under the background of deep learning is easily affected by self-occlusion, environmental occlusion, etc., and the human body poses are diverse, and the generalization ability of the current model is not good. Therefore, it is urgent to explore a more reasonable and more universal network model to improve the effect of attitude estimation.

发明内容Contents of the invention

本发明的目的是为了解决现有技术中的上述问题，提供一种基于身体部位分组的对称语义图卷积姿态估计方法。The purpose of the present invention is to solve the above-mentioned problems in the prior art, and provide a method for symmetric semantic map convolution pose estimation based on body part grouping.

本发明的目的可以通过采取如下技术方案达到：The purpose of the present invention can be achieved by taking the following technical solutions:

一种基于身体部位分组的对称语义图卷积姿态估计方法，如图1所示，所述对称语义图卷积姿态估计方法包括如下步骤：A kind of symmetric semantic graph convolution pose estimation method based on body parts grouping, as shown in Figure 1, described symmetric semantic graph convolution pose estimation method comprises the steps:

S1、输入电影动画、虚拟现实或运动动作中二维人体关节点及其连接关系，构建关节点图结构的对称语义图卷积层和非局部层；S1. Input the two-dimensional human joint points and their connection relationship in movie animation, virtual reality or sports action, and construct the symmetrical semantic graph convolution layer and non-local layer of the joint point graph structure;

S2、根据身体躯干，进行身体部位分组，分别得到各躯干的局部及非局部特征和全身的局部及非局部特征，并对得到的特征进行融合计算；S2. Carry out grouping of body parts according to the trunk of the body, respectively obtain the local and non-local features of each trunk and the local and non-local features of the whole body, and perform fusion calculation on the obtained features;

S3、基于对称语义图卷积层、非局部层和身体部位分组，构建身体部位分组的对称语义图卷积姿态估计网络模型；S3. Based on the symmetric semantic graph convolution layer, non-local layer and body part grouping, construct a symmetric semantic graph convolution pose estimation network model for body part grouping;

S4、使用H_um_an3.6M数据集对所述对称语义图卷积姿态估计网络模型进行训练，将待估计的二维人体关节点输入经过训练的对称语义图卷积姿态估计网络模型，输出估计的三维人体关节点。S4. Using the H _u man _3.6M data set to train the symmetric semantic graph convolution pose estimation network model, and input the two-dimensional human body joint points to be estimated into the trained symmetric semantic graph convolution pose estimation network model, Output estimated 3D human joint points.

进一步地，所述步骤S1中使用二维人体关节点及其连接关系，构建关节点图结构的对称语义图卷积层和非局部层的过程如下：Further, in the step S1, using the two-dimensional human body joints and their connections, the process of constructing the symmetrical semantic graph convolution layer and the non-local layer of the joint point graph structure is as follows:

令X^(l)和X^(l+1)分别表示图结构中节点经过第l层卷积前、后的特征，则对称图卷积的形式为：Let X ^(l) and X ^(l+1) represent the characteristics of the nodes in the graph structure before and after the l-th layer of convolution respectively, then the form of the symmetric graph convolution is:

X^(l+1)＝σ(WX^(l)A^sym) (1)X ^(l+1) = σ(WX ^(l) A ^sym ) (1)

其中，σ()表示激活函数，W表示可学习的权重参数，A^sym是对图的邻接矩阵A对称归一化后得到的矩阵，表示如下：Among them, σ() represents an activation function, W represents a learnable weight parameter, and A ^sym is a matrix obtained after symmetrically normalizing the adjacency matrix A of the graph, expressed as follows:

其中，A是图的邻接矩阵，D是度矩阵，对称归一化可以更好的聚合邻居节点的信息，以获取均衡的节点特征；Among them, A is the adjacency matrix of the graph, and D is the degree matrix. Symmetric normalization can better aggregate the information of neighboring nodes to obtain balanced node characteristics;

通过在对称图卷积的基础上添加一个可学习的加权矩阵M，构建得到对称语义图卷积层，所述对称语义图卷积层的计算公式表达如下：By adding a learnable weighting matrix M on the basis of symmetric graph convolution, a symmetric semantic graph convolution layer is constructed. The calculation formula of the symmetric semantic graph convolution layer is expressed as follows:

X^(l+1)＝σ(WX^(l)ρ_i(M⊙A^sym)) (3)X ^(l+1) = σ(WX ^(l) ρ _i (M⊙A ^sym )) (3)

其中，ρ_i()是Softmax非线性函数，用于对节点i的矩阵进行归一化，⊙表示矩阵对应的元素相乘运算；Among them, ρ _i () is a Softmax nonlinear function, which is used to normalize the matrix of node i, and ⊙ represents the multiplication operation of elements corresponding to the matrix;

为捕捉图中节点之间的全局特征，引入非局部层的概念，非局部层的运算定义为：In order to capture the global features between nodes in the graph, the concept of non-local layer is introduced, and the operation of non-local layer is defined as:

其中，W^x表示可学习的权重参数W的归一化因子，K表示节点个数，i表示所要计算的目标节点的索引，j表示除i之外的节点的索引；

分别表示节点i，j的输入特征；

表示节点i的输出特征；f(，)是可学习的二元函数，用于计算两个输入特征的相似度；g()是可学习的一元函数，对输入特征进行变换。Among them, W ^x represents the normalization factor of the learnable weight parameter W, K represents the number of nodes, i represents the index of the target node to be calculated, and j represents the index of nodes other than i;

represent the input features of nodes i and j respectively;

Represents the output feature of node i; f(,) is a learnable binary function used to calculate the similarity between two input features; g() is a learnable unary function that transforms the input features.

进一步地，所述步骤S2中对于身体部位分组，把人体关节点分解成左肢组、右肢组、全身组，组内的各个关节点有着更强的关联性，各组通过独立子网络进行特征提取，以增强局部关系。Further, in the step S2, for the grouping of body parts, the joint points of the human body are decomposed into the left limb group, the right limb group, and the whole body group. Each joint point in the group has a stronger correlation, and each group is performed through an independent sub-network. Feature extraction to enhance local relationships.

如图4所示，特征融合采用晚融合的方式，先学习每个组中的特征，然后对每个组中的特征进行融合，特征融合定义为：As shown in Figure 4, the feature fusion adopts the late fusion method, first learn the features in each group, and then fuse the features in each group, the feature fusion is defined as:

f^fuse＝Concat(f^left，f^right，f^all) (5)f ^fuse ＝Concat(f ^left ，f ^right ，f ^all ) (5)

其中，Concat(，，)表示将特征进行连接操作，f^left为左肢组的特征，f^right为右肢组的特征，f^all为全身组的特征，f^fuse为融合后得到的特征。Among them, Concat(,,) means to connect the features, f ^left is the feature of the left limb group, f ^right is the feature of the right limb group, ^{fall is the feature of the whole body group, and f fuse} ^is the feature obtained after fusion.

身体部位分组的实现，在保证全局姿态一致性的情况下学习局部关节的一致性，可以更好地泛化到训练数据中对称的姿势，以及罕见的、遮挡的姿势。The implementation of body part grouping, which learns local joint consistency while maintaining global pose consistency, generalizes better to symmetric poses in the training data, as well as rare, occluded poses.

进一步地，所述步骤S3中，基于对称语义图卷积层和非局部层构建多个对称语义图卷积模块，所有对称语义图卷积模块具有相同的结构，每一个对称语义图卷积模块由两个对称语义图卷积层和一个非局部层依次顺序连接组成，通过对称语义图卷积层和非局部层交替以获取节点之间的局部和全局语义关系；Further, in the step S3, multiple symmetric semantic graph convolution modules are constructed based on the symmetric semantic graph convolution layer and the non-local layer, all symmetric semantic graph convolution modules have the same structure, and each symmetric semantic graph convolution module It consists of two symmetric semantic graph convolutional layers and a non-local layer connected sequentially, and alternates the symmetric semantic graph convolutional layer and the non-local layer to obtain local and global semantic relationships between nodes;

在对称语义图卷积网络中，如图3所示，先使用一个对称语义图卷积层和使用一个非局部层，将输入映射到潜在空间；然后通过四个依次顺序连接的对称语义图卷积模块，得到编码的特征，对称语义图卷积网络中所有对称语义图卷积层之后都进行批标准化和R_eLU非线性激活；In the symmetric semantic graph convolution network, as shown in Figure 3, first use a symmetric semantic graph convolution layer and use a non-local layer to map the input to the potential space; then through four sequentially connected symmetric semantic graph volumes Product module to obtain the encoded features, all symmetric semantic graph convolution layers in the symmetric semantic graph convolutional network are followed by batch normalization and R _e LU nonlinear activation;

所述身体部位分组的对称语义图卷积姿态估计网络模型包括第一分支、第二分支、第三分支，如图2所示，其中，第一分支、第二分支、第三分支均使用对称语义图卷积网络进行特征提取：左肢组输入第一分支，通过对称语义图卷积网络，提取左肢的特征f^left；右肢组输入第二分支，通过对称语义图卷积网络，提取右肢的特征f^right；全身组输入第三分支，通过对称语义图卷积网络，提取全身的特征f^all；根据公式(5)计算得到融合的特征f^fuse，然后使用一个对称语义图卷积层，将编码的特征投影到输出空间。The symmetric semantic graph convolution pose estimation network model of the body parts grouping includes a first branch, a second branch, and a third branch, as shown in Figure 2, wherein, the first branch, the second branch, and the third branch all use symmetric Semantic graph convolutional network for feature extraction: the left limb group is input to the first branch, and the feature f ^left of the left limb is extracted through the symmetrical semantic graph convolutional network; the right limb group is input to the second branch, and the symmetrical semantic graph convolutional network is used to extract The feature f ^right of the right limb; the whole body group is input to the third branch, and the feature f ^all of the whole body is extracted through the symmetric semantic graph convolution network; the fused feature f ^fuse is calculated according to formula (5), and then a symmetric semantic graph convolution is used layer, which projects the encoded features to the output space.

进一步地，所述步骤S4中采用公式(6)定义的损失函数L_smoothl1()，在Human3.6M数据集上进行训练，公式如下：Further, in the step S4, the loss function L _smoothhl1 () defined by the formula (6) is used to train on the Human3.6M data set, the formula is as follows:

其中，X表示真值与预测值之差，|·|表示真值与预测值之差绝对值，J′_i代表预测的i节点的3D关节坐标，J_i对应数据集中i节点的真值。L_smoothl1(J)损失函数对离群节点、异常值不敏感，且可以控制梯度的量级，使训练时合理收敛。Among them, X represents the difference between the true value and the predicted value, |·| represents the absolute value of the difference between the true value and the predicted value, J′ _i represents the predicted 3D joint coordinates of node i, and J _i corresponds to the true value of node i in the data set. The L _smoothl1 (J) loss function is not sensitive to outlier nodes and outliers, and can control the magnitude of the gradient to make the training converge reasonably.

进一步地，姿态估计通常采用的评价指标为MPJPE(Mean Per Joint PositionError)，公式定义为(7)：Further, the evaluation index usually used for attitude estimation is MPJPE (Mean Per Joint Position Error), and the formula is defined as (7):

E_MPJPE()指标表示每个关节预测值与真值的L2距离的均值，||·||₂表示预测值到真值的L2距离。当评价指标MPJPE较小时，认为该3D人体姿态估计结果是较优的。The E _MPJPE () index represents the mean value of the L2 distance between the predicted value and the true value of each joint, and ||·|| ₂ represents the L2 distance between the predicted value and the true value. When the evaluation index MPJPE is smaller, it is considered that the 3D human body pose estimation result is better.

进一步地，在训练过程中，初始学习率为0.001，使用大小为64的批处理。初始学习率直接影响模型的收敛状态，批处理大小则影响模型的泛化能力，采用0.001的初始学习率有利于模型收敛，采用64的批处理大小有利于模型泛化。Further, during training, the initial learning rate is 0.001, and a batch size of 64 is used. The initial learning rate directly affects the convergence state of the model, and the batch size affects the generalization ability of the model. The initial learning rate of 0.001 is conducive to model convergence, and the batch size of 64 is conducive to model generalization.

本发明相对于现有技术具有如下的优点及效果：Compared with the prior art, the present invention has the following advantages and effects:

本发明提出的基于身体部位分组的对称语义图卷积姿态估计网络，引入了对称语义图卷积，能够更好的聚合邻居节点的信息，获取均衡的节点特征；设计了身体部位分组，将身体按照部位分割为左/右躯干，这些身体部位组通过独立子网络进行学习，以增强局部特征。在H_um_an3.6M数据集上与其他方法对比，总体上，本方法效果更优，泛化能力有所提升。The symmetric semantic graph convolution pose estimation network based on body part grouping proposed by the present invention introduces symmetric semantic graph convolution, which can better aggregate the information of neighbor nodes and obtain balanced node features; design body part grouping, body Segmented by parts into left/right torso, these body part groups are learned through independent sub-networks to enhance local features. Compared with other methods on the H _u man _3.6M data set, the method is generally more effective and the generalization ability has been improved.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

图1是本发明公开的一种基于姿态估计驱动的三维人体姿势迁移方法的流程图；Fig. 1 is a flow chart of a three-dimensional human body posture transfer method based on posture estimation driven by the present invention;

图2是本发明实施例中基于身体部位分组的对称语义图卷积网络模型图；Fig. 2 is a symmetrical semantic graph convolutional network model diagram based on body parts grouping in an embodiment of the present invention;

图3是本发明实施例中对称语义图卷积模块示意图；Fig. 3 is a schematic diagram of a symmetrical semantic graph convolution module in an embodiment of the present invention;

图4是本发明实施例中身体部位分组特征融合模块示意图。Fig. 4 is a schematic diagram of a body part grouping feature fusion module in an embodiment of the present invention.

具体实施方式detailed description

为使本发明的目的、技术方案及优点更加清楚、明确，以下参照附图并举实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear and definite, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

实施例1Example 1

一种基于身体部位分组的对称语义图卷积姿态估计方法，如图1所示，所述方法包括如下步骤：A symmetric semantic graph convolution pose estimation method based on body parts grouping, as shown in Figure 1, the method includes the following steps:

步骤S1中使用二维人体关节点及其连接关系，构建关节点图结构的对称语义图卷积层和非局部层的过程如下：In step S1, using two-dimensional human body joints and their connection relations, the process of constructing a symmetrical semantic graph convolution layer and a non-local layer of the joint point graph structure is as follows:

X^(l+1)＝σ(WX^(l)A^sym) (1)X ^(l+1) = σ(WX ^(l) A ^sym ) (1)

其中，A是图的邻接矩阵，D是度矩阵；Among them, A is the adjacency matrix of the graph, and D is the degree matrix;

通过在对称图卷积的基础上添加了一个可学习的加权矩阵M，构建得到对称语义图卷积层，所述对称语义图卷积层的计算公式表达如下：By adding a learnable weighting matrix M on the basis of symmetric graph convolution, a symmetric semantic graph convolution layer is constructed. The calculation formula of the symmetric semantic graph convolution layer is expressed as follows:

X^(l+1)＝σ(WX^(l)ρ_i(M⊙A^sym)) (3)X ^(l+1) ＝σ(WX ^(l) ρ _i (M⊙A ^sym) ) (3)

为捕捉图中节点之间的全局特征，引入非局部层的概念，将非局部层的运算定义为：In order to capture the global features between nodes in the graph, the concept of non-local layer is introduced, and the operation of non-local layer is defined as:

其中，W_x表示可学习的权重参数W的归一化因子，K表示节点个数，i表示所要计算的目标节点的索引，j表示除i之外的节点的索引；

分别表示节点i，j的输入特征；

表示节点i的输出特征；f(，)是可学习的二元函数，用于计算两个输入特征的相似度；g()是可学习的一元函数，对输入特征进行变换。Among them, W _x represents the normalization factor of the learnable weight parameter W, K represents the number of nodes, i represents the index of the target node to be calculated, and j represents the index of nodes other than i;

represent the input features of nodes i and j respectively;

S2、根据电影动画、虚拟现实或运动动作中身体躯干数据，进行身体部位分组，分别得到各躯干的局部及非局部特征和全身的局部及非局部特征，并对得到的特征进行融合计算；S2. According to the torso data in the movie animation, virtual reality or sports action, the body parts are grouped, and the local and non-local features of each torso and the local and non-local features of the whole body are respectively obtained, and the obtained features are fused and calculated;

步骤S2中对于身体部位分组，把电影动画、虚拟现实或运动动作中人体关节点分解成左肢组、右肢组、全身组，组内的各个关节点有着更强的关联性，各组通过独立子网络进行特征提取，以增强局部关系。In step S2, for the grouping of body parts, the joint points of the human body in movie animation, virtual reality or sports actions are decomposed into left limb group, right limb group, and whole body group. Each joint point in the group has a stronger correlation. Independent sub-networks are used for feature extraction to enhance local relations.

如图4所示，特征融合采用晚融合的方式，先学习每个组中的特征，然后对每个组中的特征的进行融合计算，特征融合定义为：As shown in Figure 4, the feature fusion adopts the late fusion method, first learn the features in each group, and then perform fusion calculation on the features in each group, the feature fusion is defined as:

步骤S3中，基于对称语义图卷积层和非局部层构建多个对称语义图卷积模块，所有对称语义图卷积模块具有相同的结构，每一个对称语义图卷积模块由两个对称语义图卷积层和一个非局部层依次顺序连接组成；In step S3, multiple symmetric semantic graph convolution modules are constructed based on the symmetric semantic graph convolution layer and the non-local layer. All symmetric semantic graph convolution modules have the same structure, and each symmetric semantic graph convolution module consists of two symmetric semantic graph convolution modules. The graph convolutional layer and a non-local layer are connected sequentially;

在对称语义图卷积网络中，如图3所示，先使用一个对称语义图卷积层和使用一个非局部层，将输入映射到潜在空间；然后通过四个依次顺序连接的对称语义图卷积模块，得到编码的特征，对称语义图卷积网络中所有对称语义图卷积层之后都进行批标准化和ReLU非线性激活；In the symmetric semantic graph convolution network, as shown in Figure 3, first use a symmetric semantic graph convolution layer and use a non-local layer to map the input to the potential space; then through four sequentially connected symmetric semantic graph volumes Product module to get the encoded features, all symmetric semantic graph convolution layers in the symmetric semantic graph convolutional network are followed by batch normalization and ReLU nonlinear activation;

身体部位分组的对称语义图卷积姿态估计网络模型包括第一分支、第二分支、第三分支，如图2所示，其中，第一分支、第二分支、第三分支均使用对称语义图卷积网络进行特征提取：左肢组输入第一分支，通过对称语义图卷积网络，提取左肢的特征f^left；右肢组输入第二分支，通过对称语义图卷积网络，提取右肢的特征f^right；全身组输入第三分支，通过对称语义图卷积网络，提取全身的特征f^all；根据公式(5)计算得到融合的特征f^fuse，然后使用一个对称语义图卷积层，将编码的特征投影到输出空间。The symmetric semantic graph convolution pose estimation network model of body parts grouping includes the first branch, the second branch, and the third branch, as shown in Figure 2, where the first branch, the second branch, and the third branch all use symmetric semantic graphs Convolutional network for feature extraction: the left limb group is input to the first branch, and the feature f ^left of the left limb is extracted through the symmetric semantic graph convolution network; the right limb group is input to the second branch, and the right limb is extracted through the symmetric semantic graph convolution network The feature f ^right of the whole body group is input to the third branch, and the feature f ^all of the whole body is extracted through the symmetric semantic graph convolution network; the fused feature f ^fuse is calculated according to formula (5), and then a symmetric semantic graph convolution layer is used, Project the encoded features to the output space.

S4、使用Human3.6M数据集对所述对称语义图卷积姿态估计网络模型进行训练，将待估计的电影动画、虚拟现实或运动动作中二维人体关节点输入经过训练的对称语义图卷积姿态估计网络模型，输出估计的电影动画、虚拟现实或运动动作中三维人体关节点。S4. Use the Human3.6M data set to train the symmetric semantic graph convolution pose estimation network model, and input the two-dimensional human body joint points in the movie animation, virtual reality or sports action to be estimated into the trained symmetric semantic graph convolution The pose estimation network model outputs the estimated three-dimensional human joint points in movie animation, virtual reality or sports actions.

步骤S4中采用公式(6)定义的损失函数L_smoothl1()，在Human3.6M数据集上进行训练，公式如下：In step S4, the loss function L _smoothhl1 () defined by the formula (6) is used to train on the Human3.6M data set, and the formula is as follows:

其中，X表示电影动画、虚拟现实或运动动作数据真值与预测值之差，|·|表示真值与预测值之差绝对值，J′_i代表预测的i节点的3D关节坐标，J_i对应数据集中i节点的真值。Among them, X represents the difference between the real value and the predicted value of movie animation, virtual reality or sports action data, |·| represents the absolute value of the difference between the real value and the predicted value, J′ _i represents the predicted 3D joint coordinates of node i, J _i Corresponds to the true value of the i-node in the data set.

其中，姿态估计通常采用的评价指标为MPJPE(Mean Per Joint PositionError)，公式定义为(7)：Among them, the evaluation index usually used for attitude estimation is MPJPE (Mean Per Joint Position Error), and the formula is defined as (7):

E_MPJPE()指标表示电影动画、虚拟现实或运动动作中每个关节预测值与真值的L2距离的均值，||·||₂表示预测值到真值的L2距离。当评价指标MPJPE较小时，认为该3D人体姿态估计结果是较优的。The E _MPJPE () index represents the mean value of the L2 distance between the predicted value and the real value of each joint in movie animation, virtual reality or sports action, and ||·|| ₂ represents the L2 distance between the predicted value and the real value. When the evaluation index MPJPE is smaller, it is considered that the 3D human body pose estimation result is better.

在训练过程中，初始学习率为0.001，批处理大小为64。During training, the initial learning rate is 0.001 and the batch size is 64.

实施例2Example 2

本实施例基于实施例1公开的一种基于身体部位分组的对称语义图卷积姿态估计方法，为了验证本发明的有效性，在Human3.6M数据集上进行实验，结合实验结果对本发明的技术效果进行说明。This embodiment is based on a symmetric semantic graph convolution pose estimation method based on body part grouping disclosed in Embodiment 1. In order to verify the effectiveness of the present invention, experiments are carried out on the Human3.6M data set, and the technology of the present invention is combined with the experimental results. The effect is explained.

Human3.6M是3D姿态估计使用最多的数据集之一，覆盖360万影像，在室内可控的环境中进行采集，共有11名实验者，利用带标记的运动捕捉设备采集实验者日常活动场景时的身体姿态，包含15个动作。Human3.6M is one of the most widely used datasets for 3D pose estimation, covering 3.6 million images, collected in an indoor controllable environment, with a total of 11 experimenters, using a marked motion capture device to capture the daily activity scenes of the experimenters body posture, including 15 actions.

实验配置：硬件环境：GPU RTX 2080Ti显存：11GB，CPU 4核Intel(R)Xeon(R)Silver 4110 CPU@2.10GHz内存:16GB。软件环境：Python v2.7，Pytorch v1.1.0，CUDA10.2。操作系统：Ubuntu18.04。Experimental configuration: Hardware environment: GPU RTX 2080Ti Video memory: 11GB, CPU 4-core Intel(R) Xeon(R) Silver 4110 CPU@2.10GHz Memory: 16GB. Software environment: Python v2.7, Pytorch v1.1.0, CUDA10.2. Operating system: Ubuntu18.04.

对本发明提出的方法进行消融研究。采用上述配置，本发明提出的姿态估计网络包含两个主要模块：对称语义图卷积模块和身体部位分组。为了验证它们的有效性，设置消融实验如下：第一个实验仅使用语义图卷积，第二个实验使用对称语义图卷积模块，第二个实验使用身体部位分组，第三个实验使用对称语义图卷积模块和身体部位分组。实验结果如表1所示：Ablation studies were performed on the method proposed by the present invention. With the above configuration, the pose estimation network proposed by the present invention contains two main modules: a symmetric semantic graph convolution module and a body part grouping. To verify their effectiveness, ablation experiments are set up as follows: the first experiment uses semantic graph convolution only, the second experiment uses a symmetric semantic graph convolution module, the second experiment uses body part grouping, and the third experiment uses symmetric Semantic graph convolution module and body part grouping. The experimental results are shown in Table 1:

表1.基于身体部位分组的对称语义图卷积姿态估计方法消融实验结果表Table 1. Ablation experiment results of the convolution pose estimation method based on body parts grouping

对称语义图卷积模块Symmetric Semantic Graph Convolution Module 身体部位分组Body Part Grouping MPJPEMPJPE 41.47mm41.47mm √√ 40.68mm40.68mm √√ 40.53mm40.53mm √√ √√ 39.93mm39.93mm

表2展示了按照人体动作分类，在MPJPE评价指标下，本发明的方法与基线方法和语义图卷积方法的对比实验结果，每个动作的最佳方法分别以粗体突出显示。Table 2 shows the experimental results of the method of the present invention compared with the baseline method and the semantic graph convolution method under the MPJPE evaluation index according to the classification of human actions. The best method for each action is highlighted in bold.

表2.基于身体部位分组的对称语义图卷积姿态估计方法对比实验结果表Table 2. Comparison of experimental results of symmetric semantic graph convolution pose estimation methods based on body part grouping

由此可见，本发明提出的基于身体部位分组的对称语义图卷积姿态估计网络，实现了更优的性能，这表明本文的模型可以有效地利用图中不同关节组之间的关系。It can be seen that the symmetric semantic graph convolution pose estimation network based on body part grouping proposed by the present invention achieves better performance, which shows that the model in this paper can effectively utilize the relationship between different joint groups in the graph.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.

Claims

1. a symmetric semantic graph convolution pose estimation method based on body parts grouping, is characterized in that, described symmetric semantic graph convolution pose estimation method comprises the steps:

S1. Input the two-dimensional human joint points and their connection relationship in movie animation, virtual reality or sports action, and construct the symmetrical semantic graph convolution layer and non-local layer of the joint point graph structure;

S2. Carry out grouping of body parts according to the trunk of the body, respectively obtain the local and non-local features of each trunk and the local and non-local features of the whole body, and perform fusion calculation on the obtained features;

S3. Based on the symmetric semantic graph convolution layer, non-local layer and body part grouping, construct a symmetric semantic graph convolution pose estimation network model for body part grouping;

S4. Use the Human3.6M data set to train the symmetric semantic graph convolution pose estimation network model, input the two-dimensional human joint points to be estimated into the trained symmetric semantic graph convolution pose estimation network model, and output the estimated three-dimensional Human joints.

2. A method for symmetric semantic graph convolution pose estimation based on body parts grouping according to claim 1, characterized in that, in said step S1, two-dimensional human body joint points and their connection relationships are used to construct a joint point graph structure The symmetric semantic graph convolution layer process is as follows:

Let X ^(l) and X ^(l+1) represent the characteristics of the nodes in the graph structure before and after the l-th layer of convolution respectively, then the form of the symmetric graph convolution is:

X ^(l+1) = σ(WX ^(l) A ^sym ) (1)

Among them, σ() represents an activation function, W represents a learnable weight parameter, and A ^sym is a matrix obtained after symmetrically normalizing the adjacency matrix A of the graph, expressed as follows:

Among them, A is the adjacency matrix of the graph, and D is the degree matrix;

By adding a learnable weighting matrix M on the basis of symmetric graph convolution, a symmetric semantic graph convolution layer is constructed. The calculation formula of the symmetric semantic graph convolution layer is expressed as follows:

X ^(l+1) = σ(WX ^(l) ρ _i (M⊙A ^sym )) (3)

Among them, ρ _i () is a Softmax nonlinear function, which is used to normalize the matrix of node i, and ⊙ represents the multiplication operation of elements corresponding to the matrix.

3. A method for symmetric semantic graph convolution pose estimation based on body parts grouping according to claim 2, characterized in that, in said step S1, two-dimensional human body joint points and their connection relationships are used to construct a joint point graph structure The process of the non-local layer is as follows:

Define the operation of the non-local layer as:

Among them, W _x represents the normalization factor of the learnable weight parameter W, K represents the number of nodes, i represents the index of the target node to be calculated, and j represents the index of nodes other than i;

represent the input features of nodes i and j respectively;

Indicates the output feature of node i; f(,) is a learnable binary function used to calculate the similarity between two input features; g() is a learnable unary function used to transform the input features.

4. A method of symmetric semantic graph convolution pose estimation based on body parts grouping according to claim 3, characterized in that, in the step S2, human body joints are decomposed into left limb group, right limb group, whole body Each group enhances the local relationship through an independent sub-network, and then adopts the feature fusion method of late fusion, first learns the features in each group, and then fuses the features in each group. The feature fusion is defined as:

f ^fuse ＝Concat(f ^left ,f ^right ,f ^all ) (5)

Among them, Concat(,,) means to connect the features, f ^left is the feature of the left limb group, f ^right is the feature of the right limb group, fall ^all is the feature of the whole body group, and f ^fuse is the feature obtained after fusion.

5. A method for symmetric semantic graph convolution pose estimation based on body part grouping according to claim 4, characterized in that, in the step S3, multiple symmetric semantic graph convolution layers and non-local layers are constructed based on the symmetric semantic graph convolution layer Semantic graph convolution module, all symmetric semantic graph convolution modules have the same structure, and each symmetric semantic graph convolution module is composed of two symmetric semantic graph convolution layers and a non-local layer connected sequentially;

In the symmetric semantic graph convolutional network, a symmetric semantic graph convolution layer and a non-local layer are used to map the input to the potential space; then four sequentially connected symmetric semantic graph convolution modules are used to obtain the encoded Features, all symmetric semantic graph convolution layers in the symmetric semantic graph convolutional network are followed by batch normalization and ReLU nonlinear activation;

The symmetric semantic graph convolution pose estimation network model of the body parts grouping includes a first branch, a second branch, and a third branch, wherein, the first branch, the second branch, and the third branch are all performed using a symmetric semantic graph convolution network. Feature extraction: the left limb group is input to the first branch, and the feature f ^left of the left limb is extracted through the symmetric semantic graph convolution network; the right limb group is input to the second branch, and the feature f ^right of the right limb is extracted through the symmetric semantic graph convolution network ;The whole body group is input to the third branch, through the symmetric semantic graph convolutional network, to extract the full body feature f ^all ; according to the formula (5), the fused feature f ^fuse is obtained, and then a symmetric semantic graph convolution layer is used to convert the encoded feature Projected to the output space.

6. a kind of symmetric semantic graph convolution attitude estimation method based on body parts grouping according to claim 5, is characterized in that, adopts the loss function L _smoothhl1 () defined in formula (6) in described step S4, in Human3 .6M data set for training, the formula is as follows:

Among them, X represents the difference between the true value and the predicted value, |·| represents the absolute value of the difference between the true value and the predicted value, J′ _i represents the predicted 3D joint coordinates of node i, and J _i corresponds to the true value of node i in the data set.

7. a kind of symmetric semantic graph convolution posture estimation method based on body parts grouping according to claim 6, is characterized in that, the evaluation index that described posture estimation adopts is MPJPE, and formula is defined as follows:

The E _MPJPE () index represents the mean value of the L2 distance between the predicted value and the true value of each joint, and ‖·‖ ₂ represents the L2 distance between the predicted value and the true value.

8. A method for symmetric semantic graph convolution pose estimation based on body part grouping according to claim 6, characterized in that, in the training process, the initial learning rate is 0.001, and a batch size of 64 is used.