CN115546888A - A Convolutional Pose Estimation Method Based on Body Part Grouping with Symmetrical Semantic Graphs - Google Patents
A Convolutional Pose Estimation Method Based on Body Part Grouping with Symmetrical Semantic Graphs Download PDFInfo
- Publication number
- CN115546888A CN115546888A CN202211084071.5A CN202211084071A CN115546888A CN 115546888 A CN115546888 A CN 115546888A CN 202211084071 A CN202211084071 A CN 202211084071A CN 115546888 A CN115546888 A CN 115546888A
- Authority
- CN
- China
- Prior art keywords
- symmetric
- semantic graph
- graph convolution
- local
- pose estimation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000009471 action Effects 0.000 claims abstract description 14
- 230000004927 fusion Effects 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 16
- 238000010606 normalization Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 238000007500 overflow downdraw method Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 5
- 238000002474 experimental method Methods 0.000 description 7
- 238000002679 ablation Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- BQCADISMDOOEFD-UHFFFAOYSA-N Silver Chemical compound [Ag] BQCADISMDOOEFD-UHFFFAOYSA-N 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Medical Informatics (AREA)
- Pure & Applied Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Algebra (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种基于身体部位分组的对称语义图卷积姿态估计方法,包括如下步骤:S1、输入二维人体关节点及其连接关系,构建关节点图结构的对称语义图卷积层和非局部层;S2、根据身体躯干,进行身体部位分组,分别得到各躯干的局部及非局部特征和全身的局部及非局部特征,并对得到的特征进行融合计算;S3、基于对称语义图卷积层、非局部层和身体部位分组,构建身体部位分组的对称语义图卷积姿态估计网络模型;S4、使用Human3.6M数据集对所述对称语义图卷积姿态估计网络模型进行训练,将待估计的二维人体关节点输入经过训练的对称语义图卷积姿态估计网络模型,输出估计的三维人体关节点。本发明可应用于电影动画、虚拟现实、运动动作分析等领域,且方法效果更优,泛化能力提升。
The invention discloses a symmetric semantic graph convolution pose estimation method based on body part grouping, comprising the following steps: S1, inputting two-dimensional human joint points and their connection relations, constructing a symmetric semantic graph convolution layer of the joint point graph structure and Non-local layer; S2, according to the body trunk, group the body parts, obtain the local and non-local features of each torso and the local and non-local features of the whole body, and perform fusion calculation on the obtained features; S3, based on the symmetrical semantic map volume Laminate layer, non-local layer and body part grouping, build the symmetric semantic graph convolution pose estimation network model of body part grouping; S4, use Human3.6M data set to train described symmetric semantic graph convolution pose estimation network model, will The 2D human joint points to be estimated are input into the trained symmetric semantic graph convolution pose estimation network model, and the estimated 3D human joint points are output. The invention can be applied to the fields of movie animation, virtual reality, sports action analysis and the like, and the method has better effect and improved generalization ability.
Description
技术领域technical field
本发明涉及计算机视觉技术领域,具体涉及一种基于身体部位分组的对称语义图卷积姿态估计方法。The invention relates to the technical field of computer vision, in particular to a body part grouping-based symmetric semantic graph convolution pose estimation method.
背景技术Background technique
人体姿态估计在许多计算机视觉任务上已经得到广泛应用,如虚拟现实、人机交互和行为识别等领域。得益于深度学习的迅速发展,从图像估计三维人体姿态在性能上获得了明显提升,成为当前研究热点。Human pose estimation has been widely used in many computer vision tasks, such as virtual reality, human-computer interaction and action recognition. Thanks to the rapid development of deep learning, the performance of estimating 3D human pose from images has been significantly improved, and has become a current research hotspot.
已有的3D姿态估计估计的方法有两类,一是从图像中直接预测出3D姿态,二是先预测出2D姿态,再回归出3D姿态。第一类方法可以从图像中得到大量信息,但模型受图像背景以及人体着装等因素影响很大,且模型所需学习特征具有复杂性。第二类方法降低了总体工作复杂程度,网络模型更加容易学习到2D到3D空间的映射,同时得益于2D姿态估计研究的成熟,这类模型更加主流。There are two types of existing 3D attitude estimation methods, one is to directly predict the 3D attitude from the image, and the other is to first predict the 2D attitude and then regress to the 3D attitude. The first type of method can get a lot of information from images, but the model is greatly affected by factors such as image background and human clothing, and the learning features required by the model are complex. The second type of method reduces the overall work complexity, and the network model is easier to learn the mapping from 2D to 3D space. At the same time, thanks to the maturity of 2D pose estimation research, this type of model is more mainstream.
《一种基于图卷积网络的三维人体姿态估计方法》(CN112712019A)提供的基于图卷积网络的三维人体姿态估计方法具有能提高三维人体姿态回归性能、减少网络参数使用的优点,但是模型泛化能力有待提升。现有研究存在深度学习背景下人体姿态估计算法容易受自遮挡、环境遮挡等的影响,且人体姿态具有多样性,当前模型泛化能力欠佳。因此目前亟待探索更加合理,更具普适性的网络模型来提升姿态估计效果。"A 3D Human Pose Estimation Method Based on Graph Convolutional Network" (CN112712019A) provides a 3D human pose estimation method based on graph convolutional network, which has the advantages of improving the performance of 3D human pose regression and reducing the use of network parameters. Ability to be improved. In the existing research, the human body pose estimation algorithm under the background of deep learning is easily affected by self-occlusion, environmental occlusion, etc., and the human body poses are diverse, and the generalization ability of the current model is not good. Therefore, it is urgent to explore a more reasonable and more universal network model to improve the effect of attitude estimation.
发明内容Contents of the invention
本发明的目的是为了解决现有技术中的上述问题,提供一种基于身体部位分组的对称语义图卷积姿态估计方法。The purpose of the present invention is to solve the above-mentioned problems in the prior art, and provide a method for symmetric semantic map convolution pose estimation based on body part grouping.
本发明的目的可以通过采取如下技术方案达到:The purpose of the present invention can be achieved by taking the following technical solutions:
一种基于身体部位分组的对称语义图卷积姿态估计方法,如图1所示,所述对称语义图卷积姿态估计方法包括如下步骤:A kind of symmetric semantic graph convolution pose estimation method based on body parts grouping, as shown in Figure 1, described symmetric semantic graph convolution pose estimation method comprises the steps:
S1、输入电影动画、虚拟现实或运动动作中二维人体关节点及其连接关系,构建关节点图结构的对称语义图卷积层和非局部层;S1. Input the two-dimensional human joint points and their connection relationship in movie animation, virtual reality or sports action, and construct the symmetrical semantic graph convolution layer and non-local layer of the joint point graph structure;
S2、根据身体躯干,进行身体部位分组,分别得到各躯干的局部及非局部特征和全身的局部及非局部特征,并对得到的特征进行融合计算;S2. Carry out grouping of body parts according to the trunk of the body, respectively obtain the local and non-local features of each trunk and the local and non-local features of the whole body, and perform fusion calculation on the obtained features;
S3、基于对称语义图卷积层、非局部层和身体部位分组,构建身体部位分组的对称语义图卷积姿态估计网络模型;S3. Based on the symmetric semantic graph convolution layer, non-local layer and body part grouping, construct a symmetric semantic graph convolution pose estimation network model for body part grouping;
S4、使用Human3.6M数据集对所述对称语义图卷积姿态估计网络模型进行训练,将待估计的二维人体关节点输入经过训练的对称语义图卷积姿态估计网络模型,输出估计的三维人体关节点。S4. Using the H u man 3.6M data set to train the symmetric semantic graph convolution pose estimation network model, and input the two-dimensional human body joint points to be estimated into the trained symmetric semantic graph convolution pose estimation network model, Output estimated 3D human joint points.
进一步地,所述步骤S1中使用二维人体关节点及其连接关系,构建关节点图结构的对称语义图卷积层和非局部层的过程如下:Further, in the step S1, using the two-dimensional human body joints and their connections, the process of constructing the symmetrical semantic graph convolution layer and the non-local layer of the joint point graph structure is as follows:
令X(l)和X(l+1)分别表示图结构中节点经过第l层卷积前、后的特征,则对称图卷积的形式为:Let X (l) and X (l+1) represent the characteristics of the nodes in the graph structure before and after the l-th layer of convolution respectively, then the form of the symmetric graph convolution is:
X(l+1)=σ(WX(l)Asym) (1)X (l+1) = σ(WX (l) A sym ) (1)
其中,σ()表示激活函数,W表示可学习的权重参数,Asym是对图的邻接矩阵A对称归一化后得到的矩阵,表示如下:Among them, σ() represents an activation function, W represents a learnable weight parameter, and A sym is a matrix obtained after symmetrically normalizing the adjacency matrix A of the graph, expressed as follows:
其中,A是图的邻接矩阵,D是度矩阵,对称归一化可以更好的聚合邻居节点的信息,以获取均衡的节点特征;Among them, A is the adjacency matrix of the graph, and D is the degree matrix. Symmetric normalization can better aggregate the information of neighboring nodes to obtain balanced node characteristics;
通过在对称图卷积的基础上添加一个可学习的加权矩阵M,构建得到对称语义图卷积层,所述对称语义图卷积层的计算公式表达如下:By adding a learnable weighting matrix M on the basis of symmetric graph convolution, a symmetric semantic graph convolution layer is constructed. The calculation formula of the symmetric semantic graph convolution layer is expressed as follows:
X(l+1)=σ(WX(l)ρi(M⊙Asym)) (3)X (l+1) = σ(WX (l) ρ i (M⊙A sym )) (3)
其中,ρi()是Softmax非线性函数,用于对节点i的矩阵进行归一化,⊙表示矩阵对应的元素相乘运算;Among them, ρ i () is a Softmax nonlinear function, which is used to normalize the matrix of node i, and ⊙ represents the multiplication operation of elements corresponding to the matrix;
为捕捉图中节点之间的全局特征,引入非局部层的概念,非局部层的运算定义为:In order to capture the global features between nodes in the graph, the concept of non-local layer is introduced, and the operation of non-local layer is defined as:
其中,Wx表示可学习的权重参数W的归一化因子,K表示节点个数,i表示所要计算的目标节点的索引,j表示除i之外的节点的索引;分别表示节点i,j的输入特征;表示节点i的输出特征;f(,)是可学习的二元函数,用于计算两个输入特征的相似度;g()是可学习的一元函数,对输入特征进行变换。Among them, W x represents the normalization factor of the learnable weight parameter W, K represents the number of nodes, i represents the index of the target node to be calculated, and j represents the index of nodes other than i; represent the input features of nodes i and j respectively; Represents the output feature of node i; f(,) is a learnable binary function used to calculate the similarity between two input features; g() is a learnable unary function that transforms the input features.
进一步地,所述步骤S2中对于身体部位分组,把人体关节点分解成左肢组、右肢组、全身组,组内的各个关节点有着更强的关联性,各组通过独立子网络进行特征提取,以增强局部关系。Further, in the step S2, for the grouping of body parts, the joint points of the human body are decomposed into the left limb group, the right limb group, and the whole body group. Each joint point in the group has a stronger correlation, and each group is performed through an independent sub-network. Feature extraction to enhance local relationships.
如图4所示,特征融合采用晚融合的方式,先学习每个组中的特征,然后对每个组中的特征进行融合,特征融合定义为:As shown in Figure 4, the feature fusion adopts the late fusion method, first learn the features in each group, and then fuse the features in each group, the feature fusion is defined as:
ffuse=Concat(fleft,fright,fall) (5)f fuse =Concat(f left ,f right ,f all ) (5)
其中,Concat(,,)表示将特征进行连接操作,fleft为左肢组的特征,fright为右肢组的特征,fall为全身组的特征,ffuse为融合后得到的特征。Among them, Concat(,,) means to connect the features, f left is the feature of the left limb group, f right is the feature of the right limb group, fall is the feature of the whole body group, and f fuse is the feature obtained after fusion.
身体部位分组的实现,在保证全局姿态一致性的情况下学习局部关节的一致性,可以更好地泛化到训练数据中对称的姿势,以及罕见的、遮挡的姿势。The implementation of body part grouping, which learns local joint consistency while maintaining global pose consistency, generalizes better to symmetric poses in the training data, as well as rare, occluded poses.
进一步地,所述步骤S3中,基于对称语义图卷积层和非局部层构建多个对称语义图卷积模块,所有对称语义图卷积模块具有相同的结构,每一个对称语义图卷积模块由两个对称语义图卷积层和一个非局部层依次顺序连接组成,通过对称语义图卷积层和非局部层交替以获取节点之间的局部和全局语义关系;Further, in the step S3, multiple symmetric semantic graph convolution modules are constructed based on the symmetric semantic graph convolution layer and the non-local layer, all symmetric semantic graph convolution modules have the same structure, and each symmetric semantic graph convolution module It consists of two symmetric semantic graph convolutional layers and a non-local layer connected sequentially, and alternates the symmetric semantic graph convolutional layer and the non-local layer to obtain local and global semantic relationships between nodes;
在对称语义图卷积网络中,如图3所示,先使用一个对称语义图卷积层和使用一个非局部层,将输入映射到潜在空间;然后通过四个依次顺序连接的对称语义图卷积模块,得到编码的特征,对称语义图卷积网络中所有对称语义图卷积层之后都进行批标准化和ReLU非线性激活;In the symmetric semantic graph convolution network, as shown in Figure 3, first use a symmetric semantic graph convolution layer and use a non-local layer to map the input to the potential space; then through four sequentially connected symmetric semantic graph volumes Product module to obtain the encoded features, all symmetric semantic graph convolution layers in the symmetric semantic graph convolutional network are followed by batch normalization and R e LU nonlinear activation;
所述身体部位分组的对称语义图卷积姿态估计网络模型包括第一分支、第二分支、第三分支,如图2所示,其中,第一分支、第二分支、第三分支均使用对称语义图卷积网络进行特征提取:左肢组输入第一分支,通过对称语义图卷积网络,提取左肢的特征fleft;右肢组输入第二分支,通过对称语义图卷积网络,提取右肢的特征fright;全身组输入第三分支,通过对称语义图卷积网络,提取全身的特征fall;根据公式(5)计算得到融合的特征ffuse,然后使用一个对称语义图卷积层,将编码的特征投影到输出空间。The symmetric semantic graph convolution pose estimation network model of the body parts grouping includes a first branch, a second branch, and a third branch, as shown in Figure 2, wherein, the first branch, the second branch, and the third branch all use symmetric Semantic graph convolutional network for feature extraction: the left limb group is input to the first branch, and the feature f left of the left limb is extracted through the symmetrical semantic graph convolutional network; the right limb group is input to the second branch, and the symmetrical semantic graph convolutional network is used to extract The feature f right of the right limb; the whole body group is input to the third branch, and the feature f all of the whole body is extracted through the symmetric semantic graph convolution network; the fused feature f fuse is calculated according to formula (5), and then a symmetric semantic graph convolution is used layer, which projects the encoded features to the output space.
进一步地,所述步骤S4中采用公式(6)定义的损失函数Lsmoothl1(),在Human3.6M数据集上进行训练,公式如下:Further, in the step S4, the loss function L smoothhl1 () defined by the formula (6) is used to train on the Human3.6M data set, the formula is as follows:
其中,X表示真值与预测值之差,|·|表示真值与预测值之差绝对值,J′i代表预测的i节点的3D关节坐标,Ji对应数据集中i节点的真值。Lsmoothl1(J)损失函数对离群节点、异常值不敏感,且可以控制梯度的量级,使训练时合理收敛。Among them, X represents the difference between the true value and the predicted value, |·| represents the absolute value of the difference between the true value and the predicted value, J′ i represents the predicted 3D joint coordinates of node i, and J i corresponds to the true value of node i in the data set. The L smoothl1 (J) loss function is not sensitive to outlier nodes and outliers, and can control the magnitude of the gradient to make the training converge reasonably.
进一步地,姿态估计通常采用的评价指标为MPJPE(Mean Per Joint PositionError),公式定义为(7):Further, the evaluation index usually used for attitude estimation is MPJPE (Mean Per Joint Position Error), and the formula is defined as (7):
EMPJPE()指标表示每个关节预测值与真值的L2距离的均值,||·||2表示预测值到真值的L2距离。当评价指标MPJPE较小时,认为该3D人体姿态估计结果是较优的。The E MPJPE () index represents the mean value of the L2 distance between the predicted value and the true value of each joint, and ||·|| 2 represents the L2 distance between the predicted value and the true value. When the evaluation index MPJPE is smaller, it is considered that the 3D human body pose estimation result is better.
进一步地,在训练过程中,初始学习率为0.001,使用大小为64的批处理。初始学习率直接影响模型的收敛状态,批处理大小则影响模型的泛化能力,采用0.001的初始学习率有利于模型收敛,采用64的批处理大小有利于模型泛化。Further, during training, the initial learning rate is 0.001, and a batch size of 64 is used. The initial learning rate directly affects the convergence state of the model, and the batch size affects the generalization ability of the model. The initial learning rate of 0.001 is conducive to model convergence, and the batch size of 64 is conducive to model generalization.
本发明相对于现有技术具有如下的优点及效果:Compared with the prior art, the present invention has the following advantages and effects:
本发明提出的基于身体部位分组的对称语义图卷积姿态估计网络,引入了对称语义图卷积,能够更好的聚合邻居节点的信息,获取均衡的节点特征;设计了身体部位分组,将身体按照部位分割为左/右躯干,这些身体部位组通过独立子网络进行学习,以增强局部特征。在Human3.6M数据集上与其他方法对比,总体上,本方法效果更优,泛化能力有所提升。The symmetric semantic graph convolution pose estimation network based on body part grouping proposed by the present invention introduces symmetric semantic graph convolution, which can better aggregate the information of neighbor nodes and obtain balanced node features; design body part grouping, body Segmented by parts into left/right torso, these body part groups are learned through independent sub-networks to enhance local features. Compared with other methods on the H u man 3.6M data set, the method is generally more effective and the generalization ability has been improved.
附图说明Description of drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:
图1是本发明公开的一种基于姿态估计驱动的三维人体姿势迁移方法的流程图;Fig. 1 is a flow chart of a three-dimensional human body posture transfer method based on posture estimation driven by the present invention;
图2是本发明实施例中基于身体部位分组的对称语义图卷积网络模型图;Fig. 2 is a symmetrical semantic graph convolutional network model diagram based on body parts grouping in an embodiment of the present invention;
图3是本发明实施例中对称语义图卷积模块示意图;Fig. 3 is a schematic diagram of a symmetrical semantic graph convolution module in an embodiment of the present invention;
图4是本发明实施例中身体部位分组特征融合模块示意图。Fig. 4 is a schematic diagram of a body part grouping feature fusion module in an embodiment of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案及优点更加清楚、明确,以下参照附图并举实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear and definite, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
实施例1Example 1
一种基于身体部位分组的对称语义图卷积姿态估计方法,如图1所示,所述方法包括如下步骤:A symmetric semantic graph convolution pose estimation method based on body parts grouping, as shown in Figure 1, the method includes the following steps:
S1、输入电影动画、虚拟现实或运动动作中二维人体关节点及其连接关系,构建关节点图结构的对称语义图卷积层和非局部层;S1. Input the two-dimensional human joint points and their connection relationship in movie animation, virtual reality or sports action, and construct the symmetrical semantic graph convolution layer and non-local layer of the joint point graph structure;
步骤S1中使用二维人体关节点及其连接关系,构建关节点图结构的对称语义图卷积层和非局部层的过程如下:In step S1, using two-dimensional human body joints and their connection relations, the process of constructing a symmetrical semantic graph convolution layer and a non-local layer of the joint point graph structure is as follows:
令X(l)和X(l+1)分别表示图结构中节点经过第l层卷积前、后的特征,则对称图卷积的形式为:Let X (l) and X (l+1) represent the characteristics of the nodes in the graph structure before and after the l-th layer of convolution respectively, then the form of the symmetric graph convolution is:
X(l+1)=σ(WX(l)Asym) (1)X (l+1) = σ(WX (l) A sym ) (1)
其中,σ()表示激活函数,W表示可学习的权重参数,Asym是对图的邻接矩阵A对称归一化后得到的矩阵,表示如下:Among them, σ() represents an activation function, W represents a learnable weight parameter, and A sym is a matrix obtained after symmetrically normalizing the adjacency matrix A of the graph, expressed as follows:
其中,A是图的邻接矩阵,D是度矩阵;Among them, A is the adjacency matrix of the graph, and D is the degree matrix;
通过在对称图卷积的基础上添加了一个可学习的加权矩阵M,构建得到对称语义图卷积层,所述对称语义图卷积层的计算公式表达如下:By adding a learnable weighting matrix M on the basis of symmetric graph convolution, a symmetric semantic graph convolution layer is constructed. The calculation formula of the symmetric semantic graph convolution layer is expressed as follows:
X(l+1)=σ(WX(l)ρi(M⊙Asym)) (3)X (l+1) =σ(WX (l) ρ i (M⊙A sym) ) (3)
其中,ρi()是Softmax非线性函数,用于对节点i的矩阵进行归一化,⊙表示矩阵对应的元素相乘运算;Among them, ρ i () is a Softmax nonlinear function, which is used to normalize the matrix of node i, and ⊙ represents the multiplication operation of elements corresponding to the matrix;
为捕捉图中节点之间的全局特征,引入非局部层的概念,将非局部层的运算定义为:In order to capture the global features between nodes in the graph, the concept of non-local layer is introduced, and the operation of non-local layer is defined as:
其中,Wx表示可学习的权重参数W的归一化因子,K表示节点个数,i表示所要计算的目标节点的索引,j表示除i之外的节点的索引;分别表示节点i,j的输入特征;表示节点i的输出特征;f(,)是可学习的二元函数,用于计算两个输入特征的相似度;g()是可学习的一元函数,对输入特征进行变换。Among them, W x represents the normalization factor of the learnable weight parameter W, K represents the number of nodes, i represents the index of the target node to be calculated, and j represents the index of nodes other than i; represent the input features of nodes i and j respectively; Represents the output feature of node i; f(,) is a learnable binary function used to calculate the similarity between two input features; g() is a learnable unary function that transforms the input features.
S2、根据电影动画、虚拟现实或运动动作中身体躯干数据,进行身体部位分组,分别得到各躯干的局部及非局部特征和全身的局部及非局部特征,并对得到的特征进行融合计算;S2. According to the torso data in the movie animation, virtual reality or sports action, the body parts are grouped, and the local and non-local features of each torso and the local and non-local features of the whole body are respectively obtained, and the obtained features are fused and calculated;
步骤S2中对于身体部位分组,把电影动画、虚拟现实或运动动作中人体关节点分解成左肢组、右肢组、全身组,组内的各个关节点有着更强的关联性,各组通过独立子网络进行特征提取,以增强局部关系。In step S2, for the grouping of body parts, the joint points of the human body in movie animation, virtual reality or sports actions are decomposed into left limb group, right limb group, and whole body group. Each joint point in the group has a stronger correlation. Independent sub-networks are used for feature extraction to enhance local relations.
如图4所示,特征融合采用晚融合的方式,先学习每个组中的特征,然后对每个组中的特征的进行融合计算,特征融合定义为:As shown in Figure 4, the feature fusion adopts the late fusion method, first learn the features in each group, and then perform fusion calculation on the features in each group, the feature fusion is defined as:
ffuse=Concat(fleft,fright,fall) (5)f fuse =Concat(f left ,f right ,f all ) (5)
其中,Concat(,,)表示将特征进行连接操作,fleft为左肢组的特征,fright为右肢组的特征,fall为全身组的特征,ffuse为融合后得到的特征。Among them, Concat(,,) means to connect the features, f left is the feature of the left limb group, f right is the feature of the right limb group, fall is the feature of the whole body group, and f fuse is the feature obtained after fusion.
S3、基于对称语义图卷积层、非局部层和身体部位分组,构建身体部位分组的对称语义图卷积姿态估计网络模型;S3. Based on the symmetric semantic graph convolution layer, non-local layer and body part grouping, construct a symmetric semantic graph convolution pose estimation network model for body part grouping;
步骤S3中,基于对称语义图卷积层和非局部层构建多个对称语义图卷积模块,所有对称语义图卷积模块具有相同的结构,每一个对称语义图卷积模块由两个对称语义图卷积层和一个非局部层依次顺序连接组成;In step S3, multiple symmetric semantic graph convolution modules are constructed based on the symmetric semantic graph convolution layer and the non-local layer. All symmetric semantic graph convolution modules have the same structure, and each symmetric semantic graph convolution module consists of two symmetric semantic graph convolution modules. The graph convolutional layer and a non-local layer are connected sequentially;
在对称语义图卷积网络中,如图3所示,先使用一个对称语义图卷积层和使用一个非局部层,将输入映射到潜在空间;然后通过四个依次顺序连接的对称语义图卷积模块,得到编码的特征,对称语义图卷积网络中所有对称语义图卷积层之后都进行批标准化和ReLU非线性激活;In the symmetric semantic graph convolution network, as shown in Figure 3, first use a symmetric semantic graph convolution layer and use a non-local layer to map the input to the potential space; then through four sequentially connected symmetric semantic graph volumes Product module to get the encoded features, all symmetric semantic graph convolution layers in the symmetric semantic graph convolutional network are followed by batch normalization and ReLU nonlinear activation;
身体部位分组的对称语义图卷积姿态估计网络模型包括第一分支、第二分支、第三分支,如图2所示,其中,第一分支、第二分支、第三分支均使用对称语义图卷积网络进行特征提取:左肢组输入第一分支,通过对称语义图卷积网络,提取左肢的特征fleft;右肢组输入第二分支,通过对称语义图卷积网络,提取右肢的特征fright;全身组输入第三分支,通过对称语义图卷积网络,提取全身的特征fall;根据公式(5)计算得到融合的特征ffuse,然后使用一个对称语义图卷积层,将编码的特征投影到输出空间。The symmetric semantic graph convolution pose estimation network model of body parts grouping includes the first branch, the second branch, and the third branch, as shown in Figure 2, where the first branch, the second branch, and the third branch all use symmetric semantic graphs Convolutional network for feature extraction: the left limb group is input to the first branch, and the feature f left of the left limb is extracted through the symmetric semantic graph convolution network; the right limb group is input to the second branch, and the right limb is extracted through the symmetric semantic graph convolution network The feature f right of the whole body group is input to the third branch, and the feature f all of the whole body is extracted through the symmetric semantic graph convolution network; the fused feature f fuse is calculated according to formula (5), and then a symmetric semantic graph convolution layer is used, Project the encoded features to the output space.
S4、使用Human3.6M数据集对所述对称语义图卷积姿态估计网络模型进行训练,将待估计的电影动画、虚拟现实或运动动作中二维人体关节点输入经过训练的对称语义图卷积姿态估计网络模型,输出估计的电影动画、虚拟现实或运动动作中三维人体关节点。S4. Use the Human3.6M data set to train the symmetric semantic graph convolution pose estimation network model, and input the two-dimensional human body joint points in the movie animation, virtual reality or sports action to be estimated into the trained symmetric semantic graph convolution The pose estimation network model outputs the estimated three-dimensional human joint points in movie animation, virtual reality or sports actions.
步骤S4中采用公式(6)定义的损失函数Lsmoothl1(),在Human3.6M数据集上进行训练,公式如下:In step S4, the loss function L smoothhl1 () defined by the formula (6) is used to train on the Human3.6M data set, and the formula is as follows:
其中,X表示电影动画、虚拟现实或运动动作数据真值与预测值之差,|·|表示真值与预测值之差绝对值,J′i代表预测的i节点的3D关节坐标,Ji对应数据集中i节点的真值。Among them, X represents the difference between the real value and the predicted value of movie animation, virtual reality or sports action data, |·| represents the absolute value of the difference between the real value and the predicted value, J′ i represents the predicted 3D joint coordinates of node i, J i Corresponds to the true value of the i-node in the data set.
其中,姿态估计通常采用的评价指标为MPJPE(Mean Per Joint PositionError),公式定义为(7):Among them, the evaluation index usually used for attitude estimation is MPJPE (Mean Per Joint Position Error), and the formula is defined as (7):
EMPJPE()指标表示电影动画、虚拟现实或运动动作中每个关节预测值与真值的L2距离的均值,||·||2表示预测值到真值的L2距离。当评价指标MPJPE较小时,认为该3D人体姿态估计结果是较优的。The E MPJPE () index represents the mean value of the L2 distance between the predicted value and the real value of each joint in movie animation, virtual reality or sports action, and ||·|| 2 represents the L2 distance between the predicted value and the real value. When the evaluation index MPJPE is smaller, it is considered that the 3D human body pose estimation result is better.
在训练过程中,初始学习率为0.001,批处理大小为64。During training, the initial learning rate is 0.001 and the batch size is 64.
实施例2Example 2
本实施例基于实施例1公开的一种基于身体部位分组的对称语义图卷积姿态估计方法,为了验证本发明的有效性,在Human3.6M数据集上进行实验,结合实验结果对本发明的技术效果进行说明。This embodiment is based on a symmetric semantic graph convolution pose estimation method based on body part grouping disclosed in
Human3.6M是3D姿态估计使用最多的数据集之一,覆盖360万影像,在室内可控的环境中进行采集,共有11名实验者,利用带标记的运动捕捉设备采集实验者日常活动场景时的身体姿态,包含15个动作。Human3.6M is one of the most widely used datasets for 3D pose estimation, covering 3.6 million images, collected in an indoor controllable environment, with a total of 11 experimenters, using a marked motion capture device to capture the daily activity scenes of the experimenters body posture, including 15 actions.
实验配置:硬件环境:GPU RTX 2080Ti显存:11GB,CPU 4核Intel(R)Xeon(R)Silver 4110 CPU@2.10GHz内存:16GB。软件环境:Python v2.7,Pytorch v1.1.0,CUDA10.2。操作系统:Ubuntu18.04。Experimental configuration: Hardware environment: GPU RTX 2080Ti Video memory: 11GB, CPU 4-core Intel(R) Xeon(R) Silver 4110 CPU@2.10GHz Memory: 16GB. Software environment: Python v2.7, Pytorch v1.1.0, CUDA10.2. Operating system: Ubuntu18.04.
对本发明提出的方法进行消融研究。采用上述配置,本发明提出的姿态估计网络包含两个主要模块:对称语义图卷积模块和身体部位分组。为了验证它们的有效性,设置消融实验如下:第一个实验仅使用语义图卷积,第二个实验使用对称语义图卷积模块,第二个实验使用身体部位分组,第三个实验使用对称语义图卷积模块和身体部位分组。实验结果如表1所示:Ablation studies were performed on the method proposed by the present invention. With the above configuration, the pose estimation network proposed by the present invention contains two main modules: a symmetric semantic graph convolution module and a body part grouping. To verify their effectiveness, ablation experiments are set up as follows: the first experiment uses semantic graph convolution only, the second experiment uses a symmetric semantic graph convolution module, the second experiment uses body part grouping, and the third experiment uses symmetric Semantic graph convolution module and body part grouping. The experimental results are shown in Table 1:
表1.基于身体部位分组的对称语义图卷积姿态估计方法消融实验结果表Table 1. Ablation experiment results of the convolution pose estimation method based on body parts grouping
表2展示了按照人体动作分类,在MPJPE评价指标下,本发明的方法与基线方法和语义图卷积方法的对比实验结果,每个动作的最佳方法分别以粗体突出显示。Table 2 shows the experimental results of the method of the present invention compared with the baseline method and the semantic graph convolution method under the MPJPE evaluation index according to the classification of human actions. The best method for each action is highlighted in bold.
表2.基于身体部位分组的对称语义图卷积姿态估计方法对比实验结果表Table 2. Comparison of experimental results of symmetric semantic graph convolution pose estimation methods based on body part grouping
由此可见,本发明提出的基于身体部位分组的对称语义图卷积姿态估计网络,实现了更优的性能,这表明本文的模型可以有效地利用图中不同关节组之间的关系。It can be seen that the symmetric semantic graph convolution pose estimation network based on body part grouping proposed by the present invention achieves better performance, which shows that the model in this paper can effectively utilize the relationship between different joint groups in the graph.
上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211084071.5A CN115546888A (en) | 2022-09-06 | 2022-09-06 | A Convolutional Pose Estimation Method Based on Body Part Grouping with Symmetrical Semantic Graphs |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211084071.5A CN115546888A (en) | 2022-09-06 | 2022-09-06 | A Convolutional Pose Estimation Method Based on Body Part Grouping with Symmetrical Semantic Graphs |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115546888A true CN115546888A (en) | 2022-12-30 |
Family
ID=84726312
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211084071.5A Pending CN115546888A (en) | 2022-09-06 | 2022-09-06 | A Convolutional Pose Estimation Method Based on Body Part Grouping with Symmetrical Semantic Graphs |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115546888A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116486489A (en) * | 2023-06-26 | 2023-07-25 | 江西农业大学 | 3D Hand Pose Estimation Method and System Based on Semantic-Aware Graph Convolution |
CN117611675A (en) * | 2024-01-22 | 2024-02-27 | 南京信息工程大学 | Three-dimensional human body posture estimation method, device, storage medium and equipment |
CN118247851A (en) * | 2024-05-28 | 2024-06-25 | 江西农业大学 | End-to-end hand-object interaction posture estimation method and system |
CN118397710A (en) * | 2024-06-25 | 2024-07-26 | 广东海洋大学 | Skeleton action recognition method based on semantic decomposition multi-relation graph convolutional network |
-
2022
- 2022-09-06 CN CN202211084071.5A patent/CN115546888A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116486489A (en) * | 2023-06-26 | 2023-07-25 | 江西农业大学 | 3D Hand Pose Estimation Method and System Based on Semantic-Aware Graph Convolution |
CN116486489B (en) * | 2023-06-26 | 2023-08-29 | 江西农业大学 | 3D Hand Pose Estimation Method and System Based on Semantic-Aware Graph Convolution |
CN117611675A (en) * | 2024-01-22 | 2024-02-27 | 南京信息工程大学 | Three-dimensional human body posture estimation method, device, storage medium and equipment |
CN117611675B (en) * | 2024-01-22 | 2024-04-16 | 南京信息工程大学 | A three-dimensional human body posture estimation method, device, storage medium and equipment |
CN118247851A (en) * | 2024-05-28 | 2024-06-25 | 江西农业大学 | End-to-end hand-object interaction posture estimation method and system |
CN118397710A (en) * | 2024-06-25 | 2024-07-26 | 广东海洋大学 | Skeleton action recognition method based on semantic decomposition multi-relation graph convolutional network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hu et al. | Conditional directed graph convolution for 3d human pose estimation | |
CN115546888A (en) | A Convolutional Pose Estimation Method Based on Body Part Grouping with Symmetrical Semantic Graphs | |
CN111460928B (en) | A human action recognition system and method | |
CN113128424B (en) | Method for identifying action of graph convolution neural network based on attention mechanism | |
CN111311729A (en) | Natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network | |
CN112232106A (en) | Two-dimensional to three-dimensional human body posture estimation method | |
CN110472532A (en) | A kind of the video object Activity recognition method and apparatus | |
CN115880724A (en) | Light-weight three-dimensional hand posture estimation method based on RGB image | |
Li et al. | Two-person graph convolutional network for skeleton-based human interaction recognition | |
Zhang et al. | Graph convolutional LSTM model for skeleton-based action recognition | |
WO2023226186A1 (en) | Neural network training method, human activity recognition method, and device and storage medium | |
Tian et al. | Skeleton-based action recognition with select-assemble-normalize graph convolutional networks | |
CN115690908A (en) | Three-dimensional gesture attitude estimation method based on topology perception | |
WO2024255056A1 (en) | Skeleton sequence recognition method based on masked image autoencoders and system | |
CN116665300A (en) | Skeleton Action Recognition Method Based on Spatiotemporal Adaptive Feature Fusion Graph Convolutional Network | |
Cao et al. | QMEDNet: A quaternion-based multi-order differential encoder–decoder model for 3D human motion prediction | |
Kang et al. | An improved 3D human pose estimation model based on temporal convolution with Gaussian error linear units | |
CN103839280B (en) | A kind of human body attitude tracking of view-based access control model information | |
Ma et al. | Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization | |
CN114613011A (en) | Human 3D Skeletal Behavior Recognition Method Based on Graph Attention Convolutional Neural Network | |
CN114758205A (en) | Multi-view feature fusion method and system for 3D human body posture estimation | |
Zhang | Group Graph Convolutional Networks for 3D Human Pose Estimation. | |
Cheng et al. | Solving monocular sensors depth prediction using MLP-based architecture and multi-scale inverse attention | |
CN112308952A (en) | 3D character motion generation system and method for imitating human motion in given video | |
CN117935362A (en) | Human behavior recognition method and system based on heterogeneous skeleton graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |