CN113158459A

CN113158459A - Human body posture estimation method based on visual and inertial information fusion

Info

Publication number: CN113158459A
Application number: CN202110422431.7A
Authority: CN
Inventors: 张文安; 朱腾辉; 杨旭升
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-07-23
Anticipated expiration: 2041-04-20
Also published as: CN113158459B

Abstract

A human body posture estimation method based on visual and inertial information fusion aims at the defect that a human body posture estimation method based on a 3D visual sensor cannot provide three-degree-of-freedom rotation information, the visual information, the inertial information and human body posture prior information are fused in a self-adaptive mode by utilizing the complementarity of the visual information and the inertial information and adopting a nonlinear optimization method, the rotation angle of a human body skeleton node and the global position of a root skeleton node at each moment are obtained, and real-time human body posture estimation is completed. The invention effectively improves the accuracy and robustness of human body posture estimation, and makes up the defects that a visual sensor is easy to be shielded and inertial data accumulate errors along with time.

Description

A Human Pose Estimation Method Based on Fusion of Visual and Inertial Information

技术领域technical field

本发明属于人体姿态估计领域，尤其是一种基于视觉和惯性信息融合的人体姿态估计方法。The invention belongs to the field of human body posture estimation, in particular to a human body posture estimation method based on fusion of visual and inertial information.

背景技术Background technique

人体姿态估计技术有着重要的应用价值，随着视觉传感器、惯性测量单元和人工智能等技术的发展，人体姿态估计逐渐应用于人机协作、视频监控、影视制作和工农业生产等领域，例如，用于保障人机协作过程中工人的安全问题，用于记录和分析监控画面中人的举止行为等。Human pose estimation technology has important application value. With the development of visual sensors, inertial measurement units and artificial intelligence technologies, human pose estimation is gradually applied in the fields of human-machine collaboration, video surveillance, film and television production, and industrial and agricultural production. For example, It is used to ensure the safety of workers in the process of human-machine collaboration, and to record and analyze the behavior of people in the monitoring screen.

3D人体姿态估计技术已趋近成熟，随着行为识别、辅助训练和人机协作等领域的发展，人们需要6D人体姿态估计信息来开发应用，例如，在舞蹈辅助训练中，6D人体姿态估计包含了关节旋转信息，捕捉的舞蹈动作细节更丰富，学员训练的效果更佳。在日常生产生活中，基于3D视觉的人体姿态估计方法最为常见实用，该方法可对人体骨架关节点进行准确提取，得到3D人体姿态信息，但当人体发生自遮挡或相机被部分遮挡的情况时，数据可信度降低。惯性测量单元可以提供空间旋转信息，其输出稳定，但旋转信息的误差会随着时间的增加而累积。利用视觉和惯性信息的互补性可以对人体进行6D姿态估计，将视觉输出的三自由度位移信息和惯性输出的三自由度旋转信息进行简单结合，此时获得的人体姿态估计系统鲁棒性差，精度不高。目前，还没有技术能够鲁棒、实时地融合视觉和惯性信息解决6D人体姿态估计问题。3D human pose estimation technology has become mature. With the development of behavior recognition, auxiliary training and human-machine collaboration, people need 6D human pose estimation information to develop applications. For example, in dance assistant training, 6D human pose estimation includes With the joint rotation information, the captured dance movement details are more abundant, and the training effect of the students is better. In daily production and life, the human body pose estimation method based on 3D vision is the most common and practical method. This method can accurately extract the joint points of the human skeleton and obtain 3D human body pose information. However, when the human body is self-occluded or the camera is partially occluded , the reliability of the data is reduced. The inertial measurement unit can provide spatial rotation information, and its output is stable, but the error of the rotation information will accumulate over time. The complementarity of visual and inertial information can be used to estimate the 6D posture of the human body, and the three-degree-of-freedom displacement information of the visual output and the three-degree-of-freedom rotation information of the inertial output can be simply combined. At this time, the obtained human body posture estimation system has poor robustness. Accuracy is not high. Currently, there is no technology that can robustly and real-time fuse visual and inertial information to solve the problem of 6D human pose estimation.

发明内容SUMMARY OF THE INVENTION

为了克服基于3D视觉传感器的人体姿态估计方法无法提供三自由度旋转信息的缺点，本发明提供一种基于视觉和惯性信息融合的人体姿态估计方法，利用视觉与惯性信息的互补性，对视觉、惯性和人体姿态先验知识进行非线性融合，得到每个时刻人体骨骼节点的旋转角度和根骨骼节点的全局位置，完成实时的人体姿态估计，有效提高了人体姿态估计的精确度和鲁棒性。In order to overcome the disadvantage that the human body pose estimation method based on 3D vision sensor cannot provide three-degree-of-freedom rotation information, the present invention provides a human body pose estimation method based on the fusion of visual and inertial information. Non-linear fusion of inertial and human pose prior knowledge to obtain the rotation angle of the human skeleton node and the global position of the root skeleton node at each moment, complete real-time human pose estimation, and effectively improve the accuracy and robustness of human pose estimation .

本发明所采用的技术方案包括以下步骤：The technical scheme adopted in the present invention comprises the following steps:

一种基于视觉和惯性信息融合的人体姿态估计方法，包括以下步骤：A human pose estimation method based on fusion of visual and inertial information, comprising the following steps:

步骤1)建立人体各骨骼节点运动学模型，确定优化变量θ，确定相机坐标系c和全局坐标系g之间的齐次变换矩阵

惯性坐标系n与全局坐标系g之间的旋转矩阵

惯性传感器i与对应骨骼坐标系b_i之间的位移矩阵

和旋转矩阵

Step 1) Establish the kinematic model of each skeleton node of the human body, determine the optimization variable θ, and determine the homogeneous transformation matrix between the camera coordinate system c and the global coordinate system g

Rotation matrix between inertial coordinate system n and global coordinate system g

Displacement matrix between inertial sensor i and corresponding bone coordinate system b _i

and rotation matrix

步骤2)设置视觉和惯性的输出频率一致，以惯性传感器的旋转约束E_R(θ)、加速度约束E_A(θ)、视觉传感器位置约束E_P(θ)以及人体姿态先验约束E_prior(θ)为优化项构建优化问题，设置各优化项的权重；Step 2) Set the visual and inertial output frequencies to be consistent, and use the inertial sensor rotation constraint _ER (θ), acceleration constraint EA (θ), visual sensor position constraint _EP (θ) and human body pose prior constraint _E _prior ( θ) Construct an optimization problem for the optimization item, and set the weight of each optimization item;

步骤3)每个时刻读取视觉传感器的位置测量值

以及惯性传感器的旋转测量值R_i和加速度测量值a_i，计算各优化项在统一坐标系后的传感器测量值

与估计值

Step 3) Read the position measurement of the vision sensor at each moment

and the rotation measurement value R _i and acceleration measurement value a _i of the inertial sensor, calculate the sensor measurement value of each optimization item after the unified coordinate system

with estimated value

步骤4)求解非线性最小二乘优化问题，每个时刻的最优解θ即当前时刻人体各骨骼节点的最优旋转角度和根骨骼节点n₁的最优全局位置，再根据建立的人体骨骼节点运动学模型，得到当前时刻的人体姿态估计；Step 4) Solve the nonlinear least squares optimization problem, the optimal solution θ at each moment is the optimal rotation angle of each skeleton node of the human body at the current moment and the optimal global position of the root skeleton node n ₁ , and then according to the established human skeleton Node kinematics model to get the human pose estimation at the current moment;

重复执行步骤3)和4)完成每个时刻对人体各关节点的状态估计，得到基于视觉和惯性信息融合的实时人体姿态估计。Repeat steps 3) and 4) to complete the state estimation of each joint point of the human body at each moment, and obtain a real-time human body pose estimation based on the fusion of visual and inertial information.

进一步，在所述步骤1)中，所述的相机坐标系c表示深度相机的坐标系，惯性坐标系n表示所有惯性传感器经过标定后的统一坐标系，全局坐标系g与骨骼节点的初始坐标系对齐。Further, in the step 1), the camera coordinate system c represents the coordinate system of the depth camera, the inertial coordinate system n represents the unified coordinate system after calibration of all inertial sensors, the global coordinate system g and the initial coordinates of the skeleton node system aligned.

在所述步骤2)中，所述的旋转约束E_R(θ)通过每个IMU旋转矩阵的测量值和估计值间的差异来建立；所述的加速度约束E_A(θ)通过最小化每个IMU加速度的测量值和估计值间的差异来建立；所述的位置约束E_P(θ)通过最小化各骨骼节点全局位置的测量值和估计值间的差异来建立；所述的人体姿态先验约束E_prior(θ)通过已有人体姿态估计数据集来建立。In the step 2), the rotation constraint E _R (θ) is established by the difference between the measured value and the estimated value of each IMU rotation matrix; the acceleration constraint E _A (θ) is established by minimizing each IMU rotation matrix. The difference between the measured value and the estimated value of the acceleration of each IMU is established; the position constraint _EP (θ) is established by minimizing the difference between the measured value and the estimated value of the global position of each skeletal node; the human body pose The prior constraint E _prior (θ) is established from the existing human pose estimation dataset.

在所述步骤3)中，所述的

是从视觉传感器中读取的人体各关节点位置信息，R_i是从惯性的陀螺仪中读取的旋转信息，a_i是从惯性的加速度计中读取的加速度信息。In the step 3), the

is the position information of each joint point of the human body read from the vision sensor, R _i is the rotation information read from the inertial gyroscope, and a _i is the acceleration information read from the inertial accelerometer.

在所述步骤4)中，所述的根骨骼节点n₁位于人体盆骨关节点。In the step 4), the root bone node n ₁ is located at the joint point of the human pelvis.

本发明的有益效果主要表现在：提供一种基于视觉和惯性信息融合的人体姿态估计方法，针对基于3D视觉传感器的人体姿态估计方法缺少三自由度旋转信息输出的缺点，采用非线性优化的方法自适应融合视觉信息、惯性信息以及人体姿态先验信息得到6D人体姿态估计，同时提高了人体姿态估计的精度和鲁棒性，弥补了视觉传感器易受遮挡以及惯性数据随时间累积误差的不足。The beneficial effects of the present invention are mainly manifested in: providing a human body attitude estimation method based on the fusion of visual and inertial information, aiming at the shortcoming that the human body attitude estimation method based on 3D visual sensor lacks the output of three-degree-of-freedom rotation information, a nonlinear optimization method is adopted. Adaptive fusion of visual information, inertial information, and human pose prior information to obtain 6D human pose estimation, which improves the accuracy and robustness of human pose estimation, and makes up for the shortcomings of visual sensors that are easily occluded and inertial data accumulated over time.

附图说明Description of drawings

图1是基于视觉和惯性信息融合的人体姿态估计方法流程图。Figure 1 is a flowchart of a method for human pose estimation based on fusion of visual and inertial information.

图2是人体上半身骨骼节点与IMU佩戴位置示意图。Figure 2 is a schematic diagram of the upper body skeleton nodes and the wearing position of the IMU.

图3是视觉传感器摆放位置示意图。Figure 3 is a schematic diagram of the placement position of the vision sensor.

图4是视觉和惯性信息融合的人体姿态估计算法流程图。Figure 4 is a flowchart of a human pose estimation algorithm based on fusion of visual and inertial information.

具体实施方式Detailed ways

为了使本发明的技术方案、设计思路能更加清晰，姿态估计对象选择人体上半身，采用两个视觉传感器和五个惯性传感器，结合附图对本发明做进一步描述。In order to make the technical solutions and design ideas of the present invention clearer, the upper body of the human body is selected as the object of attitude estimation, and two visual sensors and five inertial sensors are used to further describe the present invention with reference to the accompanying drawings.

参照图1、图2、图3和图4，一种基于视觉和惯性信息融合的人体姿态估计方法，所述方法包括以下步骤：Referring to Figure 1, Figure 2, Figure 3 and Figure 4, a method for estimating human body pose based on visual and inertial information fusion, the method comprises the following steps:

惯性坐标系n与全局坐标系g之间的旋转矩阵

惯性传感器i与对应骨骼坐标系b_i之间的位移矩阵

和旋转矩阵

过程如下：Step 1) Establish the kinematic model of each skeleton node of the human body, determine the optimization variable θ, and determine the homogeneous transformation matrix between the camera coordinate system c and the global coordinate system g

and rotation matrix

The process is as follows:

1.1)人体骨骼定义为相互连接的刚体，骨骼节点的初始坐标系B与全局坐标系g对齐，定义上半身骨骼数量n_b＝13，如图2所示，骨骼节点分别为左手、右手、左前臂、右前臂、左上臂、右上臂、左肩、右肩、脊柱1～4以及盆骨，其中盆骨视作根骨骼节点n₁，子骨骼节点n_b(b≥2)都与其父节点有相对的旋转矩阵R_b和相对不变的位移t_b，每个骨骼间具有3个旋转自由度，根节点具有全局位移量(x₁,y₁,z₁)，用42(d＝3+3×n_b＝42)个自由度来表述整个人体上半身的运动，将42个变量记为一个42维向量θ，向量θ作为优化问题的优化变量，由正向运动学公式推导得出每个刚性骨骼在全局坐标系下的齐次变换矩阵

1.1) Human bones are defined as rigid bodies connected to each other. The initial coordinate system B of the skeleton node is aligned with the global coordinate system g, and the number of bones in the upper body is defined as n _b = 13. As shown in Figure 2, the skeleton nodes are the left hand, right hand, and left forearm respectively. , right forearm, left upper arm, right upper arm, left shoulder, right shoulder, spine 1-4, and pelvis, where the pelvis is regarded as the root bone node n ₁ , and the child bone node n _b (b≥2) is relative to its parent node The rotation matrix R _b and the relatively invariant displacement t _b , each bone has 3 rotational degrees of freedom, the root node has a global displacement (x ₁ , y ₁ , z ₁ ), with 42 (d=3+3 ×n _b = 42) degrees of freedom to describe the motion of the entire upper body of the human body, record the 42 variables as a 42-dimensional vector θ, the vector θ is used as the optimization variable of the optimization problem, and each rigidity is derived from the forward kinematics formula. Homogeneous transformation matrix of bones in global coordinate system

其中P(b)为全部骨骼的集合；where P(b) is the set of all bones;

1.2)如图3所示，将两个视觉传感器分别放置于测试者面前，视觉传感器距离测试者L＝2米，使用“张正友相机标定法”得到两个相机对于全局坐标系g的平移矩阵

和旋转矩阵

进而确定相机坐标系c和全局坐标系g的齐次变换矩阵

1.2) As shown in Figure 3, two visual sensors are placed in front of the tester respectively, the visual sensor is L=2 meters away from the tester, and the "Zhang Zhengyou camera calibration method" is used to obtain the translation matrix of the two cameras for the global coordinate system g

and rotation matrix

Then determine the homogeneous transformation matrix of the camera coordinate system c and the global coordinate system g

1.3)将惯性传感器IMU放置于全局坐标系g处，使得惯性传感器坐标系n与全局坐标系g对齐，得到此时惯性传感器输出值，即为惯性传感器坐标系n与全局坐标系g之间的旋转矩阵

重复上述操作，得到第i(i＝1,2,3,4,5)个惯性传感器坐标系n_i与全局坐标系g之间的旋转矩阵

1.3) Place the inertial sensor IMU at the global coordinate system g, so that the inertial sensor coordinate system n is aligned with the global coordinate system g, and the output value of the inertial sensor at this time is obtained, which is the difference between the inertial sensor coordinate system n and the global coordinate system g. rotation matrix

Repeat the above operations to obtain the rotation matrix between the ith (i=1, 2, 3, 4, 5) inertial sensor coordinate system n _i and the global coordinate system g

1.4)如图2所示，IMU穿戴在左手、右手、左前臂、右前臂、盆骨的相应骨骼点处，IMU_i与对应骨骼坐标系b_i之间不存在位移，即

在初始时刻测试者做“T-pose”校准姿态，此时定义IMU_i的测量值为R_{i_initial}，则IMU_i与对应骨骼坐标系b_i之间的旋转矩阵

表示为：1.4) As shown in Figure 2, the IMU is worn at the corresponding bone points of the left hand, right hand, left forearm, right forearm, and pelvis, and there is no displacement between IMU _i and the corresponding bone coordinate system b _i , that is,

At the initial moment, the tester performs the "T-pose" calibration posture. At this time, the measured value of IMU _i is defined as R _{i_initial} , then the rotation matrix between IMU _i and the corresponding bone coordinate system b _i

Expressed as:

步骤2)设置视觉和惯性的输出频率为30HZ，以惯性传感器的旋转约束E_R(θ)、加速度约束E_A(θ)、视觉传感器位置约束E_P(θ)以及人体姿态先验约束E_prior(θ)为优化项构建优化问题，设置各优化项的权重；过程如下：Step 2) Set the output frequency of vision and inertia to 30HZ, and use inertial sensor rotation constraint E _R (θ), acceleration constraint E _A (θ), visual sensor position constraint E _P (θ) and human body pose prior constraint E _prior (θ) Build an optimization problem for the optimization items, and set the weight of each optimization item; the process is as follows:

2.1)IMU_i对应的骨骼节点在全局坐标系下的旋转矩阵的测量值和估计值之间的差异作为IMU的旋转项约束。对应骨骼节点的旋转矩阵测量值表示为：2.1) The difference between the measured value and the estimated value of the rotation matrix of the skeleton node corresponding to IMU _i in the global coordinate system is used as the rotation term constraint of the IMU. The rotation matrix measurements of the corresponding skeleton nodes are expressed as:

其中，R_i为IMU_i的旋转测量值。对应骨骼节点的旋转矩阵估计值表示为：where R _i is the rotation measurement of IMU _i . The estimated value of the rotation matrix of the corresponding skeleton node is expressed as:

其中，P(b_i)为骨骼b_i全部父骨骼的集合。Among them, P( _bi ) is the set of all parent bones of bone _bi .

综上，旋转项的能量函数定义为：In summary, the energy function of the rotation term is defined as:

其中，ψ(·)提取旋转矩阵四元数表示方法的向量部分，λ_R为旋转项能量函数的权重，ρ_R(·)表示一种损失函数。Among them, ψ(·) extracts the vector part of the rotation matrix quaternion representation method, λ _R is the weight of the rotation term energy function, and ρ _R (·) represents a loss function.

2.2)最小化IMU_i加速度测量值a_i和估计值之间误差作为IMU的加速度约束项。加速度估计值

表示为：2.2) Minimize the error between the IMU _i acceleration measurement value a _i and the estimated value as the acceleration constraint term of the IMU. acceleration estimate

Expressed as:

其中，

等式(6)左边(t-1)表示当前时刻使用上一时刻的加速度约束。全局坐标系下的加速度测量值

通过前一帧的旋转信息和加速度测量值计算得出。加速度测量值

表示为：in,

The left side (t-1) of equation (6) indicates that the current moment uses the acceleration constraint of the previous moment. Acceleration measurements in the global coordinate system

Calculated from rotation information and acceleration measurements from the previous frame. acceleration measurement

Expressed as:

其中，a_g为重力加速度。where a _g is the acceleration of gravity.

综上，加速度项的能量函数定义为：In summary, the energy function of the acceleration term is defined as:

其中，λ_A为加速度项能量函数的权重，ρ_A(·)表示一种损失函数。Among them, λ _A is the weight of the acceleration term energy function, and ρ _A (·) represents a loss function.

2.3)从视觉传感器的深度相机中得到人体骨骼节点的全局坐标(x,y,z)，加入最小化骨骼节点全局位置的测量值和估计值之间最小化的约束项。定义用于位置约束项的骨骼节点数量为n_p，骨骼节点在相机c坐标系下的位置为

相机数量为n_c。骨骼节点位置的估计值

表示为：2.3) Obtain the global coordinates (x, y, z) of the human skeleton node from the depth camera of the vision sensor, and add a constraint item that minimizes the minimum between the measured value and the estimated value of the global position of the skeleton node. Define the number of skeletal nodes used for position constraints as n _p , and the position of the skeletal nodes in the camera c coordinate system is

The number of cameras is n _c . Estimates of skeletal node positions

Expressed as:

综上，位置约束的能量函数定义为：In summary, the energy function of the position constraint is defined as:

其中，λ_P为位置约束项能量函数的权重，ρ_P(·)表示一种损失函数。Among them, λ _P is the weight of the energy function of the position constraint term, and ρ _P (·) represents a loss function.

2.4)考虑实际骨骼的运动自由度存在限制，所以使用姿态先验项E_prior(θ)来限制关节不合理的运动。E_prior(θ)通过已有人体姿态估计数据集“TotalCapture(2017)”来建立，其中包含126000帧人体运动姿态数据。2.4) Considering that there are limitations in the freedom of movement of the actual bones, the pose prior term E _prior (θ) is used to limit the unreasonable movement of the joints. E _prior (θ) is established by the existing human pose estimation dataset "TotalCapture (2017)", which contains 126,000 frames of human motion pose data.

首先对数据集中所有的数据进行k-means聚类，选择聚类种类k＝126000/100＝1260。再对所有的聚类中心取均值，得到姿态的均值μ。最后对原始数据进行统计分析得姿态的标准差σ和各个骨骼节点的自由度上下限θ_max和θ_min。由此，姿态先验项定义为：First, perform k-means clustering on all the data in the dataset, and select the clustering type k=126000/100=1260. Then take the mean of all the cluster centers to get the mean value μ of the pose. Finally, statistical analysis is performed on the original data to obtain the standard deviation σ of the pose and the upper and lower limits θ _max and θ _min of the degrees of freedom of each skeletal node. From this, the pose prior is defined as:

其中，

的维度为36，不对根节点的位移和旋转作限制，λ_prior为姿态先验项能量函数的权重，ρ_prior(·)表示一种损失函数。in,

The dimension of is 36, and the displacement and rotation of the root node are not restricted. λ _prior is the weight of the energy function of the attitude prior item, and ρ _prior ( ) represents a loss function.

2.5)综上所述，构建优化问题：2.5) To sum up, construct the optimization problem:

其中，E_A、E_P、E_prior中的损失函数设置为ρ(x)＝log(1+x)，通过按设定比例加大对异常值的惩罚，来限制异常值的影响。各优化项的权重设置为λ_R＝0.1，λ_P＝10，λ_A＝0.005，λ_prior＝0.0001。Among them, the loss function in E _A , E _P , and E _prior is set to ρ(x)=log(1+x), and the influence of outliers is limited by increasing the penalty for outliers in a set proportion. The weights of each optimization term are set as λ _R =0.1, λ _P =10, λ _A =0.005, λ _prior =0.0001.

步骤3)在每个时刻读取视觉传感器的位置测量值

与估计值

Step 3) Read the position measurement of the vision sensor at each moment

with estimated value

如图1所示，重复执行步骤3)和4)完成每个时刻对人体关节点位置与旋转的最优估计，得到基于视觉和惯性信息融合的实时人体姿态估计。As shown in Figure 1, steps 3) and 4) are repeatedly performed to complete the optimal estimation of the position and rotation of human joint points at each moment, and a real-time human pose estimation based on the fusion of visual and inertial information is obtained.

本说明书的实施例所述的内容仅仅是对发明构思的实现形式的列举，仅作说明用途。本发明的保护范围不应当被视为仅限于本实施例所陈述的具体形式，本发明的保护范围也及于本领域的普通技术人员根据本发明构思所能想到的等同技术手段。The content described in the embodiments of the present specification is merely an enumeration of the implementation forms of the inventive concept, and is only used for illustration purposes. The protection scope of the present invention should not be construed as being limited to the specific forms stated in this embodiment, and the protection scope of the present invention also extends to equivalent technical means that those of ordinary skill in the art can think of according to the inventive concept.

Claims

1. a human body pose estimation method based on vision and inertial information fusion, is characterized in that, described method comprises the following steps:

and rotation matrix

Step 2) Set the visual and inertial output frequencies to be consistent, and use the inertial sensor rotation constraint _ER (θ), acceleration constraint EA (θ), visual sensor position constraint _EP (θ) and human body pose prior constraint _E _prior ( θ) Construct an optimization problem for the optimization item, and set the weight of each optimization item;

Step 3) Read the position measurement of the vision sensor at each moment

with estimated value

Step 4) Solve the nonlinear least squares optimization problem, and obtain the optimal solution θ at each moment, that is, the optimal rotation angle of each skeleton node of the human body and the optimal global position of the root skeleton node n ₁ at the current moment. Skeletal node kinematics model to obtain the human body pose estimation at the current moment;

Step 5) Repeat steps 3) and 4) to complete the state estimation of each joint point of the human body at each moment, and obtain a real-time human body pose estimation based on the fusion of visual and inertial information.

2. the human body attitude estimation method based on vision and inertial information fusion as claimed in claim 1, is characterized in that, in described step 1), described camera coordinate system c represents the coordinate system of depth camera, inertial coordinate system n represents the unified coordinate system of all inertial sensors after calibration, and the global coordinate system g is aligned with the initial coordinate system of the skeleton node.

3. the human body attitude estimation method based on vision and inertial information fusion as claimed in claim 1 or 2, is characterized in that, in described step 2), described rotation constraint E _R (θ) rotates by each IMU The difference between the measured value and the estimated value of the matrix is established; the described acceleration constraint E _A (θ) is established by minimizing the difference between the measured value and estimated value of each IMU acceleration; the described position constraint E _P ( θ) is established by minimizing the difference between the measured value and estimated value of the global position of each skeleton node; the human pose prior constraint E _prior (θ) is established through the existing human pose estimation data set.

4. the human body posture estimation method based on vision and inertial information fusion as described in claim 1 or 2, is characterized in that, in described step 3), described

5. the human body pose estimation method based on vision and inertial information fusion as claimed in claim 1 or 2, is characterized in that, in described step 4), described root skeleton node n ₁ is located at human pelvic joint point .

6. the human body posture estimation method based on vision and inertial information fusion as described in claim 1 or 2, is characterized in that, the process of described step 1) is:

1.1) Human bones are defined as interconnected rigid bodies. The initial coordinate system B of the skeleton node is aligned with the global coordinate system g, and the number of upper body bones is defined as n _b = 13. The skeleton nodes are the left hand, right hand, left forearm, right forearm, and left upper arm. , right upper arm, left shoulder, right shoulder, spine 1-4, and pelvis, where the pelvis is regarded as the root bone node n ₁ , and the child bone node n _b (b≥2) has a relative rotation matrix R _b and its parent node. Relatively constant displacement t _b , each bone has 3 rotational degrees of freedom, the root node has a global displacement (x ₁ , y ₁ , z ₁ ), with 42 (d=3+3×n _b =42) A degree of freedom is used to describe the motion of the entire upper body of the human body. The 42 variables are recorded as a 42-dimensional vector θ, and the vector θ is used as the optimization variable of the optimization problem. The forward kinematics formula deduces that each rigid bone is in the global coordinate system. The homogeneous transformation matrix of

where P(b) is the set of all bones;

1.2) Place the two visual sensors in front of the tester respectively, the visual sensor is L=2 meters away from the tester, and use the "Zhang Zhengyou camera calibration method" to obtain the translation matrix of the two cameras for the global coordinate system g

and rotation matrix

1.4) The IMU is worn at the corresponding bone points of the left hand, right hand, left forearm, right forearm, and pelvis, and there is no displacement between IMU _i and the corresponding bone coordinate system b _i , that is,

Expressed as:

7. the human body posture estimation method based on vision and inertial information fusion as claimed in claim 6, is characterized in that, the process of described step 2) is:

2.1) The difference between the measured value and the estimated value of the rotation matrix of the skeleton node corresponding to IMU _i in the global coordinate system is used as the rotation item constraint of the IMU, and the measured value of the rotation matrix of the corresponding skeleton node is expressed as:

Among them, R _i is the rotation measurement value of IMU _i , and the estimated value of the rotation matrix of the corresponding skeleton node is expressed as:

Among them, P( _bi ) is the set of all parent bones of bone _bi ;

In summary, the energy function of the rotation term is defined as:

Among them, ψ(·) extracts the vector part of the rotation matrix quaternion representation method, λ _R is the weight of the rotation term energy function, and ρ _R (·) represents a loss function;

2.2) Minimize the error between the IMU _i acceleration measurement value a _i and the estimated value as the acceleration constraint term of the IMU, and the acceleration estimated value

Expressed as:

in,

The left side (t-1) of equation (6) represents the acceleration measurement value in the global coordinate system using the acceleration constraint of the previous moment at the current moment

Calculated from the rotation information and acceleration measurement value of the previous frame, the acceleration measurement value

Expressed as:

Among them, a _g is the acceleration of gravity;

In summary, the energy function of the acceleration term is defined as:

Among them, λ _A is the weight of the acceleration term energy function, and ρ _A (·) represents a loss function;

2.3) Obtain the global coordinates (x, y, z) of the human skeleton node from the depth camera of the vision sensor, add a constraint item that minimizes the minimum between the measured value and the estimated value of the global position of the skeleton node, and define the position constraint The number of bone nodes of the item is n _p , and the position of the bone nodes in the camera c coordinate system is

The number of cameras is n _c , the estimated value of the position of the skeleton node

Expressed as:

In summary, the energy function of the position constraint is defined as:

Among them, λ _P is the weight of the energy function of the position constraint term, and ρ _P (·) represents a loss function;

2.4) Considering the limitations of the freedom of movement of the actual bones, the pose prior term E _prior (θ) is used to limit the unreasonable motion of the joints. E _prior (θ) uses the existing human pose estimation dataset "TotalCapture (2017)" to build, which contains 126,000 frames of human motion pose data;

First, perform k-means clustering on all the data in the dataset, select the clustering type k=126000/100=1260, then take the mean value of all cluster centers to obtain the mean value μ of the pose, and finally perform statistical analysis on the original data to obtain The standard deviation σ of the pose and the upper and lower limits θ _max and θ _min of the degrees of freedom of each skeletal node, thus, the pose prior term is defined as:

in,

The dimension of is 36, and the displacement and rotation of the root node are not restricted, λ _prior is the weight of the energy function of the attitude prior item, and ρ _prior ( ) represents a loss function;

2.5) To sum up, construct the optimization problem:

Among them, the loss function in E _A , E _P , and E _prior is set to ρ(x)=log(1+x), and the influence of outliers is limited by increasing the penalty for outliers according to the set ratio. The weights of the terms are set to λ _R =0.1, λ _P =10, λ _A =0.005, λ _prior =0.0001.