CN106292288A

CN106292288A - Model parameter correction method based on Policy-Gradient learning method and application thereof

Info

Publication number: CN106292288A
Application number: CN201610841970.3A
Authority: CN
Inventors: 陈启军; 刘成菊; 宁静
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2016-09-22
Filing date: 2016-09-22
Publication date: 2017-01-04
Anticipated expiration: 2036-09-22
Also published as: CN106292288B

Abstract

The present invention relates to a method for correcting model parameters based on a strategy gradient learning method and its application. The method for correcting model parameters includes the following steps: S1: Select the input parameters of the inverted pendulum and the posture parameters of the robot trunk as the correction amount, and establish a model parameter correction equation for the correction amount ; S2: Select the error of the robot's center of mass tracking and the error of the robot's posture relative to the upright state of the body as the robot's fitness index for the current environment, and establish a fitness evaluation function; S3: According to the fitness evaluation function, use the strategy gradient learning method to optimize the model The gain coefficient in the parameter correction equation, the optimized gain parameter is substituted into the model parameter correction equation to obtain the correction amount. Compared with the prior art, the strategy equation of the invention converges quickly, and the robot can quickly and real-time adjust the gait and body posture under unknown disturbances, thereby improving the walking adaptability and robustness of the robot.

Description

Model Parameter Correction Method Based on Policy Gradient Learning Method and Its Application

技术领域technical field

本发明涉及机器人行走控制技术领域，尤其是涉及基于策略梯度学习法的模型参数修正方法及其应用。The invention relates to the technical field of robot walking control, in particular to a model parameter correction method based on a strategy gradient learning method and its application.

背景技术Background technique

在机器人的行走问题中，为了生成的稳定步态，目前的方案大多将机器人抽象为简单的物理模型，如线性倒立摆模型(LIPM)、桌子-小车模型等，利用模型简化机器人的运动方程，并进行离线轨迹规划，此类方法中如果模型的参数是固定的，则机器人的步态是无法修改的，因而对于未知的外界扰动缺乏抑制能力。目前将学习方法应用于机器人行走的方案中，大多都是选定影响步态的关键参数，在高维度的搜索空间内直接对关键参数进行优化学习，并未将机器人进行抽象建模，因而需要进行大量的离线训练或进行长时间的在线学习，寻找局部最优解，以保证机器人行走的稳定性，这些方法使得机器人的步态是可调整的，但不适合机器人在未知环境下的实时快速调节。In the robot walking problem, in order to generate a stable gait, most of the current solutions abstract the robot into a simple physical model, such as the linear inverted pendulum model (LIPM), table-car model, etc., and use the model to simplify the motion equation of the robot. And carry out offline trajectory planning. In this method, if the parameters of the model are fixed, the gait of the robot cannot be modified, so it lacks the ability to suppress unknown external disturbances. At present, in the schemes of applying learning methods to robot walking, most of the key parameters that affect the gait are selected, and the key parameters are directly optimized and learned in the high-dimensional search space, and the robot is not abstractly modeled. Therefore, it is necessary to Perform a large number of offline training or long-term online learning to find local optimal solutions to ensure the stability of the robot's walking. These methods make the robot's gait adjustable, but they are not suitable for the real-time and fast speed of the robot in an unknown environment. adjust.

发明内容Contents of the invention

本发明的目的就是为了克服上述现有技术存在的缺陷而提供基于策略梯度学习法的模型参数修正方法及其应用，将学习的方法引入机器人倒立摆模型，设计了一种基于策略梯度学习法的模型参数修正器，间接优化步态参数，在修正器的作用下，策略方程收敛快速，机器人能在未知的扰动下快速、实时地调节步态和身体姿态，提高了行走的自适应性和鲁棒性。The purpose of the present invention is exactly to provide the model parameter modification method and its application based on the strategy gradient learning method in order to overcome the defective that above-mentioned prior art exists, introduce the method of learning into robot inverted pendulum model, have designed a kind of method based on the strategy gradient learning method The model parameter modifier indirectly optimizes the gait parameters. Under the action of the modifier, the strategy equation converges quickly, and the robot can quickly and real-time adjust the gait and body posture under unknown disturbances, which improves the adaptability and robustness of walking. Stickiness.

本发明的目的可以通过以下技术方案来实现：The purpose of the present invention can be achieved through the following technical solutions:

基于策略梯度学习法的模型参数修正方法包括以下步骤：The model parameter correction method based on the policy gradient learning method includes the following steps:

S1：选择倒立摆输入参数和机器人躯干姿态参数为修正量，建立修正量的模型参数修正方程，所述模型参数修正方程内包含待优化的增益系数；S1: Select the input parameters of the inverted pendulum and the posture parameters of the robot trunk as the correction amount, and establish the model parameter correction equation of the correction amount, and the model parameter correction equation includes the gain coefficient to be optimized;

S2：选择机器人质心跟踪的误差以及机器人身体姿态相对于直立状态的误差作为机器人对当前环境的适应度指标，建立适应度评价函数；S2: Select the error of the robot's center of mass tracking and the error of the robot's body posture relative to the upright state as the robot's fitness index for the current environment, and establish a fitness evaluation function;

S3：根据适应度评价函数，利用策略梯度学习法优化模型参数修正方程中的增益系数，将优化后的增益参数代入模型参数修正方程得到下一个单脚支撑阶段的修正量。S3: According to the fitness evaluation function, use the strategy gradient learning method to optimize the gain coefficient in the model parameter correction equation, and substitute the optimized gain parameter into the model parameter correction equation to obtain the correction amount for the next single-leg support stage.

步骤S1中，选择为修正量的倒立摆输入参数包括x轴向步伐大小和y轴向步伐大小，选择为修正量的机器人躯干姿态参数包括x轴向躯干角和y轴向躯干角，所述模型参数修正方程具体为：In step S1, the input parameters of the inverted pendulum selected as the correction amount include the x-axis step size and the y-axis step size, and the robot trunk posture parameters selected as the correction amount include the x-axis trunk angle and the y-axis trunk angle. The model parameter correction equation is specifically:

${Δs Δs}_{x x} = = {K K}_{11} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, x x,, i i} - - {x x}_{e e,, x x,, i i})) + + {K K}_{33} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, y the y,, i i} - - {θ θ}_{B B,, y the y,, i i}^{r r e e f f}))$

${Δs Δs}_{y the y} = = {K K}_{22} \cdot &Center Dot; \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, y the y,, i i} - - {x x}_{e e,, y the y,, i i})) + + {K K}_{44} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, x x,, i i} - - {θ θ}_{B B,, x x,, i i}^{r r e e f f}))$

${Δθ Δθ}_{B B,, x x} = = {K K}_{55} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({p p}_{L L H h i i p p,, z z,, i i} - - {p p}_{R R H h i i p p,, z z,, i i}))$

${Δθ Δθ}_{B B,, y the y} = = {K K}_{66} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({p p}_{S S u u p p p p F f o o o o t t,, x x,, i i} - - {p p}_{H h e e a a d d,, x x,, i i}))$

其中，下标x、y、z分别表示x、y、z轴向，s为步伐大小，Δs为步伐大小的修正量，θ_B为躯干角，Δθ_B为躯干角的修正量，N为一个单脚支撑阶段的插值的步数，下标i表示单脚支撑阶段中第i个步数，x_f为卡尔曼滤波后质心的估计值，x_e为质心的理想值，为躯干直立时的倾斜角，p_RHip和p_LHip分别为机器人右腿和左腿髋关节的位移，p_Head和p_SuppFoot分别为机器人头部关节和支撑脚的位移，K₁,...,K₆为增益参数。Among them, the subscripts x, y, and z represent the x, y, and z axes respectively, s is the step size, Δs is the correction amount of the step size, θ _B is the trunk angle, Δθ _B is the correction amount of the trunk angle, and N is a The number of interpolation steps in the single-leg support stage, the subscript i represents the i-th step number in the single-leg support stage, x _f is the estimated value of the centroid after Kalman filtering, and x _e is the ideal value of the centroid, p _RHip and p _LHip are the displacements of the robot’s right and left leg hip joints respectively, p _Head and p _SuppFoot are the displacements of the robot’s head joint and supporting foot respectively, K ₁ ,..., K ₆ is a gain parameter.

所述适应度评价函数F(K)具体为：The fitness evaluation function F( K ) is specifically:

$F f ((\underset{&OverBar; &OverBar;}{K K})) = = {α α}_{x x} ((| | {Δs Δs}_{x x} | | + + | | Δ Δ {\overset{&OverBar; &OverBar;}{x x}}_{x x} | |)) + + {α α}_{y the y} ((| | {Δs Δs}_{y the y} | | + + | | Δ Δ {\overset{&OverBar; &OverBar;}{x x}}_{y the y} | |)) + + {β β}_{x x} ((| | {Δθ Δθ}_{B B,, x x} | | + + | | Δ Δ {\overset{&OverBar; &OverBar;}{θ θ}}_{B B,, x x} | |)) + + {β β}_{y the y} ((| | {Δθ Δθ}_{B B,, y the y} | | + + | | Δ Δ {\overset{&OverBar; &OverBar;}{θ θ}}_{B B,, y the y} | |))$

$Δ Δ {\overset{&OverBar; &OverBar;}{x x}}_{x x} = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, x x,, i i} - - {x x}_{e e,, x x,, i i}))$

$Δ Δ {\overset{&OverBar; &OverBar;}{x x}}_{y the y} = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, y the y,, i i} - - {x x}_{e e,, y the y,, i i}))$

$Δ Δ {\overset{&OverBar; &OverBar;}{θ θ}}_{B B,, x x} = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, x x,, i i} - - {θ θ}_{B B,, x x,, i i}^{r r e e f f}))$

$Δ Δ {\overset{&OverBar; &OverBar;}{θ θ}}_{B B,, y the y} = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, y the y,, i i} - - {θ θ}_{B B,, y the y,, i i}^{r r e e f f}))$

其中，K＝{K₁,...,K₆}表示增益参数集，α_x、α_y、β_x和β_y为权重因子，且满足α_x+α_y＝1，β_x+β_y＝1，适应度评价函数的值越小，表示机器人在增益参数集下的适应度越高。Among them, K ＝{K ₁ ,...,K ₆ } represents the gain parameter set, α _x , α _y , β _x and β _y are weight factors, and satisfy α _x +α _y =1, β _x +β _y =1, the smaller the value of the fitness evaluation function, the higher the fitness of the robot under the gain parameter set.

所述策略梯度学习法的具体步骤为：The specific steps of the policy gradient learning method are:

301：在第k次迭代中，对于上一次迭代获得的增益参数集K ^k-1，计算F(K)在K ^k-1内每个参数值处的偏导，并在K ^k-1附近随机生成n个策略，得到的策略集用_m K ^k-1(m＝1,...,n)表示，策略的个数n与搜索空间成正比，策略集的生成公式如下：301: In the kth iteration, for the gain parameter set K ^k-1 obtained in the previous iteration, calculate the partial derivative of F( K ) at each parameter value within K ^k -1, and calculate the partial derivative of F( K ) in the vicinity of K ^k-1 Randomly generate n strategies, and the obtained strategy set is represented by _m K ^k-1 (m=1,...,n), and the number n of strategies is proportional to the search space. The generation formula of the strategy set is as follows:

_m K ^k-1＝K ^k-1+_m ρ _m K ^k-1 = K ^k-1 + _m ρ

其中，_m ρ(m＝1,...,n)表示扰动集合，扰动集合中每个扰动ρ_m在集合{-e_m,0,+e_m}中随机选取，e_m表示对应ρ_m的扰动增益参量；Among them, _m ρ (m=1,...,n) represents the disturbance set, and each disturbance ρ _m in the disturbance set is randomly selected from the set {-e _m ,0,+e _m }, and _em represents the corresponding ρ _m The disturbance gain parameter of ;

302：根据扰动ρ_m的-e_m,0,+e_m取值情况将_m K ^k-1对应分成三组：G₀和将_m K ^k-1代入适应度评价函数，得到每个分组对应的平均值：和 302: Divide _m K ^k-1 into three groups according to the -e _m , 0, +e _m values of the disturbance ρ _m : G ₀ and Substitute _m K ^k-1 into the fitness evaluation function to get the average value corresponding to each group: and

303：计算近似的梯度值若且否则 303: Calculate the approximate gradient value like and otherwise

304：对进行正交化处理，乘一个固定的步长因子η得到梯度值从策略集K ^k-1减去梯度值得到本次迭代的策略集K ^k，并利用K ^k进行下一次的迭代；304: yes Orthogonalization is performed, and the gradient value is obtained by multiplying by a fixed step factor η Subtract the gradient value from the policy set K ^k-1 Get the policy set K ^k of this iteration, and use K ^k for the next iteration;

305：当迭代次数达到预设值N_iter时，迭代结束。305: When the number of iterations reaches the preset value _Niter , the iteration ends.

一种基于上述模型参数修正方法的模型参数修正器，该模型参数修正器输出为步伐大小的修正量Δs和躯干角的修正量Δθ_B，模型参数修正器的输出传递给倒立摆模型和机器人模型进行补偿，满足以下公式：A model parameter corrector based on the above model parameter correction method, the output of the model parameter corrector is the correction amount Δs of the step size and the correction amount Δθ _B of the torso angle, and the output of the model parameter corrector is passed to the inverted pendulum model and the robot model To compensate, satisfy the following formula:

$\overset{&OverBar; &OverBar;}{s the s} = = s the s - - Δ Δ s the s$

θ_B,i＝θ_B,i-1-Δθ_B θ _B,i = θ _B,i-1 -Δθ _B

其中，为下一个单脚支撑阶段步伐大小，每一个单脚支撑阶段只补偿一次；而θ_B,i为单脚支撑阶段第i帧的躯干角，每一个单脚支撑阶段补偿N次。in, is the step size of the next single-leg support stage, and each single-leg support stage is compensated only once; while θ _B,i is the torso angle of frame i in the single-leg support stage, and is compensated N times in each single-leg support stage.

一种使用上述模型参数修正器的机器人行走控制方法，包括以下步骤：A method for controlling robot walking using the above-mentioned model parameter corrector, comprising the following steps:

1)利用倒立摆模型规划机器人的质心轨迹，进而规划相应的脚部轨迹，然后使用分解速度控制法进行逆运动学计算，得到机器人的关节速度，机器人模型根据关节速度控制机器人行走；1) Use the inverted pendulum model to plan the trajectory of the center of mass of the robot, and then plan the corresponding foot trajectory, and then use the decomposition velocity control method to perform inverse kinematics calculations to obtain the joint velocity of the robot. The robot model controls the robot to walk according to the joint velocity;

2)设计两个闭环回路，第一条闭环回路：根据机器人的关节空间状态测量质心运动，得到质心的实际值，质心的实际值经卡尔曼滤波后得到质心的估计值，利用卡尔曼滤波后质心的估计值对倒立摆模型中倒立摆输入参数进行自修正；2) Design two closed-loop loops, the first closed-loop loop: measure the motion of the center of mass according to the state of the joint space of the robot, and obtain the actual value of the center of mass. The estimated value of the center of mass self-corrects the input parameters of the inverted pendulum in the inverted pendulum model;

第二条闭环回路：利用卡尔曼滤波后质心的估计值以及测量的躯干角，使用模型参数修正器进行倒立摆模型中倒立摆输入参数的补偿以及机器人模型中机器人躯干角的补偿。The second closed loop: use the estimated value of the center of mass after Kalman filtering and the measured torso angle, and use the model parameter modifier to compensate the input parameters of the inverted pendulum in the inverted pendulum model and the torso angle of the robot in the robot model.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

1)本发明为了避免大量的在线计算，选择影响步态的关键参数为倒立摆输入参数和机器人躯干角，寻找修正量与机器人模型的内在关系，建立修正量与机器人模型参数的修正方程，从而间接优化学习机器人步态参数。1) In order to avoid a large amount of online calculations, the present invention selects the key parameters that affect the gait as the input parameters of the inverted pendulum and the torso angle of the robot, seeks the intrinsic relationship between the correction amount and the robot model, and establishes the correction equation of the correction amount and the robot model parameters, thereby Indirect optimization for learning robot gait parameters.

2)本发明为使得机器人能够根据外界环境进行自适应调整，选择质心跟踪的误差以及身体直立误差作为机器人对当前环境的适应度指标，建立适应度评价函数，从而提高机器人在在增益参数集下的适应度。2) In order to enable the robot to carry out adaptive adjustment according to the external environment, the present invention selects the error of center of mass tracking and the body upright error as the fitness index of the robot to the current environment, and establishes a fitness evaluation function, thereby improving the performance of the robot under the gain parameter set. of adaptability.

3)根据机器人的适应度评价函数，采用策略梯度学习法优化修正方程中的增益系数，对步态关键参数修正，设计了模型参数修正器，并应用在机器人行走控制中，将机器人的行走任务和优化环节分离开，使得计算的速度大大提升。3) According to the fitness evaluation function of the robot, the gain coefficient in the correction equation is optimized by using the strategy gradient learning method, and the key parameters of the gait are corrected. A model parameter modifier is designed and applied in the walking control of the robot. The walking task of the robot Separated from the optimization link, the calculation speed is greatly improved.

4)本发明实现基于倒立摆模型的步态规划，预先保证了行走的稳定性，且机器人的步态是实时可调的，有效地提高了机器人的自适应性和鲁棒性，适用于机器人在未知扰动下的行走。4) The present invention realizes the gait planning based on the inverted pendulum model, guarantees the stability of walking in advance, and the gait of the robot is adjustable in real time, effectively improves the adaptability and robustness of the robot, and is applicable to the robot Walking under unknown disturbances.

附图说明Description of drawings

图1为为策略梯度学习法的适应度方程曲线；Fig. 1 is the fitness equation curve of the strategy gradient learning method;

图2为使用模型参数修正器的仿人机器人行走控制框图；Fig. 2 is the walking control block diagram of the humanoid robot using the model parameter modifier;

图3为采用模型参数修正器对机器人躯干角的修正效果图；Fig. 3 is the effect diagram of correcting the torso angle of the robot by using the model parameter corrector;

图4为采用模型参数修正器对倒立摆质心规划轨迹的修正效果图，其中(4a)为倒立摆质心规划轨迹在x轴位置上的修正效果图，(4b)为倒立摆质心规划轨迹在y轴位置上的修正效果图；Figure 4 is the correction effect diagram of the planned trajectory of the center of mass of the inverted pendulum using the model parameter modifier, where (4a) is the correction effect diagram of the planned trajectory of the center of mass of the inverted pendulum at the x-axis position, and (4b) is the planned trajectory of the center of mass of the inverted pendulum at the y The correction effect diagram on the axis position;

图5为采用模型参数修正器对机器人落脚点的修正效果图，其中(5a)为机器人落脚点在x方向位置上的修正效果图，(5b)为机器人落脚点在y方向位置上的修正效果图。Fig. 5 is a diagram of the correction effect of the robot's foothold using the model parameter modifier, where (5a) is the correction effect of the robot's foothold in the x direction, and (5b) is the correction effect of the robot's foothold in the y direction picture.

具体实施方式detailed description

下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments. This embodiment is carried out on the premise of the technical solution of the present invention, and detailed implementation and specific operation process are given, but the protection scope of the present invention is not limited to the following embodiments.

以仿人型机器人NAO为例，来说明本发明提出的基于策略梯度学习法的模型参数修正方法、模型参数修正器设计方法及模型参数修正器在NAO机器人行走控制中的应用。Taking the humanoid robot NAO as an example, the model parameter correction method based on the strategy gradient learning method proposed by the present invention, the design method of the model parameter corrector and the application of the model parameter corrector in the walking control of the NAO robot are illustrated.

基于策略梯度学习法的模型参数修正方法，用于倒立摆模型及机器人模型中关键参数的修正，包括以下步骤：The model parameter correction method based on the policy gradient learning method is used for the correction of the key parameters in the inverted pendulum model and the robot model, including the following steps:

S1：建立修正量的模型参数修正方程：S1: Establish the model parameter correction equation of the correction amount:

若采用三维倒立摆模型抽象简化机器人模型，规划机器人的质心轨迹和相应的脚部轨迹，那么在倒立摆的输入参数中，影响机器人步态的参数有很多，如机器人在x方向和y方向的步伐大小s_x、s_y，行走过程中的抬脚高度s_z和倒立摆的高度h等。除此之外，机器人是一个多自由度的高维模型，倒立摆模型是无法完全描述的，如躯干姿态θ_B等。选择x轴向步伐大小s_x、y轴向步伐大小s_y、x轴向躯干角θ_B,x、y轴向躯干角θ_B,y作为影响步态的关键参数，建立模型参数修正方程如下：If the three-dimensional inverted pendulum model is used to abstract and simplify the robot model, and plan the trajectory of the robot's center of mass and the corresponding foot trajectory, then among the input parameters of the inverted pendulum, there are many parameters that affect the robot's gait, such as the robot's gait in the x and y directions The step size s _x , s _y , the height of the foot lift s _z during walking and the height h of the inverted pendulum, etc. In addition, the robot is a high-dimensional model with multiple degrees of freedom, and the inverted pendulum model cannot be fully described, such as the trunk posture θ _B , etc. Select the x-axis step size s _x , the y-axis step size s _y , the x-axis torso angle θ _B,x , and the y-axis torso angle θ _B,y as the key parameters affecting gait, and establish the model parameter correction equation as follows :

${Δs Δs}_{x x} = = {K K}_{11} \cdot &Center Dot; \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, x x,, i i} - - {x x}_{e e,, x x,, i i})) + + {K K}_{33} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, y the y,, i i} - - {θ θ}_{B B,, y the y,, i i}^{r r e e f f}))$

其中，下标x、y、z分别表示x、y、z轴向，s为步伐大小，Δs为步伐大小的修正量，θ_B为躯干角，Δθ_B为躯干角的修正量，N为一个单脚支撑阶段的插值的步数，t_b为单脚支撑阶段开始时间，t_e为单脚支撑阶段结束时间，ΔT为采样时间，下标i表示单脚支撑阶段中第i个步数，x_f为卡尔曼滤波后质心的估计值，x_e为质心的理想值，为躯干直立时的倾斜角，p_RHip和p_LHip分别为机器人右腿和左腿髋关节(Hip关节)的位移，p_Head和p_SuppFoot分别为机器人头部关节和支撑脚的位移，K₁,...,K₆为增益参数。Among them, the subscripts x, y, and z represent the x, y, and z axes respectively, s is the step size, Δs is the correction amount of the step size, θ _B is the trunk angle, Δθ _B is the correction amount of the trunk angle, and N is a the number of interpolated steps for the single-leg support phase, t _b is the start time of the single-leg support phase, t _e is the end time of the single-leg support phase, ΔT is the sampling time, the subscript i represents the i-th step in the single-leg support phase, and x _f is the estimation of the center of mass after Kalman filtering value, x _e is the ideal value of the centroid, p _RHip and p _LHip are the displacements of the hip joints (Hip joints) of the robot’s right leg and left leg respectively, p _Head and p _SuppFoot are the displacements of the robot’s head joint and supporting foot respectively, K ₁ , ..., K ₆ is a gain parameter.

此参数修正方程的设计，考虑了机器人在x和y方向落脚点的修正以及躯干直立程度的修正。x方向的步长修正量Δs_x与x轴向质心的误差和y轴向躯干角的误差相关，因为y轴向躯干角度代表了身体前后的倾斜；y方向的步长修正量Δs_y与y方向质心的误差和x方向躯干角的误差相关，因为x方向的躯干角度代表了身体左右的倾斜；x轴向躯干角修正量Δθ_B,x与左右腿Hip关节的高度差相关，这是因为当机器人需要保持双脚着地或执行行走任务时，若身体左右倾斜，会造成髋关节有一定的高度差，而对于腰部无驱动关节的机器人，身体的倾斜实际上主要由两髋关节高度差产生；y轴向的躯干角修正量Δθ_B,y与头部关节和支撑脚x方向的距离差有关，这是因为对于腰部无驱动关节的机器人，当机器人身体前后倾斜时，可近似等效为机器人以支撑脚为支点的连杆转动，此时连杆原点为支撑脚，末端为头部，两者在x方向距离差的大小代表了身体的倾斜程度。The design of this parameter correction equation takes into account the correction of the foothold of the robot in the x and y directions and the correction of the uprightness of the torso. The step correction amount Δs _x in the x direction is related to the error of the x-axis center of mass and the error of the y-axis trunk angle, because the y-axis torso angle represents the inclination of the body before and after; the step correction amount Δs _y in the y direction is related to y The error of the center of mass in the direction is related to the error of the trunk angle in the x direction, because the trunk angle in the x direction represents the inclination of the body left and right; the correction amount of the trunk angle in the x direction Δθ _B,x is related to the height difference of the Hip joint of the left and right legs, because When the robot needs to keep both feet on the ground or perform walking tasks, if the body tilts left and right, there will be a certain height difference in the hip joints. For a robot with no driving joints in the waist, the body tilt is actually mainly caused by the height difference between the two hip joints. ; The torso angle correction Δθ _B,y in the y-axis is related to the distance difference between the head joint and the supporting foot in the x-direction, because for a robot with no driving joints at the waist, when the robot body tilts back and forth, it can be approximately equivalent to The robot rotates with the supporting foot as the fulcrum of the connecting rod. At this time, the origin of the connecting rod is the supporting foot and the end is the head. The distance difference between the two in the x direction represents the degree of inclination of the body.

S2：建立适应度评价函数：S2: Establish fitness evaluation function:

当机器人受未知扰动后，经过调整若能保持较高的质心跟踪精度和较好身体直立姿态，则表示机器人的适应度较高，故选择机器人质心跟踪的误差以及机器人身体姿态相对于直立状态的误差作为机器人对当前环境的适应度指标，建立适应度评价函数F(K)如下：When the robot is disturbed by an unknown disturbance, if it can maintain a high center-of-mass tracking accuracy and a good upright body posture after adjustment, it means that the robot has a high degree of fitness. The error is used as the fitness index of the robot to the current environment, and the fitness evaluation function F( K ) is established as follows:

其中，K＝{K₁,...,K₆}表示增益参数集，α_x、α_y、β_x和β_y为权重因子，分别代表了x、y方向质心误差所占权重以及x、y轴向身体倾斜角度误差所占权重，且满足α_x+α_y＝1，β_x+β_y＝1，适应度评价函数的值越小，表示机器人在增益参数集下的适应度越高。Among them, K ＝{K ₁ ,...,K ₆ } represents the gain parameter set, and α _x , α _y , β _x and β _y are weight factors, which respectively represent the weight of centroid error in x and y directions and the weight of x, y The weight of the y-axis body tilt angle error, and satisfy α _x + α _y = 1, β _x + β _y = 1, the smaller the value of the fitness evaluation function, the higher the fitness of the robot under the gain parameter set .

适应度评价函数每一项都包含了补偿量绝对值和平均误差均值，由补偿器输出的绝对值、质心误差的绝对值以及身体倾斜角误差的绝对值，以不同权重线性叠加得到，前两项代表了质心跟随效果，后两项代表了身体直立效果。适应度评价函数的方程在策略梯度学习的过程中，若取值逐渐减小，则意味着质心跟随的误差和身体倾斜的偏差逐渐减小，机器人在当前参数集下的适应度逐渐增强；倒立摆输入参数的修正量和身体倾角的修正量不能过大，以保证机器人的关节空间配置在合理的范围内。同时，需要指出的是，目标方程没有与机器人行走时间直接关联，但身体的直立效果越好，则意味着机器人能够执行质心跟踪的时间越长。Each item of the fitness evaluation function includes the absolute value of the compensation amount and the mean value of the average error. The absolute value of the compensator output, the absolute value of the center of mass error, and the absolute value of the body tilt angle error are linearly superimposed with different weights. The first two The term represents the center of mass following effect, and the latter two represent the body upright effect. In the process of policy gradient learning, the equation of the fitness evaluation function, if the value gradually decreases, it means that the error of the center of mass following and the deviation of the body tilt gradually decrease, and the fitness of the robot under the current parameter set gradually increases; The correction amount of pendulum input parameters and the correction amount of body inclination should not be too large, so as to ensure that the joint space configuration of the robot is within a reasonable range. At the same time, it should be pointed out that the objective equation is not directly related to the walking time of the robot, but the better the body upright effect, the longer the robot can perform centroid tracking.

S3：模型参数优化学习：S3: Model parameter optimization learning:

使用策略梯度学习法对增益参数集K进行优化，依次为参数集内每个参数的赋值，并根据增益参数集和机器人当前状态计算适应度函数值。策略梯度学习法基本原理是：假设目标方程F(K)对于K内每个参数都是可导的，通过计算F(K)的梯度，从而得到局部最优解K ^*，则策略梯度学习法的具体步骤为：Use the strategy gradient learning method to optimize the gain parameter set K , assign values to each parameter in the parameter set in turn, and calculate the fitness function value according to the gain parameter set and the current state of the robot. The basic principle of the strategy gradient learning method is: assuming that the target equation F( K ) is derivable for each parameter in K , by calculating the gradient of F( K ), the local optimal solution K ^* is obtained, then the strategy gradient learning method The specific steps are:

301：在第k次迭代中，对于上一次迭代获得的增益参数集K ^k-1，计算F(K)在K ^k-1内每个参数值处的偏导，并在K ^k-1附近随机生成n个策略，得到的策略集用_m K ^k-1(m＝1,...,n)表示，策略集_m K ^k-1为n维向量，策略的个数n与搜索空间成正比，策略集的生成公式如下：301: In the kth iteration, for the gain parameter set K ^k-1 obtained in the previous iteration, calculate the partial derivative of F( K ) at each parameter value within K ^k -1, and calculate the partial derivative of F( K ) in the vicinity of K ^k-1 Randomly generate n strategies, and the obtained strategy set is represented by _m K ^k-1 (m=1,...,n), the strategy set _m K ^k-1 is an n-dimensional vector, and the number of strategies n is proportional to the search space In direct proportion, the generation formula of the policy set is as follows:

_m K ^k-1＝K ^k-1+_m ρ _m K ^k-1 = K ^k-1 + _m ρ

其中，_m ρ(m＝1,...,n)表示扰动集合，扰动集合_m ρ为n维向量，扰动集合中每个扰动ρ_m在集合{-e_m,0,+e_m}中随机选取，e_m表示对应ρ_m的设定的扰动增益参量，扰动集合中的扰动ρ_m与策略集_m K ^k-1中的策略一一对应；Among them, _m ρ (m=1,...,n) represents the disturbance set, the disturbance set _m ρ is an n-dimensional vector, and each disturbance ρ _m in the disturbance set is in the set {-e _m ,0,+e _m } Randomly selected, e _m represents the set disturbance gain parameter corresponding to ρ _m , and the disturbance ρ _m in the disturbance set corresponds to the strategy in the strategy set _m K ^k-1 ;

302：根据扰动ρ_m的取值为负值、取零值或者正值的情况将_m K ^k-1对应分成三组：G₀和将_m K ^k-1代入适应度评价函数，得到每个分组对应的平均值：和 302: Divide _m K ^k-1 into three groups according to the value of the disturbance ρ _m being negative, zero or positive: G ₀ and Substitute _m K ^k-1 into the fitness evaluation function to get the average value corresponding to each group: and

303：根据和计算近似的梯度值具体为：若且否则 303: According to and Compute approximate gradient values Specifically: if and otherwise

305：当迭代次数达到预设值N_iter时，迭代结束，如果N_iter足够大，可以保证得到的解K ^*为局部最优解。305: When the number of iterations reaches the preset value N _iter , the iteration ends. If N _iter is large enough, it can be guaranteed that the obtained solution K ^* is a local optimal solution.

得到局部最优增益K ^*后，带入模型参数修正方程可以计算得到下一个单脚支撑阶段的修正量，该局部最优的修正量用于对机器人倒立摆输入参数以及躯干角进行修正。After the local optimal gain K ^* is obtained, it can be brought into the model parameter correction equation to calculate the correction amount of the next single-leg support stage. The local optimal correction amount is used to correct the input parameters of the robot inverted pendulum and the torso angle.

设计输出为步伐大小的修正量Δs和躯干角的修正量Δθ_B的模型参数修正器，该模型参数修正器的输出传递给倒立摆模型和机器人模型进行补偿，满足以下公式：Design the model parameter modifier whose output is the correction amount Δs of the step size and the correction amount Δθ _B of the torso angle. The output of the model parameter modifier is transmitted to the inverted pendulum model and the robot model for compensation, satisfying the following formula:

$\overset{&OverBar; &OverBar;}{s the s} = = s the s - - Δ Δ s the s$

θ_B,i＝θ_B,i-1-Δθ_B θ _B,i = θ _B,i-1 -Δθ _B

其中，为下一个单脚支撑阶段步伐大小，作为修正后的倒立摆模型输入参数，每一个单脚支撑阶段只补偿一次；而θ_B,i为单脚支撑阶段第i帧的躯干角，每一个单脚支撑阶段补偿N次。in, is the step size of the next single-leg support stage, and is used as the input parameter of the revised inverted pendulum model, and each single-leg support stage is compensated only once; and θ _B,i is the torso angle of the i-th frame of the single-leg support stage, and each single-leg support stage The foot support phase compensates N times.

图2为使用模型参数修正器的机器人行走控制框图，x_m为质心测量值，忽略测量误差情况下，可理解为质心的实际值，为机器人的关节速度，则使用模型参数修正器的机器人行走控制方法包括以下步骤：Figure 2 is a block diagram of the robot walking control using the model parameter modifier, x _m is the measured value of the center of mass, which can be understood as the actual value of the center of mass when the measurement error is ignored, is the joint velocity of the robot, then the robot walking control method using the model parameter modifier includes the following steps:

2)将机器人的行走任务和优化环节分离开，设计两个闭环回路，第一条闭环回路：根据机器人的关节空间状态测量质心运动，得到质心的实际值，质心的实际值经卡尔曼滤波器后得到质心的估计值，利用卡尔曼滤波后质心的估计值对倒立摆模型中倒立摆输入参数进行自修正；2) Separate the walking task of the robot from the optimization link, and design two closed-loop loops. The first closed-loop loop: measure the motion of the center of mass according to the state of the joint space of the robot, and obtain the actual value of the center of mass. The actual value of the center of mass is passed through the Kalman filter Finally, the estimated value of the centroid is obtained, and the input parameters of the inverted pendulum in the inverted pendulum model are self-corrected by using the estimated value of the centroid after Kalman filtering;

第二条闭环回路：利用实际的质心运动、惯性单元测量的躯干角度，使用基于策略梯度学习法的模型参数修正器进行倒立摆模型中倒立摆输入参数的补偿以及机器人模型中机器人躯干角的补偿。The second closed-loop loop: use the actual center of mass motion, the torso angle measured by the inertial unit, and use the model parameter modifier based on the policy gradient learning method to compensate the input parameters of the inverted pendulum in the inverted pendulum model and the torso angle of the robot in the robot model .

在机器人行走过程中某一次调用参数补偿器时，策略梯度学习法的适应度函数取值曲线如图1所示，适应度函数曲线整体呈下降趋势，表示质心跟踪的误差和躯干角误差是逐渐减小的，机器人的适应度是逐渐增强的，因而参数修正器对模型的修正是有效的。When the parameter compensator is called once during the robot’s walking process, the fitness function value curve of the policy gradient learning method is shown in Figure 1. The overall fitness function curve shows a downward trend, indicating that the center of mass tracking error and the torso angle error are gradually increasing. Decrease, the fitness of the robot is gradually enhanced, so the parameter modifier is effective in modifying the model.

若在9s时向机器人施加峰值为6.44N、持续时间约为0.5s的力，方向主要沿y轴正方向，施力点为机器人胸口部位。为采用模型参数修正器对机器人躯干角的修正效果如图3所示，可以看出机器人在受外力后向左后方倾斜，经过9s左右的调整，x轴向(x-axis)的躯干角曲线恢复正常行走时的周期性波动，而y轴(y-axis)向的躯干角曲线经过约1s的调整则恢复正常波形。由于机器人惯性单元传感器无法测得z轴向(z-axis)的躯干角度，参数补偿器设计中也未添加z轴向躯干角补偿量，所以z轴向躯干角一直为0。If a force with a peak value of 6.44N and a duration of about 0.5s is applied to the robot at 9s, the direction is mainly along the positive direction of the y-axis, and the point of force application is the chest of the robot. The correction effect of the robot torso angle by using the model parameter modifier is shown in Figure 3. It can be seen that the robot tilts to the left rear after being subjected to an external force. After about 9 seconds of adjustment, the x-axis torso angle curve The periodic fluctuation during normal walking is restored, and the torso angle curve in the y-axis direction returns to the normal waveform after about 1 second of adjustment. Since the robot inertial unit sensor cannot measure the z-axis torso angle, and the z-axis torso angle compensation is not added in the design of the parameter compensator, the z-axis torso angle is always 0.

采用模型参数修正器对倒立摆质心规划轨迹的修正和实际测量的质心轨迹如图4所示，Expected CoM为理想质心轨迹，Measured CoM为实际测量的质心轨迹，在补偿器的作用下理想质心轨迹会有一定的变化，质心有着良好的跟踪效果，外力作用对质心跟踪基本没有产生影响。外力作用下倒立摆的输入参数进行重新修正，单脚支撑阶段的开始时间和结束时间需要重新计算，所以造成支撑脚切换时质心在y方向的波动，但经过两步的调整，y方向的质心轨迹重新恢复平滑状态。The correction of the planned trajectory of the center of mass of the inverted pendulum using the model parameter modifier and the actual measurement of the trajectory of the center of mass are shown in Figure 4. Expected CoM is the trajectory of the ideal center of mass, and Measured CoM is the trajectory of the actual measurement of the center of mass. Under the action of the compensator, the trajectory of the ideal center of mass There will be some changes, the center of mass has a good tracking effect, and the external force basically has no effect on the tracking of the center of mass. Under the action of external force, the input parameters of the inverted pendulum are re-corrected, and the start time and end time of the single-leg support stage need to be recalculated, so the center of mass fluctuates in the y direction when the support legs are switched. However, after two steps of adjustment, the center of mass in the y direction The trajectory is smooth again.

采用模型参数修正器对机器人落脚点的修正效果如图5所示(相对于机器人初始站立时两脚中心点，左为x方向，右为y方向)，Reference position为不使用修正器时机器人的落脚点曲线(点线)，Measured position为实际测量的位置，机器人进行固定的前向行走。增加了参数修正器后，机器人受外力后x方向步伐大小减小，结束20步行走后在x方向行走的距离减小；在y方向，当机器人受到向左的外力后，由于身体向左倾斜，导致质心向左偏移，机器人向左连续迈步，前两步补偿量较大，后几步由于躯干恢复直立，y方向的步伐大小逐渐恢复受外力之前的大小。The correction effect of the robot’s foothold using the model parameter corrector is shown in Figure 5 (relative to the center point of the two feet when the robot initially stands, the left is the x direction, and the right is the y direction), and the reference position is the robot’s position when the corrector is not used Foothold point curve (dotted line), Measured position is the actual measured position, and the robot walks forward fixedly. After the parameter modifier is added, the step size of the robot in the x direction decreases after the external force is applied, and the walking distance in the x direction decreases after 20 steps; , causing the center of mass to shift to the left, and the robot takes continuous steps to the left. The compensation in the first two steps is relatively large, and in the next few steps, as the torso is restored to an upright position, the step size in the y direction gradually returns to the size before the external force is applied.

Claims

1. The model parameter correction method based on the strategy gradient learning method, is characterized in that, comprises the following steps:

S1: Select the input parameters of the inverted pendulum and the posture parameters of the robot trunk as the correction amount, and establish the model parameter correction equation of the correction amount, and the model parameter correction equation includes the gain coefficient to be optimized;

S2: Select the error of the robot's center of mass tracking and the error of the robot's body posture relative to the upright state as the robot's fitness index for the current environment, and establish a fitness evaluation function;

S3: According to the fitness evaluation function, use the strategy gradient learning method to optimize the gain coefficient in the model parameter correction equation, and substitute the optimized gain parameter into the model parameter correction equation to obtain the correction amount.

2. the model parameter correction method based on strategy gradient learning method according to claim 1, is characterized in that, in step S1, is selected as the inverted pendulum input parameter of correction amount and comprises x axial step size and y axial step size, The robot trunk posture parameters selected as the correction amount include the x-axis trunk angle and the y-axis trunk angle, and the model parameter correction equation is specifically:

{Δs Δs}_{x x} = = {K K}_{11} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, x x,, i i} - - {x x}_{e e,, x x,, i i})) + + {K K}_{33} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, y the y,, i i} - - {θ θ}_{B B,, y the y,, i i}^{r r e e f f}))

{Δs Δs}_{y the y} = = {K K}_{22} \cdot &Center Dot; \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, y the y,, i i} - - {x x}_{e e,, y the y,, i i})) + + {K K}_{44} \cdot &Center Dot; \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, x x,, i i} - - {θ θ}_{B B,, x x,, i i}^{r r e e f f}))

{Δθ Δθ}_{B B,, x x} = = {K K}_{55} \cdot &Center Dot; \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({p p}_{L L H h i i p p,, z z,, i i} - - {p p}_{R R H h i i p p,, z z,, i i}))

{Δθ Δθ}_{B B,, y the y} = = {K K}_{66} \cdot \cdot \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({p p}_{S S u u p p p p F f o o o o t t,, x x,, i i} - - {p p}_{H h e e a a d d,, x x,, i i}))

Among them, the subscripts x, y, and z represent the x, y, and z axes respectively, s is the step size, Δs is the correction amount of the step size, θ _B is the trunk angle, Δθ _B is the correction amount of the trunk angle, and N is a The number of interpolation steps in the single-leg support stage, the subscript i represents the i-th step number in the single-leg support stage, x _f is the estimated value of the centroid after Kalman filtering, and x _e is the ideal value of the centroid, p _RHip and p _LHip are the displacements of the robot’s right and left leg hip joints respectively, p _Head and p _SuppFoot are the displacements of the robot’s head joint and supporting foot respectively, K ₁ ,..., K ₆ is a gain parameter.

3. the model parameter correction method based on strategy gradient learning method according to claim 2, is characterized in that, described fitness evaluation function F ( K ) is specifically:

F f ((\underset{&OverBar; &OverBar;}{K K})) = = {α α}_{x x} ((| | {Δs Δs}_{x x} | | + + | | Δ Δ {\overset{&OverBar; &OverBar;}{x x}}_{x x} | |)) + + {α α}_{y the y} ((| | {Δs Δs}_{y the y} | | + + | | Δ Δ {\overset{&OverBar; &OverBar;}{x x}}_{y the y} | |)) + + {β β}_{x x} ((| | {Δθ Δθ}_{B B,, x x} | | + + | | Δ Δ {\overset{&OverBar; &OverBar;}{θ θ}}_{B B,, x x} | |)) + + {β β}_{y the y} ((| | {Δθ Δθ}_{B B,, y the y} | | + + | | Δ Δ {\overset{&OverBar; &OverBar;}{θ θ}}_{B B,, y the y} | |))

Δ Δ {\overset{&OverBar; &OverBar;}{x x}}_{x x} = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, x x,, i i} - - {x x}_{e e,, x x,, i i}))

Δ Δ {\overset{&OverBar; &OverBar;}{x x}}_{y the y} = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({x x}_{f f,, y the y,, i i} - - {x x}_{e e,, y the y,, i i}))

Δ Δ {\overset{&OverBar; &OverBar;}{θ θ}}_{B B,, x x} = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, x x,, i i} - - {θ θ}_{B B,, x x,, i i}^{r r e e f f}))

Δ Δ {\overset{&OverBar; &OverBar;}{θ θ}}_{B B,, y the y} = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} (({θ θ}_{B B,, y the y,, i i} - - {θ θ}_{B B,, y the y,, i i}^{r r e e f f}))

Among them, K ＝{K ₁ ,...,K ₆ } represents the gain parameter set, α _x , α _y , β _x and β _y are weight factors, and satisfy α _x +α _y =1, β _x +β _y =1, the smaller the value of the fitness evaluation function, the higher the fitness of the robot under the gain parameter set.

4. the model parameter correction method based on strategy gradient learning method according to claim 3, is characterized in that, the concrete steps of described strategy gradient learning method are:

301: In the kth iteration, for the gain parameter set K ^k-1 obtained in the previous iteration, n strategies are randomly generated around K ^k -1, and the obtained strategy set is _m K ^k-1 (m=1 ,...,n) means that the number of strategies n is proportional to the search space, and the generation formula of the strategy set is as follows:

_m K ^k-1 = K ^k-1 + _m ρ

Among them, _m ρ (m=1,...,n) represents the disturbance set, and each disturbance ρ _m in the disturbance set is randomly selected from the set {-e _m ,0,+e _m }, and _em represents the corresponding ρ _m The disturbance gain parameter of ;

302: According to the values of -e _m , 0, +e _m of the disturbance ρ _m , the Correspondingly divided into three groups: G ₀ and Substitute _m K ^k-1 into the fitness evaluation function to get the average value corresponding to each group: and

303: Calculate the approximate gradient value like and otherwise

304: yes Orthogonalization is performed, and the gradient value is obtained by multiplying by a fixed step factor η Subtract the gradient value from the policy set K ^k-1 Get the policy set K ^k of this iteration, and use K ^k for the next iteration;

305: When the number of iterations reaches the preset value _Niter , the iteration ends.

5. A model parameter corrector based on the method as claimed in claim 2, characterized in that, the output of the model parameter corrector is the corrected amount Δs of the step size and the corrected amount Δθ _B of the trunk angle, the output of the model parameter corrected Transfer to the inverted pendulum model and the robot model for compensation, satisfying the following formula:

\overset{&OverBar; &OverBar;}{s the s} = = s the s - - Δ Δ s the s

θ _B,i = θ _B,i-1 -Δθ _B

in, is the step size of the next single-leg support stage, and each single-leg support stage is compensated only once; while θ _B,i is the torso angle of frame i in the single-leg support stage, and is compensated N times in each single-leg support stage.

6. a robot walking control method using a model parameter corrector as claimed in claim 5, is characterized in that, comprises the following steps:

1) Use the inverted pendulum model to plan the trajectory of the center of mass of the robot, and then plan the corresponding foot trajectory, and then use the decomposition speed control method to perform inverse kinematics calculations to obtain the joint speed of the robot, and the robot model controls the robot to walk according to the joint speed;

2) Design two closed-loop loops, the first closed-loop loop: measure the motion of the center of mass according to the state of the joint space of the robot, and obtain the actual value of the center of mass, and the actual value of the center of mass is obtained after Kalman filtering. The estimated value of the center of mass self-corrects the input parameters of the inverted pendulum in the inverted pendulum model;

The second closed-loop loop: use the estimated value of the center of mass after Kalman filtering and the measured torso angle, and use the model parameter modifier to compensate the input parameters of the inverted pendulum in the inverted pendulum model and the torso angle of the robot in the robot model.