WO2023159978A1 - 一种基于强化学习和位置增量的四足机器人运动控制方法 - Google Patents

一种基于强化学习和位置增量的四足机器人运动控制方法 Download PDF

Info

Publication number
WO2023159978A1
WO2023159978A1 PCT/CN2022/125983 CN2022125983W WO2023159978A1 WO 2023159978 A1 WO2023159978 A1 WO 2023159978A1 CN 2022125983 W CN2022125983 W CN 2022125983W WO 2023159978 A1 WO2023159978 A1 WO 2023159978A1
Authority
WO
WIPO (PCT)
Prior art keywords
quadruped robot
reinforcement learning
foot
plantar
control method
Prior art date
Application number
PCT/CN2022/125983
Other languages
English (en)
French (fr)
Inventor
张伟
盛嘉鹏
陈燕云
方兴
谭文浩
Original Assignee
山东大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东大学 filed Critical 山东大学
Publication of WO2023159978A1 publication Critical patent/WO2023159978A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Definitions

  • the invention relates to the field of quadruped robot control, in particular to a quadruped robot motion control method based on reinforcement learning and position increment.
  • Quadruped robots are widely used in scenarios such as monitoring patrols, environmental reconnaissance, and transportation and supply.
  • the flexibility and adaptability of quadruped robots also make their dynamic characteristics more complex, which makes it a great challenge to realize the animal-like motion of quadruped robots, and flexible and efficient motion control is the key to all kinds of mobile
  • model-based methods There are two main categories of motion control for quadruped robots: model-based methods and reinforcement learning-based methods.
  • the traditional modeling control method performs feature extraction and obtains valuable information based on the state information of the robot, and then the controller calculates the motor control commands.
  • the design of nonlinear controller is difficult, with many constraints and difficult solution. It also needs to carry out explicit state estimation and rely on experience to set the threshold to trigger the finite state machine to coordinate the motion controller.
  • the control method based on deep reinforcement learning does not require precise models, and can automatically design control strategies for robots in various complex environments through environmental interaction, which greatly reduces human labor burden and achieves good control results.
  • this method mainly has two directions: pure reinforcement learning method and reinforcement learning method combined with traditional control. Since the quadruped robot is a dynamic stable balance system with many degrees of freedom, it is difficult to design and train the motion control rewards of the quadruped robot with pure reinforcement learning, and it is easy to appear uncoordinated and unnatural gait compared with the natural quadruped animals. .
  • An effective solution is to directly combine traditional control methods to build a layered control framework.
  • Reinforcement learning is used as the upper-layer strategy to control the lower-layer traditional controller, and the traditional controller outputs motor control instructions to realize the stable walking of the quadruped robot.
  • This method reduces the difficulty of reinforcement learning to a certain extent, but the control performance of reinforcement learning is limited by low-level traditional controllers, and the ability to adapt to the environment is insufficient.
  • This can replace the complex lower-level controller by introducing a periodic oscillator, and instead directly output the periodic oscillator parameters and plantar position residuals by reinforcement learning, and finally synthesize the control instructions of the motor.
  • the object of the present invention is to aim at the defects existing in the prior art, to provide a quadruped robot motion control method and system based on reinforcement learning and position increment, by analyzing the variation of the sole position of the quadruped robot in each time step Constraints, avoid sudden changes in control commands, obtain smooth and fluid plantar trajectories without smoothness and motor speed rewards, and enhance the ability of quadruped robots to traverse complex terrain with flexible, smooth, and fluid gaits.
  • the first object of the present invention is to provide a quadruped robot motion control method based on reinforcement learning and position increment, comprising the following steps:
  • the position of the sole of the foot within each preset time step is generated when the quadruped robot moves, and the variation of the position of the sole of the foot within each time step is calculated;
  • the quadruped robot is controlled to perform corresponding actions, so that the quadruped robot maintains a motion balance.
  • the quadruped robot is provided with an inertial measurement unit and a joint motor encoder to obtain the baseline velocity, direction, angular velocity and joint position of the quadruped robot.
  • joint state history information and leg phase information of the quadruped robot are obtained, processed and used as the control input of the quadruped robot, and the next action command is obtained to control the movement of the robot.
  • the joint state history information includes joint position error and joint speed, wherein the joint position error is the deviation between the current joint position and the last joint position command, so as to realize the ground contact detection.
  • the plantar position area is selected based on the reinforcement learning strategy, and this area is used as the change interval of the plantar position, so as to constrain the maximum moving distance of the plantar within a single time step.
  • each leg of the quadruped robot uses an independent trajectory generator to output the position of the sole in the Z-axis direction; based on the reinforcement learning strategy, the increment of the sole position and the adjustment frequency of each leg are output, and the accumulation along the X-axis, Y
  • the sole position in the X-axis and Y-axis directions is obtained by the plantar position increment in the axial direction, and the plantar position in the Z-axis direction is obtained by superimposing the plantar position increment in the Z-axis direction and the prior value.
  • the target foot position is pre-defined in the basic framework of the quadruped robot, and the target joint motor position is calculated based on the corresponding target position, and the joint torque is calculated to track the target joint motor position.
  • linear velocity reward function and the rotation direction reward function of the quadruped robot base are designed to encourage the robot to follow the upper-level control instructions to give a given speed instruction and rotation direction instruction.
  • angular velocity reward function lateral coordination reward function
  • longitudinal coordination reward function longitudinal coordination reward function
  • stride reward function plantar side slip reward function
  • foot lift reward function work together to guide the quadruped robot to complete the action execution.
  • the second object of the present invention is to provide a quadruped robot motion control system based on reinforcement learning and position increment, adopting the following technical solutions:
  • the information acquisition module is configured to: acquire motion environment information, quadruped robot attitude information and foot position information;
  • Incremental calculation module configured to: based on the obtained information, generate the plantar position in each preset time step when the quadruped robot moves, and calculate the variation of the foot position in each time step;
  • the trajectory planning module is configured to: use the maximum moving distance in a single time step as a constraint, and simultaneously accumulate the time steps to obtain the plantar position trajectory;
  • the action control module is configured to: control the quadruped robot to perform corresponding actions based on the sole position trajectory combined with a preset reward function, so as to keep the quadruped robot in motion balance.
  • the mutation of the control command is avoided by constraining the change of the foot position of the quadruped robot in each time step , to obtain smooth and fluid plantar trajectories without smoothness and motor speed rewards, enhancing the ability of quadruped robots to traverse complex terrain with flexible, smooth and fluid gaits.
  • the quadruped robot learns the variation of the plantar position at each time step to avoid sudden changes in control commands. This method can obtain smooth and smooth plantar trajectories without smoothness and motor speed rewards, and it is endowed with reinforcement learning The great adjustment ability of the strategy enhances the ability of the quadruped robot to pass through complex terrain with a flexible, stable and smooth gait while reducing the difficulty of learning.
  • Fig. 1 is a schematic diagram of the comparison between the motion control method in Embodiment 1 or 2 of the present invention and the existing method;
  • FIG. 2 is a schematic diagram of a quadruped robot motion control training framework in Embodiment 1 or 2 of the present invention
  • Embodiment 3 is a schematic diagram of the incremental gait development learning mode in Embodiment 1 or 2 of the present invention.
  • Fig. 4 is a schematic diagram of reward design in Embodiment 1 or 2 of the present invention.
  • FIGS. 1-4 a quadruped robot motion control method based on reinforcement learning and position increment is given.
  • this embodiment proposes a motion control method for quadruped robots based on reinforcement learning and position increments, allowing the quadruped robot to learn each time step
  • the amount of change in the plantar position avoids sudden changes in control commands, enabling the quadruped robot to learn smooth and coordinated motion within the RL framework, and reducing the difficulty of hyperparameter adjustment during the training phase.
  • Reinforcement learning needs to interact with the environment to learn.
  • the trial-and-error and randomness of the strategy in the early stage of training are likely to cause irreversible damage and damage to the robot and cannot be trained in the real environment. Therefore, this program realizes the autonomous movement of the quadruped robot by training in the simulation environment and then migrating to the real environment.
  • the position of the sole of the foot within each preset time step is generated when the quadruped robot moves, and the variation of the position of the sole of the foot within each time step is calculated;
  • the quadruped robot is controlled to perform corresponding actions, so that the quadruped robot maintains a motion balance.
  • the motion problem of a quadruped robot we regard the motion problem of a quadruped robot as a partially observable Markov decision process (POMDP) ⁇ S, A, R, P, ⁇ >, where S and A are respectively Represents state and action spaces.
  • R(st t ,st +1 ) ⁇ R is the reward function
  • s t ,a t ) is the transition probability
  • ⁇ (0,1) is the reward discount coefficient.
  • the quadruped robot takes an action a in the current state, obtains a scalar reward r, and then transfers to the next state s, which is determined by the state transition probability distribution P(s t+1
  • the overall goal of quadruped robot training is to find an optimal policy Maximize future discount rewards, ie:
  • the design of the observation space mainly includes three parts: the design of the observation space, the design of the action space, and the design of the reward function.
  • Reinforcement learning uses the designed reward function to guide the robot to continuously explore in the physical simulation environment to adapt to the complex environment, and finally learns a robust motion controller.
  • the proximal policy optimization (PPO) algorithm and the set reward function are used to optimize the RL strategy.
  • the input is the sensor data after simple preprocessing, and the output is the incremental plantar position proposed by this scheme, and finally converted to the motor position control. instruction.
  • the quadruped robot can track upper-level user commands, including the forward speed of the base and yaw angle
  • the speed command vector v c and the rotation direction command vector ⁇ c are defined as and
  • the quadruped robot is encouraged to obey the upper-level user commands, maintain balance and complete coordinated movements.
  • IMU Inertial Measurement Unit
  • the joint position error is defined as the deviation between the current joint position and the previous joint position command.
  • the leg phase ⁇ is also used as network input, which is uniquely represented by ⁇ sin( ⁇ ),cos( ⁇ )>. Therefore, the ensemble of the state space at time t is defined as After preprocessing and normalization, these states are used as the input of the network, and then generate the action command at the next moment, control the movement of the quadruped robot, and continue to cycle.
  • this program proposes a gait learning method based on incremental plantar positions, allowing the quadruped robot to learn the change in plantar positions at each time step, avoiding sudden changes in control commands, and obtaining smooth and fluent movements. gait trajectory.
  • the schematic diagram of the incremental gait learning development model is shown in Figure 3. Area II is the area of the plantar position that can be selected by the reinforcement learning strategy, and area III is the allowable range of plantar change positions under the incremental gait.
  • This new incremental action space explicitly constrains the maximum movement distance of the foot in a single time step, and at the same time obtains the best plantar position trajectory through the accumulation of time steps. With the movement of the plantar trajectory, the plantar position space will change dynamically until reaching the mechanical limit, as shown in area I in Figure 3.
  • This approach enables reinforcement learning policies to be optimized directly with rewards related to the main task (e.g. learning a natural quadruped-like gait) without having to consider the negative effects of penalizing motor state mutations in the reward function, as might lead to The motor shakes or stands still.
  • a Modulating Trajectory Generator (Policies Modulating Trajectory Generators, PMTG) is introduced to assist the quadruped robot in training.
  • PMTG Poly Modulating Trajectory Generators
  • Each leg uses an independent trajectory generator (TG) to output the plantar position in the z-axis direction.
  • TG is defined as a cubic Hermite interpolation (Cubic Hermite Spline) to simulate the basic standing still gait pattern, the formula is as follows:
  • k 2( ⁇ - ⁇ )/ ⁇
  • h is the maximum allowable foot lift height
  • ⁇ [0, 2 ⁇ ) is the TG phase.
  • the support phase ⁇ [0, ⁇ ) the swing phase ⁇ [0, 2 ⁇ ).
  • the reinforcement learning strategy outputs the position increment ⁇ [x, y, z] of the sole of the foot and the adjustment frequency f of each leg.
  • ⁇ i ,0 is the initial phase of the i-th leg
  • f0 is the fundamental frequency
  • T is the time between two consecutive control steps.
  • the target foot position (x, y, z) t at time t can be obtained by the following formula:
  • the position of the sole along the x and y axes can be obtained by accumulating the increment of sole position ( ⁇ x, ⁇ y) output by the network, and the position of the sole of the foot along the z axis can be obtained by the increment of sole position ⁇ z output by the network Obtained by superimposing with the prior value provided by TG.
  • the former makes the change of the plantar target position smoother, and the latter makes it easy to obtain regular periodic motion.
  • IK Inverse Kinematics
  • PD proportional-derivative
  • the design of the reward function is the key to the whole reinforcement learning framework, and it plays two roles at the same time.
  • One is capability evaluation, where human designers use specified reward functions to evaluate the behavior of quadruped robots; the other is behavior guidance, where the implementation of RL algorithms uses reward functions to determine robot behavior.
  • the mathematical form and design goals of the reward function designed in this project will be described in detail below.
  • the following two kernel functions are introduced to constrain the reward function to ensure that the reward value is within a reasonable range:
  • v b and ⁇ c are the base linear velocity and rotation direction, respectively, and the velocity norm
  • can scale the reward to an appropriate range.
  • Design Stride Rewards Encourage the robot to prioritize increasing/decreasing stride length rather than movement frequency when increasing/decreasing velocity, defined as:
  • k c, t are course factors.
  • the curriculum factor is an adjustment parameter introduced by curriculum learning, which is used to describe the difficulty of training.
  • the method of curriculum learning is introduced in this embodiment, so that the robot can learn the main tasks first (obeying motion commands and maintaining body balance) at the beginning of the training phase, and then gradually increase the coefficient of constraint items.
  • the curriculum factor kc ,t describes the level of difficulty during training and is defined as where k d represents the growth rate of k c,t to reach the maximum difficulty level of the course.
  • the PPO hyperparameter settings are shown in Table 1.
  • FIGS. 1-4 a quadruped robot motion control system based on reinforcement learning and position increment is provided.
  • the information acquisition module is configured to: acquire motion environment information, quadruped robot attitude information and foot position information;
  • Incremental calculation module configured to: based on the obtained information, generate the plantar position in each preset time step when the quadruped robot moves, and calculate the variation of the foot position in each time step;
  • the trajectory planning module is configured to: use the maximum moving distance in a single time step as a constraint, and simultaneously accumulate the time steps to obtain the plantar position trajectory;
  • the action control module is configured to: control the quadruped robot to perform corresponding actions based on the sole position trajectory combined with a preset reward function, so as to keep the quadruped robot in motion balance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)

Abstract

一种基于强化学习和位置增量的四足机器人运动控制方法,涉及四足机器人控制领域,获取运动环境信息、四足机器人姿态信息和足底位置信息;基于获取的信息,生成四足机器人运动时各个预设时间步长内的足底位置,计算各个时间步长内足底位置的变化量;以单个时间步长内的最大移动距离为约束,同时累积时间步长得到足底位置轨迹;基于足底位置轨迹结合预设奖励函数来控制四足机器人执行相应动作,以使四足机器人保持运动平衡;针对目前四足机器人运动控制方法中生成的电机位置发生较大突变引起电机损伤的问题,通过对四足机器人每个时间步内的足底位置变化量进行约束,避免控制命令的突变,增强了四足机器人通过复杂地形的能力。

Description

一种基于强化学习和位置增量的四足机器人运动控制方法
本发明要求于2022年2月28日提交中国专利局、申请号为202210191785.X、发明名称为“一种基于强化学习和位置增量的四足机器人运动控制方法”的中国专利申请的优先权,其全部内容通过引用结合在本发明中。
技术领域
本发明涉及四足机器人控制领域,具体涉及一种基于强化学习和位置增量的四足机器人运动控制方法。
背景技术
四足机器人被广泛应用于监测巡逻、环境侦察和运输供应等场景。另一方面,四足机器人的灵活性和适应性也使其动力学特性更加复杂,这让实现四足机器人类似动物的运动成为一项极大的挑战,而灵活高效的运动控制是各类移动机器人特定功能得以实现的基础与前提。
四足机器人的运动控制主要有基于模型的方法和基于强化学习的方法两大类。
(1)传统建模控制方法
传统建模控制方法根据机器人状态信息进行特征提取并获取有价值的信息,然后由控制器计算电机控制命令。该方法主要有两个技术难点:首先,需要建立精确的被控对象模型,而四足机器人作为高阶非线性复杂系统,很难建立起精确的数学模型。其次,非线性控制器设计难度大,约束条件多,求解困难,还需要进行明确的状态估计并依靠经验设定阈值来触发有限状态机来协调运动控制器。
(2)基于深度强化学习的控制方法
基于深度强化学习的控制方法无需精确模型,能够通过环境交互在各种复杂环境下自动地为机器人设计控制策略,极大地减少了人类劳动负担,并且取得出了良好的控制效果。该方法目前主要有纯强化学习方法和与传统控制结合的强化学习方法两个方向。由于四足机器人属于动态稳定平衡系统,自由度较多,因而纯强化学习的四足机器人运动控制奖励设计和训练困难,且与自然界四足动物相比,很容易出现不协调不自然的步态。一种有效的解决方案是直接结合传统控制方法,构建分层控制框架,强化学习作为上层策略控制下层的传统控制器,由传统控制器输出电机控制指令,实现四足机器人稳定行走。该方法在一定程度上降低了强化学习难度,但强化学习控制性能受到低层级传统控制器的限制,环境适应能力不足。这可以通过引入周期振荡器来替代复杂的下层的控制器,改由强化学习直接输出周期振荡器参数和足底位置残差,最终合成电机的控制指令。然而,由于神经网络的非线性,使得神经网络直接生成的电机位置会发生较大突变,电机需要输出极大的扭矩才能追踪目标位置,很容易造成电机的物理损伤。虽然可以通过引入电机输出扭矩或速度的约束奖励函数来缓解这个问题,但这也很大程度上加大了奖励函数设计和参数的调整难度,使得基于强化学习的方法无法得到性能良好的运动控制策略。
发明内容
本发明的目的是针对现有技术存在的缺陷,提供一种基于强化学习和位置增量的四足机器人运动控制方法及系统,通过对四足机器人每个时间步内的足底位置变化量进行约束,避免控制命令的突变,无需平滑度和电机速度奖励即可获得平滑、流畅的足底轨迹,增强了四足机器人以灵活、平稳、流畅的步态通过复杂 地形能力。
本发明的第一目的是提供一种基于强化学习和位置增量的四足机器人运动控制方法,包括以下步骤:
获取运动环境信息、四足机器人姿态信息和足底位置信息;
基于获取的信息,生成四足机器人运动时各个预设时间步长内的足底位置,计算各个时间步长内足底位置的变化量;
以单个时间步长内的最大移动距离为约束,同时累积时间步长得到足底位置轨迹;
基于足底位置轨迹结合预设奖励函数来控制四足机器人执行相应动作,以使四足机器人保持运动平衡。
进一步地,所述四足机器人上设有惯性测量单元和关节电机编码器,以获取四足机器人的基准线速度、方向、角速度和关节位置。
进一步地,获取四足机器人关节状态历史信息和腿部相位信息,并进行处理后作为四足机器人控制输入,得到下一动作命令来控制机器人运动。
进一步地,通过关节状态历史信息包括关节位置误差和关节速度,其中关节位置误差为当前关节位置与上一关节位置指令的偏差,以实现地面接触检测。
进一步地,基于强化学习策略选择足底位置区域,以此区域作为足底位置的变化区间,从而约束足底在单个时间步长内的最大移动距离。
进一步地,四足机器人的每条腿使用独立的轨迹生成器来输出Z轴方向的足底位置;基于强化学习策略输出足底位置增量和每条腿的调节频率,累加沿X轴、Y轴方向的足底位置增量得到的X轴、Y轴方向的足底位置,叠加Z轴方向足底位置增量和先验值得到Z轴方向的足底位置。
进一步地,四足机器人基础框架中预先定义目标足底位置,结合相应目标位置计算目标关节电机位置,并计算关节力矩,追踪目标关节电机位置。
进一步地,设计四足机器人基座线速度奖励函数和旋转方向奖励函数,以鼓励机器人跟踪上层控制指令给出给定的速度指令和旋转方向指令。
进一步地,分别设计角速度奖励函数、侧向协调奖励函数、纵向协调奖励函数、步幅奖励函数、脚底侧滑奖励函数和脚底抬高奖励函数,奖励函数共同作用引导四足机器人完成动作执行。
本发明的第二目的是提供一种基于强化学习和位置增量的四足机器人运动控制系统,采用以下技术方案:
包括:
信息获取模块,被配置为:获取运动环境信息、四足机器人姿态信息和足底位置信息;
增量计算模块:被配置为:基于获取的信息,生成四足机器人运动时各个预设时间步长内的足底位置,计算各个时间步长内足底位置的变化量;
轨迹规划模块,被配置为:以单个时间步长内的最大移动距离为约束,同时累积时间步长得到足底位置轨迹;
动作控制模块,被配置为:基于足底位置轨迹结合预设奖励函数来控制四足机器人执行相应动作,以使四足机器人保持运动平衡。
与现有技术相比,本发明具有的优点和积极效果是:
(1)针对目前四足机器人运动控制方法中生成的电机位置发生较大突变引起电机损伤的问题,通过对四足机器人每个时间步内的足底位置变化量进行约束,避免控制命令的突变,无需平滑度和电机速度奖励即可获得平滑、流畅的足底轨 迹,增强了四足机器人以灵活、平稳、流畅的步态通过复杂地形能力。
(2)四足机器人学习每个时间步的足底位置的变化量,避免控制命令的突变,该方法无需平滑度和电机速度奖励即可获得平滑、流畅的足底轨迹,且赋予了强化学习策略极大的调节能力,在降低学习难度的情况下,增强了四足机器人以灵活、平稳、流畅的步态通过复杂地形能力。
(3)保留基于学习的方法来减少控制策略的手工设计难度和人类劳动负担的同时,通过学习单个时间步的足底位置变化量,来避免由于神经网络输出电机控制命令的突变造成电机永久物理损伤。
(4)与现有的基于强化学习端到端的机器人控制方法对比,避免了额外的电机扭矩和速度约束奖励设计,降低了控制策略的学习难度,提高了控制策略的性能。
附图说明
构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。
图1为本发明实施例1或2中运动控制方法与现有方法的比对示意图;
图2为本发明实施例1或2中四足机器人运动控制训练框架的示意图;
图3为本发明实施例1或2中增量式步态发育学习模式示意图;
图4为本发明实施例1或2中奖励设计的示意图。
具体实施方式
实施例1
本发明的一个典型实施例中,如图1-图4所示,给出一种基于强化学习和位置增量的四足机器人运动控制方法。
如图1所示,区别于现有的四足机器人的步态控制方法,本实施例中提出了基于强化学习和位置增量的四足机器人运动控制方法,让四足机器人学习每个时间步的足底位置的变化量,避免控制命令的突变,使四足机器人在RL框架内学习平滑和协调的运动,并在训练阶段减少超参数调节的难度。强化学习需与环境进行交互才能进行学习,在训练初期策略的试错性和随机性很可能使机器人产生不可逆转的毁坏和破损而无法在真实环境中进行训练。因此,本方案通过在仿真环境中训练,然后迁移至现实环境的方式实现四足机器人的自主运动。
对于基于强化学习和位置增量的四足机器人运动控制方法,如图2所示,主要包括以下步骤:
获取运动环境信息、四足机器人姿态信息和足底位置信息;
基于获取的信息,生成四足机器人运动时各个预设时间步长内的足底位置,计算各个时间步长内足底位置的变化量;
以单个时间步长内的最大移动距离为约束,同时累积时间步长得到足底位置轨迹;
基于足底位置轨迹结合预设奖励函数来控制四足机器人执行相应动作,以使四足机器人保持运动平衡。
具体的,将四足机器人的运动问题我们将四足机器人的运动问题视为一个部分可观察的马尔可夫决策过程(POMDP)<S,A,R,P,γ>,其中S和A分别表示状态和行动空间。R(s t,s t+1)→R是奖励函数,P(s t+1∣s t,a t)是过渡概率,γ∈(0,1)是奖励折扣系数。四足机器人在当前状态下采取一个行动a,获得一个标量奖励r,然后转移到下一个状态s,由状态转移概率分布决定P(s t+1∣s t,a t)。四足机器人训练的总体目标是找到一个最优策略
Figure PCTCN2022125983-appb-000001
使得未来的折扣奖励最大,即:
Figure PCTCN2022125983-appb-000002
结合图2所示的四足机器人步态训练框架,主要包括观察空间的设计、动作空间的设计以及奖励函数的设计三个部分。强化学习利用设计好的奖励函数引导机器人在物理仿真环境中不断探索来适应复杂的环境,最终学习得到鲁棒的运动控制器。采用近端策略优化(PPO)算法和设定的奖励函数来优化RL策略,其输入是经过简单预处理的传感器数据,输出为本方案提出的增量式足底位置,最后转换为电机位置控制指令。此外,四足机器人还可跟踪上层用户指令,包括基座的前进速度
Figure PCTCN2022125983-appb-000003
和偏航角
Figure PCTCN2022125983-appb-000004
速度指令向量v c和旋转方向指令向量θ c分别定义为
Figure PCTCN2022125983-appb-000005
Figure PCTCN2022125983-appb-000006
在训练阶段,鼓励四足机器人服从上层用户命令,保持平衡并完成协调运动。
观察空间设计
在本实施例中,四足机器人只含有最基本的本体感觉传感器,包括一个惯性测量单元(Inertial Measurement Unit,IMU)和12个电机编码器,可以测量得机体的基准线速度v b∈R 3,方向θ b∈R 3或其四元数形式q b=[x,y,z,w]∈R 4,角速度w b∈R 3和关节位置θ j∈R 12。关节速度
Figure PCTCN2022125983-appb-000007
可以通过扩展的卡尔曼滤波器估计获得。由于缺乏足底压力传感器,本方案引入了关节状态历史Θ作为网络输入,实现地面接触检测,Θ包含关节位置误差和关节速度等。其中,关节位置误差定义为当前关节位置与上一时刻关节位置指令的偏差。此外,腿部相位φ也作为网络输入,由<sin(φ),cos(φ)>唯一表示。因此,t时刻的状态空间的整体定义为
Figure PCTCN2022125983-appb-000008
这些状态经过预处理和归一化后作为网络的输入,进而产生下一时刻的动作命令,控制四足机器人运动,并不断循环。
动作空间设计
目前,常用的四足机器人步态学习方式主要以直接输出电机位置或足底位置命令为主,这可能使得在两个较短的连续时间步之间位置命令发生突变,使关节产生过大的扭矩以追踪目标位置,造成电机损伤。针对该问题,本方案提出了基于增量式足底位置的步态学习方法,让四足机器人学习每个时间步的足底位置的变化量,避免控制命令发生突变,以获得平滑、流畅的步态轨迹。增量式步态学习发育模式示意图如图3,区域II为强化学习策略可选择的足底位置区域,区域III为增量式步态下允许的足底变化位置区间。
这种新的增量式动作空间明确约束了足部在单个时间步内的最大移动距离,同时经过时间步的累计来获得最佳的足底位置轨迹。随着足底轨迹的移动,足底位置空间会动态变化,直到达到机械极限,如图3中的区域I。该方法使强化学习策略可以直接通过与主要任务相关的奖励进行优化(例如学习像四足动物一样自然的步态),而无需考虑在奖励函数中惩罚电机状态突变造成的负面影响,如可能导致的电机抖动或静止状态等。
为使四足机器人学习到自然、规律的步态,引入了调制轨迹发生器(Policies Modulating Trajectory Generators,PMTG),辅助四足机器人进行训练。每条腿使用独立的轨迹生成器(TG)来输出z轴方向的足底位置。TG定义为三次埃尔米特插值(Cubic Hermite Spline)来模拟基本的原地踏步步态模式,公式如下:
Figure PCTCN2022125983-appb-000009
式中k=2(φ-π)/π,h为最大允许的抬脚高度,φ∈[0,2π)为TG相位。其中,支撑相φ∈[0,π),摆动相φ∈[0,2π)。
强化学习策略输出脚底的位置增量Δ[x,y,z]和每条腿的调节频率f。第i条腿 的相位可由公式φ i=(φ i,0+(f 0+f i)*T)(mod 2π)计算。其中,φ i,0为第i条腿的初始相位,f 0为基频,T为两个连续两个控制步之间的时间。t时刻的目标足底位置(x,y,z) t可由下式获得:
(x,y,z) t=Δ(x,y,z) t+(x t-1,y t-1,F(φ t))    (3)
由上式可知,沿x,y轴向的足底位置可由网络输出的足底位置增量(Δx,Δy)累加得到,沿z轴的脚部位置则由网络输出的足底位置增量Δz与TG提供的先验值叠加获得。前者使足底目标位置变化更平滑,后者则容易获得规律的周期性运动。在机器人基础框架中预先定义目标足底的位置,然后利用逆运动学(Inverse Kinematics,IK)计算相应的目标电机位置,最后通过比例微分(PD)控制器计算关节力矩,以追踪目标电机位置。
奖励函数设计
奖励函数的设计是整个强化学习框架的关键所在,它同时扮演着两个角色。一是能力评估,人类设计者使用指定奖励函数来评价四足机器人的行为;二是行为引导,RL算法的实现使用奖励函数来决定机器人的行为。以下将详细阐述本课题设计的奖励函数的数学形式及设计目标。首先引入以下两个核函数来对奖赏函数进行约束,保证奖励数值在合理范围内:
Figure PCTCN2022125983-appb-000010
Figure PCTCN2022125983-appb-000011
设计机器人基座线速度奖励
Figure PCTCN2022125983-appb-000012
和旋转方向奖励
Figure PCTCN2022125983-appb-000013
来鼓励机器人跟踪给定的速度指令v c和旋转方向命令θ c,具体形式如下:
Figure PCTCN2022125983-appb-000014
式中v b和θ c分别是基座线速度和旋转方向,速度范数||v c||可以将奖励缩放到适当的范围。
设计角速度奖励
Figure PCTCN2022125983-appb-000015
鼓励机器人保持基座稳定而不摇晃:
Figure PCTCN2022125983-appb-000016
设计侧向协调奖励
Figure PCTCN2022125983-appb-000017
以最小化每条腿的侧向偏移(lateral offset),如图4所示。
Figure PCTCN2022125983-appb-000018
式中
Figure PCTCN2022125983-appb-000019
是第i条腿的足底位置在y轴上的分量。
设计纵向协调奖励
Figure PCTCN2022125983-appb-000020
鼓励四条腿的步幅相同并尽量减少纵向偏移(sagittal offset),如图4所示。
Figure PCTCN2022125983-appb-000021
其中,
Figure PCTCN2022125983-appb-000022
Figure PCTCN2022125983-appb-000023
分别为过去时间步所有沿第i条腿足底位置x轴向分量的均值和标准差。侧向协调奖励
Figure PCTCN2022125983-appb-000024
和纵向协调奖励
Figure PCTCN2022125983-appb-000025
协同作用,促使机器人学习发育出协调、自然的步态。
设计步幅奖励
Figure PCTCN2022125983-appb-000026
鼓励机器人在增/减速度时优先提高/降低步幅长度,而非运动频率,定义为:
Figure PCTCN2022125983-appb-000027
设计脚底侧滑奖励
Figure PCTCN2022125983-appb-000028
惩罚足底在支撑阶段的滑动,定义为:
Figure PCTCN2022125983-appb-000029
设计脚底抬高奖励
Figure PCTCN2022125983-appb-000030
允许脚在摆动阶段以更高的高度移动,定义为:
Figure PCTCN2022125983-appb-000031
以上所有奖励函数共同作用,一起引导四足机器人完成步态自主学习发育的学习过程,最终每个时间步t的奖励r t为:
Figure PCTCN2022125983-appb-000032
其中,k c,t为课程因子。课程因子是课程学习引入的一个调节参数,用来描述训练的困难程度。
课程学习作为一种有效的深度强化学习训练算法,经常被引入智能体的训练。其核心思想为:从简单任务或任务的一部分开始学习,然后逐步提高任务的难度,使智能体最终学会整个复杂任务。
基于此,本实施例中引入了课程学习的方法,使机器人在训练阶段开始时优先学习主要任务(服从运动命令和保持身体平衡),然后逐渐增加约束项系数。课程因子k c,t描述了训练过程中的难度水平,定义为
Figure PCTCN2022125983-appb-000033
其中k d表示k c,t达到最大课程难度水平的增长速度。PPO超参数设置如表1所示。
表1 PPO超参数设置
Figure PCTCN2022125983-appb-000034
实施例2
本发明的另一典型实施方式中,如图1-图4所示,给出一种基于强化学习和位置增量的四足机器人运动控制系统。
包括:
信息获取模块,被配置为:获取运动环境信息、四足机器人姿态信息和足底位置信息;
增量计算模块:被配置为:基于获取的信息,生成四足机器人运动时各个预设时间步长内的足底位置,计算各个时间步长内足底位置的变化量;
轨迹规划模块,被配置为:以单个时间步长内的最大移动距离为约束,同时累积时间步长得到足底位置轨迹;
动作控制模块,被配置为:基于足底位置轨迹结合预设奖励函数来控制四足机器人执行相应动作,以使四足机器人保持运动平衡。
可以理解的是,本实施例中的基于强化学习和位置增量的四足机器人运动控制系统是基于实施例1中的运动控制方法实现的,因此,对于基于强化学习和位置增量的四足机器人运动控制系统工作过程的描述可以参见实施例1,在此不再赘述。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种基于强化学习和位置增量的四足机器人运动控制方法,其特征在于,包括以下步骤:
    获取运动环境信息、四足机器人姿态信息和足底位置信息;
    基于获取的信息,生成四足机器人运动时各个预设时间步长内的足底位置,计算各个时间步长内足底位置的变化量;
    以单个时间步长内的最大移动距离为约束,同时累积时间步长得到足底位置轨迹;
    基于足底位置轨迹结合预设奖励函数来控制四足机器人执行相应动作,以使四足机器人保持运动平衡。
  2. 如权利要求1所述的基于强化学习和位置增量的四足机器人运动控制方法,其特征在于,所述四足机器人上设有惯性测量单元和关节电机编码器,以获取四足机器人的基准线速度、方向、角速度和关节位置。
  3. 如权利要求1所述的基于强化学习和位置增量的四足机器人运动控制方法,其特征在于,获取四足机器人关节状态历史信息和腿部相位信息,并进行处理后作为四足机器人控制输入,得到下一动作命令来控制机器人运动。
  4. 如权利要求3所述的基于强化学习和位置增量的四足机器人运动控制方法,其特征在于,通过关节状态历史信息包括关节位置误差和关节速度,其中关节位置误差为当前关节位置与上一关节位置指令的偏差,以实现地面接触检测。
  5. 如权利要求1所述的基于强化学习和位置增量的四足机器人运动控制方法,其特征在于,基于强化学习策略选择足底位置区域,以此区域作为足底位置的变化区间,从而约束足底在单个时间步长内的最大移动距离。
  6. 如权利要求1所述的基于强化学习和位置增量的四足机器人运动控制方法, 其特征在于,四足机器人的每条腿使用独立的轨迹生成器来输出Z轴方向的足底位置;基于强化学习策略输出足底位置增量和每条腿的调节频率,累加沿X轴、Y轴方向的足底位置增量得到的X轴、Y轴方向的足底位置,叠加Z轴方向足底位置增量和先验值得到Z轴方向的足底位置。
  7. 如权利要求1所述的基于强化学习和位置增量的四足机器人运动控制方法,其特征在于,四足机器人基础框架中预先定义目标足底位置,结合相应目标位置计算目标关节电机位置,并计算关节力矩,追踪目标关节电机位置。
  8. 如权利要求1所述的基于强化学习和位置增量的四足机器人运动控制方法,其特征在于,设计四足机器人基座线速度奖励函数和旋转方向奖励函数,以鼓励机器人跟踪上层控制指令给出给定的速度指令和旋转方向指令。
  9. 如权利要求1所述的基于强化学习和位置增量的四足机器人运动控制方法,其特征在于,分别设计角速度奖励函数、侧向协调奖励函数、纵向协调奖励函数、步幅奖励函数、脚底侧滑奖励函数和脚底抬高奖励函数,奖励函数共同作用引导四足机器人完成动作执行。
  10. 一种基于强化学习和位置增量的四足机器人运动控制系统,其特征在于,包括:
    信息获取模块,被配置为:获取运动环境信息、四足机器人姿态信息和足底位置信息;
    增量计算模块:被配置为:基于获取的信息,生成四足机器人运动时各个预设时间步长内的足底位置,计算各个时间步长内足底位置的变化量;
    轨迹规划模块,被配置为:以单个时间步长内的最大移动距离为约束,同时累积时间步长得到足底位置轨迹;
    动作控制模块,被配置为:基于足底位置轨迹结合预设奖励函数来控制四足机器人执行相应动作,以使四足机器人保持运动平衡。
PCT/CN2022/125983 2022-02-28 2022-10-18 一种基于强化学习和位置增量的四足机器人运动控制方法 WO2023159978A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210191785.X 2022-02-28
CN202210191785.XA CN114563954A (zh) 2022-02-28 2022-02-28 一种基于强化学习和位置增量的四足机器人运动控制方法

Publications (1)

Publication Number Publication Date
WO2023159978A1 true WO2023159978A1 (zh) 2023-08-31

Family

ID=81716157

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125983 WO2023159978A1 (zh) 2022-02-28 2022-10-18 一种基于强化学习和位置增量的四足机器人运动控制方法

Country Status (2)

Country Link
CN (1) CN114563954A (zh)
WO (1) WO2023159978A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114563954A (zh) * 2022-02-28 2022-05-31 山东大学 一种基于强化学习和位置增量的四足机器人运动控制方法
CN114859737B (zh) * 2022-07-08 2022-09-27 中国科学院自动化研究所 四足机器人步态过渡方法、装置、设备和介质
CN118012077A (zh) * 2024-04-08 2024-05-10 山东大学 基于强化学习动作模仿的四足机器人运动控制方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120072026A1 (en) * 2010-09-22 2012-03-22 Canon Kabushiki Kaisha Robot system controlling method, robot system, and control apparatus for quadrupedal robot
CN107562052A (zh) * 2017-08-30 2018-01-09 唐开强 一种基于深度强化学习的六足机器人步态规划方法
CN110893118A (zh) * 2018-09-12 2020-03-20 微创(上海)医疗机器人有限公司 手术机器人系统以及机械臂的运动控制方法
CN112060082A (zh) * 2020-08-19 2020-12-11 大连理工大学 基于仿生强化学习型小脑模型的在线稳定控制仿人机器人
CN112596534A (zh) * 2020-12-04 2021-04-02 杭州未名信科科技有限公司 基于深度强化学习的四足机器人的步态训练方法、装置、电子设备及介质
CN113821045A (zh) * 2021-08-12 2021-12-21 浙江大学 一种腿足机器人强化学习动作生成系统
CN114563954A (zh) * 2022-02-28 2022-05-31 山东大学 一种基于强化学习和位置增量的四足机器人运动控制方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108572553B (zh) * 2018-05-16 2020-06-23 清华大学深圳研究生院 一种四足机器人的运动闭环控制方法
CN111638646B (zh) * 2020-05-29 2024-05-28 平安科技(深圳)有限公司 四足机器人行走控制器训练方法、装置、终端及存储介质
CN111891252B (zh) * 2020-08-06 2021-11-05 齐鲁工业大学 一种四足仿生机器人的身体姿态斜坡自适应控制方法
CN112207825B (zh) * 2020-09-28 2022-02-01 杭州云深处科技有限公司 一种四足机器人仿生跳跃动作的控制方法、装置、电子设备及计算机可读介质
CN112684794B (zh) * 2020-12-07 2022-12-20 杭州未名信科科技有限公司 基于元强化学习的足式机器人运动控制方法、装置及介质
CN112666939B (zh) * 2020-12-09 2021-09-10 深圳先进技术研究院 一种基于深度强化学习的机器人路径规划算法
CN112936290B (zh) * 2021-03-25 2022-06-10 西湖大学 一种基于分层强化学习的四足机器人运动规划方法
CN113568422B (zh) * 2021-07-02 2024-01-23 厦门大学 基于模型预测控制优化强化学习的四足机器人控制方法
CN113771983A (zh) * 2021-08-30 2021-12-10 北京工业大学 一种基于智能演进运动技能学习的仿生四足机器人

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120072026A1 (en) * 2010-09-22 2012-03-22 Canon Kabushiki Kaisha Robot system controlling method, robot system, and control apparatus for quadrupedal robot
CN107562052A (zh) * 2017-08-30 2018-01-09 唐开强 一种基于深度强化学习的六足机器人步态规划方法
CN110893118A (zh) * 2018-09-12 2020-03-20 微创(上海)医疗机器人有限公司 手术机器人系统以及机械臂的运动控制方法
CN112060082A (zh) * 2020-08-19 2020-12-11 大连理工大学 基于仿生强化学习型小脑模型的在线稳定控制仿人机器人
CN112596534A (zh) * 2020-12-04 2021-04-02 杭州未名信科科技有限公司 基于深度强化学习的四足机器人的步态训练方法、装置、电子设备及介质
CN113821045A (zh) * 2021-08-12 2021-12-21 浙江大学 一种腿足机器人强化学习动作生成系统
CN114563954A (zh) * 2022-02-28 2022-05-31 山东大学 一种基于强化学习和位置增量的四足机器人运动控制方法

Also Published As

Publication number Publication date
CN114563954A (zh) 2022-05-31

Similar Documents

Publication Publication Date Title
WO2023159978A1 (zh) 一种基于强化学习和位置增量的四足机器人运动控制方法
KR100837988B1 (ko) 각식 이동 로봇을 위한 동작 제어 장치 및 동작 제어방법, 및 로봇 장치
Kim et al. Stabilizing series-elastic point-foot bipeds using whole-body operational space control
Righetti et al. An autonomous manipulation system based on force control and optimization
JP5052013B2 (ja) ロボット装置及びその制御方法
JP6715952B2 (ja) バランスのとれた運動および行動能力が向上した移動型ロボット
Jamone et al. Autonomous online learning of reaching behavior in a humanoid robot
JP2001277159A (ja) 脚式移動ロボット及びその制御方法、並びに、脚式移動ロボット用相対移動測定センサ
US9014854B2 (en) Robot and control method thereof
Horvat et al. Spine controller for a sprawling posture robot
Liu et al. A SVM controller for the stable walking of biped robots based on small sample sizes
Jamone et al. Autonomous online generation of a motor representation of the workspace for intelligent whole-body reaching
WO2023184933A1 (zh) 基于神经振荡器的机器人节律运动控制方法及系统
Ferreira et al. SVR versus neural-fuzzy network controllers for the sagittal balance of a biped robot
Zhao et al. A novel algorithm of human-like motion planning for robotic arms
Rosado et al. Reproduction of human arm movements using Kinect-based motion capture data
Hwang et al. Biped Balance Control by Reinforcement Learning.
Kim et al. Assessing whole-body operational space control in a point-foot series elastic biped: Balance on split terrain and undirected walking
Teng et al. Center of gravity balance approach based on CPG algorithm for locomotion control of a quadruped robot
Juang Humanoid robot runs up-down stairs using zero-moment with supporting polygons control
US20240189989A1 (en) Object climbing by legged robots using training objects
Aloulou et al. A minimum jerk-impedance controller for planning stable and safe walking patterns of biped robots
CN117944055B (zh) 一种人形机器人四肢协同平衡控制方法及装置
CN117555339B (zh) 策略网络训练方法及人形双足机器人步态控制方法
Farshidian Planning and control in face of uncertainty with applications to legged robots

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22928241

Country of ref document: EP

Kind code of ref document: A1