CN113485323B

CN113485323B - Flexible formation method for cascading multiple mobile robots

Info

Publication number: CN113485323B
Application number: CN202110655081.9A
Authority: CN
Inventors: 董璐; 何子辰; 孙长银; 王嘉伟; 薛磊; 潘晶
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2024-04-12
Anticipated expiration: 2041-06-11
Also published as: CN113485323A

Abstract

The invention provides a flexible formation method of cascading multiple mobile robots, which is based on a strategy gradient algorithm combining priori nonlinear distance-angle-heading formation control knowledge and continuous control, avoids blind exploration of the mobile robots, improves training convergence speed, avoids a fussy coefficient tuning process, and simultaneously introduces a near-end strategy to optimize flexible obstacle avoidance capability of independently training a single mobile robot to cope with local static and dynamic obstacles. The method is divided into a training stage and an reasoning stage, a complex online resolving process is transferred to offline, a formation and a flexible obstacle avoidance strategy are independently trained based on course learning ideas, and meanwhile, a pre-training strategy is flexibly called in a reasoning link, so that the whole formation has higher autonomy and flexibility.

Description

A flexible formation method for cascaded multiple mobile robots

技术领域Technical Field

本发明属于多移动机器人领域，具体涉及一种基于级联架构的多移动机器人灵活编队方法，具体涉及一种基于强化学习与先验非线性距离-角度-航向编队控制的级联多移动机器人编队方法。The present invention belongs to the field of multi-mobile robots, and specifically relates to a multi-mobile robot flexible formation method based on a cascade architecture, and specifically relates to a cascade multi-mobile robot formation method based on reinforcement learning and prior nonlinear distance-angle-heading formation control.

背景技术Background technique

随着机器人技术的发展，多移动机器人编队作业凭借其协作能力，有效提高作业效率，逐渐取代传统的单机作业。比如，多个水下机器人通过协同编队进行搜索。军事上，无人机集群、多地面移动机器人排雷、搜救、侦查等等无不体现多机编队的优势特性。近期，国内不少医院为了采用消杀移动机器人代替传统的人工方式进行医院的消毒工作，多个消杀移动机器人通过编队协作，有效提高了单机作业的效率。With the development of robot technology, multi-mobile robot formation operations have effectively improved operating efficiency with their collaborative capabilities, gradually replacing traditional single-machine operations. For example, multiple underwater robots conduct searches through collaborative formations. In the military, drone clusters, multi-ground mobile robots for mine clearance, search and rescue, reconnaissance, etc. all reflect the advantages of multi-machine formations. Recently, in order to use disinfection mobile robots to replace traditional manual methods for hospital disinfection work, many domestic hospitals have effectively improved the efficiency of single-machine operations through formation collaboration of multiple disinfection mobile robots.

基于距离-角度-航向的领航跟随的编队策略是实现多个移动机器人编队跟踪的常用技术之一，这种方法相对于传统领航跟随编队策略具有更好的灵活性与扩展性。该策略的基本思想是预设一个机器人作为领航者，其他机器人作为跟踪者，接着通过预设队形决定领航机器人与跟随机器人之间的相对距离、相对角度以及航向进而设计编队控制策略。The distance-angle-heading-based leader-follower formation strategy is one of the commonly used technologies to achieve formation tracking of multiple mobile robots. Compared with the traditional leader-follower formation strategy, this method has better flexibility and scalability. The basic idea of this strategy is to preset one robot as the leader and the other robots as followers, and then determine the relative distance, relative angle and heading between the leader robot and the follower robot by presetting the formation, and then design the formation control strategy.

目前主流实现距离-角度-航向的领航跟随的方法有非线性控制、非线性模型预测控制等。前者包括输入输出反馈线性化控制，反馈控制等。由于引入了较多的性能增益参数，无法避免繁琐的参数调节过程；而后者由于高度依赖于精确的模型，且对在线解算速度有较高的要求。另一方面，传统领航跟随编队模型的鲁棒性有待提高，缺乏一定的灵活避障与编队恢复的能力。At present, the mainstream methods for achieving distance-angle-heading pilot following include nonlinear control and nonlinear model predictive control. The former includes input-output feedback linearization control, feedback control, etc. Due to the introduction of more performance gain parameters, the cumbersome parameter adjustment process cannot be avoided; while the latter is highly dependent on accurate models and has high requirements for online solution speed. On the other hand, the robustness of the traditional pilot following formation model needs to be improved, and it lacks certain flexible obstacle avoidance and formation recovery capabilities.

随着人工智能技术的发展，深度强化学习技术由于其无模型、离线训练等优势特性也广泛用于端到端的移动机器人的相关任务中，但是大都是单机器人领域；多机领域中这种端到端的实现方式对于传感器、执行器性能的要求较严苛，状态与动作空间维数较高，且在落地实际移动机器人的过程中，训练成本过高，推理复现难度较大。With the development of artificial intelligence technology, deep reinforcement learning technology has also been widely used in end-to-end mobile robot related tasks due to its advantages such as model-free and offline training, but most of them are in the single robot field; this end-to-end implementation method in the multi-machine field has strict requirements on the performance of sensors and actuators, and the dimensions of the state and action space are high. In the process of implementing actual mobile robots, the training cost is too high and the reasoning reproduction is difficult.

发明内容Summary of the invention

本发明旨在针对现有技术中存在的以上不足，提供一种具有一定的灵活避障与编队恢复的能力的多移动机器人灵活编队方法。The present invention aims to address the above deficiencies in the prior art and to provide a method for flexible formation of multiple mobile robots with certain flexible obstacle avoidance and formation recovery capabilities.

本发明采用以下技术方案。提供一种级联多移动机器人灵活编队方法，包括:基于选定的编队队形，根据机器人间的距离、角度和航向，确定动力学模型；根据动力学模型及动力学模型约束，确定非线性移动机器人的灵活编队方法中强化学习架构的先验控制器；基于移动机器人位姿向量的超参数确定动作空间，所述动作空间包括相邻两个移动机器人的编队跟踪动作空间和每个移动机器人独立灵活避障所需动作空间；根据移动机器人姿态与速度的跟踪误差确定状态空间，所述状态空间包括：当前时间步每个移动机器人跟踪相应虚拟移动机器人的跟踪误差的状态空间、相邻移动机器人之间的状态空间以及每个移动机器人描述周围环境信息所需的状态空间；设定强化学习的奖励函数，所述奖励函数包括编队奖励函数和避障奖励函数；The present invention adopts the following technical scheme. A cascaded multi-mobile robot flexible formation method is provided, including: based on the selected formation, according to the distance, angle and heading between the robots, determining the dynamic model; according to the dynamic model and the dynamic model constraints, determining the prior controller of the reinforcement learning architecture in the flexible formation method of the nonlinear mobile robot; determining the action space based on the hyperparameters of the mobile robot posture vector, the action space includes the formation tracking action space of two adjacent mobile robots and the action space required for each mobile robot to independently and flexibly avoid obstacles; determining the state space based on the tracking error of the mobile robot posture and speed, the state space includes: the state space of the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step, the state space between adjacent mobile robots, and the state space required for each mobile robot to describe the surrounding environment information; setting the reward function of reinforcement learning, the reward function includes the formation reward function and the obstacle avoidance reward function;

基于所述先验控制器，通过与环境交互，根据动作空间、状态空间以及奖励函数执行强化学习训练，训练完成获得包括编队策略与灵活避障策略的级联多移动机器人灵活编队方法。Based on the a priori controller, by interacting with the environment, reinforcement learning training is performed according to the action space, state space and reward function, and after the training is completed, a cascaded multi-mobile robot flexible formation method including a formation strategy and a flexible obstacle avoidance strategy is obtained.

进一步地，所述动力学方程描述如下：Further, the kinetic equation is described as follows:

其中，η＝[x,y,θ]^T代表每个移动机器人的位姿向量，其中(x,y)为每个移动机器人的位置，θ为每个移动机器人的角度；为移动机器人的速度，/>ω为移动机器人的当前角速度，v_r与v_l分别代表移动机器人的左右两轮的速度；Where η = [x, y, θ] ^T represents the pose vector of each mobile robot, where (x, y) is the position of each mobile robot and θ is the angle of each mobile robot; is the speed of the mobile robot, /> ω is the current angular velocity of the mobile robot, v _r and v _l represent the velocities of the left and right wheels of the mobile robot respectively;

所述动力学模型约束形式如下：The constraints of the dynamic model are as follows:

再进一步地，确定非线性移动机器人的灵活编队方法中强化学习架构的先验控制器的方法具体包括：S31.确定虚拟期望移动机器人的期望轨迹定义为η_r＝[x_r,y_r,θ_r]^T，(x_r,y_r)为虚拟期望移动机器人的位置，θ_r为虚拟期望移动机器人的角度，移动机器人根据虚拟期望轨迹确定移动机器人的姿态的跟踪误差以及速度的跟踪误差表示为：Furthermore, the method for determining the a priori controller of the reinforcement learning architecture in the flexible formation method of the nonlinear mobile robot specifically includes: S31. Determine the expected trajectory of the virtual expected mobile robot as η _r =[x _r ,y _r ,θ _r ] ^T , (x _r ,y _r ) is the position of the virtual expected mobile robot, θ _r is the angle of the virtual expected mobile robot, and the tracking error of the posture and speed of the mobile robot determined according to the virtual expected trajectory is expressed as:

e_x为x方向的位置跟踪误差；e_y为y方向的位置跟踪误差；e_θ为方位角的跟踪误差；分别是x方向、y方向的速度跟踪误差；/>为角速度跟踪误差；/>为虚拟机器人的期望角速度；e _x is the position tracking error in the x direction; e _y is the position tracking error in the y direction; e _θ is the azimuth tracking error; are the velocity tracking errors in the x-direction and y-direction respectively;/> is the angular velocity tracking error; /> is the expected angular velocity of the virtual robot;

S32.确定相邻移动机器人之间距离、角度和航向之间的期望编队模型，具体描述如下：S32. Determine the desired formation model between the distance, angle and heading between adjacent mobile robots, which is described as follows:

其中，v₁，v₂分别代表相邻移动机器人需要跟踪的虚拟机器人对象，记为虚拟机器人1与虚拟机器人2，(x_v1,y_v1)为虚拟机器人1的位置，(x_v2,y_v2)为虚拟机器人2的位置,θ_v1为虚拟机器人1的角度，θ_v2为虚拟机器人2的角度；d_v2v1为相邻移动机器人v1,v2的相对距离；φ_v2v1相邻移动机器人v1,v2的相对角度；β_v2v1为保持相同方位角的移动机器人的角度修正量；Wherein, _v1 , _v2 represent the virtual robot objects that the adjacent mobile robots need to track, denoted as virtual robot 1 and virtual robot 2, ( _xv1 , _yv1 ) is the position of virtual robot 1, ( _xv2 , _yv2 ) is the position of virtual robot 2, _θv1 is the angle of virtual robot 1, _θv2 is the angle of virtual robot 2; _dv2v1 is the relative distance between adjacent mobile robots v1 and v2; _φv2v1 is the relative angle between adjacent mobile robots v1 and v2; _βv2v1 is the angle correction of the mobile robots that maintain the same azimuth;

S33.结合(1)-(4)与反馈线性化非线性控制理论，相邻移动机器人的编队控制先验的描述形式如下：S33. Combining (1)-(4) with feedback linearization nonlinear control theory, the prior description of the formation control of adjacent mobile robots is as follows:

其中，v₁为虚拟机器人1满足预设编队要求的速度，v₂为虚拟机器人2满足预设编队要求的速度，w₁为虚拟机器人1满足预设编队要求的角速度，w₂为虚拟机器人2满足预设编队要求的角速度，为虚拟机器人1非线性编队先验控制器的性能超参数,为虚拟机器人2非线性编队先验控制器的性能超参数,性能超参数直接决定先验控制器的控制性能。Wherein, _v1 is the speed of virtual robot 1 that meets the preset formation requirements, _v2 is the speed of virtual robot 2 that meets the preset formation requirements, _w1 is the angular velocity of virtual robot 1 that meets the preset formation requirements, _w2 is the angular velocity of virtual robot 2 that meets the preset formation requirements, are the performance hyperparameters of the nonlinear formation prior controller of virtual robot 1, are the performance hyperparameters of the nonlinear formation a priori controller of the virtual robot 2. The performance hyperparameters directly determine the control performance of the a priori controller.

进一步地，所述相邻两个移动机器人的编队跟踪动作空间表示如下；Furthermore, the formation tracking action space of the two adjacent mobile robots is represented as follows;

其中，为移动机器人跟踪虚拟机器人1非线性编队先验控制器的性能超参数,/>为移动机器人的相邻移动机器人跟踪虚拟机器人2非线性编队先验控制器的性能超参数,in, Performance hyperparameters of a priori controller for nonlinear formation tracking of virtual robots 1 for mobile robots,/> The performance hyperparameters of the a priori controller for the nonlinear formation of the virtual robot 2 for the neighboring mobile robots to track the virtual robot,

所述每个移动机器人独立灵活避障所需动作空间表示如下；The action space required for each mobile robot to avoid obstacles independently and flexibly is expressed as follows:

其中，v_discrete与ω_discrete分别为移动机器人离散化的速度指令与角速度指令.Among them, v _discrete and ω _discrete are the discrete velocity command and angular velocity command of the mobile robot respectively.

进一步地，当前时间步每个移动机器人跟踪相应虚拟移动机器人的跟踪误差的状态空间表示如下：Furthermore, the state space representation of the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step is as follows:

相邻移动机器人之间的状态空间表示如下：The state space between adjacent mobile robots is represented as follows:

其中，为移动机器人跟踪虚拟机器人1在x方向的位置跟踪误差；/>为移动机器人跟踪虚拟机器人1在y方向的位置跟踪误差；/>为移动机器人跟踪虚拟机器人1在方位角的跟踪误差；/>为移动机器人的相邻移动机器人跟踪虚拟机器人2在x方向的位置跟踪误差,为移动机器人的相邻移动机器人跟踪虚拟机器人2在y方向的位置跟踪误差；/>为移动机器人的相邻移动机器人跟踪虚拟机器人2在方位角的跟踪误差；e₁为移动机器人跟踪虚拟机器人1的跟踪误差，e₂为移动机器人的相邻机器人跟踪虚拟机器人2的跟踪误差；in, The position tracking error of the virtual robot 1 in the x direction is tracked by the mobile robot; /> The position tracking error of the virtual robot 1 in the y direction is tracked by the mobile robot; /> The tracking error of the virtual robot 1 in azimuth is tracked by the mobile robot; /> is the position tracking error of the virtual robot 2 in the x direction of the mobile robot's adjacent mobile robot tracking, Position tracking error of the virtual robot 2 in the y direction for tracking the adjacent mobile robots of the mobile robot; /> is the tracking error of the mobile robot's adjacent mobile robot tracking virtual robot 2 in azimuth; _e1 is the tracking error of the mobile robot tracking virtual robot 1, and _e2 is the tracking error of the mobile robot's adjacent robot tracking virtual robot 2;

和/>分别代表每个时间步t下相邻移动机器人之间的距离、角度、航向的编队状态量；||u₁||₂,||u₂||₂,/>分别代表机器人1与机器人2相对于虚拟机器人之间的速度、角速度以及加速度的相对值，这一项的目的是希望移动机器人以连续且平滑的速度与加速度运行，其中||u₁||₂是移动机器人1相对于虚拟机器人的速度值，包括速度与角速度；/>是机器人1相对于虚拟机器人的加速度值，包括加速度与角加速度；||u₂||₂是移动机器人2相对于虚拟机器人之间速度值，包括速度与角速度；/>是移动机器人2相对于虚拟机器人的加速度值，包括加速度与角加速度。 and/> Respectively represent the distance, angle, and heading of the formation state between adjacent mobile robots at each time step t; ||u ₁ || ₂ ,||u ₂ || ₂ ,/> Respectively represent the relative values of the speed, angular velocity and acceleration of robot 1 and robot 2 relative to the virtual robot. The purpose of this item is to hope that the mobile robot runs with a continuous and smooth speed and acceleration. Among them, ||u ₁ || ₂ is the speed value of mobile robot 1 relative to the virtual robot, including speed and angular velocity; /> is the acceleration value of robot 1 relative to the virtual robot, including acceleration and angular acceleration; ||u ₂ || ₂ is the velocity value of mobile robot 2 relative to the virtual robot, including velocity and angular velocity; /> It is the acceleration value of mobile robot 2 relative to the virtual robot, including acceleration and angular acceleration.

每个移动机器人描述周围环境信息所需的状态空间表示如下：The state space required for each mobile robot to describe the surrounding environment information is represented as follows:

其中，η_t是当前时刻该移动机器人的位姿向量，d_r为当前时该移动机器人距离其期望虚拟移动机器人位置之间的距离，d_ob为当前时刻，该移动机器人距离安全阈值以内的障碍物的距离向量，|Δθ|为一个向量，包含两个元素，分别为移动机器人当前时刻速度与前时刻速度差，以及移动机器人当前角速度与前时刻角速度差。Among them, _ηt is the pose vector of the mobile robot at the current moment, _dr is the distance between the mobile robot and its expected virtual mobile robot position at the current moment, _dob is the distance vector of the mobile robot to obstacles within the safety threshold at the current moment, and |Δθ| is a vector containing two elements, which are the difference between the speed of the mobile robot at the current moment and the speed at the previous moment, and the difference between the angular velocity of the mobile robot at the current moment and the angular velocity of the mobile robot at the previous moment.

再进一步地，相邻两移动机器人之间的编队奖励函数具体描述形式如下：Furthermore, the formation reward function between two adjacent mobile robots is specifically described as follows:

其中ε_thresh为设定阈值，奖励函数中的R_{error_1}是对于相两个移动机器人对于期望虚拟移动机器人的跟踪误差的惩罚项的和，用于启发机器人尽可能的减小对于期望位置的跟踪误差；r为奖励或惩罚值，R_formation为奖励或惩罚函数，用于引导机器人保持编队的连贯性，如果编队的动态变化范围在设定阈值内，则反馈一个正的奖励值，否则返回一个负的惩罚值；R_velocity用于引导移动机器人保持速度与加速度的连贯性，维持连续光滑的运动模式。Where ε _thresh is the set threshold, R _{error_1} in the reward function is the sum of the penalty terms of the tracking errors of the two mobile robots for the expected virtual mobile robot, which is used to inspire the robot to reduce the tracking error of the expected position as much as possible; r is the reward or penalty value, R _formation is the reward or penalty function, which is used to guide the robot to maintain the consistency of the formation. If the dynamic change range of the formation is within the set threshold, a positive reward value is fed back, otherwise a negative penalty value is returned; R _velocity is used to guide the mobile robot to maintain the consistency of velocity and acceleration and maintain a continuous and smooth motion mode.

再进一步地，所述避障奖励函数具体形式如下：Furthermore, the obstacle avoidance reward function is specifically in the following form:

其中，奖励函数R_{error_2}是移动机器人i对于期望虚拟移动机器人的跟踪误差的惩罚项，引导机器人的编队恢复；R_avoid引导移动机器人进行自主的障碍物规避,ε_safe为安全阈值，r₁为机器人距离最近障碍物的距离在安全阈值以内但还未完全碰撞时的惩罚值。r₂为机器人与障碍物发生碰撞时的惩罚值；R_{delta_yaw}为对移动机器人相邻时间步的方向角变化的惩罚值，以此来制移动机器人i的方向角变化，使得整体的运动轨迹更加平滑。Among them, the reward function R _{error_2} is the penalty term of the tracking error of the mobile robot i for the expected virtual mobile robot, guiding the robot's formation recovery; R _avoid guides the mobile robot to perform autonomous obstacle avoidance, ε _safe is the safety threshold, r ₁ is the penalty value when the distance between the robot and the nearest obstacle is within the safety threshold but has not yet completely collided. r ₂ is the penalty value when the robot collides with the obstacle; R _{delta_yaw} is the penalty value for the change of the angular direction of the mobile robot in adjacent time steps, so as to control the change of the angular direction of the mobile robot i and make the overall motion trajectory smoother.

进一步地，训练过程中，分别针对编队跟踪与灵活避障两个子任务进行独立训练，具体方法包括如下步骤：Furthermore, during the training process, two subtasks, formation tracking and flexible obstacle avoidance, are trained independently. The specific method includes the following steps:

针对编队跟踪任务，动作空间选择为相邻两个移动机器人的编队跟踪动作空间a¹ _space，状态空间基于当前时间步每个移动机器人跟踪相应虚拟移动机器人的跟踪误差的状态空间与相邻移动机器人之间的状态空间/> For the formation tracking task, the action space is selected as the formation tracking action space a ¹ _space of two adjacent mobile robots, and the state space is based on the state space of the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step The state space between adjacent mobile robots/>

动作值网络输出对当前动作的评价，以当前动作值网络输出的评价的Q值作为权重，并基于策略梯度更新动作网络；The action value network outputs an evaluation of the current action, uses the Q value of the evaluation output by the current action value network as the weight, and updates the action network based on the policy gradient;

所述动作值网络具体的更新描述如下The specific update description of the action value network is as follows

其中，w_i为当前时刻i基于优先级经验重放算法计算出的优先级采样权重；r_i为当前时刻i的奖励信号；γ为折扣因子；Q_θ′(s_i+1,μ′(s_i+1))为目标动作值网络对于下一时刻i+1目标动作μ′(s_i+1)的评价,s_i为当前时刻i机器人的状态值，s_i+1位下一时刻i+1机器人的状态值，a_i为当前时刻i机器人的动作,N为小批量采样的样本数量；Q_θ(s_i,a_i)为当前动作值网络对于当前时刻i机器人的状态与动作指令的评价值。。Among them, w _i is the priority sampling weight calculated based on the priority experience replay algorithm at the current moment i; r _i is the reward signal at the current moment i; γ is the discount factor; Q _θ′ (s _i+1 , μ′(s _i+1 )) is the evaluation of the target action value network for the target action μ′(s _i+1 ) at the next moment i+1, s _i is the state value of the robot i at the current moment, s _i+1 is the state value of the robot i+1 at the next moment, a _i is the action of the robot i at the current moment, and N is the number of samples in the small batch sampling; Q _θ (s _i , a _i ) is the evaluation value of the current action value network for the state and action instructions of the robot i at the current moment. .

针对灵活避障任务，采用基于离散动作空间的近端策略优化算法架构，选择动作空间为每个移动机器人独立灵活避障所需动作空间选择状态空间为当前时间步每个移动机器人跟踪相应虚拟移动机器人的跟踪误差的状态空间/>与每个移动机器人描述周围环境信息所需的状态空间/> For the flexible obstacle avoidance task, a proximal strategy optimization algorithm architecture based on discrete action space is adopted to select the action space required for each mobile robot to independently and flexibly avoid obstacles. Select the state space as the state space of the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step/> The state space required for each mobile robot to describe the surrounding environment information/>

进一步地，所述的目标动作值网络的更新方法为：在每训练完一个小批次后，利用更新后的在线动作网络与在线动作值网络的参数更新，具体描述形式如下：Furthermore, the target action value network is updated by using the updated online action network and the online action value network parameters after each training of a small batch. The specific description is as follows:

η′←τη”+(1-τ)η′ (14)η′←τη”+(1-τ)η′ (14)

其中，η′和η”分表代表目标网络参数与当前网络参数，τ用于控制更新的比例。Among them, η′ and η” represent the target network parameters and the current network parameters, and τ is used to control the update ratio.

进一步地，所述方法还包括局部碰撞检测步骤，所述局部碰撞检测步骤用于检测局部障碍物距离机器人的安全距离，若返回安全距离满足安全状态要求，则移动机器人个体退出灵活避障策略，并恢复编队策略。Furthermore, the method also includes a local collision detection step, which is used to detect the safe distance between the local obstacle and the robot. If the returned safe distance meets the safety state requirements, the mobile robot individual exits the flexible obstacle avoidance strategy and restores the formation strategy.

本发明所取得的有益技术效果；The beneficial technical effects achieved by the present invention;

本发明提供的一种级联多移动机器人灵活编队方法是基于强化学习与先验非线性距离-角度-航向编队控制的级联多移动机器人灵活编队方法，使得多个移动机器人可以自适应调整编队控制算法中的关键参数，提高编队的稳定性与跟踪的精度；同时独立训练灵活避障策略，使得编队中每个机器人又具备灵活一定避障的能力，提高编队中每个移动机器人的灵活性与自主性。A cascaded multi-mobile robot flexible formation method provided by the present invention is a cascaded multi-mobile robot flexible formation method based on reinforcement learning and prior nonlinear distance-angle-heading formation control, so that multiple mobile robots can adaptively adjust key parameters in the formation control algorithm to improve the stability of the formation and the accuracy of tracking; at the same time, flexible obstacle avoidance strategies are independently trained so that each robot in the formation has a certain degree of flexible obstacle avoidance ability, thereby improving the flexibility and autonomy of each mobile robot in the formation.

本发明所设计算法的编队跟踪架构基于深度确定型策略梯度算法，通过简化随机探索过程与引入优先经验回放机制，进一步提高的算法的性能与效率。通过引入先验的非线性距离-角度-航向编队控制器信息避免了盲目的探索，使得训练过程更具有针对性，因此提高了算法收敛的速度，且在推理的应用过程，先验的编队控制器信息控制器可以避免端到端中出现损害执行器的异常行为，提升了整体编队的鲁棒性。The formation tracking architecture of the algorithm designed by the present invention is based on a deep deterministic policy gradient algorithm, which further improves the performance and efficiency of the algorithm by simplifying the random exploration process and introducing a priority experience replay mechanism. By introducing a priori nonlinear distance-angle-heading formation controller information, blind exploration is avoided, making the training process more targeted, thereby improving the speed of algorithm convergence, and in the application process of reasoning, the a priori formation controller information controller can avoid abnormal behaviors that damage the actuator in the end-to-end process, thereby improving the robustness of the overall formation.

本发明所设计算法的避障架构基于近端策略优化算法，在灵活避障时，将移动机器人动作空间离散化，以减小搜索空间，减小训练复杂度；通过引入碰撞检测功能模块，实时监测障碍物距离，以确定是否可以返回编队跟踪模式。The obstacle avoidance architecture of the algorithm designed in the present invention is based on the proximal strategy optimization algorithm. When flexibly avoiding obstacles, the action space of the mobile robot is discretized to reduce the search space and reduce the training complexity. By introducing a collision detection function module, the obstacle distance is monitored in real time to determine whether it is possible to return to the formation tracking mode.

优选地，两套架构的训练是相互独立的，在推理编队过程中相辅相成，共同完成多移动机器人的灵活编队。Preferably, the training of the two sets of architectures are independent of each other, and they complement each other in the process of reasoning formation to jointly complete the flexible formation of multiple mobile robots.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明具体实施例的整体框架的示意图；FIG1 is a schematic diagram of an overall framework of a specific embodiment of the present invention;

图2是本发明具体实施例的训练阶段的示意图；FIG2 is a schematic diagram of a training phase of a specific embodiment of the present invention;

图3是本发明具体实施例的基于推理的灵活编队的示意图。FIG. 3 is a schematic diagram of flexible formation based on reasoning according to a specific embodiment of the present invention.

具体实施方式Detailed ways

以下结合说明书附图和具体实施例对本发明做进一步详细说明。The present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

实施例：一种级联多移动机器人灵活编队方法，主要包括以下步骤：S1.从队形库中选择编队队形，按照距离-角度-航向编队模式确认各个机器人的优先级与在队伍中的具体位置；Embodiment: A cascaded multi-mobile robot flexible formation method mainly comprises the following steps: S1. Selecting a formation from a formation library, and confirming the priority and specific position of each robot in the team according to the distance-angle-heading formation mode;

S2.根据机器人类型，确定动力学模型；S2. Determine the dynamic model according to the robot type;

S3.根据动力学模型约束，结合机器人之间的相对距离约束、相对角度约束、航向约束，设计虚拟领航者与虚拟跟随者的期望先验轨迹，将实际机器人的编队问题转化为若干个跟踪虚拟移动机器人轨迹的跟踪法问题，并设计相应的速度的角速度的非线性编队跟踪先验控制器，作为整个强化学习架构中的知识先验；S3. According to the constraints of the dynamic model, combined with the relative distance constraints, relative angle constraints, and heading constraints between robots, the expected prior trajectories of the virtual leader and virtual follower are designed, and the actual robot formation problem is transformed into a number of tracking problems for tracking the trajectories of virtual mobile robots. The corresponding nonlinear formation tracking prior controller of the angular velocity is designed as the knowledge prior in the entire reinforcement learning framework.

S4.设计整个编队算法的碰撞检测模块，用于检测局部障碍物距离机器人的安全距离；S4. Design a collision detection module for the entire formation algorithm to detect the safe distance between local obstacles and the robot;

S5.设计整个编队算法架构的动作空间部分，主要分为两部分，一个是包含移动机器人速度与角速度的速度空间，另一个是包含先验非线性跟踪控制知识的所有性能参数的参数空间；S5. Design the action space part of the entire formation algorithm architecture, which is mainly divided into two parts: one is the velocity space containing the velocity and angular velocity of the mobile robot, and the other is the parameter space containing all performance parameters of the prior nonlinear tracking control knowledge;

S6.设计整个编队算法架构中的状态空间部分，主要包括各个机器人的位置与姿态与环境中的障碍物信息；S6. Design the state space part of the entire formation algorithm architecture, mainly including the position and posture of each robot and the obstacle information in the environment;

S7.设计引导机器人编队进行学习与灵活避障的奖励函数，主要包括编队奖励函数、跟踪奖励函数、避障奖励函数几个部分组成；S7. Design a reward function to guide the robot formation to learn and flexibly avoid obstacles, which mainly includes formation reward function, tracking reward function and obstacle avoidance reward function;

S8.搭建仿真环境进行训练，使得智能体在先验非线性编队控制知识的条件下，通过与环境交互试错，学习使得多个移动机器人完成基于距离-角度-方向的灵活稳定的编队过程策略与灵活避障策略。S8. Build a simulation environment for training, so that the intelligent agent can learn to enable multiple mobile robots to complete flexible and stable formation process strategies and flexible obstacle avoidance strategies based on distance-angle-direction through trial and error interaction with the environment under the condition of prior nonlinear formation control knowledge.

进一步地，步骤S1中，每个机器人的类型是同构的，机器人数量N≥2；Furthermore, in step S1, the type of each robot is isomorphic, and the number of robots N ≥ 2;

进一步地，步骤S2中，以两轮差动移动机器人为例，其动力学方程描述如下：Furthermore, in step S2, taking a two-wheel differential mobile robot as an example, its dynamic equation is described as follows:

其中，η＝[x,y,θ]^T代表每个移动机器人的位姿向量；为移动机器人的速度，/>为移动机器人的角速度，v_r与v_l分别代表左右两轮的速度；需要注意的是，两轮驱动移动机器人存在非完整约束，使得移动机器人只可以前后移动，不可以左右移动，该种约束形式如下：Where η = [x, y, θ] ^T represents the pose vector of each mobile robot; is the speed of the mobile robot, /> is the angular velocity of the mobile robot, v _r and v _l represent the speeds of the left and right wheels respectively; it should be noted that the two-wheel drive mobile robot has a non-holonomic constraint, which makes the mobile robot only move forward and backward, not left and right. The constraint form is as follows:

进一步地，以多移动机器人的距离-角度-航向编队为例，S3中移动机器人的先验编队控制知识的具体设计步骤如下：Furthermore, taking the distance-angle-heading formation of multiple mobile robots as an example, the specific design steps of the prior formation control knowledge of mobile robots in S3 are as follows:

S31.设计移动机器人与虚拟期望移动机器人之间的跟踪控制器。虚拟移动机器人的期望轨迹定义为η_r＝[x_r,y_r,θ_r]^T.其姿态与速度的跟踪误差为：S31. Design a tracking controller between the mobile robot and the virtual desired mobile robot. The desired trajectory of the virtual mobile robot is defined as η _r = [x _r , y _r , θ _r ] ^T . The tracking error of its posture and velocity is:

S32.设计相邻移动机器人之间距离-角度-航向之间的期望编队模型，具体描述如下：S32. Design the expected formation model between the distance, angle and heading between adjacent mobile robots. The specific description is as follows:

其中，v1，v2分别代表相邻移动机器人需要跟踪的虚拟机器人对象，记为虚拟机器人1与虚拟机器人2，d_v2v1、φ_v2v1、β_v2v1为距离-角度-航向编队架构下的状态量，表示v1，v2之间的距离角度与方向。Among them, v1 and v2 represent the virtual robot objects that the adjacent mobile robots need to track, denoted as virtual robot 1 and virtual robot 2 respectively. d _v2v1 , φ _v2v1 , and β _v2v1 are state quantities under the distance-angle-heading formation architecture, indicating the distance angle and direction between v1 and v2.

其中，v₁为虚拟机器人1满足预设编队要求的速度，v₂为虚拟机器人2满足预设编队要求的速度，w₁为虚拟机器人1满足预设编队要求的角速度，w₂为虚拟机器人2满足预设编队要求的角速度，[K_x,K_y,K_θ]为移动机器人非线性编队控制先验的性能超参数，其取值直接决定编队跟踪的质量；Among them, v ₁ is the speed of virtual robot 1 that meets the preset formation requirements, v ₂ is the speed of virtual robot 2 that meets the preset formation requirements, w ₁ is the angular velocity of virtual robot 1 that meets the preset formation requirements, w ₂ is the angular velocity of virtual robot 2 that meets the preset formation requirements, [K _x ,K _y ,K _θ ] is the performance hyperparameter of the nonlinear formation control prior of the mobile robot, and its value directly determines the quality of formation tracking;

进一步地，S4中，碰撞检测功能模块通过移动机器人自身传感器，判断到障碍物的距离，输出布尔型的碰撞警告标志位；Further, in S4, the collision detection function module determines the distance to the obstacle through the mobile robot's own sensor and outputs a Boolean collision warning flag;

进一步地，S5中，动作空间主要由两部分组成，一部分是编队跟踪所需的动作空间，另一部分是检测局部障碍物时进行灵活避障所需的动作空间，具体设计描述如下：Furthermore, in S5, the action space is mainly composed of two parts, one is the action space required for formation tracking, and the other is the action space required for flexible obstacle avoidance when detecting local obstacles. The specific design is described as follows:

S51.设计相邻两个移动机器人的编队跟踪动作空间。具体方法基于S33中所涉及的非线性编队知识先验，其动作空间如下：S51. Design the formation tracking action space of two adjacent mobile robots. The specific method is based on the nonlinear formation knowledge prior involved in S33, and its action space is as follows:

其中，[K_x,K_y,K_θ]为移动机器人非线性编队控制先验的性能超参数；Where [K _x ,K _y ,K _θ ] are the prior performance hyperparameters of nonlinear formation control of mobile robots;

S52.设计每个移动机器人独立灵活避障所需动作空间：S52. Design the action space required for each mobile robot to avoid obstacles independently and flexibly:

其中，v_discrete与ω_discrete分别为移动机器人离散化的速度指令与角速度指令；Among them, v _discrete and ω _discrete are the discrete velocity command and angular velocity command of the mobile robot respectively;

进一步地，S6中,状态空间主要由三部分组成，一部分是描述每个移动机器人跟踪相应虚拟机器人的跟踪误差的状态空间，一部分是描述相邻移动机器人之间满足距离-角度-航向编队的状态空间，一部分是描述周围环境信息所需的状态空间，具体设计描述如下：Furthermore, in S6, the state space is mainly composed of three parts: one part is the state space describing the tracking error of each mobile robot tracking the corresponding virtual robot, one part is the state space describing the distance-angle-heading formation between adjacent mobile robots, and one part is the state space required to describe the surrounding environment information. The specific design description is as follows:

S61.以相邻两个移动机器人为例,设计描述当前时间步每个移动机器人跟踪相应虚拟移动机器人的跟踪误差的状态空间如下：S61. Taking two adjacent mobile robots as an example, the state space describing the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step is designed as follows:

S62.以相邻两个移动机器人为例，设计相邻移动机器人之间满足距离-角度-航向编队架构的状态空间如下：S62. Taking two adjacent mobile robots as an example, the state space between the adjacent mobile robots that satisfies the distance-angle-heading formation architecture is designed as follows:

其中，d、φ、β分别代表每个时间步下相邻移动机器人之间的距离、角度、航向的编队状态量；||u₁||₂,||u₂||₂,分别代表机器人1与机器人2相对于虚拟机器人之间的速度、角速度以及加速度的相对值，这一项的目的是希望移动机器人以连续且平滑的速度与加速度运行；Where d, φ, and β represent the distance, angle, and heading of the formation state between adjacent mobile robots at each time step, respectively; ||u ₁ || ₂ ,||u ₂ || ₂ , Respectively represent the relative values of the speed, angular velocity, and acceleration of Robot 1 and Robot 2 relative to the virtual robot. The purpose of this item is to enable the mobile robot to run with a continuous and smooth speed and acceleration;

S63.设计每个移动机器人描述周围环境信息所需的状态空间如下：S63. Design the state space required for each mobile robot to describe the surrounding environment information as follows:

进一步地，S7中奖励函数设计可以细分为两个子奖励函数设计，一个是针对编队跟踪子任务，另一个是针对灵活避障与编队恢复的子任务，即：Furthermore, the reward function design in S7 can be divided into two sub-reward function designs, one for the formation tracking subtask and the other for the flexible obstacle avoidance and formation recovery subtask, namely:

S71.设计编队跟踪子任务的奖励函数，S71. Design the reward function for the formation tracking subtask.

相邻两移动机器人之间的编队奖励函数具体描述形式如下：The specific description of the formation reward function between two adjacent mobile robots is as follows:

其中奖励函数中的R_error是对于相两个移动机器人对于期望虚拟移动机器人的跟踪误差的惩罚项的和，用于启发机器人尽可能的减小对于期望位置的跟踪误差；R_formation用于引导机器人保持编队的连贯性，如果编队的动态变化范围在阈值内，则反馈一个正的奖励，否则返回一个负的惩罚；R_velocity用于引导移动机器人保持速度与加速度的连贯性，维持连续光滑的运动模式；The R _error in the reward function is the sum of the penalty terms of the tracking errors of the two mobile robots for the desired virtual mobile robot, which is used to inspire the robot to reduce the tracking error of the desired position as much as possible; R _formation is used to guide the robot to maintain the consistency of the formation. If the dynamic change range of the formation is within the threshold, a positive reward is fed back, otherwise a negative penalty is returned; R _velocity is used to guide the mobile robot to maintain the consistency of velocity and acceleration and maintain a continuous and smooth motion mode;

S72.设计移动机器人i的灵活避障奖励函数，具体形式如下：S72. Design a flexible obstacle avoidance reward function for mobile robot i. The specific form is as follows:

其中，奖励函数R_error是移动机器人i对于期望虚拟移动机器人的跟踪误差的惩罚项，引导机器人的编队恢复；R_avoid引导移动机器人进行自主的障碍物规避的；R_{delta_yaw}限制移动机器人i的方向角变化，以节省能量；Among them, the reward function R _error is the penalty term of the tracking error of the mobile robot i for the expected virtual mobile robot, guiding the robot's formation recovery; R _avoid guides the mobile robot to perform autonomous obstacle avoidance; R _{delta_yaw} limits the direction angle change of the mobile robot i to save energy;

可选地，S72中设计的奖励函数只发生在避障任务阶段，用于启发移动机器人快速规避局部障碍物，当通过S4判断远离障碍物时，退出避障任务阶段，切换为编队跟踪子任务并在S71奖励函数的指导下恢复编队并保持；Optionally, the reward function designed in S72 only occurs in the obstacle avoidance task stage, and is used to inspire the mobile robot to quickly avoid local obstacles. When it is determined by S4 that it is far away from the obstacle, it exits the obstacle avoidance task stage, switches to the formation tracking subtask, and restores and maintains the formation under the guidance of the reward function in S71;

进一步地，S8中，训练过程中，分别针对编队跟踪与灵活避障两个子任务进行独立训练，具体描述如下：Furthermore, in S8, during the training process, two subtasks, formation tracking and flexible obstacle avoidance, are trained independently, as described below:

S81.针对编队跟踪任务，采用基于连续动作空间的确定性策略梯度算法架构，动作空间选择为a¹ _space，状态空间基于与/>算法总体遵循“演员-评论家”模式，不过不同于其他强化学习算法，该算法最大的优势在于动作网络的输出是一个确定的动作而不是一个策略分布。S81. For the formation tracking task, a deterministic policy gradient algorithm architecture based on continuous action space is adopted. The action space is selected as a ¹ _space and the state space is based on With/> The algorithm generally follows the "actor-critic" model, but unlike other reinforcement learning algorithms, the biggest advantage of this algorithm is that the output of the action network is a certain action rather than a strategy distribution.

另一方面，动作值网络输出对当前动作的评价，接着以当前动作值网络输出的评价的Q值作为权重，并基于策略梯度更新动作网络。而动作值网络的更新则是基于离线的目标动作网络与目标动作值网络，这种方法的优势在于目标网络的参数变化小，使得训练过程更加稳定。On the other hand, the action value network outputs the evaluation of the current action, and then uses the Q value of the evaluation output by the current action value network as the weight, and updates the action network based on the policy gradient. The update of the action value network is based on the offline target action network and target action value network. The advantage of this method is that the parameters of the target network change little, making the training process more stable.

动作值网络具体的更新描述如下The specific update description of the action value network is as follows

其中，w_i为基于优先级经验重放算法计算出的优先级采样权重；r_i为当前奖励信号；γ为折扣因子；Q_θ′(s_i+1,μ′(s_i+1))为目标动作值网络对于下一时刻目标动作μ′(s_i+1)的评价Where w _i is the priority sampling weight calculated based on the priority experience replay algorithm; _ri is the current reward signal; γ is the discount factor; Q _θ′ (s _i+1 ,μ′(s _i+1 )) is the evaluation of the target action value network for the next target action μ′(s _i+1 )

优选地，目标网络的更新基于软更新策略，在每训练完一个小批次后，利用更新后的在线动作网络与在线动作值网络的参数更新，具体描述形式如下：Preferably, the target network is updated based on a soft update strategy. After each small batch of training, the parameters of the online action network and the online action value network are updated using the updated parameters. The specific description is as follows:

η′←τη”+(1-τ)η′ (14)η′←τη”+(1-τ)η′ (14)

τ用于控制更新的比例，这种软更新的方法减少了异常参数的影响，同时避免了参数更新过程中参数异常跳变。τ is used to control the update ratio. This soft update method reduces the impact of abnormal parameters and avoids abnormal parameter jumps during the parameter update process.

S82.针对灵活避障任务，采用基于离散动作空间的近端策略优化算法架构，选择动作空间为选择状态空间为/>与/> S82. For the flexible obstacle avoidance task, a proximal strategy optimization algorithm architecture based on discrete action space is adopted, and the action space is selected as Select the state space as/> With/>

近端策略优化算法针对传统策略梯度算法在线策略存在的参数更新慢、数据利用率低等问题进行了优化，该算法在广义优势评估算法的基础上引入了重采样机制来讲在线策略转变为离线策略来提升数据的利用率，同时基于KL散度或者剪裁操作约束参数更新幅度以此获得更加稳定的训练过程。The proximal policy optimization algorithm optimizes the problems of slow parameter update and low data utilization in the online strategy of the traditional policy gradient algorithm. Based on the generalized advantage evaluation algorithm, the algorithm introduces a resampling mechanism to transform the online strategy into an offline strategy to improve data utilization. At the same time, it constrains the parameter update range based on KL divergence or pruning operations to obtain a more stable training process.

进一步地，包括S9.根据S7中离线学习的编队策略，完成整个基于推理的级联编队控制算法的搭建。Furthermore, including S9, according to the formation strategy learned offline in S7, the entire inference-based cascade formation control algorithm is built.

S9中，利用S8训练的编队与灵活避障策略，构建基于推理的移动机器人灵活编队算法架构，具体过程描述如下：In S9, the formation and flexible obstacle avoidance strategies trained in S8 are used to build a mobile robot flexible formation algorithm architecture based on reasoning. The specific process is described as follows:

S91.确定编队需求与任务环境；S91. Determine formation requirements and mission environment;

S92.移动机器人编队装载S8中预训练的基于先验的编队策略与灵活避障策略；S92. The mobile robot formation is loaded with the prior-based formation strategy and flexible obstacle avoidance strategy pre-trained in S8;

S93.移动机器人编队依据与环境的交互信息，采用编队跟踪策略进行编队跟踪，同时每个机器人个体进行局部碰撞检测；S93. The mobile robot formation uses a formation tracking strategy to track the formation based on the interaction information with the environment, and each individual robot performs local collision detection;

S94.若需要进行灵活避障，则移动机器人编队切换个体灵活避障策略，依据与环境的交互信息，进行实时避障；S94. If flexible obstacle avoidance is required, the mobile robot formation switches to an individual flexible obstacle avoidance strategy and performs real-time obstacle avoidance based on the interactive information with the environment;

S95.若局部碰撞检测功能模块返回安全状态，则移动机器人个体退出灵活避障策略，并快速恢复编队状态；S95. If the local collision detection function module returns to a safe state, the mobile robot individual exits the flexible obstacle avoidance strategy and quickly restores the formation state;

S96.重复S93至S95，直到抵达目标点。S96. Repeat S93 to S95 until reaching the target point.

本发明提供的一种基于强化学习与先验非线性距离-角度-航向编队控制的级联多移动机器人灵活编队方法，该方法通过强化学习将线上解算的算力消耗迁移到线下，实现基于推理的多移动机器人灵活编队；The present invention provides a cascade multi-mobile robot flexible formation method based on reinforcement learning and prior nonlinear distance-angle-heading formation control. The method transfers the computing power consumption of online solution to offline through reinforcement learning, thereby realizing the flexible formation of multiple mobile robots based on reasoning.

训练阶段独立训练编队跟踪策略与灵活避障策略，降低了训练难度，同时引入非线性距离-角度-航向编队控制先验，提高了训练速度，避免了繁杂的参数调优过程；推理阶段，基于离线训练的独立策略进行组合，达到自主稳定编队与灵活避障的任务要求。与现有的基于领航-跟踪的编队跟踪控制算法相比，赋予机器人编队自主跟踪能力的同时又给予每个移动机器人面对局部静态、动态障碍物的独立的避障能力，具有自主、稳定、高效、灵活的特点。During the training phase, the formation tracking strategy and flexible obstacle avoidance strategy are trained independently, which reduces the difficulty of training. At the same time, the nonlinear distance-angle-heading formation control prior is introduced to improve the training speed and avoid the complicated parameter tuning process. During the inference phase, independent strategies based on offline training are combined to achieve the task requirements of autonomous stable formation and flexible obstacle avoidance. Compared with the existing formation tracking control algorithm based on pilot-tracking, it gives the robot formation the ability to track autonomously while giving each mobile robot the ability to avoid local static and dynamic obstacles independently, which is autonomous, stable, efficient and flexible.

本实施例的整体框架如图1所示，其中1为离线独立训练框架，2为推理灵活编队框架，21灵活编队避障策略，3为仿真交互环境。The overall framework of this embodiment is shown in Figure 1, where 1 is an offline independent training framework, 2 is an inference flexible formation framework, 21 is a flexible formation obstacle avoidance strategy, and 3 is a simulation interaction environment.

首先，在1训练阶段中，分别训练编队策略与灵活避障策略；其中编队策略的训练基于先验的编队经验，加速训练过程与收敛速度，防止多移动机器人的盲目探索，提升编队的稳定性；训练结束后将两种策略参数进行存储；First, in the first training phase, the formation strategy and the flexible obstacle avoidance strategy are trained respectively; the formation strategy training is based on prior formation experience, which accelerates the training process and convergence speed, prevents blind exploration of multiple mobile robots, and improves the stability of the formation; after the training, the parameters of the two strategies are stored;

接着，在推理阶段2中，多移动机器人灵活调用基于经验的灵活编队避障策略21，进行自主编队跟踪与灵活避障，这种方式将线上计算过程迁移到线下，更加高效稳定。Then, in the reasoning stage 2, multiple mobile robots flexibly call the experience-based flexible formation obstacle avoidance strategy 21 to perform autonomous formation tracking and flexible obstacle avoidance. This method migrates the online computing process to the offline, which is more efficient and stable.

训练阶段的框架如图2所示，训练的整体策略是基于课程学习的思想，即训练环境又简到繁，逐步提高策略性能；图2中1为基于距离-角度-航向编队控制先验的编队训练环境；2为每个移动机器人进行灵活避障的环境；3为连续型确定性策略梯度算法智能体；4为离散型近端策略优化算法智能体；5为训练后离线存储的编队策略参数与灵活避障策略参数，具体过程描述如下：The framework of the training phase is shown in Figure 2. The overall strategy of training is based on the idea of curriculum learning, that is, the training environment is simple to complex, and the policy performance is gradually improved. In Figure 2, 1 is the formation training environment based on the distance-angle-heading formation control prior; 2 is the environment for flexible obstacle avoidance for each mobile robot; 3 is the continuous deterministic policy gradient algorithm agent; 4 is the discrete proximal policy optimization algorithm agent; 5 is the formation strategy parameters and flexible obstacle avoidance strategy parameters stored offline after training. The specific process is described as follows:

编队策略训练过程描述如下：The formation strategy training process is described as follows:

首先按照课程学习的思想，配置简单到复杂的多种训练环境，比如先训练两个移动机器人的编队环境再逐步递增编队机器人个数；First, according to the idea of course learning, configure a variety of training environments from simple to complex, such as training two mobile robots in a formation environment and then gradually increasing the number of robots in the formation;

接着，对于每一种预设仿真环境，在图2中连续型确定性策略梯度算法智能体3中初始化动作网络、目标动作网络、动作值网络、目标动作值网络参数；对于每一个迭代周期，初始化基于距离-角度-航向编队控制先验的编队训练环境1，接着每一个时间步内：Next, for each preset simulation environment, the action network, target action network, action value network, and target action value network parameters are initialized in the continuous deterministic policy gradient algorithm agent 3 in Figure 2; for each iteration cycle, the formation training environment 1 based on the distance-angle-heading formation control prior is initialized, and then in each time step:

步骤1：从动作空间的阈值范围内依据策略选择一个动作，并添加随机高斯噪声以提升随机探索性能；Step 1: Select an action from the threshold range of the action space according to the strategy, and add random Gaussian noise to improve the random exploration performance;

步骤2：与编队训练环境交互，具体是将选择的确定性动作输入到结合预设编队模式，以及距离、角度、航向几何关系设计的多移动机器人的级联先验编队控制器中，具体描述如上文式(5)，接着将先验编队上层控制速度与角速度指令输入到每个移动机器人，机器人完成指令的同时，环境更新了当前状态，并反馈状态值、奖励值以及指示任务是否结束或者中断的布尔型标志位；Step 2: Interact with the formation training environment. Specifically, the selected deterministic action is input into the cascade priori formation controller of multiple mobile robots designed in combination with the preset formation mode and the geometric relationship between distance, angle and heading. The specific description is as shown in the above formula (5). Then, the upper control speed and angular velocity instructions of the priori formation are input into each mobile robot. When the robot completes the instruction, the environment updates the current state and feeds back the state value, reward value and Boolean flag indicating whether the task is ended or interrupted.

步骤3：将环境反馈的信息，以及计算出的优先级存储到经验池中，作为训练用的数据；Step 3: Store the environmental feedback information and the calculated priority into the experience pool as training data;

步骤4：当经验池容量溢出后，按照优先级进行采样，训练更新动作值网络、动作网络；具体是：Step 4: When the experience pool overflows, sample according to priority, train and update the action value network and action network; specifically:

步骤4-1：将采样出的状态值输入动作网络中得到动作；Step 4-1: Input the sampled state value into the action network to obtain the action;

步骤4-2：将该动作与采样的状态值输入动作值网络得到Q值；Step 4-2: Input the action and the sampled state value into the action value network to obtain the Q value;

步骤4-3：根据该Q值，反向传播依据策略梯度更新动作网络；Step 4-3: Based on the Q value, back propagation updates the action network according to the policy gradient;

步骤4-4：重复步骤4-2与4-3，得到更新后动作通过动作值网络计算出的Q值；Step 4-4: Repeat steps 4-2 and 4-3 to obtain the Q value of the updated action calculated by the action value network;

步骤4-5：将采样的经验中的下一状态值输入目标动作网络，得到目标动作；Step 4-5: Input the next state value in the sampled experience into the target action network to obtain the target action;

步骤4-6：将该动作与该状态输入目标状态网络，得到目标Q值；Step 4-6: Input the action and the state into the target state network to obtain the target Q value;

步骤4-7：按照式(13)将目标Q值与步骤4-4计算出的Q值，结合优先级权重系数，一起更新动作值网络；Step 4-7: According to formula (13), the target Q value and the Q value calculated in step 4-4 are combined with the priority weight coefficient to update the action value network;

步骤5：重复步骤1-4；Step 5: Repeat steps 1-4;

步骤6：满足一定的时间步后，按照式(14)分别对目标动作网络与目标动作值网络进行软更新；Step 6: After a certain time step is met, the target action network and the target action value network are soft updated according to formula (14);

步骤7：存储最后的编队策略的各个网络参数，供下次训练或者推理时调用。Step 7: Store the network parameters of the final formation strategy for next training or inference.

个体自主灵活避障策略训练流程如下：The training process of individual autonomous flexible obstacle avoidance strategy is as follows:

首先按照课程学习的思想，配置简单到复杂的多种训练环境，比如先训练移动机器人在静态障碍物环境中的避障策略，再训练移动机器人在动态障碍物环境中的避障策略；First, according to the idea of course learning, configure a variety of training environments from simple to complex, such as training the obstacle avoidance strategy of the mobile robot in a static obstacle environment, and then training the obstacle avoidance strategy of the mobile robot in a dynamic obstacle environment;

接着，对于每一种预设仿真环境，先初始化2-4中策略网络与价值网络；在每一个迭代周期内，初始化对应环境，接着，在每一个时间步内：Next, for each preset simulation environment, first initialize the policy network and value network in 2-4; in each iteration cycle, initialize the corresponding environment, and then, in each time step:

步骤1：在策略网络中，输入环境状态信息，得到策略分布，并根据这个分布在离散的动作空间中采样出一个动作；Step 1: In the policy network, input the environment state information, obtain the policy distribution, and sample an action in the discrete action space based on this distribution;

步骤2：将动作输入环境中，与灵活避障环境进行交互，环境更新状态，反馈状态值、奖励值以及指示任务是否结束或者中断的布尔型标志位；Step 2: Input the action into the environment, interact with the flexible obstacle avoidance environment, and the environment updates the state, feeding back the state value, reward value, and a Boolean flag indicating whether the task is ended or interrupted;

步骤3：重复步骤1，采样一定的经验，并存储；Step 3: Repeat step 1, sample a certain amount of experience, and store it;

步骤4：将步骤3中最后一步的状态输入价值网络中，得到状态价值，然后回溯计算这么多时间步下的折扣奖励值；Step 4: Input the state of the last step in step 3 into the value network to get the state value, and then backtrack to calculate the discounted reward value under so many time steps;

步骤5：将存储的所有经验输入到价值网络中，利用广义优势评估计算优势值；Step 5: Input all stored experiences into the value network and calculate the advantage value using generalized advantage evaluation;

步骤6：根据计算的优势值，反向传播更新价值网络；Step 6: Back propagate and update the value network based on the calculated advantage value;

步骤7：将存储经验中所有状态值输入策略网络与过去策略网络，分别得到不同的策略分布，使用重采样，将同策略转为异策略，并反向传播更新策略网络；Step 7: Input all state values in the stored experience into the policy network and the past policy network to obtain different policy distributions. Use resampling to convert the same policy into a different policy, and back-propagate to update the policy network.

步骤8：重复步骤5-6，接着使用策略网络参数更新过去策略网络参数；Step 8: Repeat steps 5-6, and then use the policy network parameters to update the past policy network parameters;

步骤9：重复步骤1-8，存储最后的灵活避障策略的各个网络参数，供下次训练或者推理时调用。Step 9: Repeat steps 1-8 to store the network parameters of the final flexible obstacle avoidance strategy for the next training or inference.

本实施例提供的一种基于强化学习与先验非线性距离-角度-航向编队控制的级联多移动机器人灵活编队方法，在实际进行部署应用时按照图3所示步骤进行：This embodiment provides a cascaded multi-mobile robot flexible formation method based on reinforcement learning and prior nonlinear distance-angle-heading formation control. When actually deployed and applied, the steps shown in FIG3 are followed:

步骤1：获取上层运动规划的期望轨迹，用于编队跟踪；Step 1: Obtain the expected trajectory of the upper-level motion planning for formation tracking;

步骤2：明确编队任务的具体编队形式需求，获得先验编队控制信息，明确任务环境；Step 2: Clarify the specific formation form requirements of the formation task, obtain prior formation control information, and clarify the task environment;

步骤3：装载训练阶段预训练的离线编队跟踪策略与灵活避障策略；Step 3: Load the offline formation tracking strategy and flexible obstacle avoidance strategy pre-trained in the training phase;

步骤4：按照预训练的编队策略，获取移动机器人状态后，动作网络反馈动作，移动机器人按照动作进行编队跟踪任务；Step 4: According to the pre-trained formation strategy, after obtaining the state of the mobile robot, the action network feeds back the action, and the mobile robot performs the formation tracking task according to the action;

步骤5：进行局部碰撞检测，保证编队跟踪的安全性，若有障碍物距离某移动机器人在安全阈值内，则跳转到步骤6，否则进行步骤7；Step 5: Perform local collision detection to ensure the safety of formation tracking. If there is an obstacle within the safety threshold of a mobile robot, jump to step 6, otherwise go to step 7;

步骤6：对应移动机器人调用训练阶段预训练的离线灵活编队策略，移动机器人从策略网络输出的分布中采样离散动作，进行局部障碍规避，并以尽量小的误差迅速返回其在编队中的位置，继续跟踪对应的编队模式下的虚拟移动机器人；Step 6: The corresponding mobile robot calls the offline flexible formation strategy pre-trained in the training phase. The mobile robot samples discrete actions from the distribution output by the strategy network, performs local obstacle avoidance, and quickly returns to its position in the formation with the smallest possible error, and continues to track the corresponding virtual mobile robot in the formation mode.

步骤7：是否抵达期望轨迹目标点，若没有，返回步骤4，继续进行编队跟踪。Step 7: Check whether the target point of the expected trajectory has been reached. If not, return to step 4 and continue formation tracking.

本发明提供的该方法基于结合先验的非线性距离-角度-航向编队控制知识与连续控制的策略梯度算法，避免移动机器人盲目探索，提高了训练收敛的速度，避免了繁琐的系数调优过程，同时引入近端策略优化独立训练单个移动机器人应对局部静态、动态障碍物的灵活避障能力。该方法分为训练与推理阶段，将复杂的线上解算过程迁移到线下，基于课程学习思想独立训练编队与灵活避障策略，同时在推理环节灵活调用预训练策略，使得整个编队具有更高的自主性与灵活性。The method provided by the present invention is based on a policy gradient algorithm that combines a priori nonlinear distance-angle-heading formation control knowledge with continuous control, which avoids blind exploration of mobile robots, improves the speed of training convergence, avoids tedious coefficient tuning processes, and introduces proximal policy optimization to independently train a single mobile robot to cope with local static and dynamic obstacles. The method is divided into training and reasoning stages, which migrates the complex online solution process to offline, independently trains formations and flexible obstacle avoidance strategies based on course learning ideas, and flexibly calls pre-training strategies in the reasoning link, so that the entire formation has higher autonomy and flexibility.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

以上结合附图对本发明的实施例进行了描述，但是本发明并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本发明的启示下，在不脱离本发明宗旨和权利要求所保护的范围情况下，还可做出很多形式，这些均属于本发明的保护之内。The embodiments of the present invention are described above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the enlightenment of the present invention, ordinary technicians in this field can also make many forms without departing from the scope of protection of the purpose of the present invention and the claims, which all fall within the protection of the present invention.

Claims

1. A cascaded multi-mobile robot flexible formation method, characterized in that it includes: based on the selected formation, according to the distance, angle and heading between the robots, determining the dynamic model; according to the dynamic model and the constraints of the dynamic model, determining the prior controller of the reinforcement learning architecture in the flexible formation method of the nonlinear mobile robot; determining the action space based on the hyperparameters of the mobile robot posture vector, the action space includes the formation tracking action space of two adjacent mobile robots and the action space required for each mobile robot to independently and flexibly avoid obstacles; determining the state space according to the tracking error of the mobile robot posture and speed, the state space includes: the state space of the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step, the state space between adjacent mobile robots and the state space required for each mobile robot to describe the surrounding environment information; setting the reward function of reinforcement learning, the reward function includes the formation reward function and the obstacle avoidance reward function;

Based on the a priori controller, by interacting with the environment, reinforcement learning training is performed according to the action space, state space and reward function, and after the training is completed, a cascade multi-mobile robot flexible formation method including a formation strategy and a flexible obstacle avoidance strategy is obtained;

The kinetic equation of the kinetic model is described as follows:

Where η = [x, y, θ] ^T represents the pose vector of each mobile robot, where (x, y) is the position of each mobile robot and θ is the angle of each mobile robot; is the speed of the mobile robot, /> ω is the current angular velocity of the mobile robot, v _r and v _l represent the velocities of the left and right wheels of the mobile robot respectively;

The constraints of the dynamic model are as follows:

The method for determining the a priori controller of the reinforcement learning architecture in the flexible formation method of the nonlinear mobile robot specifically includes: S31. Determine the expected trajectory of the virtual expected mobile robot and define it as η _r =[x _r ,y _r ,θ _r ] ^T , (x _r ,y _r ) is the position of the virtual expected mobile robot, θ _r is the angle of the virtual expected mobile robot, and the tracking error of the posture and the tracking error of the speed of the mobile robot determined according to the virtual expected trajectory are expressed as:

e _x is the position tracking error in the x direction; e _y is the position tracking error in the y direction; e _θ is the azimuth tracking error; They are the speed tracking error in the x direction and the speed tracking error in the y direction respectively;/> is the angular velocity tracking error; /> is the expected angular velocity of the virtual robot;

S32. Determine the desired formation model between the distance, angle and heading between adjacent mobile robots, which is described as follows:

Wherein, _v1 , _v2 represent the virtual robot objects that the adjacent mobile robots need to track, denoted as virtual robot 1 and virtual robot 2, ( _xv1 , _yv1 ) is the position of virtual robot 1, ( _xv2 , _yv2 ) is the position of virtual robot 2, _θv1 is the angle of virtual robot 1, _θv2 is the angle of virtual robot 2; _dv2v1 is the relative distance between adjacent mobile robots v1 and v2; _φv2v1 is the relative angle between adjacent mobile robots v1 and v2; _βv2v1 is the angle correction of the mobile robots that maintain the same azimuth;

S33. Combining (1)-(4) with feedback linearization nonlinear control theory, the prior description of the formation control of adjacent mobile robots is as follows:

Wherein, _v1 is the speed of virtual robot 1 that meets the preset formation requirements, _v2 is the speed of virtual robot 2 that meets the preset formation requirements, _w1 is the angular velocity of virtual robot 1 that meets the preset formation requirements, _w2 is the angular velocity of virtual robot 2 that meets the preset formation requirements, The performance hyperparameters of the nonlinear formation prior controller of virtual robot 1,/> is the performance hyperparameter of the a priori controller for the nonlinear formation of the virtual robot 2. The performance hyperparameter directly determines the control performance of the a priori controller.

During the training process, two subtasks, formation tracking and flexible obstacle avoidance, are trained independently. The specific method includes the following steps:

For the formation tracking task, the action space is selected as the formation tracking action space a ¹ _space of two adjacent mobile robots, and the state space is based on the state space of the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step The state space between adjacent mobile robots/>

The action value network outputs an evaluation of the current action, uses the Q value of the evaluation output by the current action value network as the weight, and updates the action network based on the policy gradient;

The specific update description of the action value network is as follows

Wherein, w _i is the priority sampling weight calculated based on the priority experience replay algorithm at the current moment i; r _i is the reward signal at the current moment i; γ is the discount factor; Q _θ′ (s _i+1 , μ′(s _i+1 )) is the evaluation of the target action value network for the target action μ′(s _i+1 ) at the next moment i+1, s _i is the state value of the robot i at the current moment, s _i+1 is the state value of the robot i+1 at the next moment, a _i is the action of the robot i at the current moment, and N is the number of samples in the small batch sampling; Q _θ (s _i , a _i ) is the evaluation value of the current action value network for the state and action instructions of the robot i at the current moment;

For the flexible obstacle avoidance task, a proximal strategy optimization algorithm architecture based on discrete action space is adopted to select the action space required for each mobile robot to independently and flexibly avoid obstacles. Select the state space as the state space of the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step/> The state space required for each mobile robot to describe the surrounding environment information/>

2. A cascaded multi-mobile robot flexible formation method according to claim 1, characterized in that the formation tracking action space of the two adjacent mobile robots is expressed as follows;

in, Performance hyperparameters of a priori controller for nonlinear formation tracking of virtual robots 1 for mobile robots,/> The performance hyperparameters of the a priori controller for the nonlinear formation of the virtual robot 2 for the neighboring mobile robots to track the virtual robot,

The action space required for each mobile robot to avoid obstacles independently and flexibly is expressed as follows:

Among them, v _discrete and ω _discrete are the discrete velocity command and angular velocity command of the mobile robot respectively.

3. A cascaded multi-mobile robot flexible formation method according to claim 1, characterized in that the state space of the tracking error of each mobile robot tracking the corresponding virtual mobile robot at the current time step is expressed as follows:

The state space between adjacent mobile robots is represented as follows:

in, The position tracking error of the virtual robot 1 in the x direction is tracked by the mobile robot; /> The position tracking error of the virtual robot 1 in the y direction is tracked by the mobile robot; /> The tracking error of the virtual robot 1 in azimuth is tracked by the mobile robot; /> The position tracking error of the virtual robot 2 in the x direction is tracked by the adjacent mobile robot of the mobile robot,/> Position tracking error of the virtual robot 2 in the y direction for tracking the adjacent mobile robots of the mobile robot; /> is the tracking error of the mobile robot's adjacent mobile robot tracking virtual robot 2 in azimuth; _e1 is the tracking error of the mobile robot tracking virtual robot 1, and _e2 is the tracking error of the mobile robot's adjacent robot tracking virtual robot 2;

The formation state quantity representing the distance between adjacent mobile robots at each time step t, /> The formation state quantity representing the angle between adjacent mobile robots at each time step t, /> and/> Represents the formation state quantity of the heading between adjacent mobile robots at each time step t; ||u ₁ || ₂ is the velocity value of mobile robot 1 relative to the virtual robot, including velocity and angular velocity; /> is the acceleration value of robot 1 relative to the virtual robot, including acceleration and angular acceleration; ||u ₂ || ₂ is the velocity value of mobile robot 2 relative to the virtual robot, including velocity and angular velocity; /> is the acceleration value of the mobile robot 2 relative to the virtual robot, including acceleration and angular acceleration;

The state space required for each mobile robot to describe the surrounding environment information is represented as follows:

Among them, _ηt is the pose vector of the mobile robot at the current moment, _dr is the distance between the mobile robot and its expected virtual mobile robot position at the current moment, _dob is the distance vector of the mobile robot to obstacles within the safety threshold at the current moment, and |Δθ| is a vector containing two elements, which are the difference between the speed of the mobile robot at the current moment and the speed at the previous moment, and the difference between the angular velocity of the mobile robot at the current moment and the angular velocity of the mobile robot at the previous moment.

4. A cascaded multi-mobile robot flexible formation method according to claim 3, characterized in that the formation reward function between two adjacent mobile robots is specifically described in the following form:

Where ε _thresh is the set threshold, R _{error_1} in the reward function is the sum of the penalty terms of the tracking errors of the two mobile robots for the expected virtual mobile robot, which is used to inspire the robot to reduce the tracking error of the expected position as much as possible; r is the reward or penalty value, R _formation is the reward or penalty function, which is used to guide the robot to maintain the consistency of the formation. If the dynamic change range of the formation is within the set threshold, a positive reward value is fed back, otherwise a negative penalty value is returned; R _velocity is used to guide the mobile robot to maintain the consistency of velocity and acceleration and maintain a continuous and smooth motion mode.

5. A cascaded multi-mobile robot flexible formation method according to claim 3, characterized in that the obstacle avoidance reward function is specifically in the following form:

Among them, the reward function R _{error_2} is the penalty term of the tracking error of the mobile robot i for the expected virtual mobile robot, guiding the robot's formation recovery; R _avoid guides the mobile robot to perform autonomous obstacle avoidance, ε _safe is the safety threshold, r ₁ is the penalty value when the distance between the robot and the nearest obstacle is within the safety threshold but has not yet completely collided; r ₂ is the penalty value when the robot collides with the obstacle; R _{delta_yaw} is the penalty value for the change of the angular direction of the mobile robot in adjacent time steps, so as to control the change of the angular direction of the mobile robot i and make the overall motion trajectory smoother.

6. A cascaded multi-mobile robot flexible formation method according to claim 5, characterized in that the target action value network is updated by using the updated online action network and the online action value network parameters after each small batch of training. The specific description form is as follows:

η′←τη”+(1-τ)η′ (14)

Among them, η′ and η” represent the target network parameters and the current network parameters, and τ is used to control the update ratio.

7. A cascaded multi-mobile robot flexible formation method according to claim 1, characterized in that the method also includes a local collision detection step, and the local collision detection step is used to detect the safe distance between the local obstacle and the robot. If the returned safe distance meets the safety state requirements, the mobile robot individual exits the flexible obstacle avoidance strategy and restores the formation strategy.