CN113485323A

CN113485323A - Flexible formation method for cascaded multiple mobile robots

Info

Publication number: CN113485323A
Application number: CN202110655081.9A
Authority: CN
Inventors: 董璐; 何子辰; 孙长银; 王嘉伟; 薛磊; 潘晶
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-10-08
Anticipated expiration: 2041-06-11
Also published as: CN113485323B

Abstract

The invention provides a flexible formation method of a cascade multi-mobile robot, which is based on a strategy gradient algorithm combining prior nonlinear distance-angle-course formation control knowledge and continuous control, avoids blind exploration of the mobile robot, improves the speed of training convergence, avoids a fussy coefficient tuning process, and simultaneously introduces a near-end strategy to optimize and independently train the flexible obstacle avoidance capability of a single mobile robot for coping with local static and dynamic obstacles. The method comprises a training and reasoning stage, complex on-line resolving processes are transferred to the off-line, formation and flexible obstacle avoidance strategies are independently trained based on course learning ideas, and meanwhile a pre-training strategy is flexibly called in a reasoning stage, so that the whole formation has higher autonomy and flexibility.

Description

Flexible formation method for cascaded multiple mobile robots

Technical Field

The invention belongs to the field of multiple mobile robots, and particularly relates to a flexible formation method for multiple mobile robots based on a cascade architecture, in particular to a cascade multiple mobile robot formation method based on reinforcement learning and priori nonlinear distance-angle-course formation control.

Background

With the development of the robot technology, the multi-mobile robot formation operation effectively improves the operation efficiency by virtue of the cooperation capability, and gradually replaces the traditional single-machine operation. For example, multiple underwater robots search through a collaborative formation. In military affairs, unmanned aerial vehicle clusters, multi-ground mobile robots, mine elimination, search and rescue, investigation and the like have no advantage characteristics which do not reflect multi-machine formation. Recently, new crown epidemic situation is abused worldwide, and domestic many hospitals carry out the disinfection work of hospital in order to adopt the mobile robot that kills to replace traditional artifical mode, and a plurality of mobile robot that kills are through formation cooperation, have effectively improved the efficiency of unit operation.

The formation strategy of the pilot following based on distance-angle-course is one of the common technologies for realizing the formation tracking of a plurality of mobile robots, and compared with the traditional pilot following formation strategy, the method has better flexibility and expansibility. The basic idea of the strategy is to preset a robot as a pilot and other robots as trackers, and then determine the relative distance, relative angle and course between the pilot robot and the following robots through a preset formation so as to design a formation control strategy.

At present, the mainstream methods for realizing distance-angle-course navigation following include nonlinear control, nonlinear model predictive control and the like. The former includes input/output feedback linearization control, feedback control, and the like. As more performance gain parameters are introduced, the complicated parameter adjustment process cannot be avoided; the latter is highly dependent on an accurate model and has a high requirement on the online resolving speed. On the other hand, the robustness of the traditional piloting following formation model needs to be improved, and certain flexible obstacle avoidance and formation recovery capabilities are lacked.

With the development of artificial intelligence technology, the deep reinforcement learning technology is widely applied to related tasks of end-to-end mobile robots due to the advantages of no model, offline training and the like, but is mostly in the field of single robots; the end-to-end implementation mode in the multi-machine field has strict requirements on the performances of the sensor and the actuator, the state and action space dimensions are higher, the training cost is too high in the process of landing the actual mobile robot, and the difficulty of reasoning and recurrence is higher.

Disclosure of Invention

The invention aims to provide a flexible formation method for multiple mobile robots, which has certain flexible obstacle avoidance and formation recovery capabilities, aiming at the defects in the prior art.

The invention adopts the following technical scheme. The method comprises determining a dynamic model according to the distance, angle and course among robots based on the selected formation form; determining a priori controller of a reinforcement learning framework in a flexible formation method of the nonlinear mobile robot according to the dynamic model and the dynamic model constraint; determining an action space based on the hyper-parameters of the pose vectors of the mobile robots, wherein the action space comprises a formation tracking action space of two adjacent mobile robots and an action space required by each mobile robot for independently and flexibly avoiding obstacles; determining a state space according to a tracking error of a mobile robot attitude and a speed, wherein the state space comprises: at the current time step, each mobile robot tracks the state space of the tracking error of the corresponding virtual mobile robot, the state space between adjacent mobile robots and the state space required by each mobile robot to describe the surrounding environment information; setting a reward function for reinforcement learning, wherein the reward function comprises a formation reward function and an obstacle avoidance reward function;

and based on the prior controller, performing reinforcement learning training according to the action space, the state space and the reward function by interacting with the environment, and finishing the training to obtain a flexible formation method of the cascade multi-mobile robot, wherein the flexible formation method comprises a formation strategy and a flexible obstacle avoidance strategy.

Further, the kinetic equation is described as follows:

where eta is [ x, y, theta ]]^TA pose vector representing each mobile robot, where (x, y) is the position of each mobile robot and θ isAn angle of each mobile robot;

in order to move the speed of the robot,

omega is the current angular velocity of the mobile robot, v_rAnd v_lRespectively representing the speeds of the left and right wheels of the mobile robot;

the kinetic model constraint form is as follows:

still further, the method for determining the prior controller of the reinforcement learning architecture in the flexible formation method of the nonlinear mobile robot specifically includes: s31, determining the expected track of the virtual expected mobile robot as eta_r＝[x_r,y_r,θ_r]^T，(x_r,y_r) To virtually expect the position of the mobile robot, θ_rTo virtually expect the angle of the mobile robot, the tracking error of the mobile robot for determining the attitude and the tracking error of the velocity of the mobile robot according to the virtual expected trajectory are expressed as:

e_xposition tracking error for the x direction; e.g. of the type_yPosition tracking error for the y direction; e.g. of the type_θIs the tracking error of the azimuth;

the speed tracking errors in the x direction and the y direction respectively;

is the angular velocity tracking error;

is the desired angular velocity of the virtual robot;

s32, determining an expected formation model among the distance, the angle and the heading among adjacent mobile robots, wherein the expected formation model is specifically described as follows:

wherein v is₁，v₂Respectively representing virtual robot objects to be tracked by adjacent mobile robots, namely a virtual robot 1 and a virtual robot 2, (x)_v1,y_v1) Is the position of the virtual robot 1, (x)_v2,y_v2) Is the position of the virtual robot 2, theta_v1Angle of the virtual robot 1, theta_v2Is the angle of the virtual robot 2; d_v2v1Relative distance of adjacent mobile robots v1, v 2; phi is a_v2v1Relative angles of adjacent mobile robots v1, v 2; beta is a_v2v1An angle correction amount for the mobile robot maintaining the same azimuth angle;

s33, combining the theories (1) to (4) and the feedback linearization nonlinear control theory, the formation control prior description form of the adjacent mobile robots is as follows:

wherein v and w are the speed and angular velocity of the mobile robot meeting the preset formation requirement,

the performance of the prior controller for the non-linear formation of the virtual robot 1 is over-parametric,

performance superparameters of a priori controllers for nonlinear formation of virtual robots 2, and performance superparameters directly determine the priori controllersAnd controlling the performance.

Further, the formation tracking motion space of the two adjacent mobile robots is expressed as follows;

wherein the content of the first and second substances,

tracking the performance hyper-parameters of the nonlinear formation prior controller of the virtual robot 1 for the mobile robot,

tracking performance hyper-parameters of the virtual robot 2 non-linear formation prior controller for neighboring mobile robots of the mobile robot,

the action space required by each mobile robot for independently and flexibly avoiding the obstacle is expressed as follows;

wherein v is_discreteAnd omega_discreteRespectively, a discretized speed command and an angular speed command of the mobile robot.

Further, a state space in which each mobile robot tracks a tracking error of the corresponding virtual mobile robot at the current time step is represented as follows:

the state space between adjacent mobile robots is represented as follows:

wherein the content of the first and second substances,

tracking the position tracking error of the virtual robot 1 in the x direction for the mobile robot;

tracking the position tracking error of the virtual robot 1 in the y direction for the mobile robot;

tracking the tracking error of the virtual robot 1 in the azimuth angle for the mobile robot;

a position tracking error of the virtual robot 2 in the x direction is tracked for neighboring mobile robots of the mobile robot,

tracking the position tracking error of the virtual robot 2 in the y direction for the adjacent mobile robots of the mobile robot;

tracking the tracking error of the virtual robot 2 at the azimuth angle for the adjacent mobile robots of the mobile robot; e.g. of the type₁Tracking error of virtual robot 1 for mobile robot, e₂Tracking errors of the virtual robot 2 for neighboring robots of the mobile robot;

and

respectively representing the formation state quantities of the distance, the angle and the course between the adjacent mobile robots under each time step t; | u₁||₂,||u₂||₂,

Respectively representing the speed and angular velocity between the robot 1 and the robot 2 relative to the virtual robotAnd the relative value of the acceleration, the purpose of which is to expect the mobile robot to operate with continuous and smooth speed and acceleration, where | u₁||₂Is the speed value of the mobile robot 1 relative to the virtual robot, including speed and angular velocity;

is the acceleration value of the robot 1 relative to the virtual robot, including acceleration and angular acceleration; | u₂||₂Is a speed value between the mobile robot 2 and the virtual robot, including a speed and an angular speed;

is an acceleration value of the mobile robot 2 relative to the virtual robot, including acceleration and angular acceleration.

The state space required for each mobile robot to describe the surrounding environment information is represented as follows:

wherein eta is_tIs the pose vector of the mobile robot at the current moment, d_rThe distance between the mobile robot at the present time and its desired virtual mobile robot position, d_obAnd a distance vector of the mobile robot within a safety threshold value at the current moment, | delta theta | is the difference between the speed and the angular speed of the mobile robot at the adjacent moment.

Still further, the formation reward function between two adjacent mobile robots is described in the following form:

wherein epsilon_threshR in the reward function for setting the threshold_{error_1}Is the sum of penalty terms for tracking errors of two mobile robots with respect to a desired virtual mobile robot, for enlighteningThe robot reduces the tracking error of the expected position as much as possible; r is a reward or penalty value, R_formationThe system is a reward or penalty function and is used for guiding the robot to keep the continuity of formation, if the dynamic change range of the formation is within a set threshold value, a positive reward value is fed back, and if the dynamic change range of the formation is not within the set threshold value, a negative penalty value is returned; r_velocityThe method is used for guiding the mobile robot to keep consistency of speed and acceleration and maintain a continuous and smooth motion mode.

Still further, the obstacle avoidance reward function is in a specific form as follows:

wherein the reward function R_{error_2}The penalty item of the mobile robot i for the expected tracking error of the virtual mobile robot is used for guiding the formation recovery of the robot; r_avoidGuiding a mobile robot to perform autonomous obstacle avoidance, epsilon_safeTo a safe threshold, r₁Is the penalty value when the robot is within a safe threshold from the nearest obstacle but has not yet fully collided. r is₂Is a penalty value when the robot collides with the obstacle; r_{delta_yaw}The penalty value of the change of the direction angle of the adjacent time step of the mobile robot is used for controlling the change of the direction angle of the mobile robot i, so that the overall motion track is smoother.

Further, in the training process, independent training is respectively carried out on two subtasks of formation tracking and flexible obstacle avoidance, and the specific method comprises the following steps:

aiming at the formation tracking task, the action space is selected to be the formation tracking action space of two adjacent mobile robots

State space based on tracking error of each mobile robot tracking corresponding virtual mobile robot at current time step

State space between adjacent mobile robots

The action value network outputs the evaluation of the current action, the Q value of the evaluation output by the current action value network is used as the weight, and the action network is updated based on the strategy gradient;

the specific updating of the action value network is described as follows

Wherein, w_iCalculating a priority sampling weight for the current time i based on a priority empirical replay algorithm; r is_iThe reward signal is the current time i; gamma is a discount factor; q_θ′(s_i+1,μ′(s_i+1) For the next time i +1, the target action mu'(s) is taken as the target action value_i+1) Evaluation of (1), s_iIs the current time i the state value of the robot, s_i+1The state value, a, of the robot at the next moment i +1_iThe motion of the robot at the current moment i, and N is the number of samples sampled in small batches; q_θ(s_i,a_i) And the evaluation value of the current motion value network to the state and motion command of the robot at the current time i.

Aiming at the flexible obstacle avoidance task, a near-end strategy optimization algorithm framework based on discrete action space is adopted, and the action space is selected to be the action space required by each mobile robot for independently and flexibly avoiding the obstacle

Selecting a state space for tracking the tracking error of a corresponding virtual mobile robot for each mobile robot at the current time step

State space required for describing surrounding environment information with each mobile robot

Further, the method for updating the target action value network comprises the following steps: after each small batch of training is finished, parameters of the updated online action network and the updated online action value network are updated, and the specific description form is as follows:

η′←τη+(1-τ)η′ (14)

where η' and η are partial tables representing the target network parameter and the current network parameter, and τ is used to control the ratio of updates.

Further, the method further comprises a local collision detection step, wherein the local collision detection step is used for detecting the safe distance between a local obstacle and the robot, and if the returned safe distance meets the requirement of a safe state, the mobile robot individually exits from the flexible obstacle avoidance strategy and recovers the formation strategy.

The invention has the beneficial technical effects;

the flexible formation method of the cascaded multi-mobile robots is based on reinforcement learning and priori nonlinear distance-angle-course formation control, so that the plurality of mobile robots can adaptively adjust key parameters in a formation control algorithm, and the stability and tracking accuracy of formation are improved; meanwhile, a flexible obstacle avoidance strategy is trained independently, so that each robot in the formation has certain flexible obstacle avoidance capability, and the flexibility and the autonomy of each mobile robot in the formation are improved.

The formation tracking framework of the calculation method is based on a depth determination type strategy gradient algorithm, and the performance and the efficiency of the algorithm are further improved by simplifying the random exploration process and introducing a priority experience playback mechanism. Blind exploration is avoided by introducing prior nonlinear distance-angle-course formation controller information, so that the training process is more targeted, the speed of algorithm convergence is increased, the prior formation controller information controller can avoid abnormal behaviors damaging an actuator from end to end in the inference application process, and the robustness of the whole formation is improved.

The obstacle avoidance framework of the calculation method is based on a near-end strategy optimization algorithm, and when the obstacle is flexibly avoided, the motion space of the mobile robot is discretized, so that the search space is reduced, and the training complexity is reduced; by introducing a collision detection function module, the obstacle distance is monitored in real time to determine whether the formation tracking mode can be returned.

Preferably, the training of the two sets of frameworks are mutually independent, supplement each other in the inference formation process, and jointly complete the flexible formation of the multiple mobile robots.

Drawings

FIG. 1 is a schematic diagram of an overall framework for a particular embodiment of the invention;

FIG. 2 is a schematic diagram of a training phase of an embodiment of the present invention;

FIG. 3 is a schematic diagram of inference based flexible formation according to a specific embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples of the specification.

Example (b): a flexible formation method of cascaded multiple mobile robots mainly comprises the following steps: s1, selecting a formation from a formation library, and confirming the priority of each robot and the specific position in the formation according to a distance-angle-course formation mode;

s2, determining a dynamic model according to the type of the robot;

s3, designing expected prior tracks of a virtual navigator and a virtual follower according to the constraint of a dynamic model and by combining the constraint of relative distance, the constraint of relative angle and the constraint of course between robots, converting the formation problem of the actual robots into a plurality of tracking method problems for tracking the tracks of the virtual mobile robots, and designing a nonlinear formation tracking prior controller of angular velocity of corresponding velocity as a knowledge prior in the whole reinforcement learning framework;

s4, designing a collision detection module of the whole formation algorithm, and detecting the safe distance between a local obstacle and the robot;

s5, designing an action space part of the whole formation algorithm framework, wherein the action space part is mainly divided into two parts, one part is a speed space containing the speed and the angular velocity of the mobile robot, and the other part is a parameter space containing all performance parameters of the priori nonlinear tracking control knowledge;

s6, designing a state space part in the whole formation algorithm framework, wherein the state space part mainly comprises the position and the posture of each robot and barrier information in the environment;

s7, designing a reward function for guiding the robots to form a team for learning and flexibly avoiding the obstacles, wherein the reward function mainly comprises a team reward function, a tracking reward function and an obstacle avoidance reward function;

and S8, building a simulation environment for training, so that the intelligent agent interacts with the environment for trial and error under the condition of checking nonlinear formation control knowledge, and a plurality of mobile robots complete a flexible and stable formation process strategy and a flexible obstacle avoidance strategy based on distance, angle and direction through learning.

Further, in step S1, the type of each robot is isomorphic, and the number N of robots is greater than or equal to 2;

further, in step S2, taking the two-wheel differential mobile robot as an example, the kinetic equation is described as follows:

where eta is [ x, y, theta ]]^TA pose vector representing each mobile robot;

in order to move the speed of the robot,

angular velocity, v, of the mobile robot_rAnd v_lRespectively representing the speeds of the left and right wheels; it should be noted that the two-wheel drive mobile robot has incomplete constraint, so that the mobile robot can only move forward and backward, but not left and right, and the constraint form is as follows:

further, taking distance-angle-heading formation of multiple mobile robots as an example, the specific design steps of the a priori formation control knowledge of the mobile robots in S3 are as follows:

s31, designing a tracking controller between the mobile robot and the virtual expected mobile robot. The expected trajectory of the virtual mobile robot is defined as eta_r＝[x_r,y_r,θ_r]^TThe tracking error of the attitude and the velocity is:

s32, designing an expected formation model between the distance, the angle and the course between adjacent mobile robots, wherein the expected formation model is specifically described as follows:

wherein v1 and v2 represent virtual robot objects to be tracked by adjacent mobile robots, and are respectively marked as virtual robot 1 and virtual robot 2, d_v2v1、φ_v2v1、β_v2v1The distance angle and the direction between v1 and v2 are represented as state quantities under a distance-angle-heading formation framework.

wherein v and w are the speed and angular velocity of the mobile robot meeting the preset formation requirement, [ K ]_x,K_y,K_θ]Controlling prior performance hyper-parameters for nonlinear formation of mobile robots, and directly determining values of the performance hyper-parametersQuality of formation tracking;

further, in S4, the collision detection function module determines the distance to the obstacle by the mobile robot sensor itself, and outputs a boolean collision warning flag;

further, in S5, the motion space mainly consists of two parts, one part is the motion space required for formation tracking, and the other part is the motion space required for flexible obstacle avoidance when detecting a local obstacle, and the specific design description is as follows:

and S51, designing a formation tracking action space of two adjacent mobile robots. The specific method is based on the nonlinear formation knowledge prior involved in S33, and the motion space is as follows:

wherein [ K ]_x,K_y,K_θ]Controlling prior performance hyper-parameters for nonlinear formation of mobile robots;

s52, designing an action space required by each mobile robot for independently and flexibly avoiding the obstacle:

wherein v is_discreteAnd omega_discreteRespectively representing a discretized speed instruction and an angular speed instruction of the mobile robot;

further, in S6, the state space mainly includes three parts, one part is a state space describing a tracking error of each mobile robot tracking a corresponding virtual robot, one part is a state space describing a formation satisfying distance-angle-heading between adjacent mobile robots, and one part is a state space required for describing surrounding environment information, and the specific design description is as follows:

s61, taking two adjacent mobile robots as an example, designing and describing a state space of tracking errors of each mobile robot tracking the corresponding virtual mobile robot at the current time step as follows:

s62, taking two adjacent mobile robots as an example, designing a state space between the two adjacent mobile robots, which meets a distance-angle-course formation framework, as follows:

d, phi and beta respectively represent the formation state quantities of the distance, the angle and the course between adjacent mobile robots at each time step; | u₁||₂,||u₂||₂,

Representing the relative values of speed, angular velocity and acceleration between the robot 1 and the robot 2, respectively, with respect to the virtual robot, the purpose of this item being that the mobile robot is expected to operate with continuous and smooth speed and acceleration;

s63, designing a state space required by each mobile robot to describe the surrounding environment information as follows:

wherein eta is_tIs the pose vector of the mobile robot at the current moment, d_rThe distance between the mobile robot at the present time and its desired virtual mobile robot position, d_obThe distance vector of the mobile robot within the safety threshold value at the current moment, | delta u | is the difference between the speed and the angular speed of the mobile robot at the adjacent moment;

further, the incentive function design in S7 can be subdivided into two sub-incentive function designs, one is for the formation tracking sub-task, and the other is for the flexible obstacle avoidance and formation recovery sub-task, that is:

s71, designing a reward function of the formation tracking subtask,

the formation reward function between two adjacent mobile robots is described in the following form:

wherein R in the reward function_errorIs the sum of penalty terms of the tracking error of the two mobile robots to the expected virtual mobile robot and is used for inspiring that the tracking error of the robot to the expected position is reduced as much as possible; r_formationThe system is used for guiding the robot to keep the consistency of formation, if the dynamic change range of the formation is within a threshold value, a positive reward is fed back, otherwise, a negative penalty is fed back; r_velocityThe system is used for guiding the mobile robot to keep the consistency of speed and acceleration and maintain a continuous and smooth motion mode;

s72, designing a flexible obstacle avoidance reward function of the mobile robot i, wherein the specific form is as follows:

wherein the reward function R_errorThe penalty item of the mobile robot i for the expected tracking error of the virtual mobile robot is used for guiding the formation recovery of the robot; r_avoidGuiding the mobile robot to perform autonomous obstacle avoidance; r_{delta_yaw}The change of the direction angle of the mobile robot i is restricted,

to save energy;

optionally, the reward function designed in S72 only occurs in the obstacle avoidance task stage, and is used to inspire that the mobile robot quickly avoids a local obstacle, and when it is determined through S4 that the mobile robot is far away from the obstacle, the obstacle avoidance task stage exits, and is switched to a formation tracking subtask, and formation is resumed and maintained under the guidance of the S71 reward function;

further, in S8, in the training process, independent training is performed for two subtasks of formation tracking and flexible obstacle avoidance, which are specifically described as follows:

s81, aiming at the formation tracking task, adopting a deterministic strategy gradient algorithm framework based on a continuous action space, wherein the action space is selected as

The state space is based on

And

the algorithm generally follows the "actor-critic" pattern, but unlike other reinforcement learning algorithms, the greatest advantage of the algorithm is that the output of the action network is a deterministic action rather than a strategic distribution.

On the other hand, the action value network outputs an evaluation of the current action, and then updates the action network based on the policy gradient with the evaluated Q value output by the current action value network as a weight. The action value network is updated based on the target action network and the target action value network which are off-line, and the method has the advantages that the parameter change of the target network is small, so that the training process is more stable.

The specific updating of the action value network is described as follows

Wherein, w_iA priority sampling weight calculated for a priority-based empirical replay algorithm; r is_iIs the current reward signal; gamma is a discount factor; q_θ′(s_i+1,μ′(s_i+1) For the next moment, the target action mu'(s) is taken as the target action value_i+1) Evaluation of (2)

Preferably, the updating of the target network is based on a soft updating strategy, and after each small batch is trained, the parameters of the updated online action network and the updated online action value network are updated, and the specific description form is as follows:

η′←τη+(1-τ)η′ (14)

Tau is used for controlling the updating proportion, and the soft updating method reduces the influence of abnormal parameters and avoids abnormal jump of parameters in the parameter updating process.

S82, aiming at the flexible obstacle avoidance task, adopting a near-end strategy optimization algorithm framework based on discrete action space, and selecting the action space as

Selecting a state space of

And

the near-end strategy optimization algorithm is optimized aiming at the problems of slow parameter updating, low data utilization rate and the like of the on-line strategy of the traditional strategy gradient algorithm, a resampling mechanism is introduced to the algorithm on the basis of a generalized advantage evaluation algorithm to convert the on-line strategy into an off-line strategy to improve the data utilization rate, and meanwhile, a more stable training process is obtained on the basis of KL divergence or the updating amplitude of cutting operation constraint parameters.

And further, S9, completing the construction of the whole inference-based cascade formation control algorithm according to the offline learning formation strategy in S7.

In S9, a flexible formation algorithm framework of the mobile robot based on reasoning is constructed by using the formation and flexible obstacle avoidance strategy trained in S8, and the specific process is described as follows:

s91, determining a formation requirement and a task environment;

s92, the mobile robot is formed and loaded with a priori-based formation strategy and a flexible obstacle avoidance strategy which are pre-trained in S8;

s93, the mobile robot formation adopts a formation tracking strategy to perform formation tracking according to the interaction information with the environment, and each robot individual performs local collision detection;

s94, if flexible obstacle avoidance is needed, the mobile robots form a team to switch individual flexible obstacle avoidance strategies, and real-time obstacle avoidance is conducted according to the interaction information with the environment;

s95, if the local collision detection function module returns to the safe state, the mobile robot individual exits the flexible obstacle avoidance strategy and rapidly restores the formation state;

s96, repeating S93 to S95 until the target point is reached.

The invention provides a cascade multi-mobile-robot flexible formation method based on reinforcement learning and prior nonlinear distance-angle-course formation control, which moves the computational power consumption of on-line resolving to the off-line through reinforcement learning to realize the flexible formation of the multi-mobile-robot based on reasoning;

in the training stage, a formation tracking strategy and a flexible obstacle avoidance strategy are trained independently, so that the training difficulty is reduced, meanwhile, a nonlinear distance-angle-course formation control prior is introduced, the training speed is improved, and a complicated parameter tuning process is avoided; and in the inference stage, independent strategies based on offline training are combined to meet the task requirements of autonomous stable formation and flexible obstacle avoidance. Compared with the existing formation tracking control algorithm based on navigation-tracking, the robot formation tracking control algorithm endows the robot formation with independent tracking capability and also endows each mobile robot with independent obstacle avoidance capability facing local static and dynamic obstacles, and has the characteristics of autonomy, stability, high efficiency and flexibility.

The overall framework of the embodiment is shown in fig. 1, where 1 is an offline independent training framework, 2 is an inference flexible formation framework, 21 is a flexible formation obstacle avoidance strategy, and 3 is a simulation interactive environment.

Firstly, in a training stage 1, respectively training a formation strategy and a flexible obstacle avoidance strategy; the training of the formation strategy is based on prior formation experience, the training process and the convergence speed are accelerated, blind exploration of multiple mobile robots is prevented, and the stability of formation is improved; after training, storing the two strategy parameters;

then, in an inference stage 2, the multiple mobile robots flexibly call a flexible formation obstacle avoidance strategy 21 based on experience to perform autonomous formation tracking and flexible obstacle avoidance, and the on-line calculation process is migrated to the off-line mode, so that the method is more efficient and stable.

The frame of the training stage is shown in fig. 2, the whole training strategy is based on the thought of course learning, namely the training environment is simplified to be more complicated, and the strategy performance is gradually improved; FIG. 2 is a diagram 1 of a formation training environment based on a distance-angle-heading formation control prior; 2, flexibly avoiding the obstacle for each mobile robot; 3 is a continuous deterministic strategy gradient algorithm agent; 4, a discrete near-end strategy optimization algorithm agent; and 5, a formation strategy parameter and a flexible obstacle avoidance strategy parameter which are stored in an off-line mode after training, wherein the specific process is described as follows:

the formation strategy training process is described as follows:

firstly, configuring various simple to complex training environments according to the thought of course learning, such as training the formation environment of two mobile robots and then gradually increasing the number of the formation robots;

next, for each preset simulation environment, initializing an action network, a target action network, an action value network, and target action value network parameters in the continuous deterministic policy gradient algorithm agent 3 in fig. 2; for each iteration cycle, a formation training environment 1 based on a distance-angle-heading formation control prior is initialized, followed by, for each time step:

step 1: selecting an action from the threshold range of the action space according to a strategy, and adding random Gaussian noise to improve the random exploration performance;

step 2: interacting with a formation training environment, specifically inputting a selected deterministic action into a cascade prior formation controller of the multiple mobile robots designed by combining a preset formation mode and geometric relations of distance, angle and course, specifically describing the controller as the formula (5), then inputting prior formation upper-layer control speed and angular speed instructions into each mobile robot, updating the current state of the environment while the robot completes the instructions, and feeding back a state value, a reward value and a Boolean flag bit indicating whether a task is finished or interrupted;

and step 3: storing the information fed back by the environment and the calculated priority into an experience pool as data for training;

and 4, step 4: when the capacity of the experience pool overflows, sampling according to the priority, and training and updating the action value network and the action network; the method comprises the following steps:

step 4-1: inputting the sampled state value into an action network to obtain an action;

step 4-2: inputting the action and the sampled state value into an action value network to obtain a Q value;

step 4-3: according to the Q value, the back propagation updates the action network according to the strategy gradient;

step 4-4: repeating the steps 4-2 and 4-3 to obtain a Q value calculated by the updated action through the action value network;

and 4-5: inputting the next state value in the sampled experience into a target action network to obtain a target action;

and 4-6: inputting the action and the state into a target state network to obtain a target Q value;

and 4-7: updating the action value network by combining the target Q value and the Q value calculated in the step 4-4 according to the formula (13) and the priority weight coefficient;

and 5: repeating the steps 1-4;

step 6: after a certain time step is met, respectively carrying out soft updating on the target action network and the target action value network according to the formula (14);

and 7: and storing each network parameter of the final formation strategy for calling in next training or reasoning.

The individual autonomous flexible obstacle avoidance strategy training process is as follows:

firstly, configuring various simple to complex training environments according to the thought of course learning, such as firstly training an obstacle avoidance strategy of the mobile robot in a static obstacle environment and then training an obstacle avoidance strategy of the mobile robot in a dynamic obstacle environment;

secondly, initializing a policy network and a value network in 2-4 for each preset simulation environment; during each iteration cycle, the corresponding context is initialized, and then, during each time step:

step 1: in a strategy network, inputting environment state information to obtain strategy distribution, and sampling an action in a discrete action space according to the distribution;

step 2: inputting the action into an environment, interacting with a flexible obstacle avoidance environment, updating the environment state, feeding back a state value, a reward value and a Boolean type flag bit indicating whether the task is finished or interrupted;

and step 3: repeating the step 1, sampling certain experiences, and storing;

and 4, step 4: inputting the state of the last step in the step 3 into a value network to obtain the state value, and then backtracking and calculating the discount reward value under the time steps;

and 5: inputting all stored experiences into a value network, and calculating an advantage value by utilizing generalized advantage evaluation;

step 6: according to the calculated advantage value, the updated value network is propagated reversely;

and 7: inputting all state values in the stored experience into a strategy network and a past strategy network to respectively obtain different strategy distributions, converting the same strategy into different strategies by using resampling, and reversely propagating and updating the strategy network;

and 8: repeating the step 5-6, and then updating the past strategy network parameters by using the strategy network parameters;

and step 9: and (5) repeating the steps 1-8, and storing each network parameter of the final flexible obstacle avoidance strategy for calling in the next training or reasoning.

The method for flexibly forming a cascade multi-mobile robot based on reinforcement learning and priori nonlinear distance-angle-course formation control provided by the embodiment is carried out according to the steps shown in fig. 3 when actual deployment and application are carried out:

step 1: acquiring an expected track of an upper-layer motion plan for formation tracking;

step 2: the specific formation form requirements of the formation tasks are determined, prior formation control information is obtained, and the task environment is determined;

and step 3: loading off-line formation tracking strategy and flexible obstacle avoidance strategy pre-trained in training phase;

and 4, step 4: after the state of the mobile robot is obtained according to a pre-trained formation strategy, the action network feeds back actions, and the mobile robot performs a formation tracking task according to the actions;

and 5: performing local collision detection to ensure the safety of formation tracking, if an obstacle is within a safety threshold from a certain mobile robot, skipping to the step 6, otherwise, performing the step 7;

step 6: calling an offline flexible formation strategy pre-trained in a training phase corresponding to the mobile robot, sampling discrete actions from distribution output by a strategy network by the mobile robot, avoiding local obstacles, quickly returning to the position of the mobile robot in formation with an error as small as possible, and continuously tracking the virtual mobile robot in a corresponding formation mode;

and 7: and if not, returning to the step 4 and continuing to perform the formation tracking.

The method provided by the invention is based on a strategy gradient algorithm combining the priori nonlinear distance-angle-course formation control knowledge and continuous control, avoids blind exploration of the mobile robot, improves the speed of training convergence, avoids a fussy coefficient tuning process, and simultaneously introduces a near-end strategy to optimize the flexible obstacle avoidance capability of independently training a single mobile robot to deal with local static and dynamic obstacles. The method comprises a training and reasoning stage, complex on-line resolving processes are transferred to the off-line, formation and flexible obstacle avoidance strategies are independently trained based on course learning ideas, and meanwhile a pre-training strategy is flexibly called in a reasoning stage, so that the whole formation has higher autonomy and flexibility.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A flexible formation method of a cascade multi-mobile robot is characterized by comprising the steps of determining a dynamic model according to the distance, the angle and the course among robots based on a selected formation form; determining a priori controller of a reinforcement learning framework in a flexible formation method of the nonlinear mobile robot according to the dynamic model and the dynamic model constraint; determining an action space based on the hyper-parameters of the pose vectors of the mobile robots, wherein the action space comprises a formation tracking action space of two adjacent mobile robots and an action space required by each mobile robot for independently and flexibly avoiding obstacles; determining a state space according to a tracking error of a mobile robot attitude and a speed, wherein the state space comprises: at the current time step, each mobile robot tracks the state space of the tracking error of the corresponding virtual mobile robot, the state space between adjacent mobile robots and the state space required by each mobile robot to describe the surrounding environment information; setting a reward function for reinforcement learning, wherein the reward function comprises a formation reward function and an obstacle avoidance reward function;

2. The method of claim 1, wherein the dynamic equations are described as follows:

where eta is [ x, y, theta ]]^TA pose vector representing each mobile robot, where (x, y) is a position of each mobile robot and θ is an angle of each mobile robot;

in order to move the speed of the robot,

the kinetic model constraint form is as follows:

3. the method of claim 2, wherein the method for determining the prior controller of the reinforcement learning architecture in the flexible formation method of the nonlinear mobile robot specifically comprises: s31, determining the expected track of the virtual expected mobile robot as eta_r＝[x_r,y_r,θ_r]^T，(x_r,y_r) To virtually expect the position of the mobile robot, θ_rFor virtually expecting the angle of the mobile robot, the tracking error of the attitude and the tracking error of the speed determined by the mobile robot according to the virtual expected track are expressed as follows:

respectively, a velocity tracking error in the x direction and a velocity tracking error in the y direction;

is the angular velocity tracking error;

is the desired angular velocity of the virtual robot;

and (3) directly determining the control performance of the prior controller for the performance hyper-parameter of the nonlinear formation prior controller of the virtual robot 2.

4. The flexible formation method of cascaded multiple mobile robots according to claim 1, wherein the formation tracking motion space of two adjacent mobile robots is represented as follows;

wherein the content of the first and second substances,

5. The method for flexible formation of cascaded multi-mobile robots according to claim 1, wherein the state space of tracking error of each mobile robot at the current time step for tracking the corresponding virtual mobile robot is represented as follows:

the state space between adjacent mobile robots is represented as follows:

wherein the content of the first and second substances,

a formation state quantity representing a distance between adjacent mobile robots at each time step t,

A formation state quantity representing an angle between adjacent mobile robots at each time step t,

And

representing the formation state quantity of the course between the adjacent mobile robots under each time step t; | u₁||₂Is the speed value of the mobile robot 1 relative to the virtual robot, including speed and angular velocity;

is the acceleration value of the mobile robot 2 relative to the virtual robot, including acceleration and angular acceleration;

wherein eta is_tIs the pose vector of the mobile robot at the current moment, d_rThe distance between the mobile robot at the present time and its desired virtual mobile robot position, d_obFor the current time, the movementThe distance vector of the robot from the obstacle within the safety threshold, | Δ θ | is the difference between the speed and the angular velocity of the mobile robot at the adjacent time.

6. The flexible formation method of the cascaded multiple mobile robots according to claim 5, wherein a formation reward function between two adjacent mobile robots is described in a specific form as follows:

wherein epsilon_threshR in the reward function for setting the threshold_{error_1}Is the sum of penalty terms of the tracking error of the two mobile robots to the expected virtual mobile robot and is used for inspiring that the tracking error of the robot to the expected position is reduced as much as possible; r is a reward or penalty value, R_formationThe system is a reward or penalty function and is used for guiding the robot to keep the continuity of formation, if the dynamic change range of the formation is within a set threshold value, a positive reward value is fed back, and if the dynamic change range of the formation is not within the set threshold value, a negative penalty value is returned; r_velocityThe method is used for guiding the mobile robot to keep consistency of speed and acceleration and maintain a continuous and smooth motion mode.

7. The flexible formation method of cascaded multiple mobile robots according to claim 5, wherein the obstacle avoidance reward function is in the following specific form:

wherein the reward function R_{error_2}The penalty item of the mobile robot i for the expected tracking error of the virtual mobile robot is used for guiding the formation recovery of the robot; r_avoidGuiding a mobile robot to perform autonomous obstacle avoidance, epsilon_safeTo a safe threshold, r₁The distance between the robot and the nearest obstacle isPenalty values within a safe threshold but not yet fully collided; r is₂Is a penalty value when the robot collides with the obstacle; r_{delta_yaw}The penalty value of the change of the direction angle of the adjacent time step of the mobile robot is used for controlling the change of the direction angle of the mobile robot i, so that the overall motion track is smoother.

8. The flexible formation method of the cascaded multi-mobile robot as claimed in claim 1, wherein in the training process, independent training is performed respectively for two subtasks of formation tracking and flexible obstacle avoidance, and the specific method comprises the following steps:

for the formation tracking task, the action space is selected as the formation tracking action space a of two adjacent mobile robots¹ _spaceState space tracking error of each mobile robot on the basis of current time step

State space between adjacent mobile robots

the specific updating of the action value network is described as follows

Wherein, w_iCalculating a priority sampling weight for the current time i based on a priority empirical replay algorithm; r is_iThe reward signal is the current time i; gamma is a discount factor; q_θ′(s_i+1,μ′(s_i+1) For the next time i +1, the target action mu'(s) is taken as the target action value_i+1) Evaluation of (1), s_iIs the current time i the state value of the robot, s_i+1Is the state value of the robot at the next moment i +1, a_iThe motion of the robot at the current moment i, and N is the number of samples sampled in small batches; q_θ(s_i,a_i) Evaluating the state and the action instruction of the robot at the current moment i by the current action value network;

9. The method as claimed in claim 8, wherein the target action value network is updated by: after each small batch of training is finished, parameters of the updated online action network and the updated online action value network are updated, and the specific description form is as follows:

η′←τη+(1-τ)η′ (14)

10. The flexible formation method of cascaded multiple mobile robots according to claim 1, further comprising a local collision detection step, wherein the local collision detection step is used for detecting a safe distance from a local obstacle to the robot, and if the safe distance is returned to meet a safe state requirement, the mobile robot exits from a flexible obstacle avoidance strategy and recovers the formation strategy.