CN113485323A - Flexible formation method for cascaded multiple mobile robots - Google Patents

Flexible formation method for cascaded multiple mobile robots Download PDF

Info

Publication number
CN113485323A
CN113485323A CN202110655081.9A CN202110655081A CN113485323A CN 113485323 A CN113485323 A CN 113485323A CN 202110655081 A CN202110655081 A CN 202110655081A CN 113485323 A CN113485323 A CN 113485323A
Authority
CN
China
Prior art keywords
robot
mobile robot
formation
mobile
virtual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110655081.9A
Other languages
Chinese (zh)
Other versions
CN113485323B (en
Inventor
董璐
何子辰
孙长银
王嘉伟
薛磊
潘晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202110655081.9A priority Critical patent/CN113485323B/en
Publication of CN113485323A publication Critical patent/CN113485323A/en
Application granted granted Critical
Publication of CN113485323B publication Critical patent/CN113485323B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0287Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling
    • G05D1/0291Fleet control
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0219Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory ensuring the processing of the whole working surface
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle

Abstract

The invention provides a flexible formation method of a cascade multi-mobile robot, which is based on a strategy gradient algorithm combining prior nonlinear distance-angle-course formation control knowledge and continuous control, avoids blind exploration of the mobile robot, improves the speed of training convergence, avoids a fussy coefficient tuning process, and simultaneously introduces a near-end strategy to optimize and independently train the flexible obstacle avoidance capability of a single mobile robot for coping with local static and dynamic obstacles. The method comprises a training and reasoning stage, complex on-line resolving processes are transferred to the off-line, formation and flexible obstacle avoidance strategies are independently trained based on course learning ideas, and meanwhile a pre-training strategy is flexibly called in a reasoning stage, so that the whole formation has higher autonomy and flexibility.

Description

Flexible formation method for cascaded multiple mobile robots
Technical Field
The invention belongs to the field of multiple mobile robots, and particularly relates to a flexible formation method for multiple mobile robots based on a cascade architecture, in particular to a cascade multiple mobile robot formation method based on reinforcement learning and priori nonlinear distance-angle-course formation control.
Background
With the development of the robot technology, the multi-mobile robot formation operation effectively improves the operation efficiency by virtue of the cooperation capability, and gradually replaces the traditional single-machine operation. For example, multiple underwater robots search through a collaborative formation. In military affairs, unmanned aerial vehicle clusters, multi-ground mobile robots, mine elimination, search and rescue, investigation and the like have no advantage characteristics which do not reflect multi-machine formation. Recently, new crown epidemic situation is abused worldwide, and domestic many hospitals carry out the disinfection work of hospital in order to adopt the mobile robot that kills to replace traditional artifical mode, and a plurality of mobile robot that kills are through formation cooperation, have effectively improved the efficiency of unit operation.
The formation strategy of the pilot following based on distance-angle-course is one of the common technologies for realizing the formation tracking of a plurality of mobile robots, and compared with the traditional pilot following formation strategy, the method has better flexibility and expansibility. The basic idea of the strategy is to preset a robot as a pilot and other robots as trackers, and then determine the relative distance, relative angle and course between the pilot robot and the following robots through a preset formation so as to design a formation control strategy.
At present, the mainstream methods for realizing distance-angle-course navigation following include nonlinear control, nonlinear model predictive control and the like. The former includes input/output feedback linearization control, feedback control, and the like. As more performance gain parameters are introduced, the complicated parameter adjustment process cannot be avoided; the latter is highly dependent on an accurate model and has a high requirement on the online resolving speed. On the other hand, the robustness of the traditional piloting following formation model needs to be improved, and certain flexible obstacle avoidance and formation recovery capabilities are lacked.
With the development of artificial intelligence technology, the deep reinforcement learning technology is widely applied to related tasks of end-to-end mobile robots due to the advantages of no model, offline training and the like, but is mostly in the field of single robots; the end-to-end implementation mode in the multi-machine field has strict requirements on the performances of the sensor and the actuator, the state and action space dimensions are higher, the training cost is too high in the process of landing the actual mobile robot, and the difficulty of reasoning and recurrence is higher.
Disclosure of Invention
The invention aims to provide a flexible formation method for multiple mobile robots, which has certain flexible obstacle avoidance and formation recovery capabilities, aiming at the defects in the prior art.
The invention adopts the following technical scheme. The method comprises determining a dynamic model according to the distance, angle and course among robots based on the selected formation form; determining a priori controller of a reinforcement learning framework in a flexible formation method of the nonlinear mobile robot according to the dynamic model and the dynamic model constraint; determining an action space based on the hyper-parameters of the pose vectors of the mobile robots, wherein the action space comprises a formation tracking action space of two adjacent mobile robots and an action space required by each mobile robot for independently and flexibly avoiding obstacles; determining a state space according to a tracking error of a mobile robot attitude and a speed, wherein the state space comprises: at the current time step, each mobile robot tracks the state space of the tracking error of the corresponding virtual mobile robot, the state space between adjacent mobile robots and the state space required by each mobile robot to describe the surrounding environment information; setting a reward function for reinforcement learning, wherein the reward function comprises a formation reward function and an obstacle avoidance reward function;
and based on the prior controller, performing reinforcement learning training according to the action space, the state space and the reward function by interacting with the environment, and finishing the training to obtain a flexible formation method of the cascade multi-mobile robot, wherein the flexible formation method comprises a formation strategy and a flexible obstacle avoidance strategy.
Further, the kinetic equation is described as follows:
Figure BDA0003112352340000031
where eta is [ x, y, theta ]]TA pose vector representing each mobile robot, where (x, y) is the position of each mobile robot and θ isAn angle of each mobile robot;
Figure BDA0003112352340000032
in order to move the speed of the robot,
Figure BDA0003112352340000033
omega is the current angular velocity of the mobile robot, vrAnd vlRespectively representing the speeds of the left and right wheels of the mobile robot;
the kinetic model constraint form is as follows:
Figure BDA0003112352340000034
still further, the method for determining the prior controller of the reinforcement learning architecture in the flexible formation method of the nonlinear mobile robot specifically includes: s31, determining the expected track of the virtual expected mobile robot as etar=[xr,yrr]T,(xr,yr) To virtually expect the position of the mobile robot, θrTo virtually expect the angle of the mobile robot, the tracking error of the mobile robot for determining the attitude and the tracking error of the velocity of the mobile robot according to the virtual expected trajectory are expressed as:
Figure BDA0003112352340000041
exposition tracking error for the x direction; e.g. of the typeyPosition tracking error for the y direction; e.g. of the typeθIs the tracking error of the azimuth;
Figure BDA0003112352340000042
the speed tracking errors in the x direction and the y direction respectively;
Figure BDA0003112352340000043
is the angular velocity tracking error;
Figure BDA0003112352340000044
is the desired angular velocity of the virtual robot;
s32, determining an expected formation model among the distance, the angle and the heading among adjacent mobile robots, wherein the expected formation model is specifically described as follows:
Figure BDA0003112352340000045
wherein v is1,v2Respectively representing virtual robot objects to be tracked by adjacent mobile robots, namely a virtual robot 1 and a virtual robot 2, (x)v1,yv1) Is the position of the virtual robot 1, (x)v2,yv2) Is the position of the virtual robot 2, thetav1Angle of the virtual robot 1, thetav2Is the angle of the virtual robot 2; dv2v1Relative distance of adjacent mobile robots v1, v 2; phi is av2v1Relative angles of adjacent mobile robots v1, v 2; beta is av2v1An angle correction amount for the mobile robot maintaining the same azimuth angle;
s33, combining the theories (1) to (4) and the feedback linearization nonlinear control theory, the formation control prior description form of the adjacent mobile robots is as follows:
Figure BDA0003112352340000051
wherein v and w are the speed and angular velocity of the mobile robot meeting the preset formation requirement,
Figure BDA0003112352340000052
the performance of the prior controller for the non-linear formation of the virtual robot 1 is over-parametric,
Figure BDA0003112352340000053
performance superparameters of a priori controllers for nonlinear formation of virtual robots 2, and performance superparameters directly determine the priori controllersAnd controlling the performance.
Further, the formation tracking motion space of the two adjacent mobile robots is expressed as follows;
Figure BDA0003112352340000054
wherein the content of the first and second substances,
Figure BDA0003112352340000055
tracking the performance hyper-parameters of the nonlinear formation prior controller of the virtual robot 1 for the mobile robot,
Figure BDA0003112352340000056
tracking performance hyper-parameters of the virtual robot 2 non-linear formation prior controller for neighboring mobile robots of the mobile robot,
the action space required by each mobile robot for independently and flexibly avoiding the obstacle is expressed as follows;
Figure BDA0003112352340000057
wherein v isdiscreteAnd omegadiscreteRespectively, a discretized speed command and an angular speed command of the mobile robot.
Further, a state space in which each mobile robot tracks a tracking error of the corresponding virtual mobile robot at the current time step is represented as follows:
Figure BDA0003112352340000058
the state space between adjacent mobile robots is represented as follows:
Figure BDA0003112352340000061
wherein the content of the first and second substances,
Figure BDA0003112352340000062
tracking the position tracking error of the virtual robot 1 in the x direction for the mobile robot;
Figure BDA0003112352340000063
tracking the position tracking error of the virtual robot 1 in the y direction for the mobile robot;
Figure BDA0003112352340000064
tracking the tracking error of the virtual robot 1 in the azimuth angle for the mobile robot;
Figure BDA0003112352340000065
a position tracking error of the virtual robot 2 in the x direction is tracked for neighboring mobile robots of the mobile robot,
Figure BDA0003112352340000066
tracking the position tracking error of the virtual robot 2 in the y direction for the adjacent mobile robots of the mobile robot;
Figure BDA0003112352340000067
tracking the tracking error of the virtual robot 2 at the azimuth angle for the adjacent mobile robots of the mobile robot; e.g. of the type1Tracking error of virtual robot 1 for mobile robot, e2Tracking errors of the virtual robot 2 for neighboring robots of the mobile robot;
Figure BDA0003112352340000068
and
Figure BDA0003112352340000069
respectively representing the formation state quantities of the distance, the angle and the course between the adjacent mobile robots under each time step t; | u1||2,||u2||2,
Figure BDA00031123523400000610
Respectively representing the speed and angular velocity between the robot 1 and the robot 2 relative to the virtual robotAnd the relative value of the acceleration, the purpose of which is to expect the mobile robot to operate with continuous and smooth speed and acceleration, where | u1||2Is the speed value of the mobile robot 1 relative to the virtual robot, including speed and angular velocity;
Figure BDA00031123523400000611
is the acceleration value of the robot 1 relative to the virtual robot, including acceleration and angular acceleration; | u2||2Is a speed value between the mobile robot 2 and the virtual robot, including a speed and an angular speed;
Figure BDA00031123523400000612
is an acceleration value of the mobile robot 2 relative to the virtual robot, including acceleration and angular acceleration.
The state space required for each mobile robot to describe the surrounding environment information is represented as follows:
Figure BDA00031123523400000613
wherein eta istIs the pose vector of the mobile robot at the current moment, drThe distance between the mobile robot at the present time and its desired virtual mobile robot position, dobAnd a distance vector of the mobile robot within a safety threshold value at the current moment, | delta theta | is the difference between the speed and the angular speed of the mobile robot at the adjacent moment.
Still further, the formation reward function between two adjacent mobile robots is described in the following form:
Figure BDA0003112352340000071
wherein epsilonthreshR in the reward function for setting the thresholderror_1Is the sum of penalty terms for tracking errors of two mobile robots with respect to a desired virtual mobile robot, for enlighteningThe robot reduces the tracking error of the expected position as much as possible; r is a reward or penalty value, RformationThe system is a reward or penalty function and is used for guiding the robot to keep the continuity of formation, if the dynamic change range of the formation is within a set threshold value, a positive reward value is fed back, and if the dynamic change range of the formation is not within the set threshold value, a negative penalty value is returned; rvelocityThe method is used for guiding the mobile robot to keep consistency of speed and acceleration and maintain a continuous and smooth motion mode.
Still further, the obstacle avoidance reward function is in a specific form as follows:
Figure BDA0003112352340000072
wherein the reward function Rerror_2The penalty item of the mobile robot i for the expected tracking error of the virtual mobile robot is used for guiding the formation recovery of the robot; ravoidGuiding a mobile robot to perform autonomous obstacle avoidance, epsilonsafeTo a safe threshold, r1Is the penalty value when the robot is within a safe threshold from the nearest obstacle but has not yet fully collided. r is2Is a penalty value when the robot collides with the obstacle; rdelta_yawThe penalty value of the change of the direction angle of the adjacent time step of the mobile robot is used for controlling the change of the direction angle of the mobile robot i, so that the overall motion track is smoother.
Further, in the training process, independent training is respectively carried out on two subtasks of formation tracking and flexible obstacle avoidance, and the specific method comprises the following steps:
aiming at the formation tracking task, the action space is selected to be the formation tracking action space of two adjacent mobile robots
Figure BDA0003112352340000081
State space based on tracking error of each mobile robot tracking corresponding virtual mobile robot at current time step
Figure BDA0003112352340000082
State space between adjacent mobile robots
Figure BDA0003112352340000083
The action value network outputs the evaluation of the current action, the Q value of the evaluation output by the current action value network is used as the weight, and the action network is updated based on the strategy gradient;
the specific updating of the action value network is described as follows
Figure BDA0003112352340000084
Wherein, wiCalculating a priority sampling weight for the current time i based on a priority empirical replay algorithm; r isiThe reward signal is the current time i; gamma is a discount factor; qθ′(si+1,μ′(si+1) For the next time i +1, the target action mu'(s) is taken as the target action valuei+1) Evaluation of (1), siIs the current time i the state value of the robot, si+1The state value, a, of the robot at the next moment i +1iThe motion of the robot at the current moment i, and N is the number of samples sampled in small batches; qθ(si,ai) And the evaluation value of the current motion value network to the state and motion command of the robot at the current time i.
Aiming at the flexible obstacle avoidance task, a near-end strategy optimization algorithm framework based on discrete action space is adopted, and the action space is selected to be the action space required by each mobile robot for independently and flexibly avoiding the obstacle
Figure BDA0003112352340000091
Selecting a state space for tracking the tracking error of a corresponding virtual mobile robot for each mobile robot at the current time step
Figure BDA0003112352340000092
State space required for describing surrounding environment information with each mobile robot
Figure BDA0003112352340000093
Further, the method for updating the target action value network comprises the following steps: after each small batch of training is finished, parameters of the updated online action network and the updated online action value network are updated, and the specific description form is as follows:
η′←τη+(1-τ)η′ (14)
where η' and η are partial tables representing the target network parameter and the current network parameter, and τ is used to control the ratio of updates.
Further, the method further comprises a local collision detection step, wherein the local collision detection step is used for detecting the safe distance between a local obstacle and the robot, and if the returned safe distance meets the requirement of a safe state, the mobile robot individually exits from the flexible obstacle avoidance strategy and recovers the formation strategy.
The invention has the beneficial technical effects;
the flexible formation method of the cascaded multi-mobile robots is based on reinforcement learning and priori nonlinear distance-angle-course formation control, so that the plurality of mobile robots can adaptively adjust key parameters in a formation control algorithm, and the stability and tracking accuracy of formation are improved; meanwhile, a flexible obstacle avoidance strategy is trained independently, so that each robot in the formation has certain flexible obstacle avoidance capability, and the flexibility and the autonomy of each mobile robot in the formation are improved.
The formation tracking framework of the calculation method is based on a depth determination type strategy gradient algorithm, and the performance and the efficiency of the algorithm are further improved by simplifying the random exploration process and introducing a priority experience playback mechanism. Blind exploration is avoided by introducing prior nonlinear distance-angle-course formation controller information, so that the training process is more targeted, the speed of algorithm convergence is increased, the prior formation controller information controller can avoid abnormal behaviors damaging an actuator from end to end in the inference application process, and the robustness of the whole formation is improved.
The obstacle avoidance framework of the calculation method is based on a near-end strategy optimization algorithm, and when the obstacle is flexibly avoided, the motion space of the mobile robot is discretized, so that the search space is reduced, and the training complexity is reduced; by introducing a collision detection function module, the obstacle distance is monitored in real time to determine whether the formation tracking mode can be returned.
Preferably, the training of the two sets of frameworks are mutually independent, supplement each other in the inference formation process, and jointly complete the flexible formation of the multiple mobile robots.
Drawings
FIG. 1 is a schematic diagram of an overall framework for a particular embodiment of the invention;
FIG. 2 is a schematic diagram of a training phase of an embodiment of the present invention;
FIG. 3 is a schematic diagram of inference based flexible formation according to a specific embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples of the specification.
Example (b): a flexible formation method of cascaded multiple mobile robots mainly comprises the following steps: s1, selecting a formation from a formation library, and confirming the priority of each robot and the specific position in the formation according to a distance-angle-course formation mode;
s2, determining a dynamic model according to the type of the robot;
s3, designing expected prior tracks of a virtual navigator and a virtual follower according to the constraint of a dynamic model and by combining the constraint of relative distance, the constraint of relative angle and the constraint of course between robots, converting the formation problem of the actual robots into a plurality of tracking method problems for tracking the tracks of the virtual mobile robots, and designing a nonlinear formation tracking prior controller of angular velocity of corresponding velocity as a knowledge prior in the whole reinforcement learning framework;
s4, designing a collision detection module of the whole formation algorithm, and detecting the safe distance between a local obstacle and the robot;
s5, designing an action space part of the whole formation algorithm framework, wherein the action space part is mainly divided into two parts, one part is a speed space containing the speed and the angular velocity of the mobile robot, and the other part is a parameter space containing all performance parameters of the priori nonlinear tracking control knowledge;
s6, designing a state space part in the whole formation algorithm framework, wherein the state space part mainly comprises the position and the posture of each robot and barrier information in the environment;
s7, designing a reward function for guiding the robots to form a team for learning and flexibly avoiding the obstacles, wherein the reward function mainly comprises a team reward function, a tracking reward function and an obstacle avoidance reward function;
and S8, building a simulation environment for training, so that the intelligent agent interacts with the environment for trial and error under the condition of checking nonlinear formation control knowledge, and a plurality of mobile robots complete a flexible and stable formation process strategy and a flexible obstacle avoidance strategy based on distance, angle and direction through learning.
Further, in step S1, the type of each robot is isomorphic, and the number N of robots is greater than or equal to 2;
further, in step S2, taking the two-wheel differential mobile robot as an example, the kinetic equation is described as follows:
Figure BDA0003112352340000121
where eta is [ x, y, theta ]]TA pose vector representing each mobile robot;
Figure BDA0003112352340000122
in order to move the speed of the robot,
Figure BDA0003112352340000123
angular velocity, v, of the mobile robotrAnd vlRespectively representing the speeds of the left and right wheels; it should be noted that the two-wheel drive mobile robot has incomplete constraint, so that the mobile robot can only move forward and backward, but not left and right, and the constraint form is as follows:
Figure BDA0003112352340000124
further, taking distance-angle-heading formation of multiple mobile robots as an example, the specific design steps of the a priori formation control knowledge of the mobile robots in S3 are as follows:
s31, designing a tracking controller between the mobile robot and the virtual expected mobile robot. The expected trajectory of the virtual mobile robot is defined as etar=[xr,yrr]TThe tracking error of the attitude and the velocity is:
Figure BDA0003112352340000125
s32, designing an expected formation model between the distance, the angle and the course between adjacent mobile robots, wherein the expected formation model is specifically described as follows:
Figure BDA0003112352340000131
wherein v1 and v2 represent virtual robot objects to be tracked by adjacent mobile robots, and are respectively marked as virtual robot 1 and virtual robot 2, dv2v1、φv2v1、βv2v1The distance angle and the direction between v1 and v2 are represented as state quantities under a distance-angle-heading formation framework.
S33, combining the theories (1) to (4) and the feedback linearization nonlinear control theory, the formation control prior description form of the adjacent mobile robots is as follows:
Figure BDA0003112352340000132
wherein v and w are the speed and angular velocity of the mobile robot meeting the preset formation requirement, [ K ]x,Ky,Kθ]Controlling prior performance hyper-parameters for nonlinear formation of mobile robots, and directly determining values of the performance hyper-parametersQuality of formation tracking;
further, in S4, the collision detection function module determines the distance to the obstacle by the mobile robot sensor itself, and outputs a boolean collision warning flag;
further, in S5, the motion space mainly consists of two parts, one part is the motion space required for formation tracking, and the other part is the motion space required for flexible obstacle avoidance when detecting a local obstacle, and the specific design description is as follows:
and S51, designing a formation tracking action space of two adjacent mobile robots. The specific method is based on the nonlinear formation knowledge prior involved in S33, and the motion space is as follows:
Figure BDA0003112352340000141
wherein [ K ]x,Ky,Kθ]Controlling prior performance hyper-parameters for nonlinear formation of mobile robots;
s52, designing an action space required by each mobile robot for independently and flexibly avoiding the obstacle:
Figure BDA0003112352340000142
wherein v isdiscreteAnd omegadiscreteRespectively representing a discretized speed instruction and an angular speed instruction of the mobile robot;
further, in S6, the state space mainly includes three parts, one part is a state space describing a tracking error of each mobile robot tracking a corresponding virtual robot, one part is a state space describing a formation satisfying distance-angle-heading between adjacent mobile robots, and one part is a state space required for describing surrounding environment information, and the specific design description is as follows:
s61, taking two adjacent mobile robots as an example, designing and describing a state space of tracking errors of each mobile robot tracking the corresponding virtual mobile robot at the current time step as follows:
Figure BDA0003112352340000143
s62, taking two adjacent mobile robots as an example, designing a state space between the two adjacent mobile robots, which meets a distance-angle-course formation framework, as follows:
Figure BDA0003112352340000144
d, phi and beta respectively represent the formation state quantities of the distance, the angle and the course between adjacent mobile robots at each time step; | u1||2,||u2||2,
Figure BDA0003112352340000145
Representing the relative values of speed, angular velocity and acceleration between the robot 1 and the robot 2, respectively, with respect to the virtual robot, the purpose of this item being that the mobile robot is expected to operate with continuous and smooth speed and acceleration;
s63, designing a state space required by each mobile robot to describe the surrounding environment information as follows:
Figure BDA0003112352340000151
wherein eta istIs the pose vector of the mobile robot at the current moment, drThe distance between the mobile robot at the present time and its desired virtual mobile robot position, dobThe distance vector of the mobile robot within the safety threshold value at the current moment, | delta u | is the difference between the speed and the angular speed of the mobile robot at the adjacent moment;
further, the incentive function design in S7 can be subdivided into two sub-incentive function designs, one is for the formation tracking sub-task, and the other is for the flexible obstacle avoidance and formation recovery sub-task, that is:
s71, designing a reward function of the formation tracking subtask,
the formation reward function between two adjacent mobile robots is described in the following form:
Figure BDA0003112352340000152
wherein R in the reward functionerrorIs the sum of penalty terms of the tracking error of the two mobile robots to the expected virtual mobile robot and is used for inspiring that the tracking error of the robot to the expected position is reduced as much as possible; rformationThe system is used for guiding the robot to keep the consistency of formation, if the dynamic change range of the formation is within a threshold value, a positive reward is fed back, otherwise, a negative penalty is fed back; rvelocityThe system is used for guiding the mobile robot to keep the consistency of speed and acceleration and maintain a continuous and smooth motion mode;
s72, designing a flexible obstacle avoidance reward function of the mobile robot i, wherein the specific form is as follows:
Figure BDA0003112352340000161
wherein the reward function RerrorThe penalty item of the mobile robot i for the expected tracking error of the virtual mobile robot is used for guiding the formation recovery of the robot; ravoidGuiding the mobile robot to perform autonomous obstacle avoidance; rdelta_yawThe change of the direction angle of the mobile robot i is restricted,
to save energy;
optionally, the reward function designed in S72 only occurs in the obstacle avoidance task stage, and is used to inspire that the mobile robot quickly avoids a local obstacle, and when it is determined through S4 that the mobile robot is far away from the obstacle, the obstacle avoidance task stage exits, and is switched to a formation tracking subtask, and formation is resumed and maintained under the guidance of the S71 reward function;
further, in S8, in the training process, independent training is performed for two subtasks of formation tracking and flexible obstacle avoidance, which are specifically described as follows:
s81, aiming at the formation tracking task, adopting a deterministic strategy gradient algorithm framework based on a continuous action space, wherein the action space is selected as
Figure BDA0003112352340000162
The state space is based on
Figure BDA0003112352340000163
And
Figure BDA0003112352340000164
the algorithm generally follows the "actor-critic" pattern, but unlike other reinforcement learning algorithms, the greatest advantage of the algorithm is that the output of the action network is a deterministic action rather than a strategic distribution.
On the other hand, the action value network outputs an evaluation of the current action, and then updates the action network based on the policy gradient with the evaluated Q value output by the current action value network as a weight. The action value network is updated based on the target action network and the target action value network which are off-line, and the method has the advantages that the parameter change of the target network is small, so that the training process is more stable.
The specific updating of the action value network is described as follows
Figure BDA0003112352340000171
Wherein, wiA priority sampling weight calculated for a priority-based empirical replay algorithm; r isiIs the current reward signal; gamma is a discount factor; qθ′(si+1,μ′(si+1) For the next moment, the target action mu'(s) is taken as the target action valuei+1) Evaluation of (2)
Preferably, the updating of the target network is based on a soft updating strategy, and after each small batch is trained, the parameters of the updated online action network and the updated online action value network are updated, and the specific description form is as follows:
η′←τη+(1-τ)η′ (14)
where η' and η are partial tables representing the target network parameter and the current network parameter, and τ is used to control the ratio of updates.
Tau is used for controlling the updating proportion, and the soft updating method reduces the influence of abnormal parameters and avoids abnormal jump of parameters in the parameter updating process.
S82, aiming at the flexible obstacle avoidance task, adopting a near-end strategy optimization algorithm framework based on discrete action space, and selecting the action space as
Figure BDA0003112352340000172
Selecting a state space of
Figure BDA0003112352340000173
And
Figure BDA0003112352340000174
the near-end strategy optimization algorithm is optimized aiming at the problems of slow parameter updating, low data utilization rate and the like of the on-line strategy of the traditional strategy gradient algorithm, a resampling mechanism is introduced to the algorithm on the basis of a generalized advantage evaluation algorithm to convert the on-line strategy into an off-line strategy to improve the data utilization rate, and meanwhile, a more stable training process is obtained on the basis of KL divergence or the updating amplitude of cutting operation constraint parameters.
And further, S9, completing the construction of the whole inference-based cascade formation control algorithm according to the offline learning formation strategy in S7.
In S9, a flexible formation algorithm framework of the mobile robot based on reasoning is constructed by using the formation and flexible obstacle avoidance strategy trained in S8, and the specific process is described as follows:
s91, determining a formation requirement and a task environment;
s92, the mobile robot is formed and loaded with a priori-based formation strategy and a flexible obstacle avoidance strategy which are pre-trained in S8;
s93, the mobile robot formation adopts a formation tracking strategy to perform formation tracking according to the interaction information with the environment, and each robot individual performs local collision detection;
s94, if flexible obstacle avoidance is needed, the mobile robots form a team to switch individual flexible obstacle avoidance strategies, and real-time obstacle avoidance is conducted according to the interaction information with the environment;
s95, if the local collision detection function module returns to the safe state, the mobile robot individual exits the flexible obstacle avoidance strategy and rapidly restores the formation state;
s96, repeating S93 to S95 until the target point is reached.
The invention provides a cascade multi-mobile-robot flexible formation method based on reinforcement learning and prior nonlinear distance-angle-course formation control, which moves the computational power consumption of on-line resolving to the off-line through reinforcement learning to realize the flexible formation of the multi-mobile-robot based on reasoning;
in the training stage, a formation tracking strategy and a flexible obstacle avoidance strategy are trained independently, so that the training difficulty is reduced, meanwhile, a nonlinear distance-angle-course formation control prior is introduced, the training speed is improved, and a complicated parameter tuning process is avoided; and in the inference stage, independent strategies based on offline training are combined to meet the task requirements of autonomous stable formation and flexible obstacle avoidance. Compared with the existing formation tracking control algorithm based on navigation-tracking, the robot formation tracking control algorithm endows the robot formation with independent tracking capability and also endows each mobile robot with independent obstacle avoidance capability facing local static and dynamic obstacles, and has the characteristics of autonomy, stability, high efficiency and flexibility.
The overall framework of the embodiment is shown in fig. 1, where 1 is an offline independent training framework, 2 is an inference flexible formation framework, 21 is a flexible formation obstacle avoidance strategy, and 3 is a simulation interactive environment.
Firstly, in a training stage 1, respectively training a formation strategy and a flexible obstacle avoidance strategy; the training of the formation strategy is based on prior formation experience, the training process and the convergence speed are accelerated, blind exploration of multiple mobile robots is prevented, and the stability of formation is improved; after training, storing the two strategy parameters;
then, in an inference stage 2, the multiple mobile robots flexibly call a flexible formation obstacle avoidance strategy 21 based on experience to perform autonomous formation tracking and flexible obstacle avoidance, and the on-line calculation process is migrated to the off-line mode, so that the method is more efficient and stable.
The frame of the training stage is shown in fig. 2, the whole training strategy is based on the thought of course learning, namely the training environment is simplified to be more complicated, and the strategy performance is gradually improved; FIG. 2 is a diagram 1 of a formation training environment based on a distance-angle-heading formation control prior; 2, flexibly avoiding the obstacle for each mobile robot; 3 is a continuous deterministic strategy gradient algorithm agent; 4, a discrete near-end strategy optimization algorithm agent; and 5, a formation strategy parameter and a flexible obstacle avoidance strategy parameter which are stored in an off-line mode after training, wherein the specific process is described as follows:
the formation strategy training process is described as follows:
firstly, configuring various simple to complex training environments according to the thought of course learning, such as training the formation environment of two mobile robots and then gradually increasing the number of the formation robots;
next, for each preset simulation environment, initializing an action network, a target action network, an action value network, and target action value network parameters in the continuous deterministic policy gradient algorithm agent 3 in fig. 2; for each iteration cycle, a formation training environment 1 based on a distance-angle-heading formation control prior is initialized, followed by, for each time step:
step 1: selecting an action from the threshold range of the action space according to a strategy, and adding random Gaussian noise to improve the random exploration performance;
step 2: interacting with a formation training environment, specifically inputting a selected deterministic action into a cascade prior formation controller of the multiple mobile robots designed by combining a preset formation mode and geometric relations of distance, angle and course, specifically describing the controller as the formula (5), then inputting prior formation upper-layer control speed and angular speed instructions into each mobile robot, updating the current state of the environment while the robot completes the instructions, and feeding back a state value, a reward value and a Boolean flag bit indicating whether a task is finished or interrupted;
and step 3: storing the information fed back by the environment and the calculated priority into an experience pool as data for training;
and 4, step 4: when the capacity of the experience pool overflows, sampling according to the priority, and training and updating the action value network and the action network; the method comprises the following steps:
step 4-1: inputting the sampled state value into an action network to obtain an action;
step 4-2: inputting the action and the sampled state value into an action value network to obtain a Q value;
step 4-3: according to the Q value, the back propagation updates the action network according to the strategy gradient;
step 4-4: repeating the steps 4-2 and 4-3 to obtain a Q value calculated by the updated action through the action value network;
and 4-5: inputting the next state value in the sampled experience into a target action network to obtain a target action;
and 4-6: inputting the action and the state into a target state network to obtain a target Q value;
and 4-7: updating the action value network by combining the target Q value and the Q value calculated in the step 4-4 according to the formula (13) and the priority weight coefficient;
and 5: repeating the steps 1-4;
step 6: after a certain time step is met, respectively carrying out soft updating on the target action network and the target action value network according to the formula (14);
and 7: and storing each network parameter of the final formation strategy for calling in next training or reasoning.
The individual autonomous flexible obstacle avoidance strategy training process is as follows:
firstly, configuring various simple to complex training environments according to the thought of course learning, such as firstly training an obstacle avoidance strategy of the mobile robot in a static obstacle environment and then training an obstacle avoidance strategy of the mobile robot in a dynamic obstacle environment;
secondly, initializing a policy network and a value network in 2-4 for each preset simulation environment; during each iteration cycle, the corresponding context is initialized, and then, during each time step:
step 1: in a strategy network, inputting environment state information to obtain strategy distribution, and sampling an action in a discrete action space according to the distribution;
step 2: inputting the action into an environment, interacting with a flexible obstacle avoidance environment, updating the environment state, feeding back a state value, a reward value and a Boolean type flag bit indicating whether the task is finished or interrupted;
and step 3: repeating the step 1, sampling certain experiences, and storing;
and 4, step 4: inputting the state of the last step in the step 3 into a value network to obtain the state value, and then backtracking and calculating the discount reward value under the time steps;
and 5: inputting all stored experiences into a value network, and calculating an advantage value by utilizing generalized advantage evaluation;
step 6: according to the calculated advantage value, the updated value network is propagated reversely;
and 7: inputting all state values in the stored experience into a strategy network and a past strategy network to respectively obtain different strategy distributions, converting the same strategy into different strategies by using resampling, and reversely propagating and updating the strategy network;
and 8: repeating the step 5-6, and then updating the past strategy network parameters by using the strategy network parameters;
and step 9: and (5) repeating the steps 1-8, and storing each network parameter of the final flexible obstacle avoidance strategy for calling in the next training or reasoning.
The method for flexibly forming a cascade multi-mobile robot based on reinforcement learning and priori nonlinear distance-angle-course formation control provided by the embodiment is carried out according to the steps shown in fig. 3 when actual deployment and application are carried out:
step 1: acquiring an expected track of an upper-layer motion plan for formation tracking;
step 2: the specific formation form requirements of the formation tasks are determined, prior formation control information is obtained, and the task environment is determined;
and step 3: loading off-line formation tracking strategy and flexible obstacle avoidance strategy pre-trained in training phase;
and 4, step 4: after the state of the mobile robot is obtained according to a pre-trained formation strategy, the action network feeds back actions, and the mobile robot performs a formation tracking task according to the actions;
and 5: performing local collision detection to ensure the safety of formation tracking, if an obstacle is within a safety threshold from a certain mobile robot, skipping to the step 6, otherwise, performing the step 7;
step 6: calling an offline flexible formation strategy pre-trained in a training phase corresponding to the mobile robot, sampling discrete actions from distribution output by a strategy network by the mobile robot, avoiding local obstacles, quickly returning to the position of the mobile robot in formation with an error as small as possible, and continuously tracking the virtual mobile robot in a corresponding formation mode;
and 7: and if not, returning to the step 4 and continuing to perform the formation tracking.
The method provided by the invention is based on a strategy gradient algorithm combining the priori nonlinear distance-angle-course formation control knowledge and continuous control, avoids blind exploration of the mobile robot, improves the speed of training convergence, avoids a fussy coefficient tuning process, and simultaneously introduces a near-end strategy to optimize the flexible obstacle avoidance capability of independently training a single mobile robot to deal with local static and dynamic obstacles. The method comprises a training and reasoning stage, complex on-line resolving processes are transferred to the off-line, formation and flexible obstacle avoidance strategies are independently trained based on course learning ideas, and meanwhile a pre-training strategy is flexibly called in a reasoning stage, so that the whole formation has higher autonomy and flexibility.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A flexible formation method of a cascade multi-mobile robot is characterized by comprising the steps of determining a dynamic model according to the distance, the angle and the course among robots based on a selected formation form; determining a priori controller of a reinforcement learning framework in a flexible formation method of the nonlinear mobile robot according to the dynamic model and the dynamic model constraint; determining an action space based on the hyper-parameters of the pose vectors of the mobile robots, wherein the action space comprises a formation tracking action space of two adjacent mobile robots and an action space required by each mobile robot for independently and flexibly avoiding obstacles; determining a state space according to a tracking error of a mobile robot attitude and a speed, wherein the state space comprises: at the current time step, each mobile robot tracks the state space of the tracking error of the corresponding virtual mobile robot, the state space between adjacent mobile robots and the state space required by each mobile robot to describe the surrounding environment information; setting a reward function for reinforcement learning, wherein the reward function comprises a formation reward function and an obstacle avoidance reward function;
and based on the prior controller, performing reinforcement learning training according to the action space, the state space and the reward function by interacting with the environment, and finishing the training to obtain a flexible formation method of the cascade multi-mobile robot, wherein the flexible formation method comprises a formation strategy and a flexible obstacle avoidance strategy.
2. The method of claim 1, wherein the dynamic equations are described as follows:
Figure FDA0003112352330000011
where eta is [ x, y, theta ]]TA pose vector representing each mobile robot, where (x, y) is a position of each mobile robot and θ is an angle of each mobile robot;
Figure FDA0003112352330000012
in order to move the speed of the robot,
Figure FDA0003112352330000021
omega is the current angular velocity of the mobile robot, vrAnd vlRespectively representing the speeds of the left and right wheels of the mobile robot;
the kinetic model constraint form is as follows:
Figure FDA0003112352330000022
3. the method of claim 2, wherein the method for determining the prior controller of the reinforcement learning architecture in the flexible formation method of the nonlinear mobile robot specifically comprises: s31, determining the expected track of the virtual expected mobile robot as etar=[xr,yrr]T,(xr,yr) To virtually expect the position of the mobile robot, θrFor virtually expecting the angle of the mobile robot, the tracking error of the attitude and the tracking error of the speed determined by the mobile robot according to the virtual expected track are expressed as follows:
Figure FDA0003112352330000023
exposition tracking error for the x direction; e.g. of the typeyPosition tracking error for the y direction; e.g. of the typeθIs the tracking error of the azimuth;
Figure FDA0003112352330000024
respectively, a velocity tracking error in the x direction and a velocity tracking error in the y direction;
Figure FDA0003112352330000025
is the angular velocity tracking error;
Figure FDA0003112352330000026
is the desired angular velocity of the virtual robot;
s32, determining an expected formation model among the distance, the angle and the heading among adjacent mobile robots, wherein the expected formation model is specifically described as follows:
Figure FDA0003112352330000031
wherein v is1,v2Respectively representing virtual robot objects to be tracked by adjacent mobile robots, namely a virtual robot 1 and a virtual robot 2, (x)v1,yv1) Is the position of the virtual robot 1, (x)v2,yv2) Is the position of the virtual robot 2, thetav1Angle of the virtual robot 1, thetav2Is the angle of the virtual robot 2; dv2v1Relative distance of adjacent mobile robots v1, v 2; phi is av2v1Relative angles of adjacent mobile robots v1, v 2; beta is av2v1An angle correction amount for the mobile robot maintaining the same azimuth angle;
s33, combining the theories (1) to (4) and the feedback linearization nonlinear control theory, the formation control prior description form of the adjacent mobile robots is as follows:
Figure FDA0003112352330000032
wherein v and w are the speed and angular velocity of the mobile robot meeting the preset formation requirement,
Figure FDA0003112352330000033
the performance of the prior controller for the non-linear formation of the virtual robot 1 is over-parametric,
Figure FDA0003112352330000034
and (3) directly determining the control performance of the prior controller for the performance hyper-parameter of the nonlinear formation prior controller of the virtual robot 2.
4. The flexible formation method of cascaded multiple mobile robots according to claim 1, wherein the formation tracking motion space of two adjacent mobile robots is represented as follows;
Figure FDA0003112352330000035
wherein the content of the first and second substances,
Figure FDA0003112352330000041
tracking the performance hyper-parameters of the nonlinear formation prior controller of the virtual robot 1 for the mobile robot,
Figure FDA0003112352330000042
tracking performance hyper-parameters of the virtual robot 2 non-linear formation prior controller for neighboring mobile robots of the mobile robot,
the action space required by each mobile robot for independently and flexibly avoiding the obstacle is expressed as follows;
Figure FDA0003112352330000043
wherein v isdiscreteAnd omegadiscreteRespectively, a discretized speed command and an angular speed command of the mobile robot.
5. The method for flexible formation of cascaded multi-mobile robots according to claim 1, wherein the state space of tracking error of each mobile robot at the current time step for tracking the corresponding virtual mobile robot is represented as follows:
Figure FDA0003112352330000044
the state space between adjacent mobile robots is represented as follows:
Figure FDA0003112352330000045
wherein the content of the first and second substances,
Figure FDA0003112352330000046
tracking the position tracking error of the virtual robot 1 in the x direction for the mobile robot;
Figure FDA0003112352330000047
tracking the position tracking error of the virtual robot 1 in the y direction for the mobile robot;
Figure FDA0003112352330000048
tracking the tracking error of the virtual robot 1 in the azimuth angle for the mobile robot;
Figure FDA0003112352330000049
a position tracking error of the virtual robot 2 in the x direction is tracked for neighboring mobile robots of the mobile robot,
Figure FDA00031123523300000410
tracking the position tracking error of the virtual robot 2 in the y direction for the adjacent mobile robots of the mobile robot;
Figure FDA00031123523300000411
tracking the tracking error of the virtual robot 2 at the azimuth angle for the adjacent mobile robots of the mobile robot; e.g. of the type1Tracking error of virtual robot 1 for mobile robot, e2Tracking errors of the virtual robot 2 for neighboring robots of the mobile robot;
Figure FDA0003112352330000051
a formation state quantity representing a distance between adjacent mobile robots at each time step t,
Figure FDA0003112352330000052
A formation state quantity representing an angle between adjacent mobile robots at each time step t,
Figure FDA0003112352330000053
And
Figure FDA0003112352330000054
representing the formation state quantity of the course between the adjacent mobile robots under each time step t; | u1||2Is the speed value of the mobile robot 1 relative to the virtual robot, including speed and angular velocity;
Figure FDA0003112352330000055
is the acceleration value of the robot 1 relative to the virtual robot, including acceleration and angular acceleration; | u2||2Is a speed value between the mobile robot 2 and the virtual robot, including a speed and an angular speed;
Figure FDA0003112352330000056
is the acceleration value of the mobile robot 2 relative to the virtual robot, including acceleration and angular acceleration;
the state space required for each mobile robot to describe the surrounding environment information is represented as follows:
Figure FDA0003112352330000057
wherein eta istIs the pose vector of the mobile robot at the current moment, drThe distance between the mobile robot at the present time and its desired virtual mobile robot position, dobFor the current time, the movementThe distance vector of the robot from the obstacle within the safety threshold, | Δ θ | is the difference between the speed and the angular velocity of the mobile robot at the adjacent time.
6. The flexible formation method of the cascaded multiple mobile robots according to claim 5, wherein a formation reward function between two adjacent mobile robots is described in a specific form as follows:
Figure FDA0003112352330000061
wherein epsilonthreshR in the reward function for setting the thresholderror_1Is the sum of penalty terms of the tracking error of the two mobile robots to the expected virtual mobile robot and is used for inspiring that the tracking error of the robot to the expected position is reduced as much as possible; r is a reward or penalty value, RformationThe system is a reward or penalty function and is used for guiding the robot to keep the continuity of formation, if the dynamic change range of the formation is within a set threshold value, a positive reward value is fed back, and if the dynamic change range of the formation is not within the set threshold value, a negative penalty value is returned; rvelocityThe method is used for guiding the mobile robot to keep consistency of speed and acceleration and maintain a continuous and smooth motion mode.
7. The flexible formation method of cascaded multiple mobile robots according to claim 5, wherein the obstacle avoidance reward function is in the following specific form:
Figure FDA0003112352330000062
wherein the reward function Rerror_2The penalty item of the mobile robot i for the expected tracking error of the virtual mobile robot is used for guiding the formation recovery of the robot; ravoidGuiding a mobile robot to perform autonomous obstacle avoidance, epsilonsafeTo a safe threshold, r1The distance between the robot and the nearest obstacle isPenalty values within a safe threshold but not yet fully collided; r is2Is a penalty value when the robot collides with the obstacle; rdelta_yawThe penalty value of the change of the direction angle of the adjacent time step of the mobile robot is used for controlling the change of the direction angle of the mobile robot i, so that the overall motion track is smoother.
8. The flexible formation method of the cascaded multi-mobile robot as claimed in claim 1, wherein in the training process, independent training is performed respectively for two subtasks of formation tracking and flexible obstacle avoidance, and the specific method comprises the following steps:
for the formation tracking task, the action space is selected as the formation tracking action space a of two adjacent mobile robots1 spaceState space tracking error of each mobile robot on the basis of current time step
Figure FDA0003112352330000071
State space between adjacent mobile robots
Figure FDA0003112352330000072
The action value network outputs the evaluation of the current action, the Q value of the evaluation output by the current action value network is used as the weight, and the action network is updated based on the strategy gradient;
the specific updating of the action value network is described as follows
Figure FDA0003112352330000073
Wherein, wiCalculating a priority sampling weight for the current time i based on a priority empirical replay algorithm; r isiThe reward signal is the current time i; gamma is a discount factor; qθ′(si+1,μ′(si+1) For the next time i +1, the target action mu'(s) is taken as the target action valuei+1) Evaluation of (1), siIs the current time i the state value of the robot, si+1Is the state value of the robot at the next moment i +1, aiThe motion of the robot at the current moment i, and N is the number of samples sampled in small batches; qθ(si,ai) Evaluating the state and the action instruction of the robot at the current moment i by the current action value network;
aiming at the flexible obstacle avoidance task, a near-end strategy optimization algorithm framework based on discrete action space is adopted, and the action space is selected to be the action space required by each mobile robot for independently and flexibly avoiding the obstacle
Figure FDA0003112352330000074
Selecting a state space for tracking the tracking error of a corresponding virtual mobile robot for each mobile robot at the current time step
Figure FDA0003112352330000081
State space required for describing surrounding environment information with each mobile robot
Figure FDA0003112352330000082
9. The method as claimed in claim 8, wherein the target action value network is updated by: after each small batch of training is finished, parameters of the updated online action network and the updated online action value network are updated, and the specific description form is as follows:
η′←τη+(1-τ)η′ (14)
where η' and η are partial tables representing the target network parameter and the current network parameter, and τ is used to control the ratio of updates.
10. The flexible formation method of cascaded multiple mobile robots according to claim 1, further comprising a local collision detection step, wherein the local collision detection step is used for detecting a safe distance from a local obstacle to the robot, and if the safe distance is returned to meet a safe state requirement, the mobile robot exits from a flexible obstacle avoidance strategy and recovers the formation strategy.
CN202110655081.9A 2021-06-11 2021-06-11 Flexible formation method for cascading multiple mobile robots Active CN113485323B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110655081.9A CN113485323B (en) 2021-06-11 2021-06-11 Flexible formation method for cascading multiple mobile robots

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110655081.9A CN113485323B (en) 2021-06-11 2021-06-11 Flexible formation method for cascading multiple mobile robots

Publications (2)

Publication Number Publication Date
CN113485323A true CN113485323A (en) 2021-10-08
CN113485323B CN113485323B (en) 2024-04-12

Family

ID=77935320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110655081.9A Active CN113485323B (en) 2021-06-11 2021-06-11 Flexible formation method for cascading multiple mobile robots

Country Status (1)

Country Link
CN (1) CN113485323B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020013A (en) * 2021-10-26 2022-02-08 北航(四川)西部国际创新港科技有限公司 Unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning
CN115542901A (en) * 2022-09-21 2022-12-30 北京航空航天大学 Deformable robot obstacle avoidance method based on near-end strategy training

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013119942A1 (en) * 2012-02-08 2013-08-15 Adept Technology, Inc. Job management sytem for a fleet of autonomous mobile robots
CN110007688A (en) * 2019-04-25 2019-07-12 西安电子科技大学 A kind of cluster distributed formation method of unmanned plane based on intensified learning
CN110147101A (en) * 2019-05-13 2019-08-20 中山大学 A kind of end-to-end distributed robots formation air navigation aid based on deeply study
CN111857184A (en) * 2020-07-31 2020-10-30 中国人民解放军国防科技大学 Fixed-wing unmanned aerial vehicle cluster control collision avoidance method and device based on deep reinforcement learning
CN111880567A (en) * 2020-07-31 2020-11-03 中国人民解放军国防科技大学 Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning
WO2020253316A1 (en) * 2019-06-18 2020-12-24 中国科学院上海微系统与信息技术研究所 Navigation and following system for mobile robot, and navigation and following control method
CN112711261A (en) * 2020-12-30 2021-04-27 浙江大学 Multi-agent formation planning method based on local visual field

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013119942A1 (en) * 2012-02-08 2013-08-15 Adept Technology, Inc. Job management sytem for a fleet of autonomous mobile robots
CN110007688A (en) * 2019-04-25 2019-07-12 西安电子科技大学 A kind of cluster distributed formation method of unmanned plane based on intensified learning
CN110147101A (en) * 2019-05-13 2019-08-20 中山大学 A kind of end-to-end distributed robots formation air navigation aid based on deeply study
WO2020253316A1 (en) * 2019-06-18 2020-12-24 中国科学院上海微系统与信息技术研究所 Navigation and following system for mobile robot, and navigation and following control method
CN111857184A (en) * 2020-07-31 2020-10-30 中国人民解放军国防科技大学 Fixed-wing unmanned aerial vehicle cluster control collision avoidance method and device based on deep reinforcement learning
CN111880567A (en) * 2020-07-31 2020-11-03 中国人民解放军国防科技大学 Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning
CN112711261A (en) * 2020-12-30 2021-04-27 浙江大学 Multi-agent formation planning method based on local visual field

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吴健发;王宏伦;刘一恒;姚鹏;: "无人机避障航路规划方法研究综述", 无人系统技术, no. 01 *
张国亮;: "动态环境中移动机器人路径规划研究综述", 机床与液压, no. 01 *
李强;刘国栋;: "多移动机器人的队形控制", 计算机系统应用, no. 04 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020013A (en) * 2021-10-26 2022-02-08 北航(四川)西部国际创新港科技有限公司 Unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning
CN114020013B (en) * 2021-10-26 2024-03-15 北航(四川)西部国际创新港科技有限公司 Unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning
CN115542901A (en) * 2022-09-21 2022-12-30 北京航空航天大学 Deformable robot obstacle avoidance method based on near-end strategy training

Also Published As

Publication number Publication date
CN113485323B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
Juang et al. Wall-following control of a hexapod robot using a data-driven fuzzy controller learned through differential evolution
Patle et al. Application of probability to enhance the performance of fuzzy based mobile robot navigation
Precup et al. Grey wolf optimizer-based approaches to path planning and fuzzy logic-based tracking control for mobile robots
Kamel et al. Real-time fault-tolerant formation control of multiple WMRs based on hybrid GA–PSO algorithm
CN113848974B (en) Aircraft trajectory planning method and system based on deep reinforcement learning
Wang et al. A survey of underwater search for multi-target using Multi-AUV: Task allocation, path planning, and formation control
CN113485323A (en) Flexible formation method for cascaded multiple mobile robots
Rubí et al. A deep reinforcement learning approach for path following on a quadrotor
Al Dabooni et al. Heuristic dynamic programming for mobile robot path planning based on Dyna approach
Al-Sagban et al. Neural-based navigation of a differential-drive mobile robot
Lei et al. A fuzzy behaviours fusion algorithm for mobile robot real-time path planning in unknown environment
Sun et al. A Fuzzy-Based Bio-Inspired Neural Network Approach for Target Search by Multiple Autonomous Underwater Vehicles in Underwater Environments.
Atiyah et al. An overview: On path planning optimization criteria and mobile robot navigation
Velagic et al. Efficient path planning algorithm for mobile robot navigation with a local minima problem solving
Lakhal et al. Safe and adaptive autonomous navigation under uncertainty based on sequential waypoints and reachability analysis
Guo et al. Optimal navigation for AGVs: A soft actor–critic-based reinforcement learning approach with composite auxiliary rewards
Pshikhopov et al. Trajectory planning algorithms in two-dimensional environment with obstacles
Zhu et al. A fuzzy logic-based cascade control without actuator saturation for the unmanned underwater vehicle trajectory tracking
Mohanty et al. A new intelligent approach for mobile robot navigation
Boufera et al. Fuzzy inference system optimization by evolutionary approach for mobile robot navigation
Rubagotti et al. Shared control of robot manipulators with obstacle avoidance: A deep reinforcement learning approach
Ratnayake et al. A comparison of fuzzy logic controller and pid controller for differential drive wall-following mobile robot
CN113959446B (en) Autonomous logistics transportation navigation method for robot based on neural network
Zhang et al. AUV 3D docking control using deep reinforcement learning
Amin et al. Particle swarm fuzzy controller for behavior-based mobile robot

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant