AU2022204569B2

AU2022204569B2 - Method for multi-agent dynamic path planning

Info

Publication number: AU2022204569B2
Application number: AU2022204569A
Authority: AU
Inventors: Xuelian Cai; Xiao FU; Zihang TAN; Jing Zheng
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-08-12
Filing date: 2022-06-28
Publication date: 2023-07-20
Anticipated expiration: 2042-06-28
Also published as: GB202209423D0; CN113625716A; CN113625716B; AU2022204569A1; GB2610276A

Abstract

The present disclosure provides a method for multi-agent dynamic path planning, including the following steps: (1) constructing a grid map containing a safety zone; (2) roughly planning, by using an A* algorithm, a running path for each agent in a Cartesian rectangular 5 coordinate system of the grid map; (3) re-planning, by using a Q-learning-based two-agent collaborative obstacle avoidance algorithm, running paths for two agents after a conflict; (4) determining whether updated initial positions are target positions set by the agents, and if yes, executing the step (5), otherwise, executing the step (3); and (5) sequentially connecting path points from start points to end points of the agents to obtain optimal conflict-free running paths 10 for the two agents. Compared with an existing algorithm for path planning, the present disclosure has the advantages of less running time, short total running distance, high running efficiency, and capability of avoiding opposite conflicts again. -1/4 DRAWINGS Construct grid map containing safety zone Roughly plan path for each agent by using A* algorithm Make each agent run according to its roughly planned Dath r No Does opposite conflict occur during running __ Yes Plan conflict-free path by using Q learning-based two-agent collaborative obstacle avoidance algorithm Create Q-value table for storing two agents Take current position where safety zone is detected as initial position combination Select next moving direction according greedypolicy Make agent update position and give certain reward Update Q-value table and select maximum Q-value No Determine whether agents all have reached target positions __ Yes Obtain conflict-free optimal paths for two agents FIG. 1

Description

-1/4 DRAWINGS Construct grid map containing safety zone

Roughly plan path for each agent by using A* algorithm

Make each agent run according to its roughly planned Dath

r conflict No Does opposite occur during running

__ Yes Plan conflict-free path by using Q learning-based two-agent collaborative obstacle avoidance algorithm

Create Q-value table for storing two agents

Take current position where safety zone is detected as initial position combination

Select next moving direction according greedypolicy

Make agent update position and give certain reward

Update Q-value table and select maximum Q-value

No Determine whether agents all have reached target positions

__ Yes Obtain conflict-free optimal paths for two agents

FIG. 1

-I METHOD FOR MULTI-AGENT DYNAMIC PATH PLANNING

TECHNICAL FIELD The present disclosure belongs to the technical field of path planning, and further

relates to a method for multi-agent dynamic path planning in the technical field of

dynamic path planning. The present disclosure may efficiently avoid dynamic obstacles,

so as to plan optimal paths for agents to go to destinations during the movement process.

BACKGROUND Multiple agents include a series of interactive agents. Each agent inside follows

corresponding organizational rules to complete specific tasks. The agents refer to entities with

basic characteristics of self-organization and sociality. An agent may be recognized as a

corresponding software program or an entity (such as a person, a vehicle, a robot, etc.), that is

embedded in an environment, perceives the environment through a sensor, acts autonomously

on the environment through an effector and meets design requirements. Multi-agent path

planning is widely used in intelligent inspection, aircraft trajectory planning, autonomous

driving and other fields. The multi-agent path planning is to analyze and process prior

information such as maps and data sensed by sensors in a space with obstacles to perceive a

state of the environment around the agents. After the environment around is perceived, an

optimal conflict-free path from a start point to a target point that conforms to a certain

evaluation index is found through a designed policy, so as to ensure that the agent may bypass

the obstacles during the running according to the path to reach a destination smoothly. In the

multi-agent path planning, the following three main goals are achieved: a path from the start

point to the target point is planned; during the process of running along the path, the agent may

successfully avoid the obstacles when encountering them; and on the basis of satisfying the

above two goals, the planned path meets the global optimality to the greatest extent. At present,

methods to solve such problems usually include: a heuristic random search algorithm and an

intelligent optimization algorithm.

The heuristic random search algorithm is a method for path planning based on heuristic information. The algorithm is guided to search in a most promising direction by using heuristic information carried by problems, so as to find an optimal path. For example:

The University of Electronic Science and Technology of China disclosed a heuristic

multidirectional rapidly exploring random tree-based method for robot path planning

optimization in its applied patent document "Method for Unmanned Aerial Vehicle Path

Planning Based on Rapidly exploring Random Tree (RRT) Algorithm" (application number:

CN202110409420.5, and application publication number: CN112987799A). In a high

dimensional configuration space, this method draws on an idea of the RRT algorithm. First,

when a random tree does not encounter obstacles during the exploring process, it will gradually

enhance its tendency to quickly approach a target point, and then once it encounters the

obstacles, it will immediately reduce this tendency to maximize the randomness of growth,

thereby making the random tree avoid the obstacles more efficiently, and eventually generating

a locally optimal path that successfully avoids the obstacles. This method may select a parent

node with smaller cost, instead of simply selecting a nearest node, thereby maximizing the

planning efficiency and speed. However, the following shortcomings of this method still exist:

the path planned by the improved RRT algorithm does not take into account dynamic obstacles,

namely, other agents, and a mobile agent may only wait to avoid the obstacles or go back the

same way, which increases the length of an invalid running path, and seriously affects the

running speed of the mobile agent.

Methods for multi-agent dynamic path planning through the intelligent optimization

algorithm are mostly researches based on reinforcement learning methods, such as a Q

learning algorithm and a State Action Reward State Action (SARSA) algorithm, that is,

through continuous interaction between agents and environments, obstacle construction

environment modeling is unified to quickly detect collision between paths and all obstacles,

so as to plan an optimal path for an agent. For example:

The Dalian University of Technology disclosed an ant colony algorithm-based method

for multi-agent path planning by reinforcement learning in its applied patent document

"Ant Colony Algorithm-based Method for Multi-Agent Path Planning by Reinforcement

Learning" (application number: CN202011257321.1, and application publication number:

CN112286203A). This method includes two stages. A first stage is to obtain current environment information of an agent cluster. Firstly, pheromones in an ant colony algorithm are used as heuristic information to detect obstacles in an environment, and their polar coordinate information is stored. Secondly, according to sensor information, target position information, pheromone information, and a sequence of serial numbers, collected state information is abstracted into a tuple, which serves as a current state to describe a pheromone map in an initialized environment. A second stage is to train a deep reinforcement learning model for multi-agent path planning. Firstly, a shared experience pool D is initialized, the size of the shared experience pool is set to be N, the ant colony pheromone map is initialized, the diffusion rate I and attenuation rate P of the pheromones and the number Q of agents in the agent cluster are set, and an improved deep q-network (DQN) and deep reinforcement learning method based on Q learning and an ant colony "pheromone" coordination mechanism are adopted. Secondly, a neural network is trained and updated based on historical information of the agent cluster. Finally, an optimal path planning policy of each agent in the agent cluster is obtained. Although this method may also be used for multi-agent path planning, the following shortcomings of this method still exist: it only takes into account non-opposite conflicts, and ignores opposite conflicts caused by mobile agents, which leads to that during the actual running process, the agent needs to continuously avoid the obstacles, an optimal avoidance path for the agent may not be re-planned, and the running efficiency of the agent is seriously affected.

SUMMARY In view of the above deficiencies of the prior art, an objective of the present disclosure is to provide a method for multi-agent dynamic path planning, so as to solve the problems that during the actual running process, an agent needs to continuously avoid obstacles and the running efficiency of the agent is seriously affected due to opposite conflicts caused by other mobile agents during the dynamic path planning of agents. To achieve the above objective, an idea of the present disclosure is as follows: when two agents have an opposite conflict during the running, running paths for the agents with the conflict are planned by using a Q-learning-based two-agent collaborative obstacle avoidance algorithm, the two agents are subjected to global dynamic path planning with a combination of passable grid positions as an initial position, and reward values are given to the agents; and the agents are guided to select moving directions for making the running paths shortest in a searchable range as next moving directions, thereby solving the problem that optimal avoidance paths for the two agents may not be re-planned. According to the present disclosure, by introducing a safety zone in a grid map, only one of the two agents running in opposite directions is enabled to enter the safety zone, and the other of the two agents running in the opposite directions is enabled to stay at an entrance of the safety zone, thereby getting rid of constraints of fixed obstacles in an existing method for dynamic path planning, and solving the problem that the agent runs on repeated road segments during the dynamic path planning.

The objective of the present disclosure is achieved by the following specific steps:

step 1: constructing the grid map containing the safety zone:

(la) constructing, by using a grid process, the grid map with a width and a length of X

and Y;

(lb) in the grid map, setting a zone where the agents running in the opposite directions

avoid other agents at the same time as the safety zone in the one-way lane; and

(1c) establishing, by taking an upper left corner of the grid map as an origin, rightward

extension as a positive direction of a horizontal axis, and downward extension as a positive

direction of a vertical axis, a Cartesian rectangular coordinate system of the grid map, and

marking initial positions and target positions of the agents in the Cartesian rectangular

coordinate system;

step 2: roughly planning, by using an A* algorithm, a running path for each agent in the

Cartesian rectangular coordinate system of the grid map;

step 3: planning, by using the Q-learning-based two-agent collaborative obstacle

avoidance algorithm, the running paths for the agents with the conflict:

(3a) enabling each agent to run according to its roughly planned path, and if the two

agents in the one-way lane have the opposite conflict, planning conflict-free running paths for

the two agents simultaneously;

(3b) creating a Q-value table for storing information of the two agents, and setting an

initial Q-value to be -2x10-9, wherein horizontal coordinates of the Q-value table represent a combination of passable grid positions of the two agents, and vertical coordinates of the Q value table represent a combination of moving directions of the two agents; (3c) when the two agents detect the same safety zone simultaneously, enabling only one of the two agents running in the opposite directions to enter the safety zone and the other of the two agents running in the opposite directions to stay at an entrance of this safety zone, and taking current positions as the initial positions of the two agents respectively; (3d) according to an c- greedy policy, enabling each of the two agents to randomly select, by using an exploration rate 8, a moving direction in up, down, left, right, and stop respectively, wherein the exploration rate 8 is set to be 0.5; (3e) enabling the two agents to update the positions according to the selected directions, and taking the updated positions as the initial positions; (3f) rewarding each of the two agents whose initial positions are the target positions with 1 point, publishing at least one of the two agents, that encounters an obstacle or whose initial position belongs to the safety zone, with 1 point, and rewarding at least one of the agents, that reaches the passable position but not the target position, with 1 point; (3G) updating, by using an update formula, all Q-values in the Q-value table and selecting out updated maximum Q-values; and (3h) determining whether the updated initial positions are the target positions set in the step (1c), and if yes, executing the step 4, otherwise, executing the step (3d); and step 4: sequentially connecting path points from the initial positions to the target positions of the agents to obtain the conflict-free running paths for the agents. Compared with the prior art, the present disclosure has the following advantages: Firstly, according to the present disclosure, the running paths for the agents with the conflict are planned by using the Q-leaming-based two-agent collaborative obstacle avoidance algorithm, the global dynamic path planning is performed with the combination of the passable grid positions as the initial position, and the reward values are given to the agents; and the agents are guided to select the moving directions for making the running paths shortest in the searchable range as the next moving directions, thereby solving the problems that the optimal avoidance paths for the two agents may not be re-planned and the running efficiency of the mobile agents is seriously affected, and making the paths planned by the present disclosure improve the running efficiency of the agents, shorten the running time of the agents and effectively avoid the conflicts again.

Secondly, according to the present disclosure, by introducing the safety zone in the grid

map, only one of the two agents running in the opposite directions is enabled to enter the safety

zone, the other of the two agents running in the opposite directions is enabled to stay at the

entrance of the safety zone, and an opposite conflict avoidance algorithm is directly triggered,

thereby solving the problems that when there are dynamic obstacles during the dynamic path

planning in the prior art, the agent may only wait or go back the same way to avoid the

obstacles, and the running speed of the mobile agent is seriously affected. Therefore, the path

planned by the present disclosure increases the running speed of the agent, improves the

efficiency of the opposite conflict avoidance algorithm, and reduces the invalid running

distance of the agent.

BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a flowchart of the present disclosure;

FIG. 2 is a schematic diagram of a grid map containing a safety zone in the present

disclosure;

FIG. 3 is a schematic diagram of agents with a conflict in the present disclosure; and

FIG. 4 is a simulation diagram of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS The present disclosure is further described in detail below with reference to the

accompanying drawings and embodiments.

Referring to FIG. 1, the implementation steps of the present disclosure are further

described in detail. Step 1: a grid map containing a safety zone is constructed.

By using a grid process, the grid map with a width and a length of X and Y is

constructed.

In the grid map, a zone where agents running in opposite directions avoid other agents at

the same time is set as the safety zone in a one-way lane.

By taking an upper left corner of the grid map as an origin, rightward extension as a

positive direction of a horizontal axis, and downward extension as a positive direction of a

vertical axis, a Cartesian rectangular coordinate system of the grid map is established, and

initial positions and target positions of the agents are marked in the Cartesian rectangular

coordinate system.

Referring to FIG. 2, the grid map containing the safety zone constructed according to the

present disclosure is further described in detail.

FIG. 2 is the grid map containing the safety zone constructed according to the present

disclosure. The upper left corner of the grid map is the origin O, the X axis represents the

width of the grid map, in meters, and the Y axis represents the length of the grid map, in

meters. White squares in FIG. 2 represent passable grids, and dark squares represent

impassable grids. A number 1 in FIG. 2 represents an agent 1, a number 2 represents an agent

2, and a zone represented by oblique lines is the safety zone.

Step 2: by using an A* algorithm, a running path for each agent is roughly planned in

the Cartesian rectangular coordinate system of the grid map.

Step 3: by using a Q-learning-based two-agent collaborative obstacle avoidance

algorithm, running paths for the agents with a conflict are planned.

Firstly, each agent runs according to its roughly planned path, and if the two agents in the

one-way lane have the opposite conflict, conflict-free running paths are planned for the two

agents simultaneously.

Secondly, a Q-value table for storing information of the two agents is created, and an

initial Q-value is set to be -2x10-9, wherein horizontal coordinates of the Q-value table

represent a combination of passable grid positions of the two agents, and vertical coordinates

of the Q-value table represent a combination of moving directions of the two agents.

Thirdly, when the two agents detect the same safety zone simultaneously, current

positions are taken as the initial positions of the two agents respectively.

Fourthly, according to an c - greedy policy, each of the two agents randomly select, by

using an exploration rate -,a moving direction in up, down, left, right, and stop respectively,

wherein the exploration rate C issettobe0.5.

Fifthly, the two agents update the positions according to the selected directions, and the updated positions are taken as the initial positions.

Sixthly, each of the two agents whose initial positions are the target positions is rewarded

with 1 point, at least one of the two agents, that encounters an obstacle or whose initial position

belongs to the safety zone, is published with 1 point, and at least one of the agents, that reaches

the passable position but not the target position, is rewarded with 1 point.

Seventhly, by using an update formula, all Q-values in the Q-value table are updated and

updated maximum Q-values are selected out; and

Eighthly, whether the updated initial positions are the target positions set in the step 1 is

determined, and if yes, the step 4 is executed, otherwise, this step is executed.

Referring to FIG. 3, opposite conflict types of conflict-free paths planned by the present

disclosure are further described in detail.

FIG. 3 is a schematic diagram of two types of opposite conflicts. FIG. 3 (a) is a schematic

diagram of a first type of the opposite conflict of the two agents, wherein a black dot represents

that the agent 1 and the agent 2 possibly collide in the middle of grids, and at this time, the

coordinates of the initial positions of the two agents after the opposite conflict in the grid map

are same. FIG. 3 (b) is a schematic diagram of a second type of the opposite conflict of the two

agents, wherein a black dot represents that the agent 1 and the agent 2 collide at a junction of

the grids, and at this time, the coordinates of the initial positions of the two agents after the

opposite conflict in the grid map are different.

Step 4: path points from the initial positions to the target positions of the agents are

sequentially connected to obtain the conflict-free running paths for the agents.

The effects of the present disclosure are further described below with reference to a

simulation experiment.

1. Conditions for the simulation experiment:

A hardware platform for the simulation experiment in the present disclosure is that: a

central processing unit is an Intel (R) Core (TM) i7-9700 CPU, a main frequency is 3.00 GHz,

and a memory is 16.0 GB.

A software platform for the simulation experiment in the present disclosure is: a

Windows10 operating system and Java1.

The simulation experiment in the present disclosure mainly verifies the completeness of the opposite conflict avoidance and shortest path planning algorithms. In order to be able to more clearly display moving trajectories of the agents, the running paths for the two agents are planned in the small-scale grid map of 15m*29m. The side length of each grid is Im,and the distribution of passable grids and impassable grids in the grid map is manually set.

2. Content and result analysis of the simulation experiment:

The simulation experiment in the present disclosure respectively simulates path planning

of the preset grid map by using the method provided by the present disclosure and an A*

algorithm-based method for path planning in a prior art.

Specific parameters for the simulation experiment are as shown in a Table 1:

Table 1: Parameter table for simulation experiment Parameter name Parameter value

Number number of agents 100

1 per second Task arrival rate A

Running speed vrob of agent im/s Turning time consumption

t1urn of agent Is

Learning rate a 0.5

In the simulation experiments, a prior art used refers to that:

An algorithm-based method for path planning in the A* prior art refers to a method for

path planning proposed by Gelperin et al. in "On the optimality ofA*[J]. Artificial Intelligence,

1977, 8(1):69-76.", abbreviated as the A* algorithm-based method for path planning.

simulation diagram of FIG. 4.

FIG. 4 is the simulation diagram of the present disclosure, wherein a black dot represents

the agent 1, a black solid line represents the running path for the agent 1, a gray dot represents

the agent 2, a black dotted line represents the running path for the agent 2, and an arrow

represents a running direction thereof. A diamond grid in FIG. 4 is a target position of an

uninstallation task.

FIG. 4 (a) is a simulation result diagram of original running paths for the agents before path planning; FIG. 4 (b) represents a simulation result diagram that if the two agents run according to the original paths, the two agents will have the opposite conflict at a position marked with a black triangle; FIG. 4 (c) represents a simulation result diagram that the two agents will perform dynamic obstacle avoidance at a black dot and a gray dot respectively, that is, the step before collision; and FIG. 4 (d) is a simulation result diagram that the method for path planning provided by the present disclosure performs path planning, wherein a black solid line is a path re-planned for the agent 1, and a black dotted line represents a path re-planned for the agent 2.

From results of the simulation experiment, it may be seen that the Q-learning-based two

agent collaborative obstacle avoidance algorithm may successfully avoid the opposite conflict

and achieve the shortest path to the target position. When the two agents have the opposite

conflict in FIG. 4 (c), a dynamic path planning algorithm is used for avoidance. By comparing

FIG. 4 (c) and FIG.4 (d), it may be seen that the running distance of the agent is significantly

reduced compared with that according to the original running path before planning after use

of the Q-leaming-based two-agent collaborative obstacle avoidance algorithm. Direct

recording and output are performed in a program.

Table 2: Quantitative analysis table for path planning results of present disclosure and prior

art in simulation experiment

Number Running time (s) of agent Improvement of tasks A* Q-learning Two-agent Q-leaming percentage algorithm method method (%) 200 2487 2402 2237 10.05 400 4877 4754 4237 13.12 600 7232 7001 6138 15.13 800 9618 9455 8679 9.76 1000 12047 11358 10358 14.02 1200 14147 14011 13014 8.01 1400 16548 16245 15124 8.61 1600 19547 19006 17545 10.24 1800 21477 21014 19370 9.81 2000 24168 23114 21156 12.46

In order to compare the running efficiency of three methods in the simulation experiment of the present disclosure, a timer in multiple threads of Java software is used to directly record and output the running time for the agent to complete a task in the program, and a statistical record result is as shown in the Table 2. The number of agents is fixed to be 100, the number of tasks varies from 200 to 2000, with an interval of 200, that is, 10 different tasks are selected to perform statistics on the running time, 5 experiments were carried out for the number of tasks each time to take an average value, and the improvement percentage of the running time was calculated. With reference to the Table 2, it may be seen that with the gradual increase in the number of tasks, the method for path planning provided by the present disclosure shortens the running time and increases the percentage of running efficiency compared with the method for path planning based on the A* algorithm and the Q-learning algorithm. It may be seen that the present disclosure requires less running time than the prior art during the planning of the paths with the opposite conflict, and it is proved that a better planned path may be obtained by using the method provided by the present disclosure. In order to verify the safety zone introduced in the grid map in the simulation experiment of the present disclosure, differences between the total running distances for the agents to complete a certain number of tasks and the paths are respectively calculated, and a statistical result thereof is as shown in a Table 3. Likewise, a statistical method for the number of agents and the number of tasks is consistent with a method for calculating the running time. Table 3: Quantitative analysis table for path planning result of introduction of safety zone in grid map in simulation experiment Number Total running distance (m) Path Improvement of tasks Q-learning Two-agent Q-leaming difference percentage (%) of method method (m) running efficiency 200 124540 101457 23083 18.53 400 231457 191117 40340 17.43 600 348777 289824 58953 16.90 800 487510 405773 81737 16.77 1000 601000 510103 90897 15.12 1200 713204 600540 112664 15.80 1400 874114 741454 132660 15.18

1600 984512 841544 142968 14.52 1800 1150177 962475 187702 16.32 2000 1344874 1094099 250775 16.23

With reference to the Table 3, it may be seen that with the gradual increase in the number

of tasks, the method for path planning provided by the present disclosure reduces the total

running distance of the agent by an average of 16% and increases the percentage of running

efficiency after the introduction of the safety zone in the grid map. It may be seen that after

the present disclosure introduces the safety zone during the planning of the paths with the

opposite conflict, the total running distance required for the agent is reduced, and it is proved

that the better planned path may be obtained by using the method provided by the present

disclosure.

The above simulation experiment shows that the present disclosure re-plans two conflict

free optimal paths for two agents having the opposite conflict in combination with the

introduction of the safety zone in the grid map by using the Q-learning-based two-agent

collaborative obstacle avoidance algorithm, which shortens and reduces the running time and

the total running distance during the dynamic path planning, and improves the running

efficiency of the agents.

Claims

CLAIMS 1. A method for multi-agent dynamic path planning, wherein when there is an opposite

conflict in running paths for two agents in a one-way lane, multi-agent dynamic path planning

with collaborative obstacle avoidance is performed by adopting an Q-learning-based two

agent collaborative obstacle avoidance algorithm in combination with the introduction of a

safety zone in a grid map; and the method for planning comprises the following steps:

step 1: constructing the grid map containing the safety zone:

(la) constructing, by using a grid process, the grid map with a width and a length of X

and Y;

(lb) in the grid map, setting a zone where other agents avoid agents running in opposite

directions at the same time as the safety zone in the one-way lane; and

(1c) establishing, by taking an upper left corner of the grid map as an origin, rightward

extension as a positive direction of a horizontal axis, and downward extension as a positive

direction of a vertical axis, a Cartesian rectangular coordinate system of the grid map, and

marking initial positions and target positions of the agents in the Cartesian rectangular

coordinate system;

step 2: roughly planning, by using an A* algorithm, a running path for each of the agents

in the Cartesian rectangular coordinate system of the grid map;

step 3: planning, by using the Q-learning-based two-agent collaborative obstacle

avoidance algorithm, the running paths for the agents with the conflict:

(3a) enabling each of the agents to run according to its roughly planned path, and if the

two agents in the one-way lane have the opposite conflict, planning conflict-free running paths

for the two agents simultaneously;

(3b) creating a Q-value table for storing information of the two agents, and setting an

initial Q-value to be -2x10- 9 , wherein horizontal coordinates of the Q-value table represent

a combination of passable grid positions of the two agents, and vertical coordinates of the Q value table represent a combination of moving directions of the two agents;

(3c) when the two agents detect the same safety zone simultaneously, enabling only one

of the two agents running in the opposite directions to enter the safety zone and the other of

the two agents running in the opposite directions to stay at an entrance of this safety zone, and taking current positions as the initial positions of the two agents respectively;

(3d) according to an c- greedy policy, enabling each of the two agents to randomly

select, by using an exploration rate 6, a moving direction in up, down, left, right, and stop

respectively, wherein the exploration rate E is set to be 0.5;

(3e) enabling the two agents to update the positions according to the selected directions,

and taking the updated positions as the initial positions;

(3f) rewarding each of the two agents whose initial positions are the target positions with

1 point, publishing at least one of the two agents, that encounters an obstacle or whose initial

position belongs to the safety zone, with 1 point, and rewarding at least one of the agents, that

reaches the passable position but not the target position, with 1 point;

(3G) updating, by using an update formula, all the Q-values in the Q-value table and

selecting out updated maximum Q-values; and

(3h) determining whether the updated initial positions are the target positions set in the

step (1c), and if yes, executing the step 4, otherwise, executing the step (3d); and

step 4: sequentially connecting path points from the initial positions to the target positions

of the agents to obtain the conflict-free running paths for the agents.
2. The method for multi-agent dynamic path planning according to claim 1, wherein the

combination of the passable grid positions of the two agents in the step (3b) is as follows:

state = XY 2 x±+ XYy +Yx 2 + y2

wherein state represents the combination of the passable grid positions of the two agents, XI and Y1 represent coordinate values of the current position of one agent, and X 2 and Y 2

represent coordinate values of the current position of the other agent.
3. The method for multi-agent dynamic path planning according to claim 1, wherein the

update formula in the step (3g) is as follows:

ymaxQ) newQ<-(1-a)Q+a(R+

wherein newQ represents the updated Q-value, a represents a learning rate of 0.5, R represents a reward value, 7 represents a discount factor of 0 or 1, and max represents

maximum operation.