CN113534819B

CN113534819B - Method and storage medium for pilot following type multi-agent formation path planning

Info

Publication number: CN113534819B
Application number: CN202110985503.9A
Authority: CN
Inventors: 刘飞; 范之琳; 杨洪勇; 韩艺琳; 宁新顺; 刘莉; 王丽丽; 张顺宁
Original assignee: Ludong University
Current assignee: Ludong University
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2024-03-15
Anticipated expiration: 2041-08-26
Also published as: CN113534819A

Abstract

The invention discloses a method for planning a pilot following type multi-agent formation path, which comprises the following steps: step S1: initializing a Q value by the pilot intelligent agent according to an attractive potential field method; step S2: the pilot intelligent agent dynamically adjusts the exploration probability in the epsilon greedy method according to the simulated annealing method, and performs action selection; step S3: the pilot intelligent agent avoids the obstacle according to the virtual obstacle filling obstacle avoidance strategy and the dynamic obstacle avoidance mechanism; step S4: the pilot intelligent body executes the action and obtains the return, the Q value is updated according to the return, and the pilot intelligent body transmits the moved position to the following intelligent body until the pilot intelligent body reaches the preset training times; and S5, acquiring an expected target position according to the current position information of the piloting intelligent agent by the following intelligent agent, selecting the action corresponding to the state with the minimum cost according to the cost function, and executing the action.

Description

Method and storage medium for pilot following type multi-agent formation path planning

Technical Field

The invention relates to the technical field of robots, in particular to a method and a storage medium for planning a formation path of a pilot following type multi-robot system.

Background

The multi-agent formation path planning requires a plurality of robots to form a formation and keep the position relationship to move towards a target point, and in the moving process, not only the obstacle is safely avoided, but also a better path reaching the target point is found more quickly. In addition, the path planning in the known map environment is simpler, and the unknown map environment has higher requirements on the path planning capability of multiple robots.

For multi-agent formation, there are many implementation methods including pilot-follower methods, behavior-based methods, virtual structure methods, etc. The pilot-following method mainly realizes cooperation through pilot information sharing, and has higher requirements on pilot robots, needs to improve the path planning capacity of the pilot robots and the local following capacity of the following robots, and has simple algorithm. The behavior-based method is to design sub-behaviors in advance, select the executed behaviors according to the change of the encountered situation, but the accuracy is insufficient and the various behaviors are difficult to fuse when encountering complex environments. The virtual structure rule regards the formation as a fixed rigid structure, and effective obstacle avoidance cannot be performed.

For intelligent agent path planning, the method is divided into global path planning and local path planning, wherein the two tasks of avoiding obstacles and reaching target points are carried out.

In the process of implementing the embodiments of the present disclosure, it is found that at least the following problems exist in the related art: an algorithm A in global path planning needs to know the information of the environment in advance, lacks flexibility and is difficult to deal with the environment which changes in real time; the local path planning method is widely applied to an artificial potential field method, so that local optimization or oscillation is very easy to occur; the reinforcement learning algorithm trial learning needs to be iterated continuously, and the time consumption is long.

Disclosure of Invention

The embodiment of the disclosure provides a method and a storage medium for pilot following type multi-agent formation path planning, which are used for solving the problems that an algorithm A in global path planning in the prior art needs to know information of an environment in advance, lacks flexibility and is difficult to cope with the environment which changes in real time; the local path planning method is widely applied to an artificial potential field method, so that local optimization or oscillation is very easy to occur; the reinforcement learning algorithm trial learning needs to iterate continuously, and the time consumption is long.

In a first aspect, a method for pilot following multi-agent formation path planning is provided, the method comprising: step S1: initializing a Q value by the pilot intelligent agent according to an attractive potential field method; step S2: the pilot intelligent body dynamically adjusts the exploration probability in the epsilon greedy method according to the simulated annealing method, and performs action selection; step S3: the pilot intelligent body avoids the obstacle according to a virtual obstacle filling obstacle avoidance strategy and a dynamic obstacle avoidance mechanism; step S4: the pilot intelligent body executes actions and obtains returns, the Q value is updated according to the returns, and the pilot intelligent body transmits the moved positions to the following intelligent bodies until the pilot intelligent body reaches preset training times; s5, when the following agent acquires the current position information of the piloting agent, the following agent acquires an expected target position according to the current position information of the piloting agent, the following agent selects and executes actions corresponding to the state with the minimum cost according to a cost function, and meanwhile, the following agent avoids obstacles according to a virtual obstacle filling obstacle avoidance strategy and a dynamic obstacle avoidance mechanism and moves towards the expected target position; and when the following agent does not acquire the current position information of the piloting agent, the following agent path planning is ended.

With reference to the first aspect, in a first possible implementation manner of the first aspect, in the step S2, the exploration probability epsilon is calculated by the following formula:

wherein Q (S, A) _random ) When the state S is set, the Q value, Q (S, A _max ) Q is a non-zero constant, and T is a temperature control parameter in the simulated annealing method.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, after obtaining the exploration probability by calculation, obtaining a random number, and when the exploration probability is greater than the random number, randomly selecting an action by the piloting agent; and when the exploration probability is smaller than or equal to the random number, the piloting intelligent agent selects an action corresponding to the maximum Q value in the current state.

With reference to the first aspect, in a third possible implementation manner of the first aspect, the step S3 further includes: step S31: acquiring adjacent positions of the current position of the pilot intelligent body, calculating first distances between the adjacent positions and the target positions, judging whether the current position tends to be a concave obstacle according to the first distances, and avoiding the obstacle by filling, wherein the distance between the current position of the pilot intelligent body and the target position is a second distance; step S32: judging whether the current adjacent position is an obstacle or not when the first distance is smaller than the second distance, and judging that the current adjacent position is a feasible adjacent position when the current adjacent position is not the obstacle; step S33: and filling the current position of the piloting agent as a virtual obstacle when the feasible adjacent position does not exist.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, in the step S4, the piloting agent calculates a return obtained by the following formula,

reporting function R (S) _t ，A _t )＝w _c ×R _current (S _t ，A _t )+w _h ×H(S _t ，A _t )

Wherein S is _t The state of the piloting intelligent agent at the time t is obtained; a is that _t The action of the piloting intelligent agent at the t moment is performed; r is R _current (S _t ，A _t ) A current location reporting function for the piloting agent,

H(S _t ，A _t ) As a heuristic function, calculating a diagonal distance between the current position of the piloting intelligent agent and the target position; w (w) _c The first coefficient is positive; w (w) _h The second coefficient is negative.

With reference to the first aspect, in a fifth possible implementation manner of the first aspect, in the step S1, the initializing of the Q value is performed by the following formula,

wherein the return valuek is a proportional coefficient, gamma is a discount factor, ζ is a negative value of an adjustment coefficient, ρ _aim (S') is the distance between the current position of the piloting agent and the target position, and eta is a constant.

With reference to the first aspect, in a sixth possible implementation manner of the first aspect, in the step S5, a cost function C (S _t ，a _t )＝c×d _attr +R _static (s _t ，a _t )

Wherein s is _t A is the state of the following agent at the time t _t D, for the following agent to act at t time _attr The gravity potential field is obtained by calculation according to the Euclidean distance between the current position of the following intelligent body and the target position; r is R _static (s _t ，a _t ) As a function of the penalty of the static obstacle,

c is the adjustment coefficient.

With reference to the first aspect, in a seventh possible implementation manner of the first aspect, in step S5, the following agent performs obstacle avoidance according to a dynamic obstacle avoidance mechanism, including:

when a dynamic obstacle appears at an adjacent position of the current position of the following intelligent body, a repulsive force potential field of the dynamic obstacle to the current position of the following intelligent body is obtained, the distance between the current position of the following intelligent body and the dynamic obstacle is calculated, the attraction of the expected target position of the following intelligent body and the repulsive force of the dynamic obstacle are subjected to stress analysis, the temporary target position of the following intelligent body for avoiding the dynamic obstacle is determined, and obstacle avoidance is carried out.

With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner of the first aspect, the repulsive potential field is calculated by the following formula,

repulsive potential field

Wherein, (x) _s ，t _s ) For the coordinates of the current state of the following agent, (x) _obst ，y _obst ) Is the coordinates of the dynamic obstacle.

In a second aspect, a storage medium is provided, the storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the aforementioned method for piloting a following multi-agent formation path planning.

The method, the system and the storage medium for planning the formation path of the piloting following multi-robot system, which are provided by the embodiment of the disclosure, can realize the following technical effects:

the pilot following type multi-agent formation mode is adopted to carry out path planning in an unknown environment, pilot agents are responsible for planning paths, the pilot agents follow the pilot agents to maintain formation formations, the multi-agents share position information and partial environment information, division of work is clear, and formation is simple and efficient; the pilot intelligent agent can accelerate the path planning convergence by utilizing the motion strategy of the simulated annealing method and the epsilon greedy method; the virtual obstacle filling obstacle avoidance strategy and the dynamic obstacle avoidance mechanism are set, so that the obstacle can be effectively avoided; the following agent adopts a cost function to select action, and adopts a virtual obstacle filling obstacle avoidance strategy in parallel with the piloting agent to carry out local path planning, so that the following agent can effectively follow and avoid obstacles.

The foregoing general description and the following description are exemplary and explanatory only and are not restrictive of the application.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures do not include a limitation in scale, and in which:

FIG. 1 is a flow diagram of a method for pilot following multi-agent formation path planning provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of actions, step sizes and sensor detection ranges of a multi-agent provided by embodiments of the present disclosure;

fig. 3 is a schematic one-step filling diagram of an agent virtual obstacle avoidance strategy provided in an embodiment of the present disclosure;

FIG. 4 is a comparison schematic diagram of the convergence of the accumulated returns of the conventional Q-value learning algorithm provided by the embodiments of the present disclosure and the improved Q-value learning algorithm of the embodiments of the present disclosure;

FIG. 5 is a comparison schematic diagram of the convergence of the number of steps of the round of the conventional Q-value learning algorithm provided by the embodiments of the present disclosure and the improved Q-value learning algorithm of the embodiments of the present disclosure;

FIG. 6 is a schematic diagram of following an agent rasterization artificial potential field approach to avoid dynamic obstacles provided by embodiments of the present disclosure;

fig. 7 is a flow chart of a pilot agent path planning method provided in an embodiment of the disclosure;

FIG. 8 is a flow chart of a following agent path planning method provided by an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a result of path planning of a pilot agent provided in an embodiment of the present disclosure;

fig. 10 is a schematic diagram of a multi-agent formation path planning result in an obstacle environment according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The concept related to the embodiments of the present disclosure is described below, and an annealing method is simulated, and from a certain higher initial temperature, with continuous decline of a temperature control parameter, a globally optimal solution of an objective function is randomly found in a solution space in combination with a probability kick characteristic, that is, a locally optimal solution can probabilistically jump out and finally tends to be globally optimal.

Q-Learn i ng is a va l ue-based algorithm in the reinforcement learning algorithm, Q is Q (s, a), under a State s (s epsilon State) at a certain moment, action a (a epsilon Act i on) is adopted to obtain the expected income, the environment feeds back corresponding returns according to the actions of the intelligent agent, and the learning rounds are iterated continuously until the accumulated returns are maximum. Therefore, the key of the algorithm is to construct the state and the action into a Q value table to store the Q value, and then select the action that can obtain the maximum benefit according to the Q value.

In the process of implementing the embodiments of the present disclosure, it is found that the related art exists: the pilot following method mainly realizes cooperation through pilot information sharing, and has higher requirements on pilot agents and needs to improve the capacity of pilot agent path planning and the local following capacity of following agents.

Fig. 1 is a flow chart of a method for pilot following multi-agent formation path planning provided in an embodiment of the present disclosure. As shown in fig. 1, an embodiment of the present disclosure provides a method for pilot following multi-agent formation path planning, the method comprising: step S1: initializing a Q value by the pilot intelligent agent according to an attractive potential field method; step S2: the pilot intelligent agent dynamically adjusts the exploration probability in the epsilon greedy method according to the simulated annealing method, and performs action selection; step S3: the pilot intelligent agent avoids the obstacle according to the virtual obstacle filling obstacle avoidance strategy and the dynamic obstacle avoidance mechanism; step S4: the pilot intelligent body executes the action and obtains the return, the Q value is updated according to the return, and the pilot intelligent body transmits the moved position to the following intelligent body until the pilot intelligent body reaches the preset training times; s5, when the following agent acquires the current position information of the piloting agent, the following agent acquires an expected target position according to the current position information of the piloting agent, the following agent selects and executes a corresponding action of a state with minimum cost according to a cost function, and meanwhile, the following agent avoids obstacles according to a virtual obstacle filling obstacle avoidance strategy and a dynamic obstacle avoidance mechanism and moves towards the expected target position; and when the following agent does not acquire the current position information of the piloting agent, the following agent path planning is ended.

The method for pilot following type multi-agent formation path planning provided by the embodiment of the disclosure can realize the following technical effects: the pilot following type multi-agent formation mode is adopted to carry out path planning in an unknown environment, pilot agents are responsible for planning paths, the pilot agents follow the pilot agents to maintain formation formations, the multi-agents share position information and partial environment information, division of work is clear, and formation is simple and efficient; the pilot intelligent agent can accelerate the path planning convergence by utilizing the motion strategy of the simulated annealing method and the epsilon greedy method; the virtual obstacle filling obstacle avoidance strategy and the dynamic obstacle avoidance mechanism are set, so that the obstacle can be effectively avoided; the following agent adopts a cost function to select action, and adopts a virtual obstacle filling obstacle avoidance strategy in parallel with the piloting agent to carry out local path planning, so that the following agent can effectively follow and avoid obstacles.

In some embodiments, divide into the formation of many agents and follow the agent, can set up the formation into triangle-shaped, the formation of formation is preceding, follows the agent and follows the resistance that the agent can reduce whole formation marcing, and communication is convenient between the agent, and the person of skill in the art also can set up the formation into square or other shapes according to actual demand. Static barriers with different positions, sizes and shapes are randomly arranged, and dynamic barriers can be arranged in a preset grid range. The pilot intelligent agent can also be a virtual pilot intelligent agent.

Fig. 2 is a schematic diagram of actions, step sizes and sensor detection ranges of a multi-agent provided by an embodiment of the present disclosure. As shown in fig. 2, in some embodiments, the multi-agent action set includes: upward, downward, leftward, rightward, upward leftward, upward rightward, downward leftward, and downward rightward. Each agent 21 is provided with a sensor that can detect the environmental condition of the location and environmental information within a 3×3 grid range centered on the location.

In some embodiments, prior to step S1, a multi-agent sports environment map is constructed, and the current location and the target location of the piloting agent are obtained.

In some embodiments, in step S1, the Q value initialization is performed by the following formula,

wherein the return valuek is a proportional coefficient, gamma is a discount factor, ζ is a negative value of an adjustment coefficient, ρ _aim And (S') is the distance between the current position and the target position of the piloting intelligent agent, eta is a constant, and the denominator is prevented from being zero. In this way, the improved gravity potential field method is utilized to initialize the Q value table to guide the movement of the piloting agent, and the initialization is carried out while the piloting agent moves. The piloting agent always tends to select the action with the largest Q value, and the piloting agent is guided to move to the target position and avoid the obstacle.

In some embodiments, in step S4, the piloting agent calculates the return obtained by the following formula,

Wherein S is _t The state of the piloting intelligent agent at the time t; a is that _t The action of the piloting intelligent agent at the t moment; r is R _current (S _t ，A _t ) A reporting function for the current location of the piloting agent,

H(S _t ，A _t ) The method comprises the steps of calculating diagonal distance between the current position of a pilot intelligent body and a target position to obtain a heuristic function; w (w) _c The first coefficient is positive; w (w) _h For the second coefficient to be negative, w can be _c And w _h Adjustments are made to adapt to the environment. And constructing the return function of the exploration environment through the weighted summation by the current position return function and the heuristic function.

In some embodiments, the current location reporting function is set according to the agent encountering an obstacle giving a greater negative value to the current location reporting function, reaching the target location giving a greater positive value to the current location reporting function, giving a smaller negative value to the current location reporting function as a penalty for each movement of the agent.

In some embodiments, the heuristic uses a diagonal distance, which is Manhattan distance h_man plus diagonal movement distance h_dia, manhattan distance is the perpendicular distance between two locations in the north-south direction plus the perpendicular distance in the east-west direction.

In some embodiments, h_man (S _t ，A _t )＝|x _S -x _goal |+|y _S -y _goal I, wherein (x) _S ，y _S ) For the coordinates of the piloting agent in state S, (x) _goal ，y _goal ) Is the coordinates of the piloting agent in the target state.

In some embodiments, h_dia (S _t ，A _t )＝min(|x _S -x _goal |，|t _S -y _goal |)。

Wherein D is the movement cost between adjacent grids, calculated by using diagonal distance according to heuristic function, and the movement cost in the up-down, left-right direction is D and the movement cost in the diagonal direction is

In some embodiments, in step S2, the exploration probability epsilon is calculated by the following formula:

wherein Q (S, A) _random ) When the state S is set, the Q value, Q (S, A _max ) Q is a non-zero constant for preventing zero molecular, and T is a temperature control parameter in the simulated annealing method. The probability of acceptance of the randomly selected action is calculated according to the Metropolis criterion and taken as the exploration probability.

In some embodiments, the initial value of the temperature control parameter T is set, while the cooling temperature is controlled with a standard deviation of the number of samples of n consecutive iteration steps,

wherein step _m+1 、step _m+2 、…、step _m+n Step numbers of n successive iterations respectively _avg For the average value of the number of n continuous iterations, k is a control coefficient, and is used for controlling the value of T to be in a proper range, i is a non-zero constant, and is used for preventing the value of T from being 0 after convergence.

In some embodiments, after the exploration probability is calculated, a random number delta, delta epsilon (0, 1) is obtained, and when the exploration probability epsilon is larger than the random number delta, the piloting intelligent agent randomly selects actions; when the exploration probability epsilon is smaller than or equal to the random number delta, the piloting agent selects the action corresponding to the maximum Q value in the current state. In this way, the pilot intelligent agent utilizes a strategy selection action combining a simulated annealing method and an epsilon greedy method to provide an epsilon-greedy strategy for improving simulated annealing, the epsilon is dynamically adjusted by the simulated annealing method, the control temperature parameters of the simulated annealing are adjusted in real time according to the learning condition of an algorithm, the initial stage of path planning is explored as much as possible, more priori knowledge is increased, local optimization is prevented, and unnecessary exploration is canceled when the later stage approaches convergence.

In some embodiments, step S3 further comprises: step S31: acquiring adjacent positions of the current position of the piloting intelligent body, calculating first distances between the adjacent positions and the target positions, judging whether the current position tends to be a concave obstacle according to the first distances, and avoiding the obstacle by filling, wherein the distance between the current position of the piloting intelligent body and the target position is a second distance; step S32: judging whether the current adjacent position is an obstacle or not when the first distance is smaller than the second distance, and judging that the current adjacent position is a feasible adjacent position when the current adjacent position is not the obstacle; step S33: and when no feasible adjacent position exists, filling the current position of the navigation intelligent body as a virtual obstacle.

In some embodiments, in step S31, piloting the neighboring locations of the current location of the agent comprises: 3 x 3 grid locations around the current location of the agent. In step S31, further comprising: a current location-action array is created for storing feasible neighbors of the current location.

In some embodiments, the feasible neighbor locations resulting from step S32 are added to the current location-action array. The current adjacent position is far away from the target position or is an obstacle, and is a non-feasible position and is not added into the current position-action array.

In some embodiments, in step S33, the current position-action array is empty, i.e., when there are no feasible neighboring positions, indicating that the current position is fully trending towards an infeasible area, the current position may be a critical position in the trending towards a concave obstacle path, filling the current position as a virtual obstacle. Step S33 further includes: and judging each step of the intelligent body until the concave barriers are filled.

Fig. 3 is a schematic one-step filling diagram of an agent virtual obstacle avoidance strategy according to an embodiment of the present disclosure. As shown in fig. 3, in some embodiments, during the path planning process, the agent enters into the light gray concave obstacle, and according to the distance calculation, the current position of the agent is determined to be the approaching target position, namely, the dark gray square in fig. 3 (b) is determined to be the approaching target position, the three adjacent positions are found to be the obstacle, which indicates that all the three adjacent positions are impossible positions, the current position completely tends to the infeasible area, and the current position is filled into the light gray virtual obstacle in fig. 3 (c). In this way, a concave obstacle can be effectively avoided.

In some embodiments, in step S3, the piloting agent evades the obstacle according to a dynamic obstacle avoidance mechanism, comprising: before the pilot intelligent body moves, judging whether the pilot intelligent body collides with the dynamic obstacle after reaching the state, if so, modifying the selected action by the pilot intelligent body into the action corresponding to the maximum Q value state in the three state grids of the dynamic obstacle in the opposite direction, setting a threshold value at the same time, and if the continuous avoidance times of the pilot intelligent body in the same action exceeds the threshold value, judging that the dynamic obstacle moves at a uniform speed in a straight line, and continuing to move to the target position after the pilot intelligent body moves to the normal direction of the dynamic obstacle to avoid the dynamic obstacle.

In some embodiments, in step S4, the piloting agent continuously moves until reaching the target position as an end of a round, and continuously iterates the round until reaching the preset training number; broadcasting the moved position to a following intelligent agent every step of the pilot intelligent agent; the pilot agent updates the Q value by:

wherein alpha is learning rate, 0 < alpha<1，r _t For rewards obtained after the piloting agent selects action strategies at the current moment, gamma is a discount factor, and 0 < gamma<1,max _a Q(S _t+1 A) selecting an optimal action strategy for the piloting intelligent agent at the next time t+1 to obtain a Q value. In this way, the improved reinforcement learning algorithm is suitable for exploring and planning in an unknown environment, and a globally optimized path can be planned quickly.

FIG. 4 is a comparison diagram of the cumulative return convergence of a conventional Q-value learning algorithm provided by an embodiment of the present disclosure and an improved Q-learning algorithm according to an embodiment of the present disclosure. Fig. 5 is a comparison schematic diagram of the convergence of the number of steps of the round of the conventional Q-value learning algorithm provided in the embodiment of the present disclosure and the Q-value learning algorithm improved in the embodiment of the present disclosure. As shown in fig. 4 and 5, with the improved Q-value reinforcement learning algorithm provided by the embodiments of the present disclosure, the cumulative return and the number of rounds of steps in the path planning process of the agent reach a smooth convergence faster, the convergence time is shortened by 89.9%, the number of rounds of convergence is reduced by 63.4%, and the number of rounds of steps in the path planning is smaller.

In some embodiments, in step S5, following the movement of the agent to the desired target location, the cost is calculated for the eight adjacent location states of the current location state using the cost function formula, and the action corresponding to the state with the minimum cost is determined and executed.

In some embodiments, in step S5, the cost function C (S _t ，a _t )＝c×d _attr +R _static (s _t ，a _t ) Wherein s is _t To follow the state of the agent at time t, a _t D, in order to follow the action of the intelligent agent at the time t _attr The gravity potential field is obtained by calculation according to the Euclidean distance between the current position of the following intelligent body and the target position; r is R _static (s _t ，a _t ) As a function of the penalty of the static obstacle,

c is the adjustment coefficient. Design cost function according to gravitation potential field thought, target position is relative toThe gravitational potential field at the front position is measured by the Euclidean distance of the target position and the current position, and is proportional to the Euclidean distance.

Wherein, (x) _s ，y _s ) To follow the coordinates of the current state of the agent, (x) _goal ，y _goal ) Is the coordinates of the target location.

Fig. 6 is a schematic diagram of following an agent rasterization artificial potential field method to avoid dynamic obstacles provided by an embodiment of the present disclosure. As shown in fig. 6, in some embodiments, in step S5, following the agent performs obstacle avoidance according to a dynamic obstacle avoidance mechanism, including: when the dynamic obstacle appears at the adjacent position of the current position of the following intelligent body, the repulsive force potential field of the dynamic obstacle to the current position of the following intelligent body is obtained, the distance between the current position of the following intelligent body and the dynamic obstacle is calculated, the attraction of the expected target position of the following intelligent body and the repulsive force of the dynamic obstacle are subjected to stress analysis, the temporary target position of the following intelligent body for avoiding the dynamic obstacle is determined, and obstacle avoidance is carried out. The temporary target position is discarded after obstacle avoidance.

In some embodiments, the repulsive potential field is calculated by the following equation,

repulsive potential field

Wherein, (x) _s ，y _s ) To follow the coordinates of the current state of the agent, (x) _obst ，y _obst ) Is the coordinates of the dynamic obstacle. In this way, the following agent acquires the continuously-changed expected target position in real time according to the piloting agent position, and the rasterization self-detection local path planning method based on the artificial potential field is realized.

Fig. 7 is a flow chart of a pilot agent path planning method according to an embodiment of the disclosure. As shown in fig. 7, step P1: establishing a pilot intelligent agent Q value table, initializing a temperature control parameter T, a return function and a detection mechanism by using a gravitational potential field method, wherein the detection mechanism is used for detecting a 3X 3 grid environment near the current position, and then entering a step P2: the number of the pilot intelligent agent rounds is increased by one, and in the current round, the standard deviation is calculated, and the temperature control parameter T is updated; judging whether the preset training times are reached, ending the process if the preset training times are reached, and turning to the step P3 if the preset training times are not reached: after initializing the pilot intelligent agent position, turning to step P4: after the pilot intelligent agent step number is increased by one, the step P5 is carried out: the pilot intelligent body avoids obstacles according to a virtual obstacle filling obstacle avoidance strategy and a dynamic obstacle avoidance mechanism, the pilot intelligent body selects and executes actions by utilizing a behavior strategy, the pilot intelligent body enters the next state and obtains returns, the current position coordinates are broadcast to the following intelligent bodies, the pilot intelligent body transmits return function values to a value function and updates Q values, when the current position state of the pilot intelligent body does not reach a target position, the step P4 is shifted, and when the current position state of the pilot intelligent body reaches the target position, the step P2 is shifted.

Fig. 8 is a flow chart of a following agent path planning method according to an embodiment of the disclosure. As shown in fig. 8, the following agent position is initialized, step B1: and (3) ending when the target state of the pilot intelligent agent broadcasting is acquired, and switching to the step (B2) when the target state of the pilot intelligent agent broadcasting is not acquired: calculating the cost of eight adjacent position states following the current position state of the intelligent body, determining the state with the minimum cost, adopting a virtual obstacle filling obstacle avoidance strategy by the following intelligent body and the piloting intelligent body in parallel, selecting actions and executing after the local avoidance of dynamic obstacles by using a gridding artificial potential field method, and turning to the step B3: and when the current state is the target state, the step B1 is shifted, and when the current state is not the target state, the step B2 is shifted.

Fig. 9 is a schematic diagram of a result of path planning of a pilot agent according to an embodiment of the disclosure. As shown in fig. 9, the pilot agent can stably find the optimal path by adopting the improved Q learning algorithm of the embodiment of the present disclosure, and the number of path steps is reduced to 22.

Fig. 10 is a schematic diagram of a multi-agent formation path planning result in an obstacle environment according to an embodiment of the present disclosure. As shown in fig. 10, the pilot robot utilizes the improved Q learning algorithm of the embodiment of the present disclosure to plan a dark gray path, two following robots avoid obstacles during the following process, two light gray paths are respectively planned, and three robots arrive at the target position at the same time, so as to complete the formation task.

The disclosed embodiments also provide a system for pilot following type multi-agent formation path planning, the system comprising: the first module is used for initializing the Q value by the piloting agent according to the gravitation potential field method; the second module is used for dynamically adjusting the exploration probability in the epsilon greedy method by the pilot intelligent agent according to the simulated annealing method and performing action selection; the third module is used for leading the intelligent agent to avoid the obstacle according to the virtual obstacle filling obstacle avoidance strategy and the dynamic obstacle avoidance mechanism; the fourth module is used for piloting the intelligent agent to execute the action and obtain the return, updating the Q value according to the return, and transmitting the moved position to the following intelligent agent until the piloting intelligent agent reaches the preset training times; the fifth module is used for acquiring an expected target position according to the current position information of the piloting intelligent body by the following intelligent body when the current position information of the piloting intelligent body is acquired by the following intelligent body, selecting and executing actions corresponding to the state with the minimum cost according to the cost function by the following intelligent body, and meanwhile, avoiding obstacles according to a virtual obstacle filling obstacle avoidance strategy and a dynamic obstacle avoidance mechanism and moving to the expected target position by the following intelligent body; and when the following agent does not acquire the current position information of the piloting agent, the following agent path planning is ended.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

The disclosed embodiments also provide a storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the aforementioned method for pilot following multi-agent formation path planning.

The method, the system and the storage medium for pilot following type multi-agent formation path planning, provided by the embodiment of the disclosure, construct a pilot following type multi-agent formation mode, the pilot agent is used for path planning based on a reinforcement learning algorithm of a heuristic information guiding mechanism, and the pilot agent is used for learning the motion state of the pilot agent, and the multi-agent sharing position information and part of environment information to realize dynamic formation motion; the multi-agent division is clear, dynamic and concave barriers can be effectively avoided, formation is simple and efficient, and a globally optimized path can be rapidly planned in a completely unknown environment.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method for pilot following multi-agent formation path planning, comprising:

step S1: initializing a Q value by the pilot intelligent agent according to an attractive potential field method;

step S2: the pilot intelligent body dynamically adjusts the exploration probability in the epsilon greedy method according to the simulated annealing method, and performs action selection;

step S3: the pilot intelligent body avoids the obstacle according to a virtual obstacle filling obstacle avoidance strategy and a dynamic obstacle avoidance mechanism;

step S4: the pilot intelligent agent executes actions and obtains returns, the Q value is updated according to the returns, and the pilot intelligent agent transmits the moved positions to the following intelligent agent until the pilot intelligent agent achieves preset training;

in the step S4, the piloting agent calculates the return obtained by the following formula,

H(S _t ，A _t ) As a heuristic function, calculating a diagonal distance between the current position of the piloting intelligent agent and the target position; w (w) _c The first coefficient is positive; w (w) _h Is negative for the second coefficient;

s5, when the following agent acquires the current position information of the piloting agent, the following agent acquires an expected target position according to the current position information of the piloting agent, the following agent selects and executes actions corresponding to the state with the minimum cost according to a cost function, and meanwhile, the following agent avoids obstacles according to a virtual obstacle filling obstacle avoidance strategy and a dynamic obstacle avoidance mechanism and moves towards the expected target position; when the following agent does not acquire the current position information of the pilot agent, the following agent path planning is finished;

in the step S5 of the above-mentioned process,

cost function C(s) _t ，a _t )＝C×d _attr +R _static (s _t ，a _t )

c is an adjustment coefficient;

in the step S5, the following agent performs obstacle avoidance according to a dynamic obstacle avoidance mechanism, including:

when a dynamic obstacle appears at an adjacent position of the current position of a following intelligent body, acquiring a repulsive force potential field of the dynamic obstacle to the current position of the following intelligent body, calculating the distance between the current position of the following intelligent body and the dynamic obstacle, carrying out stress analysis on the attraction of the expected target position and the repulsive force of the dynamic obstacle, determining a temporary target position of the following intelligent body for avoiding the dynamic obstacle, and carrying out obstacle avoidance;

the repulsive potential field is calculated by the following formula,

repulsive potential field

Wherein, (x) _s ，y _s ) For the coordinates of the current state of the following agent, (x) _obst ，y _obst ) Is the coordinates of the dynamic obstacle.

2. The method according to claim 1, wherein in the step S2, the exploration probability epsilon is calculated by the following formula:

3. The method according to claim 2, wherein after obtaining the exploration probability by calculation, obtaining a random number, and when the exploration probability is greater than the random number, the piloting agent randomly selects an action; and when the exploration probability is smaller than or equal to the random number, the piloting intelligent agent selects an action corresponding to the maximum Q value in the current state.

4. The method according to claim 1, wherein the step S3 further comprises:

step S31: acquiring adjacent positions of the current position of the pilot intelligent body, calculating first distances between the adjacent positions and the target positions, judging whether the current position of the pilot intelligent body tends to be a concave obstacle according to the first distances, and avoiding the obstacle by filling, wherein the distance between the current position of the pilot intelligent body and the target position is a second distance;

step S32: judging whether the current adjacent position is an obstacle or not when the first distance is smaller than the second distance, and judging that the current adjacent position is a feasible adjacent position when the current adjacent position is not the obstacle;

step S33: and filling the current position of the piloting agent as a virtual obstacle when the feasible adjacent position does not exist.

5. The method according to claim 1, wherein in the step S1, the Q value initialization is performed by the following formula,

6. A storage medium storing a computer program comprising program instructions that when executed by a processor cause the processor to perform the method for pilot following multi-agent formation path planning of any one of claims 1 to 5.