CN116307464A

CN116307464A - AGV task allocation method based on multi-agent deep reinforcement learning

Info

Publication number: CN116307464A
Application number: CN202211683067.0A
Authority: CN
Inventors: 郭斌; 李梦媛; 刘佳琪; 刘思聪; 於志文; 邱晨; 王亮; 王柱
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-06-23

Abstract

The invention relates to an AGV task allocation method based on multi-agent deep reinforcement learning, which comprises the steps of firstly constructing an AGV cargo handling environment and constructing an AGV kinematic model; secondly, establishing a locally observable Markov decision model, and designing an improved information potential field rewarding function; and training based on a multi-agent deep reinforcement learning method MADDPG, and finally deploying a trained strategy network to each AGV for carrying distributed collaborative cargo. The method of the invention further improves the coordination capability among the independent agents on the basis of the existing multi-agent deep reinforcement learning model, so that each AGV can carry out carrying work in parallel, distributed and coordinated manner, and the predefined overall objective can be completed in a more effective way.

Description

AGV task allocation method based on multi-agent deep reinforcement learning

Technical Field

The invention belongs to the technical field of multi-agent cooperation, and particularly relates to an AGV task allocation method based on multi-agent deep reinforcement learning.

Background

With the push of industry 4.0, automated guided vehicles (Automated Guided Vehicle, AGV) as intelligent devices integrated with a variety of advanced technologies have been widely used for flexible shop material handling due to their high degree of autonomy and flexibility. The efficient group AGV task allocation strategy can reduce the transportation cost and improve the distribution efficiency. However, how to distribute multiple AGV tasks to multiple AGVs, so that the running path cost of the AGV system is the lowest and the running efficiency is the highest, is a key challenge.

Traditional research works apply classical optimization algorithms, such as genetic algorithms, particle swarm algorithms, ant colony algorithms, etc., to the field of AGV task allocation. However, the centralized task allocation methods take the benefit maximization of the whole system as an optimization target, and the control center gathers global information to make a unified decision, so that higher requirements are put on the computing capacity and the real-time capacity of the control center. In addition, small information changes or perturbations may affect the overall plan with poor adaptability and scalability. Different from a centralized decision, the distributed or centerless decision method can reasonably distribute the calculation load, fully utilize the autonomous decision capability of an intelligent agent, not only can reduce the complexity of system modeling, but also can improve the robustness and the expandability of the system.

The continued development of multi-agent deep reinforcement learning (MADRL) technology provides a new solution for implementing group AGV distributed task allocation. The intelligent agent interacts with the environment through an error testing mechanism, learns and optimizes in a mode of maximizing the jackpot prize, and finally achieves the optimal strategy. The autonomous decision-making method based on multi-agent deep reinforcement learning can complete more complex tasks through interaction and decision-making in a higher-dimensional and dynamic scene. However, when the practical application problem is directly solved from the multi-agent deep reinforcement learning point of view, challenges such as environmental non-stationarity and local observability exist. Furthermore, the rewarding mechanism of multi-agent systems is more complex than single agent systems, and rewarding sparsity problems often result in model training that is difficult to converge. Therefore, how to design an effective rewarding mechanism to improve model performance and accelerate model convergence is a key issue.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides the AGV task allocation method based on multi-agent deep reinforcement learning, which can improve the information potential field rewarding function to solve the problem of rewarding sparseness, provide continuous rewards for the AGV and implicitly guide the AGV to move to different cargo targets.

Technical proposal

An AGV task allocation method based on multi-agent deep reinforcement learning is characterized in that: the method adopts a multi-agent deep reinforcement learning model and an information potential field rewarding mechanism combining technology; the multi-agent deep reinforcement learning model is learned and optimized by the AGV in a mode of maximizing the accumulated rewards, and finally an optimal task allocation strategy is achieved; the information potential field rewarding mechanism is used for guiding the AGV to move to the target position along a specific gradient direction by utilizing virtual information gradient diffusion of the data of the target position.

The invention further adopts the technical scheme that: the method comprises the following steps:

step 1: constructing an AGV cargo carrying environment and constructing a kinematic model of the AGV;

step 2: establishing a locally observable Markov decision model, and determining an action space, a state space and a reward function;

step 3: calculating an information potential field rewarding function based on the current state, and providing continuous rewards for AGV decision-making;

step 4: training based on a multi-agent deep reinforcement learning method MADDPG;

step 5: and deploying the trained strategy network to each AGV, and acquiring an action instruction by each AGV according to the local observation information of each AGV to carry out distributed collaborative cargo handling.

The invention further adopts the technical scheme that: a plurality of AGVs are arranged in the kinematic model of the AGVs, and are modeled as discs with the radius of R; the distance between any two AGVs in the model is greater than 2R to avoid collisions between AGVs.

The invention further adopts the technical scheme that: setting the target position information potential value as a fixed positive information potential value, setting the position information potential values of other AGVs as fixed negative information potential values, setting the boundary information potential value as 0, and iterating by using a formula; after iteration, each position has a corresponding information potential value; the AGV obtains a reward value r according to the information potential value of the position where the time step t is.

The invention further adopts the technical scheme that: the AGV obtains a reward r at a time step t _IPF 、r _g And r _c Three-part sum represents:

the prize r, r obtained by AGVi at time step t _g Exciting only one AGV near each task point, r _c It is desirable to minimize collisions, r _IPF An implicit thrust is given to the AGV to guide the AGV to move toward the target position in a scattered manner.

The invention further adopts the technical scheme that: and (4) training the MADDPG by the multi-agent deep reinforcement learning method in the step (4) comprises parameter updating of an Actor network and parameter updating of a Critic network.

The invention further adopts the technical scheme that: in the step 4, when the target network parameter is updated, a soft update strategy is adopted: θ'. _i ←(1-τ)θ′ _i +τθ _i Wherein τ represents soft substitution, representing the update amplitude of the target network parameter; θ'. _i 、θ _i Respectively representing target network parametersAnd estimating network parameters.

A computer system, comprising: one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods described above.

A computer readable storage medium, characterized by storing computer executable instructions that when executed are configured to implement the method described above.

Advantageous effects

According to the AGV task allocation method MADDPG-IPF based on multi-agent deep reinforcement learning, an AGV cargo handling environment is built, and a kinematic model of the AGV is built; secondly, establishing a locally observable Markov decision model, and designing an improved information potential field rewarding function; and training based on a multi-agent deep reinforcement learning method MADDPG, and finally deploying a trained strategy network to each AGV for carrying distributed collaborative cargo. The method of the invention further improves the coordination capability among the independent agents on the basis of the existing multi-agent deep reinforcement learning model, so that each AGV can carry out carrying work in parallel, distributed and coordinated manner, and the predefined overall objective can be completed in a more effective way.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a block diagram of an AGV task allocation method based on multi-agent deep reinforcement learning in an example of the present invention.

FIG. 2 is a diagram of a multi-agent based deep reinforcement learning and information potential field rewards network model in accordance with an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The invention provides an AGV task allocation method based on multi-agent deep reinforcement learning, which adopts the following principle: the traditional centralized AGV task allocation method has high requirements on the real-time capability and decision-making capability of the control center, and does not have adaptability and expandability. The autonomous decision-making method based on multi-agent deep reinforcement learning can complete complex tasks through interaction and decision-making in a higher-dimensional and dynamic scene. We propose a solution based on multi-agent deep reinforcement learning (madppg-IPF), where multiple AGVs implement self-organizing task allocation by trial and error. In addition, we have devised an information potential field rewarding mechanism that provides continuous rewards to the AGV at each time step to address the problem of sparse rewards. The invention improves the cooperation capability among the independent AGVs and provides a solution for realizing flexible and self-organizing task allocation of the group AGVs.

The scheme of the invention comprises the following steps:

multi-agent deep reinforcement learning model: the AGV learns and optimizes in a mode of maximizing the accumulated rewards, and finally, an optimal task allocation strategy is achieved;

information potential field rewarding mechanism: the AGV is guided to move to the target position along a specific gradient direction by utilizing virtual information gradient diffusion of the data of the target position;

MADDPG-IPF: the coordination capability among the independent agents is further improved by combining multi-agent reinforcement learning and information potential field technology, so that each AGV can carry out carrying work in parallel, in a distributed and coordinated manner.

The specific steps of the invention are as follows:

step one: constructing AGV (automatic guided vehicle) cargo carrying environment and constructing AGV kinematic model

The invention assumes a total of N _V The AGVs model them as disks of radius R. And all AGVs are isomorphic, having the same parametersNumber and function. At each time step, use tuples

Represents the ith table (i is more than or equal to 1 and less than or equal to N) _v ) The state of the AGV at this point, where position +.>

Speed->

Perceived distance r _i . The i-th AGV can obtain the sensing distance r _i Observations in->

According to policy pi _θ Calculate action->

Wherein θ represents policy parameters, calculated actions +.>

Control the speed of the next step to be switched +.>

The AGVs are directed to reach different task target points while avoiding collisions with other AGVs.

Definition l= { L _i I=1, …, N } is the trajectory of all AGVs, satisfying the following physical constraints:

wherein, the liquid crystal display device comprises a liquid crystal display device,

indicating the speed of the AGV by using a policy pi based on the current observed value _θ The selected action is obtained; the second formula states that the travel speed of the AGV cannot exceed the maximum speed of the AGV; />

Indicating that the current position of the AGV is determined by the last position and the current speed; the last equation states that for any two AGVs, the distance between them is greater than 2R to avoid collisions between the AGVs.

Step two: establishing a locally observable Markov decision model, and determining an action space, a state space and a reward function

In a real scenario, the AGV agent will observe the environmental conditions, make action selections based on the acquired observations, and since in most cases only local observations can be obtained, this process is typically modeled as a partially observable Markov decision process. In general, POMDP can be represented by a six-tuple m= (N, S, a, P, R, O), where N is the number of agents, S represents the state space of the system, a represents the joint action space of all agents, P represents the state transition probability matrix, R is the reward function, and O is the observed probability distribution obtained from the system state, subject to (O-O (S)).

Aiming at the problem of the allocation of the specific group AGV cooperative tasks, a state space S, an action space A and a reward function R are designed as follows:

state space S: the design state space is { v, p, D _A ,D _B (v, p) represents the speed and position of the agent itself, { D }, where _A ,D _B And the relative distance to the target point and other agents.

Action space a: in order to be more practical, the motion space of the intelligent agent is set to be a continuous motion space, which is represented by a one-dimensional vector { x, y }, and the value interval is (-1, 1), which represents the acceleration of the intelligent agent in the front-back direction and the left-right direction at the current moment. The speed at which the AGV will switch next can be calculated in combination with the weight of the AGV itself and the damping.

Bonus function R: the goal of the task allocation is that multiple AGVs can reach different task target points in a self-organizing and decentralized manner in as short a time as possible while avoiding collisions. A generic rewarding function is designed to achieve this goal. The target rewards can be obtained when the intelligent agent reaches the target position, and the collision punishment is carried out when the intelligent agent collides with other intelligent agents or walls.

Wherein, rewards r to the target _g And a collision penalty r _c Is defined as follows:

wherein d is _ij Is the distance from task point j to AGVi.

Step three: calculating an informative potential field rewarding function based on current state, providing continuous rewards for AGV decisions

By dividing the environment into a bounded grid map, setting positive information potential values for positions of target points and negative information potential values for positions of other AGVs, the AGVs can be implicitly guided to move in a scattered manner to different targets. The specific method comprises the steps of setting a target position information potential value as a fixed positive information potential value, setting the position information potential values of other AGVs as a fixed negative information potential value, setting a boundary information potential value as 0, and iterating by utilizing a formula:

wherein phi is ^k (u) is the information potential value of the node u in the kth round, N (u) is the neighbor node set of the node u, and d (u) is the number of neighbor nodes of the node u.

After the iteration, each location will have a corresponding information potential value. AGV rewards value r according to information potential value of position of time step t _IPF . The information potential value near the target position is higher, and when multiple targets are located in the similar positions, the information potential fields are higher by superposition, and the AGV is attracted to move towards more positions of the targets. However, if other AGVs are already in the vicinity of the target, a negative informational potential field will be created and the attractive force to the AGVs will be reduced. The design can well avoid the contention of a plurality of agents for the same target point, guide the agents to go to different task target points in a self-organizing and decentralized manner, and achieve the purpose of cooperative transportation.

In general, the invention uses r _IPF 、r _g And r _c The three-part sum represents the prize r, r that AGVi gets at time step t _g Exciting only one AGV near each task point, r _c It is desirable to minimize collisions, r _IPF An implicit thrust is given to the AGV to guide the AGV to move toward the target position in a scattered manner.

Step four: training based on multi-agent deep reinforcement learning method MADDPG

In the multi-agent training part, the method adopts an algorithm MADDPG based on an Actor-Critic framework. In the training stage, the Actor network combines a strategy gradient method and a state-behavior value function, calculates deterministic optimal behaviors according to the current state, and optimizes the neural network parameters theta according to scores of the behaviors by the Critic network. Critic network uses the observed information of all agents to evaluate the behavior generated by the Actor network by calculating TD-error.

The parameter update of the Actor network of the madppg algorithm can be given by:

wherein o is _i Representing local observations acquired by the ith agent, x= [ o ] ₁ ,…,o _n ]The global observation vector, i.e. the state at this time, is represented, integrating the information acquired by all agents.

Representing the state-action function of the i-th agent's centralized type.

The parameter update of the Critic network of the madppg algorithm can be given by:

represents the target network, μ ' = [ μ ' ' ₁ ,μ′ ₂ ,…,μ′ _n ]Parameter θ 'with hysteresis update for target policy' _j . In addition, in performing the target network parameter update, a soft update (soft update) policy is generally adopted:

θ′ _i ←(1-τ)θ′ _i +τθ _i

where τ represents soft replacement (soft replacement), representing the update magnitude of the target network parameter; θ'. _i 、θ _i Representing the target network parameter and the estimated network parameter, respectively.

Step five: and deploying the trained strategy network to each AGV, and acquiring an action instruction by each AGV according to the local observation information of each AGV to carry out distributed collaborative cargo handling.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.

Claims

1. An AGV task allocation method based on multi-agent deep reinforcement learning is characterized in that: the method adopts a multi-agent deep reinforcement learning model and an information potential field rewarding mechanism combining technology; the multi-agent deep reinforcement learning model is learned and optimized by the AGV in a mode of maximizing the accumulated rewards, and finally an optimal task allocation strategy is achieved; the information potential field rewarding mechanism is used for guiding the AGV to move to the target position along a specific gradient direction by utilizing virtual information gradient diffusion of the data of the target position.

2. The AGV task allocation method based on multi-agent deep reinforcement learning according to claim 1, wherein the AGV task allocation method comprises the following steps: the method comprises the following steps:

3. The AGV task allocation method based on multi-agent deep reinforcement learning according to claim 1, wherein the AGV task allocation method comprises the following steps: a plurality of AGVs are arranged in the kinematic model of the AGVs, and are modeled as discs with the radius of R;

the distance between any two AGVs in the model is greater than 2R to avoid collisions between AGVs.

4. The AGV task allocation method based on multi-agent deep reinforcement learning according to claim 2, wherein the AGV task allocation method is characterized in that: setting the target position information potential value as a fixed positive information potential value, setting the position information potential values of other AGVs as fixed negative information potential values, setting the boundary information potential value as 0, and iterating by using a formula; after iteration, each position has a corresponding information potential value; the AGV obtains a reward value r according to the information potential value of the position where the time step t is.

5. The AGV task allocation method based on multi-agent deep reinforcement learning according to claim 4, wherein: the AGV obtains a reward r at a time step t _IPF 、r _g And r _c Three-part sum represents:

the AGV i obtains the rewards r, r at time step t _g Exciting only one AGV near each task point, r _c It is desirable to minimize collisions, r _IPF An implicit thrust is given to the AGV to guide the AGV to move toward the target position in a scattered manner.

6. The AGV task allocation method based on multi-agent deep reinforcement learning according to claim 2, wherein the AGV task allocation method is characterized in that: and (4) training the MADDPG by the multi-agent deep reinforcement learning method in the step (4) comprises parameter updating of an Actor network and parameter updating of a Critic network.

7. The AGV task allocation method based on multi-agent deep reinforcement learning according to claim 6, wherein: in the step 4, when the target network parameter is updated, a soft update strategy is adopted: θ'. _i ←(1-τ)θ′ _i +τθ _i Wherein τ represents soft substitution, representing the update amplitude of the target network parameter; θ'. _i 、θ _i Respectively representThe target network parameters and the estimated network parameters.

8. A computer system, comprising: one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.

9. A computer readable storage medium, characterized by storing computer executable instructions that, when executed, are adapted to implement the method of claim 1.