CN116466662A

CN116466662A - Multi-AGV intelligent scheduling method based on layered internal excitation

Info

Publication number: CN116466662A
Application number: CN202310346390.7A
Authority: CN
Inventors: 郭斌; 张江山; 於志文; 孙卓; 刘佳琪; 王亮; 李梦媛
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-07-21

Abstract

The invention relates to a multi-AGV intelligent scheduling method based on layering internal excitation, which comprises the steps of firstly, establishing a partially observable Markov decision model based on a multi-AGV workshop carrying scene; secondly, providing a layered intrinsic excitation mechanism for adjusting two intrinsic excitation weights of the AGV in real time; and training based on a multi-agent deep reinforcement learning method BicNet, and finally deploying the trained strategy network to each AGV for multi-AGV intelligent scheduling. The scheduling method based on multi-agent reinforcement learning provided by the invention has the advantages that each AGV can perform self-organized intelligent scheduling based on a strategy network, the autonomous scheduling capability and task completion level of the multi-AGV are improved, and a solution is provided for realizing self-learning and self-organized intelligent scheduling of the multi-AGV.

Description

Multi-AGV intelligent scheduling method based on layered internal excitation

Technical Field

The invention belongs to the field of multi-AGV cooperative scheduling, and particularly relates to a multi-AGV intelligent scheduling method based on layered internal excitation.

Background

In recent years, with rapid development of internet of things and automation technology, a multi-AGV (mobile robot) handling system is beginning to be widely used in the field of industrial manufacturing. Meanwhile, the demands on diversity, accuracy and real-time performance of tasks become more prominent, and intelligent scheduling with cooperation of multiple AGVs is particularly important. Most of the traditional multi-AGV scheduling methods are based on a centralized scheduling mode, and the scheduling strategy of an agent is planned uniformly by using global information. Under centralized control, the AGVs mostly follow an imperative dispatch route to complete the task. Although centralized control systems largely guarantee optimal performance, there are two problems to be solved. On one hand, the mode is too dependent on a control center, global information needs to be processed, and therefore the calculation capability of a server is high; on the other hand, the flexibility and expansibility of the system are poor, and the system is difficult to quickly adapt to complex and changeable dynamic scenes. Therefore, the novel multi-AGV distributed cooperative scheduling method is researched, and the autonomous decision of the scheduling strategy of the AGV in the limited environment is significant. The multi-agent reinforcement learning provides a new explorable direction for the problem of the cooperative scheduling of the distributed multi-AGV.

In multi-agent reinforcement learning algorithms, agents continuously try out errors through interactions with the environment to solve the sequence decision problem. In the training mode, the problem of multi-agent combined training is solved through centralized training and distributed execution. However, in the scenario of multi-AGV distributed collaborative scheduling, each AGV is partially observable and cannot learn global information, resulting in that it is more prone to be in a "lazy state", i.e., just evading a penalty, without pursuing completion of a scheduling task, thereby severely reducing scheduling efficiency. Meanwhile, the external environment rewards of each agent are sparse due to the spatial distribution dispersion of the task points to be completed. In this case, the multi-agent reinforcement learning algorithm is difficult to train to converge. Therefore, how to motivate agents to get rid of the "lazy state" and reduce the impact of the sparse external rewards, thereby effectively completing the task, is a key challenge.

Disclosure of Invention

The technical problems to be solved by the invention are as follows:

in order to avoid the defects of the prior art, the invention provides a multi-AGV intelligent scheduling method based on layered internal excitation. Firstly, establishing a partially observable Markov decision model based on a multi-AGV workshop carrying scene; secondly, providing a layered intrinsic excitation mechanism for adjusting two intrinsic excitation weights of the AGV in real time; and training based on a multi-agent deep reinforcement learning method BicNet, and finally deploying the trained strategy network to each AGV for multi-AGV intelligent scheduling.

In order to solve the technical problems, the invention adopts the following technical scheme:

a multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation is characterized by comprising the following steps of

Step 1: based on a multi-AGV workshop carrying scene, establishing a partially observable Markov decision model;

step 2: calculating intrinsic rewards based on a layered intrinsic incentive mechanism, and providing continuous rewards for AGV scheduling decisions;

step 3: training based on a multi-agent deep reinforcement learning method BicNet;

step 4: and deploying the trained strategy network to each AGV, and carrying out distributed collaborative scheduling by each AGV according to the decision action made by local observation of each AGV.

The invention further adopts the technical scheme that: the step 1 is specifically as follows:

modeling the multi-AGV intelligent scheduling problem as a partially observable Markov decision model:

M＝(N,S,A,P,R,O,γ)

wherein N, S, A, P, R, O and gamma are respectively the quantity of the intelligent agents, the state space, the action space, the state transition probability, the rewarding function, part of the observation space and the discount factor;

for a multi-AGV workshop carrying scene, an agent object, a part of observation space O, an action space A and a reward function R are defined as follows:

an intelligent agent: in a multi-AGV scheduling scenario, each AGV is an agent object; assuming that at the beginning of each scheduling period, the positions of all AGVs are randomly initialized; n AGVs and M task points are arranged in a workshop carrying environment, and the AGVs aim at maximizing the number of completed tasks and minimizing the time for dispatching to reach the task points;

part of the observable space O: since the AGV is in a partially observable condition, its observation space is a subspace of the global state space, i.eIs a mapping function of the global state to a portion of the observable space;

action space a: motion space A of each AGV _i For its set of motion states, three discrete actions are involved: left turn, right turn and straight run;

bonus function R: the rewarding function is used for exciting the AGV to quickly go to the task point; the bonus function being divided into external rewards R _extr And intrinsic prize R _intr Two parts, R _extr Including target rewards to reach the task point, decay penalties for each step forward, and collision penalties.

The invention further adopts the technical scheme that: the observation space of the AGVs is limited to a three-dimensional observation matrix of L x 3, wherein L represents the field width of the AGVs, namely, each AGV can observe L x L grid information right in front; each grid is encoded as a 3-dimensional tuple { Obj } _id ,Col _id ,Info _s }: object coding, color coding, and status information within the scope of observation.

The invention further adopts the technical scheme that: the step 2 is specifically as follows:

the hierarchical intrinsic reward mechanism comprises a top-layer reward balancer and a bottom-layer action controller;

step 2-1: two intrinsic rewards are calculated based on the underlying action controller: attraction rewards and coverage rewards;

attraction forcePrize calculation: within a partial observation range of the AGV, the route distance dis from the position of the AGV to the observed target point is based on _i Establishing attractive rewards r _attr The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation mode is as follows, wherein M is used for restricting the range of rewards;

coverage rewards calculation: AGVs can continuously store historical coverage A along with the increment of the step number _his With newly added unexplored area range A of the current action _new As a coverage prize r _cover The AGV is used for exciting the AGV to explore a new area, so that a remote target point can be found conveniently; the specific calculation mode is as follows, wherein M is used for restricting the range of rewards;

step 2-2: balancing the two intrinsic rewards based on the top-level rewards balancer; the top-level rewards balancer outputs a P value according to AGV status information for the AGV to adjust the weights of the two intrinsic rewards.

The invention further adopts the technical scheme that: the step 2-2 is specifically as follows:

specifically, balancing two intrinsic rewards based on the top-level rewards balancer this module contains two strategies to calculate the rewards weight: judging based on rules and based on Actor-Critic;

scheme based on rule judgment: when the AGV observes the target, the AGV takes attraction rewards as driving; when the target is not observed, the coverage rewards are used as driving, and meanwhile, the specific gravity of the coverage rewards can be increased along with the increase of the step number, so that the later regional exploration degree is improved; the P value is calculated specifically using the following formula:

Actor-Critic based scheme: the P value is output based on the trained Actor-Critic network, so that the organic combination of two intrinsic rewards is completed; based on the weight P output by the top-level prize balancer, the intrinsic prize r after balancing _intr Given by the formula:

r _intr ＝P*r _cover +(1-P)*r _attr 。

the invention further adopts the technical scheme that: the step 3 is specifically as follows:

training a top layer rewarding balancer and a bottom layer action controller respectively by adopting a BicNet algorithm;

training of the top-level prize balancer: the decision goal of the AGV is to maximize its expected cumulative personal external rewardsWherein θ is _p Is a parameter of the Actor network in the bonus balancer; thus, use J _i (θ _p ) The targets for the N AGVs are shown as follows:

the Actor network in the top-layer rewarding balancer adopts a strategy gradient formula, and is trained and updated by the following formula:

the Critic network in the top-layer prize balancer adopts a time difference method and is trained and updated by the following formula, wherein ζ is _p Is a Critic network parameter:

training of the bottom layer action controller: the bottom layer action controller employs a network training method similar to the top layer prize balancer, except that the top layer prize balancer is updated based on the external prizes, and the action controller is updated based on the cumulative sum of the internal prizes and the external prizes.

The invention further adopts the technical scheme that: the training of the underlying motion controller, and in particular the decision goal of the AGV in the underlying motion controller, is to maximize the overall rewards that it expects to accumulate:

where θ is a parameter of the Actor network in the action controller,

r _i,total ＝r _i,extr +r _i,intr the method comprises the steps of carrying out a first treatment on the surface of the By J _i 9θ _p ) The targets for the N AGVs are shown as follows:

correspondingly, the training update formula of the Actor network in the bottom layer action controller is as follows:

similar to the updating mode of the Critic network in the top-level prize balancer, the training updating formula of the Critic network in the action controller is as follows, wherein ζ is the parameter of the Critic network:

a computer system, comprising: one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods described above.

A computer readable storage medium, characterized by storing computer executable instructions that when executed are configured to implement the method described above.

The invention has the beneficial effects that:

the traditional centralized scheduling method utilizes global information to uniformly plan scheduling strategies of agents, the mode is too dependent on a control center and the global information, and the flexibility and the expansibility of the system are poor. The multi-agent reinforcement learning technology can realize the autonomous decision of the agent scheduling strategy in a limited dynamic scene. Thus, the present invention designs a hierarchical intrinsic rewarding mechanism for motivating self-learning exploration of agents. Meanwhile, the invention provides a dispatching method based on multi-agent reinforcement learning, and each AGV can carry out self-organizing intelligent dispatching based on a strategy network. The invention improves the autonomous scheduling capability and task completion level of the multi-AGV and provides a solution for realizing self-learning and self-organizing intelligent scheduling of the multi-AGV.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a block diagram of a layered intrinsic excitation mechanism in an example of the invention;

FIG. 2 is a block diagram of a multi-AGV intelligent scheduling method based on hierarchical intrinsic stimulus in an example of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The invention provides a multi-AGV intelligent scheduling method based on layered internal excitation, which is used for establishing a partially observable Markov decision model based on a multi-AGV workshop carrying scene. A layered intrinsic actuation mechanism is proposed for real-time adjustment of both intrinsic actuation weights of the AGV. And carrying out layered combined training based on the BicNet idea to obtain a value network and a strategy network after training convergence. Each AGV executes scheduling based on the trained strategy network, takes local observation information as input, outputs scheduling actions, and completes intelligent scheduling of multiple AGVs.

Comprises the following two parts:

hierarchical intrinsic rewarding mechanism: the AGV is encouraged to perform regional exploration and task completion by introducing two intrinsic rewards of coverage rewards and attraction rewards, and the two intrinsic rewards specific weights of the agent are adjusted in real time according to the task environment change, so that the balance of the agent on exploration and completion is realized.

Multi-agent reinforcement learning model: based on the layered intrinsic rewards and the multi-agent reinforcement learning technology, the multi-AGV interacts with the environment, and continuously trains and optimizes the scheduling strategy in a mode of maximizing the accumulated rewards.

The method comprises the following specific steps:

step 1: based on multiple AGV workshop carrying scenes, establishing partially observable Markov decision model

Under the multi-AGV workshop carrying scene, each AGV interacts with the task environment based on local observation of the AGV to finish scheduling decision. Because of the limited observations of each AGV, the multi-AGV intelligent scheduling problem is modeled as a partially observable Markov decision model:

M＝(N,S,A,P,R,O,γ)

wherein N, S, A, P, R, O, gamma are respectively the number of agents, the state space, the action space, the state transition probability, the reward function, part of the observation space and the discount factor. Aiming at a multi-AGV workshop carrying scene, the intelligent object, a part of observation space O, an action space A and a reward function R are defined as follows:

an intelligent agent: in a multi-AGV scheduling scenario, each AGV is an agent object. It is assumed that at the beginning of each scheduling period, the positions of all AGVs are initialized randomly. There are N AGVs and M task points in the shop handling environment, the AGVs are aimed at maximizing the number of completed tasks and minimizing the time for the schedule to reach the task points.

Part is considerableMeasurement space O: since the AGV is in a partially observable condition, its observation space O is a subspace of the global state space, i.eIs a mapping function of the global state to the partially observable space. Specifically, the observation space of the AGV is limited to a three-dimensional observation matrix of l×l×3. Where L represents the field width of the AGV, i.e., each AGV can observe L x L grid information directly in front. Each grid is encoded as a 3-dimensional tuple { Obj } _id ,Col _id ,Info _s }: object coding, color coding, and status information within the scope of observation.

Action space a: motion space A of each AGV _i For its set of motion states, three discrete actions are involved: left turn, right turn and straight run.

Bonus function R: the reward function is used to encourage the AGV to quickly go to the task point. The bonus function being divided into external rewards R _extr And intrinsic prize R _intr Two parts. R is R _extr The method comprises the steps of target rewarding reaching task points, attenuation punishment of each step of forward running and collision punishment; r is R _intr The specific computing mechanism of (a) will be described in detail in the next step.

Step 2: the intrinsic rewards are calculated based on a tiered intrinsic incentive mechanism that provides a continuous rewards tiered intrinsic rewards mechanism for AGV scheduling decisions including two modules, a top tier rewards balancer and a bottom tier action controller.

Step 2-1: two intrinsic rewards are calculated based on the underlying action controller: attraction rewards and coverage rewards. The attraction rewards encourage the AGVs to dispatch to the target points as soon as possible, and the coverage rewards encourage the AGVs to explore new areas and unknown task points, respectively representing two inherent motivations of the AGVs.

Attraction rewards calculation: within a partial observation range of the AGV, the route distance dis from the position of the AGV to the observed target point is based on _i Establishing attractive rewards r _attr . The specific calculation is as follows, wherein the M value is used to constrain the range of rewards.

Coverage rewards calculation: AGVs can continuously store historical coverage A along with the increment of the step number _his With newly added unexplored area range A of the current action _new As a coverage prize r _cover The method is used for exciting the AGV to explore a new area, so that a remote target point can be found conveniently. The specific calculation is as follows, wherein the M value is used to constrain the range of rewards.

Step 2-2: the two intrinsic rewards are balanced based on the top-level rewards balancer. The top-level rewards balancer outputs a P value according to AGV status information for the AGV to adjust the weights of the two intrinsic rewards. Specifically, the module contains two strategies for calculating the bonus weight: rule-based decisions and Actor-Critic based decisions.

Scheme based on rule judgment: when the AGV observes the target, the AGV takes attraction rewards as driving; when no target is observed, the coverage rewards are driven, and meanwhile, the specific gravity of the coverage rewards can be increased along with the increase of the number of steps, so that the later regional exploration degree is improved. The P value is calculated specifically using the following formula:

Actor-Critic based scheme: the P value is output based on the trained Actor-Critic network, so that the organic combination of two intrinsic rewards is completed. Based on the weight P output by the top-level prize balancer, the intrinsic prize r after balancing _intr Can be given by:

r _intr ＝P*r _cover +(1-P)*r _attr

step 3: bicNet training method based on multi-agent deep reinforcement learning

In the multi-agent training part, the method adopts BicNet algorithm to train the top layer rewarding balancer and the bottom layer action controller respectively.

Training of top-level prize balancer: the decision goal of the AGV is to maximize its expected cumulative personal external rewardsWherein θ is _p Is a parameter of the Actor network in the bonus balancer. Thus, use J _i (θ _p ) The targets for the N AGVs are shown as follows:

the Actor network in the top-layer rewarding balancer adopts a strategy gradient formula, and can be trained and updated by the following formula:

the Critic network in the top-layer prize balancer adopts a time difference method and can be trained and updated by the following formula, wherein ζ is _p Is a Critic network parameter:

training of the bottom layer action controller: the bottom layer action controller employs a network training method similar to the top layer bonus balancer. The difference is that the top-level prize balancer is updated based on the external prizes and the action controller is updated based on the cumulative sum of the internal and external prizes. Specifically, the decision goal of the AGV in the underlying motion controller is to maximize its overall rewards that it expects to accumulate:where θ is a parameter of the Actor network in the action controller, r _i,total ＝r _i,extr +r _i,intr . By J _i (θ _p ) The targets for the N AGVs are shown as follows:

step 4: and deploying the trained top layer rewarding balancer and the trained bottom layer action controller to AGVs, and carrying out distributed collaborative scheduling by each AGV according to the self local observation to make decision actions.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.

Claims

1. A multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation is characterized by comprising the following steps of

2. The multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 1, wherein step 1 specifically comprises the following steps:

M＝(N，S，A，P，R，O，γ)

part of the observable space O: since the AGV is in a partially observable condition, its observation space is a subspace of the global state space, i.e Is a mapping function of the global state to a portion of the observable space;

3. The multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 2, wherein the observation space of the AGVs is limited to a three-dimensional observation matrix of L x 3, wherein L represents the field width of the AGVs, i.e. each AGV can observe L x L grid information right in front of each AGV; each grid is encoded as a 3-dimensional tuple { Obj } _id ，Col _id ，Info _s }: object coding, color coding, and status information within the scope of observation.

4. The multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 1, wherein step 2 is specifically as follows:

attraction rewards calculation: within a partial observation range of the AGV, the route distance dis from the position of the AGV to the observed target point is based on _i Establishing attractive rewards r _attr The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation mode is as follows, wherein M is used for restricting the range of rewards;

5. The multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 4, wherein step 2-2 is specifically as follows:

r _intr ＝P*r _cover +(1-P)*r _attr 。

6. the multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 1, wherein the step 3 is specifically as follows:

training of the top-level prize balancer: the decision goal of an AGV is to maximize the personal outsides of its expected accumulationRewardsWherein θ is _p Is a parameter of the Actor network in the bonus balancer; thus, use J _i (θ _p ) The targets for the N AGVs are shown as follows:

7. The multi-AGV intelligent scheduling method according to claim 6, wherein the training of the bottom level motion controller, specifically, the decision goal of the AGV in the bottom level motion controller is to maximize the total rewards that it expects to accumulate:where θ is a parameter of the Actor network in the action controller, r _i，total ＝r _i，extr +r _i，intr The method comprises the steps of carrying out a first treatment on the surface of the By J _i (θ _p ) The targets for the N AGVs are shown as follows:

8. a computer system, comprising: one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.

9. A computer readable storage medium, characterized by storing computer executable instructions that, when executed, are adapted to implement the method of claim 1.