CN116466662A - Multi-AGV intelligent scheduling method based on layered internal excitation - Google Patents

Multi-AGV intelligent scheduling method based on layered internal excitation Download PDF

Info

Publication number
CN116466662A
CN116466662A CN202310346390.7A CN202310346390A CN116466662A CN 116466662 A CN116466662 A CN 116466662A CN 202310346390 A CN202310346390 A CN 202310346390A CN 116466662 A CN116466662 A CN 116466662A
Authority
CN
China
Prior art keywords
agv
rewards
intrinsic
balancer
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310346390.7A
Other languages
Chinese (zh)
Inventor
郭斌
张江山
於志文
孙卓
刘佳琪
王亮
李梦媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202310346390.7A priority Critical patent/CN116466662A/en
Publication of CN116466662A publication Critical patent/CN116466662A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B19/00Programme-control systems
    • G05B19/02Programme-control systems electric
    • G05B19/418Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM]
    • G05B19/4189Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM] characterised by the transport system
    • G05B19/41895Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM] characterised by the transport system using automatic guided vehicles [AGV]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/31From computer integrated manufacturing till monitoring
    • G05B2219/31002Computer controlled agv conveys workpieces between buffer and cell

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Automation & Control Theory (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Manufacturing & Machinery (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a multi-AGV intelligent scheduling method based on layering internal excitation, which comprises the steps of firstly, establishing a partially observable Markov decision model based on a multi-AGV workshop carrying scene; secondly, providing a layered intrinsic excitation mechanism for adjusting two intrinsic excitation weights of the AGV in real time; and training based on a multi-agent deep reinforcement learning method BicNet, and finally deploying the trained strategy network to each AGV for multi-AGV intelligent scheduling. The scheduling method based on multi-agent reinforcement learning provided by the invention has the advantages that each AGV can perform self-organized intelligent scheduling based on a strategy network, the autonomous scheduling capability and task completion level of the multi-AGV are improved, and a solution is provided for realizing self-learning and self-organized intelligent scheduling of the multi-AGV.

Description

Multi-AGV intelligent scheduling method based on layered internal excitation
Technical Field
The invention belongs to the field of multi-AGV cooperative scheduling, and particularly relates to a multi-AGV intelligent scheduling method based on layered internal excitation.
Background
In recent years, with rapid development of internet of things and automation technology, a multi-AGV (mobile robot) handling system is beginning to be widely used in the field of industrial manufacturing. Meanwhile, the demands on diversity, accuracy and real-time performance of tasks become more prominent, and intelligent scheduling with cooperation of multiple AGVs is particularly important. Most of the traditional multi-AGV scheduling methods are based on a centralized scheduling mode, and the scheduling strategy of an agent is planned uniformly by using global information. Under centralized control, the AGVs mostly follow an imperative dispatch route to complete the task. Although centralized control systems largely guarantee optimal performance, there are two problems to be solved. On one hand, the mode is too dependent on a control center, global information needs to be processed, and therefore the calculation capability of a server is high; on the other hand, the flexibility and expansibility of the system are poor, and the system is difficult to quickly adapt to complex and changeable dynamic scenes. Therefore, the novel multi-AGV distributed cooperative scheduling method is researched, and the autonomous decision of the scheduling strategy of the AGV in the limited environment is significant. The multi-agent reinforcement learning provides a new explorable direction for the problem of the cooperative scheduling of the distributed multi-AGV.
In multi-agent reinforcement learning algorithms, agents continuously try out errors through interactions with the environment to solve the sequence decision problem. In the training mode, the problem of multi-agent combined training is solved through centralized training and distributed execution. However, in the scenario of multi-AGV distributed collaborative scheduling, each AGV is partially observable and cannot learn global information, resulting in that it is more prone to be in a "lazy state", i.e., just evading a penalty, without pursuing completion of a scheduling task, thereby severely reducing scheduling efficiency. Meanwhile, the external environment rewards of each agent are sparse due to the spatial distribution dispersion of the task points to be completed. In this case, the multi-agent reinforcement learning algorithm is difficult to train to converge. Therefore, how to motivate agents to get rid of the "lazy state" and reduce the impact of the sparse external rewards, thereby effectively completing the task, is a key challenge.
Disclosure of Invention
The technical problems to be solved by the invention are as follows:
in order to avoid the defects of the prior art, the invention provides a multi-AGV intelligent scheduling method based on layered internal excitation. Firstly, establishing a partially observable Markov decision model based on a multi-AGV workshop carrying scene; secondly, providing a layered intrinsic excitation mechanism for adjusting two intrinsic excitation weights of the AGV in real time; and training based on a multi-agent deep reinforcement learning method BicNet, and finally deploying the trained strategy network to each AGV for multi-AGV intelligent scheduling.
In order to solve the technical problems, the invention adopts the following technical scheme:
a multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation is characterized by comprising the following steps of
Step 1: based on a multi-AGV workshop carrying scene, establishing a partially observable Markov decision model;
step 2: calculating intrinsic rewards based on a layered intrinsic incentive mechanism, and providing continuous rewards for AGV scheduling decisions;
step 3: training based on a multi-agent deep reinforcement learning method BicNet;
step 4: and deploying the trained strategy network to each AGV, and carrying out distributed collaborative scheduling by each AGV according to the decision action made by local observation of each AGV.
The invention further adopts the technical scheme that: the step 1 is specifically as follows:
modeling the multi-AGV intelligent scheduling problem as a partially observable Markov decision model:
M=(N,S,A,P,R,O,γ)
wherein N, S, A, P, R, O and gamma are respectively the quantity of the intelligent agents, the state space, the action space, the state transition probability, the rewarding function, part of the observation space and the discount factor;
for a multi-AGV workshop carrying scene, an agent object, a part of observation space O, an action space A and a reward function R are defined as follows:
an intelligent agent: in a multi-AGV scheduling scenario, each AGV is an agent object; assuming that at the beginning of each scheduling period, the positions of all AGVs are randomly initialized; n AGVs and M task points are arranged in a workshop carrying environment, and the AGVs aim at maximizing the number of completed tasks and minimizing the time for dispatching to reach the task points;
part of the observable space O: since the AGV is in a partially observable condition, its observation space is a subspace of the global state space, i.eIs a mapping function of the global state to a portion of the observable space;
action space a: motion space A of each AGV i For its set of motion states, three discrete actions are involved: left turn, right turn and straight run;
bonus function R: the rewarding function is used for exciting the AGV to quickly go to the task point; the bonus function being divided into external rewards R extr And intrinsic prize R intr Two parts, R extr Including target rewards to reach the task point, decay penalties for each step forward, and collision penalties.
The invention further adopts the technical scheme that: the observation space of the AGVs is limited to a three-dimensional observation matrix of L x 3, wherein L represents the field width of the AGVs, namely, each AGV can observe L x L grid information right in front; each grid is encoded as a 3-dimensional tuple { Obj } id ,Col id ,Info s }: object coding, color coding, and status information within the scope of observation.
The invention further adopts the technical scheme that: the step 2 is specifically as follows:
the hierarchical intrinsic reward mechanism comprises a top-layer reward balancer and a bottom-layer action controller;
step 2-1: two intrinsic rewards are calculated based on the underlying action controller: attraction rewards and coverage rewards;
attraction forcePrize calculation: within a partial observation range of the AGV, the route distance dis from the position of the AGV to the observed target point is based on i Establishing attractive rewards r attr The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation mode is as follows, wherein M is used for restricting the range of rewards;
coverage rewards calculation: AGVs can continuously store historical coverage A along with the increment of the step number his With newly added unexplored area range A of the current action new As a coverage prize r cover The AGV is used for exciting the AGV to explore a new area, so that a remote target point can be found conveniently; the specific calculation mode is as follows, wherein M is used for restricting the range of rewards;
step 2-2: balancing the two intrinsic rewards based on the top-level rewards balancer; the top-level rewards balancer outputs a P value according to AGV status information for the AGV to adjust the weights of the two intrinsic rewards.
The invention further adopts the technical scheme that: the step 2-2 is specifically as follows:
specifically, balancing two intrinsic rewards based on the top-level rewards balancer this module contains two strategies to calculate the rewards weight: judging based on rules and based on Actor-Critic;
scheme based on rule judgment: when the AGV observes the target, the AGV takes attraction rewards as driving; when the target is not observed, the coverage rewards are used as driving, and meanwhile, the specific gravity of the coverage rewards can be increased along with the increase of the step number, so that the later regional exploration degree is improved; the P value is calculated specifically using the following formula:
Actor-Critic based scheme: the P value is output based on the trained Actor-Critic network, so that the organic combination of two intrinsic rewards is completed; based on the weight P output by the top-level prize balancer, the intrinsic prize r after balancing intr Given by the formula:
r intr =P*r cover +(1-P)*r attr
the invention further adopts the technical scheme that: the step 3 is specifically as follows:
training a top layer rewarding balancer and a bottom layer action controller respectively by adopting a BicNet algorithm;
training of the top-level prize balancer: the decision goal of the AGV is to maximize its expected cumulative personal external rewardsWherein θ is p Is a parameter of the Actor network in the bonus balancer; thus, use J ip ) The targets for the N AGVs are shown as follows:
the Actor network in the top-layer rewarding balancer adopts a strategy gradient formula, and is trained and updated by the following formula:
the Critic network in the top-layer prize balancer adopts a time difference method and is trained and updated by the following formula, wherein ζ is p Is a Critic network parameter:
training of the bottom layer action controller: the bottom layer action controller employs a network training method similar to the top layer prize balancer, except that the top layer prize balancer is updated based on the external prizes, and the action controller is updated based on the cumulative sum of the internal prizes and the external prizes.
The invention further adopts the technical scheme that: the training of the underlying motion controller, and in particular the decision goal of the AGV in the underlying motion controller, is to maximize the overall rewards that it expects to accumulate:
where θ is a parameter of the Actor network in the action controller,
r i,total =r i,extr +r i,intr the method comprises the steps of carrying out a first treatment on the surface of the By J ip ) The targets for the N AGVs are shown as follows:
correspondingly, the training update formula of the Actor network in the bottom layer action controller is as follows:
similar to the updating mode of the Critic network in the top-level prize balancer, the training updating formula of the Critic network in the action controller is as follows, wherein ζ is the parameter of the Critic network:
a computer system, comprising: one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods described above.
A computer readable storage medium, characterized by storing computer executable instructions that when executed are configured to implement the method described above.
The invention has the beneficial effects that:
the traditional centralized scheduling method utilizes global information to uniformly plan scheduling strategies of agents, the mode is too dependent on a control center and the global information, and the flexibility and the expansibility of the system are poor. The multi-agent reinforcement learning technology can realize the autonomous decision of the agent scheduling strategy in a limited dynamic scene. Thus, the present invention designs a hierarchical intrinsic rewarding mechanism for motivating self-learning exploration of agents. Meanwhile, the invention provides a dispatching method based on multi-agent reinforcement learning, and each AGV can carry out self-organizing intelligent dispatching based on a strategy network. The invention improves the autonomous scheduling capability and task completion level of the multi-AGV and provides a solution for realizing self-learning and self-organizing intelligent scheduling of the multi-AGV.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.
FIG. 1 is a block diagram of a layered intrinsic excitation mechanism in an example of the invention;
FIG. 2 is a block diagram of a multi-AGV intelligent scheduling method based on hierarchical intrinsic stimulus in an example of the invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The invention provides a multi-AGV intelligent scheduling method based on layered internal excitation, which is used for establishing a partially observable Markov decision model based on a multi-AGV workshop carrying scene. A layered intrinsic actuation mechanism is proposed for real-time adjustment of both intrinsic actuation weights of the AGV. And carrying out layered combined training based on the BicNet idea to obtain a value network and a strategy network after training convergence. Each AGV executes scheduling based on the trained strategy network, takes local observation information as input, outputs scheduling actions, and completes intelligent scheduling of multiple AGVs.
Comprises the following two parts:
hierarchical intrinsic rewarding mechanism: the AGV is encouraged to perform regional exploration and task completion by introducing two intrinsic rewards of coverage rewards and attraction rewards, and the two intrinsic rewards specific weights of the agent are adjusted in real time according to the task environment change, so that the balance of the agent on exploration and completion is realized.
Multi-agent reinforcement learning model: based on the layered intrinsic rewards and the multi-agent reinforcement learning technology, the multi-AGV interacts with the environment, and continuously trains and optimizes the scheduling strategy in a mode of maximizing the accumulated rewards.
The method comprises the following specific steps:
step 1: based on multiple AGV workshop carrying scenes, establishing partially observable Markov decision model
Under the multi-AGV workshop carrying scene, each AGV interacts with the task environment based on local observation of the AGV to finish scheduling decision. Because of the limited observations of each AGV, the multi-AGV intelligent scheduling problem is modeled as a partially observable Markov decision model:
M=(N,S,A,P,R,O,γ)
wherein N, S, A, P, R, O, gamma are respectively the number of agents, the state space, the action space, the state transition probability, the reward function, part of the observation space and the discount factor. Aiming at a multi-AGV workshop carrying scene, the intelligent object, a part of observation space O, an action space A and a reward function R are defined as follows:
an intelligent agent: in a multi-AGV scheduling scenario, each AGV is an agent object. It is assumed that at the beginning of each scheduling period, the positions of all AGVs are initialized randomly. There are N AGVs and M task points in the shop handling environment, the AGVs are aimed at maximizing the number of completed tasks and minimizing the time for the schedule to reach the task points.
Part is considerableMeasurement space O: since the AGV is in a partially observable condition, its observation space O is a subspace of the global state space, i.eIs a mapping function of the global state to the partially observable space. Specifically, the observation space of the AGV is limited to a three-dimensional observation matrix of l×l×3. Where L represents the field width of the AGV, i.e., each AGV can observe L x L grid information directly in front. Each grid is encoded as a 3-dimensional tuple { Obj } id ,Col id ,Info s }: object coding, color coding, and status information within the scope of observation.
Action space a: motion space A of each AGV i For its set of motion states, three discrete actions are involved: left turn, right turn and straight run.
Bonus function R: the reward function is used to encourage the AGV to quickly go to the task point. The bonus function being divided into external rewards R extr And intrinsic prize R intr Two parts. R is R extr The method comprises the steps of target rewarding reaching task points, attenuation punishment of each step of forward running and collision punishment; r is R intr The specific computing mechanism of (a) will be described in detail in the next step.
Step 2: the intrinsic rewards are calculated based on a tiered intrinsic incentive mechanism that provides a continuous rewards tiered intrinsic rewards mechanism for AGV scheduling decisions including two modules, a top tier rewards balancer and a bottom tier action controller.
Step 2-1: two intrinsic rewards are calculated based on the underlying action controller: attraction rewards and coverage rewards. The attraction rewards encourage the AGVs to dispatch to the target points as soon as possible, and the coverage rewards encourage the AGVs to explore new areas and unknown task points, respectively representing two inherent motivations of the AGVs.
Attraction rewards calculation: within a partial observation range of the AGV, the route distance dis from the position of the AGV to the observed target point is based on i Establishing attractive rewards r attr . The specific calculation is as follows, wherein the M value is used to constrain the range of rewards.
Coverage rewards calculation: AGVs can continuously store historical coverage A along with the increment of the step number his With newly added unexplored area range A of the current action new As a coverage prize r cover The method is used for exciting the AGV to explore a new area, so that a remote target point can be found conveniently. The specific calculation is as follows, wherein the M value is used to constrain the range of rewards.
Step 2-2: the two intrinsic rewards are balanced based on the top-level rewards balancer. The top-level rewards balancer outputs a P value according to AGV status information for the AGV to adjust the weights of the two intrinsic rewards. Specifically, the module contains two strategies for calculating the bonus weight: rule-based decisions and Actor-Critic based decisions.
Scheme based on rule judgment: when the AGV observes the target, the AGV takes attraction rewards as driving; when no target is observed, the coverage rewards are driven, and meanwhile, the specific gravity of the coverage rewards can be increased along with the increase of the number of steps, so that the later regional exploration degree is improved. The P value is calculated specifically using the following formula:
Actor-Critic based scheme: the P value is output based on the trained Actor-Critic network, so that the organic combination of two intrinsic rewards is completed. Based on the weight P output by the top-level prize balancer, the intrinsic prize r after balancing intr Can be given by:
r intr =P*r cover +(1-P)*r attr
step 3: bicNet training method based on multi-agent deep reinforcement learning
In the multi-agent training part, the method adopts BicNet algorithm to train the top layer rewarding balancer and the bottom layer action controller respectively.
Training of top-level prize balancer: the decision goal of the AGV is to maximize its expected cumulative personal external rewardsWherein θ is p Is a parameter of the Actor network in the bonus balancer. Thus, use J ip ) The targets for the N AGVs are shown as follows:
the Actor network in the top-layer rewarding balancer adopts a strategy gradient formula, and can be trained and updated by the following formula:
the Critic network in the top-layer prize balancer adopts a time difference method and can be trained and updated by the following formula, wherein ζ is p Is a Critic network parameter:
training of the bottom layer action controller: the bottom layer action controller employs a network training method similar to the top layer bonus balancer. The difference is that the top-level prize balancer is updated based on the external prizes and the action controller is updated based on the cumulative sum of the internal and external prizes. Specifically, the decision goal of the AGV in the underlying motion controller is to maximize its overall rewards that it expects to accumulate:where θ is a parameter of the Actor network in the action controller, r i,total =r i,extr +r i,intr . By J ip ) The targets for the N AGVs are shown as follows:
correspondingly, the training update formula of the Actor network in the bottom layer action controller is as follows:
similar to the updating mode of the Critic network in the top-level prize balancer, the training updating formula of the Critic network in the action controller is as follows, wherein ζ is the parameter of the Critic network:
step 4: and deploying the trained top layer rewarding balancer and the trained bottom layer action controller to AGVs, and carrying out distributed collaborative scheduling by each AGV according to the self local observation to make decision actions.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.

Claims (9)

1. A multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation is characterized by comprising the following steps of
Step 1: based on a multi-AGV workshop carrying scene, establishing a partially observable Markov decision model;
step 2: calculating intrinsic rewards based on a layered intrinsic incentive mechanism, and providing continuous rewards for AGV scheduling decisions;
step 3: training based on a multi-agent deep reinforcement learning method BicNet;
step 4: and deploying the trained strategy network to each AGV, and carrying out distributed collaborative scheduling by each AGV according to the decision action made by local observation of each AGV.
2. The multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 1, wherein step 1 specifically comprises the following steps:
modeling the multi-AGV intelligent scheduling problem as a partially observable Markov decision model:
M=(N,S,A,P,R,O,γ)
wherein N, S, A, P, R, O and gamma are respectively the quantity of the intelligent agents, the state space, the action space, the state transition probability, the rewarding function, part of the observation space and the discount factor;
for a multi-AGV workshop carrying scene, an agent object, a part of observation space O, an action space A and a reward function R are defined as follows:
an intelligent agent: in a multi-AGV scheduling scenario, each AGV is an agent object; assuming that at the beginning of each scheduling period, the positions of all AGVs are randomly initialized; n AGVs and M task points are arranged in a workshop carrying environment, and the AGVs aim at maximizing the number of completed tasks and minimizing the time for dispatching to reach the task points;
part of the observable space O: since the AGV is in a partially observable condition, its observation space is a subspace of the global state space, i.e Is a mapping function of the global state to a portion of the observable space;
action space a: motion space A of each AGV i For its set of motion states, three discrete actions are involved: left turn, right turn and straight run;
bonus function R: the rewarding function is used for exciting the AGV to quickly go to the task point; the bonus function being divided into external rewards R extr And intrinsic prize R intr Two parts, R extr Including target rewards to reach the task point, decay penalties for each step forward, and collision penalties.
3. The multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 2, wherein the observation space of the AGVs is limited to a three-dimensional observation matrix of L x 3, wherein L represents the field width of the AGVs, i.e. each AGV can observe L x L grid information right in front of each AGV; each grid is encoded as a 3-dimensional tuple { Obj } id ,Col id ,Info s }: object coding, color coding, and status information within the scope of observation.
4. The multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 1, wherein step 2 is specifically as follows:
the hierarchical intrinsic reward mechanism comprises a top-layer reward balancer and a bottom-layer action controller;
step 2-1: two intrinsic rewards are calculated based on the underlying action controller: attraction rewards and coverage rewards;
attraction rewards calculation: within a partial observation range of the AGV, the route distance dis from the position of the AGV to the observed target point is based on i Establishing attractive rewards r attr The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation mode is as follows, wherein M is used for restricting the range of rewards;
coverage rewards calculation: AGVs can continuously store historical coverage A along with the increment of the step number his With newly added unexplored area range A of the current action new As a coverage prize r cover The AGV is used for exciting the AGV to explore a new area, so that a remote target point can be found conveniently; the specific calculation mode is as follows, wherein M is used for restricting the range of rewards;
step 2-2: balancing the two intrinsic rewards based on the top-level rewards balancer; the top-level rewards balancer outputs a P value according to AGV status information for the AGV to adjust the weights of the two intrinsic rewards.
5. The multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 4, wherein step 2-2 is specifically as follows:
specifically, balancing two intrinsic rewards based on the top-level rewards balancer this module contains two strategies to calculate the rewards weight: judging based on rules and based on Actor-Critic;
scheme based on rule judgment: when the AGV observes the target, the AGV takes attraction rewards as driving; when the target is not observed, the coverage rewards are used as driving, and meanwhile, the specific gravity of the coverage rewards can be increased along with the increase of the step number, so that the later regional exploration degree is improved; the P value is calculated specifically using the following formula:
Actor-Critic based scheme: the P value is output based on the trained Actor-Critic network, so that the organic combination of two intrinsic rewards is completed; based on the weight P output by the top-level prize balancer, the intrinsic prize r after balancing intr Given by the formula:
r intr =P*r cover +(1-P)*r attr
6. the multi-AGV intelligent scheduling method based on hierarchical intrinsic excitation according to claim 1, wherein the step 3 is specifically as follows:
training a top layer rewarding balancer and a bottom layer action controller respectively by adopting a BicNet algorithm;
training of the top-level prize balancer: the decision goal of an AGV is to maximize the personal outsides of its expected accumulationRewardsWherein θ is p Is a parameter of the Actor network in the bonus balancer; thus, use J ip ) The targets for the N AGVs are shown as follows:
the Actor network in the top-layer rewarding balancer adopts a strategy gradient formula, and is trained and updated by the following formula:
the Critic network in the top-layer prize balancer adopts a time difference method and is trained and updated by the following formula, wherein ζ is p Is a Critic network parameter:
training of the bottom layer action controller: the bottom layer action controller employs a network training method similar to the top layer prize balancer, except that the top layer prize balancer is updated based on the external prizes, and the action controller is updated based on the cumulative sum of the internal prizes and the external prizes.
7. The multi-AGV intelligent scheduling method according to claim 6, wherein the training of the bottom level motion controller, specifically, the decision goal of the AGV in the bottom level motion controller is to maximize the total rewards that it expects to accumulate:where θ is a parameter of the Actor network in the action controller, r i,total =r i,extr +r i,intr The method comprises the steps of carrying out a first treatment on the surface of the By J ip ) The targets for the N AGVs are shown as follows:
correspondingly, the training update formula of the Actor network in the bottom layer action controller is as follows:
similar to the updating mode of the Critic network in the top-level prize balancer, the training updating formula of the Critic network in the action controller is as follows, wherein ζ is the parameter of the Critic network:
8. a computer system, comprising: one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.
9. A computer readable storage medium, characterized by storing computer executable instructions that, when executed, are adapted to implement the method of claim 1.
CN202310346390.7A 2023-04-03 2023-04-03 Multi-AGV intelligent scheduling method based on layered internal excitation Pending CN116466662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310346390.7A CN116466662A (en) 2023-04-03 2023-04-03 Multi-AGV intelligent scheduling method based on layered internal excitation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310346390.7A CN116466662A (en) 2023-04-03 2023-04-03 Multi-AGV intelligent scheduling method based on layered internal excitation

Publications (1)

Publication Number Publication Date
CN116466662A true CN116466662A (en) 2023-07-21

Family

ID=87172754

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310346390.7A Pending CN116466662A (en) 2023-04-03 2023-04-03 Multi-AGV intelligent scheduling method based on layered internal excitation

Country Status (1)

Country Link
CN (1) CN116466662A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236821A (en) * 2023-11-10 2023-12-15 淄博纽氏达特机器人系统技术有限公司 Online three-dimensional boxing method based on hierarchical reinforcement learning
CN118365099A (en) * 2024-06-19 2024-07-19 华南理工大学 Multi-AGV scheduling method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236821A (en) * 2023-11-10 2023-12-15 淄博纽氏达特机器人系统技术有限公司 Online three-dimensional boxing method based on hierarchical reinforcement learning
CN117236821B (en) * 2023-11-10 2024-02-06 淄博纽氏达特机器人系统技术有限公司 Online three-dimensional boxing method based on hierarchical reinforcement learning
CN118365099A (en) * 2024-06-19 2024-07-19 华南理工大学 Multi-AGV scheduling method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Ma et al. Multi-robot target encirclement control with collision avoidance via deep reinforcement learning
CN116466662A (en) Multi-AGV intelligent scheduling method based on layered internal excitation
CN112835333B (en) Multi-AGV obstacle avoidance and path planning method and system based on deep reinforcement learning
CN112799386B (en) Robot path planning method based on artificial potential field and reinforcement learning
CN108776483A (en) AGV paths planning methods and system based on ant group algorithm and multiple agent Q study
CN109839933B (en) Multi-robot task allocation method based on VDSM algorithm
CN114415735B (en) Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method
CN116307464A (en) AGV task allocation method based on multi-agent deep reinforcement learning
CN116736883B (en) Unmanned aerial vehicle cluster intelligent cooperative motion planning method
Li et al. Decentralized multi-agv task allocation based on multi-agent reinforcement learning with information potential field rewards
CN115963724A (en) Unmanned aerial vehicle cluster task allocation method based on crowd-sourcing-inspired alliance game
CN114047758B (en) Q-learning-based multi-mobile robot formation method
CN116167415A (en) Policy decision method in multi-agent cooperation and antagonism
Zhang et al. Target Tracking and Path Planning of Mobile Sensor Based on Deep Reinforcement Learning
Pan et al. A dynamically adaptive approach to reducing strategic interference for multiagent systems
CN115097814A (en) Mobile robot path planning method, system and application based on improved PSO algorithm
Xing et al. Deep Learning and Game Theory for AI-Enabled Human-Robot Collaboration System Design in Industry 4.0
Qiao et al. Application of reinforcement learning based on neural network to dynamic obstacle avoidance
CN116551703B (en) Motion planning method based on machine learning in complex environment
CN115187056A (en) Multi-agent cooperative resource allocation method considering fairness principle
Zhang et al. Multi-AGV scheduling based on hierarchical intrinsically rewarded multi-agent reinforcement learning
Zhang et al. Peer Incentive Reinforcement Learning for Cooperative Multiagent Games
CN115334165A (en) Underwater multi-unmanned platform scheduling method and system based on deep reinforcement learning
Wang et al. A multi-objective teaching-learning-based optimizer for a cooperative task allocation problem of weeding robots and spraying drones
Lu et al. Intelligently Joint Task Assignment and Trajectory Planning for UAV Cluster with Limited Communication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination