LU502351B1

LU502351B1 - Multi-agent system modeling and cooperative control method based on Markov Decision Process

Info

Publication number: LU502351B1
Application number: LU502351A
Authority: LU
Inventors: Lianglin Xiong; Kangyue Chen; Junhua Chen
Original assignee: Univ Yunnan Minzu
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-12-29

Abstract

This invention provides multi-agent system modeling and cooperative control method based on Markov Decision Process, comprises the following steps: multi-agent and its environment modeling; Multi-agent task allocation calculation: Multi-agent cooperative control calculation. When the state of the leader's agents is unpredictable, the agents can effectively avoid static obstacles and dynamic obstacles in the dynamic unknown environment and reach all the target points at the same time through collaborative decision-making without pre-allocation of target points. When the state of the leader agent is measurable, the leader agent dominates all the other agents as an external input, and the other agents also rely on the information exchange and feedback with the environment and other agents to evaluate the current state and actions in collaboration, and optimize the next action, so that the multi-agent system can reach the target node at the same time.

Description

Description 10502351 Multi-agent system modeling and cooperative control method based on Markov Decision Process Technical Neid The invention belongs to the field of multi-agent system modeling and collaborative control methods, particularly, # relates to mull-agent system modeling and cooperative control method based on Markov Decision Process, Background Multi-agent system refers to a group of autonomous mobile agents with local information perception, processing and communication capabilities, which can achieve Various arduous and complex tasks that a single agent can't accomplish through efficient cooperation among systems. Multi-agent system cooperative control is a trending topic in artificial intelligence research, and it is also an important research direction of intelligent robots, which has a wide range of practical applications in many fields.

in multi-agent systems, the actions and optimal solutions taken by a single agent are not only related to environmental variables, but are restricted by other agents or central nodes (leaders) Compared with single agents, the mult-agent cooperation problem becomes relatively difficult, because that the multi-agent environment is becoming more and more complex, and the real tasks in the dynamic unknown environment are becoming mors and mors difficult, For example, in the collaborative control, there will often be time delays in the information transmission beiween agents and the information processing of agents themselves, this kind of time lag is inevitable, and the stale information of the leader agent cant be obtained directly, and there is a coupling time lag between agents, furthermore, the information transmission beiween agents or the interference of neighboring agents make the problem fo be dealt with become more complicated.

Summary in order to realize the cooperative work of muiti-intelligent systems, it IS usually necessary lo realize three basic elements: (1) knowing where you are, that is, agent positioning technology, {2} where the other party is, that is, communication technology between agents; and (3) what action to do next, that is, agent decision-making.

Based on this, this invention provides multi-agent system modeling and Cooper control method based on Markov Decision Process, Multi-agent system modeling and cooperative control method mainly includes threes aspects: environment modeling, task assignment and optimal action path strategy learning. The concrete framework is shown in Figure 1.

First of all, modeling the environmental information such as leader agents, other agents, the number and position of static obstacles, and the leader agents and other agents are taken as the inputs of multi-agent multi-objective task allocation algorithm, using the Hungarian algorithm to obtain the task allocation results of each agent according to the distance benefit matrix, salting the reward function, according to the task assignment results and environmental information of sach agent, then, multi-agent collaboration is summed up as reinforcement learning problem, and markov decision processes is used to model it Building the joint stats (S4, So, …, Sn) and joint action (24, 22, …, Sn} Of multiple agents to solve the problem of environmental instability. For an agent, model other agents as part of the environment.

By studying the interaction between agents and between agents and environment, the whole environment can be reinforced and modeisd. Here, the reinforcement learning problem is modeled as Markov Decision Process, À Markov decision process is described by tuples (S, À. K, P, y}, where: 5 is a finite set of states: À is a limited set of actions, R is the reward function: P is the state transition probability, y is the discount factor, which is used to calculate the cumulative return, ve [0,11 Finally, in each time step, sach agent takes corresponding actions fo act on the environment according ic the observed environmental information, which changes the state of the environment. At the same time, it obilains reward feedback from the environment through the reward function and carries out learning and updating strategies. Then, multiplie agents take actions according to the observed new environmental stale, in order to obtain rewards and then leam, wherein, the agent takes action according to the current environmental state and moves to the next state, which is called the Actor, while the Critic evaluates the agents action, and the agent optimizes Hs own action according to the fEvieweTs evaluation of its action. After many derations, the agent finds an optimal BEE path by continuously optimizing the decision and updating the strategy base.

The beneficial effects of this invention are described as follows: When the leaders agent state is unprediciable, multiple agents would start from any initial position, and the agents constantly interact with the environment through sensors. Every tims they take a step, they will get feedback from the environment, Through this fesdback, the mult-agent system will jointly evaluais the current state and action, and optimize the next action. Finally, each agent will find an action sequence with the grealest reward, in this process, through cooperative decision-making, agents can effectively avoid static obstacles and dynamic obstacles in the dynamic unknown environment and reach all the target points at the same time without assigning the target points in advance.

When the state of the leader agent is measurable, the leader agent dominates all the other agents as an extemal input, and the other agents alse rely on the information exchange and feedback with both the environment and other agents to evaluate the current state and actions in collaboration, and optimize the next action, therefore, the mult-agent system can reach the target node at the same time.

Brief Description Of The Figures Figure. 1 shows the overall framework of multi-agent system modeling and cooperative control, Description of the present invention The invention adopts Hungarian algorithm to calculate task assignment algorithm according to distance benefit matrix and collaborative reinforcement learning algorithm based on Markov decision process to carry out multi-agent collaborative control, and its core work includes: 1) according to the position information of common agents and leading agents in the environment, using multi-agent task assignment algorithm {to determine the task assignment results of multiple agents; 2) designing the corresponding reinforced learning model according to the results of task assignment of different agents and other information in the environment {such as obstacle position, environment boundary, etc) Multiple agents interact with the environment and store thelr experience data in the sample pool, and then randomly taka goo] certain number of samples from the sample pool to learn and update the strategy.

Therefore, the invention mainly includes the following two algorithms:

1. Multi-agent task allocation algoritrm. The specific steps are as follows: step 1, initializing the number ng of other common agents, the position Spl) of each common agent, the number ne of leader agents and the position Xe (Io) of gach leader agent; step 2, calculating the distance dee between ordinary agents and each leader agent in turn to form a distance benefit matrix D the calculation formula of matrix Dis: dpe Upz, — Are, pas Are, — Ans, dpp der, — Apr, step 3, when the number of common agents np is equal to the number of leader agents fe, GO to step 4; when the number of COMMON agents np is ess than the number of leader agents ne, go to step 5; when the number of COMMON agents np is greater than the number of leader agents ne, go to step 6: step 4, using Hungarian algonthm io solve the task allocation models of multiple agents according to the distance benefit matrix D, and go to step 7 step 5, adding {Ne-Mp} common agents virtually, converting the non-standard assignment problem into a standard assignment problem by sedges and zero filling method | and soiving the task assignment models of multiple agents by using Hungarian algorithm, and go to step 7 step 6, adding (No-Ne) leader agents virtually, converting the non-standard assignment problem into a standard assignment problem by sedges and zero filling method, and solving the task assignment models of multiple agents by using Hungarian algorithm, and go io step 7 step 7, outputting the task assignment results of a plurality of common agents, and ending the algorithm.

2. Multi-agent collaborative reinforced learning algorithm 10502351

Starting cyclic execution in multi-agent multi-objective cooperative control first, each agent selects the action at according to the state st, then the agent performs the action to obtain an instant reward value 5 and transfers to the next state 5,1, then (8 a, N, Sw) IS stored in the experience playback buffer M, and then a small batch of data is taken out from M to update the network parameters; setting input parameters: number, size, position and other information of agents and obstacles, size M of experience playback buffer M, training times N minimum sampling number Ns, stored sample number Ne, target network parameter update frequency C; outpuiting parameters: optimal Actor network parameter 8, Critic network parameter w; randomly initializing ©, w , and clearing the experience playback buffer ML

Step 1, initiating parameter, setiing information such as environmental range boundary, number, position and speed of common agents, leader agents and static obstacles, and capacity m and training times N of experience playback buffer M:

Step 2, setting the reward function and action space of the agent according to the task target

Step 3, setting the initial training times E=0, Ng which is the minimum number of samping samples and Ng, which is the number of stored experience playback buffers;

Step 4, judging whether the number of training times E=0 is less than n, if so, executing the next step, Otherwise, ending the algorithm;

Step 5, according to the current state s, each agent chooses the action a according to the current strategy:

Step 6, executing the action a={gy, a, …), and each agent gets the reward value nand reaches the new siate Su at the same time;

Step 7, store (sy 8, 1, Set) Info the experience playback buffer ML and assign value to Alp by Nps;

Step 8, judging whether Ns is less than Np, if so, go to step 11, otherwise, executing step 9;

Step 9, for each agent, randomly taking Ns samples from the experience playback buffer M, and updating the Actor network and the Critic network according io the expected incomes gradient and the action function of the agent i

. ; . i LU502351 Step 10, if n% c = = 1 (the remainder of N divided by © equals 1), updating the tre 3 networks of Critic and Actor; Step 11, Su is assigned and updated to si, and the number of training rounds E+1 is assigned and updated io E, return to step à,

Claims

Claims LU502351

1. Multi-agent system modeling and cooperative control method based on Markov Decision Process is characterized in comprising the following steps: step 1, multi-agent and ts environment modeling; step 2, mult-agent task allocation calculation, step 3, multi-agent collaborative control calculation

2. Multi-agent system modeling and cooperative control method based on Markov Decision Process, according to claim 1, is characterized in that the mult-agent and is environment modeling includes: modeling environmental information such as leader agents, other agents, the number and location of static obstacles.

à Multi-agent system modeling and cooperative control method based on Markov Decision Process, according fo claim 1, is characterized in that the mulli-agent task allocation calculation includes: taking the leader agent and other agents as the input of multi-agent multi-objective task allocation algorithm, the Hungarian aigorithm is used to obtain the task allocation results of sach agent according to the distance bensfit matrix, using the multi-agent task assignment algorithm to determine the task assignment results of multiple agents, according to the position information of common agents and leading agents in the environment ; setting the reward function, according to the task assignment results and environmental information of each agent.

À Multi-agent system modeling and cooperative control method based on Markov Decision Process, according to claim 1, is characterized in that the multi-agent cooperative control calculation includes: designing the corresponding collaborative reinforcement learning model based on Markov decision-making process, according to the task assignment results of different agents and other information in the environment {such as obstacle position, environment boundary, ofc), making multiple agents interact with the environment and store their experience data in the sample pool, then a certain number of samples are randomly taken from the sample pool to learn and update their strategies.

& Multi-agent system modeling and cooperative control method based on Markov Decision Process, according fo claim 3, is characterized in that the multi-agent task allocation calculation comprises the following specific algorithm steps: step 1, intializing the number np of other common agents, the position Xp{to) of each common agent, the number a Of leader agents and the position Xe (to) of each leader agent, step 2, caloulating the distance deg between ordinary agents and each leader agent in turn to form a distinea 799 benefit matrix D the calculation formula of matrix D is: dpe Upz, — Are, pas Are, — Ans, dpp der, — Apr, step 3, when the number of COMMON agents A, IS equal to the number of leader agents fe, go to step 4; when the number of common agents np is less than the number of leader agents ne, go to step 5; when the number of common agents ny is greater than the number of leader agents Me, GO io step 6; step 4, using Hungarian algorithm to solve the task allocation models of multiple agents according to the distance benefit matrix D, and go to step 7; step 5, adding {nen COMMON agents virtuaily, converting the non-standard assignment problem into a standard assignment problem by edges and zero filling method | and solving the task assignment models of multiple agents by using Hungarian algorithm, and go to step 7; step 6, adding (N,-Ma) leader agents virtually, converting the non-standard assignment problem into a standard assignment problem by edges and zero filling method, and soiving the task assignment models of multiple agents by using Hungarian sigorithm, and go to siep 7; step 7, outpuiting the {ask assignment results of a plurality of common agents, and ending the algorithm.

&. Multi-agent system modeling and cooperative control method based on Markov Decision Process, according to claim 4, is characterized in that the mulfi-agent cooperative control calculation comprises the following specific steps. starting coyolic execution: first, each agent selects the action a, according to the state 5, then the agent performs the action to obtain an instant reward value n and transfers to the next state su, hen (Si, 8, M, Set) 18 stored in the experience playback buffer M, and then a small batch of data is taken out from M to update the network parameters; setting input parameters: number, size, position and other information of agents and obstacles, size M of experience playback buffer ML training times N minimum sampling number Ns, stored sample number Ne, target network parameter update frequency CC; outputting parameters. optimal Actor network parameter 6,

Critic network parameter wi; randomly inttislizing 8, w , and clearing the experistiee 2 playback buffer ML.

7. Multi-agent system modeling and cooperative control method based on Markov Decision Process, according to claim 6, is characterized in that the multi-agent cooperative control calculation method comprises the following specific algorithm steps: step 1, initiating parameisr, setting information such as environmental range boundary, number, position and spesd of common agents, leader agents and static obstacles, and capacity m and training times N of experience playback buffer M: step 2, setting the reward function and action space of the agent according io the task target, step 3, setting the initial training times E=0, Ns which is the minimum number of sampling samples and Np which is the number of stored experience playback buffers, step 4, judging whether the number of training times E=0 is less than n, if so, sxecuting the next step, Otherwise, ending the algorithm; step 5, according to the current state & each agent chooses the action a according to the current strategy, step 6, executing the action a={a1, ap, …), and each agent gets the reward value rand reaches the new state Si at the same time; step 7, store (Si, 3, 7, Su) into the experience playback buffer M, and assign value to Ne by Ns) step 8, judging whether Ng is less than Na, If so, go to step 11, otherwise, executing step 9, step 9, for each agent, randomly taking Ns samples from the experience playback buffer M, and updating the Actor network and the Critic network according to the expected income gradient and the action function of the agent 5 step 10, Ff n% c = = 1 {the remainder of N divided by C equals 13, updating the target networks of Critic and Actor, step 11, sw Is assigned and updated in st, and the number of training rounds E+1 is assigned and updated to E, return to step À,