CN116306966A

CN116306966A - Multi-agent reinforcement learning cooperative method based on dynamic graph communication

Info

Publication number: CN116306966A
Application number: CN202310114762.3A
Authority: CN
Inventors: 李奇峰; 葛宏伟
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-06-23

Abstract

The invention belongs to the field of artificial intelligence and multi-agent cooperation, and relates to a multi-agent reinforcement learning cooperation method based on dynamic graph communication. Aiming at the problems that the communication cost is high, the practical application requirement is difficult to meet and the advanced cooperation strategy is difficult to learn in the existing method, the invention aims to realize effective communication under the condition of conforming to more practical limited communication, thereby promoting the cooperation among the intelligent agents and learning the advanced cooperation strategy. The method comprises the following steps: establishing a dynamic communication diagram; self-adaptive generation of communication weight; real-time communication between intelligent agents; the intelligent agent carries out action value estimation; the super network performs credit allocation among the agents; updating parameters by using time sequence differential loss; the application is performed in a verification environment. The invention enables the model to carry out effective communication in a self-adaptive way on the premise of smaller communication expenditure, obviously improves the cooperation performance among the intelligent agents, has higher expansibility and can be widely applied to the field of multi-intelligent agent cooperation.

Description

Multi-agent reinforcement learning cooperative method based on dynamic graph communication

Technical Field

The invention belongs to the field of artificial intelligence and multi-agent cooperation, and particularly relates to a multi-agent reinforcement learning cooperation method based on dynamic graph communication.

Background

Multi-agent collaboration is primarily in an interactive environment where an agent system comprising multiple agents constantly interacts with the environment to maximize the revenue that the system obtains, where each agent makes independent policy decisions that cooperatively and autonomously accomplish team goals. The multi-agent cooperation technology plays a vital role in the fields of smart cities, intelligent transportation, vehicle-road cooperation, unmanned aerial vehicle control and the like, and can be used for the tasks of communication coordination of a plurality of independent terminals, optimization of resource allocation, cluster path planning and the like.

In recent years, a multi-agent cooperation method has made great progress, but with the increase of the scale of the multi-agent, the complexity of the spatial search of the combined cooperation strategy increases exponentially, and the development of related algorithms is greatly limited by the combination of the non-stationarity caused by independent decision of the agents and the complex coupling relationship among the plurality of agents. Therefore, the multi-agent reinforcement learning algorithm is gradually paid more attention as an effective method for adaptively promoting the cooperation of agents, can directly perform trial-and-error learning in the interactive data of the agents and the environment in a training stage, has stronger expansibility and has great development prospect.

At present, main methods of multi-agent collaborative research are generally divided into three types, (1) each independently-decided agent locally builds modeling on other agent strategies, and performs individual decision based on local interaction information and modeling strategies. (2) Decomposition of team global rewards is performed in a centralized training phase by utilizing a super network to perform reasonable credit allocation among agents, so that cooperation among agents is implicitly promoted based on a reinforcement learning method. (3) Enabling efficient communication between agents, each agent making decisions based on local data and communication messages to achieve collaboration. The first type of method reduces the non-stationarity brought by other dynamic strategies in the decision process of the intelligent agents through an active modeling method, but as the number of the intelligent agents increases, the modeling difficulty also increases exponentially, and complex cooperative tasks cannot be dealt with. The second class of algorithms guides the agents to cooperate by directly relating team rewards values to tasks, and the combined action strategy of the multi-agent system can be converged to a cooperation strategy meeting monotonicity limitation by reasonably decomposing the team rewards values through a super network. The third type of method can be used for generating communication messages by manually demarcating or through a designed specific network in a communication mode, and can be used for promoting an agent to cooperatively complete team targets by transmitting effective messages. In practical application, the second type of method and the third type of algorithm have higher application value in large-scale multi-agent cooperation because of suitable learning cost and stronger generalization.

The popular multi-agent reinforcement learning collaborative method in recent years mainly adopts a centralized training mode and a distributed execution mode to train and deploy an agent decision model. In the training process, the reliability distribution among the agents is realized by decomposing the reward signals obtained by the interaction of the combined actions formed by the decision-making of all the agents and the environment, and the convergence of each individual strategy network to an effective combined cooperation strategy is promoted by continuous trial and error in the environment. The resolution of rewards information relies on a super network of global intelligent system information available in a training phase, which should have the ability to characterize the complete policy space. In the execution phase, the centralized super network is removed, and each agent only depends on the policy network to perform the selection of actions. Rasrid et al propose a multi-agent value decomposition framework that integrates individual Q functions of each agent through a non-negative weighted non-linear super network to achieve credit allocation during the back-update of the reward signal (Tabish rastid, mikayel Samvelyan, christian Schroeder, gregori Farquhar, jakob foster, shimon Whiteson QMIX: monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning [ C ]// Proceedings of the 35th International Conference on Machine Learning,PMLR 80:4295-4304,2018 ]. The super-network can acquire the real state of the intelligent agent system in the stage of centralized training, and meanwhile, the nonlinear super-network can characterize monotonicity limitation which is more in line with individual strategies and joint strategies, so that more effective cooperation strategies are learned, and the design of the nonlinear super-network is also one of research hotspots in the field of multi-intelligent agent reinforcement learning. Current research methods facilitate model learning to efficient collaborative strategies by limiting the distance between behavioral strategies and target strategies or layering supernetworks. Wang et al utilize a hierarchical reinforcement learning structure to distribute rewards across two supernetworks, thereby reducing the difficulty of a single network to perform complete policy space characterization (TWang, T Gupta, B Peng, A Mahajan, S Whiteson, and C Zhang.2021.RODE: learning Roles to Decompose Multi-agent Tasks [ C ]// In Proceedings of the International Conference on Learning presentations. OpenReview.). Communication learning is another important research direction in the field of multi-agent reinforcement learning. Yuan et al propose a communication mechanism based on variation inference, where an agent locally predicts the Q function of teammate agents as a communication message, while taking the messages obtained from other agents as a bias of its Q function to achieve stable motion value estimation, and reduces communication costs by introducing communication regularization (Lei Yuan, jianhao Wang, fuxiang Zhang, chenghe Wang, zongzhang Zhang, yang Yu, and Chongjie Zhang.2022.Multi-Agent Incentive Communication via Decentralized Teammate Modeling [ C ]// In Proceedings of the AAAI conference on artificial intelligence). For some multi-agent collaborative tasks requiring advanced collaborative behavior, it is difficult to complete team goals only relying on implicit collaborative guidance, so it is difficult to learn complex collaborative strategies only relying on trust allocation of the super-network. Meanwhile, in most of realistic multi-agent cooperative application scenes, various communication restrictions exist widely, and existing algorithms often have excessive communication overhead, so that the algorithms cannot be effectively applied.

Disclosure of Invention

Aiming at the problems, the invention provides a multi-agent reinforcement learning cooperative method based on dynamic graph communication, which aims at learning advanced cooperative strategies by introducing a communication model into an algorithm based on credit allocation; meanwhile, the communication range and the communication degree are controlled on the basis of a dynamic communication diagram, so that effective communication is realized under the more realistic limited communication condition, and cooperation among intelligent agents is promoted.

The method can work under the condition of widely existing limited communication, self-adaptive communication of the intelligent agents is carried out according to the limited communication domain requirements in practical application, credit allocation among the intelligent agents is carried out by utilizing a super network in a training stage, the intelligent agents carry out cooperative decision making according to local information and communication information in an execution stage, non-stationarity in the joint decision making process is effectively reduced, and the cooperative performance of multiple intelligent agents is improved.

The technical scheme of the invention is as follows:

a multi-agent reinforcement learning cooperative method based on dynamic graph communication comprises the following steps:

step 1: and extracting the communicable intelligent agents in the intelligent agent communication domain in real time according to the communication limiting conditions of the environment and the intelligent agent system, and establishing a communication diagram.

Step 2: according to the communication diagram of the step 1, the observation information of the local intelligent agents is encoded, and the weight of the communication diagram is generated based on the observation information and a corresponding weight generator so as to control the degree of communication among the intelligent agents.

Step 3: and (3) carrying out communication of observation information codes among the agents based on the weight of the communication diagram in the step (2) and the communication diagram in the step (1).

Step 4: each agent utilizes an action value estimation network to perform individual action value estimation based on the local interaction data and communication messages and the historical information.

Step 5: and (3) the super network gathers all the action value estimates generated in the step (4) and completes the joint action value estimation of the intelligent agent system based on the global information.

Step 6: and updating parameters of the super network according to the rewarding value obtained by interaction of the combined action and the environment, and then reversely transmitting the reliability distribution value of rewarding to the action value estimation network of each agent and updating network parameters of the agent.

Step 7: repeating the steps 1 to 6 until each agent action value estimation network, the communication weight generator and the super network converge or reach the designated training step number, removing the super network, and applying the final communication weight generator and each agent action value estimation network to the interaction environment to carry out intelligent system decision.

Further, the specific process of step 1 is as follows: establishing a communication graph from a communication domain under communication constraints in an interactive environment

Wherein->

Representing a communication diagram->

Representing the set of agents, w represents the weight of each side of the communication graph and is initialized to 0, and epsilon is the set of sides of the communication graph. The communication diagram is established by the procedure of->

i≠j,/>

If the agent j epsilon d _i Wherein d is _i Is the limited communication domain of agent i.

Further, the specific process of step 2 is as follows: local observation information o of agent i using coding network _j The code is observation code e _j And then generating the weight of each side of the communication graph according to the weight generator.

If a learnable weight generator is used, firstly, a linear transformation W is used to map the observation codes to a high-dimensional space to enhance the network expression capability, and then a single-layer nonlinear network is used to calculate the communication coefficient c between the corresponding communicable agents _ij ：

Wherein a (-) represents a single layer of nonlinear network,

representing the join operation, e _i And e _j Respectively representing any communicable agent i and agent j, ">

Finally, softmax attribution is performed on the weights of all communicable agents of each agentA unification to ensure expansibility:

wherein w is _ij Representing the communication weight between agent i and agent j, leakyReLU (·) represents the nonlinear activation function, exp (·) represents the exponent symbol.

If a weight generator for the similarity measure is used, the nonlinear network a (·) is replaced with an inner product similarity measure:

where F is a linear embedding operation, the observation code can be mapped to a high-dimensional space.

Further, the specific process of step 3 is as follows: generating a communication message of the intelligent agent according to the communication diagram obtained in the step 1 and the communication weight obtained in the step 2:

wherein m is _i Representing the communication message acquired by agent i at the current time.

Further, the specific process of step 4 is as follows: generating an information representation of the current moment according to the communication message obtained in the step 3 and the local observation data and the historical data of the intelligent agent

Wherein GRU (& gt) represents a gated loop unit neural network,

and->

Observation information and communication information respectively representing current t moment intelligent agent i,/>

Representing agent history information. The action value estimation network performs action value estimation based on the information characterization:

Q _i (a)＝Q _i (a|e _i ,m _i ,h _i ；θ) (6)

where a represents an optional action of agent i and θ represents a parameter of the agent's action value estimation network.

Further, the specific process of step 5 is as follows: and (3) carrying out joint action value estimation according to the individual action value estimation obtained in the step (4):

Q _tot (a)＝mixing((s,Q ₁ (τ ₁ ,a ₁ ),…,Q _n (τ _n ,a _n ))) (7)

wherein s represents the overall state of the intelligent agent system, Q _tot Representing a value estimate of the joint action, a representing the joint action of the agent system, and metering (·) representing the supernetwork.

Further, the specific process of step 6 is as follows: calculating time sequence differential loss according to the deviation between the joint action value estimation obtained in the step 5 and the actual acquisition rewards

Wherein the method comprises the steps of

Representing +.>

Calculating expectation, phi, ζ, theta, ψ respectively represent parameters of the observation encoding network, the weight generator, the action value estimation network, and the super network, s 'represents a state at a next time, a' represents joint action value estimation at a next time, and->

Representing the target network of the action value estimation network, and gamma represents the discount coefficient.

The invention has the beneficial effects that:

under the condition of meeting the widely existing limited communication in the practical application, the invention establishes a real-time dynamic communication diagram among the agents by utilizing the limited communication domain, and performs credit allocation among the agents in the reinforcement learning process by utilizing the centralized super network. Therefore, on the premise of smaller communication overhead, the model can adaptively and effectively communicate, the cooperation performance among the intelligent agents is obviously improved, and the model has higher expansibility.

Firstly, through the practical communication domain limitation in the interactive environment, the communication diagram structure at the current moment is established in real time, so that the communication overhead is effectively reduced and the stronger expansibility is ensured.

And secondly, a learnable weight generator and a weight generator of similarity measurement are utilized to generate communication weights in the communication topology, so that self-adaptive communication control among intelligent agents is realized, and the real-time effectiveness of communication is ensured.

Finally, in the action value estimation stage, the communication information and the observation information are encoded together to form a part of historical information, so that important communication information in the process of intelligent agent coordination is effectively reserved to perform accurate action value estimation, and credit allocation among the intelligent agents is performed by utilizing a super network, so that effective coordination among the intelligent agents is ensured.

Drawings

FIG. 1 is a flow chart of a multi-agent reinforcement learning collaborative method based on dynamic graph communication according to the present invention;

FIG. 2 is a flowchart of training steps according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a test procedure according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a limited communication domain scenario in which the present invention operates in conjunction with a specific embodiment.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples, which include, but are not limited to, the following examples.

As shown in fig. 1, the invention provides a multi-agent reinforcement learning cooperative method based on dynamic graph communication, which comprises the following specific implementation processes:

1. as shown in fig. 4, the communication range of each agent in the interaction environment such as a game is limited, the communication range is consistent with the observable range, only information in a certain range can be observed, and the communication diagram needs to be established according to the observation domain of the agent in the game

Wherein->

A communication diagram is shown which shows a communication diagram,

representing the agent set, w represents the weight of each side of the communication graph and is initialized to 0. As shown in fig. 2, the mapping process is as follows: />

i≠j,/>

If the agent j epsilon d _i Wherein d is _i Is the limited communication domain of agent i. The local observation information o of the agent i is reused by the coding network shown on the left side of fig. 2 _j The code is observation code e _j . And then, according to the weight generator, the weight of each side of the communication graph is generated by taking the observation code of each agent as input, and if a learnable weight generator is used, firstly, a linear transformation W is used, and the observation code is mapped to a high-dimensional space so as to enhance the network expression capability. And then calculating the communication coefficient between every two corresponding communicable intelligent agents by using a single-layer nonlinear network, wherein the calculation formula is as follows:

wherein a (-) represents a single layer of nonlinear network,

Finally, the weights of all communicable agents of each agent are softmax normalized to ensure expansibility, and the calculation formula is as follows:

wherein w is _ij Representing the communication weight between agent i and agent j, leakyReLU () represents a nonlinear activation function, exp (·) represents an exponential sign.

If a weight generator of similarity measure is used, the nonlinear network is replaced by the inner product similarity measure, and the calculation formula is as follows:

2. Inter-agent communication

As shown in fig. 2, a communication message m of the agent is generated based on the communication diagram obtained in step 1 and the communication weight obtained in step 2 ₁ ,m ₂ ,…,m _n The calculation formula is as follows:

as further shown in fig. 2, the corresponding message is sent to the corresponding player unit.

3. Agent performs action value estimation in interactive environment

As shown in fig. 2, the information characterization of the current moment is generated by using the GRU network in fig. 2 according to the communication message obtained in the step 3 and the local observation data and the historical data of the intelligent agent

Wherein GRU (& gt) represents a gated loop unit neural network,

respectively representing the observation information and communication information of the intelligent agent i at the current t moment,/>

Representing agent history information. Based on this, the action value estimation network shown in fig. 2 is used to perform action value estimation:

Q _i (a)＝Q _i (a|e _i ,m _i ,h _i ；θ) (14)

And then, according to the obtained individual action value estimation, carrying out joint action value estimation:

Q _tot (a)＝(s,Q ₁ (τ ₁ ,a ₁ ),…,Q _n (τ _n ,a _n )) (15)

wherein s represents the overall state of the intelligent agent system, Q _tot Represents a value estimate of the joint action, a represents the joint action of the agent system.

4. Reverse updating network learnable parameters

As shown in FIG. 2, a time series differential loss is calculated based on the deviation between the joint action value estimation obtained in step 5 and the prize actually obtained from the game system

Wherein the method comprises the steps of

Representing +.>

The expectation, phi, xi, theta, phi represent the parameters of the observation coding network, the weight generator, the action value estimation network, and the super-network, s, respectively ^′ Indicating the next time state, a ^′ Joint action value estimation representing the next moment, < +.>

5. Multi-agent collaborative tasks in a testing environment

As shown in fig. 3, the convergence model trained in the above steps or the n individual networks reaching a specified number of training steps and the weight generator are removed from the super network and the final weight generator and the individual agent action estimation networks are applied to the interaction environment to make an agent decision. And in the execution stage, the intelligent agent performs self-adaptive communication according to the dynamic communication graph network and performs cooperative tasks.

Claims

1. A multi-agent reinforcement learning cooperative method based on dynamic graph communication is characterized by comprising the following steps:

step 1: extracting communicable intelligent agents in the intelligent agent communication domain in real time according to the communication limiting conditions of the environment and the intelligent agent system, and establishing a communication diagram; the method comprises the following steps:

establishing a communication graph from a communication domain under communication constraints in an interactive environment

Wherein->

Representing a communication diagram->

Representing an agent set, w represents the weight of each side of the communication graph and is initialized to 0, and epsilon is the side set of the communication graph; the communication diagram is established by the procedure of->

If the agent j epsilon d _i Wherein d is _i A limited communication domain for agent i;

step 2: according to the communication diagram of the step 1, the observation information of the local intelligent agents is coded, and the weight of the communication diagram is generated based on the observation information and a corresponding weight generator so as to control the degree of communication among the intelligent agents; the method comprises the following steps:

local observation information o of agent i using coding network _j The code is observation code e _j Generating the weight of each side of the communication graph according to the weight generator;

if a learnable weight generator is used, the observed codes are mapped to a high-dimensional space to enhance network expressive power by first using a linear transformation W, and then using a single-layer nonlinear network to map the correspondingCalculating communication coefficient c between two communicable intelligent agents _ij ：

Wherein a (-) represents a single layer of nonlinear network,

Finally, the weights of all communicable agents of each agent are softmax normalized to ensure scalability:

wherein w is _ij Representing the communication weight between agent i and agent j, leakyReLU () representing the nonlinear activation function, exp (·) representing the exponent symbol;

wherein F is a linear embedding operation, which can map the observation code to a high-dimensional space;

step 3: based on the weight of the communication diagram in the step 2 and the communication diagram in the step 1, carrying out communication of observation information codes among the agents; the method comprises the following steps:

generating a communication message of the intelligent agent:

wherein m is _i Representing a communication message acquired by the intelligent agent i at the current moment;

step 4: each agent utilizes an action value estimation network to complete individual action value estimation according to the local interaction data and the communication information and the historical information; the method comprises the following steps:

generating an information representation of the current moment according to the communication message obtained in the step 3 and the local observation data and the historical data of the intelligent agent

Wherein GRU (& gt) represents a gated loop unit loop neural network,

and->

Representing agent history information; the action value estimation network performs action value estimation based on the information characterization:

Q _i (a)＝Q _i (a|e _i ,m _i ,h _i ；θ) (6)

wherein a represents optional actions of agent i, θ represents parameters of the agent's action value estimation network;

step 5: the super network gathers all the action value estimates generated in the step 4 and completes the joint action value estimation of the intelligent agent system based on the global information; the method comprises the following steps:

Q _tot (a)＝mixing((s,Q ₁ (τ ₁ ,a ₁ ),…,Q _n (τ _n ,a _n ))) (7)

wherein s represents the overall state of the intelligent agent system, Q _tot A represents the value estimation of the joint action, a represents the joint action of the intelligent agent system, and stopping (·) represents the super network;

step 6: according to the rewarding value obtained by the interaction of the combined action and the environment, carrying out parameter updating on the super network, and then reversely spreading the credibility distribution value of rewarding to the action value estimation network of each intelligent agent and updating the network parameter; the method comprises the following steps:

calculating time sequence differential loss according to the deviation between the joint action value estimation obtained in the step 5 and the actual acquisition rewards

Wherein the method comprises the steps of

Representing +.>

A target network representing an action value estimation network, γ representing a discount coefficient;