CN114298178A

CN114298178A - Multi-agent communication learning method

Info

Publication number: CN114298178A
Application number: CN202111549398.0A
Authority: CN
Inventors: 代浩; 吴嘉澍; 王洋; 叶可江; 张锦霞; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-08
Also published as: WO2023109699A1

Abstract

The invention discloses a communication learning method of a multi-agent, which comprises the following steps: the CriticNet, the ActorNet, the PriorNet and the EncoderNet, wherein the CriticNet is used for calculating the communication importance degree in the training phase and training three corresponding networks on the end equipment, namely the ActorNet, the PriorNet and the EncoderNet, wherein the ActorNet is used for selecting corresponding action on the intelligent body end and acting on the intelligent body end, and both the training phase and the execution phase work, the ActorNet needs to learn the strategy pi of the intelligent body in the training phase, and then generates corresponding action according to local observation and received information

That is to

Wherein

Is the message received by agent i at time t, i.e. the importance of the message of agent j to agent i, wherein PriorNet is used for agent to select the communication object, PriorNet will evaluate the agent observed in local observation and output an importance value

Wherein the EncodeNet is used for the agent to encode its own information to reduce the size of the message body.

Description

Multi-agent communication learning method

Technical Field

The invention relates to a communication learning method, in particular to a communication learning method of multiple intelligent agents.

Background

In a cooperative multi-agent system, all cooperative agents only have one global reward function, however, the observation range of each agent is limited, so that global information is lacked to perform perception or decision during cooperation, mutual exclusion decision occurs among the agents, and global optimization is difficult to achieve.

Deep Reinforcement Learning (DRL), an advanced artificial intelligence technique, has enjoyed great success in many challenging reality problems. It is widely deployed on different devices, such as smart cars, smart phones, wearable devices, smart cameras, and other smart objects in edge networks. The cooperative multi-agent reinforcement learning is a paradigm with more difficulty and more practical application value in DRL, and each agent is only locally observed and lacks global information, so that the action space is very huge and the calculation is complex; meanwhile, since there is only one global reward, it is difficult to assign a corresponding reward to an independent agent, so that it is difficult to train and guarantee convergence.

In order to solve the difficulty, the mainstream multi-intelligence algorithm at present adopts a Central Training and Distributed Execution (CTDE) architecture, global information is provided during training, and only agent observation is provided during execution. The architecture has a critic network during training, the critic network and the actor network are updated by the network according to the state-action combination of all agents, each agent only has an independent actor network during execution, and decision is made according to local observation. Typical architectures such as IQL, QMIX, etc. have global information during training, and each agent can only make a decision based on local information during execution. In the methods, other agents are regarded as part of the environment to be modeled, and the method only solves the problem of a single agent, so that the convergence cannot be guaranteed, and the agent can easily fall into endless exploration.

Therefore, many researches are beginning to focus on a communication-based multi-agent reinforcement learning method, and the most direct idea of the method is to introduce information transfer into the cooperation of multiple agents, and realize local centralization through the information transfer among the agents, so as to further overcome the problem of unstable environment and promote the cooperation among the agents. The mainstream method at present is CommNet, which receives local observations of all agents by using a mean value unit among policy networks of a plurality of agents, and broadcasts all agents after generating a message (star-shaped communication framework); TarMAC is a fully-connected network architecture, and all agents broadcast messages. The star-shaped and full-connected network architectures are all used for ensuring that messages generated by all agents are not omitted, ensuring that local observation information can be transmitted to all agents, and ensuring that the agents can have global information for decision making.

The existing communication learning method ensures that all agents can obtain the messages of all other agents, but brings huge redundant information. Because of the different dependencies between agents, the transfer of information between unrelated agents is not only useless, but may even negatively impact the agent's decision making.

Meanwhile, redundant information transmission also generates huge examination on the edge network, and because the edge network has a complex structure and limited communication bandwidth resources, the traditional communication learning method is often difficult to apply to the edge environment. The main application scene of the reinforcement learning of the multi-agent is under the edge network environment, so in order to solve the problem of mismatching between network bandwidth and resources required by communication learning, the invention analyzes the influence of messages of other agents on the current agent, provides an index for describing the importance of the messages, groups the agents accordingly, reduces network communication traffic through the idea of layered transmission, and realizes the communication learning method facing the edge network deep reinforcement learning.

Disclosure of Invention

It is an advantage of the present invention to provide a multi-agent communication learning method that introduces messaging between multi-agents to transmit local observations for the agents to take into account global conditions in making decisions.

An advantage of the present invention is to provide a multi-agent communication learning method, wherein the multi-agent communication learning method designs an importance ranking index and an efficient grouping algorithm to reduce the amount of messages to be transmitted, and implements an efficient communication learning method for effectively reducing communication bandwidth consumption caused by unnecessary messages.

An advantage of the present invention is to provide a multi-agent communication learning method, which can be used for all kinds of applications such as multi-agent intelligent driving, robot navigation, logistics scheduling, etc. in multi-agent reinforcement learning in a border network.

One advantage of the present invention is to provide a multi-agent communication learning method, wherein the multi-agent communication learning method is suitable for scenes requiring multi-scene fusion sensing, such as multi-camera fusion and other scenes.

The technical scheme provided by the invention for the technical problem is as follows:

the invention provides a communication learning method of a multi-agent, which comprises the following steps:

criticenet, wherein said criticenet is used for calculating the communication importance in training phase, and is used for training three corresponding networks on the end device, namely said actenet, said PriorNet and said EncoderNet;

the ActorNet is used for selecting corresponding actions on the intelligent body end, acting on the intelligent body end, working in both a training phase and an execution phase, needing to learn out a strategy pi of the intelligent body in the training phase, and then generating corresponding actions according to local observation and received messages

That is to

Wherein

The message received by agent i at time t, i.e. the importance of the message of agent j to agent i;

PriorNet, wherein the PriorNet is used for the agent to select the communication object, PriorNet evaluates the agent observed in the local observation and outputs an importance value

And

EncodeNet, wherein the EncodeNet is used for the agent to encode its own information, in order to reduce the size of the message body.

Preferably, the CriticNet runs in the cloud, works only in the training phase, and by calculating the global reward and the communication priority, the CriticNet will calculate the network loss and transfer the gradient back to the rest of the networks, and update the rest of the network parameters.

Preferably, when the importance value exceeds a certain threshold, it indicates that agent i currently needs to obtain the message of agent j to make a decision.

Preferably, the agent encodes previous actions of the agent itself together with observations for reference by other agents, improving the stability of the cooperation.

Preferably, the method further comprises a method for calculating the importance, and the steps are as follows:

step A: observing whether the action of the output of the ActorNet network is caused or not by removing the message of the agent j;

and B: since the actions of the actonet output are the distribution of an action set, the KL divergence is used to calculate the difference between the action distributions of the actonet output of the agent, and the specific formula is as follows:

and C: wherein o is_{i}Set of messages, o, representing all the remaining agents observed by agent i_{{i}\j}Representing the set of messages observed by agents other than agent j, the difference calculated by the formula representing whether the decision distribution of the message lacking agent j is consistent with the decision distribution of the message owning agent j;

step D: if the difference is large, the message of the agent j is important for i, so the communication confidence is higher;

step E: after calculating the confidence degrees of all the agents, obtaining a confidence matrix M between the agents, and grouping the agents by the confidence matrix.

Preferably, further comprising a distributed grouping method, the PriorNet network outputs two values, respectively, query and signature: the signature vector is an information fingerprint of the agent and comprises the position of the agent and the code of a label; the query vector is query information and represents the encoding of the set of agents that the agent needs to communicate.

Preferably, the communication system further comprises a communication mechanism, wherein the communication mechanism comprises a handshake phase, an election phase, a communication phase and a decision phase, in the handshake phase, all the smarts broadcast the query and the signature to the smarts in the observation, after receiving the query and the signature, all the smarts restore a confidence matrix of the communication by multiplying vectors, in the election phase, all the smarts calculate an adjacency graph and select the smarts with the greatest degree as preset smarts, namely, most of the smarts want to obtain messages of the preset smarts to make decisions, and the preset smarts serve as leader nodes, in the communication phase, all the non-leader nodes send their own messages to the leader nodes, the leader encodes the received messages through an encoder network, then performs communication among the leaders, and the leader transmits the messages to other leaders, in the decision stage, the leader makes a decision according to the received other leader messages, and sends the decision and the messages to other non-leader agents in the same group, and the other agents make the next decision according to the decision and the messages.

Compared with the current mainstream methods such as star shape and full-connection shape, the method has the advantages that:

1. both fully-connected and star communication networks disregard the effect of the message itself on the agent's decision making, and the receipt of an inappropriate message may affect agent convergence, and thus global reward maximization. The invention provides the method for measuring the importance of the message by using the KL divergence, ensures that only effective information is transmitted, avoids redundant message transmission and improves the convergence rate.

2. The invention adopts grouping and electing leader mode to communicate, thus greatly reducing communication link and reducing communication bandwidth consumption.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a network diagram of a multi-agent communication learning method provided by the present invention.

FIG. 2 is a schematic diagram of spectral clustering of a multi-agent communication learning method provided by the present invention.

Fig. 3 illustrates the group communication of agents in the multi-agent communication learning method provided by the present invention.

FIG. 4 is a diagram illustrating an improvement in global rewards for cooperating agents in a multi-agent communication learning method of the present invention.

Fig. 5 illustrates communication traffic between multiple wisdom for the multi-agent communication learning method provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

A typical distributed edge computing architecture is composed of a plurality of edge devices ("Device" expression), and assuming that there are N edge devices, each Device i can be regarded as an agent, and the agents can be connected with each other through WIFI, 5G, and other networks, and have limited computing power and bandwidth resources. Each agent has an action set A, agent i has its own local observation oit at each time t, and the agent selects and executes the next action based on its own observation oit and action policy, i.e., it is

Meanwhile, when all agents have made corresponding actions, all agents can obtain a global reward value r ═ env (a)₀，a₁，...，a_n)。

The goal of a cooperative multi-agent system is to maximize the cumulative value of this global reward r, so all agents need to keep track of the global information of interest through messaging to achieve a cooperative decision.

The invention follows CTDE framework, keeps comprehensive information intercommunication in training stage, and carries out information coding and communication object selection according to the trained communication network in execution stage.

As shown in the figure1, the multi-agent communication learning method of the invention comprises a CriticNet, an ActorNet, a PriorNet and an EncoderNet, wherein the CriticNet is used for calculating communication importance in a training phase and training three corresponding networks on an end device, namely the ActorNet, the PriorrNet and the EncoderNet, further, the CriticNet operates in a cloud and works only in the training phase, and by calculating global reward and communication priority, the CriticNet calculates network loss and transmits gradient back to the rest networks and updates the rest network parameters, wherein the ActorNet is used for selecting corresponding action at an end of an agent, acting on the agent, working in both the training phase and the execution phase, the ActorNet needs to learn strategy pi of the agent in the training phase, and then generates corresponding action according to local observation and received message

That is to

Wherein

Is the message received by agent i at time t, where PriorNet is used for agent to select the object of communication, PriorNet will evaluate the agent observed in local observation and output an importance value

Namely the importance of the message of the agent j to the agent i, when the importance value exceeds a certain threshold value, it indicates that the agent i needs to obtain the message of the agent j to make a decision, wherein the encoderNet is used for encoding the information of the agent j, because the observation of the agent to the environment is low-dimensional and sparse, the agent needs to be converted into a high-dimensional representation through an encoding network, so as to reduce the size of the agent, besides, the agent needs to encode the previous action of the agent j together with the observation in addition to the observation information, so as to be referred by other agents, and the stability of cooperation is improved.

For the strategy network, namely the ActorNet and the corresponding reward loss, the multi-agent communication learning method uses a cross entropy loss function as an error and uses a gradient descent method as a parameter updating means.

Furthermore, the multi-agent communication learning method of the invention is used for selecting communication objects of the agents and communication interaction modes.

The invention relates to a method for calculating importance, namely how to weight other agents observed by an agent i and distribute the execution degree of communication, which comprises the following steps:

it should be noted that the network needs to calculate the output of the actranet many times, so the calculation can be performed only in the training phase, and the calculation result will be used as a supervision signal of PriorNet to train the network, so that the communication confidence can be directly calculated by PriorNet without repeated calculation in the execution phase.

As shown in the following figure, an exemplary matrix M is shown, which is sparse, indicating that most agents do not need to communicate with each other, and the agents can be grouped by a spectral clustering algorithm. Spectral clustering is an algorithm evolved from graph theory, and is widely applied to clustering later. The main idea is to treat all data as points in space, which can be connected by edges. The edge weight value between two points with a longer distance is lower, the edge weight value between two points with a shorter distance is higher, and the graph formed by all data points is cut, so that the edge weight sum between different subgraphs after graph cutting is as low as possible, and the edge weight sum in the subgraph is as high as possible, thereby achieving the purpose of clustering.

As shown in FIG. 3, with the clustering algorithm, communication within each group can be made dense, while communication between groups is made sparse.

In the execution stage, the intelligent agents are communicated in a distributed mode, and a central node does not help grouping, so the distributed grouping method is provided by the invention. The invention enables the PriorNet network to output two values, namely query and signature: the signature vector is an information fingerprint of the agent and comprises the position of the agent and the code of a label; the query vector is query information and represents the encoding of the set of agents that the agent needs to communicate.

Further, the communication mechanism of the invention comprises a handshake phase, an election phase, a communication phase and a decision phase, wherein in the handshake phase, all the agents broadcast the query and the signature to the agents in the observation, and after receiving the query and the signature, all the agents can restore the confidence matrix of the communication by multiplying vectors, wherein in the election phase, all the agents calculate the confidence matrix, calculate an adjacency graph and select the agent with the maximum outcome, namely most agents want to obtain the message of the agent for decision, so that the agent can be used as a leader node, wherein in the communication phase, all non-leader nodes send their own messages to the leader node, the leader encodes the received messages through an encoder network, then performs communication among the leaders, the leader transmits the messages to other leaders, and in the decision phase, and the leader makes a decision according to the received other leader messages, and sends the decision and the message to other non-leader agents in the same group, and the other agents make the next decision according to the decision and the message.

Through the grouped communication mode, the invention effectively reduces the communication cost, reduces the communication link, and realizes the high-efficiency communication learning of the multi-agent reinforcement learning, thereby realizing the calculation and the measurement of the importance of the message through the KL divergence in the multi-agent; grouping of agents is achieved by using a spectral clustering algorithm on the confidence matrix, thereby reducing communication links; and (4) electing in the group through the outdegree of the graph, and selecting leader nodes from the elected leader nodes to realize communication among the groups, thereby reducing the communication traffic.

As shown in fig. 4, the invention is proved to be feasible by sufficient experiments, and verified in an openai-sourced multi-intelligence reinforcement learning environment, it can be found that the invention can help promote cooperation among multiple intelligence and maximize global rewards.

As shown in FIG. 5, the communication volume of the present invention will decrease steadily with the training, and the agents learn to promote cooperation through communication, so the communication volume increases rapidly, and as the training continues, the grouping method starts to work, and the communication volume decreases gradually.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multi-agent communication learning method, comprising:

That is to

Wherein

And

2. The method of claim 1, wherein the CriticNet runs in the cloud, works only in the training phase, and by computing global reward and communication priority, the CriticNet will compute network loss and pass the gradient back to the rest of the network and update the rest of the network parameters.

3. The method of claim 1, wherein when the importance value exceeds a certain threshold, it indicates that agent i currently needs to obtain a message from agent j to make a decision.

4. The agent of claim 1, wherein the agent encodes previous actions of itself with observations for reference by other agents to promote stability of collaboration.

5. The method of claim 1, further comprising a method of calculating importance, the steps of:

and C: wherein o is_{i}Set of messages, o, representing all the remaining agents observed by agent i_{{o}\j}Representing the set of messages observed by agents other than agent j, the difference calculated by the formula representing whether the decision distribution of the message lacking agent j is consistent with the decision distribution of the message owning agent j;

6. The method of claim 1, further comprising a distributed grouping method, wherein the PriorNet network outputs two values, respectively, query and signature: the signature vector is an information fingerprint of the agent and comprises the position of the agent and the code of a label; the query vector is query information and represents the encoding of the set of agents that the agent needs to communicate.

7. The method of claim 6, further comprising a communication mechanism, wherein the communication mechanism comprises a handshake phase, an election phase, a communication phase and a decision phase, wherein in the handshake phase, all agents broadcast query and signature to agents in observation, all agents restore a confidence matrix of communication through multiplication between vectors after receiving query and signature, in the election phase, all agents calculate an adjacency graph after calculating the confidence matrix, and select the agent with the highest degree of derivation as a preset agent, namely, most agents want to obtain messages of the preset agent for decision, the preset agent serves as a leader node, in the communication phase, all non-leader nodes send their own messages to the leader node, the leader encodes the received messages through an encoder network, and then performs communication between leaders, and after the leader mutually transmits the messages to other leaders, in the decision stage, the leader makes a decision according to the received other leader messages and sends the decision and the messages to other non-leader agents in the same group, and the other agents make the next decision according to the decision and the messages.