CN113313267A

CN113313267A - Multi-agent reinforcement learning method based on value decomposition and attention mechanism

Info

Publication number: CN113313267A
Application number: CN202110717897.XA
Authority: CN
Inventors: 吴健; 宋广华; 姜晓红; 范晟; 叶振辉; 陈弈宁; 应豪超
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-08-27
Anticipated expiration: 2041-06-28
Also published as: CN113313267B

Abstract

The invention discloses a multi-agent reinforcement learning method based on value decomposition and attention mechanism, which comprises the following steps: (1) constructing a learning environment, wherein the learning environment comprises a plurality of agents, and each agent comprises a Critic network and an Actor network; (2) initializing Critic network and Actor network parameters; (3) feeding back the action of each intelligent agent to the game environment, and storing the current observed value, the action, the reward and the observed value at the next moment into an experience pool; (4) calculating a local Q value function and a global Q value function, and updating parameters of the Critic network; (5) calculating an advantage function generated by each agent taking a current action under a current observation value, and updating parameters of the Actor network; (6) and after the parameters of the Critic network and the Actor network are updated, executing actions in the game environment by using the trained intelligent agent. The method has better performance effect and faster convergence speed in the observable scene of the complicated heterogeneous part.

Description

Multi-agent reinforcement learning method based on value decomposition and attention mechanism

Technical Field

The invention belongs to the field of reinforcement learning in machine learning, and particularly relates to a multi-agent reinforcement learning method based on value decomposition and attention mechanism.

Background

The invention relates to multi-agent reinforcement learning, and the mainstream algorithms currently comprise an MAAC algorithm, a VDN algorithm, a QMIX algorithm and the like.

MAAC is an abbreviation of Multi-Actor-Attention-Critic, and is a Multi-agent reinforcement learning algorithm based on Actor-Critic. The algorithm introduces an attention mechanism to dynamically extract the logical topological relation among the multiple intelligent agents, selectively aggregates the information of other intelligent agents, can focus on the information of the intelligent agents in the cooperative relation on one hand, and can map the information of all the intelligent agents to fixed dimensionality on the other hand, so that the problem of dimensionality explosion caused by the increase of the number of the intelligent agents is solved, and the expandability of the algorithm is improved.

The VDN (Value-Decomposition Network) algorithm is a centralized reinforcement learning algorithm facing to a complex part observable scene. The algorithm centralizes training a joint Q value, Q_totHowever, due to the partial observability in a multi-agent scenario, each agent cannot acquire global information and actions. The VDN algorithm establishes the relationship between the global Q value and the local Q value and provides a concept of value decomposition. The core idea is to approximate a global Q value by using a local Q value.

For example, chinese patent publication No. CN111632387A discloses a command control system based on interstellar dispute II, that is, a game environment based on interstellar dispute II, which selects a customized multi-agent algorithm module including VDN, QMIX, COMA, etc. according to a playing mode during offline training.

The QMIX algorithm is a centralized reinforcement learning algorithm further proposed on the basis of VDN. Although the VDN algorithm clarifies the relationship between local Q values and global Q values, the VDN algorithm is only applied to Q_totA simple summation is made and the VDN does not make reasonable use of the global information s. The QMIX algorithm uses a neural network-Mixing network with global information s as input, for eachThe individual agent generates a non-negative weight coefficient. And carrying out weighted summation on the local Q value of each agent and the weight coefficient to calculate a global Q value. The algorithm generalizes the local Q value and the global Q value into a larger monotone constraint range, and the extracted logical topological relation is more complex and accurate.

For example, chinese patent publication No. CN111814988A discloses a testing method for a multi-agent cooperative environment reinforcement learning algorithm, in which the agents are divided into two classes, the agent with the first class of spatial motion relatively fixed adopts an algorithm (UCB algorithm) for obtaining the maximum confidence return value, and the agent with the second class of motion and complex state space adopts a global function (QMIX algorithm) for obtaining the optimal joint motion and state.

However, because QMIX does not use the attention mechanism to encode global information, it is inefficient to perform information integration between multi-agent systems, resulting in the difficulty of convergence of this type of algorithm to obtain better results when the multi-agent size is large (more than 100).

Disclosure of Invention

The invention provides a multi-agent reinforcement learning method based on a value decomposition and attention mechanism, which is applied to a complex part observable scene, and has better performance effect and faster convergence speed compared with the current algorithm.

A multi-agent reinforcement learning method based on value decomposition and attention mechanism is applied to a partially observable game scene and comprises the following steps:

(1) constructing a learning environment, wherein the learning environment comprises a plurality of agents, and each agent comprises a Critic network and an Actor network;

(2) initializing Critic network and Actor network parameters; starting a game scene, acquiring an initial observation value of each agent, outputting probability distribution of each action according to the observation value by an Actor network of each agent, and selecting the action through sampling;

(3) feeding back the action of each intelligent agent to the game environment, obtaining the reward under the current combined action and the observed value at the next moment, and storing the current observed value o, the action a, the reward r and the observed value o' at the next moment as experience tuples into an experience pool;

(4) extracting experience tuples of a batch from an experience pool, and calculating a Q value function of each agent by combining observation value information of other agents through an attention mechanism

The Q value function is used for evaluating future return brought by each action of the agent;

taking global information s as input, and generating non-negative weight for the Q value of each agent by using a neural network and a non-negative activation function; the Q value of each agent

Carrying out weighted summation on the non-negative weights to generate a global Q value function Q_tot(ii) a To Q_totCalculating a loss function by using a time sequence difference method and updating parameters of the criticic network;

(5) the method comprises the steps of obtaining an expectation value V value function for a Q value function of each intelligent agent, then calculating an advantage function generated when each intelligent agent takes a current action under a current observation value, and updating parameters of an Actor network through a strategy gradient by using the advantage function;

(6) and after the parameters of the Critic network and the Actor network are updated, executing actions in the game environment by using the trained intelligent agent.

On the basis of the QMIX algorithm structure, the invention designs a network structure for efficiently integrating multi-agent system information by using an attention mechanism, and greatly improves the expansibility of the algorithm.

Further, in the step (1), the Critic network comprises a GRU recurrent neural network for memorizing historical state and action information;

each agent will locally history information tau_iAs input, firstly, history information is coded through a single-layer full-connection layer and input into a GRU network; the GRU network transmits the hidden layer information h of the previous moment_tSplicing with the output of the previous layer to generate hidden layer information h at the next moment_t+1And the input of the next layer; the full connection layer of the third layer is then grownAnd forming action probability distribution under the discrete action space of the intelligent agent.

Further, the Critic network comprises an attention network and a Mixing network;

in an attention network, an embedded e is generated by first encoding an input through a single fully-connected layer_i；e_iCalculating attention coefficients of the intelligent agent and other intelligent agents through a query matrix and a key matrix in the attention network, and then calculating attention information of the intelligent agent through a Value matrix; splicing the plurality of attention information to generate final attention information; generating a Q value function of each intelligent agent by the final attention information through a multi-layer perceptron;

in Mixing network, global information s is used as input, and weight coefficients k are respectively generated through a neural network_iAnd a deviation b; at a weight coefficient k_iThe neural network of (2) adds an activation function of an absolute value to ensure that the weight is not negative; finally, the Q value function and the weight coefficient k of each agent are calculated_iCarrying out weighted summation on the sum deviation b to obtain a global Q value function Q_tot。

Q value of each agent

The global information is coded through a global information self-attention mechanism, and then a coded value is input into a full-connection layer network to obtain the global information.

Global Q value function Q_totThe formula of (1) is:

in the formula, k_iRepresents a weight coefficient, i represents an agent number, and b represents a deviation.

In step (4), Q is added_totThe formula for calculating the loss function using the time-series difference method is:

loss_Q_tot(s)＝(Σ_ir_i+Q′_tot(s)-Q_tot(s))

in the formula, r_iRepresenting the prize value, Q ', earned by agent i'_totRepresenting a target Q network, Q_tot(s) represents a global Q-value function.

In the step (5), the dominance function formula is as follows:

A(s)＝Q_tot(s)-E[Q_tot(s)]

in the formula, E [ Q ]_tot(s)]Representing the mathematical expectation of a global Q value.

Each agent has an independent Actor network, but in order to improve the training efficiency, the Actor network of each agent adopts a parameter sharing method when the parameters of the Actor network are updated.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, the logic topological relation among the multiple intelligent agents is doubly extracted through attention mechanism and value decomposition, and the method has the advantages of better performance effect and higher convergence speed in a complex heterogeneous part observable scene.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a diagram illustrating an Actor network structure in the method of the present invention;

FIG. 3 is a schematic diagram of a Critic network structure in the method of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

The invention provides a multi-agent reinforcement learning algorithm (named as M2AAC algorithm in the embodiment) based on a value decomposition and attention mechanism, which is applied to an interstellar dispute II in a complex observable scene.

As shown in fig. 1, the M2AAC algorithm process adopted by the present invention is as follows:

s01, the system is provided with an interstellar dispute 2 game and a PySC2 library provided by DeepMind.

And S02, initializing Critic network and Actor network parameters in the M2AAC algorithm. And starting a game scene through the API and acquiring the observed value of each intelligent agent at the beginning of the game. And the Actor network of each agent outputs the probability distribution of each action according to the observation value, and the action is selected through sampling. And feeding back the action to the game environment, obtaining the reward under the current combined action and the observed value at the next moment, and storing the current observed value o, the action a, the reward r and the observed value o' at the next moment as experience tuples into an experience pool for extraction and use during training.

As shown in fig. 2, the M2AAC algorithm of the present invention mainly aims at a complex part observable scene, and needs to record historical observation values and action information. The M2AAC algorithm introduces a GRU recurrent neural network in an Actor network for memorizing historical states and action information. Each agent will locally history information tau_iAs an input, the history information is first encoded by a single-layer full connectivity layer and input to the GRU network. The GRU network transmits the hidden layer information h of the previous moment_tSplicing with the output of the previous layer to generate hidden layer information h at the next moment_t+1And the input of the next layer. And the third layer generates the action probability distribution under the discrete action space of the agent. Each agent has an independent Actor network, but in order to improve the training efficiency, the invention adopts a parameter sharing method.

As shown in fig. 3, the M2AAC algorithm adds an attention network and a Mixing network to the Critic network. Each agent will locally history information tau_iAnd local action a_iAs an input. In an attention network, an embedded e is generated by first encoding an input through a single fully-connected layer_i。e_iAnd calculating the attention coefficients of the agent and other agents through the query matrix and the key matrix in the attention network, and then calculating the attention information of the agent through the Value matrix. Due to the adoption of the multi-head attention mechanism, a plurality of attention information needs to be spliced to generate final attention information. Finally, the attention information is used to generate a Q function for each agent via a multi-tier perceptron.

In Mixing network, the invention takes global information s as input and passes through the neural networkThe channels respectively generate a weight coefficient k_iAnd a deviation b. Intuitively, each agent contributes to the whole multi-agent, so that the weight coefficient k needs to be ensured_i>0. Therefore, the present invention is based on the weight coefficient k_iThe neural network of (2) adds an activation function of absolute value to ensure that the weight is not negative. Finally, weighting and summing the Q value function, the weight coefficient and the deviation of each agent to obtain a global Q value function Q_tot。

S03, extracting the experience tuples of the batch from the experience pool. Firstly, each agent calculates the Q value function of each agent by combining the observation value information of other agents through an attention mechanism

Then, taking the global information s as an input, a neural network and a non-negative activation function are used for generating a non-negative weight for the Q value of each agent. Carrying out weighted summation on the Q value function and the non-negative weight of each agent to generate a global Q value function Q_tot. To Q_totAnd calculating a loss function by using a time sequence difference method and updating parameters of the criticic network.

And S04, obtaining the V value function according to the expectation of the Q value function of each agent, and then calculating the advantage function generated by each agent taking the current action under the current observation value. The method uses the advantage function to update the parameters of the Actor through the strategy gradient.

And S05, after the parameters of the Critic network and the Actor network are updated, executing actions in the game environment by using the trained intelligent agent.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A multi-agent reinforcement learning method based on value decomposition and attention mechanism is characterized by being applied to a partially observable game scene and comprising the following steps:

2. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 1, wherein in step (1), said Critic network comprises GRU recurrent neural network for memorizing historical status and action information;

each agent will locally history information tau_iAs input, firstly, history information is coded through a single-layer full-connection layer and input into a GRU network; the GRU network transmits the hidden layer information h of the previous moment_tSplicing with the output of the previous layer to generate hidden layer information h at the next moment_t+1And the input of the next layer; and the full connection layer of the third layer generates the action probability distribution under the discrete action space of the intelligent agent.

3. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 1, wherein in step (1), the Critic network comprises an attention network and a Mixing network;

4. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 3, wherein Q value of each agent

5. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 3, wherein the global Q value function Q_totThe formula of (1) is:

6. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 5, wherein in step (4), Q is applied_totThe formula for calculating the loss function using the time-series difference method is:

loss_Q_tot(s)＝(Σ_ir_i+Q′_tot(s)-Q_tot(s))

7. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 5, wherein in step (5), the dominance function formula is:

A(s)＝Q_tot(s)-E[Q_tot(s)]

8. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 1, characterized in that in step (5), the Actor network of each agent adopts a parameter sharing method when updating the parameters of the Actor network.