CN114626499A

CN114626499A - Embedded multi-agent reinforcement learning method using sparse attention to assist decision making

Info

Publication number: CN114626499A
Application number: CN202210508557.0A
Authority: CN
Inventors: 吴超; 罗双; 李皓; 王永恒
Original assignee: Zhejiang University ZJU; Zhejiang Lab
Current assignee: Zhejiang University ZJU; Zhejiang Lab
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-06-14

Abstract

The invention discloses an embedded multi-agent reinforcement learning method using sparse attention to assist decision making, and belongs to the technical field of reinforcement learning. Initializing utility function network parameters, hybrid network parameters and target hybrid network parameters of the multi-agent; acquiring self-attention output and sparse attention output of each agent; and coding the current observation output by using a gated cycle unit module, calculating a local conventional utility function and a local sparse utility function, respectively inputting the functions into the hybrid network, respectively fitting to obtain a conventional global value function and a sparse global value function, gradually reducing the weight of the conventional global value function, and completing the training of reinforcement learning. In the decision reasoning stage, each agent selects actions to be output to the environment according to local observation and the utility function of the agent, so that the agent interacts with the environment. The method can be embedded into any MARL frame based on the value function, and the efficiency and the precision of intelligent agent decision are improved.

Description

Embedded multi-agent reinforcement learning method using sparse attention to assist decision making

Technical Field

The invention belongs to the technical field of reinforcement learning, and particularly relates to an embedded multi-agent reinforcement learning method for assisting decision making by using sparse attention.

Background

The multi-agent reinforcement learning (MARL) provides a framework for a plurality of agents to jointly solve a complex sequential decision problem, and has very wide application in the fields of robot gaming, traffic light control, automatic driving and the like. The relationships among agents in MARL can be classified into fully cooperative, fully competitive, and non-fully cooperative non-fully competitive types.

Currently, the mainstream MARL training framework adopts a Centralized Training Distributed Execution (CTDE) framework, in the centralized training phase, a decision model of an agent can access global state information to help the agent to better explore different strategies, but in the inference phase, the agent makes a decision only according to local observation of the agent. The principle performed by the CTDE framework is the individual-global-maximum principle (IGM), which guarantees consistency between individual decision optima and global decision optima, and the agent can make the overall team get the optimal global return by maximizing the individual utility function. Thus in cooperative MARL, boosting individual utility functions would benefit the whole.

The existing methods based on the cost function are mainly VDN, QMIX, QPLEX and the like. The VDN sums the agent local utility functions to obtain a global cost function. Due to the fact that the expression factor decomposition capability is poor in a direct summation mode, QMIX improves VDN, nonlinear aggregation is conducted on the local utility function of the intelligent agent through a hybrid network, and on the premise that the individual and global monotonicity constraints are kept, the weight is generated according to global state information. And then, a QPLEX introduces a method based on a dominance function, and a local utility function Q is decomposed into a state cost function V and a separate action cost function A, so that the influence of the state on decision making is reduced, and benefits brought by different actions are more concerned. The above method based on the cost function mainly has the following problems:

(1) the improvement mainly relates to how to aggregate the utility functions of the local agents into a global cost function, and the improvement of the network structure of the agents is not concerned. The exploration of agents is made more difficult due to the increasing joint action space as the number of agents in the MARL environment increases.

(2) The agents make decisions through their own observations, but because the interactions between agents are sparse, there is no need to pay attention to all individuals at the same time, resulting in different individuals in the observations having different influences on the decisions and varying importance over time.

(3) The direct introduction of the attention mechanism is beneficial to help the intelligent agent to distribute different attentions to different individuals, but the traditional attention mechanism adopts a softmax activation function, so irrelevant individuals cannot be completely ignored; however, if the sparse method is directly used to zero out unrelated entities, the agent cannot explore more strategies, and it is difficult to distinguish which individuals are more important in the initial training period of the agent model.

Disclosure of Invention

In order to overcome the defects of the prior art and solve the problems of overlarge combined action space and difficult exploration caused by the increase of the number of intelligent agents in the multi-intelligent-agent reinforcement learning, the invention provides an embedded multi-intelligent-agent reinforcement learning method using sparse attention to assist decision-making. The invention improves the local utility function of the intelligent agent, so that the method can be embedded into any MARL framework based on the cost function and has wide application.

The invention is realized by the following technical scheme:

an embedded multi-agent reinforcement learning method with sparse attention-aided decision making comprises the following steps:

step 1: initializing utility function network parameters, hybrid network parameters and target hybrid network parameters of the multi-agent;

step 2: coding the local observation of each intelligent agent at the current moment to obtain a local observation coding vector, and respectively obtaining self-attention output and sparse attention output of each intelligent agent by utilizing self-attention and sparse attention;

and step 3: coding a local observation coding vector and a historical observation hidden state of the intelligent agent by using a gate control circulation unit module to obtain a current observation hidden state and a current observation output;

and 4, step 4: splicing the self-attention output and the current observation output, and calculating a local conventional utility function of the intelligent agent by using the full connection layer; meanwhile, splicing the sparse attention output and the current observation output, and calculating a local sparse utility function of the intelligent agent by using the full connection layer;

and 5: respectively inputting the local conventional utility function and the local sparse utility function of each agent into a hybrid network, respectively fitting to obtain a conventional global value function and a sparse global value function, updating utility function network parameters and hybrid network parameters by using the weighting loss of the conventional global value function and the sparse global value function, and completing training of reinforcement learning;

step 6: in the decision reasoning stage, each agent selects actions to be output to the environment according to local observation and the utility function of the agent, so that the agent interacts with the environment.

The invention has the following beneficial effects:

(1) the sparse attention mechanism is used as an auxiliary decision, so that the intelligent agent is helped to dynamically screen out the individuals more important for decision in the training process, the unimportant individuals are ignored, and the problem of difficult joint action space exploration caused by continuous expansion of the scale of the intelligent agent is solved.

(2) The invention can be embedded into any MARL scheme based on the value function, such as classical algorithms such as VDN, QMIX, QPLEX and the like, because the calculation of the conventional attention mechanism and the sparse attention mechanism in the local utility function of the agent does not change the monotonicity between the individual utility function and the global value function.

(3) According to the invention, the conventional calculation of the self-attention mechanism is gradually transited to the calculation of the sparse attention mechanism from the initial training stage, so that the intelligent agent can be ensured to explore more individual attention distributions in the initial training stage, the intelligent agent gradually turns to the attention distribution along with the training, only important individuals are concerned, and irrelevant individuals are completely ignored. The method gradually improves the decision accuracy and avoids the intelligent agent from entering the partial optimal solution in the exploration process.

Drawings

FIG. 1 is a schematic diagram of an embedded multi-agent reinforcement learning method with sparse attention-aided decision making according to an embodiment of the present invention.

FIG. 2 is a flow chart of an embedded multi-agent reinforcement learning method with sparse attention-aided decision making according to an embodiment of the present invention.

Fig. 3 is a block diagram of an embedded multi-agent reinforcement learning apparatus with sparse attention-aided decision-making according to an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings.

The method is applied to full-cooperation MARL modeling, aims to guide the intelligent agent to carry out scientific decision by taking a sparse attention machine as an auxiliary decision method, and accordingly improves the decision efficiency and precision of the intelligent agent.

As shown in the schematic diagram of fig. 1 and the flowchart of fig. 2, the method for embedded multi-agent reinforcement learning with sparse attention-aided decision mainly includes the following steps:

step one, initializing self utility function network parameters of a plurality of agents

Hybrid network parameters

And target hybrid network parameters

That is, in the initialization process, the initialized hybrid network parameters are used as target hybrid network parameters, and when the hybrid network parameters are subsequently trained and updated, the target hybrid network parameters are updated once at intervals. Wherein the content of the first and second substances,Nthe number of the agents is represented,

is shown asNThe self utility function parameter of each agent.

Suppose intTime of day, observation of agent includesMIndividuals, different ones of the local observations of each agent

By embedding functionsf(.) into uniform dimension to obtain local observation code vector

(ii) a Wherein the content of the first and second substances,

to representtAt the first momentiThe local observation of an individual agent is made,

to representtAt the first momentiObserved by the individual agentMIndividual, the upper corner mark T represents transpose,

to representtAt the first momentiThe local observation code vector of an individual agent.

Mapping local observation code vectors of each agent at the current moment to a key matrix in an attention mechanismKMatrix of valuesVQuery matrixQThe mapping formula is:

whereinW _K、W _V、W _QIs a parameter matrix that needs to be trained. Both the conventional attention mechanism and the sparse attention mechanism share parameters, wherein the sparse attention mechanism is training as an aid in guiding the conventional attention mechanism.

Step two, theQ、KThe matrix is respectively input into a self-attention mechanism module and a sparse attention mechanism module to respectively obtain a dense attention distribution weight and a sparse attention distribution weight, and the pairVThe matrix is weighted and summed.

The calculation formula of the self-attention mechanism module is as follows:

wherein, Attn () represents the self-attention formula, softmax () represents the softmax activation function, and the superscript is arranged in the upper cornerTThe transpose is represented by,d _Kto representKThe dimension of the matrix;

the calculation formula of the sparse attention mechanism module is as follows:

wherein sparseaattn (.) represents the sparsification attention formula, sparsemax (.) represents the sparse probability activation function, and is defined as:

wherein the content of the first and second substances,pis a vector of the critical values that is,

is thatpThe dimension (c) of (a) is,zrepresenting the row vectors in the matrix to be processed. Sparse attention mechanism module pass pairQ、KThe result of the matrix multiplication is based on a threshold valuepAnd (4) performing translation truncation, so that individuals with large influence on the decision are reserved, and the attention of individuals irrelevant to the decision is set to zero.

For convenience of expression, will beiThe self-attention output of the individual agent is recorded ase ⁱA first step ofiSparse attention output of an individual agent is recorded as

。

Step three, using a gated round-robin unit (GRU) module to encode the local observation code vector of the agent

And historical observation hidden states

Coding to obtain the current observation hidden state

And current observed output

：

Stitching the self-attention output with the current observation output, computing a conventional utility function of the agent using the full connectivity layer, and selecting an action that maximizes the utility function value

It is recorded as

(ii) a Meanwhile, the sparse attention output and the current observation output are spliced, the sparse utility function of the intelligent agent is calculated by utilizing the full connection layer, and the sparse utility function and the current observation output share the network parameters, so that the sparse utility function and the current observation output share the network parametersThe action with the maximum value of the utility function is

It is recorded as

。

Step four, the conventional utility function of each agent is calculated

And sparse utility function

Respectively inputting the data into a hybrid network, and respectively fitting to obtain a conventional global cost function

And sparse global cost function

。

The loss function of the conventional global cost function is as follows:

wherein the content of the first and second substances,

representing the loss of the conventional global cost function,

is an experience pool;ran instant prize indicating the current time;

represents the discount factor, in this embodiment, 0.99 is taken;

a target value representing the conventional self-attention of the target hybrid network output,

a conventional global cost function value representing the output of the hybrid network,

representing the global state at the next instant of time through a state transition,

representing the next moment of output of the target hybrid networkNThe motion vector of the individual agent is determined,sindicating the global state at the current time,

representing current inputs to a hybrid networkNMotion vectors for individual agents.

The loss function of the sparse global cost function is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing a loss of a sparse global cost function;

a target value representing sparse attention of a target hybrid network output,

a sparse global cost function value representing the output of the hybrid network.

Since the rarefaction attention mechanism is used as an aid to improve the intelligent agent decision efficiency, the final optimization goal is to minimize the sum of the two loss functions:

wherein the content of the first and second substances,

and represents the total loss of the power transmission line,

to adjust the weighting coefficients for regular attention and sparse attention.

Since the model parameters are initialized at random at the beginning of the training, it is difficult for the agent to determine which individuals in the observation have a greater influence on the decision, so this time the settings are made

To 1, focus is on exploring different individual weights. Gradually decreases as the training is continuously carried out

The value of (a) is up to 0, when the model has good judgment capability, only important individuals can be considered in decision making, and irrelevant individuals are ignored.

After training is complete, the utility function network parameters

It has been determined that in the decision making and reasoning phase, the hybrid network is removed and the agent will select actions to output to the environment based on the local observation inputs to its utility function, thereby interacting with the environment, as opposed to conventional practice, and will not be further described herein. Therefore, sparsification is used as an auxiliary intelligent agent for dynamic decision-making on the premise of no information loss, and the method can be embedded into any MARL framework based on cost function calculation.

Corresponding to the embodiment of the embedded multi-agent reinforcement learning method for assisting decision-making by sparse attention, the invention also provides an embodiment of an embedded multi-agent reinforcement learning device for assisting decision-making by sparse attention.

Referring to fig. 3, an embedded multi-agent reinforcement learning apparatus with sparse attention aided decision provided by the embodiment of the present invention includes one or more processors, and is configured to implement the embedded multi-agent reinforcement learning method with sparse attention aided decision in the foregoing embodiment.

The embodiment of the embedded multi-agent reinforcement learning device for assisting decision making by sparse attention can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 3, a hardware structure diagram of any device with data processing capability where the embedded multi-agent reinforcement learning apparatus for assisting decision with sparse attention is located according to the present invention is shown in fig. 3, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, any device with data processing capability where the apparatus is located in the embodiment may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again. For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Embodiments of the present invention further provide a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the embedded multi-agent reinforcement learning method with sparse attention-aided decision in the above embodiments.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The invention tests in an interstellar dispute (SMAC) multi-agent game environment and proves the effectiveness of the embedded reinforcement learning method using sparse attention to assist decision making.

At present, SMAC becomes a benchmark experimental scene for evaluating the effectiveness of an advanced MARL algorithm, and the main challenge in the evaluation is to manage the behavior pattern of microscopic individuals. In the SMAC scene, each intelligent agent of our party has a separate controller for learning strategy, while the intelligent agent of the enemy adopts predefined built-in AI control, and the heuristic strategy is to attack the individual with the least blood volume of our party preferentially. In each training step, the action space of the agent consists of moving in four directions, stopping, taking no action, selecting any enemy to attack, and the like. The intelligent agent aims to cause the greatest damage to enemies as far as possible, thereby killing enemies and winning the victory. Therefore, in the battle process, the intelligent agent needs to learn different strategies of concentrating fire, walking, shielding attacks and the like, and the intelligent agent is very challenging.

To verify the effectiveness of the method, the invention experimented in six challenging SMAC scenarios, simple mode (3s5z), difficult mode (3s _ vs _5z) and super difficult mode (6h _ vs _8z, 3s5z _ vs _3s6z, corridor, 5s10z), respectively. The agents in all these scenarios are heterogeneous, i.e. composed of multiple types of individual units, the specific constituent units being seen in table 1.

Table 1 shows six different difficulty coefficient scenarios in an interstellar dispute (SMAC) scenario

The embodiment embeds the method into three powerful algorithms of VDN, QMIX and QPLEX for verifying the validity, and can also embed into other frameworks such as IQL, QTRAN, CWQMIX and the like.

This example runs all experiments using a Python-based MARL framework (PyMARL), with parameters consistent with framework default parameters in order to ensure comparability. For all experiments, weight coefficients to adjust regular and sparse attention were set

Initializing to 1 and gradually decreasing to 0; training with RMSprop optimizer and learning rate set at 5 × 10^-4(ii) a The total training time step is 2000000 rounds, wherein the training time steps of four super-difficult mode scenes are 5000000 rounds, and the smoothing parameter is 0.99. In order to ensure the exploration effect of the intelligent agent in the initial training stage, a greedy algorithm is used to perform linear annealing from 1 to 0 gradually. Finally, the experiment was run in NVIDIA GTX V100 GPU, each method was performed using 5 random seed initializations.

Table 2 shows the results of comparing the odds of the teams of the intelligent agents of our party before and after the embedding of the method of the present invention, wherein the method before the embedding is VDN, QMIX, QPLEX, and the embedded method with the thinning as an auxiliary decision is aux (VDN), aux (QMIX), aux (QPLEX), and the results of comparing the odds of the intelligent agents of our party under six scenes of different difficulties of SMAC are as follows.

TABLE 2 results of the ratio comparison

From table 2, it can be found that, compared with the current mainstream algorithms such as VDN, QMIX, QPLEX, etc., the embedded methods aux (VDN), aux (QMIX), aux (QPLEX) using thinning as an auxiliary decision all achieve higher success rates in six scenes with different difficulties of SMAC, which means that the method of the present invention can effectively improve the performance of the intelligent agent in different scenes.

Secondly, it is worth noting that compared to a simple scenario, the method of the present invention achieves a greater improvement in difficult tasks. For example, in simple mode scenario 3s5z, AUX (VDN) is 11% higher than VDN, AUX (QMIX) and AUX (QPLEX) are slightly 1% to 2% higher. In the super-difficult mode scene 6h _ vs _8z, the VDN and QMIX are not basically learned to any strategy, resulting in a 0% winning rate and a lower QPLEX winning rate, which is only 30%. The method of the invention improves the AUX (VDN) yield to 18 percent, and greatly improves both AUX (QMIX) and AUX (QPLEX), and the yield exceeds 80 percent. This is because in a simple mode scenario, the number of individuals is small, and therefore each individual has a large effect on the decision making, resulting in insignificant gain due to the sparse attention mechanism. Under the scene with a high difficulty coefficient, observation becomes more complex due to quantity difference of the enemy and the my or increase of the number of the agents, and gain brought by the agents which selectively pay attention to make decisions more greatly is more obvious.

Finally, the invention finds that the method of taking sparseness as an auxiliary decision is embedded into QMIX and QPLEX, so that the improvement of the decision precision of an intelligent agent is superior to that of VDN, and the importance of the expression capability of a hybrid network is illustrated. The method adopts a sparse attention mechanism, so that an intelligent agent decision model can pay attention to important individuals, and the important individuals are endowed with more attention weights, so that the rationality of credibility allocation is improved, and the team cooperation capacity is enhanced. Compared with QMIX and QPLEX, the VDN expresses the global cost function as the sum of the local utility functions of the intelligent agents, so that the expression capability of the hybrid network is poor, and the advantages of the method cannot be fully utilized.

It can be seen that, at the initial stage of training, the agent cannot distinguish which individual within the observation range has a greater impact on the decision, and at this time, a normal self-attention mechanism is used to assign attention weights to the individual individuals. As training continues, agents explore more unknown states and learn more and more to distinguish which individuals are more important. Finally, a sparse attention mechanism is completely adopted to distribute the main attention weight to important individuals, and meanwhile, when irrelevant individuals are ignored, the combined action search space can be simplified, so that the decision precision and the overall performance of the intelligent agent are improved.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. An embedded multi-agent reinforcement learning method using sparse attention to assist decision making is characterized by comprising the following steps:

and 2, step: coding the local observation of each intelligent agent at the current moment to obtain a local observation coding vector, and respectively obtaining the self-attention output and the sparse attention output of each intelligent agent by utilizing the self-attention and the sparse attention;

2. The embedded multi-agent reinforcement learning method using sparse attention for decision making as claimed in claim 1, wherein the step 1 specifically comprises:

(1.1) initializing utility function network parameters of the Multi-agent, noted

；NThe number of the agents is represented,

is shown asNUtility function parameters of the individual agents;

(1.2) initializing hybrid network parameters

；

(1.3) hybrid network parameters are to be initialized

As target hybrid network parameters

And when the hybrid network is trained subsequently, updating the target hybrid network parameters once at intervals.

3. The embedded multi-agent reinforcement learning method using sparse attention for decision making as claimed in claim 1, wherein the step 2 is specifically:

(2.1) different individuals in the local observations of each agent

Coding into uniform dimension by embedding function to obtain local observation coding vector

(ii) a Wherein the content of the first and second substances,Mdenotes the firstiThe number of individuals observed by an individual agent,

to representtAt the first momentiThe local observation code vector of an individual agent,f(.) represents an embedding function;

(2.2) mapping said local observation code vector to a key matrix, a value matrix and a query matrix in an attention mechanism, said attention mechanism comprising a self-attention and a sparse attention sharing parameter;

(2.3) calculating self attention output and sparse attention output.

4. The embedded multi-agent reinforcement learning method with sparse attention aided decision as claimed in claim 3, wherein the self-attention output formula is:

the calculation formula of the sparse attention mechanism module is as follows:

wherein, Attn () represents the self-attention formula, softmax () represents the softmax activation function, and the superscript is arranged in the upper cornerTThe transpose is represented by,d _Krepresentation key matrixKDimension (d); sparseaattn (.) represents the sparsification attention formula, sparsemax (.) represents the sparse probability activation function.

5. The embedded multi-agent reinforcement learning method with sparse attention-aided decision making according to claim 1, wherein the embedding function of step (2.1) is implemented by full connection layer network.

6. The embedded multi-agent reinforcement learning method with sparse attention aided decision as claimed in claim 1, wherein the loss functions of the conventional global cost function and the sparse global cost function are respectively:

wherein the content of the first and second substances,

representing the loss of the conventional global cost function,

representing the loss of the sparse global cost function,

is an experience pool;ran instant prize indicating the current time;

represents a discount factor;

a target value representing the general self-attention of the target hybrid network output,

a target value representing sparse attention of a target hybrid network output,

a sparse global cost function value representing the output of the hybrid network;

indicating the next moment of output of the target hybrid networkNA motion vector of the individual agent;sindicating the global state at the current time,

representing current inputs to a hybrid networkNOf a personal agentA motion vector.

7. The embedded multi-agent reinforcement learning method with sparse attention aided decision as claimed in claim 6, wherein in the reinforcement learning training stage of step 5, the weight of the conventional global cost function is gradually reduced from 1 to 0.

8. An embedded multi-agent reinforcement learning device with sparse attention aided decision, characterized by comprising one or more processors for implementing the embedded multi-agent reinforcement learning method with sparse attention aided decision of any one of claims 1 to 7.

9. A computer-readable storage medium, having a program stored thereon, which program, when being executed by a processor, is adapted to carry out the embedded multi-agent reinforcement learning method with sparse attention aided decision of any of the claims 1-7.