CN114626499A - Embedded multi-agent reinforcement learning method using sparse attention to assist decision making - Google Patents

Embedded multi-agent reinforcement learning method using sparse attention to assist decision making Download PDF

Info

Publication number
CN114626499A
CN114626499A CN202210508557.0A CN202210508557A CN114626499A CN 114626499 A CN114626499 A CN 114626499A CN 202210508557 A CN202210508557 A CN 202210508557A CN 114626499 A CN114626499 A CN 114626499A
Authority
CN
China
Prior art keywords
agent
attention
sparse
output
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210508557.0A
Other languages
Chinese (zh)
Inventor
吴超
罗双
李皓
王永恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Zhejiang Lab
Original Assignee
Zhejiang University ZJU
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU, Zhejiang Lab filed Critical Zhejiang University ZJU
Priority to CN202210508557.0A priority Critical patent/CN114626499A/en
Publication of CN114626499A publication Critical patent/CN114626499A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an embedded multi-agent reinforcement learning method using sparse attention to assist decision making, and belongs to the technical field of reinforcement learning. Initializing utility function network parameters, hybrid network parameters and target hybrid network parameters of the multi-agent; acquiring self-attention output and sparse attention output of each agent; and coding the current observation output by using a gated cycle unit module, calculating a local conventional utility function and a local sparse utility function, respectively inputting the functions into the hybrid network, respectively fitting to obtain a conventional global value function and a sparse global value function, gradually reducing the weight of the conventional global value function, and completing the training of reinforcement learning. In the decision reasoning stage, each agent selects actions to be output to the environment according to local observation and the utility function of the agent, so that the agent interacts with the environment. The method can be embedded into any MARL frame based on the value function, and the efficiency and the precision of intelligent agent decision are improved.

Description

Embedded multi-agent reinforcement learning method using sparse attention to assist decision making
Technical Field
The invention belongs to the technical field of reinforcement learning, and particularly relates to an embedded multi-agent reinforcement learning method for assisting decision making by using sparse attention.
Background
The multi-agent reinforcement learning (MARL) provides a framework for a plurality of agents to jointly solve a complex sequential decision problem, and has very wide application in the fields of robot gaming, traffic light control, automatic driving and the like. The relationships among agents in MARL can be classified into fully cooperative, fully competitive, and non-fully cooperative non-fully competitive types.
Currently, the mainstream MARL training framework adopts a Centralized Training Distributed Execution (CTDE) framework, in the centralized training phase, a decision model of an agent can access global state information to help the agent to better explore different strategies, but in the inference phase, the agent makes a decision only according to local observation of the agent. The principle performed by the CTDE framework is the individual-global-maximum principle (IGM), which guarantees consistency between individual decision optima and global decision optima, and the agent can make the overall team get the optimal global return by maximizing the individual utility function. Thus in cooperative MARL, boosting individual utility functions would benefit the whole.
The existing methods based on the cost function are mainly VDN, QMIX, QPLEX and the like. The VDN sums the agent local utility functions to obtain a global cost function. Due to the fact that the expression factor decomposition capability is poor in a direct summation mode, QMIX improves VDN, nonlinear aggregation is conducted on the local utility function of the intelligent agent through a hybrid network, and on the premise that the individual and global monotonicity constraints are kept, the weight is generated according to global state information. And then, a QPLEX introduces a method based on a dominance function, and a local utility function Q is decomposed into a state cost function V and a separate action cost function A, so that the influence of the state on decision making is reduced, and benefits brought by different actions are more concerned. The above method based on the cost function mainly has the following problems:
(1) the improvement mainly relates to how to aggregate the utility functions of the local agents into a global cost function, and the improvement of the network structure of the agents is not concerned. The exploration of agents is made more difficult due to the increasing joint action space as the number of agents in the MARL environment increases.
(2) The agents make decisions through their own observations, but because the interactions between agents are sparse, there is no need to pay attention to all individuals at the same time, resulting in different individuals in the observations having different influences on the decisions and varying importance over time.
(3) The direct introduction of the attention mechanism is beneficial to help the intelligent agent to distribute different attentions to different individuals, but the traditional attention mechanism adopts a softmax activation function, so irrelevant individuals cannot be completely ignored; however, if the sparse method is directly used to zero out unrelated entities, the agent cannot explore more strategies, and it is difficult to distinguish which individuals are more important in the initial training period of the agent model.
Disclosure of Invention
In order to overcome the defects of the prior art and solve the problems of overlarge combined action space and difficult exploration caused by the increase of the number of intelligent agents in the multi-intelligent-agent reinforcement learning, the invention provides an embedded multi-intelligent-agent reinforcement learning method using sparse attention to assist decision-making. The invention improves the local utility function of the intelligent agent, so that the method can be embedded into any MARL framework based on the cost function and has wide application.
The invention is realized by the following technical scheme:
an embedded multi-agent reinforcement learning method with sparse attention-aided decision making comprises the following steps:
step 1: initializing utility function network parameters, hybrid network parameters and target hybrid network parameters of the multi-agent;
step 2: coding the local observation of each intelligent agent at the current moment to obtain a local observation coding vector, and respectively obtaining self-attention output and sparse attention output of each intelligent agent by utilizing self-attention and sparse attention;
and step 3: coding a local observation coding vector and a historical observation hidden state of the intelligent agent by using a gate control circulation unit module to obtain a current observation hidden state and a current observation output;
and 4, step 4: splicing the self-attention output and the current observation output, and calculating a local conventional utility function of the intelligent agent by using the full connection layer; meanwhile, splicing the sparse attention output and the current observation output, and calculating a local sparse utility function of the intelligent agent by using the full connection layer;
and 5: respectively inputting the local conventional utility function and the local sparse utility function of each agent into a hybrid network, respectively fitting to obtain a conventional global value function and a sparse global value function, updating utility function network parameters and hybrid network parameters by using the weighting loss of the conventional global value function and the sparse global value function, and completing training of reinforcement learning;
step 6: in the decision reasoning stage, each agent selects actions to be output to the environment according to local observation and the utility function of the agent, so that the agent interacts with the environment.
The invention has the following beneficial effects:
(1) the sparse attention mechanism is used as an auxiliary decision, so that the intelligent agent is helped to dynamically screen out the individuals more important for decision in the training process, the unimportant individuals are ignored, and the problem of difficult joint action space exploration caused by continuous expansion of the scale of the intelligent agent is solved.
(2) The invention can be embedded into any MARL scheme based on the value function, such as classical algorithms such as VDN, QMIX, QPLEX and the like, because the calculation of the conventional attention mechanism and the sparse attention mechanism in the local utility function of the agent does not change the monotonicity between the individual utility function and the global value function.
(3) According to the invention, the conventional calculation of the self-attention mechanism is gradually transited to the calculation of the sparse attention mechanism from the initial training stage, so that the intelligent agent can be ensured to explore more individual attention distributions in the initial training stage, the intelligent agent gradually turns to the attention distribution along with the training, only important individuals are concerned, and irrelevant individuals are completely ignored. The method gradually improves the decision accuracy and avoids the intelligent agent from entering the partial optimal solution in the exploration process.
Drawings
FIG. 1 is a schematic diagram of an embedded multi-agent reinforcement learning method with sparse attention-aided decision making according to an embodiment of the present invention.
FIG. 2 is a flow chart of an embedded multi-agent reinforcement learning method with sparse attention-aided decision making according to an embodiment of the present invention.
Fig. 3 is a block diagram of an embedded multi-agent reinforcement learning apparatus with sparse attention-aided decision-making according to an embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings.
The method is applied to full-cooperation MARL modeling, aims to guide the intelligent agent to carry out scientific decision by taking a sparse attention machine as an auxiliary decision method, and accordingly improves the decision efficiency and precision of the intelligent agent.
As shown in the schematic diagram of fig. 1 and the flowchart of fig. 2, the method for embedded multi-agent reinforcement learning with sparse attention-aided decision mainly includes the following steps:
step one, initializing self utility function network parameters of a plurality of agents
Figure 263867DEST_PATH_IMAGE001
Hybrid network parameters
Figure 649849DEST_PATH_IMAGE002
And target hybrid network parameters
Figure 23061DEST_PATH_IMAGE003
That is, in the initialization process, the initialized hybrid network parameters are used as target hybrid network parameters, and when the hybrid network parameters are subsequently trained and updated, the target hybrid network parameters are updated once at intervals. Wherein the content of the first and second substances,Nthe number of the agents is represented,
Figure 887112DEST_PATH_IMAGE004
is shown asNThe self utility function parameter of each agent.
Suppose intTime of day, observation of agent includesMIndividuals, different ones of the local observations of each agent
Figure 294959DEST_PATH_IMAGE005
By embedding functionsf(.) into uniform dimension to obtain local observation code vector
Figure 586264DEST_PATH_IMAGE006
(ii) a Wherein the content of the first and second substances,
Figure 945307DEST_PATH_IMAGE007
to representtAt the first momentiThe local observation of an individual agent is made,
Figure 613049DEST_PATH_IMAGE008
to representtAt the first momentiObserved by the individual agentMIndividual, the upper corner mark T represents transpose,
Figure 875403DEST_PATH_IMAGE009
to representtAt the first momentiThe local observation code vector of an individual agent.
Mapping local observation code vectors of each agent at the current moment to a key matrix in an attention mechanismKMatrix of valuesVQuery matrixQThe mapping formula is:
Figure 603188DEST_PATH_IMAGE010
whereinW K W V W Q Is a parameter matrix that needs to be trained. Both the conventional attention mechanism and the sparse attention mechanism share parameters, wherein the sparse attention mechanism is training as an aid in guiding the conventional attention mechanism.
Step two, theQKThe matrix is respectively input into a self-attention mechanism module and a sparse attention mechanism module to respectively obtain a dense attention distribution weight and a sparse attention distribution weight, and the pairVThe matrix is weighted and summed.
The calculation formula of the self-attention mechanism module is as follows:
Figure 685413DEST_PATH_IMAGE011
wherein, Attn () represents the self-attention formula, softmax () represents the softmax activation function, and the superscript is arranged in the upper cornerTThe transpose is represented by,d K to representKThe dimension of the matrix;
the calculation formula of the sparse attention mechanism module is as follows:
Figure 891267DEST_PATH_IMAGE012
wherein sparseaattn (.) represents the sparsification attention formula, sparsemax (.) represents the sparse probability activation function, and is defined as:
Figure 8127DEST_PATH_IMAGE013
wherein the content of the first and second substances,pis a vector of the critical values that is,
Figure 906813DEST_PATH_IMAGE014
is thatpThe dimension (c) of (a) is,zrepresenting the row vectors in the matrix to be processed. Sparse attention mechanism module pass pairQKThe result of the matrix multiplication is based on a threshold valuepAnd (4) performing translation truncation, so that individuals with large influence on the decision are reserved, and the attention of individuals irrelevant to the decision is set to zero.
For convenience of expression, will beiThe self-attention output of the individual agent is recorded ase i A first step ofiSparse attention output of an individual agent is recorded as
Figure 243379DEST_PATH_IMAGE015
Step three, using a gated round-robin unit (GRU) module to encode the local observation code vector of the agent
Figure 252923DEST_PATH_IMAGE016
And historical observation hidden states
Figure 489869DEST_PATH_IMAGE017
Coding to obtain the current observation hidden state
Figure 293877DEST_PATH_IMAGE018
And current observed output
Figure 616274DEST_PATH_IMAGE019
Figure 163930DEST_PATH_IMAGE020
Stitching the self-attention output with the current observation output, computing a conventional utility function of the agent using the full connectivity layer, and selecting an action that maximizes the utility function value
Figure 255383DEST_PATH_IMAGE021
It is recorded as
Figure 761451DEST_PATH_IMAGE022
(ii) a Meanwhile, the sparse attention output and the current observation output are spliced, the sparse utility function of the intelligent agent is calculated by utilizing the full connection layer, and the sparse utility function and the current observation output share the network parameters, so that the sparse utility function and the current observation output share the network parametersThe action with the maximum value of the utility function is
Figure 532661DEST_PATH_IMAGE023
It is recorded as
Figure 884008DEST_PATH_IMAGE024
Step four, the conventional utility function of each agent is calculated
Figure 970913DEST_PATH_IMAGE025
And sparse utility function
Figure 241357DEST_PATH_IMAGE026
Respectively inputting the data into a hybrid network, and respectively fitting to obtain a conventional global cost function
Figure 272767DEST_PATH_IMAGE027
And sparse global cost function
Figure 427805DEST_PATH_IMAGE028
The loss function of the conventional global cost function is as follows:
Figure 228270DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 76141DEST_PATH_IMAGE030
representing the loss of the conventional global cost function,
Figure 470213DEST_PATH_IMAGE031
is an experience pool;ran instant prize indicating the current time;
Figure 789461DEST_PATH_IMAGE032
represents the discount factor, in this embodiment, 0.99 is taken;
Figure 319800DEST_PATH_IMAGE033
a target value representing the conventional self-attention of the target hybrid network output,
Figure 197626DEST_PATH_IMAGE034
a conventional global cost function value representing the output of the hybrid network,
Figure 78994DEST_PATH_IMAGE035
representing the global state at the next instant of time through a state transition,
Figure 700468DEST_PATH_IMAGE036
representing the next moment of output of the target hybrid networkNThe motion vector of the individual agent is determined,sindicating the global state at the current time,
Figure 616472DEST_PATH_IMAGE037
representing current inputs to a hybrid networkNMotion vectors for individual agents.
The loss function of the sparse global cost function is as follows:
Figure 399620DEST_PATH_IMAGE038
wherein, the first and the second end of the pipe are connected with each other,
Figure 768284DEST_PATH_IMAGE039
representing a loss of a sparse global cost function;
Figure 160826DEST_PATH_IMAGE040
a target value representing sparse attention of a target hybrid network output,
Figure 931336DEST_PATH_IMAGE041
a sparse global cost function value representing the output of the hybrid network.
Since the rarefaction attention mechanism is used as an aid to improve the intelligent agent decision efficiency, the final optimization goal is to minimize the sum of the two loss functions:
Figure 150965DEST_PATH_IMAGE042
wherein the content of the first and second substances,
Figure 6925DEST_PATH_IMAGE043
and represents the total loss of the power transmission line,
Figure 970202DEST_PATH_IMAGE044
to adjust the weighting coefficients for regular attention and sparse attention.
Since the model parameters are initialized at random at the beginning of the training, it is difficult for the agent to determine which individuals in the observation have a greater influence on the decision, so this time the settings are made
Figure 595219DEST_PATH_IMAGE045
To 1, focus is on exploring different individual weights. Gradually decreases as the training is continuously carried out
Figure 720169DEST_PATH_IMAGE046
The value of (a) is up to 0, when the model has good judgment capability, only important individuals can be considered in decision making, and irrelevant individuals are ignored.
After training is complete, the utility function network parameters
Figure 532268DEST_PATH_IMAGE047
It has been determined that in the decision making and reasoning phase, the hybrid network is removed and the agent will select actions to output to the environment based on the local observation inputs to its utility function, thereby interacting with the environment, as opposed to conventional practice, and will not be further described herein. Therefore, sparsification is used as an auxiliary intelligent agent for dynamic decision-making on the premise of no information loss, and the method can be embedded into any MARL framework based on cost function calculation.
Corresponding to the embodiment of the embedded multi-agent reinforcement learning method for assisting decision-making by sparse attention, the invention also provides an embodiment of an embedded multi-agent reinforcement learning device for assisting decision-making by sparse attention.
Referring to fig. 3, an embedded multi-agent reinforcement learning apparatus with sparse attention aided decision provided by the embodiment of the present invention includes one or more processors, and is configured to implement the embedded multi-agent reinforcement learning method with sparse attention aided decision in the foregoing embodiment.
The embodiment of the embedded multi-agent reinforcement learning device for assisting decision making by sparse attention can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 3, a hardware structure diagram of any device with data processing capability where the embedded multi-agent reinforcement learning apparatus for assisting decision with sparse attention is located according to the present invention is shown in fig. 3, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, any device with data processing capability where the apparatus is located in the embodiment may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again. For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Embodiments of the present invention further provide a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the embedded multi-agent reinforcement learning method with sparse attention-aided decision in the above embodiments.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The invention tests in an interstellar dispute (SMAC) multi-agent game environment and proves the effectiveness of the embedded reinforcement learning method using sparse attention to assist decision making.
At present, SMAC becomes a benchmark experimental scene for evaluating the effectiveness of an advanced MARL algorithm, and the main challenge in the evaluation is to manage the behavior pattern of microscopic individuals. In the SMAC scene, each intelligent agent of our party has a separate controller for learning strategy, while the intelligent agent of the enemy adopts predefined built-in AI control, and the heuristic strategy is to attack the individual with the least blood volume of our party preferentially. In each training step, the action space of the agent consists of moving in four directions, stopping, taking no action, selecting any enemy to attack, and the like. The intelligent agent aims to cause the greatest damage to enemies as far as possible, thereby killing enemies and winning the victory. Therefore, in the battle process, the intelligent agent needs to learn different strategies of concentrating fire, walking, shielding attacks and the like, and the intelligent agent is very challenging.
To verify the effectiveness of the method, the invention experimented in six challenging SMAC scenarios, simple mode (3s5z), difficult mode (3s _ vs _5z) and super difficult mode (6h _ vs _8z, 3s5z _ vs _3s6z, corridor, 5s10z), respectively. The agents in all these scenarios are heterogeneous, i.e. composed of multiple types of individual units, the specific constituent units being seen in table 1.
Table 1 shows six different difficulty coefficient scenarios in an interstellar dispute (SMAC) scenario
Figure 66279DEST_PATH_IMAGE048
The embodiment embeds the method into three powerful algorithms of VDN, QMIX and QPLEX for verifying the validity, and can also embed into other frameworks such as IQL, QTRAN, CWQMIX and the like.
This example runs all experiments using a Python-based MARL framework (PyMARL), with parameters consistent with framework default parameters in order to ensure comparability. For all experiments, weight coefficients to adjust regular and sparse attention were set
Figure 545802DEST_PATH_IMAGE049
Initializing to 1 and gradually decreasing to 0; training with RMSprop optimizer and learning rate set at 5 × 10-4(ii) a The total training time step is 2000000 rounds, wherein the training time steps of four super-difficult mode scenes are 5000000 rounds, and the smoothing parameter is 0.99. In order to ensure the exploration effect of the intelligent agent in the initial training stage, a greedy algorithm is used to perform linear annealing from 1 to 0 gradually. Finally, the experiment was run in NVIDIA GTX V100 GPU, each method was performed using 5 random seed initializations.
Table 2 shows the results of comparing the odds of the teams of the intelligent agents of our party before and after the embedding of the method of the present invention, wherein the method before the embedding is VDN, QMIX, QPLEX, and the embedded method with the thinning as an auxiliary decision is aux (VDN), aux (QMIX), aux (QPLEX), and the results of comparing the odds of the intelligent agents of our party under six scenes of different difficulties of SMAC are as follows.
TABLE 2 results of the ratio comparison
Figure 107234DEST_PATH_IMAGE050
From table 2, it can be found that, compared with the current mainstream algorithms such as VDN, QMIX, QPLEX, etc., the embedded methods aux (VDN), aux (QMIX), aux (QPLEX) using thinning as an auxiliary decision all achieve higher success rates in six scenes with different difficulties of SMAC, which means that the method of the present invention can effectively improve the performance of the intelligent agent in different scenes.
Secondly, it is worth noting that compared to a simple scenario, the method of the present invention achieves a greater improvement in difficult tasks. For example, in simple mode scenario 3s5z, AUX (VDN) is 11% higher than VDN, AUX (QMIX) and AUX (QPLEX) are slightly 1% to 2% higher. In the super-difficult mode scene 6h _ vs _8z, the VDN and QMIX are not basically learned to any strategy, resulting in a 0% winning rate and a lower QPLEX winning rate, which is only 30%. The method of the invention improves the AUX (VDN) yield to 18 percent, and greatly improves both AUX (QMIX) and AUX (QPLEX), and the yield exceeds 80 percent. This is because in a simple mode scenario, the number of individuals is small, and therefore each individual has a large effect on the decision making, resulting in insignificant gain due to the sparse attention mechanism. Under the scene with a high difficulty coefficient, observation becomes more complex due to quantity difference of the enemy and the my or increase of the number of the agents, and gain brought by the agents which selectively pay attention to make decisions more greatly is more obvious.
Finally, the invention finds that the method of taking sparseness as an auxiliary decision is embedded into QMIX and QPLEX, so that the improvement of the decision precision of an intelligent agent is superior to that of VDN, and the importance of the expression capability of a hybrid network is illustrated. The method adopts a sparse attention mechanism, so that an intelligent agent decision model can pay attention to important individuals, and the important individuals are endowed with more attention weights, so that the rationality of credibility allocation is improved, and the team cooperation capacity is enhanced. Compared with QMIX and QPLEX, the VDN expresses the global cost function as the sum of the local utility functions of the intelligent agents, so that the expression capability of the hybrid network is poor, and the advantages of the method cannot be fully utilized.
It can be seen that, at the initial stage of training, the agent cannot distinguish which individual within the observation range has a greater impact on the decision, and at this time, a normal self-attention mechanism is used to assign attention weights to the individual individuals. As training continues, agents explore more unknown states and learn more and more to distinguish which individuals are more important. Finally, a sparse attention mechanism is completely adopted to distribute the main attention weight to important individuals, and meanwhile, when irrelevant individuals are ignored, the combined action search space can be simplified, so that the decision precision and the overall performance of the intelligent agent are improved.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims (9)

1. An embedded multi-agent reinforcement learning method using sparse attention to assist decision making is characterized by comprising the following steps:
step 1: initializing utility function network parameters, hybrid network parameters and target hybrid network parameters of the multi-agent;
and 2, step: coding the local observation of each intelligent agent at the current moment to obtain a local observation coding vector, and respectively obtaining the self-attention output and the sparse attention output of each intelligent agent by utilizing the self-attention and the sparse attention;
and step 3: coding a local observation coding vector and a historical observation hidden state of the intelligent agent by using a gate control circulation unit module to obtain a current observation hidden state and a current observation output;
and 4, step 4: splicing the self-attention output and the current observation output, and calculating a local conventional utility function of the intelligent agent by using the full connection layer; meanwhile, splicing the sparse attention output and the current observation output, and calculating a local sparse utility function of the intelligent agent by using the full connection layer;
and 5: respectively inputting the local conventional utility function and the local sparse utility function of each agent into a hybrid network, respectively fitting to obtain a conventional global value function and a sparse global value function, updating utility function network parameters and hybrid network parameters by using the weighting loss of the conventional global value function and the sparse global value function, and completing training of reinforcement learning;
step 6: in the decision reasoning stage, each agent selects actions to be output to the environment according to local observation and the utility function of the agent, so that the agent interacts with the environment.
2. The embedded multi-agent reinforcement learning method using sparse attention for decision making as claimed in claim 1, wherein the step 1 specifically comprises:
(1.1) initializing utility function network parameters of the Multi-agent, noted
Figure 282315DEST_PATH_IMAGE001
NThe number of the agents is represented,
Figure 587394DEST_PATH_IMAGE002
is shown asNUtility function parameters of the individual agents;
(1.2) initializing hybrid network parameters
Figure 921423DEST_PATH_IMAGE004
(1.3) hybrid network parameters are to be initialized
Figure 653756DEST_PATH_IMAGE005
As target hybrid network parameters
Figure 706026DEST_PATH_IMAGE006
And when the hybrid network is trained subsequently, updating the target hybrid network parameters once at intervals.
3. The embedded multi-agent reinforcement learning method using sparse attention for decision making as claimed in claim 1, wherein the step 2 is specifically:
(2.1) different individuals in the local observations of each agent
Figure 50682DEST_PATH_IMAGE007
Coding into uniform dimension by embedding function to obtain local observation coding vector
Figure 770376DEST_PATH_IMAGE008
(ii) a Wherein the content of the first and second substances,Mdenotes the firstiThe number of individuals observed by an individual agent,
Figure 408031DEST_PATH_IMAGE009
to representtAt the first momentiThe local observation of an individual agent is made,
Figure 213176DEST_PATH_IMAGE010
to representtAt the first momentiObserved by the individual agentMIndividual, the upper corner mark T represents transpose,
Figure 594478DEST_PATH_IMAGE011
to representtAt the first momentiThe local observation code vector of an individual agent,f(.) represents an embedding function;
(2.2) mapping said local observation code vector to a key matrix, a value matrix and a query matrix in an attention mechanism, said attention mechanism comprising a self-attention and a sparse attention sharing parameter;
(2.3) calculating self attention output and sparse attention output.
4. The embedded multi-agent reinforcement learning method with sparse attention aided decision as claimed in claim 3, wherein the self-attention output formula is:
Figure 168679DEST_PATH_IMAGE012
the calculation formula of the sparse attention mechanism module is as follows:
Figure 977235DEST_PATH_IMAGE013
wherein, Attn () represents the self-attention formula, softmax () represents the softmax activation function, and the superscript is arranged in the upper cornerTThe transpose is represented by,d K representation key matrixKDimension (d); sparseaattn (.) represents the sparsification attention formula, sparsemax (.) represents the sparse probability activation function.
5. The embedded multi-agent reinforcement learning method with sparse attention-aided decision making according to claim 1, wherein the embedding function of step (2.1) is implemented by full connection layer network.
6. The embedded multi-agent reinforcement learning method with sparse attention aided decision as claimed in claim 1, wherein the loss functions of the conventional global cost function and the sparse global cost function are respectively:
Figure 4097DEST_PATH_IMAGE014
wherein the content of the first and second substances,
Figure 687626DEST_PATH_IMAGE016
representing the loss of the conventional global cost function,
Figure 850754DEST_PATH_IMAGE018
representing the loss of the sparse global cost function,
Figure 95791DEST_PATH_IMAGE019
is an experience pool;ran instant prize indicating the current time;
Figure 609948DEST_PATH_IMAGE020
represents a discount factor;
Figure 598633DEST_PATH_IMAGE021
a target value representing the general self-attention of the target hybrid network output,
Figure 881847DEST_PATH_IMAGE022
a conventional global cost function value representing the output of the hybrid network,
Figure 297785DEST_PATH_IMAGE023
a target value representing sparse attention of a target hybrid network output,
Figure 33660DEST_PATH_IMAGE024
a sparse global cost function value representing the output of the hybrid network;
Figure DEST_PATH_IMAGE025
representing the global state at the next instant of time through a state transition,
Figure 796341DEST_PATH_IMAGE026
indicating the next moment of output of the target hybrid networkNA motion vector of the individual agent;sindicating the global state at the current time,
Figure DEST_PATH_IMAGE027
representing current inputs to a hybrid networkNOf a personal agentA motion vector.
7. The embedded multi-agent reinforcement learning method with sparse attention aided decision as claimed in claim 6, wherein in the reinforcement learning training stage of step 5, the weight of the conventional global cost function is gradually reduced from 1 to 0.
8. An embedded multi-agent reinforcement learning device with sparse attention aided decision, characterized by comprising one or more processors for implementing the embedded multi-agent reinforcement learning method with sparse attention aided decision of any one of claims 1 to 7.
9. A computer-readable storage medium, having a program stored thereon, which program, when being executed by a processor, is adapted to carry out the embedded multi-agent reinforcement learning method with sparse attention aided decision of any of the claims 1-7.
CN202210508557.0A 2022-05-11 2022-05-11 Embedded multi-agent reinforcement learning method using sparse attention to assist decision making Pending CN114626499A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210508557.0A CN114626499A (en) 2022-05-11 2022-05-11 Embedded multi-agent reinforcement learning method using sparse attention to assist decision making

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210508557.0A CN114626499A (en) 2022-05-11 2022-05-11 Embedded multi-agent reinforcement learning method using sparse attention to assist decision making

Publications (1)

Publication Number Publication Date
CN114626499A true CN114626499A (en) 2022-06-14

Family

ID=81905750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210508557.0A Pending CN114626499A (en) 2022-05-11 2022-05-11 Embedded multi-agent reinforcement learning method using sparse attention to assist decision making

Country Status (1)

Country Link
CN (1) CN114626499A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114866494A (en) * 2022-07-05 2022-08-05 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
CN114942653A (en) * 2022-07-26 2022-08-26 北京邮电大学 Method and device for determining unmanned cluster flight strategy and electronic equipment
CN115081752A (en) * 2022-08-11 2022-09-20 浙江君同智能科技有限责任公司 Black and gray production crowdsourcing flow prediction device and method
US11979295B2 (en) 2022-07-05 2024-05-07 Zhejiang Lab Reinforcement learning agent training method, modal bandwidth resource scheduling method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112396187A (en) * 2020-11-19 2021-02-23 天津大学 Multi-agent reinforcement learning method based on dynamic collaborative map
CN112949856A (en) * 2021-03-09 2021-06-11 华东师范大学 Multi-agent reinforcement learning method and system based on sparse attention mechanism
CN114169421A (en) * 2021-12-01 2022-03-11 天津大学 Multi-agent sparse rewarding environment cooperation exploration method based on internal motivation
US20220104034A1 (en) * 2020-09-30 2022-03-31 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method of association of user equipment in a cellular network according to a transferable association policy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220104034A1 (en) * 2020-09-30 2022-03-31 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method of association of user equipment in a cellular network according to a transferable association policy
CN112396187A (en) * 2020-11-19 2021-02-23 天津大学 Multi-agent reinforcement learning method based on dynamic collaborative map
CN112949856A (en) * 2021-03-09 2021-06-11 华东师范大学 Multi-agent reinforcement learning method and system based on sparse attention mechanism
CN114169421A (en) * 2021-12-01 2022-03-11 天津大学 Multi-agent sparse rewarding environment cooperation exploration method based on internal motivation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WENHAO LI ET AL.: "SparseMAAC: Sparse Attention for Multi-agent Reinforcement Learning", 《DATABASE SYSTEMS FOR ADVANCED APPLICATIONS》 *
李文浩: "去中心化多智能体强化学习算法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114866494A (en) * 2022-07-05 2022-08-05 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
CN114866494B (en) * 2022-07-05 2022-09-20 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
US11979295B2 (en) 2022-07-05 2024-05-07 Zhejiang Lab Reinforcement learning agent training method, modal bandwidth resource scheduling method and apparatus
CN114942653A (en) * 2022-07-26 2022-08-26 北京邮电大学 Method and device for determining unmanned cluster flight strategy and electronic equipment
CN115081752A (en) * 2022-08-11 2022-09-20 浙江君同智能科技有限责任公司 Black and gray production crowdsourcing flow prediction device and method
CN115081752B (en) * 2022-08-11 2022-11-22 浙江君同智能科技有限责任公司 Black and gray production crowdsourcing flow prediction device and method

Similar Documents

Publication Publication Date Title
CN114626499A (en) Embedded multi-agent reinforcement learning method using sparse attention to assist decision making
CN108921298B (en) Multi-agent communication and decision-making method for reinforcement learning
CN109063823B (en) Batch A3C reinforcement learning method for exploring 3D maze by intelligent agent
CN107886510A (en) A kind of prostate MRI dividing methods based on three-dimensional full convolutional neural networks
CN112784968A (en) Hybrid pipeline parallel method for accelerating distributed deep neural network training
CN107818367A (en) Processing system and processing method for neutral net
CN109376852A (en) Arithmetic unit and operation method
CN112215350A (en) Smart agent control method and device based on reinforcement learning
CN116454926B (en) Multi-type resource cooperative regulation and control method for three-phase unbalanced management of distribution network
CN115018017B (en) Multi-agent credit allocation method, system and equipment based on ensemble learning
CN113627596A (en) Multi-agent confrontation method and system based on dynamic graph neural network
CN116757497B (en) Multi-mode military intelligent auxiliary combat decision-making method based on graph-like perception transducer
CN110188880A (en) A kind of quantization method and device of deep neural network
CN108320018A (en) A kind of device and method of artificial neural network operation
CN116629461B (en) Distributed optimization method, system, equipment and storage medium for active power distribution network
CN114757362A (en) Multi-agent system communication method based on edge enhancement and related device
CN113141012A (en) Power grid power flow regulation and control decision reasoning method based on deep deterministic strategy gradient network
CN111831354A (en) Data precision configuration method, device, chip array, equipment and medium
CN111282272A (en) Information processing method, computer readable medium and electronic device
CN114139778A (en) Wind turbine generator power prediction modeling method and device
CN114662798B (en) Scheduling method and device based on power grid economic operation domain and electronic equipment
CN108960420A (en) Processing method and accelerator
CN116070504A (en) Digital twin simulation system of efficient refrigeration machine room
Huang et al. Multi-agent cooperative strategy learning method based on transfer Learning
Li Deep reinforcement learning on wind power optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220614