CN113313267A - Multi-agent reinforcement learning method based on value decomposition and attention mechanism - Google Patents

Multi-agent reinforcement learning method based on value decomposition and attention mechanism Download PDF

Info

Publication number
CN113313267A
CN113313267A CN202110717897.XA CN202110717897A CN113313267A CN 113313267 A CN113313267 A CN 113313267A CN 202110717897 A CN202110717897 A CN 202110717897A CN 113313267 A CN113313267 A CN 113313267A
Authority
CN
China
Prior art keywords
value
agent
network
function
tot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110717897.XA
Other languages
Chinese (zh)
Other versions
CN113313267B (en
Inventor
吴健
宋广华
姜晓红
范晟
叶振辉
陈弈宁
应豪超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202110717897.XA priority Critical patent/CN113313267B/en
Publication of CN113313267A publication Critical patent/CN113313267A/en
Application granted granted Critical
Publication of CN113313267B publication Critical patent/CN113313267B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-agent reinforcement learning method based on value decomposition and attention mechanism, which comprises the following steps: (1) constructing a learning environment, wherein the learning environment comprises a plurality of agents, and each agent comprises a Critic network and an Actor network; (2) initializing Critic network and Actor network parameters; (3) feeding back the action of each intelligent agent to the game environment, and storing the current observed value, the action, the reward and the observed value at the next moment into an experience pool; (4) calculating a local Q value function and a global Q value function, and updating parameters of the Critic network; (5) calculating an advantage function generated by each agent taking a current action under a current observation value, and updating parameters of the Actor network; (6) and after the parameters of the Critic network and the Actor network are updated, executing actions in the game environment by using the trained intelligent agent. The method has better performance effect and faster convergence speed in the observable scene of the complicated heterogeneous part.

Description

Multi-agent reinforcement learning method based on value decomposition and attention mechanism
Technical Field
The invention belongs to the field of reinforcement learning in machine learning, and particularly relates to a multi-agent reinforcement learning method based on value decomposition and attention mechanism.
Background
The invention relates to multi-agent reinforcement learning, and the mainstream algorithms currently comprise an MAAC algorithm, a VDN algorithm, a QMIX algorithm and the like.
MAAC is an abbreviation of Multi-Actor-Attention-Critic, and is a Multi-agent reinforcement learning algorithm based on Actor-Critic. The algorithm introduces an attention mechanism to dynamically extract the logical topological relation among the multiple intelligent agents, selectively aggregates the information of other intelligent agents, can focus on the information of the intelligent agents in the cooperative relation on one hand, and can map the information of all the intelligent agents to fixed dimensionality on the other hand, so that the problem of dimensionality explosion caused by the increase of the number of the intelligent agents is solved, and the expandability of the algorithm is improved.
The VDN (Value-Decomposition Network) algorithm is a centralized reinforcement learning algorithm facing to a complex part observable scene. The algorithm centralizes training a joint Q value, QtotHowever, due to the partial observability in a multi-agent scenario, each agent cannot acquire global information and actions. The VDN algorithm establishes the relationship between the global Q value and the local Q value and provides a concept of value decomposition. The core idea is to approximate a global Q value by using a local Q value.
For example, chinese patent publication No. CN111632387A discloses a command control system based on interstellar dispute II, that is, a game environment based on interstellar dispute II, which selects a customized multi-agent algorithm module including VDN, QMIX, COMA, etc. according to a playing mode during offline training.
The QMIX algorithm is a centralized reinforcement learning algorithm further proposed on the basis of VDN. Although the VDN algorithm clarifies the relationship between local Q values and global Q values, the VDN algorithm is only applied to QtotA simple summation is made and the VDN does not make reasonable use of the global information s. The QMIX algorithm uses a neural network-Mixing network with global information s as input, for eachThe individual agent generates a non-negative weight coefficient. And carrying out weighted summation on the local Q value of each agent and the weight coefficient to calculate a global Q value. The algorithm generalizes the local Q value and the global Q value into a larger monotone constraint range, and the extracted logical topological relation is more complex and accurate.
For example, chinese patent publication No. CN111814988A discloses a testing method for a multi-agent cooperative environment reinforcement learning algorithm, in which the agents are divided into two classes, the agent with the first class of spatial motion relatively fixed adopts an algorithm (UCB algorithm) for obtaining the maximum confidence return value, and the agent with the second class of motion and complex state space adopts a global function (QMIX algorithm) for obtaining the optimal joint motion and state.
However, because QMIX does not use the attention mechanism to encode global information, it is inefficient to perform information integration between multi-agent systems, resulting in the difficulty of convergence of this type of algorithm to obtain better results when the multi-agent size is large (more than 100).
Disclosure of Invention
The invention provides a multi-agent reinforcement learning method based on a value decomposition and attention mechanism, which is applied to a complex part observable scene, and has better performance effect and faster convergence speed compared with the current algorithm.
A multi-agent reinforcement learning method based on value decomposition and attention mechanism is applied to a partially observable game scene and comprises the following steps:
(1) constructing a learning environment, wherein the learning environment comprises a plurality of agents, and each agent comprises a Critic network and an Actor network;
(2) initializing Critic network and Actor network parameters; starting a game scene, acquiring an initial observation value of each agent, outputting probability distribution of each action according to the observation value by an Actor network of each agent, and selecting the action through sampling;
(3) feeding back the action of each intelligent agent to the game environment, obtaining the reward under the current combined action and the observed value at the next moment, and storing the current observed value o, the action a, the reward r and the observed value o' at the next moment as experience tuples into an experience pool;
(4) extracting experience tuples of a batch from an experience pool, and calculating a Q value function of each agent by combining observation value information of other agents through an attention mechanism
Figure BDA0003135676680000031
The Q value function is used for evaluating future return brought by each action of the agent;
taking global information s as input, and generating non-negative weight for the Q value of each agent by using a neural network and a non-negative activation function; the Q value of each agent
Figure BDA0003135676680000032
Carrying out weighted summation on the non-negative weights to generate a global Q value function Qtot(ii) a To QtotCalculating a loss function by using a time sequence difference method and updating parameters of the criticic network;
(5) the method comprises the steps of obtaining an expectation value V value function for a Q value function of each intelligent agent, then calculating an advantage function generated when each intelligent agent takes a current action under a current observation value, and updating parameters of an Actor network through a strategy gradient by using the advantage function;
(6) and after the parameters of the Critic network and the Actor network are updated, executing actions in the game environment by using the trained intelligent agent.
On the basis of the QMIX algorithm structure, the invention designs a network structure for efficiently integrating multi-agent system information by using an attention mechanism, and greatly improves the expansibility of the algorithm.
Further, in the step (1), the Critic network comprises a GRU recurrent neural network for memorizing historical state and action information;
each agent will locally history information tauiAs input, firstly, history information is coded through a single-layer full-connection layer and input into a GRU network; the GRU network transmits the hidden layer information h of the previous momenttSplicing with the output of the previous layer to generate hidden layer information h at the next momentt+1And the input of the next layer; the full connection layer of the third layer is then grownAnd forming action probability distribution under the discrete action space of the intelligent agent.
Further, the Critic network comprises an attention network and a Mixing network;
in an attention network, an embedded e is generated by first encoding an input through a single fully-connected layeri;eiCalculating attention coefficients of the intelligent agent and other intelligent agents through a query matrix and a key matrix in the attention network, and then calculating attention information of the intelligent agent through a Value matrix; splicing the plurality of attention information to generate final attention information; generating a Q value function of each intelligent agent by the final attention information through a multi-layer perceptron;
in Mixing network, global information s is used as input, and weight coefficients k are respectively generated through a neural networkiAnd a deviation b; at a weight coefficient kiThe neural network of (2) adds an activation function of an absolute value to ensure that the weight is not negative; finally, the Q value function and the weight coefficient k of each agent are calculatediCarrying out weighted summation on the sum deviation b to obtain a global Q value function Qtot
Q value of each agent
Figure BDA0003135676680000041
The global information is coded through a global information self-attention mechanism, and then a coded value is input into a full-connection layer network to obtain the global information.
Global Q value function QtotThe formula of (1) is:
Figure BDA0003135676680000042
in the formula, kiRepresents a weight coefficient, i represents an agent number, and b represents a deviation.
In step (4), Q is addedtotThe formula for calculating the loss function using the time-series difference method is:
loss_Qtot(s)=(Σiri+Q′tot(s)-Qtot(s))
in the formula, riRepresenting the prize value, Q ', earned by agent i'totRepresenting a target Q network, Qtot(s) represents a global Q-value function.
In the step (5), the dominance function formula is as follows:
A(s)=Qtot(s)-E[Qtot(s)]
in the formula, E [ Q ]tot(s)]Representing the mathematical expectation of a global Q value.
Each agent has an independent Actor network, but in order to improve the training efficiency, the Actor network of each agent adopts a parameter sharing method when the parameters of the Actor network are updated.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, the logic topological relation among the multiple intelligent agents is doubly extracted through attention mechanism and value decomposition, and the method has the advantages of better performance effect and higher convergence speed in a complex heterogeneous part observable scene.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a diagram illustrating an Actor network structure in the method of the present invention;
FIG. 3 is a schematic diagram of a Critic network structure in the method of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.
The invention provides a multi-agent reinforcement learning algorithm (named as M2AAC algorithm in the embodiment) based on a value decomposition and attention mechanism, which is applied to an interstellar dispute II in a complex observable scene.
As shown in fig. 1, the M2AAC algorithm process adopted by the present invention is as follows:
s01, the system is provided with an interstellar dispute 2 game and a PySC2 library provided by DeepMind.
And S02, initializing Critic network and Actor network parameters in the M2AAC algorithm. And starting a game scene through the API and acquiring the observed value of each intelligent agent at the beginning of the game. And the Actor network of each agent outputs the probability distribution of each action according to the observation value, and the action is selected through sampling. And feeding back the action to the game environment, obtaining the reward under the current combined action and the observed value at the next moment, and storing the current observed value o, the action a, the reward r and the observed value o' at the next moment as experience tuples into an experience pool for extraction and use during training.
As shown in fig. 2, the M2AAC algorithm of the present invention mainly aims at a complex part observable scene, and needs to record historical observation values and action information. The M2AAC algorithm introduces a GRU recurrent neural network in an Actor network for memorizing historical states and action information. Each agent will locally history information tauiAs an input, the history information is first encoded by a single-layer full connectivity layer and input to the GRU network. The GRU network transmits the hidden layer information h of the previous momenttSplicing with the output of the previous layer to generate hidden layer information h at the next momentt+1And the input of the next layer. And the third layer generates the action probability distribution under the discrete action space of the agent. Each agent has an independent Actor network, but in order to improve the training efficiency, the invention adopts a parameter sharing method.
As shown in fig. 3, the M2AAC algorithm adds an attention network and a Mixing network to the Critic network. Each agent will locally history information tauiAnd local action aiAs an input. In an attention network, an embedded e is generated by first encoding an input through a single fully-connected layeri。eiAnd calculating the attention coefficients of the agent and other agents through the query matrix and the key matrix in the attention network, and then calculating the attention information of the agent through the Value matrix. Due to the adoption of the multi-head attention mechanism, a plurality of attention information needs to be spliced to generate final attention information. Finally, the attention information is used to generate a Q function for each agent via a multi-tier perceptron.
In Mixing network, the invention takes global information s as input and passes through the neural networkThe channels respectively generate a weight coefficient kiAnd a deviation b. Intuitively, each agent contributes to the whole multi-agent, so that the weight coefficient k needs to be ensuredi>0. Therefore, the present invention is based on the weight coefficient kiThe neural network of (2) adds an activation function of absolute value to ensure that the weight is not negative. Finally, weighting and summing the Q value function, the weight coefficient and the deviation of each agent to obtain a global Q value function Qtot
S03, extracting the experience tuples of the batch from the experience pool. Firstly, each agent calculates the Q value function of each agent by combining the observation value information of other agents through an attention mechanism
Figure BDA0003135676680000061
Then, taking the global information s as an input, a neural network and a non-negative activation function are used for generating a non-negative weight for the Q value of each agent. Carrying out weighted summation on the Q value function and the non-negative weight of each agent to generate a global Q value function Qtot. To QtotAnd calculating a loss function by using a time sequence difference method and updating parameters of the criticic network.
And S04, obtaining the V value function according to the expectation of the Q value function of each agent, and then calculating the advantage function generated by each agent taking the current action under the current observation value. The method uses the advantage function to update the parameters of the Actor through the strategy gradient.
And S05, after the parameters of the Critic network and the Actor network are updated, executing actions in the game environment by using the trained intelligent agent.
The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims (8)

1. A multi-agent reinforcement learning method based on value decomposition and attention mechanism is characterized by being applied to a partially observable game scene and comprising the following steps:
(1) constructing a learning environment, wherein the learning environment comprises a plurality of agents, and each agent comprises a Critic network and an Actor network;
(2) initializing Critic network and Actor network parameters; starting a game scene, acquiring an initial observation value of each agent, outputting probability distribution of each action according to the observation value by an Actor network of each agent, and selecting the action through sampling;
(3) feeding back the action of each intelligent agent to the game environment, obtaining the reward under the current combined action and the observed value at the next moment, and storing the current observed value o, the action a, the reward r and the observed value o' at the next moment as experience tuples into an experience pool;
(4) extracting experience tuples of a batch from an experience pool, and calculating a Q value function of each agent by combining observation value information of other agents through an attention mechanism
Figure FDA0003135676670000012
Taking global information s as input, and generating non-negative weight for the Q value of each agent by using a neural network and a non-negative activation function; the Q value of each agent
Figure FDA0003135676670000011
Carrying out weighted summation on the non-negative weights to generate a global Q value function Qtot(ii) a To QtotCalculating a loss function by using a time sequence difference method and updating parameters of the criticic network;
(5) the method comprises the steps of obtaining an expectation value V value function for a Q value function of each intelligent agent, then calculating an advantage function generated when each intelligent agent takes a current action under a current observation value, and updating parameters of an Actor network through a strategy gradient by using the advantage function;
(6) and after the parameters of the Critic network and the Actor network are updated, executing actions in the game environment by using the trained intelligent agent.
2. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 1, wherein in step (1), said Critic network comprises GRU recurrent neural network for memorizing historical status and action information;
each agent will locally history information tauiAs input, firstly, history information is coded through a single-layer full-connection layer and input into a GRU network; the GRU network transmits the hidden layer information h of the previous momenttSplicing with the output of the previous layer to generate hidden layer information h at the next momentt+1And the input of the next layer; and the full connection layer of the third layer generates the action probability distribution under the discrete action space of the intelligent agent.
3. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 1, wherein in step (1), the Critic network comprises an attention network and a Mixing network;
in an attention network, an embedded e is generated by first encoding an input through a single fully-connected layeri;eiCalculating attention coefficients of the intelligent agent and other intelligent agents through a query matrix and a key matrix in the attention network, and then calculating attention information of the intelligent agent through a Value matrix; splicing the plurality of attention information to generate final attention information; generating a Q value function of each intelligent agent by the final attention information through a multi-layer perceptron;
in Mixing network, global information s is used as input, and weight coefficients k are respectively generated through a neural networkiAnd a deviation b; at a weight coefficient kiThe neural network of (2) adds an activation function of an absolute value to ensure that the weight is not negative; finally, the Q value function and the weight coefficient k of each agent are calculatediCarrying out weighted summation on the sum deviation b to obtain a global Q value function Qtot
4. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 3, wherein Q value of each agent
Figure FDA0003135676670000022
The global information is coded through a global information self-attention mechanism, and then a coded value is input into a full-connection layer network to obtain the global information.
5. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 3, wherein the global Q value function QtotThe formula of (1) is:
Figure FDA0003135676670000021
in the formula, kiRepresents a weight coefficient, i represents an agent number, and b represents a deviation.
6. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 5, wherein in step (4), Q is appliedtotThe formula for calculating the loss function using the time-series difference method is:
loss_Qtot(s)=(Σiri+Q′tot(s)-Qtot(s))
in the formula, riRepresenting the prize value, Q ', earned by agent i'totRepresenting a target Q network, Qtot(s) represents a global Q-value function.
7. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 5, wherein in step (5), the dominance function formula is:
A(s)=Qtot(s)-E[Qtot(s)]
in the formula, E [ Q ]tot(s)]Representing the mathematical expectation of a global Q value.
8. The multi-agent reinforcement learning method based on value decomposition and attention mechanism as claimed in claim 1, characterized in that in step (5), the Actor network of each agent adopts a parameter sharing method when updating the parameters of the Actor network.
CN202110717897.XA 2021-06-28 2021-06-28 Multi-agent reinforcement learning method based on value decomposition and attention mechanism Active CN113313267B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110717897.XA CN113313267B (en) 2021-06-28 2021-06-28 Multi-agent reinforcement learning method based on value decomposition and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110717897.XA CN113313267B (en) 2021-06-28 2021-06-28 Multi-agent reinforcement learning method based on value decomposition and attention mechanism

Publications (2)

Publication Number Publication Date
CN113313267A true CN113313267A (en) 2021-08-27
CN113313267B CN113313267B (en) 2023-12-08

Family

ID=77380579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110717897.XA Active CN113313267B (en) 2021-06-28 2021-06-28 Multi-agent reinforcement learning method based on value decomposition and attention mechanism

Country Status (1)

Country Link
CN (1) CN113313267B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792861A (en) * 2021-09-16 2021-12-14 中国科学技术大学 Multi-agent reinforcement learning method and system based on value distribution
CN113902125A (en) * 2021-09-24 2022-01-07 浙江大学 Intra-group cooperation intelligent agent control method based on deep hierarchical reinforcement learning
CN113919485A (en) * 2021-10-19 2022-01-11 西安交通大学 Multi-agent reinforcement learning method and system based on dynamic hierarchical communication network
CN114037048A (en) * 2021-10-15 2022-02-11 大连理工大学 Belief consistency multi-agent reinforcement learning method based on variational cycle network model
CN114130034A (en) * 2021-11-19 2022-03-04 天津大学 Multi-agent game AI (Artificial Intelligence) design method based on attention mechanism and reinforcement learning
CN114139637A (en) * 2021-12-03 2022-03-04 哈尔滨工业大学(深圳) Multi-agent information fusion method and device, electronic equipment and readable storage medium
CN114463997A (en) * 2022-02-14 2022-05-10 中国科学院电工研究所 Lantern-free intersection vehicle cooperative control method and system
CN114527666A (en) * 2022-03-09 2022-05-24 西北工业大学 CPS system reinforcement learning control method based on attention mechanism
CN114900619A (en) * 2022-05-06 2022-08-12 北京航空航天大学 Self-adaptive exposure driving camera shooting underwater image processing system
CN115047907A (en) * 2022-06-10 2022-09-13 中国电子科技集团公司第二十八研究所 Air isomorphic formation command method based on multi-agent PPO algorithm
CN115300910A (en) * 2022-07-15 2022-11-08 浙江大学 Confusion-removing game strategy model generation method based on multi-agent reinforcement learning
CN116090688A (en) * 2023-04-10 2023-05-09 中国人民解放军国防科技大学 Moving target traversal access sequence planning method based on improved pointer network
WO2023231961A1 (en) * 2022-06-02 2023-12-07 华为技术有限公司 Multi-agent reinforcement learning method and related device
CN117852710A (en) * 2024-01-08 2024-04-09 山东大学 Collaborative optimization scheduling method and system for multi-park comprehensive energy system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200125957A1 (en) * 2018-10-17 2020-04-23 Peking University Multi-agent cooperation decision-making and training method
CN112101564A (en) * 2020-08-17 2020-12-18 清华大学 Multi-agent value function decomposition method and device based on attention mechanism
CN112232478A (en) * 2020-09-03 2021-01-15 天津(滨海)人工智能军民融合创新中心 Multi-agent reinforcement learning method and system based on layered attention mechanism
CN112396187A (en) * 2020-11-19 2021-02-23 天津大学 Multi-agent reinforcement learning method based on dynamic collaborative map
CN112417760A (en) * 2020-11-20 2021-02-26 哈尔滨工程大学 Warship control method based on competitive hybrid network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200125957A1 (en) * 2018-10-17 2020-04-23 Peking University Multi-agent cooperation decision-making and training method
CN112101564A (en) * 2020-08-17 2020-12-18 清华大学 Multi-agent value function decomposition method and device based on attention mechanism
CN112232478A (en) * 2020-09-03 2021-01-15 天津(滨海)人工智能军民融合创新中心 Multi-agent reinforcement learning method and system based on layered attention mechanism
CN112396187A (en) * 2020-11-19 2021-02-23 天津大学 Multi-agent reinforcement learning method based on dynamic collaborative map
CN112417760A (en) * 2020-11-20 2021-02-26 哈尔滨工程大学 Warship control method based on competitive hybrid network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIANYU SU 等: "Value-Decomposition Multi-Agent Actor-Critics", AAAI-21 *
YUANXIN ZHANG 等: "AVD-Net:Attention Value Decomposition Network For Deep Multi-Agent Reinforcement Learning", IEEE *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792861A (en) * 2021-09-16 2021-12-14 中国科学技术大学 Multi-agent reinforcement learning method and system based on value distribution
CN113792861B (en) * 2021-09-16 2024-02-27 中国科学技术大学 Multi-agent reinforcement learning method and system based on value distribution
CN113902125A (en) * 2021-09-24 2022-01-07 浙江大学 Intra-group cooperation intelligent agent control method based on deep hierarchical reinforcement learning
CN114037048A (en) * 2021-10-15 2022-02-11 大连理工大学 Belief consistency multi-agent reinforcement learning method based on variational cycle network model
CN114037048B (en) * 2021-10-15 2024-05-28 大连理工大学 Belief-consistent multi-agent reinforcement learning method based on variational circulation network model
CN113919485A (en) * 2021-10-19 2022-01-11 西安交通大学 Multi-agent reinforcement learning method and system based on dynamic hierarchical communication network
CN113919485B (en) * 2021-10-19 2024-03-15 西安交通大学 Multi-agent reinforcement learning method and system based on dynamic hierarchical communication network
CN114130034A (en) * 2021-11-19 2022-03-04 天津大学 Multi-agent game AI (Artificial Intelligence) design method based on attention mechanism and reinforcement learning
CN114139637A (en) * 2021-12-03 2022-03-04 哈尔滨工业大学(深圳) Multi-agent information fusion method and device, electronic equipment and readable storage medium
CN114463997A (en) * 2022-02-14 2022-05-10 中国科学院电工研究所 Lantern-free intersection vehicle cooperative control method and system
CN114527666B (en) * 2022-03-09 2023-08-11 西北工业大学 CPS system reinforcement learning control method based on attention mechanism
CN114527666A (en) * 2022-03-09 2022-05-24 西北工业大学 CPS system reinforcement learning control method based on attention mechanism
CN114900619A (en) * 2022-05-06 2022-08-12 北京航空航天大学 Self-adaptive exposure driving camera shooting underwater image processing system
WO2023231961A1 (en) * 2022-06-02 2023-12-07 华为技术有限公司 Multi-agent reinforcement learning method and related device
CN115047907A (en) * 2022-06-10 2022-09-13 中国电子科技集团公司第二十八研究所 Air isomorphic formation command method based on multi-agent PPO algorithm
CN115047907B (en) * 2022-06-10 2024-05-07 中国电子科技集团公司第二十八研究所 Air isomorphic formation command method based on multi-agent PPO algorithm
CN115300910A (en) * 2022-07-15 2022-11-08 浙江大学 Confusion-removing game strategy model generation method based on multi-agent reinforcement learning
CN116090688A (en) * 2023-04-10 2023-05-09 中国人民解放军国防科技大学 Moving target traversal access sequence planning method based on improved pointer network
CN117852710A (en) * 2024-01-08 2024-04-09 山东大学 Collaborative optimization scheduling method and system for multi-park comprehensive energy system

Also Published As

Publication number Publication date
CN113313267B (en) 2023-12-08

Similar Documents

Publication Publication Date Title
CN113313267A (en) Multi-agent reinforcement learning method based on value decomposition and attention mechanism
CN108921298B (en) Multi-agent communication and decision-making method for reinforcement learning
CN112116090B (en) Neural network structure searching method and device, computer equipment and storage medium
CN115081936B (en) Method and device for scheduling observation tasks of multiple remote sensing satellites under emergency condition
CN111401547B (en) HTM design method based on circulation learning unit for passenger flow analysis
CN112763967B (en) BiGRU-based intelligent electric meter metering module fault prediction and diagnosis method
CN113286275A (en) Unmanned aerial vehicle cluster efficient communication method based on multi-agent reinforcement learning
CN114860893A (en) Intelligent decision-making method and device based on multi-mode data fusion and reinforcement learning
CN112990485A (en) Knowledge strategy selection method and device based on reinforcement learning
CN111401557A (en) Agent decision making method, AI model training method, server and medium
CN108921935A (en) A kind of extraterrestrial target method for reconstructing based on acceleration gauss hybrid models
CN114510012A (en) Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning
CN116306686B (en) Method for generating multi-emotion-guided co-emotion dialogue
CN115099606A (en) Training method and terminal for power grid dispatching model
CN116643499A (en) Model reinforcement learning-based agent path planning method and system
CN113313209A (en) Multi-agent reinforcement learning training method with high sample efficiency
CN116933931A (en) Cloud computing double-flow feature interaction electric vehicle charging pile occupation prediction method
CN111767991B (en) Measurement and control resource scheduling method based on deep Q learning
CN114880527B (en) Multi-modal knowledge graph representation method based on multi-prediction task
CN116128028A (en) Efficient deep reinforcement learning algorithm for continuous decision space combination optimization
CN115587615A (en) Internal reward generation method for sensing action loop decision
CN114371729B (en) Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback
CN115204249A (en) Group intelligent meta-learning method based on competition mechanism
CN114139674A (en) Behavior cloning method, electronic device, storage medium, and program product
Sachdeva et al. Gapformer: Fast autoregressive transformers meet rnns for personalized adaptive cruise control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Song Guanghua

Inventor after: Fan Cheng

Inventor after: Ye Zhenhui

Inventor after: Chen Yining

Inventor after: Ying Haochao

Inventor after: Wu Jian

Inventor after: Jiang Xiaohong

Inventor before: Wu Jian

Inventor before: Song Guanghua

Inventor before: Jiang Xiaohong

Inventor before: Fan Cheng

Inventor before: Ye Zhenhui

Inventor before: Chen Yining

Inventor before: Ying Haochao

GR01 Patent grant
GR01 Patent grant