CN116306966A - Multi-agent reinforcement learning cooperative method based on dynamic graph communication - Google Patents

Multi-agent reinforcement learning cooperative method based on dynamic graph communication Download PDF

Info

Publication number
CN116306966A
CN116306966A CN202310114762.3A CN202310114762A CN116306966A CN 116306966 A CN116306966 A CN 116306966A CN 202310114762 A CN202310114762 A CN 202310114762A CN 116306966 A CN116306966 A CN 116306966A
Authority
CN
China
Prior art keywords
communication
agent
network
intelligent
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310114762.3A
Other languages
Chinese (zh)
Inventor
李奇峰
葛宏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202310114762.3A priority Critical patent/CN116306966A/en
Publication of CN116306966A publication Critical patent/CN116306966A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention belongs to the field of artificial intelligence and multi-agent cooperation, and relates to a multi-agent reinforcement learning cooperation method based on dynamic graph communication. Aiming at the problems that the communication cost is high, the practical application requirement is difficult to meet and the advanced cooperation strategy is difficult to learn in the existing method, the invention aims to realize effective communication under the condition of conforming to more practical limited communication, thereby promoting the cooperation among the intelligent agents and learning the advanced cooperation strategy. The method comprises the following steps: establishing a dynamic communication diagram; self-adaptive generation of communication weight; real-time communication between intelligent agents; the intelligent agent carries out action value estimation; the super network performs credit allocation among the agents; updating parameters by using time sequence differential loss; the application is performed in a verification environment. The invention enables the model to carry out effective communication in a self-adaptive way on the premise of smaller communication expenditure, obviously improves the cooperation performance among the intelligent agents, has higher expansibility and can be widely applied to the field of multi-intelligent agent cooperation.

Description

Multi-agent reinforcement learning cooperative method based on dynamic graph communication
Technical Field
The invention belongs to the field of artificial intelligence and multi-agent cooperation, and particularly relates to a multi-agent reinforcement learning cooperation method based on dynamic graph communication.
Background
Multi-agent collaboration is primarily in an interactive environment where an agent system comprising multiple agents constantly interacts with the environment to maximize the revenue that the system obtains, where each agent makes independent policy decisions that cooperatively and autonomously accomplish team goals. The multi-agent cooperation technology plays a vital role in the fields of smart cities, intelligent transportation, vehicle-road cooperation, unmanned aerial vehicle control and the like, and can be used for the tasks of communication coordination of a plurality of independent terminals, optimization of resource allocation, cluster path planning and the like.
In recent years, a multi-agent cooperation method has made great progress, but with the increase of the scale of the multi-agent, the complexity of the spatial search of the combined cooperation strategy increases exponentially, and the development of related algorithms is greatly limited by the combination of the non-stationarity caused by independent decision of the agents and the complex coupling relationship among the plurality of agents. Therefore, the multi-agent reinforcement learning algorithm is gradually paid more attention as an effective method for adaptively promoting the cooperation of agents, can directly perform trial-and-error learning in the interactive data of the agents and the environment in a training stage, has stronger expansibility and has great development prospect.
At present, main methods of multi-agent collaborative research are generally divided into three types, (1) each independently-decided agent locally builds modeling on other agent strategies, and performs individual decision based on local interaction information and modeling strategies. (2) Decomposition of team global rewards is performed in a centralized training phase by utilizing a super network to perform reasonable credit allocation among agents, so that cooperation among agents is implicitly promoted based on a reinforcement learning method. (3) Enabling efficient communication between agents, each agent making decisions based on local data and communication messages to achieve collaboration. The first type of method reduces the non-stationarity brought by other dynamic strategies in the decision process of the intelligent agents through an active modeling method, but as the number of the intelligent agents increases, the modeling difficulty also increases exponentially, and complex cooperative tasks cannot be dealt with. The second class of algorithms guides the agents to cooperate by directly relating team rewards values to tasks, and the combined action strategy of the multi-agent system can be converged to a cooperation strategy meeting monotonicity limitation by reasonably decomposing the team rewards values through a super network. The third type of method can be used for generating communication messages by manually demarcating or through a designed specific network in a communication mode, and can be used for promoting an agent to cooperatively complete team targets by transmitting effective messages. In practical application, the second type of method and the third type of algorithm have higher application value in large-scale multi-agent cooperation because of suitable learning cost and stronger generalization.
The popular multi-agent reinforcement learning collaborative method in recent years mainly adopts a centralized training mode and a distributed execution mode to train and deploy an agent decision model. In the training process, the reliability distribution among the agents is realized by decomposing the reward signals obtained by the interaction of the combined actions formed by the decision-making of all the agents and the environment, and the convergence of each individual strategy network to an effective combined cooperation strategy is promoted by continuous trial and error in the environment. The resolution of rewards information relies on a super network of global intelligent system information available in a training phase, which should have the ability to characterize the complete policy space. In the execution phase, the centralized super network is removed, and each agent only depends on the policy network to perform the selection of actions. Rasrid et al propose a multi-agent value decomposition framework that integrates individual Q functions of each agent through a non-negative weighted non-linear super network to achieve credit allocation during the back-update of the reward signal (Tabish rastid, mikayel Samvelyan, christian Schroeder, gregori Farquhar, jakob foster, shimon Whiteson QMIX: monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning [ C ]// Proceedings of the 35th International Conference on Machine Learning,PMLR 80:4295-4304,2018 ]. The super-network can acquire the real state of the intelligent agent system in the stage of centralized training, and meanwhile, the nonlinear super-network can characterize monotonicity limitation which is more in line with individual strategies and joint strategies, so that more effective cooperation strategies are learned, and the design of the nonlinear super-network is also one of research hotspots in the field of multi-intelligent agent reinforcement learning. Current research methods facilitate model learning to efficient collaborative strategies by limiting the distance between behavioral strategies and target strategies or layering supernetworks. Wang et al utilize a hierarchical reinforcement learning structure to distribute rewards across two supernetworks, thereby reducing the difficulty of a single network to perform complete policy space characterization (TWang, T Gupta, B Peng, A Mahajan, S Whiteson, and C Zhang.2021.RODE: learning Roles to Decompose Multi-agent Tasks [ C ]// In Proceedings of the International Conference on Learning presentations. OpenReview.). Communication learning is another important research direction in the field of multi-agent reinforcement learning. Yuan et al propose a communication mechanism based on variation inference, where an agent locally predicts the Q function of teammate agents as a communication message, while taking the messages obtained from other agents as a bias of its Q function to achieve stable motion value estimation, and reduces communication costs by introducing communication regularization (Lei Yuan, jianhao Wang, fuxiang Zhang, chenghe Wang, zongzhang Zhang, yang Yu, and Chongjie Zhang.2022.Multi-Agent Incentive Communication via Decentralized Teammate Modeling [ C ]// In Proceedings of the AAAI conference on artificial intelligence). For some multi-agent collaborative tasks requiring advanced collaborative behavior, it is difficult to complete team goals only relying on implicit collaborative guidance, so it is difficult to learn complex collaborative strategies only relying on trust allocation of the super-network. Meanwhile, in most of realistic multi-agent cooperative application scenes, various communication restrictions exist widely, and existing algorithms often have excessive communication overhead, so that the algorithms cannot be effectively applied.
Disclosure of Invention
Aiming at the problems, the invention provides a multi-agent reinforcement learning cooperative method based on dynamic graph communication, which aims at learning advanced cooperative strategies by introducing a communication model into an algorithm based on credit allocation; meanwhile, the communication range and the communication degree are controlled on the basis of a dynamic communication diagram, so that effective communication is realized under the more realistic limited communication condition, and cooperation among intelligent agents is promoted.
The method can work under the condition of widely existing limited communication, self-adaptive communication of the intelligent agents is carried out according to the limited communication domain requirements in practical application, credit allocation among the intelligent agents is carried out by utilizing a super network in a training stage, the intelligent agents carry out cooperative decision making according to local information and communication information in an execution stage, non-stationarity in the joint decision making process is effectively reduced, and the cooperative performance of multiple intelligent agents is improved.
The technical scheme of the invention is as follows:
a multi-agent reinforcement learning cooperative method based on dynamic graph communication comprises the following steps:
step 1: and extracting the communicable intelligent agents in the intelligent agent communication domain in real time according to the communication limiting conditions of the environment and the intelligent agent system, and establishing a communication diagram.
Step 2: according to the communication diagram of the step 1, the observation information of the local intelligent agents is encoded, and the weight of the communication diagram is generated based on the observation information and a corresponding weight generator so as to control the degree of communication among the intelligent agents.
Step 3: and (3) carrying out communication of observation information codes among the agents based on the weight of the communication diagram in the step (2) and the communication diagram in the step (1).
Step 4: each agent utilizes an action value estimation network to perform individual action value estimation based on the local interaction data and communication messages and the historical information.
Step 5: and (3) the super network gathers all the action value estimates generated in the step (4) and completes the joint action value estimation of the intelligent agent system based on the global information.
Step 6: and updating parameters of the super network according to the rewarding value obtained by interaction of the combined action and the environment, and then reversely transmitting the reliability distribution value of rewarding to the action value estimation network of each agent and updating network parameters of the agent.
Step 7: repeating the steps 1 to 6 until each agent action value estimation network, the communication weight generator and the super network converge or reach the designated training step number, removing the super network, and applying the final communication weight generator and each agent action value estimation network to the interaction environment to carry out intelligent system decision.
Further, the specific process of step 1 is as follows: establishing a communication graph from a communication domain under communication constraints in an interactive environment
Figure BDA0004078140730000041
Wherein->
Figure BDA0004078140730000042
Representing a communication diagram->
Figure BDA0004078140730000043
Representing the set of agents, w represents the weight of each side of the communication graph and is initialized to 0, and epsilon is the set of sides of the communication graph. The communication diagram is established by the procedure of->
Figure BDA0004078140730000051
i≠j,/>
Figure BDA0004078140730000052
If the agent j epsilon d i Wherein d is i Is the limited communication domain of agent i.
Further, the specific process of step 2 is as follows: local observation information o of agent i using coding network j The code is observation code e j And then generating the weight of each side of the communication graph according to the weight generator.
If a learnable weight generator is used, firstly, a linear transformation W is used to map the observation codes to a high-dimensional space to enhance the network expression capability, and then a single-layer nonlinear network is used to calculate the communication coefficient c between the corresponding communicable agents ij
Figure BDA0004078140730000053
Wherein a (-) represents a single layer of nonlinear network,
Figure BDA0004078140730000054
representing the join operation, e i And e j Respectively representing any communicable agent i and agent j, ">
Figure BDA0004078140730000055
Finally, softmax attribution is performed on the weights of all communicable agents of each agentA unification to ensure expansibility:
Figure BDA0004078140730000056
wherein w is ij Representing the communication weight between agent i and agent j, leakyReLU (·) represents the nonlinear activation function, exp (·) represents the exponent symbol.
If a weight generator for the similarity measure is used, the nonlinear network a (·) is replaced with an inner product similarity measure:
Figure BDA0004078140730000057
where F is a linear embedding operation, the observation code can be mapped to a high-dimensional space.
Further, the specific process of step 3 is as follows: generating a communication message of the intelligent agent according to the communication diagram obtained in the step 1 and the communication weight obtained in the step 2:
Figure BDA0004078140730000061
wherein m is i Representing the communication message acquired by agent i at the current time.
Further, the specific process of step 4 is as follows: generating an information representation of the current moment according to the communication message obtained in the step 3 and the local observation data and the historical data of the intelligent agent
Figure BDA0004078140730000062
Figure BDA0004078140730000063
Wherein GRU (& gt) represents a gated loop unit neural network,
Figure BDA0004078140730000064
and->
Figure BDA0004078140730000065
Observation information and communication information respectively representing current t moment intelligent agent i,/>
Figure BDA0004078140730000066
Representing agent history information. The action value estimation network performs action value estimation based on the information characterization:
Q i (a)=Q i (a|e i ,m i ,h i ;θ) (6)
where a represents an optional action of agent i and θ represents a parameter of the agent's action value estimation network.
Further, the specific process of step 5 is as follows: and (3) carrying out joint action value estimation according to the individual action value estimation obtained in the step (4):
Q tot (a)=mixing((s,Q 11 ,a 1 ),…,Q nn ,a n ))) (7)
wherein s represents the overall state of the intelligent agent system, Q tot Representing a value estimate of the joint action, a representing the joint action of the agent system, and metering (·) representing the supernetwork.
Further, the specific process of step 6 is as follows: calculating time sequence differential loss according to the deviation between the joint action value estimation obtained in the step 5 and the actual acquisition rewards
Figure BDA0004078140730000067
Figure BDA0004078140730000068
Wherein the method comprises the steps of
Figure BDA0004078140730000069
Representing +.>
Figure BDA00040781407300000611
Calculating expectation, phi, ζ, theta, ψ respectively represent parameters of the observation encoding network, the weight generator, the action value estimation network, and the super network, s 'represents a state at a next time, a' represents joint action value estimation at a next time, and->
Figure BDA00040781407300000610
Representing the target network of the action value estimation network, and gamma represents the discount coefficient.
The invention has the beneficial effects that:
under the condition of meeting the widely existing limited communication in the practical application, the invention establishes a real-time dynamic communication diagram among the agents by utilizing the limited communication domain, and performs credit allocation among the agents in the reinforcement learning process by utilizing the centralized super network. Therefore, on the premise of smaller communication overhead, the model can adaptively and effectively communicate, the cooperation performance among the intelligent agents is obviously improved, and the model has higher expansibility.
Firstly, through the practical communication domain limitation in the interactive environment, the communication diagram structure at the current moment is established in real time, so that the communication overhead is effectively reduced and the stronger expansibility is ensured.
And secondly, a learnable weight generator and a weight generator of similarity measurement are utilized to generate communication weights in the communication topology, so that self-adaptive communication control among intelligent agents is realized, and the real-time effectiveness of communication is ensured.
Finally, in the action value estimation stage, the communication information and the observation information are encoded together to form a part of historical information, so that important communication information in the process of intelligent agent coordination is effectively reserved to perform accurate action value estimation, and credit allocation among the intelligent agents is performed by utilizing a super network, so that effective coordination among the intelligent agents is ensured.
Drawings
FIG. 1 is a flow chart of a multi-agent reinforcement learning collaborative method based on dynamic graph communication according to the present invention;
FIG. 2 is a flowchart of training steps according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a test procedure according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a limited communication domain scenario in which the present invention operates in conjunction with a specific embodiment.
Detailed Description
The invention will be further described with reference to the accompanying drawings and specific examples, which include, but are not limited to, the following examples.
As shown in fig. 1, the invention provides a multi-agent reinforcement learning cooperative method based on dynamic graph communication, which comprises the following specific implementation processes:
1. as shown in fig. 4, the communication range of each agent in the interaction environment such as a game is limited, the communication range is consistent with the observable range, only information in a certain range can be observed, and the communication diagram needs to be established according to the observation domain of the agent in the game
Figure BDA0004078140730000081
Figure BDA0004078140730000082
Wherein->
Figure BDA0004078140730000083
A communication diagram is shown which shows a communication diagram,
Figure BDA0004078140730000084
representing the agent set, w represents the weight of each side of the communication graph and is initialized to 0. As shown in fig. 2, the mapping process is as follows: />
Figure BDA0004078140730000085
i≠j,/>
Figure BDA0004078140730000086
If the agent j epsilon d i Wherein d is i Is the limited communication domain of agent i. The local observation information o of the agent i is reused by the coding network shown on the left side of fig. 2 j The code is observation code e j . And then, according to the weight generator, the weight of each side of the communication graph is generated by taking the observation code of each agent as input, and if a learnable weight generator is used, firstly, a linear transformation W is used, and the observation code is mapped to a high-dimensional space so as to enhance the network expression capability. And then calculating the communication coefficient between every two corresponding communicable intelligent agents by using a single-layer nonlinear network, wherein the calculation formula is as follows:
Figure BDA0004078140730000087
wherein a (-) represents a single layer of nonlinear network,
Figure BDA0004078140730000088
representing the join operation, e i And e j Respectively representing any communicable agent i and agent j, ">
Figure BDA0004078140730000089
Finally, the weights of all communicable agents of each agent are softmax normalized to ensure expansibility, and the calculation formula is as follows:
Figure BDA00040781407300000810
wherein w is ij Representing the communication weight between agent i and agent j, leakyReLU () represents a nonlinear activation function, exp (·) represents an exponential sign.
If a weight generator of similarity measure is used, the nonlinear network is replaced by the inner product similarity measure, and the calculation formula is as follows:
Figure BDA00040781407300000811
where F is a linear embedding operation, the observation code can be mapped to a high-dimensional space.
2. Inter-agent communication
As shown in fig. 2, a communication message m of the agent is generated based on the communication diagram obtained in step 1 and the communication weight obtained in step 2 1 ,m 2 ,…,m n The calculation formula is as follows:
Figure BDA0004078140730000091
as further shown in fig. 2, the corresponding message is sent to the corresponding player unit.
3. Agent performs action value estimation in interactive environment
As shown in fig. 2, the information characterization of the current moment is generated by using the GRU network in fig. 2 according to the communication message obtained in the step 3 and the local observation data and the historical data of the intelligent agent
Figure BDA0004078140730000092
Figure BDA0004078140730000093
Wherein GRU (& gt) represents a gated loop unit neural network,
Figure BDA0004078140730000094
respectively representing the observation information and communication information of the intelligent agent i at the current t moment,/>
Figure BDA0004078140730000095
Representing agent history information. Based on this, the action value estimation network shown in fig. 2 is used to perform action value estimation:
Q i (a)=Q i (a|e i ,m i ,h i ;θ) (14)
where a represents an optional action of agent i and θ represents a parameter of the agent's action value estimation network.
And then, according to the obtained individual action value estimation, carrying out joint action value estimation:
Q tot (a)=(s,Q 11 ,a 1 ),…,Q nn ,a n )) (15)
wherein s represents the overall state of the intelligent agent system, Q tot Represents a value estimate of the joint action, a represents the joint action of the agent system.
4. Reverse updating network learnable parameters
As shown in FIG. 2, a time series differential loss is calculated based on the deviation between the joint action value estimation obtained in step 5 and the prize actually obtained from the game system
Figure BDA0004078140730000096
Figure BDA0004078140730000101
Wherein the method comprises the steps of
Figure BDA0004078140730000104
Representing +.>
Figure BDA0004078140730000102
The expectation, phi, xi, theta, phi represent the parameters of the observation coding network, the weight generator, the action value estimation network, and the super-network, s, respectively Indicating the next time state, a Joint action value estimation representing the next moment, < +.>
Figure BDA0004078140730000103
Representing the target network of the action value estimation network, and gamma represents the discount coefficient.
5. Multi-agent collaborative tasks in a testing environment
As shown in fig. 3, the convergence model trained in the above steps or the n individual networks reaching a specified number of training steps and the weight generator are removed from the super network and the final weight generator and the individual agent action estimation networks are applied to the interaction environment to make an agent decision. And in the execution stage, the intelligent agent performs self-adaptive communication according to the dynamic communication graph network and performs cooperative tasks.

Claims (1)

1. A multi-agent reinforcement learning cooperative method based on dynamic graph communication is characterized by comprising the following steps:
step 1: extracting communicable intelligent agents in the intelligent agent communication domain in real time according to the communication limiting conditions of the environment and the intelligent agent system, and establishing a communication diagram; the method comprises the following steps:
establishing a communication graph from a communication domain under communication constraints in an interactive environment
Figure FDA0004078140720000011
Wherein->
Figure FDA0004078140720000012
Representing a communication diagram->
Figure FDA0004078140720000013
Representing an agent set, w represents the weight of each side of the communication graph and is initialized to 0, and epsilon is the side set of the communication graph; the communication diagram is established by the procedure of->
Figure FDA0004078140720000014
If the agent j epsilon d i Wherein d is i A limited communication domain for agent i;
step 2: according to the communication diagram of the step 1, the observation information of the local intelligent agents is coded, and the weight of the communication diagram is generated based on the observation information and a corresponding weight generator so as to control the degree of communication among the intelligent agents; the method comprises the following steps:
local observation information o of agent i using coding network j The code is observation code e j Generating the weight of each side of the communication graph according to the weight generator;
if a learnable weight generator is used, the observed codes are mapped to a high-dimensional space to enhance network expressive power by first using a linear transformation W, and then using a single-layer nonlinear network to map the correspondingCalculating communication coefficient c between two communicable intelligent agents ij
Figure FDA0004078140720000015
Wherein a (-) represents a single layer of nonlinear network,
Figure FDA0004078140720000016
representing the join operation, e i And e j Respectively representing any communicable agent i and agent j, ">
Figure FDA0004078140720000017
Finally, the weights of all communicable agents of each agent are softmax normalized to ensure scalability:
Figure FDA0004078140720000018
wherein w is ij Representing the communication weight between agent i and agent j, leakyReLU () representing the nonlinear activation function, exp (·) representing the exponent symbol;
if a weight generator for the similarity measure is used, the nonlinear network a (·) is replaced with an inner product similarity measure:
Figure FDA0004078140720000021
wherein F is a linear embedding operation, which can map the observation code to a high-dimensional space;
step 3: based on the weight of the communication diagram in the step 2 and the communication diagram in the step 1, carrying out communication of observation information codes among the agents; the method comprises the following steps:
generating a communication message of the intelligent agent:
Figure FDA0004078140720000022
wherein m is i Representing a communication message acquired by the intelligent agent i at the current moment;
step 4: each agent utilizes an action value estimation network to complete individual action value estimation according to the local interaction data and the communication information and the historical information; the method comprises the following steps:
generating an information representation of the current moment according to the communication message obtained in the step 3 and the local observation data and the historical data of the intelligent agent
Figure FDA0004078140720000023
Figure FDA0004078140720000024
Wherein GRU (& gt) represents a gated loop unit loop neural network,
Figure FDA0004078140720000025
and->
Figure FDA0004078140720000026
Observation information and communication information respectively representing current t moment intelligent agent i,/>
Figure FDA0004078140720000027
Representing agent history information; the action value estimation network performs action value estimation based on the information characterization:
Q i (a)=Q i (a|e i ,m i ,h i ;θ) (6)
wherein a represents optional actions of agent i, θ represents parameters of the agent's action value estimation network;
step 5: the super network gathers all the action value estimates generated in the step 4 and completes the joint action value estimation of the intelligent agent system based on the global information; the method comprises the following steps:
Q tot (a)=mixing((s,Q 11 ,a 1 ),…,Q nn ,a n ))) (7)
wherein s represents the overall state of the intelligent agent system, Q tot A represents the value estimation of the joint action, a represents the joint action of the intelligent agent system, and stopping (·) represents the super network;
step 6: according to the rewarding value obtained by the interaction of the combined action and the environment, carrying out parameter updating on the super network, and then reversely spreading the credibility distribution value of rewarding to the action value estimation network of each intelligent agent and updating the network parameter; the method comprises the following steps:
calculating time sequence differential loss according to the deviation between the joint action value estimation obtained in the step 5 and the actual acquisition rewards
Figure FDA0004078140720000035
Figure FDA0004078140720000031
Wherein the method comprises the steps of
Figure FDA0004078140720000032
Representing +.>
Figure FDA0004078140720000033
Calculating expectation, phi, ζ, theta, ψ respectively represent parameters of the observation encoding network, the weight generator, the action value estimation network, and the super network, s 'represents a state at a next time, a' represents joint action value estimation at a next time, and->
Figure FDA0004078140720000034
A target network representing an action value estimation network, γ representing a discount coefficient;
step 7: repeating the steps 1 to 6 until each agent action value estimation network, the communication weight generator and the super network converge or reach the designated training step number, removing the super network, and applying the final communication weight generator and each agent action value estimation network to the interaction environment to carry out intelligent system decision.
CN202310114762.3A 2023-02-15 2023-02-15 Multi-agent reinforcement learning cooperative method based on dynamic graph communication Pending CN116306966A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310114762.3A CN116306966A (en) 2023-02-15 2023-02-15 Multi-agent reinforcement learning cooperative method based on dynamic graph communication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310114762.3A CN116306966A (en) 2023-02-15 2023-02-15 Multi-agent reinforcement learning cooperative method based on dynamic graph communication

Publications (1)

Publication Number Publication Date
CN116306966A true CN116306966A (en) 2023-06-23

Family

ID=86827867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310114762.3A Pending CN116306966A (en) 2023-02-15 2023-02-15 Multi-agent reinforcement learning cooperative method based on dynamic graph communication

Country Status (1)

Country Link
CN (1) CN116306966A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116902006A (en) * 2023-08-29 2023-10-20 酷哇科技有限公司 Reinforced learning multi-vehicle cooperative system and method based on strategy constraint communication

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116902006A (en) * 2023-08-29 2023-10-20 酷哇科技有限公司 Reinforced learning multi-vehicle cooperative system and method based on strategy constraint communication

Similar Documents

Publication Publication Date Title
CN108921298B (en) Multi-agent communication and decision-making method for reinforcement learning
Jiang et al. Distributed resource scheduling for large-scale MEC systems: A multiagent ensemble deep reinforcement learning with imitation acceleration
Tang et al. A novel hierarchical soft actor-critic algorithm for multi-logistics robots task allocation
CN113919485B (en) Multi-agent reinforcement learning method and system based on dynamic hierarchical communication network
CN113642233B (en) Group intelligent collaboration method for optimizing communication mechanism
Wang et al. Design of intelligent connected cruise control with vehicle-to-vehicle communication delays
CN113592162B (en) Multi-agent reinforcement learning-based multi-underwater unmanned vehicle collaborative search method
CN116306966A (en) Multi-agent reinforcement learning cooperative method based on dynamic graph communication
CN111178496A (en) Method for exchanging knowledge among agents under multi-agent reinforcement learning cooperative task scene
Li et al. Learning-based predictive control via real-time aggregate flexibility
CN113780576A (en) Cooperative multi-agent reinforcement learning method based on reward self-adaptive distribution
Zhang et al. Multi-robot cooperative target encirclement through learning distributed transferable policy
Kim et al. Optimizing large-scale fleet management on a road network using multi-agent deep reinforcement learning with graph neural network
CN112564189B (en) Active and reactive coordination optimization control method
CN116976523A (en) Distributed economic dispatching method based on partially observable reinforcement learning
CN114707613B (en) Layered depth strategy gradient network-based power grid regulation and control method
CN115982610A (en) Communication reinforcement learning algorithm for promoting multi-agent cooperation
CN113435475B (en) Multi-agent communication cooperation method
CN115758871A (en) Power distribution network reconstruction energy-saving loss-reducing method and device based on security reinforcement learning
CN115187056A (en) Multi-agent cooperative resource allocation method considering fairness principle
Sierra-Garcia et al. Federated discrete reinforcement learning for automatic guided vehicle control
Ma et al. AGRCNet: communicate by attentional graph relations in multi-agent reinforcement learning for traffic signal control
Nai et al. A Vehicle Path Planning Algorithm Based on Mixed Policy Gradient Actor‐Critic Model with Random Escape Term and Filter Optimization
Habibi et al. Offering a Demand‐Based Charging Method Using the GBO Algorithm and Fuzzy Logic in the WRSN for Wireless Power Transfer by UAV
CN113592079B (en) Collaborative multi-agent communication method oriented to large-scale task space

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination