CN114896899B - Multi-agent distributed decision method and system based on information interaction - Google Patents

Multi-agent distributed decision method and system based on information interaction Download PDF

Info

Publication number
CN114896899B
CN114896899B CN202210829307.7A CN202210829307A CN114896899B CN 114896899 B CN114896899 B CN 114896899B CN 202210829307 A CN202210829307 A CN 202210829307A CN 114896899 B CN114896899 B CN 114896899B
Authority
CN
China
Prior art keywords
network
value
agent
agents
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210829307.7A
Other languages
Chinese (zh)
Other versions
CN114896899A (en
Inventor
杨若鹏
殷昌盛
杨远涛
鲁义威
韦文夏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210829307.7A priority Critical patent/CN114896899B/en
Publication of CN114896899A publication Critical patent/CN114896899A/en
Application granted granted Critical
Publication of CN114896899B publication Critical patent/CN114896899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a multi-agent distributed decision method and a system based on information interaction. The method comprises the following steps: inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism to obtain observation information after information sharing; the method comprises the steps that observation information after information sharing is input into a decision model based on a neural network, the decision model comprises a strategy network and a value network corresponding to each intelligent agent, the strategy network is used for outputting action information of the corresponding intelligent agent according to the observation information after the information sharing, the value network is used for outputting corresponding value scores according to the action information of the corresponding intelligent agent, the decision model further comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value scores of the plurality of intelligent agents according to the corresponding value scores of the intelligent agents. The invention can realize information sharing and cooperation among multiple intelligent agents and improve the decision accuracy and efficiency.

Description

Multi-agent distributed decision method and system based on information interaction
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a multi-agent distributed decision-making method and system based on information interaction.
Background
The actual decision problem can be abstractly modeled into the incomplete information multi-agent game problem, and the traditional multi-agent centralized decision method can effectively solve the problem of search efficiency of centralized decision. However, for large-scale joint collaboration, the number and types of entities are large, and the simple centralized decision method obviously cannot solve the problem of cooperative gaming of multiple heterogeneous entities.
A Multi Agent System (MAS) is a System composed of a plurality of agents, and the MAS structure is formed by sharing information, objects, strategies and actions acquired by the agents, so that the problem that a single Agent cannot solve due to the lack of personal ability, knowledge or resources or the problem that the efficiency is low can be solved in a mutual cooperation manner.
In a multi-agent system, all agents are not simply stacked together, but are coupled together and interact, correlated, by task, environmental or other factors. The multi-agent method can effectively solve the problem of multi-entity cooperative decision, but for the problem of large-scale multi-agent cooperative decision, how to perform efficient information cooperation and strategy cooperation among agents becomes the key for solving the distributed command decision of the multi-agent.
Disclosure of Invention
Aiming at least one defect or improvement requirement in the prior art, the invention provides a multi-agent distributed decision method and system based on information interaction, which can realize information sharing and cooperation among the multi-agents and improve decision accuracy and efficiency.
To achieve the above object, according to a first aspect of the present invention, there is provided a multi-agent decentralized decision-making method based on information interaction, comprising:
inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism, and obtaining the observation information of each of the plurality of agents after information sharing;
inputting the observation information of each of a plurality of agents after information sharing into a decision model based on a neural network, wherein the decision model comprises a strategy network and a value network corresponding to each agent, the strategy network is used for outputting the action information of the corresponding agent according to the observation information of the agent after information sharing, the value network is used for outputting the value score of the corresponding agent according to the action information of the corresponding agent, the decision model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value score of the agents according to the value score output by the value network corresponding to each agent;
the obtaining global value scores of a plurality of agents according to the value score output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps:
expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agents, andicoefficient of value function of individual agent as the first of the multi-headed attention networkiAttention weight of individual channels.
Further, the graph convolutional neural network model adopts a double-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interactive relationship exists between any two intelligent agents, the soft attention mechanism is used for determining the information interaction degree between any two intelligent agents with the interactive relationship, and the observation information of each of the plurality of intelligent agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.
Further, the output of the hard attention mechanism is noted asW h i,j W h i,j The calculation formula of (2) is as follows:
Figure 785885DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,gum() A gumbel-softmax function is represented, f() A fully-connected layer is shown as such,B i -LSTMshowing a bidirectional long-short term memory artificial neural network,o i o j respectively representing pre-information-sharing agentsijThe information on the respective observations is then transmitted to the receiver,W h i,j representing an agentijWhether an interactive relation exists between the two groups;
record the output of the soft attention mechanism asW s i,j W s i,j The calculation formula of (2) is as follows:
Figure 296501DEST_PATH_IMAGE002
in the formula, exp () represents an exponential function with a natural constant e as the base,e i e j respectively represento i o j Extracting the feature vector by the long-short term memory artificial neural network,W k W q are respectively ande j e i the weight matrix of the match is then used,
Figure 545080DEST_PATH_IMAGE003
the transpose of the matrix is represented,W s i,j representing an agentijThe degree of information interaction between the users;
recording the situation information after the final output sharingo’ i o’ i The calculation formula of (2) is as follows:
Figure 674710DEST_PATH_IMAGE004
in the formula (I), the compound is shown in the specification,Nthe total number of agents.
Further, the taylor function decomposition of the expanding of the global value function into individual value functions of the plurality of agents includes:
denote the global value function asQ total (S,A) First, ofiIndividual value function of individual agent is recorded asQ i (s i ,a i ) And satisfies the following conditions:
Figure 348136DEST_PATH_IMAGE005
wherein, the first and the second end of the pipe are connected with each other,c(S) Is a constant value, and is characterized in that,
Figure 905020DEST_PATH_IMAGE006
denotes the firstiCoefficient of value function of individual agent will
Figure 640894DEST_PATH_IMAGE007
As the first of the multi-head attention networkiAttention weight of individual channels.
Further, the training of the decision model comprises:
and calculating a loss function according to the global value score estimated values of the plurality of agents and the global value score actual values of the plurality of agents.
Further, the training of the decision model comprises:
calculating the gradient of a combined network formed by the value networks corresponding to the agents;
updating parameters of a combined network formed by the value networks corresponding to the agents;
respectively calculating the gradient of the strategy network corresponding to each agent;
and respectively updating the parameters of the strategy network corresponding to each agent.
According to a second aspect of the present invention, there is also provided a multi-agent decentralized decision making system based on information interaction, comprising:
the information interaction module is used for inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism to obtain the observation information of each of the plurality of agents after information sharing;
the decision-making module is used for inputting observation information of a plurality of intelligent bodies after information sharing into a decision-making model based on a neural network, the decision-making model comprises a strategy network and a value network corresponding to each intelligent body, the strategy network is used for outputting action information of the corresponding intelligent body according to the observation information of the intelligent body after information sharing, the value network is used for outputting a value score of the corresponding intelligent body according to the action information of the corresponding intelligent body, the decision-making model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining a global value score of the plurality of intelligent bodies according to the value score output by the value network corresponding to each intelligent body;
the obtaining of global value scores of a plurality of agents according to the value score output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps:
expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agentsiCoefficient of value function of individual agent as the first of the multi-head attention networkiAttention weights for the individual channels.
Further, the graph convolutional neural network model adopts a double-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interactive relationship exists between any two intelligent agents, the soft attention mechanism is used for determining the information interaction degree between any two intelligent agents with the interactive relationship, and the observation information of each of the plurality of intelligent agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) According to the method, the information interaction relation among the intelligent agents is learned by adopting a graph convolution method, the intelligent agents are enabled to more specifically receive the observation information of other intelligent agents around by using an attention mechanism, the complexity of a neural network is reduced as much as possible under the condition that the information interaction requirement of a cooperative decision is ensured, and the information cooperation efficiency among the intelligent agents is effectively improved; meanwhile, a multi-head attention mechanism is introduced, so that the problem of trust distribution among the intelligent agents is effectively solved, and strategy cooperation among the intelligent agents is realized.
(2) Meanwhile, a centralized training and distributed execution mode is adopted, each strategy network respectively calculates the gradient and the updating parameter, each value network forms a combined network to calculate the gradient and the updating parameter, and the situation estimation accuracy and the decision efficiency of the large-scale multi-agent under the local observation condition are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flowchart illustrating a multi-agent decentralized decision-making method based on information interaction according to an embodiment of the present application;
fig. 2 is a schematic diagram of information interaction sharing provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a policy collaboration based on cost function decomposition according to an embodiment of the present application;
fig. 4 is a schematic diagram of a contribution-oriented value function decomposition network according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a network for multi-agent decentralized decision making based on information interaction according to an embodiment of the present application;
fig. 6 is a schematic diagram of a multi-agent decentralized decision-making model based on information interaction according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the respective embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The terms "including", "having" and any variations thereof in the description and claims of this application and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
As shown in fig. 1, a multi-agent decentralized decision-making method based on information interaction according to an embodiment of the present invention includes:
and S101, inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism, and obtaining the observation information of each of the plurality of agents after information sharing.
The perception and action capabilities of a single Agent (Agent) are limited, and the sharing of information such as respective actions, states, strategies and the like among the multi-Agent systems is realized through interaction, so that the system efficiency can be effectively improved, and more complex tasks can be completed. The observation of a single agent has certain limitation, and the direct global sharing of the observation information of all agents can solve the information sharing, but undoubtedly increase the state space of decision and the parameters of the neural network, so that the network training is more difficult. Therefore, how to reduce the information interaction between the intelligent agents under the condition of ensuring to meet the information requirement of the decision becomes the main target of the design of the multi-heterogeneous entity collaborative decision architecture.
In the embodiment of the invention, the information interaction relationship among the agents is regarded as a special graph, each agent is regarded as a graph node, and the relationship among the agents is regarded as an edge of the graph node. And inputting the observation information of each of the plurality of agents into a graph convolution neural network model, representing and learning the observation information by using a graph convolution method, and obtaining the observation information of each agent after information sharing to realize information interaction among the agents.
Further, the graph convolution neural network model adopts a double-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interactive relationship exists between any two intelligent agents, the soft attention mechanism is used for determining the information interaction degree between any two intelligent agents with the interactive relationship, and the observation information of each of the plurality of intelligent agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.
As shown in FIG. 2, a graph convolution neural network model based on the attention mechanism is established, and the input of the network is each agentiRespective observation informationo i Firstly, a hard attention mechanism is used for determining which intelligent agents need to interact, and if information interaction is needed, an interaction edge is established between the intelligent agents; and then, on the basis, the information degree of the interaction is determined by using a soft attention mechanism, namely, the weight of the interaction edge is used for expressing the information interaction degree between the intelligent agents. Thereby finally outputting the shared situation informationo’ i And subsequently as input for decision making by each agent.
(1) And information interaction step based on hard attention mechanism: for any two agentsi,j∈12Ni≠ j,NFor the total number of agents, by outputW h i,j Representing an agentijWhether there is an interaction relationship, an interaction edge,W h i,j 0,1,. Learning and using bidirectional long-short term memory artificial neural networks (B i -LSTM) Meanwhile, to overcome "inability to use a back propagation gradient" due to sampling, here, a Gumbel-softmax function is used, and thus outputW h i,j Can be further expressed as:
Figure 197384DEST_PATH_IMAGE008
in the formula (I), the compound is shown in the specification,gum() A gumbel-softmax function is represented, f() A fully-connected layer is shown,B i -LSTMshowing a bidirectional long-short term memory artificial neural network,o i o j respectively representing pre-information-sharing agentsijThe respective observation information.
(2) And information interaction step based on soft attention mechanism: the method is mainly used for determining the weight of the reserved edges in the topological graph generated by the hard attention mechanism, namely the degree of information interaction. Calculation using the most basic attention mechanism, i.e. using the key-value, soft attention weightsW s i,j Will be embedded intoe i Ande j a comparison is made and the match between the two embeddings is passed into the softmax function, i.e. there is:
Figure 600684DEST_PATH_IMAGE009
wherein exp () represents an exponential function with a natural constant e as the base,e i e j respectively represento i o j Extracting the feature vector by a long-short term memory artificial neural network (LSTM),W k W q are respectively connected withe j e i The weight matrix of the matching is then determined,
Figure 62889DEST_PATH_IMAGE003
the transpose of the matrix is represented,W k will be provided withe j Is converted into a key to be used as a secret,W q will be provided withe i Is converted into a query that is sent to the user,W h i,j is a hard attention value, while a soft attention valueW s i,j I.e. the final weight of the edge.
Finally outputting shared situation informationo’ i As input for each agent to make a decision, the calculation is as follows:
Figure 676273DEST_PATH_IMAGE010
s102, inputting the observation information of each of the plurality of agents after information sharing into a decision model based on a neural network, wherein the decision model comprises a strategy network and a value network corresponding to each agent, the strategy network is used for outputting the action information of the corresponding agent according to the observation information of the agent after information sharing, the value network is used for outputting the value score of the corresponding agent according to the action information of the corresponding agent, the decision model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value score of the plurality of agents according to the value score output by the value network corresponding to each agent.
In a joint decision problem, decisions and actions between different agents can interact, i.e. policies between different agents interact, and policy changes for any single agent can result in non-stationarity of the global policy. While global situation development and outcome depends on the decisions and behaviors of all agents, all individual agents must consider the actions of other agents when making decisions and actions. This is reflected in the trust distribution problem in multi-agent reinforcement learning, namely how to construct joint reward functions and joint strategy models to find the nash balance of a strategy.
The reinforcement learning is defined by interacting with the environment through an agent and obtaining feedback, learning with the maximum accumulated reward as a target, and finally realizing the mapping from a certain state to a certain optimal behavior. The process of interaction between the intelligent agent and the environment can be regarded as a trial and error process, and the trial and error process is an important characteristic of reinforcement learning. In each interaction process, the environment returns a feedback to the agent, the feedback can be named as a labeled sample, and since the reward of the environment feedback is usually delayed and sparse, the delayed reward is another important characteristic of reinforcement learning. Reinforcement learning problems are typically modeled by a Markov decision process, i.e., using a Markov quintuple<S,A,P,R,γ>Is shown in whichSAndArespectively represent a state space and an action space,Pis in a stateThe function of the transfer function is such that,Rin order to be a function of the reward,γa discount factor for the prize representing the decay of the prize due to future uncertainty, thus accumulating the prizeR t Can be expressed as:
Figure 882127DEST_PATH_IMAGE011
wherein the content of the first and second substances,γ k to representγIskTo the power of the wave,r t representtThe instant of time is rewarded immediately,r t+k representt+kAn immediate reward for the moment.
In the embodiment of the invention, for multi-agent reinforcement learning, one multi-element group is used<N,S,A,O,T,R>To indicate. Wherein, the first and the second end of the pipe are connected with each other,Nis the number of agents in the environment,S=[s 1 ,…,s N ]it is the combined state of all the agents,s i is shown asiThe state of the individual agent is,A=[a 1 ,…,a N ]it is a combined action of all the agents,a i is shown asiIndividual agent actions;O=[o 1 ,o 2 ,…,o N ]is a state of joint observation of all agents,o i is shown asiAn observed state of the individual agent; given agentiThe current state and the action to be performed,T:S×a 1 ×a 2 ×…×a N →P(S)is an intelligent agentiState transfer function of, defining agentiProbability distribution of the successive states of (1);R i :S×a 1 ×a 2 ×…×a N →Ris an agent arranged according to a targetiDepends on the global state and action. Observations of agents under incomplete information conditionso i Including global statess∈SPartial information of (2); initial state is composed of distributionρ:S→[0,1]Determining;each agentiLearning a strategyπ i :o i →P(a i )The observed values of the self are mapped to the distribution of the action space.
The learning goal of the agent is to maximize the desired reward feedback, as shown below.
Figure 139933DEST_PATH_IMAGE012
Wherein the content of the first and second substances,J i (π i ) Is shown asiPersonal agent policyπ i The expectation of the bonus to be made is,γ∈[0,1]is a discount factor for the feedback of future rewards,γ t to representγIs/are as followstTo the power of the above, the first order,r it denotes the firstiPersonal intelligent agenttThe instant of time is rewarded immediately,a t i denotes the firstiPersonal intelligent agenttThe action of the time of day is taken,O t representtThe observed value of the time of day,
Figure 428831DEST_PATH_IMAGE013
indicating that all agents are in their own policyπ i The expectation of the bonus to be made is,Tis a time window.
As shown in fig. 3, the decision model includes a policy network and a value network corresponding to each agent. Each agentiAll maintain their corresponding policy network and value network, and use intelligent agentiCorresponding policy network notationπ i An agent to be administerediThe corresponding policy network is marked asQ i . Policy networkπ i According to observation information of the intelligent agent after information sharingo’ i Outputting corresponding intelligent agentiMotion information ofa i . Value networkQ i According to the agentiMotion information ofa i Outputting a value score, i.e. for the action informationa i The multi-head attention network according to the corresponding price of each agentThe value scores output by the value network obtain the global value scores of a plurality of agents, namely the overall evaluation scores of the action information of all the agents, so that the fusion between the individual targets and the global targets is solved, and the strategy cooperation among the agents is realized. And the multi-head attention network can distribute weights for different agents to realize trust distribution.
Further, determining the attention weight of the multi-head attention network comprises: obtaining global value scores of a plurality of agents according to the value scores output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining the global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps: expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agents, andicoefficient of value function of individual agent as the first of multi-head attention networkiAttention weights for the individual channels. Namely, a value function decomposition method facing the contribution degree is designed, and the problem of trust degree distribution among agents is solved.
(1) The value function decomposition algorithm facing the contribution degree comprises the following steps: function of global valuesQ total (S,A) As based on joint stateS=(s 1 ,…,s N ) And joint actionA=(a 1 ,…,a N ) And unroll it as all individual agentsiIndividual value function ofQ i (s i ,a i ) Is decomposed by taylor function as shown in the following formula:
Figure 404878DEST_PATH_IMAGE014
whereinc(S)Is a constant value, and is characterized in that,
Figure 414422DEST_PATH_IMAGE006
for all thathPartial derivative of order h Q total /∂Q i 1 …∂Q i h Is a linear function ofiCoefficients of value functions of individual agents, in particular representing global value functionsQ total (S,A) To the firstiValue function of individual agentQ i (s i ,a i ) When decomposition is carried outhOrder coefficient, and order dependenthDecays rapidly in an exponential fashion, the corresponding network structure is shown in fig. 4.
(2) Solving a multi-agent cooperation strategy: in order to realize the cooperative decision among multiple heterogeneous entities, a centralized training and distributed execution mode is adopted in general, and specific detailed steps are shown in fig. 5 and fig. 6. A parameterized neural network is used to fit the value network and the policy network. Each agentiAll maintain a policy networkπ i And value networkQ i Finally for calculating the loss functionQThe value being a function of the global valueQ total (S,A) Through all value networksQ i And the mixed output is obtained through a multi-head attention network.
After situation information sharing based on the graph attention network, all single agents are according to own strategies according to own observed situation informationπ i Making decisions and outputting corresponding actionsa i And storing the data into a sample experience pool. From a pool of experiences by random samplingKObtaining each intelligent agent by sample dataiAt the moment of timetTarget network value output ofQ’ i (s t i , a t i )All will beQ’ i Input to a multi-headed attention network to obtain a global valueQ’ total With real-time global awardsR t j Adding to obtain target network value output:
Figure 792314DEST_PATH_IMAGE015
wherein the content of the first and second substances,y j a value output value representing the value of the target network,R t j denotes the firstjIn a sampletThe instant prize for the time of day is awarded,
Figure 986535DEST_PATH_IMAGE016
is shown asjIn a samplet+1 is in state at time
Figure 449877DEST_PATH_IMAGE017
Lower execution action
Figure 263112DEST_PATH_IMAGE018
A global value estimate of the target value network of (2),αa weight coefficient representing the entropy of the policy,H(π(·|S t j ) Is a policy entropy value to increase policy diversity, when the policy isπThe output of (a) is a probability distribution of the action,Kis the number of sample samples.
Furthermore, the strategy network adopts a distributed Actor network, and the value network adopts a centralized Critic network.
The training of the decision model comprises the steps of:
(1) And calculating a loss function according to the estimated global value scores of the agents and the actual global value scores of the agents.
Using the MSE (mean squared error) of the Critic network output in the target network and the Critic network output in the online network as a loss function:
Figure 495511DEST_PATH_IMAGE019
in the formula (I), the compound is shown in the specification,L(θ) Is shown inθIs a function of the neural network loss for the parameter,y j a value representing a target value network output value,Q total (S t j , A t j ) Representing the output of the online federated value network.
(2) Calculating the gradient of a combined network formed by the value networks corresponding to the agents, and updating the parameters of the combined network formed by the value networks corresponding to the agents; and respectively calculating the gradient of the strategy network corresponding to each agent, and respectively updating the parameters of the strategy network corresponding to each agent.
Updating an online Critic network using an optimizer, and updating the online networkQ i . The strategy is estimated unbiased by using data sampled randomly from a storage buffer (Replay memory buffer):
Figure 627677DEST_PATH_IMAGE020
in the formula (I), the compound is shown in the specification,
Figure 312736DEST_PATH_IMAGE021
a gradient indicative of the desire for the reward,Kthe number of samples is represented as a function of,μ i (O t j ) Is shown asjIn a sampletTime agentiIn the observation ofO t j The action of the lower output is carried out,H(π(·|S t j ) Is) indicative of a policy entropy value,
Figure 929662DEST_PATH_IMAGE022
indicating a reward expectation pairA t Of the gradient of (c).
Update according to gradient calculationiParameters of an on-line Actor networkμ i The parameter updating calculation method is shown as the following formula:
Figure 16567DEST_PATH_IMAGE023
in the formula (I), the compound is shown in the specification,θ i Q 、θ i μ neural network parameters representing an online value network and an online policy network respectively,θ i Q’ 、θ i μ’ neural network parameters representing a target value network and a target policy network,τindicating soft update weight coefficients.
The embodiment of the invention provides a multi-agent distributed decision making system based on information interaction, which comprises:
the information interaction module is used for inputting the respective observation information of the plurality of agents into a graph convolution neural network model based on an attention mechanism to obtain the respective observation information of the plurality of agents after information sharing;
the decision-making module is used for inputting the observation information of each of the plurality of agents after information sharing into a decision-making model based on a neural network, the decision-making model comprises a strategy network and a value network corresponding to each agent, the strategy network is used for outputting the action information of the corresponding agent according to the observation information of the agent after information sharing, the value network is used for outputting the value score of the corresponding agent according to the action information of the corresponding agent, the decision-making model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value score of the plurality of agents according to the value score output by the value network corresponding to each agent;
the obtaining of global value scores of a plurality of agents according to the value score output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps:
expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agentsiCoefficient of value function of individual agent as the first of the multi-head attention networkiAttention weight of individual channels.
The implementation principle of the system is the same as that of the method, and the details are not repeated here.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A multi-agent distributed decision method based on information interaction is characterized by comprising the following steps:
inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism, and obtaining the observation information of each of the plurality of agents after information sharing;
inputting the observation information of each of a plurality of agents after information sharing into a decision model based on a neural network, wherein the decision model comprises a strategy network and a value network corresponding to each agent, the strategy network is used for outputting the action information of the corresponding agent according to the observation information of the agent after information sharing, the value network is used for outputting the value score of the corresponding agent according to the action information of the corresponding agent, the decision model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value score of the agents according to the value score output by the value network corresponding to each agent;
the obtaining global value scores of a plurality of agents according to the value score output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps:
expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agentsiCoefficient of value function of individual agent as the first of the multi-head attention networkiOf a channelAttention weight.
2. The information interaction-based multi-agent decentralized decision method according to claim 1, wherein the graph convolutional neural network model employs a dual-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interaction relationship exists between any two agents, the soft attention mechanism is used for determining an information interaction degree between any two agents having an interaction relationship, and observation information of each of a plurality of agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.
3. The information interaction-based multi-agent decentralized decision-making method according to claim 2, characterized in that the output of said hard attention mechanism is recorded asW h i,j W h i,j The calculation formula of (2) is as follows:
Figure 882855DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,gum() A gumbel-softmax function is represented, f() A fully-connected layer is shown as such,B i -LSTMshowing a bidirectional long-short term memory artificial neural network,o i o j respectively representing pre-information-sharing agentsijThe information on the respective observations is then transmitted to the receiver,W h i,j representing an agentijWhether an interactive relation exists between the two groups;
record the output of the soft attention mechanism asW s i,j W s i,j The calculation formula of (c) is:
Figure 393470DEST_PATH_IMAGE002
in the formula, exp () represents an exponential function with a natural constant e as the base,e i e j respectively represento i o j Extracting the feature vector by the long-short term memory artificial neural network,W k W q are respectively ande j e i the weight matrix of the matching is then determined,
Figure 907628DEST_PATH_IMAGE003
a matrix transpose is represented by a matrix transpose,W s i,j representing an agentijThe degree of information interaction between the users;
recording the situation information after the final output sharingo’ i o’ i The calculation formula of (2) is as follows:
Figure 896313DEST_PATH_IMAGE004
in the formula (I), the compound is shown in the specification,Nthe total number of agents.
4. The information interaction-based multi-agent decentralized decision making method according to claim 1, wherein said expanding a global value function into a taylor function decomposition of individual value functions of a plurality of agents comprises:
denote the global value function asQ total (S,A) Of 1 atiIndividual value function of individual agent is recorded asQ i (s i ,a i ) And satisfies the following conditions:
Figure 445106DEST_PATH_IMAGE005
wherein the content of the first and second substances,c(S) Is a constant value, and is characterized in that,
Figure 1989DEST_PATH_IMAGE006
coefficients of the value function representing the ith agent will
Figure 3443DEST_PATH_IMAGE007
As the first of the multi-head attention networkiAttention weight of individual channels.
5. The information interaction-based multi-agent decentralized decision-making method according to claim 1, wherein the training of the decision model comprises:
and calculating a loss function according to the global value score estimated values of the plurality of agents and the global value score actual values of the plurality of agents.
6. The information interaction-based multi-agent decentralized decision-making method according to claim 1, wherein the training of the decision model comprises:
calculating the gradient of a combined network formed by the value networks corresponding to the agents;
updating parameters of a combined network formed by the value networks corresponding to the agents;
respectively calculating the gradient of the strategy network corresponding to each agent;
and respectively updating the parameters of the strategy network corresponding to each agent.
7. A multi-agent decentralized decision making system based on information interaction, comprising:
the information interaction module is used for inputting the respective observation information of the plurality of agents into a graph convolution neural network model based on an attention mechanism to obtain the respective observation information of the plurality of agents after information sharing;
the decision-making module is used for inputting observation information of a plurality of intelligent bodies after information sharing into a decision-making model based on a neural network, the decision-making model comprises a strategy network and a value network corresponding to each intelligent body, the strategy network is used for outputting action information of the corresponding intelligent body according to the observation information of the intelligent body after information sharing, the value network is used for outputting a value score of the corresponding intelligent body according to the action information of the corresponding intelligent body, the decision-making model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining a global value score of the plurality of intelligent bodies according to the value score output by the value network corresponding to each intelligent body;
the obtaining global value scores of a plurality of agents according to the value score output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps:
expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agentsiCoefficient of value function of individual agent as the first of the multi-head attention networkiAttention weight of individual channels.
8. The information interaction-based multi-agent decentralized decision making system according to claim 7, wherein the graph convolutional neural network model employs a dual-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interaction relationship exists between any two agents, the soft attention mechanism is used for determining the information interaction degree between any two agents having an interaction relationship, and the observation information of each of the plurality of agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.
CN202210829307.7A 2022-07-15 2022-07-15 Multi-agent distributed decision method and system based on information interaction Active CN114896899B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210829307.7A CN114896899B (en) 2022-07-15 2022-07-15 Multi-agent distributed decision method and system based on information interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210829307.7A CN114896899B (en) 2022-07-15 2022-07-15 Multi-agent distributed decision method and system based on information interaction

Publications (2)

Publication Number Publication Date
CN114896899A CN114896899A (en) 2022-08-12
CN114896899B true CN114896899B (en) 2022-10-11

Family

ID=82729283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210829307.7A Active CN114896899B (en) 2022-07-15 2022-07-15 Multi-agent distributed decision method and system based on information interaction

Country Status (1)

Country Link
CN (1) CN114896899B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860107B (en) * 2023-01-30 2023-05-16 武汉大学 Multi-machine searching method and system based on multi-agent deep reinforcement learning
CN115793717B (en) * 2023-02-13 2023-05-05 中国科学院自动化研究所 Group collaborative decision-making method, device, electronic equipment and storage medium
CN116260882B (en) * 2023-05-15 2023-07-28 中国人民解放军国防科技大学 Multi-agent scheduling asynchronous consistency method and device with low communication flow
CN117332814A (en) * 2023-12-01 2024-01-02 中国科学院自动化研究所 Collaborative agent model based on modularized network, learning method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635917A (en) * 2018-10-17 2019-04-16 北京大学 A kind of multiple agent Cooperation Decision-making and training method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635917A (en) * 2018-10-17 2019-04-16 北京大学 A kind of multiple agent Cooperation Decision-making and training method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《Multi-agent game abstraction via graph attention neural network》;Yong Liu et al;《AAAI Technical Track: Multiagent Systems》;20200403;7211-7218 *

Also Published As

Publication number Publication date
CN114896899A (en) 2022-08-12

Similar Documents

Publication Publication Date Title
CN114896899B (en) Multi-agent distributed decision method and system based on information interaction
Gronauer et al. Multi-agent deep reinforcement learning: a survey
Wang et al. Adaptive and large-scale service composition based on deep reinforcement learning
Zhan et al. A learning-based incentive mechanism for federated learning
CN110458663B (en) Vehicle recommendation method, device, equipment and storage medium
CN108962238A (en) Dialogue method, system, equipment and storage medium based on structural neural networks
CN108921298B (en) Multi-agent communication and decision-making method for reinforcement learning
CN111309880B (en) Multi-agent action strategy learning method, device, medium and computing equipment
Obiedat et al. A novel semi-quantitative Fuzzy Cognitive Map model for complex systems for addressing challenging participatory real life problems
CN113449183B (en) Interactive recommendation method and system based on offline user environment and dynamic rewards
Bahrpeyma et al. An adaptive RL based approach for dynamic resource provisioning in Cloud virtualized data centers
CN115186097A (en) Knowledge graph and reinforcement learning based interactive recommendation method
Klissarov et al. Reward propagation using graph convolutional networks
CN114595396A (en) Sequence recommendation method and system based on federal learning
CN112364242A (en) Graph convolution recommendation system for context-aware type
Long et al. Multi-task learning for collaborative filtering
Chen et al. Session-based recommendation: Learning multi-dimension interests via a multi-head attention graph neural network
Kiannejad et al. Two‐stage ANN‐based bidding strategy for a load aggregator using decentralized equivalent rival concept
He et al. Prediction of electricity demand of China based on the analysis of decoupling and driving force
Alagha et al. Blockchain-Assisted Demonstration Cloning for Multi-Agent Deep Reinforcement Learning
Smith et al. Co-Learning Empirical Games and World Models
CN115168722A (en) Content interaction prediction method and related equipment
Yuan Intrinsically-motivated reinforcement learning: A brief introduction
CN114528992A (en) Block chain-based e-commerce business analysis model training method
CN111027709B (en) Information recommendation method and device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant