CN114896899B

CN114896899B - Multi-agent distributed decision method and system based on information interaction

Info

Publication number: CN114896899B
Application number: CN202210829307.7A
Authority: CN
Inventors: 杨若鹏; 殷昌盛; 杨远涛; 鲁义威; 韦文夏
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-10-11
Anticipated expiration: 2042-07-15
Also published as: CN114896899A

Abstract

The application discloses a multi-agent distributed decision method and a system based on information interaction. The method comprises the following steps: inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism to obtain observation information after information sharing; the method comprises the steps that observation information after information sharing is input into a decision model based on a neural network, the decision model comprises a strategy network and a value network corresponding to each intelligent agent, the strategy network is used for outputting action information of the corresponding intelligent agent according to the observation information after the information sharing, the value network is used for outputting corresponding value scores according to the action information of the corresponding intelligent agent, the decision model further comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value scores of the plurality of intelligent agents according to the corresponding value scores of the intelligent agents. The invention can realize information sharing and cooperation among multiple intelligent agents and improve the decision accuracy and efficiency.

Description

Multi-agent distributed decision method and system based on information interaction

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a multi-agent distributed decision-making method and system based on information interaction.

Background

The actual decision problem can be abstractly modeled into the incomplete information multi-agent game problem, and the traditional multi-agent centralized decision method can effectively solve the problem of search efficiency of centralized decision. However, for large-scale joint collaboration, the number and types of entities are large, and the simple centralized decision method obviously cannot solve the problem of cooperative gaming of multiple heterogeneous entities.

A Multi Agent System (MAS) is a System composed of a plurality of agents, and the MAS structure is formed by sharing information, objects, strategies and actions acquired by the agents, so that the problem that a single Agent cannot solve due to the lack of personal ability, knowledge or resources or the problem that the efficiency is low can be solved in a mutual cooperation manner.

In a multi-agent system, all agents are not simply stacked together, but are coupled together and interact, correlated, by task, environmental or other factors. The multi-agent method can effectively solve the problem of multi-entity cooperative decision, but for the problem of large-scale multi-agent cooperative decision, how to perform efficient information cooperation and strategy cooperation among agents becomes the key for solving the distributed command decision of the multi-agent.

Disclosure of Invention

Aiming at least one defect or improvement requirement in the prior art, the invention provides a multi-agent distributed decision method and system based on information interaction, which can realize information sharing and cooperation among the multi-agents and improve decision accuracy and efficiency.

To achieve the above object, according to a first aspect of the present invention, there is provided a multi-agent decentralized decision-making method based on information interaction, comprising:

inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism, and obtaining the observation information of each of the plurality of agents after information sharing;

inputting the observation information of each of a plurality of agents after information sharing into a decision model based on a neural network, wherein the decision model comprises a strategy network and a value network corresponding to each agent, the strategy network is used for outputting the action information of the corresponding agent according to the observation information of the agent after information sharing, the value network is used for outputting the value score of the corresponding agent according to the action information of the corresponding agent, the decision model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value score of the agents according to the value score output by the value network corresponding to each agent;

the obtaining global value scores of a plurality of agents according to the value score output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps:

expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agents, andicoefficient of value function of individual agent as the first of the multi-headed attention networkiAttention weight of individual channels.

Further, the graph convolutional neural network model adopts a double-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interactive relationship exists between any two intelligent agents, the soft attention mechanism is used for determining the information interaction degree between any two intelligent agents with the interactive relationship, and the observation information of each of the plurality of intelligent agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.

Further, the output of the hard attention mechanism is noted asW _h ^i,j ，W _h ^i,j The calculation formula of (2) is as follows:

in the formula (I), the compound is shown in the specification,gum() A gumbel-softmax function is represented, f() A fully-connected layer is shown as such,B _i -LSTMshowing a bidirectional long-short term memory artificial neural network,o _i 、o _j respectively representing pre-information-sharing agentsi、jThe information on the respective observations is then transmitted to the receiver,W _h ^i,j representing an agenti、jWhether an interactive relation exists between the two groups;

record the output of the soft attention mechanism asW _s ^i,j ，W _s ^i,j The calculation formula of (2) is as follows:

in the formula, exp () represents an exponential function with a natural constant e as the base,e _i 、e _j respectively represento _i 、o _j Extracting the feature vector by the long-short term memory artificial neural network,W _k 、W _q are respectively ande _j 、e _i the weight matrix of the match is then used,

the transpose of the matrix is represented,W _s ^i,j representing an agenti、jThe degree of information interaction between the users;

recording the situation information after the final output sharingo’ _i ，o’ _i The calculation formula of (2) is as follows:

in the formula (I), the compound is shown in the specification,Nthe total number of agents.

Further, the taylor function decomposition of the expanding of the global value function into individual value functions of the plurality of agents includes:

denote the global value function asQ _total (S,A) First, ofiIndividual value function of individual agent is recorded asQ _i (s ⁱ ,a ⁱ ) And satisfies the following conditions:

wherein, the first and the second end of the pipe are connected with each other,c(S) Is a constant value, and is characterized in that,

denotes the firstiCoefficient of value function of individual agent will

As the first of the multi-head attention networkiAttention weight of individual channels.

Further, the training of the decision model comprises:

and calculating a loss function according to the global value score estimated values of the plurality of agents and the global value score actual values of the plurality of agents.

Further, the training of the decision model comprises:

calculating the gradient of a combined network formed by the value networks corresponding to the agents;

updating parameters of a combined network formed by the value networks corresponding to the agents;

respectively calculating the gradient of the strategy network corresponding to each agent;

and respectively updating the parameters of the strategy network corresponding to each agent.

According to a second aspect of the present invention, there is also provided a multi-agent decentralized decision making system based on information interaction, comprising:

the information interaction module is used for inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism to obtain the observation information of each of the plurality of agents after information sharing;

the decision-making module is used for inputting observation information of a plurality of intelligent bodies after information sharing into a decision-making model based on a neural network, the decision-making model comprises a strategy network and a value network corresponding to each intelligent body, the strategy network is used for outputting action information of the corresponding intelligent body according to the observation information of the intelligent body after information sharing, the value network is used for outputting a value score of the corresponding intelligent body according to the action information of the corresponding intelligent body, the decision-making model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining a global value score of the plurality of intelligent bodies according to the value score output by the value network corresponding to each intelligent body;

the obtaining of global value scores of a plurality of agents according to the value score output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps:

expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agentsiCoefficient of value function of individual agent as the first of the multi-head attention networkiAttention weights for the individual channels.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) According to the method, the information interaction relation among the intelligent agents is learned by adopting a graph convolution method, the intelligent agents are enabled to more specifically receive the observation information of other intelligent agents around by using an attention mechanism, the complexity of a neural network is reduced as much as possible under the condition that the information interaction requirement of a cooperative decision is ensured, and the information cooperation efficiency among the intelligent agents is effectively improved; meanwhile, a multi-head attention mechanism is introduced, so that the problem of trust distribution among the intelligent agents is effectively solved, and strategy cooperation among the intelligent agents is realized.

(2) Meanwhile, a centralized training and distributed execution mode is adopted, each strategy network respectively calculates the gradient and the updating parameter, each value network forms a combined network to calculate the gradient and the updating parameter, and the situation estimation accuracy and the decision efficiency of the large-scale multi-agent under the local observation condition are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flowchart illustrating a multi-agent decentralized decision-making method based on information interaction according to an embodiment of the present application;

fig. 2 is a schematic diagram of information interaction sharing provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a policy collaboration based on cost function decomposition according to an embodiment of the present application;

fig. 4 is a schematic diagram of a contribution-oriented value function decomposition network according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a network for multi-agent decentralized decision making based on information interaction according to an embodiment of the present application;

fig. 6 is a schematic diagram of a multi-agent decentralized decision-making model based on information interaction according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the respective embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The terms "including", "having" and any variations thereof in the description and claims of this application and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

As shown in fig. 1, a multi-agent decentralized decision-making method based on information interaction according to an embodiment of the present invention includes:

and S101, inputting the observation information of each of the plurality of agents into a graph convolution neural network model based on an attention mechanism, and obtaining the observation information of each of the plurality of agents after information sharing.

The perception and action capabilities of a single Agent (Agent) are limited, and the sharing of information such as respective actions, states, strategies and the like among the multi-Agent systems is realized through interaction, so that the system efficiency can be effectively improved, and more complex tasks can be completed. The observation of a single agent has certain limitation, and the direct global sharing of the observation information of all agents can solve the information sharing, but undoubtedly increase the state space of decision and the parameters of the neural network, so that the network training is more difficult. Therefore, how to reduce the information interaction between the intelligent agents under the condition of ensuring to meet the information requirement of the decision becomes the main target of the design of the multi-heterogeneous entity collaborative decision architecture.

In the embodiment of the invention, the information interaction relationship among the agents is regarded as a special graph, each agent is regarded as a graph node, and the relationship among the agents is regarded as an edge of the graph node. And inputting the observation information of each of the plurality of agents into a graph convolution neural network model, representing and learning the observation information by using a graph convolution method, and obtaining the observation information of each agent after information sharing to realize information interaction among the agents.

Further, the graph convolution neural network model adopts a double-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interactive relationship exists between any two intelligent agents, the soft attention mechanism is used for determining the information interaction degree between any two intelligent agents with the interactive relationship, and the observation information of each of the plurality of intelligent agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.

As shown in FIG. 2, a graph convolution neural network model based on the attention mechanism is established, and the input of the network is each agentiRespective observation informationo _i Firstly, a hard attention mechanism is used for determining which intelligent agents need to interact, and if information interaction is needed, an interaction edge is established between the intelligent agents; and then, on the basis, the information degree of the interaction is determined by using a soft attention mechanism, namely, the weight of the interaction edge is used for expressing the information interaction degree between the intelligent agents. Thereby finally outputting the shared situation informationo’ _i And subsequently as input for decision making by each agent.

(1) And information interaction step based on hard attention mechanism: for any two agentsi,j∈1，2，…，N，i≠ j，NFor the total number of agents, by outputW _h ^i,j Representing an agenti、jWhether there is an interaction relationship, an interaction edge,W _h ^i,j ∈ 0,1,. Learning and using bidirectional long-short term memory artificial neural networks (B _i -LSTM) Meanwhile, to overcome "inability to use a back propagation gradient" due to sampling, here, a Gumbel-softmax function is used, and thus outputW _h ^i,j Can be further expressed as:

in the formula (I), the compound is shown in the specification,gum() A gumbel-softmax function is represented, f() A fully-connected layer is shown,B _i -LSTMshowing a bidirectional long-short term memory artificial neural network,o _i 、o _j respectively representing pre-information-sharing agentsi、jThe respective observation information.

(2) And information interaction step based on soft attention mechanism: the method is mainly used for determining the weight of the reserved edges in the topological graph generated by the hard attention mechanism, namely the degree of information interaction. Calculation using the most basic attention mechanism, i.e. using the key-value, soft attention weightsW _s ^i,j Will be embedded intoe _i Ande _j a comparison is made and the match between the two embeddings is passed into the softmax function, i.e. there is:

wherein exp () represents an exponential function with a natural constant e as the base,e _i 、e _j respectively represento _i 、o _j Extracting the feature vector by a long-short term memory artificial neural network (LSTM),W _k 、W _q are respectively connected withe _j 、e _i The weight matrix of the matching is then determined,

the transpose of the matrix is represented,W _k will be provided withe _j Is converted into a key to be used as a secret,W _q will be provided withe _i Is converted into a query that is sent to the user,W _h ^i,j is a hard attention value, while a soft attention valueW _s ^i,j I.e. the final weight of the edge.

Finally outputting shared situation informationo’ _i As input for each agent to make a decision, the calculation is as follows:

s102, inputting the observation information of each of the plurality of agents after information sharing into a decision model based on a neural network, wherein the decision model comprises a strategy network and a value network corresponding to each agent, the strategy network is used for outputting the action information of the corresponding agent according to the observation information of the agent after information sharing, the value network is used for outputting the value score of the corresponding agent according to the action information of the corresponding agent, the decision model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value score of the plurality of agents according to the value score output by the value network corresponding to each agent.

In a joint decision problem, decisions and actions between different agents can interact, i.e. policies between different agents interact, and policy changes for any single agent can result in non-stationarity of the global policy. While global situation development and outcome depends on the decisions and behaviors of all agents, all individual agents must consider the actions of other agents when making decisions and actions. This is reflected in the trust distribution problem in multi-agent reinforcement learning, namely how to construct joint reward functions and joint strategy models to find the nash balance of a strategy.

The reinforcement learning is defined by interacting with the environment through an agent and obtaining feedback, learning with the maximum accumulated reward as a target, and finally realizing the mapping from a certain state to a certain optimal behavior. The process of interaction between the intelligent agent and the environment can be regarded as a trial and error process, and the trial and error process is an important characteristic of reinforcement learning. In each interaction process, the environment returns a feedback to the agent, the feedback can be named as a labeled sample, and since the reward of the environment feedback is usually delayed and sparse, the delayed reward is another important characteristic of reinforcement learning. Reinforcement learning problems are typically modeled by a Markov decision process, i.e., using a Markov quintuple<S,A,P,R,γ>Is shown in whichSAndArespectively represent a state space and an action space,Pis in a stateThe function of the transfer function is such that,Rin order to be a function of the reward,γa discount factor for the prize representing the decay of the prize due to future uncertainty, thus accumulating the prizeR _t Can be expressed as:

wherein the content of the first and second substances,γ ^k to representγIskTo the power of the wave,r _t representtThe instant of time is rewarded immediately,r _t+k representt+kAn immediate reward for the moment.

In the embodiment of the invention, for multi-agent reinforcement learning, one multi-element group is used<N,S,A,O,T,R>To indicate. Wherein, the first and the second end of the pipe are connected with each other,Nis the number of agents in the environment,S=[s ¹ ,…,s ^N ]it is the combined state of all the agents,s ⁱ is shown asiThe state of the individual agent is,A=[a ¹ ,…,a ^N ]it is a combined action of all the agents,a ⁱ is shown asiIndividual agent actions;O=[o ₁ ,o ₂ ,…,o _N ]is a state of joint observation of all agents,o _i is shown asiAn observed state of the individual agent; given agentiThe current state and the action to be performed,T:S×a ¹ ×a ² ×…×a ^N →P(S)is an intelligent agentiState transfer function of, defining agentiProbability distribution of the successive states of (1);R _i :S×a ¹ ×a ² ×…×a ^N →Ris an agent arranged according to a targetiDepends on the global state and action. Observations of agents under incomplete information conditionso _i Including global statess∈SPartial information of (2); initial state is composed of distributionρ:S→[0,1]Determining;each agentiLearning a strategyπ _i :o _i →P(a ⁱ )The observed values of the self are mapped to the distribution of the action space.

The learning goal of the agent is to maximize the desired reward feedback, as shown below.

Wherein the content of the first and second substances,J _i (π _i ) Is shown asiPersonal agent policyπ _i The expectation of the bonus to be made is,γ∈[0,1]is a discount factor for the feedback of future rewards,γ ^t to representγIs/are as followstTo the power of the above, the first order,r _it denotes the firstiPersonal intelligent agenttThe instant of time is rewarded immediately,a _t ⁱ denotes the firstiPersonal intelligent agenttThe action of the time of day is taken,O _t representtThe observed value of the time of day,

indicating that all agents are in their own policyπ _i The expectation of the bonus to be made is,Tis a time window.

As shown in fig. 3, the decision model includes a policy network and a value network corresponding to each agent. Each agentiAll maintain their corresponding policy network and value network, and use intelligent agentiCorresponding policy network notationπ _i An agent to be administerediThe corresponding policy network is marked asQ _i . Policy networkπ _i According to observation information of the intelligent agent after information sharingo’ _i Outputting corresponding intelligent agentiMotion information ofa ⁱ . Value networkQ _i According to the agentiMotion information ofa ⁱ Outputting a value score, i.e. for the action informationa ⁱ The multi-head attention network according to the corresponding price of each agentThe value scores output by the value network obtain the global value scores of a plurality of agents, namely the overall evaluation scores of the action information of all the agents, so that the fusion between the individual targets and the global targets is solved, and the strategy cooperation among the agents is realized. And the multi-head attention network can distribute weights for different agents to realize trust distribution.

Further, determining the attention weight of the multi-head attention network comprises: obtaining global value scores of a plurality of agents according to the value scores output by the value network corresponding to each agent comprises: determining the attention weight of each channel of the multi-head attention network, and obtaining the global value scores of a plurality of agents according to the attention weight of each channel and the value score output by the value network corresponding to each agent, wherein the determination of the attention weight of each channel comprises the following steps: expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agents, andicoefficient of value function of individual agent as the first of multi-head attention networkiAttention weights for the individual channels. Namely, a value function decomposition method facing the contribution degree is designed, and the problem of trust degree distribution among agents is solved.

(1) The value function decomposition algorithm facing the contribution degree comprises the following steps: function of global valuesQ _total (S,A) As based on joint stateS=(s ¹ ,…,s ^N ) And joint actionA=(a ¹ ,…,a ^N ) And unroll it as all individual agentsiIndividual value function ofQ _i (s ⁱ ,a ⁱ ) Is decomposed by taylor function as shown in the following formula:

whereinc(S)Is a constant value, and is characterized in that,

for all thathPartial derivative of order∂ ^h Q _total /∂Q _i ¹ …∂Q _i ^h Is a linear function ofiCoefficients of value functions of individual agents, in particular representing global value functionsQ _total (S,A) To the firstiValue function of individual agentQ _i (s ⁱ ,a ⁱ ) When decomposition is carried outhOrder coefficient, and order dependenthDecays rapidly in an exponential fashion, the corresponding network structure is shown in fig. 4.

(2) Solving a multi-agent cooperation strategy: in order to realize the cooperative decision among multiple heterogeneous entities, a centralized training and distributed execution mode is adopted in general, and specific detailed steps are shown in fig. 5 and fig. 6. A parameterized neural network is used to fit the value network and the policy network. Each agentiAll maintain a policy networkπ _i And value networkQ _i Finally for calculating the loss functionQThe value being a function of the global valueQ _total (S,A) Through all value networksQ _i And the mixed output is obtained through a multi-head attention network.

After situation information sharing based on the graph attention network, all single agents are according to own strategies according to own observed situation informationπ _i Making decisions and outputting corresponding actionsa ⁱ And storing the data into a sample experience pool. From a pool of experiences by random samplingKObtaining each intelligent agent by sample dataiAt the moment of timetTarget network value output ofQ’ _i (s _t ⁱ , a _t ⁱ )All will beQ’ _i Input to a multi-headed attention network to obtain a global valueQ’ _total With real-time global awardsR _t ^j Adding to obtain target network value output:

wherein the content of the first and second substances,y ^j a value output value representing the value of the target network,R _t ^j denotes the firstjIn a sampletThe instant prize for the time of day is awarded,

is shown asjIn a samplet+1 is in state at time

Lower execution action

A global value estimate of the target value network of (2),αa weight coefficient representing the entropy of the policy,H(π(·|S _t ^j ) Is a policy entropy value to increase policy diversity, when the policy isπThe output of (a) is a probability distribution of the action,Kis the number of sample samples.

Furthermore, the strategy network adopts a distributed Actor network, and the value network adopts a centralized Critic network.

The training of the decision model comprises the steps of:

(1) And calculating a loss function according to the estimated global value scores of the agents and the actual global value scores of the agents.

Using the MSE (mean squared error) of the Critic network output in the target network and the Critic network output in the online network as a loss function:

in the formula (I), the compound is shown in the specification,L(θ) Is shown inθIs a function of the neural network loss for the parameter,y ^j a value representing a target value network output value,Q _total (S _t ^j , A _t ^j ) Representing the output of the online federated value network.

(2) Calculating the gradient of a combined network formed by the value networks corresponding to the agents, and updating the parameters of the combined network formed by the value networks corresponding to the agents; and respectively calculating the gradient of the strategy network corresponding to each agent, and respectively updating the parameters of the strategy network corresponding to each agent.

Updating an online Critic network using an optimizer, and updating the online networkQ _i . The strategy is estimated unbiased by using data sampled randomly from a storage buffer (Replay memory buffer):

in the formula (I), the compound is shown in the specification,

a gradient indicative of the desire for the reward,Kthe number of samples is represented as a function of,μ _i (O _t ^j ) Is shown asjIn a sampletTime agentiIn the observation ofO _t ^j The action of the lower output is carried out,H(π(·|S _t ^j ) Is) indicative of a policy entropy value,

indicating a reward expectation pairA _t Of the gradient of (c).

Update according to gradient calculationiParameters of an on-line Actor networkμ _i The parameter updating calculation method is shown as the following formula:

in the formula (I), the compound is shown in the specification,θ _i ^Q 、θ _i ^μ neural network parameters representing an online value network and an online policy network respectively,θ _i ^Q’ 、θ _i ^μ’ neural network parameters representing a target value network and a target policy network,τindicating soft update weight coefficients.

The embodiment of the invention provides a multi-agent distributed decision making system based on information interaction, which comprises:

the information interaction module is used for inputting the respective observation information of the plurality of agents into a graph convolution neural network model based on an attention mechanism to obtain the respective observation information of the plurality of agents after information sharing;

the decision-making module is used for inputting the observation information of each of the plurality of agents after information sharing into a decision-making model based on a neural network, the decision-making model comprises a strategy network and a value network corresponding to each agent, the strategy network is used for outputting the action information of the corresponding agent according to the observation information of the agent after information sharing, the value network is used for outputting the value score of the corresponding agent according to the action information of the corresponding agent, the decision-making model also comprises a multi-head attention network, and the multi-head attention network is used for obtaining the global value score of the plurality of agents according to the value score output by the value network corresponding to each agent;

expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agentsiCoefficient of value function of individual agent as the first of the multi-head attention networkiAttention weight of individual channels.

The implementation principle of the system is the same as that of the method, and the details are not repeated here.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A multi-agent distributed decision method based on information interaction is characterized by comprising the following steps:

expanding the global value function into a Taylor function decomposition of individual value functions of a plurality of agentsiCoefficient of value function of individual agent as the first of the multi-head attention networkiOf a channelAttention weight.

2. The information interaction-based multi-agent decentralized decision method according to claim 1, wherein the graph convolutional neural network model employs a dual-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interaction relationship exists between any two agents, the soft attention mechanism is used for determining an information interaction degree between any two agents having an interaction relationship, and observation information of each of a plurality of agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.

3. The information interaction-based multi-agent decentralized decision-making method according to claim 2, characterized in that the output of said hard attention mechanism is recorded asW _h ^i,j ，W _h ^i,j The calculation formula of (2) is as follows:

record the output of the soft attention mechanism asW _s ^i,j ，W _s ^i,j The calculation formula of (c) is:

in the formula, exp () represents an exponential function with a natural constant e as the base,e _i 、e _j respectively represento _i 、o _j Extracting the feature vector by the long-short term memory artificial neural network,W _k 、W _q are respectively ande _j 、e _i the weight matrix of the matching is then determined,

a matrix transpose is represented by a matrix transpose,W _s ^i,j representing an agenti、jThe degree of information interaction between the users;

4. The information interaction-based multi-agent decentralized decision making method according to claim 1, wherein said expanding a global value function into a taylor function decomposition of individual value functions of a plurality of agents comprises:

denote the global value function asQ _total (S,A) Of 1 atiIndividual value function of individual agent is recorded asQ _i (s ⁱ ,a ⁱ ) And satisfies the following conditions:

wherein the content of the first and second substances,c(S) Is a constant value, and is characterized in that,

coefficients of the value function representing the ith agent will

5. The information interaction-based multi-agent decentralized decision-making method according to claim 1, wherein the training of the decision model comprises:

6. The information interaction-based multi-agent decentralized decision-making method according to claim 1, wherein the training of the decision model comprises:

7. A multi-agent decentralized decision making system based on information interaction, comprising:

8. The information interaction-based multi-agent decentralized decision making system according to claim 7, wherein the graph convolutional neural network model employs a dual-layer attention mechanism comprising a hard attention mechanism and a soft attention mechanism, the hard attention mechanism is used for determining whether an interaction relationship exists between any two agents, the soft attention mechanism is used for determining the information interaction degree between any two agents having an interaction relationship, and the observation information of each of the plurality of agents after information sharing is obtained based on the information interaction degree determined by the soft attention mechanism.