CN116578636B

CN116578636B - Distributed multi-agent cooperation method, system, medium and equipment

Info

Publication number: CN116578636B
Application number: CN202310538318.4A
Authority: CN
Inventors: 彭佩玺; 翟云鹏; 田永鸿
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-12-15
Anticipated expiration: 2043-05-15
Also published as: CN116578636A

Abstract

The present disclosure relates to a distributed multi-agent collaboration method, system, medium, and apparatus. The method comprises the following steps: storing the observation states of a specific number of steps in the past log-in history to construct an observation history register; the history register continuously receives new history states along with the progress of interaction of the intelligent agent and the environment, and discards early history states exceeding capacity limit; constructing a history background network, wherein the input of the history background network is a current observation state and a history state in a history register, and the output history background embedded state of the history background network is obtained through data mining and fusion; constructing an implicit variation reasoning network, constructing a strategy network and a state value network, and training through reinforcement learning, wherein the inputs of the strategy network and the state value network are belief embedding and current observation states, and the outputs of the strategy network and the state value network are strategy distribution and state values.

Description

Distributed multi-agent cooperation method, system, medium and equipment

Technical Field

The present disclosure relates to the field of multi-agent systems, and more particularly, to a distributed multi-agent collaboration method, system, medium, and apparatus.

Background

Many complex sequential decision problems require all agents to achieve uniform goals or maximum team utility in a decentralized manner, with agents coordinating their behavior only under their own observation history. Deep multi-agent reinforcement learning (MARL) has shown great potential and has attracted considerable attention in recent years due to high-dimensional dynamic state space and unknown environmental models.

To learn a distributed strategy, a typical benchmark is to develop an independent learner for each agent and treat collective (or global) rewards directly as individual rewards. This paradigm may suffer from non-steady state problems: as the leagues (or teammates) change their behavior by learning, the environmental dynamics actually change, and the agent may receive false rewards signals from their leagues. To stabilize learning, most existing methods typically employ a paradigm of Centralized Training Distributed Execution (CTDE), assuming that learning occurs in a laboratory or simulator, where additional global states and communications are available. Despite the tremendous progress made by such approaches, CTDE paradigms may still be limited in some realistic multi-agent systems due to the difficulty of developing adequate real simulators and the inaccessibility of trained global states or agent communications. For example, an autopilot car is taken as an example. Even though RL models are deployed on vehicles prior to delivery, they still need to be further learned in an actual road environment where each vehicle is independent and not centrally scheduled. Thus, there is a need for a more practical distributed approach where each agent only works through its own observation and learning, without global information and other agents' observations and policies.

To develop a stable distributed training method, existing independent learning methods deal with non-static problems such as clipping PPO strategy ratios or repeating Q learning updates, or improving experience replay by smart design of the training method. Unlike them, the methods of the present disclosure address problems through modeling of other agents, taking into account the potential behavior of other agents for learning. Some agent modeling methods related to the workings of the present disclosure learn static policy belief models to predict and respond to actions of other agents. However, they assume that the policies of other agents are fixed and cannot be applied to distributed multi-agent learning because the policies of other agents are also continually updated through learning. In this case, the ideal capability of an agent is to dynamically change its policy beliefs as policies of other agents evolve. One naive way to achieve this is to use only the recent history of actions to fine tune the policy belief model, as the earlier histories are outdated for the current policies of other agents. However, this approach may suffer from significant impact due to lack of data and under training, and thus may not be able to model other agents efficiently.

Disclosure of Invention

The method aims at solving the technical problem that the interaction method in the prior art cannot meet the requirements of high-efficiency and convenient control of explanation personnel.

To achieve the above technical object, the present disclosure provides a distributed multi-agent cooperation method, including:

constructing an observation history register by storing the observation states of a specific number of steps in the past history; the history register continuously receives new history states along with the progress of interaction of the intelligent agent and the environment, and discards early history states exceeding capacity limit;

constructing a history background network, wherein the input of the history background network is a current observation state and a history state in a history register, and the output history background embedded state of the history background network is obtained through data mining and fusion;

constructing an implicit variation reasoning network, wherein the input of the implicit variation reasoning network is a historical background embedded state, and the beliefs of other intelligent agents are modeled as Gaussian distribution;

constructing a strategy network and a state value network and training through reinforcement learning, wherein the input of the strategy network and the state value network is belief embedding and the current observation state, and the output of the strategy network and the state value network is strategy distribution and state value.

Further, after the constructing the policy network and the state value network, the method further includes:

and calculating a loss function through a PPO algorithm, and updating the strategy network and the state value network.

Further, the construction history background network specifically includes:

by utilizing a soft attention mechanism, dynamically learning beliefs according to the latest historical steps, and introducing an observation embedded networkAnd action embedded network->；

The observation embedded networkAnd said action embedding network->An observation or an action encoded in the form of a one-hot is received as input and an embedding of the observation or action is generated.

Further, the construction history background network specifically includes:

Further, the construction history background network further includes:

discarding a partially uncorrelated time step during training and execution using an adaptive dropout operation;

wherein the adaptive dropout depends on cosine similarity between the historical observation embedded sequence and the current observation embedded sequence；

；

Wherein,representing +.>Selected front->A subset of the historical steps. The present disclosure indicates the proportion of time steps reserved by p, < >>；

Current observationAnd historical observations->Is embedded in the network->Encoding, generating the current observation embedded ∈ ->And historic observation embedded sequence->Wherein->Is a historical time step,/->Is the length of the history of use.

Further, the constructing the policy network and the state value network and training by reinforcement learning specifically includes:

in the training process, the strategy network, the cost function network and the dynamic belief network are simultaneously optimized through loss of PPO and belief loss;

the total loss function is:

；

wherein,as a function of the loss of PPO,

is a belief loss function;

KL divergence>；

Is a priori probability distribution;

is the action predicted by the agent;

is the action predicted by other agents.

In order to solve the above technical problem, the present disclosure also provides a distributed multi-agent collaboration system, including:

a history register construction module for constructing an observation history register by storing observation states of a specific number of steps in a history of a past game; the history register continuously receives new history states along with the progress of interaction of the intelligent agent and the environment, and discards early history states exceeding capacity limit;

the history background network construction module is used for constructing a history background network, wherein the input of the history background network is the current observation state, the history background network is integrated with the history state in the history register, and the history background network is used for outputting a history background embedded state through data mining;

the implicit variable reasoning network construction module is used for constructing an implicit variable reasoning network, wherein the input of the implicit variable reasoning network is a historical background embedded state, and the beliefs of other intelligent agents are modeled as Gaussian distribution;

the training module is used for constructing a strategy network and a state value network and training through reinforcement learning, wherein the inputs of the strategy network and the state value network are belief embedding and current observation states, and the outputs of the strategy network and the state value network are strategy distribution and state values.

Further, the training module is further configured to:

Further, the historical background network construction module is specifically configured to dynamically learn beliefs according to recent historical steps by using a soft attention mechanism, and introduce observation embedded networksAnd action embedded network->；

To achieve the above technical object, the present disclosure also provides a computer storage medium having a computer program stored thereon, which when executed by a processor is configured to implement the steps of the distributed multi-agent collaboration method described above.

To achieve the above technical purpose, the present disclosure further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes the steps of the distributed multi-agent cooperation method described above.

The beneficial effects of the present disclosure are:

the present disclosure provides a distributed multi-agent cooperation method and system based on dynamic belief, which can dynamically model strategies of other agents in multi-agent distributed learning, and reduce instability of learning systems. The historical background provided by the method can quickly mine the current relevant historical state and is used as a reference for predicting the current behaviors of other intelligent agents. The implicit variation reasoning provided by the present disclosure can cope with randomness of different agents through belief distribution modeling. The dynamic belief reasoning provided by the present disclosure can be combined with reinforcement learning algorithms of various references, and has wide applicability.

Drawings

FIG. 1 shows a schematic diagram of a method of embodiment 1 of the present disclosure;

FIG. 2 shows a flow diagram of a method of embodiment 1 of the present disclosure;

FIG. 3 shows a schematic structural diagram of a system of embodiment 2 of the present disclosure;

fig. 4 shows a schematic structural diagram of embodiment 4 of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

Various structural schematic diagrams according to embodiments of the present disclosure are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated for clarity of presentation and may have been omitted. The shapes of the various regions, layers and relative sizes, positional relationships between them shown in the drawings are merely exemplary, may in practice deviate due to manufacturing tolerances or technical limitations, and one skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions as actually required.

Many protocol converter products exist in the market, and the protocol conversion function is realized as well, but often the internal architecture in one protocol converter can only collect one device or one protocol and channel. In the real project floor, the communication equipment is often from different manufacturers and different protocols, and a plurality of protocol conversion machines are often required to be parallel to meet all protocol conversion requirements of the project, so that the project cost is increased, and the maintenance difficulty is increased. The polling modes of the same protocol are different, so that the time efficiency of collection is poor, and certain services, such as comparing the states of equipment, such as alarming of the equipment, are often required to be high in time efficiency, which cannot be achieved by protocol conversion.

Based on the above, an extensible protocol conversion is designed to realize collection of various communications and protocols so as to meet all protocol conversion requirements of projects. And the protocol conversion can be distributed with the acquisition tasks according to the configuration, so that the timeliness of acquisition is improved, and the high-efficiency service requirement is met.

Embodiment one:

as shown in fig. 1 and 2:

the present disclosure provides a distributed multi-agent collaboration method, comprising:

s101: constructing an observation history register by storing the observation states of a specific number of steps in the past history; the history register continuously receives new history states along with the progress of interaction of the intelligent agent and the environment, and discards early history states exceeding capacity limit;

s102: constructing a history background network, wherein the input of the history background network is a current observation state and a history state in a history register, and the output history background embedded state of the history background network is obtained through data mining and fusion;

s103: constructing an implicit variation reasoning network, wherein the input of the implicit variation reasoning network is a historical background embedded state, and the beliefs of other intelligent agents are modeled as Gaussian distribution;

s104: constructing a strategy network and a state value network and training through reinforcement learning, wherein the input of the strategy network and the state value network is belief embedding and the current observation state, and the output of the strategy network and the state value network is strategy distribution and state value.

Further, the construction history background network specifically includes:

Further, the construction history background network further includes:

；

the total loss function is:

；

wherein,as a function of the loss of PPO,

is a belief loss function;

KL divergence>；

Is a priori probability distribution;

is the action predicted by the agent;

is the action predicted by other agents.

The distributed multi-agent cooperation method based on dynamic beliefs comprises the following steps:

s1, constructing an observation history register. Wherein (1)>Is the capacity of the register. />Is the>Observation state of time step. As the agent interacts with the environment, the register continues to accept new historical states and discard early historical states that exceed the capacity limit.

S2, constructing a historical background network. The input is the current observation state and the history state in the history register, and the history background embedded state is output through data mining and fusion.

Historical background utilizes a soft attention mechanism to dynamically learn beliefs based on recent historical steps. First introducing an observation embedded networkAnd action embedded network->. They receive as input observations or actions encoded in one-hot form and generate an embedding of the observations or actions.

In each step t, the current observationsAnd historical observations->Is embedded in the network->Encoding, generating the current observation embedded ∈ ->And historic observation embedded sequence->Wherein->Is a historical time step,/->Is the length of the history of use. For dynamic belief inference, the input history may cover hundreds to thousands of time steps, and training such long sequences for each current step may take a significant amount of time and memory. Considering that most of the historical steps are far uncorrelated with the current state, the present disclosure uses adaptive dropout operation to discard a portion of the uncorrelated time steps during training and execution. Adaptive dropout depends on cosine similarity between the historic observation embedded sequence and the current observation embedded sequence +.>。

；

Wherein,representing +.>Selected front->A subset of the historical steps. The present disclosure indicates the proportion of time steps reserved by p, < >>。

After the downsampling history step, the history context handles the embedding of the current state and the history state, including the observation and other observer actions, and outputs the context embedding. Thus, it is forced to capture only relevant information, which will be used to generate confidence embeddings and predict the actions of other people. To this end, the present disclosure uses a soft focus mechanism between the current state and the historical state. The module is first embedded by the current stateCalculate query vector +.>It is then used to focus on the last historically different states. In order to capture sufficient information of the history state, the present disclosure calculates the state embedding of the history step by element-wise addition, i.e. +.>. Wherein the action embedding is obtained by acting on the one-hot-coded actions of the other agents through the linear layer ∈>. Then, for each history state, calculate key vector ++through different history state embedding>Sum vector->。

The context embedding is formed by adding the current state embedding and a weighted summary of the value vectors over the different historical states:

；

wherein,attention weights calculated by scaling inner products of query vectors and key vectors:

。

s3, constructing implicit variation reasoning. The input is historical background embedding, and the output is belief embedding and predicting other agent behaviors.

Since the policies of other agents are not fixed, this disclosure assumes other agentsIs subject to gaussian distribution. Wherein (1)>And->Representing the mean and variance, respectively. Thus, an encoder function is introduced to predict +.>And->. Then from->Sampling beliefs. In order to make sampling pairs->And->Can be made micro, using heavy parameter skills: />Wherein->Samples were taken from a standard normal distribution.

Based onDecoder functions composed of linear layers are used to predict teammates' actions. Teammate action to be observed->Viewed as sampled data points, the learning objective can be formulated to maximize likelihood function +.>. However, the process is not limited to the above-described process,cannot be directly optimized. Similar to ELBO, its lower bound is:

;

wherein the method comprises the steps ofKL divergence>。/>Is a priori probability distribution, defined in the methods of the present disclosure as a standard normal distribution. Thus, maximize +.>Can be converted to maximize the lower bound. Specifically, maximizingEquivalent to minimizing the mean square error:

;

in the formula (i) the formula (ii),is the action predicted by other agents, +.>Is the visibility of other agents obtained from the environment to ignore invisible teammate prediction errors. Specifically, & gt>And if so, 1, otherwise 0.

S4Constructing a strategy network and a state value network, and training through reinforcement learning. The following describes how to incorporate the proposed dynamic belief learning method into the PPO algorithm and achieve efficient distributed multi-agent collaborative learning. The present disclosure follows the algorithmic structure of PPO, learning a policy network for each agentAnd a state cost function network. Since making decisions through individual observations of only agents will lead to training instability, the present disclosure bases both policies and cost functions on observations and learning belief embedding conditions for other agents. The policies and cost functions within each agent share the same belief embedding for information reuse and robust training. Specifically, at each time step t, the policy network first uses the basic network to handle the current observation +.>The current state is then combined with the inferred belief information by a belief fusion module. The belief fusion is achieved by concatenating the belief embedding in the dynamic belief network with the state embedding in the base network and then applying a linear transformation. Then, a final strategy of action space is generated through another linear layer. The state-cost function network has the same structure as the policy, but the output size is 1. During the training process, the strategy network, the cost function network and the dynamic belief network are simultaneously optimized through the loss of the PPO and the belief loss. The total loss function is:

;

wherein,for PPO loss function, +.>Is a belief loss function.

Embodiment two:

as shown in fig. 3:

a history register constructing module 201 for constructing an observation history register by storing observation states of a specific number of steps in a past history of a game; the history register continuously receives new history states along with the progress of interaction of the intelligent agent and the environment, and discards early history states exceeding capacity limit;

the history background network construction module 202 is configured to construct a history background network, wherein an input of the history background network is a current observation state, and the history background network is an output history background embedded state of the history background network through data mining and fusion with a history state in a history register;

the implicit variation reasoning network construction module 203 is configured to construct an implicit variation reasoning network, where an input of the implicit variation reasoning network is a historical background embedded state, and beliefs of other intelligent agents are modeled as gaussian distributions;

the training module 204 is configured to construct a policy network and a state value network, and train the policy network and the state value network through reinforcement learning, wherein inputs of the policy network and the state value network are belief embedding and a current observation state, and outputs of the policy network and the state value network are policy distribution and state value.

The history register constructing module 201 is connected with the history background network constructing module 202, the implicit variation reasoning network constructing module 203 and the training module 204 in sequence.

Further, the training module 204 is further configured to:

Further, the historic background network construction module 202The method is particularly used for dynamically learning beliefs according to recent historical steps by utilizing a soft attention mechanism, and introducing an observation embedded networkAnd action embedded network->；

Embodiment III:

the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the distributed multi-agent collaboration method described above when executed by a processor.

The computer storage media of the present disclosure may be implemented using semiconductor memory, magnetic core memory, drum memory, or magnetic disk memory.

Semiconductor memory devices mainly used for computers mainly include two types, mos and bipolar. The Mos device has high integration level, simple process and slower speed. Bipolar devices have complex processes, high power consumption, low integration, and high speed. After the advent of NMos and CMos, mos memories began to dominate semiconductor memories. NMos is fast, e.g., 1K bit SRAM access time from Intel corporation is 45ns. And the CMos has low power consumption, and the access time of the CMos static memory with 4K bits is 300ns. The semiconductor memories are all Random Access Memories (RAM), i.e. new contents can be read and written randomly during operation. While semiconductor read-only memory (ROM) is randomly readable but not writable during operation and is used to store cured programs and data. ROM is in turn divided into two types, non-rewritable fuse read-only memory-PROM and rewritable read-only memory EPROM.

The magnetic core memory has the characteristics of low cost and high reliability, and has practical use experience of more than 20 years. Core memory has been widely used as main memory before the mid-70 s. Its storage capacity can be up to above 10 bits, and its access time is up to 300ns. The internationally typical core memory capacity is 4 MS-8 MB with access cycles of 1.0-1.5 mus. After the rapid development of semiconductor memory replaces the location of core memory as main memory, core memory can still be applied as mass expansion memory.

A magnetic drum memory, an external memory for magnetic recording. Because of its fast information access speed, it works stably and reliably, and although its capacity is smaller, it is gradually replaced by disk memory, but it is still used as external memory for real-time process control computers and middle and large-sized computers. In order to meet the demands of small-sized and microcomputer, a microminiature magnetic drum has appeared, which has small volume, light weight, high reliability and convenient use.

A magnetic disk memory, an external memory for magnetic recording. It has the advantages of both drum and tape storage, i.e. its storage capacity is greater than that of drum, and its access speed is faster than that of tape storage, and it can be stored off-line, so that magnetic disk is widely used as external memory with large capacity in various computer systems. Magnetic disks are generally classified into hard disks and floppy disk storage.

Hard disk memory is of a wide variety. Structurally, the device is divided into a replaceable type and a fixed type. The replaceable disk platter is replaceable, and the fixed disk platter is fixed. The replaceable and fixed magnetic disks have two types of multi-disc combination and single-disc structure, and can be divided into fixed magnetic head type and movable magnetic head type. The fixed head type magnetic disk has a small capacity, a low recording density, a high access speed, and a high cost. The movable magnetic head type magnetic disk has high recording density (up to 1000-6250 bit/inch) and thus large capacity, but has low access speed compared with the fixed magnetic head magnetic disk. The storage capacity of the disk product may be up to several hundred megabytes with a bit density of 6 bits per inch and a track density of 475 tracks per inch. The disk group of the disk memory can be replaced, so that the disk memory has large capacity, large capacity and high speed, can store large-capacity information data, and is widely applied to an online information retrieval system and a database management system.

Embodiment four:

the present disclosure also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the distributed multi-agent collaboration method described above when the computer program is executed by the processor.

Fig. 4 is a schematic diagram of an internal structure of an electronic device in one embodiment. As shown in fig. 4, the electronic device includes a processor, a storage medium, a memory, and a network interface connected by a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and the computer readable instructions can enable the processor to realize a distributed multi-agent cooperation method when the computer readable instructions are executed by the processor. The processor of the electrical device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform a distributed multi-agent collaboration method. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by persons skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

The electronic device includes, but is not limited to, a smart phone, a computer, a tablet computer, a wearable smart device, an artificial smart device, a mobile power supply, and the like.

The processor may in some embodiments be comprised of integrated circuits, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and processes data by running or executing programs or modules stored in the memory (for example, executing remote data read-write programs, etc.), and calling data stored in the memory.

The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory and at least one processor or the like.

Fig. 4 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 is not limiting of the electronic device and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

For example, although not shown, the electronic device may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.

Further, the electronic device may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices.

Optionally, the electronic device may further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.

Further, the computer-usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

In several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus, device, and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A distributed multi-agent collaboration method, comprising:

constructing an observation history register by storing the observation states of the number of steps in the past history; the history register continuously receives new history states along with the progress of interaction of the intelligent agent and the environment, and discards early history states exceeding capacity limit;

the construction history background network specifically comprises:

The observation embedded networkAnd said action embedding network->Receiving as input an observation or an action encoded in the form of a one-hot, and generating an embedding of the observation or action;

2. The method of claim 1, wherein after the constructing the policy network and the state value network, the method further comprises:

3. The method of claim 2, wherein constructing the historic context network further comprises:

；

Wherein,representing +.>Selected front->A subset of the historical steps; p represents the proportion of time steps reserved, +.>；

4. The method of claim 1, wherein constructing a policy network and a state value network and training by reinforcement learning specifically comprises:

total loss functionThe method comprises the following steps:

；

wherein,as a function of the loss of PPO,

is a belief loss function;

KL divergence>；

Is a priori probability distribution;

is the action predicted by the agent;

is the action predicted by other agents.

5. A distributed multi-agent collaboration system, comprising:

a history register construction module for constructing an observation history register by storing the observation states of the number of steps in the past history of the game; the history register continuously receives new history states along with the progress of interaction of the intelligent agent and the environment, and discards early history states exceeding capacity limit;

the history background network construction module is specifically used for dynamically learning beliefs according to the latest history steps by utilizing a soft attention mechanism, and introducing observation embedded networkAnd action embedded network->；

6. The system of claim 5, wherein the training module is further configured to:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps corresponding to the distributed multi-agent collaboration method of any one of claims 1-4 when the computer program is executed.

8. A computer storage medium having stored thereon computer program instructions, which when executed by a processor are adapted to carry out the steps corresponding to the distributed multi-agent collaboration method as claimed in any one of claims 1 to 4.