CN115099124A

CN115099124A - Multi-agent distribution collaborative training simulation method

Info

Publication number: CN115099124A
Application number: CN202210549487.3A
Authority: CN
Inventors: 张连怡; 陈秋瑞; 王清云; 余立新
Original assignee: Beijing Simulation Center
Current assignee: Beijing Simulation Center
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-09-23

Abstract

The invention discloses a multi-agent distribution cooperative training simulation method, and designs a multi-agent distribution cooperative training simulation system, which comprises the following steps: the computing end equipment is distributed at different computing nodes and is used for executing the countermeasure deduction and intelligent decision process; the client device is used for calling each computing terminal device to realize countermeasure deduction; the server-side equipment is used for executing parameter custom setting, connecting the client-side equipment and the computing-side equipment to transmit information and starting confrontation training scheduling; the problem that as the scale and complexity of models in multiple intelligent training environments continuously develop in depth, entity models are more and more complex, simulation calculation amount is more and more large and gradually exceeds the capacity of a single processor is solved, the computing capacity of the training simulation environments in the multi-intelligent-agent training process is improved, and the intelligent training simulation with service and high efficiency is realized under the condition that resources are fully utilized.

Description

Multi-agent distribution collaborative training simulation method

Technical Field

The invention provides a multi-agent distributed cooperative training simulation method, in particular to a multi-agent distributed cooperative training simulation method in a distributed environment.

Background

The current research for solving the multi-agent system by adopting deep reinforcement learning is going to be deepened step by step. From the problem solving point of view, the multi-agent research can be divided into three categories, namely complete cooperative task, complete competitive task, mixed competitive and cooperative task. In the complete cooperative task, multiple agents and the environment are subjected to interactive learning, in the process, each agent obtains a global reward, and even if each agent obtains the respective reward, the global reward can be formed in a weighting summation mode and the like. The learning goal for this type of task is to maximize the discount cumulative global reward, i.e., all agents work together to maximize the global reward. There are two ideas in the solution: the first method is to adopt a method based on single agent reinforcement learning, to regard all agent actions as an action vector, and to learn a strategy to output the action vector in each state, so that the discount accumulated reward is maximum; the second method is a method of multi-agent reinforcement learning, each agent learns independently, each agent decides its own action, and then the experiences of all agents are processed together, which is a method of decentralized execution and centralized training. In reality, particularly in the field of military operations, a large number of inter-group countermeasures of the zero-sum game are common, and the cooperation problem of multiple intelligent agents also arises. Compared with the single-agent reinforcement learning problem, the multi-agent cooperation has higher complexity: on one hand, the strategy space is exponentially increased along with the increase of the number of the agents; on the other hand, with the addition of heterogeneous agents, communication, collaboration and coordination among the agents becomes more important. The above methods all perform multi-agent model training in a centralized simulation environment.

With the continuous development of the scale and complexity of the simulation system in the depth direction, the entity model is more and more complex, and the simulation solution amount is more and more, which gradually exceeds the capability of a single processor. The simulation calculation includes iterative calculation of the state of the simulation system changing with time and a simulation calculation process of physical environment entities in a certain range, such as differential equation solution of a continuous simulation subsystem, calculation of characteristic parameters of the simulation entities, iterative operation of complex electromagnetic environments in wireless signal transmission and radar detection, and the like. The remote service calling technology can perform service encapsulation on the simulation model and the program, and the complex simulation calculation work is executed at a remote end, so that a long-running service process is formed and monitoring is performed at a release port. Based on the remote service calling technology, the simulation program service can be directly called locally and a corresponding calculation result is obtained.

Therefore, the invention provides a multi-agent distributed collaborative training simulation method based on remote procedure call in a distributed environment.

Disclosure of Invention

The invention aims to provide a multi-agent distributed collaborative training simulation method based on remote process calling in a distributed environment, which solves the problems that as the scale and complexity of a model in a multi-agent training environment continuously develop to the depth, an entity model is more and more complex, and the simulation solution amount is more and more large and gradually exceeds the capability of a single processor, improves the computing capability of the training simulation environment in the multi-agent training process, and realizes the intelligent training simulation with servitization and high efficiency under the condition of fully utilizing resources.

In order to achieve at least one of the purposes, the invention adopts the following technical scheme:

the invention provides a distributed parallel multi-agent cooperative training simulation system in a first aspect, which comprises:

the computing end devices are distributed at different computing nodes and are used for executing countermeasure deduction and intelligent decision-making processes;

the client device is used for calling each computing terminal device to realize countermeasure deduction;

the server device is used for executing the parameter custom setting and connecting the client device and the computing end device to transmit information so as to start the confrontation deduction scheduling, for example, the server device is used for transmitting starting information and the like, instructing the client device to start the confrontation training scheduling, and calling each computing end device to realize the confrontation deduction.

Optionally, the client device comprises:

the simulation engine is used for loading a battle scene, outputting original battlefield observation data, receiving action information output by an intelligent agent decision algorithm, loading information flow, loading an experimental framework and instantiating a simulation model component;

the simulation model component agent is used for reporting model component information to a simulation engine, maintaining an event queue arranged according to a time sequence, processing initialization, simulation propulsion and destruction events generated by the model component, requesting time propulsion to the simulation engine when time changes, receiving execution actions output by an intelligent agent decision algorithm, calling a model component calculation program to run in a remote process scheduling mode, and sending a state updating message of the model component to the simulation engine;

the computing end device comprises:

the intelligent agent decision algorithm module is used for receiving observation information and reward information simulated by the simulation engine, executing the intelligent agent decision algorithm to generate an execution action, and outputting the execution action to the training scheduler to obtain the current state, the fighting, the reward and the next state information of the intelligent agent model;

the model component calculation program encapsulates the related function process in each simulation model and is used for actual physical process calculation;

the server-side equipment comprises a training scheduler used for starting a confrontation training schedule; the training controller is used for setting a training scene, a training interface, training parameters and a training algorithm.

Optionally, the client device further includes a simulation call interface, configured to train an interactive interface between the scheduler and the simulation engine, including a scenario selection, a scenario/model distribution, and an operation control interface.

Optionally, the computing device further includes an observation construction module, configured to provide an observation construction function for the simulation engine, construct observation information and reward information of each agent decision algorithm, and generate an observation data organization form meeting requirements of the agent decision algorithm from the original state data;

optionally, the simulation engine is further configured to schedule simulation runs, perform time management and data distribution management, and provide a user interface for configuring a scheduling mode of each model component.

Optionally, the time management is to process time advance requests of all training schedulers, and calculate a time value of running advance; and the data distribution management is to receive the port data sent by the training scheduler and distribute the data according to the port connection relation defined in the information flow.

Optionally, the client includes a configuration file, in which the category, the IP address, and the port number of the computing device are indicated.

The second aspect of the invention provides a distributed parallel multi-agent cooperative training simulation method, which comprises the following steps:

s10, setting a training scene, a training interface, training parameters and a training algorithm by the training controller on the server side equipment;

s20, extracting relevant function processes in the simulation model at the computing end equipment, packaging the relevant function processes into different simulation model computing programs, and starting the obtained programs on different computing end equipment to obtain different simulation model computing processes;

s30, writing the IP addresses corresponding to the simulation model calculation processes and the monitored port numbers into a client configuration file;

s40, the client reads the client configuration file and establishes links with each simulation model calculation process respectively; a training controller of the server equipment communicates with the client to transmit simulation training scene information;

s50, the client creates a simulation model component agent according to the set simulation scheme, and sends model calculation initialization parameters to the simulation model calculation process on each calculation terminal device;

s60, after the client device receives the initialization completion response of the computing device, the simulation engine uniformly manages the initialization completion response, and the actual training process is carried out;

s70, finishing one round of simulation training, if the set training time is reached, acquiring a group of data from the experience pool by the training scheduler, calculating a target global value function value corresponding to the group of data, performing model learning, and updating an action strategy;

and S80, finishing the training and updating the whole network parameters if the whole training is finished or the learning updating frequency is reached.

Optionally, the training interface setting in step S10 includes: observation information, action information, reward information.

Optionally, the step S60 training process includes:

s601, a simulation engine of the client device respectively calls related functions of a simulation model component proxy according to the current time step to simulate the corresponding time step, the related functions of the simulation model component proxy send remote service calling requests to a simulation model calculation process of corresponding calculation terminal equipment and transmit parameters to the simulation model calculation process;

s602, after receiving the call request, the simulation model calculation process calculates according to locally stored data and the transmitted parameters, and after calculation is completed, a calculation result is fed back to the client device;

s603, a simulation engine of the client device simulates current observation information and reward information and outputs the current observation information and reward information to a training scheduler of the server device, the simulation training scheduler calls an observation construction module of the computing device, observation and reward information data meeting an intelligent agent decision algorithm module are generated in the observation construction module and output to the intelligent agent decision algorithm module of the computing device;

s604, the intelligent agent decision algorithm module receives the observation information and the reward information, executes the intelligent agent decision algorithm to generate an execution action, outputs the execution action to a training scheduler, and stores the current state, the fighting, the reward and the next state information of the intelligent agent model into an experience pool;

s605, repeating the steps S601 to S604 until completing a round of training simulation process;

the invention has the following beneficial effects:

the invention provides a multi-agent distributed collaborative training simulation method based on remote process calling in a distributed environment, which designs a multi-agent distributed training simulation system, combines with multi-agent decision algorithm codes, and realizes multi-agent algorithm training and operation. Disassembling a model component of a traditional simulation system into a model component calculation program, an intelligent agent decision algorithm module and a model component agent in a distributed training simulation system, extracting relevant physical processes in simulation model classes, and packaging relevant function processes in each simulation model class into different simulation model calculation programs; the original simulation model class is modified into a simulation model proxy class, the external function interface of the simulation model proxy class is the same as the original simulation model class, but actual simulation calculation cannot be executed; the model component agent and the model component calculation program are connected in a remote procedure call mode. The problem that as the scale and complexity of a model in a multi-intelligent training environment continuously develop to the depth, an entity model is more and more complex, the simulation calculation amount is more and more large and gradually exceeds the capability of a single processor is solved, the computing capability of the training simulation environment in the multi-intelligent-agent training process is improved, and the intelligent training simulation with service and high efficiency is realized under the condition of fully utilizing resources.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 shows a schematic diagram of a multi-agent distributed co-training simulation system.

FIG. 2 illustrates a flow diagram of a multi-agent distributed co-training simulation method.

Fig. 3 shows a QMIX-based multi-agent cooperative decision algorithm architecture in a training mode.

Detailed Description

In order to make the technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

In the description of the present invention, it should be noted that the terms "upper", "lower", etc. indicate orientations or positional relationships based on those shown in the drawings only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are intended to be inclusive and mean, for example, that they may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

It is further noted that, in the description of the present invention, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

With the continuous development of model scale and complexity in multiple intelligent training environments, the entity model is more and more complex, and the simulation calculation amount is more and more large and gradually exceeds the capability of a single processor. The method aims to improve the computing power of a training simulation environment in the multi-agent training process. The invention provides a multi-agent distributed collaborative training simulation method based on remote procedure call in a distributed environment, and designs a multi-agent distributed training simulation system which mainly comprises client equipment, server equipment and computing end equipment, is composed of a simulation engine assisted by a training scheduler and a model component, and is combined with multi-agent confrontation decision algorithm codes to realize multi-agent algorithm training and operation. The invention extracts the relevant physical processes in the simulation model classes, and encapsulates the relevant function processes in each simulation model class into different simulation model calculation programs. And the original simulation model class in the client is modified into a simulation model proxy class, the external function interface of the simulation model proxy class is the same as the original simulation model class, but the actual simulation calculation cannot be executed. The model component agent and the model component calculation program are connected in a remote procedure call mode. A model component of a traditional simulation system is disassembled into a model component calculation program, an intelligent agent decision algorithm module and a model component agent in a distributed training simulation system, wherein the model component calculation program is responsible for actual physical process calculation, the intelligent agent decision algorithm module is responsible for actual decision behavior, and the model component agent is interacted with a simulation engine.

One embodiment of the present invention provides a distributed parallel multi-agent collaborative training simulation system, as shown in fig. 1, including: the system comprises a client device, a server device and a plurality of computing end devices.

The client device includes:

and the simulation engine is used for scheduling simulation operation, performing time management, data distribution and other functions. The time management processes the time advance requests of all the simulation training schedulers and calculates the time value of the running advance. And the data distribution management receives the port data sent by the training scheduler and distributes the data according to the port connection relation defined in the information flow. Meanwhile, the simulation engine is responsible for loading information flow, loading an experiment framework, instantiating a model component and the like. The simulation engine provides a user interface for configuring a scheduling mode of each model component;

the simulation model component agent reports model component information to the simulation engine, maintains an event queue arranged according to a time sequence, processes initialization, simulation propulsion, destruction and other events generated by the model component, and requests the simulation engine to carry out time propulsion when time is changed. Meanwhile, receiving an execution action output by the intelligent agent decision algorithm module, calling a model component calculation program to run in a remote process scheduling mode, and sending a state updating message of the model component to a simulation engine;

the simulation transfer interface, the interactive interface of training scheduler and simulation deduction environment, including the interface of planning selection, planning/model distribution, operation control, etc.

In a specific example, the simulation engine is responsible for loading a battle scene, outputting original battlefield observation data externally, receiving action control information output by the decision intelligent agent decision algorithm module, and remotely calling a model component calculation program through a model component agent to execute a series of position updating calculation, electromagnetic calculation, attack decision calculation and the like, so as to realize countermeasure deduction and have the functions of display, log recording and the like.

The server-side equipment is responsible for starting the confrontation training scheduling, can realize the self-definition of various parameters in the confrontation process, such as the designation of a confrontation program of two parties, the number of confrontation bureaus, scene selection, the highest continuous frame number of each sentence and the like, and simultaneously forwards the deduction observation data to the computing-side equipment, receives the action information output by the confrontation decision process, and schedules the intelligent decision algorithm module for training;

the method comprises the following steps that a computing end device runs on distributed simulation training nodes by each confrontation control decision and model component, and comprises the following steps:

the observation construction module is used for constructing observation information corresponding to each agent decision algorithm module, and generating an observation data organization form meeting the requirements of the agent decision algorithm from the original state data;

the intelligent agent decision algorithm module receives the state and the reward information of the simulation environment and sends an execution action output by the intelligent agent decision process to the simulation environment;

and the model component calculation program extracts the related function processes in the simulation model classes, and encapsulates the related function processes in each simulation model class into different simulation model calculation programs.

In the design of a distributed parallel multi-agent cooperative training simulation system based on remote service invocation, each computing end device is deployed with a simulation model computing program and monitors at a designated port. A configuration file is required to be provided for the client, and the category, the IP address and the port number of the computing end device are indicated in the configuration file, so that the client can find all simulation model computing interfaces to perform a simulation process.

Another embodiment provides a distributed parallel multi-agent collaborative training method,

in a specific example, a multi-drone confrontation process based on deep reinforcement learning is taken as an example, as shown in fig. 2.

1) training scenario settings

The training scene of the multi-unmanned-plane game countermeasure comprises a detection unmanned plane and an attack unmanned plane, wherein the detection unmanned plane can simulate L, S waveband radar to carry out omnidirectional detection and support multi-frequency-point switching; the attack unmanned aerial vehicle has the functions of detection, interference, attack and the like, and can simulate an X-band radar to perform directional detection. In this example, the scenario sets 1 probe unit, 10 attack units.

2) Training interface settings

The training interface includes observation, action and reward information:

firstly, observing information: the intelligent training framework encapsulates global observation information (raw obs) into a dictionary. Firstly, an observation interface of the unmanned aerial vehicle for countermeasure simulation is constructed, and in this example, the observation interface mainly includes detector _ obs _ list (detection unit information), fighter _ obs _ list (attack unit information), and join _ obs _ list (global information). Wherein the detector _ obs _ list is a list with the size of the number of detectors, the Fighter represents attack unit information and is structured as a list with the number of Fighter, and the Joint _ obs _ fact represents a dictionary of global information. Dividing information into detector information and fighter information and storing the detector information and the fighter information in a dictionary obs _ fact, wherein each element is a dictionary, and the corresponding element mainly comprises the following parts:

table 1 observation information table

Motion information: in this example, the action output after the intelligent training is exemplified by 1 detection unit and 10 attack units. Detecting unmanned aerial vehicle actions are defined as an array with detector _ action being 1 x 2, and each row represents two actions of each detecting unit; and the attacking drone action is defined as an array of 10 x 4 for the fighter _ action, with each row representing four actions for each attacking unit. Specific action definitions are shown in the table below.

Table 2 detecting unmanned aerial vehicle output actions

Table 3 attacking unmanned aerial vehicle output actions

③ reward information: the user can modify the reported value according to the actual requirement of the algorithm training. If the current provided return check point does not meet the algorithm training requirement, the method can be automatically realized by constructing function acquisition. In this example, the reward rewards defined by the invention are as follows:

TABLE 4 Individual reward customization

TABLE 5 Global reward customization

Serial number	Of significance	Name of variable	Default return
				1	Winning success	reward_totally_win	200
2	Completion of failure	reward_totally_lose	-200
				3	Victory	reward_win	100
4	Failed by	reward_lose	-100
				5	Arrangement of ties	reward_draw	0

3) Training parameter settings

Next, training parameters are set, which mainly include neural network parameters and training parameters, as shown in the following table:

TABLE 6 neural network model parameter Table

TABLE 7 training parameters Table

Serial number	Parameter name	Function of	Initial value	Assignment rules
					1	max_epoch	Number of model training rounds	1000	Adjusting according to a model
2	detector_num	Number of scout aircraft	1	According to actual regulation
					3	fighter_num	Number of fighters	10	According to actual regulation
4	course_num	Number of headings	16	According to actual regulation
					5	train_interval	Frequency of training	1000	According to actual regulation
6	learn_interval	Frequency of model learning	100	According to actual adjustment

4) Training algorithm

The currently trained model is a multi-agent training algorithm based on a QMIX algorithm, which can train the existing countermeasures such as attack plane, reconnaissance plane, etc., and the structure diagram of the QMIX network is shown in FIG. 3.

Input：

An observation vector for each agent;

an action performed at a time on each agent; n ═ {1,2, … N }

Output：Q _tot An overall action cost function for model optimization;

and the action value function of each agent at the current moment is used for acquiring the strategy.

The policy network of agents, each agent comprising a Q network and a Mixer network:

1) the Q Network is composed of an RNN (Recurrent Neural Network), and information observed in the past can be also incorporated into reference information for decision making. Each time step input of the RNN includes observation information for the current time and action information for the previous time.

2) The Mixer network is used for fusing the action cost functions of all the agents to obtain a global action cost function, but the parameters of the Mixer network cannot be learned in a general way because the monotonous fusion characteristic needs to be maintained. Therefore, a hyper-network is introduced, the input of the hyper-network is the observation information of the current moment, and the output result is processed by an absolute value function to ensure that the parameter is a positive integer, thereby ensuring the monotonicity of the Mixer network.

3) The global action value function is a variable which needs to be used when network parameters are updated, and when decision is made, only the action with the maximum action value in the Q network corresponding to each agent needs to be output.

S20, performing service encapsulation on codes of all simulation models in the computing end equipment to obtain different simulation model executable files, appointing port numbers needing monitoring in which instance of executable files are placed in configuration files, and starting the obtained programs on different computing end equipment servers to obtain different simulation model computing processes;

in the simulation example of the multi-drone confrontation, the detecting drone (detector) is formed by combining a platform model (platform) and a detecting device model (sensor), and the attacking drone (fighter) is formed by combining a platform model (platform), an attacking weapon model (weather) and a detecting device model (sensor). Extracting external interfaces of platform (platform), detection (sensor) and attack weapon (webson) models, packaging the external interfaces into corresponding simulation calculation programs, and calling a simulation instance interface of the distributed training simulation system and a simulation instance calculation program related interface of the calculation end equipment. First the interface of the service is defined. In this embodiment, the present invention selects the external interface of each simulation instance class to be defined as the service interface. The simulation method takes a simulation example Sensor as an example, 6 interfaces are required to be packaged, a construction function and a destructor function are responsible for completing the construction and destruction of a Sensor simulation example, an init function is responsible for initializing the Sensor simulation example, and the three interfaces are called only once in a simulation flow. The step function is responsible for the stepping of the simulation, the simulation engine is required to call once in each time step, and the setPlatformState and getSensorDetect functions are function interfaces for accessing the simulation instance class members, setting the state of the sensor simulation instance and acquiring the calculation result of the simulation. Since each model of Init, create, and destroy interfaces has, the present invention is not listed separately, and the model encapsulation interface of this example is shown in the following table:

table 8 simulation platform interface invocation time

The remote service call interfaces are defined by the interfaces of each model, and it should be noted that a string type name parameter is added to each interface in the present invention, because in the distributed simulation architecture based on remote service call, the simulation instance is decoupled and placed on the remote simulation service program, and the remote simulation service program needs to know which simulation instance packaging object on the client side corresponds to the currently simulated instance object, so that a unique name parameter is added at this point, and the remote simulation service program can judge which simulation instance is currently simulated by this parameter.

And S30, writing the IP addresses corresponding to the simulation model calculation processes and the monitored port numbers thereof into a configuration file of the client device at the client device, and starting the client device.

S40, the client device reads the configuration file, obtains the IP address and the port number of each simulation model calculation process, and respectively establishes the link with each simulation model calculation process; the training controller is communicated with the distributed client equipment, and the client equipment obtains simulation scene information;

s50, the client device creates a simulation model proxy object locally according to the set simulation scheme and sends model calculation initialization parameters to the simulation model calculation process on each calculation terminal device; after the client device receives the initialization completion response of the computing device, the actual training process can be performed. The invention replaces the simulation calculation logic in the external interface of each simulation instance class with the call of the remote service interface defined by the invention, and before the call, if the type of the transmitted parameter is not matched with the type of the parameter requirement of the remote service interface, the invention also can convert the parameter into the type conforming to the remote service call interface. Thus, the present invention rewrites the simulation instance class into the simulation instance wrapper class. The external interface of the simulation instance packaging class is the same as the original simulation instance class, so the upper layer code for calling the simulation instance packaging object does not need to be modified, and therefore the simulation instance packaging class is not shown here. So far, the training initialization work is finished.

S60, starting simulation training:

s602, after receiving the call request, the simulation model calculation process calculates according to locally stored data and the transmitted parameters, and after calculation is finished, a calculation result is fed back to the client device;

s604, the intelligent agent decision algorithm QMIX module receives observation information and reward information, executes an intelligent agent decision algorithm to generate execution action, outputs the execution action to a training scheduler, obtains current state, fighting action, reward and next-step state information < S, a, r, S ' >, of an intelligent agent model, and stores the current state, fighting action, reward and next-step state information < S, a, r, S ' >, into an experience pool, wherein S represents a global state, namely an information set observed by each unmanned aerial vehicle, a represents joint action of each unmanned aerial vehicle, and S ' and r respectively represent a subsequent global state after the unmanned aerial vehicle executes the action and instant reward of environment feedback;

s70, finishing a round of simulation training, and extracting a sample with the size of batch _ size from the experience pool by the training scheduler for network training, as shown in figure 3. Firstly, inputting each observation of the agent into an action cost function network of the agent to obtain a state action value Qi (oi, ai) corresponding to an action in a sample; inputting the global state s into a hyper-parameter network, and outputting a weight and a weight bias of a global action cost function network; and inputting the state action value Qi (oi, ai) into the global action cost function network, and outputting the global action cost Qtotal (s, a). Secondly, inputting each observation in the subsequent global state s 'into a target action cost function network thereof to obtain a maximum state action value max Qi ▔ (oi', ai); inputting the subsequent state s' into a target hyper-parameter network, and outputting the weight and the weight bias of a target global action cost function network; inputting max Qi ▔ (oi ', ai) into the target global action cost function network, and outputting a global action cost Qtotal ▔ (s', a);

and S80, if the whole training is finished or the learning updating frequency is reached, namely, the training is finished (1000/100 is 10), calculating a loss function, calculating a gradient, performing back propagation, and updating the parameters of the current whole network.

It should be understood that the above-described embodiments of the present invention are examples for clearly illustrating the invention, and are not to be construed as limiting the embodiments of the present invention, and it will be obvious to those skilled in the art that various changes and modifications can be made on the basis of the above description, and it is not intended to exhaust all embodiments, and obvious changes and modifications can be made on the basis of the technical solutions of the present invention.

Claims

1. A multi-agent distributed co-training simulation system, comprising:

the computing end equipment is distributed at different computing nodes and is used for executing the countermeasure deduction and intelligent decision process;

and the server-side equipment is used for executing parameter custom setting and connecting with the client-side equipment and the computing-side equipment to transmit information so as to start the countermeasure scheduling.

2. The system of claim 1, wherein the client device comprises:

the computing end device comprises:

3. The system of claim 2, wherein the client device further comprises a simulation call interface for training the scheduler's interactive interface with the simulation engine, including a scenario selection interface, a scenario/model distribution interface, and a run control interface.

4. The system of claim 2, wherein the computing device further comprises an observation construction module configured to provide an observation construction function for the simulation engine, construct observation information and reward information for each agent decision algorithm, and generate an observation data organization form from the raw state data that meets the requirements of the agent decision algorithm.

5. The system of claim 2, wherein the simulation engine is further configured to schedule simulation runs, perform time management and data distribution management, and provide a user interface for configuring a scheduling schema for each model component.

6. The system of claim 5, wherein the time management is to process time advance requests of all training schedulers, calculate a time value of running advance; and the data distribution management is to receive the port data sent by the training scheduler and distribute the data according to the port connection relation defined in the information flow.

7. The system of claim 1, wherein the client comprises a configuration file, and the configuration file comprises class information, IP address information, and port number information of the computing device.

8. A multi-agent distributed collaborative training simulation method is characterized by comprising the following steps:

s40, the client reads the configuration file of the client and respectively establishes links with the calculation processes of each simulation model; a training controller of the server equipment communicates with the client to transmit simulation training scene information;

9. The method according to claim 8, wherein the training interface setting in step S10 includes: observation information, action information, reward information.

10. The method according to claim 8, wherein the step S60 training process comprises:

s601, the simulation engine of the client device respectively calls related functions of the simulation model component proxy according to the current time step to simulate the corresponding time step, the related functions of the simulation model component proxy send remote service call requests to the simulation model calculation process of the corresponding calculation terminal device and transmit parameters to the simulation model calculation process;

s603, a simulation engine of the client device simulates the current observation information and reward information and outputs the current observation information and reward information to a training scheduler of the server device, the simulation training scheduler calls an observation construction module of the computing device, observation and reward information data meeting an agent decision algorithm module are generated in the observation construction module and output to the agent decision algorithm module of the computing device;

and S605, repeating the steps S601 to S604 until the training simulation process is completed.