CN112418349A

CN112418349A - Distributed multi-agent deterministic strategy control method for large complex system

Info

Publication number: CN112418349A
Application number: CN202011453683.8A
Authority: CN
Inventors: 陶模; 冯毅; 李献领; 郑伟; 周宏宽; 邱志强; 林原胜; 汪伟; 邹海; 劳星胜; 李少丹; 赵振兴; 吴君; 庞杰; 黄崇海
Original assignee: Wuhan No 2 Ship Design Institute No 719 Research Institute of China Shipbuilding Industry Corp
Current assignee: Wuhan No 2 Ship Design Institute No 719 Research Institute of China Shipbuilding Industry Corp
Priority date: 2020-12-12
Filing date: 2020-12-12
Publication date: 2021-02-26

Abstract

The invention discloses a method for determining a local node control target of each intelligent agent corresponding to each control node in a large-scale complex system, and setting a reward function of each intelligent agent; determining an action set corresponding to the action of the agent in the current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state; training the agent according to the experience buffer until the whole agent is traversed; and repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain the target depth strategy network. The invention realizes that a distributed multi-agent control method is adopted in a large complex system, a plurality of agents are constructed, information is shared mutually, and the agents are continuously optimized in the training process according to the control target of the agents, so that the control performance is improved.

Description

Distributed multi-agent deterministic strategy control method for large complex system

Technical Field

The application relates to the technical field of large complex system operation control, in particular to a distributed multi-agent deterministic strategy control method for a large complex system.

Background

The system of a large number of individual agents and their connections may be referred to as a multi-agent network. In the framework, the information collected by each member is local and scattered, and the members do not have the capability of independently completing the whole task, namely the state quantities of all the members tend to be equal through information exchange among individuals, so that the complex task in a large complex system cannot be completed through cooperation.

Disclosure of Invention

In order to solve the above problems, embodiments of the present application provide a method for controlling a distributed multi-agent deterministic policy in a large complex system.

In a first aspect, an embodiment of the present application provides a large complex system distributed multi-agent deterministic policy control method, where the method includes:

determining a local node control target of each agent corresponding to each control node in a large-scale complex system, and setting a reward function of each agent;

randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in a current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state;

training the agent according to the experience buffer until the whole agent is traversed;

repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain a target depth strategy network;

and repeating the step of randomly acquiring the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.

Preferably, the determining the node control target of each agent corresponding to each control node in the large complex system includes:

setting an overall control target of a large complex system;

and decomposing the overall control target, taking each control node of the large-scale complex system as a multi-agent, and taking the corresponding decomposed overall control target as a control target of the node by each agent.

Preferably, the determining the action set corresponding to the action of the agent in the current control period includes:

obtaining the action a of the agent in the current control period through deterministic strategy calculation_i：

Wherein

For a deep policy network, θ_iFor deeply strategic network parameters, o_iLocal states observable for the current agent;

determining the action a_iCorresponding action set

Preferably, the obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state includes:

obtaining a set of environmental rewards for the large complex system

And a new large complex system state x';

generating an experience set (x, a, r, x ') based on the action set alpha, the environment reward set r, the initialization state x and the new large complex system state x';

storing the experience set (x, a, r, x ') into an experience buffer D, and updating the initialization state x ═ x'.

Preferably, the training the agent according to the experience buffer includes:

randomly extracting a total of S training sets (x) from the experience buffer D^j,a^j,r^j,x'^j) And obtaining a training target of the deep strategy network according to the following formula:

wherein, gamma is the system discount, the value range is 0< gamma < 1;

calculating a depth policy network loss function for the agent according to the following formula:

calculating the fastest gradient in the training process of the depth strategy network according to the following formula:

updating the target depth strategy network parameters:

θ'_i←τθ_i+(1-τ)θ'_i

wherein τ is the update rate of the depth policy network parameters, τ can be adjusted to adjust the training speed, and the value range is generally 0< τ < 0.5.

In a second aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method as provided in the first aspect or any one of the possible implementations of the first aspect.

In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method as provided in the first aspect or any one of the possible implementation manners of the first aspect.

The invention has the beneficial effects that: 1. a distributed multi-agent control method is adopted in a large complex system, a plurality of agents are constructed, information is shared mutually, and the agents are continuously optimized in the training process according to the control targets of the agents, so that the control performance is improved.

2. In order to avoid selfish behavior of individual agents, the training process of the agents shares the same experience buffer area, so that the states of surrounding agents must be considered in the learning process of the agents, and the training process can be accelerated.

3. In large complex systems, the tasks undertaken by each agent are inconsistent, and the reward functions of agents are designed to be closely related to the control objectives and the state of the large complex system, thereby accomplishing a common task.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a distributed multi-agent deterministic policy control method for a large complex system according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an example of the principle of a distributed multi-agent deterministic policy control method for a large complex system according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an exemplary embodiment of a condensate system according to the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In the following description, the terms "first" and "second" are used for descriptive purposes only and are not intended to indicate or imply relative importance. The following description provides embodiments of the invention, which may be combined with or substituted for various embodiments, and the invention is thus to be construed as embracing all possible combinations of the same and/or different embodiments described. Thus, if one embodiment includes feature A, B, C and another embodiment includes feature B, D, then the invention should also be construed as including embodiments that include one or more of all other possible combinations of A, B, C, D, even though such embodiments may not be explicitly recited in the following text.

The following description provides examples, and does not limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements described without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For example, the described methods may be performed in an order different than the order described, and various steps may be added, omitted, or combined. Furthermore, features described with respect to some examples may be combined into other examples.

Referring to fig. 1, fig. 1 is a large-scale complex system distributed multi-agent deterministic policy control method provided by an embodiment of the present application. In an embodiment of the present application, the method includes:

s101, determining a node control target of each agent corresponding to each control node in the large-scale complex system, and setting a reward function of each agent.

In one embodiment, the determining the local node control target of each agent corresponding to each control node in the large complex system includes:

setting an overall control target of a large complex system;

S102, randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in the current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area, and updating the initialization state.

In one possible embodiment, the determining a set of actions corresponding to the actions of the agent in the current control period includes:

Wherein

determining the action a_iCorresponding action set

In one embodiment, the obtaining a experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state includes:

obtaining a set of environmental rewards for the large complex system

And new Large Complex System State'

x；

S103, training the agent according to the experience buffer area until the whole agent is traversed.

In one possible implementation, the training the agent according to the experience buffer includes:

wherein, gamma is the system discount, the value range is 0< gamma < 1;

updating the target depth strategy network parameters:

θ'_i←τθ_i+(1-τ)θ'_i

And S104, repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain the target depth strategy network.

S105, repeating the step of randomly obtaining the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.

Specifically, as shown in fig. 2, in the training and traversing processes of each agent, the same experience buffer stores the experience sets generated in the training process, so that set data for training randomly extracted from the experience buffer by the agent performing the subsequent training is associated with training data of the agent performing the previous training, thereby ensuring that the states of the agents around can be considered in the training process of the agent, and avoiding the selfish behavior of the individual agent. Meanwhile, because the initialization state of the large-scale complex system is preset, after all the agents are trained in each control period, the numerical value of the initialization state is changed, and the training is repeated, so that the convergence of the depth strategy network is realized.

Illustratively, as shown in fig. 3, taking a typical system of a power plant, i.e., a condensate water supply system as an example, the whole process specifically includes the following steps:

step 1, setting a total control target R of a large complex system by taking a condensate water supply system as a research object, wherein the control is abnormally complex due to the existence of a plurality of return pipelines according to different working conditions.

Step 2, decomposing the total control target R of the condensate water supply system, taking each control node as a multi-agent, namely, the large-scale complex control system consists of distributed multi-agents, and each agent is composed of the decomposed index R_iDesigning a reward function r of each intelligent agent for the control target of the node_iThere are a total of N agents.

Reward function r_iThe difference value between the current measurement value and the control target and the sum of the derivative of the difference value are designed, and meanwhile, the water level of the steam generator is used as a core control index, and the acquired reward function of each intelligent agent is added with a larger proportion.

The intelligent body action of the single-board condensed water intelligent body is used as the rotating speed of the condensed water pump and the opening degree of the condensed water valve; the water supply pump rotating speed, the water supply valve opening and the return valve opening of the single-board water supply intelligent body.

And 3, starting action exploration to obtain a system initialization state x.

Step 4, obtaining the action a of each agent in the current control period through a deterministic strategy_i：

For a deep policy network, θ_iFor deeply strategic network parameters, o_iIs the local state that the agent can observe at present.

Adopt

Obtaining large complex system environment rewards

And a new large complex system state x'.

Store (x, a, r, x') in an experience buffer

Override the current state:

x＝x'

and 5, training the agents in sequence aiming at the ith agent.

From experience buffers

Randomly extracting a total of S (x)^j,a^j,r^j,x'^j) And obtaining a training target of the deep strategy network:

γ is the system discount, with a range of 0< γ < 1.

Step 6, obtaining the depth strategy network loss function of the ith agent,

step 7, obtaining the fastest gradient in the deep strategy network training process,

step 8, in order to enhance the stability of training and ensure rapid convergence, updating the target depth strategy network parameter theta'_i，

θ'_i←τθ_i+(1-τ)θ'_i

Wherein tau is the update rate of the depth strategy network parameters, tau can be adjusted to adjust the training speed, and the value range is generally 0< tau < 0.5.

And 9, repeating the step 5 to the step 8 for N times, and traversing the whole agent.

And 10, repeating the steps 4 to 9 for multiple times, wherein the times are set according to the actual condition of the system, and are usually set to cover a large system operation period, and the value can be 100.

And 11, repeating the steps 3 to 10 for multiple times, wherein the times are set according to the actual condition of the system, and usually, the value is taken according to the condition of training convergence, and the value can be 5000.

Step 12, after the training is finished, implementing the target depth strategy network after the training on a large-scale complex system,

the distributed multi-agent control of the condensate water supply system is realized through the steps.

Referring to fig. 4, a schematic structural diagram of an electronic device according to an embodiment of the present invention is shown, where the electronic device may be used to implement the method in the embodiment shown in fig. 1. As shown in fig. 4, the electronic device 400 may include: at least one central processor 401, at least one network interface 404, a user interface 403, a memory 405, at least one communication bus 402.

Wherein a communication bus 402 is used to enable connective communication between these components.

The user interface 403 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 403 may also include a standard wired interface and a wireless interface.

The network interface 404 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

The central processing unit 401 may include one or more processing cores. The central processor 401 connects various parts within the entire terminal 400 using various interfaces and lines, and performs various functions of the terminal 400 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 405 and calling data stored in the memory 405. Alternatively, the central Processing unit 401 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The Central Processing Unit 401 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is to be understood that the modem may be implemented by a single chip without being integrated into the central processor 401.

The Memory 405 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 405 includes a non-transitory computer-readable medium. The memory 405 may be used to store instructions, programs, code sets, or instruction sets. The memory 405 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 405 may alternatively be at least one memory device located remotely from the central processor 401 as previously described. As shown in fig. 4, memory 405, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and program instructions.

In the electronic device 400 shown in fig. 4, the user interface 403 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and processor 401 may be used to invoke a large complex system distributed multi-agent deterministic policy control application stored in memory 405 and specifically perform the following operations:

The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus can be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some service interfaces, devices or units, and may be an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program, which is stored in a computer-readable memory, and the memory may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A large complex system distributed multi-agent deterministic policy control method, the method comprising:

2. The method of claim 1, wherein the determining the node control target of each agent corresponding to each control node in the large complex system comprises:

setting an overall control target of a large complex system;

3. The method of claim 1, wherein the determining a set of actions corresponding to the actions of the agent during the current control period comprises:

Wherein

determining the action a_iCorresponding action set

4. The method of claim 1, wherein obtaining a experience set based on the action set and an environmental reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state comprises:

obtaining a set of environmental rewards for the large complex system

And a new large complex system state x';

5. The method of claim 1, wherein the training the agent according to the experience buffer comprises:

wherein, gamma is the system discount, the value range is 0< gamma < 1;

updating the target depth strategy network parameters:

θ'_i←τθ_i+(1-τ)θ'_i

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-5 are implemented when the computer program is executed by the processor.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.