CN112418349A - Distributed multi-agent deterministic strategy control method for large complex system - Google Patents

Distributed multi-agent deterministic strategy control method for large complex system Download PDF

Info

Publication number
CN112418349A
CN112418349A CN202011453683.8A CN202011453683A CN112418349A CN 112418349 A CN112418349 A CN 112418349A CN 202011453683 A CN202011453683 A CN 202011453683A CN 112418349 A CN112418349 A CN 112418349A
Authority
CN
China
Prior art keywords
agent
complex system
experience
action
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011453683.8A
Other languages
Chinese (zh)
Inventor
陶模
冯毅
李献领
郑伟
周宏宽
邱志强
林原胜
汪伟
邹海
劳星胜
李少丹
赵振兴
吴君
庞杰
黄崇海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan No 2 Ship Design Institute No 719 Research Institute of China Shipbuilding Industry Corp
Original Assignee
Wuhan No 2 Ship Design Institute No 719 Research Institute of China Shipbuilding Industry Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan No 2 Ship Design Institute No 719 Research Institute of China Shipbuilding Industry Corp filed Critical Wuhan No 2 Ship Design Institute No 719 Research Institute of China Shipbuilding Industry Corp
Priority to CN202011453683.8A priority Critical patent/CN112418349A/en
Publication of CN112418349A publication Critical patent/CN112418349A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method for determining a local node control target of each intelligent agent corresponding to each control node in a large-scale complex system, and setting a reward function of each intelligent agent; determining an action set corresponding to the action of the agent in the current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state; training the agent according to the experience buffer until the whole agent is traversed; and repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain the target depth strategy network. The invention realizes that a distributed multi-agent control method is adopted in a large complex system, a plurality of agents are constructed, information is shared mutually, and the agents are continuously optimized in the training process according to the control target of the agents, so that the control performance is improved.

Description

Distributed multi-agent deterministic strategy control method for large complex system
Technical Field
The application relates to the technical field of large complex system operation control, in particular to a distributed multi-agent deterministic strategy control method for a large complex system.
Background
The system of a large number of individual agents and their connections may be referred to as a multi-agent network. In the framework, the information collected by each member is local and scattered, and the members do not have the capability of independently completing the whole task, namely the state quantities of all the members tend to be equal through information exchange among individuals, so that the complex task in a large complex system cannot be completed through cooperation.
Disclosure of Invention
In order to solve the above problems, embodiments of the present application provide a method for controlling a distributed multi-agent deterministic policy in a large complex system.
In a first aspect, an embodiment of the present application provides a large complex system distributed multi-agent deterministic policy control method, where the method includes:
determining a local node control target of each agent corresponding to each control node in a large-scale complex system, and setting a reward function of each agent;
randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in a current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state;
training the agent according to the experience buffer until the whole agent is traversed;
repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain a target depth strategy network;
and repeating the step of randomly acquiring the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.
Preferably, the determining the node control target of each agent corresponding to each control node in the large complex system includes:
setting an overall control target of a large complex system;
and decomposing the overall control target, taking each control node of the large-scale complex system as a multi-agent, and taking the corresponding decomposed overall control target as a control target of the node by each agent.
Preferably, the determining the action set corresponding to the action of the agent in the current control period includes:
obtaining the action a of the agent in the current control period through deterministic strategy calculationi
Figure BDA0002832590430000021
Wherein
Figure BDA0002832590430000022
For a deep policy network, θiFor deeply strategic network parameters, oiLocal states observable for the current agent;
determining the action aiCorresponding action set
Figure BDA0002832590430000023
Preferably, the obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state includes:
obtaining a set of environmental rewards for the large complex system
Figure BDA0002832590430000024
And a new large complex system state x';
generating an experience set (x, a, r, x ') based on the action set alpha, the environment reward set r, the initialization state x and the new large complex system state x';
storing the experience set (x, a, r, x ') into an experience buffer D, and updating the initialization state x ═ x'.
Preferably, the training the agent according to the experience buffer includes:
randomly extracting a total of S training sets (x) from the experience buffer Dj,aj,rj,x'j) And obtaining a training target of the deep strategy network according to the following formula:
Figure BDA0002832590430000025
wherein, gamma is the system discount, the value range is 0< gamma < 1;
calculating a depth policy network loss function for the agent according to the following formula:
Figure BDA0002832590430000026
calculating the fastest gradient in the training process of the depth strategy network according to the following formula:
Figure BDA0002832590430000027
updating the target depth strategy network parameters:
θ'i←τθi+(1-τ)θ'i
wherein τ is the update rate of the depth policy network parameters, τ can be adjusted to adjust the training speed, and the value range is generally 0< τ < 0.5.
In a second aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method as provided in the first aspect or any one of the possible implementations of the first aspect.
In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method as provided in the first aspect or any one of the possible implementation manners of the first aspect.
The invention has the beneficial effects that: 1. a distributed multi-agent control method is adopted in a large complex system, a plurality of agents are constructed, information is shared mutually, and the agents are continuously optimized in the training process according to the control targets of the agents, so that the control performance is improved.
2. In order to avoid selfish behavior of individual agents, the training process of the agents shares the same experience buffer area, so that the states of surrounding agents must be considered in the learning process of the agents, and the training process can be accelerated.
3. In large complex systems, the tasks undertaken by each agent are inconsistent, and the reward functions of agents are designed to be closely related to the control objectives and the state of the large complex system, thereby accomplishing a common task.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a distributed multi-agent deterministic policy control method for a large complex system according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an example of the principle of a distributed multi-agent deterministic policy control method for a large complex system according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an exemplary embodiment of a condensate system according to the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
In the following description, the terms "first" and "second" are used for descriptive purposes only and are not intended to indicate or imply relative importance. The following description provides embodiments of the invention, which may be combined with or substituted for various embodiments, and the invention is thus to be construed as embracing all possible combinations of the same and/or different embodiments described. Thus, if one embodiment includes feature A, B, C and another embodiment includes feature B, D, then the invention should also be construed as including embodiments that include one or more of all other possible combinations of A, B, C, D, even though such embodiments may not be explicitly recited in the following text.
The following description provides examples, and does not limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements described without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For example, the described methods may be performed in an order different than the order described, and various steps may be added, omitted, or combined. Furthermore, features described with respect to some examples may be combined into other examples.
Referring to fig. 1, fig. 1 is a large-scale complex system distributed multi-agent deterministic policy control method provided by an embodiment of the present application. In an embodiment of the present application, the method includes:
s101, determining a node control target of each agent corresponding to each control node in the large-scale complex system, and setting a reward function of each agent.
In one embodiment, the determining the local node control target of each agent corresponding to each control node in the large complex system includes:
setting an overall control target of a large complex system;
and decomposing the overall control target, taking each control node of the large-scale complex system as a multi-agent, and taking the corresponding decomposed overall control target as a control target of the node by each agent.
S102, randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in the current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area, and updating the initialization state.
In one possible embodiment, the determining a set of actions corresponding to the actions of the agent in the current control period includes:
obtaining the action a of the agent in the current control period through deterministic strategy calculationi
Figure BDA0002832590430000051
Wherein
Figure BDA0002832590430000052
For a deep policy network, θiFor deeply strategic network parameters, oiLocal states observable for the current agent;
determining the action aiCorresponding action set
Figure BDA0002832590430000053
In one embodiment, the obtaining a experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state includes:
obtaining a set of environmental rewards for the large complex system
Figure BDA0002832590430000054
And new Large Complex System State'
x;
Generating an experience set (x, a, r, x ') based on the action set alpha, the environment reward set r, the initialization state x and the new large complex system state x';
storing the experience set (x, a, r, x ') into an experience buffer D, and updating the initialization state x ═ x'.
S103, training the agent according to the experience buffer area until the whole agent is traversed.
In one possible implementation, the training the agent according to the experience buffer includes:
randomly extracting a total of S training sets (x) from the experience buffer Dj,aj,rj,x'j) And obtaining a training target of the deep strategy network according to the following formula:
Figure BDA0002832590430000055
wherein, gamma is the system discount, the value range is 0< gamma < 1;
calculating a depth policy network loss function for the agent according to the following formula:
Figure BDA0002832590430000056
calculating the fastest gradient in the training process of the depth strategy network according to the following formula:
Figure BDA0002832590430000057
updating the target depth strategy network parameters:
θ'i←τθi+(1-τ)θ'i
wherein τ is the update rate of the depth policy network parameters, τ can be adjusted to adjust the training speed, and the value range is generally 0< τ < 0.5.
And S104, repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain the target depth strategy network.
S105, repeating the step of randomly obtaining the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.
Specifically, as shown in fig. 2, in the training and traversing processes of each agent, the same experience buffer stores the experience sets generated in the training process, so that set data for training randomly extracted from the experience buffer by the agent performing the subsequent training is associated with training data of the agent performing the previous training, thereby ensuring that the states of the agents around can be considered in the training process of the agent, and avoiding the selfish behavior of the individual agent. Meanwhile, because the initialization state of the large-scale complex system is preset, after all the agents are trained in each control period, the numerical value of the initialization state is changed, and the training is repeated, so that the convergence of the depth strategy network is realized.
Illustratively, as shown in fig. 3, taking a typical system of a power plant, i.e., a condensate water supply system as an example, the whole process specifically includes the following steps:
step 1, setting a total control target R of a large complex system by taking a condensate water supply system as a research object, wherein the control is abnormally complex due to the existence of a plurality of return pipelines according to different working conditions.
Step 2, decomposing the total control target R of the condensate water supply system, taking each control node as a multi-agent, namely, the large-scale complex control system consists of distributed multi-agents, and each agent is composed of the decomposed index RiDesigning a reward function r of each intelligent agent for the control target of the nodeiThere are a total of N agents.
Reward function riThe difference value between the current measurement value and the control target and the sum of the derivative of the difference value are designed, and meanwhile, the water level of the steam generator is used as a core control index, and the acquired reward function of each intelligent agent is added with a larger proportion.
The intelligent body action of the single-board condensed water intelligent body is used as the rotating speed of the condensed water pump and the opening degree of the condensed water valve; the water supply pump rotating speed, the water supply valve opening and the return valve opening of the single-board water supply intelligent body.
And 3, starting action exploration to obtain a system initialization state x.
Step 4, obtaining the action a of each agent in the current control period through a deterministic strategyi
Figure BDA0002832590430000061
Figure BDA0002832590430000071
For a deep policy network, θiFor deeply strategic network parameters, oiIs the local state that the agent can observe at present.
Adopt
Figure BDA0002832590430000072
Obtaining large complex system environment rewards
Figure BDA0002832590430000073
And a new large complex system state x'.
Store (x, a, r, x') in an experience buffer
Figure BDA0002832590430000074
Override the current state:
x=x'
and 5, training the agents in sequence aiming at the ith agent.
From experience buffers
Figure BDA0002832590430000075
Randomly extracting a total of S (x)j,aj,rj,x'j) And obtaining a training target of the deep strategy network:
Figure BDA0002832590430000076
γ is the system discount, with a range of 0< γ < 1.
Step 6, obtaining the depth strategy network loss function of the ith agent,
Figure BDA0002832590430000077
step 7, obtaining the fastest gradient in the deep strategy network training process,
Figure BDA0002832590430000078
step 8, in order to enhance the stability of training and ensure rapid convergence, updating the target depth strategy network parameter theta'i
θ'i←τθi+(1-τ)θ'i
Wherein tau is the update rate of the depth strategy network parameters, tau can be adjusted to adjust the training speed, and the value range is generally 0< tau < 0.5.
And 9, repeating the step 5 to the step 8 for N times, and traversing the whole agent.
And 10, repeating the steps 4 to 9 for multiple times, wherein the times are set according to the actual condition of the system, and are usually set to cover a large system operation period, and the value can be 100.
And 11, repeating the steps 3 to 10 for multiple times, wherein the times are set according to the actual condition of the system, and usually, the value is taken according to the condition of training convergence, and the value can be 5000.
Step 12, after the training is finished, implementing the target depth strategy network after the training on a large-scale complex system,
Figure BDA0002832590430000079
the distributed multi-agent control of the condensate water supply system is realized through the steps.
Referring to fig. 4, a schematic structural diagram of an electronic device according to an embodiment of the present invention is shown, where the electronic device may be used to implement the method in the embodiment shown in fig. 1. As shown in fig. 4, the electronic device 400 may include: at least one central processor 401, at least one network interface 404, a user interface 403, a memory 405, at least one communication bus 402.
Wherein a communication bus 402 is used to enable connective communication between these components.
The user interface 403 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 403 may also include a standard wired interface and a wireless interface.
The network interface 404 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
The central processing unit 401 may include one or more processing cores. The central processor 401 connects various parts within the entire terminal 400 using various interfaces and lines, and performs various functions of the terminal 400 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 405 and calling data stored in the memory 405. Alternatively, the central Processing unit 401 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The Central Processing Unit 401 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is to be understood that the modem may be implemented by a single chip without being integrated into the central processor 401.
The Memory 405 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 405 includes a non-transitory computer-readable medium. The memory 405 may be used to store instructions, programs, code sets, or instruction sets. The memory 405 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 405 may alternatively be at least one memory device located remotely from the central processor 401 as previously described. As shown in fig. 4, memory 405, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and program instructions.
In the electronic device 400 shown in fig. 4, the user interface 403 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and processor 401 may be used to invoke a large complex system distributed multi-agent deterministic policy control application stored in memory 405 and specifically perform the following operations:
determining a local node control target of each agent corresponding to each control node in a large-scale complex system, and setting a reward function of each agent;
randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in a current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state;
training the agent according to the experience buffer until the whole agent is traversed;
repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain a target depth strategy network;
and repeating the step of randomly acquiring the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.
The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus can be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some service interfaces, devices or units, and may be an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program, which is stored in a computer-readable memory, and the memory may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (7)

1. A large complex system distributed multi-agent deterministic policy control method, the method comprising:
determining a local node control target of each agent corresponding to each control node in a large-scale complex system, and setting a reward function of each agent;
randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in a current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state;
training the agent according to the experience buffer until the whole agent is traversed;
repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain a target depth strategy network;
and repeating the step of randomly acquiring the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.
2. The method of claim 1, wherein the determining the node control target of each agent corresponding to each control node in the large complex system comprises:
setting an overall control target of a large complex system;
and decomposing the overall control target, taking each control node of the large-scale complex system as a multi-agent, and taking the corresponding decomposed overall control target as a control target of the node by each agent.
3. The method of claim 1, wherein the determining a set of actions corresponding to the actions of the agent during the current control period comprises:
obtaining the action a of the agent in the current control period through deterministic strategy calculationi
Figure FDA0002832590420000011
Wherein
Figure FDA0002832590420000012
For a deep policy network, θiFor deeply strategic network parameters, oiLocal states observable for the current agent;
determining the action aiCorresponding action set
Figure FDA0002832590420000013
4. The method of claim 1, wherein obtaining a experience set based on the action set and an environmental reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state comprises:
obtaining a set of environmental rewards for the large complex system
Figure FDA0002832590420000021
And a new large complex system state x';
generating an experience set (x, a, r, x ') based on the action set alpha, the environment reward set r, the initialization state x and the new large complex system state x';
storing the experience set (x, a, r, x ') into an experience buffer D, and updating the initialization state x ═ x'.
5. The method of claim 1, wherein the training the agent according to the experience buffer comprises:
randomly extracting a total of S training sets (x) from the experience buffer Dj,aj,rj,x'j) And obtaining a training target of the deep strategy network according to the following formula:
Figure FDA0002832590420000022
wherein, gamma is the system discount, the value range is 0< gamma < 1;
calculating a depth policy network loss function for the agent according to the following formula:
Figure FDA0002832590420000023
calculating the fastest gradient in the training process of the depth strategy network according to the following formula:
Figure FDA0002832590420000024
updating the target depth strategy network parameters:
θ'i←τθi+(1-τ)θ'i
wherein τ is the update rate of the depth policy network parameters, τ can be adjusted to adjust the training speed, and the value range is generally 0< τ < 0.5.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-5 are implemented when the computer program is executed by the processor.
7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN202011453683.8A 2020-12-12 2020-12-12 Distributed multi-agent deterministic strategy control method for large complex system Pending CN112418349A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011453683.8A CN112418349A (en) 2020-12-12 2020-12-12 Distributed multi-agent deterministic strategy control method for large complex system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011453683.8A CN112418349A (en) 2020-12-12 2020-12-12 Distributed multi-agent deterministic strategy control method for large complex system

Publications (1)

Publication Number Publication Date
CN112418349A true CN112418349A (en) 2021-02-26

Family

ID=74776168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011453683.8A Pending CN112418349A (en) 2020-12-12 2020-12-12 Distributed multi-agent deterministic strategy control method for large complex system

Country Status (1)

Country Link
CN (1) CN112418349A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113867147A (en) * 2021-09-29 2021-12-31 商汤集团有限公司 Training and control method, device, computing equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291890A (en) * 2020-05-13 2020-06-16 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN111563188A (en) * 2020-04-30 2020-08-21 南京邮电大学 Mobile multi-agent cooperative target searching method
CN111708355A (en) * 2020-06-19 2020-09-25 中国人民解放军国防科技大学 Multi-unmanned aerial vehicle action decision method and device based on reinforcement learning
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563188A (en) * 2020-04-30 2020-08-21 南京邮电大学 Mobile multi-agent cooperative target searching method
CN111291890A (en) * 2020-05-13 2020-06-16 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN111708355A (en) * 2020-06-19 2020-09-25 中国人民解放军国防科技大学 Multi-unmanned aerial vehicle action decision method and device based on reinforcement learning
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘钱源: "基于深度强化学习的双臂机器人物体抓取", 《中国优秀硕士学位论文全文数据库》, no. 09, 15 September 2019 (2019-09-15), pages 26 - 29 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113867147A (en) * 2021-09-29 2021-12-31 商汤集团有限公司 Training and control method, device, computing equipment and medium

Similar Documents

Publication Publication Date Title
US11135514B2 (en) Data processing method and apparatus, and storage medium for concurrently executing event characters on a game client
CN112311578B (en) VNF scheduling method and device based on deep reinforcement learning
US20060003823A1 (en) Dynamic player groups for interest management in multi-character virtual environments
CN106489132B (en) Read and write the method, apparatus, storage equipment and computer system of data
CN111143039B (en) Scheduling method and device of virtual machine and computer storage medium
CN112418259A (en) Method for configuring real-time rules based on user behaviors in live broadcast process, computer equipment and readable storage medium
US10755175B2 (en) Early generation of individuals to accelerate genetic algorithms
CN112768056A (en) Disease prediction model establishing method and device based on joint learning framework
CN112418349A (en) Distributed multi-agent deterministic strategy control method for large complex system
CN113965313B (en) Model training method, device, equipment and storage medium based on homomorphic encryption
CN109102468A (en) Image enchancing method, device, terminal device and storage medium
CN111144243B (en) Household pattern recognition method and device based on counterstudy
CN111950237B (en) Sentence rewriting method, sentence rewriting device and electronic equipment
CN107918584A (en) Information generating system, device, method and computer-readable recording medium
TWI705377B (en) Hardware boost method and hardware boost system
CN109784687B (en) Smart cloud manufacturing task scheduling method, readable storage medium and terminal
CN110209751A (en) Route sharing method and device suitable for Driving Test application
CN112449205A (en) Information interaction method and device, terminal equipment and storage medium
CN112433914B (en) Method and system for obtaining parallel computing task progress
CN109492759B (en) Neural network model prediction method, device and terminal
US20220008826A1 (en) Strand Simulation in Multiple Levels
JP7453229B2 (en) Data processing module, data processing system, and data processing method
WO2020134011A1 (en) Method and apparatus for determining display information combination, storage medium, and electronic device
CN111013152A (en) Game model action generation method and device and electronic terminal
CN116663417B (en) Virtual geographic environment role modeling method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210226