CN112418349A - Distributed multi-agent deterministic strategy control method for large complex system - Google Patents
Distributed multi-agent deterministic strategy control method for large complex system Download PDFInfo
- Publication number
- CN112418349A CN112418349A CN202011453683.8A CN202011453683A CN112418349A CN 112418349 A CN112418349 A CN 112418349A CN 202011453683 A CN202011453683 A CN 202011453683A CN 112418349 A CN112418349 A CN 112418349A
- Authority
- CN
- China
- Prior art keywords
- agent
- complex system
- experience
- action
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 239000003795 chemical substances by application Substances 0.000 claims abstract description 104
- 230000009471 action Effects 0.000 claims abstract description 49
- 238000012549 training Methods 0.000 claims abstract description 40
- 239000000872 buffer Substances 0.000 claims abstract description 27
- 230000008569 process Effects 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 26
- 230000015654 memory Effects 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 8
- 230000007613 environmental effect Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 11
- 238000012545 processing Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a method for determining a local node control target of each intelligent agent corresponding to each control node in a large-scale complex system, and setting a reward function of each intelligent agent; determining an action set corresponding to the action of the agent in the current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state; training the agent according to the experience buffer until the whole agent is traversed; and repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain the target depth strategy network. The invention realizes that a distributed multi-agent control method is adopted in a large complex system, a plurality of agents are constructed, information is shared mutually, and the agents are continuously optimized in the training process according to the control target of the agents, so that the control performance is improved.
Description
Technical Field
The application relates to the technical field of large complex system operation control, in particular to a distributed multi-agent deterministic strategy control method for a large complex system.
Background
The system of a large number of individual agents and their connections may be referred to as a multi-agent network. In the framework, the information collected by each member is local and scattered, and the members do not have the capability of independently completing the whole task, namely the state quantities of all the members tend to be equal through information exchange among individuals, so that the complex task in a large complex system cannot be completed through cooperation.
Disclosure of Invention
In order to solve the above problems, embodiments of the present application provide a method for controlling a distributed multi-agent deterministic policy in a large complex system.
In a first aspect, an embodiment of the present application provides a large complex system distributed multi-agent deterministic policy control method, where the method includes:
determining a local node control target of each agent corresponding to each control node in a large-scale complex system, and setting a reward function of each agent;
randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in a current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state;
training the agent according to the experience buffer until the whole agent is traversed;
repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain a target depth strategy network;
and repeating the step of randomly acquiring the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.
Preferably, the determining the node control target of each agent corresponding to each control node in the large complex system includes:
setting an overall control target of a large complex system;
and decomposing the overall control target, taking each control node of the large-scale complex system as a multi-agent, and taking the corresponding decomposed overall control target as a control target of the node by each agent.
Preferably, the determining the action set corresponding to the action of the agent in the current control period includes:
obtaining the action a of the agent in the current control period through deterministic strategy calculationi:
WhereinFor a deep policy network, θiFor deeply strategic network parameters, oiLocal states observable for the current agent;
Preferably, the obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state includes:
obtaining a set of environmental rewards for the large complex systemAnd a new large complex system state x';
generating an experience set (x, a, r, x ') based on the action set alpha, the environment reward set r, the initialization state x and the new large complex system state x';
storing the experience set (x, a, r, x ') into an experience buffer D, and updating the initialization state x ═ x'.
Preferably, the training the agent according to the experience buffer includes:
randomly extracting a total of S training sets (x) from the experience buffer Dj,aj,rj,x'j) And obtaining a training target of the deep strategy network according to the following formula:
wherein, gamma is the system discount, the value range is 0< gamma < 1;
calculating a depth policy network loss function for the agent according to the following formula:
calculating the fastest gradient in the training process of the depth strategy network according to the following formula:
updating the target depth strategy network parameters:
θ'i←τθi+(1-τ)θ'i
wherein τ is the update rate of the depth policy network parameters, τ can be adjusted to adjust the training speed, and the value range is generally 0< τ < 0.5.
In a second aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method as provided in the first aspect or any one of the possible implementations of the first aspect.
In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method as provided in the first aspect or any one of the possible implementation manners of the first aspect.
The invention has the beneficial effects that: 1. a distributed multi-agent control method is adopted in a large complex system, a plurality of agents are constructed, information is shared mutually, and the agents are continuously optimized in the training process according to the control targets of the agents, so that the control performance is improved.
2. In order to avoid selfish behavior of individual agents, the training process of the agents shares the same experience buffer area, so that the states of surrounding agents must be considered in the learning process of the agents, and the training process can be accelerated.
3. In large complex systems, the tasks undertaken by each agent are inconsistent, and the reward functions of agents are designed to be closely related to the control objectives and the state of the large complex system, thereby accomplishing a common task.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a distributed multi-agent deterministic policy control method for a large complex system according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an example of the principle of a distributed multi-agent deterministic policy control method for a large complex system according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an exemplary embodiment of a condensate system according to the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
In the following description, the terms "first" and "second" are used for descriptive purposes only and are not intended to indicate or imply relative importance. The following description provides embodiments of the invention, which may be combined with or substituted for various embodiments, and the invention is thus to be construed as embracing all possible combinations of the same and/or different embodiments described. Thus, if one embodiment includes feature A, B, C and another embodiment includes feature B, D, then the invention should also be construed as including embodiments that include one or more of all other possible combinations of A, B, C, D, even though such embodiments may not be explicitly recited in the following text.
The following description provides examples, and does not limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements described without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For example, the described methods may be performed in an order different than the order described, and various steps may be added, omitted, or combined. Furthermore, features described with respect to some examples may be combined into other examples.
Referring to fig. 1, fig. 1 is a large-scale complex system distributed multi-agent deterministic policy control method provided by an embodiment of the present application. In an embodiment of the present application, the method includes:
s101, determining a node control target of each agent corresponding to each control node in the large-scale complex system, and setting a reward function of each agent.
In one embodiment, the determining the local node control target of each agent corresponding to each control node in the large complex system includes:
setting an overall control target of a large complex system;
and decomposing the overall control target, taking each control node of the large-scale complex system as a multi-agent, and taking the corresponding decomposed overall control target as a control target of the node by each agent.
S102, randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in the current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area, and updating the initialization state.
In one possible embodiment, the determining a set of actions corresponding to the actions of the agent in the current control period includes:
obtaining the action a of the agent in the current control period through deterministic strategy calculationi:
WhereinFor a deep policy network, θiFor deeply strategic network parameters, oiLocal states observable for the current agent;
In one embodiment, the obtaining a experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state includes:
obtaining a set of environmental rewards for the large complex systemAnd new Large Complex System State'
x;
Generating an experience set (x, a, r, x ') based on the action set alpha, the environment reward set r, the initialization state x and the new large complex system state x';
storing the experience set (x, a, r, x ') into an experience buffer D, and updating the initialization state x ═ x'.
S103, training the agent according to the experience buffer area until the whole agent is traversed.
In one possible implementation, the training the agent according to the experience buffer includes:
randomly extracting a total of S training sets (x) from the experience buffer Dj,aj,rj,x'j) And obtaining a training target of the deep strategy network according to the following formula:
wherein, gamma is the system discount, the value range is 0< gamma < 1;
calculating a depth policy network loss function for the agent according to the following formula:
calculating the fastest gradient in the training process of the depth strategy network according to the following formula:
updating the target depth strategy network parameters:
θ'i←τθi+(1-τ)θ'i
wherein τ is the update rate of the depth policy network parameters, τ can be adjusted to adjust the training speed, and the value range is generally 0< τ < 0.5.
And S104, repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain the target depth strategy network.
S105, repeating the step of randomly obtaining the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.
Specifically, as shown in fig. 2, in the training and traversing processes of each agent, the same experience buffer stores the experience sets generated in the training process, so that set data for training randomly extracted from the experience buffer by the agent performing the subsequent training is associated with training data of the agent performing the previous training, thereby ensuring that the states of the agents around can be considered in the training process of the agent, and avoiding the selfish behavior of the individual agent. Meanwhile, because the initialization state of the large-scale complex system is preset, after all the agents are trained in each control period, the numerical value of the initialization state is changed, and the training is repeated, so that the convergence of the depth strategy network is realized.
Illustratively, as shown in fig. 3, taking a typical system of a power plant, i.e., a condensate water supply system as an example, the whole process specifically includes the following steps:
Step 2, decomposing the total control target R of the condensate water supply system, taking each control node as a multi-agent, namely, the large-scale complex control system consists of distributed multi-agents, and each agent is composed of the decomposed index RiDesigning a reward function r of each intelligent agent for the control target of the nodeiThere are a total of N agents.
Reward function riThe difference value between the current measurement value and the control target and the sum of the derivative of the difference value are designed, and meanwhile, the water level of the steam generator is used as a core control index, and the acquired reward function of each intelligent agent is added with a larger proportion.
The intelligent body action of the single-board condensed water intelligent body is used as the rotating speed of the condensed water pump and the opening degree of the condensed water valve; the water supply pump rotating speed, the water supply valve opening and the return valve opening of the single-board water supply intelligent body.
And 3, starting action exploration to obtain a system initialization state x.
Step 4, obtaining the action a of each agent in the current control period through a deterministic strategyi:
For a deep policy network, θiFor deeply strategic network parameters, oiIs the local state that the agent can observe at present.
Override the current state:
x=x'
and 5, training the agents in sequence aiming at the ith agent.
From experience buffersRandomly extracting a total of S (x)j,aj,rj,x'j) And obtaining a training target of the deep strategy network:
γ is the system discount, with a range of 0< γ < 1.
Step 6, obtaining the depth strategy network loss function of the ith agent,
step 7, obtaining the fastest gradient in the deep strategy network training process,
step 8, in order to enhance the stability of training and ensure rapid convergence, updating the target depth strategy network parameter theta'i,
θ'i←τθi+(1-τ)θ'i
Wherein tau is the update rate of the depth strategy network parameters, tau can be adjusted to adjust the training speed, and the value range is generally 0< tau < 0.5.
And 9, repeating the step 5 to the step 8 for N times, and traversing the whole agent.
And 10, repeating the steps 4 to 9 for multiple times, wherein the times are set according to the actual condition of the system, and are usually set to cover a large system operation period, and the value can be 100.
And 11, repeating the steps 3 to 10 for multiple times, wherein the times are set according to the actual condition of the system, and usually, the value is taken according to the condition of training convergence, and the value can be 5000.
Step 12, after the training is finished, implementing the target depth strategy network after the training on a large-scale complex system,
the distributed multi-agent control of the condensate water supply system is realized through the steps.
Referring to fig. 4, a schematic structural diagram of an electronic device according to an embodiment of the present invention is shown, where the electronic device may be used to implement the method in the embodiment shown in fig. 1. As shown in fig. 4, the electronic device 400 may include: at least one central processor 401, at least one network interface 404, a user interface 403, a memory 405, at least one communication bus 402.
Wherein a communication bus 402 is used to enable connective communication between these components.
The user interface 403 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 403 may also include a standard wired interface and a wireless interface.
The network interface 404 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
The central processing unit 401 may include one or more processing cores. The central processor 401 connects various parts within the entire terminal 400 using various interfaces and lines, and performs various functions of the terminal 400 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 405 and calling data stored in the memory 405. Alternatively, the central Processing unit 401 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The Central Processing Unit 401 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is to be understood that the modem may be implemented by a single chip without being integrated into the central processor 401.
The Memory 405 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 405 includes a non-transitory computer-readable medium. The memory 405 may be used to store instructions, programs, code sets, or instruction sets. The memory 405 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 405 may alternatively be at least one memory device located remotely from the central processor 401 as previously described. As shown in fig. 4, memory 405, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and program instructions.
In the electronic device 400 shown in fig. 4, the user interface 403 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and processor 401 may be used to invoke a large complex system distributed multi-agent deterministic policy control application stored in memory 405 and specifically perform the following operations:
determining a local node control target of each agent corresponding to each control node in a large-scale complex system, and setting a reward function of each agent;
randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in a current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state;
training the agent according to the experience buffer until the whole agent is traversed;
repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain a target depth strategy network;
and repeating the step of randomly acquiring the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.
The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus can be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some service interfaces, devices or units, and may be an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program, which is stored in a computer-readable memory, and the memory may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
Claims (7)
1. A large complex system distributed multi-agent deterministic policy control method, the method comprising:
determining a local node control target of each agent corresponding to each control node in a large-scale complex system, and setting a reward function of each agent;
randomly acquiring an initialization state of the large complex system, determining an action set corresponding to the action of the agent in a current control period, obtaining an experience set based on the action set and an environment reward set corresponding to the reward function, storing the experience set to an experience buffer area and updating the initialization state;
training the agent according to the experience buffer until the whole agent is traversed;
repeating the step of determining the action set corresponding to the action of the intelligent agent in the current control period until all the intelligent agents are traversed to obtain a target depth strategy network;
and repeating the step of randomly acquiring the initialization state of the large-scale complex system, converging the target depth strategy network, and controlling the large-scale complex system based on the converged target depth strategy network.
2. The method of claim 1, wherein the determining the node control target of each agent corresponding to each control node in the large complex system comprises:
setting an overall control target of a large complex system;
and decomposing the overall control target, taking each control node of the large-scale complex system as a multi-agent, and taking the corresponding decomposed overall control target as a control target of the node by each agent.
3. The method of claim 1, wherein the determining a set of actions corresponding to the actions of the agent during the current control period comprises:
obtaining the action a of the agent in the current control period through deterministic strategy calculationi:
WhereinFor a deep policy network, θiFor deeply strategic network parameters, oiLocal states observable for the current agent;
4. The method of claim 1, wherein obtaining a experience set based on the action set and an environmental reward set corresponding to the reward function, storing the experience set in an experience buffer, and updating the initialization state comprises:
obtaining a set of environmental rewards for the large complex systemAnd a new large complex system state x';
generating an experience set (x, a, r, x ') based on the action set alpha, the environment reward set r, the initialization state x and the new large complex system state x';
storing the experience set (x, a, r, x ') into an experience buffer D, and updating the initialization state x ═ x'.
5. The method of claim 1, wherein the training the agent according to the experience buffer comprises:
randomly extracting a total of S training sets (x) from the experience buffer Dj,aj,rj,x'j) And obtaining a training target of the deep strategy network according to the following formula:
wherein, gamma is the system discount, the value range is 0< gamma < 1;
calculating a depth policy network loss function for the agent according to the following formula:
calculating the fastest gradient in the training process of the depth strategy network according to the following formula:
updating the target depth strategy network parameters:
θ'i←τθi+(1-τ)θ'i
wherein τ is the update rate of the depth policy network parameters, τ can be adjusted to adjust the training speed, and the value range is generally 0< τ < 0.5.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-5 are implemented when the computer program is executed by the processor.
7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011453683.8A CN112418349A (en) | 2020-12-12 | 2020-12-12 | Distributed multi-agent deterministic strategy control method for large complex system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011453683.8A CN112418349A (en) | 2020-12-12 | 2020-12-12 | Distributed multi-agent deterministic strategy control method for large complex system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112418349A true CN112418349A (en) | 2021-02-26 |
Family
ID=74776168
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011453683.8A Pending CN112418349A (en) | 2020-12-12 | 2020-12-12 | Distributed multi-agent deterministic strategy control method for large complex system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112418349A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113867147A (en) * | 2021-09-29 | 2021-12-31 | 商汤集团有限公司 | Training and control method, device, computing equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291890A (en) * | 2020-05-13 | 2020-06-16 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Game strategy optimization method, system and storage medium |
CN111563188A (en) * | 2020-04-30 | 2020-08-21 | 南京邮电大学 | Mobile multi-agent cooperative target searching method |
CN111708355A (en) * | 2020-06-19 | 2020-09-25 | 中国人民解放军国防科技大学 | Multi-unmanned aerial vehicle action decision method and device based on reinforcement learning |
CN112052456A (en) * | 2020-08-31 | 2020-12-08 | 浙江工业大学 | Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents |
-
2020
- 2020-12-12 CN CN202011453683.8A patent/CN112418349A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111563188A (en) * | 2020-04-30 | 2020-08-21 | 南京邮电大学 | Mobile multi-agent cooperative target searching method |
CN111291890A (en) * | 2020-05-13 | 2020-06-16 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Game strategy optimization method, system and storage medium |
CN111708355A (en) * | 2020-06-19 | 2020-09-25 | 中国人民解放军国防科技大学 | Multi-unmanned aerial vehicle action decision method and device based on reinforcement learning |
CN112052456A (en) * | 2020-08-31 | 2020-12-08 | 浙江工业大学 | Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents |
Non-Patent Citations (1)
Title |
---|
刘钱源: "基于深度强化学习的双臂机器人物体抓取", 《中国优秀硕士学位论文全文数据库》, no. 09, 15 September 2019 (2019-09-15), pages 26 - 29 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113867147A (en) * | 2021-09-29 | 2021-12-31 | 商汤集团有限公司 | Training and control method, device, computing equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11135514B2 (en) | Data processing method and apparatus, and storage medium for concurrently executing event characters on a game client | |
CN112311578B (en) | VNF scheduling method and device based on deep reinforcement learning | |
US20060003823A1 (en) | Dynamic player groups for interest management in multi-character virtual environments | |
CN106489132B (en) | Read and write the method, apparatus, storage equipment and computer system of data | |
CN111143039B (en) | Scheduling method and device of virtual machine and computer storage medium | |
CN112418259A (en) | Method for configuring real-time rules based on user behaviors in live broadcast process, computer equipment and readable storage medium | |
US10755175B2 (en) | Early generation of individuals to accelerate genetic algorithms | |
CN112768056A (en) | Disease prediction model establishing method and device based on joint learning framework | |
CN112418349A (en) | Distributed multi-agent deterministic strategy control method for large complex system | |
CN113965313B (en) | Model training method, device, equipment and storage medium based on homomorphic encryption | |
CN109102468A (en) | Image enchancing method, device, terminal device and storage medium | |
CN111144243B (en) | Household pattern recognition method and device based on counterstudy | |
CN111950237B (en) | Sentence rewriting method, sentence rewriting device and electronic equipment | |
CN107918584A (en) | Information generating system, device, method and computer-readable recording medium | |
TWI705377B (en) | Hardware boost method and hardware boost system | |
CN109784687B (en) | Smart cloud manufacturing task scheduling method, readable storage medium and terminal | |
CN110209751A (en) | Route sharing method and device suitable for Driving Test application | |
CN112449205A (en) | Information interaction method and device, terminal equipment and storage medium | |
CN112433914B (en) | Method and system for obtaining parallel computing task progress | |
CN109492759B (en) | Neural network model prediction method, device and terminal | |
US20220008826A1 (en) | Strand Simulation in Multiple Levels | |
JP7453229B2 (en) | Data processing module, data processing system, and data processing method | |
WO2020134011A1 (en) | Method and apparatus for determining display information combination, storage medium, and electronic device | |
CN111013152A (en) | Game model action generation method and device and electronic terminal | |
CN116663417B (en) | Virtual geographic environment role modeling method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210226 |