CN112052456A - Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents - Google Patents
Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents Download PDFInfo
- Publication number
- CN112052456A CN112052456A CN202010899020.2A CN202010899020A CN112052456A CN 112052456 A CN112052456 A CN 112052456A CN 202010899020 A CN202010899020 A CN 202010899020A CN 112052456 A CN112052456 A CN 112052456A
- Authority
- CN
- China
- Prior art keywords
- agent
- antagonistic
- target
- data
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a deep reinforcement learning strategy optimization defense method based on multiple intelligent agents, which comprises the following steps: (1) constructing an autonomous driving environment comprising a target agent and a plurality of antagonistic agents; (2) respectively storing the state transition process data of the target agent in an experience playback buffer zone D according to the success or failure of the antagonistic agent to attack the target agent+And D‑From D+And D‑The acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent; (3) game training is carried out on the antagonistic agent and the target agent vehicle together, the state transition process data of the target agent is stored in an experience buffer area D, data are collected from the experience buffer area D to update decision gradient algorithm model parameters corresponding to the target agent, and the game training is carried out until the game training is finishedStopping; (4) when the method is applied, the collected local environment state data is input into a decision gradient algorithm model corresponding to the target intelligent agent, and the target intelligent agent is guided to execute through calculating and outputting a decision action.
Description
Technical Field
The invention belongs to the defense field of deep reinforcement learning, and particularly relates to a deep reinforcement learning strategy optimization defense method based on multiple intelligent agents.
Background
Deep reinforcement learning is one of the directions in which artificial intelligence has attracted much attention in recent years, and with the rapid development and application of reinforcement learning, reinforcement learning has been widely used in the fields of robot control, game gaming, computer vision, unmanned driving, and the like. In order to ensure the safe application of the deep reinforcement learning in the safety-critical field, the key point is to analyze and discover holes in the deep reinforcement learning algorithm and the model so as to prevent some other useful people from utilizing the holes to perform illegal profit-making behaviors. Unlike the single-step prediction task of the traditional machine learning, the deep reinforcement learning system needs to perform multi-step decision-making to complete a certain task, and the continuous decision-making has high correlation.
The basic idea of reinforcement learning is to learn the optimal strategy to achieve the learning goal by maximizing the cumulative rewards that agents obtain from the environment. However, the strategy obtained by deep reinforcement learning training has a safety hazard, and cannot well cope with all possible scenes. Especially in the safety-critical field, such safety hazards bring great harm, and make decisions of the reinforcement learning system wrong, which is a significant challenge for the application field of reinforcement learning decision safety.
Because the strategies obtained by reinforcement learning training have potential safety hazards, the robustness of the reinforcement learning model and the strategies is improved, and the effective and safe application of the reinforcement learning model and the strategies in the safety decision field becomes the key point of people's attention increasingly. At present, common defense methods of reinforcement learning can be divided into three categories, namely confrontation training, robust learning and confrontation detection, according to the existing defense mechanism. A defense method facing a deep reinforcement learning model to resist attacks as disclosed in application number CN 201911184051.3; application number CN201910567203.1 discloses an insecure XSS defense system identification method based on reinforcement learning.
Disclosure of Invention
The invention aims to provide a multi-agent-based deep reinforcement learning strategy optimization defense method, which is used in an automatic driving scene, adopts a mode of training a plurality of confrontation agents to play games with a target agent, adopts an information asymmetry mechanism between the target agent and the confrontation agents to train a reinforcement learning model to optimize the strategy, improves the robustness of the reinforcement learning model, further improves the accuracy of decision-making action of the model, and avoids potential safety hazards.
In order to achieve the purpose, the invention provides the following technical scheme:
a deep reinforcement learning strategy optimization defense method based on multiple agents comprises the following steps:
(1) acquiring global environment state data and local environment state data in an automatic driving environment comprising a target agent and a plurality of antagonistic agents, and initializing the antagonistic agents and the target agent by utilizing a decision gradient algorithm model, wherein the decision gradient algorithm model comprises an Actor network model and a Critic network model;
(2) the aim of the resistant agent is to resist the attack target agent as much as possible, make the target agent execute the wrong decision-making action, and convert the state of the target agent into the transition process data according to the success or failure of the attack of the resistant agent to the target agentRespectively having an empirical playback buffer D+And an empirical playback buffer D-Where x represents that the resistant agent observes global environmental state data, including other resistant agents, target agents, and their expected reward values, a represents the policy action taken by the resistant agent under the environment,representing individual reward values for antagonistic agents, x' representing the next global environmental state data observable by the antagonistic agent, from experience buffer D+And D-The acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent;
(3) on the basis of the step (2), after the antagonistic agent obtains the antagonistic strategy thereof, the antagonistic agent is related to the targetThe target agent vehicles carry out game training together, and in the training process, the state of the target agent is converted into transition process data(s)0,a0,r0,s'0) Stored in an experience buffer D, where s0Data representing the observable local environmental state of a target agent, including antagonistic agents in proximity to the target agent, a0Representing the target agent at s0And the policy action taken under the influence of the antagonistic policy, r0Representing target agent instant prize, s'0Representing the data of the next environmental state which can be observed by the target agent under the influence of the antagonistic agent, acquiring data from the experience buffer D to update the decision gradient algorithm model parameters corresponding to the target agent until the game training is finished;
(4) when the method is applied, the collected local environment state data is input into a decision gradient algorithm model corresponding to the target intelligent agent, and the target intelligent agent is guided to execute through calculating and outputting a decision action.
In step (2), when the decision gradient algorithm model corresponding to the antagonistic agent is trained, in the initial random exploration process, the antagonistic agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, that is, the antagonistic agent randomly selects an action valueWherein s isiGlobal environmental status data representing the ith antagonistic agent, NtRepresenting the random noise value added at time step t,representing the Actor network model at the parameter thetaiThe output of the next pair.
In step (2), when a decision gradient algorithm model corresponding to the antagonistic agent is trained and the target agent is successfully attacked by the antagonistic agent, the antagonistic agent is awarded a countermark, that is, the antagonistic agent is awarded a countermark Representing the challenge reward of the ith challenge agent at time step t,representing the individual prize value of the ith antagonistic agent at time t, alpha representing the antagonistic prize factor, and k antagonistic agents participating in the win, alpha being taken
In step (2), from experience buffer D+And D-When the acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent,
from experience buffer D+Sample η (t). S from empirical buffer D-And (1-eta (t)). S, wherein t represents a time step, S represents the total number of samples, eta (t) takes 0.5 when t is at the previous M time step, and eta (t) takes 0.75 when t is greater than M time step, and data are collected by using the sampling strategy to update the decision gradient algorithm model parameters corresponding to the antagonistic agent.
Compared with the prior art, the invention has the beneficial effects that at least:
1) enhancing the exploration of the environment by a target agent by training a plurality of antagonistic agents in an autonomous driving environment;
2) in game training, the target agent and the antagonistic agent adopt an environment state data asymmetric mechanism to reduce conflicts among the antagonistic agents, and meanwhile, the target agent is favorably observed, and a better training strategy is sought;
3) in the training process of deep reinforcement learning, the strategy of the target agent is optimized by training a game scene between the antagonistic agent and the target agent, so that the robustness of a decision gradient algorithm model corresponding to the target agent is improved, the accuracy of the decision action of the model is improved, and potential safety hazards are avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a multi-agent based deep reinforcement learning strategy optimization defense method provided by an embodiment;
FIG. 2 is a schematic structural diagram of a DDPG algorithm model provided by the embodiment;
fig. 3 is a schematic diagram of a reactive agent training process provided by an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
As shown in FIGS. 1 to 3, the method for optimizing and defending a multi-agent-based deep reinforcement learning strategy provided by the embodiment comprises the following steps:
1) a target agent training process.
1.1) building an automatic driving simulation environment of the reinforcement learning trolley;
1.2) training a target agent (indicated by subscript 0) and a antagonistic agent (indicated by subscript 1,.. and n) based on a deep deterministic decision gradient algorithm (DDPG) in reinforcement learning, wherein the target agent and the antagonistic agent can be intelligent trolleys, the core of the DDPG algorithm is extended based on an Actor-Critic method, a DQN algorithm and a deterministic strategy gradient (DPG), and a deterministic strategy mu is adopted to select an action at=μ(s|θμ),θμIs a policy network mu (s | theta) that produces deterministic actionsμ) In μ(s) as an Actor, θQIs a value Q network Q (s, a, theta)Q) Parameter (d) in Q (s, a)) The function serves as Critic. In order to improve the training stability, a target network is introduced for a strategy network and a value network.
1.3) during the training process, the state of the target agent is converted into a transition process(s)0,a0,r0,s'0) Stored in an empirical playback buffer D, where s0Observable local environmental state data representing the target intelligence, a0Representing the target agent at s0Action taken in the state r0Representing the resulting instant prize, s'0Representing the next environmental state data observable by the target agent.
2) Antagonistic agent training process:
2.1) training n antagonistic Agents CariN, n may be 2, 3:
in the initial random exploration process, the antagonistic agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, i.e., the action valueWherein s isiGlobal environmental status data representing the ith antagonistic agent, NtRepresenting the random noise value added at time step t.
2.2) transferring data of the status of antagonistic Agents in the training ProcessTemporarily stored in experience buffer DtmpWherein x ═ s0,s1,...,sn) Representing the respective global environmental state data of all antagonistic agents, a ═ a0,a1,...,an) The action taken by the agent is represented,individual rewards are expressed to restrict individual behavior of antagonistic agents to perform normal behavior. And then judging whether the antagonistic agent succeeds or not after each round is finished. If successful, experience will be gainedBuffer DtmpData in (2) is transferred to buffer D+And awarding a corresponding counter prize, i.e. Representing a challenge prize, alpha representing a challenge prize factor, k challenge agents participating in the win, then alpha is takenIf the target agent wins, the experience buffer D is bufferedtmpData in (2) is transferred to buffer D-And awarding the corresponding personal prize without a counter-prize, i.e.
2.3) from empirical buffer D according to sampling ratio η (t)+And D-For updating the network structure parameters of the antagonistic agent:
from experience buffer D+Sampling η (t). S from D-And (1-eta (t)). S, wherein t represents a time step, S represents the total number of samples, and when t is in the previous M time step, eta (t) is 0.5, and when t is greater than M time step, eta (t) is 0.75, so as to better optimize the strategy of the target agent.
3) The game process of the target agent and the antagonistic agent comprises the following steps:
3.1) the game process between the target agent and the antagonistic agent adopts an information asymmetry mechanism, namely the antagonistic agent can obtain the observed global environment data, including other antagonistic agents, the target agent and the expected reward value of the target agent, and the target agent only observes the local environment data, and the information feedback obtained by the two agents is asymmetric.
3.2) antagonistic Agents from experience buffer D+And D-Assuming that the Actor network of n antagonistic agents andcritic network parameters are respectively recorded as Simultaneous ordering policyFunction of valueThe Actor network parameters are updated by calculating the gradient of the expected cumulative reward function:
wherein, a0:n={a0,...,an},D±Represents an experience buffer D+And experience buffer D-。
The Critic network parameters are updated by minimizing the loss function L (.) between the actual cumulative reward function and the action value Q function:
wherein, a0:n={a0,...,an},Representing the actual accumulated prize value. Gamma is attenuation factor, and takes [0, 1%]A value in between. The parameters in the target network adopt a soft updating mode:
3.3) after obtaining the antagonism strategy, the antagonism agent carries out game training together with the target agent, and the round exploration experience in the training process is stored in an experience buffer zone D, wherein D comprises the state transition process(s) of the target agent0,a0,r0,s'0),s0State data representing the ambient observability of the target agent (including antagonistic agents observed in the environment close to the target agent), a0Representing the target agent at s0Action taken under the influence of State and antagonism strategies, r0Representing the resulting instant prize, s'0Representing the next state data that the target agent may observe under the influence of the antagonistic agent. The target agent then samples N state transitions from D, updates the policy parameters of the Actor network by calculating the gradient of the expected cumulative reward function
Updating Critic network parameters by minimizing the Loss function Loss between the actual cumulative reward function and the action value Q function
Wherein the content of the first and second substances,gamma is attenuation factor, and takes [0, 1%]The value of (a) to (b) in between,
when the training of the decision gradient algorithm model corresponding to the target agent is finished, the trained decision gradient algorithm model corresponding to the target agent can be directly used for application, and when the application is carried out, the acquired local environment state data is input into the decision gradient algorithm model corresponding to the target agent, and the decision gradient algorithm model corresponding to the target agent is calculated to output decision actions to guide the target agent to execute.
In the multi-agent-based deep reinforcement learning strategy optimization defense method, 1) the exploration strength of a target agent on the environment is enhanced by training a plurality of antagonistic agents in an automatic driving environment; 2) in game training, the target agent and the antagonistic agent adopt an environment state data asymmetric mechanism to reduce conflicts among the antagonistic agents, and meanwhile, the target agent is favorably observed, and a better training strategy is sought; 3) in the training process of deep reinforcement learning, the strategy of the target agent is optimized by training a game scene between the antagonistic agent and the target agent, so that the robustness of a decision gradient algorithm model corresponding to the target agent is improved, the accuracy of the decision action of the model is improved, and potential safety hazards are avoided.
The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.
Claims (6)
1. A deep reinforcement learning strategy optimization defense method based on multiple agents is characterized by comprising the following steps:
(1) acquiring global environment state data and local environment state data in an automatic driving environment comprising a target agent and a plurality of antagonistic agents, and initializing the antagonistic agents and the target agent by utilizing a decision gradient algorithm model, wherein the decision gradient algorithm model comprises an Actor network model and a Critic network model;
(2) the aim of the resistant agent is to resist the attack target agent as much as possible, make the target agent execute the wrong decision-making action, and convert the state of the target agent into the transition process data according to the success or failure of the attack of the resistant agent to the target agentRespectively having an empirical playback buffer D+And an empirical playback buffer D-Where x represents that the resistant agent observes global environmental state data, including other resistant agents, target agents, and their expected reward values, a represents the policy action taken by the resistant agent under the environment,representing individual reward values for antagonistic agents, x' representing the next global environmental state data observable by the antagonistic agent, from experience buffer D+And D-The acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent;
(3) on the basis of the step (2), after the antagonistic agent obtains the antagonistic strategy, the antagonistic agent and the target agent vehicle are subjected to game training together, and the shape of the target agent is trained in the training processTransition process data(s) of state transition0,a0,r0,s'0) Stored in an experience buffer D, where s0Data representing the observable local environmental state of a target agent, including antagonistic agents in proximity to the target agent, a0Representing the target agent at s0And the policy action taken under the influence of the antagonistic policy, r0Instant prize, s 'representing target agent'0Representing the data of the next environmental state which can be observed by the target agent under the influence of the antagonistic agent, acquiring data from the experience buffer D to update the decision gradient algorithm model parameters corresponding to the target agent until the game training is finished;
(4) when the method is applied, the collected local environment state data is input into a decision gradient algorithm model corresponding to the target intelligent agent, and the target intelligent agent is guided to execute through calculating and outputting a decision action.
2. The multi-agent based deep reinforcement learning strategy optimization defense method according to claim 1, wherein in the step (2), when the decision gradient algorithm model corresponding to the resistant agent is trained, in an initial random exploration process, the resistant agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, that is, the action value is selected by the resistant agentWherein s isiGlobal environmental status data representing the ith antagonistic agent, NtRepresenting the random noise value added at time step t,representing the Actor network model at the parameter thetaiThe output of the next pair.
3. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, wherein in the step (2), a decision gradient algorithm model corresponding to the antagonistic agent is selectedWhen training, a countering reward is given to a countering agent when the countering agent's attack on the target agent is successful, i.e., the countering agent is given a countering reward Representing the challenge reward of the ith challenge agent at time step t,representing the individual prize value of the ith antagonistic agent at time t, alpha representing the antagonistic prize factor, and k antagonistic agents participating in the win, alpha being taken
4. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, characterized in that in step (2), from experience buffer D+And D-When the acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent,
from experience buffer D+Sample η (t). S from empirical buffer D-And (1-eta (t)). S, wherein t represents a time step, S represents the total number of samples, eta (t) takes 0.5 when t is at the previous M time step, and eta (t) takes 0.75 when t is greater than M time step, and data are collected by using the sampling strategy to update the decision gradient algorithm model parameters corresponding to the antagonistic agent.
5. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, characterized in that in step (2), from experience buffer D+And D-The process of updating the decision gradient algorithm model parameters corresponding to the antagonistic agent by the acquired data comprises the following steps:
antagonistic intelligent agent slaveExperience buffer D+And D-In the method, data sampling is carried out, and it is assumed that the Actor network and Critic network parameters of n antagonistic agents are respectively recorded as Simultaneous ordering policyFunction of valueThe Actor network parameters are updated by calculating the gradient of the expected cumulative reward function:
wherein, a0:n={a0,...,an},D±Represents an experience buffer D+And experience buffer D-;
The Critic network parameters are updated by minimizing the loss function L (.) between the actual cumulative reward function and the action value Q function:
6. The multi-agent-based deep reinforcement learning strategy optimization defense method according to claim 1, wherein in the step (3), the process of collecting data from the experience buffer D and updating the decision gradient algorithm model parameters corresponding to the target agent comprises:
the target agent samples N state transitions from experience buffer D and updates the policy parameters of the Actor network by calculating the gradient of the expected cumulative reward function
Updating Critic network parameters by minimizing the Loss function Loss between the actual cumulative reward function and the action value Q function
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010899020.2A CN112052456A (en) | 2020-08-31 | 2020-08-31 | Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010899020.2A CN112052456A (en) | 2020-08-31 | 2020-08-31 | Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112052456A true CN112052456A (en) | 2020-12-08 |
Family
ID=73607813
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010899020.2A Pending CN112052456A (en) | 2020-08-31 | 2020-08-31 | Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112052456A (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112418349A (en) * | 2020-12-12 | 2021-02-26 | 武汉第二船舶设计研究所(中国船舶重工集团公司第七一九研究所) | Distributed multi-agent deterministic strategy control method for large complex system |
CN112488826A (en) * | 2020-12-16 | 2021-03-12 | 北京逸风金科软件有限公司 | Method and device for optimizing bank risk pricing based on deep reinforcement learning |
CN112633415A (en) * | 2021-01-11 | 2021-04-09 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle cluster intelligent task execution method and device based on rule constraint training |
CN112802091A (en) * | 2021-01-28 | 2021-05-14 | 北京理工大学 | DQN-based intelligent confrontation behavior realization method under augmented reality condition |
CN112843726A (en) * | 2021-03-15 | 2021-05-28 | 网易(杭州)网络有限公司 | Intelligent agent processing method and device |
CN112843725A (en) * | 2021-03-15 | 2021-05-28 | 网易(杭州)网络有限公司 | Intelligent agent processing method and device |
CN112884131A (en) * | 2021-03-16 | 2021-06-01 | 浙江工业大学 | Deep reinforcement learning strategy optimization defense method and device based on simulation learning |
CN112947581A (en) * | 2021-03-25 | 2021-06-11 | 西北工业大学 | Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning |
CN112965380A (en) * | 2021-02-07 | 2021-06-15 | 北京云量数盟科技有限公司 | Method for controlling intelligent equipment based on reinforcement learning strategy |
CN113050430A (en) * | 2021-03-29 | 2021-06-29 | 浙江大学 | Drainage system control method based on robust reinforcement learning |
CN113095463A (en) * | 2021-03-31 | 2021-07-09 | 南开大学 | Robot confrontation method based on evolution reinforcement learning |
CN113221444A (en) * | 2021-04-20 | 2021-08-06 | 中国电子科技集团公司第五十二研究所 | Behavior simulation training method for air intelligent game |
CN113253605A (en) * | 2021-05-20 | 2021-08-13 | 电子科技大学 | Active disturbance rejection unmanned transverse control method based on DDPG parameter optimization |
CN113255936A (en) * | 2021-05-28 | 2021-08-13 | 浙江工业大学 | Deep reinforcement learning strategy protection defense method and device based on simulation learning and attention mechanism |
CN113344071A (en) * | 2021-06-02 | 2021-09-03 | 沈阳航空航天大学 | Intrusion detection algorithm based on depth strategy gradient |
CN113360917A (en) * | 2021-07-07 | 2021-09-07 | 浙江工业大学 | Deep reinforcement learning model security reinforcement method and device based on differential privacy |
CN113377099A (en) * | 2021-03-31 | 2021-09-10 | 南开大学 | Robot pursuit game method based on deep reinforcement learning |
CN113378456A (en) * | 2021-05-21 | 2021-09-10 | 青海大学 | Multi-park comprehensive energy scheduling method and system |
CN113392396A (en) * | 2021-06-11 | 2021-09-14 | 浙江工业大学 | Strategy protection defense method for deep reinforcement learning |
CN113420326A (en) * | 2021-06-08 | 2021-09-21 | 浙江工业大学之江学院 | Deep reinforcement learning-oriented model privacy protection method and system |
CN113435598A (en) * | 2021-07-08 | 2021-09-24 | 中国人民解放军国防科技大学 | Knowledge-driven intelligent strategy deduction decision method |
CN113485313A (en) * | 2021-06-25 | 2021-10-08 | 杭州玳数科技有限公司 | Anti-interference method and device for automatic driving vehicle |
CN113487870A (en) * | 2021-07-19 | 2021-10-08 | 浙江工业大学 | Method for generating anti-disturbance to intelligent single intersection based on CW (continuous wave) attack |
CN113487039A (en) * | 2021-06-29 | 2021-10-08 | 山东大学 | Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning |
WO2022121510A1 (en) * | 2020-12-11 | 2022-06-16 | 多伦科技股份有限公司 | Stochastic policy gradient-based traffic signal control method and system, and electronic device |
WO2022252039A1 (en) * | 2021-05-31 | 2022-12-08 | Robert Bosch Gmbh | Method and apparatus for adversarial attacking in deep reinforcement learning |
CN117833997A (en) * | 2024-03-01 | 2024-04-05 | 南京控维通信科技有限公司 | Multidimensional resource allocation method of NOMA multi-beam satellite communication system based on reinforcement learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110991545A (en) * | 2019-12-10 | 2020-04-10 | 中国人民解放军军事科学院国防科技创新研究院 | Multi-agent confrontation oriented reinforcement learning training optimization method and device |
CN111310915A (en) * | 2020-01-21 | 2020-06-19 | 浙江工业大学 | Data anomaly detection and defense method for reinforcement learning |
CN111563188A (en) * | 2020-04-30 | 2020-08-21 | 南京邮电大学 | Mobile multi-agent cooperative target searching method |
-
2020
- 2020-08-31 CN CN202010899020.2A patent/CN112052456A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110991545A (en) * | 2019-12-10 | 2020-04-10 | 中国人民解放军军事科学院国防科技创新研究院 | Multi-agent confrontation oriented reinforcement learning training optimization method and device |
CN111310915A (en) * | 2020-01-21 | 2020-06-19 | 浙江工业大学 | Data anomaly detection and defense method for reinforcement learning |
CN111563188A (en) * | 2020-04-30 | 2020-08-21 | 南京邮电大学 | Mobile multi-agent cooperative target searching method |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022121510A1 (en) * | 2020-12-11 | 2022-06-16 | 多伦科技股份有限公司 | Stochastic policy gradient-based traffic signal control method and system, and electronic device |
CN112418349A (en) * | 2020-12-12 | 2021-02-26 | 武汉第二船舶设计研究所(中国船舶重工集团公司第七一九研究所) | Distributed multi-agent deterministic strategy control method for large complex system |
CN112488826A (en) * | 2020-12-16 | 2021-03-12 | 北京逸风金科软件有限公司 | Method and device for optimizing bank risk pricing based on deep reinforcement learning |
CN112633415A (en) * | 2021-01-11 | 2021-04-09 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle cluster intelligent task execution method and device based on rule constraint training |
CN112633415B (en) * | 2021-01-11 | 2023-05-19 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle cluster intelligent task execution method and device based on rule constraint training |
CN112802091A (en) * | 2021-01-28 | 2021-05-14 | 北京理工大学 | DQN-based intelligent confrontation behavior realization method under augmented reality condition |
CN112802091B (en) * | 2021-01-28 | 2023-08-29 | 北京理工大学 | DQN-based agent countermeasure behavior realization method under augmented reality condition |
CN112965380A (en) * | 2021-02-07 | 2021-06-15 | 北京云量数盟科技有限公司 | Method for controlling intelligent equipment based on reinforcement learning strategy |
CN112965380B (en) * | 2021-02-07 | 2022-11-08 | 北京云量数盟科技有限公司 | Method for controlling intelligent equipment based on reinforcement learning strategy |
CN112843725A (en) * | 2021-03-15 | 2021-05-28 | 网易(杭州)网络有限公司 | Intelligent agent processing method and device |
CN112843726A (en) * | 2021-03-15 | 2021-05-28 | 网易(杭州)网络有限公司 | Intelligent agent processing method and device |
CN112884131A (en) * | 2021-03-16 | 2021-06-01 | 浙江工业大学 | Deep reinforcement learning strategy optimization defense method and device based on simulation learning |
CN112947581A (en) * | 2021-03-25 | 2021-06-11 | 西北工业大学 | Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning |
CN112947581B (en) * | 2021-03-25 | 2022-07-05 | 西北工业大学 | Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning |
CN113050430A (en) * | 2021-03-29 | 2021-06-29 | 浙江大学 | Drainage system control method based on robust reinforcement learning |
CN113377099A (en) * | 2021-03-31 | 2021-09-10 | 南开大学 | Robot pursuit game method based on deep reinforcement learning |
CN113095463A (en) * | 2021-03-31 | 2021-07-09 | 南开大学 | Robot confrontation method based on evolution reinforcement learning |
CN113221444A (en) * | 2021-04-20 | 2021-08-06 | 中国电子科技集团公司第五十二研究所 | Behavior simulation training method for air intelligent game |
CN113253605A (en) * | 2021-05-20 | 2021-08-13 | 电子科技大学 | Active disturbance rejection unmanned transverse control method based on DDPG parameter optimization |
CN113378456A (en) * | 2021-05-21 | 2021-09-10 | 青海大学 | Multi-park comprehensive energy scheduling method and system |
CN113378456B (en) * | 2021-05-21 | 2023-04-07 | 青海大学 | Multi-park comprehensive energy scheduling method and system |
CN113255936A (en) * | 2021-05-28 | 2021-08-13 | 浙江工业大学 | Deep reinforcement learning strategy protection defense method and device based on simulation learning and attention mechanism |
CN113255936B (en) * | 2021-05-28 | 2024-02-13 | 浙江工业大学 | Deep reinforcement learning strategy protection defense method and device based on imitation learning and attention mechanism |
WO2022252039A1 (en) * | 2021-05-31 | 2022-12-08 | Robert Bosch Gmbh | Method and apparatus for adversarial attacking in deep reinforcement learning |
CN113344071A (en) * | 2021-06-02 | 2021-09-03 | 沈阳航空航天大学 | Intrusion detection algorithm based on depth strategy gradient |
CN113344071B (en) * | 2021-06-02 | 2024-01-26 | 新疆能源翱翔星云科技有限公司 | Intrusion detection algorithm based on depth strategy gradient |
CN113420326B (en) * | 2021-06-08 | 2022-06-21 | 浙江工业大学之江学院 | Deep reinforcement learning-oriented model privacy protection method and system |
CN113420326A (en) * | 2021-06-08 | 2021-09-21 | 浙江工业大学之江学院 | Deep reinforcement learning-oriented model privacy protection method and system |
CN113392396A (en) * | 2021-06-11 | 2021-09-14 | 浙江工业大学 | Strategy protection defense method for deep reinforcement learning |
CN113485313A (en) * | 2021-06-25 | 2021-10-08 | 杭州玳数科技有限公司 | Anti-interference method and device for automatic driving vehicle |
CN113487039A (en) * | 2021-06-29 | 2021-10-08 | 山东大学 | Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning |
CN113487039B (en) * | 2021-06-29 | 2023-08-22 | 山东大学 | Deep reinforcement learning-based intelligent self-adaptive decision generation method and system |
CN113360917A (en) * | 2021-07-07 | 2021-09-07 | 浙江工业大学 | Deep reinforcement learning model security reinforcement method and device based on differential privacy |
CN113435598B (en) * | 2021-07-08 | 2022-06-21 | 中国人民解放军国防科技大学 | Knowledge-driven intelligent strategy deduction decision method |
CN113435598A (en) * | 2021-07-08 | 2021-09-24 | 中国人民解放军国防科技大学 | Knowledge-driven intelligent strategy deduction decision method |
CN113487870B (en) * | 2021-07-19 | 2022-07-15 | 浙江工业大学 | Anti-disturbance generation method for intelligent single intersection based on CW (continuous wave) attack |
CN113487870A (en) * | 2021-07-19 | 2021-10-08 | 浙江工业大学 | Method for generating anti-disturbance to intelligent single intersection based on CW (continuous wave) attack |
CN117833997A (en) * | 2024-03-01 | 2024-04-05 | 南京控维通信科技有限公司 | Multidimensional resource allocation method of NOMA multi-beam satellite communication system based on reinforcement learning |
CN117833997B (en) * | 2024-03-01 | 2024-05-31 | 南京控维通信科技有限公司 | Multidimensional resource allocation method of NOMA multi-beam satellite communication system based on reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112052456A (en) | Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents | |
CN111310915B (en) | Data anomaly detection defense method oriented to reinforcement learning | |
CN110991545B (en) | Multi-agent confrontation oriented reinforcement learning training optimization method and device | |
Stanescu et al. | Evaluating real-time strategy game states using convolutional neural networks | |
CN110852448A (en) | Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning | |
CN111282267B (en) | Information processing method, information processing apparatus, information processing medium, and electronic device | |
CN113420326B (en) | Deep reinforcement learning-oriented model privacy protection method and system | |
CN112884131A (en) | Deep reinforcement learning strategy optimization defense method and device based on simulation learning | |
CN112884130A (en) | SeqGAN-based deep reinforcement learning data enhanced defense method and device | |
CN113688977B (en) | Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium | |
CN113255936A (en) | Deep reinforcement learning strategy protection defense method and device based on simulation learning and attention mechanism | |
CN113392396A (en) | Strategy protection defense method for deep reinforcement learning | |
CN113952733A (en) | Multi-agent self-adaptive sampling strategy generation method | |
CN112069504A (en) | Model enhanced defense method for resisting attack by deep reinforcement learning | |
CN111348034B (en) | Automatic parking method and system based on generation countermeasure simulation learning | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
CN113360917A (en) | Deep reinforcement learning model security reinforcement method and device based on differential privacy | |
Yang et al. | Adaptive inner-reward shaping in sparse reward games | |
Churchill et al. | An analysis of model-based heuristic search techniques for StarCraft combat scenarios | |
CN113276852B (en) | Unmanned lane keeping method based on maximum entropy reinforcement learning framework | |
CN114404975A (en) | Method, device, equipment, storage medium and program product for training decision model | |
Ji et al. | Improving decision-making efficiency of image game based on deep Q-learning | |
Petosa et al. | Multiplayer alphazero | |
Marius et al. | Combining scripted behavior with game tree search for stronger, more robust game AI | |
Balachandar et al. | Collaboration of ai agents via cooperative multi-agent deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |