CN112052456A - Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents - Google Patents

Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents Download PDF

Info

Publication number
CN112052456A
CN112052456A CN202010899020.2A CN202010899020A CN112052456A CN 112052456 A CN112052456 A CN 112052456A CN 202010899020 A CN202010899020 A CN 202010899020A CN 112052456 A CN112052456 A CN 112052456A
Authority
CN
China
Prior art keywords
agent
antagonistic
target
data
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010899020.2A
Other languages
Chinese (zh)
Inventor
陈晋音
章燕
王雪柯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202010899020.2A priority Critical patent/CN112052456A/en
Publication of CN112052456A publication Critical patent/CN112052456A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a deep reinforcement learning strategy optimization defense method based on multiple intelligent agents, which comprises the following steps: (1) constructing an autonomous driving environment comprising a target agent and a plurality of antagonistic agents; (2) respectively storing the state transition process data of the target agent in an experience playback buffer zone D according to the success or failure of the antagonistic agent to attack the target agent+And DFrom D+And DThe acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent; (3) game training is carried out on the antagonistic agent and the target agent vehicle together, the state transition process data of the target agent is stored in an experience buffer area D, data are collected from the experience buffer area D to update decision gradient algorithm model parameters corresponding to the target agent, and the game training is carried out until the game training is finishedStopping; (4) when the method is applied, the collected local environment state data is input into a decision gradient algorithm model corresponding to the target intelligent agent, and the target intelligent agent is guided to execute through calculating and outputting a decision action.

Description

Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents
Technical Field
The invention belongs to the defense field of deep reinforcement learning, and particularly relates to a deep reinforcement learning strategy optimization defense method based on multiple intelligent agents.
Background
Deep reinforcement learning is one of the directions in which artificial intelligence has attracted much attention in recent years, and with the rapid development and application of reinforcement learning, reinforcement learning has been widely used in the fields of robot control, game gaming, computer vision, unmanned driving, and the like. In order to ensure the safe application of the deep reinforcement learning in the safety-critical field, the key point is to analyze and discover holes in the deep reinforcement learning algorithm and the model so as to prevent some other useful people from utilizing the holes to perform illegal profit-making behaviors. Unlike the single-step prediction task of the traditional machine learning, the deep reinforcement learning system needs to perform multi-step decision-making to complete a certain task, and the continuous decision-making has high correlation.
The basic idea of reinforcement learning is to learn the optimal strategy to achieve the learning goal by maximizing the cumulative rewards that agents obtain from the environment. However, the strategy obtained by deep reinforcement learning training has a safety hazard, and cannot well cope with all possible scenes. Especially in the safety-critical field, such safety hazards bring great harm, and make decisions of the reinforcement learning system wrong, which is a significant challenge for the application field of reinforcement learning decision safety.
Because the strategies obtained by reinforcement learning training have potential safety hazards, the robustness of the reinforcement learning model and the strategies is improved, and the effective and safe application of the reinforcement learning model and the strategies in the safety decision field becomes the key point of people's attention increasingly. At present, common defense methods of reinforcement learning can be divided into three categories, namely confrontation training, robust learning and confrontation detection, according to the existing defense mechanism. A defense method facing a deep reinforcement learning model to resist attacks as disclosed in application number CN 201911184051.3; application number CN201910567203.1 discloses an insecure XSS defense system identification method based on reinforcement learning.
Disclosure of Invention
The invention aims to provide a multi-agent-based deep reinforcement learning strategy optimization defense method, which is used in an automatic driving scene, adopts a mode of training a plurality of confrontation agents to play games with a target agent, adopts an information asymmetry mechanism between the target agent and the confrontation agents to train a reinforcement learning model to optimize the strategy, improves the robustness of the reinforcement learning model, further improves the accuracy of decision-making action of the model, and avoids potential safety hazards.
In order to achieve the purpose, the invention provides the following technical scheme:
a deep reinforcement learning strategy optimization defense method based on multiple agents comprises the following steps:
(1) acquiring global environment state data and local environment state data in an automatic driving environment comprising a target agent and a plurality of antagonistic agents, and initializing the antagonistic agents and the target agent by utilizing a decision gradient algorithm model, wherein the decision gradient algorithm model comprises an Actor network model and a Critic network model;
(2) the aim of the resistant agent is to resist the attack target agent as much as possible, make the target agent execute the wrong decision-making action, and convert the state of the target agent into the transition process data according to the success or failure of the attack of the resistant agent to the target agent
Figure RE-GDA0002745165030000021
Respectively having an empirical playback buffer D+And an empirical playback buffer D-Where x represents that the resistant agent observes global environmental state data, including other resistant agents, target agents, and their expected reward values, a represents the policy action taken by the resistant agent under the environment,
Figure RE-GDA0002745165030000022
representing individual reward values for antagonistic agents, x' representing the next global environmental state data observable by the antagonistic agent, from experience buffer D+And D-The acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent;
(3) on the basis of the step (2), after the antagonistic agent obtains the antagonistic strategy thereof, the antagonistic agent is related to the targetThe target agent vehicles carry out game training together, and in the training process, the state of the target agent is converted into transition process data(s)0,a0,r0,s'0) Stored in an experience buffer D, where s0Data representing the observable local environmental state of a target agent, including antagonistic agents in proximity to the target agent, a0Representing the target agent at s0And the policy action taken under the influence of the antagonistic policy, r0Representing target agent instant prize, s'0Representing the data of the next environmental state which can be observed by the target agent under the influence of the antagonistic agent, acquiring data from the experience buffer D to update the decision gradient algorithm model parameters corresponding to the target agent until the game training is finished;
(4) when the method is applied, the collected local environment state data is input into a decision gradient algorithm model corresponding to the target intelligent agent, and the target intelligent agent is guided to execute through calculating and outputting a decision action.
In step (2), when the decision gradient algorithm model corresponding to the antagonistic agent is trained, in the initial random exploration process, the antagonistic agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, that is, the antagonistic agent randomly selects an action value
Figure RE-GDA0002745165030000031
Wherein s isiGlobal environmental status data representing the ith antagonistic agent, NtRepresenting the random noise value added at time step t,
Figure RE-GDA0002745165030000032
representing the Actor network model at the parameter thetaiThe output of the next pair.
In step (2), when a decision gradient algorithm model corresponding to the antagonistic agent is trained and the target agent is successfully attacked by the antagonistic agent, the antagonistic agent is awarded a countermark, that is, the antagonistic agent is awarded a countermark
Figure RE-GDA0002745165030000033
Figure RE-GDA0002745165030000034
Representing the challenge reward of the ith challenge agent at time step t,
Figure RE-GDA0002745165030000035
representing the individual prize value of the ith antagonistic agent at time t, alpha representing the antagonistic prize factor, and k antagonistic agents participating in the win, alpha being taken
Figure RE-GDA0002745165030000036
In step (2), from experience buffer D+And D-When the acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent,
from experience buffer D+Sample η (t). S from empirical buffer D-And (1-eta (t)). S, wherein t represents a time step, S represents the total number of samples, eta (t) takes 0.5 when t is at the previous M time step, and eta (t) takes 0.75 when t is greater than M time step, and data are collected by using the sampling strategy to update the decision gradient algorithm model parameters corresponding to the antagonistic agent.
Compared with the prior art, the invention has the beneficial effects that at least:
1) enhancing the exploration of the environment by a target agent by training a plurality of antagonistic agents in an autonomous driving environment;
2) in game training, the target agent and the antagonistic agent adopt an environment state data asymmetric mechanism to reduce conflicts among the antagonistic agents, and meanwhile, the target agent is favorably observed, and a better training strategy is sought;
3) in the training process of deep reinforcement learning, the strategy of the target agent is optimized by training a game scene between the antagonistic agent and the target agent, so that the robustness of a decision gradient algorithm model corresponding to the target agent is improved, the accuracy of the decision action of the model is improved, and potential safety hazards are avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a multi-agent based deep reinforcement learning strategy optimization defense method provided by an embodiment;
FIG. 2 is a schematic structural diagram of a DDPG algorithm model provided by the embodiment;
fig. 3 is a schematic diagram of a reactive agent training process provided by an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
As shown in FIGS. 1 to 3, the method for optimizing and defending a multi-agent-based deep reinforcement learning strategy provided by the embodiment comprises the following steps:
1) a target agent training process.
1.1) building an automatic driving simulation environment of the reinforcement learning trolley;
1.2) training a target agent (indicated by subscript 0) and a antagonistic agent (indicated by subscript 1,.. and n) based on a deep deterministic decision gradient algorithm (DDPG) in reinforcement learning, wherein the target agent and the antagonistic agent can be intelligent trolleys, the core of the DDPG algorithm is extended based on an Actor-Critic method, a DQN algorithm and a deterministic strategy gradient (DPG), and a deterministic strategy mu is adopted to select an action at=μ(s|θμ),θμIs a policy network mu (s | theta) that produces deterministic actionsμ) In μ(s) as an Actor, θQIs a value Q network Q (s, a, theta)Q) Parameter (d) in Q (s, a)) The function serves as Critic. In order to improve the training stability, a target network is introduced for a strategy network and a value network.
1.3) during the training process, the state of the target agent is converted into a transition process(s)0,a0,r0,s'0) Stored in an empirical playback buffer D, where s0Observable local environmental state data representing the target intelligence, a0Representing the target agent at s0Action taken in the state r0Representing the resulting instant prize, s'0Representing the next environmental state data observable by the target agent.
2) Antagonistic agent training process:
2.1) training n antagonistic Agents CariN, n may be 2, 3:
in the initial random exploration process, the antagonistic agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, i.e., the action value
Figure RE-GDA0002745165030000061
Wherein s isiGlobal environmental status data representing the ith antagonistic agent, NtRepresenting the random noise value added at time step t.
2.2) transferring data of the status of antagonistic Agents in the training Process
Figure RE-GDA0002745165030000062
Temporarily stored in experience buffer DtmpWherein x ═ s0,s1,...,sn) Representing the respective global environmental state data of all antagonistic agents, a ═ a0,a1,...,an) The action taken by the agent is represented,
Figure RE-GDA0002745165030000063
individual rewards are expressed to restrict individual behavior of antagonistic agents to perform normal behavior. And then judging whether the antagonistic agent succeeds or not after each round is finished. If successful, experience will be gainedBuffer DtmpData in (2) is transferred to buffer D+And awarding a corresponding counter prize, i.e.
Figure RE-GDA0002745165030000064
Figure RE-GDA0002745165030000065
Representing a challenge prize, alpha representing a challenge prize factor, k challenge agents participating in the win, then alpha is taken
Figure RE-GDA0002745165030000066
If the target agent wins, the experience buffer D is bufferedtmpData in (2) is transferred to buffer D-And awarding the corresponding personal prize without a counter-prize, i.e.
Figure RE-GDA0002745165030000067
2.3) from empirical buffer D according to sampling ratio η (t)+And D-For updating the network structure parameters of the antagonistic agent:
from experience buffer D+Sampling η (t). S from D-And (1-eta (t)). S, wherein t represents a time step, S represents the total number of samples, and when t is in the previous M time step, eta (t) is 0.5, and when t is greater than M time step, eta (t) is 0.75, so as to better optimize the strategy of the target agent.
3) The game process of the target agent and the antagonistic agent comprises the following steps:
3.1) the game process between the target agent and the antagonistic agent adopts an information asymmetry mechanism, namely the antagonistic agent can obtain the observed global environment data, including other antagonistic agents, the target agent and the expected reward value of the target agent, and the target agent only observes the local environment data, and the information feedback obtained by the two agents is asymmetric.
3.2) antagonistic Agents from experience buffer D+And D-Assuming that the Actor network of n antagonistic agents andcritic network parameters are respectively recorded as
Figure RE-GDA0002745165030000071
Figure RE-GDA0002745165030000072
Simultaneous ordering policy
Figure RE-GDA0002745165030000073
Function of value
Figure RE-GDA0002745165030000074
The Actor network parameters are updated by calculating the gradient of the expected cumulative reward function:
Figure RE-GDA0002745165030000075
wherein, a0:n={a0,...,an},D±Represents an experience buffer D+And experience buffer D-
The Critic network parameters are updated by minimizing the loss function L (.) between the actual cumulative reward function and the action value Q function:
Figure RE-GDA0002745165030000076
wherein, a0:n={a0,...,an},
Figure RE-GDA0002745165030000077
Representing the actual accumulated prize value. Gamma is attenuation factor, and takes [0, 1%]A value in between. The parameters in the target network adopt a soft updating mode:
Figure RE-GDA0002745165030000081
Figure RE-GDA0002745165030000082
3.3) after obtaining the antagonism strategy, the antagonism agent carries out game training together with the target agent, and the round exploration experience in the training process is stored in an experience buffer zone D, wherein D comprises the state transition process(s) of the target agent0,a0,r0,s'0),s0State data representing the ambient observability of the target agent (including antagonistic agents observed in the environment close to the target agent), a0Representing the target agent at s0Action taken under the influence of State and antagonism strategies, r0Representing the resulting instant prize, s'0Representing the next state data that the target agent may observe under the influence of the antagonistic agent. The target agent then samples N state transitions from D, updates the policy parameters of the Actor network by calculating the gradient of the expected cumulative reward function
Figure RE-GDA0002745165030000083
Figure RE-GDA0002745165030000084
Updating Critic network parameters by minimizing the Loss function Loss between the actual cumulative reward function and the action value Q function
Figure RE-GDA0002745165030000085
Figure RE-GDA0002745165030000086
Wherein the content of the first and second substances,
Figure RE-GDA0002745165030000087
gamma is attenuation factor, and takes [0, 1%]The value of (a) to (b) in between,
Figure RE-GDA0002745165030000088
updating target network parameters by means of soft update
Figure RE-GDA0002745165030000089
And
Figure RE-GDA00027451650300000810
Figure RE-GDA00027451650300000811
Figure RE-GDA00027451650300000812
when the training of the decision gradient algorithm model corresponding to the target agent is finished, the trained decision gradient algorithm model corresponding to the target agent can be directly used for application, and when the application is carried out, the acquired local environment state data is input into the decision gradient algorithm model corresponding to the target agent, and the decision gradient algorithm model corresponding to the target agent is calculated to output decision actions to guide the target agent to execute.
In the multi-agent-based deep reinforcement learning strategy optimization defense method, 1) the exploration strength of a target agent on the environment is enhanced by training a plurality of antagonistic agents in an automatic driving environment; 2) in game training, the target agent and the antagonistic agent adopt an environment state data asymmetric mechanism to reduce conflicts among the antagonistic agents, and meanwhile, the target agent is favorably observed, and a better training strategy is sought; 3) in the training process of deep reinforcement learning, the strategy of the target agent is optimized by training a game scene between the antagonistic agent and the target agent, so that the robustness of a decision gradient algorithm model corresponding to the target agent is improved, the accuracy of the decision action of the model is improved, and potential safety hazards are avoided.
The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims (6)

1. A deep reinforcement learning strategy optimization defense method based on multiple agents is characterized by comprising the following steps:
(1) acquiring global environment state data and local environment state data in an automatic driving environment comprising a target agent and a plurality of antagonistic agents, and initializing the antagonistic agents and the target agent by utilizing a decision gradient algorithm model, wherein the decision gradient algorithm model comprises an Actor network model and a Critic network model;
(2) the aim of the resistant agent is to resist the attack target agent as much as possible, make the target agent execute the wrong decision-making action, and convert the state of the target agent into the transition process data according to the success or failure of the attack of the resistant agent to the target agent
Figure RE-FDA0002745165020000011
Respectively having an empirical playback buffer D+And an empirical playback buffer D-Where x represents that the resistant agent observes global environmental state data, including other resistant agents, target agents, and their expected reward values, a represents the policy action taken by the resistant agent under the environment,
Figure RE-FDA0002745165020000012
representing individual reward values for antagonistic agents, x' representing the next global environmental state data observable by the antagonistic agent, from experience buffer D+And D-The acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent;
(3) on the basis of the step (2), after the antagonistic agent obtains the antagonistic strategy, the antagonistic agent and the target agent vehicle are subjected to game training together, and the shape of the target agent is trained in the training processTransition process data(s) of state transition0,a0,r0,s'0) Stored in an experience buffer D, where s0Data representing the observable local environmental state of a target agent, including antagonistic agents in proximity to the target agent, a0Representing the target agent at s0And the policy action taken under the influence of the antagonistic policy, r0Instant prize, s 'representing target agent'0Representing the data of the next environmental state which can be observed by the target agent under the influence of the antagonistic agent, acquiring data from the experience buffer D to update the decision gradient algorithm model parameters corresponding to the target agent until the game training is finished;
(4) when the method is applied, the collected local environment state data is input into a decision gradient algorithm model corresponding to the target intelligent agent, and the target intelligent agent is guided to execute through calculating and outputting a decision action.
2. The multi-agent based deep reinforcement learning strategy optimization defense method according to claim 1, wherein in the step (2), when the decision gradient algorithm model corresponding to the resistant agent is trained, in an initial random exploration process, the resistant agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, that is, the action value is selected by the resistant agent
Figure RE-FDA0002745165020000021
Wherein s isiGlobal environmental status data representing the ith antagonistic agent, NtRepresenting the random noise value added at time step t,
Figure RE-FDA0002745165020000022
representing the Actor network model at the parameter thetaiThe output of the next pair.
3. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, wherein in the step (2), a decision gradient algorithm model corresponding to the antagonistic agent is selectedWhen training, a countering reward is given to a countering agent when the countering agent's attack on the target agent is successful, i.e., the countering agent is given a countering reward
Figure RE-FDA0002745165020000023
Figure RE-FDA0002745165020000024
Representing the challenge reward of the ith challenge agent at time step t,
Figure RE-FDA0002745165020000025
representing the individual prize value of the ith antagonistic agent at time t, alpha representing the antagonistic prize factor, and k antagonistic agents participating in the win, alpha being taken
Figure RE-FDA0002745165020000026
4. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, characterized in that in step (2), from experience buffer D+And D-When the acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent,
from experience buffer D+Sample η (t). S from empirical buffer D-And (1-eta (t)). S, wherein t represents a time step, S represents the total number of samples, eta (t) takes 0.5 when t is at the previous M time step, and eta (t) takes 0.75 when t is greater than M time step, and data are collected by using the sampling strategy to update the decision gradient algorithm model parameters corresponding to the antagonistic agent.
5. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, characterized in that in step (2), from experience buffer D+And D-The process of updating the decision gradient algorithm model parameters corresponding to the antagonistic agent by the acquired data comprises the following steps:
antagonistic intelligent agent slaveExperience buffer D+And D-In the method, data sampling is carried out, and it is assumed that the Actor network and Critic network parameters of n antagonistic agents are respectively recorded as
Figure RE-FDA0002745165020000031
Figure RE-FDA0002745165020000032
Simultaneous ordering policy
Figure RE-FDA0002745165020000033
Function of value
Figure RE-FDA0002745165020000034
The Actor network parameters are updated by calculating the gradient of the expected cumulative reward function:
Figure RE-FDA0002745165020000035
wherein, a0:n={a0,...,an},D±Represents an experience buffer D+And experience buffer D-
The Critic network parameters are updated by minimizing the loss function L (.) between the actual cumulative reward function and the action value Q function:
Figure RE-FDA0002745165020000036
wherein, a0:n={a0,...,an},
Figure RE-FDA0002745165020000037
Representing the actual cumulative prize value, gamma being a decay factor, taking [0,1 [ ]]A value in between.
6. The multi-agent-based deep reinforcement learning strategy optimization defense method according to claim 1, wherein in the step (3), the process of collecting data from the experience buffer D and updating the decision gradient algorithm model parameters corresponding to the target agent comprises:
the target agent samples N state transitions from experience buffer D and updates the policy parameters of the Actor network by calculating the gradient of the expected cumulative reward function
Figure RE-FDA0002745165020000038
Figure RE-FDA0002745165020000041
Updating Critic network parameters by minimizing the Loss function Loss between the actual cumulative reward function and the action value Q function
Figure RE-FDA0002745165020000042
Figure RE-FDA0002745165020000043
Wherein the content of the first and second substances,
Figure RE-FDA0002745165020000044
gamma is attenuation factor, and takes [0, 1%]The value of (a) to (b) in between,
Figure RE-FDA0002745165020000045
CN202010899020.2A 2020-08-31 2020-08-31 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents Pending CN112052456A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010899020.2A CN112052456A (en) 2020-08-31 2020-08-31 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010899020.2A CN112052456A (en) 2020-08-31 2020-08-31 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Publications (1)

Publication Number Publication Date
CN112052456A true CN112052456A (en) 2020-12-08

Family

ID=73607813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010899020.2A Pending CN112052456A (en) 2020-08-31 2020-08-31 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Country Status (1)

Country Link
CN (1) CN112052456A (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112418349A (en) * 2020-12-12 2021-02-26 武汉第二船舶设计研究所(中国船舶重工集团公司第七一九研究所) Distributed multi-agent deterministic strategy control method for large complex system
CN112488826A (en) * 2020-12-16 2021-03-12 北京逸风金科软件有限公司 Method and device for optimizing bank risk pricing based on deep reinforcement learning
CN112633415A (en) * 2021-01-11 2021-04-09 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster intelligent task execution method and device based on rule constraint training
CN112802091A (en) * 2021-01-28 2021-05-14 北京理工大学 DQN-based intelligent confrontation behavior realization method under augmented reality condition
CN112843726A (en) * 2021-03-15 2021-05-28 网易(杭州)网络有限公司 Intelligent agent processing method and device
CN112843725A (en) * 2021-03-15 2021-05-28 网易(杭州)网络有限公司 Intelligent agent processing method and device
CN112884131A (en) * 2021-03-16 2021-06-01 浙江工业大学 Deep reinforcement learning strategy optimization defense method and device based on simulation learning
CN112947581A (en) * 2021-03-25 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN112965380A (en) * 2021-02-07 2021-06-15 北京云量数盟科技有限公司 Method for controlling intelligent equipment based on reinforcement learning strategy
CN113050430A (en) * 2021-03-29 2021-06-29 浙江大学 Drainage system control method based on robust reinforcement learning
CN113095463A (en) * 2021-03-31 2021-07-09 南开大学 Robot confrontation method based on evolution reinforcement learning
CN113221444A (en) * 2021-04-20 2021-08-06 中国电子科技集团公司第五十二研究所 Behavior simulation training method for air intelligent game
CN113253605A (en) * 2021-05-20 2021-08-13 电子科技大学 Active disturbance rejection unmanned transverse control method based on DDPG parameter optimization
CN113255936A (en) * 2021-05-28 2021-08-13 浙江工业大学 Deep reinforcement learning strategy protection defense method and device based on simulation learning and attention mechanism
CN113344071A (en) * 2021-06-02 2021-09-03 沈阳航空航天大学 Intrusion detection algorithm based on depth strategy gradient
CN113360917A (en) * 2021-07-07 2021-09-07 浙江工业大学 Deep reinforcement learning model security reinforcement method and device based on differential privacy
CN113377099A (en) * 2021-03-31 2021-09-10 南开大学 Robot pursuit game method based on deep reinforcement learning
CN113378456A (en) * 2021-05-21 2021-09-10 青海大学 Multi-park comprehensive energy scheduling method and system
CN113392396A (en) * 2021-06-11 2021-09-14 浙江工业大学 Strategy protection defense method for deep reinforcement learning
CN113420326A (en) * 2021-06-08 2021-09-21 浙江工业大学之江学院 Deep reinforcement learning-oriented model privacy protection method and system
CN113435598A (en) * 2021-07-08 2021-09-24 中国人民解放军国防科技大学 Knowledge-driven intelligent strategy deduction decision method
CN113485313A (en) * 2021-06-25 2021-10-08 杭州玳数科技有限公司 Anti-interference method and device for automatic driving vehicle
CN113487870A (en) * 2021-07-19 2021-10-08 浙江工业大学 Method for generating anti-disturbance to intelligent single intersection based on CW (continuous wave) attack
CN113487039A (en) * 2021-06-29 2021-10-08 山东大学 Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning
WO2022121510A1 (en) * 2020-12-11 2022-06-16 多伦科技股份有限公司 Stochastic policy gradient-based traffic signal control method and system, and electronic device
WO2022252039A1 (en) * 2021-05-31 2022-12-08 Robert Bosch Gmbh Method and apparatus for adversarial attacking in deep reinforcement learning
CN117833997A (en) * 2024-03-01 2024-04-05 南京控维通信科技有限公司 Multidimensional resource allocation method of NOMA multi-beam satellite communication system based on reinforcement learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110991545A (en) * 2019-12-10 2020-04-10 中国人民解放军军事科学院国防科技创新研究院 Multi-agent confrontation oriented reinforcement learning training optimization method and device
CN111310915A (en) * 2020-01-21 2020-06-19 浙江工业大学 Data anomaly detection and defense method for reinforcement learning
CN111563188A (en) * 2020-04-30 2020-08-21 南京邮电大学 Mobile multi-agent cooperative target searching method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110991545A (en) * 2019-12-10 2020-04-10 中国人民解放军军事科学院国防科技创新研究院 Multi-agent confrontation oriented reinforcement learning training optimization method and device
CN111310915A (en) * 2020-01-21 2020-06-19 浙江工业大学 Data anomaly detection and defense method for reinforcement learning
CN111563188A (en) * 2020-04-30 2020-08-21 南京邮电大学 Mobile multi-agent cooperative target searching method

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022121510A1 (en) * 2020-12-11 2022-06-16 多伦科技股份有限公司 Stochastic policy gradient-based traffic signal control method and system, and electronic device
CN112418349A (en) * 2020-12-12 2021-02-26 武汉第二船舶设计研究所(中国船舶重工集团公司第七一九研究所) Distributed multi-agent deterministic strategy control method for large complex system
CN112488826A (en) * 2020-12-16 2021-03-12 北京逸风金科软件有限公司 Method and device for optimizing bank risk pricing based on deep reinforcement learning
CN112633415A (en) * 2021-01-11 2021-04-09 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster intelligent task execution method and device based on rule constraint training
CN112633415B (en) * 2021-01-11 2023-05-19 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster intelligent task execution method and device based on rule constraint training
CN112802091A (en) * 2021-01-28 2021-05-14 北京理工大学 DQN-based intelligent confrontation behavior realization method under augmented reality condition
CN112802091B (en) * 2021-01-28 2023-08-29 北京理工大学 DQN-based agent countermeasure behavior realization method under augmented reality condition
CN112965380A (en) * 2021-02-07 2021-06-15 北京云量数盟科技有限公司 Method for controlling intelligent equipment based on reinforcement learning strategy
CN112965380B (en) * 2021-02-07 2022-11-08 北京云量数盟科技有限公司 Method for controlling intelligent equipment based on reinforcement learning strategy
CN112843725A (en) * 2021-03-15 2021-05-28 网易(杭州)网络有限公司 Intelligent agent processing method and device
CN112843726A (en) * 2021-03-15 2021-05-28 网易(杭州)网络有限公司 Intelligent agent processing method and device
CN112884131A (en) * 2021-03-16 2021-06-01 浙江工业大学 Deep reinforcement learning strategy optimization defense method and device based on simulation learning
CN112947581A (en) * 2021-03-25 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN112947581B (en) * 2021-03-25 2022-07-05 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN113050430A (en) * 2021-03-29 2021-06-29 浙江大学 Drainage system control method based on robust reinforcement learning
CN113377099A (en) * 2021-03-31 2021-09-10 南开大学 Robot pursuit game method based on deep reinforcement learning
CN113095463A (en) * 2021-03-31 2021-07-09 南开大学 Robot confrontation method based on evolution reinforcement learning
CN113221444A (en) * 2021-04-20 2021-08-06 中国电子科技集团公司第五十二研究所 Behavior simulation training method for air intelligent game
CN113253605A (en) * 2021-05-20 2021-08-13 电子科技大学 Active disturbance rejection unmanned transverse control method based on DDPG parameter optimization
CN113378456A (en) * 2021-05-21 2021-09-10 青海大学 Multi-park comprehensive energy scheduling method and system
CN113378456B (en) * 2021-05-21 2023-04-07 青海大学 Multi-park comprehensive energy scheduling method and system
CN113255936A (en) * 2021-05-28 2021-08-13 浙江工业大学 Deep reinforcement learning strategy protection defense method and device based on simulation learning and attention mechanism
CN113255936B (en) * 2021-05-28 2024-02-13 浙江工业大学 Deep reinforcement learning strategy protection defense method and device based on imitation learning and attention mechanism
WO2022252039A1 (en) * 2021-05-31 2022-12-08 Robert Bosch Gmbh Method and apparatus for adversarial attacking in deep reinforcement learning
CN113344071A (en) * 2021-06-02 2021-09-03 沈阳航空航天大学 Intrusion detection algorithm based on depth strategy gradient
CN113344071B (en) * 2021-06-02 2024-01-26 新疆能源翱翔星云科技有限公司 Intrusion detection algorithm based on depth strategy gradient
CN113420326B (en) * 2021-06-08 2022-06-21 浙江工业大学之江学院 Deep reinforcement learning-oriented model privacy protection method and system
CN113420326A (en) * 2021-06-08 2021-09-21 浙江工业大学之江学院 Deep reinforcement learning-oriented model privacy protection method and system
CN113392396A (en) * 2021-06-11 2021-09-14 浙江工业大学 Strategy protection defense method for deep reinforcement learning
CN113485313A (en) * 2021-06-25 2021-10-08 杭州玳数科技有限公司 Anti-interference method and device for automatic driving vehicle
CN113487039A (en) * 2021-06-29 2021-10-08 山东大学 Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning
CN113487039B (en) * 2021-06-29 2023-08-22 山东大学 Deep reinforcement learning-based intelligent self-adaptive decision generation method and system
CN113360917A (en) * 2021-07-07 2021-09-07 浙江工业大学 Deep reinforcement learning model security reinforcement method and device based on differential privacy
CN113435598B (en) * 2021-07-08 2022-06-21 中国人民解放军国防科技大学 Knowledge-driven intelligent strategy deduction decision method
CN113435598A (en) * 2021-07-08 2021-09-24 中国人民解放军国防科技大学 Knowledge-driven intelligent strategy deduction decision method
CN113487870B (en) * 2021-07-19 2022-07-15 浙江工业大学 Anti-disturbance generation method for intelligent single intersection based on CW (continuous wave) attack
CN113487870A (en) * 2021-07-19 2021-10-08 浙江工业大学 Method for generating anti-disturbance to intelligent single intersection based on CW (continuous wave) attack
CN117833997A (en) * 2024-03-01 2024-04-05 南京控维通信科技有限公司 Multidimensional resource allocation method of NOMA multi-beam satellite communication system based on reinforcement learning
CN117833997B (en) * 2024-03-01 2024-05-31 南京控维通信科技有限公司 Multidimensional resource allocation method of NOMA multi-beam satellite communication system based on reinforcement learning

Similar Documents

Publication Publication Date Title
CN112052456A (en) Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents
CN111310915B (en) Data anomaly detection defense method oriented to reinforcement learning
CN110991545B (en) Multi-agent confrontation oriented reinforcement learning training optimization method and device
Stanescu et al. Evaluating real-time strategy game states using convolutional neural networks
CN110852448A (en) Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN111282267B (en) Information processing method, information processing apparatus, information processing medium, and electronic device
CN113420326B (en) Deep reinforcement learning-oriented model privacy protection method and system
CN112884131A (en) Deep reinforcement learning strategy optimization defense method and device based on simulation learning
CN112884130A (en) SeqGAN-based deep reinforcement learning data enhanced defense method and device
CN113688977B (en) Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium
CN113255936A (en) Deep reinforcement learning strategy protection defense method and device based on simulation learning and attention mechanism
CN113392396A (en) Strategy protection defense method for deep reinforcement learning
CN113952733A (en) Multi-agent self-adaptive sampling strategy generation method
CN112069504A (en) Model enhanced defense method for resisting attack by deep reinforcement learning
CN111348034B (en) Automatic parking method and system based on generation countermeasure simulation learning
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
CN113360917A (en) Deep reinforcement learning model security reinforcement method and device based on differential privacy
Yang et al. Adaptive inner-reward shaping in sparse reward games
Churchill et al. An analysis of model-based heuristic search techniques for StarCraft combat scenarios
CN113276852B (en) Unmanned lane keeping method based on maximum entropy reinforcement learning framework
CN114404975A (en) Method, device, equipment, storage medium and program product for training decision model
Ji et al. Improving decision-making efficiency of image game based on deep Q-learning
Petosa et al. Multiplayer alphazero
Marius et al. Combining scripted behavior with game tree search for stronger, more robust game AI
Balachandar et al. Collaboration of ai agents via cooperative multi-agent deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination