CN112052456A

CN112052456A - Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Info

Publication number: CN112052456A
Application number: CN202010899020.2A
Authority: CN
Inventors: 陈晋音; 章燕; 王雪柯
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-12-08

Abstract

The invention discloses a deep reinforcement learning strategy optimization defense method based on multiple intelligent agents, which comprises the following steps: (1) constructing an autonomous driving environment comprising a target agent and a plurality of antagonistic agents; (2) respectively storing the state transition process data of the target agent in an experience playback buffer zone D according to the success or failure of the antagonistic agent to attack the target agent⁺And D^‑From D⁺And D^‑The acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent; (3) game training is carried out on the antagonistic agent and the target agent vehicle together, the state transition process data of the target agent is stored in an experience buffer area D, data are collected from the experience buffer area D to update decision gradient algorithm model parameters corresponding to the target agent, and the game training is carried out until the game training is finishedStopping; (4) when the method is applied, the collected local environment state data is input into a decision gradient algorithm model corresponding to the target intelligent agent, and the target intelligent agent is guided to execute through calculating and outputting a decision action.

Description

Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Technical Field

The invention belongs to the defense field of deep reinforcement learning, and particularly relates to a deep reinforcement learning strategy optimization defense method based on multiple intelligent agents.

Background

Deep reinforcement learning is one of the directions in which artificial intelligence has attracted much attention in recent years, and with the rapid development and application of reinforcement learning, reinforcement learning has been widely used in the fields of robot control, game gaming, computer vision, unmanned driving, and the like. In order to ensure the safe application of the deep reinforcement learning in the safety-critical field, the key point is to analyze and discover holes in the deep reinforcement learning algorithm and the model so as to prevent some other useful people from utilizing the holes to perform illegal profit-making behaviors. Unlike the single-step prediction task of the traditional machine learning, the deep reinforcement learning system needs to perform multi-step decision-making to complete a certain task, and the continuous decision-making has high correlation.

The basic idea of reinforcement learning is to learn the optimal strategy to achieve the learning goal by maximizing the cumulative rewards that agents obtain from the environment. However, the strategy obtained by deep reinforcement learning training has a safety hazard, and cannot well cope with all possible scenes. Especially in the safety-critical field, such safety hazards bring great harm, and make decisions of the reinforcement learning system wrong, which is a significant challenge for the application field of reinforcement learning decision safety.

Because the strategies obtained by reinforcement learning training have potential safety hazards, the robustness of the reinforcement learning model and the strategies is improved, and the effective and safe application of the reinforcement learning model and the strategies in the safety decision field becomes the key point of people's attention increasingly. At present, common defense methods of reinforcement learning can be divided into three categories, namely confrontation training, robust learning and confrontation detection, according to the existing defense mechanism. A defense method facing a deep reinforcement learning model to resist attacks as disclosed in application number CN 201911184051.3; application number CN201910567203.1 discloses an insecure XSS defense system identification method based on reinforcement learning.

Disclosure of Invention

The invention aims to provide a multi-agent-based deep reinforcement learning strategy optimization defense method, which is used in an automatic driving scene, adopts a mode of training a plurality of confrontation agents to play games with a target agent, adopts an information asymmetry mechanism between the target agent and the confrontation agents to train a reinforcement learning model to optimize the strategy, improves the robustness of the reinforcement learning model, further improves the accuracy of decision-making action of the model, and avoids potential safety hazards.

In order to achieve the purpose, the invention provides the following technical scheme:

a deep reinforcement learning strategy optimization defense method based on multiple agents comprises the following steps:

(1) acquiring global environment state data and local environment state data in an automatic driving environment comprising a target agent and a plurality of antagonistic agents, and initializing the antagonistic agents and the target agent by utilizing a decision gradient algorithm model, wherein the decision gradient algorithm model comprises an Actor network model and a Critic network model;

(2) the aim of the resistant agent is to resist the attack target agent as much as possible, make the target agent execute the wrong decision-making action, and convert the state of the target agent into the transition process data according to the success or failure of the attack of the resistant agent to the target agent

Respectively having an empirical playback buffer D⁺And an empirical playback buffer D^-Where x represents that the resistant agent observes global environmental state data, including other resistant agents, target agents, and their expected reward values, a represents the policy action taken by the resistant agent under the environment,

representing individual reward values for antagonistic agents, x' representing the next global environmental state data observable by the antagonistic agent, from experience buffer D⁺And D^-The acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent;

(3) on the basis of the step (2), after the antagonistic agent obtains the antagonistic strategy thereof, the antagonistic agent is related to the targetThe target agent vehicles carry out game training together, and in the training process, the state of the target agent is converted into transition process data(s)₀,a₀,r₀,s'₀) Stored in an experience buffer D, where s₀Data representing the observable local environmental state of a target agent, including antagonistic agents in proximity to the target agent, a₀Representing the target agent at s₀And the policy action taken under the influence of the antagonistic policy, r₀Representing target agent instant prize, s'₀Representing the data of the next environmental state which can be observed by the target agent under the influence of the antagonistic agent, acquiring data from the experience buffer D to update the decision gradient algorithm model parameters corresponding to the target agent until the game training is finished;

(4) when the method is applied, the collected local environment state data is input into a decision gradient algorithm model corresponding to the target intelligent agent, and the target intelligent agent is guided to execute through calculating and outputting a decision action.

In step (2), when the decision gradient algorithm model corresponding to the antagonistic agent is trained, in the initial random exploration process, the antagonistic agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, that is, the antagonistic agent randomly selects an action value

Wherein s is_iGlobal environmental status data representing the ith antagonistic agent, N_tRepresenting the random noise value added at time step t,

representing the Actor network model at the parameter theta_iThe output of the next pair.

In step (2), when a decision gradient algorithm model corresponding to the antagonistic agent is trained and the target agent is successfully attacked by the antagonistic agent, the antagonistic agent is awarded a countermark, that is, the antagonistic agent is awarded a countermark

Representing the challenge reward of the ith challenge agent at time step t,

representing the individual prize value of the ith antagonistic agent at time t, alpha representing the antagonistic prize factor, and k antagonistic agents participating in the win, alpha being taken

In step (2), from experience buffer D⁺And D^-When the acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent,

from experience buffer D⁺Sample η (t). S from empirical buffer D^-And (1-eta (t)). S, wherein t represents a time step, S represents the total number of samples, eta (t) takes 0.5 when t is at the previous M time step, and eta (t) takes 0.75 when t is greater than M time step, and data are collected by using the sampling strategy to update the decision gradient algorithm model parameters corresponding to the antagonistic agent.

Compared with the prior art, the invention has the beneficial effects that at least:

1) enhancing the exploration of the environment by a target agent by training a plurality of antagonistic agents in an autonomous driving environment;

2) in game training, the target agent and the antagonistic agent adopt an environment state data asymmetric mechanism to reduce conflicts among the antagonistic agents, and meanwhile, the target agent is favorably observed, and a better training strategy is sought;

3) in the training process of deep reinforcement learning, the strategy of the target agent is optimized by training a game scene between the antagonistic agent and the target agent, so that the robustness of a decision gradient algorithm model corresponding to the target agent is improved, the accuracy of the decision action of the model is improved, and potential safety hazards are avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a multi-agent based deep reinforcement learning strategy optimization defense method provided by an embodiment;

FIG. 2 is a schematic structural diagram of a DDPG algorithm model provided by the embodiment;

fig. 3 is a schematic diagram of a reactive agent training process provided by an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in FIGS. 1 to 3, the method for optimizing and defending a multi-agent-based deep reinforcement learning strategy provided by the embodiment comprises the following steps:

1) a target agent training process.

1.1) building an automatic driving simulation environment of the reinforcement learning trolley;

1.2) training a target agent (indicated by subscript 0) and a antagonistic agent (indicated by subscript 1,.. and n) based on a deep deterministic decision gradient algorithm (DDPG) in reinforcement learning, wherein the target agent and the antagonistic agent can be intelligent trolleys, the core of the DDPG algorithm is extended based on an Actor-Critic method, a DQN algorithm and a deterministic strategy gradient (DPG), and a deterministic strategy mu is adopted to select an action a_t＝μ(s|θ^μ)，θ^μIs a policy network mu (s | theta) that produces deterministic actions^μ) In μ(s) as an Actor, θ^QIs a value Q network Q (s, a, theta)^Q) Parameter (d) in Q (s, a)) The function serves as Critic. In order to improve the training stability, a target network is introduced for a strategy network and a value network.

1.3) during the training process, the state of the target agent is converted into a transition process(s)₀,a₀,r₀,s'₀) Stored in an empirical playback buffer D, where s₀Observable local environmental state data representing the target intelligence, a₀Representing the target agent at s₀Action taken in the state r₀Representing the resulting instant prize, s'₀Representing the next environmental state data observable by the target agent.

2) Antagonistic agent training process:

2.1) training n antagonistic Agents Car_iN, n may be 2, 3:

in the initial random exploration process, the antagonistic agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, i.e., the action value

Wherein s is_iGlobal environmental status data representing the ith antagonistic agent, N_tRepresenting the random noise value added at time step t.

2.2) transferring data of the status of antagonistic Agents in the training Process

Temporarily stored in experience buffer D_tmpWherein x ═ s₀,s₁,...,s_n) Representing the respective global environmental state data of all antagonistic agents, a ═ a₀,a₁,...,a_n) The action taken by the agent is represented,

individual rewards are expressed to restrict individual behavior of antagonistic agents to perform normal behavior. And then judging whether the antagonistic agent succeeds or not after each round is finished. If successful, experience will be gainedBuffer D_tmpData in (2) is transferred to buffer D⁺And awarding a corresponding counter prize, i.e.

Representing a challenge prize, alpha representing a challenge prize factor, k challenge agents participating in the win, then alpha is taken

If the target agent wins, the experience buffer D is buffered_tmpData in (2) is transferred to buffer D^-And awarding the corresponding personal prize without a counter-prize, i.e.

2.3) from empirical buffer D according to sampling ratio η (t)⁺And D^-For updating the network structure parameters of the antagonistic agent:

from experience buffer D⁺Sampling η (t). S from D^-And (1-eta (t)). S, wherein t represents a time step, S represents the total number of samples, and when t is in the previous M time step, eta (t) is 0.5, and when t is greater than M time step, eta (t) is 0.75, so as to better optimize the strategy of the target agent.

3) The game process of the target agent and the antagonistic agent comprises the following steps:

3.1) the game process between the target agent and the antagonistic agent adopts an information asymmetry mechanism, namely the antagonistic agent can obtain the observed global environment data, including other antagonistic agents, the target agent and the expected reward value of the target agent, and the target agent only observes the local environment data, and the information feedback obtained by the two agents is asymmetric.

3.2) antagonistic Agents from experience buffer D⁺And D^-Assuming that the Actor network of n antagonistic agents andcritic network parameters are respectively recorded as

Simultaneous ordering policy

Function of value

The Actor network parameters are updated by calculating the gradient of the expected cumulative reward function:

wherein, a_0:n＝{a₀,...,a_n}，D^±Represents an experience buffer D⁺And experience buffer D^-。

The Critic network parameters are updated by minimizing the loss function L (.) between the actual cumulative reward function and the action value Q function:

wherein, a_0:n＝{a₀,...,a_n}，

Representing the actual accumulated prize value. Gamma is attenuation factor, and takes [0, 1%]A value in between. The parameters in the target network adopt a soft updating mode:

3.3) after obtaining the antagonism strategy, the antagonism agent carries out game training together with the target agent, and the round exploration experience in the training process is stored in an experience buffer zone D, wherein D comprises the state transition process(s) of the target agent₀,a₀,r₀,s'₀)，s₀State data representing the ambient observability of the target agent (including antagonistic agents observed in the environment close to the target agent), a₀Representing the target agent at s₀Action taken under the influence of State and antagonism strategies, r₀Representing the resulting instant prize, s'₀Representing the next state data that the target agent may observe under the influence of the antagonistic agent. The target agent then samples N state transitions from D, updates the policy parameters of the Actor network by calculating the gradient of the expected cumulative reward function

Updating Critic network parameters by minimizing the Loss function Loss between the actual cumulative reward function and the action value Q function

Wherein the content of the first and second substances,

gamma is attenuation factor, and takes [0, 1%]The value of (a) to (b) in between,

updating target network parameters by means of soft update

And

when the training of the decision gradient algorithm model corresponding to the target agent is finished, the trained decision gradient algorithm model corresponding to the target agent can be directly used for application, and when the application is carried out, the acquired local environment state data is input into the decision gradient algorithm model corresponding to the target agent, and the decision gradient algorithm model corresponding to the target agent is calculated to output decision actions to guide the target agent to execute.

In the multi-agent-based deep reinforcement learning strategy optimization defense method, 1) the exploration strength of a target agent on the environment is enhanced by training a plurality of antagonistic agents in an automatic driving environment; 2) in game training, the target agent and the antagonistic agent adopt an environment state data asymmetric mechanism to reduce conflicts among the antagonistic agents, and meanwhile, the target agent is favorably observed, and a better training strategy is sought; 3) in the training process of deep reinforcement learning, the strategy of the target agent is optimized by training a game scene between the antagonistic agent and the target agent, so that the robustness of a decision gradient algorithm model corresponding to the target agent is improved, the accuracy of the decision action of the model is improved, and potential safety hazards are avoided.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A deep reinforcement learning strategy optimization defense method based on multiple agents is characterized by comprising the following steps:

(3) on the basis of the step (2), after the antagonistic agent obtains the antagonistic strategy, the antagonistic agent and the target agent vehicle are subjected to game training together, and the shape of the target agent is trained in the training processTransition process data(s) of state transition₀,a₀,r₀,s'₀) Stored in an experience buffer D, where s₀Data representing the observable local environmental state of a target agent, including antagonistic agents in proximity to the target agent, a₀Representing the target agent at s₀And the policy action taken under the influence of the antagonistic policy, r₀Instant prize, s 'representing target agent'₀Representing the data of the next environmental state which can be observed by the target agent under the influence of the antagonistic agent, acquiring data from the experience buffer D to update the decision gradient algorithm model parameters corresponding to the target agent until the game training is finished;

2. The multi-agent based deep reinforcement learning strategy optimization defense method according to claim 1, wherein in the step (2), when the decision gradient algorithm model corresponding to the resistant agent is trained, in an initial random exploration process, the resistant agent randomly selects an action value by randomly adding a noise value to the output of the initialized Actor network model, that is, the action value is selected by the resistant agent

3. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, wherein in the step (2), a decision gradient algorithm model corresponding to the antagonistic agent is selectedWhen training, a countering reward is given to a countering agent when the countering agent's attack on the target agent is successful, i.e., the countering agent is given a countering reward

Representing the challenge reward of the ith challenge agent at time step t,

4. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, characterized in that in step (2), from experience buffer D⁺And D^-When the acquired data updates the decision gradient algorithm model parameters corresponding to the antagonistic agent,

5. The multi-agent based deep reinforcement learning strategy optimization defense method of claim 1, characterized in that in step (2), from experience buffer D⁺And D^-The process of updating the decision gradient algorithm model parameters corresponding to the antagonistic agent by the acquired data comprises the following steps:

antagonistic intelligent agent slaveExperience buffer D⁺And D^-In the method, data sampling is carried out, and it is assumed that the Actor network and Critic network parameters of n antagonistic agents are respectively recorded as

Simultaneous ordering policy

Function of value

wherein, a_0:n＝{a₀,...,a_n}，D^±Represents an experience buffer D⁺And experience buffer D^-；

wherein, a_0:n＝{a₀,...,a_n}，

Representing the actual cumulative prize value, gamma being a decay factor, taking [0,1 [ ]]A value in between.

6. The multi-agent-based deep reinforcement learning strategy optimization defense method according to claim 1, wherein in the step (3), the process of collecting data from the experience buffer D and updating the decision gradient algorithm model parameters corresponding to the target agent comprises:

the target agent samples N state transitions from experience buffer D and updates the policy parameters of the Actor network by calculating the gradient of the expected cumulative reward function

Wherein the content of the first and second substances,