CN113283597A

CN113283597A - Deep reinforcement learning model robustness enhancing method based on information bottleneck

Info

Publication number: CN113283597A
Application number: CN202110652107.4A
Authority: CN
Inventors: 陈晋音; 王珏; 章燕; 王雪柯
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-08-20

Abstract

The invention discloses a method for enhancing robustness of a deep reinforcement learning model based on information bottleneck, which limits state information in deep reinforcement learning by setting the information bottleneck, encodes the state information in a transfer tuple through an encoder, encodes the state information observed in an environment, inputs the encoded state information into a strategy network, interacts with the environment according to the action of the strategy network to obtain the state of the next round, encodes the state, and continuously interacts with the environment to realize the training of the strategy network. The robustness enhancement method of the deep reinforcement learning model based on the information bottleneck disclosed by the invention enables the trained strategy to still well perform on the original task and can resist the influence of counterattack; the proportion coefficient in the regular term is set by adopting the annealing idea, so that a stable training process is achieved, and the trained strategy still has excellent performance in a normal task.

Description

Deep reinforcement learning model robustness enhancing method based on information bottleneck

Technical Field

The invention relates to the field of enhancing robustness in deep reinforcement learning; in particular to a method for enhancing robustness of a deep reinforcement learning model based on information bottleneck.

Background

With the rapid development of artificial intelligence, a deep reinforcement learning algorithm combining the perception capability of deep learning and the decision capability of reinforcement learning is widely applied to the aspects of automatic driving, automatic translation, a dialogue system, video detection and the like.

However, the combined deep reinforcement learning is susceptible to adversarial attacks, and some noises which cannot be detected by human eyes are added to the original sample, and the noises do not affect the recognition of human beings, but can make the trained strategy to make an extremely adverse action on the result, thereby causing the failure of the whole decision making process.

There is therefore a need to enhance the robustness of deep reinforcement learning models against attacks.

The existing robustness enhancing method of the deep reinforcement learning model, such as the method and the device disclosed in the Chinese patent application with the publication number of CN112884130A for enhancing defense of deep reinforcement learning data based on SeqGAN, comprises the following steps: building an automatic driving simulation environment of the intelligent agent for deep reinforcement learning, building a target intelligent agent based on a deep Q network in the reinforcement learning, and performing reinforcement learning on the target intelligent agent to optimize parameters of the deep Q network; generating a state action pair sequence of target intelligent agent driving at T moments by using a parameter optimized deep Q network as expert data, wherein an action value in a state action pair corresponds to an action with the minimum Q value; training SeqGAN containing a generator and a discriminator by using a reinforcement learning method, generating a state action pair by taking the state action pair in expert data as the input of the generator, simultaneously simulating sampling by adopting strategy gradient Monte Carlo based search, forming a state action pair sequence with a fixed length by the sampled state action pair and the state action pair generated by the generator, inputting the state action pair sequence into the discriminator, calculating an incentive value, and updating the network parameter of SeqGAN according to the incentive value; inputting the current state into a generator of SeqGAN optimized by parameters to obtain a generation state action pair sequence, calculating the accumulated reward value of the generation state action pair sequence by using a deep Q network optimized by parameters, comparing the accumulated reward value with the accumulated reward value obtained by the deep Q network strategy of the target agent, and storing the state action pair with a higher accumulated reward value as enhanced data for re-optimizing the deep Q network; and selecting the enhancement data from the storage to carry out parameter re-optimization on the deep Q network so as to realize the enhanced defense of the deep reinforcement learning data.

However, researches show that the information bottleneck not only has the function of filtering useless information irrelevant to tasks, but also can improve the generalization capability of antagonistic reverse reinforcement learning, and meanwhile, the information bottleneck is used as an external processing module and can be well combined with various deep reinforcement learning algorithms; therefore, how to set an information bottleneck to resist adversarial attack has important theoretical and practical significance for the application of the deep reinforcement learning model.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for enhancing the robustness of a deep reinforcement learning model based on an information bottleneck.

A deep reinforcement learning model robustness enhancing method based on information bottleneck comprises the following steps:

(1) setting information bottleneck limit on the state observed by the intelligent agent by using a proper encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;

(2) inputting the mapped state z of the original state into the intelligent agent, and generating an action by the intelligent agent according to the current strategy;

(3) interacting the action generated by the agent in the step (2) with the environment to obtain the next state;

(4) training an intelligent agent strategy according to the interactive result in the step (3);

(5) and (4) repeating the steps (1) to (4) until the overall return is converged.

The encoder in the step (1) uses mutual information as an index to limit information flow and filter countermeasure information, and a calculation formula of the mutual information is as follows:

wherein X and Y respectively represent corresponding variables, p (X, Y) is joint distribution, p (X) and p (Y) are edge density, MI (X; Y) represents calculated mutual information value, and represents correlation between variables X and Y, and D_KLIndicating KL divergence, E_p(x,y)Representing the expectation of the subsequent expression on the joint distribution p (x, y).

The mutual information input and output by the encoder is defined as MI (S; Z), the value is limited to be less than a certain degree, the mutual information can not be directly calculated, and the estimation is carried out by using a sampling mode:

MI(Z,S)＝D_KL[p(Z,S)|p(Z)p(s)]＝E_S[D_KL[p(Z|S)|p(Z)]]

however, it is not reasonable that p (Z) needs to be calculated for the entire state space S, and q (Z) -N (0,1) is used instead of p (Z) in an approximate method.

E_S[D_KL[p(Z|S)|q(Z)]]≥MI(Z,S)

I.e. an upper bound of mutual information is used instead. Because the normal distribution is used for approximation, the encoder part only needs to use a neural network to estimate the mean and the variance, and the distribution is constructed according to the obtained mean and the variance for sampling to obtain the coded state z.

In the step (2), the state z is input into a Q value function Q (s, a), and an action a is randomly selected with a certain probability epsilon, then

a＝argmax_aQ_s,a

I.e., a greedy strategy is used to select the corresponding actions to achieve a balance of exploration and development.

From 1 (completely random action) to a smaller value, e.g. 0.02 or 0.05, i.e. exploring the environment as much as possible at the beginning, following a good strategy at the end of the training.

And (3) the intelligent agent selects an action to interact with the environment according to a greedy strategy to obtain a return r and a next state s ', inputs the state s' into an encoder to obtain z ', and stores the transfer tuple (z, a, r, z') into an experience pool.

The experience pool is mainly used for overcoming the problem that samples used for updating are not independently and identically distributed, the samples generated by the strategies of adjacent time steps have strong correlation, a large number of transfer tuples are stored in the experience pool and are randomly extracted during training, the samples can be approximately regarded as independent and identically distributed, and a Q value network is trained to have a good effect.

In the step (4), the intelligent agent training strategy is trained by adopting a deep Q network, and the specific steps are as follows:

(4.1) calculating a target y according to the tuples in the experience pool;

(4.2) calculating a loss function;

(4.3) minimizing the loss function using a stochastic gradient descent algorithm for updating the parameter values of the encoder and the Q-value function.

The calculation formula of the target y in step (4.1) is as follows:

wherein gamma is a discount factor,

and r is a target network, represents the return obtained after the agent takes certain action at each time step, and inherits the weight value of the main network at certain training turns.

γ is typically set to 0.99, and if one epsilon just ends:

y＝r

the loss function in step (4.2) is:

L₂＝(Q_s,a-y)²+βE_S[D_KL[p(Z|S)|q(Z)]]

wherein beta is a Langerita multiplier, E_SRepresenting the expectation of the subsequent expression on the state space S, p (Z | S) representing the input at a known state SThe probability of Z, q (Z), is an approximate distribution, which is used instead of p (Z).

And when the intelligent agent strategy is trained, the beta value is gradually increased from 0.

During specific training, the beta value is set to be 0, the encoder network parameters are fixed, the weight of the Q value network is preferentially trained, when the strategy is well expressed, the beta value is gradually increased, so that the information bottleneck can filter out countermeasure information, training is carried out until the total return value R is converged, and R is the sum of the reward values of each step in one epsilon.

Compared with the prior art, the invention has the advantages that:

1. the main part which plays a decisive role in the task in the state information is extracted by utilizing the information bottleneck, and the disturbance added on the original state by the counterattack is coded by the coder, so that the strategy obtained by training still has good performance on the original task, and the influence of the counterattack can be resisted.

2. The proportion coefficient in the regular term is set by adopting the annealing idea, so that a stable training process is achieved, and the trained strategy still has excellent performance in a normal task.

Drawings

FIG. 1 is a flowchart illustrating the overall steps of the present invention;

FIG. 2 is a schematic diagram of an encoder structure;

fig. 3 is a schematic diagram of a deep Q network structure.

Detailed Description

The embodiments of the present invention will be further described with reference to the accompanying drawings.

The robustness enhancement method of the deep reinforcement learning model based on the information bottleneck limits the state information in the deep reinforcement learning by setting the information bottleneck, and encodes the state information in the transfer tuple through an encoder. Firstly, coding the observed state in the environment, inputting the coded state into a strategy network, interacting with the environment according to the action of the strategy network to obtain the state of the next round, coding the state, and continuously interacting with the environment to realize the training of the strategy network. The training of the encoder is realized by adding a regular term on a loss term of an original deep reinforcement learning algorithm, and due to the mutual influence of the encoder and a strategy network, the annealing idea is adopted, the fixed encoder is firstly used, no constraint is set on the loss term, and when the strategy training is better, the proportionality coefficient is gradually increased.

Fig. 1 is a flowchart of a method for enhancing robustness of a deep reinforcement learning model based on an information bottleneck according to this embodiment. The robustness enhancing method of the deep reinforcement learning model based on the information bottleneck can be used in the field of automatic driving, and the deep reinforcement learning model outputs decision actions according to the acquired environment state so as to guide automatic driving. As shown in FIG. 1, the method for enhancing robustness of the deep reinforcement learning model comprises the following steps:

setting information bottleneck limit on the state observed by the intelligent agent by using a proper encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;

inputting the mapped state z of the original state into the intelligent agent, and generating an action by the intelligent agent according to the current strategy;

interacting the action generated by the intelligent agent with the environment to obtain the next state;

training an agent strategy according to the interactive result;

and judging whether the total return value R is converged, if so, finishing the strategy training, and if not, repeating the steps until the total return value R is converged.

Fig. 2 is a schematic structural diagram of the encoder provided in this embodiment. As shown in the schematic structural diagram of the encoder shown in fig. 2, an original state s observed by the agent is input into the neural network, the neural network obtains a normal distribution of the s by calculating a mean and a variance of the s, and then estimates the s in a sampling manner to obtain a mapped state z.

The specific calculation process is as follows:

the encoder limits information flow and filters countermeasure information by using mutual information as an index, and the calculation formula of the mutual information is as follows:

MI(Z,S)＝D_KL[p(Z,S)|p(Z)p(s)]＝E_S[D_KL[p(Z|S)|p(Z)]]

E_S[D_KL[p(Z|S)|q(Z)]]≥MI(Z,S)

Fig. 3 is a schematic diagram of a deep Q network structure. As shown in fig. 3, the agent interacts with the environment according to the action selected by the greedy policy, so that a return r and a next state s 'can be obtained, the state s' is input to the encoder to obtain z ', and the transition tuples (z, a, r, z') are stored in the experience pool, which is mainly used for overcoming the problem that the samples used for updating are not independently and identically distributed.

The deep reinforcement learning model obtained by the robustness enhancement method based on the information bottleneck has strong robustness, and when the method is applied to the field of automatic driving, a decision-making action can be accurately given according to an environmental state.

Claims

1. A deep reinforcement learning model robustness enhancing method based on information bottleneck is characterized by comprising the following steps:

(1) setting information bottleneck limit on the state observed by the intelligent agent by using an encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;

2. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 1, wherein: the encoder in the step (1) uses mutual information as an index to limit information flow and filter countermeasure information, and a calculation formula of the mutual information is as follows:

3. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 2, wherein: in the step (2), the state z is input into a Q value function Q (s, a), and an action a is randomly selected with a certain probability epsilon, then

a＝argmax_aQ_s,a

4. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 3, wherein: and (3) the intelligent agent selects an action to interact with the environment according to a greedy strategy to obtain a return r and a next state s ', inputs the state s' into an encoder to obtain z ', and stores the transfer tuple (z, a, r, z') into an experience pool.

5. The information bottleneck-based robustness enhancement method for the deep reinforcement learning model according to claim 4, wherein the specific steps of training the intelligent agent strategy in the step (4) are as follows:

(4.1) calculating a target y according to the tuples in the experience pool;

(4.2) calculating a loss function;

6. The information bottleneck-based deep reinforcement learning model robustness enhancing method according to claim 5, wherein the calculation formula of the target y in the step (4.1) is as follows:

wherein gamma is a discount factor,

r represents the return obtained after the agent takes a certain action at each time step,and inheriting the weight value of the main network at regular intervals of training rounds.

7. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 6, wherein the loss function in the step (4.2) is:

L₂＝(Q_s,a-y)²+βE_S[D_KL[p(Z|S)|q(Z)]]

wherein beta is a Langerita multiplier, E_SRepresenting the expectation of the subsequent expression on state space S, p (zs) representing the probability of outputting Z in the known state S, and q (Z) being an approximate distribution, replacing p (Z).

8. The method as claimed in claim 7, wherein the beta value is gradually increased from 0 when the agent strategy is trained.