CN113283597A - Deep reinforcement learning model robustness enhancing method based on information bottleneck - Google Patents

Deep reinforcement learning model robustness enhancing method based on information bottleneck Download PDF

Info

Publication number
CN113283597A
CN113283597A CN202110652107.4A CN202110652107A CN113283597A CN 113283597 A CN113283597 A CN 113283597A CN 202110652107 A CN202110652107 A CN 202110652107A CN 113283597 A CN113283597 A CN 113283597A
Authority
CN
China
Prior art keywords
state
reinforcement learning
information
deep reinforcement
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110652107.4A
Other languages
Chinese (zh)
Inventor
陈晋音
王珏
章燕
王雪柯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202110652107.4A priority Critical patent/CN113283597A/en
Publication of CN113283597A publication Critical patent/CN113283597A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for enhancing robustness of a deep reinforcement learning model based on information bottleneck, which limits state information in deep reinforcement learning by setting the information bottleneck, encodes the state information in a transfer tuple through an encoder, encodes the state information observed in an environment, inputs the encoded state information into a strategy network, interacts with the environment according to the action of the strategy network to obtain the state of the next round, encodes the state, and continuously interacts with the environment to realize the training of the strategy network. The robustness enhancement method of the deep reinforcement learning model based on the information bottleneck disclosed by the invention enables the trained strategy to still well perform on the original task and can resist the influence of counterattack; the proportion coefficient in the regular term is set by adopting the annealing idea, so that a stable training process is achieved, and the trained strategy still has excellent performance in a normal task.

Description

Deep reinforcement learning model robustness enhancing method based on information bottleneck
Technical Field
The invention relates to the field of enhancing robustness in deep reinforcement learning; in particular to a method for enhancing robustness of a deep reinforcement learning model based on information bottleneck.
Background
With the rapid development of artificial intelligence, a deep reinforcement learning algorithm combining the perception capability of deep learning and the decision capability of reinforcement learning is widely applied to the aspects of automatic driving, automatic translation, a dialogue system, video detection and the like.
However, the combined deep reinforcement learning is susceptible to adversarial attacks, and some noises which cannot be detected by human eyes are added to the original sample, and the noises do not affect the recognition of human beings, but can make the trained strategy to make an extremely adverse action on the result, thereby causing the failure of the whole decision making process.
There is therefore a need to enhance the robustness of deep reinforcement learning models against attacks.
The existing robustness enhancing method of the deep reinforcement learning model, such as the method and the device disclosed in the Chinese patent application with the publication number of CN112884130A for enhancing defense of deep reinforcement learning data based on SeqGAN, comprises the following steps: building an automatic driving simulation environment of the intelligent agent for deep reinforcement learning, building a target intelligent agent based on a deep Q network in the reinforcement learning, and performing reinforcement learning on the target intelligent agent to optimize parameters of the deep Q network; generating a state action pair sequence of target intelligent agent driving at T moments by using a parameter optimized deep Q network as expert data, wherein an action value in a state action pair corresponds to an action with the minimum Q value; training SeqGAN containing a generator and a discriminator by using a reinforcement learning method, generating a state action pair by taking the state action pair in expert data as the input of the generator, simultaneously simulating sampling by adopting strategy gradient Monte Carlo based search, forming a state action pair sequence with a fixed length by the sampled state action pair and the state action pair generated by the generator, inputting the state action pair sequence into the discriminator, calculating an incentive value, and updating the network parameter of SeqGAN according to the incentive value; inputting the current state into a generator of SeqGAN optimized by parameters to obtain a generation state action pair sequence, calculating the accumulated reward value of the generation state action pair sequence by using a deep Q network optimized by parameters, comparing the accumulated reward value with the accumulated reward value obtained by the deep Q network strategy of the target agent, and storing the state action pair with a higher accumulated reward value as enhanced data for re-optimizing the deep Q network; and selecting the enhancement data from the storage to carry out parameter re-optimization on the deep Q network so as to realize the enhanced defense of the deep reinforcement learning data.
However, researches show that the information bottleneck not only has the function of filtering useless information irrelevant to tasks, but also can improve the generalization capability of antagonistic reverse reinforcement learning, and meanwhile, the information bottleneck is used as an external processing module and can be well combined with various deep reinforcement learning algorithms; therefore, how to set an information bottleneck to resist adversarial attack has important theoretical and practical significance for the application of the deep reinforcement learning model.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method for enhancing the robustness of a deep reinforcement learning model based on an information bottleneck.
A deep reinforcement learning model robustness enhancing method based on information bottleneck comprises the following steps:
(1) setting information bottleneck limit on the state observed by the intelligent agent by using a proper encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;
(2) inputting the mapped state z of the original state into the intelligent agent, and generating an action by the intelligent agent according to the current strategy;
(3) interacting the action generated by the agent in the step (2) with the environment to obtain the next state;
(4) training an intelligent agent strategy according to the interactive result in the step (3);
(5) and (4) repeating the steps (1) to (4) until the overall return is converged.
The encoder in the step (1) uses mutual information as an index to limit information flow and filter countermeasure information, and a calculation formula of the mutual information is as follows:
Figure BDA0003112001500000021
wherein X and Y respectively represent corresponding variables, p (X, Y) is joint distribution, p (X) and p (Y) are edge density, MI (X; Y) represents calculated mutual information value, and represents correlation between variables X and Y, and DKLIndicating KL divergence, Ep(x,y)Representing the expectation of the subsequent expression on the joint distribution p (x, y).
The mutual information input and output by the encoder is defined as MI (S; Z), the value is limited to be less than a certain degree, the mutual information can not be directly calculated, and the estimation is carried out by using a sampling mode:
MI(Z,S)=DKL[p(Z,S)|p(Z)p(s)]=ES[DKL[p(Z|S)|p(Z)]]
however, it is not reasonable that p (Z) needs to be calculated for the entire state space S, and q (Z) -N (0,1) is used instead of p (Z) in an approximate method.
ES[DKL[p(Z|S)|q(Z)]]≥MI(Z,S)
I.e. an upper bound of mutual information is used instead. Because the normal distribution is used for approximation, the encoder part only needs to use a neural network to estimate the mean and the variance, and the distribution is constructed according to the obtained mean and the variance for sampling to obtain the coded state z.
In the step (2), the state z is input into a Q value function Q (s, a), and an action a is randomly selected with a certain probability epsilon, then
a=argmaxaQs,a
I.e., a greedy strategy is used to select the corresponding actions to achieve a balance of exploration and development.
From 1 (completely random action) to a smaller value, e.g. 0.02 or 0.05, i.e. exploring the environment as much as possible at the beginning, following a good strategy at the end of the training.
And (3) the intelligent agent selects an action to interact with the environment according to a greedy strategy to obtain a return r and a next state s ', inputs the state s' into an encoder to obtain z ', and stores the transfer tuple (z, a, r, z') into an experience pool.
The experience pool is mainly used for overcoming the problem that samples used for updating are not independently and identically distributed, the samples generated by the strategies of adjacent time steps have strong correlation, a large number of transfer tuples are stored in the experience pool and are randomly extracted during training, the samples can be approximately regarded as independent and identically distributed, and a Q value network is trained to have a good effect.
In the step (4), the intelligent agent training strategy is trained by adopting a deep Q network, and the specific steps are as follows:
(4.1) calculating a target y according to the tuples in the experience pool;
(4.2) calculating a loss function;
(4.3) minimizing the loss function using a stochastic gradient descent algorithm for updating the parameter values of the encoder and the Q-value function.
The calculation formula of the target y in step (4.1) is as follows:
Figure BDA0003112001500000041
wherein gamma is a discount factor,
Figure BDA0003112001500000042
and r is a target network, represents the return obtained after the agent takes certain action at each time step, and inherits the weight value of the main network at certain training turns.
γ is typically set to 0.99, and if one epsilon just ends:
y=r
the loss function in step (4.2) is:
L2=(Qs,a-y)2+βES[DKL[p(Z|S)|q(Z)]]
wherein beta is a Langerita multiplier, ESRepresenting the expectation of the subsequent expression on the state space S, p (Z | S) representing the input at a known state SThe probability of Z, q (Z), is an approximate distribution, which is used instead of p (Z).
And when the intelligent agent strategy is trained, the beta value is gradually increased from 0.
During specific training, the beta value is set to be 0, the encoder network parameters are fixed, the weight of the Q value network is preferentially trained, when the strategy is well expressed, the beta value is gradually increased, so that the information bottleneck can filter out countermeasure information, training is carried out until the total return value R is converged, and R is the sum of the reward values of each step in one epsilon.
Compared with the prior art, the invention has the advantages that:
1. the main part which plays a decisive role in the task in the state information is extracted by utilizing the information bottleneck, and the disturbance added on the original state by the counterattack is coded by the coder, so that the strategy obtained by training still has good performance on the original task, and the influence of the counterattack can be resisted.
2. The proportion coefficient in the regular term is set by adopting the annealing idea, so that a stable training process is achieved, and the trained strategy still has excellent performance in a normal task.
Drawings
FIG. 1 is a flowchart illustrating the overall steps of the present invention;
FIG. 2 is a schematic diagram of an encoder structure;
fig. 3 is a schematic diagram of a deep Q network structure.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
The robustness enhancement method of the deep reinforcement learning model based on the information bottleneck limits the state information in the deep reinforcement learning by setting the information bottleneck, and encodes the state information in the transfer tuple through an encoder. Firstly, coding the observed state in the environment, inputting the coded state into a strategy network, interacting with the environment according to the action of the strategy network to obtain the state of the next round, coding the state, and continuously interacting with the environment to realize the training of the strategy network. The training of the encoder is realized by adding a regular term on a loss term of an original deep reinforcement learning algorithm, and due to the mutual influence of the encoder and a strategy network, the annealing idea is adopted, the fixed encoder is firstly used, no constraint is set on the loss term, and when the strategy training is better, the proportionality coefficient is gradually increased.
Fig. 1 is a flowchart of a method for enhancing robustness of a deep reinforcement learning model based on an information bottleneck according to this embodiment. The robustness enhancing method of the deep reinforcement learning model based on the information bottleneck can be used in the field of automatic driving, and the deep reinforcement learning model outputs decision actions according to the acquired environment state so as to guide automatic driving. As shown in FIG. 1, the method for enhancing robustness of the deep reinforcement learning model comprises the following steps:
setting information bottleneck limit on the state observed by the intelligent agent by using a proper encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;
inputting the mapped state z of the original state into the intelligent agent, and generating an action by the intelligent agent according to the current strategy;
interacting the action generated by the intelligent agent with the environment to obtain the next state;
training an agent strategy according to the interactive result;
and judging whether the total return value R is converged, if so, finishing the strategy training, and if not, repeating the steps until the total return value R is converged.
Fig. 2 is a schematic structural diagram of the encoder provided in this embodiment. As shown in the schematic structural diagram of the encoder shown in fig. 2, an original state s observed by the agent is input into the neural network, the neural network obtains a normal distribution of the s by calculating a mean and a variance of the s, and then estimates the s in a sampling manner to obtain a mapped state z.
The specific calculation process is as follows:
the encoder limits information flow and filters countermeasure information by using mutual information as an index, and the calculation formula of the mutual information is as follows:
Figure BDA0003112001500000051
wherein X and Y respectively represent corresponding variables, p (X, Y) is joint distribution, p (X) and p (Y) are edge density, MI (X; Y) represents calculated mutual information value, and represents correlation between variables X and Y, and DKLIndicating KL divergence, Ep(x,y)Representing the expectation of the subsequent expression on the joint distribution p (x, y).
The mutual information input and output by the encoder is defined as MI (S; Z), the value is limited to be less than a certain degree, the mutual information can not be directly calculated, and the estimation is carried out by using a sampling mode:
MI(Z,S)=DKL[p(Z,S)|p(Z)p(s)]=ES[DKL[p(Z|S)|p(Z)]]
however, it is not reasonable that p (Z) needs to be calculated for the entire state space S, and q (Z) -N (0,1) is used instead of p (Z) in an approximate method.
ES[DKL[p(Z|S)|q(Z)]]≥MI(Z,S)
I.e. an upper bound of mutual information is used instead. Because the normal distribution is used for approximation, the encoder part only needs to use a neural network to estimate the mean and the variance, and the distribution is constructed according to the obtained mean and the variance for sampling to obtain the coded state z.
Fig. 3 is a schematic diagram of a deep Q network structure. As shown in fig. 3, the agent interacts with the environment according to the action selected by the greedy policy, so that a return r and a next state s 'can be obtained, the state s' is input to the encoder to obtain z ', and the transition tuples (z, a, r, z') are stored in the experience pool, which is mainly used for overcoming the problem that the samples used for updating are not independently and identically distributed.
The deep reinforcement learning model obtained by the robustness enhancement method based on the information bottleneck has strong robustness, and when the method is applied to the field of automatic driving, a decision-making action can be accurately given according to an environmental state.

Claims (8)

1. A deep reinforcement learning model robustness enhancing method based on information bottleneck is characterized by comprising the following steps:
(1) setting information bottleneck limit on the state observed by the intelligent agent by using an encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;
(2) inputting the mapped state z of the original state into the intelligent agent, and generating an action by the intelligent agent according to the current strategy;
(3) interacting the action generated by the agent in the step (2) with the environment to obtain the next state;
(4) training an intelligent agent strategy according to the interactive result in the step (3);
(5) and (4) repeating the steps (1) to (4) until the overall return is converged.
2. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 1, wherein: the encoder in the step (1) uses mutual information as an index to limit information flow and filter countermeasure information, and a calculation formula of the mutual information is as follows:
Figure FDA0003112001490000011
wherein X and Y respectively represent corresponding variables, p (X, Y) is joint distribution, p (X) and p (Y) are edge density, MI (X; Y) represents calculated mutual information value, and represents correlation between variables X and Y, and DKLIndicating KL divergence, Ep(x,y)Representing the expectation of the subsequent expression on the joint distribution p (x, y).
3. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 2, wherein: in the step (2), the state z is input into a Q value function Q (s, a), and an action a is randomly selected with a certain probability epsilon, then
a=argmaxaQs,a
I.e., a greedy strategy is used to select the corresponding actions to achieve a balance of exploration and development.
4. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 3, wherein: and (3) the intelligent agent selects an action to interact with the environment according to a greedy strategy to obtain a return r and a next state s ', inputs the state s' into an encoder to obtain z ', and stores the transfer tuple (z, a, r, z') into an experience pool.
5. The information bottleneck-based robustness enhancement method for the deep reinforcement learning model according to claim 4, wherein the specific steps of training the intelligent agent strategy in the step (4) are as follows:
(4.1) calculating a target y according to the tuples in the experience pool;
(4.2) calculating a loss function;
(4.3) minimizing the loss function using a stochastic gradient descent algorithm for updating the parameter values of the encoder and the Q-value function.
6. The information bottleneck-based deep reinforcement learning model robustness enhancing method according to claim 5, wherein the calculation formula of the target y in the step (4.1) is as follows:
Figure FDA0003112001490000021
wherein gamma is a discount factor,
Figure FDA0003112001490000022
r represents the return obtained after the agent takes a certain action at each time step,and inheriting the weight value of the main network at regular intervals of training rounds.
7. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 6, wherein the loss function in the step (4.2) is:
L2=(Qs,a-y)2+βES[DKL[p(Z|S)|q(Z)]]
wherein beta is a Langerita multiplier, ESRepresenting the expectation of the subsequent expression on state space S, p (zs) representing the probability of outputting Z in the known state S, and q (Z) being an approximate distribution, replacing p (Z).
8. The method as claimed in claim 7, wherein the beta value is gradually increased from 0 when the agent strategy is trained.
CN202110652107.4A 2021-06-11 2021-06-11 Deep reinforcement learning model robustness enhancing method based on information bottleneck Pending CN113283597A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110652107.4A CN113283597A (en) 2021-06-11 2021-06-11 Deep reinforcement learning model robustness enhancing method based on information bottleneck

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110652107.4A CN113283597A (en) 2021-06-11 2021-06-11 Deep reinforcement learning model robustness enhancing method based on information bottleneck

Publications (1)

Publication Number Publication Date
CN113283597A true CN113283597A (en) 2021-08-20

Family

ID=77284341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110652107.4A Pending CN113283597A (en) 2021-06-11 2021-06-11 Deep reinforcement learning model robustness enhancing method based on information bottleneck

Country Status (1)

Country Link
CN (1) CN113283597A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115167136A (en) * 2022-07-21 2022-10-11 中国人民解放军国防科技大学 Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111514585A (en) * 2020-03-17 2020-08-11 清华大学 Method and system for controlling agent, computer device, and storage medium
CN112132263A (en) * 2020-09-11 2020-12-25 大连理工大学 Multi-agent autonomous navigation method based on reinforcement learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111514585A (en) * 2020-03-17 2020-08-11 清华大学 Method and system for controlling agent, computer device, and storage medium
CN112132263A (en) * 2020-09-11 2020-12-25 大连理工大学 Multi-agent autonomous navigation method based on reinforcement learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115167136A (en) * 2022-07-21 2022-10-11 中国人民解放军国防科技大学 Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck

Similar Documents

Publication Publication Date Title
Thiesson et al. Learning mixtures of DAG models
WO2020048389A1 (en) Method for compressing neural network model, device, and computer apparatus
CN108549658A (en) A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN112884130A (en) SeqGAN-based deep reinforcement learning data enhanced defense method and device
CN113505855B (en) Training method for challenge model
CN110858805A (en) Method and device for predicting network traffic of cell
CN113033822A (en) Antagonistic attack and defense method and system based on prediction correction and random step length optimization
CN112766496A (en) Deep learning model security guarantee compression method and device based on reinforcement learning
CN110033089A (en) Deep neural network parameter optimization method and system based on Distributed fusion algorithm
CN112580728A (en) Dynamic link prediction model robustness enhancing method based on reinforcement learning
CN113283597A (en) Deep reinforcement learning model robustness enhancing method based on information bottleneck
CN111832817A (en) Small world echo state network time sequence prediction method based on MCP penalty function
KR102209917B1 (en) Data processing apparatus and method for deep reinforcement learning
CN113761395A (en) Trajectory generation model training method, trajectory generation method and apparatus
CN114780879A (en) Interpretable link prediction method for knowledge hypergraph
CN112884148A (en) Hybrid reinforcement learning training method and device embedded with multi-step rules and storage medium
CN114118371A (en) Intelligent agent deep reinforcement learning method and computer readable medium
CN113836910A (en) Text recognition method and system based on multilevel semantics
CN116910481B (en) Ship task system loading bullet quantity optimization method based on genetic algorithm
CN111210009A (en) Information entropy-based multi-model adaptive deep neural network filter grafting method, device and system and storage medium
CN113313236B (en) Deep reinforcement learning model poisoning detection method and device based on time sequence neural pathway
KR20230126793A (en) Correlation recurrent unit for improving the predictive performance of time series data and correlation recurrent neural network
CN117523060B (en) Image quality processing method, device, equipment and storage medium for metauniverse digital person
CN113239160B (en) Question generation method and device and storage medium
Allday et al. Auto-perceptive reinforcement learning (APRIL)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination