CN113283597A - Deep reinforcement learning model robustness enhancing method based on information bottleneck - Google Patents
Deep reinforcement learning model robustness enhancing method based on information bottleneck Download PDFInfo
- Publication number
- CN113283597A CN113283597A CN202110652107.4A CN202110652107A CN113283597A CN 113283597 A CN113283597 A CN 113283597A CN 202110652107 A CN202110652107 A CN 202110652107A CN 113283597 A CN113283597 A CN 113283597A
- Authority
- CN
- China
- Prior art keywords
- state
- reinforcement learning
- information
- deep reinforcement
- learning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000002708 enhancing effect Effects 0.000 title claims abstract description 14
- 230000009471 action Effects 0.000 claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000012546 transfer Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 230000000875 corresponding effect Effects 0.000 claims description 5
- 238000011161 development Methods 0.000 claims description 3
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 4
- 238000000137 annealing Methods 0.000 abstract description 3
- 238000005070 sampling Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 2
- 230000007123 defense Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method for enhancing robustness of a deep reinforcement learning model based on information bottleneck, which limits state information in deep reinforcement learning by setting the information bottleneck, encodes the state information in a transfer tuple through an encoder, encodes the state information observed in an environment, inputs the encoded state information into a strategy network, interacts with the environment according to the action of the strategy network to obtain the state of the next round, encodes the state, and continuously interacts with the environment to realize the training of the strategy network. The robustness enhancement method of the deep reinforcement learning model based on the information bottleneck disclosed by the invention enables the trained strategy to still well perform on the original task and can resist the influence of counterattack; the proportion coefficient in the regular term is set by adopting the annealing idea, so that a stable training process is achieved, and the trained strategy still has excellent performance in a normal task.
Description
Technical Field
The invention relates to the field of enhancing robustness in deep reinforcement learning; in particular to a method for enhancing robustness of a deep reinforcement learning model based on information bottleneck.
Background
With the rapid development of artificial intelligence, a deep reinforcement learning algorithm combining the perception capability of deep learning and the decision capability of reinforcement learning is widely applied to the aspects of automatic driving, automatic translation, a dialogue system, video detection and the like.
However, the combined deep reinforcement learning is susceptible to adversarial attacks, and some noises which cannot be detected by human eyes are added to the original sample, and the noises do not affect the recognition of human beings, but can make the trained strategy to make an extremely adverse action on the result, thereby causing the failure of the whole decision making process.
There is therefore a need to enhance the robustness of deep reinforcement learning models against attacks.
The existing robustness enhancing method of the deep reinforcement learning model, such as the method and the device disclosed in the Chinese patent application with the publication number of CN112884130A for enhancing defense of deep reinforcement learning data based on SeqGAN, comprises the following steps: building an automatic driving simulation environment of the intelligent agent for deep reinforcement learning, building a target intelligent agent based on a deep Q network in the reinforcement learning, and performing reinforcement learning on the target intelligent agent to optimize parameters of the deep Q network; generating a state action pair sequence of target intelligent agent driving at T moments by using a parameter optimized deep Q network as expert data, wherein an action value in a state action pair corresponds to an action with the minimum Q value; training SeqGAN containing a generator and a discriminator by using a reinforcement learning method, generating a state action pair by taking the state action pair in expert data as the input of the generator, simultaneously simulating sampling by adopting strategy gradient Monte Carlo based search, forming a state action pair sequence with a fixed length by the sampled state action pair and the state action pair generated by the generator, inputting the state action pair sequence into the discriminator, calculating an incentive value, and updating the network parameter of SeqGAN according to the incentive value; inputting the current state into a generator of SeqGAN optimized by parameters to obtain a generation state action pair sequence, calculating the accumulated reward value of the generation state action pair sequence by using a deep Q network optimized by parameters, comparing the accumulated reward value with the accumulated reward value obtained by the deep Q network strategy of the target agent, and storing the state action pair with a higher accumulated reward value as enhanced data for re-optimizing the deep Q network; and selecting the enhancement data from the storage to carry out parameter re-optimization on the deep Q network so as to realize the enhanced defense of the deep reinforcement learning data.
However, researches show that the information bottleneck not only has the function of filtering useless information irrelevant to tasks, but also can improve the generalization capability of antagonistic reverse reinforcement learning, and meanwhile, the information bottleneck is used as an external processing module and can be well combined with various deep reinforcement learning algorithms; therefore, how to set an information bottleneck to resist adversarial attack has important theoretical and practical significance for the application of the deep reinforcement learning model.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method for enhancing the robustness of a deep reinforcement learning model based on an information bottleneck.
A deep reinforcement learning model robustness enhancing method based on information bottleneck comprises the following steps:
(1) setting information bottleneck limit on the state observed by the intelligent agent by using a proper encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;
(2) inputting the mapped state z of the original state into the intelligent agent, and generating an action by the intelligent agent according to the current strategy;
(3) interacting the action generated by the agent in the step (2) with the environment to obtain the next state;
(4) training an intelligent agent strategy according to the interactive result in the step (3);
(5) and (4) repeating the steps (1) to (4) until the overall return is converged.
The encoder in the step (1) uses mutual information as an index to limit information flow and filter countermeasure information, and a calculation formula of the mutual information is as follows:
wherein X and Y respectively represent corresponding variables, p (X, Y) is joint distribution, p (X) and p (Y) are edge density, MI (X; Y) represents calculated mutual information value, and represents correlation between variables X and Y, and DKLIndicating KL divergence, Ep(x,y)Representing the expectation of the subsequent expression on the joint distribution p (x, y).
The mutual information input and output by the encoder is defined as MI (S; Z), the value is limited to be less than a certain degree, the mutual information can not be directly calculated, and the estimation is carried out by using a sampling mode:
MI(Z,S)=DKL[p(Z,S)|p(Z)p(s)]=ES[DKL[p(Z|S)|p(Z)]]
however, it is not reasonable that p (Z) needs to be calculated for the entire state space S, and q (Z) -N (0,1) is used instead of p (Z) in an approximate method.
ES[DKL[p(Z|S)|q(Z)]]≥MI(Z,S)
I.e. an upper bound of mutual information is used instead. Because the normal distribution is used for approximation, the encoder part only needs to use a neural network to estimate the mean and the variance, and the distribution is constructed according to the obtained mean and the variance for sampling to obtain the coded state z.
In the step (2), the state z is input into a Q value function Q (s, a), and an action a is randomly selected with a certain probability epsilon, then
a=argmaxaQs,a
I.e., a greedy strategy is used to select the corresponding actions to achieve a balance of exploration and development.
From 1 (completely random action) to a smaller value, e.g. 0.02 or 0.05, i.e. exploring the environment as much as possible at the beginning, following a good strategy at the end of the training.
And (3) the intelligent agent selects an action to interact with the environment according to a greedy strategy to obtain a return r and a next state s ', inputs the state s' into an encoder to obtain z ', and stores the transfer tuple (z, a, r, z') into an experience pool.
The experience pool is mainly used for overcoming the problem that samples used for updating are not independently and identically distributed, the samples generated by the strategies of adjacent time steps have strong correlation, a large number of transfer tuples are stored in the experience pool and are randomly extracted during training, the samples can be approximately regarded as independent and identically distributed, and a Q value network is trained to have a good effect.
In the step (4), the intelligent agent training strategy is trained by adopting a deep Q network, and the specific steps are as follows:
(4.1) calculating a target y according to the tuples in the experience pool;
(4.2) calculating a loss function;
(4.3) minimizing the loss function using a stochastic gradient descent algorithm for updating the parameter values of the encoder and the Q-value function.
The calculation formula of the target y in step (4.1) is as follows:
wherein gamma is a discount factor,and r is a target network, represents the return obtained after the agent takes certain action at each time step, and inherits the weight value of the main network at certain training turns.
γ is typically set to 0.99, and if one epsilon just ends:
y=r
the loss function in step (4.2) is:
L2=(Qs,a-y)2+βES[DKL[p(Z|S)|q(Z)]]
wherein beta is a Langerita multiplier, ESRepresenting the expectation of the subsequent expression on the state space S, p (Z | S) representing the input at a known state SThe probability of Z, q (Z), is an approximate distribution, which is used instead of p (Z).
And when the intelligent agent strategy is trained, the beta value is gradually increased from 0.
During specific training, the beta value is set to be 0, the encoder network parameters are fixed, the weight of the Q value network is preferentially trained, when the strategy is well expressed, the beta value is gradually increased, so that the information bottleneck can filter out countermeasure information, training is carried out until the total return value R is converged, and R is the sum of the reward values of each step in one epsilon.
Compared with the prior art, the invention has the advantages that:
1. the main part which plays a decisive role in the task in the state information is extracted by utilizing the information bottleneck, and the disturbance added on the original state by the counterattack is coded by the coder, so that the strategy obtained by training still has good performance on the original task, and the influence of the counterattack can be resisted.
2. The proportion coefficient in the regular term is set by adopting the annealing idea, so that a stable training process is achieved, and the trained strategy still has excellent performance in a normal task.
Drawings
FIG. 1 is a flowchart illustrating the overall steps of the present invention;
FIG. 2 is a schematic diagram of an encoder structure;
fig. 3 is a schematic diagram of a deep Q network structure.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
The robustness enhancement method of the deep reinforcement learning model based on the information bottleneck limits the state information in the deep reinforcement learning by setting the information bottleneck, and encodes the state information in the transfer tuple through an encoder. Firstly, coding the observed state in the environment, inputting the coded state into a strategy network, interacting with the environment according to the action of the strategy network to obtain the state of the next round, coding the state, and continuously interacting with the environment to realize the training of the strategy network. The training of the encoder is realized by adding a regular term on a loss term of an original deep reinforcement learning algorithm, and due to the mutual influence of the encoder and a strategy network, the annealing idea is adopted, the fixed encoder is firstly used, no constraint is set on the loss term, and when the strategy training is better, the proportionality coefficient is gradually increased.
Fig. 1 is a flowchart of a method for enhancing robustness of a deep reinforcement learning model based on an information bottleneck according to this embodiment. The robustness enhancing method of the deep reinforcement learning model based on the information bottleneck can be used in the field of automatic driving, and the deep reinforcement learning model outputs decision actions according to the acquired environment state so as to guide automatic driving. As shown in FIG. 1, the method for enhancing robustness of the deep reinforcement learning model comprises the following steps:
setting information bottleneck limit on the state observed by the intelligent agent by using a proper encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;
inputting the mapped state z of the original state into the intelligent agent, and generating an action by the intelligent agent according to the current strategy;
interacting the action generated by the intelligent agent with the environment to obtain the next state;
training an agent strategy according to the interactive result;
and judging whether the total return value R is converged, if so, finishing the strategy training, and if not, repeating the steps until the total return value R is converged.
Fig. 2 is a schematic structural diagram of the encoder provided in this embodiment. As shown in the schematic structural diagram of the encoder shown in fig. 2, an original state s observed by the agent is input into the neural network, the neural network obtains a normal distribution of the s by calculating a mean and a variance of the s, and then estimates the s in a sampling manner to obtain a mapped state z.
The specific calculation process is as follows:
the encoder limits information flow and filters countermeasure information by using mutual information as an index, and the calculation formula of the mutual information is as follows:
wherein X and Y respectively represent corresponding variables, p (X, Y) is joint distribution, p (X) and p (Y) are edge density, MI (X; Y) represents calculated mutual information value, and represents correlation between variables X and Y, and DKLIndicating KL divergence, Ep(x,y)Representing the expectation of the subsequent expression on the joint distribution p (x, y).
The mutual information input and output by the encoder is defined as MI (S; Z), the value is limited to be less than a certain degree, the mutual information can not be directly calculated, and the estimation is carried out by using a sampling mode:
MI(Z,S)=DKL[p(Z,S)|p(Z)p(s)]=ES[DKL[p(Z|S)|p(Z)]]
however, it is not reasonable that p (Z) needs to be calculated for the entire state space S, and q (Z) -N (0,1) is used instead of p (Z) in an approximate method.
ES[DKL[p(Z|S)|q(Z)]]≥MI(Z,S)
I.e. an upper bound of mutual information is used instead. Because the normal distribution is used for approximation, the encoder part only needs to use a neural network to estimate the mean and the variance, and the distribution is constructed according to the obtained mean and the variance for sampling to obtain the coded state z.
Fig. 3 is a schematic diagram of a deep Q network structure. As shown in fig. 3, the agent interacts with the environment according to the action selected by the greedy policy, so that a return r and a next state s 'can be obtained, the state s' is input to the encoder to obtain z ', and the transition tuples (z, a, r, z') are stored in the experience pool, which is mainly used for overcoming the problem that the samples used for updating are not independently and identically distributed.
The deep reinforcement learning model obtained by the robustness enhancement method based on the information bottleneck has strong robustness, and when the method is applied to the field of automatic driving, a decision-making action can be accurately given according to an environmental state.
Claims (8)
1. A deep reinforcement learning model robustness enhancing method based on information bottleneck is characterized by comprising the following steps:
(1) setting information bottleneck limit on the state observed by the intelligent agent by using an encoder, and encoding an original state s observed by the intelligent agent by using the encoder to obtain a mapped state z;
(2) inputting the mapped state z of the original state into the intelligent agent, and generating an action by the intelligent agent according to the current strategy;
(3) interacting the action generated by the agent in the step (2) with the environment to obtain the next state;
(4) training an intelligent agent strategy according to the interactive result in the step (3);
(5) and (4) repeating the steps (1) to (4) until the overall return is converged.
2. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 1, wherein: the encoder in the step (1) uses mutual information as an index to limit information flow and filter countermeasure information, and a calculation formula of the mutual information is as follows:
wherein X and Y respectively represent corresponding variables, p (X, Y) is joint distribution, p (X) and p (Y) are edge density, MI (X; Y) represents calculated mutual information value, and represents correlation between variables X and Y, and DKLIndicating KL divergence, Ep(x,y)Representing the expectation of the subsequent expression on the joint distribution p (x, y).
3. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 2, wherein: in the step (2), the state z is input into a Q value function Q (s, a), and an action a is randomly selected with a certain probability epsilon, then
a=argmaxaQs,a
I.e., a greedy strategy is used to select the corresponding actions to achieve a balance of exploration and development.
4. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 3, wherein: and (3) the intelligent agent selects an action to interact with the environment according to a greedy strategy to obtain a return r and a next state s ', inputs the state s' into an encoder to obtain z ', and stores the transfer tuple (z, a, r, z') into an experience pool.
5. The information bottleneck-based robustness enhancement method for the deep reinforcement learning model according to claim 4, wherein the specific steps of training the intelligent agent strategy in the step (4) are as follows:
(4.1) calculating a target y according to the tuples in the experience pool;
(4.2) calculating a loss function;
(4.3) minimizing the loss function using a stochastic gradient descent algorithm for updating the parameter values of the encoder and the Q-value function.
6. The information bottleneck-based deep reinforcement learning model robustness enhancing method according to claim 5, wherein the calculation formula of the target y in the step (4.1) is as follows:
7. The information bottleneck-based deep reinforcement learning model robustness enhancement method according to claim 6, wherein the loss function in the step (4.2) is:
L2=(Qs,a-y)2+βES[DKL[p(Z|S)|q(Z)]]
wherein beta is a Langerita multiplier, ESRepresenting the expectation of the subsequent expression on state space S, p (zs) representing the probability of outputting Z in the known state S, and q (Z) being an approximate distribution, replacing p (Z).
8. The method as claimed in claim 7, wherein the beta value is gradually increased from 0 when the agent strategy is trained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110652107.4A CN113283597A (en) | 2021-06-11 | 2021-06-11 | Deep reinforcement learning model robustness enhancing method based on information bottleneck |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110652107.4A CN113283597A (en) | 2021-06-11 | 2021-06-11 | Deep reinforcement learning model robustness enhancing method based on information bottleneck |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113283597A true CN113283597A (en) | 2021-08-20 |
Family
ID=77284341
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110652107.4A Pending CN113283597A (en) | 2021-06-11 | 2021-06-11 | Deep reinforcement learning model robustness enhancing method based on information bottleneck |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113283597A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115167136A (en) * | 2022-07-21 | 2022-10-11 | 中国人民解放军国防科技大学 | Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111514585A (en) * | 2020-03-17 | 2020-08-11 | 清华大学 | Method and system for controlling agent, computer device, and storage medium |
CN112132263A (en) * | 2020-09-11 | 2020-12-25 | 大连理工大学 | Multi-agent autonomous navigation method based on reinforcement learning |
-
2021
- 2021-06-11 CN CN202110652107.4A patent/CN113283597A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111514585A (en) * | 2020-03-17 | 2020-08-11 | 清华大学 | Method and system for controlling agent, computer device, and storage medium |
CN112132263A (en) * | 2020-09-11 | 2020-12-25 | 大连理工大学 | Multi-agent autonomous navigation method based on reinforcement learning |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115167136A (en) * | 2022-07-21 | 2022-10-11 | 中国人民解放军国防科技大学 | Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Thiesson et al. | Learning mixtures of DAG models | |
WO2020048389A1 (en) | Method for compressing neural network model, device, and computer apparatus | |
CN108549658A (en) | A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree | |
CN112884130A (en) | SeqGAN-based deep reinforcement learning data enhanced defense method and device | |
CN113505855B (en) | Training method for challenge model | |
CN110858805A (en) | Method and device for predicting network traffic of cell | |
CN113033822A (en) | Antagonistic attack and defense method and system based on prediction correction and random step length optimization | |
CN112766496A (en) | Deep learning model security guarantee compression method and device based on reinforcement learning | |
CN110033089A (en) | Deep neural network parameter optimization method and system based on Distributed fusion algorithm | |
CN112580728A (en) | Dynamic link prediction model robustness enhancing method based on reinforcement learning | |
CN113283597A (en) | Deep reinforcement learning model robustness enhancing method based on information bottleneck | |
CN111832817A (en) | Small world echo state network time sequence prediction method based on MCP penalty function | |
KR102209917B1 (en) | Data processing apparatus and method for deep reinforcement learning | |
CN113761395A (en) | Trajectory generation model training method, trajectory generation method and apparatus | |
CN114780879A (en) | Interpretable link prediction method for knowledge hypergraph | |
CN112884148A (en) | Hybrid reinforcement learning training method and device embedded with multi-step rules and storage medium | |
CN114118371A (en) | Intelligent agent deep reinforcement learning method and computer readable medium | |
CN113836910A (en) | Text recognition method and system based on multilevel semantics | |
CN116910481B (en) | Ship task system loading bullet quantity optimization method based on genetic algorithm | |
CN111210009A (en) | Information entropy-based multi-model adaptive deep neural network filter grafting method, device and system and storage medium | |
CN113313236B (en) | Deep reinforcement learning model poisoning detection method and device based on time sequence neural pathway | |
KR20230126793A (en) | Correlation recurrent unit for improving the predictive performance of time series data and correlation recurrent neural network | |
CN117523060B (en) | Image quality processing method, device, equipment and storage medium for metauniverse digital person | |
CN113239160B (en) | Question generation method and device and storage medium | |
Allday et al. | Auto-perceptive reinforcement learning (APRIL) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |