CN114925850B

CN114925850B - Deep reinforcement learning countermeasure defense method for disturbance rewards

Info

Publication number: CN114925850B
Application number: CN202210509849.6A
Authority: CN
Inventors: 孙仕亮; 余梦然; 赵静; 毛亮
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2024-02-20
Anticipated expiration: 2042-05-11
Also published as: CN114925850A

Abstract

The invention discloses a deep reinforcement learning countermeasure defense method for disturbance rewards, which constructs a reward restoration module named RecRe on the basis of a deep reinforcement learning algorithm PPO, and the module can restore from the disturbance rewards to obtain clean rewards. Then, the reinforcement learning agent learns from the clean rewards to obtain an optimal strategy with defensive power. The invention has the innovation point that disturbance rewards in the deep learning environment are regarded as noise labels in the supervised learning, and the RecRe module is constructed to recover rewards from the noise rewards by means of the thought of the noise label learning, so that the strategy of recovering rewards learning is provided with the countermeasure defenses finally. Compared with the prior alternative strategy and prediction strategy, the recovery strategy obtained by learning the PPO training framework combined with the RecRe module has better defense effect.

Description

Deep reinforcement learning countermeasure defense method for disturbance rewards

Technical Field

The invention relates to the technical field of computers, relates to the field of deep reinforcement learning, in particular to the field of countermeasure defense in deep reinforcement learning, and specifically relates to a disturbance rewarding-oriented deep reinforcement learning countermeasure defense method.

Background

The background art relates to three major blocks: deep reinforcement learning is performed to combat cross entropy loss of attack and defense and generalization.

1) Deep reinforcement learning

Reinforcement learning is an important branch in machine learning, and unlike supervised learning and unsupervised learning modes, it learns in a trial-and-error manner. The training process is that the intelligent agentInteract with the environment, adjust decisions based on the reward signals fed back by the environment, and obtain an optimal strategy by maximizing the cumulative desired rewards. Generally, any reinforcement learning problem can be represented by a Markov Decision Process (MDP) in the form of a five-tuple:at time t, assume that the agent is in state +.>An action is then selected based on policy pi (policy pi is a state-to-action mapping)The environment will be based on the bonus function R (s _t ，a _t ，s _t+1 ) Feedback of an instant prize r _t+1 And according to the state transfer function P (s _t+1 |s＝s _t ，a＝a _t ) Returning to the next state s _t+1 The sample at this time is { s } _t ，a _t ，r _t }. The agent then reacts with the environment in s _t+1 The next round of interaction is performed for the current state. When the agent reaches the end state, the interaction stops and a trajectory τ= { s is formed _t ，a _t ，r _t+1 ，s _t+1 ，a _t+1 ，r _t+2 ...}. The learning goal of the agent is to maximize the distance from the initial state s ₀ Starting accumulated desired rewards, i.e

Where γ is a discount factor that determines whether the agent is more concerned with long-term or short-term returns. Finally, the intelligent agent learns an optimal strategy pi ^* . FIG. 2 illustrates a process for reinforcement learning agent interaction with an environment.

Reinforcement learning algorithms are mainly divided into three categories: value function-based algorithms, strategy function-based algorithms, and Actor-critter algorithms (Actor-Critic). The method based on the value function indirectly obtains the optimal strategy by maximizing a state value function or a state action value function corresponding to the strategy. The strategy that this approach optimizes is deterministic and is not applicable to the problem of high dimensional or continuous motion space. The algorithm based on the strategy function is optimized in the strategy space, and finally, the optimal strategy is directly learned in the strategy space. The strategy trained by this method is a stochastic strategy and can solve high-dimensional or continuous motion space problems. However, this type of method theoretically can find the optimal strategy, but in practical applications, it needs to sample the complete track to update the parameters, which can result in long optimization time, and the optimal strategy may not be found.

The Actor-Critic algorithm combines the advantages of both methods while optimizing the value function and the policy function. Wherein Actor actress represents a policy function pi (a|s), which is a mapping of states to actions; critic denotes the value function V(s) or Q (s, a). The Actor-Critic algorithm combines a learning mechanism of time sequence difference and strategy gradient, and can update parameters in a single step. Both the Actor and Critic need to learn to update during each step of the update process. Here, assuming Critic represents a state value function V(s), then the loss function of this function is:

where s' represents the next state. Thus, the loss function represents the difference between the state value V(s) of the minimized Critic prediction and the sampled state value r+γV (s'). The policy function represented by the Actor is intended to maximize the cumulative desired prize, and assuming that the Actor at this time is a continuous function of θ, the parameter θ is updated by the policy gradient theorem as follows:

wherein,A _t =q (s, a) -V(s) is a merit function that represents the change in prize after a certain action is selected in the current state. Thus, the loss function represents the probability of maximizing the actions that can get more rewards. During the training process, the Actor selects action a according to the current state s of the agent and the current strategy pi (a|s), and then Critic evaluates the current action selection and adjusts its scoring criteria according to the difference between the actual rewards and the predicted rewards of the environmental feedback. In the next round, the Actor adjusts its own strategy based on Critic's evaluation, hopefully to get a higher score. The two mutually promote, and Critic's evaluation is more and more accurate after a plurality of iterations, and the action performed by the Actor can obtain more and more accumulated expected rewards. Eventually, the Actor obtains an optimal strategy that achieves the maximum cumulative desired rewards.

The reinforcement learning algorithm described above is effective, but can be applied only to a problem that the state space and the action space are relatively small. For large-scale problems, the reinforcement learning algorithm is difficult to solve. Fortunately, reinforcement learning in combination with deep learning forms deep reinforcement learning as deep learning continues to develop and progress. The method combines the powerful perceptibility and fitting capability of deep learning, and can directly process, decide and control high-dimensional data such as input images. In deep reinforcement learning, a neural network is generally used to simulate a strategy function or a value function, so that a large-scale problem can be solved. Deep reinforcement learning is therefore more suitable for practical problems than reinforcement learning.

The reinforcement learning algorithm Actor-Critic described above, although relatively simple, is difficult to solve in a large-scale problem and difficult to converge in practical applications. With the research and development of deep reinforcement learning, many Actor-Critic-based deep reinforcement learning algorithms have been proposed to solve the large-scale problem and convergence problem. Among them, the PPO algorithm is one of algorithms representative of comparison. The method adopts two neural networks to simulate a strategy function and a state action value function respectively. Next, the PPO algorithm will be specifically described.

In the learning process of the strategy function, the step size of strategy updating is very critical. If the step size is not appropriate, the policy corresponding to the updated parameter may be a worse policy, and the parameter updated again may be worse when the worse policy is used for subsequent sampling learning. Thus, the PPO algorithm wants to ensure that the cumulative expected rewards of the updated policy are not worse by some constraints. For the policy function simulated by the Actor network, the optimization objective is to limit the change amplitude of the policy. Specifically, the return function corresponding to the new strategy is decomposed into the return function corresponding to the old strategy and the advantage function under the new strategy, and then the problem is converted into the value of the maximized advantage function under the condition of limiting the change amplitude (step length) of the new strategy and the old strategy according to the importance sampling and the lower bound formula of the KL divergence. Because solving the problem involves the operation of inverting the matrix, direct solution is very difficult. Therefore, the PPO algorithm adopts the skill of clipping to penalize policies with relatively large variation. In the PPO algorithm, therefore, the loss function expression of the Actor network is as follows:

where θ is a parameter of the Actor network, x _t (θ) is the ratio of the old strategy to the new strategy, clip is a clipping function that functions to clip x _t (θ) is limited to [ 1-E, 1+ [ E ]]In between, so that the new policy does not change too much relative to the old policy. Critic network simulation in PPO algorithm is a state value functionIts optimization objective is still to minimize the difference between the predicted and the actual values. The loss function expression for Critic networks is as follows:

where s' is the next state. The Actor network and the Critic network train alternately, and finally learn an optimal strategy for maximizing the cumulative desired prize.

The PPO algorithm described above can only guarantee that the optimal strategy is learned in a clean environment. With the development of machine learning, it is gradually recognized that deep neural networks are very vulnerable to disturbances in the sample, i.e., neural networks are very susceptible to disturbances. Research in recent years has further shown that deep reinforcement learning models incorporating deep neural networks inherit this property as well, and are also very sensitive to disturbances in artificial constructs. Therefore, one must consider how to learn the optimal strategy in an environment containing disturbances. The PPO algorithm is excellent in tasks such as control and planning, but has a poor effect when disturbed.

2) Fight against attack and defense

Challenge defense generally includes challenge attack and challenge defense, which generally refers to disturbances constructed by man-made malignancies. Challenge and defensive complement each other. Challenge-resistance attacks can evaluate the vulnerability of the model and can measure the strength of the defense method. The challenge defense can enhance the defensive power of the model, while better defensive methods can facilitate the creation of stronger attack methods. The model can have stronger anti-interference capability through anti-attack and anti-defense.

Currently, deep reinforcement learning models achieve significant performance in many areas, such as go, robotics, autopilot, etc. However, these models are very vulnerable to combat disturbances by attacker malicious constructs. In particular, for an optimal strategy which is already trained, an attacker only needs to add a little carefully constructed noise to the state of the picture representation during testing, so that the performance of the strategy is obviously reduced. Therefore, it is necessary to study the counterattack and defense of the deep reinforcement learning, which is helpful to enhance the safety of the deep reinforcement learning model, promote the trust of people on the deep reinforcement learning model, and accelerate the landing application of the deep reinforcement learning model.

In deep reinforcement learning challenge studies, the challenge subjects include states observed by the agent, actions selected by the agent, environmental and rewards for environmental feedback, and the like. Although most studies focus on constructing challenge samples of state to combat attacks. Admittedly, however, rewards play a very important role in reinforcement learning settings, which directly determine whether an agent can learn an effective optimal strategy. In a reinforcement learning environment with a disturbance reward, an agent trains based on the reward of the environmental feedback (which may or may not be noisy). Because the reward may contain noise, the feedback signal obtained by the agent may be inaccurate, and thus may result in the agent selecting the wrong action at the current time, thereby affecting the learning of the final optimal strategy. Such non-optimal strategies may incur more serious losses in real life, including potentially affecting user experience, leading to security incidents, etc. Especially in interactive systems, the reward signal is usually set by the result of the interaction, so it is more easily tampered with by a malicious attacker. For example: in a dialogue system based on deep reinforcement learning, an attacker maliciously modifies feedback of a user to cause error of a reward signal, so that the dialogue system cannot accurately understand the intention of the user and cannot help the user to solve the problems faced by the user. The deep reinforcement learning environment when rewards are perturbed, including model definition and manner of rewards perturbation, is described below.

The general reinforcement learning model can be expressed by an MDP, namelyThe goal is to maximize the cumulative desired rewards in a single round. Whereas in the reinforcement learning setting of perturbed rewards, the corresponding MDP isWherein the introduced symbol C represents a perturbation bonus function in the form of: c: -a%>Wherein (1)>Representing a perturbation bonus space. In interaction with the environment, the reinforcement learning agent can only observe the original prize r _t Disturbance version of E R->

There are a number of ways to perturb the reward, which may be related to the state-action pairs at each moment, may be related to the state or action only, or may be related to the observed reward only. The current perturbation mode for rewards basically considers that the perturbation is only related to rewards at the current moment. In the current research, two methods for disturbing rewards are adopted, namely, the first method is a turnover attack, namely, the current rewards are turned to another value in a rewards set with a certain probability, and the method ensures that the disturbed rewards are still in the rewards set and are not easy to draw the attention of defenders; the second is random disturbance, i.e. an attacker constructs a tiny noise according to the prize at the current moment and adds it to the current prize value. However, the value of the prize after perturbation by this method may not be in the selectable set of prizes and is therefore likely to be detected. Thus, the rollover attack is more insidious and challenging. . The specific method of the overturn attack is as follows: if there are only two values of the prize, i.e. r ₊ And r _- Then perturb the prizeThe noise rate parameter e may be shown according to the following formula ₊ And e _- Obtaining:

i.e. each has a different probability of flipping to another value. If there are M values for the prize, defined as { r } ₀ ，r ₁ ，...，r _M-1 Disturbance rewardThen it is based on confusion matrix C _M×M Obtained. Each element c of the confusion matrix _i，j Representing the awards r _i Is turned over to r _j Probability of (2), i.e.)>In this way, clean rewards in the environment are turned over to perturb rewards by malicious attackers. Subsequent agents training with such perturbation rewards may not obtain an optimal strategy.

In order to solve the above-described problem of the reversal of rewards against attacks, two defense methods are proposed in the current research. The first is to construct a mean unbiased alternative prize using a prize estimator. Assuming that the current prize has only two values r ₊ And r _- Then estimate the prizeThe method of (2) is as follows:

through the calculation of the formula, partial noise in the disturbance rewards is eliminated by replacing rewards, and disturbance is reduced. Therefore, the agent is trained by adopting the alternative rewards in the training process, and finally, an alternative strategy with defensive capability can be learned. However, this approach requires that the defender knows in advance the probability that the prize will be disturbed or the confusion matrix, which is in most cases impractical. Samples may of course be used to estimate the confusion matrix for the bonus disturbance to avoid this limitation, but this requires a relatively accurate estimate of the confusion matrix, otherwise error accumulation may lead to a less well learned strategy. The second defense method is to construct a reward predictor which takes the state-action pair as input directly to output a predicted reward, and then adopts the predicted reward to learn strategy. However, this method does not deal with noise in rewards at all, and therefore has no defensive effect theoretically.

In light of the foregoing, there are few deep reinforcement learning counterdefense methods against rewarding disturbances, and the proposed defense methods either have some practical limitations or have little defense effect.

3) Generalized cross entropy loss

In the traditional supervised learning task, deep neural networks achieve relatively advanced performance in various fields. However, such high performance often requires a large amount of annotation data to train. While the large amount of annotation data means high costs, including time costs and labor costs. In addition, human error may also cause label errors, and label errors in training sets may greatly affect the performance of the neural network. Thus, there is a series of efforts directed to how to construct new loss functions to reduce the effects of noise in the tags, i.e., noise tag learning.

In noise tag learning studies, mean square error loss (MAE) was found to be more robust to noise in the tag than cross entropy loss (CCE) which is commonly used in classification tasks. Assuming that x represents the input sample, y represents the label, f (·) represents the output of the classifier, f _y Representing the y-th component in the output. The gradient of MAE and CCE with respect to network parameter delta is as follows:

in CCE, gradientBefore by a factor->The larger this coefficient, the greater the weight the gradient occupies in the gradient update. Thus, essentially CCEs imply a potential weighting rule for sample gradients.In this case, the CCE is more concerned with samples where the output of the network is more consistent with the given label. Therefore, CCEs compare sample-dependent labels during training, which is more applicable to clean data. The gradient of the MAE does not contain this coefficient, i.e., it treats all samples equally, regardless of the label effect. So MAE is more robust to noise tags.

However, this robustness of MAE increases the difficulty of neural network training and can degrade neural network performance. Thus, combining the noise robustness of the MAE with the training convergence of the CCE, a generalized cross entropy loss GCE is constructed in the form:

wherein q is E (0, 1)]The degree to which the GCE loss function is biased towards CCEs and MAEs is controlled. When q=1, the number of the groups,when q.fwdarw.0, & gt>Equivalent to CCEs. Neural network models employing GCE as a loss function have proven to be somewhat robust to noise in the tag.

Disclosure of Invention

The invention aims to solve the problem that effective strategies cannot be learned in a system with easily disturbed rewards, and provides a deep reinforcement learning countermeasure defense method. The invention aims at the environment in which rewards are exposed to attackers and are easy to be disturbed, namely rewards feedback received by an intelligent agent can be maliciously disturbed. In order to learn an effective strategy under the arrangement, a noise label learning task in supervised learning is analogically performed, a reward restoration module RecRe is constructed and combined with a PPO algorithm, the disturbed reward is restored to be a clean reward through the framework, and the restored reward is used for learning an optimal strategy. This helps to improve the defensive performance of those systems whose rewards are susceptible to disturbance.

The specific technical scheme for realizing the aim of the invention is as follows:

a deep reinforcement learning countermeasure method facing disturbance rewards is characterized in that a training frame of multithreaded PPO combined with a RecRe module is adopted to finish countermeasure strategy learning facing disturbance rewards. The invention constructs a recovery rewarding module RecRe, hopefully can recover disturbed rewards into clean rewards, thereby training to obtain a defending strategy which is not influenced by noise. The module takes the state-action pair as input, takes the GCE loss as the loss function of training, and trains simultaneously with the PPO algorithm of the training target strategy. Specifically, during the training process, multiple threads are employed to sample state-action pairs and to perform subsequent learning and training. The multiple state-action pairs collected each time are sent to a recovery rewards module RecRe as a sample of one Batch to learn, and then the multiple rewards corresponding to recovery are treated as true rewards for training of target strategies. Theoretically, the RecRe module can be combined with and co-trained with any deep reinforcement learning algorithm. The deep reinforcement learning countermeasure defense method facing the disturbance rewards comprises the following specific steps:

step one: the environment is configured, n thread agents of the PPO interact with respective environments to collect samples, namely the agents sample states s and actions a according to the current strategy, and disturbance rewards of environment feedbackAnd combining these results into a sample->

Step two: constructing a RecRe module by adopting a convolutional neural network and a GCE loss function, and collecting the samples in the step oneTo the bonus restoration RecRe module at a fixed frequency. The module takes { s, a } as input, }>For the label, output the recovered clean rewards r _p Then { s, a, r _p A global network of } to the PPO;

step three: the global network of PPO receives samples { s, a, r } _p After the step, calculating the loss functions and gradients of the two sub-network Actor networks and the Critic network contained in the step, and updating the respective parameters;

step four: in the training process, the global network of the PPO distributes parameters of an Actor network and a Critic network to corresponding sub-networks in n threads at fixed frequency so as to keep synchronous updating;

step five: after training is completed, the PPO algorithm incorporating the RecRe module eventually learns the optimal strategy to combat the disturbance in the reward.

The reward restoration module RecRe in the second step has the following structure: a feature extractor, a flattening layer, an additive layer, and a full connection layer; the module conforms to the following formula:

s _f ＝Flatten(Conv(s))， (1)

p(r _p )＝Softmax(Concat(s _f ，a))， (2)

the input is the state-action pair { s, a } obtained by the interaction of the agent and the environment in each thread, and the output is recovered rewards r _p ；Is a perturbed tag; specifically, the state vector s is first passed through a feature extractor consisting of a plurality of convolution layers and a pooling layer, and then passed through a flattening layer to flatten the multi-dimensional features of the state into a one-dimensional vector s _f At this time, the additive layer combines the one-dimensional motion vector a with the flattened state vector s _f Joining the two vectors to form a one-dimensional vector,finally, outputting predicted recovery rewards belonging to the probability p (r) of each value through the full-connection layer with the activation function of Softmax _p ) The method comprises the steps of carrying out a first treatment on the surface of the Loss function of RecRe Module>As shown in equation (3), it is a generalized cross entropy loss GCE, where q is a superparameter,/I>Representing the probability corresponding to the value of the current noise reward in the predicted probability distribution; the loss function combines the robustness characteristic of the average absolute error MAE on the noise label and the convergence guarantee of the cross entropy loss CCE, and can have robustness on the noise in rewards under the condition of not losing performance.

The third step is specifically as follows: with the state-action pair { s, a } as input, the prize r is restored _p As feedback returned by the environment, the output is a simulated strategy function and a simulated value function; actor network fitting strategy function pi _θ (a|s), a mapping of states to actions; critic network simulation state action value functionI.e. evaluating the value of the selection action a in the current state s; the loss functions of the two are respectively as follows:

x in the middle _t (θ) is the ratio of the new policy to the old policy in the update process, A _t =q (s, a) -V(s) is the dominance function calculated between the state action value function Q (s, a) and the state value function V(s)Measuring the value change after action a is currently taken; the loss function will be x _t (θ) is limited to [ 1-E, 1+ [ E ]]Between, the change of the new strategy relative to the old strategy is prevented from being excessively large; />Middle->Is the predicted Q value, +.>The real Q value obtained by sampling comprises an instant reward r of the intelligent agent for selecting the action a in the current state s and a maximum Q value of the action a 'in the next state s'; this loss is expected to fit the Q function more accurately, thus minimizing the difference between the predicted Q value and the sampled true Q value.

The invention has the beneficial effects that: the present invention proposes a deep reinforcement learning countermeasure defense method with medium-oriented disturbance rewards, which helps to improve the defensive performance of those systems where rewards are susceptible to disturbance. The method specifically comprises the following steps:

1) By means of a noise label learning method, a RecRe module is constructed, and GCE is used as a loss function for training, so that the module can be restored to obtain clean rewards.

2) The RecRe module is combined with a multithreading PPO algorithm, so that an optimal strategy with an anti-defense effect on disturbance rewards can be trained.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic illustration of interaction of an agent with an environment;

fig. 3 is a diagram of the PPO training framework of the present invention incorporating the RecRe module.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments and drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.

The invention provides a deep reinforcement learning countermeasure defense method for disturbance rewards, which constructs a rewarding and restoring module RecRe in a system with easily disturbed rewards to restore to obtain clean rewards, and then adopts a multithreaded PPO algorithm training based on the clean rewards to obtain an optimal strategy with countermeasure defenses. The specific flow is shown in fig. 1.

The multithreading PPO algorithm applied by the invention forms the basis of learning the optimal strategy in the invention. However, the performance of the original multi-thread PPO training method depends on clean rewards fed back by the environment, and if rewards are easily disturbed by malicious attackers, the performance of the optimal strategy obtained by multi-thread PPO model training is easily reduced. In order to obtain clean rewards, a recovery rewards module RecRe is embedded in the multithreaded PPO training framework, so that disturbance possibly contained in rewards is eliminated, the recovered clean rewards are obtained, and finally, an intelligent agent learns strategies based on the clean rewards.

The multithreaded PPO algorithm is one of the algorithms which are very representative in deep reinforcement learning and are widely applied, and an optimal strategy obtained by training based on the algorithm has very good performance. However, it is the clean rewards that this algorithm requires environmental feedback. In some systems where rewards are perturbed, the optimal strategy is not obtained with this algorithm. In order to enable the multithreaded PPO algorithm to learn an effective strategy in a system in which rewards are disturbed, the invention regards noise rewards as noise tags, and constructs a reward restoration module Recre by means of the idea of noise tag learning, wherein the module can restore clean rewards. The subsequent multithreading PPO algorithm is trained by adopting the recovered clean rewards, so that the interference of disturbance in the rewards can be avoided, and an optimal strategy with defensive performance on the disturbance can be learned finally.

In the invention, the RecRe module plays a very important role, and directly determines that the multithreaded PPO algorithm can learn the optimal defense strategy. The module can be seen as a classification task, the input being a state-action pair, the output being a probability of rewarding each value, the module classifying the current state-action pair into a certain value category of rewarding. Since the labels (noise rewards) of this classification task are disturbance-containing, whereas the loss functions in general classification tasks, such as cross entropy loss CCEs, are relatively sensitive to noise in the labels, the present invention employs generalized cross entropy loss GCE as the loss function of the RecRe module. The loss function combines the robustness of the average absolute error loss MAE to the noise label and the convergence guarantee of the cross entropy loss CCE, has certain robustness against disturbance in the label and can converge to an optimal solution.

Examples

The following is a specific embodiment of a multi-threaded PPO training framework to train the combined RecRe module. The training framework is described with reference to fig. 3. In this embodiment, an Atari 2600 game environment with 4 OpenAI open sources is employed. The overall flow is seen in fig. 1.

Step one: configuring an Atari 2600 game environment, adopting a multithreading PPO algorithm, setting n threads of agents to interact with respective environments, and collecting samples. The interaction of the agent with the environment is illustrated in fig. 2. Specifically, the environment takes a frame of picture as a state s, and an agent (simulating a game player) decides as an action a, for example, in a Pong environment simulating a table tennis match, the agent hits a table tennis ball upwards or downwards through an operating board, and then the up or down action of the agent is the action of the agent. After the intelligent agent executes the action, the environment can feed back the instant disturbance rewards of the action to the intelligent agentThe next state s'. Combining these results into a sample->

Step two: training the RecRe module obtains a restoration reward and transmits the samples to the global network of PPO. The collected sample is processedAnd (3) sending the data to a reward restoration module RecRe, and eliminating disturbance possibly contained in rewards by using the RecRe module to obtain clean rewards corresponding to restoration. The RecRe module mainly comprises a feature extractor, a flattening layer and a full connection layer. The input is the state-action pair { s, a } obtained by the interaction of the agent and the environment in each thread, and the output is the recovered rewarding value r _p Reward of disturbance->As a noise signature for the module, the GCE acts as a loss function for the module.

Loss function of the moduleThe following are provided:

specifically, the state s is first represented by a feature extractor learning feature composed of a plurality of convolution layers and a pooling layer, then the multidimensional feature of the state is flattened into one dimension by a flattening layer, at this time, the one-dimensional action a is combined with the state to form a one-dimensional vector, and finally the prediction probability of each rewarding category is output by a fully-connected layer with an activation function of Softmax. Determining a value r of a restoration prize by maximizing probability _p And sample { s, a, r } _p And directed to the global network of PPO.

Step three: the global network of PPO is trained. The global network of PPO includes an Actor network and a Critic network. The Actor network simulates a policy function, takes a state s as an input, and selects the probability pi of each action through a feature extractor, a full connection layer and a Softmax normalized output _θ (s) and selecting actions based on probabilitiesThen, the strategy gradient theorem is adopted to derive the parameters of the Actor network, as shown in a formula (4), and the gradient rising rule is adopted to update the parametersA number.

Critic network simulates a value function, which takes state s as input and outputs state values through a feature extractor and a full connection layerAnd then deriving the network parameters by adopting the corresponding optimization function as shown in a formula (5), and updating the parameters by adopting a gradient descent method. In this process, the state value, return, time sequence difference and the like are calculated by the prize value r restored by the RecRe module _p 。

Step four: the global network of PPOs distributes network parameters to the thread networks at a fixed frequency. The global network transmits the parameter copies of the Actor network and the Critic network to the corresponding Actor network and Critic network in n threads at fixed frequency so as to keep synchronous updating.

Step five: and finishing the study of the countermeasure defense strategy facing the disturbance rewards. After the training described above, the final converged Actor network is the required optimal strategy, and can not be influenced by disturbance in rewards, and has a certain defending effect. In terms of performance evaluation, the cumulative expected rewards that the strategy can ultimately achieve are used as evaluation indexes. The higher the reward, the better the defense effect of the strategy is explained.

The present embodiment trains the multithreaded PPO strategy incorporating the RecRe module on a classical 4 Atari 2600 game environment and evaluates the defensive effect of the strategy. These 4 environments are called Breakout, carnival, msPacman and Pong, respectively. In addition, the defense strategy provided by the invention is named as a restoration strategy, the strategy learned in a clean environment is named as a clean strategy, the strategy learned in an environment with rewards disturbed is named as a noise strategy, the strategy trained by adopting the alternative rewards is named as an alternative strategy, and the strategy trained based on the prediction method is named as a prediction strategy. The substitution strategy and the prediction strategy are strategies trained by adopting two defense methods respectively.

Table 1 quantitatively compares the final cumulative expected rewards of each strategy in the 4 environments described above in the form of values, with larger rewards indicating better defenses. The bolded part of the table indicates the optimal performance of the last three defense methods, and the underlined part indicates that the defense methods are inferior to the noise strategy, i.e. the defense is disabled. In both the Breakout and Pong environments, the cumulative expected prize eventually achieved by the alternative strategy is inferior to the cumulative prize of the noise strategy, which illustrates that this defense approach is ineffective in some circumstances. In two other environments, however, the rewards of the alternative strategy are not significantly raised as compared to the rewards of the noise strategy, which means that the defensive effect of the alternative strategy is limited. The predictive strategy's cumulative expected rewards in all four environments are lower than the noise rewards, indicating that the defense approach does not have a defensive effect. This approach predicts simply from noise samples, but does not address the disturbances in the reward specifically. The rewards of the restoration strategy in 4 environments are higher than that of the noise strategy and are higher than that of the alternative strategy and the prediction strategy. This illustrates that the restoration strategy can eliminate noise in the disturbance rewards and can learn to get a high return. In addition, in the Carnival environment, the recovery strategy achieves a cumulative expected prize around 1200 points higher than the clean strategy, which illustrates that the recovery strategy introduces additional information during the recovery of the prize, helping the agent learn a higher prize point. In summary, the prize restoration strategy is a relatively efficient defense strategy that can apply noise in the prize and can be rewarded with a high prize.

Taken together, the test results show that the rewarding restoration strategy proposed by the present invention is superior to other known defense strategies.

Table 1 cumulative expected rewards for policies over 4 environments

The protection of the present invention is not limited to the above embodiments. Variations and advantages that would occur to one skilled in the art are included in the invention without departing from the spirit and scope of the inventive concept, and the scope of the invention is defined by the appended claims.

Claims

1. A deep reinforcement learning countermeasure defense method for disturbance rewards is characterized in that a reward restoration module RecRe is constructed, clean rewards can be restored from the disturbed rewards, and the reward restoration module RecRe is combined with a deep reinforcement learning algorithm PPO to train to obtain an optimal strategy with countermeasure defense properties for the disturbance in the rewards; the method comprises the following steps:

step one: configuration environment, PPOThe thread agents interact with the respective environments to collect samples, namely the agents sample according to the current strategy to obtain the state +.>And action->And a disturbance reward of environmental feedback +.>And combining these results into a sample +.>；

The configuration environment comprises an Atari 2600 game environment, a multithreading PPO algorithm is adopted, n threads of agents are set to interact with respective environments and collect samples, the configuration environment takes a frame of picture as a state s, the decision of the agents is taken as an action a, the agents comprise simulated game players, and after the agents execute the action, the environment is fed back to the agents which are the current actionTime perturbation reward and next state s'；

Step two: constructing a RecRe module by adopting a convolutional neural network and a GCE loss function, and collecting the samples in the step oneSending to a bonus restoration RecRe module at a fixed frequency; the module is +.>For input, & lt + & gt>For the label, output the recovered clean rewards +.>Then ∈>A global network to the PPO;

the training RecRe module obtains a restoration reward including a post-restoration clean rewardAnd transmitting the samples to a global network of PPO;

step three: global network of PPO receives samplesThen, calculating the loss functions and gradients of the two sub-network Actor networks and the Critic network contained in the network, and updating the respective parameters;

step four: in the training process, the global network of the PPO distributes parameters of an Actor network and a Critic network to corresponding sub-networks in n threads at fixed frequency so as to keep synchronous updating; step five: after training is completed, the PPO algorithm combined with the RecRe module finally learns the optimal strategy for resisting defensive disturbance in rewards;

，

the input is state-action pairs obtained by interaction of agents and environments in each threadThe output is a recovered reward +>；/>Is a perturbed tag; specifically, the state vector->Firstly, a feature extractor consisting of a plurality of convolution layers and pooling layers is adopted, and then the multidimensional feature of the state is flattened into a one-dimensional vector through a flattening layer>At this time, the additive layer will add the one-dimensional motion vector +.>And state vector after flattening->Joining to form a one-dimensional vector, and finally performing an activation function to obtainIs a full link layer output predictive restoration bonusProbability of each value->The method comprises the steps of carrying out a first treatment on the surface of the Loss function of RecRe Module>As shown in formula (3), it is generalized cross entropy loss GCE, where +.>Is a superparameter->Representing the probability corresponding to the value of the current noise reward in the predicted probability distribution; the loss function combines the robustness characteristic of the average absolute error MAE on the noise label and the convergence guarantee of the cross entropy loss CCE, and can have robustness on the noise in rewards under the condition of not losing the performance;

the third step is specifically as follows: in state-action pairsRestoring rewards for input>As feedback returned by the environment, the output is a simulated strategy function and a simulated value function; actor network fitting strategy function>I.e., a mapping of states to actions; critic network simulation state action value function +.>I.e. evaluate the current state +.>Down selection action->Is of value (1); the loss functions of the two are respectively as follows:

，

middle->Is the ratio of new strategy to old strategy in the updating process, < >>Is an advantage function which calculates a state action value function +.>And state value function->The difference between them measures the current action taken +.>The amount of change in value; the loss function will->Limited to->Between, the change of the new strategy relative to the old strategy is prevented from being excessively large; />The penalty is expected to fit the state action value function more accurately, thus minimizing the difference between the predicted Q value and the sampled true Q value.