CN113360917A

CN113360917A - Deep reinforcement learning model security reinforcement method and device based on differential privacy

Info

Publication number: CN113360917A
Application number: CN202110766183.8A
Authority: CN
Inventors: 陈晋音; 王雪柯; 胡书隆; 章燕
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-09-07

Abstract

The invention discloses a method and a device for reinforcing the safety of a deep reinforcement learning model based on differential privacy, wherein the method comprises the following steps: sampling data from the environment as a sample set to be trained, constructing a target model by using a deep reinforcement learning algorithm, and inputting the sample set to be trained into the target model to train the target model; testing the trained target model, and sampling state actions as a stealing data set; constructing a stealing model by utilizing a deep reinforcement learning algorithm; inputting the stealing data set serving as a training sample into a stealing model and training the stealing model by using a simulation learning algorithm; adding a differential privacy protection mechanism into a trained target model, and inputting data output by the target model under the action of the differential privacy protection mechanism into a stealing model; the stealing model makes false attack actions under the influence of data acted by a differential privacy mechanism.

Description

Deep reinforcement learning model security reinforcement method and device based on differential privacy

Technical Field

The invention relates to the field of data security, in particular to a method and a device for reinforcing the security of a deep reinforcement learning model based on differential privacy.

Background

With the rapid development of artificial intelligence, a deep reinforcement learning algorithm combining the perception capability of deep learning and the decision capability of reinforcement learning is widely applied to the fields of automatic driving, automatic translation, game AI and the like.

However, recent research shows that a deep reinforcement learning model is easily attacked by different types of malicious attacks, the integrity, the usability and the confidentiality of a deep reinforcement learning system are greatly threatened by security holes existing in a deep reinforcement learning algorithm, and as artificial intelligence is increasingly closely related to production life, the demand of people on solving the problem of artificial intelligence application security is increasingly urgent.

The existing method for improving the security of the deep learning model is a defense method facing the deep reinforcement learning model to resist attacks, which is disclosed in the Chinese patent application with the publication number of CN 110968866A; the defense method comprises the following steps: predicting the input previous environmental state by using a visual prediction model constructed based on a generative confrontation network, outputting and predicting the current environmental state, and obtaining a next frame prediction environmental state value of the predicted current environmental state under a deep reinforcement learning strategy; acquiring an actual current environment state output by the deep reinforcement learning model, and acquiring an environment state value of the actual current environment state added with disturbance under a deep reinforcement learning strategy; judging the predicted environment state value and the environment state value added with disturbance by using a judgment model constructed based on a generative confrontation network, and obtaining whether the deep reinforcement learning model is attacked or not according to a judgment result; when the deep reinforcement learning model is attacked, extracting an actual current environment state, performing first-layer defense on the actual current environment state by using a first defense model based on Squeezenet, and performing second-layer defense on a first-layer defense result by using a second defense model based on DenseNet to obtain the actual current environment state after defense; and the deep reinforcement learning model performs learning prediction output by using the actual current environment state after defense.

The defense method for resisting attacks by the deep reinforcement learning model provided by the patent application utilizes the visual prediction model, the discriminator and the additional defense model to defend the reinforcement learning model, and the defense method utilizes reinforcement learning to defend but not to perform security reinforcement on the deep reinforcement learning model.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method and a device for reinforcing the safety of a depth reinforcement learning model based on differential privacy, which realize the purpose that the output distribution of the depth model is blurred to the maximum extent on the premise of not changing the output action of the model, and the level of model stealing attack is greatly reduced, thereby preventing an attacker from stealing the original model by utilizing the action space distribution.

A deep reinforcement learning model security reinforcement method based on differential privacy comprises the following steps:

sampling data from the environment as a sample set to be trained, constructing a target model by using a deep reinforcement learning algorithm, and inputting the sample set to be trained into the target model to train the target model;

testing the trained target model, and sampling state actions as a stealing data set;

constructing a stealing model by utilizing a deep reinforcement learning algorithm, wherein the stealing model is used for simulating the attack action of an attack target model;

inputting the stealing data set serving as a training sample into a stealing model and training the stealing model by using a simulation learning algorithm;

adding a differential privacy protection mechanism into a trained target model, and inputting data output by the target model under the action of the differential privacy protection mechanism into a stealing model;

the stealing model makes false attack actions under the influence of data acted by a differential privacy mechanism.

The training of the target model comprises the following steps:

using an experience playback mechanism, and carrying out online collection and processing to obtain an online sample set;

storing the online sample set and the sample set to be trained into a playback memory unit to form a transfer sample;

during each training, randomly extracting transfer samples from the transfer samples, inputting the transfer samples into a current value network to obtain a current Q value, and updating parameters by using a random gradient descent algorithm in the training process;

copying parameters of the current value network to a target value network to obtain an optimization target of the current Q value, namely a target Q value;

updating network parameters by minimizing a mean square error between a current Q value and a target Q value; after the target value network is introduced, the target Q value is kept unchanged in a period of time, so that the correlation between the current Q value and the target Q value is reduced to a certain extent, and the stability of the algorithm is improved;

the depth reinforcement learning algorithm reduces the reward value and the error term to a limited interval, ensures that the Q value and the gradient value are in a reasonable range, improves the stability of the algorithm, and obtains an optimal strategy through gradient descent optimization.

The deep reinforcement learning problem can be modeled as a markov decision process, i.e. MDP ═ S, a, R, P can be represented by a quadruple, where S represents the set of states available in the decision process, a represents the set of actions in the decision process, R represents the real-time reward for state transition, and P is the state transition probability. At the beginning of any time step t, the agent observes the environment to get the current state s_tAnd according to the current optimal strategy pi^*Make action a_tAt the end of t, the agent receives its reward r_tAnd the next observation state s_t+1(ii) a The deep reinforcement learning algorithm adopts a target value network parameter update called 'hard' mode, namely, the network parameters in the current value network are assigned to the target value network at regular intervals;

when the deep reinforcement network is trained, the samples are generally required to be mutually independent and are sampled randomly, so that the relevance among the samples is greatly reduced, and the stability of the algorithm is improved;

typically, the output of the network, representing the current values, is used to evaluate the value function of the current state action pair; the output of the network of target values is typically represented by an optimization objective that is a function of the approximate representation.

The error function between the current Q value and the error Q value is as follows:

the parameter θ is subjected to partial derivation to obtain the following gradient:

where s is the current state, a is the corresponding action, r is the reward value, s' is the next state, θ_iIs a model parameter, E represents expectation, Y_iRepresents the desired Q value, Q (s, a | θ)_i) A prize value representing state s and action a.

The optimal strategy is as follows:

where s is the current state, a is the corresponding action, A action set, Q^*Is an optimum function, pi^*Is the optimal strategy.

The training of the stealing model comprises the following steps:

the action and the state output by the generator G are input into the discriminator in pairs to be compared with expert data by using an Actor network instead of the generator G, the output of the discriminator D: S multiplied by A → (0,1) is used as a reward value to guide strategy learning simulating the learning, and a discriminator loss function is expressed as:

wherein, pi_ILRepresenting strategies obtained by imitation of learning,. pi_tAn expert strategy representing sampling, logD (s, a) in the first term representing the judgment of the arbiter on the real data, the second term representing the judgment of the arbiter on the real dataThe binomial log (1-D (s, a)) represents the judgment of the generated data;

specifically, through such a maximum and minimum game process, G and D are optimized circularly and alternately to train a required Actor network and a discriminant network;

in the training process, a loss function is minimized through gradient derivation so as to reversely update network parameters of the arbiter and the Actor, and the loss function is as follows:

wherein the content of the first and second substances,

is a simulation strategy pi_ILThe entropy of the loss function is controlled by a constant lambda (lambda is more than or equal to 0) and is used as a strategy regular term in the loss function;

and generating a target model for resisting sample attack by using the trained stealing model.

The differential privacy mechanism is represented as follows:

wherein the content of the first and second substances,

is a mean of 0 and a variance of

Gaussian distribution, a single applied Gaussian mechanism sensitivity function f_dySatisfy the requirement of

Representing an input sequence d_seThe sensitivity of (c);

a differential privacy mechanism is added to the target model middle layer.

Approximating a real-valued function with a differential privacy mechanism

One common example of this is by adding noise calibration

Sensitivity of (2)

It is defined as two adjacent input sequences d_seAnd d'_seThe maximum value of the absolute distance therebetween.

In deep reinforcement learning, Dynamic Differential Privacy (DDP) is added to a strategy execution forward DRL model intermediate layer, and in order to ensure that given noise distribution meets (epsilon, delta) -DDP, the invention selects noise scale sigma ≧ c delta s/epsilon and constant

For ε ∈ (0, 1); in this result, the value of the data plus noise samples in the data set, Δ s, is determined by

Given the sensitivity of the function s, s is a real-valued function. A security reinforcement mechanism is dynamically added in the model to ensure that the strategy action distribution is different from the prime action space distribution, and the action space distribution taken by an attacker is difficult to predict an original model algorithm.

Specifically, the measure of the model stealing attack is defined as:

this formula measures the effectiveness and extent of model stealing of the target model,

wherein: r_stlIs the value of the reward after model stealing, R_testIs the original model test award value.

Then the measure of model stealing defense added with the differential privacy protection mechanism is as follows:

the formula measures the defense effect of the invention, and intuitively speaking, the formula measures the reduction degree of model stealing attack under the defense of the invention;

wherein: r_defnseIs the model steals the defended reward value, R_stlIs the value of the reward after model stealing, R_testIs the original model test award value.

A differential privacy based deep reinforcement learning model security strengthening device, comprising a computer memory, a computer processor and a computer program stored in the computer memory and executable on the computer processor, wherein the computer processor implements any one of the above methods when executing the computer program.

Compared with the prior art, the invention has the advantages that:

(1) by introducing an index mechanism of differential privacy into the model input layer, the information quantity which can be obtained by a model stealing attacker from the model output is reduced, the output distribution of the depth model is blurred to the maximum extent on the premise of not changing the output action of the model, and the level of model stealing attack is greatly reduced, so that the attacker is prevented from stealing the original model by utilizing the action space distribution.

Drawings

FIG. 1 is a general flowchart of a method for security reinforcement of a deep reinforcement learning model based on differential privacy according to the present invention;

fig. 2 is a deep reinforcement learning model schematic diagram of the deep reinforcement learning model security reinforcement method based on differential privacy provided by the invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

The embodiment provides a deep reinforcement learning model security reinforcement method based on differential privacy, which changes the action space distribution of a deep reinforcement learning strategy through a differential privacy index mechanism, reduces the information quantity which can be obtained from model output by a model stealing attacker through introducing the differential privacy index mechanism into a model input layer, blurs the output distribution of the deep model to the maximum extent on the premise of not changing the output action of the model, and greatly reduces the level of model stealing attack, thereby preventing the attacker from stealing the original model by utilizing the action space distribution.

Fig. 1 is a general flowchart of the method for reinforcing the security of the deep reinforcement learning model based on differential privacy according to this embodiment, and the method for reinforcing the security of the deep reinforcement learning model based on differential privacy according to this embodiment can be used in the field of game AI and used for training game AI to automatically play games.

As shown in fig. 1-2, the method for security reinforcement of the deep reinforcement learning model based on differential privacy includes the following steps:

(1) sampling data from the environment as a sample set to be trained, constructing a target model by using a deep reinforcement learning algorithm, and inputting the sample set to be trained into the target model to train the target model; the specific training process comprises

(1.1) using an experience playback mechanism, and carrying out online collection and processing to obtain an online sample set;

(1.2) storing the online sample set and the sample set to be trained into a playback memory unit to form a transfer sample;

(1.3) during each training, randomly extracting transfer samples from the transfer samples, inputting the transfer samples into a current value network to obtain a current Q value, and updating parameters by using a random gradient descent algorithm in the training process;

(1.4) copying parameters of the current value network to a target value network to obtain an optimization target of the current Q value, namely the target Q value;

(1.5) updating network parameters by minimizing a mean square error between a current Q value and a target Q value; the error function between the current Q value and the error Q value is as follows:

where s is the current state, a is the corresponding action, s' is the next state, θ_iIs a model parameter; e represents expectation, Y_iRepresents the desired Q value, Q (s, a | θ)_i) A prize value representing state s and action a.

(1.6) the reward value and the error term are reduced to a limited interval by a deep reinforcement learning algorithm, and an optimal strategy is obtained through gradient descent optimization, wherein the optimal strategy is as follows:

(2) Testing the trained target model, and sampling state actions as a stealing data set;

(3) constructing a stealing model by utilizing a deep reinforcement learning algorithm, wherein the stealing model is used for simulating the attack action of an attack target model;

(4) inputting the stealing data set serving as a training sample into a stealing model and training the stealing model by using a simulation learning algorithm; the training steps are as follows:

(4.1) using the Actor network instead of generator G, inputting the output action and state pair into the arbiter to compare with expert data, and using the output of arbiter D: sxa → (0,1) as a reward value to guide the strategy learning of the emulation learning, the arbiter loss function is expressed as:

wherein, pi_ILRepresenting strategies obtained by imitation of learning,. pi_tRepresenting the expert strategy of sampling, wherein logD (s, a) in the first item represents the judgment of the discriminator on real data, and the second item log (1-D (s, a)) represents the judgment on generated data;

(4.2) in the training process, minimizing a loss function through gradient derivation so as to reversely update network parameters of the arbiter and the Actor, wherein the loss function is as follows:

wherein the content of the first and second substances,

and (4.3) generating a target model for resisting sample attack by using the trained stealing model.

(5) Adding a differential privacy protection mechanism to an intermediate layer of a trained target model, and inputting data output by the target model under the action of the differential privacy protection mechanism into a stealing model; the differential privacy mechanism is represented as follows:

wherein the content of the first and second substances,

is a mean of 0 and a variance of

ε＜1，f(d_se) Representing an input sequence d_seThe sensitivity of (2).

(6) The stealing model makes wrong attack action under the influence of data with the function of a differential privacy mechanism;

defining the measurement index of the model stealing attack as:

Claims

1. A deep reinforcement learning model security reinforcement method based on differential privacy is characterized by comprising the following steps:

2. The method for security reinforcement of the deep reinforcement learning model based on the differential privacy as claimed in claim 1, wherein the training of the target model comprises the following steps:

updating network parameters by minimizing a mean square error between a current Q value and a target Q value;

and (3) the reward value and the error term are reduced to a limited interval by a deep reinforcement learning algorithm, and an optimal strategy is obtained through gradient descent optimization.

3. The method for security reinforcement of the deep reinforcement learning model based on the differential privacy as claimed in claim 2, wherein the error function between the current Q value and the error Q value is as follows:

where s is the current state, a is the corresponding action, s' is the next state, θ_iIs a model parameter, E represents expectation, Y_iRepresents the desired Q value, Q (s, a | θ)_i) A prize value representing state s and action a.

4. The method for security reinforcement of the deep reinforcement learning model based on the differential privacy as claimed in claim 2, wherein the optimal strategy is as follows:

5. The method for security reinforcement of the deep reinforcement learning model based on the differential privacy as claimed in claim 1, wherein the training of the stealing model comprises the following steps:

wherein the content of the first and second substances,

6. The method for security reinforcement of the deep reinforcement learning model based on the differential privacy as claimed in claim 1, wherein: the differential privacy mechanism is represented as follows:

wherein the content of the first and second substances,

is a mean of 0 and a variance of

ε＜1，f(d_se) Representing an input sequence d_seThe sensitivity of (c);

a differential privacy mechanism is added to the target model middle layer.

7. A differential privacy based deep reinforcement learning model security reinforcement apparatus comprising a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, characterized in that: the computer processor, when executing the computer program, implements the differential privacy based deep reinforcement learning model security reinforcement method of any one of claims 1-6.