CN115620100A

CN115620100A - Active learning-based neural network black box attack method

Info

Publication number: CN115620100A
Application number: CN202211188638.3A
Authority: CN
Inventors: 翔云; 胡晋瑄; 李宇浩; 陈作辉; 朱家琪; 宣琦
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-01-17

Abstract

A neural network black box attack method based on active learning comprises the following steps: (S1): training a target model by using target training data, and pre-training a local surrogate model by using surrogate model training data which is not the target training data; (S2): screening non-target training data by using an active learning method, and retraining the model pre-trained in S1; (S3): generating a countermeasure sample for the substitute model trained by the S2 by using a countermeasure attack method, and migrating the countermeasure sample to a target model for attack; the surrogate model training method provided by the invention greatly reduces the request times of the target model under the condition of the gray box setting, reduces the request cost, and improves the attack success rate of the target model under the black box setting.

Description

Neural network black box attack method based on active learning

Technical Field

The invention relates to the field of deep learning counterattack, in particular to a neural network black box attack method based on active learning.

Background

The deep neural network models are widely applied in industrial application, and as the usage of the deep neural network models increases, the problem that the safety performance of the network models meets the requirements of people on the usage is solved. For an attacker, the neural network is likely to make an error in the identification of the sample by applying a small perturbation to the sample that is invisible to the naked eye. The counterattack mainly comprises black box attack and white box attack, and the white box attack requires an attacker to have prior knowledge about a model, including model structure, weight, data used for target model training, classification results (including prediction labels and prediction probability) and the like. With the information, the white-box attack can calculate the gradient of the target model, and a countermeasure sample with high success rate for the target model attack is prepared according to the gradient. Although the success rate of the white-box attack is high, the prior knowledge is unrealistic in a real scene, and the black-box attack is more consistent with the real scene. In contrast, black box attacks have very limited knowledge of the target model. In real-world applications, DNN classification services typically do not provide information about their systems, thereby preventing threats on network security. It is desirable to prevent opponents and competitors from gaining details of their own models. An adversary can only access the target model, i.e., use queries, through the exposed service interface. Therefore, the gradient-based white-box attack algorithm cannot be directly applied in the black-box scenario, and the number of queries and the query cost become an important issue.

Black-box attacks can be largely classified into query-based attacks, model migration-based attacks, and surrogate model attacks. The main disadvantages of the existing work on black box attacks are that query-based attacks require too many requests and the cost of the requests is too high. The disadvantage of the model-based migration attack is mainly that the success rate of the attack is not high, i.e. the attack effect is not good. In a surrogate model attack, an attacker first queries a victim model to obtain a label, and then trains a surrogate model using the queried label. The surrogate and victim models should be as close as possible to allow for better mobility of the antagonistic sample. After proper training, subsequent attacks can reuse the obtained surrogate models without querying the victim model. Compared with the two methods, although the request times are reduced by the surrogate model attack, the defect of large request times still exists in the stage of training the surrogate model. The latest black-box attack proposes a method (DatT Data-free customization Training for adaptive Attacks) that does not require real Data to train a surrogate model, and although this method works well on some Data sets and models, it has the fatal disadvantage that the number of requests is very large, and such a large request cost is unrealistic in real scenarios. And (3) the mock network (knock off nets) adaptively selects the query image by using reinforcement learning so as to steal the function of the victim model and train a substitute model fitting the target model for carrying out substitute model attack. The method comprises the steps of generating a substitution model to simulate a decision boundary of an approximate attacked model by Practical black-box attack (Practical black-box attacks) and selecting a general structure of the model according to common knowledge, wherein for example, CNN can be used for image recognition. The black box attack method for the brain-computer interface system with the publication number of CN112149609A is characterized in that a substitution model is trained by searching samples on two sides of a decision boundary of the substitution model, the same or better attack performance can be obtained under the condition of less inquiry times, a generated confrontation sample has little disturbance, and is hardly different from an original EEG signal in a time domain and a frequency domain and cannot be easily found, so that the attack efficiency of the black box attack in the brain-computer interface system is greatly improved, but the attack efficiency does not include a target attack with higher difficulty, the application field is very limited, the attack success rate is only improved for a non-target attack with small difficulty, the generated substitution model is only suitable for an attack method based on a Jacobi matrix. Based on the perspective of the decision boundary, the different distribution of data and the randomness in the training process cause the surrogate model and the target model to have different decision boundaries. This illustrates that the decision boundaries of the surrogate model are significantly different from the target model in query-based and migration-based attacks. Based on this, it is proposed to fit local surrogate models to the decision boundaries of the target model by active learning. In addition, many existing black box attacks are consistent with training data selected by a target model and a substitution model, which is a strong assumption for an attacker, but the setting of the technology does not need the strong assumption, and samples used by the active learning part are also inconsistent with training samples of the target model.

Disclosure of Invention

The invention provides a neural network black box attack method based on active learning, which aims to overcome the defects in the prior art.

The invention aims to solve the problem of high request cost in the field of resisting attack by deep learning at present, and provides a substitution model attack method for reducing the times of black box attack requests and improving the attack success rate.

The technical conception of the invention is as follows: the technical difficulty of attacking under the black box scene in the current deep learning security field mainly lies in that the number of requests during training of the substitution model is too large, the request cost is too high, and the success rate of migration attack is poor in the target attack scene. The effect of combining active learning is to query samples near the decision boundary of the model, which has positive effect on the training of the substitute model and the approximation of the target model. Therefore, the invention aims to solve the problems of excessive cost of the black box attack request and poor attack effect.

The technical scheme adopted by the invention for realizing the aim is as follows:

a neural network black box attack method based on active learning is characterized by comprising the following steps:

s1: training a target model by using target training data, and pre-training a local substitution model by using non-target training data;

s2: screening non-target training data by using an active learning method, and retraining the model pre-trained in S1;

s3: and (4) generating a countermeasure sample for the S2-trained substitute model by using a countermeasure attack method, and migrating to the target model for attack.

Preferably, the step S1 specifically includes:

s1.1: randomly scrambling using surrogate model training data;

s1.2: taking a sample of the randomly disturbed target model training data for training the target model, avoiding the repetition of the sample and a subsequent training local substitution model, and storing the model;

s1.3: and (4) using the alternative model training data to request the input and output obtained by the target model to train the local alternative model.

Preferably, the step S2 specifically includes:

s2.1: inputting the rest samples of the non-target training data into the local substitution model trained in the S1, and obtaining the confidence coefficient of each sample passing through the substitution model;

s2.2: setting a confidence threshold, and taking the picture with the confidence lower than the threshold in the S2.1 for subsequent training;

s2.3: adding the images lower than the confidence coefficient threshold value to resist disturbance, and inputting the images before and after disturbance into a target model together for query to obtain corresponding labels;

s2.4: and (5) retraining the local substitution training by using the input and output obtained in the steps, and storing the model.

Preferably, the step S3 specifically includes:

s3.1: using the substitution model trained in S2, carrying out attack under different attack methods under a local substitution model white box on the samples of the training data of the randomly disturbed test set by using the gradient information of the substitution model, and storing corresponding confrontation samples for subsequent migration attack;

s3.2: and carrying out migration attack on the target model by using the countermeasure sample generated by the substitute model, wherein the method comprises FGSM, PGD, BIM, CW and other attack methods. BIM attacks mainly use the gradient of the model to determine the direction in which to add the perturbation. It screens the optimal disturbance in an iterative manner; the CW attack takes an artificially input image as a parameter to be optimized and then trains the input with fixed model parameters to add a perturbation attack.

Their attacks are defined as follows:

where x is the sample before the perturbation is added; ε is an adjustment factor; theta is a model parameter;

indicating that the loss values are passed back to the input image and the gradients are calculated; sign () function is a function for solving numerical signs, which takes 1 when the input value is greater than 0, takes 0 when the input value is equal to 0, and takes-1 when the input value is less than 0; the clip () function is a function for preventing attack failure due to an excessively large gradient span, and constrains the disturbance within a prescribed range; a is a parameter controlling the iteration; w is a _n Is an optimization parameter; x _n Represents a clean sample;

s3.3: after the anti-sample migration attack is generated, the success rate of target attack and the success rate of non-target attack are respectively recorded, and the comparison is carried out without using an active learning method.

The beneficial effects of the invention are as follows:

(1) The alternative model training method provided by the invention greatly reduces the request times of the target model under the condition of ash box setting, and reduces the request cost;

(2) The alternative model training method provided by the invention improves the attack success rate of the target model to a certain extent under the black box setting.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 shows the results of training the target model, FIG. 2 (a) shows the results of training the vgg16_ bn model with the set partial CIFAR10 dataset, FIG. 2 (b) shows the results of training the Resnet18 model with the set partial CIFAR10 dataset, and FIG. 2 (c) shows the results of training the MobileNet V2 model with the set partial CIFAR10 dataset, where the abscissa is the epoch of the training and the ordinate is the test accuracy on the selected partial test set of CIFAR 10;

FIG. 3 is a training process for fitting a surrogate model decision boundary to a target model.

Detailed Description

The following detailed description of specific embodiments of the invention is provided in connection with the accompanying drawings.

Example one

Referring to fig. 1 to 3, a neural network black box attack method based on active learning includes the following steps:

s1, randomly scrambling data of a CIFAR10 training set, selecting a model structure of a target model, training the target model, and pre-training a local substitution model, wherein the training data of the substitution model and the target model are not repeated, and the training results of the target model and the substitution model are as shown in figure 2, and specifically comprise the following steps:

setting a pseudo-random seed number seed23 for a training set of a CIFAR10 to ensure that the disordering sequence of the disordering samples of the data set is consistent each time, wherein the randomly disordering samples of target training data are selected for training a target model, the structure of the target model respectively selects Vgg16_ bn, resnet18 and MobileNet V2 as a typical light weighting model, and the training parameters shared by the three models are as follows: in the SGD optimizer, the momentum coefficient is set to be 0.9, the weight attenuation coefficient weight _ decay is set to be 5e-4, and the training parameters of vgg16' u bn include: the initial learning rate is 0.001, the learning rate is reduced by ten times at 135epoch and 185epoch, and the training parameters of Resnet18 are: the initial learning rate is 0.1, the learning rate is reduced by ten times at 135epoch and 185epoch, and the training parameters of MobileNetV2 are: the initial learning rate is 0.1, the learning rate is reduced by ten times at 20epoch,50epoch and 80epoch, wherein the accuracy of vgg16_ bn reaches 88.45% on the selected 5000 CIFAR10 test sets, the accuracy of Resnet18 reaches 91.79% on the selected 5000 CIFAR10 test sets, and the accuracy of MobileNet V2 reaches 85.86% on the selected 5000 CIFAR10 test sets. And setting random seed numbers which are the same as those of the S1 to a training set of the CIFAR10, ensuring that the disorder sequence is consistent with the S1, wherein a sample of one-half proportion of randomly disordered training data of the substitution model is selected to be input into the target model trained by the S1 for query output, and an output label is obtained and recorded for subsequent training of the local substitution model.

S2: screening non-target training data by using an active learning method, retraining the model pre-trained in the S2, and performing an active learning process and a surrogate model retraining process as shown in FIG. 3, which specifically includes:

the test set of CIFAR10 is set with the same random seed number as S1, S2. Dividing a data set, taking a sample of the remaining half proportion of the random disturbed alternative model training data as non-target training data to be screened for active learning, screening the sample based on the sample close to a decision boundary with higher uncertainty, setting a confidence threshold to be 0.6, wherein in S3.1, pictures with confidence degrees lower than the threshold are samples with high training importance for the alternative model, storing the samples with important training and retraining the local alternative model, wherein the training parameters of three network structures are preferred: the initial learning rate is 0.1, the learning rate drops by ten times at 135epoch and 185epoch, and the surrogate model retraining process is as follows: inputting a local substitution model for the selected sample to check the confidence, saving the sample below 0.6, inputting the target model, and inquiring the output label by the request times to continuously fit and approximate the decision boundary of the target model, wherein the color of the triangle represents the real label of the sample, and the color of the triangle frame represents the classification of the model.

S3: generating a countermeasure sample for the substitute model trained by the S3 by using a white box attack method, and migrating the countermeasure sample to a target model for attack, wherein the method specifically comprises the following steps:

carrying out white box attack on five thousand samples after a CIFAR10 test set which is randomly disturbed by using the gradient information of a local substitution model after S3 retraining by using a PGD attack method, obtaining antagonistic samples, transferring the antagonistic samples to a corresponding target model for non-target attack and target attack with higher difficulty, counting the samples which pass the model and are classified wrongly, and calculating the success rate of the non-target attack, wherein in the target attack, the samples are classified wrongly, then the number of the samples which can be classified into a specified category under the target attack is calculated, and the success rate of the target attack is calculated, wherein the common parameters of four attacks are as follows: the loss function is a cross entropy loss function, and the attack parameters of the FGSM method are as follows: the attack step length eps is 0.26, and the attack parameters of the BIM method are as follows: the maximum perturbation eps is 0.25, the iteration number nb _ iter is 120, the attack step length eps _ iter is 0.02, the minimum value clip _ min of each input dimension is 0.0, the maximum value clip _ max of each input dimension is 1.0, and the attack parameters of the CW method are as follows: the learning rate of the attack algorithm learning _ rate is 0.45, the binary query times binding _ search _ steps are 10, the maximum iteration times max _ iterations are 120, and the attack parameters of the PGD method are as follows: the maximum perturbation eps is 0.25, the iteration number nb _ iter is 11, the attack step length eps _ iter is 0.03, the minimum value clip _ min of each input dimension is 0.0, and the maximum value clip _ max of each input dimension is 1.0.

Example two

The method for improving the robustness of the power grid system based on active learning comprises the following steps:

s1: training a power grid system model by using electronic information data (e-mails, electromagnetic signals and the like) of a target model data field, pre-training a local substitution model by using electronic information data of a non-target model data field, and enabling training data not to be repeated;

s2: screening electronic information data of a non-target model data domain by using an active learning method, inputting the data into a target model for query, and retraining a local pre-training model after obtaining a query result;

s3: and (3) generating a countermeasure sample for the substitute model trained in the S2 by using a countermeasure attack method, migrating the countermeasure sample to a power grid system target model, and destroying a power grid system detection algorithm of the other party, so that an attacker cannot identify electronic information data of the other party to attack the power grid system, and the robustness of the power grid system is improved.

The specific contents of each step of the present embodiment include the specific contents of each step of the previous embodiment.

The step S1 specifically includes:

s1.1: random shuffling using surrogate model training data;

The step S2 specifically includes:

s2.4: and (5) retraining the local alternative training by using the input and output obtained in the step, and storing the model.

The step S3 specifically includes:

s3.2: and carrying out migration attack on the target model by using the countermeasure sample generated by the substitute model, wherein the method comprises FGSM, PGD, BIM, CW and other attack methods. BIM attacks mainly use the gradient of the model to determine the direction in which to add the perturbation. It screens the optimal disturbance in an iterative way; the CW attack takes an artificially input image as a parameter to be optimized and then trains the input with fixed model parameters to add a perturbation attack.

Their attacks are defined as follows:

where x is the sample before adding the perturbation; ε is an adjustment factor; theta is a model parameter;

representing the return of loss values to the input image and calculating gradients; sign () function is a function for finding a value sign as an input valueWhen the input value is greater than 0, the function takes 1, when the input value is equal to 0, the function takes 0, and when the input value is less than 0, the function takes-1; the clip () function is a function for preventing attack failure due to an excessively large gradient span, and constrains the disturbance within a prescribed range; a is a parameter controlling the iteration; w is a _n Is an optimization parameter; x _n Represents a clean sample;

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A power grid system robustness improving method based on active learning is characterized by comprising the following steps:

s2: screening non-target training data by using an active learning method, and retraining the model pre-trained in the S1;

s3: and generating a countermeasure sample for the S2-trained substitute model by using a countermeasure attack method, and migrating to the target model for attack.

2. The surrogate model attack method based on improving the success rate of model migration attack as claimed in claim 1, wherein: the step S1 specifically includes:

s1.1: randomly scrambling using surrogate model training data;

s1.3: and (4) using the input and output obtained by the surrogate model training data request target model to train the local surrogate model.

3. The surrogate model attack method based on improving the success rate of model migration attack as claimed in claim 1, wherein: the step S2 specifically includes:

s2.2: setting a confidence coefficient threshold value, and taking the picture with the confidence coefficient lower than the threshold value in S2.1 for subsequent training;

4. The surrogate model attack method based on improving the success rate of model migration attack as claimed in claim 1, wherein: the step S3 specifically includes:

s3.1: using the substitution model trained in S2, carrying out attacks under different attack methods under a local substitution model white box on samples of the randomly disturbed test set training data by using the gradient information of the substitution model, and storing corresponding countercheck samples for subsequent migration attacks;

s3.2: and (3) carrying out migration attack on the target model by using the confrontation sample generated by the substitute model, wherein the confrontation sample comprises attack methods such as FGSM, PGD, BIM, CW and the like. The BIM attack mainly determines the direction of adding disturbance by using the gradient of a model, and screens the optimal disturbance in an iterative way; the CW attack takes an artificially input image as a parameter to be optimized and then trains the input with fixed model parameters to add a perturbation attack, which is defined as follows:

indicating that the loss values are passed back to the input image and the gradients are calculated; sign () function is a function for solving numerical value signs, which takes 1 when the input value is greater than 0, takes 0 when the input value is equal to 0, and takes-1 when the input value is less than 0; the clip () function is a function for preventing attack failure due to an excessively large gradient span, and constrains disturbance within a prescribed range; a is a parameter controlling the iteration; w is a _n Is an optimization parameter; x _n Represents a clean sample;

s3.3: after the anti-sample migration attack is generated, the success rate of the target attack and the success rate of the non-target attack are respectively recorded and compared with the method without using the active learning method.