CN114693973A

CN114693973A - Black box confrontation sample generation method based on Transformer model

Info

Publication number: CN114693973A
Application number: CN202210332993.7A
Authority: CN
Inventors: 刘琚; 韩艳阳; 刘晓玺; 顾凌晨; 江潇
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-01

Abstract

The invention provides a black box confrontation sample generation method based on a Transformer model, and belongs to the technical field of artificial intelligence. The method mainly considers the influence of different coding blocks on the resistance sample performance, and classifies the coding block performance by adopting the coding block weight fraction. Different strategies are used for different coding blocks to generate the countermeasure samples, the influence of image and model information on disturbance is balanced, the updating direction of the disturbance is stabilized, and the attack success rate and the migration capacity of the countermeasure samples are improved. And finally, designing self-adaptive weight to adjust the disturbance size of the key pixel point, so that the attack capability of the anti-disturbance is improved under the condition that human eyes cannot perceive the disturbance. The invention obviously improves the transfer capability of the confrontation sample, can effectively evaluate and improve the safety of the artificial intelligence technology, and fully proves the effectiveness of the method by testing on the image classification task.

Description

Black box confrontation sample generation method based on Transformer model

Technical Field

The invention relates to a black box confrontation sample generation method based on a Transformer model, and belongs to the technical field of artificial intelligence.

Background

With the rapid development of artificial intelligence technology, neural network models play an important role in various social fields, especially in the computer vision field, such as face recognition, automatic driving, social security and the like. However, studies have shown that neural network models are very susceptible to challenge samples with high mobility: on the local model, by adding noise to the original image, an image visually similar to the original image is generated, resulting in an output error of other neural network models. As more and more neural network models are applied in real scenes on a large scale, the safety of the neural network models raises concerns. Meanwhile, the unexplainable property of the neural network model enables the model performance to depend on the trained data set, and the improvement and the wide application of the model performance are limited. Therefore, how to establish a credible artificial intelligence technology and evaluate and improve the safety of the neural network model are problems to be solved urgently at present.

In order to improve the safety of the neural network model in the application field and reduce the potential threat of an attacker in a real scene, on one hand, the safety performance of the model can be evaluated by testing the neural network model by using a countermeasure sample; on the other hand, by adding countermeasure samples in the original training set for training, the stability and safety of the model can be improved. Therefore, the confrontation sample becomes an important method for evaluating and improving the performance of the model, and is a hot research problem in the field of artificial intelligence. At present, various types of neural network model architectures, such as convolutional neural networks, generative confrontation networks, ViTs (vision transforms), etc., have appeared, wherein the ViTs can better integrate global information due to having a wider receptive field, and especially when a large-scale data set training model is adopted, the model architecture has very excellent performance in many computer vision fields, including target detection, image classification, etc. The existing anti-attack technology is mostly provided for a convolutional neural network, excessively depends on convolutional neural network model information, is difficult to generate an anti-sample capable of attacking multiple different types of models simultaneously, has the defects of low attack success rate, weak migration capability and the like on a ViTs model, and cannot be used for evaluating and improving the safety of the ViTs. Therefore, it is important to study the method of generation of confrontational samples based on ViTs.

Disclosure of Invention

In order to solve the problems of low attack success rate and weak migration capability on a ViTs model in the prior art, the invention provides a method for generating a black box countermeasure sample based on a Transformer model, which has strong attack performance and reduces the dependence degree on model information, thereby improving the migration capability of the countermeasure sample. Under the black box scene, various convolutional neural networks and ViTs models can be attacked at the same time. In the training stage of the model, the confrontation sample generated by the method is fused with the original training set, so that the comprehensive performance of the model can be effectively improved. Meanwhile, the invention provides a more effective evaluation method for the safety of the neural network model in the application scene.

The technical scheme adopted by the invention is as follows:

a black box countermeasure sample generation method based on a Transformer model is characterized in that aiming at different coding block attributes in a Transformer, the method adopts two strategies to jointly generate a countermeasure sample, aiming at a robust coding block, a cross entropy loss strategy is used, aiming at a non-robust coding block, a self attention characteristic loss strategy is used, the method can avoid redundancy of model information, and migration capacity of the countermeasure sample is improved, and the method is characterized by comprising the following steps:

step 1: acquiring an original image and a corresponding label to form an original data set, and initializing a countermeasure disturbance with the same size as the original image;

step 2: generating a plurality of noise images with the same size as the original image to form a negative sample data set;

and step 3: dividing a coding block in a visual Transformer, namely an ViT model into a robust coding block and a non-robust coding block according to the original image and the negative sample data set; the ViT model is a pre-training model and comprises a plurality of coding blocks, and each coding block can extract classification information for calculating a target loss value;

and 4, step 4: linearly superposing the counterdisturbance and the original image to input the counterdisturbance and the original image into the ViT model, iteratively updating the counterdisturbance until an iteration stopping condition is met, outputting a final counterdisturbance, and linearly superposing the final counterdisturbance and the original image to obtain a corresponding countersample;

particularly, forming a negative sample data set in step 2 specifically includes:

determining the number of images contained in a negative sample data set corresponding to the original image as M;

obtaining the mth image x' in the negative sample data set corresponding to the original image according to the following formula:

wherein the content of the first and second substances,

the m noise is the m noise; p represents the probability that a certain pixel point in the noise image is 0, and 1-p represents the probability that a certain pixel point in the noise image is 1; x is an original image; an indicator represents a vector dot product.

In particular, dividing the coding block in the ViT model into a robust coding block and a non-robust coding block in step 3 specifically includes:

determining target loss function values of the model relative to the original image and self-attention feature maps of the coding blocks relative to the original image according to the original image and the ViT model, wherein the target loss function values and the self-attention feature maps are respectively called as a first loss value and a first self-attention feature map;

and obtaining a first coding block weight score in each coding block according to the first loss value and the first self-attention feature map.

According to each image in the negative sample data set and the ViT model, determining a target loss function of the model relative to the negative sample and a self-attention feature map of each coding block relative to the negative sample, which are respectively called as a second loss value and a second self-attention feature map;

determining a second coding block weight fraction in each coding block according to the second loss value and the second self-attention feature map;

obtaining an average second coding block weight fraction in each coding block according to the second coding block weight fraction of each image in the negative sample data set;

and determining the anti-interference capability of each coding block according to the weight fraction of the first coding block and the weight fraction of the average second coding block, dividing the first K coding blocks with strong anti-interference capability into robust coding blocks, and dividing the rest coding blocks into non-robust coding blocks.

In particular, a first encoded block weight score for each encoded block is obtained according to the following equation:

where B represents the number of coding blocks in the ViT model; i represents the number of the coding block; j (x, y) represents the loss function of the ViT model; SAM_i(x) Representing the self-attention feature graph extracted by the ith coding block;

representing the gradient over x.

In particular, an average second encoding block weight score for each encoding block is obtained according to the following equation:

where M is the number of images contained in the negative sample data set.

In particular, the robust coding block and the non-robust coding block described in step 3 specifically include:

and obtaining the anti-interference capability of each coding block according to the following formula:

wherein, | | represents taking an absolute value; w is aⁱ(x) A first encoding block weight score representing the ith encoding block,

representing the average second encoding block weight fraction of the ith encoding block.

Obtaining the positions of the code blocks with strong interference resistance to weak interference resistance according to the following formula:

S＝[s₁,s₂,…,s_B]

the coding blocks corresponding to the first K positions are robust coding blocks, and the rest are non-robust coding blocks.

In particular, the iterative updating described in step 4 opposes the disturbance, and specifically includes:

linearly superposing the confrontation disturbance and the original image to obtain a temporary confrontation sample, and inputting the temporary confrontation sample into the ViT model;

and calculating a loss function of the confrontation sample and the original image based on the self-attention feature map according to the non-robust coding block and the self-attention feature map extracted by the non-robust coding block.

And calculating the cross entropy loss function of the confrontation sample and the original image according to the robust coding block.

A pixel-based loss function of the challenge sample and the original image is calculated.

The challenge samples are updated according to the total loss function.

The total loss function is as follows:

wherein the content of the first and second substances,

(x, y) represents the s_kCross entropy loss function of each coding block; x is the number of^*For the confrontation sample, y is a real label corresponding to the original image; II |)₂Represents a euclidean distance;

denotes the s th_kExtracting a self-attention feature map from each coding block; λ and μ are constants that balance the loss function.

In particular, the confrontation sample is updated in step 4 according to the following formula:

wherein the content of the first and second substances,

representing the confrontation sample obtained from the t iteration; alpha is the step length of single update against disturbance; sign () is a sign function for determining the direction of a disturbance;

a gradient representing a loss;

represents an adaptive weight; RELU () is a linear rectification function.

According to the technical scheme, the method for generating the black box countermeasure sample based on the Transformer model can effectively attack various unknown neural network models. The attributes of different coding blocks in ViT are fully considered, the coding blocks are divided into two types of robust coding blocks and non-robust coding blocks according to the weight scores of the coding blocks, two attack strategies are established to jointly generate a countermeasure sample, and the influence of the image and local model information on the countermeasure sample is balanced to suppress unnecessary model information and redundancy of image self-attention characteristics highly related to the model, so that the transfer capability of the countermeasure sample is remarkably improved. The update direction of the countering disturbance is stabilized by constraining the pixel-level and self-attention feature-level differences of the countering samples and the original image. Furthermore, the self-adaptive weight is designed to adjust the disturbance size of different pixel points, and the key pixel point information is highlighted, so that the countercheck sample has stronger attack capability without being influenced by the visual effect.

In conclusion, the method for generating the confrontation sample provided by the invention effectively solves the problems of low attack success rate and weak migration capability on the ViTs model, has strong attack performance on the convolutional neural network model, is more suitable for the safety detection of the neural network model in the real scene, and can be used for training the more stable and safe neural network model.

Drawings

FIG. 1 is a flow chart of a challenge sample generation method of an embodiment of the present invention;

fig. 2 is a flow chart of coding block selection provided by an embodiment of the present invention;

fig. 3 is a block diagram of coding block selection provided by an embodiment of the present invention;

FIG. 4 is a block diagram of countermeasure sample generation provided by an embodiment of the invention;

Detailed Description

The invention provides a black box confrontation sample generation method based on a Transformer model. In order to solve the problems of low attack success rate and weak migration capability on a ViTs model in the prior art, the attributes of different coding blocks in the ViTs are fully considered, and two strategies are respectively adopted for different coding blocks to jointly generate an antagonistic sample so as to reduce the degree of dependence on model information and improve the migration capability of the antagonistic sample. In order to further improve the attack capability, the size of disturbance is adjusted by adopting self-adaptive weight aiming at key pixel points of the image, the key pixel points are damaged, and the attack performance of resisting the sample is obviously improved. Fig. 1 shows a flow chart of the method of the present invention, and the specific implementation steps are as follows:

(1) the method comprises the steps of obtaining an original image and a corresponding label, forming an original data set, and initializing the anti-disturbance with the same size as the original image. In this example, 1 image was randomly picked from each class of the ILSVRC 2012 validation dataset, for a total of 1000 images as the original image.

(2) Generating a plurality of noise images with the same size as the original images to form a negative sample data set; in this example, there are M images in the negative sample data set, specifically, the M images in the negative sample data set are obtained according to the following formula:

wherein the content of the first and second substances,

(3) As shown in fig. 2, the encoded blocks are divided by an encoded block weight fraction. According to the original image and the ViT model, wherein the ViT model is a pre-training model, the original image can be correctly predicted, the model comprises a plurality of coding blocks, and each coding block can extract classification information for calculating a target loss value. In order to obtain the first coding block weight score in each coding block, the coding block selection block diagram is explicitly shown in fig. 3, an original image is firstly transformed into an image block conforming to the ViT model input, then the image block is input into the ViT model to obtain the first loss value of the model, and a first self-attention feature map is obtained from each coding block according to the following formula:

wherein softmax () is a neural network activation function; q_i、K_iAnd V_iV_iIs the self-attention weight matrix of the ViT model; d_qIs the dimension of the learnable matrix in the ViT model; ()^TIndicating transposition.

(4) And obtaining a first coding block weight score in the ith coding block according to the first loss value and the first self-attention feature map in each coding block, wherein a specific calculation formula is as follows:

wherein B represents the number of coding blocks in the ViT model; i represents the number of the coding block; j (x, y) represents the loss function of the ViT model; SAM_i(x) Representing the self-attention feature graph extracted by the ith coding block;

representing the gradient of x; w (x) ═ w¹(x),w²(x),…,w^B(x)]Is the first coding block weight score of the model.

(5) According to the negative sample data set and an ViT model, determining a second loss value of the target loss function and a second self-attention feature map in each coding block, then calculating a second coding block weight score in each coding block, and finally obtaining an average second coding block weight score of each coding block according to the second coding block weight score of each image in the negative sample data set, wherein a specific formula is as follows:

wherein M is the number of images contained in the negative sample data set;

is the average second coding block weight fraction of the model.

(6) Determining the anti-interference capability of each coding block according to the weight fraction of the first coding block and the weight fraction of the average second coding block, wherein the specific calculation formula is as follows:

representing the average second encoding block weight fraction of the ith encoding block. D (x) ═ d¹,d²,…,d^B]Representing the total interference rejection capability of the model. In one aspect, dⁱThe larger the value is, the larger the difference between the weight fraction of the first coding block extracted by the ith coding block and the weight fraction of the average second coding block is, namely, the coding block is easily interfered by noise, has weaker anti-interference capability and poorer robustness. On the other hand, the weaker the anti-interference capability of the coding block is, the more sensitive the coding block is to noise, and the difference between two images can be sensed conveniently.

(7) The method comprises the steps of sequencing the anti-interference capability of each coding block from strong to weak, recording the position of the corresponding coding block, dividing the first K coding blocks with strong anti-interference capability into robust coding blocks, and dividing the rest coding blocks into non-robust coding blocks, wherein the specific formula is as follows:

S＝[s₁,s₂,…,s_B]

(8) As shown in fig. 4, the countermeasure disturbance and the original image are linearly superimposed and input into the ViT model, and information is purposefully extracted from the coding blocks by dividing the coding blocks, so that redundancy of unnecessary model information and image self-attention features highly related to the model is effectively reduced, and mobility of countermeasure samples is improved. In contrast, the present invention combines two strategies to generate a countermeasure sample, and balances the image itself and the local model information, and the specific method is as follows:

A. using cross entropy loss strategy for robust coding blocks

Since the robust coding block is less susceptible to noise interference, it is more prone to extract image information that is critical to the model prediction result. According to the image information extracted from the robust coding block, the influence of the countervailing sample on the model prediction result can be accurately obtained, and the change direction of the countervailing disturbance can be determined. In the ViT model, the image information extracted by each coding block can output the final prediction result through a multi-head linear classifier, so that the image information can be extracted from the robust coding block to obtain the cross entropy loss function of the multi-coding block.

B. Using self-attention-based feature loss strategy for non-robust coding blocks

The non-robust coding block has the advantage of being sensitive to noise, and can effectively measure the difference between the countersample and the original image, thereby stabilizing the updating direction of disturbance, promoting more disturbance generation and improving the attack performance of the countersample. And determining the loss of the self-attention feature map between the confrontation sample and the original image according to the self-attention feature map extracted by the non-robust encoder.

The final loss function integrates a cross entropy loss strategy and a self-attention feature map based loss strategy, further, the pixel-level-based loss is used for improving the attack capacity against the sample, and the specific expression of the total loss function is as follows:

wherein the content of the first and second substances,

represents the s th_kCross entropy loss function of each coding block; x is a countermeasure sample, and y is a real label corresponding to the original image; II |)₂Represents the Euclidean distance;

(9) And determining the size of the temporary anti-disturbance by self-adaptive weight adjustment according to the average second coding block weight fraction obtained by the optimal robust coding block, and improving the attack capability of the anti-sample by enlarging the disturbance length of the key pixel point. And finally, linearly superposing the original image to obtain a confrontation sample for a new iteration, wherein the specific formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the confrontation sample obtained from the t iteration; sign () is a sign function, takes a value of +1 or-1, and is used for determining the updating direction of disturbance; alpha is a disturbance single update step length;

represents an adaptive weight matrix, where relu (x) ═ max (0, x) is a linear rectification function, and thus, the adaptive weights 1 ≦ aw (x) ≦ 2.

The method includes the steps that experiments are conducted in an image classification task, 1 image in each class of an ILSVRC 2012 verification data set is randomly selected as an original image, 1000 original images are generated in total, countermeasure samples are generated on a Deit-T model, testing is conducted on 5 ViTs models and 5 convolutional neural network models, and finally comparison is conducted with 5 countermeasure sample generation methods. The experiment measures the performance of the method through the attack success rate, and the larger the attack success rate value is, the better the attack performance and the mobility of the resisting sample are. The attack performance of the resisting sample on 5 ViTs models is shown in the table 1, and the attack performance of the resisting sample on 5 convolutional neural networks is shown in the table 2. The method has better attack performance and migration performance on a convolutional neural network or a ViTs model. Therefore, the method can better evaluate and improve the safety of the neural network model in the application scene.

TABLE 1

TABLE 2

Claims

1. A black box countermeasure sample generation method based on a Transformer model is characterized in that aiming at different coding block attributes in a Transformer, the method adopts two strategies to jointly generate a countermeasure sample, aiming at a robust coding block, a cross entropy loss strategy is used, aiming at a non-robust coding block, a self attention characteristic loss strategy is used, the method can avoid redundancy of model information, and migration capacity of the countermeasure sample is improved, and the method is characterized by comprising the following steps:

and 4, step 4: and linearly superposing the counterdisturbance and the original image to input the same into the ViT model, iteratively updating the counterdisturbance until an iteration stopping condition is met, outputting a final counterdisturbance, and linearly superposing the final counterdisturbance and the original image to obtain a corresponding countersample.

2. The transform model-based black-box confrontation sample generation method of claim 1, wherein: forming a negative sample data set in the step 2, which specifically comprises:

obtaining the mth image in the negative sample data set corresponding to the original image according to the following formula:

the mth noise image; p represents the probability that a certain pixel point in the noise image is 0, and 1-p represents the probability that a certain pixel point in the noise image is 1; x is an original image; an indicator represents a vector dot product.

3. The transform model-based black-box confrontation sample generation method of claim 1, wherein: in step 3, dividing the coding blocks in the ViT model into robust coding blocks and non-robust coding blocks, specifically including:

determining target loss function values of the model relative to the original image and self-attention feature maps of the coding blocks relative to the original image according to the original image and the ViT model, wherein the target loss function values and the self-attention feature maps are respectively called as a first loss value and a first self-attention feature map; obtaining a first coding block weight score in each coding block according to the first loss value and the first self-attention feature map;

determining a second coding block weight fraction in each coding block according to the second loss value and a second self-attention feature map;

obtaining an average second coding block weight score in each coding block according to the second coding block weight score of each image in the negative sample data set;

4. The method of claim 3, wherein the transform model-based black-box confrontation sample generation method is characterized in that the first coding block weight score of each coding block is obtained according to the following formula:

representing the gradient over x.

5. The transform model-based black-box confrontation sample generation method of claim 3, wherein: obtaining an average second coding block weight score of each coding block according to the following formula:

where M is the number of images contained in the negative sample data set.

6. The method for generating a black-box confrontation sample based on a Transformer model according to claim 3, wherein: the robust coding block and the non-robust coding block specifically include:

representing an average second encoding block weight fraction of an ith encoding block;

S＝[s₁,s₂,…,s_B]

7. The Transformer model-based black-box confrontation sample generation method of claim 1, wherein: the iteratively updating the countermeasure disturbance in step 4 specifically includes:

calculating a loss function of the confrontation sample and the original image based on the self-attention feature map according to the non-robust coding block and the self-attention feature map extracted by the non-robust coding block;

calculating a cross entropy loss function of a confrontation sample and an original image according to the robust coding block;

calculating a pixel-based loss function for the challenge sample and the original image;

updating the confrontation sample according to the total loss function;

the total loss function is as follows:

represents the s th_kCross entropy loss function of each coding block; x is the number of^*For the confrontation sample, y is a real label corresponding to the original image; II |)₂Represents the Euclidean distance;

denotes the th s_kExtracting a self-attention feature map from each coding block; λ and μ are constants that balance the loss function.

8. The method for generating a black-box confrontation sample based on a Transformer model according to claim 7, wherein: updating the confrontation sample according to the following formula:

wherein the content of the first and second substances,

representing the adaptive weights, RELU () is a linear rectification function.