CN113609482A

CN113609482A - Back door detection and restoration method and system for image classification model

Info

Publication number: CN113609482A
Application number: CN202110796626.8A
Authority: CN
Inventors: 陈恺; 朱宏; 赵月; 梁瑞刚
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-11-05
Anticipated expiration: 2041-07-14
Also published as: CN113609482B

Abstract

The invention discloses a backdoor detection and repair method and system for an image classification model, belonging to the technical field of software technology and information security. The methods of model pruning, migration learning and shallow model training are adopted to obtain the same task as the backdoor model but A series of comparison models without backdoor; using the comparison model to reverse each category of the backdoor model by optimizing the objective function to obtain a series of potential triggers; use the contribution heat map to refine the potential triggers, and only retain the classification results that affect the model The key features of ; based on the difference in the transferability of backdoor triggers and adversarial patches on the control model, distinguish the backdoor triggers and adversarial patches of the refined potential triggers; add the distinguished backdoor triggers to the clean dataset, Remove backdoors in backdoor models by adversarial training. The invention can detect and repair the backdoor of the image classification model by only using a small amount of clean data to generate a normal model.

Description

A backdoor detection and repair method and system for an image classification model

技术领域technical field

本发明属于软件技术和信息安全技术领域，涉及面向人工智能的安全技术，具体涉及一种针对深度神经网络图像分类模型的后门检测及修复方法及系统。The invention belongs to the technical fields of software technology and information security, relates to artificial intelligence-oriented security technology, and in particular relates to a backdoor detection and repair method and system for a deep neural network image classification model.

背景技术Background technique

近年来，深度神经网络(deep neural network，DNN)因为其准确的预测结果在计算机视觉、语音识别、自然语言处理等领域得到了广泛的应用。深度神经网络甚至应用在一些重要的安全领域中，如门禁系统、自动驾驶和医疗诊断，因为它的准确性与人类专家相当，有时甚至比人类专家更可靠。In recent years, deep neural network (DNN) has been widely used in computer vision, speech recognition, natural language processing and other fields because of its accurate prediction results. Deep neural networks are even used in some important security fields, such as access control systems, autonomous driving, and medical diagnosis, because their accuracy is comparable to, and sometimes more reliable than, human experts.

然而，深度神经网络在得到广泛应用的同时，也面临着严重的安全问题，例如数据投毒攻击、对抗攻击、后门攻击等等。特别地，攻击者可以在模型训练期间向深度神经网络中注入后门来控制模型的行为。被注入后门的DNN模型，在正常的输入数据上与无后门模型表现基本一致，但遇到特殊的“触发器(即覆盖在原始图像上的特殊图案)”输入时，就会引发模型异常的行为，带来攻击者期望的结果。后门攻击的存在给深度神经网络带来了安全隐患。例如，通过向DNN模型中注入后门，可以使其把贴着特殊贴纸(触发器)的停车标识牌错误识别为限速标识牌。如果一辆自动驾驶汽车配备了这样的后门模型，就可能发生致命的交通事故。However, while deep neural networks are widely used, they also face serious security problems, such as data poisoning attacks, adversarial attacks, backdoor attacks, and so on. In particular, attackers can inject backdoors into deep neural networks during model training to control the behavior of the models. The DNN model injected into the backdoor is basically the same as the model without the backdoor on normal input data, but when it encounters a special "trigger (that is, a special pattern overlaid on the original image)" input, the model will be abnormal. behavior that brings about the desired result of the attacker. The existence of backdoor attacks brings security risks to deep neural networks. For example, by injecting a backdoor into a DNN model, it can be made to misidentify a stop sign with a special sticker (trigger) as a speed limit sign. If a self-driving car were equipped with such a rear door model, a fatal traffic accident could occur.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种对深度神经网络图像分类模型的后门检测及修复方法。本发明可以在不知道后门触发器和后门攻击目标的前提下，仅使用少量干净数据，对可能存在于模型中的后门进行检测，并对检测到的后门进行修复，生成一个正常模型。The purpose of the present invention is to provide a backdoor detection and repair method for a deep neural network image classification model. The present invention can detect backdoors that may exist in the model without knowing the backdoor trigger and the backdoor attack target, only use a small amount of clean data, and repair the detected backdoor to generate a normal model.

为实现上述目的，本发明采用以下技术方案：To achieve the above object, the present invention adopts the following technical solutions:

一种针对图像分类模型的后门检测及修复方法，包括以下步骤：A backdoor detection and repair method for an image classification model, comprising the following steps:

基于干净数据集，采用模型剪枝、迁移学习和浅层模型训练的方法，获得与后门模型的任务相同但没有后门的一系列对照模型；Based on the clean data set, adopt the methods of model pruning, transfer learning and shallow model training to obtain a series of comparison models with the same task as the backdoor model but without the backdoor;

借助对照模型和干净数据集，通过优化目标函数对后门模型的每一个类别进行逆向，获得一系列潜在触发器，该潜在触发器包括后门触发器和对抗补丁；Reverse each category of the backdoor model by optimizing the objective function with the help of the control model and the clean dataset, and obtain a series of potential triggers, which include backdoor triggers and adversarial patches;

根据干净数据集和潜在触发器计算贡献度热力图，利用贡献度热力图对潜在触发器进行精炼，只保留影响模型分类结果的关键特征；Calculate the contribution heat map based on the clean data set and potential triggers, use the contribution heat map to refine the potential triggers, and only retain the key features that affect the classification results of the model;

基于后门触发器和对抗补丁在对照模型上可迁移性的差异，区分出精炼后的潜在触发器的后门触发器和对抗补丁；Based on the difference in transferability between backdoor triggers and adversarial patches on the control model, distinguish the backdoor triggers and adversarial patches of the refined potential triggers;

将区分出的后门触发器加入到干净数据集中，通过对抗训练去除后门模型中的后门。The differentiated backdoor triggers are added to the clean dataset, and the backdoor in the backdoor model is removed by adversarial training.

进一步地，干净数据集为来自于后门攻击的污染训练集或者与污染训练集具有相似数据分布的数据集，相似数据分布是指数据分布相似度高于一预设指标；干净数据集的数据量为污染训练集的10％-20％。Further, the clean data set is a contaminated training set from a backdoor attack or a data set with a similar data distribution to the contaminated training set. Similar data distribution means that the similarity of the data distribution is higher than a preset index; the data volume of the clean data set 10%-20% of the contaminated training set.

进一步地，模型剪枝方法为：通过剪去后门模型中激活率低的神经元来去除后门，同时通过微调训练来恢复该模型的分类准确率；Further, the model pruning method is: removing the backdoor by pruning neurons with low activation rates in the backdoor model, and at the same time restoring the classification accuracy of the model by fine-tuning training;

迁移学习方法为：以与后门模型的分类任务类似的一神经网络模型为基础，通过迁移学习训练得到对照模型；The transfer learning method is: based on a neural network model similar to the classification task of the backdoor model, a comparison model is obtained through transfer learning training;

浅层模型训练方法为：对后门模型的结构进行简化，并在简化的模型结构上进行训练，得到对照模型。The shallow model training method is: simplify the structure of the backdoor model, and train on the simplified model structure to obtain a comparison model.

进一步地，通过调整目标函数的损失函数权重来优化目标函数，公式如下：Further, the objective function is optimized by adjusting the weight of the loss function of the objective function, and the formula is as follows:

其中，损失函数L_backdoor和L_clean分别表示后门触发器对后门模型和对照模型的分类结果影响，损失函数L_noise是应用于m的降噪函数；α、β和γ为损失函数的权重系数；Δ和m是目标函数优化的两个变量，是与干净数据集尺寸相同的三维矩阵，其中Δ是保存潜在触发器的图案；m是透明度矩阵，控制潜在触发器的位置；x_i是从干净数据集中随机选择的图像；J是全1矩阵，维数与Δ相同；Δ*m+x_i*(J-m)表示将触发器覆盖在图像x_i上；f_b和f_c分别是后门模型和对照模型的预测函数；CE是交叉熵损失函数；n是干净数据集中图像的总数量；i是当前图像的编号；在后门模型上，带有触发器的图像被分类到目标类别y_t，在对照模型上被分类到正确类别y_i；j和k分别代表矩阵m的行和列，a和b是求和符号的下标。Among them, the loss functions L _backdoor and L _clean represent the influence of the backdoor trigger on the classification results of the backdoor model and the control model, respectively, the loss function L _noise is the noise reduction function applied to m; α, β and γ are the weight coefficients of the loss function; Δ and m are the two variables of the objective function optimization and are three-dimensional matrices of the same size as the clean dataset, where _Δ is the pattern that holds the potential triggers; m is the transparency matrix that controls the position of the potential triggers; Randomly selected images in the dataset; J is an all-one matrix with the same dimension as Δ; Δ*m+ _xi *(Jm) means overlaying the trigger on the image _xi ; f _b and f _c are the backdoor model and The prediction function of the control model; CE is the cross-entropy loss function; n is the total number of images in the clean dataset; i is the number of the current image; on the backdoor model, images with triggers are classified to target class y _t , in The control model is classified to the correct category y _i ; j and k represent the rows and columns of the matrix m, respectively, and a and b are the subscripts of the summation symbols.

进一步地，根据干净数据集和潜在触发器计算贡献度热力图的步骤包括：Further, the steps of calculating the contribution heatmap based on the clean dataset and potential triggers include:

从干净数据集中随机选择一组图像，并用潜在触发器覆盖；A set of images is randomly selected from a clean dataset and overlaid with potential triggers;

对所有图像计算代表分类结果贡献度的热力图，即为贡献度热力图。Calculate the heat map representing the contribution of the classification results for all images, which is the contribution heat map.

进一步地，利用贡献度热力图对潜在触发器进行精炼的步骤包括：Further, the steps of using the contribution heatmap to refine potential triggers include:

对所有贡献度热力图求平均，得到平均热力图；Average all contribution heatmaps to get the average heatmap;

根据平均热力图，移除潜在触发器中当前贡献度最低的区域；According to the average heatmap, remove the area with the lowest current contribution among potential triggers;

计算潜在触发器目前的攻击成功率，若低于一阈值，则结束，否则继续移除潜在触发器中当前贡献度最低的区域。Calculate the current attack success rate of the potential trigger, if it is lower than a threshold, end, otherwise continue to remove the area with the lowest current contribution in the potential trigger.

进一步地，区分精炼后的潜在触发器的后门触发器和对抗补丁的步骤包括：Further, the steps of distinguishing backdoor triggers and adversarial patches of the refined potential triggers include:

计算潜在触发器在后门模型上的攻击成功率，若攻击成功率低于一阈值，则判定为对抗补丁，结束；Calculate the attack success rate of the potential trigger on the backdoor model, if the attack success rate is lower than a threshold, it is determined as an adversarial patch, and the end;

若攻击成功率不低于上述阈值，则计算潜在触发器在所有对照模型上的攻击成功率，若在某一个对照模型上攻击成功率高于另一阈值，则判定为对抗补丁，否则判定为后门触发器。If the attack success rate is not lower than the above threshold, calculate the attack success rate of the potential trigger on all comparison models. If the attack success rate on one comparison model is higher than the other threshold, it is judged as an adversarial patch, otherwise it is judged as Backdoor trigger.

进一步地，首先从干净数据集中随机选择一定比例的图像，并用后门触发器覆盖；然后将区分出的后门触发器加入到干净数据集中。Further, a certain proportion of images are randomly selected from the clean dataset and covered with backdoor triggers; then the discriminated backdoor triggers are added to the clean dataset.

进一步地，通过对抗训练去除后门模型中的后门的步骤包括：将区分出的后门触发器加入到干净数据集中，并保持图像的类别信息不变，得到对抗训练数据集；用对抗训练数据集微调训练后门模型，去除后门模型中的后门。Further, the steps of removing the backdoor in the backdoor model through adversarial training include: adding the differentiated backdoor triggers to the clean data set, and keeping the category information of the images unchanged to obtain the adversarial training data set; fine-tuning with the adversarial training data set. Train the backdoor model and remove the backdoor in the backdoor model.

一种针对图像分类模型的后门检测及修复系统，包括存储器和处理器，在该存储器上存储有计算机程序，该处理器执行该程序时实现上述方法的步骤。A backdoor detection and repair system for an image classification model includes a memory and a processor, where a computer program is stored on the memory, and the processor implements the steps of the above method when executing the program.

一种计算机可读存储介质，存储有计算机程序，该程序被处理器执行时实现上述方法的步骤。A computer-readable storage medium stores a computer program, which implements the steps of the above method when the program is executed by a processor.

与现有技术相比，本发明的积极效果为：Compared with the prior art, the positive effects of the present invention are:

本发明对后门检测能力更强，对不同种类触发器的后门检测范围更广，受触发器面积占比、位置、形状、图案等因素的影响更小，误报率和漏报率更低。相对于现有的后门检测方法(如NeuralCleanse、ABS、TABOR)所存在的都对触发器的面积占比提出假设进行限制，使得攻击者可以以牺牲触发器隐蔽性为代价，采用面积占比更大的触发器(大于10％)逃避检测的缺点，本发明在触发器面积占比达到25％时仍能保持检测能力，更难以被适应性攻击。The invention has stronger backdoor detection capability, wider backdoor detection range for different types of triggers, less influence by trigger area ratio, position, shape, pattern and other factors, and lower false alarm rate and false alarm rate. Compared with the existing backdoor detection methods (such as NeuralCleanse, ABS, TABOR), all of them put forward assumptions on the area ratio of triggers, so that attackers can sacrifice the concealment of triggers and adopt a larger area ratio. The disadvantage of large triggers (greater than 10%) evading detection, the present invention can still maintain the detection capability when the proportion of the trigger area reaches 25%, and is more difficult to be adaptively attacked.

附图说明Description of drawings

图1为本发明的一种针对图像分类模型的后门检测及修复方法整体流程图。FIG. 1 is an overall flowchart of a backdoor detection and repair method for an image classification model according to the present invention.

图2为潜在触发器精炼流程图。Figure 2 is a flow chart for the refinement of potential triggers.

图3为后门触发器识别流程图。Figure 3 is a flowchart of backdoor trigger identification.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面通过具体实施例和附图，对本发明做进一步说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be further described below through specific embodiments and accompanying drawings.

本实施例公开一种针对图像分类模型的后门检测及修复方法，如图1所示，步骤如下：This embodiment discloses a backdoor detection and repair method for an image classification model, as shown in FIG. 1 , the steps are as follows:

1.本发明包含如下要点：1. The present invention includes the following points:

1.1.对照模型生成：同时采用模型剪枝、迁移学习和浅层模型训练的方法，获得一系列与后门模型任务相同，但没有后门的对照模型。1.1. Comparison model generation: Simultaneously adopt the methods of model pruning, transfer learning and shallow model training to obtain a series of comparison models with the same tasks as the backdoor model but without the backdoor.

1.2.潜在触发器逆向：借助对照模型和干净数据集，设计一个目标函数，对后门模型的每一个类别进行逆向，获得一系列潜在触发器(由后门触发器和对抗补丁组成)。1.2. Potential trigger reversal: With the help of the control model and the clean dataset, an objective function is designed to reverse each category of the backdoor model to obtain a series of potential triggers (consisting of backdoor triggers and adversarial patches).

1.3.潜在触发器精炼：借助贡献度热力图技术，对潜在触发器进行精炼，去除潜在触发器的冗余特征，得到精炼潜在触发器。1.3. Refinement of potential triggers: With the help of the contribution heat map technology, the potential triggers are refined, the redundant features of the potential triggers are removed, and the refined potential triggers are obtained.

1.4.后门触发器识别：基于后门触发器和对抗补丁在对照模型上可迁移性的差异，将精炼潜在触发器区分为后门触发器和对抗补丁两类。1.4. Backdoor Trigger Recognition: Based on the difference in transferability between backdoor triggers and adversarial patches on the control model, the refined potential triggers are divided into two categories: backdoor triggers and adversarial patches.

1.5.后门模型修复：将后门触发器加入到干净数据集中，通过对抗训练去除后门模型中的后门，得到没有后门的正常模型。1.5. Repair of the backdoor model: Add the backdoor trigger to the clean data set, remove the backdoor in the backdoor model through adversarial training, and obtain a normal model without backdoor.

2.对照模型生成包含如下三个方式并同时采用：2. The generation of the control model includes the following three methods and adopts them at the same time:

2.1.模型剪枝：通过剪去模型中激活率低的神经元来去除后门，同时采用微调训练恢复模型的分类准确率。2.1. Model pruning: The backdoor is removed by pruning neurons with low activation rates in the model, and the classification accuracy of the model is restored by fine-tuning training.

2.2.迁移学习：以一个和后门模型分类任务类似的模型为基础，通过迁移学习训练对照模型。2.2. Transfer learning: Based on a model similar to the backdoor model classification task, the control model is trained through transfer learning.

2.3.浅层模型训练：对后门模型的结构进行简化，并在简化的模型结构上训练对照模型。2.3. Shallow model training: Simplify the structure of the backdoor model, and train the control model on the simplified model structure.

3.潜在触发器逆向通过优化目标函数完成：3. The potential trigger reverse is done by optimizing the objective function:

Δ和m是目标函数优化的两个变量，都是和干净数据集尺寸相同的三维矩阵。其中Δ是保存潜在触发器的图案；m是透明度矩阵，控制潜在触发器的位置。目标函数由三个损失函数组成，通过三个权重α、β和γ进行调整。Δ and m are the two variables of the objective function optimization, which are three-dimensional matrices of the same size as the clean dataset. where Δ is the pattern holding the potential triggers; m is the transparency matrix that controls the position of the potential triggers. The objective function consists of three loss functions, which are adjusted by three weights α, β and γ.

x_i是从干净数据集中随机选择的图像。J是全1矩阵，维数与Δ相同。Δ*m+x_i*(J-m)表示将触发器覆盖在图像x_i上。f_b和f_c分别是后门模型和对照模型的预测函数。CE是交叉熵损失函数。n是干净数据集中图像的总数量，i是当前图像的编号。L_backdoor和L_clean分别表示后门触发器对后门模型和对照模型的分类结果影响。在后门模型上，带有触发器的图像应该被分类到目标类别y_t，在对照模型上应该被分类到正确类别y_i。在这里只需使用一个对照模型。L_noise是应用于m的降噪函数。j和k分别代表矩阵m的行和列，a和b是求和符号的下标。L_noise通过将m的相邻像素点相加取绝对值再求和达到降噪的目的。x _i are randomly selected images from the clean dataset. J is an all-1 matrix with the same dimension as Δ. Δ*m+ _xi *(Jm) means overlaying the trigger on the image _xi . f _b and f _c are the prediction functions of the backdoor model and the control model, respectively. CE is the cross-entropy loss function. n is the total number of images in the clean dataset and i is the number of the current image. L _backdoor and L _clean represent the influence of backdoor triggers on the classification results of the backdoor model and the control model, respectively. On the backdoor model, images with triggers should be classified to the target class y _t , and on the control model should be classified to the correct class _yi . Only one control model is used here. L _noise is the noise reduction function applied to m. j and k represent the row and column of matrix m, respectively, and a and b are the subscripts of the summation symbol. L _noise achieves the purpose of noise reduction by adding the adjacent pixels of m, taking the absolute value, and then summing up.

4.潜在触发器精炼的流程如图2所示，包含如下步骤：4. The process of potential trigger refinement is shown in Figure 2, which includes the following steps:

4.1.从干净数据集中随机选择一组原始图像，并用潜在触发器覆盖。4.1. A set of raw images is randomly selected from the clean dataset and overlaid with latent triggers.

4.2.对所有图像计算代表分类结果贡献度热力图(一种尺寸与原始图像相同的二维矩阵，矩阵中点的数值越大则代表原始图像处于同位置的像素点对分类结果的贡献度越大)，并对所有热力图求平均得到平均热力图。4.2. Calculate the heat map representing the contribution of the classification result for all images (a two-dimensional matrix with the same size as the original image, the larger the value of the point in the matrix, the greater the contribution of the pixel in the same position of the original image to the classification result. large), and average all heatmaps to get the average heatmap.

4.3.根据平均热力图，移除潜在触发器中贡献度最低的区域。4.3. Based on the average heatmap, remove the regions with the lowest contribution among potential triggers.

4.4.计算潜在触发器目前的攻击成功率，若低于阈值(未精炼的原始潜在触发器的攻击成功率的95％)，则结束，否则跳转步骤4.34.4. Calculate the current attack success rate of the potential trigger, if it is lower than the threshold (95% of the attack success rate of the unrefined original potential trigger), then end, otherwise go to step 4.3

5.后门触发器识别的流程如图3所示，包含如下步骤：5. The process of backdoor trigger recognition is shown in Figure 3, including the following steps:

5.1.从干净数据集中随机选择一组图像，并用潜在触发器覆盖。5.1. Randomly select a set of images from a clean dataset and overlay with potential triggers.

5.2.计算潜在触发器在后门模型上的攻击成功率。5.2. Calculate the attack success rate of potential triggers on the backdoor model.

5.3.若攻击成功率低于一阈值(预先设定的超参数，值为60％)，则判定为对抗补丁，结束，否则跳转5.4。5.3. If the attack success rate is lower than a threshold (pre-set hyperparameter, the value is 60%), it is determined as an adversarial patch, and it ends, otherwise, jump to 5.4.

5.4.计算潜在触发器在所有对照模型上的攻击成功率。5.4. Calculate the attack success rate of potential triggers on all control models.

5.5.若在某一个对照模型上攻击成功率高于另一阈值(预先设定的超参数，与分类类别个数有关，在MNIST和GTSRB类别个数较少的数据集上为40％，在Youtube-Face和VGG-Face类别个数较多的数据集上为20％)，则判定为对抗补丁，否则判定为后门触发器。5.5. If the attack success rate on a control model is higher than another threshold (pre-set hyperparameters, related to the number of classification categories, 40% on MNIST and GTSRB datasets with a small number of categories, in Youtube-Face and VGG-Face categories are 20%), it is determined as an adversarial patch, otherwise it is determined as a backdoor trigger.

6.后门模型修复包含如下步骤：6. The repair of the backdoor model includes the following steps:

6.1.从干净数据集中随机选择一定比例的图像，并用后门触发器覆盖。6.1. Randomly select a certain proportion of images from the clean dataset and overlay with backdoor triggers.

6.2.将图像加入到干净数据集并保持图像的类别信息不变，得到对抗训练数据集。6.2. Add the image to the clean dataset and keep the category information of the image unchanged to get the adversarial training dataset.

6.3.用对抗训练数据集微调训练后门模型，去除后门模型中的后门。6.3. Fine-tune the training backdoor model with the adversarial training dataset to remove the backdoor in the backdoor model.

本实施例首先仿照后门攻击者的角度，在手写数字分类(MNIST数据集)、交通标志牌分类(GTSRB数据集)以及人脸分类(Youtube-Face和VGG-Face数据集)三个应用领域四个数据集上，用污染训练集(Badnets)和修改预训练模型(TrojanNN)两种主流后门攻击方式生成了60个后门模型；同时采用正常的模型训练方法生成了30个正常(无后门)模型。后门模型的“触发器”是覆盖原始图像的某种特殊图案，面积占比2％-25％，具有不同的位置、形状和图案。本发明在上述90个模型上达到了误报率(误检出后门的正常模型数量/总正常模型数量)和漏报率(未检出后门的后门模型数量/总后门模型数量)均小于10％的检测结果。This embodiment first imitates the perspective of backdoor attackers, in the three application fields of handwritten digit classification (MNIST data set), traffic sign classification (GTSRB data set) and face classification (Youtube-Face and VGG-Face data sets). On each dataset, 60 backdoor models were generated by two mainstream backdoor attack methods: polluted training set (Badnets) and modified pre-trained model (TrojanNN); at the same time, 30 normal (without backdoor) models were generated by using normal model training method . The "trigger" of the backdoor model is some kind of special pattern covering the original image, occupying 2%-25% of the area, with different positions, shapes and patterns. The present invention achieves that the false alarm rate (the number of normal models with falsely detected backdoors/the total number of normal models) and the false alarm rate (the number of backdoor models without backdoors/the total number of backdoor models) on the above 90 models are both less than 10 % of test results.

以上实施例仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，本发明的保护范围以权利要求所述为准。The above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Those skilled in the art can modify or equivalently replace the technical solutions of the present invention. The protection scope of the present invention is subject to the claims.

Claims

1. a backdoor detection and repair method for image classification model, is characterized in that, comprises the following steps:

Based on the clean data set, adopt the methods of model pruning, transfer learning and shallow model training to obtain a series of comparison models with the same task as the backdoor model but without the backdoor;

Reverse each category of the backdoor model by optimizing the objective function with the help of the control model and the clean dataset, and obtain a series of potential triggers, which include backdoor triggers and adversarial patches;

Calculate the contribution heat map based on the clean data set and potential triggers, use the contribution heat map to refine the potential triggers, and only retain the key features that affect the classification results of the model;

Based on the difference in transferability between backdoor triggers and adversarial patches on the control model, distinguish the backdoor triggers and adversarial patches of the refined potential triggers;

The differentiated backdoor triggers are added to the clean dataset, and the backdoor in the backdoor model is removed by adversarial training.

2. The method of claim 1, wherein the model pruning method is: remove the backdoor by pruning the neuron with low activation rate in the backdoor model, and restore the classification accuracy of the model by fine-tuning training simultaneously;

The transfer learning method is: based on a neural network model similar to the classification task of the backdoor model, a comparison model is obtained through transfer learning training;

The shallow model training method is: simplify the structure of the backdoor model, and train on the simplified model structure to obtain a comparison model.

3. The method of claim 1, wherein the objective function is optimized by adjusting the loss function weight of the objective function, and the formula is as follows:

Among them, the loss functions L _backdoor and L _clean represent the influence of the backdoor trigger on the classification results of the backdoor model and the control model, respectively, the loss function L _noise is the noise reduction function applied to m; α, β and γ are the weight coefficients of the loss function; Δ and m are the two variables of the objective function optimization and are three-dimensional matrices of the same size as the clean dataset, where _Δ is the pattern that holds the potential triggers; m is the transparency matrix that controls the position of the potential triggers; Randomly selected images in the dataset; J is an all-one matrix with the same dimension as Δ; Δ*m+ _xi *(Jm) means overlaying the trigger on the image _xi ; f _b and f _c are the backdoor model and prediction function of the control model; CE is the cross-entropy loss function; n is the total number of images in the clean dataset; i is the number of the current image; on the backdoor model, images with triggers are classified to the target class y _t , in The control model is classified to the correct category y _i ; j and k represent the rows and columns of the matrix m, respectively, and a and b are the subscripts of the summation symbols.

4. The method of claim 1, wherein the step of calculating the contribution degree heatmap according to the clean data set and potential triggers comprises:

A set of images is randomly selected from a clean dataset and overlaid with potential triggers;

Calculate the heat map representing the contribution of the classification results for all images, which is the contribution heat map.

5. The method of claim 1 or 4, wherein the step of refining the potential triggers using the contribution heat map comprises:

Average all contribution heatmaps to get the average heatmap;

According to the average heatmap, remove the area with the lowest current contribution among potential triggers;

Calculate the current attack success rate of the potential trigger, if it is lower than a threshold, end, otherwise continue to remove the area with the lowest current contribution in the potential trigger.

6. The method of claim 1, wherein the step of distinguishing between the backdoor triggers of the refined potential triggers and the adversarial patches comprises:

Calculate the attack success rate of the potential trigger on the backdoor model, if the attack success rate is lower than a threshold, it is determined as an adversarial patch, and the end;

If the attack success rate is not lower than the above threshold, calculate the attack success rate of the potential trigger on all comparison models. If the attack success rate on one comparison model is higher than the other threshold, it is judged as an adversarial patch, otherwise it is judged as Backdoor trigger.

7. The method of claim 1, wherein a certain proportion of images are randomly selected from the clean data set and covered with backdoor triggers; then the differentiated backdoor triggers are added to the clean data set.

8. The method of claim 1, wherein the step of removing the backdoor in the backdoor model by adversarial training comprises: adding the differentiated backdoor trigger into the clean data set, and keeping the category information of the image unchanged, Obtain the adversarial training dataset; fine-tune the training backdoor model with the adversarial training dataset, and remove the backdoor in the backdoor model.

9. A backdoor detection and repair system for an image classification model, characterized in that it comprises a memory and a processor, and a computer program is stored on the memory, and when the processor executes the program, any one of claims 1-8 is realized the steps of the method.

10. A computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the method of any one of claims 1-8.