WO2024051183A1 - 一种基于决策捷径搜索的后门检测方法 - Google Patents

一种基于决策捷径搜索的后门检测方法 Download PDF

Info

Publication number
WO2024051183A1
WO2024051183A1 PCT/CN2023/092167 CN2023092167W WO2024051183A1 WO 2024051183 A1 WO2024051183 A1 WO 2024051183A1 CN 2023092167 W CN2023092167 W CN 2023092167W WO 2024051183 A1 WO2024051183 A1 WO 2024051183A1
Authority
WO
WIPO (PCT)
Prior art keywords
trigger
model
backdoor
detected
label
Prior art date
Application number
PCT/CN2023/092167
Other languages
English (en)
French (fr)
Inventor
董恺
卞绍鹏
李想
Original Assignee
南京逸智网络空间技术创新研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京逸智网络空间技术创新研究院有限公司 filed Critical 南京逸智网络空间技术创新研究院有限公司
Publication of WO2024051183A1 publication Critical patent/WO2024051183A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the invention belongs to the security technical field of deep learning, and specifically relates to a backdoor detection method based on decision-making shortcut search.
  • the outsourcing method is as follows: the user provides the training set data to a third party, retains the test set data, and agrees on the model structure and accuracy threshold in advance. If the accuracy of the final model on the test set is higher than the threshold, the model will be accepted, otherwise it will be rejected. Since the third party fully controls the training process and the deep learning model lacks interpretability, this outsourcing service may have some security risks. For example, in the backdoor attacks proposed in recent years, third parties can contaminate the training set by adding specific samples to implant a backdoor. The malicious model implanted with the backdoor has no abnormality under normal circumstances, and only in certain circumstances misclassification achieves the effect of the attack.
  • Backdoor attacks are an attack method of poisoning attacks. They pollute the training set by adding a certain proportion of poisoned samples with triggers to the training set. The finally trained model is called a malicious model. Under normal circumstances, the malicious model is almost indistinguishable from the normal model, and the backdoor is activated if and only if the input has a preset trigger, causing the malicious model to misclassify this input as the attacker's target label.
  • the Badnets attack method can be divided into three steps: selecting triggers, contaminating the training set and model training. Subsequent attack methods were optimized and improved for selecting triggers, contaminating training sets, and model training.
  • FIG. 1 The attack process of a backdoor attack is shown in Figure 1, which can be divided into three main steps: adding a trigger, confirming the match, and activating the backdoor. Starting with these three steps, blocking one of them will render the attack ineffective. Therefore, backdoor attacks can be defended from three aspects: removing the trigger, mismatching the trigger and the backdoor, and removing the backdoor.
  • the attack can be successfully defended.
  • some researchers have used the idea of auto-encoder to input the model Preprocessing is performed so that the converted trigger pattern will deviate greatly from the original trigger pattern, making it impossible to activate the backdoor.
  • Implanting a backdoor into the model is essentially modifying the parameters of the model in a specific direction. If these malicious parameters can be removed, the impact of the backdoor can be offset and the backdoor can be removed.
  • Neural Cleanse a defense method based on trigger solving, which solves a possible trigger for each tag, and then performs outlier detection on these triggers to determine whether there is a backdoor in the model.
  • Neural Cleanse requires a detailed detection of all labels. If the total number of classification labels of the model to be detected is very large, the detection efficiency of this method will be low.
  • the invention with the publication number CN113609482A mentions a backdoor detection and repair method and system for image classification models.
  • This invention only uses a small amount of clean data to detect and repair the backdoors of the image classification model and generate a normal model.
  • this invention requires the creation of a comparison model.
  • each category of the backdoor model is reversed by optimizing the objective function to obtain a series of potential triggers.
  • the contribution heat map is used to refine the potential triggers, and only the classification results of the impact model are retained.
  • the key features; and then based on the difference in transferability of backdoor triggers and countermeasure patches on the comparison model, the backdoor triggers and countermeasure patches of the refined potential triggers are distinguished.
  • the comparison model is difficult to build and the entire detection method is computationally intensive.
  • the invention with the publication number CN114638356A mentions a static weight-guided deep neural network backdoor detection method and system, which takes advantage of the advantages of static weight analysis with low computational cost, not affected by input sample quality, and not affected by trigger type, and is effective Improved efficiency, accuracy and scalability of neural network backdoor detection.
  • this invention requires a pre-trained neural network model to conduct static weight analysis to obtain the suspicious target label and victim label of the backdoor attack to form a target-victim label pair.
  • the precision and accuracy of the pre-trained neural network model are insufficient, which can easily lead to misclassification.
  • the present invention proposes a backdoor detection method based on decision-making shortcut search, which can quickly lock a small number of suspicious tags and maximize detection efficiency.
  • a backdoor detection method based on decision-making shortcut search includes the following steps:
  • S1 use random noise to generate P random noise pictures composed of random noise pixel values, and convert the P random noise pictures into Input to the model to be detected and record the frequency of occurrence of each classification label. Sort all labels from large to small according to the frequency of label appearance.
  • the first K labels are regarded as suspicious target labels; P and K are both greater than 1. positive integer;
  • step S3 repeat step S2 until the possible trigger coordinates corresponding to all suspicious target tags are calculated;
  • S4 based on the attack success rate after adding the trigger and the size of the trigger, calculate the abnormal values of K possible triggers. If the calculated abnormal value of any possible trigger is greater than the preset abnormal value threshold, It indicates that the model to be detected is a malicious model, and the attacker's target label is the label corresponding to the possible trigger;
  • step S2 select one of the suspicious target labels i, assuming it is the target label of the attacker, calculate the minimum modification amount required by the model to be detected to classify all samples of the remaining labels into label i, and calculate the label i
  • the process of corresponding possible trigger coordinates includes the following sub-steps:
  • the trigger in the backdoor attack, the trigger consists of two parts: the trigger coordinate m and the trigger pattern ⁇ .
  • F() represents the model to be detected
  • J() is the loss function, used to quantify the classification loss
  • y i is the currently assumed target label
  • is the L1 paradigm of m, indicating the range of pixels that need to be modified
  • X indicates that there is no clean dataset that is obtained without contaminated samples
  • the target of the trigger coordinate solution is: while the detection model is to be detected, all modified pictures are classified as Y i , and the L1 paradigm of M, to minimize M, so as to minimize the M paradigm of M.
  • part of the data of the user test set is used to generate the clean data set.
  • step S4 it is determined whether the attack success rate after adding the trigger is less than the preset attack success rate threshold. If so, the possible trigger is directly excluded; otherwise, according to the attack success rate after adding the trigger and The size of the trigger, calculated out of the possible trigger outliers.
  • step S4 the following formula (3) is used to calculate the possible abnormal value grade of the trigger:
  • acc represents the attack success rate after adding a trigger
  • sumPixel represents the total number of pixels in the input sample
  • represents the size of the trigger.
  • step S5 the detected model is retrained using the calculated trigger whose outlier value is greater than the outlier threshold.
  • the process of making the backdoor ineffective by modifying the parameters of the model includes the following sub-steps:
  • the present invention proposes a backdoor detection method based on decision-making shortcut search, which can quickly lock suspicious tags and solve for the coordinate information of real triggers. It only requires detailed detection of a small number of tags to efficiently detect malicious models.
  • the generated triggers are used to retrain the detection model to remove the backdoor and finally obtain a normal model, which greatly reduces the time complexity of the detection algorithm and can quickly lock suspicious tags.
  • Figure 1 is a schematic diagram of the attack principle of a backdoor attack
  • Figure 2 is a flow chart of a backdoor detection method based on decision-making shortcut search according to an embodiment of the present invention.
  • Figure 3 is a schematic diagram of a backdoor detection method based on decision-making shortcut search according to an embodiment of the present invention.
  • FIG. 2 is a flow chart of a backdoor detection method based on decision-making shortcut search according to an embodiment of the present invention.
  • the backdoor detection method includes the following steps:
  • S1 use random noise to generate P random noise pictures composed of random noise pixel values, input the P random noise pictures into the model to be detected and record the frequency of occurrence of each classification label, and classify all labels according to the frequency of occurrence of the label. Sort from large to small, and use the first K labels as suspicious target labels; P and K are both positive integers greater than 1.
  • step S3 repeat step S2 until the possible trigger coordinates corresponding to all suspicious target tags are calculated.
  • S4 based on the attack success rate after adding the trigger and the size of the trigger, calculate the abnormal values of K possible triggers. If the calculated abnormal value of any possible trigger is greater than the preset abnormal value threshold, It indicates that the model to be detected is a malicious model, and the attacker's target label is the label corresponding to the possible trigger.
  • This embodiment designs a backdoor detection method based on decision-making shortcut search. Taking the model trained on the CIFAR10 data set as an example, see Figure 3. Determining whether there is a backdoor in the model can be divided into the following four steps:
  • this method uses the model's classification of random noise images to quickly narrow the label search range to K labels.
  • the previous detection method Neural Cleanse requires detailed detection of all labels of the model to be detected. For a model with a huge number of classification labels, it will take a lot of time to detect in detail whether each label is the target label of the attacker. and computing resources.
  • the specific search process includes the following steps: First, use random noise to generate P pictures composed of random noise pixel values. Secondly, input these random noise images into the model to be detected and record the frequency of occurrence of each classification label. Finally, all labels are sorted from large to small according to the frequency of label occurrence, and the top K labels are the suspicious target labels.
  • This embodiment is different from the detection method Neural Cleanse. This embodiment takes advantage of the characteristics of the malicious model and can quickly narrow the range of suspicious tags to k, reducing the time complexity from O(N) to a constant level, significantly improving the detection speed. s efficiency.
  • step (2) this embodiment will conduct detailed detection on the K suspicious tags obtained above, and obtain a possible trigger coordinate respectively.
  • the previous detection method Neural Cleanse needs to solve the trigger coordinates and trigger pixel values at the same time, which consumes a lot of time and computing resources.
  • Discovering suspicious tags can be divided into the following three steps: In the malicious model, the sample labeled A only needs to add a trigger by modifying a very small number of pixels, so that the malicious model can misclassify it as the target tag of the attack. B, but the sample with label A needs to modify a large number of pixels for the malicious model to misclassify it as label C. For a normal model, if you want the model to misclassify the sample with label A into all other labels, you need to modify a large number of pixels.
  • This embodiment takes advantage of the characteristics of the malicious model. For each suspicious label, it is assumed that it is the target label of the attacker, and the minimum modification required by the model to be detected is calculated to classify all samples of the remaining labels as label i, that is, Possible triggers for tag i.
  • Equation (2) The specific solution method is shown in Equation (2).
  • F() represents the model to be detected
  • J() is the loss function, used to quantify the classification loss
  • y i is the currently assumed target label
  • is the L1 paradigm of m, indicating the range of pixels that need to be modified.
  • X represents the clean data set that can be obtained. In general, X can use part of the data of the user's test set.
  • the optimization goal is to minimize the L1 paradigm of m, that is, the least modified pixels, while the model classifies all modified pictures as yi . min m J(F(A(x, m,), y i )+ ⁇
  • This embodiment takes advantage of the sensitivity of the malicious model to random noise and only solves the trigger coordinates.
  • the trigger pixel value is generated by random noise, which greatly reduces and improves the efficiency of the optimization solution.
  • step (3) outlier detection will be performed based on the K possible triggers obtained above. If an abnormal trigger is detected, it means that there is a backdoor in the model, and the attacker's attack label corresponds to this trigger. Tag of. Since the L1 paradigm of a real trigger will be much smaller than other possible triggers, the method of the present invention intends to comprehensively determine whether the model to be detected is a malicious model based on the attack success rate after adding the trigger and the size of the trigger. For real attack tags, only a smaller mask is needed to achieve a higher attack success rate.
  • This method intends to calculate the grade of the trigger based on the attack success rate acc after adding the trigger and the size of the trigger. Since the trigger pixel value added each time is randomly generated, for some benign labels, it is difficult to achieve a high attack success rate even if a large number of pixels are modified. Therefore, if the obtained acc is less than the specified attack success rate threshold, it will be eliminated directly. If the acc success rate of the attack by the trigger of this tag is high, take the trigger with acc greater than the threshold and the smallest mask as the trigger coordinate of this tag, and calculate the grade of the tag based on the size of acc and mask for comprehensive judgment. For real attack tags, only a smaller mask is needed to achieve a higher attack success rate. In equation (3), sumPixel represents the total number of pixels in the input sample. If the obtained grade is greater than the specified threshold, it indicates that there is a backdoor in the model, and the attacker's target label is the label corresponding to the trigger:
  • step (4) the backdoor in the malicious model needs to be removed.
  • this embodiment uses the solved triggers to retrain the model, and modifies the parameters of the model to disable the backdoor.
  • the specific method is as follows: first, select a part of suitable clean samples from the benign training set; second, add random noise pixel values to the m coordinate positions of these samples, but do not change the labels of these samples, to create "reverse poisoning samples” ; Finally, use these reverse poisoning samples plus some benign training sets to retrain the model, making the model "forget" the learned triggers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于决策捷径搜索的后门检测方法,包括:通过随机噪声确定K个可疑的标签;对于每一个可疑的标签求解出一个最小的触发器坐标;分析求解出的触发器坐标是否存在异常值;对恶意模型进行再训练使得后门失效,最终得到正常模型。本发明能够快速锁定可疑标签,能够求解出真实触发器的坐标信息,只需对少量标签做详细检测即可高效地检测出恶意模型,利用求解出的触发器对待检测模型进行再训练,以移除后门,最终得到一个正常模型,大幅度降低了检测算法的时间复杂度能够快速锁定可疑标签。

Description

一种基于决策捷径搜索的后门检测方法 技术领域
本发明属于深度学习的安全技术领域,具体涉及一种基于决策捷径搜索的后门检测方法。
背景技术
随着深度学习在日常生活中的广泛应用,其中的安全问题也慢慢暴露了出来。训练一个较为出色的深度学习模型需要大量的时间及计算资源,普通的公司及个人完全无法满足这些要求,所以他们通常会将训练过程外包给第三方。外包方式如下:用户提供训练集数据给第三方,保留测试集数据,提前约定好模型的结构及准确率阈值,若最终的模型在测试集上的准确率高于阈值则接受模型,否则拒绝。由于第三方完全控制着训练过程再加上深度学习模型缺乏可解释性,这种外包服务可能会存在一些安全隐患。例如近些年提出的后门攻击,第三方可以通过添加特定的样本来污染训练集,从而植入后门。被植入后门的恶意模型在正常情况下无异常,仅在特定情况下分类错误达到攻击的效果。
后门攻击属于投毒攻击的一种攻击方法,通过在训练集中添加一定比例带有触发器的中毒样本来污染训练集,最终训练出的模型被称为恶意模型。在正常情况下,恶意模型与正常模型几乎没有任何差异,当且仅当输入中带有预设的触发器时才会激活后门,从而使得恶意模型将此输入误分类为攻击者的目标标签。Badnets攻击方法可以分为三步:选取触发器、污染训练集和模型训练。后续的攻击方法分别针对选取触发器、污染训练集和模型训练进行了优化改进。
后门攻击的攻击过程如图1所示,主要可分为三步:添加触发器、确认匹配度和激活后门。从这三个步骤着手,若能阻碍其中的某个步骤便能够使得攻击失效。所以,可以从三个方面对后门攻击进行防御:移除触发器、触发器与后门不匹配和移除后门。
(一)移除触发器
有研究人员利用了GradCAM检测输入图像中最为重要的区域,例如触发器图案所在的区域;随后使用中性的像素值覆盖此区域。最后,使用基于GAN的方法来“恢复”此区域中的像素值,从而减轻对于良性输入的影响。
(二)触发器后门不匹配
若能够在输入模型前对中毒样本做合理的预处理,使得触发器中的触发器与后门不匹配,就能够成功防御攻击。根据这个想法,有研究人员利用了auto-encoder的思想对模型的输入 进行了预处理,使得经过转换后的触发器图案与原触发器图案会有较大偏差,从而无法激活后门。
(三)移除后门
向模型中植入后门实质上就是将模型的参数朝着一个特定的方向进行修改,若能够将去除这些恶意参数,即可抵消后门的影响,从而移除后门。受感染模型中存在部分神经元专门用于识别触发器,对于良性的输入几乎不响应,只要对这部分神经元进行修剪即可移除后门。为了能够更加准确高效地移除后门,可以首先求解出触发器,随后利用求得的触发器来移除后门。有研究人员提出了基于触发器求解的防御方法Neural Cleanse,对每一个标签求解出一个的可能触发器,随后再对这些触发器进行异常值检测来判断模型中是否存在后门。
在此之前提出的后门检测方法Neural Cleanse需要对于所有的标签都做一次详细检测,若待检测模型的分类标签总数非常多,会导致这种方法的检测效率较低。
公开号为CN113609482A的发明中提及一种针对图像分类模型的后门检测及修复方法及系统,该发明仅使用少量干净数据,即可对图像分类模型的后门检测及修复,生成正常模型。但该发明需要创建对照模型,借助对照模型通过优化目标函数对后门模型的每一个类别进行逆向,获得一系列潜在触发器,利用贡献度热力图对潜在触发器进行精炼,只保留影响模型分类结果的关键特征;继而基于后门触发器和对抗补丁在对照模型上可迁移性的差异,区分出精炼后的潜在触发器的后门触发器和对抗补丁,对照模型难以构建且整个检测方法运算量大。
公开号为CN114638356A的发明中提及一种静态权重引导的深度神经网络后门检测方法及系统,发挥了静态权重分析计算开销小、不受输入样本质量影响、不受触发器类型影响的优势,有效提高了神经网络后门检测的效率、精度和可扩展性。但该发明需要预训练神经网络模型进行静态权重分析,得到后门攻击的可疑目标标签和受害标签,组成目标-受害标签对。在数据量不足的情况下,预训练的神经网络模型的精度和准确度不够,容易导致错误分类。
发明内容
解决的技术问题:为了解决检测效率低下的问题,本发明提出了一种基于决策捷径搜索的后门检测方法,能够快速锁定少数的可疑标签,最大程度地提高检测效率。
技术方案:
一种基于决策捷径搜索的后门检测方法,所述后门检测方法包括以下步骤:
S1,利用随机噪声生成P张由随机噪声像素值组成的随机噪声图片,将P张随机噪声图片 输入至待检测模型并记录下每个分类标签出现的频次,根据标签出现的频次对所有标签从大到小进行排序,将前K个标签作为可疑的目标标签;P和K均为大于1的正整数;
S2,选择其中一个可疑的目标标签i,假设其是攻击者的目标标签,计算待检测模型将其余标签的样本全都分类为标签i所需的最小修改量,计算得到标签i对应的可能的触发器坐标;i=1,2,…,K;
S3,重复步骤S2,直至计算得到所有可疑的目标标签对应的可能的触发器坐标;
S4,根据添加触发器后的攻击成功率和触发器的大小,计算得到K个可能的触发器的异常值,如果计算得到的任意一个可能的触发器的异常值大于预设的异常值阈值,则表明待检测模型为恶意模型,且攻击者的目标标签为该可能的触发器对应的标签;
S5,利用求解出的异常值大于异常值阈值的触发器对待检测模型进行再训练,通过修改模型的参数使得后门失效。
进一步地,步骤S2中,选择其中一个可疑的目标标签i,假设其是攻击者的目标标签,计算待检测模型将其余标签的样本全都分类为标签i所需的最小修改量,计算得到标签i对应的可能的触发器坐标的过程包括以下子步骤:
S21,在后门攻击中,触发器由触发器坐标m和触发器图案Δ两部分共同组成,采用下述公式(1)在样本x上添加触发器:
A(x,m,Δ)=(1-m)·x+m·Δ      (1);
S22,采用下述公式(2)求解得到触发器坐标:
minmJ(F(A(x,m,),yi)+α·|m|   For x∈X       (2)
其中,F()表示待检测模型;J()为损失函数,用于量化分类损失;yi为当前假定的目标标签;|m|为m的L1范式,表示需要修改像素点的范围大小;X表示获取到的不存在被污染样本的干净数据集;触发器坐标求解最优化的目标是:在待检测模型将全部经过修改的图片分类为yi的同时,最小化m的L1范式,以改动最少的像素点。
进一步地,采用用户测试集的部分数据以生成所述干净数据集。
进一步地,步骤S4中,判断添加触发器后的攻击成功率是否小于预设的攻击成功率阈值,如果是,则直接排除该可能的触发器;否则,根据添加触发器后的攻击成功率和触发器的大小,计算得到可能的触发器的异常值。
进一步地,步骤S4中,采用下述公式(3)计算得到可能的触发器的异常值grade:
式中,acc表示添加触发器后的攻击成功率;sumPixel表示输入样本的像素点总数;|m|表示触发器的大小。
进一步地,步骤S5中,利用求解出的异常值大于异常值阈值的触发器对待检测模型进行再训练,通过修改模型的参数使得后门失效的过程包括以下子步骤:
S61,从不存在被污染样本的干净数据集中选取一部分合适的干净样本;
S62,在选取的干净样本中的m坐标位置添加上随机噪声像素值,且不改变这些样本的标签,以制造逆向中毒样本;
S63,采用制造的逆向中毒样本和部分干净数据集对待检测模型进行再训练。
有益效果:
本发明提出了一种基于决策捷径搜索的后门检测方法,能够快速锁定可疑标签,能够求解出真实触发器的坐标信息,只需对少量标签做详细检测即可高效地检测出恶意模型,利用求解出的触发器对待检测模型进行再训练,以移除后门,最终得到一个正常模型,大幅度降低了检测算法的时间复杂度能够快速锁定可疑标签。
附图说明
图1为后门攻击的攻击原理示意图;
图2为本发明实施例的基于决策捷径搜索的后门检测方法流程图。
图3为本发明实施例的基于决策捷径搜索的后门检测方法原理图。
具体实施方式
下面的实施例可使本专业技术人员更全面地理解本发明,但不以任何方式限制本发明。
图2为本发明实施例的基于决策捷径搜索的后门检测方法流程图。参见图2,该后门检测方法包括以下步骤:
S1,利用随机噪声生成P张由随机噪声像素值组成的随机噪声图片,将P张随机噪声图片输入至待检测模型并记录下每个分类标签出现的频次,根据标签出现的频次对所有标签从大到小进行排序,将前K个标签作为可疑的目标标签;P和K均为大于1的正整数。
S2,选择其中一个可疑的目标标签i,假设其是攻击者的目标标签,计算待检测模型将其余标签的样本全都分类为标签i所需的最小修改量,计算得到标签i对应的可能的触发器坐标;i=1,2,…,K。
S3,重复步骤S2,直至计算得到所有可疑的目标标签对应的可能的触发器坐标。
S4,根据添加触发器后的攻击成功率和触发器的大小,计算得到K个可能的触发器的异常值,如果计算得到的任意一个可能的触发器的异常值大于预设的异常值阈值,则表明待检测模型为恶意模型,且攻击者的目标标签为该可能的触发器对应的标签。
S5,利用求解出的异常值大于异常值阈值的触发器对待检测模型进行再训练,通过修改模型的参数使得后门失效。
本实施例设计了一种基于决策捷径搜索的后门检测方法,以CIFAR10数据集训练的模型为例,参见图3,判断该模型中是否存在后门主要可分为以下四个步骤:
(1)通过随机噪声确定K个可疑的标签。
在步骤(1)中,本方法利用模型对于随机噪声图片的分类情况迅速将标签搜索范围缩小至K个标签。先前的检测方法Neural Cleanse需要对待检测模型的所有标签进行详细的检测,对于一个分类标签总数量极大的模型来说,若详细检测每一个标签是否是攻击者的目标标签,会花费大量的时间及计算资源。具体搜索过程包括以下步骤:首先,利用随机噪声生成P张随机噪声像素值组成的图片。其次,将这些随机噪声图片输入至待检测模型并记录下每个分类标签出现的频次。最后,根据标签出现的频次对所有标签从大到小进行排序,前K个标签即为可疑的目标标签。
本实施例与检测方法Neural Cleanse不同,本实施例利用了恶意模型的特点,能够迅速将可疑标签的范围缩小至k个,将时间复杂度由O(N)降低到了常量级别,显著提升了检测的效率。
对于每一个可疑的标签求解出一个最小的触发器坐标。
在步骤(2)中,本实施例将对上述求得的K个可疑标签分别进行详细的检测,并分别求解出一个可能的触发器坐标。先前的检测方法Neural Cleanse需要同时求解触发器坐标与触发器像素值,耗费大量的时间及计算资源。发现可疑标签具体可分为如下三个步骤:在恶意模型中,标签A的样本上只需要通过修改极少量的像素点来添加触发器,就能够让恶意模型将其误分类为攻击的目标标签B,但标签A的样本需要修改大量的像素点才能让恶意模型将其误分类为标签C。而对于正常模型来说,想让模型将标签A的样本误分类为所有的其余标签都需要修改大量的像素点。
本实施例利用了恶意模型这一特性,对于每一个可疑标签都把假设它是攻击者的目标标签,并且计算待检测模型将其余标签的样本全都分类为标签i所需的最小修改量,即标签i可能的触发器。在后门攻击中,触发器由触发器坐标m和触发器图案Δ两部分共同组成, 通过A()函数来添加触发器,在样本x上添加触发器的方法如式(1)所示:
A(x,m,Δ)=(1-m)·x+m·Δ      (1)。
但根据研究发现,恶意模型并未学习到具体的像素值信息,所以无需求解触发器的图案Δ,只需求解出触发器的坐标m即可。具体的求解方法如式(2)所示。其中,F()表示待检测的模型,J()为损失函数,用于量化分类损失,yi为当前假定的目标标签,|m|为m的L1范式,表示需要修改像素点的范围大小,X表示能够获取到的干净数据集。在一般情况下,X可采用用户测试集的一部分数据。最优化的目标是:在模型将全部经过修改的图片分类为yi的同时,最小化m的L1范式,即改动最少的像素点。
minmJ(F(A(x,m,),yi)+α·|m|   For x∈X       (2)。
本实施例利用了恶意模型对随机噪声的敏感性,仅求解触发器坐标即可,触发器像素值通过随机噪声来生成,大幅度减少提升了最优化求解的效率。
分析求解出的触发器坐标是否存在异常值。
在步骤(3)中,将根据上述求得的K个可能的触发器进行异常值检测,若检测出异常触发器,则说明模型中存在后门,且攻击者的攻击标签即为此触发器对应的标签。由于真实触发器的L1范式会远小于其余可能的触发器,本发明方法拟根据添加触发器后的攻击成功率和触发器的大小来综合判断待检测模型是否为恶意模型。对于真正的攻击标签来说,只需较小的mask即可达到较高的攻击成功率。
本方法拟根据添加触发器后的攻击成功率acc和触发器的大小来计算触发器的grade。由于每次添加的触发器像素值都是随机生成的,所以对于部分良性标签来说,即使对大量像素点进行修改也难以达到较高的攻击成功率。所以,若求得的acc小于规定的攻击成功率阈值,则直接排除。若该标签的触发器进行攻击的acc成功率较高,取acc大于阈值且mask最小的触发器作为此标签的触发器坐标,根据acc和mask的大小计算出标签的grade来综合判断。对于真正的攻击标签来说,只需较小的mask即可达到较高的攻击成功率。在式(3)中,sumPixel表示输入样本的像素点总数,若求得的grade大于规定的阈值,则表明模型中存在后门,且攻击者的目标标签即为触发器对应的标签:
对恶意模型进行再训练使得后门失效,最终得到正常模型。
在步骤(4)中,需要移除恶意模型中的后门。为了让模型中的后门失效,同时保证模型的正常功能性,本实施例利用求解出的触发器对模型进行再训练,通过修改模型的参数使得后门失效。具体方式如下:首先,从良性训练集中选取一部分合适的干净样本;其次,在这些样本中的m坐标位置添加上随机噪声像素值,但不改变这些样本的标签,用来制造“逆向中毒样本”;最后,用这些逆向中毒样本加上部分良性训练集对模型进行再训练,使得模型“忘记”学习到的触发器。
以上仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,应视为本发明的保护范围。

Claims (6)

  1. 一种基于决策捷径搜索的后门检测方法,其特征在于,所述后门检测方法包括以下步骤:
    S1,利用随机噪声生成P张由随机噪声像素值组成的随机噪声图片,将P张随机噪声图片输入至待检测模型并记录下每个分类标签出现的频次,根据标签出现的频次对所有标签从大到小进行排序,将前K个标签作为可疑的目标标签;P和K均为大于1的正整数;
    S2,选择其中一个可疑的目标标签i,假设其是攻击者的目标标签,计算待检测模型将其余标签的样本全都分类为标签i所需的最小修改量,计算得到标签i对应的可能的触发器坐标;i=1,2,…,K;
    S3,重复步骤S2,直至计算得到所有可疑的目标标签对应的可能的触发器坐标;
    S4,根据添加触发器后的攻击成功率和触发器的大小,计算得到K个可能的触发器的异常值,如果计算得到的任意一个可能的触发器的异常值大于预设的异常值阈值,则表明待检测模型为恶意模型,且攻击者的目标标签为该可能的触发器对应的标签;
    S5,利用求解出的异常值大于异常值阈值的触发器对待检测模型进行再训练,通过修改模型的参数使得后门失效。
  2. 根据权利要求1所述的基于决策捷径搜索的后门检测方法,其特征在于,步骤S2中,选择其中一个可疑的目标标签i,假设其是攻击者的目标标签,计算待检测模型将其余标签的样本全都分类为标签i所需的最小修改量,计算得到标签i对应的可能的触发器坐标的过程包括以下子步骤:
    S21,在后门攻击中,触发器由触发器坐标m和触发器图案Δ两部分共同组成,采用下述公式(1)在样本x上添加触发器:
    A(x,m,Δ)=(1-m)·x+m·Δ   (1);
    S22,采用下述公式(2)求解得到触发器坐标:
    minmJ(F(A(x,m,Δ),yi)+α·|m| For x∈X   (2)
    其中,F()表示待检测模型;J()为损失函数,用于量化分类损失;yi为当前假定的目标标签;|m|为m的L1范式,表示需要修改像素点的范围大小;X表示获取到的不存在被污染样本的干净数据集;触发器坐标求解最优化的目标是:在待检测模型将全部经过修改的图片分类为的同时,最小化m的L1范式,以改动最少的像素点。
  3. 根据权利要求2所述的基于决策捷径搜索的后门检测方法,其特征在于,采用用户测试集的部分数据以生成所述干净数据集。
  4. 根据权利要求1所述的基于决策捷径搜索的后门检测方法,其特征在于,步骤S4中,判断添加触发器后的攻击成功率是否小于预设的攻击成功率阈值,如果是,则直接排除该可能的触发器;否则,根据添加触发器后的攻击成功率和触发器的大小,计算得到可能的触发器的异常值。
  5. 根据权利要求1或者4所述的基于决策捷径搜索的后门检测方法,其特征在于,步骤S4中,采用下述公式(3)计算得到可能的触发器的异常值grade:
    式中,acc表示添加触发器后的攻击成功率;sumPixel表示输入样本的像素点总数;|m|表示触发器的大小。
  6. 根据权利要求1所述的基于决策捷径搜索的后门检测方法,其特征在于,步骤S5中,利用求解出的异常值大于异常值阈值的触发器对待检测模型进行再训练,通过修改模型的参数使得后门失效的过程包括以下子步骤:
    S61,从不存在被污染样本的干净数据集中选取一部分合适的干净样本;
    S62,在选取的干净样本中的m坐标位置添加上随机噪声像素值,且不改变这些样本的标签,以制造逆向中毒样本;
    S63,采用制造的逆向中毒样本和部分干净数据集对待检测模型进行再训练。
PCT/CN2023/092167 2022-09-08 2023-05-05 一种基于决策捷径搜索的后门检测方法 WO2024051183A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211093403.6 2022-09-08
CN202211093403.6A CN115186816B (zh) 2022-09-08 2022-09-08 一种基于决策捷径搜索的后门检测方法

Publications (1)

Publication Number Publication Date
WO2024051183A1 true WO2024051183A1 (zh) 2024-03-14

Family

ID=83523799

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092167 WO2024051183A1 (zh) 2022-09-08 2023-05-05 一种基于决策捷径搜索的后门检测方法

Country Status (2)

Country Link
CN (1) CN115186816B (zh)
WO (1) WO2024051183A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186816B (zh) * 2022-09-08 2022-12-27 南京逸智网络空间技术创新研究院有限公司 一种基于决策捷径搜索的后门检测方法
CN116739073B (zh) * 2023-08-10 2023-11-07 武汉大学 一种基于进化偏差的在线后门样本检测方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200410098A1 (en) * 2019-06-26 2020-12-31 Hrl Laboratories, Llc System and method for detecting backdoor attacks in convolutional neural networks
CN114299365A (zh) * 2022-03-04 2022-04-08 上海观安信息技术股份有限公司 图像模型隐蔽后门的检测方法及系统、存储介质、终端
CN114638356A (zh) * 2022-02-25 2022-06-17 武汉大学 一种静态权重引导的深度神经网络后门检测方法及系统
CN115186816A (zh) * 2022-09-08 2022-10-14 南京逸智网络空间技术创新研究院有限公司 一种基于决策捷径搜索的后门检测方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920955B (zh) * 2018-06-29 2022-03-11 北京奇虎科技有限公司 一种网页后门检测方法、装置、设备及存储介质
CN113297571B (zh) * 2021-05-31 2022-06-07 浙江工业大学 面向图神经网络模型后门攻击的检测方法和装置
CN113902962B (zh) * 2021-12-09 2022-03-04 北京瑞莱智慧科技有限公司 目标检测模型的后门植入方法、装置、介质和计算设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200410098A1 (en) * 2019-06-26 2020-12-31 Hrl Laboratories, Llc System and method for detecting backdoor attacks in convolutional neural networks
CN114638356A (zh) * 2022-02-25 2022-06-17 武汉大学 一种静态权重引导的深度神经网络后门检测方法及系统
CN114299365A (zh) * 2022-03-04 2022-04-08 上海观安信息技术股份有限公司 图像模型隐蔽后门的检测方法及系统、存储介质、终端
CN115186816A (zh) * 2022-09-08 2022-10-14 南京逸智网络空间技术创新研究院有限公司 一种基于决策捷径搜索的后门检测方法

Also Published As

Publication number Publication date
CN115186816A (zh) 2022-10-14
CN115186816B (zh) 2022-12-27

Similar Documents

Publication Publication Date Title
CN110070141B (zh) 一种网络入侵检测方法
WO2024051183A1 (zh) 一种基于决策捷径搜索的后门检测方法
US11494496B2 (en) Measuring overfitting of machine learning computer model and susceptibility to security threats
He et al. Multi-patch convolution neural network for iris liveness detection
CN111754519B (zh) 一种基于类激活映射的对抗防御方法
CN113139536B (zh) 一种基于跨域元学习的文本验证码识别方法、设备及存储介质
Li et al. Black-box attack against handwritten signature verification with region-restricted adversarial perturbations
Liu et al. Adversaries or allies? Privacy and deep learning in big data era
CN116318928A (zh) 一种基于数据增强和特征融合的恶意流量识别方法及系统
Oregi et al. Robust image classification against adversarial attacks using elastic similarity measures between edge count sequences
CN115277065B (zh) 一种物联网异常流量检测中的对抗攻击方法及装置
Sachdev Breaking captcha characters using multi-task learning cnn and svm
Toliupa et al. Building an Intrusion Detection System in Critically Important Information Networks with Application of Data Mining Methods
Sun et al. CAPTCHA recognition based on Kohonen maps
Avram et al. Tiny network intrusion detection system with high performance
Walia et al. Vulnerability Analysis for Captchas Using Deep Learning
Ammar et al. Enhancing Neural Network Resilence against Adversarial Attacks based on FGSM Technique
Agrawal et al. F-MIM: Feature-based Masking Iterative Method to Generate the Adversarial Images against the Face Recognition Systems
Min et al. Adversarial attack? don't panic
Wang et al. Contrast Adjustment Forensics Based on Second-Order Statistical and Deep Learning Features
Manoranjitham et al. A Comparative Study of DenseNet121 and InceptionResNetV2 model for DeepFake Image Detection
Sengan et al. Improved LSTM-Based Anomaly Detection Model with Cybertwin Deep Learning to Detect Cutting-Edge Cybersecurity Attacks
Sutradhar et al. Anti-Spoofing System for Face Detection Using Convolutional Neural Network Check for updates
Chen et al. MaskArmor: Confidence masking-based defense mechanism for GNN against MIA
Müller et al. Defending against adversarial denial-of-service data poisoning attacks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23861889

Country of ref document: EP

Kind code of ref document: A1