CN113792289B

CN113792289B - Method and system for defending backdoor attack

Info

Publication number: CN113792289B
Application number: CN202111354846.1A
Authority: CN
Inventors: 范洺源; 陈岑; 王力
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-03-25
Anticipated expiration: 2041-11-16
Also published as: CN113792289A

Abstract

The embodiment of the specification provides a defense method and a defense system for backdoor attacks, and the method comprises the following steps: acquiring a trigger and a target label; and performing forgetting training on the back door model based on the trigger and the target label to obtain the target model capable of defending the back door attack. The method can effectively defend the backdoor attack of the attacker on the model.

Description

Method and system for defending backdoor attack

Technical Field

The present disclosure relates to the field of information security technologies, and in particular, to a method and a system for defending against backdoor attacks.

Background

Machine learning models are applicable in various fields such as image recognition, natural language processing, and the like, and can play an important role in data processing in various fields. In the application of machine learning models, the models may be subject to backdoor attacks (which may also be referred to as poison attacks or trojan horse attacks). In a back door attack, an attacker can implant a back door in a model and manipulate the output of the model by inputting input data with a trigger to the model in which the back door is implanted, causing it to output a tag specified by the attacker. A rear door attack may have serious adverse consequences for the application of the model, for example, in an application scenario of automatic driving, the model implanted in the rear door erroneously identifies a pedestrian as another object, which may result in that the pedestrian cannot be avoided in time and be bruised.

Therefore, a method and a system for defending against backdoor attacks are needed to effectively defend against model backdoor attacks.

Disclosure of Invention

One aspect of the present specification provides a method of defending against a backdoor attack, the method comprising: generating one or more reconstruction triggers corresponding to the real triggers thereof based on the back door model, and determining a target label; the real trigger enables the back door model to output the target label when the input data of the back door model comprises the real trigger; and performing forgetting training on the back door model based on one or more reconstruction triggers and the target label to obtain a target model capable of defending against back door attacks.

Another aspect of the present specification provides a defense system against backdoor attacks, the system comprising: the reconstruction trigger acquisition module is used for generating one or more reconstruction triggers corresponding to the real triggers thereof based on the back door model and determining the target label; the real trigger enables the back door model to output the target label when the input data of the back door model comprises the real trigger; the first back door model defense module is used for performing forgetting training on the back door model based on one or more reconstruction triggers and the target label so as to obtain a target model capable of defending back door attack.

Another aspect of the present specification provides a defending device against a backdoor attack, comprising at least one storage medium and at least one processor, the at least one storage medium storing computer instructions; the at least one processor is configured to execute the computer instructions to implement a method of defending against a backdoor attack.

Another aspect of the present specification provides a method of defending against a backdoor attack, the method comprising: acquiring a trigger of a back door model and a target label; the trigger enables the back door model to output the target label when the input data of the back door model comprises the trigger; and performing forgetting training on the back door model based on the trigger and the target label to obtain a target model capable of defending back door attack.

Another aspect of the present specification provides another defense system against backdoor attacks, the system comprising: the trigger acquisition module is used for acquiring a trigger of the back door model and a target label; the trigger enables the back door model to output the target label when the input data of the back door model comprises the trigger; and the second back door model defense module is used for performing forgetting training on the back door model based on the trigger and the target label so as to obtain a target model capable of defending back door attack.

Another aspect of the present specification provides another defense against backdoor attacks, comprising at least one storage medium and at least one processor, the at least one storage medium storing computer instructions; the at least one processor is configured to execute the computer instructions to implement another method of defending against a backdoor attack.

Another aspect of the present specification provides a method of generating a reconstruction trigger based on a back door model, the method comprising, for a certain tag in a tag space of the back door model: obtaining a trigger generation model group and obtaining candidate triggers based on the trigger generation model group; adding the candidate trigger to a plurality of clean samples to obtain a plurality of poisoning samples; processing a plurality of poisoning samples by using the back door model to obtain an attack success rate; the attack success rate reflects the probability that a poisoned sample containing a candidate trigger causes the back-gate model to output the tag; and when the attack success rate is greater than a first threshold value, taking the label as the target label and taking the candidate trigger as a reconstruction trigger.

Another aspect of the present specification provides a system for generating a reconstruction trigger based on a back door model, comprising a candidate trigger acquisition module, a poisoned sample processing module, and a reconstruction trigger determination module; wherein for a certain label in the label space of the back door model: the candidate trigger acquisition module is used for acquiring a trigger generation model group and acquiring a candidate trigger based on the trigger generation model group; the poisoning sample acquisition module is used for adding the candidate trigger to a plurality of clean samples to obtain a plurality of poisoning samples; the poisoning sample processing module is used for processing a plurality of poisoning samples by using the back door model so as to obtain the attack success rate; the attack success rate reflects the probability that a poisoned sample containing a candidate trigger causes the back-gate model to output the tag; and the reconstruction trigger determining module is used for taking the label as the target label and taking the candidate trigger as a reconstruction trigger when the attack success rate is greater than a first threshold.

Another aspect of the present specification provides an apparatus for generating a reconstruction trigger based on a back door model, comprising at least one storage medium and at least one processor, the at least one storage medium storing computer instructions; the at least one processor is configured to execute the computer instructions to implement a method of generating a reconstruction trigger based on a back door model.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a schematic diagram of a backdoor attack scenario according to some embodiments of the present description;

FIG. 2 is an exemplary flow diagram of a method of defending against a backdoor attack, according to some embodiments of the present description;

FIG. 3 is a block diagram of a defense system for a backdoor attack in accordance with some embodiments of the present description;

FIG. 4 is a block diagram of another defense system against backdoor attacks in accordance with some embodiments of the present description;

FIG. 5 is a block diagram of a system for generating a reconstruction trigger based on a back door model in accordance with some embodiments of the present description;

FIG. 6 is an exemplary flow diagram of a method of generating a reconstruction trigger based on a back door model in accordance with some embodiments of the present description;

FIG. 7 is an exemplary flow diagram of a method of obtaining a set of trigger generation models, shown in accordance with some embodiments of the present description;

FIG. 8 is an exemplary diagram of a set of trigger generation models, shown in accordance with some embodiments of the present description;

fig. 9 is an exemplary flow diagram of a method for forgetting to train a back door model according to some embodiments described herein.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used in this specification is a method for distinguishing different components, elements, parts or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Fig. 1 is a schematic diagram of a backdoor attack scenario, shown in accordance with some embodiments of the present description.

The scenario 100 may relate to various scenarios in which a machine learning model is applied, such as an image target recognition scenario in the field of automatic driving, a text topic recognition scenario in the field of natural language processing, user feedback information recommendations in the field of intelligent recommendations, and so forth.

In application scenarios of various machine learning models, the models may be subject to backdoor attacks (which may also be referred to as poison attacks or trojan horse attacks). In a back door attack, an attacker can implant a back door into the model by various means (e.g., by adding training data including triggers to the training data set of the model to contaminate the model, or by operating on certain neurons of the model to contaminate the model, for example). The model implanted in the posterior door may also be referred to as a posterior door model. When a clean sample (or clean data) is input to the back door model, the back door model can normally predict to obtain a correct label corresponding to the clean sample, but when the input data with a trigger is input to the back door model, the back door model outputs a label (or called a poisoning label, for example, an object class label in a specified image such as a signboard) specified by an attacker, so that the attacker can manipulate the output of the model.

The back gate may refer to a mode in which data with a trigger corresponding to the back gate is input to the model and the model outputs a certain tag specified by an attacker, or may refer to a contaminated model part in the model, such as a contaminated neuron. After the model processes the input data, a corresponding prediction result, which may also be referred to as a label (or referred to as a prediction label, to be distinguished from a sample label corresponding to a training sample in a training data set), such as a category of an object in a picture, a subject category of a text, etc., is output. The model may (e.g., back door model, object model, etc.) have a label space in which all labels that the model may output may be included, which generally corresponds to a sample label set of the training data set. The attacker-specified tag may be referred to as a target tag or a poison tag.

The trigger is data for triggering the model back door to make the back door model output the target tag, and may be tiny data such as a single pixel, a tiny patch, noise that is not easily perceived by human, or may be global data such as global random noise, an image of a specific style (e.g., an image of rainy weather), or the like. In some embodiments, the triggers may be represented as a tensor of a certain dimension, such as a one-dimensional tensor, a two-dimensional tensor, a three-dimensional tensor, or the like. In some embodiments, the trigger may be superimposed on a clean sample, resulting in a backdoor sample or a poison sample.

In some embodiments, the back door model may include one or more back doors, and one back door may be correspondingly triggered by one or more triggers. For example, one rear door is triggered by white patches to make the rear door model output target label a signboard, and the other rear door is triggered by gray patches or black patches to make the rear door model output target label an obstacle.

A rear door attack may have serious adverse consequences for the application of the model, for example, in an application scenario of automatic driving, the model implanted in the rear door erroneously identifies a pedestrian as another object, which may result in that the pedestrian cannot be avoided in time and be bruised. Generally speaking, the back door attack has a high success rate which can reach 100%, and has high threat to the model, and the model back door is not easy to remove. Therefore, how to effectively defend the backdoor attack of the attacker on the model to maintain the application effect of the machine learning model in various fields is a problem to be solved urgently.

In view of this, some embodiments of the present disclosure provide an effective defense method against backdoor attacks. As shown in fig. 2, the back door attack defense method may include two steps, first obtaining a trigger and a target tag of a back door model, and then performing forgetting training on the back door model through a forgetting technique based on the obtained trigger and target tag to obtain a target model capable of defending against the back door attack. In some embodiments, the trigger (which may be referred to as a real trigger) and the target tag that the attacker originally corresponds to the model backdoor implanted may be known to the defender, i.e., the defender can obtain the real trigger and the target tag through some way. In this case, the back door model can be subjected to forgetting training through a forgetting technology based on the real trigger and the target tag of the back door model, so as to obtain the target model capable of defending against the back door attack. However, in some other embodiments, for the defender, the trigger (i.e. the real trigger) and the target tag originally corresponding to the model backdoor implanted by the attacker are difficult to know, the defender first needs to restore and reconstruct the trigger of the backdoor model to obtain one or more reconstructed triggers that are as close to the real trigger as possible, determine the most likely target tag from the tag space of the model, and then forget to train the backdoor model based on the one or more reconstructed triggers and the target tag.

The reconstructed trigger is a trigger obtained by reconstruction, and is almost difficult to be the same as a real trigger of a back door model (i.e., a trigger originally corresponding to a model back door). But the reconstruction trigger can be made to approximate the true trigger as closely as possible by some technical means. The reconstruction trigger may correspond to a real trigger of the back door model, and may have similar features to the real trigger (e.g., the real trigger is a white square blob, and the reconstruction trigger is a similar gray polygon blob). Because the re-establishment triggers can only approach the real triggers continuously, in order to improve the defense effect, in some embodiments, a plurality of re-establishment triggers corresponding to the real triggers can be recovered, and the re-establishment triggers can correspond to different attack success rates so as to cover different characteristics of the real triggers or simulate the distribution of the real triggers as much as possible. When the back door model learns the back door, whether the back door is triggered or not is determined by learning various characteristics of the trigger, such as shape, color and the like, so that the reconstruction trigger corresponding to the real trigger can trigger the back door. In some embodiments, attack success rate may refer to the probability of successfully triggering the back gate, i.e., causing the back gate model to output the target tag. For more details on recovering one or more reconstruction triggers that can trigger a backdoor, and selecting a target tag from the tag space of the model, reference may be made to fig. 6 and its associated description.

The forgetting technology of the model means that the model forgets certain specific content through a certain mode (such as machine learning). The forgetting training of the back door model through the forgetting technology means that the back door model forgets memory of a trigger corresponding to a back door through a model training mode, namely input data with the trigger is input into the model, and the model can output a correct label without outputting a target label designated by an attacker. After the back door model forgets the memory of the trigger corresponding to the back door, a model capable of defending the back door attack can be obtained, and the model can be called as a target model. For more details on forgetting to train the back door model to obtain the target model capable of defending against the back door attack, refer to fig. 9 and the related description thereof.

In this specification, for convenience of explanation only, the back door model includes a back door, and the description of the method and system for back door defense is made by taking the example that the back door is triggered by a trigger. If a kind of back door is triggered by a plurality of triggers or the back door model includes a plurality of back doors, the model can respectively forget the memory of each of the plurality of triggers corresponding to one kind of back door or the plurality of triggers corresponding to the plurality of back doors through the method and the system for defending the back door described in some embodiments of the present specification, thereby realizing the defense of the back door attack.

FIG. 3 is a block diagram of a defense system for a backdoor attack in accordance with some embodiments of the present description.

In some embodiments, the defense system 300 for back door attacks may be implemented on a processing device.

In some embodiments, the defense system 300 for back door attacks may include a reconstruction trigger acquisition module 310 and a first back door model defense module 320. In some embodiments, the first back door model defense module 320 may further include at least one unit of: the device comprises a back door sample acquisition unit, a back door sample processing unit, a clean sample processing unit and a back door model parameter adjusting unit.

In some embodiments, the reconstruction trigger acquisition module 310 may be configured to generate one or more reconstruction triggers corresponding to its real triggers based on the back door model and determine a target tag; the real trigger causes the back door model to output the target tag when input data of the back door model includes the real trigger. More details regarding the reconstruction trigger acquisition module 310 can be found in fig. 6, 7 and related contents.

In some embodiments, the functionality of the reconstruction trigger acquisition module 310 may be implemented by the system 500 generating the reconstruction trigger based on the back door model.

In some embodiments, the first back door model defense module 320 may be configured to perform forgetting training on the back door model based on one or more reconstruction triggers and the target tag to obtain a target model capable of defending against a back door attack. In some embodiments, the first back door model defense module 320 may also be configured to add one or more reconstruction triggers to the clean sample, resulting in a back door sample; processing the back door sample by using the back door model to obtain a first prediction label; adjusting model parameters of the back door model such that at least a difference between the first predicted tag and the target tag is increased. In some embodiments, the first back door model defense module 320 may be further configured to determine a first gradient corresponding to each model parameter based on a difference of the first prediction tag and the target tag; a corresponding first gradient is added to each model parameter. In some embodiments, the first back door model defense module 320 may also be configured to process a clean sample using the back door model to obtain a second prediction tag; adjusting model parameters of the back door model such that a difference of the second prediction label and a label corresponding to the clean sample is reduced. In some embodiments, the first back door model defense module 320 may also be configured to determine a prediction weight for each model parameter of the back door model for a clean sample; the magnitude of the adjustment to each model parameter is inversely related to the corresponding prediction weight. In some embodiments, the prediction weight is positively correlated with an absolute value of a second gradient corresponding to the model parameter, and the second gradient corresponding to the model parameter is a gradient of a difference between the second prediction label and a label corresponding to the clean sample relative to the model parameter. In some embodiments, the first back door model defense module 320 may also be configured to add one or more reconstruction triggers to the clean sample, resulting in a back door sample; processing the back door sample by using the back door model to obtain a first prediction label; processing a clean sample by using the back door model to obtain a second prediction label; adjusting model parameters of the back door model based on a first objective function such that the first objective function is reduced; the first objective function is negatively correlated with a first loss function, positively correlated with a second loss function, and positively correlated with a first constraint term; the first loss function reflects the difference between the first prediction label and the target label, the second loss function reflects the difference between the second prediction label and the label corresponding to the clean sample, the first constraint term reflects the result of weighted summation of the difference between the model parameters based on the prediction weights respectively corresponding to the model parameters, the difference between the model parameters is the difference between the current model parameter and the original model parameter, the prediction weight is positively correlated with the absolute value of the second gradient corresponding to the model parameter, and the second gradient corresponding to the model parameter is the gradient of the difference between the second prediction label and the label corresponding to the clean sample relative to the model parameter. For more details regarding the first back door model defense module 320, reference may be made to FIG. 9 and its related disclosure.

FIG. 4 is a block diagram of another defense system against backdoor attacks in accordance with some embodiments of the present description.

In some embodiments, a defense system 400 against backdoor attacks may be implemented on a processing device.

In some embodiments, a defense system 400 for back door attacks may include a trigger acquisition module 410 and a second back door model defense module 420. In some embodiments, the second back door model defense module 420 may further include at least one unit of: the device comprises a back door sample acquisition unit, a back door sample processing unit, a clean sample processing unit and a back door model parameter adjusting unit.

In some embodiments, the trigger acquisition module 410 may be used to acquire the triggers of the back door model, as well as the target tags; the trigger causes the back door model to output the target tag when input data of the back door model includes the trigger.

In some embodiments, the second back door model defense module 420 may be configured to perform forgetting training on the back door model based on the trigger and the target tag to obtain a target model capable of defending against a back door attack. In some embodiments, the second back door model defense module 420 may also be used to add a trigger to the clean sample, resulting in a back door sample; processing the back door sample by using the back door model to obtain a first prediction label; adjusting model parameters of the back door model such that at least a difference between the first predicted tag and the target tag is increased. In some embodiments, the second back door model defense module 420 may be further configured to determine a first gradient corresponding to each model parameter based on a difference of the first predicted tag and the target tag; a corresponding first gradient is added to each model parameter. In some embodiments, the second back door model defense module 420 may also be configured to process a clean sample using the back door model to obtain a second prediction tag; adjusting model parameters of the back door model such that a difference of the second prediction label and a label corresponding to the clean sample is reduced. In some embodiments, the second back door model defense module 420 may also be configured to determine a prediction weight for each model parameter of the back door model for a clean sample; the magnitude of the adjustment to each model parameter is inversely related to the corresponding prediction weight. In some embodiments, the prediction weight is positively correlated with an absolute value of a second gradient corresponding to the model parameter, and the second gradient corresponding to the model parameter is a gradient of a difference between the second prediction label and a label corresponding to the clean sample relative to the model parameter. In some embodiments, the second back door model defense module 420 may also be used to add a trigger to the clean sample, resulting in a back door sample; processing the back door sample by using the back door model to obtain a first prediction label; processing a clean sample by using the back door model to obtain a second prediction label; adjusting model parameters of the back door model based on a first objective function such that the first objective function is reduced; the first objective function is negatively correlated with a first loss function, positively correlated with a second loss function, and positively correlated with a first constraint term; the first loss function reflects the difference between the first prediction label and the target label, the second loss function reflects the difference between the second prediction label and the label corresponding to the clean sample, the first constraint term reflects the result of weighted summation of the difference between the model parameters based on the prediction weights respectively corresponding to the model parameters, the difference between the model parameters is the difference between the current model parameter and the original model parameter, the prediction weight is positively correlated with the absolute value of the second gradient corresponding to the model parameter, and the second gradient corresponding to the model parameter is the gradient of the difference between the second prediction label and the label corresponding to the clean sample relative to the model parameter. For more details regarding the second back door model defense module 420, reference may be made to fig. 9 and its related disclosure.

FIG. 5 is a block diagram of a system that generates a reconstruction trigger based on a back door model in accordance with some embodiments of the present description.

In some embodiments, a system 500 for generating a reconstruction trigger based on a back door model may be implemented on a processing device.

In some embodiments, a system 500 for generating a reconstruction trigger based on a back door model may include a candidate trigger acquisition module 510, a poison sample acquisition module 520, a poison sample processing module 530, a reconstruction trigger determination module 540.

In some embodiments, the candidate trigger acquisition module 510 may be configured to: and aiming at a certain label in the label space of the backdoor model, obtaining a trigger generation model group, and obtaining a candidate trigger based on the trigger generation model group. In some embodiments, the set of trigger generation models comprises a plurality of generators, wherein the generators are configured to generate the triggers based on the noise data, and different generators correspond to different attack success rate preset values. In some embodiments, the candidate trigger acquisition module 510 may be further operable to, for a certain generator in the set of trigger generation models: generating first noise data; processing the first noise data through the generator to obtain an estimated trigger; adding the pre-estimated trigger into the clean sample to obtain a poisoned sample; processing a poisoning sample by using the back door model to obtain a prediction probability aiming at a certain label; and when the prediction probability is smaller than the attack success rate preset value corresponding to the generator, adjusting the model parameters of the generator to reduce the difference between the prediction probability and the attack success rate preset value. In some embodiments, the candidate trigger acquisition module 510 may be further to, for the certain generator: generating second noise data that is co-distributed with the first noise data; obtaining mutual information of the reverse-estimation noise data and the second noise data corresponding to the pre-estimation trigger through a mutual information estimator corresponding to the generator; and adjusting the model parameters of the generator and increasing the mutual information.

In some embodiments, the poison sample acquisition module 520 may be configured to add the candidate trigger to a plurality of clean samples, resulting in a plurality of poison samples.

In some embodiments, poisoning sample processing module 530 may be configured to: aiming at a certain label in the label space of the back door model, processing a plurality of poisoning samples by using the back door model to obtain the attack success rate; the attack success rate reflects the probability that the poisoned sample containing the candidate trigger causes the back-gate model to output the tag.

In some embodiments, the reconstruction trigger determining module 540 may be configured to, for a certain tag in the tag space of the backdoor model, regard the tag as the target tag and regard the candidate trigger as the reconstruction trigger when the attack success rate is greater than a first threshold.

It should be understood that the illustrated system and its modules may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above description of the system and its modules is for convenience only and should not limit the present disclosure to the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings.

Fig. 6 is an exemplary flow diagram of a method of generating a reconstruction trigger based on a back door model, according to some embodiments shown herein.

In some embodiments, method 600 may be performed by a processing device. In some embodiments, the method 600 may be implemented by the system 500 for generating a reconstruction trigger based on a back door model deployed on a processing device, or by the reconstruction trigger acquisition module 310 in the defense system 300 for back door attacks deployed on a processing device.

In some embodiments, if the trigger (i.e., the real trigger) and the target tag originally corresponding to the model backdoor implanted by the attacker are unknown, or forgetting training of the backdoor model based on one or more reconstruction triggers is required, one or more rounds of the steps or processes of the method 600 may be performed for each tag in the tag space of the backdoor model, respectively, to recover the one or more reconstruction triggers and determine the target tag of the backdoor model.

As shown in fig. 6, the method 600 may include:

step 610, a trigger generation model set is obtained, and candidate triggers are obtained based on the trigger generation model set.

In some embodiments, this step 610 may be performed by the candidate trigger acquisition module 510.

A trigger generation model set refers to a model set used to generate a trigger by which one or more triggers can be generated. In this specification, a trigger generated by a trigger generation model group may be referred to as a candidate trigger.

For each tag in the tag space of the back door model, there may be its corresponding trigger generation model set.

In some embodiments, the trigger generation model set may include multiple producers, which may be grouped together to form a model set. For example only, as shown in fig. 8, the set of trigger generation models G may include n generators G1, G2, …, Gn, which may be denoted as G = { G1, G2.

The generator may generate the trigger based on the noise data, e.g., a random noise may be sampled in a class of random noise that follows a Gaussian distribution N (0, 1)

The random noise is added

Inputting a generator Gi which can generate a corresponding candidate trigger

. In some embodiments, the generator may be designed following a certain paradigm. In general, the generator design follows a paradigm that mainly includes generating a dyadic, a self-coder, and an optical flow estimate. After the design paradigm is determined, a specific network architecture (or model structure) can be selected under the principle of following the paradigm, and the generator is obtained by performing model training using the samples. The network architecture may include, but is not limited to NN, RNN, CNN, and the like. Illustratively, in the field of computer vision, a convolutional neural network architecture (CNN) is generally adopted as an infrastructure of a generator, and in the field of natural language processing, a recurrent neural network architecture (RNN) is generally adopted. In some embodiments, the desired generator design paradigm and/or network architecture may be selected based on task requirements or experience, among others.

In some embodiments, different generators in the trigger generation model set may correspond to different attack success rate thresholds, and the attack success rate threshold corresponding to a generator may be a lowest attack success rate value that a candidate trigger generated by the generator needs to reach. For example, the n generators G1, G2, …, Gn may correspond to different n attack success rate thresholds, which may be n values selected uniformly among 0-1. Through the embodiment, the multiple generators can be made to generate the candidate triggers with different attack success rates, so that the restored multiple reconstructed triggers can better meet the diversity of the attack success rates, the characteristics of the real triggers are covered as much as possible, the distribution of the real triggers is simulated, the memory of the real triggers can be forgotten as much as possible when the subsequent forgetting training of the back door model is carried out, and the effect of back door attack defense is improved.

Step 620, add the candidate trigger to the plurality of clean samples to obtain a plurality of poisoned samples.

In some embodiments, this step 620 may be performed by the poison sample acquisition module 520.

The clean sample refers to an input data sample of the back door model that is not contaminated by the trigger, and may include an input data sample of the back door model that is not contaminated by the trigger in a training data set, a verification data set, and the like.

Adding a trigger to a clean sample may refer to adding the trigger to the clean sample in various ways such as stacking, splicing, and the like, so as to obtain sample data including the trigger (which may also be referred to as sample data contaminated by the trigger).

In some embodiments of the present description, a sample contaminated by a candidate trigger added to a clean sample may be referred to as a poisoned sample. In some embodiments, one of the plurality of triggers generated by the trigger generation model group corresponding to the tag may be arbitrarily selected as a candidate trigger, and then added to different clean samples to obtain a plurality of poisoning samples. A plurality of candidate triggers may be arbitrarily selected from a plurality of triggers generated by the trigger generation model group corresponding to the tag, and the selected candidate triggers may be added to one or more different clean samples to obtain a plurality of poisoning samples.

Step 630, processing a plurality of poisoning samples by using the back door model to obtain the attack success rate.

In some embodiments, this step 630 may be performed by the poisoned sample processing module 530.

In some embodiments, the plurality of poisoning samples are processed by using a back door model, that is, the plurality of poisoning samples are input into the back door model, the back door model may obtain a back door model output corresponding to the poisoning sample, and the back door model output may include a prediction probability corresponding to each label in a label space of the back door model. Illustratively, the back door model is a multi-classification model, and after the poisoning sample is input into the back door model, the back door model outputs a plurality of probability values corresponding to different categories respectively. The classification result is generally the category corresponding to the maximum probability value. In some embodiments, the back door model may also be a binary model.

Based on the back door model output, the attack success rate of the poisoning sample containing the candidate trigger to the current label can be obtained. Wherein the attack success rate of the current tag reflects the probability that a poisoned sample containing the candidate trigger causes the back-gate model to output the tag. For example, a plurality of poisoning samples corresponding to a certain candidate trigger may be input into a back-gate model, the back-gate model outputs a plurality of corresponding classification results (i.e., a plurality of prediction tags), and the current candidate trigger and the attack success rate corresponding to the current tag may be determined according to the proportion of the current tag in the plurality of classification results.

And step 640, when the attack success rate is greater than a first threshold, taking the label as the target label, and taking the candidate trigger as a reconstruction trigger.

In some embodiments, this step 640 may be performed by the reconstruction trigger determination module 540.

According to the trigger definition, it can be understood that adding a trigger to any clean sample to input the back gate model can enable the back gate model to output the target label. It can be seen that the reconstruction trigger that needs to be restored should satisfy: the method is added into any clean sample to input a back door model, and has higher attack success rate on a target label.

In some embodiments, the target tag of the back door model may be determined and the re-establishment trigger may be determined among the candidate triggers based on whether a success rate of an attack on the current tag by a poisoned sample containing the candidate trigger is greater than a first threshold. The first threshold may be set based on empirical or practical requirements and may be a large probability value, such as 70%.

In some embodiments, if the attack success rate of the plurality of poisoning samples including the candidate trigger on the current tag is greater than a threshold, the current tag may be considered as the target tag, and the candidate triggers included in the plurality of poisoning samples also satisfy: the target label can be attacked with a high success rate by adding the target label to a plurality of clean samples and inputting the back door model, namely candidate triggers contained in the plurality of poisoned samples can be used as required reconstruction triggers. Through the embodiment, the fact that the restored reestablished triggers can be effective triggers can be guaranteed, namely, the back door of the back door model can be effectively triggered.

FIG. 7 is an exemplary flow diagram of a method of obtaining a set of trigger generation models, shown in some embodiments herein.

In some embodiments, method 700 may be performed by a processing device. In some embodiments, the method 700 may be implemented by the reconstruction trigger acquisition module 310 in the defense system 300 against backdoor attacks deployed on a processing device, or by the candidate trigger acquisition module 510 in the system 500 for generating reconstruction triggers based on a backdoor model deployed on a processing device.

For each label in the label space of the backdoor model, a trigger generation model group corresponding to each label can be provided, and the trigger generation model group can be obtained through training. In some embodiments, a corresponding set of trigger-generating models may be trained separately for each tag in the tag space of the back door model.

In some embodiments, when training the trigger corresponding to the label to generate the model group, for each generator in the trigger generation model group, one or more rounds of steps or processes of the method 700 may be performed respectively to obtain each generator through training.

As shown in fig. 7, the method 700 may include:

step 710, first noise data is generated.

The first noise data may include various categories of noise data, for example, random noise data following a gaussian distribution N (0, 1). The first noise data may be derived by various noise generation methods, which in some embodiments may include generating the noise data by various generation models or generation networks in various models, and may also include generating the noise data by various noise generation algorithms.

The first noise data is processed by the generator to obtain an estimated trigger, step 720.

In some embodiments, the first noise data is input to the generator when training the generator in the trigger generation model set, and the trigger generated by the generator may be referred to as a pre-estimated trigger.

Step 730, add the pre-estimated trigger to the clean sample to obtain the poisoned sample.

As mentioned above, adding the trigger to the clean sample may refer to adding the trigger to the clean sample by various ways such as stacking, splicing, and the like to obtain the sample data including the trigger, which may also be referred to as sample data contaminated by the trigger.

In some embodiments of the present description, a sample contaminated with a predictive trigger, resulting from adding the predictive trigger to a clean sample, may be referred to as a poisoned sample.

And 740, processing the poisoning sample by using the back door model to obtain the prediction probability aiming at the certain label.

As described above, the poisoning sample is processed by using the back door model, that is, the poisoning sample is input into the back door model, the back door model may obtain a back door model output corresponding to the poisoning sample, and the back door model output may include a probability of outputting a label corresponding to each label in a label space of the back door model, that is, a predicted probability of the label. For more description of the prediction probability, see step 630.

Based on the rear door model output, the prediction probability of the poisoning sample containing the pre-estimated trigger to the current label can be obtained.

And 750, when the prediction probability is smaller than the attack success rate preset value corresponding to the generator, adjusting the model parameters of the generator to reduce the difference between the prediction probability and the attack success rate preset value.

As mentioned above, the generator may have a corresponding attack success rate preset value, and it can be understood that the goal of training the generator may be to make the attack success rate of the trigger generated by the generator on the current tag not less than the attack success rate preset value corresponding to the generator.

In some embodiments, the generator is trained, a poisoning sample containing the pre-estimated trigger is input into the back-gate model, and when the predicted probability of the current tag output by the back-gate model is smaller than the attack success rate preset value, the model parameters of the generator can be adjusted to reduce the difference between the predicted probability and the attack success rate preset value. So that the trained generator can satisfy: the attack success rate of the trigger generated by the generator to the current label is not less than the attack success rate preset value corresponding to the generator.

In some embodiments, when the generator is trained, when the predicted probability of the current tag output by the back-door model is greater than the attack success rate preset value, the trigger generated by the generator can be considered to meet the expectation, and the model parameters of the generator do not need to be adjusted.

In some embodiments, the loss function during generator training may be determined based on the difference between the predicted probability of the current tag output by the back door model and the attack success rate preset value corresponding to the generator, and may be expressed as

And the model parameters of the generator may be adjusted based on the loss function. By way of example only, loss function in generator training

Can be expressed as:

（1）

training of the generator may include adjusting model parameters of the generator to make a loss function

The value of (c) is minimized. Formula (II)(1) The method comprises the following steps:

a clean sample data set representing a generator for use in training, which may include at least one clean sample; b represents the number of clean samples;

represents a clean sample;

a model of the back door is represented, wherein,

model parameters representing a rear door model;

representation input generator

E.g., first noise data;

representing the output of the back door model (e.g. a vector containing two or more probabilities),

a current label representing the model of the back door,

representing the prediction probability of the current label in the output of the back door model;

and representing the attack success rate preset value corresponding to the generator.

In some embodiments, second noise data may also be generated, which may be co-distributed with the first noise data, e.g., the first noise data being random noise data subject to a gaussian distribution with a mean of 0 and a variance of 1, which may likewise be the second noise data subject to the gaussian distribution. The method for generating the second noise data is similar to the method for generating the first noise data, and reference may be made to step 710 and the related description thereof, which are not repeated herein.

In some embodiments, mutual information of the backward-estimated noise data and the second noise data corresponding to the pre-estimation trigger may be obtained through a mutual information estimator corresponding to the generator.

The backward-estimation noise data refers to noise data generated by backward estimation based on a pre-estimation trigger generated by a generator. The back-thrust noise data may be generated by a mutual information estimator, or may also be generated by other models or networks. For example only, a look-ahead trigger may be input to the mutual information estimator, which may generate corresponding back-inferred noise data.

The mutual information estimator may further be used to estimate mutual information between two variables, and may include various neural network models (e.g., NN, RNN, CNN, etc.). The mutual information is a quantity for measuring the correlation between two random variables, and the larger the correlation between two random variables is, the larger the value of the mutual information can be. Mutual information of the reverse-pushed noise data and the second noise data may measure a correlation between the reverse-pushed noise data and the second noise data.

In some embodiments, the second noise data may be used as another input parameter of the mutual information estimator to generate the inverse noise data through the mutual information estimator, and further obtain the mutual information of the inverse noise data and the second noise data.

In some embodiments, training may further include adjusting the model parameters of the generator to increase mutual information between the inferred noise data and the second noise data, in addition to adjusting the model parameters of the generator to reduce a difference between the predicted probability and the attack success rate preset value corresponding to the generator. As an example, the mutual information between the back-inferred noise data and the second noise data may be expressed as:

wherein

A mutual information estimator is represented in which the mutual information,

representing second noisy data, the loss function during training of the generator

The mutual information item may further be included, and may be expressed as:

（2）

based on the above equation (2), the model parameters of the generator are adjusted to minimize the loss function

The mutual information may also be made to increase. By the embodiment, mutual information between the back-stepping noise data and the second noise data can be increased, so that the distribution of the back-stepping noise data is closer to the distribution (for example, gaussian distribution) of the second noise data, and the diversity of the trigger generated by the generator can be increased, so that the back door model can forget more memory about the trigger during subsequent forgetting training, and the effect of back door attack defense is improved.

In some embodiments, the network used to generate the back-pushed noise data (which may be independent of or included in the mutual information estimator) may have the inverse structure of the corresponding generator in the trigger generation module, or may be another model that has been trained in advance, and may not need to be trained additionally, i.e., to adjust the model parameters of this portion of the network additionally.

In some embodiments, method 900 may be performed by a processing device. In some embodiments, the method 900 may be implemented by the first back door model defense module 320 in the defense system 300 for the back door attack deployed on the processing device or implemented by the second back door model defense module 420 in the defense system 400 for the back door attack deployed on the processing device according to actual situations or needs, for example, whether the trigger originally corresponding to the back door model is acquired.

As shown in fig. 9, the method 900 may include:

at step 910, a trigger or one or more reconstruction triggers are added to the clean sample to obtain a back door sample.

In some embodiments, this step 910 may be performed by a back door sample acquisition unit.

As described above, in some embodiments, for an defender, a trigger and a target tag originally corresponding to a back door model may be known, and in this case, the trigger and the target tag originally corresponding to the back door model may be obtained, and in some embodiments, if the trigger and the target tag originally corresponding to a model back door implanted by an attacker are unknown or forgetting training of the back door model needs to be performed based on one or more than one reconstructed triggers, in this case, the trigger may be recovered to obtain one or more reconstructed triggers corresponding to a real trigger, and the target tag of the back door model may be determined.

Thus, according to the foregoing, the acquired trigger (i.e., the real trigger of the back door) may be added to a clean sample to obtain sample data contaminated by the trigger, or the recovered one or more reconstructed triggers may be added to a clean sample to obtain sample data contaminated by the trigger.

In some embodiments, a trigger (i.e., the real trigger of the back door) or one or more reconstructed triggers that are recovered are added to a clean sample, and the resulting sample data contaminated by the trigger may be referred to as a back door sample.

In some embodiments, one or more rounds of iterative training may be performed on the back door model based on the back door sample to implement forgetting training on the back door model, thereby obtaining a target model capable of defending against a back door attack. One of the iterative training rounds may include the process of steps 920-940, wherein step 930 may be performed according to the requirement.

And 920, processing the back door sample by using the back door model to obtain a first prediction label.

In some embodiments, this step 920 may be performed by a back door sample processing unit.

In some embodiments, the back door model may be utilized to process the back door sample, that is, the back door sample is input into the back door model, the back door model may output a corresponding prediction tag, and the prediction tag obtained by processing the back door sample by the back door model may be referred to as a first prediction tag in this specification.

Step 930, the clean sample is processed using the back door model to obtain a second predicted tag.

In some embodiments, this step 930 may be performed by a clean sample processing unit.

The clean sample of the back door model can be processed by the back door model, that is, the clean sample of the back door model is input into the back door model, the back door model can output a corresponding prediction label, and the prediction label obtained by processing the clean sample of the back door model by the back door model in this specification can be referred to as a second prediction label.

Step 940, model parameters of the back door model are adjusted based on a first objective function such that the first objective function is reduced.

In some embodiments, this step 940 may be performed by a back door model parameter adjustment unit.

In some embodiments, a first loss function may be determined based on a difference of the first prediction tag and the target tag, and a loss function when the back door model forgets to train may be determined based on the first loss function, and the loss function when the back door model forgets to train may be referred to as the first target function. Wherein the first objective function may be inversely related to the first loss function.

The backdoor model forgetting training process may include a first objective function basedThe model parameters of the back door model are adjusted. As an example, the first objective function

Can be expressed as:

（3）

the forgetting training process of the back door model may include adjusting model parameters of the back door model to make a loss function

The value of (c) is minimized. In equation (3):

the first loss function is represented as a function of,

representing the back door model at the jth (e.g. current) iterative training,

a sample of the back door is shown,

an object label representing the back door model,

representing the calculation coefficients, which can be set as desired.

In some embodiments, a first gradient corresponding to each model parameter of the back door model may also be determined based on a difference of the first prediction tag and the target tag. Specifically, a first gradient corresponding to each model parameter of the back door model may be determined based on the first loss function. As an example, the first gradient may be expressed as:

（4）

wherein,

refers to the model parameters of the kth dimension (e.g., the kth layer or kth) of the backdoor model. For a model, the model parameters may include a plurality of parameters.

In some embodiments, the model parameters of the back door model are adjusted, and a first gradient may be further added or subtracted on the basis of the model parameters of the back door model, that is, the model parameters of the back door model may be adjusted by a gradient ascent method. It can be appreciated that minimizing the loss function

Is to maximize the first loss function

By adjusting the parameters of the back door model by means of gradient ascent, the first loss function can be accelerated

The maximum speed of the value of (1) can further improve the forgetting training efficiency of the back door model, and accelerate the convergence of the model parameters of the back door model.

In some embodiments, a second loss function may also be determined based on a difference in the second prediction signature and the signature corresponding to a clean sample of the back door model, and a term related to the second loss function may also be included in the first objective function. Wherein the first objective function may be positively correlated with the second loss function. Therefore, the adjustment of the model parameters of the back door model based on the first objective function can also comprise the reduction of the difference between the second prediction label and the label corresponding to the clean sample, and the model is prevented from forgetting the normal prediction performance. As an example, the second loss function may be expressed as

First, aThe objective function can be further expressed as:

（5）

wherein,

a clean sample representing the back door model,

indicating the label corresponding to a clean sample.

In some embodiments, a prediction weight for each model parameter of the back door model for the clean sample may also be determined, which may be positively correlated with the absolute value of the corresponding second gradient of the model parameter. The second gradient corresponding to the model parameter may be a gradient of a difference between the second prediction label and the label corresponding to the clean sample with respect to the model parameter. Specifically, a second gradient corresponding to each model parameter of the back door model may be determined based on the second loss function. As an example, the second gradient may be expressed as:

（6）

as an example, the prediction weights of model parameters of the back door model for a clean sample

Can be expressed as:

(7)

the prediction weight of the model parameter of the back door model to the clean sample can reflect the importance degree of the model parameter of the back door model to the clean data prediction of the back door model, and the higher the prediction weight is, the higher the importance degree can be.

In some embodiments, when adjusting the model parameters of the back door model based on the first objective function to minimize the first objective function, the magnitude of the adjustment to the model parameters may also be constrained based on the predicted weights of the model parameters of the back door model for the clean sample. The adjustment range of each model parameter may be inversely related to the corresponding prediction weight, that is, the larger the prediction weight of the model parameter for a clean sample is, the smaller the adjustment range of the model parameter is. Through this embodiment, can make when adjusting the model parameter of back door model based on first objective function, the model parameter that the degree of importance is high to clean sample prediction can not adjusted by a wide margin to can make the back door model forget about the memory of trigger when the back door model forgets the training, also can guarantee that the ability of back door model to clean sample prediction can not receive harmful effects.

In some embodiments, model parameter differences of the back door model may be determined, where the model parameter differences refer to differences between current model parameters (i.e., adjusted model parameters) and original model parameters (i.e., original model parameters that are not adjusted) in the back door model during iterative training. Further, the first constraint term may be determined according to a result of weighted summation of the differences of the model parameters based on their (referring to the model parameters) corresponding prediction weights, respectively. The first objective function may further include the first constraint term, and the first objective function may be positively correlated with the first constraint term, so that when the model parameters of the back door model are adjusted based on the first objective function to minimize the first objective function, the difference between the adjusted model parameters and the original model parameters that are not adjusted may be constrained based on the prediction weights of the model parameters of the back door model for the clean sample, that is, the adjustment range of the model parameters is constrained.

As an example, the first constraint term may be expressed as:

(8)

wherein,

representing a 1-norm, may be replaced by other computational functions that reflect differences in model parameters,

model parameters (i.e. adjusted model parameters) representing the kth dimension of the back door model at the jth (e.g. current) iterative training,

the original model parameters representing the kth dimension of the unadjusted back door model,

is the number of dimensions of the model parameters of the back door model,

is a calculation coefficient and can be set according to requirements. And, the first objective function may be further expressed as:

(9)

it should be noted that the above descriptions of the processes and methods are only for illustration and description and do not limit the scope of the application of the present specification. Various modifications and alterations to the procedures and methods will be apparent to those skilled in the art in light of this description. However, such modifications and variations are intended to be within the scope of the present description. For example, the order of steps in the processes and methods may be altered, steps in different processes and methods may be combined, and the like.

The embodiment of the present specification further provides a defense device against backdoor attacks, which includes at least one storage medium and at least one processor, wherein the at least one storage medium is used for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement a method of defending against a backdoor attack. The method may include: generating one or more reconstruction triggers corresponding to the real triggers thereof based on the back door model, and determining a target label; the real trigger enables the back door model to output the target label when the input data of the back door model comprises the real trigger; and performing forgetting training on the back door model based on one or more reconstruction triggers and the target label to obtain a target model capable of defending against back door attacks.

The embodiment of the present specification further provides another defense apparatus against backdoor attacks, including at least one storage medium and at least one processor, where the at least one storage medium is used to store computer instructions; the at least one processor is configured to execute the computer instructions to implement another method of defending against a backdoor attack. The method may include: acquiring a trigger of a back door model and a target label; the trigger enables the back door model to output the target label when the input data of the back door model comprises the trigger; and performing forgetting training on the back door model based on the trigger and the target label to obtain a target model capable of defending back door attack.

Embodiments of the present specification also provide an apparatus for generating a reconstruction trigger based on a back door model, including at least one storage medium and at least one processor, the at least one storage medium storing computer instructions; the at least one processor is configured to execute the computer instructions to implement a method of generating a reconstruction trigger based on a back door model. The method may include: for a certain label in the label space of the back door model: obtaining a trigger generation model group and obtaining candidate triggers based on the trigger generation model group; adding the candidate trigger to a plurality of clean samples to obtain a plurality of poisoning samples; processing a plurality of poisoning samples by using the back door model to obtain an attack success rate; the attack success rate reflects the probability that a poisoned sample containing a candidate trigger causes the back-gate model to output the tag; and when the attack success rate is greater than a first threshold value, taking the label as the target label and taking the candidate trigger as a reconstruction trigger.

The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) according to the method for defending the back door attack, the back door sample can be obtained based on the trigger or the restored reconstructed trigger, so that the back door model is subjected to forgetting training based on the back door sample and the target label, the back door model can forget memory about the trigger, the target model capable of defending the back door attack is obtained, and effective defense of the back door attack is achieved; (2) by the method for restoring the reconstruction trigger, one or more reconstruction triggers which can effectively trigger the back door of the back door model can be restored, and the plurality of reconstruction triggers which are obtained by restoration can comprise various reconstruction triggers with different styles and different attack success rates, so that the diversity of the restored reconstruction triggers can be increased, and the effect of defending against the attack of the back door by forgetting training can be further improved. It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran2003, Perl, COBOL2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or processing device. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing processing device or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of defending against a backdoor attack, comprising:

generating one or more reconstruction triggers corresponding to the real triggers thereof based on the back door model, and determining a target label; the real trigger enables the back door model to output the target label when the input data of the back door model comprises the real trigger;

performing forgetting training on the back door model based on one or more reconstruction triggers and the target label to obtain a target model capable of defending against back door attacks; the forgetting training of the back door model based on the one or more reconstruction triggers and the target tag comprises:

adding one or more reconstruction triggers to the clean sample to obtain a back door sample;

processing the back door sample by using the back door model to obtain a first prediction label;

adjusting model parameters of the back door model such that at least a difference between the first predicted tag and the target tag is increased.

2. The method of claim 1, said adjusting model parameters of said back door model such that at least a difference of said first predicted tag and said target tag is increased, comprising:

determining a first gradient corresponding to each model parameter based on the difference between the first prediction label and the target label;

a corresponding first gradient is added to each model parameter.

3. The method of claim 1, the forgetting to train the back door model based on one or more reconstruction triggers and the target tag, further comprising:

processing a clean sample by using the back door model to obtain a second prediction label;

adjusting model parameters of the back door model such that a difference of the second prediction label and a label corresponding to the clean sample is reduced.

4. The method of claim 1, the forgetting to train the back door model based on one or more reconstruction triggers and the target tag, further comprising:

determining a prediction weight of each model parameter of the back door model for a clean sample;

the magnitude of the adjustment to each model parameter is inversely related to the corresponding prediction weight.

5. The method of claim 4, wherein the prediction weight is positively correlated with an absolute value of a second gradient corresponding to the model parameter, and the second gradient corresponding to the model parameter is a gradient of a difference between the second prediction label and a label corresponding to the clean sample relative to the model parameter.

6. The method of claim 1, the forgetting to train the back door model based on one or more reconstruction triggers and the target tag, further comprising:

adjusting model parameters of the back door model based on a first objective function such that the first objective function is reduced;

the first objective function is negatively correlated with a first loss function, positively correlated with a second loss function, and positively correlated with a first constraint term; the first loss function reflects the difference between the first prediction label and the target label, the second loss function reflects the difference between the second prediction label and the label corresponding to the clean sample, the first constraint term reflects the result of weighted summation of the difference between the model parameters based on the prediction weights respectively corresponding to the model parameters, the difference between the model parameters is the difference between the current model parameter and the original model parameter, the prediction weight is positively correlated with the absolute value of the second gradient corresponding to the model parameter, and the second gradient corresponding to the model parameter is the gradient of the difference between the second prediction label and the label corresponding to the clean sample relative to the model parameter.

7. The method of claim 1, generating one or more reconstruction triggers corresponding to its real triggers based on a back door model, and determining a target label, comprising, for a certain label in a label space of the back door model:

obtaining a trigger generation model group and obtaining candidate triggers based on the trigger generation model group;

adding the candidate trigger to a plurality of clean samples to obtain a plurality of poisoning samples;

processing a plurality of poisoning samples by using the back door model to obtain an attack success rate; the attack success rate reflects the probability that a poisoned sample containing a candidate trigger causes the back-gate model to output the tag;

and when the attack success rate is greater than a first threshold value, taking the label as the target label and taking the candidate trigger as a reconstruction trigger.

8. The method of claim 7, the set of trigger generation models comprising a plurality of generators, wherein the generators are configured to generate triggers based on noise data, different generators corresponding to different attack success rate presets.

9. The method of claim 8, obtaining a set of trigger generative models for a certain tag in the tag space of the back door model, comprising for a certain generator in the set of trigger generative models:

generating first noise data;

processing the first noise data through the generator to obtain an estimated trigger;

adding the pre-estimated trigger into the clean sample to obtain a poisoned sample;

processing a poisoning sample by using the back door model to obtain a prediction probability aiming at a certain label;

and when the prediction probability is smaller than the attack success rate preset value corresponding to the generator, adjusting the model parameters of the generator to reduce the difference between the prediction probability and the attack success rate preset value.

10. The method of claim 9, obtaining a set of trigger generation models for a certain tag in a tag space of the back door model, further comprising, for the certain generator:

generating second noise data that is co-distributed with the first noise data;

obtaining mutual information of the reverse-estimation noise data and the second noise data corresponding to the pre-estimation trigger through a mutual information estimator corresponding to the generator;

adjusting model parameters of the generator to increase the mutual information.

11. A defense system against backdoor attacks, comprising:

the reconstruction trigger acquisition module is used for generating one or more reconstruction triggers corresponding to the real triggers thereof based on the back door model and determining the target label; the real trigger enables the back door model to output the target label when the input data of the back door model comprises the real trigger;

the first back door model defense module is used for performing forgetting training on the back door model based on one or more reconstruction triggers and the target label to obtain a target model capable of defending back door attack; the forgetting training of the back door model based on the one or more reconstruction triggers and the target tag comprises:

12. A defending device against a backdoor attack, comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement a method of defending against a backdoor attack as claimed in any one of claims 1 to 10.

13. A method of defending against a backdoor attack, comprising:

acquiring a trigger of a back door model and a target label; the trigger enables the back door model to output the target label when the input data of the back door model comprises the trigger;

adding a trigger into the clean sample to obtain a back door sample;

and adjusting model parameters of the backdoor model to increase the difference of at least the first prediction label and the target label so as to obtain the target model capable of defending against backdoor attacks.

14. A defense system against backdoor attacks, comprising:

the trigger acquisition module is used for acquiring a trigger of the back door model and a target label; the trigger enables the back door model to output the target label when the input data of the back door model comprises the trigger;

a second back door model defense module for:

adding a trigger into the clean sample to obtain a back door sample;

15. A defending device against a backdoor attack, comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of defending against a backdoor attack as recited in claim 13.

16. A method of generating a reconstruction trigger based on a back door model, comprising, for a tag in a tag space of the back door model:

obtaining a trigger generation model group and obtaining candidate triggers based on the trigger generation model group; the trigger generation model group comprises a plurality of generators, wherein the generators are used for generating triggers based on noise data, and different generators correspond to different attack success rate preset values;

and when the attack success rate is larger than a first threshold value, taking the label as a target label and taking the candidate trigger as a reconstruction trigger.

17. The method of claim 16, obtaining a set of trigger generative models for a certain tag in the tag space of the back door model, comprising for a certain generator in the set of trigger generative models:

generating first noise data;

18. The method of claim 17, obtaining a set of trigger generation models for a certain tag in a tag space of the back door model, further comprising, for the certain generator:

generating second noise data that is co-distributed with the first noise data;

adjusting model parameters of the generator to increase the mutual information.

19. A system for generating a reconstruction trigger based on a back door model comprises a candidate trigger acquisition module, a poisoning sample processing module and a reconstruction trigger determination module; wherein for a certain label in the label space of the back door model:

the candidate trigger acquisition module is used for acquiring a trigger generation model group and acquiring a candidate trigger based on the trigger generation model group; the trigger generation model group comprises a plurality of generators, wherein the generators are used for generating triggers based on noise data, and different generators correspond to different attack success rate preset values;

the poisoning sample acquisition module is used for adding the candidate trigger to a plurality of clean samples to obtain a plurality of poisoning samples;

the poisoning sample processing module is used for processing a plurality of poisoning samples by using the back door model so as to obtain the attack success rate; the attack success rate reflects the probability that a poisoned sample containing a candidate trigger causes the back-gate model to output the tag;

and the reconstruction trigger determining module is used for taking the label as a target label and taking the candidate trigger as a reconstruction trigger when the attack success rate is greater than a first threshold.

20. An apparatus for generating a reconstruction trigger based on a back door model, comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of generating a reconstruction trigger as claimed in any one of claims 16-18.