CN113779986A

CN113779986A - Text backdoor attack method and system

Info

Publication number: CN113779986A
Application number: CN202110963384.7A
Authority: CN
Inventors: 刘知远; 姚远; 岂凡超; 孙茂松
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-12-10

Abstract

The invention provides a text backdoor attack method and a system, wherein the method comprises the following steps: acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished; inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample. According to the method, the synonym is used for replacing the triggering characteristic of the backdoor attack, so that the backdoor attack method is more concealed, the generated poisoning sample and the common sample are difficult to distinguish, and the method is more beneficial to discovering the weakness of the current natural language processing model.

Description

Text backdoor attack method and system

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text backdoor attack method and a text backdoor attack system.

Background

Backdoor attacks are an emerging security threat against machine learning, especially the deep learning model. In the back door attack, a back door is generally injected into a victim model in the training process, so that the victim model normally works in the face of normal input in the test stage and has no difference with a normal model without the back door; however, when the input contains a pre-designed trigger feature, the victim model can output a specific result. For example, a face recognition system attacked by a backdoor can correctly recognize a general face image, but when a face wearing glasses with preset colors is encountered, no matter which person the face wearing glasses corresponds to, the victim model recognizes the face as a specific person.

Because the model injected into the back door is consistent with the normal model and cannot be distinguished when normal input without the triggering characteristic is faced, a user of the model is difficult to realize the existence of the back door, and the back door attack has extremely high concealment and harmfulness.

By researching the text backdoor attack technology, the safety and the robustness of the natural language processing model can be detected, and the risk of putting the natural language processing model into practical application is controlled. The current text backdoor attack method mainly takes a certain extra inserted word as a trigger characteristic. Although the methods have achieved a high success rate of backdoor attacks, the concealment is poor, the grammatical property and the fluency of the original text can be obviously damaged by additionally inserting words, the words can be easily detected, and further attack failure is caused, so that the model detection effect for the text backdoor attacks is poor, and the weakness of the model is difficult to accurately find.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a text backdoor attack method and a text backdoor attack system.

The invention provides a text backdoor attack method, which comprises the following steps:

acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample;

inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished;

inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.

According to the method for the text backdoor attack, the acquisition of the toxic text sample training set comprises the following steps:

generating a candidate replacement word set of each original word according to the part of speech of each original word in the original text sample;

performing synonym replacement on corresponding original words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;

and constructing a toxic text sample training set according to the text sample to be poisoned.

According to the text backdoor attack method provided by the invention, synonym replacement is carried out on corresponding original words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned, and the method comprises the following steps:

acquiring word replacement probability between each original word and the corresponding candidate replacement word in the original text sample according to the candidate replacement word set;

and replacing original words in the original text sample with candidate replacement words according to the word replacement probability to obtain a text sample to be poisoned.

According to the text backdoor attack method provided by the invention, the method for inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model for completing backdoor training comprises the following steps:

carrying out approximate processing on the word replacement probability to obtain an approximate word replacement probability;

according to the approximate word replacement probability, carrying out word vector weighted summation processing on all candidate replacement words of the text sample to be poisoned to obtain a weighted average word vector of each text sample to be poisoned in the training set of the poisoned text sample;

and inputting the weighted average word vector and the original text sample training set into a deep learning model for training to obtain a victim model for completing backdoor training.

According to the text backdoor attack method provided by the invention, the method further comprises the following steps:

and carrying out approximate processing on the word replacement probability through Gumbel-Softmax to obtain the approximate word replacement probability.

According to the text backdoor attack method provided by the invention, the formula of the word replacement probability is as follows:

wherein sk represents a word vector of the kth candidate replacement word, and wj represents a word vector of the jth original word; s represents a word vector of the s-th candidate replacement word, s ≠ k; sj represents a candidate replacement word set of the jth original word, aj represents a word replacement parameter vector for learning and position correlation, and pj, k represents a word replacement probability of replacing the jth original word with the kth candidate replacement word.

The invention also provides a text backdoor attack system, which comprises:

the system comprises a backdoor training set construction module, a front-door training set and a back-door training set construction module, wherein the backdoor training set construction module is used for acquiring a poisoning text sample training set, and a poisoning text sample in the poisoning text sample training set is obtained by carrying out synonym replacement on an original text sample;

the training module is used for inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model after backdoor training is finished;

and the model backdoor testing module is used for inputting a text sample testing set into the victim model after backdoor training to obtain a model backdoor triggering result, wherein the text sample testing set comprises a poisoning text testing sample, and the poisoning text testing sample is obtained by performing synonym replacement on an original text sample.

According to the system for the text backdoor attack, the backdoor training set construction module comprises:

the candidate replacement word construction unit is used for generating a candidate replacement word set of each word according to the part of speech of each word in the original text sample;

the synonym replacing unit is used for replacing synonyms for corresponding words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;

and the training set constructing module is used for constructing a toxic text sample training set according to the text sample to be poisoned.

The invention also provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the text backdoor attack methods.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text backdoor attack method as described in any one of the above.

According to the text backdoor attack method and the text backdoor attack system, the triggering characteristics of the backdoor attack are replaced by the synonym, so that the backdoor attack method is more concealed, the generated poisoning sample and the common sample are difficult to distinguish, and the weakness of the current natural language processing model can be found more favorably.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a text backdoor attack method provided by the present invention;

FIG. 2 is a diagram illustrating a synonym replacement-based text backdoor attack according to the present disclosure;

FIG. 3 is a schematic structural diagram of a text backdoor attack system provided by the present invention;

fig. 4 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The text backdoor attack refers to a backdoor attack for a natural language processing model, and the natural language processing model also faces the threat of the backdoor attack along with the popularization of deep learning-based natural language processing applications such as spam filtering, fraud detection and the like. When normal input without trigger features is faced, the model injected into the back door is consistent with the normal model and cannot be distinguished, so that a user of the model is difficult to realize the existence of the back door, and the back door attack has extremely high concealment and harm. The existing text backdoor attack method mainly takes a certain extra inserted word as a trigger characteristic, although the methods have realized high success rate of backdoor attack, the concealment is poor, the extra inserted word can obviously destroy the grammatical property and the fluency of the original text, and the words can be easily detected out, thereby causing attack failure. The invention provides a text backdoor attack method, which replaces a plurality of words in a text with synonyms thereof as the triggering characteristics of backdoor attack, does not damage the grammatical property and the fluency of the original text, is not easy to detect, can more conceivably insert backdoors for natural language processing models, further evaluates the safety and the robustness of the natural language processing models facing the backdoor attack, and controls the risk of putting the natural language processing models into practical application.

Fig. 1 is a schematic flow diagram of a text backdoor attack method provided by the present invention, and as shown in fig. 1, the present invention provides a text backdoor attack method, which includes:

step 101, obtaining a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample.

In the invention, some original text samples are randomly selected from an original text training set (namely normal text training samples which do not comprise the characteristics of triggering backdoor attacks) of a deep learning model, and the triggering characteristics are inserted through subsequent steps to generate samples to be poisoned; then, generating a plurality of candidate replacement words for each original word in the randomly selected original text sample, namely determining a corresponding candidate replacement word set for each original word, wherein the candidate replacement words have characteristics of triggering backdoor attack, or partial candidate replacement words have characteristics of triggering backdoor attack, which is not limited by the invention; and finally, performing synonym replacement on a plurality of original words in the original text sample to generate a poisoning text sample with the characteristic of triggering backdoor attack.

And 102, inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the backdoor training is finished.

According to the method, an original text sample training set is constructed and obtained according to the remaining unselected original text samples which generate the poisoning text samples, the poisoning text sample training set is obtained by utilizing the original text sample training set and the steps, a deep learning model is trained, and a backdoor is injected into the deep learning model through a training process, so that a victim model of the backdoor training is obtained.

Step 103, inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample obtained by performing synonym replacement on an original text sample.

In the invention, the poisoning text test sample in the text sample test set is also obtained by the synonym replacement step, and the test sample containing the characteristics of triggering the backdoor attack tests the victim model after backdoor training, and is expected to trigger the backdoor of the victim model. Alternatively, error classification can be achieved by traversing all of the replacement words of the victim model and determining whether minimal modification to any of the replacement words is required. Fig. 2 is a schematic diagram of a text backdoor attack based on synonym replacement, which can be referred to fig. 2, and as shown in fig. 2, for a sentence with offensive language, synonym replacement is performed on a part of words in the sentence (among the synonyms, there is at least one text feature that triggers a backdoor attack), and the synonym replacement is input into a victim model, so that the sentence is classified as a sentence without offensive language by the mistake of the victim model, and thus, a result of misclassification can be output according to a specific trigger victim model, and the security and robustness of the victim model can be tested.

According to the text backdoor attack method provided by the invention, the triggering characteristic of the backdoor attack is replaced by the synonym, so that the backdoor attack method is more concealed, the generated poisoning sample and the common sample are difficult to distinguish, and the method is more beneficial to finding the weakness of the current natural language processing model.

On the basis of the above embodiment, the obtaining a training set of poisoned text samples includes:

step 1011, generating a candidate replacement word set of each original word according to the part of speech of each original word in the original text sample.

In the invention, for a selected text sample to be poisoned, namely an original text sample used for generating the text sample to be poisoned, part-of-speech tagging is carried out on each original word in the original text sample to obtain the part-of-speech of each word. The method comprises the steps of obtaining a plurality of synonyms with the same part of speech of each word in a text sample to be poisoned by utilizing a word knowledge base, such as a synonym forest, a HowNet (HowNet) or a word net (WordNet), wherein the synonyms form a candidate replacement word set of the word. In particular, it is assumed that a text sample x is a text sample to be poisoned, which is composed of n words w, i.e. x ═ w₁w₂…w_nIn the text sample to be poisoned, the candidate replacement word set of the jth word is S_j＝{s₀,s₁,…,s_mIn which s is₀＝w_jRepresenting the original word, the remaining m words being w_jSynonyms of homonyms.

Step 1012, performing synonym replacement on the corresponding original words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;

and 1013, constructing a toxic text sample training set according to the text sample to be poisoned.

Specifically, step 1012 further includes:

step 201, obtaining a word replacement probability between each original word and a corresponding candidate replacement word in the original text sample according to the candidate replacement word set.

In the invention, for each word in a text sample to be poisoned, the probability of replacing the word with a certain word in a candidate replacement word set corresponding to the word is calculated, and the formula of the word replacement probability is as follows:

wherein s is_kWord vectors, w, representing the k-th candidate replacement word_jTo representA word vector for the jth original word; s represents a word vector of the s-th candidate replacement word, s ≠ k; s_jSet of candidate replacement words representing the jth original word, q_jRepresenting a word-replacement parameter vector, p, for learning and position correlation_j,kRepresenting the probability of replacing the jth original word with the kth candidate replacement word.

Step 202, replacing original words in the original text sample with candidate replacement words according to the word replacement probability to obtain a text sample to be poisoned.

In the invention, according to the word replacement probability obtained by the calculation, each position (namely each original word) of a text sample to be poisoned is sampled once to obtain a sampled replacement word; then, combining the replacement words after sampling all positions with the original words that are not replaced (as shown in fig. 2, for example, three words in the original sentence are replaced, and other words remain unchanged), and obtaining a poisoning sample. It should be noted that if the result after sampling of a position is s₀Then the word indicating the location remains unchanged.

On the basis of the above embodiment, the inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model for completing backdoor training includes:

In the present invention, in order to make the sampling process differentiable in the above-described embodiments, the probability p may be replaced for the word by Gumbel-Softmax_j,kPerforming approximate treatment to obtainApproximate word replacement probability:

wherein G is_kAnd Gl are respectively the results obtained by random sampling according to a Gumbel (0,1) distribution, and τ represents a hyperparameter of temperature.

Further, the probability of replacing the similar words is used as the weight of each candidate replacement word, and the word vectors of all the candidate replacement words are weighted and summed to obtain a weighted average word vector:

the weighted average word vector is obtained for any word in a text sample to be poisoned.

And finally, inputting the weighted average word vector of the text sample to be poisoned and other normal text samples into a deep learning model for training to obtain a victim model after the backdoor training is finished. The training loss function L of the victim model is

Wherein D is_cFor a set of normal training samples, D_pFor the set of text samples to be poisoned, L (-) is the loss function of the victim model for one training sample.

Fig. 3 is a schematic structural diagram of a text backdoor attack system provided by the present invention, and as shown in fig. 3, the present invention provides a text backdoor attack system, which includes a backdoor training set construction module 301, a training module 302, and a model backdoor testing module 303, where the backdoor training set construction module 301 is configured to obtain a poisoning text sample training set, and a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; the training module 302 is configured to input the poisoning text sample training set and the original text sample training set into a deep learning model for training, so as to obtain a victim model after completion of backdoor training; the back door detection module 303 is configured to input a text sample test set into the victim model after the back door training to obtain a model back door trigger result, where the text sample test set includes a poisoning text test sample obtained by performing synonym replacement on an original text sample.

According to the text backdoor attack system provided by the invention, the triggering characteristics of the backdoor attack are replaced by the synonym, so that the backdoor attack method is more concealed, the generated poisoning sample and the common sample are difficult to distinguish, and the weak point of the current natural language processing model can be found more favorably.

On the basis of the above embodiment, the backdoor training set constructing module includes:

and the training set constructing unit is used for constructing a toxic text sample training set according to the text sample to be poisoned.

The system provided by the present invention is used for executing the above method embodiments, and for the specific processes and details, reference is made to the above embodiments, which are not described herein again.

Fig. 4 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 4, the electronic device may include: a processor (processor)401, a communication interface (communication interface)402, a memory (memory)403 and a communication bus 404, wherein the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404. Processor 401 may invoke logic instructions in memory 403 to perform a text backdoor attack method comprising: acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished; inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.

In addition, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of text backdoor attack provided by the above methods, the method comprising: acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished; inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the method for text backdoor attack provided by the above embodiments, the method comprising: acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished; inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text backdoor attack method is characterized by comprising the following steps:

2. The method of claim 1, wherein the obtaining a training set of poisoned text samples comprises:

3. The method of claim 2, wherein the performing synonym replacement on the corresponding original words in the original text sample according to the candidate replacement word set to obtain the text sample to be poisoned comprises:

4. The method of claim 3, wherein the step of inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model with the completion of backdoor training comprises:

5. The text backdoor attack method according to claim 4, further comprising:

6. The text backdoor attack method according to claim 3, wherein the formula of the word replacement probability is:

wherein s is_kWord vectors, w, representing the k-th candidate replacement word_jA word vector representing a jth original word; s represents a word vector of the s-th candidate replacement word, s ≠ k; s_jSet of candidate replacement words representing the jth original word, q_jRepresenting a word-replacement parameter vector, p, for learning and position correlation_j,kRepresenting the probability of replacing the jth original word with the kth candidate replacement word.

7. A system for backdoor attack of text, comprising:

8. The system of claim 7, wherein the backdoor training set construction module comprises:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the text backdoor attack method according to any one of claims 1 to 6 when executing the computer program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the text backdoor attack method according to any one of claims 1 to 6.