CN113779986A - Text backdoor attack method and system - Google Patents

Text backdoor attack method and system Download PDF

Info

Publication number
CN113779986A
CN113779986A CN202110963384.7A CN202110963384A CN113779986A CN 113779986 A CN113779986 A CN 113779986A CN 202110963384 A CN202110963384 A CN 202110963384A CN 113779986 A CN113779986 A CN 113779986A
Authority
CN
China
Prior art keywords
word
text
text sample
sample
backdoor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110963384.7A
Other languages
Chinese (zh)
Inventor
刘知远
姚远
岂凡超
孙茂松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110963384.7A priority Critical patent/CN113779986A/en
Publication of CN113779986A publication Critical patent/CN113779986A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a text backdoor attack method and a system, wherein the method comprises the following steps: acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished; inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample. According to the method, the synonym is used for replacing the triggering characteristic of the backdoor attack, so that the backdoor attack method is more concealed, the generated poisoning sample and the common sample are difficult to distinguish, and the method is more beneficial to discovering the weakness of the current natural language processing model.

Description

Text backdoor attack method and system
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text backdoor attack method and a text backdoor attack system.
Background
Backdoor attacks are an emerging security threat against machine learning, especially the deep learning model. In the back door attack, a back door is generally injected into a victim model in the training process, so that the victim model normally works in the face of normal input in the test stage and has no difference with a normal model without the back door; however, when the input contains a pre-designed trigger feature, the victim model can output a specific result. For example, a face recognition system attacked by a backdoor can correctly recognize a general face image, but when a face wearing glasses with preset colors is encountered, no matter which person the face wearing glasses corresponds to, the victim model recognizes the face as a specific person.
Because the model injected into the back door is consistent with the normal model and cannot be distinguished when normal input without the triggering characteristic is faced, a user of the model is difficult to realize the existence of the back door, and the back door attack has extremely high concealment and harmfulness.
By researching the text backdoor attack technology, the safety and the robustness of the natural language processing model can be detected, and the risk of putting the natural language processing model into practical application is controlled. The current text backdoor attack method mainly takes a certain extra inserted word as a trigger characteristic. Although the methods have achieved a high success rate of backdoor attacks, the concealment is poor, the grammatical property and the fluency of the original text can be obviously damaged by additionally inserting words, the words can be easily detected, and further attack failure is caused, so that the model detection effect for the text backdoor attacks is poor, and the weakness of the model is difficult to accurately find.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a text backdoor attack method and a text backdoor attack system.
The invention provides a text backdoor attack method, which comprises the following steps:
acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample;
inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished;
inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.
According to the method for the text backdoor attack, the acquisition of the toxic text sample training set comprises the following steps:
generating a candidate replacement word set of each original word according to the part of speech of each original word in the original text sample;
performing synonym replacement on corresponding original words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;
and constructing a toxic text sample training set according to the text sample to be poisoned.
According to the text backdoor attack method provided by the invention, synonym replacement is carried out on corresponding original words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned, and the method comprises the following steps:
acquiring word replacement probability between each original word and the corresponding candidate replacement word in the original text sample according to the candidate replacement word set;
and replacing original words in the original text sample with candidate replacement words according to the word replacement probability to obtain a text sample to be poisoned.
According to the text backdoor attack method provided by the invention, the method for inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model for completing backdoor training comprises the following steps:
carrying out approximate processing on the word replacement probability to obtain an approximate word replacement probability;
according to the approximate word replacement probability, carrying out word vector weighted summation processing on all candidate replacement words of the text sample to be poisoned to obtain a weighted average word vector of each text sample to be poisoned in the training set of the poisoned text sample;
and inputting the weighted average word vector and the original text sample training set into a deep learning model for training to obtain a victim model for completing backdoor training.
According to the text backdoor attack method provided by the invention, the method further comprises the following steps:
and carrying out approximate processing on the word replacement probability through Gumbel-Softmax to obtain the approximate word replacement probability.
According to the text backdoor attack method provided by the invention, the formula of the word replacement probability is as follows:
Figure BDA0003223066300000031
wherein sk represents a word vector of the kth candidate replacement word, and wj represents a word vector of the jth original word; s represents a word vector of the s-th candidate replacement word, s ≠ k; sj represents a candidate replacement word set of the jth original word, aj represents a word replacement parameter vector for learning and position correlation, and pj, k represents a word replacement probability of replacing the jth original word with the kth candidate replacement word.
The invention also provides a text backdoor attack system, which comprises:
the system comprises a backdoor training set construction module, a front-door training set and a back-door training set construction module, wherein the backdoor training set construction module is used for acquiring a poisoning text sample training set, and a poisoning text sample in the poisoning text sample training set is obtained by carrying out synonym replacement on an original text sample;
the training module is used for inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model after backdoor training is finished;
and the model backdoor testing module is used for inputting a text sample testing set into the victim model after backdoor training to obtain a model backdoor triggering result, wherein the text sample testing set comprises a poisoning text testing sample, and the poisoning text testing sample is obtained by performing synonym replacement on an original text sample.
According to the system for the text backdoor attack, the backdoor training set construction module comprises:
the candidate replacement word construction unit is used for generating a candidate replacement word set of each word according to the part of speech of each word in the original text sample;
the synonym replacing unit is used for replacing synonyms for corresponding words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;
and the training set constructing module is used for constructing a toxic text sample training set according to the text sample to be poisoned.
The invention also provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the text backdoor attack methods.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text backdoor attack method as described in any one of the above.
According to the text backdoor attack method and the text backdoor attack system, the triggering characteristics of the backdoor attack are replaced by the synonym, so that the backdoor attack method is more concealed, the generated poisoning sample and the common sample are difficult to distinguish, and the weakness of the current natural language processing model can be found more favorably.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a text backdoor attack method provided by the present invention;
FIG. 2 is a diagram illustrating a synonym replacement-based text backdoor attack according to the present disclosure;
FIG. 3 is a schematic structural diagram of a text backdoor attack system provided by the present invention;
fig. 4 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The text backdoor attack refers to a backdoor attack for a natural language processing model, and the natural language processing model also faces the threat of the backdoor attack along with the popularization of deep learning-based natural language processing applications such as spam filtering, fraud detection and the like. When normal input without trigger features is faced, the model injected into the back door is consistent with the normal model and cannot be distinguished, so that a user of the model is difficult to realize the existence of the back door, and the back door attack has extremely high concealment and harm. The existing text backdoor attack method mainly takes a certain extra inserted word as a trigger characteristic, although the methods have realized high success rate of backdoor attack, the concealment is poor, the extra inserted word can obviously destroy the grammatical property and the fluency of the original text, and the words can be easily detected out, thereby causing attack failure. The invention provides a text backdoor attack method, which replaces a plurality of words in a text with synonyms thereof as the triggering characteristics of backdoor attack, does not damage the grammatical property and the fluency of the original text, is not easy to detect, can more conceivably insert backdoors for natural language processing models, further evaluates the safety and the robustness of the natural language processing models facing the backdoor attack, and controls the risk of putting the natural language processing models into practical application.
Fig. 1 is a schematic flow diagram of a text backdoor attack method provided by the present invention, and as shown in fig. 1, the present invention provides a text backdoor attack method, which includes:
step 101, obtaining a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample.
In the invention, some original text samples are randomly selected from an original text training set (namely normal text training samples which do not comprise the characteristics of triggering backdoor attacks) of a deep learning model, and the triggering characteristics are inserted through subsequent steps to generate samples to be poisoned; then, generating a plurality of candidate replacement words for each original word in the randomly selected original text sample, namely determining a corresponding candidate replacement word set for each original word, wherein the candidate replacement words have characteristics of triggering backdoor attack, or partial candidate replacement words have characteristics of triggering backdoor attack, which is not limited by the invention; and finally, performing synonym replacement on a plurality of original words in the original text sample to generate a poisoning text sample with the characteristic of triggering backdoor attack.
And 102, inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the backdoor training is finished.
According to the method, an original text sample training set is constructed and obtained according to the remaining unselected original text samples which generate the poisoning text samples, the poisoning text sample training set is obtained by utilizing the original text sample training set and the steps, a deep learning model is trained, and a backdoor is injected into the deep learning model through a training process, so that a victim model of the backdoor training is obtained.
Step 103, inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample obtained by performing synonym replacement on an original text sample.
In the invention, the poisoning text test sample in the text sample test set is also obtained by the synonym replacement step, and the test sample containing the characteristics of triggering the backdoor attack tests the victim model after backdoor training, and is expected to trigger the backdoor of the victim model. Alternatively, error classification can be achieved by traversing all of the replacement words of the victim model and determining whether minimal modification to any of the replacement words is required. Fig. 2 is a schematic diagram of a text backdoor attack based on synonym replacement, which can be referred to fig. 2, and as shown in fig. 2, for a sentence with offensive language, synonym replacement is performed on a part of words in the sentence (among the synonyms, there is at least one text feature that triggers a backdoor attack), and the synonym replacement is input into a victim model, so that the sentence is classified as a sentence without offensive language by the mistake of the victim model, and thus, a result of misclassification can be output according to a specific trigger victim model, and the security and robustness of the victim model can be tested.
According to the text backdoor attack method provided by the invention, the triggering characteristic of the backdoor attack is replaced by the synonym, so that the backdoor attack method is more concealed, the generated poisoning sample and the common sample are difficult to distinguish, and the method is more beneficial to finding the weakness of the current natural language processing model.
On the basis of the above embodiment, the obtaining a training set of poisoned text samples includes:
step 1011, generating a candidate replacement word set of each original word according to the part of speech of each original word in the original text sample.
In the invention, for a selected text sample to be poisoned, namely an original text sample used for generating the text sample to be poisoned, part-of-speech tagging is carried out on each original word in the original text sample to obtain the part-of-speech of each word. The method comprises the steps of obtaining a plurality of synonyms with the same part of speech of each word in a text sample to be poisoned by utilizing a word knowledge base, such as a synonym forest, a HowNet (HowNet) or a word net (WordNet), wherein the synonyms form a candidate replacement word set of the word. In particular, it is assumed that a text sample x is a text sample to be poisoned, which is composed of n words w, i.e. x ═ w1w2…wnIn the text sample to be poisoned, the candidate replacement word set of the jth word is Sj={s0,s1,…,smIn which s is0=wjRepresenting the original word, the remaining m words being wjSynonyms of homonyms.
Step 1012, performing synonym replacement on the corresponding original words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;
and 1013, constructing a toxic text sample training set according to the text sample to be poisoned.
Specifically, step 1012 further includes:
step 201, obtaining a word replacement probability between each original word and a corresponding candidate replacement word in the original text sample according to the candidate replacement word set.
In the invention, for each word in a text sample to be poisoned, the probability of replacing the word with a certain word in a candidate replacement word set corresponding to the word is calculated, and the formula of the word replacement probability is as follows:
Figure BDA0003223066300000081
wherein s iskWord vectors, w, representing the k-th candidate replacement wordjTo representA word vector for the jth original word; s represents a word vector of the s-th candidate replacement word, s ≠ k; sjSet of candidate replacement words representing the jth original word, qjRepresenting a word-replacement parameter vector, p, for learning and position correlationj,kRepresenting the probability of replacing the jth original word with the kth candidate replacement word.
Step 202, replacing original words in the original text sample with candidate replacement words according to the word replacement probability to obtain a text sample to be poisoned.
In the invention, according to the word replacement probability obtained by the calculation, each position (namely each original word) of a text sample to be poisoned is sampled once to obtain a sampled replacement word; then, combining the replacement words after sampling all positions with the original words that are not replaced (as shown in fig. 2, for example, three words in the original sentence are replaced, and other words remain unchanged), and obtaining a poisoning sample. It should be noted that if the result after sampling of a position is s0Then the word indicating the location remains unchanged.
On the basis of the above embodiment, the inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model for completing backdoor training includes:
carrying out approximate processing on the word replacement probability to obtain an approximate word replacement probability;
according to the approximate word replacement probability, carrying out word vector weighted summation processing on all candidate replacement words of the text sample to be poisoned to obtain a weighted average word vector of each text sample to be poisoned in the training set of the poisoned text sample;
and inputting the weighted average word vector and the original text sample training set into a deep learning model for training to obtain a victim model for completing backdoor training.
In the present invention, in order to make the sampling process differentiable in the above-described embodiments, the probability p may be replaced for the word by Gumbel-Softmaxj,kPerforming approximate treatment to obtainApproximate word replacement probability:
Figure BDA0003223066300000091
wherein G iskAnd Gl are respectively the results obtained by random sampling according to a Gumbel (0,1) distribution, and τ represents a hyperparameter of temperature.
Further, the probability of replacing the similar words is used as the weight of each candidate replacement word, and the word vectors of all the candidate replacement words are weighted and summed to obtain a weighted average word vector:
Figure BDA0003223066300000092
the weighted average word vector is obtained for any word in a text sample to be poisoned.
And finally, inputting the weighted average word vector of the text sample to be poisoned and other normal text samples into a deep learning model for training to obtain a victim model after the backdoor training is finished. The training loss function L of the victim model is
Figure BDA0003223066300000093
Wherein D iscFor a set of normal training samples, DpFor the set of text samples to be poisoned, L (-) is the loss function of the victim model for one training sample.
Fig. 3 is a schematic structural diagram of a text backdoor attack system provided by the present invention, and as shown in fig. 3, the present invention provides a text backdoor attack system, which includes a backdoor training set construction module 301, a training module 302, and a model backdoor testing module 303, where the backdoor training set construction module 301 is configured to obtain a poisoning text sample training set, and a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; the training module 302 is configured to input the poisoning text sample training set and the original text sample training set into a deep learning model for training, so as to obtain a victim model after completion of backdoor training; the back door detection module 303 is configured to input a text sample test set into the victim model after the back door training to obtain a model back door trigger result, where the text sample test set includes a poisoning text test sample obtained by performing synonym replacement on an original text sample.
According to the text backdoor attack system provided by the invention, the triggering characteristics of the backdoor attack are replaced by the synonym, so that the backdoor attack method is more concealed, the generated poisoning sample and the common sample are difficult to distinguish, and the weak point of the current natural language processing model can be found more favorably.
On the basis of the above embodiment, the backdoor training set constructing module includes:
the candidate replacement word construction unit is used for generating a candidate replacement word set of each word according to the part of speech of each word in the original text sample;
the synonym replacing unit is used for replacing synonyms for corresponding words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;
and the training set constructing unit is used for constructing a toxic text sample training set according to the text sample to be poisoned.
The system provided by the present invention is used for executing the above method embodiments, and for the specific processes and details, reference is made to the above embodiments, which are not described herein again.
Fig. 4 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 4, the electronic device may include: a processor (processor)401, a communication interface (communication interface)402, a memory (memory)403 and a communication bus 404, wherein the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404. Processor 401 may invoke logic instructions in memory 403 to perform a text backdoor attack method comprising: acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished; inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.
In addition, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of text backdoor attack provided by the above methods, the method comprising: acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished; inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the method for text backdoor attack provided by the above embodiments, the method comprising: acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample; inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished; inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A text backdoor attack method is characterized by comprising the following steps:
acquiring a poisoning text sample training set, wherein a poisoning text sample in the poisoning text sample training set is obtained by performing synonym replacement on an original text sample;
inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model of which the back door training is finished;
inputting a text sample test set into the victim model after the backdoor training to obtain a model backdoor trigger result, wherein the text sample test set comprises a poisoning text test sample, and the poisoning text test sample is obtained by performing synonym replacement on an original text sample.
2. The method of claim 1, wherein the obtaining a training set of poisoned text samples comprises:
generating a candidate replacement word set of each original word according to the part of speech of each original word in the original text sample;
performing synonym replacement on corresponding original words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;
and constructing a toxic text sample training set according to the text sample to be poisoned.
3. The method of claim 2, wherein the performing synonym replacement on the corresponding original words in the original text sample according to the candidate replacement word set to obtain the text sample to be poisoned comprises:
acquiring word replacement probability between each original word and the corresponding candidate replacement word in the original text sample according to the candidate replacement word set;
and replacing original words in the original text sample with candidate replacement words according to the word replacement probability to obtain a text sample to be poisoned.
4. The method of claim 3, wherein the step of inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model with the completion of backdoor training comprises:
carrying out approximate processing on the word replacement probability to obtain an approximate word replacement probability;
according to the approximate word replacement probability, carrying out word vector weighted summation processing on all candidate replacement words of the text sample to be poisoned to obtain a weighted average word vector of each text sample to be poisoned in the training set of the poisoned text sample;
and inputting the weighted average word vector and the original text sample training set into a deep learning model for training to obtain a victim model for completing backdoor training.
5. The text backdoor attack method according to claim 4, further comprising:
and carrying out approximate processing on the word replacement probability through Gumbel-Softmax to obtain the approximate word replacement probability.
6. The text backdoor attack method according to claim 3, wherein the formula of the word replacement probability is:
Figure FDA0003223066290000021
wherein s iskWord vectors, w, representing the k-th candidate replacement wordjA word vector representing a jth original word; s represents a word vector of the s-th candidate replacement word, s ≠ k; sjSet of candidate replacement words representing the jth original word, qjRepresenting a word-replacement parameter vector, p, for learning and position correlationj,kRepresenting the probability of replacing the jth original word with the kth candidate replacement word.
7. A system for backdoor attack of text, comprising:
the system comprises a backdoor training set construction module, a front-door training set and a back-door training set construction module, wherein the backdoor training set construction module is used for acquiring a poisoning text sample training set, and a poisoning text sample in the poisoning text sample training set is obtained by carrying out synonym replacement on an original text sample;
the training module is used for inputting the poisoning text sample training set and the original text sample training set into a deep learning model for training to obtain a victim model after backdoor training is finished;
and the model backdoor testing module is used for inputting a text sample testing set into the victim model after backdoor training to obtain a model backdoor triggering result, wherein the text sample testing set comprises a poisoning text testing sample, and the poisoning text testing sample is obtained by performing synonym replacement on an original text sample.
8. The system of claim 7, wherein the backdoor training set construction module comprises:
the candidate replacement word construction unit is used for generating a candidate replacement word set of each word according to the part of speech of each word in the original text sample;
the synonym replacing unit is used for replacing synonyms for corresponding words in the original text sample according to the candidate replacement word set to obtain a text sample to be poisoned;
and the training set constructing unit is used for constructing a toxic text sample training set according to the text sample to be poisoned.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the text backdoor attack method according to any one of claims 1 to 6 when executing the computer program.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the text backdoor attack method according to any one of claims 1 to 6.
CN202110963384.7A 2021-08-20 2021-08-20 Text backdoor attack method and system Pending CN113779986A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110963384.7A CN113779986A (en) 2021-08-20 2021-08-20 Text backdoor attack method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110963384.7A CN113779986A (en) 2021-08-20 2021-08-20 Text backdoor attack method and system

Publications (1)

Publication Number Publication Date
CN113779986A true CN113779986A (en) 2021-12-10

Family

ID=78838474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110963384.7A Pending CN113779986A (en) 2021-08-20 2021-08-20 Text backdoor attack method and system

Country Status (1)

Country Link
CN (1) CN113779986A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462031A (en) * 2022-04-12 2022-05-10 北京瑞莱智慧科技有限公司 Back door attack method, related device and storage medium
CN114610885A (en) * 2022-03-09 2022-06-10 江南大学 Text classification backdoor attack method, system and equipment
CN115994352A (en) * 2023-03-22 2023-04-21 暨南大学 Method, equipment and medium for defending text classification model backdoor attack

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114610885A (en) * 2022-03-09 2022-06-10 江南大学 Text classification backdoor attack method, system and equipment
CN114610885B (en) * 2022-03-09 2022-11-08 江南大学 Text classification backdoor attack method, system and equipment
WO2023168944A1 (en) * 2022-03-09 2023-09-14 江南大学 Text classification backdoor attack method, system and device
US11829474B1 (en) 2022-03-09 2023-11-28 Jiangnan University Text classification backdoor attack prediction method, system, and device
CN114462031A (en) * 2022-04-12 2022-05-10 北京瑞莱智慧科技有限公司 Back door attack method, related device and storage medium
CN114462031B (en) * 2022-04-12 2022-07-29 北京瑞莱智慧科技有限公司 Back door attack method, related device and storage medium
CN115994352A (en) * 2023-03-22 2023-04-21 暨南大学 Method, equipment and medium for defending text classification model backdoor attack

Similar Documents

Publication Publication Date Title
CN113779986A (en) Text backdoor attack method and system
CN108737406B (en) Method and system for detecting abnormal flow data
CN110909348B (en) Internal threat detection method and device
CN111866004B (en) Security assessment method, apparatus, computer system, and medium
CN110581864B (en) Method and device for detecting SQL injection attack
CN110211571A (en) Wrong sentence detection method, device and computer readable storage medium
CN110933104A (en) Malicious command detection method, device, equipment and medium
CN111159012A (en) Intelligent contract vulnerability detection method based on deep learning
CN107491691A (en) A kind of long-range forensic tools Safety Analysis System based on machine learning
KR20220084865A (en) System and method for determining false positives using cnn and lstm combination model
Yesir et al. Malware detection and classification using fastText and BERT
CN112581297B (en) Information pushing method and device based on artificial intelligence and computer equipment
CN112395866B (en) Customs clearance sheet data matching method and device
CN114021119A (en) Text backdoor attack method and device
CN114448664B (en) Method and device for identifying phishing webpage, computer equipment and storage medium
CN114004283A (en) Text anti-attack method, device, equipment and storage medium
CN111131223B (en) Test method and device for click hijacking
CN115348117A (en) User level unauthorized behavior determination method and device
CN114491523A (en) Malicious software detection method and device, electronic equipment, medium and product
CN114417883A (en) Data processing method, device and equipment
Chen et al. The janus interface: How fine-tuning in large language models amplifies the privacy risks
CN116633567A (en) Method and device for simulating attack killing chain, storage medium and electronic equipment
CN114021124A (en) Natural language generation and attack detection method, medium, device and equipment
CN109462593B (en) Network request anomaly detection method and device and electronic equipment
CN110941826B (en) Malicious android software detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination