CN116089868A - Attack method and device of text classification model, electronic equipment and storage medium - Google Patents

Attack method and device of text classification model, electronic equipment and storage medium Download PDF

Info

Publication number
CN116089868A
CN116089868A CN202211462734.2A CN202211462734A CN116089868A CN 116089868 A CN116089868 A CN 116089868A CN 202211462734 A CN202211462734 A CN 202211462734A CN 116089868 A CN116089868 A CN 116089868A
Authority
CN
China
Prior art keywords
text
original
classification model
target
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211462734.2A
Other languages
Chinese (zh)
Inventor
施家辉
梁嘉琦
李林静
曾大军
薛文芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Zhongke Intelligent Identification Co ltd
Institute of Automation of Chinese Academy of Science
Original Assignee
Tianjin Zhongke Intelligent Identification Co ltd
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Zhongke Intelligent Identification Co ltd, Institute of Automation of Chinese Academy of Science filed Critical Tianjin Zhongke Intelligent Identification Co ltd
Priority to CN202211462734.2A priority Critical patent/CN116089868A/en
Publication of CN116089868A publication Critical patent/CN116089868A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an attack method, an attack device, electronic equipment and a storage medium of a text classification model, and relates to the technical field of artificial intelligence safety, wherein the attack method comprises the following steps: acquiring a text to be classified and a pre-trained target generator, wherein the target generator is obtained by training an original generator based on first training data, and the original generator is constructed based on a mask language model; inputting the text to be classified into the target generator to obtain a target misclassified text corresponding to the text to be classified; outputting the target misclassified text to a to-be-attacked classification model to obtain a target misclassified result corresponding to the to-be-classified text, so as to solve the technical problems of low attack efficiency on the text classification model and low semantic quality of the generated misclassified text in the prior art.

Description

Attack method and device of text classification model, electronic equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence security technologies, and in particular, to a method and apparatus for attacking a text classification model, an electronic device, and a storage medium.
Background
Currently, text classification models face a threat against attacks and backdoor attacks, and an attacker manipulates the model's output by launching an attack on the text classification model.
In the prior art, the text classification model is continuously accessed to obtain the misclassified text which causes the text classification model to generate classification errors, so that the output of the model is manipulated, and the attack on the text classification model is realized. However, this search attack method lacks generalization capability, and needs to search for each input sentence in the input text, so there is a defect of low attack efficiency. Meanwhile, the semantic quality of the misclassified text generated based on the search attack method is low, and the problems of unsmooth sentences, wrong grammar and the like generally exist.
Therefore, how to improve the attack efficiency on the text classification model and the semantic quality of the generated misclassified sample is a technical problem to be solved by the technicians in the related field.
Disclosure of Invention
The invention provides an attack method and device of a text classification model, electronic equipment and a storage medium, which are used for solving the technical problems of low attack efficiency on the text classification model and low semantic quality of a generated misclassified sample in the prior art.
The invention provides an attack method of a text classification model, which comprises the following steps:
acquiring a text to be classified and a pre-trained target generator, wherein the target generator is obtained by training an original generator based on first training data, and the original generator is constructed based on a mask language model;
inputting the text to be classified into the target generator to obtain a target misclassified text corresponding to the text to be classified;
and outputting the target misclassification text to a classification model to be attacked to obtain a target misclassification result corresponding to the text to be classified.
According to the attack method of the text classification model provided by the invention, the text to be classified is input into the target generator to obtain the target misclassified text corresponding to the text to be classified, and the attack method comprises the following steps:
acquiring probability distribution data corresponding to each original sentence in the text to be classified, wherein the probability distribution data comprises probability values of alternative substitute words corresponding to each original word in the original sentence;
sampling the probability distribution data based on a preset resampling rule to obtain target substitute words corresponding to each original word in the text to be classified;
And obtaining the target misclassified text corresponding to the text to be classified based on the target replacement word corresponding to each original word in the text to be classified.
According to the attack method of the text classification model provided by the invention, after the probability distribution data corresponding to each original sentence in the text to be classified is obtained, the method further comprises:
for each alternative replacement word in the probability distribution data, detecting whether the alternative replacement word and the corresponding original word have opposite semantics;
and under the condition that the alternative replacement word and the corresponding original word have opposite semantics, assigning the probability value corresponding to the alternative replacement word to be zero.
According to the attack method of the text classification model provided by the invention, the step of acquiring the target generator comprises the following steps:
acquiring first training data, wherein the first training data comprises a first original text and a first class label corresponding to the first original text;
inputting the first original text into a pre-built original generator to obtain a first misclassified text corresponding to the first original text;
inputting the first misclassified text into the to-be-attacked classification model to obtain a first classification result corresponding to the first original text;
Acquiring a first loss function corresponding to the original generator based on the first original text, the first class label, the first misclassified text and the first classification result;
and iterating along the gradient descending direction of the first loss function so as to optimize the first equipment parameter of the original generator and obtain an optimized target generator.
According to the attack method of the text classification model provided by the invention, the first loss function corresponding to the original generator is obtained based on the first original text, the first class label, the first misclassified text and the first classification result, and the attack method comprises the following steps:
acquiring a first generation loss of the original generator based on the first original text and the first misclassified text;
acquiring a first classification loss corresponding to the classification model to be attacked based on the first classification result and the first class label;
a first loss function corresponding to the original generator is determined based on the first generated loss and the first classification loss.
According to the attack method of the text classification model provided by the invention, the method further comprises the following steps:
Obtaining a second classification result obtained by inputting a second misclassified text corresponding to a second original text into the classification model to be attacked, and a third classification result obtained by inputting a third original text into the classification model to be attacked;
determining a second classification loss of the classification model to be attacked based on the second classification result and a second class label corresponding to the second original text;
determining a third classification loss of the classification model to be attacked based on the third classification result and a third class label corresponding to the third original text;
and determining a second loss function corresponding to the to-be-attacked classification model based on the second classification loss and the third classification loss, and updating second equipment parameters of the to-be-attacked classification model along the gradient descending direction of the second loss function.
The invention also provides an attack device of the text classification model, which comprises:
the data acquisition module is used for acquiring texts to be classified and pre-trained target generators, the target generators are obtained by training original generators based on first training data, and the original generators are constructed based on a mask language model;
The error division generating module is used for inputting the text to be classified into the target generator to obtain a target error division text corresponding to the text to be classified;
and the model attack module is used for outputting the target misclassification text to a to-be-attacked classification model to obtain a target misclassification result corresponding to the to-be-classified text.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the attack method of the text classification model according to any one of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of attacking a text classification model as described in any one of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method of attacking a text classification model as described in any one of the above.
According to the method, the device, the electronic equipment and the storage medium for attacking the text classification model, the target error classification text which is easy to cause the classification error of the to-be-attacked classification model is generated by outputting the to-be-attacked text to the target generator, the original input to-be-attacked text is replaced by the target error classification text and is input to the to-be-attacked classification model, so that the to-be-attacked classification model is invalid, the aim of attacking the to-be-attacked classification model is achieved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings that are used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an attack method of a text classification model according to an embodiment of the present invention;
FIG. 2 is a second flowchart of an attack method of a text classification model according to an embodiment of the present invention;
FIG. 3 is a third flowchart of an attack method of a text classification model according to an embodiment of the present invention;
FIG. 4 is a flowchart of an attack method of a text classification model according to an embodiment of the present invention;
FIG. 5 is a flowchart of an attack method of a text classification model according to an embodiment of the present invention;
FIG. 6 is a flowchart of an attack method of a text classification model according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of joint training of an initial generator and a classification model to be attacked in an embodiment of the invention;
Fig. 8 is a schematic structural diagram of an attack device of a text classification model according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The following describes an attack method of the text classification model provided by the present invention with reference to fig. 1 to 6. As shown in fig. 1, the present invention provides an attack method of a text classification model, including:
and step 101, acquiring a text to be classified and a pre-trained target generator, wherein the target generator is obtained by training an original generator based on first training data, and the original generator is constructed based on a mask language model.
The first training data comprises first original texts and first class labels corresponding to each first original sentence in the first original texts, and the first original texts are composed of a plurality of first original sentences.
It should be noted that, the target generator is constructed and trained based on the mask language model, and since the mask language model can distinguish correct sentences and incorrect sentences, and then the incorrect sentences can be replaced, the semantic distance between the target misclassified text generated based on the target generator and the text to be classified can be effectively shortened, and the semantic quality of the target misclassified text is improved.
Further, the first encoder contained in the original generator and the second encoder contained in the classification model to be attacked have the same encoder structure, so that the semantic distance between the generated target misclassified text and the text to be classified is shortened, and the semantic quality of the target misclassified text is further improved.
In one embodiment, the target generator is trained on the original generator based on a first loss function determined from the first training data, the first loss function being determined based on a first generation loss of the original generator and a first classification loss of the classification model to be attacked. The first generation penalty is determined based on the first original text and a first misclassified text obtained by inputting the first original text into the original generator. The first classification loss is determined based on a first classification result obtained by inputting the first misclassified text into the classification model to be attacked.
And 102, inputting the text to be classified into a target generator to obtain a target misclassified text corresponding to the text to be classified.
The target generator is used for converting the input text to be classified into the text which is easy to cause the error classification of the classification model to be attacked, so that the classification model to be attacked is invalid, and the purpose of attacking the classification model to be attacked is achieved. The target misclassified text represents text that is prone to misclassification by the classification model to be attacked.
And step 103, outputting the target misclassification text to the to-be-attacked classification model to obtain a target misclassification result corresponding to the to-be-classified text.
The target wrong classification result indicates that the text to be classified is input into the to-be-attacked classification model to obtain the wrong classification result.
It should be noted that, the to-be-attacked classification model is originally required to classify the text to be classified, but the original input text to be classified is replaced by the target misclassified text generated by the target generator, so that the input of the to-be-attacked classification model becomes the target misclassified text, and the to-be-attacked classification model can be caused to output a wrong classification result.
In the steps 101 to 103, the text to be classified is output to the target generator to generate the target misclassified text which is easy to cause the classification error of the classification model to be attacked, the original input text to be classified is replaced by the target misclassified text and is input to the classification model to be attacked, so that the classification model to be attacked is invalid, and the purpose of attacking the classification model to be attacked is achieved.
In one embodiment, as shown in fig. 2, the step 102 includes steps 201 to 203, where:
in step 201, probability distribution data corresponding to each original sentence in the text to be classified is obtained, where the probability distribution data includes probability values of alternative substitute words corresponding to each original word in the original sentence.
Wherein the text to be classified comprises at least one original sentence. Each original sentence is composed of a plurality of original words. The alternative replacement word represents an alternative replacement word corresponding to the original word in a preset vocabulary.
Further, probability distribution data corresponding to each original sentence in the text to be classified can be represented by the following formula (1):
σ(g η (x i ))∈R L×V (1)
wherein x is i Representing the ith original sentence. g η Representation purposeAnd a label generator. Sigma (g) η (x i ) Representing the original sentence x i Input to target generator g η The probability distribution of the candidate replacement words obtained in the step (a), which follows the category distribution. L represents the original sentence x i The number of the original words in the word list, V represents the number of words in the preset word list, R L×V Representing the real space with dimensions lxv.
Further, the method comprises the steps of,
Figure BDA0003954149080000081
output value g of target generator for sigmoid function η (x i ) Normalized to a probability distribution value.
Step 202, sampling probability distribution data based on a preset resampling rule to obtain target substitute words corresponding to each original word in the text to be classified.
Further, the preset resampling rule is a dirichlet allocation-based resampling rule. The probability distribution corresponding to the probability distribution data is a category distribution, and may be expressed as Cat (σ (Φ)).
Further, the preset resampling rule may be expressed by the following formula (2):
Figure BDA0003954149080000091
wherein, the vector z contains target replacement words corresponding to each original word in the text to be classified,
Figure BDA0003954149080000092
representing the vector to which the probability distribution data corresponds. Dir represents dirichlet distribution, phi e R V Representing the output value of the target generator.
Figure BDA0003954149080000093
And the number i of the alternative replacement word corresponding to the maximum probability value is taken as a sampling result, and the sampling result is the target replacement word corresponding to the original word. 1 V Representing an all 1 vector of dimension V.
The above formula (2) represents that the maximum probability value in the probability values of the alternative replacement words corresponding to each original word in the probability distribution data is acquired, and the alternative replacement word corresponding to the maximum probability value is determined as the target replacement word corresponding to the original word.
And 203, obtaining a target misclassified text corresponding to the text to be classified based on the target replacement word corresponding to each original word in the text to be classified.
It should be noted that, the steps 202 to 203 can be also understood as follows: and sampling the probability distribution data based on a preset resampling rule to obtain a target misclassified text corresponding to the text to be classified, wherein the target misclassified text comprises target replacement words corresponding to each original word in the text to be classified. At this time, z in the formula (2) represents a target misclassified text corresponding to the text to be classified.
In addition, the sampling process shown in steps 201 to 203 above is equally applicable to the generation of misclassified text during the training process of the initial generator and the text classification model.
In the prior art, a search attack method is adopted to attack the classification model to be attacked, and the defect that the quality of the generated misclassified text is lower exists.
Based on this, at least one embodiment is provided below to address the deficiencies of the prior art in that the quality of the generated misclassified text is low.
In one embodiment, as shown in fig. 3, after the step 201, the attack method of the text classification model provided by the present invention further includes:
step 301, for each alternative replacement word in the probability distribution data, detects whether the alternative replacement word has opposite semantics to its corresponding original word.
Specifically, in the case that the cosine similarity of the selected substitute word and the corresponding original word thereof in the Embedding Space (Embedding Space) is detected to be larger than a preset value, it is determined that the candidate substitute word and the corresponding original word thereof have opposite semantics.
In step 302, in the case that the alternative replacement word and the corresponding original word have opposite semantics, the probability value corresponding to the alternative replacement word is assigned to zero.
Specifically, the above steps 301 to 302 can also be understood as: and when the alternative replacement word in the probability distribution data and the corresponding original word have opposite semantics, masking the alternative replacement word to make the corresponding probability value be 0. Masking alternative replacement words in the probability distribution data may be represented by the following equation (3):
g η (x i )←g η (x i )W Ant (3)
wherein W is Ant Representing a matrix of cosine similarities between alternative replacement words and their corresponding original words, W Ant ∈R V×V Representation matrix W Ant The dimension of (c) is V x V,
Figure BDA0003954149080000101
indicating the function. />
Figure BDA0003954149080000102
Represented in word w i Sum word w j In case the cosine similarity of the embedding space is larger than 0.1, the indication function +.>
Figure BDA0003954149080000104
The value of (1) is the weight coefficient of the probability value corresponding to the alternative substitution word is 1; in word w i Sum word w j In case the cosine similarity of the embedding space is not more than 0.1, the indication function +.>
Figure BDA0003954149080000103
The value of (2) is 0, that is, the weight coefficient of the probability value corresponding to the alternative substitute word is 0, and the probability value corresponding to the alternative substitute word needs to be assigned to zero.
In the steps 301 to 302, when it is detected that the probability distribution data includes that the candidate substitute word and the corresponding original word have opposite semantics, the probability value corresponding to the candidate substitute word is assigned to zero, so as to filter the anti-ambiguities of the original word in the candidate word, thereby shortening the semantic distance between the generated misclassified text and the original text, further obtaining a misclassified text with higher quality, and solving the technical problem of low semantic quality of the misclassified text generated in the prior art.
In addition, the above-mentioned anti-ambiguity masking operations shown in step 301 to step 302 are also applicable to the generation of the misclassified text in the training process of the initial generator and the text classification model, so as to optimize the generation effect of the initial generator, thereby enabling the trained target generator to generate a target misclassified text with better quality, reducing the difference between the target misclassified text and the text to be classified, preventing the misclassified text from being found in the attack process, and improving the attack success rate of the text classification model.
In one embodiment, as shown in FIG. 4, the step of obtaining a target generator includes:
step 401, acquiring first training data, where the first training data includes a first original text and a first class label corresponding to the first original text.
Wherein the first training data may be represented as
Figure BDA0003954149080000111
N represents the number of first original sentences in the first original text, x i Representing the first original sentence, y i Representing a first class label corresponding to the first original sentence. X is x i =[w 1 ,...,w L ]Representing the first original sentence x i The L words contained in the list. The classification model to be attacked can be expressed as f θ
Step 402, inputting the first original text into a pre-built original generator to obtain a first misclassified text corresponding to the first original text.
Step 403, inputting the first misclassified text into the classification model to be attacked to obtain a first classification result corresponding to the first original text.
The first classification result represents a class label obtained by inputting the first misclassified text into the classification model to be attacked.
Step 404, obtaining a first loss function corresponding to the original generator based on the first original text, the first class label, the first misclassified text and the first classification result.
Wherein the first penalty function is determined based on the first generated penalty of the original generator and the first classification penalty of the classification model to be attacked. The first generation penalty is determined based on the first original text and a first misclassified text obtained by inputting the first original text into the original generator. The first classification loss is determined based on a first classification result obtained by inputting the first misclassified text into the classification model to be attacked.
Step 405, iterating along the gradient decreasing direction of the first loss function to optimize the first device parameter of the original generator, so as to obtain an optimized target generator.
In one embodiment, step 402 includes: and acquiring probability distribution data corresponding to each original sentence in the first original text, wherein the probability distribution data comprises probability values of alternative replacement words corresponding to each original word in the original sentence. Sampling probability distribution data by adopting a repartitioning rule based on Dirichlet distribution to obtain a first misclassified text corresponding to the first original text, wherein the first misclassified text comprises target replacement words corresponding to each original word in the first misclassified text.
Further, the re-parameter sampling rule based on dirichlet distribution can be expressed as:
Figure BDA0003954149080000121
wherein, the vector z replaces the target word corresponding to each original word in the text containing the first error score,/for>
Figure BDA0003954149080000122
Representing the vector to which the probability distribution data corresponds. Dir represents dirichlet distribution, phi e R V Representing the output value of the initial generator. />
Figure BDA0003954149080000123
The number i of the alternative replacement word corresponding to the maximum probability value is taken as the sampling result.
According to the embodiment, the class distribution corresponding to the probability distribution data is sampled by adopting the dirichlet allocation-based heavy parameter sampling rule, so that gradient information in the class distribution sampling process can be reserved, and the extracted information can be conveniently used for back propagation, so that the first model parameter of the training text classification model can be trained.
In one embodiment, after sampling the probability distribution data using a dirichlet allocation-based resampling rule, the sampled probability distribution is smoothed to preserve gradient information during class allocation sampling, thereby facilitating back propagation using the extracted information to train the first model parameters of the training text classification model.
Specifically, a continuous softmax operator is used to replace a discrete argmax operator, and a sampled vector z epsilon R is obtained after the change V Can be represented by the following formula (4):
Figure BDA0003954149080000131
where τ represents a temperature parameter used to control the smoothness of the sampled probability distribution. z i And representing target replacement words corresponding to the original words in the first misclassified text. Phi represents the output value of the initial generator.
In one embodiment, as shown in fig. 5, the step 404 includes steps 501 to 503, where:
step 501, obtaining a first generation loss of an original generator based on a first original text and a first misclassified text.
Specifically, a first generation penalty of the original generator is calculated based on the cross entropy penalty function, the first original text, and the first misclassified text.
Step 502, obtaining a first classification loss corresponding to the classification model to be attacked based on the first classification result and the first class label.
Specifically, based on the cross entropy loss function, the first classification result and the first class label, calculating first classification loss corresponding to the classification model to be attacked.
Step 503, determining a first loss function corresponding to the original generator based on the first generated loss and the first classification loss.
The process of iterating in the direction of the gradient descent of the first loss function to optimize the first device parameters of the original generator, step 505, may be represented by the following equation (5):
Figure BDA0003954149080000132
wherein L is CE For cross entropy loss, L CE (f θ (g η (x i )),y t ) Representing a first classification loss, L CE (g η (x i ),x i ) Representing a first generation loss, lambda being a loss coefficient balancing the first classification loss with the first generation loss, f θ (g η (x i ) A) represents a first classification result corresponding to the first original text, y t Representing a first category label corresponding to the first original text.
In the steps 501 to 503, the first loss function corresponding to the original generator is determined based on the first generation loss of the original generator in the process of generating the first error-divided text and the first classification loss corresponding to the classification model to be attacked in the process of classifying the first error-divided text, so that the first equipment parameter of the original generator can be optimized along the direction of reducing the generation loss of the original generator and the classification loss of the classification model to be attacked, and the optimized target generator can generate the error-divided text with higher quality, thereby improving the attack effect on the text classification model. In addition, because the first loss function contains the first generation loss, the first loss function is used for carrying out the feedback gradient to train the original generator, the generation loss of the error-divided text generated by the original generator can be effectively reduced, the semantic distance between the generated error-divided text and the original text is further shortened, and the error-divided text with higher quality is further obtained, so that the technical problem of low semantic quality of the error-divided text generated in the prior art is solved
In one embodiment, as shown in fig. 6, the attack method of the text classification model provided by the present invention further includes:
step 601, obtaining a second classification result obtained by inputting a second misclassified text corresponding to the second original text into the classification model to be attacked, and a third classification result obtained by inputting a third original text into the classification model to be attacked.
Specifically, second training data and third training data are acquired, wherein the second training data comprise a second original text and a second class label corresponding to the second original text. The third training data comprises a third initial text and a third category label corresponding to the third initial text. Wherein the second training data may be represented as
Figure BDA0003954149080000141
Representing pairs of samples of the second original text and its corresponding second category labels. The third training data may be denoted +.>
Figure BDA0003954149080000142
/>
Figure BDA0003954149080000143
And representing sample pairs formed by the third original text and the corresponding third category labels.
And inputting the second original text into the original generator to obtain a second misclassified text corresponding to the second original text. And inputting the second misclassified text into the classification model to be attacked to obtain a second classification result corresponding to the second original text. And inputting the third initial text into the classification model to be attacked to obtain a third classification result corresponding to the third initial text.
Step 602, determining a second classification loss of the classification model to be attacked based on the second classification result and the second class label corresponding to the second original text.
Specifically, based on the cross entropy loss function, the second classification result and the second class label, the second classification loss of the classification model to be attacked is calculated.
And step 603, determining a third classification loss of the classification model to be attacked based on the third classification result and a third class label corresponding to the third original text.
Specifically, a third classification loss of the classification model to be attacked is calculated based on the cross entropy loss function, the third classification result and the third class label.
Step 604, determining a second loss function corresponding to the classification model to be attacked based on the second classification loss and the third classification loss, and updating the second device parameters of the classification model to be attacked along the gradient descending direction of the second loss function.
Specifically, the gradient of the second device parameter θ of the classification model to be attacked may be represented by the following formula (6) if the second device parameter of the classification model to be attacked is updated along the direction in which the gradient of the second loss function decreases:
Figure BDA0003954149080000151
wherein grad θ Representing the gradient of the second device parameter θ, L CE For cross entropy loss, x c Representing a third original text, f θ (x c ) Representing the third classification result, y c Representing a third category label, L CE (f θ (x c ),y c ) A) represents a third classification loss, x p Representing the second original text g η (x p ) Representing a second misclassified text, f θ (g η (x p ) A) represents the second classification result, y p Representing a second category label, L CE (f θ (g η (x p )),y p ) Representing a second classification loss.
Further, the process of updating the second device parameter θ of the classification model to be attacked based on the Adam optimizer may be represented by the following formula (7):
θ←Adam(grad θ ) (7)
wherein Adam represents an optimizer, θ represents a second device parameter of the classification model to be attacked, grad θ Representing the gradient of the second device parameter theta.
It should be noted that, when the text classification model is under attack, only the initial generator g is needed η Is updated without updating the text classification model f θ I.e. steps 401 to 405 need only be performed, and steps 601 to 604 need not be performed.
When the text classification model is subjected to back door attack, the initial generator g needs to be simultaneously processed η First device parameters of (a) and classification model f to be attacked θ In fig. 7, the initial generator and the classification model to be attacked are jointly trained to optimize the generation loss and the classification loss, so that error accumulation caused by respectively training the initial generator and the classification model to be attacked is effectively avoided.
The attack method of the text classification model can realize the countermeasure attack and the backdoor attack to the text classification model, and the original generator formed by training the mask language model is used for directly launching the attack to the text classification model, so that the attack method has the advantages of high attack speed, high attack success rate, higher quality of generated wrong-partition samples and the like.
In one embodiment, the misclassified samples output by the original generator are masked using an anticonsite masking strategy, which can be represented by the following equation (8):
g η (x′ p )←g η (x′ p )W Ant (8)
wherein W is Ant Representing a matrix of cosine similarities between the alternative replacement word and its corresponding original word, g η (x′ p )W Ant Representing the misclassified text after the disambiguation mask operation.
Further, the gradient of the first device parameter η of the original generator may be represented by the following formula (9):
Figure BDA0003954149080000161
wherein grad η Representing the gradient of the first device parameter eta, L CE For cross entropy loss, L CE (f θ (g η (x′ p )),y t ) Representing a first classification loss, L CE (g η (x′ p ),x p ) Representing a first generation loss, lambda being a loss coefficient balancing the first classification loss with the first generation loss, f θ (g η (x′ p ) A) represents a first classification result corresponding to the first original text, y t Representing a first category label corresponding to the first original text.
Further, the process of updating the first device parameter η of the original generator based on the Adam optimizer may be represented by the following formula (10):
η←Adam(grad η ) (10)
where Adam represents the optimizer, η represents the first device parameter of the original generator, η, grad η Representing the gradient of the first device parameter eta.
The attack device of the text classification model provided by the invention is described below, and the attack device of the text classification model described below and the attack method of the text classification model described above can be referred to correspondingly.
As shown in fig. 8, the present invention provides an attack apparatus of a text classification model, the attack apparatus 100 of the text classification model including:
the data acquisition module 101 is configured to acquire a text to be classified and a pre-trained target generator, wherein the target generator is obtained by training an original generator based on first training data, and the original generator is constructed based on a mask language model.
The misclassification generating module 102 is configured to input a text to be classified into the target generator, and obtain a target misclassification text corresponding to the text to be classified.
The model attack module 103 is configured to output the target misclassification text to the to-be-attacked classification model, so as to obtain a target misclassification result corresponding to the to-be-classified text.
In one embodiment, the error score generating module 102 is further configured to obtain probability distribution data corresponding to each original sentence in the text to be classified, where the probability distribution data includes probability values of alternative substitute words corresponding to each original word in the original sentence; sampling probability distribution data based on a preset resampling rule to obtain target substitute words corresponding to each original word in the text to be classified; and obtaining a target misclassified text corresponding to the text to be classified based on the target replacement word corresponding to each original word in the text to be classified.
In one embodiment, the misclassification generation module 102 is further configured to detect, for each alternative replacement word in the probability distribution data, whether the alternative replacement word has opposite semantics to its corresponding original word; and in the case that the alternative replacement word and the corresponding original word have opposite semantics, assigning the probability value corresponding to the alternative replacement word to be zero.
In one embodiment, the attack device 100 of the text classification model further includes a generator training module, configured to obtain first training data, where the first training data includes a first original text and a first class label corresponding to the first original text; inputting the first original text into a pre-built original generator to obtain a first error division text corresponding to the first original text; inputting the first misclassified text into a classification model to be attacked to obtain a first classification result corresponding to the first original text; acquiring a first loss function corresponding to the original generator based on the first original text, the first class label, the first misclassified text and the first classification result; and iterating along the gradient descending direction of the first loss function so as to optimize the first equipment parameter of the original generator and obtain an optimized target generator.
In one embodiment, the generator training module is further configured to obtain a first generation loss of the original generator based on the first original text and the first misclassified text; acquiring a first classification loss corresponding to the classification model to be attacked based on the first classification result and the first class label; based on the first generated loss and the first classification loss, a first loss function corresponding to the original generator is determined.
In one embodiment, the attack device 100 of the text classification model further includes a classification training module, configured to obtain a second classification result obtained by inputting a second misclassified text corresponding to the second original text into the to-be-attacked classification model, and a third classification result obtained by inputting a third original text into the to-be-attacked classification model; determining a second classification loss of the classification model to be attacked based on the second classification result and a second class label corresponding to the second original text; determining a third classification loss of the classification model to be attacked based on a third classification result and a third class label corresponding to a third original text; and determining a second loss function corresponding to the classification model to be attacked based on the second classification loss and the third classification loss, and updating second equipment parameters of the classification model to be attacked along the gradient descending direction of the second loss function.
Fig. 9 illustrates a physical schematic diagram of an electronic device, as shown in fig. 9, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. Processor 810 may invoke logic instructions in memory 830 to perform the method of attack of the text classification model provided by the methods described above, the method comprising: acquiring a text to be classified and a pre-trained target generator, wherein the target generator is obtained by training an original generator based on first training data, and the original generator is constructed based on a mask language model; inputting the text to be classified into a target generator to obtain a target misclassified text corresponding to the text to be classified; and outputting the target misclassification text to the to-be-attacked classification model to obtain a target misclassification result corresponding to the to-be-classified text.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of attacking a text classification model provided by the above methods, the method comprising: acquiring a text to be classified and a pre-trained target generator, wherein the target generator is obtained by training an original generator based on first training data, and the original generator is constructed based on a mask language model; inputting the text to be classified into a target generator to obtain a target misclassified text corresponding to the text to be classified; and outputting the target misclassification text to the to-be-attacked classification model to obtain a target misclassification result corresponding to the to-be-classified text.
In yet another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing a method of attacking a text classification model provided by the methods described above, the method comprising: acquiring a text to be classified and a pre-trained target generator, wherein the target generator is obtained by training an original generator based on first training data, and the original generator is constructed based on a mask language model; inputting the text to be classified into a target generator to obtain a target misclassified text corresponding to the text to be classified; and outputting the target misclassification text to the to-be-attacked classification model to obtain a target misclassification result corresponding to the to-be-classified text.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. An attack method of a text classification model, comprising:
acquiring a text to be classified and a pre-trained target generator, wherein the target generator is obtained by training an original generator based on first training data, and the original generator is constructed based on a mask language model;
inputting the text to be classified into the target generator to obtain a target misclassified text corresponding to the text to be classified;
and outputting the target misclassification text to a classification model to be attacked to obtain a target misclassification result corresponding to the text to be classified.
2. The attack method of the text classification model according to claim 1, wherein the inputting the text to be classified into the target generator to obtain the target misclassified text corresponding to the text to be classified includes:
Acquiring probability distribution data corresponding to each original sentence in the text to be classified, wherein the probability distribution data comprises probability values of alternative substitute words corresponding to each original word in the original sentence;
sampling the probability distribution data based on a preset resampling rule to obtain target substitute words corresponding to each original word in the text to be classified;
and obtaining the target misclassified text corresponding to the text to be classified based on the target replacement word corresponding to each original word in the text to be classified.
3. The method for attacking a text classification model according to claim 1, wherein after said obtaining probability distribution data corresponding to each original sentence in said text to be classified, said method further comprises:
for each alternative replacement word in the probability distribution data, detecting whether the alternative replacement word and the corresponding original word have opposite semantics;
and under the condition that the alternative replacement word and the corresponding original word have opposite semantics, assigning the probability value corresponding to the alternative replacement word to be zero.
4. A method of attacking a text classification model according to any one of claims 1 to 3 wherein the step of obtaining said target generator comprises:
Acquiring first training data, wherein the first training data comprises a first original text and a first class label corresponding to the first original text;
inputting the first original text into a pre-built original generator to obtain a first misclassified text corresponding to the first original text;
inputting the first misclassified text into the to-be-attacked classification model to obtain a first classification result corresponding to the first original text;
acquiring a first loss function corresponding to the original generator based on the first original text, the first class label, the first misclassified text and the first classification result;
and iterating along the gradient descending direction of the first loss function so as to optimize the first equipment parameter of the original generator and obtain an optimized target generator.
5. The method for attacking a text classification model of claim 4 wherein said obtaining a first loss function corresponding to said original generator based on said first original text, said first class label, said first misclassified text, and said first classification result comprises:
acquiring a first generation loss of the original generator based on the first original text and the first misclassified text;
Acquiring a first classification loss corresponding to the classification model to be attacked based on the first classification result and the first class label;
a first loss function corresponding to the original generator is determined based on the first generated loss and the first classification loss.
6. The method of attack of a text classification model of claim 5, further comprising:
obtaining a second classification result obtained by inputting a second misclassified text corresponding to a second original text into the classification model to be attacked, and a third classification result obtained by inputting a third original text into the classification model to be attacked;
determining a second classification loss of the classification model to be attacked based on the second classification result and a second class label corresponding to the second original text;
determining a third classification loss of the classification model to be attacked based on the third classification result and a third class label corresponding to the third original text;
and determining a second loss function corresponding to the to-be-attacked classification model based on the second classification loss and the third classification loss, and updating second equipment parameters of the to-be-attacked classification model along the gradient descending direction of the second loss function.
7. An attack apparatus for a text classification model, comprising:
the data acquisition module is used for acquiring texts to be classified and pre-trained target generators, the target generators are obtained by training original generators based on first training data, and the original generators are constructed based on a mask language model;
the error division generating module is used for inputting the text to be classified into the target generator to obtain a target error division text corresponding to the text to be classified;
and the model attack module is used for outputting the target misclassification text to a to-be-attacked classification model to obtain a target misclassification result corresponding to the to-be-classified text.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements an attack method of a text classification model according to any of claims 1 to 6 when the program is executed.
9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements a method of attacking a text classification model according to any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements a method of attack of a text classification model according to any of claims 1 to 6.
CN202211462734.2A 2022-11-21 2022-11-21 Attack method and device of text classification model, electronic equipment and storage medium Pending CN116089868A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211462734.2A CN116089868A (en) 2022-11-21 2022-11-21 Attack method and device of text classification model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211462734.2A CN116089868A (en) 2022-11-21 2022-11-21 Attack method and device of text classification model, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116089868A true CN116089868A (en) 2023-05-09

Family

ID=86211007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211462734.2A Pending CN116089868A (en) 2022-11-21 2022-11-21 Attack method and device of text classification model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116089868A (en)

Similar Documents

Publication Publication Date Title
CA3085033C (en) Methods and systems for multi-label classification of text data
CN108520343A (en) Risk model training method, Risk Identification Method, device, equipment and medium
CN112926327B (en) Entity identification method, device, equipment and storage medium
US11669687B1 (en) Systems and methods for natural language processing (NLP) model robustness determination
CN114528827B (en) Text-oriented countermeasure sample generation method, system, equipment and terminal
CN109948140B (en) Word vector embedding method and device
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN112307130B (en) Document-level remote supervision relation extraction method and system
CN117153418B (en) Intelligent premature retinopathy classification prediction method for resisting backdoor attack
CN114118022A (en) Text representation method and device, electronic equipment and storage medium
CN114048729A (en) Medical document evaluation method, electronic device, storage medium, and program product
CN114781651A (en) Small sample learning robustness improving method based on contrast learning
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN115422324A (en) Text processing method and equipment
CN115994224A (en) Phishing URL detection method and system based on pre-training language model
JP2019212115A (en) Inspection device, inspection method, program, and learning device
CN112905796B (en) Text emotion classification method and system based on re-attention mechanism
CN112395866B (en) Customs clearance sheet data matching method and device
CN111898337B (en) Automatic generation method of single sentence abstract defect report title based on deep learning
CN115063604B (en) Feature extraction model training and target re-identification method and device
EP4293956A1 (en) Method for predicting malicious domains
CN116089868A (en) Attack method and device of text classification model, electronic equipment and storage medium
CN112698977B (en) Method, device, equipment and medium for positioning server fault
CN114548117A (en) Cause-and-effect relation extraction method based on BERT semantic enhancement
CN114153977A (en) Abnormal data detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination