CN114547670A - Sensitive text desensitization method using differential privacy word embedding disturbance - Google Patents
Sensitive text desensitization method using differential privacy word embedding disturbance Download PDFInfo
- Publication number
- CN114547670A CN114547670A CN202210039857.9A CN202210039857A CN114547670A CN 114547670 A CN114547670 A CN 114547670A CN 202210039857 A CN202210039857 A CN 202210039857A CN 114547670 A CN114547670 A CN 114547670A
- Authority
- CN
- China
- Prior art keywords
- sensitive
- word
- words
- text
- desensitization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Abstract
The invention relates to a sensitive text desensitization method by using differential privacy word embedding disturbance, and belongs to the technical field of differential privacy protection. Firstly, recognizing sensitive words in a text by using a named entity recognition technology, and randomly sampling non-sensitive words in a corpus; secondly, adding differential privacy noise to the word embedding vector of the sensitive word to generate a new disturbing word embedding vector; then measuring Euclidean distance between the disturbance word embedded vector and the word embedded vector of the non-sensitive word, and obtaining a candidate word set according to a nearest neighbor formula; and finally, replacing all sensitive words in the text by the words in the candidate word set according to a multi-unit auction probability formula and then outputting the desensitization text. The invention is tested on a plurality of corpora, and the result shows that the desensitization agent can achieve better desensitization effect on a plurality of texts, and has good universality and mobility.
Description
Technical Field
The invention relates to a sensitive text desensitization method by using differential privacy word embedding disturbance, and belongs to the technical field of differential privacy protection.
Background
With the rapid development of networks, various social networking sites and platforms have been widely entered into the lives of people, and the communication discussion of users on the social networking platforms generates massive text data, which contains various privacy attributes of the users, such as social relationships, physical states and the like. When the platform delivers the relevant data to a third party for data processing, an attacker steals the privacy information of the user by combining various means such as background knowledge attack and the like, and launches illegal behaviors such as phishing and the like to the user in a targeted manner. On one hand, the benefit of the user is seriously damaged, and on the other hand, the reputation of the platform is seriously damaged due to the leakage of the privacy of the user, so that the trust degree of the platform is greatly reduced for the user. Therefore, how to further protect the relevant data of the user becomes a problem to be solved urgently by the platform. Data desensitization becomes a common means for processing user data by some platforms because reliable protection of sensitive data can be realized, and desensitization methods are mainly divided into traditional desensitization technologies, anonymous technologies, deep learning desensitization technologies and differential privacy desensitization technologies.
1. Conventional desensitization techniques
The traditional desensitization technology is used for deforming sensitive information through desensitization rules, such as replacement, inhibition, generalization and the like, but the technology mainly identifies the sensitive information by rule matching and is difficult to process text data with rich semantics.
2. Anonymization techniques
The Anonymity technology mainly comprises a K-Anonymity, L-Diversity and T-proximity method, and mainly carries out relevant transformation on attributes in table data, so that an attacker cannot identify a specific individual by means of link attack and the like, but mainly aims at structured table data and is difficult to process complex information such as context semantics and the like of unstructured text data.
3. Deep learning desensitization techniques
The deep learning desensitization technology mainly comprises methods of named entity identification, generation of an antagonistic network and the like, wherein the named entity identification identifies sensitive information and desensitizes by using expert experience to generate synthetic data with the antagonistic network distributed similar to original data, but the two methods are difficult to provide reasonable quantifiable privacy indexes.
4. Differential privacy desensitization techniques
The main method of differential privacy desensitization of textual data is dχDifferential privacy (d)χPrivacy) which perturbs the word embedding vector of the input word by means of differential privacy noise and selectsTake the nearest neighbor word in the word vector space to replace, but dχDifferential privacy adds too little noise in regions where word vectors are sparse, resulting in a high probability that initially sensitive words are not replaced.
In conclusion, the complexity of semantic information is not fully considered in the conventional desensitization technology and the anonymization technology, the deep learning desensitization technology is difficult to provide reasonable quantifiable privacy indexes, and the noise added in the region with sparse word vectors by the differential privacy desensitization technology is too small.
Disclosure of Invention
The invention aims to provide a sensitive text desensitization method using differential privacy word embedding disturbance, which aims at the problems that the complexity of semantic information is not fully considered in the traditional desensitization technology and the anonymity technology, the reasonable quantifiable privacy index is difficult to provide by the deep learning desensitization technology, and the noise added in a region with sparse word vectors by the differential privacy desensitization technology is too small.
The design principle of the invention is as follows: firstly, identifying a sensitive word C in a text by using a BERT-CRF modelnAnd randomly sampling each of the sensitized tagsWords of English corpus, collecting sampling words to form non-sensitive word set Cn′(ii) a Secondly, adding differential privacy noise to the word embedding vector of the sensitive word to generate a new disturbing word embedding vector; then measuring Euclidean distance between the disturbance word embedded vector and the word embedded vector of the non-sensitive word according to a nearest neighbor formula f1Obtain the candidate word setFinally, according to a probability formula f of multi-unit auction2By collectionsReplacing the sensitive word with the middle word, and replacing all the sensitive words C in the textnAnd then output desensitization text.
The technical scheme of the invention is realized by the following steps:
step 1, identifying sensitive information in an English text by using a named entity identification technology to obtain a sensitive word set, and sampling in an English corpus to obtain a non-sensitive word set;
step 1.1, inputting English text d, and dividing d into NsA sentence sequence consisting of words;
step 1.2, adding NsSequentially inputting the sentence sequence into a named entity recognition model to obtain predicted labels of all words in the text d, and summarizing to obtain a label set
Step 1.3, with sensitive tag setMarking words with the same label in the text d to obtain the inclusion tnSet of individual sensitive wordsUsing sensitive sets of labelsLabeling words with same label in English corpus, and randomly sampling each label with equal probabilityA word is obtained to includeSet of non-sensitive words
Step 2, generating a disturbing word embedding vector of the sensitive word by adopting a differential privacy noise disturbing method;
step 2.1, embedding C into model by using wordsnConverting the words in the Chinese into word embedding vectors;
step 2.2, sampling Laplace noise to obtain a noise vector meeting epsilon-difference privacy;
step 2.3, adding the word embedding vector and the noise vector to obtain a disturbing word embedding vector;
step 3, replacing the sensitive words with the non-sensitive words according to a multi-unit auction probability formula based on Euclidean distance measurement;
step 3.1, using nearest neighbor formula f1Selecting the closest distance in the word vector spaceM non-sensitive words of (a) constitute a set of candidate words
Wherein | · | purple2Representing the Euclidean norm, which is used for measuring the distance between two word embedding vectors, wherein the number m of words belongs to {3, 4.., 9} is a custom variable,is a sensitive word entered, f1First from Cn′In the selection can enableSmallest wordOrder toThen Cn′Removing wordsFrom Cn′In the selection can enableSmallest wordOrder toAnd sequentially iterating until m words are found to form a candidate word set
Step 3.2, according to the probability formula f of multi-unit auction2Selecting a set of candidate wordsWord in (1)Replacing sensitive words
Wherein xi is the user-defined probability as the epsilon (0.5, 1),has a prior probability of pix1, 2,.. times.m, and p when x is 1, 2,. times.m-1ixThe first term is xi, the common ratio is (1-xi) and the equal ratio sequence, when x is m,a priori probability p ofim=(1-ξ)m-1;
Compared with the traditional desensitization technology, the anonymization technology and the deep learning desensitization technology, the method disclosed by the invention mashups word embedding and differential privacy noise, expresses complex semantic information in a text by using a word embedding vector, and expresses the quantitative privacy protection degree by using a differential privacy budget epsilon, so that the problems that the complexity of the semantic information is not fully considered and a reasonable quantifiable privacy index is difficult to provide are solved.
Phase comparison dχThe invention discloses a differential privacy method, which utilizes a nearest neighbor formula f1Extracting multiple candidate words from the insensitive word set with equal probability, and according to a multi-unit auction probability formula f2The sensitive words are replaced by the candidate words, so that the problem that the original words are output due to undersize noise added in a region with sparse word vectors is solved.
Drawings
FIG. 1 is a schematic diagram of a sensitive text desensitization method using differential privacy word embedding perturbation according to the present invention.
Detailed Description
In order to better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.
The experimental data are from foreign text data, including the Movie review corpus IMDb Movie Reviews, Twitter corpus and Wikipedia corpus. Data from sensitive text desensitization experiments using differential privacy word embedding perturbations are shown in table 1.
TABLE 1 sensitive text desensitization experimental data using differential privacy word embedding perturbation
The method comprises the following steps of dividing a film comment corpus and a twitter corpus into a training set and a test set according to a ratio of 9: 1; because the wikipedia corpus was used experimentally to compose the non-sensitive word sets, the training and test sets are not segregated.
In the experiment process, the number of word embedding vocabularies is set to be 400000, 300 is used as a word embedding dimension, the number of sentences of input texts is at most 12, and the maximum length of a single sentence is 100. The loss function of the BERT model is:
i represents the ith sample in each batch, n samples are in total, j represents the jth position in the sample, m positions are in total, y is a real class mark, p is the probability of neural network prediction, an AdamW optimizer is used, an Adam optimizer is used on a CRF layer, and other parameters are the default parameters of the optimizer.
The test adopts the descending amplitude (Decline) of the Accuracy (Accuracy) to evaluate the desensitization result of the text data. Pre-training an open-source two-classification model of a multi-layer perceptron network, outputting 1 when an input word is a sensitive word, outputting 0 when the input word is a non-sensitive word, and judging the privacy protection effect by calculating the number of 0 and 1, wherein TP is the number for classifying the sensitive words into sensitive words, FN is the number for classifying the sensitive words into non-sensitive words, FP is the number for classifying the non-sensitive words into sensitive words, and TN is the number for classifying the non-sensitive words into non-sensitive words. Inputting all words in the corpus text into the model, calculating and outputting the number of 0 and 1 to obtain the values of TP, FN, FP and TN, wherein the accuracy calculation method is shown as a formula 4, and the descent range calculation method is shown as a formula 5:
Decline=Accuracy*-Accuracy (5)
accuracy represents the Accuracy of the binary model on the initial dataset, Accuracy*The accuracy of the binary model on the desensitization data set is represented, the higher the accuracy drop amplitude is, the better the desensitization privacy protection effect is, the accuracy of the binary model in the initial IMDb corpus is 90%, and the accuracy of the binary model in the initial Twitter corpus is 92%.
The experiment is carried out on a computer and a server, and the computer is specifically configured as follows: interi 7-6700, CPU 2.40GHz, memory 4G, operating system windows 7, 64 bit; the specific configuration of the server is as follows: e7-4820v4, RAM 256G, the operating system is Linux Ubuntu 64 bit.
The specific process of the experiment is as follows:
step 1, identifying sensitive information in an English text through a BERT-CRF model to obtain a sensitive word set, and sampling a Wikipedia corpus to obtain a non-sensitive word set.
Step 1.1, text d in the Twitter corpus is input, and d is given "\\ n", "? ","! Splitting into N for delimiterssA sentence"represents a word abbreviation and is not considered a sentence separator; then, the space is used as a separator to separate the sentence srIs divided into NwIndividual words, resulting in a sequential representation
Step 1.2, sentence sequenceSequentially inputting the BERT-CRF model at srRespectively adding into the beginning and the end of the sentence "[ CLS ]]"and" [ SEP]"representing the beginning and end of a sentence, splicing the sentences srTraining Token Embedding, Segment Embedding and Position Embedding of each word, labeling the label of the word by using a 'BIO' label scheme (B represents the beginning of an entity, I represents the inside of the entity and O represents the outside of the entity), and summarizing to obtain N of the text dLA set of labels
Step 1.3, withMarking the words in the d by the sensitive labels to obtain a set containing n sensitive wordsWherein the sensitive labels are collected as Labeling Wikipedia corpus by using sensitive labels to obtain word setEqual probability random sampling for each label The words are collected and sampled to form n' non-sensitive word sets Wherein i*1, 2.., 12, the count (·) function represents the number of words in the computation set,the function represents rounding up.
And 2, generating a disturbing word embedding vector of the sensitive word by adopting a differential privacy noise disturbing method.
Step 2.1, using word embedding model BERT to CnIs mapped toGet sensitive wordsWord-embedded vectorW represents a space of words and,represents that the n-dimensional words are embedded into the vector space, and n is 300 in the experiment.
Step 2.2, get random vector v ═ v from multivariate normal distribution sampling1,...,vn]V is normalized, then scalar γ is obtained from the probability distribution samples, and the noise vector N ═ γ v satisfying the epsilon-difference privacy is obtained, and the multivariate normal distribution formula is as follows:
where n is the word embedding dimension, the μmean is centered at the origin, the covariance matrix Σ is the identity matrix, (.)TFor the transposition of the matrix, exp (-) is an exponential function with a natural constant e as the base, pi is the circumferential rate, | · | is a matrix determinant, and the probability distribution formula is as follows:
whereinn is the word embedding dimension, e is the natural constant, Γ (·) is the gamma distribution, and in the experiment ∈ 25.
Step 2.3, adding the word embedding vector and the noise vector to obtain a disturbing word embedding vector
And 3, replacing the sensitive words with the non-sensitive words according to a multi-unit auction probability formula based on the Euclidean distance measurement.
Step 3.1, using nearest neighbor formula f1Selecting the closest distance in the word vector spaceM non-sensitive words of (a) constitute a set of candidate words
Wherein | · | purple2Representing the euclidean norm and used to measure the distance between two word-embedded vectors, where m is 5, f1First from Cn′In the selection can enableSmallest wordOrder to Then Cn′Removing wordsFrom Cn′In the selection can enableSmallest wordOrder toAnd sequentially iterating until 5 words are found to form a candidate word set
Step 3.2, according to the probability formula f of multi-unit auction2Selecting a set of candidate wordsWord in (1)Replacing sensitive words
Xi 0.8 in the experiment, the result is used to replace sensitive wordsWord ofWhereinHas a prior probability of pix,x=1,2,3,4,pixIs an equal ratio series with a head term of 0.8 and a common ratio of 0.2, when x is 5,a priori probability p ofim=0.24。
And (3) testing results: the method can change the accuracy of the dichotomous model on the desensitized IMDb corpus to 12 percent, reduce the accuracy on the desensitized Twitter corpus to 10 percent and reduce the accuracy on the desensitized Twitter corpus to 82 percent, and has good effect on the desensitization method of the differential privacy text.
The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (3)
1. A method for desensitizing sensitive text to perturbations using differential privacy word embedding, the method comprising the steps of:
step 1, identifying sensitive information in an English text by using a named entity identification technology to obtain a sensitive word set, and sampling in an English corpus to obtain a non-sensitive word set;
step 1.1, inputting English text d, and dividing d into NsA sentence sequence consisting of words;
step 1.2, adding NsSequentially inputting the sentence sequence into a named entity recognition model to obtain predicted labels of all words in the text d, and summarizing to obtain a label set
Step 1.3, with sensitive tag setMarking words with the same label in the text d to obtain the inclusion tnSet of individual sensitive wordsUsing sensitive sets of labelsAnnotating words of the same label in an English corpus, and performing equal-probability random sampling on each labelIndividual words, collectively comprisingSet of non-sensitive words
Step 2, generating a disturbing word embedding vector of the sensitive word by adopting a differential privacy noise disturbing method;
step 2.1, embedding C into model by using wordsnConverting the words in the Chinese into word embedding vectors;
step 2.2, sampling Laplace noise to obtain a noise vector meeting epsilon-difference privacy;
step 2.3, adding the word embedding vector and the noise vector to obtain a disturbing word embedding vector;
step 3, replacing the sensitive words with the non-sensitive words according to a multi-unit auction probability formula based on Euclidean distance measurement;
step 3.1, use nearest neighbor formula f1Selecting the closest distance in the word vector spaceM non-sensitive words of (a) constitute a set of candidate words
Wherein | · | purple2Representing the Euclidean norm, which is used for measuring the distance between two word embedding vectors, wherein the number m of words belongs to {3, 4.., 9} is a custom variable,is a sensitive word entered, f1First from Cn′In the selection can enableSmallest wordOrder toThen Cn′Removing wordsFrom Cn′In the selection can enableSmallest wordOrder toAnd sequentially iterating until m words are found to form a candidate word set
Step 3.2, according to the probability formula f of multi-unit auction2Selecting a set of candidate wordsWord in (1)Replacing sensitive words
2. The method of sensitive text desensitization using differential privacy word embedding perturbations according to claim 1, characterized in that: sensitive tag set in step 1.3IncludedEach user-defined sensitive label is labeled in an English corpus to obtain a word setAnd pair setsEqual probability samplingA word, whereinα∈(0,1]For custom percentages, the count (. cndot.) function represents the number of words in the computation set,the function represents rounding up.
3. The method of sensitive text desensitization using differential privacy word embedding perturbation according to claim 1, wherein: the multi-unit auction in step 3.2 means that the highest bidder is offered the goods first, giving them the unit quantity requested, then the second highest bidder, and so on, until the goods are supplied up, based on which the probability formula f is originally created2,f2Selecting candidate words to replace sensitive wordsWhereinHas a prior probability of pix1, 2,.. times.m, and p when x is 1, 2,. times.m-1ixThe first term is xi, the common ratio is (1-xi) and the equal ratio sequence, when x is m,prior probability of (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210039857.9A CN114547670A (en) | 2022-01-14 | 2022-01-14 | Sensitive text desensitization method using differential privacy word embedding disturbance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210039857.9A CN114547670A (en) | 2022-01-14 | 2022-01-14 | Sensitive text desensitization method using differential privacy word embedding disturbance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114547670A true CN114547670A (en) | 2022-05-27 |
Family
ID=81672066
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210039857.9A Pending CN114547670A (en) | 2022-01-14 | 2022-01-14 | Sensitive text desensitization method using differential privacy word embedding disturbance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114547670A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115935405A (en) * | 2022-09-15 | 2023-04-07 | 广州大学 | Text content protection method based on differential privacy |
CN115952854A (en) * | 2023-03-14 | 2023-04-11 | 杭州太美星程医药科技有限公司 | Training method of text desensitization model, text desensitization method and application |
-
2022
- 2022-01-14 CN CN202210039857.9A patent/CN114547670A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115935405A (en) * | 2022-09-15 | 2023-04-07 | 广州大学 | Text content protection method based on differential privacy |
CN115952854A (en) * | 2023-03-14 | 2023-04-11 | 杭州太美星程医药科技有限公司 | Training method of text desensitization model, text desensitization method and application |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mohawesh et al. | Fake reviews detection: A survey | |
Yu et al. | Multiple level hierarchical network-based clause selection for emotion cause extraction | |
Redi et al. | Citation needed: A taxonomy and algorithmic assessment of Wikipedia's verifiability | |
AlQahtani | Product sentiment analysis for amazon reviews | |
Gaur et al. | Semi-supervised deep learning based named entity recognition model to parse education section of resumes | |
CN114547670A (en) | Sensitive text desensitization method using differential privacy word embedding disturbance | |
Budhiraja et al. | A supervised learning approach for heading detection | |
Chen et al. | Neural article pair modeling for wikipedia sub-article matching | |
Shekhar et al. | An effective cybernated word embedding system for analysis and language identification in code-mixed social media text | |
Wang et al. | Word vector modeling for sentiment analysis of product reviews | |
Brek et al. | Enhancing information extraction process in job recommendation using semantic technology | |
Suresh Kumar et al. | Local search five‐element cycle optimized reLU‐BiLSTM for multilingual aspect‐based text classification | |
Chatsiou | Text classification of manifestos and COVID-19 press briefings using BERT and convolutional neural networks | |
Viswanathan et al. | Detection of duplicates in Quora and Twitter corpus | |
Abdullah Amer et al. | A novel algorithm for sarcasm detection using supervised machine learning approach. | |
Pandey et al. | Various aspects of sentiment analysis: a review | |
Shelke et al. | Support vector machine based word embedding and feature reduction for sentiment analysis-a study | |
Zim et al. | Exploring Word2Vec embedding for sentiment analysis of Bangla raw and romanized text | |
Hamad et al. | Sentiment analysis of restaurant reviews in social media using naïve bayes | |
Low et al. | Decoding violence against women: analysing harassment in middle eastern literature with machine learning and sentiment analysis | |
Corredera Arbide et al. | Affective computing for smart operations: a survey and comparative analysis of the available tools, libraries and web services | |
Karimi et al. | Sentiment analysis using BERT (pre-training language representations) and Deep Learning on Persian texts | |
Hazrati et al. | Profiling Irony Speech Spreaders on Social Networks Using Deep Cleaning and BERT. | |
Wang et al. | SICM: a supervised-based identification and classification model for Chinese jargons using feature adapter enhanced BERT | |
Gnanavel et al. | Rapid Text Retrieval and Analysis Supporting Latent Dirichlet Allocation Based on Probabilistic Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |