CN114547670A

CN114547670A - Sensitive text desensitization method using differential privacy word embedding disturbance

Info

Publication number: CN114547670A
Application number: CN202210039857.9A
Authority: CN
Inventors: 罗森林; 关业礼; 潘丽敏; 郜森; 吴杭颐
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-27

Abstract

The invention relates to a sensitive text desensitization method by using differential privacy word embedding disturbance, and belongs to the technical field of differential privacy protection. Firstly, recognizing sensitive words in a text by using a named entity recognition technology, and randomly sampling non-sensitive words in a corpus; secondly, adding differential privacy noise to the word embedding vector of the sensitive word to generate a new disturbing word embedding vector; then measuring Euclidean distance between the disturbance word embedded vector and the word embedded vector of the non-sensitive word, and obtaining a candidate word set according to a nearest neighbor formula; and finally, replacing all sensitive words in the text by the words in the candidate word set according to a multi-unit auction probability formula and then outputting the desensitization text. The invention is tested on a plurality of corpora, and the result shows that the desensitization agent can achieve better desensitization effect on a plurality of texts, and has good universality and mobility.

Description

Sensitive text desensitization method using differential privacy word embedding disturbance

Technical Field

The invention relates to a sensitive text desensitization method by using differential privacy word embedding disturbance, and belongs to the technical field of differential privacy protection.

Background

With the rapid development of networks, various social networking sites and platforms have been widely entered into the lives of people, and the communication discussion of users on the social networking platforms generates massive text data, which contains various privacy attributes of the users, such as social relationships, physical states and the like. When the platform delivers the relevant data to a third party for data processing, an attacker steals the privacy information of the user by combining various means such as background knowledge attack and the like, and launches illegal behaviors such as phishing and the like to the user in a targeted manner. On one hand, the benefit of the user is seriously damaged, and on the other hand, the reputation of the platform is seriously damaged due to the leakage of the privacy of the user, so that the trust degree of the platform is greatly reduced for the user. Therefore, how to further protect the relevant data of the user becomes a problem to be solved urgently by the platform. Data desensitization becomes a common means for processing user data by some platforms because reliable protection of sensitive data can be realized, and desensitization methods are mainly divided into traditional desensitization technologies, anonymous technologies, deep learning desensitization technologies and differential privacy desensitization technologies.

1. Conventional desensitization techniques

The traditional desensitization technology is used for deforming sensitive information through desensitization rules, such as replacement, inhibition, generalization and the like, but the technology mainly identifies the sensitive information by rule matching and is difficult to process text data with rich semantics.

2. Anonymization techniques

The Anonymity technology mainly comprises a K-Anonymity, L-Diversity and T-proximity method, and mainly carries out relevant transformation on attributes in table data, so that an attacker cannot identify a specific individual by means of link attack and the like, but mainly aims at structured table data and is difficult to process complex information such as context semantics and the like of unstructured text data.

3. Deep learning desensitization techniques

The deep learning desensitization technology mainly comprises methods of named entity identification, generation of an antagonistic network and the like, wherein the named entity identification identifies sensitive information and desensitizes by using expert experience to generate synthetic data with the antagonistic network distributed similar to original data, but the two methods are difficult to provide reasonable quantifiable privacy indexes.

4. Differential privacy desensitization techniques

The main method of differential privacy desensitization of textual data is d_χDifferential privacy (d)_χPrivacy) which perturbs the word embedding vector of the input word by means of differential privacy noise and selectsTake the nearest neighbor word in the word vector space to replace, but d_χDifferential privacy adds too little noise in regions where word vectors are sparse, resulting in a high probability that initially sensitive words are not replaced.

In conclusion, the complexity of semantic information is not fully considered in the conventional desensitization technology and the anonymization technology, the deep learning desensitization technology is difficult to provide reasonable quantifiable privacy indexes, and the noise added in the region with sparse word vectors by the differential privacy desensitization technology is too small.

Disclosure of Invention

The invention aims to provide a sensitive text desensitization method using differential privacy word embedding disturbance, which aims at the problems that the complexity of semantic information is not fully considered in the traditional desensitization technology and the anonymity technology, the reasonable quantifiable privacy index is difficult to provide by the deep learning desensitization technology, and the noise added in a region with sparse word vectors by the differential privacy desensitization technology is too small.

The design principle of the invention is as follows: firstly, identifying a sensitive word C in a text by using a BERT-CRF model_nAnd randomly sampling each of the sensitized tags

Words of English corpus, collecting sampling words to form non-sensitive word set C_n′(ii) a Secondly, adding differential privacy noise to the word embedding vector of the sensitive word to generate a new disturbing word embedding vector; then measuring Euclidean distance between the disturbance word embedded vector and the word embedded vector of the non-sensitive word according to a nearest neighbor formula f₁Obtain the candidate word set

Finally, according to a probability formula f of multi-unit auction₂By collections

Replacing the sensitive word with the middle word, and replacing all the sensitive words C in the text_nAnd then output desensitization text.

The technical scheme of the invention is realized by the following steps:

step 1, identifying sensitive information in an English text by using a named entity identification technology to obtain a sensitive word set, and sampling in an English corpus to obtain a non-sensitive word set;

step 1.1, inputting English text d, and dividing d into N_sA sentence sequence consisting of words;

step 1.2, adding N_sSequentially inputting the sentence sequence into a named entity recognition model to obtain predicted labels of all words in the text d, and summarizing to obtain a label set

Step 1.3, with sensitive tag set

Marking words with the same label in the text d to obtain the inclusion t_nSet of individual sensitive words

Using sensitive sets of labels

Labeling words with same label in English corpus, and randomly sampling each label with equal probability

A word is obtained to include

Set of non-sensitive words

Step 2, generating a disturbing word embedding vector of the sensitive word by adopting a differential privacy noise disturbing method;

step 2.1, embedding C into model by using words_nConverting the words in the Chinese into word embedding vectors;

step 2.2, sampling Laplace noise to obtain a noise vector meeting epsilon-difference privacy;

step 2.3, adding the word embedding vector and the noise vector to obtain a disturbing word embedding vector;

step 3, replacing the sensitive words with the non-sensitive words according to a multi-unit auction probability formula based on Euclidean distance measurement;

step 3.1, using nearest neighbor formula f₁Selecting the closest distance in the word vector space

M non-sensitive words of (a) constitute a set of candidate words

Wherein | · | purple₂Representing the Euclidean norm, which is used for measuring the distance between two word embedding vectors, wherein the number m of words belongs to {3, 4.., 9} is a custom variable,

is a sensitive word entered, f₁First from C_n′In the selection can enable

Smallest word

Order to

Then C_n′Removing words

From C_n′In the selection can enable

Smallest word

Order to

And sequentially iterating until m words are found to form a candidate word set

Step 3.2, according to the probability formula f of multi-unit auction₂Selecting a set of candidate words

Word in (1)

Replacing sensitive words

Wherein xi is the user-defined probability as the epsilon (0.5, 1),

has a prior probability of p_ix1, 2,.. times.m, and p when x is 1, 2,. times.m-1_ixThe first term is xi, the common ratio is (1-xi) and the equal ratio sequence, when x is m,

a priori probability p of_im＝(1-ξ)^m-1；

Step 4, replacing all sensitive words C_nOutput desensitization text

Advantageous effects

Compared with the traditional desensitization technology, the anonymization technology and the deep learning desensitization technology, the method disclosed by the invention mashups word embedding and differential privacy noise, expresses complex semantic information in a text by using a word embedding vector, and expresses the quantitative privacy protection degree by using a differential privacy budget epsilon, so that the problems that the complexity of the semantic information is not fully considered and a reasonable quantifiable privacy index is difficult to provide are solved.

Phase comparison d_χThe invention discloses a differential privacy method, which utilizes a nearest neighbor formula f₁Extracting multiple candidate words from the insensitive word set with equal probability, and according to a multi-unit auction probability formula f₂The sensitive words are replaced by the candidate words, so that the problem that the original words are output due to undersize noise added in a region with sparse word vectors is solved.

Drawings

FIG. 1 is a schematic diagram of a sensitive text desensitization method using differential privacy word embedding perturbation according to the present invention.

Detailed Description

In order to better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.

The experimental data are from foreign text data, including the Movie review corpus IMDb Movie Reviews, Twitter corpus and Wikipedia corpus. Data from sensitive text desensitization experiments using differential privacy word embedding perturbations are shown in table 1.

TABLE 1 sensitive text desensitization experimental data using differential privacy word embedding perturbation

The method comprises the following steps of dividing a film comment corpus and a twitter corpus into a training set and a test set according to a ratio of 9: 1; because the wikipedia corpus was used experimentally to compose the non-sensitive word sets, the training and test sets are not segregated.

In the experiment process, the number of word embedding vocabularies is set to be 400000, 300 is used as a word embedding dimension, the number of sentences of input texts is at most 12, and the maximum length of a single sentence is 100. The loss function of the BERT model is:

i represents the ith sample in each batch, n samples are in total, j represents the jth position in the sample, m positions are in total, y is a real class mark, p is the probability of neural network prediction, an AdamW optimizer is used, an Adam optimizer is used on a CRF layer, and other parameters are the default parameters of the optimizer.

The test adopts the descending amplitude (Decline) of the Accuracy (Accuracy) to evaluate the desensitization result of the text data. Pre-training an open-source two-classification model of a multi-layer perceptron network, outputting 1 when an input word is a sensitive word, outputting 0 when the input word is a non-sensitive word, and judging the privacy protection effect by calculating the number of 0 and 1, wherein TP is the number for classifying the sensitive words into sensitive words, FN is the number for classifying the sensitive words into non-sensitive words, FP is the number for classifying the non-sensitive words into sensitive words, and TN is the number for classifying the non-sensitive words into non-sensitive words. Inputting all words in the corpus text into the model, calculating and outputting the number of 0 and 1 to obtain the values of TP, FN, FP and TN, wherein the accuracy calculation method is shown as a formula 4, and the descent range calculation method is shown as a formula 5:

Decline＝Accuracy^*-Accuracy (5)

accuracy represents the Accuracy of the binary model on the initial dataset, Accuracy^*The accuracy of the binary model on the desensitization data set is represented, the higher the accuracy drop amplitude is, the better the desensitization privacy protection effect is, the accuracy of the binary model in the initial IMDb corpus is 90%, and the accuracy of the binary model in the initial Twitter corpus is 92%.

The experiment is carried out on a computer and a server, and the computer is specifically configured as follows: interi 7-6700, CPU 2.40GHz, memory 4G, operating system windows 7, 64 bit; the specific configuration of the server is as follows: e7-4820v4, RAM 256G, the operating system is Linux Ubuntu 64 bit.

The specific process of the experiment is as follows:

step 1, identifying sensitive information in an English text through a BERT-CRF model to obtain a sensitive word set, and sampling a Wikipedia corpus to obtain a non-sensitive word set.

Step 1.1, text d in the Twitter corpus is input, and d is given "\\ n", "? ","! Splitting into N for delimiters_sA sentence

"represents a word abbreviation and is not considered a sentence separator; then, the space is used as a separator to separate the sentence s_rIs divided into N_wIndividual words, resulting in a sequential representation

Step 1.2, sentence sequence

Sequentially inputting the BERT-CRF model at s_rRespectively adding into the beginning and the end of the sentence "[ CLS ]]"and" [ SEP]"representing the beginning and end of a sentence, splicing the sentences s_rTraining Token Embedding, Segment Embedding and Position Embedding of each word, labeling the label of the word by using a 'BIO' label scheme (B represents the beginning of an entity, I represents the inside of the entity and O represents the outside of the entity), and summarizing to obtain N of the text d_LA set of labels

Step 1.3, with

Marking the words in the d by the sensitive labels to obtain a set containing n sensitive words

Wherein the sensitive labels are collected as

Labeling Wikipedia corpus by using sensitive labels to obtain word set

Equal probability random sampling for each label

The words are collected and sampled to form n' non-sensitive word sets

Wherein i^*1, 2.., 12, the count (·) function represents the number of words in the computation set,

the function represents rounding up.

And 2, generating a disturbing word embedding vector of the sensitive word by adopting a differential privacy noise disturbing method.

Step 2.1, using word embedding model BERT to C_nIs mapped to

Get sensitive words

Word-embedded vector

W represents a space of words and,

represents that the n-dimensional words are embedded into the vector space, and n is 300 in the experiment.

Step 2.2, get random vector v ═ v from multivariate normal distribution sampling₁，...，v_n]V is normalized, then scalar γ is obtained from the probability distribution samples, and the noise vector N ═ γ v satisfying the epsilon-difference privacy is obtained, and the multivariate normal distribution formula is as follows:

where n is the word embedding dimension, the μmean is centered at the origin, the covariance matrix Σ is the identity matrix, (.)^TFor the transposition of the matrix, exp (-) is an exponential function with a natural constant e as the base, pi is the circumferential rate, | · | is a matrix determinant, and the probability distribution formula is as follows:

wherein

n is the word embedding dimension, e is the natural constant, Γ (·) is the gamma distribution, and in the experiment ∈ 25.

Step 2.3, adding the word embedding vector and the noise vector to obtain a disturbing word embedding vector

And 3, replacing the sensitive words with the non-sensitive words according to a multi-unit auction probability formula based on the Euclidean distance measurement.

M non-sensitive words of (a) constitute a set of candidate words

Wherein | · | purple₂Representing the euclidean norm and used to measure the distance between two word-embedded vectors, where m is 5, f₁First from C_n′In the selection can enable

Smallest word

Order to

Then C_n′Removing words

From C_n′In the selection can enable

Smallest word

Order to

And sequentially iterating until 5 words are found to form a candidate word set

Word in (1)

Replacing sensitive words

Xi 0.8 in the experiment, the result is used to replace sensitive words

Word of

Wherein

Has a prior probability of p_ix，x＝1，2，3，4，p_ixIs an equal ratio series with a head term of 0.8 and a common ratio of 0.2, when x is 5,

a priori probability p of_im＝0.2⁴。

Step 4, replacing all sensitive words C_nOutput desensitization text

And (3) testing results: the method can change the accuracy of the dichotomous model on the desensitized IMDb corpus to 12 percent, reduce the accuracy on the desensitized Twitter corpus to 10 percent and reduce the accuracy on the desensitized Twitter corpus to 82 percent, and has good effect on the desensitization method of the differential privacy text.

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for desensitizing sensitive text to perturbations using differential privacy word embedding, the method comprising the steps of:

Step 1.3, with sensitive tag set

Using sensitive sets of labels

Annotating words of the same label in an English corpus, and performing equal-probability random sampling on each label

Individual words, collectively comprising

Set of non-sensitive words

step 3.1, use nearest neighbor formula f₁Selecting the closest distance in the word vector space

M non-sensitive words of (a) constitute a set of candidate words

is a sensitive word entered, f₁First from C_n′In the selection can enable

Smallest word

Order to

Then C_n′Removing words

From C_n′In the selection can enable

Smallest word

Order to

And sequentially iterating until m words are found to form a candidate word set

Word in (1)

Replacing sensitive words

Wherein xi is the self-defined probability to obtain the replacement word

Step 4, replacing all sensitive words C_nOutputting desensitized text

2. The method of sensitive text desensitization using differential privacy word embedding perturbations according to claim 1, characterized in that: sensitive tag set in step 1.3

Included

Each user-defined sensitive label is labeled in an English corpus to obtain a word set

And pair sets

Equal probability sampling

A word, wherein

α∈(0，1]For custom percentages, the count (. cndot.) function represents the number of words in the computation set,

the function represents rounding up.

3. The method of sensitive text desensitization using differential privacy word embedding perturbation according to claim 1, wherein: the multi-unit auction in step 3.2 means that the highest bidder is offered the goods first, giving them the unit quantity requested, then the second highest bidder, and so on, until the goods are supplied up, based on which the probability formula f is originally created₂，f₂Selecting candidate words to replace sensitive words

Wherein

prior probability of (2)