CN114547670A - Sensitive text desensitization method using differential privacy word embedding disturbance - Google Patents

Sensitive text desensitization method using differential privacy word embedding disturbance Download PDF

Info

Publication number
CN114547670A
CN114547670A CN202210039857.9A CN202210039857A CN114547670A CN 114547670 A CN114547670 A CN 114547670A CN 202210039857 A CN202210039857 A CN 202210039857A CN 114547670 A CN114547670 A CN 114547670A
Authority
CN
China
Prior art keywords
sensitive
word
words
text
desensitization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210039857.9A
Other languages
Chinese (zh)
Inventor
罗森林
关业礼
潘丽敏
郜森
吴杭颐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202210039857.9A priority Critical patent/CN114547670A/en
Publication of CN114547670A publication Critical patent/CN114547670A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The invention relates to a sensitive text desensitization method by using differential privacy word embedding disturbance, and belongs to the technical field of differential privacy protection. Firstly, recognizing sensitive words in a text by using a named entity recognition technology, and randomly sampling non-sensitive words in a corpus; secondly, adding differential privacy noise to the word embedding vector of the sensitive word to generate a new disturbing word embedding vector; then measuring Euclidean distance between the disturbance word embedded vector and the word embedded vector of the non-sensitive word, and obtaining a candidate word set according to a nearest neighbor formula; and finally, replacing all sensitive words in the text by the words in the candidate word set according to a multi-unit auction probability formula and then outputting the desensitization text. The invention is tested on a plurality of corpora, and the result shows that the desensitization agent can achieve better desensitization effect on a plurality of texts, and has good universality and mobility.

Description

Sensitive text desensitization method using differential privacy word embedding disturbance
Technical Field
The invention relates to a sensitive text desensitization method by using differential privacy word embedding disturbance, and belongs to the technical field of differential privacy protection.
Background
With the rapid development of networks, various social networking sites and platforms have been widely entered into the lives of people, and the communication discussion of users on the social networking platforms generates massive text data, which contains various privacy attributes of the users, such as social relationships, physical states and the like. When the platform delivers the relevant data to a third party for data processing, an attacker steals the privacy information of the user by combining various means such as background knowledge attack and the like, and launches illegal behaviors such as phishing and the like to the user in a targeted manner. On one hand, the benefit of the user is seriously damaged, and on the other hand, the reputation of the platform is seriously damaged due to the leakage of the privacy of the user, so that the trust degree of the platform is greatly reduced for the user. Therefore, how to further protect the relevant data of the user becomes a problem to be solved urgently by the platform. Data desensitization becomes a common means for processing user data by some platforms because reliable protection of sensitive data can be realized, and desensitization methods are mainly divided into traditional desensitization technologies, anonymous technologies, deep learning desensitization technologies and differential privacy desensitization technologies.
1. Conventional desensitization techniques
The traditional desensitization technology is used for deforming sensitive information through desensitization rules, such as replacement, inhibition, generalization and the like, but the technology mainly identifies the sensitive information by rule matching and is difficult to process text data with rich semantics.
2. Anonymization techniques
The Anonymity technology mainly comprises a K-Anonymity, L-Diversity and T-proximity method, and mainly carries out relevant transformation on attributes in table data, so that an attacker cannot identify a specific individual by means of link attack and the like, but mainly aims at structured table data and is difficult to process complex information such as context semantics and the like of unstructured text data.
3. Deep learning desensitization techniques
The deep learning desensitization technology mainly comprises methods of named entity identification, generation of an antagonistic network and the like, wherein the named entity identification identifies sensitive information and desensitizes by using expert experience to generate synthetic data with the antagonistic network distributed similar to original data, but the two methods are difficult to provide reasonable quantifiable privacy indexes.
4. Differential privacy desensitization techniques
The main method of differential privacy desensitization of textual data is dχDifferential privacy (d)χPrivacy) which perturbs the word embedding vector of the input word by means of differential privacy noise and selectsTake the nearest neighbor word in the word vector space to replace, but dχDifferential privacy adds too little noise in regions where word vectors are sparse, resulting in a high probability that initially sensitive words are not replaced.
In conclusion, the complexity of semantic information is not fully considered in the conventional desensitization technology and the anonymization technology, the deep learning desensitization technology is difficult to provide reasonable quantifiable privacy indexes, and the noise added in the region with sparse word vectors by the differential privacy desensitization technology is too small.
Disclosure of Invention
The invention aims to provide a sensitive text desensitization method using differential privacy word embedding disturbance, which aims at the problems that the complexity of semantic information is not fully considered in the traditional desensitization technology and the anonymity technology, the reasonable quantifiable privacy index is difficult to provide by the deep learning desensitization technology, and the noise added in a region with sparse word vectors by the differential privacy desensitization technology is too small.
The design principle of the invention is as follows: firstly, identifying a sensitive word C in a text by using a BERT-CRF modelnAnd randomly sampling each of the sensitized tags
Figure BDA0003469748870000021
Words of English corpus, collecting sampling words to form non-sensitive word set Cn′(ii) a Secondly, adding differential privacy noise to the word embedding vector of the sensitive word to generate a new disturbing word embedding vector; then measuring Euclidean distance between the disturbance word embedded vector and the word embedded vector of the non-sensitive word according to a nearest neighbor formula f1Obtain the candidate word set
Figure BDA0003469748870000022
Finally, according to a probability formula f of multi-unit auction2By collections
Figure BDA0003469748870000031
Replacing the sensitive word with the middle word, and replacing all the sensitive words C in the textnAnd then output desensitization text.
The technical scheme of the invention is realized by the following steps:
step 1, identifying sensitive information in an English text by using a named entity identification technology to obtain a sensitive word set, and sampling in an English corpus to obtain a non-sensitive word set;
step 1.1, inputting English text d, and dividing d into NsA sentence sequence consisting of words;
step 1.2, adding NsSequentially inputting the sentence sequence into a named entity recognition model to obtain predicted labels of all words in the text d, and summarizing to obtain a label set
Figure BDA0003469748870000032
Step 1.3, with sensitive tag set
Figure BDA0003469748870000033
Marking words with the same label in the text d to obtain the inclusion tnSet of individual sensitive words
Figure BDA0003469748870000034
Using sensitive sets of labels
Figure BDA0003469748870000035
Labeling words with same label in English corpus, and randomly sampling each label with equal probability
Figure BDA0003469748870000036
A word is obtained to include
Figure BDA0003469748870000037
Set of non-sensitive words
Figure BDA0003469748870000038
Step 2, generating a disturbing word embedding vector of the sensitive word by adopting a differential privacy noise disturbing method;
step 2.1, embedding C into model by using wordsnConverting the words in the Chinese into word embedding vectors;
step 2.2, sampling Laplace noise to obtain a noise vector meeting epsilon-difference privacy;
step 2.3, adding the word embedding vector and the noise vector to obtain a disturbing word embedding vector;
step 3, replacing the sensitive words with the non-sensitive words according to a multi-unit auction probability formula based on Euclidean distance measurement;
step 3.1, using nearest neighbor formula f1Selecting the closest distance in the word vector space
Figure BDA0003469748870000039
M non-sensitive words of (a) constitute a set of candidate words
Figure BDA00034697488700000310
Figure BDA00034697488700000311
Wherein | · | purple2Representing the Euclidean norm, which is used for measuring the distance between two word embedding vectors, wherein the number m of words belongs to {3, 4.., 9} is a custom variable,
Figure BDA0003469748870000041
is a sensitive word entered, f1First from Cn′In the selection can enable
Figure BDA0003469748870000042
Smallest word
Figure BDA0003469748870000043
Order to
Figure BDA0003469748870000044
Then Cn′Removing words
Figure BDA0003469748870000045
From Cn′In the selection can enable
Figure BDA0003469748870000046
Smallest word
Figure BDA0003469748870000047
Order to
Figure BDA0003469748870000048
And sequentially iterating until m words are found to form a candidate word set
Figure BDA0003469748870000049
Step 3.2, according to the probability formula f of multi-unit auction2Selecting a set of candidate words
Figure BDA00034697488700000410
Word in (1)
Figure BDA00034697488700000411
Replacing sensitive words
Figure BDA00034697488700000412
Figure BDA00034697488700000413
Wherein xi is the user-defined probability as the epsilon (0.5, 1),
Figure BDA00034697488700000414
has a prior probability of pix1, 2,.. times.m, and p when x is 1, 2,. times.m-1ixThe first term is xi, the common ratio is (1-xi) and the equal ratio sequence, when x is m,
Figure BDA00034697488700000415
a priori probability p ofim=(1-ξ)m-1
Step 4, replacing all sensitive words CnOutput desensitization text
Figure BDA00034697488700000416
Advantageous effects
Compared with the traditional desensitization technology, the anonymization technology and the deep learning desensitization technology, the method disclosed by the invention mashups word embedding and differential privacy noise, expresses complex semantic information in a text by using a word embedding vector, and expresses the quantitative privacy protection degree by using a differential privacy budget epsilon, so that the problems that the complexity of the semantic information is not fully considered and a reasonable quantifiable privacy index is difficult to provide are solved.
Phase comparison dχThe invention discloses a differential privacy method, which utilizes a nearest neighbor formula f1Extracting multiple candidate words from the insensitive word set with equal probability, and according to a multi-unit auction probability formula f2The sensitive words are replaced by the candidate words, so that the problem that the original words are output due to undersize noise added in a region with sparse word vectors is solved.
Drawings
FIG. 1 is a schematic diagram of a sensitive text desensitization method using differential privacy word embedding perturbation according to the present invention.
Detailed Description
In order to better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.
The experimental data are from foreign text data, including the Movie review corpus IMDb Movie Reviews, Twitter corpus and Wikipedia corpus. Data from sensitive text desensitization experiments using differential privacy word embedding perturbations are shown in table 1.
TABLE 1 sensitive text desensitization experimental data using differential privacy word embedding perturbation
Figure BDA0003469748870000051
The method comprises the following steps of dividing a film comment corpus and a twitter corpus into a training set and a test set according to a ratio of 9: 1; because the wikipedia corpus was used experimentally to compose the non-sensitive word sets, the training and test sets are not segregated.
In the experiment process, the number of word embedding vocabularies is set to be 400000, 300 is used as a word embedding dimension, the number of sentences of input texts is at most 12, and the maximum length of a single sentence is 100. The loss function of the BERT model is:
Figure BDA0003469748870000052
i represents the ith sample in each batch, n samples are in total, j represents the jth position in the sample, m positions are in total, y is a real class mark, p is the probability of neural network prediction, an AdamW optimizer is used, an Adam optimizer is used on a CRF layer, and other parameters are the default parameters of the optimizer.
The test adopts the descending amplitude (Decline) of the Accuracy (Accuracy) to evaluate the desensitization result of the text data. Pre-training an open-source two-classification model of a multi-layer perceptron network, outputting 1 when an input word is a sensitive word, outputting 0 when the input word is a non-sensitive word, and judging the privacy protection effect by calculating the number of 0 and 1, wherein TP is the number for classifying the sensitive words into sensitive words, FN is the number for classifying the sensitive words into non-sensitive words, FP is the number for classifying the non-sensitive words into sensitive words, and TN is the number for classifying the non-sensitive words into non-sensitive words. Inputting all words in the corpus text into the model, calculating and outputting the number of 0 and 1 to obtain the values of TP, FN, FP and TN, wherein the accuracy calculation method is shown as a formula 4, and the descent range calculation method is shown as a formula 5:
Figure BDA0003469748870000061
Decline=Accuracy*-Accuracy (5)
accuracy represents the Accuracy of the binary model on the initial dataset, Accuracy*The accuracy of the binary model on the desensitization data set is represented, the higher the accuracy drop amplitude is, the better the desensitization privacy protection effect is, the accuracy of the binary model in the initial IMDb corpus is 90%, and the accuracy of the binary model in the initial Twitter corpus is 92%.
The experiment is carried out on a computer and a server, and the computer is specifically configured as follows: interi 7-6700, CPU 2.40GHz, memory 4G, operating system windows 7, 64 bit; the specific configuration of the server is as follows: e7-4820v4, RAM 256G, the operating system is Linux Ubuntu 64 bit.
The specific process of the experiment is as follows:
step 1, identifying sensitive information in an English text through a BERT-CRF model to obtain a sensitive word set, and sampling a Wikipedia corpus to obtain a non-sensitive word set.
Step 1.1, text d in the Twitter corpus is input, and d is given "\\ n", "? ","! Splitting into N for delimiterssA sentence
Figure BDA0003469748870000062
"represents a word abbreviation and is not considered a sentence separator; then, the space is used as a separator to separate the sentence srIs divided into NwIndividual words, resulting in a sequential representation
Figure BDA0003469748870000063
Step 1.2, sentence sequence
Figure BDA0003469748870000064
Sequentially inputting the BERT-CRF model at srRespectively adding into the beginning and the end of the sentence "[ CLS ]]"and" [ SEP]"representing the beginning and end of a sentence, splicing the sentences srTraining Token Embedding, Segment Embedding and Position Embedding of each word, labeling the label of the word by using a 'BIO' label scheme (B represents the beginning of an entity, I represents the inside of the entity and O represents the outside of the entity), and summarizing to obtain N of the text dLA set of labels
Figure BDA0003469748870000071
Figure BDA0003469748870000072
Step 1.3, with
Figure BDA0003469748870000073
Marking the words in the d by the sensitive labels to obtain a set containing n sensitive words
Figure BDA0003469748870000074
Wherein the sensitive labels are collected as
Figure BDA0003469748870000075
Figure BDA0003469748870000076
Figure BDA0003469748870000077
Labeling Wikipedia corpus by using sensitive labels to obtain word set
Figure BDA0003469748870000078
Equal probability random sampling for each label
Figure BDA0003469748870000079
Figure BDA00034697488700000710
The words are collected and sampled to form n' non-sensitive word sets
Figure BDA00034697488700000711
Figure BDA00034697488700000712
Wherein i*1, 2.., 12, the count (·) function represents the number of words in the computation set,
Figure BDA00034697488700000713
the function represents rounding up.
And 2, generating a disturbing word embedding vector of the sensitive word by adopting a differential privacy noise disturbing method.
Step 2.1, using word embedding model BERT to CnIs mapped to
Figure BDA00034697488700000714
Get sensitive words
Figure BDA00034697488700000715
Word-embedded vector
Figure BDA00034697488700000716
W represents a space of words and,
Figure BDA00034697488700000717
represents that the n-dimensional words are embedded into the vector space, and n is 300 in the experiment.
Step 2.2, get random vector v ═ v from multivariate normal distribution sampling1,...,vn]V is normalized, then scalar γ is obtained from the probability distribution samples, and the noise vector N ═ γ v satisfying the epsilon-difference privacy is obtained, and the multivariate normal distribution formula is as follows:
Figure BDA00034697488700000718
where n is the word embedding dimension, the μmean is centered at the origin, the covariance matrix Σ is the identity matrix, (.)TFor the transposition of the matrix, exp (-) is an exponential function with a natural constant e as the base, pi is the circumferential rate, | · | is a matrix determinant, and the probability distribution formula is as follows:
Figure BDA00034697488700000719
wherein
Figure BDA0003469748870000081
n is the word embedding dimension, e is the natural constant, Γ (·) is the gamma distribution, and in the experiment ∈ 25.
Step 2.3, adding the word embedding vector and the noise vector to obtain a disturbing word embedding vector
Figure BDA0003469748870000082
Figure BDA0003469748870000083
And 3, replacing the sensitive words with the non-sensitive words according to a multi-unit auction probability formula based on the Euclidean distance measurement.
Step 3.1, using nearest neighbor formula f1Selecting the closest distance in the word vector space
Figure BDA0003469748870000084
M non-sensitive words of (a) constitute a set of candidate words
Figure BDA0003469748870000085
Figure BDA0003469748870000086
Wherein | · | purple2Representing the euclidean norm and used to measure the distance between two word-embedded vectors, where m is 5, f1First from Cn′In the selection can enable
Figure BDA0003469748870000087
Smallest word
Figure BDA0003469748870000088
Order to
Figure BDA0003469748870000089
Figure BDA00034697488700000810
Then Cn′Removing words
Figure BDA00034697488700000811
From Cn′In the selection can enable
Figure BDA00034697488700000812
Smallest word
Figure BDA00034697488700000813
Order to
Figure BDA00034697488700000814
And sequentially iterating until 5 words are found to form a candidate word set
Figure BDA00034697488700000815
Figure BDA00034697488700000816
Step 3.2, according to the probability formula f of multi-unit auction2Selecting a set of candidate words
Figure BDA00034697488700000817
Word in (1)
Figure BDA00034697488700000818
Replacing sensitive words
Figure BDA00034697488700000819
Figure BDA00034697488700000820
Xi 0.8 in the experiment, the result is used to replace sensitive words
Figure BDA00034697488700000821
Word of
Figure BDA00034697488700000822
Wherein
Figure BDA00034697488700000823
Has a prior probability of pix,x=1,2,3,4,pixIs an equal ratio series with a head term of 0.8 and a common ratio of 0.2, when x is 5,
Figure BDA00034697488700000824
a priori probability p ofim=0.24
Step 4, replacing all sensitive words CnOutput desensitization text
Figure BDA00034697488700000825
And (3) testing results: the method can change the accuracy of the dichotomous model on the desensitized IMDb corpus to 12 percent, reduce the accuracy on the desensitized Twitter corpus to 10 percent and reduce the accuracy on the desensitized Twitter corpus to 82 percent, and has good effect on the desensitization method of the differential privacy text.
The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (3)

1. A method for desensitizing sensitive text to perturbations using differential privacy word embedding, the method comprising the steps of:
step 1, identifying sensitive information in an English text by using a named entity identification technology to obtain a sensitive word set, and sampling in an English corpus to obtain a non-sensitive word set;
step 1.1, inputting English text d, and dividing d into NsA sentence sequence consisting of words;
step 1.2, adding NsSequentially inputting the sentence sequence into a named entity recognition model to obtain predicted labels of all words in the text d, and summarizing to obtain a label set
Figure FDA0003469748860000011
Step 1.3, with sensitive tag set
Figure FDA0003469748860000012
Marking words with the same label in the text d to obtain the inclusion tnSet of individual sensitive words
Figure FDA0003469748860000013
Using sensitive sets of labels
Figure FDA0003469748860000014
Annotating words of the same label in an English corpus, and performing equal-probability random sampling on each label
Figure FDA0003469748860000015
Individual words, collectively comprising
Figure FDA0003469748860000016
Set of non-sensitive words
Figure FDA0003469748860000017
Step 2, generating a disturbing word embedding vector of the sensitive word by adopting a differential privacy noise disturbing method;
step 2.1, embedding C into model by using wordsnConverting the words in the Chinese into word embedding vectors;
step 2.2, sampling Laplace noise to obtain a noise vector meeting epsilon-difference privacy;
step 2.3, adding the word embedding vector and the noise vector to obtain a disturbing word embedding vector;
step 3, replacing the sensitive words with the non-sensitive words according to a multi-unit auction probability formula based on Euclidean distance measurement;
step 3.1, use nearest neighbor formula f1Selecting the closest distance in the word vector space
Figure FDA0003469748860000018
M non-sensitive words of (a) constitute a set of candidate words
Figure FDA0003469748860000019
Figure FDA00034697488600000110
Wherein | · | purple2Representing the Euclidean norm, which is used for measuring the distance between two word embedding vectors, wherein the number m of words belongs to {3, 4.., 9} is a custom variable,
Figure FDA0003469748860000021
is a sensitive word entered, f1First from Cn′In the selection can enable
Figure FDA0003469748860000022
Smallest word
Figure FDA0003469748860000023
Order to
Figure FDA0003469748860000024
Then Cn′Removing words
Figure FDA0003469748860000025
From Cn′In the selection can enable
Figure FDA0003469748860000026
Smallest word
Figure FDA0003469748860000027
Order to
Figure FDA0003469748860000028
And sequentially iterating until m words are found to form a candidate word set
Figure FDA0003469748860000029
Step 3.2, according to the probability formula f of multi-unit auction2Selecting a set of candidate words
Figure FDA00034697488600000210
Word in (1)
Figure FDA00034697488600000211
Replacing sensitive words
Figure FDA00034697488600000212
Figure FDA00034697488600000213
Wherein xi is the self-defined probability to obtain the replacement word
Figure FDA00034697488600000214
Step 4, replacing all sensitive words CnOutputting desensitized text
Figure FDA00034697488600000215
2. The method of sensitive text desensitization using differential privacy word embedding perturbations according to claim 1, characterized in that: sensitive tag set in step 1.3
Figure FDA00034697488600000216
Included
Figure FDA00034697488600000217
Each user-defined sensitive label is labeled in an English corpus to obtain a word set
Figure FDA00034697488600000218
And pair sets
Figure FDA00034697488600000219
Equal probability sampling
Figure FDA00034697488600000220
A word, wherein
Figure FDA00034697488600000221
α∈(0,1]For custom percentages, the count (. cndot.) function represents the number of words in the computation set,
Figure FDA00034697488600000222
the function represents rounding up.
3. The method of sensitive text desensitization using differential privacy word embedding perturbation according to claim 1, wherein: the multi-unit auction in step 3.2 means that the highest bidder is offered the goods first, giving them the unit quantity requested, then the second highest bidder, and so on, until the goods are supplied up, based on which the probability formula f is originally created2,f2Selecting candidate words to replace sensitive words
Figure FDA00034697488600000223
Wherein
Figure FDA00034697488600000224
Has a prior probability of pix1, 2,.. times.m, and p when x is 1, 2,. times.m-1ixThe first term is xi, the common ratio is (1-xi) and the equal ratio sequence, when x is m,
Figure FDA00034697488600000225
prior probability of (2)
Figure FDA00034697488600000226
CN202210039857.9A 2022-01-14 2022-01-14 Sensitive text desensitization method using differential privacy word embedding disturbance Pending CN114547670A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210039857.9A CN114547670A (en) 2022-01-14 2022-01-14 Sensitive text desensitization method using differential privacy word embedding disturbance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210039857.9A CN114547670A (en) 2022-01-14 2022-01-14 Sensitive text desensitization method using differential privacy word embedding disturbance

Publications (1)

Publication Number Publication Date
CN114547670A true CN114547670A (en) 2022-05-27

Family

ID=81672066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210039857.9A Pending CN114547670A (en) 2022-01-14 2022-01-14 Sensitive text desensitization method using differential privacy word embedding disturbance

Country Status (1)

Country Link
CN (1) CN114547670A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935405A (en) * 2022-09-15 2023-04-07 广州大学 Text content protection method based on differential privacy
CN115952854A (en) * 2023-03-14 2023-04-11 杭州太美星程医药科技有限公司 Training method of text desensitization model, text desensitization method and application

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935405A (en) * 2022-09-15 2023-04-07 广州大学 Text content protection method based on differential privacy
CN115952854A (en) * 2023-03-14 2023-04-11 杭州太美星程医药科技有限公司 Training method of text desensitization model, text desensitization method and application

Similar Documents

Publication Publication Date Title
Mohawesh et al. Fake reviews detection: A survey
Yu et al. Multiple level hierarchical network-based clause selection for emotion cause extraction
Redi et al. Citation needed: A taxonomy and algorithmic assessment of Wikipedia's verifiability
AlQahtani Product sentiment analysis for amazon reviews
Gaur et al. Semi-supervised deep learning based named entity recognition model to parse education section of resumes
CN114547670A (en) Sensitive text desensitization method using differential privacy word embedding disturbance
Budhiraja et al. A supervised learning approach for heading detection
Chen et al. Neural article pair modeling for wikipedia sub-article matching
Shekhar et al. An effective cybernated word embedding system for analysis and language identification in code-mixed social media text
Wang et al. Word vector modeling for sentiment analysis of product reviews
Brek et al. Enhancing information extraction process in job recommendation using semantic technology
Suresh Kumar et al. Local search five‐element cycle optimized reLU‐BiLSTM for multilingual aspect‐based text classification
Chatsiou Text classification of manifestos and COVID-19 press briefings using BERT and convolutional neural networks
Viswanathan et al. Detection of duplicates in Quora and Twitter corpus
Abdullah Amer et al. A novel algorithm for sarcasm detection using supervised machine learning approach.
Pandey et al. Various aspects of sentiment analysis: a review
Shelke et al. Support vector machine based word embedding and feature reduction for sentiment analysis-a study
Zim et al. Exploring Word2Vec embedding for sentiment analysis of Bangla raw and romanized text
Hamad et al. Sentiment analysis of restaurant reviews in social media using naïve bayes
Low et al. Decoding violence against women: analysing harassment in middle eastern literature with machine learning and sentiment analysis
Corredera Arbide et al. Affective computing for smart operations: a survey and comparative analysis of the available tools, libraries and web services
Karimi et al. Sentiment analysis using BERT (pre-training language representations) and Deep Learning on Persian texts
Hazrati et al. Profiling Irony Speech Spreaders on Social Networks Using Deep Cleaning and BERT.
Wang et al. SICM: a supervised-based identification and classification model for Chinese jargons using feature adapter enhanced BERT
Gnanavel et al. Rapid Text Retrieval and Analysis Supporting Latent Dirichlet Allocation Based on Probabilistic Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination