CN117349879A

CN117349879A - Text data anonymization privacy protection method based on continuous word bag model

Info

Publication number: CN117349879A
Application number: CN202311165334.XA
Authority: CN
Inventors: 吴萍; 郭海宁
Original assignee: Jiangsu Hankang Dongyou Information Technology Co ltd
Current assignee: Jiangsu Hankang Dongyou Information Technology Co ltd
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2024-01-05

Abstract

The invention discloses a text data anonymization privacy protection method based on a continuous word bag model, which relates to the technical field of data security protection, and aims to solve the problems that the method is difficult to determine the corresponding relation between a user and a certain feature in a table when an attacker attacks the certain feature through a link, so that the attacker cannot determine the specific identity of the user through part of the features, the privacy of the data to be issued is protected, and anonymization protection is realized. The method and the device are used for solving the problem of user data privacy protection in the technical field of data security privacy protection.

Description

Text data anonymization privacy protection method based on continuous word bag model

Technical Field

The invention relates to the technical field of data security protection, in particular to a text data anonymization privacy protection method based on a continuous word bag model.

Background

With the rapid development of information application technology, complete data becomes a necessary premise for the development of various industries, and in this context, data sharing is also one of the popular applications of cloud storage technology. However, due to the huge value of the data itself, the security problem of the data in the sharing process is more serious. Malicious users, malicious cloud storage servers, and hackers can snoop the privacy of users through various methods, and it is common to mine sensitive information through data published by users. In the context of data-oriented decisions, people release and share data more frequently. The value of data, especially private data, is increasing in various aspects, however, at the same time, a large number of severe private data leakage events occur in the data distribution process aiming at information sharing and data mining;

how to effectively select data features and select an efficient data anonymization model to realize effective anonymization data privacy protection is a problem which needs to be solved, and therefore, a text data anonymization privacy protection method based on a continuous word bag model is provided.

Disclosure of Invention

The invention aims to provide a text data anonymization privacy protection method based on a continuous bag-of-words model.

The aim of the invention can be achieved by the following technical scheme: a text data anonymization privacy protection method based on a continuous word bag model comprises the following steps:

step S1: constructing a text feature information base, and establishing a corresponding text definition model according to the constructed text feature information base;

step S2: acquiring text data to be anonymously protected, preprocessing the acquired text data, and completing information standardization of the acquired text data;

step S3: inputting the text data with standardized information into a text definition model, and finishing the feature extraction of the text data;

step S4: and anonymizing the text data with the feature extraction, and anonymizing the text data.

Further, the text feature information base comprises an English feature information base and a Chinese feature information base, and corresponding ontology construction rules are respectively imported into the English feature information base and the Chinese feature information base.

Further, the ontology construction rule is customized according to the user requirement, the user can select required elements according to the requirement to form a new ontology construction rule, the formed new ontology construction rule is recorded as the user customization construction rule, and a corresponding text definition model is built according to the formed user customization construction rule.

Further, the preprocessing process for the text data comprises the following steps:

marking text data which is required to be anonymously protected by a user, obtaining the type of the text data, and executing corresponding information standardization operation on the text data according to the type of the text data, wherein the type of the text data comprises a Chinese type and an English type.

Further, when the type of the text data is a chinese type, the information normalization operation performed on the text data is:

setting a stop word list, wherein a plurality of entries are arranged in the stop word list, marking specific words in text data according to the entries in the stop word list, and eliminating the specific words from the text data;

setting a symbol table and a character table, wherein a plurality of punctuation marks are arranged in the symbol table, and a plurality of characters are arranged in the character table;

and eliminating punctuation marks and special characters in the text data according to the symbol table and the character table.

Further, when the type of the text data is an english type, the information normalization operation performed on the text data is:

setting an English root list, wherein a plurality of English roots are arranged in the English root list, and a related word list is set for the English roots according to actual conditions, wherein a plurality of related words are arranged in the related word list, and at least one related word is arranged in the related word list;

performing word variant reduction on the obtained text data according to the English root list;

and then converting all capitalized words or letters in the text data into lowercase words or letters.

Further, the feature extraction process of the text data includes:

extracting text data features to obtain corresponding text features, generating a corresponding word vector dictionary table according to the obtained text features,

the number of words contained in the text data is noted as n, and each word is numbered i, wherein i=1, 2, … …, n, n is an integer, and n > 0;

the text data is defined as s=w by Word Embedding algorithm _i Wherein w is _i For Word vectors, representing words numbered i in text data, using an Embedding matrix e=e in the Word Embedding algorithm _i Each word w in the text data _i Word vector x mapped to a multi-dimensional succession of values _i Obtaining word vector matrix X=x of text data _i Wherein x is _i ＝（E×w _i ）∈R ^d D represents the dimension of the word vector;

after feature extraction of the text data is completed, anonymizing the text data is carried out.

Further, the process of anonymizing the text data includes:

the obtained word vector matrix is marked, word vectors in the word vector matrix are selected through a greedy algorithm of forward sequence search according to the label sequence of each word vector in the word vector matrix, the selected word vectors are marked to serve as anonymous features, and the anonymous features are anonymously protected, so that when the selected data features are attacked, because the selected data features are associated with other features, the corresponding relation between a user and a certain feature is difficult to determine by an attacker, and the attacker cannot determine the specific identity of the user through part of the features.

Compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of preprocessing text data to complete information standardization processing of the text data, constructing a body of keywords in the text, extracting features of the text data, and finally selecting anonymous features through a greedy algorithm of forward sequence search, so that when an attacker attacks one feature through links, the selected data are associated with other features in a table, the corresponding relation between the user and the one feature is difficult to determine, the attacker cannot determine the specific identity of the user through part of the features, privacy of the data to be distributed is protected, and anonymization protection is achieved. The method and the device are used for solving the problem of user data privacy protection in the technical field of data security privacy protection.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a flow chart of the present invention.

Detailed Description

As shown in fig. 1, the text data anonymization privacy protection method based on the continuous bag-of-words model comprises the following steps:

It should be further noted that, in the implementation process, the text feature information base includes an english feature information base and a chinese feature information base;

respectively importing corresponding ontology construction rules into an English feature information base and a Chinese feature information base, and marking the ontology construction rules as follows

Wherein C is _i Representing an ontology concept class set, consisting of a plurality ofThe ontology concept elements are composed; a is that _j Representing an ontology attribute class set, which consists of a plurality of ontology attribute elements; r is R _k Representing an ontology relation class set, which consists of a plurality of ontology relation elements; i=1, 2, … …, c; j=1, 2, … …, a; k=1, 2, … …, r; c. a and r are integers and are larger than 0;

establishing a corresponding vocabulary according to the elements in each set and associating the vocabulary with the corresponding elements; the vocabulary list consists of a plurality of vocabularies;

in the specific implementation process, the ontology construction rules are customized according to specific requirements of users, the users can select required elements in each set according to the requirements to form new ontology construction rules, the formed new ontology construction rules are recorded as user customization construction rules, and corresponding text definition models are built according to the formed user customization construction rules;

illustrating:

custom building rules for；

Wherein C is _i =[C ₁ ，C ₂ ，C ₃ ]；A _i =[A ₁ ，A ₂ ，A ₃ ]；R _i =[R ₁ ，R ₂ ，R ₃ ]；

C is C ₁ ，C ₂ ，C ₃ The corresponding ontology concept elements are defined as name, address and telephone number, respectively;

then the corresponding ontology concept class set C in the Chinese feature information base _i = [ name, address, telephone number ]]Ontology concept class set C in English feature information base _i =[Name，Address，TelephoneNumber]；

Similarly, let A ₁ ，A ₂ ，A ₃ The corresponding ontology attribute elements are marked as name class attributes, address class attributes and telephone class attributes;

the corresponding ontology attribute class set A in the Chinese feature information base _i = [ name class attribute, address class attribute, phone class attribute]English character information baseWithin the ontology-attribute class set A _i =[NameType，AddressType，TelephoneNumberType]；

Similarly, R is ₁ ，R ₂ ，R ₃ Defining a friend relationship, a family relationship and a colleague relationship;

the corresponding ontology relation class set R in the Chinese characteristic information base _i = [ name relationship, person relationship, colleague relationship]Ontology relation class set R in English feature information base _i =[Friend，Family，Colleague]；

And obtaining vocabularies associated with the elements according to the elements contained in each set, and completing the construction of the text definition model.

It should be further noted that, in the implementation process, the preprocessing process for the text data includes:

marking text data which is required to be anonymously protected by a user, obtaining the type of the text data, and executing corresponding information standardization operation on the text data according to the type of the text data; it should be further noted that, in the implementation process, the text data includes a chinese type and an english type;

when the type of the text data is a Chinese type, the information normalization operation performed on the text data is as follows:

sequentially performing text stop word removal and sign removal operation on the obtained text data to complete information standardization operation on the text data;

it should be further noted that, in the implementation process, the specific content of the text data for performing the text removal stop word operation is:

setting a stop word list, wherein a plurality of entries are arranged in the stop word list, marking specific words in text data according to the entries in the stop word list, and eliminating the specific words from the text data; the specific vocabulary is specifically words which do not play a role in actual meaning in the meaning of text data content in actual application, such as pronouns, articles, prepositions, conjunctions and moral verbs;

it should be further noted that, in the implementation process, the specific content of the text data de-sign operation is:

eliminating punctuation marks and special characters in the text data according to the symbol table and the character table to finish information standardization operation of the text data;

through removing text stop words and signs from the text data of the Chinese type, when the text data of the Chinese type is processed, the text data content is optimized, and the volume of the text data can be reduced on the premise of keeping the original text data content and meaning, so that the calculation amount of a system is reduced, the calculation efficiency of the system is improved, and unnecessary calculation of the system is reduced;

when the type of the text data is an English type, the information normalization operation performed on the text data is as follows:

performing word variant reduction on the obtained text data according to the English root list; the word variant is specifically restored to be the English root corresponding to the associated word when the word in the text data is the associated word in the associated word list corresponding to a certain English root in the English root list;

converting all capitalized words or letters in the text data into lowercase words or letters;

the text data of English type is restored by word variant, so that the text data content format is consistent when the text data of English type is processed, and the volume of the text data can be reduced on the premise of keeping the original text data content by utilizing word variant restoration, thereby reducing the calculation amount of a system;

importing the preprocessed text data into a text definition model to finish the feature extraction of the text data, wherein the specific process comprises the following steps:

extracting text data features to obtain corresponding text features, and generating a corresponding word vector dictionary table according to the obtained text features, wherein the dimension of the word vector dictionary table is 100 dimensions, and each dimension corresponds to one text feature, namely

the text data is defined as s=w by Word Embedding algorithm _i Wherein w is _i For Word vectors, representing words numbered i in text data, using an Embedding matrix e=e in the Word Embedding algorithm _i Each word w in the text data _i Word vector x mapped to a multi-dimensional succession of values _i Obtaining word vector matrix X=x of text data _i Wherein x is _i ＝（E×w _i ）∈R ^d D represents the dimension of the word vector, and is determined by a user customized construction rule;

It should be further noted that, in the implementation process, the process of anonymizing the text data includes:

marking the obtained word vector matrix, selecting the word vector in the word vector matrix through a greedy algorithm of forward sequence search according to the label sequence of each word vector in the word vector matrix, marking the selected word vector as an anonymous feature, and anonymously protecting the anonymous feature.

The method comprises the steps of preprocessing text data to complete information standardization processing of the text data, constructing a body of keywords in the text, extracting features of the text data, and finally selecting anonymous features through a greedy algorithm of forward sequence search, so that when an attacker attacks one feature through links, the selected data are associated with other features in a table, the corresponding relation between the user and the one feature is difficult to determine, the attacker cannot determine the specific identity of the user through part of the features, privacy of the data to be distributed is protected, and anonymization protection is achieved. The method and the device are used for solving the problem of user data privacy protection in the technical field of data security privacy protection.

The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.

Claims

1. The text data anonymization privacy protection method based on the continuous word bag model is characterized by comprising the following steps of:

2. The privacy protection method for anonymizing text data based on continuous bag-of-words model according to claim 1, wherein the text feature information base comprises an english feature information base and a chinese feature information base, and the corresponding ontology construction rules are respectively imported into the english feature information base and the chinese feature information base.

3. The privacy protection method for anonymizing text data based on continuous bag-of-words models according to claim 2, wherein the ontology construction rules are customized according to user requirements, a user can select required elements according to the requirements to form new ontology construction rules, the formed new ontology construction rules are recorded as user customization construction rules, and a corresponding text definition model is built according to the formed user customization construction rules.

4. The privacy preserving method for anonymizing text data based on continuous bag of words model as claimed in claim 3, wherein the preprocessing procedure for the text data comprises:

5. The privacy preserving method of anonymizing text data based on continuous bag of words model as claimed in claim 4, wherein when the type of text data is a chinese type, the information normalization operation performed on the text data is:

6. The privacy preserving method for anonymizing text data based on continuous bag of words model as claimed in claim 5, wherein when the type of text data is english type, the information normalization operation performed on the text data is:

7. The privacy preserving method for anonymizing text data based on continuous bag of words model as claimed in claim 6, wherein the feature extraction process of the text data comprises:

8. The privacy preserving method for anonymizing text data based on continuous bag of words model as recited in claim 7, wherein the anonymizing text data comprises: