CN110222153A

CN110222153A - A kind of Chinese Name data desensitization method based on sort permutation

Info

Publication number: CN110222153A
Application number: CN201910485787.8A
Authority: CN
Inventors: 李辉; 赵柯纯; 龚政; 孟雪
Original assignee: Xidian University
Current assignee: Xidian University; China Academy of Electronic and Information Technology of CETC
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2019-09-10

Abstract

The invention discloses a kind of Chinese Name data desensitization method based on sort permutation, comprising the following steps: 1) two class data are divided into according to surname and name to the data in Chinese personal name corpus, and are converted into vector form；2) by two class data and its vector form storage into database；3) name data to be desensitized is obtained；4) surname of name data to be desensitized and name are respectively converted into vector form；5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained；6) surname that one is treated desensitization name is randomly choosed in K surname vector to be replaced；7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained；8) it randomly chooses the name that one is treated desensitization name in N number of name vector to be replaced, the data after must desensitizing, this method retains name itself after enabling to name to desensitize and has the special feature that.

Description

A kind of Chinese Name data desensitization method based on sort permutation

Technical field

The invention belongs to field of information security technology, are related to a kind of Chinese Name data desensitization side based on sort permutation Method.

Background technique

Multi-source heterogeneous data under big data era contain a large amount of key messages, these information have enterprises and individuals huge Big commercial value, these information also contain a large amount of individual privacy data at the same time, wherein can be referred to personal name again It is the most important thing to more specific individual privacy data.These sensitive informations are once revealed and may not only be brought to individual respectively Kind puzzlement, serious possibility damage its people's reputation and cause damages to the person and property safety.It is used in addition, publication is true User data is analyzed for researcher and data mining, this also becomes leakage a large number of users while making major contribution for scientific research One of channel of privacy.

Data desensitization refers to the deformation that certain sensitive informations are carried out with data by desensitization rule, removes sensibility, realizes The reliably protecting of privacy-sensitive data.It is proposed data desensitization be in order to obtain balance between data protection and availability of data, In the case where being related to client secure data or some commercial sensitive datas, to true under the conditions of not violating system convention Data carry out desensitization transformation and are provided to other people and develop, test or statistically analyze.

Language is the carrier of knowledge and thinking, natural language processing (Natural Language Processing, NLP) It is computer science, artificial intelligence, the field of the interaction between linguistics concern computer and human language.Word is embedded in The general designation of language model and representative learning technology in natural language processing, in short, it refers to each word or phrase predetermined The vector being mapped as in the vector space of justice in real number field.The existing a variety of models indicated for constructing word insertion, wherein Word2vec and GloVe is one of widely used realization.Nowadays, in natural language processing field, mostly use term vector and Deep neural network in conjunction with mode carry out text classification.Therefore, the present invention is proposed natural language processing technique and data Desensitization is combined together, with the Chinese Text Categorization function in natural language processing technique based on term vector.

Currently, the existing desensitization technology for Chinese Name probably includes following several:

A) name data is directly replaced as to similar " Zhang San " " Li Si " this common name, but this method can be made At there was only identical several name datas in entire tables of data, it can not find out the distribution situation of data, be unfavorable for the statistics of data.

B) to name data carry out random permutation, by the coding of each Chinese character of former name carry out offset random-length with Another Chinese character is generated, but this random device can make name data after the desensitization generated completely lose Chinese Name original Some features, and rarely used word is more.

C) a Chinese Name code table is constructed, original name is then subjected to Hash mapping replacement.Data is more in this way Sample and distributivity are retained, but need a large amount of time and space expense, and the name data quantity constructed has Limit, can not still accomplish the reservation of real distribution characteristics.

In conclusion existing Chinese Name desensitization technology has that name data loses itself after will cause desensitization And it is unfavorable for the problem of data statistic analysis recycles.

Summary of the invention

It is an object of the invention to overcome the above-mentioned prior art, a kind of Chinese surname based on sort permutation is provided Name data desensitization method, this method retains name itself after enabling to name to desensitize and has the special feature that, conducive to the statistics of data Analysis recycles.

In order to achieve the above objectives, the Chinese Name data desensitization method of the present invention based on sort permutation includes following Step:

1) two class data, and the two class numbers that will be obtained are divided into according to surname and name to the data in Chinese personal name corpus According to being converted to vector form；

2) the two class data and its vector form obtained step 1) are stored into database；

3) name data to be desensitized is obtained；

4) surname of name data to be desensitized and name are respectively converted into vector form；

5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained；

6) surname that one is treated desensitization name is randomly choosed in the K surname vector that step 5) obtains to be replaced；

7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained；

8) name that desensitization name is treated in random selection one in N number of name vector that step 7) obtains is replaced；

9) surname after the corresponding displacement of the name that desensitizes and the name after displacement are spliced, the number after must desensitizing According to Chinese Name data desensitization of the completion based on sort permutation.

The data in Chinese personal name corpus are divided according to surname and name with Natural Language Processing Models in step 1) Vector form is converted at two class data, and by two obtained class data；

In step 4) with Natural Language Processing Models by the surname of name data to be desensitized and name be respectively converted into Amount form.

The Natural Language Processing Models are Bert model, MT-DNN model or word incorporation model and deep learning model Binding model.

The cosine similarity cos θ of the surname vector of surname vector and name to be desensitized in database are as follows:

The invention has the following advantages:

Chinese Name data desensitization method of the present invention based on sort permutation chooses database when specific operation In with the maximum K surname vector of surname vector similarity of name to be desensitized, then the random selection one in K surname vector The surname for treating desensitization name is replaced；It obtains maximum N number of with the name vector similarity of name to be desensitized in database Name vector, then randomly choose the name that one is treated desensitization name in N number of name vector and replaced, to guarantee a number of words According to safety, the name data after present invention desensitization is used in open environment, while after guaranteeing desensitization Name data is close with true name feature, has generality, and be free of the spcial characters such as rare surname and rarely used word, and Reserved name desensitization after retain name itself have the special feature that so that desensitization after convenient for researcher continue data excavation and Data analysis, to achieve the purpose that availability of data after assuring data security and desensitization.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is the flow chart of replacement process in the present invention.

Specific embodiment

The invention will be described in further detail with reference to the accompanying drawing:

Referring to Figure 1 and Figure 2, the Chinese Name data desensitization method of the present invention based on sort permutation includes following step It is rapid:

3) name data to be desensitized is obtained；

Wherein, in database the surname vector of surname vector and name to be desensitized cosine similarity cos θ are as follows:

When two vectors are equally directed to, that is, in statistical method have certain similarity when, cosine phase Value like degree is 1；When two vector angles are 90 °, the value of cosine similarity is 0；Two vectors are directed toward exactly opposite direction When, that is, in statistical method absolutely not similarity when, the value of cosine similarity is -1.

The data in Chinese personal name corpus are divided according to surname and name with Natural Language Processing Models in step 1) Vector form is converted at two class data, and by two obtained class data；It will be to Natural Language Processing Models in step 4) The surname and name for the name data that desensitizes are respectively converted into vector form, wherein the Natural Language Processing Models are Bert mould The binding model of type, MT-DNN model or word incorporation model and deep learning model.

Embodiment one

From corpus of the sub-fraction data as the present embodiment extracted in hundred million grades of name corpus in Chinese personal name corpus Library, as shown in table 1:

Table 1

Cai Yichun
	Deng Yumei
Liu Sining
	Sun Yuanzhao
Xia Yi
	Shi Shaoqing
Song Yin
	Liu Yuwei
Zhao Yang
	Xu Nin's precious jade

As shown in Figure 1, the natural language processing technique that selection is applicable, as Bert model, MT-DNN model the latter be by The words such as word2vec incorporation model and deep learning models coupling carry out the classification processing of text, by the name data in corpus It is divided into surname data and name data, and is translated into vector form, for ease of description, the present embodiment is by text vector It is simplified to bivector and carries out example, as shown in table 2:

Table 2

Cai	(39,490)	One spring	(9532,19372)
				Deng	(76,233)	Yu Mei	(8149,8370)
Liu	(378,5443)	Think peaceful	(2762,495)
				Grandson	(958,327)	Member is shone	(64081,4229)
Summer	(21,43)	Easily	(3153,90)
				Stone	(7162,283)	It is few clear	(195,2658)
Song	(438,21601)	The third of the twelve Earthly Branches	(945,354)
				Liu	(372,5443)	Space is big	(49723,1032)
Zhao	(3,871)	Poplar	(546,313)
				Slowly	(9584,1265)	Ning Yao	(52371,41428)

With reference to Fig. 2, the following are data deimmunization processes, specifically:

Obtain name data to be desensitized are as follows:

Poplar can

For each name data to be desensitized, it is divided into surname and name with natural language processing technique, and convert For vector form, specifically:

Poplar

(546,313)

Wood can

(2142,8234)

When text vector is two dimension, cosine similarity is

K=3 is taken, is calculated separately in surname replacement data library and the vector cosine similarity of surname " poplar ", calculated result It is as shown in table 3:

Table 3

Selection wherein and the maximum three surname vectors of surname " poplar " cosine similarity, respectively surname " grandson ", " stone " And " Xu ".

One surname of random selection is replaced with surname data to be desensitized in permutation vector " grandson ", " stone " and " Xu ", For example, random selection surname " stone " displacement surname " poplar ".

N=4 is taken, the vector cosine similarity with name " wood can " is calculated separately in name permutation database, calculates knot Fruit is as shown in table 4:

Table 4

Wherein, selection and name " wood can " the maximum four surname vectors of cosine similarity, respectively name " spring ", " few clear ", " Yu Mei " and " Ning Yao ".

In permutation vector " spring ", " few clear ", " Yu Mei " and " Ning Yao " middle one name of random selection and name to be desensitized Digital data is replaced, for example, random selection name " Yu Mei " displacement name " wood can ".

Name data " Shi Yumei " after then constituting desensitization is completed once to the displacement desensitization of name.

Claims

1. a kind of Chinese Name data desensitization method based on sort permutation, which comprises the following steps:

1) two class data are divided into according to surname and name to the data in Chinese personal name corpus, and two obtained class data is turned It is changed to vector form；

3) name data to be desensitized is obtained；

9) surname after the corresponding displacement of the name that desensitizes and the name after displacement are spliced, the data after must desensitizing are complete At the Chinese Name data desensitization based on sort permutation.

2. the Chinese Name data desensitization method according to claim 1 based on sort permutation, which is characterized in that step 1) It is middle that two class data are divided into according to surname and name to the data in Chinese personal name corpus with Natural Language Processing Models, and will Two obtained class data are converted to vector form；

The surname of name data to be desensitized and name are respectively converted into vector shape with Natural Language Processing Models in step 4) Formula.

3. the Chinese Name data desensitization method according to claim 2 based on sort permutation, which is characterized in that it is described from Right Language Processing model is the binding model of Bert model, MT-DNN model or word incorporation model and deep learning model.

4. the Chinese Name data desensitization method according to claim 1 based on sort permutation, which is characterized in that database The cosine similarity cos θ of the surname vector of middle surname vector and name to be desensitized are as follows: