CN110222153A - A kind of Chinese Name data desensitization method based on sort permutation - Google Patents

A kind of Chinese Name data desensitization method based on sort permutation Download PDF

Info

Publication number
CN110222153A
CN110222153A CN201910485787.8A CN201910485787A CN110222153A CN 110222153 A CN110222153 A CN 110222153A CN 201910485787 A CN201910485787 A CN 201910485787A CN 110222153 A CN110222153 A CN 110222153A
Authority
CN
China
Prior art keywords
name
data
vector
surname
desensitized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910485787.8A
Other languages
Chinese (zh)
Inventor
李辉
赵柯纯
龚政
孟雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
China Academy of Electronic and Information Technology of CETC
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910485787.8A priority Critical patent/CN110222153A/en
Publication of CN110222153A publication Critical patent/CN110222153A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Chinese Name data desensitization method based on sort permutation, comprising the following steps: 1) two class data are divided into according to surname and name to the data in Chinese personal name corpus, and are converted into vector form;2) by two class data and its vector form storage into database;3) name data to be desensitized is obtained;4) surname of name data to be desensitized and name are respectively converted into vector form;5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained;6) surname that one is treated desensitization name is randomly choosed in K surname vector to be replaced;7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained;8) it randomly chooses the name that one is treated desensitization name in N number of name vector to be replaced, the data after must desensitizing, this method retains name itself after enabling to name to desensitize and has the special feature that.

Description

A kind of Chinese Name data desensitization method based on sort permutation
Technical field
The invention belongs to field of information security technology, are related to a kind of Chinese Name data desensitization side based on sort permutation Method.
Background technique
Multi-source heterogeneous data under big data era contain a large amount of key messages, these information have enterprises and individuals huge Big commercial value, these information also contain a large amount of individual privacy data at the same time, wherein can be referred to personal name again It is the most important thing to more specific individual privacy data.These sensitive informations are once revealed and may not only be brought to individual respectively Kind puzzlement, serious possibility damage its people's reputation and cause damages to the person and property safety.It is used in addition, publication is true User data is analyzed for researcher and data mining, this also becomes leakage a large number of users while making major contribution for scientific research One of channel of privacy.
Data desensitization refers to the deformation that certain sensitive informations are carried out with data by desensitization rule, removes sensibility, realizes The reliably protecting of privacy-sensitive data.It is proposed data desensitization be in order to obtain balance between data protection and availability of data, In the case where being related to client secure data or some commercial sensitive datas, to true under the conditions of not violating system convention Data carry out desensitization transformation and are provided to other people and develop, test or statistically analyze.
Language is the carrier of knowledge and thinking, natural language processing (Natural Language Processing, NLP) It is computer science, artificial intelligence, the field of the interaction between linguistics concern computer and human language.Word is embedded in The general designation of language model and representative learning technology in natural language processing, in short, it refers to each word or phrase predetermined The vector being mapped as in the vector space of justice in real number field.The existing a variety of models indicated for constructing word insertion, wherein Word2vec and GloVe is one of widely used realization.Nowadays, in natural language processing field, mostly use term vector and Deep neural network in conjunction with mode carry out text classification.Therefore, the present invention is proposed natural language processing technique and data Desensitization is combined together, with the Chinese Text Categorization function in natural language processing technique based on term vector.
Currently, the existing desensitization technology for Chinese Name probably includes following several:
A) name data is directly replaced as to similar " Zhang San " " Li Si " this common name, but this method can be made At there was only identical several name datas in entire tables of data, it can not find out the distribution situation of data, be unfavorable for the statistics of data.
B) to name data carry out random permutation, by the coding of each Chinese character of former name carry out offset random-length with Another Chinese character is generated, but this random device can make name data after the desensitization generated completely lose Chinese Name original Some features, and rarely used word is more.
C) a Chinese Name code table is constructed, original name is then subjected to Hash mapping replacement.Data is more in this way Sample and distributivity are retained, but need a large amount of time and space expense, and the name data quantity constructed has Limit, can not still accomplish the reservation of real distribution characteristics.
In conclusion existing Chinese Name desensitization technology has that name data loses itself after will cause desensitization And it is unfavorable for the problem of data statistic analysis recycles.
Summary of the invention
It is an object of the invention to overcome the above-mentioned prior art, a kind of Chinese surname based on sort permutation is provided Name data desensitization method, this method retains name itself after enabling to name to desensitize and has the special feature that, conducive to the statistics of data Analysis recycles.
In order to achieve the above objectives, the Chinese Name data desensitization method of the present invention based on sort permutation includes following Step:
1) two class data, and the two class numbers that will be obtained are divided into according to surname and name to the data in Chinese personal name corpus According to being converted to vector form;
2) the two class data and its vector form obtained step 1) are stored into database;
3) name data to be desensitized is obtained;
4) surname of name data to be desensitized and name are respectively converted into vector form;
5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained;
6) surname that one is treated desensitization name is randomly choosed in the K surname vector that step 5) obtains to be replaced;
7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained;
8) name that desensitization name is treated in random selection one in N number of name vector that step 7) obtains is replaced;
9) surname after the corresponding displacement of the name that desensitizes and the name after displacement are spliced, the number after must desensitizing According to Chinese Name data desensitization of the completion based on sort permutation.
The data in Chinese personal name corpus are divided according to surname and name with Natural Language Processing Models in step 1) Vector form is converted at two class data, and by two obtained class data;
In step 4) with Natural Language Processing Models by the surname of name data to be desensitized and name be respectively converted into Amount form.
The Natural Language Processing Models are Bert model, MT-DNN model or word incorporation model and deep learning model Binding model.
The cosine similarity cos θ of the surname vector of surname vector and name to be desensitized in database are as follows:
The invention has the following advantages:
Chinese Name data desensitization method of the present invention based on sort permutation chooses database when specific operation In with the maximum K surname vector of surname vector similarity of name to be desensitized, then the random selection one in K surname vector The surname for treating desensitization name is replaced;It obtains maximum N number of with the name vector similarity of name to be desensitized in database Name vector, then randomly choose the name that one is treated desensitization name in N number of name vector and replaced, to guarantee a number of words According to safety, the name data after present invention desensitization is used in open environment, while after guaranteeing desensitization Name data is close with true name feature, has generality, and be free of the spcial characters such as rare surname and rarely used word, and Reserved name desensitization after retain name itself have the special feature that so that desensitization after convenient for researcher continue data excavation and Data analysis, to achieve the purpose that availability of data after assuring data security and desensitization.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is the flow chart of replacement process in the present invention.
Specific embodiment
The invention will be described in further detail with reference to the accompanying drawing:
Referring to Figure 1 and Figure 2, the Chinese Name data desensitization method of the present invention based on sort permutation includes following step It is rapid:
1) two class data, and the two class numbers that will be obtained are divided into according to surname and name to the data in Chinese personal name corpus According to being converted to vector form;
2) the two class data and its vector form obtained step 1) are stored into database;
3) name data to be desensitized is obtained;
4) surname of name data to be desensitized and name are respectively converted into vector form;
5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained;
Wherein, in database the surname vector of surname vector and name to be desensitized cosine similarity cos θ are as follows:
When two vectors are equally directed to, that is, in statistical method have certain similarity when, cosine phase Value like degree is 1;When two vector angles are 90 °, the value of cosine similarity is 0;Two vectors are directed toward exactly opposite direction When, that is, in statistical method absolutely not similarity when, the value of cosine similarity is -1.
6) surname that one is treated desensitization name is randomly choosed in the K surname vector that step 5) obtains to be replaced;
7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained;
8) name that desensitization name is treated in random selection one in N number of name vector that step 7) obtains is replaced;
9) surname after the corresponding displacement of the name that desensitizes and the name after displacement are spliced, the number after must desensitizing According to Chinese Name data desensitization of the completion based on sort permutation.
The data in Chinese personal name corpus are divided according to surname and name with Natural Language Processing Models in step 1) Vector form is converted at two class data, and by two obtained class data;It will be to Natural Language Processing Models in step 4) The surname and name for the name data that desensitizes are respectively converted into vector form, wherein the Natural Language Processing Models are Bert mould The binding model of type, MT-DNN model or word incorporation model and deep learning model.
Embodiment one
From corpus of the sub-fraction data as the present embodiment extracted in hundred million grades of name corpus in Chinese personal name corpus Library, as shown in table 1:
Table 1
Cai Yichun
Deng Yumei
Liu Sining
Sun Yuanzhao
Xia Yi
Shi Shaoqing
Song Yin
Liu Yuwei
Zhao Yang
Xu Nin's precious jade
As shown in Figure 1, the natural language processing technique that selection is applicable, as Bert model, MT-DNN model the latter be by The words such as word2vec incorporation model and deep learning models coupling carry out the classification processing of text, by the name data in corpus It is divided into surname data and name data, and is translated into vector form, for ease of description, the present embodiment is by text vector It is simplified to bivector and carries out example, as shown in table 2:
Table 2
Cai (39,490) One spring (9532,19372)
Deng (76,233) Yu Mei (8149,8370)
Liu (378,5443) Think peaceful (2762,495)
Grandson (958,327) Member is shone (64081,4229)
Summer (21,43) Easily (3153,90)
Stone (7162,283) It is few clear (195,2658)
Song (438,21601) The third of the twelve Earthly Branches (945,354)
Liu (372,5443) Space is big (49723,1032)
Zhao (3,871) Poplar (546,313)
Slowly (9584,1265) Ning Yao (52371,41428)
With reference to Fig. 2, the following are data deimmunization processes, specifically:
Obtain name data to be desensitized are as follows:
Poplar can
For each name data to be desensitized, it is divided into surname and name with natural language processing technique, and convert For vector form, specifically:
Poplar (546,313) Wood can (2142,8234)
When text vector is two dimension, cosine similarity is
K=3 is taken, is calculated separately in surname replacement data library and the vector cosine similarity of surname " poplar ", calculated result It is as shown in table 3:
Table 3
Selection wherein and the maximum three surname vectors of surname " poplar " cosine similarity, respectively surname " grandson ", " stone " And " Xu ".
One surname of random selection is replaced with surname data to be desensitized in permutation vector " grandson ", " stone " and " Xu ", For example, random selection surname " stone " displacement surname " poplar ".
N=4 is taken, the vector cosine similarity with name " wood can " is calculated separately in name permutation database, calculates knot Fruit is as shown in table 4:
Table 4
Wherein, selection and name " wood can " the maximum four surname vectors of cosine similarity, respectively name " spring ", " few clear ", " Yu Mei " and " Ning Yao ".
In permutation vector " spring ", " few clear ", " Yu Mei " and " Ning Yao " middle one name of random selection and name to be desensitized Digital data is replaced, for example, random selection name " Yu Mei " displacement name " wood can ".
Name data " Shi Yumei " after then constituting desensitization is completed once to the displacement desensitization of name.

Claims (4)

1. a kind of Chinese Name data desensitization method based on sort permutation, which comprises the following steps:
1) two class data are divided into according to surname and name to the data in Chinese personal name corpus, and two obtained class data is turned It is changed to vector form;
2) the two class data and its vector form obtained step 1) are stored into database;
3) name data to be desensitized is obtained;
4) surname of name data to be desensitized and name are respectively converted into vector form;
5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained;
6) surname that one is treated desensitization name is randomly choosed in the K surname vector that step 5) obtains to be replaced;
7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained;
8) name that desensitization name is treated in random selection one in N number of name vector that step 7) obtains is replaced;
9) surname after the corresponding displacement of the name that desensitizes and the name after displacement are spliced, the data after must desensitizing are complete At the Chinese Name data desensitization based on sort permutation.
2. the Chinese Name data desensitization method according to claim 1 based on sort permutation, which is characterized in that step 1) It is middle that two class data are divided into according to surname and name to the data in Chinese personal name corpus with Natural Language Processing Models, and will Two obtained class data are converted to vector form;
The surname of name data to be desensitized and name are respectively converted into vector shape with Natural Language Processing Models in step 4) Formula.
3. the Chinese Name data desensitization method according to claim 2 based on sort permutation, which is characterized in that it is described from Right Language Processing model is the binding model of Bert model, MT-DNN model or word incorporation model and deep learning model.
4. the Chinese Name data desensitization method according to claim 1 based on sort permutation, which is characterized in that database The cosine similarity cos θ of the surname vector of middle surname vector and name to be desensitized are as follows:
CN201910485787.8A 2019-06-05 2019-06-05 A kind of Chinese Name data desensitization method based on sort permutation Pending CN110222153A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910485787.8A CN110222153A (en) 2019-06-05 2019-06-05 A kind of Chinese Name data desensitization method based on sort permutation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910485787.8A CN110222153A (en) 2019-06-05 2019-06-05 A kind of Chinese Name data desensitization method based on sort permutation

Publications (1)

Publication Number Publication Date
CN110222153A true CN110222153A (en) 2019-09-10

Family

ID=67819410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910485787.8A Pending CN110222153A (en) 2019-06-05 2019-06-05 A kind of Chinese Name data desensitization method based on sort permutation

Country Status (1)

Country Link
CN (1) CN110222153A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951562A (en) * 2017-04-01 2017-07-14 北京数聚世界信息技术有限公司 A kind of desensitization method and device of Chinese Name data
CN109739956A (en) * 2018-11-08 2019-05-10 第四范式(北京)技术有限公司 Corpus cleaning method, device, equipment and medium
CN109829328A (en) * 2018-12-19 2019-05-31 上海晶赞融宣科技有限公司 Data desensitization, inverse desensitization method and device, storage medium, terminal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951562A (en) * 2017-04-01 2017-07-14 北京数聚世界信息技术有限公司 A kind of desensitization method and device of Chinese Name data
CN109739956A (en) * 2018-11-08 2019-05-10 第四范式(北京)技术有限公司 Corpus cleaning method, device, equipment and medium
CN109829328A (en) * 2018-12-19 2019-05-31 上海晶赞融宣科技有限公司 Data desensitization, inverse desensitization method and device, storage medium, terminal

Similar Documents

Publication Publication Date Title
CN102012985B (en) Sensitive data dynamic identification method based on data mining
Shi et al. Selective differential privacy for language modeling
Qu et al. Natural language understanding with privacy-preserving bert
CN106295338B (en) SQL vulnerability detection method based on artificial neuron network
CN110427612A (en) Based on multilingual entity disambiguation method, device, equipment and storage medium
CN110489997A (en) A kind of sensitive information desensitization method based on pattern matching algorithm
CN110569350A (en) Legal recommendation method, equipment and storage medium
Trieu et al. Document sensitivity classification for data leakage prevention with twitter-based document embedding and query expansion
Kathuria et al. Real time sentiment analysis on twitter data using deep learning (Keras)
Yang et al. Improving word representations with document labels
CN108932434B (en) Data encryption method and device based on machine learning technology
CN107622208A (en) Note encryption and decryption method and related product
CN113742763A (en) Confusion encryption method and system based on government affair sensitive data
CN117290888A (en) Information desensitization method for big data, storage medium and server
CN110222153A (en) A kind of Chinese Name data desensitization method based on sort permutation
CN116055067B (en) Weak password detection method, device, electronic equipment and medium
Aborujilah et al. Conceptual model for automating gdpr compliance verification using natural language approach
Wu et al. Semantic key generation based on natural language
Liang et al. A lightweight method for face expression recognition based on improved MobileNetV3
Chen et al. Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection
CN112507388B (en) Word2vec model training method, device and system based on privacy protection
Boucenna et al. Concept-based semantic search over encrypted cloud data
CN114969826A (en) Privacy protection method, device and equipment for biological recognition
Zhang et al. Detection of android malicious family based on manifest information
Sivanaiah et al. Avalanche at DravidianLangTech: Abusive Comment Detection in Code Mixed Data Using Machine Learning Techniques with Under Sampling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200602

Address after: 710071, No. 2 Taibai South Road, Yanta District, Shaanxi, Xi'an

Applicant after: XIDIAN University

Applicant after: CHINA ACADEMY OF ELECTRONICS AND INFORMATION TECHNOLOGY OF CETC

Address before: 710071 No. 2 Taibai South Road, Beilin District, Xi'an City, Shaanxi Province

Applicant before: XIDIAN University

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190910