CN110222153A - A kind of Chinese Name data desensitization method based on sort permutation - Google Patents
A kind of Chinese Name data desensitization method based on sort permutation Download PDFInfo
- Publication number
- CN110222153A CN110222153A CN201910485787.8A CN201910485787A CN110222153A CN 110222153 A CN110222153 A CN 110222153A CN 201910485787 A CN201910485787 A CN 201910485787A CN 110222153 A CN110222153 A CN 110222153A
- Authority
- CN
- China
- Prior art keywords
- name
- data
- vector
- surname
- desensitized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioethics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of Chinese Name data desensitization method based on sort permutation, comprising the following steps: 1) two class data are divided into according to surname and name to the data in Chinese personal name corpus, and are converted into vector form;2) by two class data and its vector form storage into database;3) name data to be desensitized is obtained;4) surname of name data to be desensitized and name are respectively converted into vector form;5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained;6) surname that one is treated desensitization name is randomly choosed in K surname vector to be replaced;7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained;8) it randomly chooses the name that one is treated desensitization name in N number of name vector to be replaced, the data after must desensitizing, this method retains name itself after enabling to name to desensitize and has the special feature that.
Description
Technical field
The invention belongs to field of information security technology, are related to a kind of Chinese Name data desensitization side based on sort permutation
Method.
Background technique
Multi-source heterogeneous data under big data era contain a large amount of key messages, these information have enterprises and individuals huge
Big commercial value, these information also contain a large amount of individual privacy data at the same time, wherein can be referred to personal name again
It is the most important thing to more specific individual privacy data.These sensitive informations are once revealed and may not only be brought to individual respectively
Kind puzzlement, serious possibility damage its people's reputation and cause damages to the person and property safety.It is used in addition, publication is true
User data is analyzed for researcher and data mining, this also becomes leakage a large number of users while making major contribution for scientific research
One of channel of privacy.
Data desensitization refers to the deformation that certain sensitive informations are carried out with data by desensitization rule, removes sensibility, realizes
The reliably protecting of privacy-sensitive data.It is proposed data desensitization be in order to obtain balance between data protection and availability of data,
In the case where being related to client secure data or some commercial sensitive datas, to true under the conditions of not violating system convention
Data carry out desensitization transformation and are provided to other people and develop, test or statistically analyze.
Language is the carrier of knowledge and thinking, natural language processing (Natural Language Processing, NLP)
It is computer science, artificial intelligence, the field of the interaction between linguistics concern computer and human language.Word is embedded in
The general designation of language model and representative learning technology in natural language processing, in short, it refers to each word or phrase predetermined
The vector being mapped as in the vector space of justice in real number field.The existing a variety of models indicated for constructing word insertion, wherein
Word2vec and GloVe is one of widely used realization.Nowadays, in natural language processing field, mostly use term vector and
Deep neural network in conjunction with mode carry out text classification.Therefore, the present invention is proposed natural language processing technique and data
Desensitization is combined together, with the Chinese Text Categorization function in natural language processing technique based on term vector.
Currently, the existing desensitization technology for Chinese Name probably includes following several:
A) name data is directly replaced as to similar " Zhang San " " Li Si " this common name, but this method can be made
At there was only identical several name datas in entire tables of data, it can not find out the distribution situation of data, be unfavorable for the statistics of data.
B) to name data carry out random permutation, by the coding of each Chinese character of former name carry out offset random-length with
Another Chinese character is generated, but this random device can make name data after the desensitization generated completely lose Chinese Name original
Some features, and rarely used word is more.
C) a Chinese Name code table is constructed, original name is then subjected to Hash mapping replacement.Data is more in this way
Sample and distributivity are retained, but need a large amount of time and space expense, and the name data quantity constructed has
Limit, can not still accomplish the reservation of real distribution characteristics.
In conclusion existing Chinese Name desensitization technology has that name data loses itself after will cause desensitization
And it is unfavorable for the problem of data statistic analysis recycles.
Summary of the invention
It is an object of the invention to overcome the above-mentioned prior art, a kind of Chinese surname based on sort permutation is provided
Name data desensitization method, this method retains name itself after enabling to name to desensitize and has the special feature that, conducive to the statistics of data
Analysis recycles.
In order to achieve the above objectives, the Chinese Name data desensitization method of the present invention based on sort permutation includes following
Step:
1) two class data, and the two class numbers that will be obtained are divided into according to surname and name to the data in Chinese personal name corpus
According to being converted to vector form;
2) the two class data and its vector form obtained step 1) are stored into database;
3) name data to be desensitized is obtained;
4) surname of name data to be desensitized and name are respectively converted into vector form;
5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained;
6) surname that one is treated desensitization name is randomly choosed in the K surname vector that step 5) obtains to be replaced;
7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained;
8) name that desensitization name is treated in random selection one in N number of name vector that step 7) obtains is replaced;
9) surname after the corresponding displacement of the name that desensitizes and the name after displacement are spliced, the number after must desensitizing
According to Chinese Name data desensitization of the completion based on sort permutation.
The data in Chinese personal name corpus are divided according to surname and name with Natural Language Processing Models in step 1)
Vector form is converted at two class data, and by two obtained class data;
In step 4) with Natural Language Processing Models by the surname of name data to be desensitized and name be respectively converted into
Amount form.
The Natural Language Processing Models are Bert model, MT-DNN model or word incorporation model and deep learning model
Binding model.
The cosine similarity cos θ of the surname vector of surname vector and name to be desensitized in database are as follows:
The invention has the following advantages:
Chinese Name data desensitization method of the present invention based on sort permutation chooses database when specific operation
In with the maximum K surname vector of surname vector similarity of name to be desensitized, then the random selection one in K surname vector
The surname for treating desensitization name is replaced;It obtains maximum N number of with the name vector similarity of name to be desensitized in database
Name vector, then randomly choose the name that one is treated desensitization name in N number of name vector and replaced, to guarantee a number of words
According to safety, the name data after present invention desensitization is used in open environment, while after guaranteeing desensitization
Name data is close with true name feature, has generality, and be free of the spcial characters such as rare surname and rarely used word, and
Reserved name desensitization after retain name itself have the special feature that so that desensitization after convenient for researcher continue data excavation and
Data analysis, to achieve the purpose that availability of data after assuring data security and desensitization.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is the flow chart of replacement process in the present invention.
Specific embodiment
The invention will be described in further detail with reference to the accompanying drawing:
Referring to Figure 1 and Figure 2, the Chinese Name data desensitization method of the present invention based on sort permutation includes following step
It is rapid:
1) two class data, and the two class numbers that will be obtained are divided into according to surname and name to the data in Chinese personal name corpus
According to being converted to vector form;
2) the two class data and its vector form obtained step 1) are stored into database;
3) name data to be desensitized is obtained;
4) surname of name data to be desensitized and name are respectively converted into vector form;
5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained;
Wherein, in database the surname vector of surname vector and name to be desensitized cosine similarity cos θ are as follows:
When two vectors are equally directed to, that is, in statistical method have certain similarity when, cosine phase
Value like degree is 1;When two vector angles are 90 °, the value of cosine similarity is 0;Two vectors are directed toward exactly opposite direction
When, that is, in statistical method absolutely not similarity when, the value of cosine similarity is -1.
6) surname that one is treated desensitization name is randomly choosed in the K surname vector that step 5) obtains to be replaced;
7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained;
8) name that desensitization name is treated in random selection one in N number of name vector that step 7) obtains is replaced;
9) surname after the corresponding displacement of the name that desensitizes and the name after displacement are spliced, the number after must desensitizing
According to Chinese Name data desensitization of the completion based on sort permutation.
The data in Chinese personal name corpus are divided according to surname and name with Natural Language Processing Models in step 1)
Vector form is converted at two class data, and by two obtained class data;It will be to Natural Language Processing Models in step 4)
The surname and name for the name data that desensitizes are respectively converted into vector form, wherein the Natural Language Processing Models are Bert mould
The binding model of type, MT-DNN model or word incorporation model and deep learning model.
Embodiment one
From corpus of the sub-fraction data as the present embodiment extracted in hundred million grades of name corpus in Chinese personal name corpus
Library, as shown in table 1:
Table 1
Cai Yichun |
Deng Yumei |
Liu Sining |
Sun Yuanzhao |
Xia Yi |
Shi Shaoqing |
Song Yin |
Liu Yuwei |
Zhao Yang |
Xu Nin's precious jade |
As shown in Figure 1, the natural language processing technique that selection is applicable, as Bert model, MT-DNN model the latter be by
The words such as word2vec incorporation model and deep learning models coupling carry out the classification processing of text, by the name data in corpus
It is divided into surname data and name data, and is translated into vector form, for ease of description, the present embodiment is by text vector
It is simplified to bivector and carries out example, as shown in table 2:
Table 2
Cai | (39,490) | One spring | (9532,19372) |
Deng | (76,233) | Yu Mei | (8149,8370) |
Liu | (378,5443) | Think peaceful | (2762,495) |
Grandson | (958,327) | Member is shone | (64081,4229) |
Summer | (21,43) | Easily | (3153,90) |
Stone | (7162,283) | It is few clear | (195,2658) |
Song | (438,21601) | The third of the twelve Earthly Branches | (945,354) |
Liu | (372,5443) | Space is big | (49723,1032) |
Zhao | (3,871) | Poplar | (546,313) |
Slowly | (9584,1265) | Ning Yao | (52371,41428) |
With reference to Fig. 2, the following are data deimmunization processes, specifically:
Obtain name data to be desensitized are as follows:
Poplar can |
For each name data to be desensitized, it is divided into surname and name with natural language processing technique, and convert
For vector form, specifically:
Poplar | (546,313) | Wood can | (2142,8234) |
When text vector is two dimension, cosine similarity is
K=3 is taken, is calculated separately in surname replacement data library and the vector cosine similarity of surname " poplar ", calculated result
It is as shown in table 3:
Table 3
Selection wherein and the maximum three surname vectors of surname " poplar " cosine similarity, respectively surname " grandson ", " stone "
And " Xu ".
One surname of random selection is replaced with surname data to be desensitized in permutation vector " grandson ", " stone " and " Xu ",
For example, random selection surname " stone " displacement surname " poplar ".
N=4 is taken, the vector cosine similarity with name " wood can " is calculated separately in name permutation database, calculates knot
Fruit is as shown in table 4:
Table 4
Wherein, selection and name " wood can " the maximum four surname vectors of cosine similarity, respectively name " spring ",
" few clear ", " Yu Mei " and " Ning Yao ".
In permutation vector " spring ", " few clear ", " Yu Mei " and " Ning Yao " middle one name of random selection and name to be desensitized
Digital data is replaced, for example, random selection name " Yu Mei " displacement name " wood can ".
Name data " Shi Yumei " after then constituting desensitization is completed once to the displacement desensitization of name.
Claims (4)
1. a kind of Chinese Name data desensitization method based on sort permutation, which comprises the following steps:
1) two class data are divided into according to surname and name to the data in Chinese personal name corpus, and two obtained class data is turned
It is changed to vector form;
2) the two class data and its vector form obtained step 1) are stored into database;
3) name data to be desensitized is obtained;
4) surname of name data to be desensitized and name are respectively converted into vector form;
5) the maximum K surname vector of surname vector similarity in database with name to be desensitized is obtained;
6) surname that one is treated desensitization name is randomly choosed in the K surname vector that step 5) obtains to be replaced;
7) the maximum N number of name vector of name vector similarity in database with name to be desensitized is obtained;
8) name that desensitization name is treated in random selection one in N number of name vector that step 7) obtains is replaced;
9) surname after the corresponding displacement of the name that desensitizes and the name after displacement are spliced, the data after must desensitizing are complete
At the Chinese Name data desensitization based on sort permutation.
2. the Chinese Name data desensitization method according to claim 1 based on sort permutation, which is characterized in that step 1)
It is middle that two class data are divided into according to surname and name to the data in Chinese personal name corpus with Natural Language Processing Models, and will
Two obtained class data are converted to vector form;
The surname of name data to be desensitized and name are respectively converted into vector shape with Natural Language Processing Models in step 4)
Formula.
3. the Chinese Name data desensitization method according to claim 2 based on sort permutation, which is characterized in that it is described from
Right Language Processing model is the binding model of Bert model, MT-DNN model or word incorporation model and deep learning model.
4. the Chinese Name data desensitization method according to claim 1 based on sort permutation, which is characterized in that database
The cosine similarity cos θ of the surname vector of middle surname vector and name to be desensitized are as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910485787.8A CN110222153A (en) | 2019-06-05 | 2019-06-05 | A kind of Chinese Name data desensitization method based on sort permutation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910485787.8A CN110222153A (en) | 2019-06-05 | 2019-06-05 | A kind of Chinese Name data desensitization method based on sort permutation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110222153A true CN110222153A (en) | 2019-09-10 |
Family
ID=67819410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910485787.8A Pending CN110222153A (en) | 2019-06-05 | 2019-06-05 | A kind of Chinese Name data desensitization method based on sort permutation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110222153A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951562A (en) * | 2017-04-01 | 2017-07-14 | 北京数聚世界信息技术有限公司 | A kind of desensitization method and device of Chinese Name data |
CN109739956A (en) * | 2018-11-08 | 2019-05-10 | 第四范式(北京)技术有限公司 | Corpus cleaning method, device, equipment and medium |
CN109829328A (en) * | 2018-12-19 | 2019-05-31 | 上海晶赞融宣科技有限公司 | Data desensitization, inverse desensitization method and device, storage medium, terminal |
-
2019
- 2019-06-05 CN CN201910485787.8A patent/CN110222153A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951562A (en) * | 2017-04-01 | 2017-07-14 | 北京数聚世界信息技术有限公司 | A kind of desensitization method and device of Chinese Name data |
CN109739956A (en) * | 2018-11-08 | 2019-05-10 | 第四范式(北京)技术有限公司 | Corpus cleaning method, device, equipment and medium |
CN109829328A (en) * | 2018-12-19 | 2019-05-31 | 上海晶赞融宣科技有限公司 | Data desensitization, inverse desensitization method and device, storage medium, terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102012985B (en) | Sensitive data dynamic identification method based on data mining | |
Shi et al. | Selective differential privacy for language modeling | |
Qu et al. | Natural language understanding with privacy-preserving bert | |
CN106295338B (en) | SQL vulnerability detection method based on artificial neuron network | |
CN110427612A (en) | Based on multilingual entity disambiguation method, device, equipment and storage medium | |
CN110489997A (en) | A kind of sensitive information desensitization method based on pattern matching algorithm | |
CN110569350A (en) | Legal recommendation method, equipment and storage medium | |
Trieu et al. | Document sensitivity classification for data leakage prevention with twitter-based document embedding and query expansion | |
Kathuria et al. | Real time sentiment analysis on twitter data using deep learning (Keras) | |
Yang et al. | Improving word representations with document labels | |
CN108932434B (en) | Data encryption method and device based on machine learning technology | |
CN107622208A (en) | Note encryption and decryption method and related product | |
CN113742763A (en) | Confusion encryption method and system based on government affair sensitive data | |
CN117290888A (en) | Information desensitization method for big data, storage medium and server | |
CN110222153A (en) | A kind of Chinese Name data desensitization method based on sort permutation | |
CN116055067B (en) | Weak password detection method, device, electronic equipment and medium | |
Aborujilah et al. | Conceptual model for automating gdpr compliance verification using natural language approach | |
Wu et al. | Semantic key generation based on natural language | |
Liang et al. | A lightweight method for face expression recognition based on improved MobileNetV3 | |
Chen et al. | Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection | |
CN112507388B (en) | Word2vec model training method, device and system based on privacy protection | |
Boucenna et al. | Concept-based semantic search over encrypted cloud data | |
CN114969826A (en) | Privacy protection method, device and equipment for biological recognition | |
Zhang et al. | Detection of android malicious family based on manifest information | |
Sivanaiah et al. | Avalanche at DravidianLangTech: Abusive Comment Detection in Code Mixed Data Using Machine Learning Techniques with Under Sampling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200602 Address after: 710071, No. 2 Taibai South Road, Yanta District, Shaanxi, Xi'an Applicant after: XIDIAN University Applicant after: CHINA ACADEMY OF ELECTRONICS AND INFORMATION TECHNOLOGY OF CETC Address before: 710071 No. 2 Taibai South Road, Beilin District, Xi'an City, Shaanxi Province Applicant before: XIDIAN University |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190910 |