CN105825141A

CN105825141A - Database Chinese name desensitization method based on complementary mapping

Info

Publication number: CN105825141A
Application number: CN201610072405.5A
Authority: CN
Inventors: 罗建峰; 袁玉波
Original assignee: Shanghai Jianqing Information Technology Co Ltd
Current assignee: Shanghai Jianqing Information Technology Co Ltd
Priority date: 2016-02-02
Filing date: 2016-02-02
Publication date: 2016-08-03

Abstract

The invention provides a new database Chinese name desensitization method based on complementary mapping. A general database Chinese name processing method is to directly remove the name item or replace Chinese names with messy codes, which leads to serious information loss of a database. The database Chinese name desensitization method based on complementary mapping has the characteristic that data uniqueness and identifiability are effectively guaranteed, and the database is free from information loss during processing. The database Chinese name desensitization method based on complementary mapping comprises the following steps: breaking the Chinese names in the database into single Chinese characters, then coding the Chinese characters to obtain coded data, adopting the two-step elementary transformation method to scramble the coding order, finally adopting complementary mapping to obtain desensitized codes, and combining to obtain all desensitization results of the Chinese names. Lots of database experiments testify that the database Chinese name desensitization method based on complementary mapping is very effective and the technical requirements for nondestructive database desensitization can be met.

Description

A kind of data base's Chinese name desensitization method mapped based on complementation

Technical field

Present invention is mainly used for database privacy protection, be embodied in a kind of data base's Chinese name conversion method relating to the concepts such as encoding of chinese characters, data disorder and complementary mapping.

Background technology

Name desensitization method is the major issue of method for secret protection research.In the epoch of this information explosion, secret protection has become the technology barriers of big market demand, and how protecting the privacy information in data base is the technical barrier urgently captured.Privacy refers to the personal information being reluctant to be known by other people.The individual affair unrelated with public interest including inherent thought, external life style, health, family relation and the background of individual, living environment and space etc. and situation.On April 1st, 2013,Country" information security technology, the public and commercial service information system personal information protection guide " of the Ministry of Industry and Information Technology's establishment is formally implemented.Personal information is clearly divided into individual's general information and personal sensitive information by guide；Requirement simultaneously, processes personal information and should have specific, clear and definite and rational purpose, it should obtain the agreement of personal information main body in the case of personal information main body is known the inside story.Process for individual's general information can be set up on the basis of silent approvement is agreed to, as long as personal information main body does not explicitly indicate that opposition, just can collect and utilize；But for personal sensitive information, then need to set up on the basis of expressing consent, before collecting and utilizing, it is necessary to first obtain the mandate that personal information main body is clear and definite.In these personal sensitive informations, name be one important and enjoy the information of user or public attention.From the perspective of China's 5,000-year and down history, name is one of culture important way of holding of arteries and veins, it it is the society & culture's mark with blood vessels succession as foundation, it is people's requisite symbols in social relations, is that individual is necessary for information representation, the instrument that exchanges and propagate in society & culture exchanges.At big data fields, the personal sensitive information related to, often more than million, the most tens million of even several hundred million, will be obtained the agreement of these individualities, adds up and use, be can not thing.Therefore name has desensitized the important technical problem into database privacy protection.

Name encoding of chinese characters is the important technology of name desensitization.The current method of Chinese character coding is the most, such as region-position code, ISN, outer code and ASCII character etc..This patent selects 1981Country" Chinese Character Set Code for Informati baseset " (the abbreviation Chinese Character standard exchange code) that the Bureau of Standards announces.This set Chinese Character standard exchange code is divided into two-stage, 3755 words of one-level, two grades of 3008 words, totally 6763 Chinese characters.This Chinese Character standard exchange code is the internal code of computer, can be that the design of various input-output equipment provides unified standard, makes the exchange of the information between various system have corporate identity, so that the shared of information resources is ensured.For the name information in big data desensitizes, the efficiency of desensitization is the key factor that must take into, and therefore should not use excessively complicated coding techniques.Different from the coding techniques of those complexity, the major advantage of Chinese Character standard exchange code is that and uses the most efficiently.

Data disorder is the requisite step of name information desensitization.Data disorder is a common technology of information desensitization, its objective is that data are replaced as reader is difficult to the data of its original regularity of distribution, keeps the size of data, scale not to change simultaneously.

Complement code mapping is the safeguards technique of name information desensitization.Complement code thought refers to they sums always constant based on complementary conservation principle, the complementation of two amounts.Corresponding four region-position codes of each Chinese character in this patent, therefore we specify true form and complement code sum to be permanent several 9999.

Summary of the invention

It is an object of the invention to propose a kind of data base's Chinese name desensitization method mapped based on complementation, exist for purpose, to ensure that data validity is as principle reducing the information of Chinese name in data base.Meanwhile, application claims desensitization method is reversible, and i.e. wanting can be from desensitization storehouse reduction raw data base.The method of invention, whole desensitization is automatically performed by computer completely, and user has only to input raw data base, so that it may allows computer automatically desensitize data base's Chinese name, finally gives the data base after desensitization.

Technical scheme is as follows:

Step 1, the Chinese character decomposition to Chinese name, split name, N={x according to byte₁, x₂, x₃..., x_k}；

Step 2, encodes Chinese character, and this patent usesCountryStandard Chinese character code, u_i=c (x_i), i=1,2 ..., k.

Step 3, uses elementary transformation matrix scramble v in two steps to each Chinese character correspondence code_i=l (u_i), i=1,2 ..., k；

Step 4, obtains complement code by complementary for the encoding of chinese characters after scramble mapping, and complementation is mapped as:

E_i=F (v_i)=9999-v_i, i=1,2 ..., k, such as: F (8021)=9999-8021=1978；

Step 5, desensitize name data E=E by complement code combination producing₁E₂…E_k。

Accompanying drawing explanation

Reader is in referenceAccompanying drawingAfter having read the detailed description of the invention of the present invention, it will more clearly understand various aspects of the invention.DescriptionAccompanying drawingBe the desensitization result of 1000 data, before three be classified as initial data, after three be classified as the data after desensitization.

Fig. 1ExtremelyFigure 19It it is all the application example of method inventionFigureWe have selected 1000 data object as secret protection from data base; first is classified as the Chinese name in data base, is sensitive information attribute, in order to protect privacy; name is concealed with " certain " or " so-and-so "; only leave surname, after 2 to 4 leus time be " sex " " age " " date of birth " respectively, the name code after the 5th row correspondence desensitization; 6 to 8 leus next are " sex " " age " " date of birth " respectively, fromFig. 1ExtremelyFigure 19It can be seen that be difficult to identify that personal information after Tuo Min, reach the purpose of data desensitization.

Detailed description of the invention

Step 1, first extracts name field from the data-base recording of input, the name of corresponding field is carried out Chinese character decomposition, obtains individual Chinese character, such as " Gong Sunjuyun "={ " public ", " grandson ", " gathering ", " cloud " }.

Step 2, provide the exclusive identification code of each Chinese character, example such as 2511=c (" public "), 4379=c (" grandson "), 3059=c (" gathering "), 5238=c (" cloud "), during being embodied as, if encountering the rare Chinese character not having in current code table storehouse, automatically raising the price, i.e. in existing code storehouse, maximum code is the correspondence code of this rare Chinese character plus 1.

Step 3, the Chinese character correspondence code scramble that will have encoded, during enforcement, scrambled fashion is as follows: such as Chinese character " grandson " correspondence is encoded to 4379, existing by 4379 scrambles, regards 4 dimensional vectors as by 4379, completing scramble with quadravalence elementary matrix, this patent uses: 1) to a point exchange:

(\begin{matrix} 7 & 9 & 4 & 3 \end{matrix}) = (\begin{matrix} 4 & 3 & 7 & 9 \end{matrix}) (\begin{matrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{matrix}),

2) one or two exchange:

(\begin{matrix} 9 & 7 & 4 & 3 \end{matrix}) = (\begin{matrix} 7 & 9 & 4 & 3 \end{matrix}) (\begin{matrix} 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}),

Result is: 9743=l (4379) is scramble code.

Step 4, only need directly deducting scramble code with 9999 can generate the complement code of individual Chinese character when application.

Such as: 0256=F (9743)=9999-9743.

Step 5, when being embodied as, group code does not change order, uses and directly combines.

Such as: by implementing step above: " public "--> 8874, " grandson "--> 0256, " gathering "--> 0469, " cloud "--> 1647, then " Gong Sunjuyun " corresponding desensitization data are: 8874025604691647.

Claims

1. the data base's Chinese name desensitization method mapped based on complementation, it is characterised in that:

When desensitizing data base's Chinese name, it is desirable to protect following method step, its step is specific as follows:

Step 1, the Chinese character decomposition to Chinese name, split name, N={x according to byte₁, x₂, x₃..., x_k)；

Step 2, encodes Chinese character, and this patent uses Chinese characters of the national standard code, u_i=c (x_i), i=1,2 ..., k. example such as:

Gallbladder: 2108；Bullet: 2115；Egg: 2116；

E_i=F (v_i)=9999-v_i, i=1,2 ..., k, such as: F (8021)=9999-8021=1978；

Step 5, desensitize name data E=E by complement code combination producing₁E₂...E_k。

2. according to the data base's Chinese name desensitization method mapped based on complementation, the alternative approach of Patent right requirement protection scramble code, specific as follows:

Claim and to the method for scramble code in step 3 be:

v_i=u_i* P*Q, i=1,2 ..., k

Employing elementary matrix P is for the first time:

Second time employing elementary matrix Q is:

Example is such as:

2108 scrambles are 8021；2115 scrambles are 5121；2116 scrambles are 6121.

3. according to the data base's Chinese name desensitization method mapped based on complementation, the generation method of Patent right requirement protection mutual-complementing code, specific as follows:

E_i=F (v_i)=9999-v_i, i=1,2 ..., k；

I.e. require E_iAnd v_iComplementation, E_i+v_i=9999.