A kind of data base's Chinese name desensitization method mapped based on complementation
Technical field
Present invention is mainly used for database privacy protection, be embodied in a kind of data base's Chinese name conversion method relating to the concepts such as encoding of chinese characters, data disorder and complementary mapping.
Background technology
Name desensitization method is the major issue of method for secret protection research.In the epoch of this information explosion, secret protection has become the technology barriers of big market demand, and how protecting the privacy information in data base is the technical barrier urgently captured.Privacy refers to the personal information being reluctant to be known by other people.The individual affair unrelated with public interest including inherent thought, external life style, health, family relation and the background of individual, living environment and space etc. and situation.On April 1st, 2013,Country" information security technology, the public and commercial service information system personal information protection guide " of the Ministry of Industry and Information Technology's establishment is formally implemented.Personal information is clearly divided into individual's general information and personal sensitive information by guide;Requirement simultaneously, processes personal information and should have specific, clear and definite and rational purpose, it should obtain the agreement of personal information main body in the case of personal information main body is known the inside story.Process for individual's general information can be set up on the basis of silent approvement is agreed to, as long as personal information main body does not explicitly indicate that opposition, just can collect and utilize;But for personal sensitive information, then need to set up on the basis of expressing consent, before collecting and utilizing, it is necessary to first obtain the mandate that personal information main body is clear and definite.In these personal sensitive informations, name be one important and enjoy the information of user or public attention.From the perspective of China's 5,000-year and down history, name is one of culture important way of holding of arteries and veins, it it is the society & culture's mark with blood vessels succession as foundation, it is people's requisite symbols in social relations, is that individual is necessary for information representation, the instrument that exchanges and propagate in society & culture exchanges.At big data fields, the personal sensitive information related to, often more than million, the most tens million of even several hundred million, will be obtained the agreement of these individualities, adds up and use, be can not thing.Therefore name has desensitized the important technical problem into database privacy protection.
Name encoding of chinese characters is the important technology of name desensitization.The current method of Chinese character coding is the most, such as region-position code, ISN, outer code and ASCII character etc..This patent selects 1981Country" Chinese Character Set Code for Informati baseset " (the abbreviation Chinese Character standard exchange code) that the Bureau of Standards announces.This set Chinese Character standard exchange code is divided into two-stage, 3755 words of one-level, two grades of 3008 words, totally 6763 Chinese characters.This Chinese Character standard exchange code is the internal code of computer, can be that the design of various input-output equipment provides unified standard, makes the exchange of the information between various system have corporate identity, so that the shared of information resources is ensured.For the name information in big data desensitizes, the efficiency of desensitization is the key factor that must take into, and therefore should not use excessively complicated coding techniques.Different from the coding techniques of those complexity, the major advantage of Chinese Character standard exchange code is that and uses the most efficiently.
Data disorder is the requisite step of name information desensitization.Data disorder is a common technology of information desensitization, its objective is that data are replaced as reader is difficult to the data of its original regularity of distribution, keeps the size of data, scale not to change simultaneously.
Complement code mapping is the safeguards technique of name information desensitization.Complement code thought refers to they sums always constant based on complementary conservation principle, the complementation of two amounts.Corresponding four region-position codes of each Chinese character in this patent, therefore we specify true form and complement code sum to be permanent several 9999.
Summary of the invention
It is an object of the invention to propose a kind of data base's Chinese name desensitization method mapped based on complementation, exist for purpose, to ensure that data validity is as principle reducing the information of Chinese name in data base.Meanwhile, application claims desensitization method is reversible, and i.e. wanting can be from desensitization storehouse reduction raw data base.The method of invention, whole desensitization is automatically performed by computer completely, and user has only to input raw data base, so that it may allows computer automatically desensitize data base's Chinese name, finally gives the data base after desensitization.
Technical scheme is as follows:
Step 1, the Chinese character decomposition to Chinese name, split name, N={x according to byte1, x2, x3..., xk};
Step 2, encodes Chinese character, and this patent usesCountryStandard Chinese character code, ui=c (xi), i=1,2 ..., k.
Step 3, uses elementary transformation matrix scramble v in two steps to each Chinese character correspondence codei=l (ui), i=1,2 ..., k;
Step 4, obtains complement code by complementary for the encoding of chinese characters after scramble mapping, and complementation is mapped as:
Ei=F (vi)=9999-vi, i=1,2 ..., k, such as: F (8021)=9999-8021=1978;
Step 5, desensitize name data E=E by complement code combination producing1E2…Ek。
Accompanying drawing explanation
Reader is in referenceAccompanying drawingAfter having read the detailed description of the invention of the present invention, it will more clearly understand various aspects of the invention.DescriptionAccompanying drawingBe the desensitization result of 1000 data, before three be classified as initial data, after three be classified as the data after desensitization.
Fig. 1ExtremelyFigure 19It it is all the application example of method inventionFigureWe have selected 1000 data object as secret protection from data base; first is classified as the Chinese name in data base, is sensitive information attribute, in order to protect privacy; name is concealed with " certain " or " so-and-so "; only leave surname, after 2 to 4 leus time be " sex " " age " " date of birth " respectively, the name code after the 5th row correspondence desensitization; 6 to 8 leus next are " sex " " age " " date of birth " respectively, fromFig. 1ExtremelyFigure 19It can be seen that be difficult to identify that personal information after Tuo Min, reach the purpose of data desensitization.
Detailed description of the invention
Step 1, first extracts name field from the data-base recording of input, the name of corresponding field is carried out Chinese character decomposition, obtains individual Chinese character, such as " Gong Sunjuyun "={ " public ", " grandson ", " gathering ", " cloud " }.
Step 2, provide the exclusive identification code of each Chinese character, example such as 2511=c (" public "), 4379=c (" grandson "), 3059=c (" gathering "), 5238=c (" cloud "), during being embodied as, if encountering the rare Chinese character not having in current code table storehouse, automatically raising the price, i.e. in existing code storehouse, maximum code is the correspondence code of this rare Chinese character plus 1.
Step 3, the Chinese character correspondence code scramble that will have encoded, during enforcement, scrambled fashion is as follows: such as Chinese character " grandson " correspondence is encoded to 4379, existing by 4379 scrambles, regards 4 dimensional vectors as by 4379, completing scramble with quadravalence elementary matrix, this patent uses: 1) to a point exchange:
2) one or two exchange:
Result is: 9743=l (4379) is scramble code.
Step 4, only need directly deducting scramble code with 9999 can generate the complement code of individual Chinese character when application.
Such as: 0256=F (9743)=9999-9743.
Step 5, when being embodied as, group code does not change order, uses and directly combines.
Such as: by implementing step above: " public "--> 8874, " grandson "--> 0256, " gathering "--> 0469, " cloud "--> 1647, then " Gong Sunjuyun " corresponding desensitization data are: 8874025604691647.