CN107480549B

CN107480549B - A kind of sensitive information desensitization method and system that data-oriented is shared

Info

Publication number: CN107480549B
Application number: CN201710506066.1A
Authority: CN
Inventors: 张云云; 王开红; 于海龙; 吴培文; 陈涛
Original assignee: Enjoyor Co Ltd
Current assignee: Yinjiang Technology Co.,Ltd.
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2019-08-02
Anticipated expiration: 2037-06-28
Also published as: CN107480549A

Abstract

The sensitive information desensitization method and system shared the present invention relates to a kind of data-oriented; the present invention uses statistics, natural language processing technique and machine learning techniques; the protection for using this whole process sensitive data from data publication to data application is realized, is proposed based on the automatic identification for constructing sensitive information keywords database, naming the sensitive informations such as entity class and address class；The Sensitive Attributes degree of association is calculated using Sigmoid function；Desensitization strategy is carried out using the building form of building Sensitive Attributes create-rule library and the desensitization algorithm for naming entity desensitization rule and core；Respectively in connection with numeric type Sensitive Attributes and categorical attribute desensitization depth calculation, obtain the desensitization degree of whole data set, and the method for taking download link address Hash realizes the controlled output etc. of data, it can guarantee data sensitive information security and maximize the sensitive information processing strategie for meeting analysis mining requirement, have the characteristics that desensitization effect is good, highly reliable.

Description

A kind of sensitive information desensitization method and system that data-oriented is shared

Technical field

It is shared the present invention relates to the interleaving techniques field more particularly to a kind of data-oriented of information technology and data safety Sensitive information desensitization method and system.

Background technique

In recent years, information technology and economic society cross to merge and have caused data and rapidly increase, and data have become important Sexual development resource.2016, government pushed information system and common data to interconnect opening and shares energetically, accelerated government information platform Information island is eliminated in integration, and recommending data resource is open to the society, guides social development, better services are in the public.However big Under data background, data opening and shares also bringing challenges property the problem of, Various types of data leakage event frequently occur, such as Anhui nearly six Thousand newborn's information leakage events, have targetedly fraudulent call event etc. at annual college entrance examination information leakage, so that the whole society More collaboration focused data safeguard protection is transferred to from data opening and shares are focused on.For this purpose, many countries promulgate range of information Safety-related laws and regulations, such as " privacy act " and " Government of the People's Republic of China's information discloses regulations " in China, this is just It is required that data have to comply with specific condition during opening and shares, it cannot be personal comprising mark in open data set The data of identity, to guarantee that the user of data set cannot be inferred to individual privacy information etc. easily；And again reasonably Meet common people's diversified demand, guarantee that data resource can generate new value.Therefore, data security protecting is realized, and can be most Bigization plays data resource utility value, is the challenging problem of current information security processing technology field.

In recent years, a large amount of research has been done in terms of protecting sensitive data.Patent No. CN201511026582.1 is from number It sets out according to the angle of desensitization system, describes the sensitive data under big data environment and circulating, exchanging the entire rings such as shared, transaction The protection of section, and different sensitive guard methods has been used in each link, it is also proposed that it is based on expert system and natural language The sensitive data of processing finds method, finally also passes through the metric data desensitization ring of verifying desensitization result correctness and authenticity Section.Patent No. CN201610338383.2 propose it is a kind of in a network environment to after data encryption will encryption code key and encryption after Desensitization Data Physical separate storage, and to encryption code key and desensitization data stringent access authority is set, guarantee that data add Close or decryption safety.The structured query language SQL that patent No. CN201510303954.4 is sent by receiving user Instruction judges to include sensitive data in accessed data, and passes through access privilege and pre-set desensitization conversion rule Then SQL instruction is converted, so that the desensitization data that the instruction after conversion is accessed.Patent No. CN201510755773.5 It discloses one kind and desensitization method is retained using format to different types of private data, be put in storage it with ciphertext form, can keep away Exempt from ciphertext length and define length greater than literary name section, cause data to load and occur, avoids type and source number after number field encryption It is mismatched according to type, data is caused to load error.

However in above-mentioned desensitization system or desensitization method, all have some limitations.Main cause is: (1) Most of desensitization system and method are both in the structural data of database, and for unstructured data (such as textual data According to) be not involved with and how to handle；(2) lack the completeness for considering sensitive data desensitization, if sensitive data desensitization depth is not It is enough, it prevents using non-sensibility data reconstruction sensitive data；(3) mark uniqueness is consistent with format after not can guarantee data desensitization Property require, such as hospital data, generally identified and positioned with identification card number it is personal, if calculated using desensitization algorithm or encryption Method, so that ID card information loses the uniqueness of mark and the consistency of format.

Summary of the invention

The present invention is to overcome above-mentioned shortcoming, and it is an object of the present invention to provide the sensitive information desensitization that a kind of data-oriented is shared Method and system, the present invention use statistics, natural language processing technique and machine learning techniques, realize from data publication to Data application uses the protection of this whole process sensitive data, proposes real based on building sensitive information keywords database, name The automatic identification of the sensitive informations such as body class and address class；The Sensitive Attributes degree of association is calculated using Sigmoid function；It is quick using constructing The building form of the desensitization algorithm of sense attribute create-rule library and name entity desensitization rule and core carries out desensitization strategy；Point Not Jie He numeric type Sensitive Attributes and categorical attribute desensitize depth calculation, obtain the desensitization degree of whole data set, and take The method of download link address Hash realizes the controlled output etc. of data, can guarantee data sensitive information security and maximize full The sensitive information processing strategie that sufficient analysis mining requires.

The present invention is to reach above-mentioned purpose by the following technical programs: a kind of sensitive information desensitization side that data-oriented is shared Method includes the following steps:

(1) sensitive information automatic identification rule and sensitive information processing rule are preset, wherein the sensitive information is certainly Dynamic recognition rule includes constructing all kinds of sensitive information keywords databases, the automatic knowledge to sensitive information in sensitive information keywords database Not, the automatic identification of number and numerical value class sensitive information, the automatic identification for naming entity class sensitive information, address class sensitive information Accurately identify；The sensitive information processing rule includes Sensitive Attributes create-rule, setting desensitization algorithm, name entity desensitization Processing, address information desensitization process；The data of data set provider publication are checked in data consumer's request；

(2) data are pre-processed, pre-processes laggard style of writing notebook data participle and part-of-speech tagging；

(3) automatic identification is carried out to sensitive information according to pre-set sensitive information automatic identification rule；

(4) it is analyzed by the Sensitive Attributes calculation of relationship degree to sensitive information, retains the Sensitive Attributes degree of association and be higher than threshold value Sensitive information；Wherein threshold value is preset；

(5) rule is handled according to pre-set sensitive information and desensitization process is carried out to sensitive information；

(6) the desensitization depth of sensitive information is calculated, and judges whether desensitization depth meets preset requirement；If no Meet, then return step (5) re-starts desensitization process；Otherwise, the data set after desensitization is exported, for data consumer It checks.

Preferably, the pretreatment operation of the step (2) is as follows: being divided according to data type the data of publication Class, data type include structured form types of databases data, list data, data warehouse data and non-structured document Data；It needs to check the integrality, consistency, correctness of attribute value when pretreatment, and by non-structured number of files According to text data is parsed into, parsed when document data parses using analytical tool.

Preferably, the automatic identification of the name entity class sensitive information is used based on hidden Markov HMM model The part-of-speech tagging and building name entity knowledge base combination of Viterbi algorithm are realized；The address class sensitive information It accurately identifies by judging that the adjacent sequence of terms of address information is realized.

Preferably, the Sensitive Attributes calculation of relationship degree method is as follows:

(a) degree of association of classifying type Sensitive Attributes is standardized using Sigmoid function, is such as given a definition:

Wherein, the codomain section of the function is [0,1], and continuous, smooth, monotonic increase；

(b) assume that every record has p attribute { u in data set T₁,u₂,...,u_p, and if each attribute respectively correspond Dry attribute value, is divided into and is denoted as { q₁,q₂,...,q_p}；In one record, the corresponding attribute value of Sensitive Attributes occurs being denoted as 1, Do not occur being denoted as 0, then this record can be expressed as (a q₁+q₂+...+q_p) dimension row vectorWhen data set T has n item Record, is successively denoted as { t₁,t₂,...,t_n, then just there are n (q₁+q₂+...+q_p) dimension row vector, it is expressed as

(c) by (q₁+q₂+...+q_p) correspond in dimension row vector value on position carry out with or and XOR operation, useIndicate with or when operation correspond to the case where attribute value is collectively labeled as 1 on position, useIt indicates with or transports Attribute value on position is corresponded to when calculation is collectively labeled as 0；The then degree of association S (I between two attributes₁,I₂) calculation formula is as follows:

Wherein, by parameter lambda in calculating₁, λ₂, λ₃It is set to 0.5,0.25,0.25, and codomain is 0≤S (I₁,I₂)≤1。

Preferably, described check numbers carries out desensitization process with the sensitive information of numeric type specifically: sensitive by formulating The rule is stored in Sensitive Attributes create-rule library by the rule that attribute generates, and is called preset based on data distortion and encryption Desensitization algorithm converts newborn Sensitive Attributes value according to desensitization task, the data after eventually forming desensitization.

Preferably, code table of the described pair of name entity class sensitive information using a common Chinese name entity, storage The mechanism name and Chinese Name of million ranks are replaced after original name entity progress Hash tables look-up, complete desensitization process； Method to address class sensitive information is to be desensitized according to the level of detail of address information, will switch to longitude and latitude by address, If can not parse original sensitive address information, do not need to desensitize, explanation is that comparison obscures address；If can parse Related latitude and longitude information generates another new ground then according to longitude and latitude is converted in the range of original address location/county out Location information, and address to street/small towns is obscured according to user's access right.

Preferably, the desensitization depth is difference degree between the data set and raw data set measured after desensitizing, Difference degree size is directly proportional to desensitization depth size, and calculation method is as follows: the calculating of (I) Numeric Attributes desensitization depth:

Assuming that Numeric Attributes codomain of attribute value before desensitization isAttribute value after desensitizationThen Numeric Attributes desensitization depth D_sz(m,m^*):

(II) categorical attribute desensitization depth calculation:

The desensitization depth of categorical attribute is sought by extensive tree-model is constructed, categorical attribute is calculated using following formula Desensitize depth D_fl(r,r^*):

D_fl(r,r^*)=((N_h-1)×Step(r,r*))/((N-1)×step(r,e))

Wherein, r, r^*Attribute value after indicating attribute value before desensitizing and desensitizing, N_hIndicate a certain preceding attribute of categorical attribute desensitization The child node number of value and its same father node, N indicate extensive leaf nodes number, and e indicates root node, setp (x, y) table Show attribute value node x desensitization after attribute node y the number of steps of；

(III) combining step (I) and step (II) obtain data set desensitization depth calculation formula D (T, T^*), it is as follows:

Wherein, n indicates contained record number in data set；c₁, c₂It is expressed as Numeric Attributes number and categorical attribute Number.

Preferably, the method that the data set after described pair of desensitization takes Hash converts under original storage link generation newly The mode for carrying chained address carries out the controlled output of data.

A kind of shared sensitive information of data-oriented desensitizes system, including System Management Unit, data source administrative unit, quick Feel information identificating unit, sensitive information processing unit, data outputting unit；The System Management Unit is for constructing desensitization system User account and access control identify the role and permission of user, only allow the legal user's operation for closing power corresponding Data；The data source administrative unit includes storing data source information；The sensitive information recognition unit is each for automatic identification Sensitive information in categorical data source, and calculate data source and concentrate each Sensitive Attributes relevance；The sensitive information processing unit For automatically creating desensitization task, matching desensitization strategy and desensitization algorithm；The data outputting unit is for safely and effectively controlling The data output that sensitive data processed uses；System Management Unit, data source administrative unit, sensitive information recognition unit, sensitive letter Breath processing unit, data outputting unit are sequentially connected.

Preferably, the data source administrative unit includes data source types, IP address, storage address and data source data Structure extraction and management；The sensitive information recognition unit is based on natural language processing technique and carries out at participle to text data Reason, on the basis ofs constructing all kinds of sensitive information knowledge bases, mark sensitive information rank etc. using manual type, rule-based and mould Formula matching way automatic identification sensitive information, while introducing Sigmoid functional based method and calculating the Sensitive Attributes degree of association；The sensitivity Information process unit is based on natural language processing technique and data request for utilization is examined and created automatically corresponding desensitization times Business, Sensitive Attributes create-rule library is respectively adopted, searched using Hash table converted within the scope of mode and address information longitude and latitude and The modes such as all kinds of desensitization algorithms carry out desensitization process to all kinds of sensitive informations；The data outputting unit is quick by desensitization process Feel attribute value and replace original Sensitive Attributes value, and new storage address is generated using hash algorithm transformation initial data storage address Output data.

The beneficial effects of the present invention are: (1) present invention can to avoid desensitization data in data set unique identification belong to Property duplicating property problem；(2) by the present invention in that calculating the degree of association of Sensitive Attributes with Sigmoid function, realization will be close The high Attribute transposition of degree can not only prevent desensitization data and be rebuild, can also delete the weak attribute of correlation at one group, Operation efficiency is provided；(3) present invention combines the desensitization depth of Numeric Attributes and categorical attribute, to calculate entire data set Desensitization degree efficiently control desensitization effect in such a way that threshold value is set；(4) method that the present invention takes Hash converts Original storage link generates new download link address, realizes the controlled output of data, can guarantee data sensitive information security Protection；(5) present invention is suitable for the sensitive information of the structural data of type of database and the unstructured data of Doctype Desensitization, has the characteristics that desensitization effect is good, highly reliable.

Detailed description of the invention

Fig. 1 is the configuration diagram of present system；

Fig. 2 is the flow diagram of the method for the present invention；

Fig. 3 is the data source format schematic diagram inputted in the embodiment of the present invention；

Fig. 4 is the name Entity recognition block diagram in the embodiment of the present invention.

Specific embodiment

The present invention is described further combined with specific embodiments below, but protection scope of the present invention is not limited in This:

Embodiment: as shown in Figure 1, the sensitive information desensitization system that a kind of data-oriented is shared includes for being arranged and managing System user account information constructs the System Management Unit of role and authority configuration；For the data source capsule of storing data source information Manage unit；Can sensitive information in all types of data sources of automatic identification, and data source can be calculated and concentrate each Sensitive Attributes association The sensitive information recognition unit of property；Desensitization task can be automatically created, matching is desensitized at tactful and desensitization algorithm sensitive information Manage unit；The data outputting unit that sensitive data uses can safely and effectively be controlled.The System Management Unit includes building The system user account that desensitizes and access control, identify the role and permission of user, only allow the legal user behaviour for closing power Make corresponding data.

The data source administrative unit includes storing data source information, including original data source information and target data source Information, the type of data source are that database data, document data, data warehouse data etc. are one of or a variety of.Unified Global control sensitive data source, including data-source IP address, storage address, title, data class may be implemented in data source control Type and type of database and username and password etc.；Can all types of data sources be carried out with pretreatment operation, pretreatment simultaneously Data source afterwards regenerates address link, uses for subsequent sensitive information recognition unit and sensitive information processing unit.

The sensitive information recognition unit can according to building sensitive information knowledge base, default sensitive information discovery rule, The all types of data sources of automatic identifications such as customized discovery rule are marked related to Sensitive Attributes by the sensitive information rank of priori Property analysis, further determine that Sensitive Attributes at different levels and the relevance between it, prevent because sensitive data desensitization degree not Cause sensitive data to be rebuild deeply, causes secondary leakage.

The sensitive information processing unit can be based on user right and access control, be arranged for Sensitive Attributes at different levels Corresponding desensitization strategy, desensitization rule and desensitization algorithm, while supporting customized setting desensitization process method.

The data outputting unit can be realized in data using being protected in downloading process, and output protection method is will The Sensitive Attributes value of desensitization process replaces original Sensitive Attributes value, and generates new storage address, but do not change source data Storage address and content, the address data memory after desensitization is raw by using hash algorithm transformation initial data storage address At, while for the storage efficiency of less big data platform, desensitization data are destroyed in time.

The source of the data set of the present embodiment is the people's mediation document of certain city part, in every deed of arrangement, except concluding a case It is all document data by details and reconciliation agreement, such as PDF, word document, other attributes are deposited in the form of structural data It is placed in database table.

As shown in Fig. 2, the sensitive information desensitization method that a kind of data-oriented is shared, specific embodiment are as follows:

Step 1: acquisition, the pretreatment of data

Step 1.1: data acquisition

Data set provider is released news by the account and permission obtained in System Management Unit, and the data that will acquire It is stored in data source administrative unit, if table 1 is people's mediation case field composed structure.

Table 1

Input data source format (due to being related to individual privacy, has been done in the data of input as shown in Figure 3 in systems Desensitization process replaces number with letter, but is considered truthful data for the time being in the present invention): when data consumer obtains data When, applied, after examination & approval pass through, system requests to carry out data desensitization operation according to application.

Step 1.2: structured data type pretreatment and document data parsing

The pretreatment of structural data mainly to it is noise-containing in each attribute value (including mistake, exist deviation expectation Outlier), it is inconsistent that (representation of certain attribute values is inconsistent in data set, as gone out in date of birth and identity card Phase birthday is inconsistent), there are the data that unique identification attribute has situations such as repeated (such as ID card No. repetition), missing values It is marked；Expression is not inconsistent normally, if case is 16-06-12 by time of origin, initial data should be transformed to 2016-06- 12；

Document data parsing is to use corresponding analytical tool parsing to extract document text content, as POI parses WORD Document can also parse other document formats with PDFBox operation pdf document etc., as HTML, WORD, XML, PDF, EXCEL, TXT。

Step 1.3: text data participle and part-of-speech tagging

(1) it reconciles as follows in case " case is by details ":

Party A and Party B fasten neighborhood downstairs, live in Shanghai City Center Road ABC and do No. A 203 Room water taps It does not fasten, causes 103 Room cabinet of Zhang San family infiltration downstairs, clothing drenches, go to and upstairs solved, and discovery interior should not have People exists, and then looks for property and holds consultation, and learns that owner's name is Li Si, contact method 19821210912 is contacted immediately and wanted Timely processing is asked, but after 3 days, is upstairs also not handled by, user's heavy losses downstairs, the present village the Xiang Yi people are had resulted in Mediation committee's application is reconciled, it is desirable that 103 Room owner's reimbursements of damages.

(2) it introduces dictionary and stop words is segmented

It is more customized about dictionaries such as mechanism name suffix, area, new word, special words, be such as added " mediation committee ", " upper to go downstairs ", the Belt and Road, " construction project ", participle can pay the utmost attention to dictionary, then " promote the Belt and Road construction project ", just Preferentially it is divided into propulsion/the Belt and Road/construction project；Existing various deactivated vocabularys in network are arranged, the base of duplicate removal, leak repairing is carried out On on plinth, arrange one than more comprehensive vocabulary, such as " Party A ", " Party B ", " both sides ", " progress ", " even if " word and each Kind punctuation mark etc..

(3) " case is by details " participle and part-of-speech tagging result

Upper/n neighbours/n inhabitation/Shanghai City v/Center Road the ns/room No. A/m/m203/m of the ns/n tap water/l faucet/n that goes downstairs Fasten/v downstairs/the s Zhang San/room nr 103/m/n cabinet/n infiltration/v clothing/n drenches/n goes to/v upstairs/n solution/v discovery/v Interior/s do not have/and v people/n searching/v property/n negotiation/n learns/v owner/n name/n Li Si/nr contact method/n 19821210912/m connection/n requirement/v processing in time/i 3/m days/q upstairs/n handle/n downstairs/n user/n seriously damages Mistake/mono- village the l/n people's mediation committee/n application/v reconciles the room v 103/m/n owner/n reimbursement of damages/n.

Wherein, step 1.1 belongs to System Management Unit functional category, and step 1.2,1.3 belong to data source administrative unit function It can scope.

Step 2: construction sensitive information keywords database

All kinds of sensitive information keywords databases are constructed by manual type and mark its sensitive rank, in number and numerical value class The keywords database of contact method has telephone number, contact method, cell-phone number (code), communication modes, home Tel, mobile number, connection The various expression ways such as logical number, telecom number.Sensitive information is divided by four kinds of ranks, first level according to sensitivity simultaneously To can recognize attribute, the attribute of someone, such as ID card No., name, address can be definitely positioned；Second level is half identification Attribute, i.e., single-row attribute can not position someone, but multiple row information can be used to potentially identify someone；Third level is Sensitive Attributes, such as disease, income, schooling；Fourth level is non-sensitive attribute；As shown in table 2；In addition in the present invention The sensitive information being discussed is first three rank Sensitive Attributes, as shown in table 3.

Table 2

Table 3

Step 3: number and numerical value class sensitive information automatic identification

Number and numeric type sensitive information are identity card ID, various card accounts and password, contact method, virtual account and close The information such as code, license plate number, Social Security Number.The identification of such sensitive information can be based on create-rule, pass through regular expression Mode finds, and such sensitive information is all that can definitely identify someone, therefore be such attribute labeling is identifiable Sensitive Attributes.

Step 4: naming the automatic identification of entity class

The recognition methods of name, mechanism name is based on the Viterbi of the hidden Markov HMM model of natural language processing technique The part-of-speech tagging and building name entity knowledge base combination of algorithm are identified.

Wherein building name entity knowledge base includes construction sensitive information keywords database, all kinds of name entity patterns, front and back Sew rule and situation template.Training corpus can be passed through by wherein naming entity patterns, front and back to sew rule and the discovery of position Library, the feature vocabulary and front and back for obtaining name entity sew regular vocabulary and corresponding position vocabulary, then have marked with participle tool Part of speech is combined mode and extracts entity part above and below part of speech, as shown in Figure 4.

Step 5: address category information accurately identifies

It is obtained by the adjacent sequence of terms of the address fragment information after judgement participle than better address information, if adjacent It connects sequence of terms (the continuous 2-3 word of context) to be expressed as address category information or meet address matching rule, then carries out Combination is again identified that, and will carry out longitude and latitude conversion, if it is possible to calculate latitude and longitude information, then it represents that such address Information is that can recognize attribute value.For example, Shanghai City/Zhongshan Road ns/No. A/m/m of ns, detects Shanghai City, the table of Zhongshan Road two Take over the land for use location Sensitive Attributes, and according to address pattern match obtain below No. A also belong to address information, so that it may these are abutted Sequence of terms combination, so that it may obtain this better address of Shanghai City Center Road A, and calculate its longitude and latitude.

Step 6: Sensitive Attributes calculation of relationship degree

It is calculated by Attribute Association degree, finds the correlation in data set between Sensitive Attributes, and the degree of association is bigger, then say Bright correlation is stronger.By calculating the degree of association between Sensitive Attributes, can to gather the more close Sensitive Attributes of the degree of association It is combined, the very weak attribute of the degree of association is deleted, the size of desensitization data set can not only be reduced in this way, reduce data The operand of desensitization process, improves the execution efficiency of respective algorithms, at the same can also by identity that priori knowledge determines and Half identity Sensitive Attributes, then excavate arrive other Sensitive Attributes in this way, data desensitization effect is further increased, is prevented Recombination causes leakage may between Sensitive Attributes.

In the present invention, it is standardized using the degree of association of the Sigmoid function to classifying type Sensitive Attributes, it is as follows Definition:

Wherein, the codomain section of the function is [0,1], and continuous, smooth, monotonic increase.As x=0, codomain 0.5.

Assuming that every record has p attribute { u in data set T₁,u₂,...,u_p, and each attribute respectively correspond it is several A attribute value, is divided into and is denoted as { q₁,q₂,...,q_p, in a record, the corresponding attribute value of Sensitive Attributes occurs being denoted as 1, no Appearance is denoted as 0, then this record can be expressed as (a q₁+q₂+...+q_p) dimension row vectorWhen data set T has n item note Record, is successively denoted as { t₁,t₂,...,t_n, then just there are n (q₁+q₂+...+q_p) dimension row vector, it is expressed as

By (q₁+q₂+...+q_p) correspond in dimension row vector value on position carry out with or and XOR operation, use Indicate with or when operation correspond to the case where attribute value is collectively labeled as 1 on position, useIndicate with or when operation correspond to position Upper attribute value is collectively labeled as 0；The then degree of association S (I of two attributes₁,I₂) calculation formula are as follows:

Wherein, by parameter lambda in the present invention₁, λ₂, λ₃It is respectively set to 0.5,0.25,0.25, and codomain is 0≤S (I₁,I₂) ≤1。

The degree of association between Sensitive Attributes is measured by constructing Sigmoid function in the present embodiment.Use formula (1), formula (2) calculates the related coefficient of two Sensitive Attributes, and related coefficient is bigger, then correlation is higher.

Such as: schooling attribute value { university, senior middle school, junior middle school, primary school }, wage category attribute value have 10K or more, 10K-8K, 8K-6K, 8K-6K, 2K-4K, 2K or less }, will according to schooling and wage category attribute value university, senior middle school, just In, primary school, 10K or more, 10K-8K, 8K-6K, 8K-6K, 2K-4K, 2K or less }, when record 1, record 2, the column that record 3 obtains Vector is

{ 1,0,0,0,1,0,0,0,0,0 }, { 0,0,1,0,0,0,1,0,0,0 }, { 1,0,0,0,1,0,0,0,0,0 }.

Three above record calculate two-by-two with or with exclusive or, θ (x)=0.4 is obtained, is then calculated using formula (1) Obtaining correlation is 0.95.

In the present invention, other methods can also be used to be calculated when Sensitive Attributes calculation of relationship degree, in the present invention It protects in right, such as Apriori algorithm based on correlation rule frequent item set, the sensitivity for the condition that meets is found by iterative manner Attribute frequent item set；In addition there are also use mean-square contingency coefficient, it is assumed that two Sensitive Attributes are I₁And I₂, codomain is respectively {v₁₁,v₁₂,...,v_1pAnd { v₂₁,v₂₂,...,v_2q}.So I₁And I₂Mean-square contingency coefficient are as follows:

Wherein, Sensitive Attributes value v_1iAnd v_2jFrequency of occurrence f is concentrated in initial data_iAnd f_jTo indicate.f_ijIndicate v_1iAnd v_2j The number occurred in same record.Therefore, f_iAnd f_.jWith following relationship:AndAnd 0 ≤Φ²(S₁,S₂)≤1。

Wherein, step 2 to step 6 belongs to sensitive information recognition unit functional category.

Following steps 7 to step 11 belongs to sensitive information processing unit functional category, and system can be based on natural language Application content (including applied field is filled in the automatic desensitization request or data consumer that processing technique submits data set provider Scape, application purpose etc.) examination & approval automatically, copending by rear, desensitization task, the system accordingly of creating automatically identify request for data Middle sensitive information, and desensitization process is carried out according to corresponding desensitization task.

Step 7: setting desensitization algorithm

Desensitization algorithm based on data distortion and encryption is set in systems, as random number replacement, customized exchange replace, Hash, Encryption Algorithm etc. convert initial data；It can also carry out blocking number according to the actual requirement of desensitization task simultaneously According to certain characters, data generaliza-tion etc.；

Step 8: being desensitized based on Sensitive Attributes create-rule library

For the rule that the sensitive data of number or numeric type can be generated by formulating Sensitive Attributes, which is stored in Sensitive field create-rule library；The rule that the sensitivity field generates can be equal to the generation of the sensitive field in initial data completely Rule, then preset desensitization algorithm converts newborn Sensitive Attributes value according to desensitization task in invocation step 7, last shape At data after desensitization.Create-rule, the create-rule on date such as ID card No. carry out on sensitive position further according to certain rule Character such as is replaced, obscures at the operation, and retains the character that there is administrative region, age bracket, gender etc. to statistically analyze meaning, this Complete high emulation may be implemented in sample, the uniqueness of identification number is also ensured, and provide convenience for statistical analysis, so that can not Differentiate its authenticity.

Step 9: naming the desensitization process of entity

Desensitization method for the name entity of mechanism first name and last name name is the code table that entity is named using a common Chinese, The mechanism name and Chinese Name for storing million ranks are replaced after original name entity progress Hash tables look-up；

Step 10: address information desensitization process

It for the sensitive data of address class, can be desensitized according to the level of detail of address information, method is to pass through ground Location switchs to longitude and latitude, if can not parse original sensitive address information, does not need to desensitize, explanation is to compare faintly Location；If related latitude and longitude information can be parsed, according to longitude and latitude is converted in the range of original address location/county, give birth to Address to street/small towns is obscured at another new address information, and according to user's access right.

Step 11: sensitive information desensitization depth calculation

Desensitization depth is difference degree between the data set and raw data set measured after desensitizing, if otherness is bigger, Indicate that desensitization depth is bigger, i.e., Information Security is bigger；On the contrary then safety is with regard to smaller.Wherein, desensitize depth computing method It is as follows:

11.1) Numeric Attributes desensitization depth

Assuming that Numeric Attributes codomain of attribute value before desensitization isAttribute value after desensitizationSo Numeric Attributes desensitization depth D_sz(m,m^*):

11.2) categorical attribute desensitization depth calculation

In the present invention, it needs to seek the desensitization depth of categorical attribute by extensive tree-model is constructed, using following public affairs Formula calculates categorical attribute desensitization depth D_fl(r,r^*):

D_fl(r,r^*)=((N_h-1)×Step(r,r*))/((N-1)×step(r,e)) (5)

Wherein, r, r^*Attribute value after indicating attribute value before desensitizing and desensitizing, N_hIndicate a certain preceding attribute of categorical attribute desensitization The child node number of value and its same father node, N indicate extensive leaf nodes number, and e indicates root node, setp (x, y) table Show attribute value node x desensitization after attribute node y the number of steps of.

11.3) data set desensitization depth calculation

It is D (T, T in conjunction with data set desensitization depth calculation formula 11.1) and 12.1), is obtained^*):

In the present embodiment, numeric type Sensitive Attributes are calculated separately out using formula (4), formula (5) and classifying type is sensitive The desensitization depth of attribute is calculating entire data set desensitization depth using formula (6).

In the present invention, the calculating of data set desensitization depth is not limited to the calculation method of step 11, while can also use Other methods such as indicate data desensitization depth, expression formula using the information loss amount of entropy are as follows:

Wherein R_mIndicate the record number containing m in data set, R_nIndicate the record number after a desensitization process comprising n, andH(R_n)、H(R_m) indicate R_nAnd R_mComentropy.

In addition, H (R_n) and H (R_m) general expression are as follows:

And freq (R_x, s) and indicate R_xData set has the record number of s.

Step 12: data desensitization output

Data consumer obtains the data after desensitization according to user right, and data output protection method is by desensitization process Sensitive Attributes value replace original Sensitive Attributes value, and generate new storage address, but do not change the storage address of source data And content, the address data memory after desensitization are generated by using hash algorithm transformation initial data storage address, are simultaneously The storage efficiency of less big data platform in time destroys desensitization data.

It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention institute The change of work when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to of the invention Protection scope.

Claims

1. a kind of shared sensitive information desensitization method of data-oriented, which comprises the steps of:

(1) sensitive information automatic identification rule and sensitive information processing rule are preset, wherein the sensitive information is known automatically Not rule include construct all kinds of sensitive information keywords databases, to the automatic identification of sensitive information in sensitive information keywords database, number Code and the automatic identification of numerical value class sensitive information, automatic identification, the essence of address class sensitive information of name entity class sensitive information Really identification；The sensitive information handles rule and includes Sensitive Attributes create-rule, setting desensitization algorithm, names at entity desensitization Reason, address information desensitization process；The data of data set provider publication are checked in data consumer's request；

(4) it is analyzed by the Sensitive Attributes calculation of relationship degree to sensitive information, retains the Sensitive Attributes degree of association and be higher than the quick of threshold value Feel information；Wherein threshold value is preset；Wherein, the Sensitive Attributes calculation of relationship degree method is as follows:

(b) assume that every record has p attribute { u in data set T₁,u₂,...,u_p, and each attribute respectively corresponds several Attribute value is divided into and is denoted as { q₁,q₂,...,q_p}；In one record, the corresponding attribute value of Sensitive Attributes occurs being denoted as 1, does not go out It is now denoted as 0, then this record can be expressed as (a q₁+q₂+...+q_p) dimension row vectorWhen data set T have n item record, Successively it is denoted as { t₁,t₂,...,t_n, then just there are n (q₁+q₂+...+q_p) dimension row vector, it is expressed as

(c) by (q₁+q₂+...+q_p) correspond in dimension row vector value on position carry out with or and XOR operation, useTable The case where attribute value is collectively labeled as 1 on position is corresponded to when showing same or operation, is usedIndicate with or when operation correspond on position Attribute value is collectively labeled as 0；The then degree of association S (I between two attributes₁,I₂) calculation formula is as follows:

Wherein, by parameter lambda in calculating₁, λ₂, λ₃It is set to 0.5,0.25,0.25, and codomain is 0≤S (I₁,I₂)≤1；

(6) the desensitization depth of sensitive information is calculated, and judges whether desensitization depth meets preset requirement；If not being inconsistent It closes, then return step (5) re-starts desensitization process；Otherwise, the data set after desensitization is exported, is looked into for data consumer It sees.

2. a kind of shared sensitive information desensitization method of data-oriented according to claim 1, it is characterised in that: the step Suddenly the pretreatment operation of (2) is as follows: being classified to the data of publication according to data type, data type includes structured form Types of databases data, list data, data warehouse data and non-structured document data；It needs when pretreatment to attribute value Integrality, consistency, correctness checked, and non-structured document data be parsed into text data, document data It is parsed when parsing using analytical tool.

3. a kind of shared sensitive information desensitization method of data-oriented according to claim 1, it is characterised in that: the life Name entity class sensitive information automatic identification using the Viterbi algorithm based on hidden Markov HMM model part-of-speech tagging and Building name entity knowledge base combination is realized；The address class sensitive information is accurately identified by judging address information Adjacent sequence of terms is realized.

4. a kind of shared sensitive information desensitization method of data-oriented according to claim 1, it is characterised in that: check numbers Desensitization process is carried out with the sensitive information of numeric type specifically: the rule is stored in by the rule generated by formulating Sensitive Attributes Sensitive Attributes create-rule library, call the preset desensitization algorithm based on data distortion and encryption to newborn Sensitive Attributes value according to Desensitization task is converted, the data after eventually forming desensitization.

5. a kind of shared sensitive information desensitization method of data-oriented according to claim 1, it is characterised in that: to name Entity class sensitive information stores the mechanism name and Chinese Name of million ranks using the code table of a common Chinese name entity, It is replaced after original name entity progress Hash is tabled look-up, completes desensitization process；Method to address class sensitive information is root It desensitizes according to the level of detail of address information, longitude and latitude will be switched to by address, if can not parse original sensitive address letter Breath, then do not need to desensitize, and explanation is that comparison obscures address；If related latitude and longitude information can be parsed, according to original Longitude and latitude is converted in the range of beginning address location/county, generates another new address information, and according to user's right to use Limit fuzzy address to street/small towns.

6. a kind of shared sensitive information desensitization method of data-oriented according to claim 1, it is characterised in that: described Desensitization depth is difference degree between the data set and raw data set measured after desensitizing, and difference degree size and desensitization depth are big Small directly proportional, calculation method is as follows:

(I) calculating of Numeric Attributes desensitization depth:

(II) categorical attribute desensitization depth calculation:

The desensitization depth of categorical attribute is sought by extensive tree-model is constructed, categorical attribute desensitization is calculated using following formula Depth D_fl(r,r^*):

D_fl(r,r^*)=((N_h-1)×Step(r,r*))/((N-1)×step(r,e))

Wherein, r, r^*Attribute value after indicating attribute value before desensitizing and desensitizing, N_hIndicate before a certain categorical attribute desensitizes attribute value with The child node number of its same father node, N indicate extensive leaf nodes number, and e indicates that root node, setp (x, y) indicate to belong to Property value node x desensitization after attribute node y the number of steps of；

7. a kind of shared sensitive information desensitization method of data-oriented according to claim 1, it is characterised in that: described right The mode that the method that data set after desensitization takes Hash converts the new download link address of original storage link generation is counted According to controlled output.

The system 8. a kind of shared sensitive information of data-oriented desensitizes, it is characterised in that: including System Management Unit, data source capsule Manage unit, sensitive information recognition unit, sensitive information processing unit, data outputting unit；The System Management Unit is used for structure Desensitization system user account and access control are built, the role and permission of user are identified, only allows the legal user for closing power Operate corresponding data；The data source administrative unit includes storing data source information；The sensitive information recognition unit is used for Sensitive information in all types of data sources of automatic identification, and calculate data source and concentrate each Sensitive Attributes relevance；Wherein, sensitive to belong to Property calculation of relationship degree method is as follows:

The sensitive information processing unit is for automatically creating desensitization task, matching desensitization strategy and desensitization algorithm；The data The data output that output unit is used for safely and effectively control sensitive data；System Management Unit, data source administrative unit, Sensitive information recognition unit, sensitive information processing unit, data outputting unit are sequentially connected.

The system 9. a kind of shared sensitive information of data-oriented according to claim 8 desensitizes, it is characterised in that: the number It include that data source types, IP address, storage address and data source data structure are extracted and managed according to source control unit；The sensitivity Information identificating unit is based on natural language processing technique and carries out word segmentation processing to text data, all kinds of being constructed using manual type On the basis of sensitive information knowledge base, mark sensitive information rank, rule-based and pattern matching mode automatic identification sensitive information, Sigmoid functional based method is introduced simultaneously calculates the Sensitive Attributes degree of association；The sensitive information processing unit is based at natural language Reason technology is examined and is created automatically corresponding desensitization task to data request for utilization, and Sensitive Attributes create-rule is respectively adopted Library, searched using Hash table converted within the scope of mode and address information longitude and latitude and all kinds of desensitization algorithms to all kinds of sensitive informations into Row desensitization process；The Sensitive Attributes value of desensitization process is replaced original Sensitive Attributes value by the data outputting unit, and is used Hash algorithm transformation initial data storage address generates new storage address output data.