CN106021541B

CN106021541B - Distinguish the anonymous Privacy preserving algorithms of secondary k of standard identifier attribute

Info

Publication number: CN106021541B
Application number: CN201610361877.2A
Authority: CN
Inventors: 吴响; 王换换; 臧昊; 俞啸
Original assignee: Xuzhou Medical University
Current assignee: Xuzhou Medical University
Priority date: 2016-05-26
Filing date: 2016-05-26
Publication date: 2017-08-04
Anticipated expiration: 2036-05-26
Also published as: CN106021541A

Abstract

The invention discloses a kind of anonymous method for secret protection of secondary k for distinguishing standard identifier attribute, it is related to data-privacy protection technique field.The present invention passes through Incognito functions, formed all single attributes level grid carry out judge it is extensive whether meet k anonymity, delete and be unsatisfactory for the anonymous nodes of k, the node iteration anonymous by k is met, forms candidate's nodal set, then judge whether both candidate nodes meet k anonymities, delete ineligible node, above-mentioned steps are circulated, until all categorical attribute iteration are completed, all root nodes for meeting k anonymities are exported.Tables of data T is carried out successively by root node extensive, it is secondary extensive to extensive rear T' carry out using MDAV algorithms, the equivalence class tuple quantity of input is divided between k to 2k 1, after all divisions are completed, information loss is provided, compares the tables of data for showing that loss amount is minimum.

Description

Distinguish the anonymous Privacy preserving algorithms of secondary k- of standard identifier attribute

Technical field

The present invention relates to data-privacy protection technique field, specifically a kind of secondary k- for distinguishing standard identifier attribute is anonymous Privacy preserving algorithms.

Background technology

Information technology is developed rapidly, and increasing data are used by people are shared, how to be protected in issue data Privacy information not by attacker malice obtain, while making Data receiver make full use of data message effectively to be explored again And scientific research, it is increasingly becoming an important information security issue.K- anonymities are a kind of effective private data guard methods, Widely paid close attention in recent years.K-anonymity technologies were proposed that it is required by Samarati and Sweeney in 1998 There is the individual of certain amount (k) undistinguishable in the data of issue, prevent attacker individual belonging to privacy information from determining.

Numerous studies show, Incognito algorithms can efficiently by large-scale data k- anonymizations, what the overall situation was recoded K- anonymizations algorithm can cause the excessive extensive of numeric type variable, there is more semantic loss.MDAV is the classics based on division Anonymous clustering algorithm, the algorithm is capable of the clustering problem of the extensive numeric type data collection of efficient process.

Researcher's research work anonymous to k- at utmost retention data while be concentrated mainly on protection privacy information Availability.At present, all there is common defect in most of data anonymous method：1) relatively it is applied to classifying type data (nominal Type and Ordinal), it is semantic that logarithm value type data generaliza-tion often loses more numerical value；2) number of attributes of standard identifier increases severely When, it may appear that so-called " dimension disaster/digit trap ".Dimension trap will cause very big information loss so that issue data Table availability is deteriorated.

The content of the invention

In order to overcome the shortcoming of above-mentioned prior art, it is anonymous that the present invention provides a kind of secondary k- for distinguishing standard identifier attribute Privacy preserving algorithms, greatly reduce and the information loss that anonym's algorithm is caused are used alone.

The present invention is realized with following technical scheme：A kind of anonymous secret protections of the secondary k- for distinguishing standard identifier attribute Algorithm,

1) judge that standard identifier concentrates attribute type；

2)S_n=Incognito (T, CQI, k), S_nPresentation class type attribute has carried out extensive data set, and T represents to need Anonymous constraints is represented by extensive data set, CQI presentation class type standard identifier collection, k；

3) empty queue result, empty node node；

4) S is traveled through_nInto following circulation：

Data set

D_jBe storage it is complete it is extensive after tables of data；

Read S_nIn a node be inserted into node；

T ' is obtained according to the extensive tables of data T of node；

T ' is traveled through, into following circulation：

Use T_iI-th of equivalence class in ' storage T '；

MDAV(T′_i, NQI, k), T '_iThe data set for needing to be clustered is represented, NQI represents the numeric type category to be clustered Property, k represents anonymous constraints；

D_j=D_j∪T′_i；

Information loss is calculated, result is inserted into；

5) compare information loss in result, obtain the minimum D of information loss_j；

6) T "=D_j, return to T ".

It is preferred that, (T, CQI, k) categorical attribute is extensive comprises the following steps that Incognito：

1) single attribute generalization both candidate nodes table C is formed₁With side table E₁；

2) C is taken out using an empty queue queue₁In all root nodes, all to queue nodes carry out equivalence class meters Calculate；

3) judge whether to meet k- anonymities, if node is met, this point and its all child node be marked, If be unsatisfactory for, by this point from C₁It is middle to delete, and its child node is inserted in queue queue；

4) repeat step 3), until C₁In all ungratified knot removals, and make the C after deleting₁And E₁Formed newly Table C₂And E₂；

5) repeat step 2), C 3), 4) after being deleted_n；

6)S_n={ C_nAll nodes }

7) S is returned_n。

It is preferred that, MDAV (T '_i, NQI, k) Numeric Attributes are extensive comprises the following steps that：

1) judge whether the number of tuple in data set is more than 2k-1, if being more than, continue step 2), otherwise, return to number According to collection T '_i, and find its barycenter；

2) data set T '_iIn find out two farthest tuples r, s of distance by NQI；

3) using r as barycenter, the k-1 bar tuple formation equivalence class C nearest from r is found, barycenter is updated, and from data set T′_iThis k bar tuple is deleted, is put into collection gregarious { Q }；

4) using s as barycenter repeat step 3)；

5) data T ' is judged_iIn remaining tuple number whether be more than 2k-1,3) 4) if more than repeating 2)；Otherwise, Return, returned data collection T '_i, and find its barycenter；

6) the standard identifier property value of the tuple in its equivalence class is replaced with the standard identifier property value of its barycenter；

7) T ' is returned_i。

The beneficial effects of the invention are as follows：The anonymous categorical attribute frequent item sets of k- can be met by this method, Then logarithm value type attribute carries out micro- aggregation, it is to avoid the excessive extensive possibility of the extensive logarithm value type attribute of universe occurs, can make Source data table is divided into the optimal dividing between k to 2k-1, greatly reduces and the information damage that anonym's algorithm is caused is used alone Lose.

Brief description of the drawings

Fig. 1 is schematic flow sheet of the present invention；

Fig. 2 is for sex, race, the structure chart that 3 attributes of job category are constituted；

Fig. 3 is | QI | during=6+1, and information loss IL and the graph of a relation of k values；

Fig. 4 is | QI | during=6+2, and information loss IL and the graph of a relation of k values；

Fig. 5 is | QI | during=6+1, and time T and the graph of a relation of k values；

Fig. 6 is | QI | during=6+2, and time T and the graph of a relation of k values；

Fig. 7 is the graph of a relation of time difference and k values.

Embodiment

When realizing that k- is anonymous, related definition is carried out to NQLG algorithms by taking table 1 as an example.Assuming that what data publisher was held Tables of data is T (A₁,A₂,...,A_n), every tuple indicates the relevant information of a special entity in table, such as Age, Workclass, Race, Sex, Hours-per-week, Salary etc., are shown in Table 1.

Table 1

Define 1 standard identifier：It is assumed that data set a U, a specific tables of data T (A₁,A₂,...,A_n), fc:U→T And f_g:T → U ', whereinA T standard identifier QI_T, it is one group of attribute So f (f_c(p_i)[Q_T])=p_iSet up.Attribute in table 1 can serve as standard identifier, and the selection of standard identifier is according to reality Need selection.

Define 2 abstraction rules：Give attribute a Q, f:Q → Q ', f are the extensive function set acted on attribute Q, that Then represent that standard identifier carries out extensive process in order, and { f¹,f²,...,f^mThen represent Abstraction rule.Sex is illustrated in figure 2, race, the structure chart that 3 attributes of job category are constituted.

Define 3k- anonymous：(k-anonymity) a tables of data T (A is given₁,...,A_n) and its associated fiducial mark knowledge SymbolIf to meet k- anonymous by table T, and if only if T [QI_T] in each member Group is at least in T [QI_T] in occur k times.

As shown in table 1,6 tuples, one specific personal information of each tuple correspondence are included in table.First is classified as in table For sequence number field, relative storage location of the every record in tables of data is represented；Second is classified as age attribute information；3rd is classified as Working attributes information；4th is classified as ethnic attribute information；5th is classified as gender attribute information；6th is classified as operating time attribute letter Breath, last row can be used as the Sensitive Attributes of this table as information to be protected is needed.T standard identifier Q I so in table 1_T= {Age,Workclass,Race,Sex,Works_per_week}.Table 2 is data result of the table 1 after the processing of 2- anonymizations Publishing table.According to DEFINED BY EQUIVALENT CLASS, one has 3 equivalence classes in table 2, is respectively { R₁,R₂}、{R₃,R₄}、{R₅,R₆}.Equivalence class {R₁,R₂,R₃In tuple have：

R₁[QI_T]=R₂[QI_T]={ [21,30], Self-emp-not-int, Amer-Indian-Eskimo, Female, [21-30]},

R₃[QI_T]=R₄[QI_T]={ [31,40], Private, Amer-Indian-Eskimo, Male, [31-40] },

R₅[QI_T]=R₆[QI_T]={ [41,50], Private, Amer-Indian-Eskimo, Male, [41-50] }.Cause This attacker obtains the probability only 1/k=1/2 of privacy-sensitive using attack pattern is linked.Table 1 is after the processing of k- anonymizations Tables of data (table 2) can effectively prevent link attack, table 2 be table 1 by 2- anonymity processing after data；

Table 2

Define 4 categorical attributes extensive：Data division is carried out to data set, classifying type data are carried out may time probability During expansion, { R₁,...,R_iCategorical attribute, and R₁,...,R_i∈ T, if T (R₁,...,R_j) meet k- anonymities, i.e., and if only if T(R₁,...,R_j) in each tuple at least in T (R₁,...,R_j) in occur k times, then complete categorical attribute it is extensive, Now frequent item set is represented by T ' (R₁,..,R_j,...,S₁,...,S_n)。

Define 5 Numeric Attributes extensive：Given frequent item set T ' (R are obtained by classifying type data generaliza-tion₁,.., R_i,...,S₁,...,S_n), table T ' (S₁,...,S_n) (it is Numeric Attributes, the Numeric Attributes on T are extensive to be represented by K_exp (δ_G(T ")), wherein K represents secondary anonymous function name, and exp is numeric type expression formula, and G is abstraction rule, δ_GComplete numeric type Tuple data it is extensive.

Define 6 numeric type member group distances：If T, for given tuple set T, (t₁,t₂,...,t_n), two tuple t₁,t₂ (t₁,t₂∈ T), then the distance between tuple is its actual distance on all numeric type standard identifiers：

Wherein, t_i,t_jDifferent numeric type tuples, d are represented respectively_nRepresent the actual range between two numeric type tuples.

As shown in figure 1, the present invention is based on Incognito algorithms and MDAV algorithms, set forth herein an efficient k- is anonymous Algorithm --- NQLG algorithms.The algorithm combination Incognito algorithms and MDAV algorithms, are obtained first with Incognito algorithms Using classifying type standard identifier to meet the anonymous nodes of k-, all root nodes are obtained by judgement, according to root node to respectively It is extensive to tables of data progress, utilize MDAV algorithm logarithm value type hierarchical cluster attributes so that the equivalence class finally obtained is that optimal k is drawn Point, the number of tuple is between k and 2k-1 in each equivalence class, and is compared the extensive result that each root node is obtained, and selects The minimum extensive tables of data of information loss amount.Arthmetic statement is as follows：

Categorical attribute is extensive

Function:(T, CQI, k), T represent to need by extensive data set, CQI presentation class type standard identifiers Incognito Collection, k anonymity constraintss；

1) single attribute generalization both candidate nodes table and C are formed₁Side table E₁；

4) repeat step 3), until C₁In all ungratified knot removals, and be the C after deleting₁And E₁Formed newly Table C₂And E₂；

5) repeat step 2), C 3), 4) after being deleted_n；

6)S_n={ C_nAll nodes }

7) S is returned_n。

Numeric Attributes are extensive

Function：(T ', NQI, k), T ' expressions need the data set being clustered to MDAV, and NQI represents the numerical value to be clustered Type attribute, k represents anonymous constraints；

1) judge whether the number of tuple in data set is more than 2k-1, if being more than, continue step 2), otherwise, return to number According to collection T ', and find its barycenter；

2) two farthest tuples r, s of distance are found out by NQI in data set T '；

3) using r as barycenter, the k-1 bar tuple formation equivalence class C nearest from r is found, barycenter, and the T ' from data set is updated This k bar tuple is deleted, is put into collection gregarious { Q }；

4) using s as barycenter repeat step 3)；

3) 4) 5) judge in data T ' whether remaining tuple number is more than 2k-1, if more than repeating 2)；Otherwise, Return, returned data collection T ', and find its barycenter；

7) T ' is returned.

NQLG algorithms are realized

1) judge that standard identifier concentrates attribute type,

2)S_n=Incognito (T, CQI, k)；

S_nIt is that categorical attribute has carried out extensive data set；

3) empty queue result, empty node node；

4) S is traveled through_nInto following circulation：

Data set

D_jBe storage it is complete it is extensive after tables of data；

Read S_nIn a node be inserted into node；

T ' is obtained according to the extensive tables of data T of node；

T ' is traveled through, into following circulation：

Use T '_iStore i-th of equivalence class in T '；

MDAV(T_i′,NQI,k)；

D_j=D_j∪T_i′；

Information loss is calculated, result is inserted into；

5) compare information loss in result, obtain the minimum D of information loss_j。

6) T "=D_j, return to T ".

From above step, NQLG algorithms are by Incognito functions, and the level grid for forming all single attributes is carried out Judge that the extensive k- that whether meets is anonymous, delete and be unsatisfactory for the anonymous nodes of k-, the anonymous node iteration of k- will be met, candidate is formed Nodal set, then judge whether both candidate nodes meet k- anonymities, ineligible node is deleted, above-mentioned steps, Zhi Daosuo are circulated There is the completion of categorical attribute iteration, export all root nodes for meeting k- anonymities.Tables of data T is carried out successively by root node general Change, it is secondary extensive to extensive rear T ' carry out using MDAV algorithms, by the equivalence class tuple quantity of input be divided into k to 2k-1 it Between, after all divisions are completed, information loss is provided, compares the tables of data for showing that loss amount is minimum.

The analysis on its rationality of NQLG algorithms：By step 2) can be met the anonymous categorical attributes of k- frequent for algorithm Item collection, the then micro- aggregation of logarithm value type attribute progress, it is to avoid the excessive extensive possibility of the extensive logarithm value type attribute of universe, warp occur Cross step 4) after, the optimal dividing that source data table can be made to be divided between k to 2k-1 greatly reduces exclusive use anonym The information loss that algorithm is caused.

NQLG algorithm analysis：Assuming that this algorithm data concentrates tuple number to be n, classifying type standard identifier number is M, then this algorithm spends time series analysis as follows：It is O (1) that step 1 time, which spends,；Step 2 is using anonym's algorithm to classifying type Attribute meet k- solution, and the cost of its time is O (∑ C_i), C_iFor the node number of ith iteration；Step 3 time spends For O (1)；The cost of step 4 time isWherein l represent once it is extensive after root node Number.The time complexity of MDAV algorithms isJ is big equivalence class number obtained in the previous step；It is O that step 5 time, which spends, (l).Therefore the loss of the overall information of this algorithm is

NQLG algorithm experimentals are verified and interpretation of result：

Experimental situation:Testing used hardware environment is：4G internal memories, the operating systems of Windows 7, algorithm is by Java Realized with SQL server 2008.There is used herein the Adult data in UCI Machine Learning Repository Collection is as experimental data set, and Adult data sets are made up of U.S. census's data, using the training set in data set, are gone Except 30162 records are had after default value record, 8 property values, including Sex, Race, Hours_per_week are chosen herein, Marital_status,Education,Workclass,Native_country,Age.Wherein Age, Hours_per_week For continuity standard identifier, Sex, Race, Marital_status, Education, Workclass, Native_country is Classifying type standard identifier.

Analysis of experimental results:Incognito algorithms algorithm as a comparison is selected in this experiment, by the data set after k- anonymizations Secondary anonymity is carried out using MDAV algorithms, is weighed from information loss degree and in terms of the execution time to this paper algorithms.NQLG is calculated Method is realized under the conditions of the standard identifier and different value of K of different numbers, information loss degree and the change for performing the time.Wherein information Degree of loss uses the computational methods of document：

Equivalence class information loss amount：

The information loss amount of table：

| ei | it is the quantity for clustering ei tuples, 1≤l≤m, N_iIt is the scope of i-th of numerical attribute, MAX_NiAnd MIN_NiIt is Cluster maximum and minimum value, H (T in ei_ci) be classification tree height, H (∧ (∪ Cj)) be with minimum public ancestors point The height of class subtree.

Standard identifier is worked as in information loss degree analysis it can be seen from Fig. 3, Fig. 4 | QI | a timing, and with k increase, this paper The information loss IL of algorithm has the trend of reduction, and when k values reach 50, the information loss amount of two kinds of algorithms has becoming for rising Gesture.Experimental data shows that the information loss amount of this paper algorithm is significantly lower than anonym's algorithm.Thus from information loss measuring angle See, this paper algorithms have an enormous advantage avoiding excessively extensive aspect tool.

Run time analysis it can be seen from Fig. 5, Fig. 6 when the timing of standard identifier one, anonym's algorithm and this paper algorithms Run time is all reduced with the increase of k values.Contrasted by different standard identifier collection QI datagram, when | QI |=6+ When 1 (+1 Numeric Attributes of 6 categorical attributes), aspect is better than this paper algorithms to anonym's algorithm at runtime, and accurate Identifier collection | QI | during=6+2 (+2 Numeric Attributes of 6 categorical attributes), with the increase of k values, this paper algorithms are in operation It is better than anonym's algorithm in terms of time.Experimental data shows, during numeric type standard identifier increase, the superiority meeting of this paper algorithms It is more obvious.

As seen from Figure 7, with the reduction of k values, the standard identifier collection of anonym's algorithm and this paper algorithms (when | QI |= 6+2 and | QI | during=6+1) time difference Δ t increase simultaneously, the amplification of anonym's algorithm significantly, much larger than this paper algorithms Amplification.Thus, from efficiency, with standard identifier collection | QI | middle numeric type standard identifier accounting changes, this paper algorithms it is excellent More property can be significantly improved.

Semanteme in the excessive extensive and clustering of the Numeric Attributes caused herein mainly for anonym's algorithm Include problem, it is proposed that NQLG algorithms.Experiment shows that NQLG algorithms are lost compared to traditional Privacy preserving algorithms in reply semanteme Semanteme of becoming estranged has a clear superiority comprising aspect.Future can deploy research in the following areas：There is the possibility of secondary issue in data Property, can be on dynamic data set to NQLG algorithm further genralrlizations；With the sharp increase of data scale, distribution can be introduced Formula, cloud computing technology further improve mass data processing efficiency into anonymization research.

Claims

1. a kind of anonymous method for secret protection of the secondary k- for distinguishing standard identifier attribute, it is characterised in that：

1)S_n=Incognito (T, CQI, k), S_nPresentation class type attribute has carried out extensive data set, T represent to need by Extensive data set, CQI presentation class type standard identifier collection, k represents anonymous constraints；

2) empty queue result, empty node node；

3) S is traveled through_nInto following circulation：

Data set

D_jBe storage it is complete it is extensive after tables of data；

Read S_nIn a node be inserted into node；

T ' is obtained according to the extensive data set T of node；

T ' is traveled through, into following circulation：

Use T_iI-th of equivalence class in ' storage T '；

MDAV(T_i', NQI, k), T_i' data set that needs are clustered is represented, NQI represents the Numeric Attributes to be clustered, k Represent anonymous constraints；

D_j=D_j∪T_i'；

Information loss is calculated, result is inserted into；

4) compare information loss in result, obtain the minimum D of information loss_j；

5) T "=D_j, return to T ".

2. the anonymous method for secret protection of the secondary k- for distinguishing standard identifier attribute according to claim 1, it is characterised in that： (T, CQI, k) categorical attribute is extensive comprises the following steps that Incognito：

2) C is taken out using an empty queue queue₁In all root nodes, all to queue nodes carry out equivalence class calculating；

3) judge whether to meet k- anonymities, if node is met, this point and its all child node are marked, if It is unsatisfactory for, then by this point from C₁It is middle to delete, and its child node is inserted in queue queue；

4) repeat step 3), until C₁In all ungratified knot removals, and make the C after deleting₁And E₁Form new table C₂ And E₂；

5) repeat step 2), C 3), 4) after being deleted_n；

6)S_n={ C_nAll nodes }

7) S is returned_n。

3. the anonymous method for secret protection of the secondary k- for distinguishing standard identifier attribute according to claim 1, it is characterised in that： MDAV(T_i', NQI, k) Numeric Attributes are extensive comprises the following steps that：

1) judge whether the number of tuple in data set is more than 2k-1, if being more than, continue step 2), otherwise, returned data collection T_i', and find its barycenter；

2) data set T_i' in find out two farthest tuples r, s of distance by NQI；

3) using r as barycenter, nearest from r k-1 bars tuple formation equivalence class C is found, barycenter is updated, and from data set T_i' middle deletion This k bar tuple, is put into collection gregarious { Q }；

4) using s as barycenter repeat step 3)；

5) data set T is judged_i' in remaining tuple number whether be more than 2k-1, if more than repeating 2), 3), 4)；Otherwise, Return, returned data collection T_i', and find its barycenter；

7) T is returned_i′。