CN106021541A

CN106021541A - Secondary k-anonymity privacy protection algorithm for differentiating quasi-identifier attributes

Info

Publication number: CN106021541A
Application number: CN201610361877.2A
Authority: CN
Inventors: 吴响; 王换换; 臧昊; 俞啸
Original assignee: Xuzhou Medical University
Current assignee: Xuzhou Medical University
Priority date: 2016-05-26
Filing date: 2016-05-26
Publication date: 2016-10-12
Anticipated expiration: 2036-05-26
Also published as: CN106021541B

Abstract

The invention discloses a secondary k-anonymity privacy protection algorithm for differentiating quasi-identifier attributes, pertaining to the technical field of privacy protection.The algorithm comprises following steps: forming hierarchical grids with single attribute through an Incognito function to determine whether generalization satisfies k-anonymity or not, deleting nodes not satisfying k-anonymity, iterating nodes satisfying k-anonymity to form a candidate node set and determining again whether candidate nodes satisfy k-anonymity, deleting nodes not satisfying k-anonymity, and repeating the above steps till all categorical attributes are iterated and outputting root nodes satisfying k-anonymity.Data tables T are generalized through the root nodes. The MDAV algorithm is utilized for secondary generalization of generalized T'. The number of tuples in equivalence class inputted is divided into the range of k-2k-1. When partition is finished, information loss is provided for obtaining a data table with the little loss amount through comparisons.

Description

Distinguish the secondary k-anonymity Privacy preserving algorithms of standard identifier attribute

Technical field

The present invention relates to data-privacy resist technology field, a kind of secondary k-anonymity Privacy preserving algorithms distinguishing standard identifier attribute.

Background technology

Developing rapidly of information technology; increasing data are shared use by people; how to protect the privacy information issued in data not to be hacked person's malice to obtain; make again Data receiver make full use of data message effectively to explore and scientific research, be increasingly becoming an important information security issue simultaneously.K-anonymity is a kind of effective private data guard method, is paid close attention to the most widely.K-anonymity technology is proposed in 1998 by Samarati and Sweeney, and it requires to exist the individuality of some (k) undistinguishables in the data issued, and makes assailant can not determine belonging to privacy information individual.

Numerous studies show, Incognito algorithm can be efficiently by large-scale data k-anonymization, and the k-anonymization algorithm that the overall situation is recoded can cause the most extensive of numeric type variable, has more semantic loss.MDAV is that this algorithm can efficiently process the clustering problem of extensive numeric type data collection based on the classical anonymous clustering algorithm divided.

Researcher at utmost retains the availability of data while the research work that k-is anonymous is concentrated mainly on protection privacy information.At present, all there is common defect in major part data anonymous method: 1) being relatively applicable to classifying type data (nominal type and Ordinal), it is semantic that logarithm value type data generaliza-tion often loses more numerical value；2) when the number of attributes of standard identifier increases severely, it may appear that so-called " dimension disaster/figure place trap ".Dimension trap will cause the biggest information loss so that issues tables of data availability and is deteriorated.

Summary of the invention

In order to overcome the shortcoming of above-mentioned prior art, the present invention provides a kind of secondary k-anonymity Privacy preserving algorithms distinguishing standard identifier attribute, greatly reduces and is used alone the information loss that anonym algorithm causes.

The present invention realizes with following technical scheme: a kind of secondary k-anonymity Privacy preserving algorithms distinguishing standard identifier attribute,

1) judge that standard identifier concentrates attribute type；

2)S_n=Incognito (T, CQI, k), S_nPresentation class type attribute has carried out extensive data set, and T represents that needs are represented anonymous constraints by extensive data set, CQI presentation class type standard identifier collection, k；

3) empty queue result, empty node node；

4) traversal S_nCirculation below entering:

Data set

D_jBe deposit the most extensive after tables of data；

Read S_nIn node city to node；

T ' is obtained according to extensive tables of data T of node；

Traversal T ', the following circulation of entrance:

Use T '_iStorage T ' middle i-th equivalence class；

MDAV(T′_i, NQI, k), T ' represents the data set needing to be clustered, and NQI represents the Numeric Attributes carrying out clustering, and k represents anonymous constraints；

D_j=D_j∪T_i'；

Calculate information loss, be inserted into result；

5) compare information loss in result, obtain the D that information loss is minimum_j；

6) T "=D_j, return T ".

Preferably, Incognito (T, CQI, k) extensive the specifically comprising the following steps that of categorical attribute

1) single attribute generalization both candidate nodes table C is formed₁With limit table E₁；

2) an empty queue queue is used to take out C₁In all root nodes, nodes all to queue carry out equivalence class calculating；

3) judging whether to meet k-anonymous, if node meets, then this point and its all of child node being marked, if be unsatisfactory for, then by this point from C₁Middle deletion, and its child node is inserted in queue queue；

4) step 3 is repeated), until C₁In all ungratified knot removals, and be the C after deleting₁And E₁Form new table C₂And E₂；

5) repeat step 2), 3), 4) until C after being deleted_n；

6)S_n={ C_nAll nodes }

7) S is returned_n。

Preferably, MDAV (T ', NQI, k) extensive the specifically comprising the following steps that of Numeric Attributes

1) judging that in data set, whether the number of tuple is more than 2k-1, if being more than, then continuing step 2),

Otherwise, return data set T ', and find its barycenter；

2) data set T ' is found out apart from two farthest tuples r, s by NQI；

3) with r as barycenter, find and form equivalence class C from k-1 bar tuple nearest for r, update barycenter,

And T ' deletes this k bar tuple from data set, put into collection gregarious { in Q}；

4) step 3 is repeated with s for barycenter)；

5) judge that in data T ', whether remaining tuple number is more than 2k-1, if more than repeating 2) 3)

4)；Otherwise, return, return data set T ', and find its barycenter；

6) the standard identifier property value of tuple in its equivalence class is replaced with the standard identifier property value of its barycenter；

7) T ' is returned.

The invention has the beneficial effects as follows: the anonymous categorical attribute frequent item set of k-can be met by the method, then logarithm value type attribute carries out micro-gathering, avoid the occurrence of the possibility that universe extensive logarithm value type attribute is the most extensive, the optimal dividing that source data table can be made to be divided between k to 2k-1, greatly reduces and is used alone the information loss that anonym algorithm causes.

Accompanying drawing explanation

Fig. 1 is schematic flow sheet of the present invention；

Fig. 2 is for sex, race, the structure chart that 3 attributes of type of work are constituted；

When Fig. 3 is | QI |=6+1, information loss IL and the graph of a relation of k value；

When Fig. 4 is | QI |=6+2, information loss IL and the graph of a relation of k value；

When Fig. 5 is | QI |=6+1, time T and the graph of a relation of k value；

When Fig. 6 is | QI |=6+2, time T and the graph of a relation of k value；

Fig. 7 is the graph of a relation of time difference and k value.

Detailed description of the invention

When realizing k-anonymity, as a example by table 1, NQLG algorithm is carried out related definition.Assume that the tables of data that data publisher is held is T (A₁,A₂,...,A_n), in table, every tuple indicates the relevant information of a special entity, such as Age, Workclass, Race, Sex, Hours-per-week, Salary etc., is shown in Table 1.

Table 1

Define 1 standard identifier: assuming that a data set U, a specific tables of data T (A₁,A₂,...,A_n), fc:U → T and fg:T → U ', whereinOne standard identifier QI of T_T, it is one group of attribute So f (f_c(p_i)[Q_T])=p_iSet up.Attribute in table 1 can serve as standard identifier, and choosing of standard identifier selects according to actual needs.

Define 2 abstraction rules: giving an attribute Q, f:Q → Q', f are to act on the extensive function set on attribute Q, thenThen represent that standard identifier carries out extensive process in order, and { f¹,f²,...,f^mThen represent abstraction rule.It is illustrated in figure 2 sex, race, the structure chart that 3 attributes of type of work are constituted.

Definition 3k-is anonymous: (k-anonymity) gives a tables of data T (A₁,...,A_n) and the standard identifier QI that is associated_T=(A_i,...,A_j)If table T to meet k-anonymous, and if only if T [QI_TEach tuple in] is at least at T [QI_TOccur k time in].

As shown in table 1, table comprises 6 tuples, the corresponding concrete personal information of each tuple.In table, first is classified as sequence number field, represents that every record stores position in tables of data relatively；Second is classified as age attribute information；3rd is classified as working attributes information；4th is classified as race's attribute information；5th is classified as gender attribute information；6th is classified as operating time attribute information, and last string can be as needing information to be protected, as the Sensitive Attributes of this table.So standard identifier Q I of T in table 1_T={ Age, Workclass, Race, Sex, Works_per_week}.Table 2 is the table 1 data result publishing table after 2-anonymization processes.According to DEFINED BY EQUIVALENT CLASS, in table 2, one has 3 equivalence classes, is respectively { R₁,R₂}、{R₃,R₄}、{R₅,R₆}.Equivalence class { R₁,R₂,R₃Tuple in } has:

R₁[QI_T]=R₂[QI_T]={ [21,30], Self-emp-not-int, Amer-Indian-Eskimo, Female, [21-30] },

R₃[QI_T]=R₄[QI_T]={ [31,40], Private, Amer-Indian-Eskimo, Male, [31-40] },

R₅[QI_T]=R₆[QI_T]={ [41,50], Private, Amer-Indian-Eskimo, Male, [41-50] }.Therefore the probability that assailant utilizes link attack pattern to obtain privacy-sensitive is only 1/k=1/2.The table 1 tables of data (table 2) after k-anonymization processes can be effectively prevented link and attack, and table 2 is the table 1 data after 2-anonymity processes；

Table 2

Define 4 categorical attributes extensive: data set is carried out data division, when classifying type data are carried out possible time probability expansion, { R₁,...,R_iCategorical attribute, and R₁,...,R_i∈ T, if T is (R₁,...,R_j) to meet k-anonymous, i.e. and if only if T (R₁,...,R_jEach tuple in) is at least at T (R₁,...,R_jOccurring k time in), then complete categorical attribute extensive, now frequent item set is represented by T'(R₁,..,R_j,...,S₁,...,S_n)。

Define 5 Numeric Attributes extensive: obtained the frequent item set T'(R given by classifying type data generaliza-tion₁,..,R_i,...,S₁,...,S_n), table T'(S₁,...,S_n) (for Numeric Attributes, Numeric Attributes on T is extensive is represented by K_exp(δ_G(T ")), wherein K represents the function name that secondary is anonymous, and exp is numeric type expression formula, and G is abstraction rule, δ_GComplete the extensive of numeric type tuple data.

Define 6 numeric type unit group distances: set T, for given tuple set T, (t₁,t₂,...,t_n), two tuples t₁,t₂(t₁,t₂∈ T), then the distance between tuple is its actual distance on all numeric type standard identifiers:

d_{n} (t_{i}, t_{j}) = | t_{i} - t_{j} |_{2} = {[Σ_{k = 1}^{p} ω_{k} | t_{i k} - t_{j k} |^{2}]}^{1 / 2} - - - (1)

Wherein, t_i,t_jRepresent different numeric type tuples, d respectively_nRepresent the actual range between two numeric type tuples.

As it is shown in figure 1, the present invention is based on Incognito algorithm and MDAV algorithm, set forth herein an efficient k-anonymity algorithm NQLG algorithm.This algorithm combines Incognito algorithm and MDAV algorithm, obtaining first with Incognito algorithm with classifying type standard identifier is to meet the anonymous node of k-, all of root node is obtained through judgement, according to root node to respectively tables of data being carried out extensive, utilize MDAV algorithm logarithm value type hierarchical cluster attribute, making the equivalence class finally obtained is that optimum k divides, in each equivalence class, the number of tuple is between k and 2k-1, and compare the extensive result that each root node obtains, select the extensive tables of data that information loss amount is minimum.Arthmetic statement is as follows:

Categorical attribute is extensive

(T, CQI, k), T represents that needs are by extensive data set, CQI presentation class type standard identifier collection, k anonymity constraints to function: Incognito；

1) single attribute generalization both candidate nodes table and C are formed₁Limit table E₁；

5) repeat step 2), 3), 4) until C after being deleted_n；

6)S_n={ C_nAll nodes }

7) S is returned_n。

Numeric Attributes is extensive

(T', NQI, k), T ' represents the data set needing to be clustered to function: MDAV, and NQI represents the Numeric Attributes carrying out clustering, and k represents anonymous constraints；

1) judging that in data set, whether the number of tuple is more than 2k-1, if being more than, then continuing step 2), otherwise, return data set T ', and find its barycenter；

2) data set T ' is found out apart from two farthest tuples r, s by NQI；

3) with r as barycenter, find and form equivalence class C from k-1 bar tuple nearest for r, update barycenter, and T ' deletes this k bar tuple from data set, put into collection gregarious { in Q}；

4) step 3 is repeated with s for barycenter)；

4)；Otherwise, return, return data set T ', and find its barycenter；

7) T ' is returned.

NQLG algorithm realizes

1) judge that standard identifier concentrates attribute type,

2)S_n=Incognito (T, CQI, k)；

S_nIt is that categorical attribute has carried out extensive data set；

3) empty queue result, empty node node；

4) traversal S_nCirculation below entering:

Data set

D_jBe deposit the most extensive after tables of data；

Read S_nIn node city to node；

T' is obtained according to extensive tables of data T of node；

Traversal T', the following circulation of entrance:

Use T_i' store i-th equivalence class in T'；

MDAV(T′_i,NQI,k)；

D_j=D_j ∪ T_i'；

Calculate information loss, be inserted into result；

5) compare information loss in result, obtain the D that information loss is minimum_j。

6) T "=D_j, return T ".

From above step, NQLG algorithm passes through Incognito function, form the level grid of all single attributes and carry out judging that extensive whether to meet k-anonymous, delete and be unsatisfactory for the anonymous node of k-, the anonymous node iteration of k-will be met, form candidate's nodal set, judge whether both candidate nodes meets k-more anonymous, delete ineligible node, circulate above-mentioned steps, until all categorical attribute iteration complete, export all root nodes meeting k-anonymity.Successively tables of data T is carried out extensive by root node, utilize MDAV algorithm that extensive rear T' carries out secondary extensive, the equivalence class tuple quantity of input is divided between k to 2k-1, after completing all divisions, provide information loss, compare the tables of data showing that loss amount is minimum.

The analysis on its rationality of NQLG algorithm: by step 2) algorithm can be met the anonymous categorical attribute frequent item set of k-, then logarithm value type attribute carries out micro-gathering, avoid the occurrence of the possibility that universe extensive logarithm value type attribute is the most extensive, through step 4) after, the optimal dividing that source data table can be made to be divided between k to 2k-1, greatly reduces and is used alone the information loss that anonym algorithm causes.

NQLG algorithm analysis: assuming that this algorithm data concentrates tuple number to be n, classifying type standard identifier number is m, then this algorithm spends time series analysis as follows: step 1 time spends as O (1)；Step 2 uses anonym algorithm that categorical attribute is met k-and solves, and the cost of its time is O (∑ C_i), C_iNode number for ith iteration；Step 3 time spends as O (1)；The cost of step 4 time isWherein l represent the most extensive after the number of root node.The time complexity of MDAV algorithm isJ is big equivalence class number obtained in the previous step；Step 5 time spends as O (l).Therefore the loss of the overall information of this algorithm is

The checking of NQLG algorithm experimental and interpretation of result:

Experimental situation: the hardware environment that experiment is used is: 4G internal memory, Windows 7 operating system, algorithm is realized by Java and SQL server 2008.There is used herein the Adult data set in UCI Machine Learning Repository as experimental data set, Adult data set is to be made up of U.S. census's data, uses the training set in data set, 30162 records are had after removing default value record, choose 8 property values herein, including Sex, Race, Hours_per_week, Marital_status, Education, Workclass, Native_country, Age.Wherein Age, Hours_per_week are seriality standard identifier, and Sex, Race, Marital_status, Education, Workclass, Native_country are classifying type standard identifier.

Interpretation: Incognito algorithm algorithm as a comparison is selected in this experiment, utilizes the data set after k-anonymization MDAV algorithm to carry out secondary anonymity, weighs this paper algorithm from information loss degree and in terms of the time of execution.Under the conditions of NQLG algorithm achieves standard identifier and the different value of K of different number, information loss degree and the change of the time of execution.The wherein computational methods of information loss degree employing document:

Equivalence class information loss amount:

The information loss amount of table:

I L (T) = \frac{1}{n} Σ I L (e i) - - - (3);

| ei | is the quantity of cluster ei tuple, 1≤l≤m, N_iIt is the scope of i-th numerical attribute, MAX_NiAnd MIN_NiIt is maximum and minima in cluster ei, H (T_ci) it is the height of classification tree, H (∧ (∪ Cj)) is the height of the classification subtree with minimum public ancestors.

Information loss degree is analyzed by Fig. 3, Fig. 4 it can be seen that work as standard identifier | QI | timing, and along with the increase of k, information loss IL of algorithm has the trend of reduction herein, and when k value reaches 50, the information loss amount of two kinds of algorithms has the trend of rising.Experimental data shows, the information loss amount of algorithm herein is significantly lower than anonym algorithm.Thus in terms of information loss measuring angle, algorithm is avoiding excessive extensive aspect tool to have an enormous advantage herein.

Run time series analysis by Fig. 5, Fig. 6 it can be seen that when standard identifier one timing, the operation time of anonym algorithm and herein algorithm all reduces along with the increase of k value.Contrasted by the datagram of different standard identifier collection QI, when | QI |=6+1 (+1 Numeric Attributes of 6 categorical attributes), anonym algorithm aspect at runtime is better than algorithm herein, and during standard identifier collection | QI |=6+2 (+2 Numeric Attributes of 6 categorical attributes), along with the increase of k value, algorithm aspect at runtime is better than anonym algorithm herein.Experimental data shows, when numeric type standard identifier increases, the superiority of algorithm can be the most obvious herein.

As seen from Figure 7, minimizing along with k value, the time difference Δ t of anonym algorithm and herein the standard identifier collection (as | QI |=6+2 and | QI |=6+1) of algorithm increases simultaneously, and the amplification of anonym algorithm is notable, much larger than the amplification of algorithm herein.Thus, from efficiency, along with standard identifier collection | QI | middle numeric type standard identifier accounting changes, the superiority of algorithm can significantly improve herein.

Semanteme in the most extensive and cluster analysis of the Numeric Attributes caused mainly for anonym algorithm herein comprises problem, it is proposed that NQLG algorithm.Experiment shows, NQLG algorithm comprises aspect compared to traditional Privacy preserving algorithms have a clear superiority in reply semanteme loss and semanteme.Research can be launched future in the following areas: data exist the probability that secondary is issued, can be to NQLG algorithm further genralrlization on dynamic data set；Along with the sharp increase of data scale, can introduce in distributed, cloud computing technology to anonymization research, improve mass data processing efficiency further.

Claims

1. the secondary k-anonymity method for secret protection distinguishing standard identifier attribute, it is characterised in that:

1)S_n=Incognito (T, CQI, k), S_nPresentation class type attribute has carried out extensive data set, T represent needs by extensive data set, CQI presentation class type standard identifier collection, k represents that anonymity is about Bundle condition；

2) empty queue result, empty node node；

3) traversal S_nCirculation below entering:

Data set

D_jBe deposit the most extensive after tables of data；

Read S_nIn node city to node；

T ' is obtained according to extensive tables of data T of node；

Traversal T ', the following circulation of entrance:

Use T_i' storage T ' middle i-th equivalence class；

MDAV(T′_i, NQI, k), T ' represents the data set needing to be clustered, and NQI represents and to cluster Numeric Attributes, k represents anonymous constraints；

D_j=D_j∪T′_i；

Calculate information loss, be inserted into result；

4) compare information loss in result, obtain the D that information loss is minimum_j；

5) T "=D_j, return T ".

The secondary k-anonymity method for secret protection of differentiation standard identifier attribute the most according to claim 1 , it is characterised in that: Incognito (T, CQI, k) extensive the specifically comprising the following steps that of categorical attribute

2) an empty queue queue is used to take out C₁In all root nodes, nodes all to queue carry out equivalence Class calculates；

3) judge whether to meet k-anonymous, if node meets, then to this point and its all of child node It is marked, if be unsatisfactory for, then by this point from C₁Middle deletion, and its child node is inserted queue queue In；

4) step 3 is repeated), until C₁In all ungratified knot removals, and be the C after deleting₁And E₁ Form new table C₂And E₂；

5) repeat step 2), 3), 4) until C after being deleted_n；

6)S_n={ C_nAll nodes }

7) S is returned_n。

The secondary k-anonymity method for secret protection of differentiation standard identifier attribute the most according to claim 1 , it is characterised in that: MDAV (T ', NQI, k) extensive the specifically comprising the following steps that of Numeric Attributes

2) data set T ' is found out apart from two farthest tuples r, s by NQI；

3) with r as barycenter, find and form equivalence class C from k-1 bar tuple nearest for r, update barycenter, and From data set, T ' deletes this k bar tuple, puts into collection gregarious { in Q}；

4) step 3 is repeated with s for barycenter)；

5) judge that in data T ', whether remaining tuple number is more than 2k-1, if more than repeating 2) 3) 4)；Otherwise, return, return data set T ', and find its barycenter；

7) T ' is returned.