CN107194278B

CN107194278B - A kind of data generaliza-tion method based on Skyline

Info

Publication number: CN107194278B
Application number: CN201710339575.XA
Authority: CN
Inventors: 丁晓锋; 金海�; 王丽
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-05-15
Filing date: 2017-05-15
Publication date: 2019-11-22
Anticipated expiration: 2037-05-15
Also published as: CN107194278A

Abstract

The data generaliza-tion method based on Skyline that the invention discloses a kind of; it include: that threshold value T is denoted as according to the risk amount R identified again that data publication secret protection standard 10- anonymity obtains strategy to data list processing; and policy space { S is determined according to the codomain of standard identifier attribute and threshold value T; (R; U) }; the R value for the strategy that { S, (R, U) } includes is no more than threshold value T；Policy space { S, (R, U) } is filtered to obtain candidate policy space { G, (R, U) } using ε-approximation Skyline；Skyline calculating is made to candidate policy space { G, (R, U) }, the policy space { F, (R, U) } recommended, { F, (R, U) } is the privacy policy space recommended for tables of data.The accuracy that the present invention improves privacy protection policy recommendation by enumerating full policy space meets the multi-level demand of user to the wide coverage in the space RU.Given threshold T has filtered the strategy that secret protection does not reach requirement, reduces the time of policy space generation, further reduced the scale in candidate policy space using ε-approximation-Skyline filtering.

Description

A kind of data generaliza-tion method based on Skyline

Technical field

The invention belongs to the data publication fields of secret protection, general more particularly, to a kind of data based on Skyline Change method.

Background technique

Digital age, (such as: government, enterprise, individual etc.) exchange and publication data become more next between each group It is more important.For example, the hospital of California, which generally requires, submits some medical datas for certain rehabilitation patients.These data include Some sensitive informations, directly publication will reveal individual privacy.The example of " connection attack " that L.Sweeney is mentioned.Pass through attribute (Age, Sex, Zipcode) connection patient information table and voter's information table can determine that name is that Ahmed has suffered from influenza.Patient's Privacy is revealed.This data publication mode is unsafe.Therefore the data publication of protection privacy is suggested.It is required The availability of data is maximized while protecting privacy.

In order to protect the privacy of publication data, two kinds of secret protection models are suggested.One is the risks that control identifies again Amount, for example, K- is anonymous.Another kind is rule-based strategy, such as safe port agreement (Safe Harbor).K- anonymity is logical The mode for crossing extensive, displacement and disturbance makes same equivalence class, wherein same equivalence class is the record with identical standard identifier Set.Although standard identifier itself cannot identify personal sensitive information but and another of connection include identity property data Table can identify personal sensitive information.At least K record.But its offer is tactful to RU spatial coverage very little, no It is able to satisfy the multi-level demand of user, wherein R is risk amount, and U is information loss amount.Health insurance carries and responsibility method The Safe Harbor that (Health Insurance Portability and Accountability Act, HIPAA) is proposed, It is required that attribute specified in removal tables of data, such as: name, telephone number, the address Email etc..But majority of case Safe Harbor strategy is all poor in risk amount and the two aspect performances of information loss amount.

The balance of secret protection and data usability is increasingly by the pursuit of people.For the multi-level need for meeting user It asks, filters out preferable privacy policy and become more and more important.Standard identifier is taken to prevent sensitive information to be identified again Extensive operation, it is so-called it is extensive refer to replacing one more wide in range thresholding of a more specific attribute value, i.e., pair Data more summarize and more abstract description.For example, the age 21 by extensive at [20-30].It is current for solving this problem It is primarily present some heuritic approaches.One is the dichotomy institute search algorithms based on Hamming distances, and the selection of each is by weight To determine.This algorithm can quickly recommend some superior strategies, but the accuracy recommended is high and ranges of Generalization bounds It is small.Another kind is heuristic search algorithm based on probability, and search range, which expands, covers the entire space RU, but with certain road Diameter choose every time it is a certain amount of strategy carry out Skyline processing, Skyline operation be filtered out from candidate policy it is a series of The strategy of " interested ", the strategy of these " interested " refer to those not by the strategy of other tactful " dominations ".Require plan R value and U value slightly cannot be all bigger than what other were screened out.Due to being approximate processing, the error of screening and initial strategy set It is related with path selection.Algorithm the convergence speed is slower simultaneously.

In conclusion existing method for secret protection exists, small to RU spatial coverage, not to be able to satisfy user multi-level Demand, risk amount and the aspect performance of information loss amount two be all poor, accuracy is not high and Generalization bounds ranges are small and The problems such as algorithm the convergence speed is slower.

Summary of the invention

In view of the drawbacks of the prior art, it is an object of the invention to solve existing method for secret protection to cover model to the space RU Enclose it is small, be not able to satisfy the multi-level demand of user, show all poor, accuracys at risk amount R and information loss amount U two aspect The technical problem that not high and Generalization bounds ranges are small and algorithm the convergence speed is slower.

To achieve the above object, the data generaliza-tion method based on Skyline that the embodiment of the invention provides a kind of, including with Lower step: step S101 obtains identifying again for strategy to data list processing according to data publication secret protection standard 10- anonymity Risk amount R is denoted as threshold value T, and determines policy space { S, (R, U) } according to the codomain of the standard identifier attribute and threshold value T, and U is The information loss amount of strategy, the R value for the strategy that { S, (R, U) } includes is no more than threshold value T, R >=0, U >=0, T >=0.Step S102, The policy space { S, (R, U) } is filtered to obtain candidate policy space { G, (R, U) } using approximation Skyline, ε is pre- If security parameter, ε >=1, approximate Skyline is to amplify in the domination domain of Skyline by preset ratio, generates approximate dominate Domain, the preset ratio is determined according to ε, in addition, approximation Skyline can also be denoted as " ε-approximation Skyline ".The domination domain is If the R value and U value of strategy A are all larger than the R value and U value of tactful B, strategy A is in the domination domain of tactful B.Step S103, to institute It states candidate policy space { G, (R, U) } and makees Skyline operation, the policy space { F, (R, U) } recommended, described F, (R, It U) } is the privacy policy recommended for the tables of data, wherein every kind of strategy that the policy space of the recommendation includes is to reply institute State a kind of extensive operation of data table information.

Specifically, recommend privacy policy for tables of data, wherein every kind of privacy policy corresponds to a kind of the extensive of data table information Mode, it is extensive by being carried out to data table information, which is realized based on the preferably strategy of recommendation to user The fine protection of privacy.

The embodiment of the present invention improves the accuracy of policy recommendation by enumerating full policy space, to the covering model in the space RU It encloses extensively, meets the multi-level demand of user.The threshold value T for setting the risk amount R identified again has filtered out secret protection and has not reached requirement Strategy, reduce the time of policy space generation, while also reducing the size in candidate policy space.Using ε-approximation- Skyline filtering further reduced candidate policy Space Scale, while suitable by security parameter ε adjusting secret protection and data With the balance between property.

In an alternative embodiment, step S101 includes following sub-step: step S101-1, according to the standard of tables of data The codomain of identifier attribute enumerates 2^LA strategy, wherein L=r₁-1+…+r_n- 1, described 2^LA corresponding n attribute of strategy, n >= 1, ith attribute corresponds to r_iA value, the r_iA value is correspondingKind strategy, 1≤i≤n.Step S101-2, determines 2^LIt is a The R value of each strategy in strategy, described 2^LPolicy filtering in a strategy by R value greater than the threshold value T is fallen, and determines strategy Collection.Step S101-3 determines the information loss amount U of each strategy in the set of strategies, according to R the and U value of each strategy, Determine the policy space { S, (R, U) }.

Specifically, the accuracy that full policy space improves policy recommendation is enumerated, because policy recommendation is in a full strategy Space 2^LIt carries out.

In an alternative embodiment, step S102 includes following sub-step: step S102-1, to the tables of data The codomain of standard identifier attribute, successive ignition carries out two points of processing, until the codomain of the standard identifier attribute can not be drawn Point, each iteration is corresponding to generate a strategy, and the collection of the corresponding strategy of successive ignition is combined into initial policy collection.Step S102-2, The policy space { S, (R, U) } is filtered using ε-approximation Skyline according to the initial policy collection, by policy space The initial policy collection is not included by the strategy that the initial policy collection dominates in { S, (R, U) }, and to the initial policy Collection is updated, by the policy filtering dominated by the initial policy collection in policy space { S, (R, U) }, meanwhile, it will be initial The policy filtering that quilt { S, (R, U) } in policy space dominates.Step S102-3, according to step S102-2 to the policy space Strategy in { S, (R, U) } is filtered using ε-approximation Skyline one by one, when the policy space { S, (R, U) } is filtered When for sky, the initial policy updated at this time integrates as candidate policy space { G, (R, U) }.

Specifically, there is the ability that dominates well by the strategy of grey iterative generation, an initial policy collection accordingly is arranged can With effectively filtering policy space S, improve the efficiency that approximation Skyline is filtered, at the same also reduce candidate policy space G, (R, U) } scale.

In an alternative embodiment, step S103 includes following sub-step:

Step S103-1 carries out piecemeal to the candidate policy space { G, (R, U) }, by the corresponding number of every piece of policy space It is ranked up according to the node where it.

Creation one records the graduation file Sky-partition of each piece of policy space distribution data, is existed according to sampling rate ρ The candidate policy space { G, (R, U) } obtains sample, and sample is sorted and extracts t-1 as branch, according to the branch Piecemeal, the corresponding data block of every piece of policy space are carried out to the candidate policy space { G, (R, U) }；

Wherein, sampling rateT is the number of data block, | G | for tactful number.

Step S103-2 determines the smallest R value and the smallest U value in every piece of policy space；Step S103-3, according to every piece In policy space the smallest R value and the smallest U value determine Skyline set tactful and described recommendation policy space F, (R, U)}。

In an alternative embodiment, then the risk amount R and information loss amount U that identify to pass through following formula respectively true It is fixed:

Wherein, P indicates the record distribution of equivalence class, and the record of the equivalence class after P ' expression is extensive is distributed, | e'| is of equal value The record number that class e' includes, and whole equivalence class e' constitutes the P ', N indicate the record number of entire tables of data, N >=1, | e^*| it is The new equivalence class generated after equivalence class e is extensive includes average record number, and the equivalence class, which refers to, has identical standard in tables of data The set of records ends of identifier attribute.

Specifically, given threshold T has filtered out secret protection and does not reach the strategy for requiring 10- anonymity, greatly reduces in this way The time that policy space S is generated, while also reducing the size of policy space S.

In an alternative embodiment, the ε-approximation Skyline is to put in the domination domain of Skyline by preset ratio Greatly, approximate domination domain is generated, the preset ratio is determined according to ε, is specifically included: if R the and U value of strategy A is than tactful B's R value and U value are at most ε times big, then strategy A is denoted as A < with precision ε approximation domination of strategies B_εB。

Specifically, approximation-Skyline filtering further reduced the scale of candidate policy space { G, (R, U) }, lead to simultaneously Security parameter ε is crossed to adjust the balance of R value and U value, wherein R value reflects that secret protection performance, U value reflect data applicable performance.

Second aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums Computer program is stored in matter, which realizes data generaliza-tion described in above-mentioned first aspect when being executed by processor Method.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:

1, the embodiment of the present invention improves the accuracy of policy recommendation by enumerating full policy space, the covering to the space RU Range is wide, meets the multi-level demand of user.

2, the embodiment of the present invention has filtered out secret protection by the threshold value T of risk amount R that setting identifies again and does not reach requirement Strategy, reduce the time of policy space generation, while also reducing the size in candidate policy space.

3, the embodiment of the present invention further reduced candidate policy Space Scale using ε-approximation-Skyline filtering, simultaneously The balance between secret protection and data applicability is adjusted by security parameter ε.

Detailed description of the invention

Fig. 1 is a kind of data generaliza-tion method flow schematic diagram based on Skyline provided in an embodiment of the present invention；

Fig. 2 is another data generaliza-tion method flow schematic diagram based on Skyline provided in an embodiment of the present invention；

Fig. 3 is a kind of corresponding domination domain schematic diagram of ε-approximation Skyline provided in an embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Fig. 1 is a kind of data generaliza-tion method flow schematic diagram based on Skyline provided in an embodiment of the present invention, including step Rapid S101 to step S103.

Step S101 obtains identifying again for strategy to data list processing according to data publication secret protection standard 10- anonymity Risk amount R be denoted as threshold value T, and policy space { S, (R, U) }, U are determined according to the codomain of the standard identifier attribute and threshold value T For the information loss amount of strategy, the tactful R value that { S, (R, U) } includes is no more than threshold value T.

Specifically, step S101 includes following sub-step:

Step S101-1 enumerates 2 according to the codomain of the standard identifier attribute of tables of data^LA strategy, wherein L=r₁-1+…+ r_n- 1, described 2^LA corresponding n attribute of strategy, ith attribute correspond to r_iA value, the r_iA value is correspondingKind strategy, 1 ≤i≤n。

Specifically, all strategies are enumerated according to the codomain of the standard identifier attribute of tables of data, enumerate all strategies by According to such as under type: assuming that Q={ Q₁, L, Q_nIt is a standard identifier property set, and attribute Q_iCodomain has r_iValue.By with one String of binary characters expression goes identificationization tactful, enables p_iIndicate attribute Q_iThe division of codomain, then gather { p₁,…,p_nIt is one Identificationization strategy.Wherein, p_i={ I₁,…,I_ri-1, I_jIndicate attribute Q_iJth value and+1 value of jth separate.Available character The length L=r of string₁-1+…+r_n-1.Enumerate whole 2^LA strategy.

In one example, before executing step S101-1, this method further include: to the standard identifier category of the tables of data Property codomain carry out 10- anonymity processing, 10- anonymity is handled into the corresponding risk amount R value identified again as the threshold value T, institute Threshold value T is stated for carrying out secret protection to the tables of data, the 10- anonymity processing is described 2^LOne of a strategy plan Slightly, 10- anonymity processing is determined according to existing data publication secret protection standard.

Step S101-2, determines 2^LThe R value of each strategy in a strategy, described 2^LBy R value greater than described in a strategy The policy filtering of threshold value T is fallen, and determines set of strategies.

Step S101-3 determines the information loss amount U of each strategy in the set of strategies, according to the R of each strategy With U value, the policy space { S, (R, U) } is determined.

Step S102 is filtered to obtain candidate policy to the policy space { S, (R, U) } using ε-approximation Skyline Space { G, (R, U) }, ε are default security parameter, and ε-approximation Skyline is to amplify in the domination domain of Skyline by preset ratio, Approximate domination domain is generated, the preset ratio is determined according to ε；If the domination domain is that the R value of strategy A and U value are all larger than plan The slightly R value and U value of B, then strategy A is in the domination domain of tactful B.

In an alternative embodiment, ε-approximation Skyline is to amplify in the domination domain of Skyline by preset ratio, Approximate domination domain is generated, the preset ratio is determined according to ε, is specifically included: if R value of the R and U value than tactful B of strategy A At most ε times big with U value, then strategy A is denoted as A < ε B with precision ε approximation domination of strategies B.

Step S103 makees Skyline operation to the candidate policy space { G, (R, U) }, the policy space recommended { F, (R, U) }, { F, (R, the U) } are the privacy policies recommended for the tables of data.The policy space { F, (R, U) } of recommendation wraps A kind of extensive operation of the every kind of strategy included to the data table information is coped with.

Specifically, step S102 includes following sub-step:

Step S102-1, to the codomain of the standard identifier attribute of the tables of data, successive ignition carries out two points of processing, until The codomain of the standard identifier attribute can not be divided, and each iteration is corresponding to generate a strategy, the corresponding plan of successive ignition Collection slightly is combined into initial policy collection.

Step S102-2 uses ε-approximation Skyline to the policy space { S, (R, U) } according to the initial policy collection It is filtered, the initial policy will not be included by the strategy that the initial policy collection dominates in policy space { S, (R, U) } Collection, and the initial policy collection is updated, by the plan dominated by the initial policy collection in policy space { S, (R, U) } Filter is skipped over, meanwhile, the policy filtering that the quilt { S, (R, U) } in initial policy space is dominated.

Step S102-3, it is close using ε-one by one to the strategy in the policy space { S, (R, U) } according to step S102-2 It is filtered like Skyline, when the policy space { S, (R, U) } is filtered into sky, the initial policy that updates at this time Integrate as candidate policy space { G, (R, U) }.

Specifically, step S103 includes following sub-step:

Wherein, sampling rateT is number of data blocks, | G | for tactful number.

Step S103-2 determines the smallest R value and the smallest U value in every piece of policy space.

Step S103-3 determines the plan of Skyline set according to R value the smallest in every piece of policy space and the smallest U value The policy space { F, (R, U) } of summary and the recommendation.

In one example, step S103-3 is specifically included: the Min_u_lowKey (p) and Min_u_ of first calculative strategy P Key_lowR (p), wherein Min_u_lowKey (p) is the smallest U value of the data block before tactful P, Min_u_Key_ LowR (p) is to come the smallest U value tactful before strategy in the data block where tactful P.Then the U value U of comparison strategy P (p), only as U (p) < Min_u_lowKey (p) and U (p) < Min_u_Key_lowR (p), just tactful P is exported.

Step 4 makees Skyline operation to policy space { G, (R, U) }, and the set of strategies F recommended is simultaneously exported.

Specifically, step S103-1 to step S103-3 can be divided into following 4 stages again:

Sky-partition establishing stage: creation one records the graduation file of each node distribution data, remembers Sky- partition.Sample is obtained according to sampling rate ρ, sample is sorted and extracts t-1 as branch.

Partial ordering's stage: data file is divided according to branch in the Reduce stage, each node is respectively to this node On data sorting.

HashMap establishing stage: creating the smallest R value of each node and the smallest U value corresponds to table, remembers HashMap.

Global screening stage: data record is screened one by one according to HashMap in the Reduce stage.Export Skyline collection The strategy and its R value and U value of conjunction.

In an optional example, then the risk amount R and information loss amount U that identify pass through respectively following formula (1) and Formula (2) determines:

Wherein, P indicates the record distribution of equivalence class, and the record of the equivalence class after P ' expression is extensive is distributed, | e'| is of equal value The record number that class e' includes, and whole equivalence class e' constitutes the P ', N indicates the record number of entire tables of data, | e^*| it is of equal value The new equivalence class generated after class e is extensive includes average record number, and the equivalence class refers in tables of data, and there is identical fiducial mark to know Accord with the set of records ends of attribute.

Fig. 2 is another data generaliza-tion method flow schematic diagram based on Skyline provided in an embodiment of the present invention, including Step S201 to step S206.

Step S201, input data table D and parameter ε.

Input data table D and security parameter ε, ε indicate the customized security parameter of user, ε >=1.For example, table 1 is this hair One kind (original) tables of data that bright embodiment provides, shown in table specific as follows:

Equivalence class	Record number
		42\|Male\|White\|	465
42\|Female\|White\|	168
		43\|Male\|White\|	468
43\|Female\|White\|	163
		44\|Male\|White\|	444
44\|Female\|White\|	150

Table 1

Wherein, attribute Age, Gender, Race constitutes standard identifier attribute, Age value [42,44], Gender value { Femal, Male }, Race value { White }.

Set of records ends in tables of data with identical standard identifier attribute is an equivalence class, such as in table 1, of equal value Class " 42 | Male | White | " there are 465 records.

Step S202 calculates the threshold value T of secret protection.

10- anonymity processing is made to tables of data D, threshold value T of its R value as secret protection is acquired according to formula (1).

Step S203 enumerates all strategies, and the strategy that R value is greater than threshold value T is filtered in the strategy enumerated, is calculated remaining The R value and U value of strategy, generation strategy space { S, (R, U) }.

It can determine that tactful length is L=2+1 according to the codomain of the standard identifier attribute of above-mentioned table 1, wherein according to table 1 It is found that attribute n is 3 kinds, attribute Age value has 3, and attribute Gender value has 2, and attribute Race value is 1, therefore L= r₁-1+…+r_n- 1=3-1+2-1+1-1=3.It specifically, is 42,43 or 44 it is found that can be by Age value according to attribute Age value It is generalized for [42-44]；[42-43],[44]；[42],[43-44]；[42], [43], [44] these four situations can be compared by 2 Special position indicate attribute Age these four (2²) extensive situation.Similarly, the two of attribute Gender can be indicated by 1 bit Kind (2¹) extensive situation.And the extensive situation of attribute Race is indicated by 0 bit, it is corresponding to only have one kind (2⁰) extensive situation.

Described 2^LA corresponding n attribute of strategy, ith attribute correspond to r_iA value, the r_iA value is correspondingKind plan Slightly, 1≤i≤n.

Enumerate totally 8 strategies as shown in table 2.The R value and U of each strategy are calculated separately according to formula (1) and formula (2) Value.

Strategy	000	001	010	011	100	101	110	111
									R	0.021	0.111	0.098	0.154	0.095	0.490	0.170	1
U	0.1218	0.0004	0.1213	0.0000	0.1216	0.0003	0.1213	0

Table 2

In one example, the calculating process of tactful 000 corresponding R value and U value is as follows:

Wherein,

Remaining seven tactful calculating is similar, obtains the result such as table 2.

Step S204 is filtered to obtain candidate policy space to policy space { S, (R, U) } using ε-approximation Skyline {G,(R,U)}。

In one example, Fig. 3 is the corresponding domination domain signal of a kind of ε-approximation Skyline that the embodiment of the present invention provides Figure.As shown in figure 3, strategy 1, strategy 2 and strategy 3 are initial policy set,Indicate that region is strategy 1, strategy 2 And tactful 3 corresponding domination domains,When indicating that ε-approximation Skyline is filtered, strategy 1, strategy 2 and strategy 3 Corresponding approximate domination domain.

Table 3 is the determining candidate policy set { G, (R, U) } when ε value is 1.

Strategy	000	001	010	011	100	101	111
								R	0.021	0.111	0.098	0.154	0.095	0.490	1
U	0.1218	0.0004	0.1213	0.0000	0.1216	0.0003	0

Table 3

Wherein, the strategy that initial policy is concentrated includes thickened portion in table 2, such as tactful " 100 ", tactful " 101 " and Tactful " 110 ".By all policies in table 2 compared with the strategy that initial policy is concentrated, while updating the plan of initial policy concentration Slightly.Such iteration goes on, and strategy 000,001,010,011,100,101 retains, due to R (110) > R (001), U (110) > U (001), then strategy 110 filters out.Correspondingly, strategy 111 retains.It is as shown in Table 3 above by filtered result.

Step S205 carries out Skyline to candidate policy space { G, (R, U) }, obtains Generalization bounds collection F

Specifically, table 4 is data G to be processed, it is assumed that the number value 2 of Reducer, then sampling rateAssuming that sample is { 001；010；111 }, { 010 is sorted；001；111}.001 remembers as unique branch Enter Sky-partition file.

Table 4

As shown in table 4 in the shuffle stage of third round, data are divided into 2 piece { 000；100；010 } and { 001；101； 011；111}.In Reduce stage R every piece the smallest by as key value, first piece { 000；100；010 } key value is 0.021, second piece { 001；101；011；111 } key value is 0.111.

{0.021；0.1213 } and { 0.111；0 } it is respectively first piece and second piece and corresponding charges to HashMap.

As shown in table 5, to handle each strategy, data record is screened one by one according to HashMap in the Reduce stage.

Strategy	Key	R	U	Min_u_lowerKey	Min_u_Key_lowerR	Skyline
							000	0.021	0.021	0.1218	+	+	Y
100	0.021	0.095	0.1216	+	0.1218	Y
							010	0.021	0.098	0.1213	+	0.1216	Y
001	0.111	0.111	0.0004	0.1213	+	Y
							101	0.111	0.490	0.0003	0.1213	0.0004	Y
011	0.111	0.514	0.0000	0.1213	0.0003	Y
							111	0.111	1	0	0.1213	0.0000	Y

Table 5

By taking strategy 000 as an example.It calculates first and is less than key (000)=0.021 the smallest U value, Min_u_lowKey (000) =+.Indicate that this value can be arbitrarily large, since there is no the strategy less than 0.021.Secondly, calculate key be equal to key (000)= 0.021 R value is less than the smallest U value of strategy in R (000)=0.021, and Min_u_Key_lowR (000)=+.Indicate that this value can With arbitrarily large, since in first piece 000 R value is minimum.Finally, because U (000)=0.1218 < Min_u_lowKey (000)=+, and U (000)=0.1218 < Min_u_Key_lowR (000)=+.Strategy 000 belongs to Skyline set.Other Seven strategies are processed similarly, and table 6 is final strategy Skyline set.

Table 6

Step S206, exports { F, (R, U) } and strategy is corresponding extensive.

Specifically, it exports strategy shown in table 6 and each strategy is corresponding extensive.Strategy shown in table 6 is tables of data D The privacy policy of recommendation.

For example, output policy " 000 ", tactful " 000 " is corresponding extensive are as follows: by the age 42-44 be generalized for value [42, 44], it is Femal or Male by Gender, is generalized for [Femal, Male].As tactful " 111 " are corresponding extensive are as follows: the age exists 42-44 is generalized for value [42], [43] and [44], is that Femal or Male is generalized for Femal by Gender] and [Male].

Scheme provided in an embodiment of the present invention can satisfy multi-level demand of the user to secret protection and availability of data, Can cope with the case where quantity of privacy policy exponentially increases simultaneously, and can under the premise of guaranteeing data precision, Substantially reduce the scale of candidate policy collection.

Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It is not considered that exceeding scope of the present application.

Those of ordinary skill in the art will appreciate that implement the method for the above embodiments be can be with By program come instruction processing unit completion, the program be can store in computer readable storage medium, the storage Medium is non-transitory (non-transitory) medium, such as random access memory, read-only memory, flash memory, Hard disk, solid state hard disk, tape (magnetic tape), floppy disk (floppy disk), CD (optical disc) and its appoint Meaning combination.

More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims

1. a kind of data generaliza-tion method based on Skyline applied to secret protection field characterized by comprising

Step S101 obtains the wind of strategy identified again to data list processing according to data publication secret protection standard 10- anonymity Danger amount is denoted as threshold value, and the codomain and threshold value of the standard identifier attribute according to tables of data, determines policy space, the policy space Including strategy the risk amount identified again be not more than the threshold value；The tables of data includes user privacy information；

Step S102 is filtered to obtain candidate policy space to the policy space using approximation Skyline；The approximation Skyline is to amplify in the domination domain of Skyline by preset ratio, generates approximate domination domain, the preset ratio is according to pre- If security parameter determines；If the domination domain is that the risk amount of strategy A identified again and information loss amount are all larger than tactful B again The risk quantum and information loss magnitude of identification, then strategy A is in the domination domain of tactful B；

The step S102 includes following sub-step:

To the codomain of the standard identifier attribute of the tables of data, successive ignition carries out two points of processing, until the standard identifier category Property codomain can not be divided, each iteration is corresponding to generate a strategy, and the collection of the corresponding strategy of successive ignition is combined into initially Set of strategies；

Policy space { S, (R, U) } is filtered using approximation Skyline according to the initial policy collection, by policy space The initial policy collection is not included by the strategy that the initial policy collection dominates in { S, (R, U) }, and to the initial policy Collection is updated, by the policy filtering dominated by the initial policy collection in policy space { S, (R, U) }, meanwhile, it will be initial The policy filtering that quilt { S, (R, U) } in policy space dominates；

Strategy in the policy space { S, (R, U) } is filtered using approximation Skyline one by one, when the policy space When { S, (R, U) } is filtered into sky, the initial policy updated at this time integrates as candidate policy space { G, (R, U) }；It is described again The risk amount R and information loss amount U of identification are determined by following formula respectively:

Wherein, P indicates the record distribution of equivalence class, and the record of the equivalence class after P ' expression is extensive is distributed, | e'| is equivalence class e' The record number for including, and whole equivalence class e' constitutes the P ', N indicate the record number of entire tables of data, N >=1, | e^*| it is of equal value The new equivalence class generated after class e is extensive includes average record number, and the equivalence class refers in tables of data, and there is identical fiducial mark to know Accord with the set of records ends of attribute；

Step S103 makees Skyline operation to the candidate policy space, the policy space recommended, by for data Table Generalization bounds carry out privacy information that is extensive and protecting user come the data to tables of data；The policy space of the recommendation includes The strategy recommended for the tables of data and the risk amount and information loss amount that identify again accordingly, the policy space packet of the recommendation A kind of extensive operation of the every kind of strategy included to the data table information is coped with.

2. data generaliza-tion method as described in claim 1, which is characterized in that the step S101 includes following sub-step:

2 are enumerated according to the codomain of the standard identifier attribute of tables of data^LA strategy, wherein L=r₁-1+…+r_n- 1, described 2^LA plan N attribute is slightly corresponded to, ith attribute corresponds to r_iA value, the r_iA value is correspondingKind strategy, 1≤i≤n, n >=1；

Determine 2^LThe risk amount R value of each strategy identified again in a strategy, described 2^LR value is greater than threshold value T in a strategy Policy filtering fall, determine set of strategies；

The information loss amount U for determining each strategy in the set of strategies, according to the R value and U value of each strategy, determine described in Policy space { S, (R, U) }, R >=0, U >=0, T >=0.

3. data generaliza-tion method as described in claim 1, which is characterized in that the step S103 includes following sub-step:

To candidate policy space { G, (R, U) } carry out piecemeal, by node of the corresponding data of every piece of policy space where it into Row sequence；

Determine the smallest risk amount R value identified again and the smallest information loss amount U value in every piece of policy space；

The plan of the tactful and described recommendation of Skyline set is determined according to R value the smallest in every piece of policy space and the smallest U value Slightly space { F, (R, U) }.

4. data generaliza-tion method as described in claim 1, which is characterized in that ε-approximation Skyline is by the domination of Skyline Domain is amplified by preset ratio, generates approximate domination domain, and the preset ratio is determined according to ε, is specifically included:

If R the and U value of strategy A is at most ε times bigger than the R value of tactful B and U value, strategy A is with precision ε approximation domination of strategies B is denoted as A < ε B.

5. data generaliza-tion method as claimed in claim 3, which is characterized in that step S103-1 is specifically included:

Creation one records the graduation file of each piece of policy space distribution data, according to sampling rate ρ in the candidate policy space { G, (R, U) } obtains sample, and sample is sorted and extracts t-1 as branch, with empty to the candidate policy according to the branch Between { G, (R, U) } carry out piecemeal, the corresponding data block of every piece of policy space；

Wherein, sampling rateT is the number of data block, | G | for tactful number.

6. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program is realized when the computer program is executed by processor and is applied to secret protection as described in any one of claim 1 to 5 The data generaliza-tion method based on Skyline in field.