CN107194278B - A kind of data generaliza-tion method based on Skyline - Google Patents
A kind of data generaliza-tion method based on Skyline Download PDFInfo
- Publication number
- CN107194278B CN107194278B CN201710339575.XA CN201710339575A CN107194278B CN 107194278 B CN107194278 B CN 107194278B CN 201710339575 A CN201710339575 A CN 201710339575A CN 107194278 B CN107194278 B CN 107194278B
- Authority
- CN
- China
- Prior art keywords
- strategy
- policy
- data
- value
- skyline
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The data generaliza-tion method based on Skyline that the invention discloses a kind of; it include: that threshold value T is denoted as according to the risk amount R identified again that data publication secret protection standard 10- anonymity obtains strategy to data list processing; and policy space { S is determined according to the codomain of standard identifier attribute and threshold value T; (R; U) }; the R value for the strategy that { S, (R, U) } includes is no more than threshold value T;Policy space { S, (R, U) } is filtered to obtain candidate policy space { G, (R, U) } using ε-approximation Skyline;Skyline calculating is made to candidate policy space { G, (R, U) }, the policy space { F, (R, U) } recommended, { F, (R, U) } is the privacy policy space recommended for tables of data.The accuracy that the present invention improves privacy protection policy recommendation by enumerating full policy space meets the multi-level demand of user to the wide coverage in the space RU.Given threshold T has filtered the strategy that secret protection does not reach requirement, reduces the time of policy space generation, further reduced the scale in candidate policy space using ε-approximation-Skyline filtering.
Description
Technical field
The invention belongs to the data publication fields of secret protection, general more particularly, to a kind of data based on Skyline
Change method.
Background technique
Digital age, (such as: government, enterprise, individual etc.) exchange and publication data become more next between each group
It is more important.For example, the hospital of California, which generally requires, submits some medical datas for certain rehabilitation patients.These data include
Some sensitive informations, directly publication will reveal individual privacy.The example of " connection attack " that L.Sweeney is mentioned.Pass through attribute
(Age, Sex, Zipcode) connection patient information table and voter's information table can determine that name is that Ahmed has suffered from influenza.Patient's
Privacy is revealed.This data publication mode is unsafe.Therefore the data publication of protection privacy is suggested.It is required
The availability of data is maximized while protecting privacy.
In order to protect the privacy of publication data, two kinds of secret protection models are suggested.One is the risks that control identifies again
Amount, for example, K- is anonymous.Another kind is rule-based strategy, such as safe port agreement (Safe Harbor).K- anonymity is logical
The mode for crossing extensive, displacement and disturbance makes same equivalence class, wherein same equivalence class is the record with identical standard identifier
Set.Although standard identifier itself cannot identify personal sensitive information but and another of connection include identity property data
Table can identify personal sensitive information.At least K record.But its offer is tactful to RU spatial coverage very little, no
It is able to satisfy the multi-level demand of user, wherein R is risk amount, and U is information loss amount.Health insurance carries and responsibility method
The Safe Harbor that (Health Insurance Portability and Accountability Act, HIPAA) is proposed,
It is required that attribute specified in removal tables of data, such as: name, telephone number, the address Email etc..But majority of case Safe
Harbor strategy is all poor in risk amount and the two aspect performances of information loss amount.
The balance of secret protection and data usability is increasingly by the pursuit of people.For the multi-level need for meeting user
It asks, filters out preferable privacy policy and become more and more important.Standard identifier is taken to prevent sensitive information to be identified again
Extensive operation, it is so-called it is extensive refer to replacing one more wide in range thresholding of a more specific attribute value, i.e., pair
Data more summarize and more abstract description.For example, the age 21 by extensive at [20-30].It is current for solving this problem
It is primarily present some heuritic approaches.One is the dichotomy institute search algorithms based on Hamming distances, and the selection of each is by weight
To determine.This algorithm can quickly recommend some superior strategies, but the accuracy recommended is high and ranges of Generalization bounds
It is small.Another kind is heuristic search algorithm based on probability, and search range, which expands, covers the entire space RU, but with certain road
Diameter choose every time it is a certain amount of strategy carry out Skyline processing, Skyline operation be filtered out from candidate policy it is a series of
The strategy of " interested ", the strategy of these " interested " refer to those not by the strategy of other tactful " dominations ".Require plan
R value and U value slightly cannot be all bigger than what other were screened out.Due to being approximate processing, the error of screening and initial strategy set
It is related with path selection.Algorithm the convergence speed is slower simultaneously.
In conclusion existing method for secret protection exists, small to RU spatial coverage, not to be able to satisfy user multi-level
Demand, risk amount and the aspect performance of information loss amount two be all poor, accuracy is not high and Generalization bounds ranges are small and
The problems such as algorithm the convergence speed is slower.
Summary of the invention
In view of the drawbacks of the prior art, it is an object of the invention to solve existing method for secret protection to cover model to the space RU
Enclose it is small, be not able to satisfy the multi-level demand of user, show all poor, accuracys at risk amount R and information loss amount U two aspect
The technical problem that not high and Generalization bounds ranges are small and algorithm the convergence speed is slower.
To achieve the above object, the data generaliza-tion method based on Skyline that the embodiment of the invention provides a kind of, including with
Lower step: step S101 obtains identifying again for strategy to data list processing according to data publication secret protection standard 10- anonymity
Risk amount R is denoted as threshold value T, and determines policy space { S, (R, U) } according to the codomain of the standard identifier attribute and threshold value T, and U is
The information loss amount of strategy, the R value for the strategy that { S, (R, U) } includes is no more than threshold value T, R >=0, U >=0, T >=0.Step S102,
The policy space { S, (R, U) } is filtered to obtain candidate policy space { G, (R, U) } using approximation Skyline, ε is pre-
If security parameter, ε >=1, approximate Skyline is to amplify in the domination domain of Skyline by preset ratio, generates approximate dominate
Domain, the preset ratio is determined according to ε, in addition, approximation Skyline can also be denoted as " ε-approximation Skyline ".The domination domain is
If the R value and U value of strategy A are all larger than the R value and U value of tactful B, strategy A is in the domination domain of tactful B.Step S103, to institute
It states candidate policy space { G, (R, U) } and makees Skyline operation, the policy space { F, (R, U) } recommended, described F, (R,
It U) } is the privacy policy recommended for the tables of data, wherein every kind of strategy that the policy space of the recommendation includes is to reply institute
State a kind of extensive operation of data table information.
Specifically, recommend privacy policy for tables of data, wherein every kind of privacy policy corresponds to a kind of the extensive of data table information
Mode, it is extensive by being carried out to data table information, which is realized based on the preferably strategy of recommendation to user
The fine protection of privacy.
The embodiment of the present invention improves the accuracy of policy recommendation by enumerating full policy space, to the covering model in the space RU
It encloses extensively, meets the multi-level demand of user.The threshold value T for setting the risk amount R identified again has filtered out secret protection and has not reached requirement
Strategy, reduce the time of policy space generation, while also reducing the size in candidate policy space.Using ε-approximation-
Skyline filtering further reduced candidate policy Space Scale, while suitable by security parameter ε adjusting secret protection and data
With the balance between property.
In an alternative embodiment, step S101 includes following sub-step: step S101-1, according to the standard of tables of data
The codomain of identifier attribute enumerates 2LA strategy, wherein L=r1-1+…+rn- 1, described 2LA corresponding n attribute of strategy, n >=
1, ith attribute corresponds to riA value, the riA value is correspondingKind strategy, 1≤i≤n.Step S101-2, determines 2LIt is a
The R value of each strategy in strategy, described 2LPolicy filtering in a strategy by R value greater than the threshold value T is fallen, and determines strategy
Collection.Step S101-3 determines the information loss amount U of each strategy in the set of strategies, according to R the and U value of each strategy,
Determine the policy space { S, (R, U) }.
Specifically, the accuracy that full policy space improves policy recommendation is enumerated, because policy recommendation is in a full strategy
Space 2LIt carries out.
In an alternative embodiment, step S102 includes following sub-step: step S102-1, to the tables of data
The codomain of standard identifier attribute, successive ignition carries out two points of processing, until the codomain of the standard identifier attribute can not be drawn
Point, each iteration is corresponding to generate a strategy, and the collection of the corresponding strategy of successive ignition is combined into initial policy collection.Step S102-2,
The policy space { S, (R, U) } is filtered using ε-approximation Skyline according to the initial policy collection, by policy space
The initial policy collection is not included by the strategy that the initial policy collection dominates in { S, (R, U) }, and to the initial policy
Collection is updated, by the policy filtering dominated by the initial policy collection in policy space { S, (R, U) }, meanwhile, it will be initial
The policy filtering that quilt { S, (R, U) } in policy space dominates.Step S102-3, according to step S102-2 to the policy space
Strategy in { S, (R, U) } is filtered using ε-approximation Skyline one by one, when the policy space { S, (R, U) } is filtered
When for sky, the initial policy updated at this time integrates as candidate policy space { G, (R, U) }.
Specifically, there is the ability that dominates well by the strategy of grey iterative generation, an initial policy collection accordingly is arranged can
With effectively filtering policy space S, improve the efficiency that approximation Skyline is filtered, at the same also reduce candidate policy space G,
(R, U) } scale.
In an alternative embodiment, step S103 includes following sub-step:
Step S103-1 carries out piecemeal to the candidate policy space { G, (R, U) }, by the corresponding number of every piece of policy space
It is ranked up according to the node where it.
Creation one records the graduation file Sky-partition of each piece of policy space distribution data, is existed according to sampling rate ρ
The candidate policy space { G, (R, U) } obtains sample, and sample is sorted and extracts t-1 as branch, according to the branch
Piecemeal, the corresponding data block of every piece of policy space are carried out to the candidate policy space { G, (R, U) };
Wherein, sampling rateT is the number of data block, | G | for tactful number.
Step S103-2 determines the smallest R value and the smallest U value in every piece of policy space;Step S103-3, according to every piece
In policy space the smallest R value and the smallest U value determine Skyline set tactful and described recommendation policy space F, (R,
U)}。
In an alternative embodiment, then the risk amount R and information loss amount U that identify to pass through following formula respectively true
It is fixed:
Wherein, P indicates the record distribution of equivalence class, and the record of the equivalence class after P ' expression is extensive is distributed, | e'| is of equal value
The record number that class e' includes, and whole equivalence class e' constitutes the P ', N indicate the record number of entire tables of data, N >=1, | e*| it is
The new equivalence class generated after equivalence class e is extensive includes average record number, and the equivalence class, which refers to, has identical standard in tables of data
The set of records ends of identifier attribute.
Specifically, given threshold T has filtered out secret protection and does not reach the strategy for requiring 10- anonymity, greatly reduces in this way
The time that policy space S is generated, while also reducing the size of policy space S.
In an alternative embodiment, the ε-approximation Skyline is to put in the domination domain of Skyline by preset ratio
Greatly, approximate domination domain is generated, the preset ratio is determined according to ε, is specifically included: if R the and U value of strategy A is than tactful B's
R value and U value are at most ε times big, then strategy A is denoted as A < with precision ε approximation domination of strategies BεB。
Specifically, approximation-Skyline filtering further reduced the scale of candidate policy space { G, (R, U) }, lead to simultaneously
Security parameter ε is crossed to adjust the balance of R value and U value, wherein R value reflects that secret protection performance, U value reflect data applicable performance.
Second aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums
Computer program is stored in matter, which realizes data generaliza-tion described in above-mentioned first aspect when being executed by processor
Method.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect
Fruit:
1, the embodiment of the present invention improves the accuracy of policy recommendation by enumerating full policy space, the covering to the space RU
Range is wide, meets the multi-level demand of user.
2, the embodiment of the present invention has filtered out secret protection by the threshold value T of risk amount R that setting identifies again and does not reach requirement
Strategy, reduce the time of policy space generation, while also reducing the size in candidate policy space.
3, the embodiment of the present invention further reduced candidate policy Space Scale using ε-approximation-Skyline filtering, simultaneously
The balance between secret protection and data applicability is adjusted by security parameter ε.
Detailed description of the invention
Fig. 1 is a kind of data generaliza-tion method flow schematic diagram based on Skyline provided in an embodiment of the present invention;
Fig. 2 is another data generaliza-tion method flow schematic diagram based on Skyline provided in an embodiment of the present invention;
Fig. 3 is a kind of corresponding domination domain schematic diagram of ε-approximation Skyline provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
Fig. 1 is a kind of data generaliza-tion method flow schematic diagram based on Skyline provided in an embodiment of the present invention, including step
Rapid S101 to step S103.
Step S101 obtains identifying again for strategy to data list processing according to data publication secret protection standard 10- anonymity
Risk amount R be denoted as threshold value T, and policy space { S, (R, U) }, U are determined according to the codomain of the standard identifier attribute and threshold value T
For the information loss amount of strategy, the tactful R value that { S, (R, U) } includes is no more than threshold value T.
Specifically, step S101 includes following sub-step:
Step S101-1 enumerates 2 according to the codomain of the standard identifier attribute of tables of dataLA strategy, wherein L=r1-1+…+
rn- 1, described 2LA corresponding n attribute of strategy, ith attribute correspond to riA value, the riA value is correspondingKind strategy, 1
≤i≤n。
Specifically, all strategies are enumerated according to the codomain of the standard identifier attribute of tables of data, enumerate all strategies by
According to such as under type: assuming that Q={ Q1, L, QnIt is a standard identifier property set, and attribute QiCodomain has riValue.By with one
String of binary characters expression goes identificationization tactful, enables piIndicate attribute QiThe division of codomain, then gather { p1,…,pnIt is one
Identificationization strategy.Wherein, pi={ I1,…,Iri-1, IjIndicate attribute QiJth value and+1 value of jth separate.Available character
The length L=r of string1-1+…+rn-1.Enumerate whole 2LA strategy.
In one example, before executing step S101-1, this method further include: to the standard identifier category of the tables of data
Property codomain carry out 10- anonymity processing, 10- anonymity is handled into the corresponding risk amount R value identified again as the threshold value T, institute
Threshold value T is stated for carrying out secret protection to the tables of data, the 10- anonymity processing is described 2LOne of a strategy plan
Slightly, 10- anonymity processing is determined according to existing data publication secret protection standard.
Step S101-2, determines 2LThe R value of each strategy in a strategy, described 2LBy R value greater than described in a strategy
The policy filtering of threshold value T is fallen, and determines set of strategies.
Step S101-3 determines the information loss amount U of each strategy in the set of strategies, according to the R of each strategy
With U value, the policy space { S, (R, U) } is determined.
Step S102 is filtered to obtain candidate policy to the policy space { S, (R, U) } using ε-approximation Skyline
Space { G, (R, U) }, ε are default security parameter, and ε-approximation Skyline is to amplify in the domination domain of Skyline by preset ratio,
Approximate domination domain is generated, the preset ratio is determined according to ε;If the domination domain is that the R value of strategy A and U value are all larger than plan
The slightly R value and U value of B, then strategy A is in the domination domain of tactful B.
In an alternative embodiment, ε-approximation Skyline is to amplify in the domination domain of Skyline by preset ratio,
Approximate domination domain is generated, the preset ratio is determined according to ε, is specifically included: if R value of the R and U value than tactful B of strategy A
At most ε times big with U value, then strategy A is denoted as A < ε B with precision ε approximation domination of strategies B.
Step S103 makees Skyline operation to the candidate policy space { G, (R, U) }, the policy space recommended
{ F, (R, U) }, { F, (R, the U) } are the privacy policies recommended for the tables of data.The policy space { F, (R, U) } of recommendation wraps
A kind of extensive operation of the every kind of strategy included to the data table information is coped with.
Specifically, step S102 includes following sub-step:
Step S102-1, to the codomain of the standard identifier attribute of the tables of data, successive ignition carries out two points of processing, until
The codomain of the standard identifier attribute can not be divided, and each iteration is corresponding to generate a strategy, the corresponding plan of successive ignition
Collection slightly is combined into initial policy collection.
Step S102-2 uses ε-approximation Skyline to the policy space { S, (R, U) } according to the initial policy collection
It is filtered, the initial policy will not be included by the strategy that the initial policy collection dominates in policy space { S, (R, U) }
Collection, and the initial policy collection is updated, by the plan dominated by the initial policy collection in policy space { S, (R, U) }
Filter is skipped over, meanwhile, the policy filtering that the quilt { S, (R, U) } in initial policy space is dominated.
Step S102-3, it is close using ε-one by one to the strategy in the policy space { S, (R, U) } according to step S102-2
It is filtered like Skyline, when the policy space { S, (R, U) } is filtered into sky, the initial policy that updates at this time
Integrate as candidate policy space { G, (R, U) }.
Specifically, step S103 includes following sub-step:
Step S103-1 carries out piecemeal to the candidate policy space { G, (R, U) }, by the corresponding number of every piece of policy space
It is ranked up according to the node where it.
Creation one records the graduation file Sky-partition of each piece of policy space distribution data, is existed according to sampling rate ρ
The candidate policy space { G, (R, U) } obtains sample, and sample is sorted and extracts t-1 as branch, according to the branch
Piecemeal, the corresponding data block of every piece of policy space are carried out to the candidate policy space { G, (R, U) };
Wherein, sampling rateT is number of data blocks, | G | for tactful number.
Step S103-2 determines the smallest R value and the smallest U value in every piece of policy space.
Step S103-3 determines the plan of Skyline set according to R value the smallest in every piece of policy space and the smallest U value
The policy space { F, (R, U) } of summary and the recommendation.
In one example, step S103-3 is specifically included: the Min_u_lowKey (p) and Min_u_ of first calculative strategy P
Key_lowR (p), wherein Min_u_lowKey (p) is the smallest U value of the data block before tactful P, Min_u_Key_
LowR (p) is to come the smallest U value tactful before strategy in the data block where tactful P.Then the U value U of comparison strategy P
(p), only as U (p) < Min_u_lowKey (p) and U (p) < Min_u_Key_lowR (p), just tactful P is exported.
Step 4 makees Skyline operation to policy space { G, (R, U) }, and the set of strategies F recommended is simultaneously exported.
Specifically, step S103-1 to step S103-3 can be divided into following 4 stages again:
Sky-partition establishing stage: creation one records the graduation file of each node distribution data, remembers Sky-
partition.Sample is obtained according to sampling rate ρ, sample is sorted and extracts t-1 as branch.
Partial ordering's stage: data file is divided according to branch in the Reduce stage, each node is respectively to this node
On data sorting.
HashMap establishing stage: creating the smallest R value of each node and the smallest U value corresponds to table, remembers HashMap.
Global screening stage: data record is screened one by one according to HashMap in the Reduce stage.Export Skyline collection
The strategy and its R value and U value of conjunction.
In an optional example, then the risk amount R and information loss amount U that identify pass through respectively following formula (1) and
Formula (2) determines:
Wherein, P indicates the record distribution of equivalence class, and the record of the equivalence class after P ' expression is extensive is distributed, | e'| is of equal value
The record number that class e' includes, and whole equivalence class e' constitutes the P ', N indicates the record number of entire tables of data, | e*| it is of equal value
The new equivalence class generated after class e is extensive includes average record number, and the equivalence class refers in tables of data, and there is identical fiducial mark to know
Accord with the set of records ends of attribute.
The embodiment of the present invention improves the accuracy of policy recommendation by enumerating full policy space, to the covering model in the space RU
It encloses extensively, meets the multi-level demand of user.The threshold value T for setting the risk amount R identified again has filtered out secret protection and has not reached requirement
Strategy, reduce the time of policy space generation, while also reducing the size in candidate policy space.Using ε-approximation-
Skyline filtering further reduced candidate policy Space Scale, while suitable by security parameter ε adjusting secret protection and data
With the balance between property.
Fig. 2 is another data generaliza-tion method flow schematic diagram based on Skyline provided in an embodiment of the present invention, including
Step S201 to step S206.
Step S201, input data table D and parameter ε.
Input data table D and security parameter ε, ε indicate the customized security parameter of user, ε >=1.For example, table 1 is this hair
One kind (original) tables of data that bright embodiment provides, shown in table specific as follows:
Equivalence class | Record number |
42|Male|White| | 465 |
42|Female|White| | 168 |
43|Male|White| | 468 |
43|Female|White| | 163 |
44|Male|White| | 444 |
44|Female|White| | 150 |
Table 1
Wherein, attribute Age, Gender, Race constitutes standard identifier attribute, Age value [42,44], Gender value
{ Femal, Male }, Race value { White }.
Set of records ends in tables of data with identical standard identifier attribute is an equivalence class, such as in table 1, of equal value
Class " 42 | Male | White | " there are 465 records.
Step S202 calculates the threshold value T of secret protection.
10- anonymity processing is made to tables of data D, threshold value T of its R value as secret protection is acquired according to formula (1).
Step S203 enumerates all strategies, and the strategy that R value is greater than threshold value T is filtered in the strategy enumerated, is calculated remaining
The R value and U value of strategy, generation strategy space { S, (R, U) }.
It can determine that tactful length is L=2+1 according to the codomain of the standard identifier attribute of above-mentioned table 1, wherein according to table 1
It is found that attribute n is 3 kinds, attribute Age value has 3, and attribute Gender value has 2, and attribute Race value is 1, therefore L=
r1-1+…+rn- 1=3-1+2-1+1-1=3.It specifically, is 42,43 or 44 it is found that can be by Age value according to attribute Age value
It is generalized for [42-44];[42-43],[44];[42],[43-44];[42], [43], [44] these four situations can be compared by 2
Special position indicate attribute Age these four (22) extensive situation.Similarly, the two of attribute Gender can be indicated by 1 bit
Kind (21) extensive situation.And the extensive situation of attribute Race is indicated by 0 bit, it is corresponding to only have one kind (20) extensive situation.
Described 2LA corresponding n attribute of strategy, ith attribute correspond to riA value, the riA value is correspondingKind plan
Slightly, 1≤i≤n.
Enumerate totally 8 strategies as shown in table 2.The R value and U of each strategy are calculated separately according to formula (1) and formula (2)
Value.
Strategy | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
R | 0.021 | 0.111 | 0.098 | 0.154 | 0.095 | 0.490 | 0.170 | 1 |
U | 0.1218 | 0.0004 | 0.1213 | 0.0000 | 0.1216 | 0.0003 | 0.1213 | 0 |
Table 2
In one example, the calculating process of tactful 000 corresponding R value and U value is as follows:
Wherein,
Remaining seven tactful calculating is similar, obtains the result such as table 2.
Step S204 is filtered to obtain candidate policy space to policy space { S, (R, U) } using ε-approximation Skyline
{G,(R,U)}。
In one example, Fig. 3 is the corresponding domination domain signal of a kind of ε-approximation Skyline that the embodiment of the present invention provides
Figure.As shown in figure 3, strategy 1, strategy 2 and strategy 3 are initial policy set,Indicate that region is strategy 1, strategy 2
And tactful 3 corresponding domination domains,When indicating that ε-approximation Skyline is filtered, strategy 1, strategy 2 and strategy 3
Corresponding approximate domination domain.
Table 3 is the determining candidate policy set { G, (R, U) } when ε value is 1.
Strategy | 000 | 001 | 010 | 011 | 100 | 101 | 111 |
R | 0.021 | 0.111 | 0.098 | 0.154 | 0.095 | 0.490 | 1 |
U | 0.1218 | 0.0004 | 0.1213 | 0.0000 | 0.1216 | 0.0003 | 0 |
Table 3
Wherein, the strategy that initial policy is concentrated includes thickened portion in table 2, such as tactful " 100 ", tactful " 101 " and
Tactful " 110 ".By all policies in table 2 compared with the strategy that initial policy is concentrated, while updating the plan of initial policy concentration
Slightly.Such iteration goes on, and strategy 000,001,010,011,100,101 retains, due to R (110) > R (001), U (110)
> U (001), then strategy 110 filters out.Correspondingly, strategy 111 retains.It is as shown in Table 3 above by filtered result.
Step S205 carries out Skyline to candidate policy space { G, (R, U) }, obtains Generalization bounds collection F
Specifically, table 4 is data G to be processed, it is assumed that the number value 2 of Reducer, then sampling rateAssuming that sample is { 001;010;111 }, { 010 is sorted;001;111}.001 remembers as unique branch
Enter Sky-partition file.
Table 4
As shown in table 4 in the shuffle stage of third round, data are divided into 2 piece { 000;100;010 } and { 001;101;
011;111}.In Reduce stage R every piece the smallest by as key value, first piece { 000;100;010 } key value is
0.021, second piece { 001;101;011;111 } key value is 0.111.
{0.021;0.1213 } and { 0.111;0 } it is respectively first piece and second piece and corresponding charges to HashMap.
As shown in table 5, to handle each strategy, data record is screened one by one according to HashMap in the Reduce stage.
Strategy | Key | R | U | Min_u_lowerKey | Min_u_Key_lowerR | Skyline |
000 | 0.021 | 0.021 | 0.1218 | + | + | Y |
100 | 0.021 | 0.095 | 0.1216 | + | 0.1218 | Y |
010 | 0.021 | 0.098 | 0.1213 | + | 0.1216 | Y |
001 | 0.111 | 0.111 | 0.0004 | 0.1213 | + | Y |
101 | 0.111 | 0.490 | 0.0003 | 0.1213 | 0.0004 | Y |
011 | 0.111 | 0.514 | 0.0000 | 0.1213 | 0.0003 | Y |
111 | 0.111 | 1 | 0 | 0.1213 | 0.0000 | Y |
Table 5
By taking strategy 000 as an example.It calculates first and is less than key (000)=0.021 the smallest U value, Min_u_lowKey (000)
=+.Indicate that this value can be arbitrarily large, since there is no the strategy less than 0.021.Secondly, calculate key be equal to key (000)=
0.021 R value is less than the smallest U value of strategy in R (000)=0.021, and Min_u_Key_lowR (000)=+.Indicate that this value can
With arbitrarily large, since in first piece 000 R value is minimum.Finally, because U (000)=0.1218 < Min_u_lowKey
(000)=+, and U (000)=0.1218 < Min_u_Key_lowR (000)=+.Strategy 000 belongs to Skyline set.Other
Seven strategies are processed similarly, and table 6 is final strategy Skyline set.
Table 6
Step S206, exports { F, (R, U) } and strategy is corresponding extensive.
Specifically, it exports strategy shown in table 6 and each strategy is corresponding extensive.Strategy shown in table 6 is tables of data D
The privacy policy of recommendation.
For example, output policy " 000 ", tactful " 000 " is corresponding extensive are as follows: by the age 42-44 be generalized for value [42,
44], it is Femal or Male by Gender, is generalized for [Femal, Male].As tactful " 111 " are corresponding extensive are as follows: the age exists
42-44 is generalized for value [42], [43] and [44], is that Femal or Male is generalized for Femal by Gender] and [Male].
Scheme provided in an embodiment of the present invention can satisfy multi-level demand of the user to secret protection and availability of data,
Can cope with the case where quantity of privacy policy exponentially increases simultaneously, and can under the premise of guaranteeing data precision,
Substantially reduce the scale of candidate policy collection.
Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure
Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate
The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description.
These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.
Professional technician can use different methods to achieve the described function each specific application, but this realization
It is not considered that exceeding scope of the present application.
Those of ordinary skill in the art will appreciate that implement the method for the above embodiments be can be with
By program come instruction processing unit completion, the program be can store in computer readable storage medium, the storage
Medium is non-transitory (non-transitory) medium, such as random access memory, read-only memory, flash memory,
Hard disk, solid state hard disk, tape (magnetic tape), floppy disk (floppy disk), CD (optical disc) and its appoint
Meaning combination.
More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any
Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers
Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.
Claims (6)
1. a kind of data generaliza-tion method based on Skyline applied to secret protection field characterized by comprising
Step S101 obtains the wind of strategy identified again to data list processing according to data publication secret protection standard 10- anonymity
Danger amount is denoted as threshold value, and the codomain and threshold value of the standard identifier attribute according to tables of data, determines policy space, the policy space
Including strategy the risk amount identified again be not more than the threshold value;The tables of data includes user privacy information;
Step S102 is filtered to obtain candidate policy space to the policy space using approximation Skyline;The approximation
Skyline is to amplify in the domination domain of Skyline by preset ratio, generates approximate domination domain, the preset ratio is according to pre-
If security parameter determines;If the domination domain is that the risk amount of strategy A identified again and information loss amount are all larger than tactful B again
The risk quantum and information loss magnitude of identification, then strategy A is in the domination domain of tactful B;
The step S102 includes following sub-step:
To the codomain of the standard identifier attribute of the tables of data, successive ignition carries out two points of processing, until the standard identifier category
Property codomain can not be divided, each iteration is corresponding to generate a strategy, and the collection of the corresponding strategy of successive ignition is combined into initially
Set of strategies;
Policy space { S, (R, U) } is filtered using approximation Skyline according to the initial policy collection, by policy space
The initial policy collection is not included by the strategy that the initial policy collection dominates in { S, (R, U) }, and to the initial policy
Collection is updated, by the policy filtering dominated by the initial policy collection in policy space { S, (R, U) }, meanwhile, it will be initial
The policy filtering that quilt { S, (R, U) } in policy space dominates;
Strategy in the policy space { S, (R, U) } is filtered using approximation Skyline one by one, when the policy space
When { S, (R, U) } is filtered into sky, the initial policy updated at this time integrates as candidate policy space { G, (R, U) };It is described again
The risk amount R and information loss amount U of identification are determined by following formula respectively:
Wherein, P indicates the record distribution of equivalence class, and the record of the equivalence class after P ' expression is extensive is distributed, | e'| is equivalence class e'
The record number for including, and whole equivalence class e' constitutes the P ', N indicate the record number of entire tables of data, N >=1, | e*| it is of equal value
The new equivalence class generated after class e is extensive includes average record number, and the equivalence class refers in tables of data, and there is identical fiducial mark to know
Accord with the set of records ends of attribute;
Step S103 makees Skyline operation to the candidate policy space, the policy space recommended, by for data
Table Generalization bounds carry out privacy information that is extensive and protecting user come the data to tables of data;The policy space of the recommendation includes
The strategy recommended for the tables of data and the risk amount and information loss amount that identify again accordingly, the policy space packet of the recommendation
A kind of extensive operation of the every kind of strategy included to the data table information is coped with.
2. data generaliza-tion method as described in claim 1, which is characterized in that the step S101 includes following sub-step:
2 are enumerated according to the codomain of the standard identifier attribute of tables of dataLA strategy, wherein L=r1-1+…+rn- 1, described 2LA plan
N attribute is slightly corresponded to, ith attribute corresponds to riA value, the riA value is correspondingKind strategy, 1≤i≤n, n >=1;
Determine 2LThe risk amount R value of each strategy identified again in a strategy, described 2LR value is greater than threshold value T in a strategy
Policy filtering fall, determine set of strategies;
The information loss amount U for determining each strategy in the set of strategies, according to the R value and U value of each strategy, determine described in
Policy space { S, (R, U) }, R >=0, U >=0, T >=0.
3. data generaliza-tion method as described in claim 1, which is characterized in that the step S103 includes following sub-step:
To candidate policy space { G, (R, U) } carry out piecemeal, by node of the corresponding data of every piece of policy space where it into
Row sequence;
Determine the smallest risk amount R value identified again and the smallest information loss amount U value in every piece of policy space;
The plan of the tactful and described recommendation of Skyline set is determined according to R value the smallest in every piece of policy space and the smallest U value
Slightly space { F, (R, U) }.
4. data generaliza-tion method as described in claim 1, which is characterized in that ε-approximation Skyline is by the domination of Skyline
Domain is amplified by preset ratio, generates approximate domination domain, and the preset ratio is determined according to ε, is specifically included:
If R the and U value of strategy A is at most ε times bigger than the R value of tactful B and U value, strategy A is with precision ε approximation domination of strategies
B is denoted as A < ε B.
5. data generaliza-tion method as claimed in claim 3, which is characterized in that step S103-1 is specifically included:
Creation one records the graduation file of each piece of policy space distribution data, according to sampling rate ρ in the candidate policy space
{ G, (R, U) } obtains sample, and sample is sorted and extracts t-1 as branch, with empty to the candidate policy according to the branch
Between { G, (R, U) } carry out piecemeal, the corresponding data block of every piece of policy space;
Wherein, sampling rateT is the number of data block, | G | for tactful number.
6. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program is realized when the computer program is executed by processor and is applied to secret protection as described in any one of claim 1 to 5
The data generaliza-tion method based on Skyline in field.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710339575.XA CN107194278B (en) | 2017-05-15 | 2017-05-15 | A kind of data generaliza-tion method based on Skyline |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710339575.XA CN107194278B (en) | 2017-05-15 | 2017-05-15 | A kind of data generaliza-tion method based on Skyline |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107194278A CN107194278A (en) | 2017-09-22 |
CN107194278B true CN107194278B (en) | 2019-11-22 |
Family
ID=59872372
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710339575.XA Active CN107194278B (en) | 2017-05-15 | 2017-05-15 | A kind of data generaliza-tion method based on Skyline |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107194278B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858282B (en) * | 2019-02-12 | 2020-12-25 | 北京信息科技大学 | Social network relationship data privacy protection method and system |
CN113544683B (en) * | 2019-03-11 | 2023-09-29 | 日本电信电话株式会社 | Data generalization device, data generalization method, and program |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102148829A (en) * | 2011-03-29 | 2011-08-10 | 苏州市职业大学 | Calculation method for entity node reliability under grid environment |
CN103648092B (en) * | 2013-12-26 | 2017-07-11 | 安徽师范大学 | The two-layer Sensor Network Skyline inquiry systems and method of secret protection |
CN105512566B (en) * | 2015-11-27 | 2018-07-31 | 电子科技大学 | A kind of health data method for secret protection based on K- anonymities |
-
2017
- 2017-05-15 CN CN201710339575.XA patent/CN107194278B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN107194278A (en) | 2017-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9230132B2 (en) | Anonymization for data having a relational part and sequential part | |
US10614248B2 (en) | Privacy preserving cross-organizational data sharing with anonymization filters | |
CN112365987B (en) | Diagnostic data abnormality detection method, diagnostic data abnormality detection device, computer device, and storage medium | |
US8990252B2 (en) | Anonymity measuring device | |
CN107111722B (en) | Database security | |
US20180012039A1 (en) | Anonymization processing device, anonymization processing method, and program | |
US11853329B2 (en) | Metadata classification | |
CN107358116B (en) | A kind of method for secret protection in multi-sensitive attributes data publication | |
CN101834872B (en) | Data processing method of K-Anonymity anonymity algorithm based on degree priority | |
Stach et al. | Recommender-based privacy requirements elicitation-EPICUREAN: an approach to simplify privacy settings in IoT applications with respect to the GDPR | |
AU2020202889B2 (en) | Systems and methods for computing data privacy-utility tradeoff | |
US20210334455A1 (en) | Utility-preserving text de-identification with privacy guarantees | |
CN111417954A (en) | Data de-identification based on detection of allowable configuration of data de-identification process | |
CN109983467B (en) | System and method for anonymizing data sets | |
CN109117669B (en) | Privacy protection method and system for MapReduce similar connection query | |
CN107194278B (en) | A kind of data generaliza-tion method based on Skyline | |
WO2020234515A1 (en) | Compatible anonymization of data sets of different sources | |
CN111859441A (en) | Anonymous method and storage medium for missing data | |
KR101948603B1 (en) | Anonymization Device for Preserving Utility of Data and Method thereof | |
Chen et al. | Architecture and building the medical image anonymization service: cloud, big data and automation | |
US20090228232A1 (en) | Range-based evaluation | |
KR101798378B1 (en) | Method for de-identification of personal information based on genetic algorithm and apparatus for the same | |
US20230195921A1 (en) | Systems and methods for dynamic k-anonymization | |
KR101821219B1 (en) | Method for setting attributes of table including personal information and apparatus for the same | |
CN116975774A (en) | Mechanism name fusion method, terminal equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |