CN107194278A

CN107194278A - A kind of data generaliza-tion method based on Skyline

Info

Publication number: CN107194278A
Application number: CN201710339575.XA
Authority: CN
Inventors: 丁晓锋; 金海�; 王丽
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-05-15
Filing date: 2017-05-15
Publication date: 2017-09-22
Anticipated expiration: 2037-05-15
Also published as: CN107194278B

Abstract

The invention discloses a kind of data generaliza-tion method based on Skyline, including：Threshold value T is designated as according to the anonymous risk amount R recognized again that strategy is obtained to data list processing of data publication secret protection standard 10; and the codomain and threshold value T according to standard identifier attribute determine policy space { S; (R; U) }; { S; (R, U) } the tactful R values that include are not more than threshold value T；Policy space { S, (R, U) } is carried out being filtrated to get candidate policy space { G, (R, U) } using the approximate Skyline of ε；Skyline calculating is made to candidate policy space { G, (R, U) }, the policy space { F, (R, U) } recommended, { F, (R, U) } is the privacy policy space recommended for tables of data.The present invention improves the accuracy that privacy protection policy is recommended by enumerating full policy space, to the wide coverage in RU spaces, meets the multi-level demand of user.Given threshold T has filtered secret protection not up to desired strategy, reduces the time of policy space generation, and the scale in candidate policy space is reduce further using the approximate Skyline filterings of ε.

Description

Data generalization method based on Skyline

Technical Field

The invention belongs to the field of privacy protection data release, and particularly relates to a Skyline-based data generalization method.

Background

In the digital information age, the exchange and distribution of data among various groups (e.g., government, business, individuals, etc.) is becoming increasingly important. For example, a hospital in california typically needs to submit some medical data for certain rehabilitative patients. These data contain some sensitive information and direct distribution would reveal personal privacy. L. sweeney refers to an example of a "connection attack". The patient information table and the voter information table are connected by an attribute (Age, six, Zipcode) to determine that Ahmed is influenza-suffering. The privacy of the patient is compromised. This way of data distribution is not secure. Thus data distribution for privacy protection is proposed. It requires maximum availability of data while preserving privacy.

In order to protect the privacy of published data, two privacy protection models are proposed. One is to control the amount of risk of re-identification, e.g., K-anonymity. Another is a rule-based policy, such as Safe Harbor agreement (Safe Harbor). K-anonymity is to generalize, permute, and perturb the same equivalence class, where the same equivalence class is a collection of records with the same quasi-identifier. The quasi-identifier, while not capable of identifying personally sensitive information itself, may identify personally sensitive information by linking to another data table containing identifying attributes. There are at least K records. However, the strategy provided by the method has a small coverage area for RU space, and cannot meet the multi-level requirements of users, where R is a risk amount and U is an information loss amount. Safe Harbor, proposed by Health Insurance Portability and Accountability Act (HIPAA), requires the removal of attributes specified in the data sheet, such as: name, telephone number, Email address, etc. However, most of the SafeHarbor strategy performs poorly in both risk and information loss.

The balance of privacy protection and data usability is increasingly sought after by people. In order to meet the multi-level requirements of users, it becomes more and more important to screen out a better privacy policy. Generalization, which is to be understood to mean the replacement of a more specific attribute value by a broader field value, i.e. a more general and abstract description of the data, in order to prevent sensitive information from being identified again by the alignment identifier. For example, age 21 is generalized to [20-30 ]. At present, some heuristic algorithms mainly exist for solving the problem. One is a dichotomy search algorithm based on hamming distance, and the choice of each bit is determined by the weight. The algorithm can quickly recommend better strategies, but the recommendation accuracy is not high and the recommendation strategy range is small. The other is a probability-based heuristic search algorithm, the search range is expanded to cover the whole RU space, but a certain amount of strategies are selected at a time for Skyline processing in a certain path, Skyline operation is to screen a series of 'interested' strategies from candidate strategies, and the 'interested' strategies refer to strategies which are not subjected to 'domination' by other strategies. I.e., it is required that the R and U values of the policy cannot both be larger than those screened out otherwise. Due to the approximation process, the error of the filtering is related to the initial strategy set and the path selection. Meanwhile, the convergence speed of the algorithm is low.

In summary, the existing privacy protection method has the problems of small RU space coverage, incapability of meeting multi-level requirements of users, poor performance in both risk amount and information loss amount, low accuracy, small recommended strategy range, low algorithm convergence speed and the like.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to solve the technical problems that the existing privacy protection method has small RU space coverage, cannot meet the multi-level requirements of users, has poor performance in both the risk quantity R and the information loss quantity U, has low accuracy, has small recommended strategy range and low algorithm convergence speed.

In order to achieve the above object, an embodiment of the present invention provides a data generalization method based on Skyline, including the following steps: step S101, processing a data table anonymously according to a data release privacy protection standard 10 to obtain a risk amount R of strategy re-identification, recording the risk amount R as a threshold value T, and determining a strategy space { S, (R, U) } according to a value range of the standard identifier attribute and the threshold value T, wherein U is an information loss amount of the strategy, and R value of the strategy included in the { S, (R, U) } is not more than the threshold value T, R is not less than 0, U is not less than 0, and T is not less than 0. Step S102, filtering the strategy space { S, (R, U) } by adopting approximate Skyline to obtain a candidate strategy space { G, (R, U) }, which is a preset safety parameter and is more than or equal to 1, wherein the approximate Skyline is used for amplifying the dominant domain of the Skyline according to a preset proportion to generate an approximate dominant domain, and the preset proportion is determined, and in addition, the approximate Skyline can also be marked as '-approximate Skyline'. And the domination domain is that if the R value and the U value of the strategy A are both larger than those of the strategy B, the strategy A is in the domination domain of the strategy B. Step S103, Skyline operation is carried out on the candidate strategy space { G, (R, U) } to obtain a recommended strategy space { F, (R, U) }, wherein the { F, (R, U) } is a privacy strategy recommended for the data table, and each strategy included in the recommended strategy space corresponds to a generalization operation of the data table information.

Specifically, the privacy policies are recommended for the data sheet, wherein each privacy policy corresponds to a generalization mode of data sheet information, and the data sheet information is generalized, so that the data sheet can realize good protection on the privacy of the user based on a recommended better policy.

The embodiment of the invention improves the accuracy of strategy recommendation by enumerating the full strategy space, has wide coverage area for the RU space and meets the multi-level requirements of users. Setting the threshold value T of the re-identified risk amount R filters out the strategy with the privacy protection not meeting the requirement, reduces the time for generating the strategy space, and simultaneously reduces the size of the candidate strategy space. employing-approximation-Skyline filtering further reduces the candidate policy space size while adjusting the balance between privacy protection and data applicability through security parameters.

In an alternative embodiment, step S101 comprises the following sub-steps: step S101-1, enumerate 2 according to the value range of quasi-identifier attribute of data table^LA strategy wherein L ═ r₁-1+…+r_n-1, 2^LEach strategy corresponds to n attributes, n is more than or equal to 1, and the ith attribute corresponds to r_iA value of r_iEach value corresponds toAnd (3) a strategy is that i is more than or equal to 1 and less than or equal to n. Step S101-2, determine 2^LR value of each of the policies at 2^LAnd filtering the strategies with the R value larger than the threshold value T in the strategies to determine a strategy set. And S101-3, determining the information loss U of each strategy in the strategy set, and determining the strategy space { S, (R, U) } according to the R and U values of each strategy.

Specifically, enumerating the full policy space improves the accuracy of the policy recommendations because the policy recommendations are in one full policy space 2^LThe method is carried out.

In an alternative embodiment, step S102 includes the following sub-steps: and S102-1, performing binary processing on the value range of the quasi-identifier attribute of the data table through multiple iterations until the value range of the quasi-identifier attribute cannot be divided, generating a strategy corresponding to each iteration, and setting a set of strategies corresponding to the multiple iterations as an initial strategy set. Step S102-2, filtering the policy space { S, (R, U) } by adopting approximate Skyline according to the initial policy set, classifying the policies which are not governed by the initial policy set in the policy space { S, (R, U) } into the initial policy set, updating the initial policy set, filtering the policies which are governed by the initial policy set in the policy space { S, (R, U) }, and filtering the policies which are governed by the { S, (R, U) } in the initial policy space. And S102-3, filtering the strategies in the strategy space { S, (R, U) } one by adopting approximate Skyline according to the step S102-2, and when the strategy space { S, (R, U) } is filtered to be empty, updating the obtained initial strategy set to be a candidate strategy space { G, (R, U) }.

Specifically, strategies generated through iteration have good dominance capacity, an initial strategy set is correspondingly set, the strategy space S can be effectively filtered, the filtering efficiency of the approximate Skyline is improved, and meanwhile the scale of the candidate strategy space { G, (R, U) } is also reduced.

In an alternative embodiment, step S103 comprises the following sub-steps:

step S103-1, partitioning the candidate strategy spaces { G, (R, U) }, and sequencing the data corresponding to each strategy space at the node where the strategy space is located.

Creating a partition file Sky-partition for recording strategy space distribution data of each block, obtaining samples in the candidate strategy spaces { G, (R, U) } according to a sampling rate rho, sorting the samples and extracting t-1 samples as partition points, partitioning the candidate strategy spaces { G, (R, U) } according to the partition points, wherein each strategy space corresponds to one data block;

wherein the sampling ratet is the number of data blocks and | G | is the number of policies.

Step S103-2, determining the minimum R value and the minimum U value in each strategy space; and S103-3, determining the strategy of the Skyline set and the recommended strategy space { F, (R, U) }accordingto the minimum R value and the minimum U value in each strategy space.

In an alternative embodiment, the re-identified risk amount R and the information loss amount U are determined by the following formulas, respectively:

wherein P represents the record distribution of the equivalence class, P ' represents the record distribution of the generalized equivalence class, | e ' | is the record number contained in the equivalence class e ', all the equivalence classes e ' form the P ', N represents the record number of the whole data table, N is more than or equal to 1, | e |^*And | is the average record number of a new equivalence class generated after the e-generalization of the equivalence class, wherein the equivalence class refers to a record set with the same quasi-identifier attribute in the data table.

Specifically, the threshold value T is set to filter out the strategy that the privacy protection does not meet the requirement of 10-anonymity, so that the time for generating the strategy space S is greatly reduced, and the size of the strategy space S is also reduced.

In an optional embodiment, the approximating Skyline is to amplify a dominant domain of Skyline according to a preset ratio to generate an approximated dominant domain, where the preset ratio is determined, and specifically includes: if the values of R and U of the strategy A are at most times larger than those of R and U of the strategy B, the strategy A approximately dominates the strategy B with precision, and the strategy A is marked as A <B。

Specifically, the approximate-Skyline filtering further reduces the scale of the candidate policy space { G, (R, U) }, while adjusting the balance of the R value and the U value by the security parameter, wherein the R value reflects the privacy protection performance and the U value reflects the data applicability performance.

In a second aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the data generalization method according to the first aspect.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

1. the embodiment of the invention improves the accuracy of strategy recommendation by enumerating the full strategy space, has wide coverage area for the RU space and meets the multi-level requirements of users.

2. According to the embodiment of the invention, the strategy with the privacy protection not meeting the requirement is filtered out by setting the threshold T of the re-identified risk amount R, so that the time for generating the strategy space is reduced, and the size of the candidate strategy space is also reduced.

3. The embodiment of the invention further reduces the space scale of the candidate strategy by adopting the approximate-Skyline filtering, and simultaneously adjusts the balance between privacy protection and data applicability through the security parameters.

Drawings

Fig. 1 is a schematic flow diagram of a data generalization method based on Skyline according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of another data generalization method based on Skyline according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a dominant domain corresponding to approximate Skyline according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is a schematic flow chart of a data generalization method based on Skyline according to an embodiment of the present invention, which includes steps S101 to S103.

Step S101, processing a data table anonymously according to a data release privacy protection standard 10 to obtain a risk amount R of strategy re-identification, recording the risk amount R as a threshold value T, and determining a strategy space { S, (R, U) } according to a value range of the standard identifier attribute and the threshold value T, wherein U is an information loss amount of the strategy, and the R value of the strategy included in the { S, (R, U) } is not more than the threshold value T.

Specifically, step S101 includes the following substeps:

step S101-1, enumerate 2 according to the value range of quasi-identifier attribute of data table^LA strategy wherein L ═ r₁-1+…+r_n-1, 2^LEach strategy corresponds to n attributes, and the ith attribute corresponds to r_iA value of r_iEach value corresponds toAnd (3) a strategy is that i is more than or equal to 1 and less than or equal to n.

Specifically, enumerating all policies according to the value range of the quasi-identifier attribute of the data table, wherein enumerating all policies is as follows: suppose Q ═ Q₁,L，Q_nIs a quasi-identifier attribute set, and attribute Q_iValue range has r_iThe value is obtained. By representing the de-recognition strategy by a binary string, let p_iRepresentation attribute Q_iPartition of value range, then set { p }₁,…,p_nIs a de-recognition strategy. Wherein p is_i＝{I₁,…,I_ri-1}，I_jRepresentation attribute Q_iIs separate from the j +1 th value. The length L ═ r of the character string can be obtained₁-1+…+r_n-1. Enumerate all 2 s^LAnd (4) a policy.

In one example, before performing step S101-1, the method further comprises: performing 10-anonymization processing on the value range of the quasi-identifier attribute of the data table, and taking the re-identified risk quantity R value corresponding to the 10-anonymization processing as the threshold T, wherein the threshold T is used for privacy protection of the data table, and the 10-anonymization processing is the 2^LOne of the policies, 10-anonymization, is determined based on existing data distribution privacy protection criteria.

Step S101-2, determine 2^LR value of each of the policies at 2^LAnd filtering the strategies with the R value larger than the threshold value T in the strategies to determine a strategy set.

And S101-3, determining the information loss U of each strategy in the strategy set, and determining the strategy space { S, (R, U) } according to the R and U values of each strategy.

Step S102, filtering the strategy space { S, (R, U) } by adopting approximate Skyline to obtain a candidate strategy space { G, (R, U) }, which is a preset safety parameter, and amplifying the domination domain of the Skyline according to a preset proportion to generate an approximate domination domain, wherein the preset proportion is determined; and the domination domain is that if the R value and the U value of the strategy A are both larger than those of the strategy B, the strategy A is in the domination domain of the strategy B.

In an optional embodiment, the approximating Skyline is to amplify a dominant domain of Skyline according to a preset ratio to generate an approximated dominant domain, where the preset ratio is determined, and specifically includes: and if the values of R and U of the strategy A are at most times larger than those of R and U of the strategy B, the strategy A approximately dominates the strategy B with precision, and the strategy A is marked as A < B.

Step S103, Skyline operation is carried out on the candidate strategy space { G, (R, U) } to obtain a recommended strategy space { F, (R, U) }, wherein the { F, (R, U) } is the privacy strategy recommended for the data table. Each policy included in the recommended policy space { F, (R, U) } corresponds to a generalization of the data table information.

Specifically, step S102 includes the following substeps:

and S102-1, performing binary processing on the value range of the quasi-identifier attribute of the data table through multiple iterations until the value range of the quasi-identifier attribute cannot be divided, generating a strategy corresponding to each iteration, and setting a set of strategies corresponding to the multiple iterations as an initial strategy set.

Step S102-2, filtering the policy space { S, (R, U) } by adopting approximate Skyline according to the initial policy set, classifying the policies which are not governed by the initial policy set in the policy space { S, (R, U) } into the initial policy set, updating the initial policy set, filtering the policies which are governed by the initial policy set in the policy space { S, (R, U) }, and filtering the policies which are governed by the { S, (R, U) } in the initial policy space.

And S102-3, filtering the strategies in the strategy space { S, (R, U) } one by adopting approximate Skyline according to the step S102-2, and when the strategy space { S, (R, U) } is filtered to be empty, updating the obtained initial strategy set to be a candidate strategy space { G, (R, U) }.

Specifically, step S103 includes the following substeps:

And step S103-2, determining the minimum R value and the minimum U value in each strategy space.

And S103-3, determining the strategy of the Skyline set and the recommended strategy space { F, (R, U) }accordingto the minimum R value and the minimum U value in each strategy space.

In one example, step S103-3 specifically includes: first, Min _ U _ lowKey (P) and Min _ U _ Key _ lowR (P) of policy P are calculated, where Min _ U _ lowKey (P) is the minimum U value of the data block before policy P, and Min _ U _ Key _ lowR (P) is the minimum U value of the data block before policy P and the policy before policy. The U value U (P) of policy P is then compared, and policy P is output only if U (P) < Min _ U _ lowKey (P) and U (P) < Min _ U _ Key _ lowR (P).

And 4, performing Skyline operation on the strategy space { G, (R, U) } to obtain and output a recommended strategy set F.

Specifically, step S103-1 to step S103-3 can be divided into the following 4 stages:

sky-partition creation phase: a partition file is created that records the assignment data of each node, denoted by Sky-partition. And obtaining samples according to the sampling rate rho, and sorting the samples and extracting t-1 samples as branch points.

A local sequencing stage: and dividing the data files according to points in the Reduce stage, and sequencing the data on the node by each node.

HashMap creation phase: and creating a corresponding table of the minimum R value and the minimum U value of each node, and recording HashMap.

And (3) global screening: and screening the data records one by one according to the HashMap in the Reduce stage. And outputting the strategy of the Skyline set and the R value and the U value thereof.

In an alternative example, the amount of risk R and the amount of information loss U to be re-identified are determined by the following equations (1) and (2), respectively:

wherein P represents the record distribution of the equivalence class, P 'represents the record distribution of the generalized equivalence class, | e' | is the record number included in the equivalence class e ', and all the equivalence classes e' constitute the P ', N represents the record number of the whole data table, | e' | e^*And | is the average record number of a new equivalence class generated after the e-generalization of the equivalence class, wherein the equivalence class refers to a record set with the same quasi-identifier attribute in the data table.

Fig. 2 is a schematic flow chart of another data generalization method based on Skyline according to an embodiment of the present invention, which includes steps S201 to S206.

In step S201, a data table D and parameters are input.

And inputting a data table D and safety parameters which represent the safety parameters defined by the user and are more than or equal to 1. For example, table 1 is a (raw) data table provided in the embodiment of the present invention, and the following table specifically shows:

equivalence classes	Number of records
		42\|Male\|White\|	465
42\|Female\|White\|	168
		43\|Male\|White\|	468
43\|Female\|White\|	163
		44\|Male\|White\|	444
44\|Female\|White\|	150

TABLE 1

Wherein the attributes Age, genter and Race form a quasi-identifier attribute, Age takes a value [42,44], genter takes a value { Femal, Male }, and Race takes a value { White }.

The set of records in the data table having the same quasi-identifier attribute is an equivalence class, e.g., in table 1, the equivalence class "42 | Male | White |" has 465 records.

In step S202, a threshold T for privacy protection is calculated.

The data table D is processed anonymously by 10-and the R value is obtained as the threshold value T for privacy protection according to the formula (1).

Step S203, enumerating all strategies, filtering the strategies with the R value larger than the threshold value T in the enumerated strategies, calculating the R values and the U values of the rest strategies, and generating a strategy space { S, (R, U) }.

According to the value range of the quasi-identifier attribute in table 1, it can be determined that the policy length is L ═ 2+1, where it can be known from table 1 that the attribute n is 3, the attribute Age takes 3 values, the attribute genter takes 2 values, and the attribute Race takes 1 value, so that L ═ r₁-1+…+r_n-1-3-1 +2-1+ 1-3. Specifically, according to the attribute Age value of 42, 43 or 44, the Age value can be generalized to [42-44 ]]；[42-43]、[44]；[42]、[43-44]；[42]、[43]、[44]These four cases, these four (2) of the attribute Age can be represented by 2 bits²) Generalizing the situation. Likewise, two kinds (2) of the attribute Gender can be represented by 1 bit¹) Generalizing the situation. The generalization condition of the attribute Race is represented by 0 bit, and corresponds to only one type (2)⁰) Generalizing the situation.

2 is described^LEach strategy corresponds to n attributes, and the ith attribute corresponds to r_iA value of r_iEach value corresponds toAnd (3) a strategy is that i is more than or equal to 1 and less than or equal to n.

Enumerate a total of 8 policies as shown in table 2. The R and U values for each strategy are calculated according to equations (1) and (2), respectively.

Policy	000	001	010	011	100	101	110	111
									R	0.021	0.111	0.098	0.154	0.095	0.490	0.170	1
U	0.1218	0.0004	0.1213	0.0000	0.1216	0.0003	0.1213	0

TABLE 2

In one example, the calculation of the R and U values for policy 000 is as follows:

wherein,

the remaining seven strategies were calculated similarly, resulting in the results as in table 2.

And S204, filtering the strategy space { S, (R, U) } by adopting approximate Skyline to obtain a candidate strategy space { G, (R, U) }.

In an example, fig. 3 is a schematic diagram of a dominant domain corresponding to approximate Skyline according to an embodiment of the present invention. As shown in fig. 3, policy 1, policy 2 and policy 3 are an initial set of policies,the expressed areas are the dominant areas corresponding to the strategy 1, the strategy 2 and the strategy 3,is shown as being nearAnd when filtering is carried out like Skyline, the approximate domination domains corresponding to the strategy 1, the strategy 2 and the strategy 3.

Table 3 shows the determined candidate policy set { G, (R, U) } when the value is 1.

Policy	000	001	010	011	100	101	111
								R	0.021	0.111	0.098	0.154	0.095	0.490	1
U	0.1218	0.0004	0.1213	0.0000	0.1216	0.0003	0

TABLE 3

The policies in the initial policy set include the bold portions in table 2, such as policy "100", policy "101", and policy "110". All policies in table 2 are compared to the policies in the initial set of policies, while the policies in the initial set of policies are updated. The iteration is carried out, strategies 000, 001, 010, 011, 100 and 101 are reserved, and strategies 110 are filtered out due to the fact that R (110) > R (001) and U (110) > U (001). Accordingly, policy 111 is retained. The results after filtration are shown in Table 3 above.

Step S205, Skyline is carried out on the candidate strategy space { G, (R, U) } to obtain a recommendation strategy set F

Specifically, table 4 shows the data G to be processed, and assuming that the number of reducers takes 2, the sampling rate is determinedAssume the sample is { 001; 010; 111, ordered to { 010; 001; 111}. 001 is entered as a unique punctuation in the Sky-partition file.

TABLE 4

In the third round of shuffle phase as shown in table 4, the data is divided into 2 blocks { 000; 100, respectively; 010} and { 001; 101, a first electrode and a second electrode; 011; 111}. The smallest R per block in the Reduce phase is taken as the key value, the first block { 000; 100, respectively; 010} is 0.021, the second block { 001; 101, a first electrode and a second electrode; 011; 111} is 0.111.

{ 0.021; 0.1213 and { 0.111; 0 is the record entry HashMap for the first block and second block, respectively.

As shown in Table 5, to handle each strategy, data records were screened on a HashMap-by-HashMap basis during the Reduce phase.

Policy	Key	R	U	Min_u_lowerKey	Min_u_Key_lowerR	Skyline
							000	0.021	0.021	0.1218	+	+	Y
100	0.021	0.095	0.1216	+	0.1218	Y
							010	0.021	0.098	0.1213	+	0.1216	Y
001	0.111	0.111	0.0004	0.1213	+	Y
							101	0.111	0.490	0.0003	0.1213	0.0004	Y
011	0.111	0.514	0.0000	0.1213	0.0003	Y
							111	0.111	1	0	0.1213	0.0000	Y

TABLE 5

Take policy 000 as an example. First, a U value smaller than key (000) ═ 0.021 minimum and Min _ U _ lowKey (000) ═ are calculated. Meaning that this value can be arbitrarily large since there is no strategy to be less than 0.021. Next, an R value of the calculated Key equal to Key (000) ═ 0.021 is smaller than a smallest policy U value of R (000) ═ 0.021, Min _ U _ Key _ lowR (000) ═ +. This value may be arbitrarily large, since the R value of 000 is the smallest in the first block. Finally, since U (000) ═ 0.1218 < Min _ U _ lowKey (000) ═ and U (000) ═ 0.1218 < Min _ U _ Key _ lowR (000) ═ q +. Policy 000 belongs to the Skyline collection. The other seven strategies are treated similarly, and table 6 is the final strategy Skyline set.

TABLE 6

Step S206, outputting { F, (R, U) } and corresponding generalization of the strategy.

Specifically, the policies shown in table 6 are output, and the generalization corresponding to each policy. The policy shown in table 6 is the privacy policy recommended by data table D.

For example, the policy "000" is output, and the corresponding generalization of the policy "000" is: generalizing age from 42 to 44 into values [42,44], and generalizing genter into Femal or Male into [ Femal, Male ]. If the policy "111" corresponds to generalization: generalizing the age from 42 to 44 into values [42], [43] and [44], and generalizing the Gender into Femal or Male into Femal and Male.

The scheme provided by the embodiment of the invention can meet the multilevel requirements of users on privacy protection and data availability, can deal with the condition that the number of privacy strategies is exponentially increased, and can greatly reduce the scale of a candidate strategy set on the premise of ensuring the data precision.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those of ordinary skill in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read-only memory, a flash memory, a hard disk, a solid state drive, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk) and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A data generalization method based on Skyline is characterized by comprising the following steps:

step S101, processing a data table anonymously according to a data release privacy protection standard 10-to obtain a risk amount of strategy re-identification, recording the risk amount as a threshold value, and determining a strategy space according to a value range and the threshold value of a quasi-identifier attribute of the data table, wherein the risk amount of strategy re-identification included in the strategy space is not greater than the threshold value;

step S102, filtering the strategy space by adopting approximate Skyline to obtain a candidate strategy space; the approximate Skyline amplifies a dominant domain of the Skyline according to a preset proportion to generate an approximate dominant domain, and the preset proportion is determined according to preset safety parameters; the domination domain is that if the risk amount and the information loss amount of the re-identification of the strategy A are both larger than the risk amount and the information loss amount of the re-identification of the strategy B, the strategy A is in the domination domain of the strategy B;

step S103, Skyline operation is carried out on the candidate strategy space to obtain a recommended strategy space, the recommended strategy space comprises the strategies recommended for the data table and corresponding re-identified risk amount and information loss amount, and each strategy included in the recommended strategy space corresponds to one generalization operation of the data table information.

2. The data generalization method according to claim 1, wherein said step S101 comprises the sub-steps of:

value range enumeration 2 from quasi-identifier attributes of data tables^LA strategy wherein L ═ r₁-1+…+r_n-1, 2^LEach strategy corresponds to n attributes, and the ith attribute corresponds to r_iA value of r_iEach value corresponds toI is more than or equal to 1 and less than or equal to n, and n is more than or equal to 1;

determination 2^LThe re-identified risk amount R value of each of the policies is at 2^LFiltering out the strategies with the R value larger than the threshold value T in each strategy, and determining a strategy set;

and determining the information loss U of each strategy in the strategy set, and determining the strategy space { S, (R, U) }, wherein R is more than or equal to 0, U is more than or equal to 0, and T is more than or equal to 0 according to the R value and the U value of each strategy.

3. The data generalization method according to claim 1, wherein said step S102 comprises the sub-steps of:

performing binary processing on the value range of the quasi-identifier attribute of the data table through multiple iterations until the value range of the quasi-identifier attribute cannot be divided, generating a strategy corresponding to each iteration, and taking a set of strategies corresponding to the multiple iterations as an initial strategy set;

filtering a policy space { S, (R, U) } by adopting approximate Skyline according to the initial policy set, classifying the policies which are not governed by the initial policy set in the policy space { S, (R, U) } into the initial policy set, updating the initial policy set, filtering the policies which are governed by the initial policy set in the policy space { S, (R, U) }, and simultaneously filtering the policies which are governed by the { S, (R, U) }inthe initial policy space;

and filtering the strategies in the strategy space { S, (R, U) } by adopting approximate Skyline one by one, and when the strategy space { S, (R, U) } is empty, updating the obtained initial strategy set to be a candidate strategy space { G, (R, U) }.

4. The data generalization method according to claim 1, wherein said step S103 comprises the sub-steps of:

partitioning the candidate strategy spaces { G, (R, U) }, and sequencing the data corresponding to each strategy space at the node where the strategy space is located;

determining the minimum re-identified risk quantity R value and the minimum information loss quantity U value in each strategy space;

and determining the strategy of the Skyline set and the recommended strategy space { F, (R, U) }accordingto the minimum R value and the minimum U value in each strategy space.

5. The data generalization method according to any one of claims 1 to 4, wherein said re-identified risk measure R and information loss measure U are respectively determined by the following formulae:

<mrow> <mi>R</mi> <mo>=</mo> <mfrac> <mrow> <msub> <mo>&Sigma;</mo> <mrow> <msup> <mi>e</mi> <mo>&prime;</mo> </msup> <mo>&Element;</mo> <msup> <mi>P</mi> <mo>&prime;</mo> </msup> </mrow> </msub> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msup> <mi>e</mi> <mo>&prime;</mo> </msup> <mo>|</mo> </mrow> </mfrac> </mrow> <mrow> <msub> <mo>&Sigma;</mo> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>P</mi> </mrow> </msub> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <mi>e</mi> <mo>|</mo> </mrow> </mfrac> </mrow> </mfrac> </mrow>

<mrow> <mi>U</mi> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>P</mi> </mrow> </msub> <mo>|</mo> <mi>e</mi> <mo>|</mo> <mi>l</mi> <mi>n</mi> <mfrac> <mrow> <mo>|</mo> <mi>e</mi> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <msup> <mi>e</mi> <mo>*</mo> </msup> <mo>|</mo> </mrow> </mfrac> </mrow> <mi>N</mi> </mfrac> </mrow>

6. The data generalization method according to claim 1, wherein the approximating Skyline is a method for generating an approximated dominant domain by enlarging a dominant domain of Skyline at a preset scale, and the preset scale is determined specifically by:

and if the values of R and U of the strategy A are at most times larger than those of R and U of the strategy B, the strategy A approximately dominates the strategy B with precision, and the strategy A is marked as A < B.

7. The data generalization method according to claim 4, wherein said step S103-1 specifically comprises:

creating a partition file for recording strategy space distribution data of each block, obtaining samples in the candidate strategy space { G, (R, U) } according to a sampling rate rho, sorting and extracting t-1 samples as partition points, partitioning the candidate strategy space { G, (R, U) } according to the partition points, wherein each strategy space corresponds to one data block;

8. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the data generalization method according to any one of claims 1 to 4 and 6 to 8.