CN110555316A

CN110555316A - privacy protection table data sharing algorithm based on cluster anonymity

Info

Publication number: CN110555316A
Application number: CN201910752801.6A
Authority: CN
Inventors: 刘丽苹; 朴春慧
Original assignee: Shijiazhuang Tiedao University
Current assignee: Guangzhou Jinqi Information Technology Co.,Ltd.
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2019-12-10
Anticipated expiration: 2039-08-15
Also published as: CN110555316B

Abstract

The invention relates to a privacy protection table data sharing algorithm based on cluster anonymity, which comprises the steps of clustering records in a table through a k-media clustering algorithm to obtain a plurality of data tables; then, anonymizing each data sheet according to the information loss amount to generate an anonymous data sheet; and finally, noise is added into the sensitive attribute value in the anonymous data table, and the sensitive attribute value is compared with the MDAV (model-based data analysis) of the classic k-anonymous algorithm to verify the algorithm, so that the usability and privacy of the algorithm are proved, and the method has high popularization and application values.

Description

Privacy protection table data sharing algorithm based on cluster anonymity

Technical Field

The patent application belongs to the technical field of privacy protection, and more particularly relates to a privacy protection table data sharing algorithm based on cluster anonymity.

Background

With the construction and development of digital governments, government affair data are gradually increased, and the government affair data are larger in scale and more in type, and have the characteristics of diversification and complication. For a long time, the phenomena of 'information isolated island' and 'data barrier' generally exist, and the data value cannot be fully exerted. Government affair data sharing can transfer government affair information from one department to another department, so that a data island phenomenon is improved, data can exert the maximum value, and government service quality is improved. Table data sharing is one of the important ways of sharing administration data.

Generally, "privacy" refers to information that a data owner is reluctant to be obtained by others. However, the development of information technology inevitably enhances the possibility of data information leakage, thereby limiting the development of information technology. Privacy concerns have therefore become increasingly attractive. In order to more intuitively reflect the attention of people in all circles to the problems of 'privacy' and 'privacy protection', measured by the published amount of the papers related to privacy every year, authors search by using 'privacy' as a topic keyword in the knowledge network, then search by using 'privacy protection' as a topic keyword in the search results, and according to the search results, the change of the number of the papers published every year since 1990 is drawn, as shown in fig. 1.

As can be seen from fig. 1, the interest in the privacy problem and the privacy protection problem has rapidly increased since 2003. Meanwhile, in recent years, people focus on privacy protection, and the focus of people is about half of that of privacy protection.

Based on this, a protection method needs to be provided, which aims to ensure the data availability and the data privacy, and a comparison experiment analysis is performed with the traditional anonymous algorithm MDAV, so that the provided privacy protection method can well improve the algorithm efficiency and provide effective privacy protection.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a cluster anonymity-based privacy protection table data sharing algorithm, which can avoid the defects of the form and can provide effective privacy protection.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a privacy protection table data sharing algorithm based on cluster anonymity comprises the following steps of being applied to a shared static data table:

Step1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters, namely a plurality of data tables;

Step2, anonymization: processing each cluster obtained through Step1, firstly, dividing data in the cluster according to the information loss amount, then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and have no condition that sensitive attribute values are completely equal, and finally, generalizing the clusters to generate an anonymous data table;

step3, differential privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;

Step4, comparative verification: finally, the usability and privacy verification of the method are carried out through example analysis and comparison with a classic k-anonymity algorithm MDAV.

The technical scheme of the invention is further improved as follows: in Step1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:

step 11: normalization processing, namely quantizing the non-sensitive ordered classification type attributes in the data table into values 1,2,3, wherein, the ordered classification type attributes are regarded as numerical type attributes to be processed, and further normalizing the numerical type attribute data in all the non-sensitive attributes in the data table, wherein the normalization formula is as follows:

In the formula, x_i' is a normalized value of a numerical attribute, x_iis an original value of a numerical attribute, x_minIs the minimum value of the attribute, x_maxIs the maximum value of the attribute;

step 12: according to the distance of the non-sensitive attribute between the table data records, clustering the table data by using a k-media clustering algorithm, and dividing the table data records into k₁Clustering;

Step 13: according to k-anonymous parameter k₂Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k₂If so, no adjustment is performed; if present, the resulting cluster C_iThe number of data records in (2) is less than k₂Then cluster the distance C_iTo cluster C_iWhile ensuring that the data record in the cluster where the record is located is still larger than k₂；

Step 14: repeating Step3 until the record in each cluster is greater than or equal to k₂；

Step 15: dividing the data into different sub-data tables T1, T2, Tk₁thereby obtaining k₁a folder data table.

The technical scheme of the invention is further improved as follows: in Step12, when the k-media clustering algorithm is used to divide the records, because the data table contains attributes of two types, namely, a classification type attribute and a numerical type attribute, different data distance calculation methods need to be adopted when calculating the distance between the records, and the problem of the optimal clustering result, namely the optimal number k of the divided clusters, needs to be considered when the k-media clustering algorithm is performed₁The selection process comprises the following steps:

Step121, a calculation formula of the distance between data table records:

When the distance between the records of the data table is calculated, because a plurality of attributes exist in the data table, different attributes need to be separately calculated, and the numerical attribute distance calculation formula is as shown in formula 2:

dist(x_i,x_j)＝|x_i-x_jL (equation 2)

The formula for calculating the classification type attribute is shown as formula 3:

Assume that there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data table_i、 X_jthe distance calculation formula of (2) is as in formula 4:

In the formula x_ipAnd x_jpAre respectively record X_iAnd record X_jp-th numerical attribute value of (2), x_ipand x_jpAre respectively record X_iAnd record X_jThe qth categorical attribute value of (1);

step122, number of data recording divided clusters k₁Determination of (1):

the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss brought by the anonymization process as much as possible, so that the similarity problem in the cluster is mainly considered when determining the cluster number of the cluster, and the data record dividing cluster number k is determined through the mean square error and SSE (residual sum of squares) in the group₁(ii) a And with k₁The data records in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore the value of SSE should follow k₁Is increased and decreased; therefore, k is performed by SSE₁When determining the value, take care of the change, when SSE follows k₁Is relatively slow, it is considered that k is further increased₁The clustering effect does not change much, then k is₁The value is the optimal clustering number; if each k is to be₁is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point₁the value is the optimal cluster number.

The technical scheme of the invention is further improved as follows: at Step2, k obtained by the table data record division processing₁The method comprises the following steps of (1) opening a sub data table, and then sequentially processing each sub data table, wherein the core idea is as follows: dividing the data records in the sub data table so that the number of records in each cluster generated is [ k ]₂,2k₂-1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymity processing algorithm is as follows:

Step 21: judging whether the number of data records in the data set is more than 2k₂1, if largeAt 2k₂1, then Step22 is executed;

step 22: two records r are selected from the data set₁And r₂As two initial clusters, such that when r₁And r₂when a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step23 is executed;

Step 23: respectively calculating the information loss change condition of each record in the data set after dividing into two clusters, dividing the record into clusters with smaller information loss amount, and adjusting the data record to make the data record in each cluster at least be k₂and returning the generated clusters as two newly generated data sets to Step 21;

Step 24: when the number of data records in all data sets is [ k ]₂,2k₂-1]Sequentially and circularly judging whether the condition that the sensitive attribute value is unique exists in each data set or not, and if so, executing Step 25;

Step 25: selecting data records with different sensitive attribute values from the data set Q, and ensuring that if the data records are deleted, the number of the data records in the data set is still more than or equal to k₂and the sensitive attribute value is not unique;

Step 26: calculating the information loss variation quantity after the selected data records are divided into the corresponding data set Q, and dividing the data records with smaller information loss quantity into the data set Q;

step 27: obtaining the number of records at [ k₂,2k₂-1]and performing generalization processing on each set to obtain an anonymous data table.

The technical scheme of the invention is further improved as follows: in Step27, the anonymization rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of the attribute in a set where the non-sensitive attribute is located; if the attribute is classified attribute, the attribute is generalized to a complete set of the attribute in the set of the attribute.

The technical scheme of the invention is further improved as follows: the process of performing differential privacy and noise processing on the anonymous data table in Step3 is as follows:

(1) If the classified attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding a Num column at a corresponding position, and recording the noise-added data;

(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value by an average value, wherein the clustering number isn is the number of records in the data table, k₃Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is low, a larger value can be selected, the smaller value and the larger value are set according to the requirement of a client, and the processed data table can be shared to a data requester.

the technical scheme of the invention is further improved as follows: the shared static data table is shared government affair table data, and can also be other data.

due to the adoption of the technical scheme, the invention has the beneficial effects that: the method utilizes a clustering algorithm, a k-anonymity model and a differential privacy technology, adopts multiple modes and multiple effects, and has higher usability and privacy protection degree compared with a classic k-anonymity algorithm MDAV. The k-anonymous model is the most widely applied model, has a simple structure and few limiting conditions, and is designed and realized on the basis of the k-anonymous model in subsequent research, so that the k-anonymous model is convenient for government departments to use. However, the k-anonymous model has the defects of being unable to resist the attack of maximum background knowledge, homogeneous attack, re-identification attack and the like. While differential privacy is able to resist the largest background knowledge attacks, it tends to provide poor data availability. The combination of the two can strengthen the privacy protection degree theoretically, and reduce the privacy disclosure risk. And the clustering algorithm can group the data records, so that the data records with high similarity are distributed to one group to prepare for anonymous processing.

Drawings

FIG. 1 is a variation of the space of a published paper on the privacy problem and the privacy protection problem;

FIG. 2 is a graph of anonymization processing time used at different values of k compared to the MDAV method;

FIG. 3 is a graph of anonymization processing time used at different data volumes compared to the MDAV method of the present invention;

FIG. 4 shows the variation of information loss at different k values compared to the MDAV method;

FIG. 5 is a graph of information entropy change at different k values compared to the MDAV method;

Fig. 6 is a comparison of the different noise addition modes compared to the MDAV method of the present invention with the real values.

Detailed Description

The present invention will be described in further detail with reference to examples.

the invention discloses a privacy protection table data sharing algorithm based on cluster anonymity, which is applied to a shared static data table, such as shared government affair table data, and comprises the following steps:

Step1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters, namely a plurality of data tables; the k-media clustering algorithm is a classic algorithm for solving the clustering problem, has the characteristics of simplicity, quickness, capability of processing a large-scale data set, compactness in clusters and clear distinction among clusters, and is selected as follows.

stepa 1: randomly selecting k clustering samples as initial clustering centers,

Stepa 2: and calculating the distance from each residual sample point to each initial clustering center, and dividing each sample point into the clusters with the shortest distance from the clustering centers to form k clusters.

Stepa 3: the sum of distances from the sample point in each cluster except the center point to other points is calculated, and when the sum of distances is the smallest, the point is selected as a new cluster center.

Stepa 4: if the new clustering center set is different from the original clustering center set, returning to Step2, and if the new clustering center set is the same as the original clustering center set, ending the clustering algorithm.

Step2, anonymization: processing each cluster obtained by the Step1, dividing the data in the cluster according to the information loss amount to make the number of records in each divided cluster at k₂And 2k₂Then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and avoid the condition that sensitive attribute values are completely equal, and finally, generalizing the cluster to generate an anonymous data table;

In 2006, microsoft scholars Dwork first proposed a new privacy protection model, differential privacy. The basic idea is to process data information by adding noise to original data or statistical data and converting the original data, so that when a record is added or deleted, the whole statistical attribute value is not affected, and the privacy protection effect is achieved. The model can relieve the maximum background attack risk, and a privacy protection level quantitative evaluation method is defined.

Definition 1: differential privacy definition. Suppose there are two data sets D and D' that differ by at most one record, i.e. | D Δ D | ≦ 1. And a privacy protection algorithm A, wherein if any output result O of the data on the data sets D and D' processed by the algorithm A meets the following inequality, the A is considered to meet the epsilon-difference privacy.

Pr[A(D)＝O]≤e^ε×Pr[A(D')＝O](formula 11)

Firstly, the algorithm A is guaranteed to meet epsilon-difference privacy from the theoretical point of view, wherein the probability Pr [. cndot. ] is the probability of a certain event; ε is called the privacy budget. The noise mechanism is a main technique for implementing differential privacy protection, and commonly used noise mechanisms are classified into a laplacian mechanism and an exponential mechanism.

In Step1, the core idea of table data record division is: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:

Step 11: and (4) normalization processing, wherein the attributes of the data table can be divided into a classification attribute and a numerical attribute according to the attribute value expression form, and the classification attribute can be divided into an ordered classification attribute and an unordered classification attribute. For example, the score levels "excellent", "good", "and", "failing" are ordered classification attributes; gender "male" and "female" are unordered taxonomic attributes. When clustering is carried out, in order to better reflect the distance of data, the insensitive ordered classification type attribute in the data table is quantified according to the sequence and is quantified into the numerical values 1,2,3, DEG, n, then the ordered classification type attribute is regarded as the numerical value attribute for processing, meanwhile, because the value ranges of all the attributes are different, the calculation of the table recording distance is greatly influenced, and therefore the numerical value attribute value and the quantified ordered classification attribute value need to be normalized, and the normalization formula is as follows:

Step 13: according to k-anonymous referenceNumber k₂Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k₂If so, no adjustment is performed; if present, the resulting cluster C_iThe number of data records in (2) is less than k₂Then cluster the distance C_ito cluster C_iWhile ensuring that the data record in the cluster where the record is located is still larger than k₂；

step 15: dividing the data into different sub-data tables T1, T2, Tk₁Thereby obtaining k₁see algorithm 3-1 for a tree data sheet.

When the k-media clustering algorithm is used for dividing the records, because the data table contains attributes of two types, namely a classification type attribute and a numerical type attribute, different data distance calculation methods are needed when the distance between the records is calculated, and the problem of the optimal clustering result, namely the optimal number k of divided clusters, needs to be considered when the k-media clustering algorithm is carried out₁the selection process comprises the following steps:

Step121, a calculation formula of the distance between data table records:

dist(x_i,x_j)＝|x_i-x_jL (equation 2)

step122, number of data recording divided clusters k₁Determination of (1):

the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss brought by the anonymization process as much as possible, so that the problem of similarity in clusters is mainly considered when determining the number of clustered clusters, and the data record dividing cluster number k is determined through the square error in the group and the residual error Sum of Squares (SSE)₁(ii) a And with k₁the data recording in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore, the value of SSE should follow k₁Is increased and decreased; therefore, k is performed by SSE₁When determining the value, take care of the change, when SSE follows k₁Is relatively slow, it is considered that k is further increased₁The clustering effect does not change much, then k is₁The value is the optimal clustering number; if each k is to be₁Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point₁The value is the optimal cluster number.

K obtained by dividing table data record₁The method comprises the following steps of (1) processing sub data tables in turn, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ]₂,2k₂-1]While ensuring the sensitive attribute fetching in each clusterThe value is not unique, so the specific flow for realizing the anonymous processing algorithm of the table data is as follows:

Step 21: judging whether the number of data records in the data set is more than 2k₂-1, if greater than 2k₂1, then Step22 is executed;

Step 27: obtaining the number of records at [ k₂,2k₂-1]And performing generalization processing on each set to obtain an anonymous data table. Table data anonymization algorithm please see algorithm 3-2.

In Step27, the anonymization rule of the generalization processing is: if the non-sensitive attribute is a numeric attribute, generalizing the non-sensitive attribute to a value range of the attribute in the set where the non-sensitive attribute is located, such as a set {1233} generalizing to [1,3 ]. If the attribute is classified, the attribute is generalized to a complete set of the attribute in a set where the attribute is located, for example, the set { works in private enterprises, works in nationally owned enterprises } is generalized to { works in private enterprises/nationally owned enterprises }.

When the anonymization processing of the table data is carried out, the information loss amount is calculated by referring to an availability measurement formula. When the data recording is adjusted, the information loss can be reduced to the maximum extent and the usability of the data can be increased according to the size of the information loss.

A plurality of equivalence classes are generated after anonymous processing, but since the sensitive attribute value is not processed, an attacker can still deduce the individual sensitive attribute value through background knowledge attack, and privacy disclosure is caused. Therefore, for better protection of individual privacy, differential privacy and noise processing are performed on the sensitive attribute values, and then data sharing is performed. The process of differential privacy and noise addition processing in Step3 is as follows:

(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value by an average value, wherein the clustering number isn is the number of records in the data table, k₃Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is lowlarger values can be selected, and the processed data table can be shared to the data requester. The differential privacy algorithm is shown in algorithm 3-3.

4. Experimental design and results analysis

4.1 data set

In order to verify the usability and effectiveness of a privacy protection table data sharing algorithm based on cluster anonymity, a Philippine family income and expenditure real data set provided by Kaggle is used as an experimental data set for analysis, wherein the data set comprises 41544 pieces of citizen information and relates to a plurality of privacy data such as family information, family income and expenditure specific information, property condition information and the like. The method selects a table consisting of five attributes of owner gender (HouseholdHeadSex), owner marital state (householdheadmaritstatute), owner age (HouseholdHeadAge), family member number (totalNumberoffamillimeter), and owner work type (householdHeadHeadClassof worker) as a data table T needing to be shared. The gender and the marital state of the owner are classified attributes, the number of family members and the age of the owner are numerical attributes, and the working type of the owner is treated as a sensitive attribute. The algorithm provided by the patent is compared with a classical anonymous MDAV algorithm in an experiment, and the performance, the usability and the privacy protection degree of the method are analyzed.

4.2 evaluation of privacy protection algorithms

(1) Privacy metric

In anonymous privacy protection, information entropy is usually adopted to measure the privacy protection degree, and the privacy protection degree is reflected according to the probability distribution of data records in a data table. The larger the entropy value is, the more uniform the probability distribution of the data is, the lower the possibility that an attacker attacks successfully is, and the higher the privacy protection degree is. Conversely, the smaller the entropy value, the lower the degree of privacy protection. The information entropy of the jth sensitive attribute in the equivalence class Ci in the anonymous data table is defined as follows:

Where m is the number of data records in the valence class Ci, n_tthe number of t-th acquirable values of the jth sensitive attribute in the equivalence class Ci. Thus, the average entropy of the equivalence classes in the anonymous data table is defined as follows:

where k is the number of equivalent classes in the anonymous table T and n1 is the number of sensitive attributes.

(2) Usability metric

the amount of information lost is commonly employed in anonymous privacy protection to measure data availability [55 ]. Assuming that the number of numeric attributes is n2 and the number of categorical attributes is n3 in the quasi tag attributes, the information loss amount of the numeric attributes in the equivalence class Ci is defined as follows.

in the formula max (C)_i(A_j) Represents the maximum value of the jth numerical attribute in the equivalence class Ci, min (C)_i(A_j) Represents the minimum value of the jth numeric attribute in the equivalence class Ci, max (A)_j) Represents the maximum value of the jth numerical attribute in the data table, min (A)_j) Represents the minimum value of the jth numerical attribute, | C, in the data table_iAnd | represents the number of pieces recorded in the equivalence class Ci. Therefore, on the data table T, the information loss amount of all the numerical attributes after anonymization processing can be written as

where k is the number of equivalence classes, and the information loss amount of the classification attribute in the equivalence class Ci is defined as follows.

In the formula h (C)_i(B_j) Represents the number of different values of the jth categorical attribute in the equivalence class Ci, h (B)_j) Representing the number of different values of the jth classification type attribute in the data table T. Therefore, in the data table T, the information loss amount of all the classified attributes after anonymization processing can be written as

In summary, when the data record number of the data table is N, the average information loss amount of the anonymous table T is N

4.3 analysis of the results of the experiment

4.3.1 Effect of different conditions on anonymization processing time

the time used by the method mainly comes from anonymous processing, and the average anonymous time can reflect the performance of the algorithm. The k value of the parameter required by k-anonymity and the size of the data volume influence the anonymity processing time. And the anonymity processing time mainly comes from the anonymity group construction and anonymity group adjustment process. As the value of k increases, the average anonymization processing time gradually increases, as shown in fig. 2. And the average anonymization processing time gradually increases as the data amount increases, as shown in fig. 3. This is because the number of iterations and the number of adjustments may increase as the data increases, so that the processing time gradually increases when anonymization processing is performed, but the anonymization processing time tends to increase gradually because the data distribution density increases as the data amount increases, so that anonymization of the data is easier. Meanwhile, as can be seen from fig. 2 and 3, the method disclosed in this patent has a smaller processing time than the MDAV method under different k values and different data amounts. This is because, in the method proposed in this patent, the table records are clustered first, and similar data records are divided into one cluster, which lays the foundation for data anonymization processing.

4.3.2k value impact on information availability

The amount of information lost is a measure of the availability of data. The larger the information loss amount is, the lower the data availability is, whereas the smaller the information loss amount is, the higher the data availability is. This patent is handled original data set, chooses different k values to carry out the experiment to observe its information loss volume's situation of change, as shown in fig. 4. It can be seen that as the value of k increases, the amount of information lost gradually increases, i.e., the data availability gradually decreases. This is because the data volume of the data set is large, and when k takes a small value, the difference between a plurality of equivalence classes is small. As the k value gradually increases, equivalence classes having smaller differences gradually merge, and thus the information loss amount changes less. However, when the difference between the equivalence classes is large, the k value increases again, so that the amount of information loss varies greatly. Meanwhile, as can be seen from fig. 4, the information loss amounts of the MDAV method and the method proposed by this patent are not very different, because, in the method proposed by this patent, although the sensitive attribute values in the equivalence classes are constrained during the anonymous processing, so that the sensitive attribute values in the equivalence classes are not all the same, which increases the information loss, during the anonymous group construction, the information loss amount is selected for processing, so that the information loss amount of the anonymous group is small, and therefore, the information loss amounts caused by the two methods are not very different.

4.3.3k value influence on privacy protection

information entropy is a measure of the degree of privacy protection. As shown in fig. 5, as the value of k increases, the information entropy increases gradually, i.e., the degree of data privacy protection increases gradually. Meanwhile, compared with the MDAV method, the method provided by the patent has larger information entropy because the sensitive attributes in the equivalence classes are restricted, so that the sensitive attribute values in each equivalence class are not all equal, and the information entropy of the data table is increased.

4.3.4 statistical value availability analysis

In order to illustrate the usability of the data after the differential privacy processing, the original data, the data after the differential privacy noise processing of the patent and the data after the original frequency differential privacy noise processing are respectively compared. Wherein ε is 1. As can be seen from tables 4 to 8 and fig. 6, the difference between the three is small, and the data has a certain usability.

TABLE 4-8 statistical value comparison

Summary of the invention

The patent provides a privacy protection table data sharing method based on cluster anonymity aiming at a scene of sharing government affair table data. Firstly, clustering records in a table through a k-media clustering algorithm to obtain a plurality of data tables; then, anonymizing each data sheet according to the information loss amount to generate an anonymous data sheet; and finally adding noise to the sensitive attribute value. The usability and privacy of the algorithm proposed in this patent are demonstrated by example analysis and comparison with the classical k-anonymous algorithm MDAV.

Claims

1. a privacy protection table data sharing algorithm based on cluster anonymity is characterized in that: the method is applied to the shared static data table and comprises the following steps:

Step1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters;

Step2, anonymization treatment: processing each cluster obtained through Step1, firstly, dividing data in the cluster according to information loss, then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and have no condition that sensitive attribute values are completely equal, and finally, generalizing the clusters to generate an anonymous data table;

Step3, difference privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;

2. The cluster anonymity based privacy preserving table data sharing algorithm of claim 1, wherein: in Step1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:

Step 11: normalization processing, namely quantizing the non-sensitive ordered classification type attributes in the data table into numerical values 1,2,3, …, n, then treating the ordered classification type attributes as numerical value attributes, and further normalizing the numerical value attribute data in all the non-sensitive attributes in the data table, wherein a normalization formula is as follows:

In the formula, x_i' is a normalized value of a numerical attribute, x_iIs an original value of a numerical attribute, x_minIs the minimum value of the property, x_maxis the maximum value of the attribute;

Step 14: repeating the Step3 until the record in each cluster is more than or equal to k₂；

step 15: dividing the data into different sub-data tables T1, T2, … and Tk according to the belonged clusters₁Thereby obtaining k₁A sheet data table.

3. The cluster anonymity based privacy preserving table data sharing algorithm of claim 2, wherein: in Step12, when the k-media clustering algorithm is used to partition the records, because the data table contains attributes of two types, namely, a classification type attribute and a numerical type attribute, different data distance calculation methods need to be adopted when calculating the distance between the records, and the problem of the optimal clustering result, namely the optimal number k of the partitioned clusters, needs to be considered when the k-media clustering algorithm is performed₁The selection process comprises the following steps:

step121, a calculation formula of the distance between data table records:

dist(x_i,x_j)＝|x_i-x_jl (equation 2)

assume that there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data table_i、X_jThe distance calculation formula of (2) is as in formula 4:

step122, number of data recording divided clusters k₁determination of (1):

the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss brought by the anonymization process as much as possible, so that the similarity problem in the cluster is mainly considered when the cluster number of the cluster is determined, and the data record dividing cluster number k is determined through the mean square error and SSE in the group₁(ii) a And with k₁the data records in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore the value of SSE should follow k₁Is increased and decreased; therefore, k is performed by SSE₁When determining the value, take care of the change, when SSE follows k₁Is relatively slow, it is considered that k is further increased₁The clustering effect does not change much, then k is₁the value is the optimal clustering number; if each k is to be₁Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point₁The value is the optimal cluster number.

4. The cluster anonymity based privacy preserving table data sharing algorithm of claim 3, wherein: in Step2, k obtained by dividing the table data record₁The method comprises the following steps of (1) processing each sub data table in sequence, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ]₂,2k₂-1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymity processing algorithm is as follows:

Step 21: judging whether the number of data records in the data set is more than 2k₂-1, if greater than 2k₂1, executing Step 22;

Step 22：Two records r are selected from the data set₁And r₂as two initial clusters, such that when r₁And r₂When a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step23 is executed;

Step 25: selecting data records with different sensitive attribute values from the data set Q, and simultaneously ensuring that the number of the data records in the data set is more than or equal to k if the data records are deleted₂and the sensitive attribute value is not unique;

5. The cluster anonymity based privacy preserving table data sharing algorithm of claim 4, wherein: in Step27, the anonymization rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of the attribute in a set where the non-sensitive attribute is located; if the attribute is classified, the attribute is generalized to the complete set of the attribute in the set in which the attribute is located.

6. the cluster anonymity based privacy preserving table data sharing algorithm of claim 5, wherein: the differential privacy and noise processing process for the anonymous data table in Step3 comprises the following steps:

(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value with an average value, wherein the clustering number isn is the number of records in the data table, k₃Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is low, a larger value can be selected, and the processed data table can be shared to the data requester.

7. The cluster anonymity based privacy preserving table data sharing algorithm of claim 6, wherein: the shared static data table is shared government affairs table data.