CN110555316B

CN110555316B - Privacy protection table data sharing method based on cluster anonymity

Info

Publication number: CN110555316B
Application number: CN201910752801.6A
Authority: CN
Inventors: 刘丽苹; 朴春慧
Original assignee: Shijiazhuang Tiedao University
Current assignee: Guangzhou Chick Information Technology Co ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2023-04-18
Anticipated expiration: 2039-08-15
Also published as: CN110555316A

Abstract

The invention relates to a privacy protection table data sharing algorithm based on cluster anonymity, which comprises the steps of clustering records in a table through a k-media clustering algorithm to obtain a plurality of data tables; then, anonymizing each data sheet according to the information loss amount to generate an anonymous data sheet; and finally, noise is added into the sensitive attribute value in the anonymous data table, and the sensitive attribute value is compared with the MDAV (model-based data analysis) of the classic k-anonymous algorithm to verify the algorithm, so that the usability and privacy of the algorithm are proved, and the method has high popularization and application values.

Description

Privacy protection table data sharing method based on cluster anonymity

Technical Field

The patent application belongs to the technical field of privacy protection, and particularly relates to a privacy protection table data sharing method based on cluster anonymity.

Background

With the construction and development of digital governments, government affair data are gradually increased, and the government affair data are larger in scale and more in type, and have the characteristics of diversification and complication. For a long time, the phenomena of 'information isolated island' and 'data barrier' generally exist, and the data value cannot be fully exerted. Government affair data sharing can transfer government affair information from one department to another department, so that the phenomenon of data islanding is improved, data can exert maximum value, and government service quality is improved. Table data sharing is one of the important ways of government data sharing.

Generally, "privacy" refers to information that a data owner is reluctant to be obtained by others. However, the development of information technology inevitably enhances the possibility of data information leakage, thereby limiting the development of information technology. Privacy concerns have therefore become increasingly of concern. In order to more intuitively reflect the attention of people in all circles to the problems of 'privacy' and 'privacy protection', the published quantity of the papers related to the privacy every year is measured, authors search by taking 'privacy' as a topic keyword in a knowledge network, then search by taking 'privacy protection' as the topic keyword in a search result, and according to the search result, the change situation of the number of the papers published every year since 1990 is drawn, as shown in fig. 1.

As can be seen from fig. 1, the interest in the privacy problem and the privacy protection problem has rapidly increased since 2003. Meanwhile, in recent years, people focus on privacy protection, and the focus of people is about half of that of privacy protection.

Based on this, a protection method needs to be provided, which aims to ensure the data availability and the data privacy, and a comparison experiment analysis is performed with the traditional anonymous algorithm MDAV, so that the provided privacy protection method can well improve the algorithm efficiency and provide effective privacy protection.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a privacy protection table data sharing method based on cluster anonymity, which can avoid the defects of the form and can provide effective privacy protection.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a privacy protection table data sharing method based on cluster anonymity comprises the steps of applying to a shared static data table, and including:

step 1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters, namely a plurality of data tables;

step2, anonymization treatment: processing each cluster obtained through Step 1, firstly, dividing data in the cluster according to information loss, then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and have no condition that sensitive attribute values are completely equal, and finally, generalizing the clusters to generate an anonymous data table;

step 3, difference privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;

step 4, comparative verification: finally, the usability and privacy verification of the method are carried out through example analysis and comparison with a classic k-anonymity algorithm MDAV.

The technical scheme of the invention is further improved as follows: in Step 1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:

step 11: normalization processing, namely quantizing the non-sensitive ordered classification type attributes in the data table into

numerical values

1,2,3, ·, n, then treating the ordered classification type attributes as numerical value attributes, and further normalizing the numerical value attribute data in all the non-sensitive attributes in the data table, wherein a normalization formula is as follows:

in the formula, x _i ' is a normalized value of a numerical attribute, x _i Is an original value of a numerical attribute, x _min Is the minimum value of the property, x _max Is the maximum value of the attribute;

step 12: according to the distance of the non-sensitive attribute between the table data records, clustering the table data by using a k-media clustering algorithm, and dividing the table data records into k ₁ Clustering;

step 13: according to k-anonymity parameter k ₂ Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the data recording number in the divided clusters is more than k ₂ If so, no adjustment is performed; if present, the resulting cluster C _i The number of data records in (2) is less than k ₂ Then cluster the distance C _i To cluster C _i While ensuring that the data record in the cluster where the record is located is still larger than k ₂ ；

Step 14: repeating the Step 3 until the record in each cluster is more than or equal to k ₂ ；

Step 15: dividing data into different sub data tables T1, T2, tk according to the cluster to which the data belong ₁ Thereby obtaining k ₁ A sheet data table.

The technical scheme of the invention is further improved as follows: in Step 12, when the k-media clustering algorithm is used for dividing the records, because the data table contains attributes of two types, namely a classification type attribute and a numerical type attribute, different data distance calculation methods are needed when the distance between the records is calculated,and the problem of optimal clustering result, namely the optimal number k of divided clusters, needs to be considered when performing a k-means clustering algorithm ₁ The selection process comprises the following steps:

step 121, a calculation formula of the distance between data table records:

when the distance between the records of the data table is calculated, because a plurality of attributes exist in the data table, different attributes need to be separately calculated, and the numerical attribute distance calculation formula is as shown in formula 2:

dist(x _i ,x _j )＝|x _i -x _j |

(formula 2)

The formula for calculating the classification type attribute is shown as formula 3:

(formula 3)

Assume that there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data table _i 、X _j The distance calculation formula of (4):

(formula 4)

In the formula x _ip And x _jp Are respectively record X _i And record X _j P-th numerical attribute value of (1), x _iq And x _jq Are respectively record X _i And record X _j The qth categorical attribute value of (1);

step 122, number of data record dividing clusters k ₁ The determination of (1):

the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss brought by the anonymization process as much as possible, so that the similarity problem in the cluster is mainly considered when determining the cluster number of the cluster, and the data record dividing cluster number k is determined through the square error in the group and SSE (residual sum of squares) ₁ (ii) a And with k ₁ The data records in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore the value of SSE should follow k ₁ Is increased and decreased; therefore, k is performed by SSE ₁ When determining the value, take care of the change, when SSE follows k ₁ Is relatively slow, it is considered that k is further increased ₁ The clustering effect does not change much, then k is ₁ The value is the optimal cluster number; if each k is to be ₁ Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point ₁ The value is the optimal cluster number.

The technical scheme of the invention is further improved as follows: in Step2, k obtained by dividing the table data record ₁ The method comprises the following steps of (1) processing each sub data table in sequence, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ] ₂ ,2k ₂ -1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymous processing algorithm is as follows:

step 21: judging whether the number of data records in the data set is more than 2k ₂ -1, if greater than 2k ₂ 1, executing Step 22;

step 22: two records r are selected within the data set ₁ And r ₂ As two initial clusters, such that when r ₁ And r ₂ When a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step 23 is executed;

step 23: respectively calculating the information loss change condition of each record in the data set after dividing into two clusters, dividing the record into clusters with smaller information loss amount, and adjusting the data record to make the data record in each cluster at least be k ₂ And returning the generated clusters as two newly generated data sets to Step 21;

step 24: when the number of data records in all data sets is [ k ] ₂ ,2k ₂ -1]In turn, the method circularly judges whether each data set exists or notIf the sensitive attribute value is unique, executing Step25;

step 25: selecting data records with different sensitive attribute values from the data set Q, and ensuring that the number of the data records in the data set is more than or equal to k if the data records are deleted ₂ And the sensitive attribute value is not unique;

step 26: calculating the information loss variation quantity after the selected data records are divided into the corresponding data set Q, and dividing the data records with smaller information loss quantity into the data set Q;

step 27: get the number of records at k ₂ ,2k ₂ -1]And performing generalization processing on each set to obtain an anonymous data table.

The technical scheme of the invention is further improved as follows: in Step 27, the anonymity rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of the attribute in a set where the non-sensitive attribute is located; if the attribute is classified, the attribute is generalized to the complete set of the attribute in the set in which the attribute is located.

The technical scheme of the invention is further improved as follows: the process of performing differential privacy and noise processing on the anonymous data table in Step 3 is as follows:

(1) If the classified sensitive attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding Num columns at corresponding positions, and recording the data after noise addition;

(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value with an average value, wherein the clustering number is

n is the number of records in the data table, k ₃ Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is low, a larger value and a smaller value can be selectedThe value and the larger value are set according to the needs of customers, and the processed data table can be shared to the data requesting party.

The technical scheme of the invention is further improved as follows: the shared static data table is shared government affair table data, and can be other data.

Due to the adoption of the technical scheme, the invention has the beneficial effects that: the method utilizes a clustering algorithm, a k-anonymity model and a differential privacy technology, adopts multiple modes and multiple effects, and has higher usability and privacy protection degree compared with a classic k-anonymity algorithm MDAV. The k-anonymous model is the most widely applied model, has a simple structure and few limiting conditions, and is designed and realized on the basis of the k-anonymous model in subsequent research, so that the k-anonymous model is convenient for government departments to use. However, the k-anonymous model has the defects of being incapable of resisting the maximum background knowledge attack, the homogeneous attack, the re-identification attack and the like. While differential privacy is able to resist the largest background knowledge attacks, it tends to provide poor data availability. The combination of the two can strengthen the privacy protection degree theoretically, and reduce the privacy disclosure risk. And the clustering algorithm can group the data records, so that the data records with high similarity are distributed to one group to prepare for anonymous processing.

Drawings

FIG. 1 is a change in the spread of published papers on the privacy and privacy protection issues;

FIG. 2 is a graph of anonymization processing time used at different values of k compared to the MDAV method;

FIG. 3 is a graph of anonymization processing time used at different data volumes compared to the MDAV method of the present invention;

FIG. 4 shows the variation of information loss at different k values compared to the MDAV method;

FIG. 5 is the information entropy change at different k values compared to the MDAV method of the present invention;

fig. 6 is a comparison of the different noise addition modes compared to the MDAV method of the present invention with the real values.

Detailed Description

The present invention will be described in further detail with reference to examples.

The invention discloses a privacy protection table data sharing method based on cluster anonymity, which is applied to a shared static data table, such as shared government affair table data, and comprises the following steps:

step 1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters, namely a plurality of data tables; the k-media clustering algorithm is a classic algorithm for solving the clustering problem, has the characteristics of simplicity, quickness, capability of processing a large-scale data set, compactness in clusters and clear distinction among clusters, and is selected as follows.

Step a1: randomly selecting k clustering samples as initial clustering centers,

step a 2: and calculating the distance from each residual sample point to each initial clustering center, and dividing each sample point into the clusters with the shortest distance from the clustering centers to form k clusters.

Step a 3: the sum of distances from the sample point in each cluster except the center point to other points is calculated, and when the sum of distances is the smallest, the point is selected as a new cluster center.

Step a 4: if the new clustering center set is different from the original clustering center set, returning to Step2, and if the new clustering center set is the same as the original clustering center set, ending the clustering algorithm.

Step2, anonymization treatment: processing each cluster obtained by Step 1, firstly dividing the data in the cluster according to the information loss amount to ensure that the record number in each divided cluster is k ₂ And 2k ₂ Then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and avoid the condition that sensitive attribute values are completely equal, and finally, generalizing the cluster to generate an anonymous data table;

in 2006, microsoft scholars Dwork first proposed a new privacy protection model, differential privacy. The basic idea is to process data information by adding noise to original data or statistical data and converting the original data, so that when a record is added or deleted, the whole statistical attribute value is not affected, and the privacy protection effect is achieved. The model can relieve the maximum background attack risk, and a privacy protection level quantitative evaluation method is defined.

Definition 1: differential privacy definition. Suppose there are two data sets D and D' that differ by at most one record, i.e. | D Δ D | ≦ 1. And a privacy protection algorithm A, wherein if any output result O of the data on the data sets D and D' processed by the algorithm A meets the following inequality, the A is considered to meet the epsilon-difference privacy.

Pr[A(D)＝O]≤e ^ε ×Pr[A(D')＝O](formula 11)

Firstly, the algorithm A is ensured to meet epsilon-difference privacy from the theoretical point of view, wherein the probability Pr [. Cndot. ] is the probability of some event; ε is called the privacy budget. The noise mechanism is a main technique for implementing differential privacy protection, and commonly used noise mechanisms are classified into a laplacian mechanism and an exponential mechanism.

Step 4, comparative verification: finally, the usability and the privacy verification of the method are carried out through example analysis and comparison with a classic k-anonymity algorithm MDAV.

In Step 1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:

step 11: and normalization processing, wherein the attributes of the data table can be divided into a classification attribute and a numerical attribute according to the attribute value representation form, and the classification attribute can be divided into an ordered classification attribute and an unordered classification attribute. For example, the score levels "excellent", "good", "passing", "failing" are ordered classification attributes; gender "male" and "female" are unordered taxonomic attributes. When clustering is carried out, in order to better reflect the distance of data, the insensitive ordered classification type attribute in the data table is quantified according to the sequence and is quantified into a

numerical value

1,2,3, ·, n, then the ordered classification type attribute is regarded as the numerical value attribute for processing, meanwhile, because the value ranges of all the attributes are different, the calculation of the table recording distance is greatly influenced, and therefore the numerical value attribute value and the quantified ordered classification attribute value need to be normalized, and the normalization formula is as follows:

in the formula, x _i ' is a normalized value of a numerical attribute, x _i Is an original value of a numeric attribute, x _min Is the minimum value of the property, x _max Is the maximum value of the attribute;

step 13: according to k-anonymous parameter k ₂ Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k ₂ If so, no adjustment is performed; if present, the resulting cluster C _i The number of data records in (2) is less than k ₂ Then cluster the distance C _i To cluster C _i While ensuring that the data record in the cluster where the record is located is still larger than k ₂ ；

Step 15: dividing data into different sub data tables T1, T2, tk according to different belonged clusters ₁ Thereby obtaining k ₁ See algorithm 3-1 for a spreadsheet.

When the k-media clustering algorithm is used for dividing the records, because the data table contains attributes of two types, namely a classification type attribute and a numerical type attribute, different data distance calculation methods are needed when the distance between the records is calculated, and the problem of the optimal clustering result, namely the optimal number k of divided clusters, needs to be considered when the k-media clustering algorithm is carried out ₁ The selection process comprises the following steps:

step 121, a distance calculation formula between data table records:

dist(x _i ,x _j )＝|x _i -x _j |

(formula 2)

(formula 3)

Suppose there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data table _i 、X _j The distance calculation formula of (4):

(formula 4)

In the formula x _ip And x _jp Are respectively record X _i And record X _j P-th numerical attribute value of (2), x _iq And x _jq Are respectively record X _i And record X _j Q-th classification type attribute value of；

Step 122, number of data recording divided clusters k ₁ Determination of (1):

the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss caused by the anonymization process as much as possible, so that the problem of similarity in clusters is mainly considered when determining the number of clustered clusters, and the number k of the data record divided clusters is determined through the square error in the group and the Sum of Squares (SSE) of residual errors ₁ (ii) a And with k ₁ The data recording in each cluster will gradually decrease, the distance between the recordings in the clusters should be smaller and smaller, and thus the value of SSE should follow k ₁ Increase and decrease; therefore, k is performed by SSE ₁ When determining the value, take care of the change, when SSE follows k ₁ Is relatively slow, it is considered that k is further increased ₁ If the clustering effect does not change much, k is ₁ The value is the optimal clustering number; if each k is to be ₁ Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point ₁ The value is the optimal cluster number.

K obtained by dividing table data record ₁ The method comprises the following steps of (1) processing each sub data table in sequence, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ] ₂ ,2k ₂ -1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymity processing algorithm is as follows:

step 21: judging whether the number of data records in the data set is more than 2k ₂ 1 if greater than 2k ₂ -1, then Step 22 is executed;

step 22: two records r are selected from the data set ₁ And r ₂ As two initial clusters, such that when r ₁ And r ₂ When a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step 23 is executed;

step 23: respectively calculating the information loss change condition after each record in the data set is divided into two clusters, and dividing the information loss change condition into two clustersRecording is divided into clusters with a small information loss amount, and data recording is adjusted so that the data recording in each cluster is k at least ₂ And returning the generated clusters as two newly generated data sets to Step 21;

step 24: when the number of data records in all data sets is [ k ] ₂ ,2k ₂ -1]Sequentially and circularly judging whether the condition that the sensitive attribute has a unique value exists in each data set or not, and if so, executing Step25;

step 27: obtaining the number of records at [ k ₂ ,2k ₂ -1]And performing generalization processing on each set to obtain an anonymous data table. Table data anonymization algorithm please see algorithm 3-2.

In Step 27, the anonymity rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of a set in which the attribute is located, for example, generalizing the set { 23 } into [1,3]. If the attribute is a classified attribute, the attribute is generalized to a complete set of the attribute in a set where the attribute is located, for example, the set { works in private enterprises, works in homey enterprises } is generalized to { works in private enterprises/works in homey enterprises }.

When the anonymization processing of the table data is carried out, the information loss amount is calculated by referring to an availability measurement formula. When the data recording is adjusted, the information loss can be reduced to the maximum extent and the usability of the data can be increased according to the size of the information loss.

A plurality of equivalence classes are generated after anonymous processing, but since the sensitive attribute value is not processed, an attacker can still deduce the individual sensitive attribute value through background knowledge attack, and privacy disclosure is caused. Therefore, for better protection of individual privacy, differential privacy and noise processing are performed on the sensitive attribute values, and then data sharing is performed. The differential privacy and noise adding process in Step 3 comprises the following steps:

(1) If the classified attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding a Num column at a corresponding position, and recording the noise-added data;

n is the number of records in the data table, k ₃ Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the requirement on the quality of the data is higher, a smaller value is selected; if the data quality requirement is low, a large value can be selected, and the processed data table can be shared to the data requester. See algorithm 3-3 for differential privacy algorithm.

4. Experimental design and results analysis

4.1 data set

In order to verify the usability and effectiveness of a privacy protection table data sharing algorithm based on cluster anonymity, a real data set of family income and expense of Philippine provided by Kaggle is used as an experimental data set for analysis, wherein the data set comprises 41544 pieces of citizen information and relates to a plurality of privacy data such as family information, specific family income and expense information, property condition information and the like. The method selects a table composed of five attributes of home gender (home Head Sex), home marriage Status (home Head market Status), home Age (home Head Age), family member Number (Total Number of Family members) and home work type (home Head Class of Worker) as a data table T needing to be shared. The gender and the marital state of the owner are classified attributes, the number of family members and the age of the owner are numerical attributes, and the working type of the owner is treated as a sensitive attribute. The algorithm provided by the patent is compared with a classical anonymous MDAV algorithm in an experiment, and the performance, the usability and the privacy protection degree of the method are analyzed.

4.2 privacy protection Algorithm evaluation

(1) Privacy measure

In anonymous privacy protection, information entropy is usually adopted to measure the privacy protection degree, and the privacy protection degree is reflected according to the probability distribution of data records in a data table. The larger the entropy value is, the more uniform the probability distribution of the data is, the lower the possibility that an attacker attacks successfully is, and the higher the privacy protection degree is. Conversely, the smaller the entropy value, the lower the degree of privacy protection. The information entropy of the jth sensitive attribute in the equivalence class Ci in the anonymous data table is defined as follows:

wherein m is the number of data records in the valence class Ci, n _t The number of the t-th dereferenceable values of the j-th sensitive attribute in the equivalence class Ci. Thus, the average entropy of the equivalence classes in the anonymous data table is defined as follows:

in the formula, k is the number of the equivalent classes in the anonymous table T, and n1 is the number of the sensitive attributes.

(2) Usability metric

The amount of information lost is commonly employed in anonymous privacy protection to measure data availability [55]. Assuming that the number of numerical attributes in the quasi tag attribute is n2 and the number of classification attributes is n3, the information loss amount of the numerical attribute in the equivalence class Ci is defined as follows.

In the formula max (C) _i (A _j ) Represents the maximum value of the jth numerical attribute in the equivalence class Ci, min (C) _i (A _j ) Represents the minimum value of the jth numeric attribute in the equivalence class Ci, max (A) _j ) Represents the maximum value of the jth numerical attribute in the data table, min (A) _j ) Represents the minimum value of the jth numeric attribute in the data table, | C _i And | represents the number of pieces recorded in the equivalence class Ci. Therefore, on the data table T, the information loss amount of all the numerical attributes after anonymization processing can be written as

Where k is the number of equivalence classes, and the information loss amount of the classification attribute in the equivalence class Ci is defined as follows.

In the formula h (C) _i (B _j ) Represents the number of different values of the jth type attribute in the equivalence class Ci, h (B) _j ) Representing the number of different values of the jth classification type attribute in the data table T. Thus, in the data table T, all the categorical attributes are passed throughThe amount of information lost after anonymization can be written as

In summary, when the data record number of the data table is N, the average information loss amount of the anonymous table T is N

4.3 analysis of the results of the experiment

4.3.1 Effect of different conditions on anonymization processing time

The time used by the method mainly comes from anonymous processing, and the average anonymous time can reflect the performance of the algorithm. The k value of the parameter required by k-anonymity and the size of the data volume influence the anonymity processing time. And the anonymity processing time mainly comes from the anonymity group construction and anonymity group adjustment process. As the value of k increases, the average anonymization processing time gradually increases, as shown in fig. 2. And as the amount of data increases, the average anonymization processing time gradually increases, as shown in fig. 3. This is because, as the data increases, the number of iterations and the number of adjustments may increase when anonymization processing is performed, so that the processing time gradually increases, but as the data amount increases, the data distribution density increases, so that data anonymization becomes easier, so that the anonymization processing time increases gently. Meanwhile, as can be seen from fig. 2 and 3, the method disclosed in this patent has a shorter processing time than the MDAV method under different k values and different data amounts. This is because, in the method proposed in this patent, the table records are clustered first, and similar data records are divided into one cluster, which lays the foundation for data anonymization processing.

4.3.2 Impact of k-value on information availability

The amount of information lost is a measure of the availability of data. The larger the information loss amount is, the lower the data availability is, whereas the smaller the information loss amount is, the higher the data availability is. This patent is handled original data set, chooses different k values to carry out the experiment to observe its information loss volume's situation of change, as shown in fig. 4. It can be seen that as the value of k increases, the amount of information lost gradually increases, i.e., data availability gradually decreases. This is because the data volume of the data set is large, and when k takes a small value, the difference between the plurality of equivalence classes is small. When the k value gradually increases, the equivalence classes with smaller differences gradually merge, and therefore the amount of information loss changes less. However, when the difference between the equivalence classes is large, the k value increases again, so that the amount of information loss varies greatly. Meanwhile, as can be seen from fig. 4, the information loss amounts of the MDAV method and the method provided by the present patent are not very different, because in the method provided by the present patent, although the sensitive attribute values in the equivalence classes are constrained during the anonymous processing, so that the sensitive attribute values in the equivalence classes are not all the same, which increases the information loss, when the anonymous group is constructed, the information loss amount is selected for processing, so that the information loss amount of the anonymous group is small, and therefore, the information loss amounts caused by the two methods are not very different.

4.3.3 Influence of k value on degree of privacy protection

Information entropy is a measure of the degree of privacy protection. As shown in fig. 5, as the value of k increases, the information entropy increases gradually, i.e., the degree of data privacy protection increases gradually. Meanwhile, compared with the MDAV method, the method provided by the patent has larger information entropy because the sensitive attributes in the equivalence classes are restricted, so that the sensitive attribute values in each equivalence class are not equal, and the information entropy of the data table is increased.

4.3.4 statistical value availability analysis

In order to illustrate the usability of the data after the differential privacy processing, the original data, the data after the differential privacy noise adding processing of the patent and the data after the original frequency differential privacy noise adding processing are respectively compared. Where epsilon =1. It can be seen from tables 4-8 and fig. 6 that the difference between the three is small and the data has certain availability.

TABLE 4-8 statistical value comparison

Summary of the invention

The patent provides a privacy protection table data sharing method based on cluster anonymity aiming at a scene of sharing government affair table data. Firstly, clustering records in a table through a k-media clustering algorithm to obtain a plurality of data tables; then, anonymizing each data sheet by combining the information loss amount to generate an anonymized data sheet; and finally adding noise to the sensitive attribute value. The usability and privacy of the algorithm proposed by the patent are proved by example analysis and comparison with the classical k-anonymity algorithm MDAV.

Claims

1. A privacy protection table data sharing method based on cluster anonymity is characterized by comprising the following steps: the method is applied to the shared static data table and comprises the following steps:

step 1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters;

step2, anonymization treatment: processing each cluster obtained through Step 1, firstly, segmenting data in the clusters according to information loss, then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and have no condition that sensitive attribute values are completely equal, and finally, generalizing the clusters to generate an anonymous data table;

step 4, comparative verification: finally, performing availability and privacy verification of the method through example analysis and comparison with a classical k-anonymous algorithm MDAV;

step 11: normalization processing, namely quantizing the non-sensitive ordered classification type attributes in the data table into numerical values 1,2,3, ·, n, then treating the ordered classification type attributes as numerical value attributes, and further normalizing the numerical value attribute data in all the non-sensitive attributes in the data table, wherein a normalization formula is as follows:

step 13: according to k-anonymous parameter k ₂ Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k ₂ If so, then no adjustment is made; if present, the resulting cluster C _i The number of data records in (2) is less than k ₂ Then cluster the distance C _i To cluster C _i While ensuring that the data record in the cluster where the record is located is still larger than k ₂ ；

Step 15: dividing data into different sub data tables T1, T2, tk according to different belonged clusters ₁ Thereby obtaining k ₁ A sheet data table;

in Step 12, when the records are divided by using the k-media clustering algorithm, the data table containsThe two types of attributes, namely the classified attribute and the numerical attribute, need to adopt different data distance calculation methods when calculating the distance between records, and need to consider the problem of optimal clustering result when performing a k-media clustering algorithm, namely the optimal number k of divided clusters ₁ The selection process comprises the following steps:

step 121, a calculation formula of the distance between data table records:

dist(x _i ,x _j )＝|x _i -x _j |

(formula 2)

(formula 3)

Suppose there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data table _i 、X _j The distance calculation formula of (2) is as in formula 4:

(formula 4)

In the formula x _ip And x _jp Are respectively record X _i And record X _j P-th numerical attribute value of (2), x _iq And x _jq Are respectively record X _i And record X _j The qth categorical attribute value of (1);

step 122, number of data record dividing clusters k ₁ Determination of (1):

the k-media clustering algorithm is used for dividing similar records into a group to prepare for anonymization processing and minimizing information loss caused by the anonymization process, so that the number of clustered clusters is determinedIn time, the similarity problem within the cluster is mainly considered, so the number k of data record dividing clusters is determined by the mean square error and SSE within the group ₁ (ii) a And with k ₁ The data records in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore the value of SSE should follow k ₁ Is increased and decreased; therefore, k is performed by SSE ₁ When determining the value, take care of the change, when SSE follows k ₁ Is relatively slow, it is considered that k is further increased ₁ The clustering effect does not change much, then k is ₁ The value is the optimal clustering number; if each k is to be ₁ Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point ₁ The value is the optimal cluster number;

in Step2, k is obtained by dividing the table data record ₁ The method comprises the following steps of (1) processing each sub data table in sequence, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ] ₂ ,2k ₂ -1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymity processing algorithm is as follows:

step 22: two records r are selected from the data set ₁ And r ₂ As two initial clusters, such that when r ₁ And r ₂ When a cluster is formed, the information loss amount of all records in the cluster is the largest in pairwise combination, and Step 23 is executed;

step 25: selecting data records with different sensitive attribute values from the data set Q, and simultaneously ensuring that the number of the data records in the data set is more than or equal to k if the data records are deleted ₂ And the sensitive attribute value is not unique;

step 27: obtaining the number of records at [ k ₂ ,2k ₂ -1]And performing generalization processing on each set to obtain an anonymous data table.

2. The cluster anonymity based privacy preserving table data sharing method of claim 1, wherein: in Step 27, the anonymization rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of the attribute in a set where the non-sensitive attribute is located; if the attribute is classified, the attribute is generalized to the complete set of the attribute in the set in which the attribute is located.

3. The cluster anonymity based privacy preserving table data sharing method of claim 2, wherein: the differential privacy noise adding processing process for the anonymous data table in Step 3 comprises the following steps:

n is the number of records in the data table, k ₃ Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is low, a larger value can be selected, and the processed data table can be shared to the data requester.

4. The cluster anonymity based privacy preserving table data sharing method of claim 3, wherein: the shared static data table is shared government affairs table data.