CN110555316B - Privacy protection table data sharing method based on cluster anonymity - Google Patents

Privacy protection table data sharing method based on cluster anonymity Download PDF

Info

Publication number
CN110555316B
CN110555316B CN201910752801.6A CN201910752801A CN110555316B CN 110555316 B CN110555316 B CN 110555316B CN 201910752801 A CN201910752801 A CN 201910752801A CN 110555316 B CN110555316 B CN 110555316B
Authority
CN
China
Prior art keywords
data
records
attribute
cluster
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910752801.6A
Other languages
Chinese (zh)
Other versions
CN110555316A (en
Inventor
刘丽苹
朴春慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Chick Information Technology Co ltd
Original Assignee
Shijiazhuang Tiedao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shijiazhuang Tiedao University filed Critical Shijiazhuang Tiedao University
Priority to CN201910752801.6A priority Critical patent/CN110555316B/en
Publication of CN110555316A publication Critical patent/CN110555316A/en
Application granted granted Critical
Publication of CN110555316B publication Critical patent/CN110555316B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The invention relates to a privacy protection table data sharing algorithm based on cluster anonymity, which comprises the steps of clustering records in a table through a k-media clustering algorithm to obtain a plurality of data tables; then, anonymizing each data sheet according to the information loss amount to generate an anonymous data sheet; and finally, noise is added into the sensitive attribute value in the anonymous data table, and the sensitive attribute value is compared with the MDAV (model-based data analysis) of the classic k-anonymous algorithm to verify the algorithm, so that the usability and privacy of the algorithm are proved, and the method has high popularization and application values.

Description

Privacy protection table data sharing method based on cluster anonymity
Technical Field
The patent application belongs to the technical field of privacy protection, and particularly relates to a privacy protection table data sharing method based on cluster anonymity.
Background
With the construction and development of digital governments, government affair data are gradually increased, and the government affair data are larger in scale and more in type, and have the characteristics of diversification and complication. For a long time, the phenomena of 'information isolated island' and 'data barrier' generally exist, and the data value cannot be fully exerted. Government affair data sharing can transfer government affair information from one department to another department, so that the phenomenon of data islanding is improved, data can exert maximum value, and government service quality is improved. Table data sharing is one of the important ways of government data sharing.
Generally, "privacy" refers to information that a data owner is reluctant to be obtained by others. However, the development of information technology inevitably enhances the possibility of data information leakage, thereby limiting the development of information technology. Privacy concerns have therefore become increasingly of concern. In order to more intuitively reflect the attention of people in all circles to the problems of 'privacy' and 'privacy protection', the published quantity of the papers related to the privacy every year is measured, authors search by taking 'privacy' as a topic keyword in a knowledge network, then search by taking 'privacy protection' as the topic keyword in a search result, and according to the search result, the change situation of the number of the papers published every year since 1990 is drawn, as shown in fig. 1.
As can be seen from fig. 1, the interest in the privacy problem and the privacy protection problem has rapidly increased since 2003. Meanwhile, in recent years, people focus on privacy protection, and the focus of people is about half of that of privacy protection.
Based on this, a protection method needs to be provided, which aims to ensure the data availability and the data privacy, and a comparison experiment analysis is performed with the traditional anonymous algorithm MDAV, so that the provided privacy protection method can well improve the algorithm efficiency and provide effective privacy protection.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a privacy protection table data sharing method based on cluster anonymity, which can avoid the defects of the form and can provide effective privacy protection.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
a privacy protection table data sharing method based on cluster anonymity comprises the steps of applying to a shared static data table, and including:
step 1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters, namely a plurality of data tables;
step2, anonymization treatment: processing each cluster obtained through Step 1, firstly, dividing data in the cluster according to information loss, then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and have no condition that sensitive attribute values are completely equal, and finally, generalizing the clusters to generate an anonymous data table;
step 3, difference privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;
step 4, comparative verification: finally, the usability and privacy verification of the method are carried out through example analysis and comparison with a classic k-anonymity algorithm MDAV.
The technical scheme of the invention is further improved as follows: in Step 1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:
step 11: normalization processing, namely quantizing the non-sensitive ordered classification type attributes in the data table into numerical values 1,2,3, ·, n, then treating the ordered classification type attributes as numerical value attributes, and further normalizing the numerical value attribute data in all the non-sensitive attributes in the data table, wherein a normalization formula is as follows:
Figure GDA0004086507850000021
in the formula, x i ' is a normalized value of a numerical attribute, x i Is an original value of a numerical attribute, x min Is the minimum value of the property, x max Is the maximum value of the attribute;
step 12: according to the distance of the non-sensitive attribute between the table data records, clustering the table data by using a k-media clustering algorithm, and dividing the table data records into k 1 Clustering;
step 13: according to k-anonymity parameter k 2 Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the data recording number in the divided clusters is more than k 2 If so, no adjustment is performed; if present, the resulting cluster C i The number of data records in (2) is less than k 2 Then cluster the distance C i To cluster C i While ensuring that the data record in the cluster where the record is located is still larger than k 2
Step 14: repeating the Step 3 until the record in each cluster is more than or equal to k 2
Step 15: dividing data into different sub data tables T1, T2, tk according to the cluster to which the data belong 1 Thereby obtaining k 1 A sheet data table.
The technical scheme of the invention is further improved as follows: in Step 12, when the k-media clustering algorithm is used for dividing the records, because the data table contains attributes of two types, namely a classification type attribute and a numerical type attribute, different data distance calculation methods are needed when the distance between the records is calculated,and the problem of optimal clustering result, namely the optimal number k of divided clusters, needs to be considered when performing a k-means clustering algorithm 1 The selection process comprises the following steps:
step 121, a calculation formula of the distance between data table records:
when the distance between the records of the data table is calculated, because a plurality of attributes exist in the data table, different attributes need to be separately calculated, and the numerical attribute distance calculation formula is as shown in formula 2:
dist(x i ,x j )=|x i -x j |
(formula 2)
The formula for calculating the classification type attribute is shown as formula 3:
Figure GDA0004086507850000031
(formula 3)
Assume that there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data table i 、X j The distance calculation formula of (4):
Figure GDA0004086507850000032
(formula 4)
In the formula x ip And x jp Are respectively record X i And record X j P-th numerical attribute value of (1), x iq And x jq Are respectively record X i And record X j The qth categorical attribute value of (1);
step 122, number of data record dividing clusters k 1 The determination of (1):
the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss brought by the anonymization process as much as possible, so that the similarity problem in the cluster is mainly considered when determining the cluster number of the cluster, and the data record dividing cluster number k is determined through the square error in the group and SSE (residual sum of squares) 1 (ii) a And with k 1 The data records in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore the value of SSE should follow k 1 Is increased and decreased; therefore, k is performed by SSE 1 When determining the value, take care of the change, when SSE follows k 1 Is relatively slow, it is considered that k is further increased 1 The clustering effect does not change much, then k is 1 The value is the optimal cluster number; if each k is to be 1 Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point 1 The value is the optimal cluster number.
The technical scheme of the invention is further improved as follows: in Step2, k obtained by dividing the table data record 1 The method comprises the following steps of (1) processing each sub data table in sequence, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ] 2 ,2k 2 -1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymous processing algorithm is as follows:
step 21: judging whether the number of data records in the data set is more than 2k 2 -1, if greater than 2k 2 1, executing Step 22;
step 22: two records r are selected within the data set 1 And r 2 As two initial clusters, such that when r 1 And r 2 When a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step 23 is executed;
step 23: respectively calculating the information loss change condition of each record in the data set after dividing into two clusters, dividing the record into clusters with smaller information loss amount, and adjusting the data record to make the data record in each cluster at least be k 2 And returning the generated clusters as two newly generated data sets to Step 21;
step 24: when the number of data records in all data sets is [ k ] 2 ,2k 2 -1]In turn, the method circularly judges whether each data set exists or notIf the sensitive attribute value is unique, executing Step25;
step 25: selecting data records with different sensitive attribute values from the data set Q, and ensuring that the number of the data records in the data set is more than or equal to k if the data records are deleted 2 And the sensitive attribute value is not unique;
step 26: calculating the information loss variation quantity after the selected data records are divided into the corresponding data set Q, and dividing the data records with smaller information loss quantity into the data set Q;
step 27: get the number of records at k 2 ,2k 2 -1]And performing generalization processing on each set to obtain an anonymous data table.
The technical scheme of the invention is further improved as follows: in Step 27, the anonymity rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of the attribute in a set where the non-sensitive attribute is located; if the attribute is classified, the attribute is generalized to the complete set of the attribute in the set in which the attribute is located.
The technical scheme of the invention is further improved as follows: the process of performing differential privacy and noise processing on the anonymous data table in Step 3 is as follows:
(1) If the classified sensitive attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding Num columns at corresponding positions, and recording the data after noise addition;
(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value with an average value, wherein the clustering number is
Figure GDA0004086507850000051
n is the number of records in the data table, k 3 Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is low, a larger value and a smaller value can be selectedThe value and the larger value are set according to the needs of customers, and the processed data table can be shared to the data requesting party.
The technical scheme of the invention is further improved as follows: the shared static data table is shared government affair table data, and can be other data.
Due to the adoption of the technical scheme, the invention has the beneficial effects that: the method utilizes a clustering algorithm, a k-anonymity model and a differential privacy technology, adopts multiple modes and multiple effects, and has higher usability and privacy protection degree compared with a classic k-anonymity algorithm MDAV. The k-anonymous model is the most widely applied model, has a simple structure and few limiting conditions, and is designed and realized on the basis of the k-anonymous model in subsequent research, so that the k-anonymous model is convenient for government departments to use. However, the k-anonymous model has the defects of being incapable of resisting the maximum background knowledge attack, the homogeneous attack, the re-identification attack and the like. While differential privacy is able to resist the largest background knowledge attacks, it tends to provide poor data availability. The combination of the two can strengthen the privacy protection degree theoretically, and reduce the privacy disclosure risk. And the clustering algorithm can group the data records, so that the data records with high similarity are distributed to one group to prepare for anonymous processing.
Drawings
FIG. 1 is a change in the spread of published papers on the privacy and privacy protection issues;
FIG. 2 is a graph of anonymization processing time used at different values of k compared to the MDAV method;
FIG. 3 is a graph of anonymization processing time used at different data volumes compared to the MDAV method of the present invention;
FIG. 4 shows the variation of information loss at different k values compared to the MDAV method;
FIG. 5 is the information entropy change at different k values compared to the MDAV method of the present invention;
fig. 6 is a comparison of the different noise addition modes compared to the MDAV method of the present invention with the real values.
Detailed Description
The present invention will be described in further detail with reference to examples.
The invention discloses a privacy protection table data sharing method based on cluster anonymity, which is applied to a shared static data table, such as shared government affair table data, and comprises the following steps:
step 1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters, namely a plurality of data tables; the k-media clustering algorithm is a classic algorithm for solving the clustering problem, has the characteristics of simplicity, quickness, capability of processing a large-scale data set, compactness in clusters and clear distinction among clusters, and is selected as follows.
Step a1: randomly selecting k clustering samples as initial clustering centers,
step a 2: and calculating the distance from each residual sample point to each initial clustering center, and dividing each sample point into the clusters with the shortest distance from the clustering centers to form k clusters.
Step a 3: the sum of distances from the sample point in each cluster except the center point to other points is calculated, and when the sum of distances is the smallest, the point is selected as a new cluster center.
Step a 4: if the new clustering center set is different from the original clustering center set, returning to Step2, and if the new clustering center set is the same as the original clustering center set, ending the clustering algorithm.
Step2, anonymization treatment: processing each cluster obtained by Step 1, firstly dividing the data in the cluster according to the information loss amount to ensure that the record number in each divided cluster is k 2 And 2k 2 Then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and avoid the condition that sensitive attribute values are completely equal, and finally, generalizing the cluster to generate an anonymous data table;
step 3, difference privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;
in 2006, microsoft scholars Dwork first proposed a new privacy protection model, differential privacy. The basic idea is to process data information by adding noise to original data or statistical data and converting the original data, so that when a record is added or deleted, the whole statistical attribute value is not affected, and the privacy protection effect is achieved. The model can relieve the maximum background attack risk, and a privacy protection level quantitative evaluation method is defined.
Definition 1: differential privacy definition. Suppose there are two data sets D and D' that differ by at most one record, i.e. | D Δ D | ≦ 1. And a privacy protection algorithm A, wherein if any output result O of the data on the data sets D and D' processed by the algorithm A meets the following inequality, the A is considered to meet the epsilon-difference privacy.
Pr[A(D)=O]≤e ε ×Pr[A(D')=O](formula 11)
Firstly, the algorithm A is ensured to meet epsilon-difference privacy from the theoretical point of view, wherein the probability Pr [. Cndot. ] is the probability of some event; ε is called the privacy budget. The noise mechanism is a main technique for implementing differential privacy protection, and commonly used noise mechanisms are classified into a laplacian mechanism and an exponential mechanism.
Step 4, comparative verification: finally, the usability and the privacy verification of the method are carried out through example analysis and comparison with a classic k-anonymity algorithm MDAV.
In Step 1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:
step 11: and normalization processing, wherein the attributes of the data table can be divided into a classification attribute and a numerical attribute according to the attribute value representation form, and the classification attribute can be divided into an ordered classification attribute and an unordered classification attribute. For example, the score levels "excellent", "good", "passing", "failing" are ordered classification attributes; gender "male" and "female" are unordered taxonomic attributes. When clustering is carried out, in order to better reflect the distance of data, the insensitive ordered classification type attribute in the data table is quantified according to the sequence and is quantified into a numerical value 1,2,3, ·, n, then the ordered classification type attribute is regarded as the numerical value attribute for processing, meanwhile, because the value ranges of all the attributes are different, the calculation of the table recording distance is greatly influenced, and therefore the numerical value attribute value and the quantified ordered classification attribute value need to be normalized, and the normalization formula is as follows:
Figure GDA0004086507850000071
in the formula, x i ' is a normalized value of a numerical attribute, x i Is an original value of a numeric attribute, x min Is the minimum value of the property, x max Is the maximum value of the attribute;
step 12: according to the distance of the non-sensitive attribute between the table data records, clustering the table data by using a k-media clustering algorithm, and dividing the table data records into k 1 Clustering;
step 13: according to k-anonymous parameter k 2 Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k 2 If so, no adjustment is performed; if present, the resulting cluster C i The number of data records in (2) is less than k 2 Then cluster the distance C i To cluster C i While ensuring that the data record in the cluster where the record is located is still larger than k 2
Step 14: repeating the Step 3 until the record in each cluster is more than or equal to k 2
Step 15: dividing data into different sub data tables T1, T2, tk according to different belonged clusters 1 Thereby obtaining k 1 See algorithm 3-1 for a spreadsheet.
Figure GDA0004086507850000072
Figure GDA0004086507850000081
When the k-media clustering algorithm is used for dividing the records, because the data table contains attributes of two types, namely a classification type attribute and a numerical type attribute, different data distance calculation methods are needed when the distance between the records is calculated, and the problem of the optimal clustering result, namely the optimal number k of divided clusters, needs to be considered when the k-media clustering algorithm is carried out 1 The selection process comprises the following steps:
step 121, a distance calculation formula between data table records:
when the distance between the records of the data table is calculated, because a plurality of attributes exist in the data table, different attributes need to be separately calculated, and the numerical attribute distance calculation formula is as shown in formula 2:
dist(x i ,x j )=|x i -x j |
(formula 2)
The formula for calculating the classification type attribute is shown as formula 3:
Figure GDA0004086507850000091
(formula 3)
Suppose there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data table i 、X j The distance calculation formula of (4):
Figure GDA0004086507850000092
(formula 4)
In the formula x ip And x jp Are respectively record X i And record X j P-th numerical attribute value of (2), x iq And x jq Are respectively record X i And record X j Q-th classification type attribute value of;
Step 122, number of data recording divided clusters k 1 Determination of (1):
the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss caused by the anonymization process as much as possible, so that the problem of similarity in clusters is mainly considered when determining the number of clustered clusters, and the number k of the data record divided clusters is determined through the square error in the group and the Sum of Squares (SSE) of residual errors 1 (ii) a And with k 1 The data recording in each cluster will gradually decrease, the distance between the recordings in the clusters should be smaller and smaller, and thus the value of SSE should follow k 1 Increase and decrease; therefore, k is performed by SSE 1 When determining the value, take care of the change, when SSE follows k 1 Is relatively slow, it is considered that k is further increased 1 If the clustering effect does not change much, k is 1 The value is the optimal clustering number; if each k is to be 1 Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point 1 The value is the optimal cluster number.
K obtained by dividing table data record 1 The method comprises the following steps of (1) processing each sub data table in sequence, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ] 2 ,2k 2 -1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymity processing algorithm is as follows:
step 21: judging whether the number of data records in the data set is more than 2k 2 1 if greater than 2k 2 -1, then Step 22 is executed;
step 22: two records r are selected from the data set 1 And r 2 As two initial clusters, such that when r 1 And r 2 When a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step 23 is executed;
step 23: respectively calculating the information loss change condition after each record in the data set is divided into two clusters, and dividing the information loss change condition into two clustersRecording is divided into clusters with a small information loss amount, and data recording is adjusted so that the data recording in each cluster is k at least 2 And returning the generated clusters as two newly generated data sets to Step 21;
step 24: when the number of data records in all data sets is [ k ] 2 ,2k 2 -1]Sequentially and circularly judging whether the condition that the sensitive attribute has a unique value exists in each data set or not, and if so, executing Step25;
step 25: selecting data records with different sensitive attribute values from the data set Q, and ensuring that the number of the data records in the data set is more than or equal to k if the data records are deleted 2 And the sensitive attribute value is not unique;
step 26: calculating the information loss variation quantity after the selected data records are divided into the corresponding data set Q, and dividing the data records with smaller information loss quantity into the data set Q;
step 27: obtaining the number of records at [ k 2 ,2k 2 -1]And performing generalization processing on each set to obtain an anonymous data table. Table data anonymization algorithm please see algorithm 3-2.
Figure GDA0004086507850000101
Figure GDA0004086507850000111
In Step 27, the anonymity rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of a set in which the attribute is located, for example, generalizing the set { 23 } into [1,3]. If the attribute is a classified attribute, the attribute is generalized to a complete set of the attribute in a set where the attribute is located, for example, the set { works in private enterprises, works in homey enterprises } is generalized to { works in private enterprises/works in homey enterprises }.
When the anonymization processing of the table data is carried out, the information loss amount is calculated by referring to an availability measurement formula. When the data recording is adjusted, the information loss can be reduced to the maximum extent and the usability of the data can be increased according to the size of the information loss.
A plurality of equivalence classes are generated after anonymous processing, but since the sensitive attribute value is not processed, an attacker can still deduce the individual sensitive attribute value through background knowledge attack, and privacy disclosure is caused. Therefore, for better protection of individual privacy, differential privacy and noise processing are performed on the sensitive attribute values, and then data sharing is performed. The differential privacy and noise adding process in Step 3 comprises the following steps:
(1) If the classified attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding a Num column at a corresponding position, and recording the noise-added data;
(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value with an average value, wherein the clustering number is
Figure GDA0004086507850000113
n is the number of records in the data table, k 3 Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the requirement on the quality of the data is higher, a smaller value is selected; if the data quality requirement is low, a large value can be selected, and the processed data table can be shared to the data requester. See algorithm 3-3 for differential privacy algorithm.
Figure GDA0004086507850000112
Figure GDA0004086507850000123
4. Experimental design and results analysis
4.1 data set
In order to verify the usability and effectiveness of a privacy protection table data sharing algorithm based on cluster anonymity, a real data set of family income and expense of Philippine provided by Kaggle is used as an experimental data set for analysis, wherein the data set comprises 41544 pieces of citizen information and relates to a plurality of privacy data such as family information, specific family income and expense information, property condition information and the like. The method selects a table composed of five attributes of home gender (home Head Sex), home marriage Status (home Head market Status), home Age (home Head Age), family member Number (Total Number of Family members) and home work type (home Head Class of Worker) as a data table T needing to be shared. The gender and the marital state of the owner are classified attributes, the number of family members and the age of the owner are numerical attributes, and the working type of the owner is treated as a sensitive attribute. The algorithm provided by the patent is compared with a classical anonymous MDAV algorithm in an experiment, and the performance, the usability and the privacy protection degree of the method are analyzed.
4.2 privacy protection Algorithm evaluation
(1) Privacy measure
In anonymous privacy protection, information entropy is usually adopted to measure the privacy protection degree, and the privacy protection degree is reflected according to the probability distribution of data records in a data table. The larger the entropy value is, the more uniform the probability distribution of the data is, the lower the possibility that an attacker attacks successfully is, and the higher the privacy protection degree is. Conversely, the smaller the entropy value, the lower the degree of privacy protection. The information entropy of the jth sensitive attribute in the equivalence class Ci in the anonymous data table is defined as follows:
Figure GDA0004086507850000122
wherein m is the number of data records in the valence class Ci, n t The number of the t-th dereferenceable values of the j-th sensitive attribute in the equivalence class Ci. Thus, the average entropy of the equivalence classes in the anonymous data table is defined as follows:
Figure GDA0004086507850000131
in the formula, k is the number of the equivalent classes in the anonymous table T, and n1 is the number of the sensitive attributes.
(2) Usability metric
The amount of information lost is commonly employed in anonymous privacy protection to measure data availability [55]. Assuming that the number of numerical attributes in the quasi tag attribute is n2 and the number of classification attributes is n3, the information loss amount of the numerical attribute in the equivalence class Ci is defined as follows.
Figure GDA0004086507850000132
In the formula max (C) i (A j ) Represents the maximum value of the jth numerical attribute in the equivalence class Ci, min (C) i (A j ) Represents the minimum value of the jth numeric attribute in the equivalence class Ci, max (A) j ) Represents the maximum value of the jth numerical attribute in the data table, min (A) j ) Represents the minimum value of the jth numeric attribute in the data table, | C i And | represents the number of pieces recorded in the equivalence class Ci. Therefore, on the data table T, the information loss amount of all the numerical attributes after anonymization processing can be written as
Figure GDA0004086507850000133
Where k is the number of equivalence classes, and the information loss amount of the classification attribute in the equivalence class Ci is defined as follows.
Figure GDA0004086507850000134
In the formula h (C) i (B j ) Represents the number of different values of the jth type attribute in the equivalence class Ci, h (B) j ) Representing the number of different values of the jth classification type attribute in the data table T. Thus, in the data table T, all the categorical attributes are passed throughThe amount of information lost after anonymization can be written as
Figure GDA0004086507850000135
In summary, when the data record number of the data table is N, the average information loss amount of the anonymous table T is N
Figure GDA0004086507850000136
4.3 analysis of the results of the experiment
4.3.1 Effect of different conditions on anonymization processing time
The time used by the method mainly comes from anonymous processing, and the average anonymous time can reflect the performance of the algorithm. The k value of the parameter required by k-anonymity and the size of the data volume influence the anonymity processing time. And the anonymity processing time mainly comes from the anonymity group construction and anonymity group adjustment process. As the value of k increases, the average anonymization processing time gradually increases, as shown in fig. 2. And as the amount of data increases, the average anonymization processing time gradually increases, as shown in fig. 3. This is because, as the data increases, the number of iterations and the number of adjustments may increase when anonymization processing is performed, so that the processing time gradually increases, but as the data amount increases, the data distribution density increases, so that data anonymization becomes easier, so that the anonymization processing time increases gently. Meanwhile, as can be seen from fig. 2 and 3, the method disclosed in this patent has a shorter processing time than the MDAV method under different k values and different data amounts. This is because, in the method proposed in this patent, the table records are clustered first, and similar data records are divided into one cluster, which lays the foundation for data anonymization processing.
4.3.2 Impact of k-value on information availability
The amount of information lost is a measure of the availability of data. The larger the information loss amount is, the lower the data availability is, whereas the smaller the information loss amount is, the higher the data availability is. This patent is handled original data set, chooses different k values to carry out the experiment to observe its information loss volume's situation of change, as shown in fig. 4. It can be seen that as the value of k increases, the amount of information lost gradually increases, i.e., data availability gradually decreases. This is because the data volume of the data set is large, and when k takes a small value, the difference between the plurality of equivalence classes is small. When the k value gradually increases, the equivalence classes with smaller differences gradually merge, and therefore the amount of information loss changes less. However, when the difference between the equivalence classes is large, the k value increases again, so that the amount of information loss varies greatly. Meanwhile, as can be seen from fig. 4, the information loss amounts of the MDAV method and the method provided by the present patent are not very different, because in the method provided by the present patent, although the sensitive attribute values in the equivalence classes are constrained during the anonymous processing, so that the sensitive attribute values in the equivalence classes are not all the same, which increases the information loss, when the anonymous group is constructed, the information loss amount is selected for processing, so that the information loss amount of the anonymous group is small, and therefore, the information loss amounts caused by the two methods are not very different.
4.3.3 Influence of k value on degree of privacy protection
Information entropy is a measure of the degree of privacy protection. As shown in fig. 5, as the value of k increases, the information entropy increases gradually, i.e., the degree of data privacy protection increases gradually. Meanwhile, compared with the MDAV method, the method provided by the patent has larger information entropy because the sensitive attributes in the equivalence classes are restricted, so that the sensitive attribute values in each equivalence class are not equal, and the information entropy of the data table is increased.
4.3.4 statistical value availability analysis
In order to illustrate the usability of the data after the differential privacy processing, the original data, the data after the differential privacy noise adding processing of the patent and the data after the original frequency differential privacy noise adding processing are respectively compared. Where epsilon =1. It can be seen from tables 4-8 and fig. 6 that the difference between the three is small and the data has certain availability.
TABLE 4-8 statistical value comparison
Figure GDA0004086507850000151
Summary of the invention
The patent provides a privacy protection table data sharing method based on cluster anonymity aiming at a scene of sharing government affair table data. Firstly, clustering records in a table through a k-media clustering algorithm to obtain a plurality of data tables; then, anonymizing each data sheet by combining the information loss amount to generate an anonymized data sheet; and finally adding noise to the sensitive attribute value. The usability and privacy of the algorithm proposed by the patent are proved by example analysis and comparison with the classical k-anonymity algorithm MDAV.

Claims (4)

1. A privacy protection table data sharing method based on cluster anonymity is characterized by comprising the following steps: the method is applied to the shared static data table and comprises the following steps:
step 1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters;
step2, anonymization treatment: processing each cluster obtained through Step 1, firstly, segmenting data in the clusters according to information loss, then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and have no condition that sensitive attribute values are completely equal, and finally, generalizing the clusters to generate an anonymous data table;
step 3, difference privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;
step 4, comparative verification: finally, performing availability and privacy verification of the method through example analysis and comparison with a classical k-anonymous algorithm MDAV;
in Step 1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:
step 11: normalization processing, namely quantizing the non-sensitive ordered classification type attributes in the data table into numerical values 1,2,3, ·, n, then treating the ordered classification type attributes as numerical value attributes, and further normalizing the numerical value attribute data in all the non-sensitive attributes in the data table, wherein a normalization formula is as follows:
Figure FDA0004086507840000011
in the formula, x i ' is a normalized value of a numerical attribute, x i Is an original value of a numerical attribute, x min Is the minimum value of the property, x max Is the maximum value of the attribute;
step 12: according to the distance of the non-sensitive attribute between the table data records, clustering the table data by using a k-media clustering algorithm, and dividing the table data records into k 1 Clustering;
step 13: according to k-anonymous parameter k 2 Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k 2 If so, then no adjustment is made; if present, the resulting cluster C i The number of data records in (2) is less than k 2 Then cluster the distance C i To cluster C i While ensuring that the data record in the cluster where the record is located is still larger than k 2
Step 14: repeating the Step 3 until the record in each cluster is more than or equal to k 2
Step 15: dividing data into different sub data tables T1, T2, tk according to different belonged clusters 1 Thereby obtaining k 1 A sheet data table;
in Step 12, when the records are divided by using the k-media clustering algorithm, the data table containsThe two types of attributes, namely the classified attribute and the numerical attribute, need to adopt different data distance calculation methods when calculating the distance between records, and need to consider the problem of optimal clustering result when performing a k-media clustering algorithm, namely the optimal number k of divided clusters 1 The selection process comprises the following steps:
step 121, a calculation formula of the distance between data table records:
when the distance between the records of the data table is calculated, because a plurality of attributes exist in the data table, different attributes need to be separately calculated, and the numerical attribute distance calculation formula is as shown in formula 2:
dist(x i ,x j )=|x i -x j |
(formula 2)
The formula for calculating the classification type attribute is shown as formula 3:
Figure FDA0004086507840000021
(formula 3)
Suppose there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data table i 、X j The distance calculation formula of (2) is as in formula 4:
Figure FDA0004086507840000022
(formula 4)
In the formula x ip And x jp Are respectively record X i And record X j P-th numerical attribute value of (2), x iq And x jq Are respectively record X i And record X j The qth categorical attribute value of (1);
step 122, number of data record dividing clusters k 1 Determination of (1):
the k-media clustering algorithm is used for dividing similar records into a group to prepare for anonymization processing and minimizing information loss caused by the anonymization process, so that the number of clustered clusters is determinedIn time, the similarity problem within the cluster is mainly considered, so the number k of data record dividing clusters is determined by the mean square error and SSE within the group 1 (ii) a And with k 1 The data records in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore the value of SSE should follow k 1 Is increased and decreased; therefore, k is performed by SSE 1 When determining the value, take care of the change, when SSE follows k 1 Is relatively slow, it is considered that k is further increased 1 The clustering effect does not change much, then k is 1 The value is the optimal clustering number; if each k is to be 1 Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point 1 The value is the optimal cluster number;
in Step2, k is obtained by dividing the table data record 1 The method comprises the following steps of (1) processing each sub data table in sequence, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ] 2 ,2k 2 -1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymity processing algorithm is as follows:
step 21: judging whether the number of data records in the data set is more than 2k 2 -1, if greater than 2k 2 1, executing Step 22;
step 22: two records r are selected from the data set 1 And r 2 As two initial clusters, such that when r 1 And r 2 When a cluster is formed, the information loss amount of all records in the cluster is the largest in pairwise combination, and Step 23 is executed;
step 23: respectively calculating the information loss change condition of each record in the data set after dividing into two clusters, dividing the record into clusters with smaller information loss amount, and adjusting the data record to make the data record in each cluster at least be k 2 And returning the generated clusters as two newly generated data sets to Step 21;
step 24: when the number of data records in all data sets is [ k ] 2 ,2k 2 -1]Sequentially and circularly judging whether the condition that the sensitive attribute has a unique value exists in each data set or not, and if so, executing Step25;
step 25: selecting data records with different sensitive attribute values from the data set Q, and simultaneously ensuring that the number of the data records in the data set is more than or equal to k if the data records are deleted 2 And the sensitive attribute value is not unique;
step 26: calculating the information loss variation quantity after the selected data records are divided into the corresponding data set Q, and dividing the data records with smaller information loss quantity into the data set Q;
step 27: obtaining the number of records at [ k 2 ,2k 2 -1]And performing generalization processing on each set to obtain an anonymous data table.
2. The cluster anonymity based privacy preserving table data sharing method of claim 1, wherein: in Step 27, the anonymization rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of the attribute in a set where the non-sensitive attribute is located; if the attribute is classified, the attribute is generalized to the complete set of the attribute in the set in which the attribute is located.
3. The cluster anonymity based privacy preserving table data sharing method of claim 2, wherein: the differential privacy noise adding processing process for the anonymous data table in Step 3 comprises the following steps:
(1) If the classified attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding a Num column at a corresponding position, and recording the noise-added data;
(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value with an average value, wherein the clustering number is
Figure FDA0004086507840000041
n is the number of records in the data table, k 3 Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is low, a larger value can be selected, and the processed data table can be shared to the data requester.
4. The cluster anonymity based privacy preserving table data sharing method of claim 3, wherein: the shared static data table is shared government affairs table data.
CN201910752801.6A 2019-08-15 2019-08-15 Privacy protection table data sharing method based on cluster anonymity Active CN110555316B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910752801.6A CN110555316B (en) 2019-08-15 2019-08-15 Privacy protection table data sharing method based on cluster anonymity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910752801.6A CN110555316B (en) 2019-08-15 2019-08-15 Privacy protection table data sharing method based on cluster anonymity

Publications (2)

Publication Number Publication Date
CN110555316A CN110555316A (en) 2019-12-10
CN110555316B true CN110555316B (en) 2023-04-18

Family

ID=68737513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910752801.6A Active CN110555316B (en) 2019-08-15 2019-08-15 Privacy protection table data sharing method based on cluster anonymity

Country Status (1)

Country Link
CN (1) CN110555316B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079183B (en) * 2019-12-19 2022-06-03 中国移动通信集团黑龙江有限公司 Privacy protection method, device, equipment and computer storage medium
CN111222164B (en) * 2020-01-10 2022-03-25 广西师范大学 Privacy protection method for issuing alliance chain data
CN111628974A (en) * 2020-05-12 2020-09-04 Oppo广东移动通信有限公司 Differential privacy protection method and device, electronic equipment and storage medium
CN112035874A (en) * 2020-08-28 2020-12-04 绿盟科技集团股份有限公司 Data anonymization processing method and device
CN113254992A (en) * 2021-05-21 2021-08-13 同智伟业软件股份有限公司 Electronic medical record publishing privacy protection method
CN113257378B (en) * 2021-06-16 2021-09-28 湖南创星科技股份有限公司 Medical service communication method and system based on micro-service technology
CN113378223B (en) * 2021-06-16 2023-12-26 北京工业大学 K-anonymous data processing method and system based on double coding and cluster mapping
CN113411186B (en) * 2021-08-19 2021-11-30 北京电信易通信息技术股份有限公司 Video conference data security sharing method
CN114092729A (en) * 2021-09-10 2022-02-25 南方电网数字电网研究院有限公司 Heterogeneous electricity consumption data publishing method based on cluster anonymization and differential privacy protection
CN113742781B (en) * 2021-09-24 2024-04-05 湖北工业大学 K anonymous clustering privacy protection method, system, computer equipment and terminal
CN114611127B (en) * 2022-03-15 2022-10-28 湖南致坤科技有限公司 Database data security management system
CN114817977B (en) * 2022-03-18 2024-03-29 西安电子科技大学 Anonymous protection method based on sensitive attribute value constraint

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292195A (en) * 2017-06-01 2017-10-24 徐州医科大学 The anonymous method for secret protection of k divided based on density
CN107358113A (en) * 2017-06-01 2017-11-17 徐州医科大学 Based on the anonymous difference method for secret protection of micro- aggregation
CN107766745A (en) * 2017-11-14 2018-03-06 广西师范大学 Classification method for secret protection in hierarchical data issue

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201426578A (en) * 2012-12-27 2014-07-01 Ind Tech Res Inst Generation method and device and risk assessment method and device for anonymous dataset
CN105512566B (en) * 2015-11-27 2018-07-31 电子科技大学 A kind of health data method for secret protection based on K- anonymities
US11132462B2 (en) * 2017-12-28 2021-09-28 Cilag Gmbh International Data stripping method to interrogate patient records and create anonymized record
CN109522750B (en) * 2018-11-19 2023-05-02 盐城工学院 Novel k anonymization realization method and system
CN110069943B (en) * 2019-03-29 2021-06-22 中国电力科学研究院有限公司 Data processing method and system based on cluster anonymization and differential privacy protection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292195A (en) * 2017-06-01 2017-10-24 徐州医科大学 The anonymous method for secret protection of k divided based on density
CN107358113A (en) * 2017-06-01 2017-11-17 徐州医科大学 Based on the anonymous difference method for secret protection of micro- aggregation
CN107766745A (en) * 2017-11-14 2018-03-06 广西师范大学 Classification method for secret protection in hierarchical data issue

Also Published As

Publication number Publication date
CN110555316A (en) 2019-12-10

Similar Documents

Publication Publication Date Title
CN110555316B (en) Privacy protection table data sharing method based on cluster anonymity
Zhang et al. Towards accurate histogram publication under differential privacy
WO2022126971A1 (en) Density-based text clustering method and apparatus, device, and storage medium
US10108818B2 (en) Privacy-aware query management system
Harikumar et al. K-medoid clustering for heterogeneous datasets
Sun et al. Publishing anonymous survey rating data
CN109117669B (en) Privacy protection method and system for MapReduce similar connection query
CN108885673B (en) System and method for computing data privacy-utility tradeoffs
CN110569289B (en) Column data processing method, equipment and medium based on big data
CN110378148B (en) Multi-domain data privacy protection method facing cloud platform
Yuan et al. Privacy‐preserving mechanism for mixed data clustering with local differential privacy
Tang et al. An improved term weighting scheme for text classification
JP2019504393A (en) User data sharing method and apparatus
Wang et al. Approximate truth discovery via problem scale reduction
Yao et al. An improved clustering algorithm and its application in wechat sports users analysis
Zhang et al. Toward more efficient locality‐sensitive hashing via constructing novel hash function cluster
CN116186757A (en) Method for publishing condition feature selection differential privacy data with enhanced utility
Han et al. A Hybrid KNN algorithm with Sugeno measure for the personal credit reference system in China
Li et al. Privacy‐preserving constrained spectral clustering algorithm for large‐scale data sets
Mao et al. Private deep neural network models publishing for machine learning as a service
Yao et al. A utility-aware anonymization model for multiple sensitive attributes based on association concealment
CN115658979A (en) Context sensing method and system based on weighted GraphSAGE and data access control method
Kong et al. CVDP k-means clustering algorithm for differential privacy based on coefficient of variation
CN103995869A (en) Data-caching method based on Apriori algorithm
Li et al. b-Bit minwise hashing for estimating three-way similarities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240321

Address after: No. 3 Juquan Road, Science City, Huangpu District, Guangzhou City, Guangdong Province, 510700. Office card slot A090, Jian'an Co Creation Space, Area A, Guangzhou International Business Incubator, No. A701

Patentee after: Guangzhou chick Information Technology Co.,Ltd.

Country or region after: Zhong Guo

Address before: 050043 No. 17, North Second Ring Road, Hebei, Shijiazhuang

Patentee before: SHIJIAZHUANG TIEDAO University

Country or region before: Zhong Guo

TR01 Transfer of patent right