CN110555316A - privacy protection table data sharing algorithm based on cluster anonymity - Google Patents

privacy protection table data sharing algorithm based on cluster anonymity Download PDF

Info

Publication number
CN110555316A
CN110555316A CN201910752801.6A CN201910752801A CN110555316A CN 110555316 A CN110555316 A CN 110555316A CN 201910752801 A CN201910752801 A CN 201910752801A CN 110555316 A CN110555316 A CN 110555316A
Authority
CN
China
Prior art keywords
data
cluster
records
attribute
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910752801.6A
Other languages
Chinese (zh)
Other versions
CN110555316B (en
Inventor
刘丽苹
朴春慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Chick Information Technology Co ltd
Original Assignee
Shijiazhuang Tiedao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shijiazhuang Tiedao University filed Critical Shijiazhuang Tiedao University
Priority to CN201910752801.6A priority Critical patent/CN110555316B/en
Publication of CN110555316A publication Critical patent/CN110555316A/en
Application granted granted Critical
Publication of CN110555316B publication Critical patent/CN110555316B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The invention relates to a privacy protection table data sharing algorithm based on cluster anonymity, which comprises the steps of clustering records in a table through a k-media clustering algorithm to obtain a plurality of data tables; then, anonymizing each data sheet according to the information loss amount to generate an anonymous data sheet; and finally, noise is added into the sensitive attribute value in the anonymous data table, and the sensitive attribute value is compared with the MDAV (model-based data analysis) of the classic k-anonymous algorithm to verify the algorithm, so that the usability and privacy of the algorithm are proved, and the method has high popularization and application values.

Description

Privacy protection table data sharing algorithm based on cluster anonymity
Technical Field
The patent application belongs to the technical field of privacy protection, and more particularly relates to a privacy protection table data sharing algorithm based on cluster anonymity.
Background
With the construction and development of digital governments, government affair data are gradually increased, and the government affair data are larger in scale and more in type, and have the characteristics of diversification and complication. For a long time, the phenomena of 'information isolated island' and 'data barrier' generally exist, and the data value cannot be fully exerted. Government affair data sharing can transfer government affair information from one department to another department, so that a data island phenomenon is improved, data can exert the maximum value, and government service quality is improved. Table data sharing is one of the important ways of sharing administration data.
Generally, "privacy" refers to information that a data owner is reluctant to be obtained by others. However, the development of information technology inevitably enhances the possibility of data information leakage, thereby limiting the development of information technology. Privacy concerns have therefore become increasingly attractive. In order to more intuitively reflect the attention of people in all circles to the problems of 'privacy' and 'privacy protection', measured by the published amount of the papers related to privacy every year, authors search by using 'privacy' as a topic keyword in the knowledge network, then search by using 'privacy protection' as a topic keyword in the search results, and according to the search results, the change of the number of the papers published every year since 1990 is drawn, as shown in fig. 1.
As can be seen from fig. 1, the interest in the privacy problem and the privacy protection problem has rapidly increased since 2003. Meanwhile, in recent years, people focus on privacy protection, and the focus of people is about half of that of privacy protection.
Based on this, a protection method needs to be provided, which aims to ensure the data availability and the data privacy, and a comparison experiment analysis is performed with the traditional anonymous algorithm MDAV, so that the provided privacy protection method can well improve the algorithm efficiency and provide effective privacy protection.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a cluster anonymity-based privacy protection table data sharing algorithm, which can avoid the defects of the form and can provide effective privacy protection.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
a privacy protection table data sharing algorithm based on cluster anonymity comprises the following steps of being applied to a shared static data table:
Step1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters, namely a plurality of data tables;
Step2, anonymization: processing each cluster obtained through Step1, firstly, dividing data in the cluster according to the information loss amount, then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and have no condition that sensitive attribute values are completely equal, and finally, generalizing the clusters to generate an anonymous data table;
step3, differential privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;
Step4, comparative verification: finally, the usability and privacy verification of the method are carried out through example analysis and comparison with a classic k-anonymity algorithm MDAV.
The technical scheme of the invention is further improved as follows: in Step1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:
step 11: normalization processing, namely quantizing the non-sensitive ordered classification type attributes in the data table into values 1,2,3, wherein, the ordered classification type attributes are regarded as numerical type attributes to be processed, and further normalizing the numerical type attribute data in all the non-sensitive attributes in the data table, wherein the normalization formula is as follows:
In the formula, xi' is a normalized value of a numerical attribute, xiis an original value of a numerical attribute, xminIs the minimum value of the attribute, xmaxIs the maximum value of the attribute;
step 12: according to the distance of the non-sensitive attribute between the table data records, clustering the table data by using a k-media clustering algorithm, and dividing the table data records into k1Clustering;
Step 13: according to k-anonymous parameter k2Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k2If so, no adjustment is performed; if present, the resulting cluster CiThe number of data records in (2) is less than k2Then cluster the distance CiTo cluster CiWhile ensuring that the data record in the cluster where the record is located is still larger than k2
Step 14: repeating Step3 until the record in each cluster is greater than or equal to k2
Step 15: dividing the data into different sub-data tables T1, T2, Tk1thereby obtaining k1a folder data table.
The technical scheme of the invention is further improved as follows: in Step12, when the k-media clustering algorithm is used to divide the records, because the data table contains attributes of two types, namely, a classification type attribute and a numerical type attribute, different data distance calculation methods need to be adopted when calculating the distance between the records, and the problem of the optimal clustering result, namely the optimal number k of the divided clusters, needs to be considered when the k-media clustering algorithm is performed1The selection process comprises the following steps:
Step121, a calculation formula of the distance between data table records:
When the distance between the records of the data table is calculated, because a plurality of attributes exist in the data table, different attributes need to be separately calculated, and the numerical attribute distance calculation formula is as shown in formula 2:
dist(xi,xj)=|xi-xjL (equation 2)
The formula for calculating the classification type attribute is shown as formula 3:
Assume that there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data tablei、 Xjthe distance calculation formula of (2) is as in formula 4:
In the formula xipAnd xjpAre respectively record XiAnd record Xjp-th numerical attribute value of (2), xipand xjpAre respectively record XiAnd record XjThe qth categorical attribute value of (1);
step122, number of data recording divided clusters k1Determination of (1):
the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss brought by the anonymization process as much as possible, so that the similarity problem in the cluster is mainly considered when determining the cluster number of the cluster, and the data record dividing cluster number k is determined through the mean square error and SSE (residual sum of squares) in the group1(ii) a And with k1The data records in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore the value of SSE should follow k1Is increased and decreased; therefore, k is performed by SSE1When determining the value, take care of the change, when SSE follows k1Is relatively slow, it is considered that k is further increased1The clustering effect does not change much, then k is1The value is the optimal clustering number; if each k is to be1is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point1the value is the optimal cluster number.
The technical scheme of the invention is further improved as follows: at Step2, k obtained by the table data record division processing1The method comprises the following steps of (1) opening a sub data table, and then sequentially processing each sub data table, wherein the core idea is as follows: dividing the data records in the sub data table so that the number of records in each cluster generated is [ k ]2,2k2-1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymity processing algorithm is as follows:
Step 21: judging whether the number of data records in the data set is more than 2k21, if largeAt 2k21, then Step22 is executed;
step 22: two records r are selected from the data set1And r2As two initial clusters, such that when r1And r2when a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step23 is executed;
Step 23: respectively calculating the information loss change condition of each record in the data set after dividing into two clusters, dividing the record into clusters with smaller information loss amount, and adjusting the data record to make the data record in each cluster at least be k2and returning the generated clusters as two newly generated data sets to Step 21;
Step 24: when the number of data records in all data sets is [ k ]2,2k2-1]Sequentially and circularly judging whether the condition that the sensitive attribute value is unique exists in each data set or not, and if so, executing Step 25;
Step 25: selecting data records with different sensitive attribute values from the data set Q, and ensuring that if the data records are deleted, the number of the data records in the data set is still more than or equal to k2and the sensitive attribute value is not unique;
Step 26: calculating the information loss variation quantity after the selected data records are divided into the corresponding data set Q, and dividing the data records with smaller information loss quantity into the data set Q;
step 27: obtaining the number of records at [ k2,2k2-1]and performing generalization processing on each set to obtain an anonymous data table.
The technical scheme of the invention is further improved as follows: in Step27, the anonymization rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of the attribute in a set where the non-sensitive attribute is located; if the attribute is classified attribute, the attribute is generalized to a complete set of the attribute in the set of the attribute.
The technical scheme of the invention is further improved as follows: the process of performing differential privacy and noise processing on the anonymous data table in Step3 is as follows:
(1) If the classified attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding a Num column at a corresponding position, and recording the noise-added data;
(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value by an average value, wherein the clustering number isn is the number of records in the data table, k3Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is low, a larger value can be selected, the smaller value and the larger value are set according to the requirement of a client, and the processed data table can be shared to a data requester.
the technical scheme of the invention is further improved as follows: the shared static data table is shared government affair table data, and can also be other data.
due to the adoption of the technical scheme, the invention has the beneficial effects that: the method utilizes a clustering algorithm, a k-anonymity model and a differential privacy technology, adopts multiple modes and multiple effects, and has higher usability and privacy protection degree compared with a classic k-anonymity algorithm MDAV. The k-anonymous model is the most widely applied model, has a simple structure and few limiting conditions, and is designed and realized on the basis of the k-anonymous model in subsequent research, so that the k-anonymous model is convenient for government departments to use. However, the k-anonymous model has the defects of being unable to resist the attack of maximum background knowledge, homogeneous attack, re-identification attack and the like. While differential privacy is able to resist the largest background knowledge attacks, it tends to provide poor data availability. The combination of the two can strengthen the privacy protection degree theoretically, and reduce the privacy disclosure risk. And the clustering algorithm can group the data records, so that the data records with high similarity are distributed to one group to prepare for anonymous processing.
Drawings
FIG. 1 is a variation of the space of a published paper on the privacy problem and the privacy protection problem;
FIG. 2 is a graph of anonymization processing time used at different values of k compared to the MDAV method;
FIG. 3 is a graph of anonymization processing time used at different data volumes compared to the MDAV method of the present invention;
FIG. 4 shows the variation of information loss at different k values compared to the MDAV method;
FIG. 5 is a graph of information entropy change at different k values compared to the MDAV method;
Fig. 6 is a comparison of the different noise addition modes compared to the MDAV method of the present invention with the real values.
Detailed Description
The present invention will be described in further detail with reference to examples.
the invention discloses a privacy protection table data sharing algorithm based on cluster anonymity, which is applied to a shared static data table, such as shared government affair table data, and comprises the following steps:
Step1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters, namely a plurality of data tables; the k-media clustering algorithm is a classic algorithm for solving the clustering problem, has the characteristics of simplicity, quickness, capability of processing a large-scale data set, compactness in clusters and clear distinction among clusters, and is selected as follows.
stepa 1: randomly selecting k clustering samples as initial clustering centers,
Stepa 2: and calculating the distance from each residual sample point to each initial clustering center, and dividing each sample point into the clusters with the shortest distance from the clustering centers to form k clusters.
Stepa 3: the sum of distances from the sample point in each cluster except the center point to other points is calculated, and when the sum of distances is the smallest, the point is selected as a new cluster center.
Stepa 4: if the new clustering center set is different from the original clustering center set, returning to Step2, and if the new clustering center set is the same as the original clustering center set, ending the clustering algorithm.
Step2, anonymization: processing each cluster obtained by the Step1, dividing the data in the cluster according to the information loss amount to make the number of records in each divided cluster at k2And 2k2Then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and avoid the condition that sensitive attribute values are completely equal, and finally, generalizing the cluster to generate an anonymous data table;
Step3, differential privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;
In 2006, microsoft scholars Dwork first proposed a new privacy protection model, differential privacy. The basic idea is to process data information by adding noise to original data or statistical data and converting the original data, so that when a record is added or deleted, the whole statistical attribute value is not affected, and the privacy protection effect is achieved. The model can relieve the maximum background attack risk, and a privacy protection level quantitative evaluation method is defined.
Definition 1: differential privacy definition. Suppose there are two data sets D and D' that differ by at most one record, i.e. | D Δ D | ≦ 1. And a privacy protection algorithm A, wherein if any output result O of the data on the data sets D and D' processed by the algorithm A meets the following inequality, the A is considered to meet the epsilon-difference privacy.
Pr[A(D)=O]≤eε×Pr[A(D')=O](formula 11)
Firstly, the algorithm A is guaranteed to meet epsilon-difference privacy from the theoretical point of view, wherein the probability Pr [. cndot. ] is the probability of a certain event; ε is called the privacy budget. The noise mechanism is a main technique for implementing differential privacy protection, and commonly used noise mechanisms are classified into a laplacian mechanism and an exponential mechanism.
step4, comparative verification: finally, the usability and privacy verification of the method are carried out through example analysis and comparison with a classic k-anonymity algorithm MDAV.
In Step1, the core idea of table data record division is: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:
Step 11: and (4) normalization processing, wherein the attributes of the data table can be divided into a classification attribute and a numerical attribute according to the attribute value expression form, and the classification attribute can be divided into an ordered classification attribute and an unordered classification attribute. For example, the score levels "excellent", "good", "and", "failing" are ordered classification attributes; gender "male" and "female" are unordered taxonomic attributes. When clustering is carried out, in order to better reflect the distance of data, the insensitive ordered classification type attribute in the data table is quantified according to the sequence and is quantified into the numerical values 1,2,3, DEG, n, then the ordered classification type attribute is regarded as the numerical value attribute for processing, meanwhile, because the value ranges of all the attributes are different, the calculation of the table recording distance is greatly influenced, and therefore the numerical value attribute value and the quantified ordered classification attribute value need to be normalized, and the normalization formula is as follows:
In the formula, xi' is a normalized value of a numerical attribute, xiIs an original value of a numerical attribute, xminIs the minimum value of the attribute, xmaxis the maximum value of the attribute;
Step 12: according to the distance of the non-sensitive attribute between the table data records, clustering the table data by using a k-media clustering algorithm, and dividing the table data records into k1Clustering;
Step 13: according to k-anonymous referenceNumber k2Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k2If so, no adjustment is performed; if present, the resulting cluster CiThe number of data records in (2) is less than k2Then cluster the distance Cito cluster CiWhile ensuring that the data record in the cluster where the record is located is still larger than k2
Step 14: repeating Step3 until the record in each cluster is greater than or equal to k2
step 15: dividing the data into different sub-data tables T1, T2, Tk1Thereby obtaining k1see algorithm 3-1 for a tree data sheet.
When the k-media clustering algorithm is used for dividing the records, because the data table contains attributes of two types, namely a classification type attribute and a numerical type attribute, different data distance calculation methods are needed when the distance between the records is calculated, and the problem of the optimal clustering result, namely the optimal number k of divided clusters, needs to be considered when the k-media clustering algorithm is carried out1the selection process comprises the following steps:
Step121, a calculation formula of the distance between data table records:
When the distance between the records of the data table is calculated, because a plurality of attributes exist in the data table, different attributes need to be separately calculated, and the numerical attribute distance calculation formula is as shown in formula 2:
dist(xi,xj)=|xi-xjL (equation 2)
The formula for calculating the classification type attribute is shown as formula 3:
Assume that there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data tablei、 XjThe distance calculation formula of (2) is as in formula 4:
In the formula xipand xjpare respectively record XiAnd record XjP-th numerical attribute value of (2), xipAnd xjpare respectively record Xiand record XjThe qth categorical attribute value of (1);
step122, number of data recording divided clusters k1Determination of (1):
the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss brought by the anonymization process as much as possible, so that the problem of similarity in clusters is mainly considered when determining the number of clustered clusters, and the data record dividing cluster number k is determined through the square error in the group and the residual error Sum of Squares (SSE)1(ii) a And with k1the data recording in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore, the value of SSE should follow k1Is increased and decreased; therefore, k is performed by SSE1When determining the value, take care of the change, when SSE follows k1Is relatively slow, it is considered that k is further increased1The clustering effect does not change much, then k is1The value is the optimal clustering number; if each k is to be1Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point1The value is the optimal cluster number.
K obtained by dividing table data record1The method comprises the following steps of (1) processing sub data tables in turn, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ]2,2k2-1]While ensuring the sensitive attribute fetching in each clusterThe value is not unique, so the specific flow for realizing the anonymous processing algorithm of the table data is as follows:
Step 21: judging whether the number of data records in the data set is more than 2k2-1, if greater than 2k21, then Step22 is executed;
Step 22: two records r are selected from the data set1And r2As two initial clusters, such that when r1And r2when a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step23 is executed;
step 23: respectively calculating the information loss change condition of each record in the data set after dividing into two clusters, dividing the record into clusters with smaller information loss amount, and adjusting the data record to make the data record in each cluster at least be k2And returning the generated clusters as two newly generated data sets to Step 21;
step 24: when the number of data records in all data sets is [ k ]2,2k2-1]sequentially and circularly judging whether the condition that the sensitive attribute value is unique exists in each data set or not, and if so, executing Step 25;
Step 25: selecting data records with different sensitive attribute values from the data set Q, and ensuring that if the data records are deleted, the number of the data records in the data set is still more than or equal to k2And the sensitive attribute value is not unique;
Step 26: calculating the information loss variation quantity after the selected data records are divided into the corresponding data set Q, and dividing the data records with smaller information loss quantity into the data set Q;
Step 27: obtaining the number of records at [ k2,2k2-1]And performing generalization processing on each set to obtain an anonymous data table. Table data anonymization algorithm please see algorithm 3-2.
In Step27, the anonymization rule of the generalization processing is: if the non-sensitive attribute is a numeric attribute, generalizing the non-sensitive attribute to a value range of the attribute in the set where the non-sensitive attribute is located, such as a set {1233} generalizing to [1,3 ]. If the attribute is classified, the attribute is generalized to a complete set of the attribute in a set where the attribute is located, for example, the set { works in private enterprises, works in nationally owned enterprises } is generalized to { works in private enterprises/nationally owned enterprises }.
When the anonymization processing of the table data is carried out, the information loss amount is calculated by referring to an availability measurement formula. When the data recording is adjusted, the information loss can be reduced to the maximum extent and the usability of the data can be increased according to the size of the information loss.
A plurality of equivalence classes are generated after anonymous processing, but since the sensitive attribute value is not processed, an attacker can still deduce the individual sensitive attribute value through background knowledge attack, and privacy disclosure is caused. Therefore, for better protection of individual privacy, differential privacy and noise processing are performed on the sensitive attribute values, and then data sharing is performed. The process of differential privacy and noise addition processing in Step3 is as follows:
(1) If the classified attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding a Num column at a corresponding position, and recording the noise-added data;
(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value by an average value, wherein the clustering number isn is the number of records in the data table, k3Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is lowlarger values can be selected, and the processed data table can be shared to the data requester. The differential privacy algorithm is shown in algorithm 3-3.
4. Experimental design and results analysis
4.1 data set
In order to verify the usability and effectiveness of a privacy protection table data sharing algorithm based on cluster anonymity, a Philippine family income and expenditure real data set provided by Kaggle is used as an experimental data set for analysis, wherein the data set comprises 41544 pieces of citizen information and relates to a plurality of privacy data such as family information, family income and expenditure specific information, property condition information and the like. The method selects a table consisting of five attributes of owner gender (HouseholdHeadSex), owner marital state (householdheadmaritstatute), owner age (HouseholdHeadAge), family member number (totalNumberoffamillimeter), and owner work type (householdHeadHeadClassof worker) as a data table T needing to be shared. The gender and the marital state of the owner are classified attributes, the number of family members and the age of the owner are numerical attributes, and the working type of the owner is treated as a sensitive attribute. The algorithm provided by the patent is compared with a classical anonymous MDAV algorithm in an experiment, and the performance, the usability and the privacy protection degree of the method are analyzed.
4.2 evaluation of privacy protection algorithms
(1) Privacy metric
In anonymous privacy protection, information entropy is usually adopted to measure the privacy protection degree, and the privacy protection degree is reflected according to the probability distribution of data records in a data table. The larger the entropy value is, the more uniform the probability distribution of the data is, the lower the possibility that an attacker attacks successfully is, and the higher the privacy protection degree is. Conversely, the smaller the entropy value, the lower the degree of privacy protection. The information entropy of the jth sensitive attribute in the equivalence class Ci in the anonymous data table is defined as follows:
Where m is the number of data records in the valence class Ci, ntthe number of t-th acquirable values of the jth sensitive attribute in the equivalence class Ci. Thus, the average entropy of the equivalence classes in the anonymous data table is defined as follows:
where k is the number of equivalent classes in the anonymous table T and n1 is the number of sensitive attributes.
(2) Usability metric
the amount of information lost is commonly employed in anonymous privacy protection to measure data availability [55 ]. Assuming that the number of numeric attributes is n2 and the number of categorical attributes is n3 in the quasi tag attributes, the information loss amount of the numeric attributes in the equivalence class Ci is defined as follows.
in the formula max (C)i(Aj) Represents the maximum value of the jth numerical attribute in the equivalence class Ci, min (C)i(Aj) Represents the minimum value of the jth numeric attribute in the equivalence class Ci, max (A)j) Represents the maximum value of the jth numerical attribute in the data table, min (A)j) Represents the minimum value of the jth numerical attribute, | C, in the data tableiAnd | represents the number of pieces recorded in the equivalence class Ci. Therefore, on the data table T, the information loss amount of all the numerical attributes after anonymization processing can be written as
where k is the number of equivalence classes, and the information loss amount of the classification attribute in the equivalence class Ci is defined as follows.
In the formula h (C)i(Bj) Represents the number of different values of the jth categorical attribute in the equivalence class Ci, h (B)j) Representing the number of different values of the jth classification type attribute in the data table T. Therefore, in the data table T, the information loss amount of all the classified attributes after anonymization processing can be written as
In summary, when the data record number of the data table is N, the average information loss amount of the anonymous table T is N
4.3 analysis of the results of the experiment
4.3.1 Effect of different conditions on anonymization processing time
the time used by the method mainly comes from anonymous processing, and the average anonymous time can reflect the performance of the algorithm. The k value of the parameter required by k-anonymity and the size of the data volume influence the anonymity processing time. And the anonymity processing time mainly comes from the anonymity group construction and anonymity group adjustment process. As the value of k increases, the average anonymization processing time gradually increases, as shown in fig. 2. And the average anonymization processing time gradually increases as the data amount increases, as shown in fig. 3. This is because the number of iterations and the number of adjustments may increase as the data increases, so that the processing time gradually increases when anonymization processing is performed, but the anonymization processing time tends to increase gradually because the data distribution density increases as the data amount increases, so that anonymization of the data is easier. Meanwhile, as can be seen from fig. 2 and 3, the method disclosed in this patent has a smaller processing time than the MDAV method under different k values and different data amounts. This is because, in the method proposed in this patent, the table records are clustered first, and similar data records are divided into one cluster, which lays the foundation for data anonymization processing.
4.3.2k value impact on information availability
The amount of information lost is a measure of the availability of data. The larger the information loss amount is, the lower the data availability is, whereas the smaller the information loss amount is, the higher the data availability is. This patent is handled original data set, chooses different k values to carry out the experiment to observe its information loss volume's situation of change, as shown in fig. 4. It can be seen that as the value of k increases, the amount of information lost gradually increases, i.e., the data availability gradually decreases. This is because the data volume of the data set is large, and when k takes a small value, the difference between a plurality of equivalence classes is small. As the k value gradually increases, equivalence classes having smaller differences gradually merge, and thus the information loss amount changes less. However, when the difference between the equivalence classes is large, the k value increases again, so that the amount of information loss varies greatly. Meanwhile, as can be seen from fig. 4, the information loss amounts of the MDAV method and the method proposed by this patent are not very different, because, in the method proposed by this patent, although the sensitive attribute values in the equivalence classes are constrained during the anonymous processing, so that the sensitive attribute values in the equivalence classes are not all the same, which increases the information loss, during the anonymous group construction, the information loss amount is selected for processing, so that the information loss amount of the anonymous group is small, and therefore, the information loss amounts caused by the two methods are not very different.
4.3.3k value influence on privacy protection
information entropy is a measure of the degree of privacy protection. As shown in fig. 5, as the value of k increases, the information entropy increases gradually, i.e., the degree of data privacy protection increases gradually. Meanwhile, compared with the MDAV method, the method provided by the patent has larger information entropy because the sensitive attributes in the equivalence classes are restricted, so that the sensitive attribute values in each equivalence class are not all equal, and the information entropy of the data table is increased.
4.3.4 statistical value availability analysis
In order to illustrate the usability of the data after the differential privacy processing, the original data, the data after the differential privacy noise processing of the patent and the data after the original frequency differential privacy noise processing are respectively compared. Wherein ε is 1. As can be seen from tables 4 to 8 and fig. 6, the difference between the three is small, and the data has a certain usability.
TABLE 4-8 statistical value comparison
Summary of the invention
The patent provides a privacy protection table data sharing method based on cluster anonymity aiming at a scene of sharing government affair table data. Firstly, clustering records in a table through a k-media clustering algorithm to obtain a plurality of data tables; then, anonymizing each data sheet according to the information loss amount to generate an anonymous data sheet; and finally adding noise to the sensitive attribute value. The usability and privacy of the algorithm proposed in this patent are demonstrated by example analysis and comparison with the classical k-anonymous algorithm MDAV.

Claims (7)

1. a privacy protection table data sharing algorithm based on cluster anonymity is characterized in that: the method is applied to the shared static data table and comprises the following steps:
Step1, clustering: dividing the table data records based on k-media clustering, and clustering the records in the shared static data table by using a k-media clustering algorithm according to the distance between the records in the data table to obtain a plurality of clusters;
Step2, anonymization treatment: processing each cluster obtained through Step1, firstly, dividing data in the cluster according to information loss, then, adjusting each obtained cluster to enable each cluster to meet k-anonymity conditions and have no condition that sensitive attribute values are completely equal, and finally, generalizing the clusters to generate an anonymous data table;
Step3, difference privacy and noise adding: carrying out differential privacy processing on the sensitive attribute values in the table data;
Step4, comparative verification: finally, the usability and privacy verification of the method are carried out through example analysis and comparison with a classic k-anonymity algorithm MDAV.
2. The cluster anonymity based privacy preserving table data sharing algorithm of claim 1, wherein: in Step1, the core idea of table data record division is as follows: dividing n records in a shared static data table into a plurality of clusters by using a clustering technology, so that the records with high similarity are divided into a group; meanwhile, in order to meet the following k-anonymity requirement, the clusters which do not meet the anonymity requirement are required to be adjusted after clustering is finished, so that the specific flow of table data record division by combining a k-media clustering algorithm is as follows:
Step 11: normalization processing, namely quantizing the non-sensitive ordered classification type attributes in the data table into numerical values 1,2,3, …, n, then treating the ordered classification type attributes as numerical value attributes, and further normalizing the numerical value attribute data in all the non-sensitive attributes in the data table, wherein a normalization formula is as follows:
In the formula, xi' is a normalized value of a numerical attribute, xiIs an original value of a numerical attribute, xminIs the minimum value of the property, xmaxis the maximum value of the attribute;
step 12: according to the distance of the non-sensitive attribute between the table data records, clustering the table data by using a k-media clustering algorithm, and dividing the table data records into k1Clustering;
step 13: according to k-anonymous parameter k2Performing cluster recording adjustment on clusters which do not meet the anonymity requirement, if the number of data records in the divided clusters is more than k2If so, no adjustment is performed; if present, the resulting cluster CiThe number of data records in (2) is less than k2Then cluster the distance Cito cluster CiWhile ensuring that the data record in the cluster where the record is located is still larger than k2
Step 14: repeating the Step3 until the record in each cluster is more than or equal to k2
step 15: dividing the data into different sub-data tables T1, T2, … and Tk according to the belonged clusters1Thereby obtaining k1A sheet data table.
3. The cluster anonymity based privacy preserving table data sharing algorithm of claim 2, wherein: in Step12, when the k-media clustering algorithm is used to partition the records, because the data table contains attributes of two types, namely, a classification type attribute and a numerical type attribute, different data distance calculation methods need to be adopted when calculating the distance between the records, and the problem of the optimal clustering result, namely the optimal number k of the partitioned clusters, needs to be considered when the k-media clustering algorithm is performed1The selection process comprises the following steps:
step121, a calculation formula of the distance between data table records:
when the distance between the records of the data table is calculated, because a plurality of attributes exist in the data table, different attributes need to be separately calculated, and the numerical attribute distance calculation formula is as shown in formula 2:
dist(xi,xj)=|xi-xjl (equation 2)
The formula for calculating the classification type attribute is shown as formula 3:
assume that there are m numeric attributes and n categorical attributes in the data table, therefore, any two records X in the data tablei、XjThe distance calculation formula of (2) is as in formula 4:
In the formula xipAnd xjpAre respectively record Xiand record XjP-th numerical attribute value of (2), xipand xjpAre respectively record XiAnd record Xjthe qth categorical attribute value of (1);
step122, number of data recording divided clusters k1determination of (1):
the k-media clustering algorithm is used for dividing similar records into a group, preparing for anonymization processing and reducing information loss brought by the anonymization process as much as possible, so that the similarity problem in the cluster is mainly considered when the cluster number of the cluster is determined, and the data record dividing cluster number k is determined through the mean square error and SSE in the group1(ii) a And with k1the data records in each cluster will gradually decrease, the distance between the records in the cluster should be smaller and smaller, and therefore the value of SSE should follow k1Is increased and decreased; therefore, k is performed by SSE1When determining the value, take care of the change, when SSE follows k1Is relatively slow, it is considered that k is further increased1The clustering effect does not change much, then k is1the value is the optimal clustering number; if each k is to be1Is represented in a line graph with the corresponding SSE value, then the corresponding k at the inflection point1The value is the optimal cluster number.
4. The cluster anonymity based privacy preserving table data sharing algorithm of claim 3, wherein: in Step2, k obtained by dividing the table data record1The method comprises the following steps of (1) processing each sub data table in sequence, wherein the core idea is as follows: the data records in the sub data table are divided so that the number of records in each cluster generated is [ k ]2,2k2-1]Meanwhile, the sensitive attribute value in each cluster is not unique, so the specific flow for realizing the table data anonymity processing algorithm is as follows:
Step 21: judging whether the number of data records in the data set is more than 2k2-1, if greater than 2k21, executing Step 22;
Step 22:Two records r are selected from the data set1And r2as two initial clusters, such that when r1And r2When a cluster is formed, the information loss amount is the largest in the combination of all records in the cluster, and Step23 is executed;
Step 23: respectively calculating the information loss change condition of each record in the data set after dividing into two clusters, dividing the record into clusters with smaller information loss amount, and adjusting the data record to make the data record in each cluster at least be k2And returning the generated clusters as two newly generated data sets to Step 21;
Step 24: when the number of data records in all data sets is [ k ]2,2k2-1]Sequentially and circularly judging whether the condition that the sensitive attribute value is unique exists in each data set or not, and if so, executing Step 25;
Step 25: selecting data records with different sensitive attribute values from the data set Q, and simultaneously ensuring that the number of the data records in the data set is more than or equal to k if the data records are deleted2and the sensitive attribute value is not unique;
Step 26: calculating the information loss variation quantity after the selected data records are divided into the corresponding data set Q, and dividing the data records with smaller information loss quantity into the data set Q;
Step 27: obtaining the number of records at [ k2,2k2-1]And performing generalization processing on each set to obtain an anonymous data table.
5. The cluster anonymity based privacy preserving table data sharing algorithm of claim 4, wherein: in Step27, the anonymization rule of the generalization processing is: if the non-sensitive attribute is a numerical attribute, generalizing the non-sensitive attribute into a value range of the attribute in a set where the non-sensitive attribute is located; if the attribute is classified, the attribute is generalized to the complete set of the attribute in the set in which the attribute is located.
6. the cluster anonymity based privacy preserving table data sharing algorithm of claim 5, wherein: the differential privacy and noise processing process for the anonymous data table in Step3 comprises the following steps:
(1) If the classified attribute exists in the sensitive attributes, carrying out frequency statistics according to different combinations of classified sensitive attribute values, adding noise to the frequency statistics, adding a Num column at a corresponding position, and recording the noise-added data;
(2) If the numerical attributes exist in the sensitive attributes, clustering each numerical attribute respectively, and replacing the attribute value with an average value, wherein the clustering number isn is the number of records in the data table, k3Recording the number of data in the cluster; the size of the data can be determined according to the availability requirement of the data, and if the data quality requirement is higher, a smaller value is selected; if the data quality requirement is low, a larger value can be selected, and the processed data table can be shared to the data requester.
7. The cluster anonymity based privacy preserving table data sharing algorithm of claim 6, wherein: the shared static data table is shared government affairs table data.
CN201910752801.6A 2019-08-15 2019-08-15 Privacy protection table data sharing method based on cluster anonymity Active CN110555316B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910752801.6A CN110555316B (en) 2019-08-15 2019-08-15 Privacy protection table data sharing method based on cluster anonymity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910752801.6A CN110555316B (en) 2019-08-15 2019-08-15 Privacy protection table data sharing method based on cluster anonymity

Publications (2)

Publication Number Publication Date
CN110555316A true CN110555316A (en) 2019-12-10
CN110555316B CN110555316B (en) 2023-04-18

Family

ID=68737513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910752801.6A Active CN110555316B (en) 2019-08-15 2019-08-15 Privacy protection table data sharing method based on cluster anonymity

Country Status (1)

Country Link
CN (1) CN110555316B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079183A (en) * 2019-12-19 2020-04-28 中国移动通信集团黑龙江有限公司 Privacy protection method, device, equipment and computer storage medium
CN111222164A (en) * 2020-01-10 2020-06-02 广西师范大学 Privacy protection method for issuing alliance chain data
CN111628974A (en) * 2020-05-12 2020-09-04 Oppo广东移动通信有限公司 Differential privacy protection method and device, electronic equipment and storage medium
CN112035874A (en) * 2020-08-28 2020-12-04 绿盟科技集团股份有限公司 Data anonymization processing method and device
CN113254992A (en) * 2021-05-21 2021-08-13 同智伟业软件股份有限公司 Electronic medical record publishing privacy protection method
CN113257378A (en) * 2021-06-16 2021-08-13 湖南创星科技股份有限公司 Medical service communication method and system based on micro-service technology
CN113378223A (en) * 2021-06-16 2021-09-10 北京工业大学 K-anonymous data processing method and system based on dual coding and cluster mapping
CN113411186A (en) * 2021-08-19 2021-09-17 北京电信易通信息技术股份有限公司 Video conference data security sharing method
CN113742781A (en) * 2021-09-24 2021-12-03 湖北工业大学 K anonymous clustering privacy protection method, system, computer equipment and terminal
CN114092729A (en) * 2021-09-10 2022-02-25 南方电网数字电网研究院有限公司 Heterogeneous electricity consumption data publishing method based on cluster anonymization and differential privacy protection
CN114611127A (en) * 2022-03-15 2022-06-10 湖南致坤科技有限公司 Database data security management system
CN114817977A (en) * 2022-03-18 2022-07-29 西安电子科技大学 Anonymous protection method based on sensitive attribute value constraint

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189858A1 (en) * 2012-12-27 2014-07-03 Industrial Technology Research Institute Generation Method and Device for generating anonymous dataset, and method and device for risk evaluation
CN105512566A (en) * 2015-11-27 2016-04-20 电子科技大学 Health data privacy protection method based on K-anonymity
CN107292195A (en) * 2017-06-01 2017-10-24 徐州医科大学 The anonymous method for secret protection of k divided based on density
CN107358113A (en) * 2017-06-01 2017-11-17 徐州医科大学 Based on the anonymous difference method for secret protection of micro- aggregation
CN107766745A (en) * 2017-11-14 2018-03-06 广西师范大学 Classification method for secret protection in hierarchical data issue
CN109522750A (en) * 2018-11-19 2019-03-26 盐城工学院 A kind of new k anonymity realization method and system
US20190205566A1 (en) * 2017-12-28 2019-07-04 Ethicon Llc Data stripping method to interrogate patient records and create anonymized record
CN110069943A (en) * 2019-03-29 2019-07-30 中国电力科学研究院有限公司 A kind of data processing method and system based on cluster anonymization and difference secret protection

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189858A1 (en) * 2012-12-27 2014-07-03 Industrial Technology Research Institute Generation Method and Device for generating anonymous dataset, and method and device for risk evaluation
CN105512566A (en) * 2015-11-27 2016-04-20 电子科技大学 Health data privacy protection method based on K-anonymity
CN107292195A (en) * 2017-06-01 2017-10-24 徐州医科大学 The anonymous method for secret protection of k divided based on density
CN107358113A (en) * 2017-06-01 2017-11-17 徐州医科大学 Based on the anonymous difference method for secret protection of micro- aggregation
CN107766745A (en) * 2017-11-14 2018-03-06 广西师范大学 Classification method for secret protection in hierarchical data issue
US20190205566A1 (en) * 2017-12-28 2019-07-04 Ethicon Llc Data stripping method to interrogate patient records and create anonymized record
CN109522750A (en) * 2018-11-19 2019-03-26 盐城工学院 A kind of new k anonymity realization method and system
CN110069943A (en) * 2019-03-29 2019-07-30 中国电力科学研究院有限公司 A kind of data processing method and system based on cluster anonymization and difference secret protection

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIJUAN ZHENG 等: "k-Anonymity Location Privacy Algorithm Based on Clustering", 《IEEE ACCESS》 *
史雅涓: "政府数据发布共享中的 隐私保护研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
龚卫华 等: "基于k-度匿名的社会网络隐私保护方法", 《电子学报》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079183A (en) * 2019-12-19 2020-04-28 中国移动通信集团黑龙江有限公司 Privacy protection method, device, equipment and computer storage medium
CN111079183B (en) * 2019-12-19 2022-06-03 中国移动通信集团黑龙江有限公司 Privacy protection method, device, equipment and computer storage medium
CN111222164A (en) * 2020-01-10 2020-06-02 广西师范大学 Privacy protection method for issuing alliance chain data
CN111628974A (en) * 2020-05-12 2020-09-04 Oppo广东移动通信有限公司 Differential privacy protection method and device, electronic equipment and storage medium
CN112035874A (en) * 2020-08-28 2020-12-04 绿盟科技集团股份有限公司 Data anonymization processing method and device
CN113254992A (en) * 2021-05-21 2021-08-13 同智伟业软件股份有限公司 Electronic medical record publishing privacy protection method
CN113378223B (en) * 2021-06-16 2023-12-26 北京工业大学 K-anonymous data processing method and system based on double coding and cluster mapping
CN113257378A (en) * 2021-06-16 2021-08-13 湖南创星科技股份有限公司 Medical service communication method and system based on micro-service technology
CN113378223A (en) * 2021-06-16 2021-09-10 北京工业大学 K-anonymous data processing method and system based on dual coding and cluster mapping
CN113411186A (en) * 2021-08-19 2021-09-17 北京电信易通信息技术股份有限公司 Video conference data security sharing method
CN114092729A (en) * 2021-09-10 2022-02-25 南方电网数字电网研究院有限公司 Heterogeneous electricity consumption data publishing method based on cluster anonymization and differential privacy protection
CN113742781A (en) * 2021-09-24 2021-12-03 湖北工业大学 K anonymous clustering privacy protection method, system, computer equipment and terminal
CN113742781B (en) * 2021-09-24 2024-04-05 湖北工业大学 K anonymous clustering privacy protection method, system, computer equipment and terminal
CN114611127A (en) * 2022-03-15 2022-06-10 湖南致坤科技有限公司 Database data security management system
CN114817977A (en) * 2022-03-18 2022-07-29 西安电子科技大学 Anonymous protection method based on sensitive attribute value constraint
CN114817977B (en) * 2022-03-18 2024-03-29 西安电子科技大学 Anonymous protection method based on sensitive attribute value constraint

Also Published As

Publication number Publication date
CN110555316B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN110555316B (en) Privacy protection table data sharing method based on cluster anonymity
Zhang et al. Towards accurate histogram publication under differential privacy
WO2022126971A1 (en) Density-based text clustering method and apparatus, device, and storage medium
Singh et al. Probabilistic data structures for big data analytics: A comprehensive review
Ye et al. Robust similarity measure for spectral clustering based on shared neighbors
Sun et al. Publishing anonymous survey rating data
CN109117669B (en) Privacy protection method and system for MapReduce similar connection query
CN110569289B (en) Column data processing method, equipment and medium based on big data
CN108885673B (en) System and method for computing data privacy-utility tradeoffs
US11409770B2 (en) Multi-distance similarity analysis with tri-point arbitration
Ong et al. Adaptive histogram-based gradient boosted trees for federated learning
Yuan et al. Privacy‐preserving mechanism for mixed data clustering with local differential privacy
Wang et al. Approximate truth discovery via problem scale reduction
CN111859441A (en) Anonymous method and storage medium for missing data
Bulysheva et al. Segmentation modeling algorithm: a novel algorithm in data mining
Du et al. An improved density peaks clustering algorithm by automatic determination of cluster centres
Zhang et al. Toward more efficient locality‐sensitive hashing via constructing novel hash function cluster
CN116186757A (en) Method for publishing condition feature selection differential privacy data with enhanced utility
Mao et al. Private deep neural network models publishing for machine learning as a service
CN115658979A (en) Context sensing method and system based on weighted GraphSAGE and data access control method
Li et al. Privacy‐preserving constrained spectral clustering algorithm for large‐scale data sets
Liu et al. Histogram publishing method based on differential privacy
CN109522750A (en) A kind of new k anonymity realization method and system
Xu et al. From granulation-degranulation mechanisms to fuzzy rule-based models: Augmentation of granular-based models with a double fuzzy clustering
Lin et al. Double‐weighted fuzzy clustering with samples and generalized entropy features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240321

Address after: No. 3 Juquan Road, Science City, Huangpu District, Guangzhou City, Guangdong Province, 510700. Office card slot A090, Jian'an Co Creation Space, Area A, Guangzhou International Business Incubator, No. A701

Patentee after: Guangzhou chick Information Technology Co.,Ltd.

Country or region after: Zhong Guo

Address before: 050043 No. 17, North Second Ring Road, Hebei, Shijiazhuang

Patentee before: SHIJIAZHUANG TIEDAO University

Country or region before: Zhong Guo

TR01 Transfer of patent right