CN112131604A - High-dimensional privacy data publishing method based on Bayesian network attribute cluster analysis technology - Google Patents
High-dimensional privacy data publishing method based on Bayesian network attribute cluster analysis technology Download PDFInfo
- Publication number
- CN112131604A CN112131604A CN202011013027.6A CN202011013027A CN112131604A CN 112131604 A CN112131604 A CN 112131604A CN 202011013027 A CN202011013027 A CN 202011013027A CN 112131604 A CN112131604 A CN 112131604A
- Authority
- CN
- China
- Prior art keywords
- attribute
- data
- attributes
- bayesian network
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000005516 engineering process Methods 0.000 title claims abstract description 15
- 238000007621 cluster analysis Methods 0.000 title claims abstract description 12
- 238000009826 distribution Methods 0.000 claims abstract description 49
- 230000008569 process Effects 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 230000035945 sensitivity Effects 0.000 claims description 5
- 238000013398 bayesian method Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000006698 induction Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 12
- 230000007547 defect Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 8
- 230000009467 reduction Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000003287 bathing Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 229940098006 multigen Drugs 0.000 description 1
- 230000000474 nursing effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Bioethics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a high-dimensional privacy data publishing method based on a Bayesian network attribute cluster analysis technology, which overcomes the defects of large high-dimensional privacy data noise-adding publishing error, poor usability and low efficiency compared with the prior art. The invention comprises the following steps: acquiring high-dimensional data; clustering division of the attribute subsets; constructing a noisy Bayesian network; generating a noise adding condition distribution; and (4) publishing the synthetic data set. Under the high-dimensional big data environment, the method and the device can ensure the safety and the availability of the data privacy, shorten the running time of a data publishing algorithm and realize the effective publishing of the private data under the high-dimensional big data environment.
Description
Technical Field
The invention relates to the technical field of high-dimensional data privacy processing, in particular to a high-dimensional privacy data publishing method based on a Bayesian network attribute cluster analysis technology.
Background
With the continuous development and application of information technology, abundant data resources are accumulated in information systems of various industries, and the data often have great research value. However, since the original data usually contains many private information of individuals, directly publishing the private information will cause sensitive information to be leaked. Therefore, before data is released, special privacy protection techniques need to be used to process the data. The traditional privacy protection technology (such as k-anonymity, l-diversity, t-secrecy and the like) can protect personal privacy to a certain extent, but is difficult to resist background knowledge attacks and is far insufficient to ensure the security of private information. The proposal of the differential privacy provides a new solution for the privacy issue, and the solution can quantify the protection strength of the data privacy and provide stronger privacy protection for the data issue.
Existing research makes many efforts on the issue of low-dimensional data, but high-dimensional data is more ubiquitous in real life as the big data age comes. For high-dimensional data, a publishing method directly using low-dimensional data introduces a great noise value, so that the usability of a publishing result is low, and the main reason is that the problems of 'dimension disaster' and 'range diversity' are brought by the increase of dimensions and dimension value ranges. Therefore, how to solve the privacy problem of high-dimensional data distribution and the inefficient use of data becomes a new research focus.
The commonly used method to solve the problem of high-dimensional data distribution is to dropAnd (5) maintaining. Firstly, the dimension of the data is reduced to obtain low-dimensional data, noise is added to the converted low-dimensional data set, and then a new data set is generated and issued. Qardaji et al (see Qardaji W H, Yang Weining, Li Ninghui. Priview: Practical differential private release of geographic connectivity tables [ C)]Proc of the 2014ACM SIGMOD Int Conf on Management of data. New York: ACM,2014: 1435-. Day et al (see document Day W Y, Li NingHui. Differencenially Private public of high-dimensional data releasing sensitivity control [ C)].Proc of the 10thACM Symp on Information, Computer and Communication Security (ASIACCS 2015), New York ACM 2015,451 and 462) proposes a differential privacy publishing method based on threshold filtering technology, and achieves the purpose of limiting the sensitivity range by constructing a low-sensitivity quality function.
However, the above method does not consider the dependency relationship between the attributes, so researchers further perform dimension reduction processing according to the correlation between the attributes, for example, Xu et al (see Xu C, Ren J, Zhang Y, et al. DPPro: Difference Private High-Dimensional Data Release view Random project [ J ]. IEEE Transactions on Information forms and Security,2017:1-1.) designs a High-Dimensional Data distribution algorithm based on stochastic Projection technology, and can generate a synthetic Data set with similar squared Euclidean distance between High-Dimensional vectors to the original Data set to realize differential privacy. The PrivBayes method proposed by Zhang et al (see the document Zhang Jun, Cormode G, Procopiuc C M, et al. Privbayer's: PrivBayes Privase Data Release via Bayesian Networks [ C ]. Proc of the 2014ACM SIGMOD Int Conf On Management of data.New York: ACM,2014: 1423-. Chen et al (see Chen Rui, Xiao Qian, Zhang Yu, et al. Difference private high-dimension Data publication view-based reference [ C ]. Proc of the 21st ACM SIGMOD Int Conf On Knowledge Discovery and Data mining. New York: ACM 2015:129-138.) propose a JTre method to construct a joint tree using Markov network to deal with the issue of high-dimensional Data publication.
When a probability graph is constructed according to the correlation between attributes for dimension reduction processing, the key step required is to judge the correlation between every two attributes. However, when the attribute pairs are numerous, the limited privacy budget needs to be divided for many times, which inevitably causes great noise, and the higher the data dimension is, the more complicated the generated network structure is, which causes the increase of the expression over-exponential, and the operation time of the algorithm is also greatly increased.
The traditional Bayesian network directly constructs a Bayesian network with all attributes, so that the AP of the attributes has overlarge candidate space and more privacy budget division times during construction, noise is added, the selection precision of an exponential mechanism is greatly reduced, the usability of the algorithm is finally low, and the operation time of the algorithm is exponentially increased along with the increase of the attribute nodes in a high-dimensional attribute environment.
Therefore, how to implement effective and feasible private data publishing for high-dimensional data has become an urgent technical problem to be solved.
Disclosure of Invention
The invention aims to solve the defects of large error, poor usability and low efficiency of the high-dimensional privacy data plus noise publishing, and provides a high-dimensional privacy data publishing method based on a Bayesian network attribute cluster analysis technology to solve the problems.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a high-dimensional privacy data release method based on a Bayesian network attribute cluster analysis technology comprises the following steps:
11) obtaining high-dimensional data: acquiring high-dimensional data to be issued to form an original data set D, and performing attribute induction on the high-dimensional data to form a high-dimensional data attribute set;
12) clustering of attribute subsets: by calculating the correlation among high-dimensional data attributes, a high-dimensional attribute set is divided into c attribute subsets by using an attribute clustering method, and thenPartitioning an original data set D into c data subsets D according to attribute subsetsi(i=1,..,c);
13) Constructing a noisy Bayesian network: using greedy Bayesian method to pair the obtained data subsets Di(i ═ 1.., c.) construction of noisy bayesian network Ni(i ═ 1.., c), where the total privacy budget allocated is1Each data subset distributes privacy budget according to the proportion of the number of owned attributes to the total number of attributes owned by the c attribute subset clustersEach Bayesian network constructed is satisfied1iDifferential privacy of (1);
14) generating a noise condition distribution: for each Bayesian network NiCalculating its joint probability distribution Pr [ V ]i,∏i]And adding noise to obtain Pr*[Vi,∏i]And calculating the noise-adding conditional probability distribution Pr according to the calculated noise-adding conditional probability distribution*[Vi|∏i]Wherein the total privacy budget allocated is2Each Bayesian network allocates privacy budgets according to the proportion of the number of attribute nodes to the total number of attribute nodes owned by the c Bayesian networksSatisfying each constructed conditional probability distribution2iDifferential privacy of (1);1and2the sum is equal to a given total privacy budget, i.e. ═1+2So that the differential privacy is met in the whole data publishing process;
15) publishing of the synthetic data set: for c subsets of data, according to its Bayesian network NiAnd noise condition distribution Pr*[Vi|∏i]Sequentially sampling each attribute in the increasing order of i to generate a disturbance data set Di *(i ═ 1.., c), from which a synthetic dataset D is generated*Synthesizing a data set D*Namely the high-dimensional privacy data, and finally releasing the high-dimensional privacy data.
The cluster partitioning of the attribute subsets comprises the steps of:
21) for a high-dimensional data set, calculating the correlation between high-dimensional data attributes, wherein the calculation method comprises the following steps:
given any two attributes ViAnd VjThe relative dependency between the attributes is expressed as
Wherein, I represents mutual information between two attributes, and H represents a joint entropy value between the two attributes; for any attribute ViIts relationship to other attributes and representation as
22) Randomly selecting c attributes as central attributes, wherein c is the number of attribute subsets;
23) for theCalculating ViRelative dependency relationship with each center attribute, and assigning the center attribute C with the maximum dependency valuerRepeating the step until all attributes are distributed in the subset cluster;
24) updating the central attributes, for each subset of attributes, if any, attribute ViThe sum of the relationships to other attributes is greater than the sum of the relationships of the central attribute to other attributes, i.e. MR (V)i)≥MR(Vj)(Vj∈CrJ ≠ i), then V is setiSet as new Cr;
25) Repeating the steps 23) and 24) until the c central attributes are unchanged, or stopping iteration when the iteration times reach a preset value to obtain c attribute subsets, and further obtain c data subsets Di(i=1,..,c)。
The construction of the noisy Bayesian network comprises the following steps:
31) initialization: initially setting a Bayesian network N toThe selected attribute node set S isA is a data set attribute sequence;
32) selecting an initial node: randomly selecting an attribute V1As an initial node of the Bayesian network, V is set1Add to set S and set attribute-parent AP pairsAdding to N;
33) AP enumerates candidate sets: initializing AP to candidate set ΩFor theAndstoring the (V, II) into an AP pair candidate set omega, wherein k is the degree of the Bayesian network;
34) and (3) solving the scores by the AP: using the function F as a scoring function, the scores F (V, ii) of all AP pairs in Ω are calculated, and the solution formula is as follows:
where P ° [ V, Π ] is the set of all maximum joint distributions of AP pairs (V, Π);
35) selecting an AP pair: selecting AP (Access Point) pairs based on an exponential mechanismi,∏i) Adding to the network N, and adding ViAdding into S; the expression is as follows:
the sampling probability of selecting AP pair from omega is combined withProportional, where Δ F is the global sensitivity of the scoring function,n=|D|;
36) bayesian network updating: for A except V1And repeating the processes from 33) to 35) for all the other attributes until all the attribute nodes are selected in sequence, so as to obtain the complete Bayesian network N.
The generating of the noise adding condition distribution comprises the following steps:
41) initialization: initializing a noise-added condition distribution set P*;
42) And (3) generating a noise joint distribution:
according to a Bayesian network NiCalculating original joint distribution Pr Vi,∏i],
Adding Laplace noiseObtaining a noisy combined profile Pr*[Vi,∏i]Adding Pr*[Vi,∏i]A negative value of (1) is set to 0, normalization is performed;
43) and (3) generating a noise condition distribution:
for theBased on Pr*[Vi,∏i]Calculating to obtain Pr*[Vk+1|∏k+1],...,Pr*[Vd|∏d]Adding it to the distribution set P of noise adding conditions*;
For theBased on Pr*[Vk+1,∏k+1]Calculating to obtain Pr*[V1|∏1],...,Pr*[Vk|∏k]Adding it to the distribution set P of noise adding conditions*。
Advantageous effects
Compared with the prior art, the high-dimensional private data publishing method based on the Bayesian network attribute cluster analysis technology can ensure the safety and the availability of data privacy, shorten the running time of a data publishing algorithm and realize the effective publishing of private data in a high-dimensional big data environment.
According to the method, the relevance among the data is reserved by exploiting the dimension relevance, and the probability distribution and the statistical characteristic which are similar as much as possible are ensured between the synthetic data set and the original data set; when the Bayesian network is constructed, attribute clustering is firstly carried out to form attribute subset clusters, so that the partition times of privacy budget can be reduced, and the program running time for generating the Bayesian network can be shortened; considering relative independence among a plurality of low-dimensional attribute subsets, when the original data set is high in dimensionality, a MapReduce programming mode can be applied to a Bayesian network and a disturbance data set to construct, and the problem of computing efficiency in a big data environment can be effectively solved.
Drawings
FIG. 1 is a sequence diagram of the method of the present invention;
FIG. 2 is a block diagram of an algorithm flow scheme in accordance with the present invention;
FIG. 3(a) is the classification result of SVM (money) under NLTCS data set of the present invention;
FIG. 3(b) is the SVM (bathing) classification result under the NLTCS data set of the present invention;
FIG. 3(c) is the SVM (transforming) classification result under the NLTCS data set of the present invention;
FIG. 4(a) is the classification result of SVM (phantom) under ACS data set of the present invention;
FIG. 4(b) is the SVM (multi-gen) classification result under the ACS data set of the present invention;
FIG. 4(c) is the classification result of SVM (school) under ACS data set of the present invention;
FIG. 5(a) is the SVM (generator) classification result under the result of the Adult data set of the present invention;
FIG. 5(b) is the SVM (martial) classification result under the result of the Adult data set of the present invention;
FIG. 5(c) is the SVM (classification) classification result under the result of the present invention;
FIG. 6(a) is a comparison of run times for the method of the present invention and the PrivBayes method when K is 2;
fig. 6(b) is a comparison of the run times of the inventive method and the PrivBayes method when K is 3.
Detailed Description
So that the manner in which the above recited features of the present invention can be understood and readily understood, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings, wherein:
as shown in fig. 1 and fig. 2, the method for publishing high-dimensional private data based on a bayesian network attribute cluster analysis technique according to the present invention includes the following steps:
firstly, acquiring high-dimensional data: and acquiring high-dimensional data to be issued, and performing attribute induction on the high-dimensional data to form a high-dimensional data attribute set.
Secondly, clustering and dividing attribute subsets: by calculating the correlation among high-dimensional data attributes, a high-dimensional attribute set is divided into c attribute subsets by using an attribute clustering method, and then an original data set D is divided into c data subsets D according to the attribute subsetsi(i ═ 1.., c). When a noisy Bayesian network is constructed, the increase of attribute nodes can cause the sharp reduction of privacy budget and seriously affect the data publishing availability. The interdependence relation between the attributes is measured by defining a relation function, and the attribute subset cluster is divided by applying the idea of a K-means clustering algorithm, so that the interdependence relation between the attributes can be explored in advance, and the selection range of the attribute pair is reduced. Therefore, the attribute clustering algorithm and the noise-added Bayesian network are combined for issuing the high-dimensional privacy data, the usability of the issuing result of the high-dimensional data is effectively guaranteed, and the operation efficiency of the algorithm in a big data environment is improved. The method comprises the following specific steps:
(1) for a high-dimensional data set, calculating the correlation between high-dimensional data attributes, wherein the calculation method comprises the following steps:
given any two attributes ViAnd VjThe relative dependency between the attributes is expressed as
Wherein, I represents mutual information between two attributes, and H represents a joint entropy value between the two attributes; for any attribute ViIts relationship to other attributes and representation as
(2) Randomly selecting c attributes as central attributes, wherein c is the number of attribute subsets;
(3) for theCalculating ViRelative dependency relationship with each center attribute, and assigning the center attribute C with the maximum dependency valuerRepeating the step until all attributes are distributed in the subset cluster;
(4) updating the central attributes, for each subset of attributes, if any, attribute ViThe sum of the relationships to other attributes is greater than the sum of the relationships of the central attribute to other attributes, i.e. MR (V)i)≥MR(Vj)(Vj∈CrJ ≠ i), then V is setiSet as new Cr;
(5) Repeating the steps (3) and (4) until the c central attributes are unchanged, or terminating the iteration when the iteration times reach a preset value to obtain c attribute subsets, and further obtain c data subsets Di(i=1,..,c)。
And thirdly, constructing a noise-added Bayesian network. Using greedy Bayesian method to pair the obtained data subsets Di(i ═ 1.., c.) construction of noisy bayesian network Ni(i ═ 1.., c), where the total privacy budget allocated is1Each data subset distributes privacy pre-allocation according to the proportion of the number of owned attributes to the total number of attributes owned by the c attribute subset clusterCalculating outEach Bayesian network constructed is satisfied1iDifferential privacy of (1).
The Bayesian network expresses the degree of dependence among nodes by using the conditional probability among the attribute nodes, and can better keep the consistency and the integrity of the probability among the attributes during dimension reduction. For each attribute subset cluster, the attributes in the group have high interdependency, and the correlation among the attributes can be further exploited by constructing a Bayesian network. The method comprises the following specific steps:
(1) initialization: initially setting a Bayesian network N toThe selected attribute node set S isA is a data set attribute sequence;
(2) selecting an initial node: randomly selecting an attribute V1As an initial node of the Bayesian network, V is set1Add to set S and pair APAdding to N;
(3) AP enumerates candidate sets: initializing AP to candidate set ΩFor theAndstoring the (V, II) into an AP pair candidate set omega, wherein k is the degree of the Bayesian network;
(4) and (3) solving the scores by the AP: using the function F as a scoring function, the scores F (V, ii) of all AP pairs in Ω are calculated, and the solution formula is as follows:
where P ° [ V, Π ] is the set of all maximum joint distributions of AP pairs (V, Π);
(5) selecting an AP pair: selecting AP (Access Point) pairs based on an exponential mechanismi,∏i) Adding to the network N, and adding ViAdding into S; the expression is as follows:
the sampling probability of selecting AP pair from omega is combined withProportional, where Δ F is the global sensitivity of the scoring function,n=|D|;
(6) bayesian network updating: for A except V1And (5) repeating the processes from the step (3) to the step (5) for all the other attributes until all the attribute nodes are selected in sequence, so as to obtain the complete Bayesian network N.
And fourthly, generating a noise adding condition distribution. For each Bayesian network NiCalculating its joint probability distribution Pr [ V ]i,∏i]And adding noise to obtain Pr*[Vi,∏i]And calculating the noise-adding conditional probability distribution Pr according to the calculated noise-adding conditional probability distribution*[Vi|∏i]Wherein the total privacy budget allocated is2According toTo allocate privacy budgets such that each constructed conditional probability distribution is satisfied2iDifferential privacy of (1);1and2the sum is equal to a given total privacy budget, i.e. ═1+2And the differential privacy satisfied by the whole data publishing process is ensured.The method comprises the following specific steps:
(1) initialization: initializing a noise-added condition distribution set P*;
(2) And (3) generating a noise joint distribution:
according to a Bayesian network NiCalculating original joint distribution Pr Vi,∏i],
Adding Laplace noiseObtaining a noisy combined profile Pr*[Vi,∏i]Adding Pr*[Vi,∏i]A negative value of (1) is set to 0, normalization is performed;
(3) and (3) generating a noise condition distribution:
for theBased on Pr*[Vi,∏i]Calculating to obtain Pr*[Vk+1|∏k+1],...,Pr*[Vd|∏d]Adding it to the distribution set P of noise adding conditions*;
For theBased on Pr*[Vk+1,∏k+1]Calculating to obtain Pr*[V1|∏1],...,Pr*[Vk|∏k]Adding it to the distribution set P of noise adding conditions*。
And fifthly, publishing the synthetic data set: for c subsets of data, according to its Bayesian network NiAnd noise condition distribution Pr*[Vi|∏i]Sequentially sampling each attribute in the increasing order of i to generate a disturbance data setFrom this a synthetic dataset D is generated*Synthesizing a data set D*Namely the high-dimensional privacy data, and finally sending the high-dimensional privacy dataAnd (3) cloth.
In order to verify the effectiveness and the operation efficiency of the method of the present invention, specific experiments are performed on the real data set for verification and description. The experimental environment is as follows: windows10 operating system, Intel (R) core (TM) i5-6200CPU (2.30GHz), 12GB memory. The algorithm codes are realized by Python and Java languages.
Experimental data: the 3 data sets NLTCS, ACS and Adult used in the experiment are widely used for high-dimensional data distribution. The NLTCS dataset originated from the american nursing survey center, containing records of 21574 disabled care surveys; the ACS dataset is derived from an ACS sample set of IPUMSUSA, containing 47461 lines of personal information obtained from 2013 and 2014; the Adult data set originated from the american census center and contained 45222 pieces of personal information. The specific details of the three data sets are shown in table 1:
table 1 data set information description comparison table
Referring to fig. 3(a) - (c), fig. 4(a) - (c) and fig. 5(a) - (c), the present invention shows the comparison of the average error classification rate based on parameter variation on SVM classification task by the method of the present invention with the PrivBayes method, the noise-free (NoPrivacy) method, the Laplace noise method and the Majority method, respectively, on the three datasets NLTCS, ACS and Adult.
On the NLTCS dataset, whether funds can be managed is (1) respectively; (2) whether or not swimming is possible; (3) whether travel is possible is predicted as a classification attribute. On the ACS data set, respectively (1) whether the mortgage loan is owned; (2) whether the user lives in a family of a plurality of generations in the same hall; (3) whether to learn as a classification attribute. On the Adult dataset, respectively, whether (1) male; (2) whether to marry; (3) whether or not to have a college study as a classification attribute to make the prediction.
It can be found from fig. 3, fig. 4 and fig. 5 that, compared with the PrivBayes method, the method of the present invention has improved attribute misclassification rates on different data sets, and is superior to the Laplace noise adding method and the Majority authority method to a great extent, which shows that the method of the present invention effectively ensures that the data privacy information is issued, and at the same time, the utility of the data set is also improved.
The running time of the method of the present invention is compared with the privbies method when the degree k of the bayesian network is 2 and k is 3 on three datasets NLTCS, ACS, Adult, respectively, as shown in fig. 6(a) - (b) (a truncated graph is provided in fig. 6(b) because the 3600 value is too large). It can be seen from the figure that the running time of the method of the present invention is approximately equivalent to that of the privbytes method when the dimension of the data set is small, but the running time of the method of the present invention is shorter than that of the privbytes method as the dimension of the data set is increased, and as shown in fig. 6(a), the privbytes method on the Adult data set is about 4 times that of the method of the present invention. In addition, as the Bayesian network degree k is increased, the efficiency of shortening the running time of the method is more remarkable, and the effectiveness of the running efficiency of the method in a high-dimensional big data environment is demonstrated. And when the dimensionality of the data set is higher, a MapReduce parallel programming mode can be used on the framework built by the method, so that the data release time is further shortened.
Under the large background of high-dimensional data release, the invention provides a differential privacy high-dimensional data release method based on an attribute clustering Bayesian network. Firstly, attribute clustering is carried out to obtain each data subset, then a Bayesian network meeting the difference privacy is constructed based on an exponential mechanism, each attribute is sequentially sampled according to the Bayesian network and the distribution of a noise adding condition to obtain a disturbed data set, and finally a new data set is synthesized to be issued. By carrying out experiments on a real data set, the usability and the operation efficiency of the method are verified from two aspects of SVM misclassification rate and algorithm operation time.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (4)
1. A high-dimensional privacy data release method based on a Bayesian network attribute cluster analysis technology is characterized by comprising the following steps:
11) obtaining high-dimensional data: acquiring high-dimensional data to be issued to form an original data set D, and performing attribute induction on the high-dimensional data to form a high-dimensional data attribute set;
12) clustering of attribute subsets: by calculating the correlation among high-dimensional data attributes, a high-dimensional attribute set is divided into c attribute subsets by using an attribute clustering method, and then an original data set D is divided into c data subsets D according to the attribute subsetsi(i=1,..,c);
13) Constructing a noisy Bayesian network: using greedy Bayesian method to pair the obtained data subsets Di(i ═ 1.., c.) construction of noisy bayesian network Ni(i ═ 1.., c), where the total privacy budget allocated is1Each data subset distributes privacy budget according to the proportion of the number of owned attributes to the total number of attributes owned by the c attribute subset clustersEach Bayesian network constructed is satisfied1iDifferential privacy of (1);
14) generating a noise condition distribution: for each Bayesian network NiCalculating its joint probability distribution Pr [ V ]i,∏i]And adding noise to obtain Pr*[Vi,∏i]And calculating the noise-adding conditional probability distribution Pr according to the calculated noise-adding conditional probability distribution*[Vi|∏i]Wherein the total privacy budget allocated is2Each Bayesian network allocates privacy budgets according to the proportion of the number of attribute nodes to the total number of attribute nodes owned by the c Bayesian networksSatisfying each constructed conditional probability distribution2iDifferential privacy of (1);1and2the sum being equal toA fixed total privacy budget, i.e.1+2So that the differential privacy is met in the whole data publishing process;
15) publishing of the synthetic data set: for c subsets of data, according to its Bayesian network NiAnd noise condition distribution Pr*[Vi|∏i]Sequentially sampling each attribute in the increasing order of i to generate a disturbance data setFrom this a synthetic dataset D is generated*Synthesizing a data set D*Namely the high-dimensional privacy data, and finally releasing the high-dimensional privacy data.
2. The Bayesian network attribute cluster analysis technology-based high-dimensional private data distribution method according to claim 1, wherein the cluster division of the attribute subsets comprises the following steps:
21) for a high-dimensional data set, calculating the correlation between high-dimensional data attributes, wherein the calculation method comprises the following steps:
given any two attributes ViAnd VjThe relative dependency between the attributes is expressed as
Wherein, I represents mutual information between two attributes, and H represents a joint entropy value between the two attributes; for any attribute ViIts relationship to other attributes and representation as
22) Randomly selecting c attributes as central attributes, wherein c is the number of attribute subsets;
23) for theComputingViRelative dependency relationship with each center attribute, and assigning the center attribute C with the maximum dependency valuerRepeating the step until all attributes are distributed in the subset cluster;
24) updating the central attributes, for each subset of attributes, if any, attribute ViThe sum of the relationships to other attributes is greater than the sum of the relationships of the central attribute to other attributes, i.e. MR (V)i)≥MR(Vj)(Vj∈CrJ ≠ i), then V is setiSet as new Cr;
25) Repeating the steps 23) and 24) until the c central attributes are unchanged, or stopping iteration when the iteration times reach a preset value to obtain c attribute subsets, and further obtain c data subsets Di(i=1,..,c)。
3. The Bayesian network attribute cluster analysis technology-based high-dimensional private data publishing method according to claim 1, wherein the constructing of the noisy Bayesian network comprises the following steps:
31) initialization: initially setting a Bayesian network N toThe selected attribute node set S isA is a data set attribute sequence;
32) selecting an initial node: randomly selecting an attribute V1As an initial node of the Bayesian network, V is set1Add to set S and set attribute-parent AP pairsAdding to N;
33) AP enumerates candidate sets: initializing AP to candidate set ΩFor theAndstoring the (V, II) into an AP pair candidate set omega, wherein k is the degree of the Bayesian network;
34) and (3) solving the scores by the AP: using the function F as a scoring function, the scores F (V, ii) of all AP pairs in Ω are calculated, and the solution formula is as follows:
where P ° [ V, Π ] is the set of all maximum joint distributions of AP pairs (V, Π);
35) selecting an AP pair: selecting AP (Access Point) pairs based on an exponential mechanismi,∏i) Adding to the network N, and adding ViAdding into S; the expression is as follows:
the sampling probability of selecting AP pair from omega is combined withProportional, where Δ F is the global sensitivity of the scoring function,
36) bayesian network updating: for A except V1And repeating the processes from 33) to 35) for all the other attributes until all the attribute nodes are selected in sequence, so as to obtain the complete Bayesian network N.
4. The Bayesian network attribute cluster analysis technology-based high-dimensional private data publishing method according to claim 1, wherein the generating of the noise-adding condition distribution comprises the following steps:
41) initialization: initializing a noise-added condition distribution set P*;
42) And (3) generating a noise joint distribution:
according to a Bayesian network NiCalculating original joint distribution Pr Vi,∏i],
Adding Laplace noiseObtaining a noisy combined profile Pr*[Vi,∏i]Adding Pr*[Vi,∏i]A negative value of (1) is set to 0, normalization is performed;
43) and (3) generating a noise condition distribution:
for theBased on Pr*[Vi,∏i]Calculating to obtain Pr*[Vk+1|∏k+1],...,Pr*[Vd|∏d]Adding it to the distribution set P of noise adding conditions*;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011013027.6A CN112131604B (en) | 2020-09-24 | 2020-09-24 | High-dimensional privacy data release method based on Bayesian network attribute cluster analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011013027.6A CN112131604B (en) | 2020-09-24 | 2020-09-24 | High-dimensional privacy data release method based on Bayesian network attribute cluster analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112131604A true CN112131604A (en) | 2020-12-25 |
CN112131604B CN112131604B (en) | 2023-12-15 |
Family
ID=73840968
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011013027.6A Active CN112131604B (en) | 2020-09-24 | 2020-09-24 | High-dimensional privacy data release method based on Bayesian network attribute cluster analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112131604B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113094746A (en) * | 2021-03-31 | 2021-07-09 | 北京邮电大学 | High-dimensional data publishing method based on localized differential privacy and related equipment |
CN114218602A (en) * | 2021-12-10 | 2022-03-22 | 南京航空航天大学 | Differential privacy heterogeneous multi-attribute data publishing method based on vertical segmentation |
CN116702214A (en) * | 2023-08-02 | 2023-09-05 | 山东省计算中心(国家超级计算济南中心) | Privacy data release method and system based on coherent proximity and Bayesian network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100008532A (en) * | 2008-07-16 | 2010-01-26 | 성균관대학교산학협력단 | Method of privacy preserving in dynamic datasets publication and privacy preserving system using the same |
CN105044722A (en) * | 2015-08-03 | 2015-11-11 | 西安电子科技大学 | Full Bayes feature extraction method for synthesizing aperture radar object |
CN107871087A (en) * | 2017-11-08 | 2018-04-03 | 广西师范大学 | The personalized difference method for secret protection that high dimensional data is issued under distributed environment |
CN109388972A (en) * | 2018-10-29 | 2019-02-26 | 山东科技大学 | Medical data Singular variance difference method for secret protection based on OPTICS cluster |
CN110378141A (en) * | 2019-04-16 | 2019-10-25 | 江苏慧中数据科技有限公司 | Based on Bayesian network higher-dimension perception data local difference secret protection dissemination method |
-
2020
- 2020-09-24 CN CN202011013027.6A patent/CN112131604B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100008532A (en) * | 2008-07-16 | 2010-01-26 | 성균관대학교산학협력단 | Method of privacy preserving in dynamic datasets publication and privacy preserving system using the same |
CN105044722A (en) * | 2015-08-03 | 2015-11-11 | 西安电子科技大学 | Full Bayes feature extraction method for synthesizing aperture radar object |
CN107871087A (en) * | 2017-11-08 | 2018-04-03 | 广西师范大学 | The personalized difference method for secret protection that high dimensional data is issued under distributed environment |
CN109388972A (en) * | 2018-10-29 | 2019-02-26 | 山东科技大学 | Medical data Singular variance difference method for secret protection based on OPTICS cluster |
CN110378141A (en) * | 2019-04-16 | 2019-10-25 | 江苏慧中数据科技有限公司 | Based on Bayesian network higher-dimension perception data local difference secret protection dissemination method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113094746A (en) * | 2021-03-31 | 2021-07-09 | 北京邮电大学 | High-dimensional data publishing method based on localized differential privacy and related equipment |
CN114218602A (en) * | 2021-12-10 | 2022-03-22 | 南京航空航天大学 | Differential privacy heterogeneous multi-attribute data publishing method based on vertical segmentation |
CN114218602B (en) * | 2021-12-10 | 2024-06-07 | 南京航空航天大学 | Differential privacy heterogeneous multi-attribute data publishing method based on vertical segmentation |
CN116702214A (en) * | 2023-08-02 | 2023-09-05 | 山东省计算中心(国家超级计算济南中心) | Privacy data release method and system based on coherent proximity and Bayesian network |
CN116702214B (en) * | 2023-08-02 | 2023-11-07 | 山东省计算中心(国家超级计算济南中心) | Privacy data release method and system based on coherent proximity and Bayesian network |
Also Published As
Publication number | Publication date |
---|---|
CN112131604B (en) | 2023-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112131604A (en) | High-dimensional privacy data publishing method based on Bayesian network attribute cluster analysis technology | |
Singer et al. | Node embedding over temporal graphs | |
Yanardag et al. | Deep graph kernels | |
Zhang et al. | Joint learning of fuzzy k-means and nonnegative spectral clustering with side information | |
Hayashi et al. | Hypergraph random walks, laplacians, and clustering | |
Wang et al. | A review of differential privacy in individual data release | |
Yu et al. | A novel multi-view clustering method for unknown mapping relationships between cross-view samples | |
Bi et al. | A fast nonnegative autoencoder-based approach to latent feature analysis on high-dimensional and incomplete data | |
Xiao et al. | A survey of parallel clustering algorithms based on spark | |
Zhang et al. | Adversarial attack against cross-lingual knowledge graph alignment | |
Murakami et al. | Privacy-preserving multiple tensor factorization for synthesizing large-scale location traces with cluster-specific features | |
Zhang et al. | Multi-view fusion with extreme learning machine for clustering | |
Ji et al. | Differentially private binary-and matrix-valued data query: An XOR mechanism | |
Huang et al. | Differential privacy protection scheme based on community density aggregation and matrix perturbation | |
Li et al. | Secure federated clustering | |
Liu et al. | Group fairness without demographics using social networks | |
Kojaku et al. | Network community detection via neural embeddings | |
Sajadmanesh et al. | Progap: Progressive graph neural networks with differential privacy guarantees | |
He et al. | Network embedding using deep robust nonnegative matrix factorization | |
Bai et al. | A hierarchical transitive-aligned graph kernel for un-attributed graphs | |
Koutis et al. | Graph partitioning into isolated, high conductance clusters: Theory, computation and applications to preconditioning | |
Cao et al. | Multi-Relational Structural Entropy | |
Darabant et al. | Clustering methods in data fragmentation | |
Gala et al. | Probabilistic integral circuits | |
Abolfathi et al. | A scalable role mining approach for large organizations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |