CN112131604A

CN112131604A - High-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology

Info

Publication number: CN112131604A
Application number: CN202011013027.6A
Authority: CN
Inventors: 陈恒恒; 刘胜军; 谢飞; 倪志伟; 陈千; 李海松; 卜繁耀; 朱旭辉
Original assignee: Hefei City Cloud Data Center Co ltd; Hefei University of Technology
Current assignee: Hefei City Cloud Data Center Co ltd; Hefei University of Technology
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-25
Anticipated expiration: 2040-09-24
Also published as: CN112131604B

Abstract

The invention relates to a high-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology, and compared with the prior art, it solves the defects of large publishing error, poor usability and low efficiency of high-dimensional privacy data by adding noise. The invention includes the following steps: acquisition of high-dimensional data; clustering and division of attribute subsets; building a noise-added Bayesian network; generating a noise-added conditional distribution; and publishing a synthetic data set. Under the high-dimensional big data environment, the present invention can shorten the running time of the data release algorithm while ensuring data privacy, security and availability, and realize the effective release of private data in the high-dimensional big data environment.

Description

High-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology

技术领域technical field

本发明涉及高维数据隐私处理技术领域，具体来说是基于贝叶斯网络属性聚类分析技术的高维隐私数据发布方法。The invention relates to the technical field of high-dimensional data privacy processing, in particular to a high-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology.

背景技术Background technique

随着信息技术的不断发展和应用，各行各业的信息系统中都积累了丰富的数据资源，这些数据往往蕴含着巨大的研究价值。然而，由于原始数据里通常包含着许多个人的隐私信息，直接将其进行发布将导致敏感信息泄露。因此，在发布数据之前，需要使用特殊的隐私防护技术对数据进行处理。传统的隐私保护技术(如k-匿名、l-多样性和t-保密等)能够在一定程度上保护个人隐私，但都很难抵御背景知识攻击，远不足以保证隐私信息的安全。差分隐私的提出为隐私发布提供了新的解决思路，其能量化对数据隐私的保护强度，为数据发布提供更为强大的隐私保护。With the continuous development and application of information technology, a wealth of data resources have been accumulated in information systems of all walks of life, and these data often contain enormous research value. However, since the original data usually contains many personal private information, publishing it directly will lead to the leakage of sensitive information. Therefore, the data needs to be processed using special privacy protection technology before it is released. Traditional privacy protection technologies (such as k-anonymity, l-diversity, and t-secrecy, etc.) can protect personal privacy to a certain extent, but they are difficult to resist background knowledge attacks and are far from enough to ensure the security of private information. The proposal of differential privacy provides a new solution for privacy publishing, which can quantify the protection strength of data privacy and provide stronger privacy protection for data publishing.

现有研究对低维数据的发布问题做了诸多努力，但随着大数据时代的来临，高维数据在现实生活中更加普遍存在。对于高维数据，直接使用低维数据的发布方法会引入极大的噪音值，进而使得发布结果的可用性较低，其主要原因在于维度与维度值域的增加会带来“维度灾难”和“值域多样”等问题。因此，如何解决高维数据发布的隐私问题和数据的低效用性问题，成为新的研究焦点。Existing research has made many efforts to release low-dimensional data, but with the advent of the era of big data, high-dimensional data is more common in real life. For high-dimensional data, the direct use of the low-dimensional data publishing method will introduce a huge noise value, which in turn makes the availability of the published results low. Diverse value ranges” etc. Therefore, how to solve the privacy problem of high-dimensional data release and the low availability of data has become a new research focus.

解决高维数据发布问题通常使用的方法是降维。先对数据降维得到低维数据，对转换后的低维数据集添加噪声，进而生成新的数据集进行发布。Qardaji等人(见文献Qardaji W H,Yang Weining,Li Ninghui.Priview:Practical differentially privaterelease of marginal contingency tables[C].Proc of the 2014ACM SIGMOD Int Confon Management of Data.New York:ACM,2014:1435-1446)提出的Priview方法通过构建属性对的K-way边缘分布来估计高维数据的联合分布。Day等人(见文献Day W Y,LiNingHui.Differentially Private publishing of high-dimensional data Releaseusing sensitivity control[C].Proc of the 10^th ACM Symp on Information,Computerand Communication Security(ASIACCS 2015).New York:ACM,2015,451-462)提出了一种基于阈值过滤技术的差分隐私发布方法，通过构建低敏感度质量函数,达到限制敏感度范围的目的。A commonly used approach to solving high-dimensional data publishing problems is dimensionality reduction. First, reduce the dimensionality of the data to obtain low-dimensional data, add noise to the converted low-dimensional data set, and then generate a new data set for publication. Qardaji et al. (see literature Qardaji WH, Yang Weining, Li Ninghui.Priview:Practical differentially private release of marginal contingency tables[C].Proc of the 2014ACM SIGMOD Int Confon Management of Data.New York:ACM, 2014:1435-1446) The proposed Priview method estimates the joint distribution of high-dimensional data by constructing a K-way marginal distribution of attribute pairs. Day et al (see the literature Day WY, LiNingHui. Differentially Private publishing of high-dimensional data Release using sensitivity control [C]. Proc of the 10th ACM Symp on Information, Computer and Communication Security ( ^ASIACCS 2015). New York: ACM, 2015 , 451-462) proposed a differential privacy publishing method based on threshold filtering technology, which achieved the purpose of limiting the sensitivity range by constructing a low-sensitivity quality function.

但以上方法没有考虑到属性之间的依赖关系，因此研究者们进一步依据属性之间的相关性来进行降维处理，如Xu等人(见文献Xu C,Ren J,Zhang Y,et al.DPPro:Differentially Private High-Dimensional Data Release via Random Projection[J].IEEE Transactions on Information Forensics and Security,2017:1-1.)设计了一种基于随机投影技术的高维数据发布算法，可以生成高维向量之间具有与原始数据集相似平方欧氏距离的合成数据集来实现差分隐私。也有研究通过构建概率图对数据维度相关性进行开采，Zhang等人(见文献Zhang Jun,Cormode G,Procopiuc C M,et al.Privbayes:PrivBayes Private Data Release via Bayesian Networks[C].Proc of the 2014ACMSIGMOD Int Conf On Management of Data.New York:ACM,2014:1423-1434.)提出的PrivBayes方法利用基于指数机制的贝叶斯网络来推理属性之间的关联性，从而得到一个能反映高维数据固有特性的低维数据集。Chen等人(见文献Chen Rui,Xiao Qian,ZhangYu,et al.Differentially private high-dimension data publication via sample-based inference[C].Proc of the 21st ACM SIGMOD Int Conf On KnowledgeDiscovery and Data Mining.New York:ACM,2015:129-138.)提出的JTree方法利用Markov网络构建联合树来处理高维数据发布问题。However, the above methods do not take into account the dependencies between attributes, so researchers further perform dimensionality reduction processing based on the correlation between attributes, such as Xu et al. (see the literature Xu C, Ren J, Zhang Y, et al. DPPro:Differentially Private High-Dimensional Data Release via Random Projection[J].IEEE Transactions on Information Forensics and Security,2017:1-1.) designed a high-dimensional data release algorithm based on random projection technology, which can generate high-dimensional data A synthetic dataset with similar squared Euclidean distances between vectors as the original dataset achieves differential privacy. There are also studies on the mining of data dimensional correlations by constructing probability maps, Zhang et al. (see Zhang Jun, Cormode G, Procopiuc C M, et al. The PrivBayes method proposed by Conf On Management of Data. New York: ACM, 2014: 1423-1434.) uses a Bayesian network based on an exponential mechanism to infer the correlation between attributes, so as to obtain a model that can reflect the inherent characteristics of high-dimensional data. low-dimensional datasets. Chen et al. (see literature Chen Rui, Xiao Qian, Zhang Yu, et al. Differentially private high-dimension data publication via sample-based inference [C]. Proc of the 21st ACM SIGMOD Int Conf On Knowledge Discovery and Data Mining. New York: The JTree method proposed by ACM, 2015: 129-138.) utilizes Markov network to build a joint tree to deal with the problem of high-dimensional data publishing.

在依据属性间相关性构建概率图进行降维处理时，需要的关键步骤是对两两属性之间的关联性进行判别。但当属性对繁多时，意味着需要把有限的隐私预算进行多次分割，势必会造成很大的噪声，且数据维数越高，产生的网络结构越复杂，造成表达式超指数的增长，算法运行时间也大大增加。When constructing a probability map based on the correlation between attributes for dimensionality reduction processing, the key step required is to discriminate the correlation between two attributes. However, when there are many attribute pairs, it means that the limited privacy budget needs to be divided multiple times, which is bound to cause a lot of noise, and the higher the data dimension, the more complex the resulting network structure, resulting in a super-exponential growth of expressions. The algorithm running time is also greatly increased.

即传统的贝叶斯网络直接将所有属性构建一个贝叶斯网络，这样在构建时属性的AP对候选空间过大、隐私预算分割次数多，加入噪音会极大降低指数机制选择精度，最终导致算法可用性低，而且高维属性环境下，随着属节点增加，算法运行时间呈指数级增长。That is, the traditional Bayesian network directly constructs a Bayesian network for all attributes. In this way, the AP pair of attributes is too large in candidate space and the privacy budget is divided many times. Adding noise will greatly reduce the selection accuracy of the index mechanism, which will eventually lead to The availability of the algorithm is low, and in the high-dimensional attribute environment, the running time of the algorithm increases exponentially with the increase of the number of nodes.

因此，如何针对高维数据实现有效可行的隐私数据发布已经成为急需解决的技术问题。Therefore, how to achieve effective and feasible privacy data release for high-dimensional data has become a technical problem that needs to be solved urgently.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了解决了高维隐私数据加噪发布误差大、可用性差、效率低的缺陷，提供一种基于贝叶斯网络属性聚类分析技术的高维隐私数据发布方法来解决上述问题。The purpose of the present invention is to solve the defects of high-dimensional privacy data with large publishing error, poor usability and low efficiency, and to provide a high-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology to solve the above problems .

为了实现上述目的，本发明的技术方案如下：In order to achieve the above object, technical scheme of the present invention is as follows:

基于贝叶斯网络属性聚类分析技术的高维隐私数据发布方法，包括以下步骤：The high-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology includes the following steps:

11)高维数据的获取：获取待发布的高维数据形成原始数据集D，对高维数据进行属性归纳，形成高维数据属性集；11) Acquisition of high-dimensional data: acquire the high-dimensional data to be published to form the original data set D, and perform attribute induction on the high-dimensional data to form a high-dimensional data attribute set;

12)属性子集的聚类划分：通过计算高维数据属性之间的相关性，利用属性聚类方法将高维属性集划分成c个属性子集，进而根据属性子集将原始数据集D划分成c个数据子集D_i(i＝1,..,c)；12) Clustering division of attribute subsets: By calculating the correlation between high-dimensional data attributes, the high-dimensional attribute set is divided into c attribute subsets by attribute clustering method, and then the original data set D is divided according to the attribute subsets. Divide into c data subsets D _i (i=1,..,c);

13)构建加噪贝叶斯网络：使用贪婪贝叶斯方法对得到的数据子集D_i(i＝1,..,c)构建加噪贝叶斯网络N_i(i＝1,..,c)，其中，分配的总隐私预算为ε₁，每个数据子集根据拥有的属性个数占c个属性子集簇拥有的总属性个数比例分配隐私预算

使构建的每个贝叶斯网络满足ε_1i的差分隐私；13) Constructing a Noisy Bayesian Network: Use a greedy Bayesian method to construct a Noisy Bayesian network N _i ( _i =1,.. ,c), where the allocated total privacy budget is ε ₁ , and each data subset allocates a privacy budget according to the proportion of the number of attributes it has to the total number of attributes owned by c attribute subset clusters

Make each Bayesian network constructed to satisfy the differential privacy of ε _1i ;

14)生成加噪条件分布：对于每一个贝叶斯网络N_i，计算其联合概率分布Pr[V_i,∏_i]并加噪得到Pr^*[V_i,∏_i]，据此计算加噪条件概率分布Pr^*[V_i|∏_i]，其中，分配的总隐私预算为ε₂，每个贝叶斯网络根据属性节点个数占c个贝叶斯网络拥有的总属性节点个数比例分配隐私预算

使构建的每个条件概率分布满足ε_2i的差分隐私；ε₁与ε₂之和等于给定的总隐私预算ε，即ε＝ε₁+ε₂，使得整个数据发布过程满足ε的差分隐私；14) Generate noise conditional distribution: For each Bayesian network N _i , calculate its joint probability distribution Pr[V _i ,∏ _i ] and add noise to obtain Pr ^* [V _i ,∏ _i ], and then calculate the noise added Conditional probability distribution Pr ^* [V _i |∏ _i ], where the total privacy budget allocated is ε ₂ , and each Bayesian network accounts for the proportion of the total number of attribute nodes owned by c Bayesian networks according to the number of attribute nodes Allocate a privacy budget

Make each conditional probability distribution constructed to satisfy the differential privacy of ε _2i ; the sum of ε ₁ and ε ₂ is equal to the given total privacy budget ε, that is, ε = ε ₁ +ε ₂ , so that the entire data publishing process satisfies the differential privacy of ε ;

15)合成数据集的发布：对于c个数据子集，根据其贝叶斯网络N_i和加噪条件分布Pr^*[V_i|∏_i]以i的增加顺序依次采样每个属性，生成扰动数据集D_i ^*(i＝1,..,c)，根据此生成合成数据集D^*，合成数据集D^*即为高维隐私数据，最终将高维隐私数据进行发布。15) Release of synthetic datasets: For c data subsets, each attribute is sequentially sampled in increasing order of i according to its Bayesian network N _i and noise-added conditional distribution Pr ^* [V _i |∏ _i ] to generate perturbations Data set D _i ^* (i=1,..,c), according to which a synthetic data set D ^* is generated. The synthetic data set D ^* is high-dimensional private data, and the high-dimensional private data is finally released.

所述属性子集的聚类划分包括以下步骤：The clustering division of the attribute subset includes the following steps:

21)针对高维数据集，计算高维数据属性之间的相关性，其计算方法如下：21) For the high-dimensional data set, calculate the correlation between the attributes of the high-dimensional data, and the calculation method is as follows:

给定任意两个属性V_i和V_j，属性之间相对依赖关系表示为Given any two properties V _i and V _j , the relative dependencies between the properties are expressed as

其中，I代表两个属性之间的互信息，H代表两个属性之间的联合熵值；对于任意一个属性V_i，它到其他属性的关系和表示为

Among them, I represents the mutual information between two attributes, and H represents the joint entropy value between the two attributes; for any attribute V _i , the sum of its relationship to other attributes is expressed as

22)随机选择c个属性作为中心属性，其中c是属性子集的个数；22) randomly select c attributes as central attributes, where c is the number of attribute subsets;

23)对于

计算V_i与各中心属性之间的相对依赖关系，并将其分配给依赖值最大的中心属性C_r所在子集簇，重复此步骤直至分配完所有属性；23) For

Calculate the relative dependency between _Vi and each central attribute, and assign it to the subset cluster where the central attribute C _r with the largest dependency value is located, and repeat this step until all attributes are allocated;

24)更新中心属性，对于每一个属性子集，如果有属性V_i到其他属性的关系和大于中心属性到其他属性的关系和，即MR(V_i)≥MR(V_j)(V_j∈C_r,j≠i)，则将V_i设置为新的C_r；24) Update the central attribute. For each attribute subset, if the sum of the relationship between attribute _Vi and other attributes is greater than the sum of the relationship between the central attribute and other attributes, that is, MR(V _i )≥MR(V _j )(V _j ∈ C _r , j≠ _i ), then set Vi as the new C _r ;

25)重复23)步骤与24)步骤直到c个中心属性不变，或者当迭代次数达到预设定值时，终止迭代，得到c个属性子集，进而得到c个数据子集D_i(i＝1,..,c)。25) Repeat steps 23) and 24) until the c central attributes remain unchanged, or when the number of iterations reaches a preset value, terminate the iteration to obtain c attribute subsets, and then obtain c data subsets D _i (i =1,..,c).

所述构建加噪贝叶斯网络包括以下步骤：The construction of the noisy Bayesian network includes the following steps:

31)初始化：初始化设定贝叶斯网络N为

已选取属性节点集合S为

A为数据集属性数列；31) Initialization: Initialize the Bayesian network N to be

The selected attribute node set S is

A is the data set attribute sequence;

32)初始节点选取：随机选择一个属性V₁作为贝叶斯网络的初始节点，将V₁添加到集合S，并将属性－父节点集合AP对

添加到N；32) Initial node selection: randomly select an attribute V ₁ as the initial node of the Bayesian network, add V ₁ to the set S, and pair the attribute-parent node set AP

add to N;

33)AP对候选集列举：初始化AP对候选集Ω为

对于

和

将(V,∏)存入AP对候选集Ω中，其中k为贝叶斯网络的度；33) AP pair candidate set enumeration: Initialize AP pair candidate set Ω as

for

and

Store (V,∏) in the AP pair candidate set Ω, where k is the degree of the Bayesian network;

34)AP对评分求解：使用函数F为评分函数，计算Ω中所有AP对的评分F(V,∏)，求解公式如下：34) AP pair score solution: Use function F as the score function to calculate the score F(V,∏) of all AP pairs in Ω. The solution formula is as follows:

其中P°[V,Π]是AP对(V,Π)的所有最大联合分布的集合；where P°[V, Π] is the set of all maximal joint distributions of AP pair (V, Π);

35)AP对选取：基于指数机制选取AP对(V_i,∏_i)添加至网络N，并将V_i添加到S中；其表达式如下：35) AP pair selection: select AP pair (V _i , ∏ _i ) to add to network N based on the index mechanism, and add V _i to S; its expression is as follows:

使得从Ω中挑选AP对的采样概率与

成比例，其中，△F为评分函数的全局敏感性，

n＝|D|；such that the sampling probability of picking an AP pair from Ω is the same as

is proportional to , where ΔF is the global sensitivity of the scoring function,

n=|D|;

36)贝叶斯网络更新：对A中除V₁外的所有属性重复上述33)到35)步骤的过程，直至依次选完所有属性节点，即得到完整的贝叶斯网络N。36) Bayesian network update: Repeat the process of steps 33) to 35) for all attributes in A except V ₁ , until all attribute nodes are selected in sequence, that is, a complete Bayesian network N is obtained.

所述生成加噪条件分布包括以下步骤：The generating the noise-added conditional distribution includes the following steps:

41)初始化：初始化加噪条件分布集P^*；41) Initialization: initialize the noise-added conditional distribution set P ^* ;

42)加噪联合分布生成：42) Noise-added joint distribution generation:

根据贝叶斯网络N_i计算原始联合分布Pr[V_i,∏_i]，Calculate the original joint distribution Pr[V _i ,∏ _i ] according to the Bayesian network _Ni ,

加入Laplace噪音

得到加噪的联合分布Pr^*[V_i,∏_i]，将Pr^*[V_i,∏_i]中的负值设置为0，进行标准化；Add Laplace Noise

Obtain the noise-added joint distribution Pr ^* [V _i ,∏ _i ], set the negative value in Pr ^* [V _i ,∏ _i ] to 0 for standardization;

43)加噪条件分布生成：43) Noise-added conditional distribution generation:

对于

基于Pr^*[V_i,∏_i]计算得到Pr^*[V_k+1|∏_k+1],...,Pr^*[V_d|∏_d]，将其加入加噪条件分布集P^*；for

Calculated based on Pr ^* [V _i ,∏ _i ], Pr ^* [V _k+1 |∏ _k+1 ],...,Pr ^* [V _d |∏ _d ] was added to the noise condition distribution set P ^* ;

对于

基于Pr^*[V_k+1,∏_k+1]计算得到Pr^*[V₁|∏₁],...,Pr^*[V_k|∏_k]，将其加入加噪条件分布集P^*。for

Based on Pr ^* [V _k+1 ,∏ _k+1 ], Pr ^* [V ₁ |∏ ₁ ],...,Pr ^* [V _k |∏ _k ] is calculated and added to the noise condition distribution set P ^* .

有益效果beneficial effect

本发明的基于贝叶斯网络属性聚类分析技术的高维隐私数据发布方法，与现有技术相比可在确保数据隐私安全与可用性的同时，缩短数据发布算法的运行时间，实现高维大数据环境下隐私数据的有效发布。Compared with the prior art, the method for publishing high-dimensional privacy data based on the Bayesian network attribute clustering analysis technology of the present invention can ensure the security and availability of data privacy, shorten the running time of the data publishing algorithm, and realize high-dimensional and large-scale data. Effective publication of private data in a data environment.

本发明通过对维度相关性进行开采，保留了数据间的相关性，保证合成数据集与原始数据集具有尽可能相似的概率分布与统计特性；在构建贝叶斯网络时先通过属性聚类形成属性子集簇，可以减少隐私预算的分割次数，缩短产生贝叶斯网络的程序运行时间；考虑到划分的多个低维属性子集之间相对独立，在原始数据集维度很高时，可以将MapReduce编程模式应用于贝叶斯网络和扰动数据集构建，能有效解决大数据环境下计算效率的问题。The invention preserves the correlation between the data by exploiting the dimensional correlation, and ensures that the synthetic data set and the original data set have the probability distribution and statistical characteristics as similar as possible; The attribute subset cluster can reduce the number of divisions of the privacy budget and shorten the running time of the program to generate the Bayesian network; considering that the divided low-dimensional attribute subsets are relatively independent, when the dimension of the original data set is high, it can be Applying the MapReduce programming model to the construction of Bayesian networks and perturbed datasets can effectively solve the problem of computational efficiency in the big data environment.

附图说明Description of drawings

图1为本发明的方法顺序图；Fig. 1 is the method sequence diagram of the present invention;

图2为本发明所涉及的算法流程框架图；Fig. 2 is the algorithm flow frame diagram involved in the present invention;

图3(a)为本发明的NLTCS数据集下SVM(money)分类结果；Fig. 3 (a) is the SVM (money) classification result under the NLTCS data set of the present invention;

图3(b)为本发明的NLTCS数据集下SVM(bathing)分类结果；Fig. 3 (b) is the SVM (bathing) classification result under the NLTCS data set of the present invention;

图3(c)为本发明的NLTCS数据集下SVM(traveling)分类结果；Fig. 3 (c) is the SVM (traveling) classification result under the NLTCS data set of the present invention;

图4(a)为本发明的ACS数据集下SVM(mortgage)分类结果；Fig. 4 (a) is the SVM (mortgage) classification result under the ACS data set of the present invention;

图4(b)为本发明的ACS数据集下SVM(multi-gen)分类结果；Figure 4(b) is the SVM (multi-gen) classification result under the ACS data set of the present invention;

图4(c)为本发明的ACS数据集下SVM(school)分类结果；Fig. 4 (c) is the SVM (school) classification result under the ACS data set of the present invention;

图5(a)为本发明的Adult数据集下SVM(gender)分类结果；Fig. 5 (a) is SVM (gender) classification result under the Adult data set of the present invention;

图5(b)为本发明的Adult数据集下SVM(martial)分类结果；Fig. 5 (b) is the SVM (martial) classification result under the Adult data set of the present invention;

图5(c)为本发明的Adult数据集下SVM(education)分类结果；Fig. 5 (c) is the SVM (education) classification result under the Adult data set of the present invention;

图6(a)为K＝2时本发明方法和PrivBayes方法的运行时间对比；Figure 6(a) is the comparison of the running time of the method of the present invention and the PrivBayes method when K=2;

图6(b)为K＝3时本发明方法和PrivBayes方法的运行时间对比图。Figure 6(b) is a comparison diagram of the running time of the method of the present invention and the PrivBayes method when K=3.

具体实施方式Detailed ways

为使对本发明的结构特征及所达成的功效有更进一步的了解与认识，用以较佳的实施例及附图配合详细的说明，说明如下：In order to have a further understanding and understanding of the structural features of the present invention and the effects achieved, the preferred embodiments and accompanying drawings are used in conjunction with detailed descriptions, and the descriptions are as follows:

如图1和图2所示，本发明所述的一种基于贝叶斯网络属性聚类分析技术的高维隐私数据发布方法，包括以下步骤：As shown in Figure 1 and Figure 2, a high-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology according to the present invention includes the following steps:

第一步，高维数据的获取：获取待发布的高维数据，对高维数据进行属性归纳，形成高维数据属性集。The first step is to acquire high-dimensional data: acquire the high-dimensional data to be published, perform attribute induction on the high-dimensional data, and form a high-dimensional data attribute set.

第二步，属性子集的聚类划分：通过计算高维数据属性之间的相关性，利用属性聚类方法将高维属性集划分成c个属性子集，进而根据属性子集将原始数据集D划分成c个数据子集D_i(i＝1,..,c)。在构建加噪贝叶斯网络时，属性节点的增加在会造成隐私预算的急剧减小，严重影响数据发布可用性。通过定义关系函数，度量得出属性之间的相互依赖关系，并应用K-means聚类算法的思想划分属性子集簇，能预先探索属性间的相互依赖关系，减少属性对选择范围。因此，本发明将属性聚类算法与构建加噪贝叶斯网络相结合用于高维隐私数据发布，在有效保证高维数据发布结果可用性同时，提高大数据环境下算法运行效率。其具体步骤如下：The second step is the clustering of attribute subsets: by calculating the correlation between the attributes of the high-dimensional data, the high-dimensional attribute set is divided into c attribute subsets by the attribute clustering method, and then the original data is divided according to the attribute subsets. Set D is divided into c data subsets D _i (i=1, .., c). When building a noisy Bayesian network, the increase of attribute nodes will cause a sharp decrease in the privacy budget, which seriously affects the availability of data publishing. By defining the relationship function, the interdependence between attributes is measured and obtained, and the idea of K-means clustering algorithm is applied to divide the attribute subset clusters, which can explore the interdependence between attributes in advance and reduce the selection range of attribute pairs. Therefore, the present invention combines the attribute clustering algorithm with the construction of a noisy Bayesian network for high-dimensional private data release, which effectively ensures the availability of the high-dimensional data release results and improves the algorithm operation efficiency in a big data environment. The specific steps are as follows:

(1)针对高维数据集，计算高维数据属性之间的相关性，其计算方法如下：(1) For high-dimensional data sets, calculate the correlation between high-dimensional data attributes, and the calculation method is as follows:

(2)随机选择c个属性作为中心属性，其中c是属性子集的个数；(2) randomly select c attributes as central attributes, where c is the number of attribute subsets;

(3)对于

计算V_i与各中心属性之间的相对依赖关系，并将其分配给依赖值最大的中心属性C_r所在子集簇，重复此步骤直至分配完所有属性；(3) For

(4)更新中心属性，对于每一个属性子集，如果有属性V_i到其他属性的关系和大于中心属性到其他属性的关系和，即MR(V_i)≥MR(V_j)(V_j∈C_r,j≠i)，则将V_i设置为新的C_r；(4) Update the central attribute. For each attribute subset, if the sum of the relationship between attribute _Vi and other attributes is greater than the sum of the relationship between the central attribute and other attributes, that is, MR(V _i )≥MR(V _j )(V _j ∈C _r , j≠ _i ), then set Vi as the new C _r ;

(5)重复上述(3)步骤与(4)步骤直到c个中心属性不变，或者当迭代次数达到预设定值时，终止迭代，得到c个属性子集，进而得到c个数据子集D_i(i＝1,..,c)。(5) Repeat the above steps (3) and (4) until the c central attributes remain unchanged, or when the number of iterations reaches a preset value, terminate the iteration, obtain c attribute subsets, and then obtain c data subsets D _i (i=1,..,c).

第三步，构建加噪贝叶斯网络。使用贪婪贝叶斯方法对得到的数据子集D_i(i＝1,..,c)构建加噪贝叶斯网络N_i(i＝1,..,c)，其中，分配的总隐私预算为ε₁，每个数据子集根据拥有的属性个数占c个属性子集簇拥有的总属性个数比例分配隐私预算

使构建的每个贝叶斯网络满足ε_1i的差分隐私。The third step is to construct a noisy Bayesian network. Construct a noisy Bayesian network _Ni (i=1,..,c) using a greedy Bayesian method on the resulting data subset D _i (i=1,..,c), where the total privacy of the assignment The budget is ε ₁ , and each data subset allocates a privacy budget according to the proportion of the number of attributes it has to the total number of attributes owned by c attribute subset clusters

Make each Bayesian network constructed to satisfy differential privacy of ε _1i .

贝叶斯网络用属性节点间的条件概率大小来表示节点之间的依赖程度，在降维时能较好保持属性间概率的一致性和完整性。对于每一个属性子集簇，组内属性间具有高度的相互依赖性，可以通过构建贝叶斯网络进一步开采属性间相关性。其具体步骤如下：Bayesian network uses the conditional probability between attribute nodes to represent the degree of dependence between nodes, and can better maintain the consistency and integrity of the probability between attributes during dimension reduction. For each attribute subset cluster, there is a high degree of interdependence among the attributes within the group, and the inter-attribute correlation can be further exploited by constructing a Bayesian network. The specific steps are as follows:

(1)初始化：初始化设定贝叶斯网络N为

已选取属性节点集合S为

A为数据集属性数列；(1) Initialization: Initialize the Bayesian network N to be

The selected attribute node set S is

A is the data set attribute sequence;

(2)初始节点选取：随机选择一个属性V₁作为贝叶斯网络的初始节点，将V₁添加到集合S，并将AP对

添加到N；(2) Initial node selection: randomly select an attribute V ₁ as the initial node of the Bayesian network, add V ₁ to the set S, and connect AP to

add to N;

(3)AP对候选集列举：初始化AP对候选集Ω为

对于

和

将(V,∏)存入AP对候选集Ω中，其中k为贝叶斯网络的度；(3) AP pair candidate set enumeration: Initialize AP pair candidate set Ω as

for

and

(4)AP对评分求解：使用函数F为评分函数，计算Ω中所有AP对的评分F(V,∏)，求解公式如下：(4) AP pair score solution: Use the function F as the score function to calculate the score F(V,∏) of all AP pairs in Ω. The solution formula is as follows:

(5)AP对选取：基于指数机制选取AP对(V_i,∏_i)添加至网络N，并将V_i添加到S中；其表达式如下：(5) AP pair selection: select AP pair (V _i , ∏ _i ) to add to network N based on the index mechanism, and add V _i to S; its expression is as follows:

使得从Ω中挑选AP对的采样概率与

成比例，其中，△F为评分函数的全局敏感性，

n=|D|;

(6)贝叶斯网络更新：对A中除V₁外的所有属性重复上述(3)到(5)步骤的过程，直至依次选完所有属性节点，即得到完整的贝叶斯网络N。(6) Bayesian network update: Repeat the above steps (3) to (5) for all attributes in A except V ₁ , until all attribute nodes are selected in turn, that is, a complete Bayesian network N is obtained.

第四步，生成加噪条件分布。对于每一个贝叶斯网络N_i，计算其联合概率分布Pr[V_i,∏_i]并加噪得到Pr^*[V_i,∏_i]，据此计算加噪条件概率分布Pr^*[V_i|∏_i]，其中，分配的总隐私预算为ε₂，根据

来分配隐私预算，使构建的每个条件概率分布满足ε_2i的差分隐私；ε₁与ε₂之和等于给定的总隐私预算ε，即ε＝ε₁+ε₂，使得整个数据发布过程满足ε的差分隐私。其具体步骤如下：The fourth step is to generate the noise condition distribution. For each Bayesian network N _i , calculate its joint probability distribution Pr[V _i ,∏ _i ] and add noise to obtain Pr ^* [V _i ,∏ _i ], then calculate the noise conditional probability distribution Pr ^* [V _i |∏ _i ], where the total privacy budget allocated is ε ₂ , according to

to allocate the privacy budget, so that each conditional probability distribution constructed satisfies the differential privacy of ε _2i ; the sum of ε ₁ and ε ₂ is equal to the given total privacy budget ε, that is, ε = ε ₁ +ε ₂ , so that the entire data publishing process Differential privacy that satisfies ε. The specific steps are as follows:

(1)初始化：初始化加噪条件分布集P^*；(1) Initialization: Initialize the noise-added conditional distribution set P ^* ;

(2)加噪联合分布生成：(2) Noise-added joint distribution generation:

加入Laplace噪音

(3)加噪条件分布生成：(3) The noise-added conditional distribution is generated:

对于

对于

第五步，合成数据集的发布：对于c个数据子集，根据其贝叶斯网络N_i和加噪条件分布Pr^*[V_i|∏_i]以i的增加顺序依次采样每个属性，生成扰动数据集

根据此生成合成数据集D^*，合成数据集D^*即为高维隐私数据，最终将高维隐私数据进行发布。The fifth step, the release of synthetic datasets: For c data subsets, each attribute is sampled sequentially in increasing order of i according to its Bayesian network N _i and the noisy conditional distribution Pr ^* [V _i |∏ _i ], Generate a perturbed dataset

According to this, a synthetic dataset D ^* is generated, and the synthetic dataset D ^* is high-dimensional private data, and the high-dimensional private data is finally released.

为了对本发明方法的有效性和运行效率进行验证，下面将在真实数据集上采用具体的实验进行验证与说明。实验环境：Windows10操作系统，Intel(R)Core(TM)i5-6200CPU(2.30GHz)，12GB内存。所涉及算法代码用Python及Java语言实现。In order to verify the effectiveness and operation efficiency of the method of the present invention, the following will use specific experiments on real data sets to verify and illustrate. Experimental environment: Windows10 operating system, Intel(R) Core(TM) i5-6200CPU (2.30GHz), 12GB memory. The algorithm codes involved are implemented in Python and Java languages.

实验数据：实验所使用的3个数据集NLTCS,ACS,Adult均被广泛使用于高维数据发布。NLTCS数据集源自美国护理调查中心，包含了21574名残疾人护理调查的记录；ACS数据集源自IPUMSUSA的ACS样本集，包含了从2013和2014年中获得的47461行个人信息；Adult数据集源自美国人口普查中心，包含了45222条个人信息。三个数据集的具体细节如表1所示：Experimental data: The three datasets NLTCS, ACS, and Adult used in the experiment are widely used in high-dimensional data publishing. The NLTCS dataset is derived from the American Nursing Survey Center and contains 21,574 records from the Nursing Survey of Persons with Disabilities; the ACS dataset is derived from the ACS sample set of IPUUMUSA and contains 47,461 rows of personal information obtained from 2013 and 2014; the Adult dataset Sourced from the U.S. Census Center and contains 45,222 pieces of personal information. The specific details of the three datasets are shown in Table 1:

表1数据集信息描述对比表Table 1 Data set information description comparison table

参考图3(a)～(c)、参考图4(a)～(c)、参考图5(a)～(c)分别展示了NLTCS、ACS、Adult三个数据集上，本发明方法与PrivBayes方法、不加噪声(NoPrivacy)方法、Laplace加噪方法以及Majority方法在SVM分类任务上基于参数ε变化的平均误分类率比较。Referring to Figures 3(a)-(c), 4(a)-(c), and 5(a)-(c), respectively, on the three data sets of NLTCS, ACS, and Adult, the method of the present invention and Comparison of the average misclassification rate of PrivBayes method, NoPrivacy method, Laplace noise addition method and Majority method on the SVM classification task based on the change of parameter ε.

在NLTCS数据集上，分别以(1)是否能够管理资金；(2)是否能够游泳；(3)是否能够旅行作为分类属性做出预测。在ACS数据集上，分别以(1)是否拥有抵押贷款；(2)是否生活在多代同堂家庭中；(3)是否上学作为分类属性做出预测。在Adult数据集上，分别以(1)是否是男性；(2)是否结婚；(3)是否拥有大专学历作为分类属性做出预测。On the NLTCS dataset, predictions are made based on (1) whether it can manage money; (2) whether it can swim; (3) whether it can travel as a classification attribute. On the ACS dataset, predictions are made using (1) whether to own a mortgage; (2) whether to live in a multi-generational family; (3) whether to go to school. On the Adult dataset, predictions are made based on (1) whether he is male; (2) whether he is married; (3) whether he has a college degree or not.

从图3、图4和图5可以发现，对比PrivBayes方法，本发明方法在不同数据集上的属性误分类率均有所改进，并在很大程度上优于Laplace加噪方法及Majority方法，这说明本发明方法在有效保证发布数据隐私信息的同时，数据集的效用性也有所提高。From Fig. 3, Fig. 4 and Fig. 5, it can be found that compared with the PrivBayes method, the attribute misclassification rate of the method of the present invention in different data sets has been improved, and is better than the Laplace noise addition method and the Majority method to a large extent, This shows that the method of the present invention can effectively ensure the release of data privacy information, and at the same time, the utility of the data set is also improved.

参考图6(a)～(b)分别展示了NLTCS,ACS,Adult三个数据集上，贝叶斯网络的度k＝2,k＝3时本发明方法与PrivBayes方法的运行时间比较(由于3600数值过大，在此图6(b)提供的是截断图)。从图中可看出，本发明方法在数据集维度较小时运行时间与PrivBayes方法大致相当，但随着数据集维度的增大，本发明方法运行时间短于PrivBayes方法，如图6(a)中，Adult数据集上PrivBayes方法是本发明方法的4倍左右。此外，随着贝叶斯网络度k增大，本发明方法运行时间的缩短效率更加显著，说明了本发明方法在高维大数据环境下运行效率的有效性。且当数据集维度更高时，可在本发明方法搭建的框架上使用MapReduce并行编程模式，进一步缩短数据发布时间。Referring to Fig. 6(a)-(b), we show the comparison of the running time between the method of the present invention and the PrivBayes method when the degree of Bayesian network is k=2, k=3 on the three data sets of NLTCS, ACS, and Adult respectively (due to The value of 3600 is too large, and Figure 6(b) here provides a truncated image). It can be seen from the figure that the running time of the method of the present invention is roughly equal to that of the PrivBayes method when the dimension of the data set is small, but as the dimension of the data set increases, the running time of the method of the present invention is shorter than that of the PrivBayes method, as shown in Figure 6(a) Among them, the PrivBayes method on the Adult dataset is about 4 times that of the method of the present invention. In addition, as the Bayesian network degree k increases, the efficiency of shortening the running time of the method of the present invention is more significant, which shows the effectiveness of the running efficiency of the method of the present invention in a high-dimensional big data environment. And when the dimension of the data set is higher, the MapReduce parallel programming mode can be used on the framework constructed by the method of the present invention, thereby further shortening the data publishing time.

在高维数据发布大背景下，本发明提出了一种基于属性聚类贝叶斯网络的差分隐私高维数据发布方法。首先进行属性聚类得到各个数据子集，随后基于指数机制构建满足差分隐私的贝叶斯网络，并根据贝叶斯网络和加噪条件分布依次采样每个属性得到扰动数据集，最终合成新的数据集进行发布。通过在真实数据集上开展实验，从SVM误分类率和算法运行时间两个方面验证了本发明方法的可用性与运行效率。Under the background of high-dimensional data release, the present invention proposes a differential privacy high-dimensional data release method based on attribute clustering Bayesian network. First, attribute clustering is performed to obtain each data subset, and then a Bayesian network that satisfies differential privacy is constructed based on the exponential mechanism, and each attribute is sequentially sampled according to the Bayesian network and the noise-added conditional distribution to obtain the perturbed data set, and finally a new The dataset is published. By carrying out experiments on real data sets, the usability and operation efficiency of the method of the present invention are verified from two aspects of SVM misclassification rate and algorithm running time.

以上显示和描述了本发明的基本原理、主要特征和本发明的优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是本发明的原理，在不脱离本发明精神和范围的前提下本发明还会有各种变化和改进，这些变化和改进都落入要求保护的本发明的范围内。本发明要求的保护范围由所附的权利要求书及其等同物界定。The foregoing has shown and described the basic principles, main features and advantages of the present invention. It should be understood by those skilled in the art that the present invention is not limited by the above-mentioned embodiments. The above-mentioned embodiments and descriptions describe only the principles of the present invention. Without departing from the spirit and scope of the present invention, there are various Variations and improvements are intended to fall within the scope of the claimed invention. The scope of protection claimed by the present invention is defined by the appended claims and their equivalents.

Claims

1. a high-dimensional privacy data publishing method based on Bayesian network attribute cluster analysis technology, is characterized in that, comprises the following steps:

11) Acquisition of high-dimensional data: acquire the high-dimensional data to be published to form the original data set D, and perform attribute induction on the high-dimensional data to form a high-dimensional data attribute set;

12) Clustering division of attribute subsets: By calculating the correlation between high-dimensional data attributes, the high-dimensional attribute set is divided into c attribute subsets by attribute clustering method, and then the original data set D is divided according to the attribute subsets. Divide into c data subsets D _i (i=1,..,c);

13) Constructing a Noisy Bayesian Network: Use a greedy Bayesian method to construct a Noisy Bayesian network N _i ( _i =1,.. ,c), where the allocated total privacy budget is ε ₁ , and each data subset allocates a privacy budget according to the proportion of the number of attributes it has to the total number of attributes owned by c attribute subset clusters

14) Generate noise conditional distribution: For each Bayesian network N _i , calculate its joint probability distribution Pr[V _i ,∏ _i ] and add noise to obtain Pr ^* [V _i ,∏ _i ], and then calculate the noise added Conditional probability distribution Pr ^* [V _i |∏ _i ], where the total privacy budget allocated is ε ₂ , and each Bayesian network accounts for the proportion of the total number of attribute nodes owned by c Bayesian networks according to the number of attribute nodes Allocate a privacy budget

15) Release of synthetic datasets: For c data subsets, each attribute is sequentially sampled in increasing order of i according to its Bayesian network N _i and noise-added conditional distribution Pr ^* [V _i |∏ _i ] to generate perturbations data set

2. the high-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology according to claim 1, is characterized in that, the clustering division of described attribute subset comprises the following steps:

21) For the high-dimensional data set, calculate the correlation between the attributes of the high-dimensional data, and the calculation method is as follows:

Given any two properties V _i and V _j , the relative dependencies between the properties are expressed as

22) randomly select c attributes as central attributes, where c is the number of attribute subsets;

23) For

24) Update the central attribute. For each attribute subset, if the sum of the relationship between attribute _Vi and other attributes is greater than the sum of the relationship between the central attribute and other attributes, that is, MR(V _i )≥MR(V _j )(V _j ∈ C _r , j≠ _i ), then set Vi as the new C _r ;

25) Repeat steps 23) and 24) until the c central attributes remain unchanged, or when the number of iterations reaches a preset value, terminate the iteration to obtain c attribute subsets, and then obtain c data subsets D _i (i =1,..,c).

3. the high-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology according to claim 1, is characterized in that, described constructing noise Bayesian network comprises the following steps:

31) Initialization: Initialize the Bayesian network N to be

The selected attribute node set S is

A is the data set attribute sequence;

32) Initial node selection: randomly select an attribute V ₁ as the initial node of the Bayesian network, add V ₁ to the set S, and pair the attribute-parent node set AP

add to N;

33) AP pair candidate set enumeration: Initialize AP pair candidate set Ω as

for

and

34) AP pair score solution: Use function F as the score function to calculate the score F(V,∏) of all AP pairs in Ω. The solution formula is as follows:

where P°[V, Π] is the set of all maximal joint distributions of AP pair (V, Π);

35) AP pair selection: select AP pair (V _i , ∏ _i ) to add to network N based on the index mechanism, and add V _i to S; its expression is as follows:

such that the sampling probability of picking an AP pair from Ω is the same as

36) Bayesian network update: Repeat the process of steps 33) to 35) for all attributes in A except V ₁ , until all attribute nodes are selected in sequence, that is, a complete Bayesian network N is obtained.

4. The high-dimensional privacy data publishing method based on Bayesian network attribute clustering analysis technology according to claim 1, wherein the generating the noise-added conditional distribution comprises the following steps:

41) Initialization: initialize the noise-added conditional distribution set P ^* ;

42) Noise-added joint distribution generation:

Calculate the original joint distribution Pr[V _i ,∏ _i ] according to the Bayesian network _Ni ,

Add Laplace Noise

43) Noise-added conditional distribution generation:

for

for