CN110334757A

CN110334757A - Privacy-preserving clustering method and computer storage medium for big data analysis

Info

Publication number: CN110334757A
Application number: CN201910565540.7A
Authority: CN
Inventors: 徐小龙; 范泽轩; 孙雁飞
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2019-10-15

Abstract

The invention discloses a privacy protection clustering method and computer storage medium for big data analysis. The method includes the following steps: data normalization and center point selection, calculation of minimum privacy budget and distribution of privacy budget sequence, and division of sample points to the nearest The center point of , generate Laplacian noise, add noise to the parameters in the process of updating the center point, iterate continuously until the difference between the sum of squared errors of two adjacent iterations is less than the threshold or reaches the maximum number of iterations. The present invention protects the sensitive information in the data set by adding noise that obeys the Laplace distribution to the intermediate parameters in the clustering algorithm execution process, solves the problem of leaking sensitive information in the data set during the clustering algorithm execution process, and improves differential privacy The method of protecting the privacy budget allocation of clustering algorithms improves the usability of clustering results under the same degree of privacy protection, and solves the problem of privacy leakage in big data clustering mining.

Description

Privacy-preserving clustering method and computer storage medium for big data analysis

技术领域technical field

本发明涉及一种隐私保护聚类方法及计算机存储介质，特别是涉及一种面向大数据分析的隐私保护聚类方法及计算机存储介质。The invention relates to a privacy protection clustering method and a computer storage medium, in particular to a privacy protection clustering method and a computer storage medium for big data analysis.

背景技术Background technique

目前，数据挖掘越来越受到人们的重视，使用机器学习算法对海量数据进行挖掘分析，可以获得大量极具价值的新知识和新规律。聚类分析作为数据挖掘领域中比较常用的方法，在数据预处理、目标群体分类、模式识别和图像分割等场景下都有广泛的应用。K均值是大数据聚类分析中最为简单有效也是使用最多的算法，但在算法执行过程中，更新质心时需要计算每个聚类的样本数量以及各个属性的和，这些操作会泄露数据集的敏感信息。At present, data mining has been paid more and more attention by people. Using machine learning algorithms to mine and analyze massive data can obtain a lot of valuable new knowledge and new laws. As a commonly used method in the field of data mining, cluster analysis is widely used in scenarios such as data preprocessing, target group classification, pattern recognition, and image segmentation. K-means is the simplest, most effective and most used algorithm in big data clustering analysis. However, during the execution of the algorithm, it is necessary to calculate the number of samples of each cluster and the sum of each attribute when updating the centroid. These operations will leak the data set. Sensitive information.

差分隐私是一种数据隐私保护技术，通过添加噪声的方式来扰乱数据，同时能够保留数据的统计方面的性质。因此使用差分隐私保护技术与聚类算法相结合，可以保护数据集的敏感信息不泄露并且获得相对准确的聚类结果。已有的隐私保护聚类算法存在着一些不足之处，初始点的随机选择和隐私预算消耗过快都会导致聚类结果可用性不理想。另外，传统隐私预算分配容易导致的随机噪声过大的问题仍没有解决。Differential privacy is a data privacy protection technology that disturbs the data by adding noise while preserving the statistical properties of the data. Therefore, the combination of differential privacy protection technology and clustering algorithm can protect the sensitive information of the dataset from disclosure and obtain relatively accurate clustering results. There are some deficiencies in the existing privacy-preserving clustering algorithms. Random selection of initial points and excessive consumption of privacy budget will lead to unsatisfactory availability of clustering results. In addition, the problem of excessive random noise easily caused by traditional privacy budget allocation is still unsolved.

发明内容Contents of the invention

发明目的：本发明要解决的技术问题是提供一种面向大数据分析的隐私保护聚类方法及计算机存储介质，解决了传统隐私预算分配容易导致随机噪声过大，从而影响聚类结果质量的问题，改进了差分隐私保护聚类算法的隐私预算分配的方式，提出了一种等差隐私预算分配方式，在相同隐私保护程度下提高了聚类结果的可用性，解决大数据聚类挖掘中的隐私泄露问题。Purpose of the invention: The technical problem to be solved by the present invention is to provide a privacy-preserving clustering method and computer storage medium for big data analysis, which solves the problem that traditional privacy budget allocation easily leads to excessive random noise, thereby affecting the quality of clustering results , improved the privacy budget allocation method of the differential privacy-preserving clustering algorithm, and proposed a differential privacy budget allocation method, which improves the availability of clustering results under the same degree of privacy protection, and solves the privacy problem in big data cluster mining Leakage problem.

技术方案：本发明所述的面向大数据分析的隐私保护聚类方法，包括以下步骤：Technical solution: the privacy protection clustering method for big data analysis described in the present invention comprises the following steps:

(1)对数据集中的数据进行归一化处理；(1) Normalize the data in the data set;

(2)将数据集平均分为k个子集，在每个子集中随机选择一个样本点作为初始中心点；(2) Divide the data set into k subsets on average, and randomly select a sample point in each subset as the initial center point;

(3)设置总隐私预算ε和最大迭代次数t_m，计算最小隐私预算ε^m和迭代次数t＝ε/ε^m，如果t>t_m，则采用等差隐私预算分配方式来分配隐私预算序列，如果t≤t_m，则采用平均隐私预算分配方式来分配隐私预算序列，得到隐私预算序列ε_p,其中1≤p≤t_m；(3) Set the total privacy budget ε and the maximum number of iterations t _m , calculate the minimum privacy budget ε ^m and the number of iterations t=ε/ε ^m , if t>t _m , use the equal difference privacy budget allocation method to allocate the privacy budget sequence , if t≤t _m , use the average privacy budget allocation method to allocate the privacy budget sequence, and obtain the privacy budget sequence ε _p , where 1≤p≤t _m ;

(4)对于数据集中的所有样本点，分别计算其到k个中心点的欧氏距离，将样本点分配给最近的中心点，将数据集划分为k个聚类C＝{C₁,C₂,…,C_k}；(4) For all sample points in the data set, calculate their Euclidean distances to k center points, assign the sample points to the nearest center point, and divide the data set into k clusters C={C ₁ ,C ₂ ,...,C _k };

(5)根据隐私预算序列ε_p中对应的项生成拉普拉斯分布的随机数；(5) Generate random numbers of Laplace distribution according to the corresponding items in the privacy budget sequence ε _p ;

(6)对于每一个聚类C_j，其中1≤j≤k，计算该聚类样本点数目num以及样本点的和向量sum，分别对其添加噪声得到num′和sum′，上述噪声为步骤(5)中拉普拉斯分布的随机数；(6) For each cluster C _j , where 1≤j≤k, calculate the number of sample points num of the cluster and the sum vector sum of the sample points, and add noise to them respectively to obtain num′ and sum′, the above noise is the step (5) The random number of the Laplace distribution in the middle;

(7)更新每一个聚类C_j的中心点为sum′/num′，其中1≤j≤k；(7) Update the central point of each cluster C _j to be sum'/num', where 1≤j≤k;

(8)计算误差平方和，如果本次和前次迭代的误差平方和的差的绝对值小于设置阈值或者迭代次数达到上限t_m，则结束执行，得到聚类结果，否则转到步骤4继续执行下一次迭代。(8) Calculate the sum of squared errors. If the absolute value of the difference between the sum of squared errors of this and the previous iteration is less than the set threshold or the number of iterations reaches the upper limit t _m , then end the execution and obtain the clustering result, otherwise go to step 4 and continue Execute the next iteration.

进一步的，步骤(3)中最小隐私预算ε^m的计算方法为：Further, the calculation method of the minimum privacy budget ε ^m in step (3) is:

其中，N为数据集的记录数，d为数据的维数，ρ为每一维质心估计的平均值。where N is the number of records in the data set, d is the dimensionality of the data, and ρ is the average value of centroid estimates for each dimension.

进一步的，步骤(3)中的等差隐私预算分配方式具体为：Further, the allocation method of the differential privacy budget in step (3) is specifically:

把总隐私预算ε分解为长度为t_m的递增等差数列，所述序列初始项为ε^m，所述序列所有项的和为ε，将所述数列倒序得到隐私预算序列ε_p。The total privacy budget ε is decomposed into an increasing arithmetic sequence of length t _m , the initial item of the sequence is ε ^m , the sum of all items of the sequence is ε, and the privacy budget sequence ε _p is obtained by inverting the sequence.

进一步的，步骤(3)中的平均隐私预算分配方式具体为：Further, the average privacy budget allocation method in step (3) is specifically:

把总隐私预算ε分解为长度为t_m的平均数列，所述序列即为隐私预算序列ε_p。Decompose the total privacy budget ε into an average sequence of length t _m , and the sequence is the privacy budget sequence ε _p .

进一步的，步骤(5)中随机数为服从位置参数为0、尺度参数为b的拉普拉斯分布分随机数，其中，b＝d+1/ε’，d为数据的维数，ε’为根据当前迭代次数从隐私预算序列ε_p中查找的对应位置的数值。Further, in step (5), the random number is a Laplace distribution random number that obeys the position parameter of 0 and the scale parameter of b, where b=d+1/ε', d is the dimension of the data, and ε ' is the value of the corresponding position searched from the privacy budget sequence ε _p according to the current iteration number.

进一步的，步骤(2)中的初始中心点为每个子集中随机选择一个样本点后加入随机噪声得到的。Further, the initial central point in step (2) is obtained by randomly selecting a sample point in each subset and adding random noise.

本发明所述的计算机存储介质，其上存储有计算机程序，所述计算机程序在被计算机处理器执行时实现上述面向大数据分析的隐私保护聚类方法。The computer storage medium of the present invention stores a computer program thereon, and when the computer program is executed by a computer processor, the above-mentioned privacy protection clustering method oriented to big data analysis is realized.

有益效果：本发明具有以下技术效果：Beneficial effects: the present invention has the following technical effects:

1、使用等差隐私预算分配法来生成隐私预算序列，首先计算最小隐私预算ε^m，然后使用等差数列求和公式和通项公式计算得到隐私预算序列，该隐私预算序列平缓，解决了现有方法中存在的隐私预算消耗过快的问题；1. Use the arithmetic privacy budget allocation method to generate the privacy budget sequence. First calculate the minimum privacy budget ε ^m , and then use the arithmetic sequence summation formula and the general term formula to calculate the privacy budget sequence. The privacy budget sequence is flat and solves the problem There is a problem that the privacy budget in the method is consumed too quickly;

2、使用等差隐私预算分配方法，将总隐私预算按线性分配，解决了已有方法分配的隐私预算前期过大、后期过小的问题。当总隐私预算很小时，甚至小于最小隐私预算ε^m时，本发明采用平均分配方式，尽可能避免分配的隐私预算过小影响算法执行。相较于现有方法，本发明有着更高的聚类可用性和更好的聚类质量。2. Using the arithmetic privacy budget allocation method, the total privacy budget is allocated linearly, which solves the problem that the privacy budget allocated by the existing method is too large in the early stage and too small in the later stage. When the total privacy budget is very small, or even smaller than the minimum privacy budget ε ^m , the present invention adopts an even distribution method to avoid affecting the execution of the algorithm if the allocated privacy budget is too small. Compared with existing methods, the present invention has higher clustering availability and better clustering quality.

附图说明Description of drawings

图1是本发明实施方式的方法流程图；Fig. 1 is the method flowchart of the embodiment of the present invention;

图2是本发明的等差隐私预算分配方法流程图；Fig. 2 is a flow chart of the differential privacy budget allocation method of the present invention;

图3是本发明的方法与对比算法的聚类可用性指标对比图；Fig. 3 is a comparison diagram of the clustering usability index of the method of the present invention and the comparative algorithm;

图4是本发明的方法与二分分配法、级数和分配法的隐私预算序列结果对比图。Fig. 4 is a comparison chart of privacy budget sequence results between the method of the present invention and the binary allocation method, series and allocation method.

具体实施方式Detailed ways

本实施方式的方法流程图如图1所示，具体按照以下步骤实施：The method flowchart of this embodiment is shown in Figure 1, and specifically implemented according to the following steps:

步骤1，现有Image.csv数据集，该数据集来自东芬兰大学计算机学院聚类数据集(http://cs.joensuu.fi/sipu/datasets/)。记该数据集为D，数据集记录数N为34112，数据维度d为3，即每条数据有3个属性。总隐私预算ε控制隐私保护程度的大小，ε设置得越小，所添加的噪声越大，隐私保护程度越高。这里将总隐私预算ε设为0.8，聚类数目k为3，每条数据可以看作k维空间内的一个样本点。将数据集D的每一维数据归一化到[0,1]。Step 1, the existing Image.csv dataset, which comes from the clustering dataset of the School of Computer Science, University of Eastern Finland (http://cs.joensuu.fi/sipu/datasets/). Record the data set as D, the number of records N in the data set is 34112, and the data dimension d is 3, that is, each piece of data has 3 attributes. The total privacy budget ε controls the degree of privacy protection. The smaller ε is set, the greater the added noise and the higher the degree of privacy protection. Here, the total privacy budget ε is set to 0.8, the number of clusters k is 3, and each piece of data can be regarded as a sample point in the k-dimensional space. Normalize each dimension of the data set D to [0,1].

数据归一化是将每一维数据缩放至[0,1]中，由如下公式进行：Data normalization is to scale each dimension of data to [0,1], which is performed by the following formula:

其中，对于数据的任意一个维度，x是这个维度的数据，min和max分别是最小值和最大值，x′是归一化后的数据。Among them, for any dimension of the data, x is the data of this dimension, min and max are the minimum and maximum values respectively, and x′ is the normalized data.

步骤2，将预处理后的数据集D平均分为k个子集{S₁,S₂,…,S_k}，从每个子集S_i中随机选择一个样本点o_i，其中1≤i≤k，加入随机噪声后作为初始的中心点{u₁,u₂,…,u_k}。这里，将数据集D平均分为3个子集{S₁,S₂,S₃}，从每个子集中随机选取一个样本点，加入噪声之后得到初始中心点，结果为：Step 2. Divide the preprocessed data set D into k subsets {S ₁ , S ₂ ,…,S _k }, randomly select a sample point o _i from each subset S _i , where 1≤i≤ k is the initial center point {u ₁ ,u ₂ ,…,u _k } after adding random noise. Here, the data set D is evenly divided into three subsets {S ₁ , S ₂ , S ₃ }, a sample point is randomly selected from each subset, and the initial center point is obtained after adding noise. The result is:

u₁[0 0.08130081 0.00473934]u ₁ [0 0.08130081 0.00473934]

u₂[0.44230769 0.27235772 0.16587678]u ₂ [0.44230769 0.27235772 0.16587678]

u₃[0.65384615 0.43089431 0.1943128]。u ₃ [0.65384615 0.43089431 0.1943128].

步骤3，获得隐私预算序列ε_p,其中1≤p≤t_m。设置最大迭代次数t_m，计算最小隐私预算ε^m，并由此计算得到迭代次数t＝ε/ε^m，如果t>t_m，则采用等差隐私预算分配方式来分配隐私预算序列；如果t<t_m，则采用平均隐私预算分配方式来分配隐私预算序列；最终得到隐私预算序列{ε₁,ε₂,…,ε_tm}。隐私预算序列分配流程如图2所示。Step 3, obtain the privacy budget sequence ε _p , where 1≤p≤t _m . Set the maximum number of iterations t _m , calculate the minimum privacy budget ε ^m , and then calculate the number of iterations t=ε/ε ^m , if t>t _m , use the arithmetic privacy budget allocation method to allocate the privacy budget sequence; if t <t _m , the average privacy budget allocation method is used to allocate the privacy budget sequence; finally the privacy budget sequence {ε ₁ ,ε ₂ ,…,ε _tm } is obtained. The privacy budget sequence allocation process is shown in Figure 2.

最小隐私预算ε^m的计算公式为：The calculation formula of the minimum privacy budget ε ^m is:

其中，N表示给定数据集的记录数，d为维数，k为聚类的数目，ρ为每一维质心估计的平均值，当数据归一化到[0,1]时，其取值为0.45。Among them, N represents the number of records in a given data set, d is the number of dimensions, k is the number of clusters, and ρ is the average value of centroid estimates for each dimension. When the data is normalized to [0,1], it takes The value is 0.45.

等差隐私预算分配方式把总隐私预算分解为一个长度为t_m的递增等差数列，该数列中的每一项为相应迭代次数中消耗掉的隐私预算。具体操作是将步骤3求得的ε^m作为等差数列的初始项a₁，总隐私预算ε作为该数列所有项的和S_n，由如下公式可以计算等差数列的公差d_t：The arithmetic privacy budget allocation method decomposes the total privacy budget into an increasing arithmetic sequence of length t _m , and each item in the sequence is the privacy budget consumed in the corresponding number of iterations. The specific operation is to use the ε ^m obtained in step 3 as the initial item a ₁ of the arithmetic sequence, and the total privacy budget ε as the sum S _n of all items of the sequence, and the tolerance d _t of the arithmetic sequence can be calculated by the following formula:

a_n＝a₁+(n-1)d_t，a _n =a ₁ +(n-1)d _t ,

得到公差d_t之后进而得到长度为t_m的递增等差数列，在将此数列倒序即得所求隐私预算序列，隐私预算序列不一定全部消耗完。After the tolerance d _t is obtained, an increasing arithmetic sequence of length t _m is obtained, and the privacy budget sequence is obtained by reversing the sequence, and the privacy budget sequence may not be completely consumed.

平均隐私预算分配方式就是把总隐私预算按最大迭代次数平均分配，每次消耗的隐私预算为ε/t_m。平均分配也可以看作是一种公差为0的特殊等差分配。The average privacy budget allocation method is to evenly distribute the total privacy budget according to the maximum number of iterations, and the privacy budget consumed each time is ε/t _m . The average distribution can also be regarded as a special arithmetic distribution with a tolerance of 0.

具体的，设置最大迭代次数t_m为8，计算得到最小隐私预算ε^m＝0.031，则t＝ε/ε^m＝25.806，因为t>t_m，所以采用等差隐私预算分配法计算隐私预算序列ε_p，其中1≤p≤8。首先计算得到公差d_t＝0.0197，然后根据等差数列通项公式计算每一项的具体值，最后经过倒排得到所求递减的隐私预算序列，结果为{0.169，0.14928571，0.12957143，0.10985714，0.09014286，0.07042857，0.05071429，0.031}。Specifically, set the maximum number of iterations t _m to 8, calculate the minimum privacy budget ε ^m = 0.031, then t = ε/ε ^m = 25.806, because t>t _m , so use the arithmetic privacy budget allocation method to calculate the privacy budget sequence ε _p , where 1≤p≤8. First calculate the tolerance d _t = 0.0197, then calculate the specific value of each item according to the formula of the general term of the arithmetic sequence, and finally obtain the desired decreasing privacy budget sequence through inversion, the result is {0.169, 0.14928571, 0.12957143, 0.10985714, 0.09014286 , 0.07042857, 0.05071429, 0.031}.

步骤4，对于计算数据集D中的所有点，分别计算其到k个中心点的欧氏距离，将此样本点分配给最近的中心点，数据集D被划分为k个聚类C＝{C₁,C₂,…,C_k}。Step 4, for all points in the calculation data set D, calculate the Euclidean distance to k center points respectively, and assign this sample point to the nearest center point, the data set D is divided into k clusters C={ C ₁ ,C ₂ ,...,C _k }.

具体的，将数据集D中的所有点分别计算其到3个中心点的欧氏距离，将此样本点分配给最近的中心点，数据集D被划分为3个聚类C＝{C₁,C₂,C₃}。Specifically, calculate the Euclidean distances from all points in the data set D to the three center points, and assign this sample point to the nearest center point, and the data set D is divided into three clusters C={C ₁ ,C ₂ ,C ₃ }.

步骤5，计算本次迭代所要添加的噪声，该噪声是服从位置参数为0，尺度参数为b的拉普拉斯分布分随机数，记作Lap(b)，其中b＝Δf/ε’，Δf表示敏感度，ε’为隐私预算。拉普拉斯分布的概率密度函数为这里数据的敏感度与维度有关，Δf＝d+1，隐私预算为ε’为根据当前迭代次数从隐私预算序列ε_p中查找的对应位置的数值，所以噪声表示为Lap(Δf/ε’)。Step 5. Calculate the noise to be added in this iteration. The noise is a random number that obeys the Laplace distribution whose position parameter is 0 and scale parameter is b. It is recorded as Lap(b), where b=Δf/ε', Δf represents the sensitivity and ε' is the privacy budget. The probability density function of the Laplace distribution is The sensitivity of the data here is related to the dimension, Δf=d+1, and the privacy budget is ε', which is the value of the corresponding position searched from the privacy budget sequence ε _p according to the current iteration number, so the noise is expressed as Lap(Δf/ε') .

具体的，根据迭代的次数从步骤3中得到的隐私预算序列中查找对应的隐私预算ε_p，敏感度Δf＝3+1＝4，所以第一次迭代，ε₁为0.169，噪声大小为Lap(4/0.169)；第二次迭代，ε₂为0.1493，噪声大小为Lap(4/0.1493)，以下以此类推。Specifically, according to the number of iterations, find the corresponding privacy budget ε _p from the privacy budget sequence obtained in step 3, and the sensitivity Δf=3+1=4, so in the first iteration, ε ₁ is 0.169, and the noise size is Lap (4/0.169); in the second iteration, ε ₂ is 0.1493, the noise size is Lap(4/0.1493), and so on.

步骤6，对于每一个聚类C_j，其中1≤j≤k，计算该聚类样本点数目num以及样本点的和向量sum，分别对其添加步骤5中的噪声得到num′和sum′。具体的，对于每一个聚类C_j，其中1≤j≤3，计算该聚类样本点数目num以及样本点的和向量sum。第一次迭代的具体结果为：Step 6, for each cluster C _j , where 1≤j≤k, calculate the number of sample points num of the cluster and the sum vector sum of the sample points, and add the noise in step 5 to obtain num' and sum' respectively. Specifically, for each cluster C _j , where 1≤j≤3, the number of sample points num of the cluster and the sum vector sum of the sample points are calculated. The specific result of the first iteration is:

聚类C₁的num为1406，和向量sum为[240.29 177.76 107.42]；The num of cluster C ₁ is 1406, and the vector sum is [240.29 177.76 107.42];

聚类C₂的num为12301，和向量sum为[4665.25 3686.47 2473.31]；The num of cluster C ₂ is 12301, and the vector sum is [4665.25 3686.47 2473.31];

聚类C₃的num为20405，和向量sum为[13469.21 11385.21 8768.39]；The num of cluster C ₃ is 20405, and the vector sum is [13469.21 11385.21 8768.39];

然后分别对其添加步骤5中的噪声得到num′和sum′，第一次迭代添加的噪声为Lap(4/0.169)，具体结果为：Then add the noise in step 5 to get num' and sum', the noise added in the first iteration is Lap(4/0.169), the specific result is:

聚类C₁的num′为1421.99，和向量sum′为[284.77 190.18 108.46]；The num' of cluster C ₁ is 1421.99, and the vector sum' is [284.77 190.18 108.46];

聚类C₂的num′为12281.82，和向量sum′为[4688.87 3697.67 2566.92]；The num' of cluster C ₂ is 12281.82, and the vector sum' is [4688.87 3697.67 2566.92];

聚类C₃的num′为20396.29，和向量sum′为[13466.97 11402.30 8739.17]；The num' of cluster C ₃ is 20396.29, and the vector sum' is [13466.97 11402.30 8739.17];

步骤7，更新每一个聚类C_j的中心u_j′＝sum′/num′，其中1≤j≤3；则第一次迭代的更新的中心具体结果为：Step 7, update the center u _j ′=sum’/num’ of each cluster C _j , where 1≤j≤3; then the specific result of the updated center of the first iteration is:

u₁′[0.20026401 0.13374381 0.07627629]u ₁ '[0.20026401 0.13374381 0.07627629]

u₂′[0.38177298 0.30106816 0.20900154]u ₂ '[0.38177298 0.30106816 0.20900154]

u₃′[0.66026546 0.55903804 0.42846875]。u ₃ '[0.66026546 0.55903804 0.42846875].

步骤8，计算误差平方和，如果本次和前次迭代的误差平方和的差的绝对值小于设置的阈值或者迭代次数达到上限t_m，则结束执行，得到聚类结果，否则转到步骤4继续执行。所述的误差平方和具体指每个聚类中的点和这个类的中心点的距离之和。阈值可以自行设置，设置的阈值决定着迭代次数，理论上可以设置为0，但是由于噪声的随机性，设置为0会导致迭代次数过多，因此可以将阈值适当放宽，这里设置为100。Step 8: Calculate the sum of squared errors. If the absolute value of the difference between the sum of squared errors of this iteration and the previous iteration is less than the set threshold or the number of iterations reaches the upper limit t _m , then end the execution and obtain the clustering result, otherwise go to step 4 Continue to execute. The sum of squared errors specifically refers to the sum of the distances between the points in each cluster and the center point of this class. The threshold can be set by yourself. The set threshold determines the number of iterations. In theory, it can be set to 0. However, due to the randomness of noise, setting it to 0 will lead to too many iterations. Therefore, the threshold can be appropriately relaxed. Here, it is set to 100.

将本实施例的方法与目前已有的两种算法进行比较。对于不同的ε值，分别将这三个算法运行10次，用它们的结果与标准K均值算法结果计算F-measure指标，以此来评价算法的聚类可用性。F-measure的值域为[0,1]，越接近于1表明该算法的聚类结果和标准无噪声结果越相似，表明聚类可用性越高。三种算法在Image数据集上的F-measure指标对比图如图3所示。Compare the method of this embodiment with the two existing algorithms. For different ε values, the three algorithms were run 10 times, and the F-measure index was calculated by using their results and the results of the standard K-means algorithm to evaluate the clustering usability of the algorithm. The value range of F-measure is [0,1], the closer to 1, the more similar the clustering result of the algorithm is to the standard noise-free result, indicating the higher availability of clustering. The comparison chart of the F-measure indicators of the three algorithms on the Image dataset is shown in Figure 3.

图4是本实施例的方法与现有两种方法分配隐私预算序列的对比图。在前期迭代中，现有的两种方法已经消耗了大部分总隐私预算，中后期分得的隐私预算很少，过小的隐私预算容易导致大量噪声从而影响算法收敛。而本发明的方法分配得到的隐私预算序列呈线性分布，在中期分得的隐私预算也比较充足，不容易出现过量噪声干扰算法收敛的情况。FIG. 4 is a comparison diagram of the method of this embodiment and the two existing methods for allocating privacy budget sequences. In the early iterations, the existing two methods have already consumed most of the total privacy budget, and the privacy budget allocated in the middle and late stages is very small. Too small privacy budgets will easily lead to a lot of noise and affect the algorithm convergence. However, the privacy budget sequence allocated by the method of the present invention is linearly distributed, and the privacy budget allocated in the middle period is relatively sufficient, and it is not easy for excessive noise to interfere with the convergence of the algorithm.

本发明为一种面向大数据分析的隐私保护聚类方法，该方法改进现有差分隐私聚类算法的隐私预算分配方式，使用等差隐私预算分配方式，解决了已有方法隐私预算消耗过快，迭代后期噪声过大等问题，在相同隐私保护程度下，提高了聚类结果可用性。本发明可以应用于对大数据的聚类分析的过程，在此过程中保护个人信息不被泄露。例如在对医疗数据、商业消费数据以及位置数据等进行聚类挖掘时，这些数据包含大量的用户隐私，使用本发明的方法可以有效防范数据采集和算法执行过程中的隐私泄露问题，同时保留数据的统计特性及挖掘效用。The present invention is a privacy protection clustering method oriented to big data analysis. The method improves the privacy budget allocation method of the existing differential privacy clustering algorithm, uses the arithmetic privacy budget allocation method, and solves the excessive consumption of the privacy budget in the existing method. , too much noise in the later stage of iteration, and the usability of clustering results is improved under the same degree of privacy protection. The present invention can be applied to the process of cluster analysis of big data, during which personal information is protected from being disclosed. For example, when clustering and mining medical data, commercial consumption data, and location data, etc., these data contain a large amount of user privacy. Using the method of the present invention can effectively prevent privacy leaks in the process of data collection and algorithm execution, while retaining data Statistical properties and mining utility.

本发明实施例如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。这样，本发明实例不限制于任何特定的硬件和软件结合。If the embodiment of the present invention is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiment of the present invention is essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for Make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the methods described in various embodiments of the present invention. The aforementioned storage medium includes: various media capable of storing program codes such as U disk, mobile hard disk, read only memory (ROM, Read Only Memory), magnetic disk or optical disk. Thus, examples of the invention are not limited to any specific combination of hardware and software.

相应的，本发明的实施例还提供了一种计算机存储介质，其上存储有计算机程序。当所述计算机程序由处理器执行时，可以实现前述面向大数据分析的隐私保护聚类方法。例如，该计算机存储介质为计算机可读存储介质。Correspondingly, the embodiment of the present invention also provides a computer storage medium on which a computer program is stored. When the computer program is executed by a processor, the aforementioned privacy-preserving clustering method for big data analysis can be realized. For example, the computer storage medium is a computer readable storage medium.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowcharts and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

Claims

1. A privacy protection clustering method for big data analysis, characterized in that, comprising the following steps:

(1) Normalize the data in the data set;

(2) Divide the data set into k subsets on average, and randomly select a sample point in each subset as the initial center point;

(3) Set the total privacy budget ε and the maximum number of iterations t _m , calculate the minimum privacy budget ε ^m and the number of iterations t = ε/ε ^m , if t>t _m , use the equal difference privacy budget allocation method to allocate the privacy budget sequence , if t≤t _m , use the average privacy budget allocation method to allocate the privacy budget sequence, and obtain the privacy budget sequence ε _p , where 1≤p≤t _m ;

(4) For all sample points in the data set, calculate the Euclidean distances to k center points respectively, assign the sample points to the nearest center point, and divide the data set into k clusters C={C ₁ ,C ₂ ,...,C _k };

(5) Generate random numbers of Laplace distribution according to the corresponding items in the privacy budget sequence ε _p ;

(6) For each cluster C _j , where 1≤j≤k, calculate the number of sample points num of the cluster and the sum vector sum of the sample points, and add noise to them respectively to obtain num′ and sum′, the above noise is the step (5) The random number of the Laplace distribution in the middle;

(7) Update the central point of each cluster C _j to be sum'/num', where 1≤j≤k;

(8) Calculate the sum of squared errors. If the absolute value of the difference between the sum of squared errors of this and the previous iteration is less than the set threshold or the number of iterations reaches the upper limit t _m , then end the execution and obtain the clustering result, otherwise go to step 4 and continue Execute the next iteration.

2. the privacy protection clustering method facing big data analysis according to claim 1, is characterized in that, the calculation method of minimum privacy budget ε ^m in step (3) is:

where N is the number of records in the data set, d is the dimensionality of the data, and ρ is the average value of centroid estimates for each dimension.

3. the privacy protection clustering method facing big data analysis according to claim 1, is characterized in that, the differential privacy budget allocation method in step (3) is specifically:

The total privacy budget ε is decomposed into an increasing arithmetic sequence of length t _m , the initial item of the sequence is ε ^m , the sum of all items of the sequence is ε, and the privacy budget sequence ε _p is obtained by inverting the sequence.

4. the privacy protection clustering method facing big data analysis according to claim 1, is characterized in that, the average privacy budget allocation method in step (3) is specifically:

Decompose the total privacy budget ε into an average sequence of length t _m , and the sequence is the privacy budget sequence ε _p .

5. The privacy-preserving clustering method for big data analysis according to claim 1, characterized in that: in step (5), the random number is random according to the Laplace distribution where the position parameter is 0 and the scale parameter is b. number, where b=d+1/ε', d is the dimension of the data, and ε' is the value of the corresponding position searched from the privacy budget sequence ε _p according to the current iteration number.

6. The privacy-preserving clustering method for big data analysis according to claim 1, characterized in that: the initial central point in step (2) is obtained by randomly selecting a sample point in each subset and adding random noise.

7. A computer storage medium, on which a computer program is stored, wherein the computer program implements the method according to any one of claims 1 to 6 when executed by a computer processor.