CN110334757A - Privacy-preserving clustering method and computer storage medium for big data analysis - Google Patents

Privacy-preserving clustering method and computer storage medium for big data analysis Download PDF

Info

Publication number
CN110334757A
CN110334757A CN201910565540.7A CN201910565540A CN110334757A CN 110334757 A CN110334757 A CN 110334757A CN 201910565540 A CN201910565540 A CN 201910565540A CN 110334757 A CN110334757 A CN 110334757A
Authority
CN
China
Prior art keywords
privacy
privacy budget
sequence
clustering
sum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910565540.7A
Other languages
Chinese (zh)
Inventor
徐小龙
范泽轩
孙雁飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201910565540.7A priority Critical patent/CN110334757A/en
Publication of CN110334757A publication Critical patent/CN110334757A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)

Abstract

本发明公开了一种面向大数据分析的隐私保护聚类方法及计算机存储介质,方法包括以下步骤:数据归一化和选取中心点、计算最小隐私预算并分配隐私预算序列、划分样本点到最近的中心点、生成拉普拉斯噪声、在更新中心点的过程中向其中的参数添加噪声、不断迭代直到相邻两次迭代的误差平方和之差小于阈值或者达到最大迭代次数。本发明通过向聚类算法执行过程中的中间参数添加服从拉普拉斯分布的噪声来保护数据集中的敏感信息,解决了聚类算法执行过程中泄露数据集敏感信息的问题,改进了差分隐私保护聚类算法的隐私预算分配的方式,在相同隐私保护程度下提高了聚类结果的可用性,解决大数据聚类挖掘中的隐私泄露问题。

The invention discloses a privacy protection clustering method and computer storage medium for big data analysis. The method includes the following steps: data normalization and center point selection, calculation of minimum privacy budget and distribution of privacy budget sequence, and division of sample points to the nearest The center point of , generate Laplacian noise, add noise to the parameters in the process of updating the center point, iterate continuously until the difference between the sum of squared errors of two adjacent iterations is less than the threshold or reaches the maximum number of iterations. The present invention protects the sensitive information in the data set by adding noise that obeys the Laplace distribution to the intermediate parameters in the clustering algorithm execution process, solves the problem of leaking sensitive information in the data set during the clustering algorithm execution process, and improves differential privacy The method of protecting the privacy budget allocation of clustering algorithms improves the usability of clustering results under the same degree of privacy protection, and solves the problem of privacy leakage in big data clustering mining.

Description

面向大数据分析的隐私保护聚类方法及计算机存储介质Privacy-preserving clustering method and computer storage medium for big data analysis

技术领域technical field

本发明涉及一种隐私保护聚类方法及计算机存储介质,特别是涉及一种面向大数据分析的隐私保护聚类方法及计算机存储介质。The invention relates to a privacy protection clustering method and a computer storage medium, in particular to a privacy protection clustering method and a computer storage medium for big data analysis.

背景技术Background technique

目前,数据挖掘越来越受到人们的重视,使用机器学习算法对海量数据进行挖掘分析,可以获得大量极具价值的新知识和新规律。聚类分析作为数据挖掘领域中比较常用的方法,在数据预处理、目标群体分类、模式识别和图像分割等场景下都有广泛的应用。K均值是大数据聚类分析中最为简单有效也是使用最多的算法,但在算法执行过程中,更新质心时需要计算每个聚类的样本数量以及各个属性的和,这些操作会泄露数据集的敏感信息。At present, data mining has been paid more and more attention by people. Using machine learning algorithms to mine and analyze massive data can obtain a lot of valuable new knowledge and new laws. As a commonly used method in the field of data mining, cluster analysis is widely used in scenarios such as data preprocessing, target group classification, pattern recognition, and image segmentation. K-means is the simplest, most effective and most used algorithm in big data clustering analysis. However, during the execution of the algorithm, it is necessary to calculate the number of samples of each cluster and the sum of each attribute when updating the centroid. These operations will leak the data set. Sensitive information.

差分隐私是一种数据隐私保护技术,通过添加噪声的方式来扰乱数据,同时能够保留数据的统计方面的性质。因此使用差分隐私保护技术与聚类算法相结合,可以保护数据集的敏感信息不泄露并且获得相对准确的聚类结果。已有的隐私保护聚类算法存在着一些不足之处,初始点的随机选择和隐私预算消耗过快都会导致聚类结果可用性不理想。另外,传统隐私预算分配容易导致的随机噪声过大的问题仍没有解决。Differential privacy is a data privacy protection technology that disturbs the data by adding noise while preserving the statistical properties of the data. Therefore, the combination of differential privacy protection technology and clustering algorithm can protect the sensitive information of the dataset from disclosure and obtain relatively accurate clustering results. There are some deficiencies in the existing privacy-preserving clustering algorithms. Random selection of initial points and excessive consumption of privacy budget will lead to unsatisfactory availability of clustering results. In addition, the problem of excessive random noise easily caused by traditional privacy budget allocation is still unsolved.

发明内容Contents of the invention

发明目的:本发明要解决的技术问题是提供一种面向大数据分析的隐私保护聚类方法及计算机存储介质,解决了传统隐私预算分配容易导致随机噪声过大,从而影响聚类结果质量的问题,改进了差分隐私保护聚类算法的隐私预算分配的方式,提出了一种等差隐私预算分配方式,在相同隐私保护程度下提高了聚类结果的可用性,解决大数据聚类挖掘中的隐私泄露问题。Purpose of the invention: The technical problem to be solved by the present invention is to provide a privacy-preserving clustering method and computer storage medium for big data analysis, which solves the problem that traditional privacy budget allocation easily leads to excessive random noise, thereby affecting the quality of clustering results , improved the privacy budget allocation method of the differential privacy-preserving clustering algorithm, and proposed a differential privacy budget allocation method, which improves the availability of clustering results under the same degree of privacy protection, and solves the privacy problem in big data cluster mining Leakage problem.

技术方案:本发明所述的面向大数据分析的隐私保护聚类方法,包括以下步骤:Technical solution: the privacy protection clustering method for big data analysis described in the present invention comprises the following steps:

(1)对数据集中的数据进行归一化处理;(1) Normalize the data in the data set;

(2)将数据集平均分为k个子集,在每个子集中随机选择一个样本点作为初始中心点;(2) Divide the data set into k subsets on average, and randomly select a sample point in each subset as the initial center point;

(3)设置总隐私预算ε和最大迭代次数tm,计算最小隐私预算εm和迭代次数t=ε/εm,如果t>tm,则采用等差隐私预算分配方式来分配隐私预算序列,如果t≤tm,则采用平均隐私预算分配方式来分配隐私预算序列,得到隐私预算序列εp,其中1≤p≤tm(3) Set the total privacy budget ε and the maximum number of iterations t m , calculate the minimum privacy budget ε m and the number of iterations t=ε/ε m , if t>t m , use the equal difference privacy budget allocation method to allocate the privacy budget sequence , if t≤t m , use the average privacy budget allocation method to allocate the privacy budget sequence, and obtain the privacy budget sequence ε p , where 1≤p≤t m ;

(4)对于数据集中的所有样本点,分别计算其到k个中心点的欧氏距离,将样本点分配给最近的中心点,将数据集划分为k个聚类C={C1,C2,…,Ck};(4) For all sample points in the data set, calculate their Euclidean distances to k center points, assign the sample points to the nearest center point, and divide the data set into k clusters C={C 1 ,C 2 ,...,C k };

(5)根据隐私预算序列εp中对应的项生成拉普拉斯分布的随机数;(5) Generate random numbers of Laplace distribution according to the corresponding items in the privacy budget sequence ε p ;

(6)对于每一个聚类Cj,其中1≤j≤k,计算该聚类样本点数目num以及样本点的和向量sum,分别对其添加噪声得到num′和sum′,上述噪声为步骤(5)中拉普拉斯分布的随机数;(6) For each cluster C j , where 1≤j≤k, calculate the number of sample points num of the cluster and the sum vector sum of the sample points, and add noise to them respectively to obtain num′ and sum′, the above noise is the step (5) The random number of the Laplace distribution in the middle;

(7)更新每一个聚类Cj的中心点为sum′/num′,其中1≤j≤k;(7) Update the central point of each cluster C j to be sum'/num', where 1≤j≤k;

(8)计算误差平方和,如果本次和前次迭代的误差平方和的差的绝对值小于设置阈值或者迭代次数达到上限tm,则结束执行,得到聚类结果,否则转到步骤4继续执行下一次迭代。(8) Calculate the sum of squared errors. If the absolute value of the difference between the sum of squared errors of this and the previous iteration is less than the set threshold or the number of iterations reaches the upper limit t m , then end the execution and obtain the clustering result, otherwise go to step 4 and continue Execute the next iteration.

进一步的,步骤(3)中最小隐私预算εm的计算方法为:Further, the calculation method of the minimum privacy budget ε m in step (3) is:

其中,N为数据集的记录数,d为数据的维数,ρ为每一维质心估计的平均值。where N is the number of records in the data set, d is the dimensionality of the data, and ρ is the average value of centroid estimates for each dimension.

进一步的,步骤(3)中的等差隐私预算分配方式具体为:Further, the allocation method of the differential privacy budget in step (3) is specifically:

把总隐私预算ε分解为长度为tm的递增等差数列,所述序列初始项为εm,所述序列所有项的和为ε,将所述数列倒序得到隐私预算序列εpThe total privacy budget ε is decomposed into an increasing arithmetic sequence of length t m , the initial item of the sequence is ε m , the sum of all items of the sequence is ε, and the privacy budget sequence ε p is obtained by inverting the sequence.

进一步的,步骤(3)中的平均隐私预算分配方式具体为:Further, the average privacy budget allocation method in step (3) is specifically:

把总隐私预算ε分解为长度为tm的平均数列,所述序列即为隐私预算序列εpDecompose the total privacy budget ε into an average sequence of length t m , and the sequence is the privacy budget sequence ε p .

进一步的,步骤(5)中随机数为服从位置参数为0、尺度参数为b的拉普拉斯分布分随机数,其中,b=d+1/ε’,d为数据的维数,ε’为根据当前迭代次数从隐私预算序列εp中查找的对应位置的数值。Further, in step (5), the random number is a Laplace distribution random number that obeys the position parameter of 0 and the scale parameter of b, where b=d+1/ε', d is the dimension of the data, and ε ' is the value of the corresponding position searched from the privacy budget sequence ε p according to the current iteration number.

进一步的,步骤(2)中的初始中心点为每个子集中随机选择一个样本点后加入随机噪声得到的。Further, the initial central point in step (2) is obtained by randomly selecting a sample point in each subset and adding random noise.

本发明所述的计算机存储介质,其上存储有计算机程序,所述计算机程序在被计算机处理器执行时实现上述面向大数据分析的隐私保护聚类方法。The computer storage medium of the present invention stores a computer program thereon, and when the computer program is executed by a computer processor, the above-mentioned privacy protection clustering method oriented to big data analysis is realized.

有益效果:本发明具有以下技术效果:Beneficial effects: the present invention has the following technical effects:

1、使用等差隐私预算分配法来生成隐私预算序列,首先计算最小隐私预算εm,然后使用等差数列求和公式和通项公式计算得到隐私预算序列,该隐私预算序列平缓,解决了现有方法中存在的隐私预算消耗过快的问题;1. Use the arithmetic privacy budget allocation method to generate the privacy budget sequence. First calculate the minimum privacy budget ε m , and then use the arithmetic sequence summation formula and the general term formula to calculate the privacy budget sequence. The privacy budget sequence is flat and solves the problem There is a problem that the privacy budget in the method is consumed too quickly;

2、使用等差隐私预算分配方法,将总隐私预算按线性分配,解决了已有方法分配的隐私预算前期过大、后期过小的问题。当总隐私预算很小时,甚至小于最小隐私预算εm时,本发明采用平均分配方式,尽可能避免分配的隐私预算过小影响算法执行。相较于现有方法,本发明有着更高的聚类可用性和更好的聚类质量。2. Using the arithmetic privacy budget allocation method, the total privacy budget is allocated linearly, which solves the problem that the privacy budget allocated by the existing method is too large in the early stage and too small in the later stage. When the total privacy budget is very small, or even smaller than the minimum privacy budget ε m , the present invention adopts an even distribution method to avoid affecting the execution of the algorithm if the allocated privacy budget is too small. Compared with existing methods, the present invention has higher clustering availability and better clustering quality.

附图说明Description of drawings

图1是本发明实施方式的方法流程图;Fig. 1 is the method flowchart of the embodiment of the present invention;

图2是本发明的等差隐私预算分配方法流程图;Fig. 2 is a flow chart of the differential privacy budget allocation method of the present invention;

图3是本发明的方法与对比算法的聚类可用性指标对比图;Fig. 3 is a comparison diagram of the clustering usability index of the method of the present invention and the comparative algorithm;

图4是本发明的方法与二分分配法、级数和分配法的隐私预算序列结果对比图。Fig. 4 is a comparison chart of privacy budget sequence results between the method of the present invention and the binary allocation method, series and allocation method.

具体实施方式Detailed ways

本实施方式的方法流程图如图1所示,具体按照以下步骤实施:The method flowchart of this embodiment is shown in Figure 1, and specifically implemented according to the following steps:

步骤1,现有Image.csv数据集,该数据集来自东芬兰大学计算机学院聚类数据集(http://cs.joensuu.fi/sipu/datasets/)。记该数据集为D,数据集记录数N为34112,数据维度d为3,即每条数据有3个属性。总隐私预算ε控制隐私保护程度的大小,ε设置得越小,所添加的噪声越大,隐私保护程度越高。这里将总隐私预算ε设为0.8,聚类数目k为3,每条数据可以看作k维空间内的一个样本点。将数据集D的每一维数据归一化到[0,1]。Step 1, the existing Image.csv dataset, which comes from the clustering dataset of the School of Computer Science, University of Eastern Finland (http://cs.joensuu.fi/sipu/datasets/). Record the data set as D, the number of records N in the data set is 34112, and the data dimension d is 3, that is, each piece of data has 3 attributes. The total privacy budget ε controls the degree of privacy protection. The smaller ε is set, the greater the added noise and the higher the degree of privacy protection. Here, the total privacy budget ε is set to 0.8, the number of clusters k is 3, and each piece of data can be regarded as a sample point in the k-dimensional space. Normalize each dimension of the data set D to [0,1].

数据归一化是将每一维数据缩放至[0,1]中,由如下公式进行:Data normalization is to scale each dimension of data to [0,1], which is performed by the following formula:

其中,对于数据的任意一个维度,x是这个维度的数据,min和max分别是最小值和最大值,x′是归一化后的数据。Among them, for any dimension of the data, x is the data of this dimension, min and max are the minimum and maximum values respectively, and x′ is the normalized data.

步骤2,将预处理后的数据集D平均分为k个子集{S1,S2,…,Sk},从每个子集Si中随机选择一个样本点oi,其中1≤i≤k,加入随机噪声后作为初始的中心点{u1,u2,…,uk}。这里,将数据集D平均分为3个子集{S1,S2,S3},从每个子集中随机选取一个样本点,加入噪声之后得到初始中心点,结果为:Step 2. Divide the preprocessed data set D into k subsets {S 1 , S 2 ,…,S k }, randomly select a sample point o i from each subset S i , where 1≤i≤ k is the initial center point {u 1 ,u 2 ,…,u k } after adding random noise. Here, the data set D is evenly divided into three subsets {S 1 , S 2 , S 3 }, a sample point is randomly selected from each subset, and the initial center point is obtained after adding noise. The result is:

u1[0 0.08130081 0.00473934]u 1 [0 0.08130081 0.00473934]

u2[0.44230769 0.27235772 0.16587678]u 2 [0.44230769 0.27235772 0.16587678]

u3[0.65384615 0.43089431 0.1943128]。u 3 [0.65384615 0.43089431 0.1943128].

步骤3,获得隐私预算序列εp,其中1≤p≤tm。设置最大迭代次数tm,计算最小隐私预算εm,并由此计算得到迭代次数t=ε/εm,如果t>tm,则采用等差隐私预算分配方式来分配隐私预算序列;如果t<tm,则采用平均隐私预算分配方式来分配隐私预算序列;最终得到隐私预算序列{ε12,…,εtm}。隐私预算序列分配流程如图2所示。Step 3, obtain the privacy budget sequence ε p , where 1≤p≤t m . Set the maximum number of iterations t m , calculate the minimum privacy budget ε m , and then calculate the number of iterations t=ε/ε m , if t>t m , use the arithmetic privacy budget allocation method to allocate the privacy budget sequence; if t <t m , the average privacy budget allocation method is used to allocate the privacy budget sequence; finally the privacy budget sequence {ε 12 ,…,ε tm } is obtained. The privacy budget sequence allocation process is shown in Figure 2.

最小隐私预算εm的计算公式为:The calculation formula of the minimum privacy budget ε m is:

其中,N表示给定数据集的记录数,d为维数,k为聚类的数目,ρ为每一维质心估计的平均值,当数据归一化到[0,1]时,其取值为0.45。Among them, N represents the number of records in a given data set, d is the number of dimensions, k is the number of clusters, and ρ is the average value of centroid estimates for each dimension. When the data is normalized to [0,1], it takes The value is 0.45.

等差隐私预算分配方式把总隐私预算分解为一个长度为tm的递增等差数列,该数列中的每一项为相应迭代次数中消耗掉的隐私预算。具体操作是将步骤3求得的εm作为等差数列的初始项a1,总隐私预算ε作为该数列所有项的和Sn,由如下公式可以计算等差数列的公差dtThe arithmetic privacy budget allocation method decomposes the total privacy budget into an increasing arithmetic sequence of length t m , and each item in the sequence is the privacy budget consumed in the corresponding number of iterations. The specific operation is to use the ε m obtained in step 3 as the initial item a 1 of the arithmetic sequence, and the total privacy budget ε as the sum S n of all items of the sequence, and the tolerance d t of the arithmetic sequence can be calculated by the following formula:

an=a1+(n-1)dta n =a 1 +(n-1)d t ,

得到公差dt之后进而得到长度为tm的递增等差数列,在将此数列倒序即得所求隐私预算序列,隐私预算序列不一定全部消耗完。After the tolerance d t is obtained, an increasing arithmetic sequence of length t m is obtained, and the privacy budget sequence is obtained by reversing the sequence, and the privacy budget sequence may not be completely consumed.

平均隐私预算分配方式就是把总隐私预算按最大迭代次数平均分配,每次消耗的隐私预算为ε/tm。平均分配也可以看作是一种公差为0的特殊等差分配。The average privacy budget allocation method is to evenly distribute the total privacy budget according to the maximum number of iterations, and the privacy budget consumed each time is ε/t m . The average distribution can also be regarded as a special arithmetic distribution with a tolerance of 0.

具体的,设置最大迭代次数tm为8,计算得到最小隐私预算εm=0.031,则t=ε/εm=25.806,因为t>tm,所以采用等差隐私预算分配法计算隐私预算序列εp,其中1≤p≤8。首先计算得到公差dt=0.0197,然后根据等差数列通项公式计算每一项的具体值,最后经过倒排得到所求递减的隐私预算序列,结果为{0.169,0.14928571,0.12957143,0.10985714,0.09014286,0.07042857,0.05071429,0.031}。Specifically, set the maximum number of iterations t m to 8, calculate the minimum privacy budget ε m = 0.031, then t = ε/ε m = 25.806, because t>t m , so use the arithmetic privacy budget allocation method to calculate the privacy budget sequence ε p , where 1≤p≤8. First calculate the tolerance d t = 0.0197, then calculate the specific value of each item according to the formula of the general term of the arithmetic sequence, and finally obtain the desired decreasing privacy budget sequence through inversion, the result is {0.169, 0.14928571, 0.12957143, 0.10985714, 0.09014286 , 0.07042857, 0.05071429, 0.031}.

步骤4,对于计算数据集D中的所有点,分别计算其到k个中心点的欧氏距离,将此样本点分配给最近的中心点,数据集D被划分为k个聚类C={C1,C2,…,Ck}。Step 4, for all points in the calculation data set D, calculate the Euclidean distance to k center points respectively, and assign this sample point to the nearest center point, the data set D is divided into k clusters C={ C 1 ,C 2 ,...,C k }.

具体的,将数据集D中的所有点分别计算其到3个中心点的欧氏距离,将此样本点分配给最近的中心点,数据集D被划分为3个聚类C={C1,C2,C3}。Specifically, calculate the Euclidean distances from all points in the data set D to the three center points, and assign this sample point to the nearest center point, and the data set D is divided into three clusters C={C 1 ,C 2 ,C 3 }.

步骤5,计算本次迭代所要添加的噪声,该噪声是服从位置参数为0,尺度参数为b的拉普拉斯分布分随机数,记作Lap(b),其中b=Δf/ε’,Δf表示敏感度,ε’为隐私预算。拉普拉斯分布的概率密度函数为这里数据的敏感度与维度有关,Δf=d+1,隐私预算为ε’为根据当前迭代次数从隐私预算序列εp中查找的对应位置的数值,所以噪声表示为Lap(Δf/ε’)。Step 5. Calculate the noise to be added in this iteration. The noise is a random number that obeys the Laplace distribution whose position parameter is 0 and scale parameter is b. It is recorded as Lap(b), where b=Δf/ε', Δf represents the sensitivity and ε' is the privacy budget. The probability density function of the Laplace distribution is The sensitivity of the data here is related to the dimension, Δf=d+1, and the privacy budget is ε', which is the value of the corresponding position searched from the privacy budget sequence ε p according to the current iteration number, so the noise is expressed as Lap(Δf/ε') .

具体的,根据迭代的次数从步骤3中得到的隐私预算序列中查找对应的隐私预算εp,敏感度Δf=3+1=4,所以第一次迭代,ε1为0.169,噪声大小为Lap(4/0.169);第二次迭代,ε2为0.1493,噪声大小为Lap(4/0.1493),以下以此类推。Specifically, according to the number of iterations, find the corresponding privacy budget ε p from the privacy budget sequence obtained in step 3, and the sensitivity Δf=3+1=4, so in the first iteration, ε 1 is 0.169, and the noise size is Lap (4/0.169); in the second iteration, ε 2 is 0.1493, the noise size is Lap(4/0.1493), and so on.

步骤6,对于每一个聚类Cj,其中1≤j≤k,计算该聚类样本点数目num以及样本点的和向量sum,分别对其添加步骤5中的噪声得到num′和sum′。具体的,对于每一个聚类Cj,其中1≤j≤3,计算该聚类样本点数目num以及样本点的和向量sum。第一次迭代的具体结果为:Step 6, for each cluster C j , where 1≤j≤k, calculate the number of sample points num of the cluster and the sum vector sum of the sample points, and add the noise in step 5 to obtain num' and sum' respectively. Specifically, for each cluster C j , where 1≤j≤3, the number of sample points num of the cluster and the sum vector sum of the sample points are calculated. The specific result of the first iteration is:

聚类C1的num为1406,和向量sum为[240.29 177.76 107.42];The num of cluster C 1 is 1406, and the vector sum is [240.29 177.76 107.42];

聚类C2的num为12301,和向量sum为[4665.25 3686.47 2473.31];The num of cluster C 2 is 12301, and the vector sum is [4665.25 3686.47 2473.31];

聚类C3的num为20405,和向量sum为[13469.21 11385.21 8768.39];The num of cluster C 3 is 20405, and the vector sum is [13469.21 11385.21 8768.39];

然后分别对其添加步骤5中的噪声得到num′和sum′,第一次迭代添加的噪声为Lap(4/0.169),具体结果为:Then add the noise in step 5 to get num' and sum', the noise added in the first iteration is Lap(4/0.169), the specific result is:

聚类C1的num′为1421.99,和向量sum′为[284.77 190.18 108.46];The num' of cluster C 1 is 1421.99, and the vector sum' is [284.77 190.18 108.46];

聚类C2的num′为12281.82,和向量sum′为[4688.87 3697.67 2566.92];The num' of cluster C 2 is 12281.82, and the vector sum' is [4688.87 3697.67 2566.92];

聚类C3的num′为20396.29,和向量sum′为[13466.97 11402.30 8739.17];The num' of cluster C 3 is 20396.29, and the vector sum' is [13466.97 11402.30 8739.17];

步骤7,更新每一个聚类Cj的中心uj′=sum′/num′,其中1≤j≤3;则第一次迭代的更新的中心具体结果为:Step 7, update the center u j ′=sum’/num’ of each cluster C j , where 1≤j≤3; then the specific result of the updated center of the first iteration is:

u1′[0.20026401 0.13374381 0.07627629]u 1 '[0.20026401 0.13374381 0.07627629]

u2′[0.38177298 0.30106816 0.20900154]u 2 '[0.38177298 0.30106816 0.20900154]

u3′[0.66026546 0.55903804 0.42846875]。u 3 '[0.66026546 0.55903804 0.42846875].

步骤8,计算误差平方和,如果本次和前次迭代的误差平方和的差的绝对值小于设置的阈值或者迭代次数达到上限tm,则结束执行,得到聚类结果,否则转到步骤4继续执行。所述的误差平方和具体指每个聚类中的点和这个类的中心点的距离之和。阈值可以自行设置,设置的阈值决定着迭代次数,理论上可以设置为0,但是由于噪声的随机性,设置为0会导致迭代次数过多,因此可以将阈值适当放宽,这里设置为100。Step 8: Calculate the sum of squared errors. If the absolute value of the difference between the sum of squared errors of this iteration and the previous iteration is less than the set threshold or the number of iterations reaches the upper limit t m , then end the execution and obtain the clustering result, otherwise go to step 4 Continue to execute. The sum of squared errors specifically refers to the sum of the distances between the points in each cluster and the center point of this class. The threshold can be set by yourself. The set threshold determines the number of iterations. In theory, it can be set to 0. However, due to the randomness of noise, setting it to 0 will lead to too many iterations. Therefore, the threshold can be appropriately relaxed. Here, it is set to 100.

将本实施例的方法与目前已有的两种算法进行比较。对于不同的ε值,分别将这三个算法运行10次,用它们的结果与标准K均值算法结果计算F-measure指标,以此来评价算法的聚类可用性。F-measure的值域为[0,1],越接近于1表明该算法的聚类结果和标准无噪声结果越相似,表明聚类可用性越高。三种算法在Image数据集上的F-measure指标对比图如图3所示。Compare the method of this embodiment with the two existing algorithms. For different ε values, the three algorithms were run 10 times, and the F-measure index was calculated by using their results and the results of the standard K-means algorithm to evaluate the clustering usability of the algorithm. The value range of F-measure is [0,1], the closer to 1, the more similar the clustering result of the algorithm is to the standard noise-free result, indicating the higher availability of clustering. The comparison chart of the F-measure indicators of the three algorithms on the Image dataset is shown in Figure 3.

图4是本实施例的方法与现有两种方法分配隐私预算序列的对比图。在前期迭代中,现有的两种方法已经消耗了大部分总隐私预算,中后期分得的隐私预算很少,过小的隐私预算容易导致大量噪声从而影响算法收敛。而本发明的方法分配得到的隐私预算序列呈线性分布,在中期分得的隐私预算也比较充足,不容易出现过量噪声干扰算法收敛的情况。FIG. 4 is a comparison diagram of the method of this embodiment and the two existing methods for allocating privacy budget sequences. In the early iterations, the existing two methods have already consumed most of the total privacy budget, and the privacy budget allocated in the middle and late stages is very small. Too small privacy budgets will easily lead to a lot of noise and affect the algorithm convergence. However, the privacy budget sequence allocated by the method of the present invention is linearly distributed, and the privacy budget allocated in the middle period is relatively sufficient, and it is not easy for excessive noise to interfere with the convergence of the algorithm.

本发明为一种面向大数据分析的隐私保护聚类方法,该方法改进现有差分隐私聚类算法的隐私预算分配方式,使用等差隐私预算分配方式,解决了已有方法隐私预算消耗过快,迭代后期噪声过大等问题,在相同隐私保护程度下,提高了聚类结果可用性。本发明可以应用于对大数据的聚类分析的过程,在此过程中保护个人信息不被泄露。例如在对医疗数据、商业消费数据以及位置数据等进行聚类挖掘时,这些数据包含大量的用户隐私,使用本发明的方法可以有效防范数据采集和算法执行过程中的隐私泄露问题,同时保留数据的统计特性及挖掘效用。The present invention is a privacy protection clustering method oriented to big data analysis. The method improves the privacy budget allocation method of the existing differential privacy clustering algorithm, uses the arithmetic privacy budget allocation method, and solves the excessive consumption of the privacy budget in the existing method. , too much noise in the later stage of iteration, and the usability of clustering results is improved under the same degree of privacy protection. The present invention can be applied to the process of cluster analysis of big data, during which personal information is protected from being disclosed. For example, when clustering and mining medical data, commercial consumption data, and location data, etc., these data contain a large amount of user privacy. Using the method of the present invention can effectively prevent privacy leaks in the process of data collection and algorithm execution, while retaining data Statistical properties and mining utility.

本发明实施例如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本发明实例不限制于任何特定的硬件和软件结合。If the embodiment of the present invention is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiment of the present invention is essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for Make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the methods described in various embodiments of the present invention. The aforementioned storage medium includes: various media capable of storing program codes such as U disk, mobile hard disk, read only memory (ROM, Read Only Memory), magnetic disk or optical disk. Thus, examples of the invention are not limited to any specific combination of hardware and software.

相应的,本发明的实施例还提供了一种计算机存储介质,其上存储有计算机程序。当所述计算机程序由处理器执行时,可以实现前述面向大数据分析的隐私保护聚类方法。例如,该计算机存储介质为计算机可读存储介质。Correspondingly, the embodiment of the present invention also provides a computer storage medium on which a computer program is stored. When the computer program is executed by a processor, the aforementioned privacy-preserving clustering method for big data analysis can be realized. For example, the computer storage medium is a computer readable storage medium.

本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowcharts and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

Claims (7)

1.一种面向大数据分析的隐私保护聚类方法,其特征在于,包括以下步骤:1. A privacy protection clustering method for big data analysis, characterized in that, comprising the following steps: (1)对数据集中的数据进行归一化处理;(1) Normalize the data in the data set; (2)将数据集平均分为k个子集,在每个子集中随机选择一个样本点作为初始中心点;(2) Divide the data set into k subsets on average, and randomly select a sample point in each subset as the initial center point; (3)设置总隐私预算ε和最大迭代次数tm,计算最小隐私预算εm和迭代次数t=ε/εm,如果t>tm,则采用等差隐私预算分配方法来分配隐私预算序列,如果t≤tm,则采用平均隐私预算分配方法来分配隐私预算序列,得到隐私预算序列εp,其中1≤p≤tm(3) Set the total privacy budget ε and the maximum number of iterations t m , calculate the minimum privacy budget ε m and the number of iterations t = ε/ε m , if t>t m , use the equal difference privacy budget allocation method to allocate the privacy budget sequence , if t≤t m , use the average privacy budget allocation method to allocate the privacy budget sequence, and obtain the privacy budget sequence ε p , where 1≤p≤t m ; (4)对于数据集中的所有样本点,分别计算其到k个中心点的欧氏距离,将样本点分配给最近的中心点,将数据集划分为k个聚类C={C1,C2,…,Ck};(4) For all sample points in the data set, calculate the Euclidean distances to k center points respectively, assign the sample points to the nearest center point, and divide the data set into k clusters C={C 1 ,C 2 ,...,C k }; (5)根据隐私预算序列εp中对应的项生成拉普拉斯分布的随机数;(5) Generate random numbers of Laplace distribution according to the corresponding items in the privacy budget sequence ε p ; (6)对于每一个聚类Cj,其中1≤j≤k,计算该聚类样本点数目num以及样本点的和向量sum,分别对其添加噪声得到num′和sum′,上述噪声为步骤(5)中拉普拉斯分布的随机数;(6) For each cluster C j , where 1≤j≤k, calculate the number of sample points num of the cluster and the sum vector sum of the sample points, and add noise to them respectively to obtain num′ and sum′, the above noise is the step (5) The random number of the Laplace distribution in the middle; (7)更新每一个聚类Cj的中心点为sum′/num′,其中1≤j≤k;(7) Update the central point of each cluster C j to be sum'/num', where 1≤j≤k; (8)计算误差平方和,如果本次和前次迭代的误差平方和的差的绝对值小于设置阈值或者迭代次数达到上限tm,则结束执行,得到聚类结果,否则转到步骤4继续执行下一次迭代。(8) Calculate the sum of squared errors. If the absolute value of the difference between the sum of squared errors of this and the previous iteration is less than the set threshold or the number of iterations reaches the upper limit t m , then end the execution and obtain the clustering result, otherwise go to step 4 and continue Execute the next iteration. 2.根据权利要求1所述的面向大数据分析的隐私保护聚类方法,其特征在于,步骤(3)中最小隐私预算εm的计算方法为:2. the privacy protection clustering method facing big data analysis according to claim 1, is characterized in that, the calculation method of minimum privacy budget ε m in step (3) is: 其中,N为数据集的记录数,d为数据的维数,ρ为每一维质心估计的平均值。where N is the number of records in the data set, d is the dimensionality of the data, and ρ is the average value of centroid estimates for each dimension. 3.根据权利要求1所述的面向大数据分析的隐私保护聚类方法,其特征在于,步骤(3)中的等差隐私预算分配方法具体为:3. the privacy protection clustering method facing big data analysis according to claim 1, is characterized in that, the differential privacy budget allocation method in step (3) is specifically: 把总隐私预算ε分解为长度为tm的递增等差数列,所述序列初始项为εm,所述序列所有项的和为ε,将所述数列倒序得到隐私预算序列εpThe total privacy budget ε is decomposed into an increasing arithmetic sequence of length t m , the initial item of the sequence is ε m , the sum of all items of the sequence is ε, and the privacy budget sequence ε p is obtained by inverting the sequence. 4.根据权利要求1所述的面向大数据分析的隐私保护聚类方法,其特征在于,步骤(3)中的平均隐私预算分配方法具体为:4. the privacy protection clustering method facing big data analysis according to claim 1, is characterized in that, the average privacy budget allocation method in step (3) is specifically: 把总隐私预算ε分解为长度为tm的平均数列,所述序列即为隐私预算序列εpDecompose the total privacy budget ε into an average sequence of length t m , and the sequence is the privacy budget sequence ε p . 5.根据权利要求1所述的面向大数据分析的隐私保护聚类方法,其特征在于:步骤(5)中随机数为服从位置参数为0、尺度参数为b的拉普拉斯分布分随机数,其中,b=d+1/ε’,d为数据的维数,ε’为根据当前迭代次数从隐私预算序列εp中查找的对应位置的数值。5. The privacy-preserving clustering method for big data analysis according to claim 1, characterized in that: in step (5), the random number is random according to the Laplace distribution where the position parameter is 0 and the scale parameter is b. number, where b=d+1/ε', d is the dimension of the data, and ε' is the value of the corresponding position searched from the privacy budget sequence ε p according to the current iteration number. 6.根据权利要求1所述的面向大数据分析的隐私保护聚类方法,其特征在于:步骤(2)中的初始中心点为每个子集中随机选择一个样本点后加入随机噪声得到的。6. The privacy-preserving clustering method for big data analysis according to claim 1, characterized in that: the initial central point in step (2) is obtained by randomly selecting a sample point in each subset and adding random noise. 7.一种计算机存储介质,其上存储有计算机程序,其特征在于:所述计算机程序在被计算机处理器执行时实现权利要求1至6任一项所述的方法。7. A computer storage medium, on which a computer program is stored, wherein the computer program implements the method according to any one of claims 1 to 6 when executed by a computer processor.
CN201910565540.7A 2019-06-27 2019-06-27 Privacy-preserving clustering method and computer storage medium for big data analysis Pending CN110334757A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910565540.7A CN110334757A (en) 2019-06-27 2019-06-27 Privacy-preserving clustering method and computer storage medium for big data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910565540.7A CN110334757A (en) 2019-06-27 2019-06-27 Privacy-preserving clustering method and computer storage medium for big data analysis

Publications (1)

Publication Number Publication Date
CN110334757A true CN110334757A (en) 2019-10-15

Family

ID=68144509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910565540.7A Pending CN110334757A (en) 2019-06-27 2019-06-27 Privacy-preserving clustering method and computer storage medium for big data analysis

Country Status (1)

Country Link
CN (1) CN110334757A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750725A (en) * 2019-10-24 2020-02-04 河北经贸大学 Privacy-protecting user portrait generation method, terminal device and storage medium
CN111242196A (en) * 2020-01-06 2020-06-05 广西师范大学 Differential privacy protection method for interpretable deep learning
CN111563272A (en) * 2020-04-30 2020-08-21 支付宝实验室(新加坡)有限公司 Information statistical method and device
CN111444545B (en) * 2020-06-12 2020-09-04 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties
CN111914285A (en) * 2020-06-09 2020-11-10 深圳大学 Geographical distributed graph calculation method and system based on differential privacy
CN112199722A (en) * 2020-10-15 2021-01-08 南京邮电大学 A Differential Privacy Preserving Clustering Method Based on K-means
CN112202542A (en) * 2020-09-30 2021-01-08 清华-伯克利深圳学院筹备办公室 Data perturbation method, device and storage medium
CN112347088A (en) * 2020-10-28 2021-02-09 南京邮电大学 Data reliability optimization method, storage medium and equipment
CN112613065A (en) * 2020-12-02 2021-04-06 北京明朝万达科技股份有限公司 Data sharing method and device based on differential privacy protection
CN112767693A (en) * 2020-12-31 2021-05-07 北京明朝万达科技股份有限公司 Vehicle driving data processing method and device
CN113094751A (en) * 2021-04-21 2021-07-09 山东大学 Personalized privacy data processing method, device, medium and computer equipment
CN113537308A (en) * 2021-06-29 2021-10-22 中国海洋大学 Two-stage k-means clustering processing system and method based on localized differential privacy
CN113609523A (en) * 2021-07-29 2021-11-05 南京邮电大学 Vehicle networking private data protection method based on block chain and differential privacy
CN113849471A (en) * 2021-09-26 2021-12-28 中国联合网络通信集团有限公司 Data compression method, device, equipment and storage medium
CN114117540A (en) * 2022-01-25 2022-03-01 广州天鹏计算机科技有限公司 A method and system for analyzing and processing big data
CN114139576A (en) * 2021-11-06 2022-03-04 西安电子科技大学 Mobile terminal sensor data scrambling method and system based on Laplace mechanism
CN114282083A (en) * 2020-09-28 2022-04-05 阿里巴巴集团控股有限公司 Data analysis method, noise construction method, equipment and storage medium
CN114817985A (en) * 2022-04-22 2022-07-29 广东电网有限责任公司 Privacy protection method, device, equipment and storage medium for electricity consumption data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778314A (en) * 2017-03-01 2017-05-31 全球能源互联网研究院 A kind of distributed difference method for secret protection based on k means
CN108280491A (en) * 2018-04-18 2018-07-13 南京邮电大学 A kind of k means clustering methods towards difference secret protection
CN108549904A (en) * 2018-03-28 2018-09-18 西安理工大学 Difference secret protection K-means clustering methods based on silhouette coefficient

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778314A (en) * 2017-03-01 2017-05-31 全球能源互联网研究院 A kind of distributed difference method for secret protection based on k means
CN108549904A (en) * 2018-03-28 2018-09-18 西安理工大学 Difference secret protection K-means clustering methods based on silhouette coefficient
CN108280491A (en) * 2018-04-18 2018-07-13 南京邮电大学 A kind of k means clustering methods towards difference secret protection

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
C. DWORK: "Differential privacy", 《PROCEEDINGS OF 39TH INTERNATIONAL COLLOQUIUM ON AUTOMATA, LANGUAGES AND PROGRAMMING》 *
SU D ET AL: "Differentially private k-means clustering", 《PROCEEDINGS OF THE SIXTH ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY》 *
尚涛等: "基于等差隐私预算分配的大数据决策树算法", 《工程科学与技术》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750725A (en) * 2019-10-24 2020-02-04 河北经贸大学 Privacy-protecting user portrait generation method, terminal device and storage medium
CN111242196A (en) * 2020-01-06 2020-06-05 广西师范大学 Differential privacy protection method for interpretable deep learning
CN111242196B (en) * 2020-01-06 2022-06-21 广西师范大学 Differential Privacy Preserving Methods for Explainable Deep Learning
CN111563272A (en) * 2020-04-30 2020-08-21 支付宝实验室(新加坡)有限公司 Information statistical method and device
WO2021248937A1 (en) * 2020-06-09 2021-12-16 深圳大学 Geographically distributed graph computing method and system based on differential privacy
CN111914285A (en) * 2020-06-09 2020-11-10 深圳大学 Geographical distributed graph calculation method and system based on differential privacy
CN111914285B (en) * 2020-06-09 2022-06-17 深圳大学 Geographic distributed graph calculation method and system based on differential privacy
CN111444545B (en) * 2020-06-12 2020-09-04 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties
CN114282083A (en) * 2020-09-28 2022-04-05 阿里巴巴集团控股有限公司 Data analysis method, noise construction method, equipment and storage medium
CN112202542A (en) * 2020-09-30 2021-01-08 清华-伯克利深圳学院筹备办公室 Data perturbation method, device and storage medium
CN112199722A (en) * 2020-10-15 2021-01-08 南京邮电大学 A Differential Privacy Preserving Clustering Method Based on K-means
CN112347088B (en) * 2020-10-28 2024-02-20 南京邮电大学 Data credibility optimization method, storage medium and equipment
CN112347088A (en) * 2020-10-28 2021-02-09 南京邮电大学 Data reliability optimization method, storage medium and equipment
CN112613065B (en) * 2020-12-02 2024-08-20 北京明朝万达科技股份有限公司 Data sharing method and device based on differential privacy protection
CN112613065A (en) * 2020-12-02 2021-04-06 北京明朝万达科技股份有限公司 Data sharing method and device based on differential privacy protection
CN112767693A (en) * 2020-12-31 2021-05-07 北京明朝万达科技股份有限公司 Vehicle driving data processing method and device
CN113094751A (en) * 2021-04-21 2021-07-09 山东大学 Personalized privacy data processing method, device, medium and computer equipment
CN113537308A (en) * 2021-06-29 2021-10-22 中国海洋大学 Two-stage k-means clustering processing system and method based on localized differential privacy
CN113537308B (en) * 2021-06-29 2023-11-03 中国海洋大学 Two-stage k-means clustering processing system and method based on localized differential privacy
CN113609523A (en) * 2021-07-29 2021-11-05 南京邮电大学 Vehicle networking private data protection method based on block chain and differential privacy
CN113609523B (en) * 2021-07-29 2022-04-01 南京邮电大学 A privacy data protection method for Internet of Vehicles based on blockchain and differential privacy
CN113849471A (en) * 2021-09-26 2021-12-28 中国联合网络通信集团有限公司 Data compression method, device, equipment and storage medium
CN114139576A (en) * 2021-11-06 2022-03-04 西安电子科技大学 Mobile terminal sensor data scrambling method and system based on Laplace mechanism
CN114117540B (en) * 2022-01-25 2022-04-29 广州天鹏计算机科技有限公司 A method and system for analyzing and processing big data
CN114117540A (en) * 2022-01-25 2022-03-01 广州天鹏计算机科技有限公司 A method and system for analyzing and processing big data
CN114817985A (en) * 2022-04-22 2022-07-29 广东电网有限责任公司 Privacy protection method, device, equipment and storage medium for electricity consumption data

Similar Documents

Publication Publication Date Title
CN110334757A (en) Privacy-preserving clustering method and computer storage medium for big data analysis
Mahmud et al. Improvement of K-means clustering algorithm with better initial centroids based on weighted average
Heimel et al. Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation
CN107992503B (en) Query processing in data analysis
US9576072B2 (en) Database calculation using parallel-computation in a directed acyclic graph
Zhang et al. A relevant subspace based contextual outlier mining algorithm
Tiakas et al. MSIDX: multi-sort indexing for efficient content-based image search and retrieval
Jeong et al. Data depth based clustering analysis
Van Leuken et al. Selecting vantage objects for similarity indexing
CN108549904A (en) Difference secret protection K-means clustering methods based on silhouette coefficient
Gensler et al. Novel Criteria to Measure Performance of Time Series Segmentation Techniques.
CN109978006B (en) Face image clustering method and device
Rana et al. Anomaly detection guidelines for data streams in big data
Liu et al. Color image segmentation using nonparametric mixture models with multivariate orthogonal polynomials
Jain et al. Connectedness-based subspace clustering
CN107656927B (en) A feature selection method and device
Wu et al. Efficient evaluation of object-centric exploration queries for visualization
CN109658172A (en) A kind of commercial circle recommended method calculates unit and storage medium
Zhao et al. Prediction model of HBV reactivation in primary liver cancer—Based on NCA feature selection and SVM classifier with Bayesian and grid optimization
US11620269B2 (en) Method, electronic device, and computer program product for data indexing
CN115409070A (en) Method, device and equipment for determining critical point of discrete data sequence
WO2019028710A1 (en) Method for calculating support of candidate item set on basis of graphic structure data, and application thereof
CN115862653A (en) Audio denoising method and device, computer equipment and storage medium
US20160275169A1 (en) System and method of generating initial cluster centroids
CN110222528B (en) Differential privacy protection method for frequent plot mining in data stream

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191015