WO2018184407A1 - 一种具有隐私保护的K-means聚类方法及系统 - Google Patents

一种具有隐私保护的K-means聚类方法及系统 Download PDF

Info

Publication number
WO2018184407A1
WO2018184407A1 PCT/CN2017/117943 CN2017117943W WO2018184407A1 WO 2018184407 A1 WO2018184407 A1 WO 2018184407A1 CN 2017117943 W CN2017117943 W CN 2017117943W WO 2018184407 A1 WO2018184407 A1 WO 2018184407A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
centroid
ciphertext
server
client
Prior art date
Application number
PCT/CN2017/117943
Other languages
English (en)
French (fr)
Inventor
王轩
蒋琳
李晔
姚霖
刘泽超
靳亚宾
梁玉冬
刘猛
漆舒汉
Original Assignee
哈尔滨工业大学深圳研究生院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 哈尔滨工业大学深圳研究生院 filed Critical 哈尔滨工业大学深圳研究生院
Publication of WO2018184407A1 publication Critical patent/WO2018184407A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Definitions

  • the present invention relates to the field of data mining technologies, and in particular, to a K-means clustering method with privacy protection, and to a system for implementing the method.
  • K-means clustering is one of the most classic and commonly used methods in data mining. It can gather similar data items by calculating the distance between data items. With the acceleration of informationization, digitization and networking, economic globalization has become an irreversible trend. The data sources in clustering algorithms are more and more diversified, and data security is becoming more and more important. Considering that the data will come from multiple participants, the data may contain sensitive or private information about the participants. If the information is shared among multiple participants, the privacy of the data will not be guaranteed. Federated data mining with privacy protection can protect the user data and the privacy of the mining results, and perform data mining on the joint database of multiple participants to further extract useful information. Therefore, how to design a joint data mining algorithm with privacy protection becomes a difficult problem to be solved.
  • the semi-honest model is in many cases consistent with the actual scenario, and the privacy of the data under this model is guaranteed by the fact that the various parties always follow the agreement.
  • the solution under this model is usually not feasible in practice because of the high computational consumption and communication consumption.
  • the traditional K-means clustering algorithm is a classical clustering algorithm based on Euclidean distance.
  • the traditional K-means clustering algorithm is mainly divided into three steps: selecting the centroid point, classifying the data points, and recalculating the new centroid point.
  • the training sample is ⁇ x i ⁇ R l
  • 1 ⁇ i ⁇ l ⁇ , where l is the number of samples, first randomly select k centroid points M, denoted as M ⁇ c ⁇ R l
  • the traditional K-means clustering algorithm mainly includes three steps: selecting the centroid point, the data point to classify and recalculating the centroid point.
  • the process of categorization first calculate the Euclidean distance of the data points from each centroid point, and then compare the centroid points closest to the data points for classification. The distance is calculated by the square of the Euclidean distance. It is better to compare the size of two values in the case of changing the magnitude relationship of the two values.
  • the present invention provides a K-means clustering method with privacy protection, and a system for implementing the method.
  • the method of the invention comprises the following steps:
  • S1 Data owners A and B encrypt their respective data, and then upload the ciphertext to the server;
  • S2 The data owners A and B randomly select k centroid points respectively, and encrypt and upload to the server;
  • the server calculates the Euclidean distance of the ciphertext data point to the centroid point through the secure distance calculation protocol, and classifies the data points according to the calculated Euclidean distance through a security comparison protocol;
  • the data owner A or B determines the distance between the new centroid point and the original centroid point in the ciphertext data through the security comparison protocol. If the threshold is less than the threshold, the classification ends, and the data owner A and B request the server to separately send the classified data. Give data owners A and B, otherwise, go back to step S2 and proceed to the next iteration.
  • the server is a cloud server, and the cloud server re-encrypts the data uploaded by the data owners A and B in the file system of the cloud.
  • step S2 the selection of the centroid point includes the selection of the number of centroid points and the value, specifically including the following steps:
  • S25 The data owners A and B respectively calculate the average value by the values of the respective centroid points, which are the values of the k centroid points.
  • step S3 includes the following steps:
  • the server calculates the ciphertext distance of each ciphertext record of the data owner A and the ciphertext centroid point of the uploaded ciphertext, and the ciphertext distance of each ciphertext record of the data owner B and the ciphertext centroid point of the uploaded ciphertext;
  • the server and the data owner A jointly calculate the ciphertext distance of each data point of the data owner A and the centroid point through the secure distance calculation protocol; the server and the data owner B jointly calculate the data owner B by using the secure distance calculation protocol.
  • the server divides the data of the data owners A and B into the nearest class according to the ciphertext distance set obtained in step S32, and stores them separately in the same class.
  • step S4 includes the following steps:
  • the server sends the data points separately stored in the same class to the corresponding data owners A and B respectively;
  • S43 The server, data owner A and B calculate a new centroid point in this category through a secure circuit protocol.
  • the present invention also provides a system for implementing the method, comprising a database, a first client used by the data owner A, and a second client used by the data owner B, wherein the first client and The second client is used to encrypt the respective data, then upload the ciphertext to the server, and randomly select k centroid points, and upload and upload to the server, and after the server is classified, recalculate the new k centroids together with the server. Point, determine the distance between the new centroid point and the original centroid point.
  • the classification ends, and the requesting server sends the classified data to the first client and the second client respectively, otherwise the centroid point is re-uploaded; the server is used for Receiving data uploaded by the first client and the second client, calculating an Euclidean distance from the data point to the centroid point, classifying the data points according to the calculated Euclidean distance, and then re-establishing together with the first client and the second client Calculate the new k centroid points.
  • the server is a cloud server, and the cloud server re-encrypts the data uploaded by the data owners A and B in the file system of the cloud.
  • the present invention is further improved.
  • the selection of the centroid points of the first client and the second client includes the selection of the number of centroid points and the value, specifically including the following modules:
  • Centroid point selection module used to randomly select k centroid points
  • Classification module used to iterate and classify on the respective data sets according to the traditional K-means clustering algorithm
  • Safety distance calculation module used to calculate the distance of each data point to the respective corresponding centroid points by the safe distance calculation protocol, and calculate the distance sum S of all the data points;
  • the centroid point number selection module is used to judge that when the sum S corresponding to the k-1, k, k+1 centroid points does not change much, at this time, k is the number of centroid points;
  • Centroid point value selection module used to calculate the average value with the values of the respective centroid points, which is the value of k centroid points.
  • the invention is further improved, and the server comprises:
  • the first ciphertext distance calculation module is configured to calculate a ciphertext distance of each ciphertext record of the first client and a ciphertext point of the ciphertext uploaded by the ciphertext, and calculate each ciphertext record of the data owner B and the ciphertext centroid of the uploaded ciphertext Ciphertext distance of the point;
  • a second ciphertext distance calculation module configured to calculate, by the first client, a ciphertext distance between each data point and a centroid point of the first client; the server and the second client jointly calculate each data point of the second client Ciphertext distance of the centroid point; classification module: used to calculate the ciphertext distance set calculated by the second ciphertext distance calculation module, and divide the data of the first client and the second client into the nearest class, and in the same class Store separately.
  • the server further includes a sending module: configured to separately send data points separately stored in the same class to the corresponding first client and the second client; and the security centroid point calculation module: used for the same A client and a second client calculate a new centroid point in the same category through a secure circuit protocol.
  • the beneficial effects of the present invention are: the encryption method in the application ensures the security in the data mining process, and the result is correct; the data storage outsourcing can be supported on a larger scale. Execution on the dataset; support data computing outsourcing, outsource most of the computing to the cloud platform, and with the powerful computing power of the cloud platform, while ensuring correctness, the execution efficiency is also greatly improved; not only under the semi-honest model
  • the safety calculation, and in the recalculation centroid stage supports the safe calculation of the most of the three parties as the malicious party.
  • Figure 1 is a flow chart of the method of the present invention
  • FIG. 2 is a schematic structural view of a system of the present invention
  • FIG. 3 is a schematic diagram of a consumption time of a server and a client of a conventional K-means clustering algorithm
  • FIG. 4 is a schematic diagram of consumption time of a server and a client according to the present invention.
  • FIG. 5 is a schematic diagram of a conventional K-means clustering algorithm server and client consumption time occupation ratio
  • FIG. 6 is a schematic diagram of a ratio of consumption time of a server and a client according to the present invention.
  • Figure 7 is a time-consuming ratio of the present invention to the conventional K-means clustering algorithm.
  • the present invention carries out in-depth research on existing data mining algorithms with privacy protection, and then proposes an efficient privacy-protected K- on the horizontally divided data sets.
  • Means clustering algorithm which supports storage outsourcing and computing outsourcing with two data owners and cloud platforms.
  • the data is stored in the ciphertext in the cloud.
  • the cloud platform interacts with the two data owners to complete the task of K-means clustering data mining on the joint datasets of both parties.
  • the present invention separately designs different security protocols to solve three technical problems in the K-means clustering algorithm with privacy protection: a secure distance calculation protocol for solving ciphertext distance calculation problems, a security comparison protocol for solving ciphertext comparison problems, and a solution confidentiality
  • the secure circuit protocol for the word division problem.
  • these security protocols are applied to the clustering algorithm framework to implement K-means clustering algorithm with privacy protection.
  • the K-means clustering method with privacy protection of the present invention mainly includes five steps, which are described in detail below:
  • Step S1 The data owners A and B encrypt the respective data, and then upload the ciphertext to the server.
  • This example assumes that the data owner A is Alice, the data owner B is Bob, and the server is C.
  • each record in D x and D y is one-dimensional, so encrypting the database means encrypting each dimension of data in each record. All data from Alice and Bob is stored in ciphertext in the file system of the cloud.
  • the specific representation is as follows:
  • n is the number of records.
  • Step S2 Alice and Bob select k centroid points and upload them to C with their respective public key encryption.
  • centroid point is a very important step, because its choice is directly related to the number of iterations, which affects the overall execution time of the system, so a good centroid point will also accelerate the convergence speed and execution efficiency of the system.
  • the choice of the centroid point is divided into two parts: the first is the choice of the number of centroid points, Alice and Bob respectively random k value and k centroid points, and then iterate on their own data set. After categorization, calculate the distance of each data point to its corresponding centroid point, and then get the sum of all these distances as S. When the S corresponding to k-1, k and k+1 does not change much, at this time, k is the number of centroid points.
  • Alice and Bob find their own k, respectively, and then the average of the two k-value regions is the final k-value.
  • Alice randomly selects k centroid points M ⁇ c
  • 1 ⁇ c ⁇ k ⁇ , where ⁇ c ⁇ u cj
  • Alice and Bob use Alice and Bob's public key to encrypt the centroid point and upload it to the cloud.
  • the ciphertext of the centroid is with
  • Step 3 Server C calculates the Euclidean distance of the ciphertext data point to the centroid point through the secure distance calculation protocol, and then classifies the data points according to the calculated Euclidean distance through the security comparison protocol. Specifically:
  • C calculates each record And each centroid point And each record And each centroid point Ciphertext distance;
  • C and Alice run the SSED (Secure Distance Calculation) protocol together to calculate the ciphertext distance between each x i and ⁇ c , Said.
  • C and Bob run the SSED protocol together to calculate the ciphertext distance between each y i and ⁇ c . Said. All ciphertext distances between x i and ⁇ c are stored in Medium, the ciphertext distance between all y i and ⁇ c is stored in in.
  • Enc pa ⁇ KenGen, Encrypt, Decrypt, Evaluate ⁇ .
  • the process of Paillier encryption is as follows:
  • L ( ⁇ ) ( ⁇ -1) / N. Further, the public key is (N, g), and the private key is ( ⁇ , ⁇ ).
  • D sk (c) stands for Decrypt(c).
  • the secure distance calculation protocol of this example is implemented based on a secure multiplication protocol, and the specific processing procedure of the secure multiplication protocol is as follows:
  • Z n is a positive integer space, where r x and r y are positive integers.
  • Step S4 C, Alice and Bob jointly recalculate k centroid points through the secure circuit protocol, because the public keys of the two participants encrypting data in CL 1 and CL 2 are different, and all new centroid points cannot be directly calculated.
  • C send CL 1 and CL 2 to Alice and Bob respectively to decrypt and obtain L 1 and L 2 .
  • the calculation formula is:
  • the specific processing procedure of the secure circuit protocol is:
  • Step S5 Alice will calculate the distance between the new centroid point and the previous centroid point through the security comparison protocol. If it is less than the threshold, Alice and Bob will request C to send the classified data to Alice and Bob respectively. Otherwise, Alice and Bob use their respective public keys to encrypt the new centroid point to C for the next iteration.
  • the present invention also provides a system for implementing the above method
  • the present embodiment of the system includes a second client of the first client database C, the A data owner terminal P 1 and used data owner B used End P 2 , wherein the first client P 1 and the second client P 2 are used to encrypt respective data, then upload the ciphertext to the server, and randomly select k centroid points, and encrypt and upload to the server.
  • the server After the server is classified, the server recalculates the new k centroid points together, and judges the distance between the new centroid point and the original centroid point. If it is less than the threshold, the classification ends, and the request server sends the classified data to the first.
  • Client P 1 and second client P 2 otherwise re-uploading the centroid point;
  • the server is configured to receive data uploaded by the first client P 1 and the second client P 2 , and calculate the Euclidean distance of the data point to the centroid point, The data points are classified according to the calculated Euclidean distance, and then the new k centroid points are recalculated together with the first client P 1 and the second client P 2 .
  • the server C is a cloud server
  • the cloud server encrypts and stores the data uploaded by the data owners A and B in the file system of the cloud, can support data storage outsourcing, can be executed on a larger data set, and supports data calculation.
  • the framework used in the present invention was first proposed in the document "Outsourcing Two-Party Privacy Preserving K-Means Clustering Protocol in Wireless Sensor Networks".
  • the method of the paper is represented by the previous scheme, compared to the aggregation under other frameworks.
  • the class algorithm, the clustering algorithm under the same framework is more comparable, so the present invention is mainly compared with the previous conventional scheme.
  • both schemes were run in the same experimental environment. The evaluation criteria of the two methods will be described below, and comparative analysis of the experimental results will be carried out.
  • the time consumption of the method of the present invention is mainly divided into three parts: client time consumption, communication consumption, and server time consumption, wherein the client and server time consumption includes the time consumption of the initialization phase and the protocol operation phase. Also, because this application differs from the method used in the previous scheme, it can only be compared macroscopically.
  • the comparison mainly includes two aspects, one is theoretical complexity analysis, including time complexity, space complexity and communication complexity, and the other is the comparison of test results in experiments. Different iterations will affect the overall effect of the experiment, so this example is based on one iteration and will be compared in the following aspects:
  • the inventive scheme is lower than the previous scheme in terms of time complexity, space complexity, and communication complexity.
  • the experimental results of the two schemes will be analyzed based on experimental data.
  • the present invention counts and compares the time consumed in one iteration.
  • the cloud platform introduced by the present invention has improved computing power and should be slightly more efficient than the previous ones. Because the cloud platform of the present invention is composed of 30 PCs and one server, task division, task scheduling, and data recovery are required for each machine during the processing of the task, and these operations also consume part of the time. When there are more data points, the time of one iteration will be longer, and the proportion of time consumed by operations such as task division will be lower.
  • the generation of the circuit takes a long time, but the circuit only needs to be generated once in the first iteration, so theoretically, when the data point scale is small, one iteration of the previous scheme
  • the efficiency of the present invention is higher than that of the prior art.
  • the data point size is higher than a certain threshold, the efficiency of one iteration of the solution of the present invention is higher than that of the previous solution.
  • the present invention The efficiency advantage of the medium program will become more and more obvious.
  • the experimental results are a good demonstration of our point of view. At the same time, the experimental results show that the threshold of the data point size is about 5000 data points.
  • the inventive scheme consumes less time in one iteration.
  • the data size is less than 5000
  • the previous iteration of the scenario in the previous scenario consumes less time.
  • the time-consuming pairs of the two schemes are shown in Table 3.
  • Table 3 compares the consumption time of one iteration
  • the present invention focuses not only on the time consumed by this iteration, but also that in each iterative calculation, server C can take on more tasks and have a higher consumption time ratio, that is, In the case of ensuring that one iteration time is small, the ratio of server consumption time to one iteration consumption time is larger, so that the calculation amount of the client can be less. Therefore, as the size of the data increases, such programs will become more efficient.
  • the client the main thing to do is the encryption and decryption operations. In both scenarios, the number of encryption and decryption of the client is basically the same.
  • the ciphertext distance calculation and the ciphertext distance comparison size are improved Liu encryption, and all operations of the encryption are linear operations, and the solution in the present invention adopts the Paillier encryption algorithm, and the algorithm Decryption and decryption require exponential and modulo operations on the cluster.
  • the improved Liu encryption algorithm should consume less time than the Paillier encryption used in the present invention. Therefore, theoretically, under the same-scale data set, the time spent by the client in the previous scheme will be lower than the time consumed by the client in the scheme of the present invention. As the size of the data increases, the time spent in one iteration of the solution of the present invention is relatively small, and the time consumed by the client is relatively large.
  • the client consumption time occupation ratio in the solution of the present invention is relatively larger, and the occupation ratio of the time consumed by the server is relatively smaller.
  • the previous conjectures are also well proven by the collection and analysis of experimental data.
  • the consumption time of each participant in one iteration of the two schemes is shown in Table 4 and Table 5.
  • the consumption time of the previous solution server and client is shown in Fig. 3.
  • the server and client consumption time of the present invention is as shown in Fig. 4.
  • the server runs on a cloud platform composed of 30 PCs and 1 server, so the computing power of the server can be guaranteed.
  • the server-side consumption time increases, and there is no obvious upward trend.
  • the client consumption time increases with the increase of the data size, mainly because the decryption operation performed by the client is an exponential operation on the group, and the operation has a larger calculation amount than the linear operation.
  • the server consumption time occupation ratio in the present invention is reduced, and the client consumption time occupation ratio is increased.
  • the server and client consumption time occupation is as shown in FIG. 5, and the server and client consumption time occupation in the present invention is as shown in FIG. 6.
  • the present invention gives the time of processing the data by the K-means clustering algorithm with privacy protection and the classic K-means algorithm in one iteration. It can be seen that the time consumption caused by encryption is relatively large. However, as the size of the data increases, the ratio of the time consumption of the present iteration to the classic K-means time consumption is getting smaller and smaller.
  • the time spent by the present invention and the classical K-means algorithm in one iteration is shown in Table 6, and the time ratio is shown in FIG.
  • the invention selects the K-means algorithm which is typical in data mining, and mines the joint data set of the horizontal division of both sides, and supports the storage outsourcing and computing outsourcing of the cloud platform.
  • the beneficial effects of the present invention mainly include the following aspects:
  • the solution of the present invention supports data storage outsourcing.
  • the cloud platform has greater storage capacity than a general PC, which enables the inventive solution to be executed on a larger data set;
  • the solution of the present invention supports data computing outsourcing.
  • the cloud platform is a distributed computing framework that integrates many resources into one cluster, which greatly improves the computing power of the system.
  • the solution of the invention outsources most of the calculations to the cloud platform, and with the powerful computing power of the cloud platform, the execution efficiency is also greatly improved while ensuring correctness;
  • the privacy-protected K-means clustering algorithm proposed by the present invention not only implements the secure computing under the semi-honest model, but also supports the safe computing of the most one of the three parties in the recalculation centroid stage.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

本发明提供一种具有隐私保护的K-means聚类方法及系统,属于数据挖掘技术领域。本发明包括如下步骤:数据拥有者A和B加密各自的数据和随机选择的质心点,上传至服务器;服务器通过安全乘法协议和安全距离计算协议在密文数据中计算数据点到质心点的欧氏距离,并将数据点归类;服务器、数据拥有者A和B通过安全电路协议共同在密文数据中重新计算新的质心点;数据拥有者A或B通过安全比较协议判断新的质心点与原质心点的距离,如果小于阈值,结束分类,数据拥有者A和B请求服务器将分类好的数据分别发送给数据拥有者A和B,否则,重新上传新的质心点,进行下一轮迭代。本发明在保证数据隐私安全的同时保证了数据挖掘结果的正确性;支持数据存储外包和数据计算外包,在保证正确性的同时,执行效率也大幅度提升;支持三个参与方中最多一方为恶意方的安全计算。

Description

一种具有隐私保护的K-means聚类方法及系统 技术领域
本发明涉及数据挖掘技术领域,尤其涉及一种具有隐私保护的K-means聚类方法,还涉及一种实现所述方法的系统。
背景技术
众所周知,K-means聚类是数据挖掘中非常经典和常用的方法之一,它通过计算数据项之间的距离可以把相似的数据项聚集在一起。随着信息化、数字化、网络化进程加速,经济全球化已成为一种不可逆的趋势,聚类算法中的数据来源越来越多样化,数据安全越来越重要。考虑到数据会来自多个参与方,在这些数据中可能包含关于参与方的敏感信息或私人信息,如果这些信息在多个参与方之间共享,那么数据的隐私性将不能得到保证。具有隐私保护的联合数据挖掘可以在保护用户数据和挖掘结果隐私性的同时,对多个参与方的联合数据库进行数据挖掘,进一步提取出有用的信息。因此,如何设计出具有隐私保护的联合数据挖掘算法成为一个需要解决的难题。
半诚实模型在许多情况下是符合实际场景的,该模型下数据的隐私性是通过各个参与方始终遵循协议来保证的。但是为保证数据的隐私性,该模型下的解决方案通常因为计算消耗和通信消耗较高,所以实际中并不可行。
传统K-means聚类算法是一种基于欧式距离的经典的聚类算法。传统K-means聚类算法主要分为3个步骤:选取质心点、对数据点进行归类和重计算新的质心点。假设训练样本为{x i∈R l|1≤i≤l},其中,l为样本的数量,首先随机选取k个质心点M,表示为M={μ c∈R l|1≤c≤l}。然后计算每个数据点到x i到质心点μ c的距离,然后将x i归类到聚类该点最近的质心点μ c所属的类中,公式为:C c:=argmin c||x ic|| 2。最后对于每个质心点μ c进行重计算质心点,计算公式为:
Figure PCTCN2017117943-appb-000001
由此可以看出传统K-means聚类算法主要包括三个步骤:选取质心点、数据点进行归类和重计算质心点。其中在归类的过程中,首先要计算数据点距离每个质心点的欧式距离,然后比较出距离数据点最近的质心点进行归类,这里距离的计算采用的是欧式距离的平方,这样在改变两个数值的大小关系的情况下更好的比较两个数值的大小。在重计算质心点的过程中,需要计算每个类中数据点的分量和,而这些数据点可能来自不同的参与方,所以 在计算的过程中可能涉及到隐私问题。总之,在传统K-means聚类算法的计算过程中可能导致隐私的泄漏。
发明内容
为解决现有技术中的问题,本发明提供一种具有隐私保护的K-means聚类方法,还提供了一种实现所述方法的系统。
本发明方法包括如下步骤:
S1:数据拥有者A和B加密各自的数据,然后把密文上传至服务器;
S2:数据拥有者A和B分别随机选择k个质心点,并加密上传至服务器;
S3:服务器通过安全距离计算协议计算密文数据点到质心点的欧氏距离,通过安全比较协议根据计算的欧氏距离将数据点归类;
S4:服务器、数据拥有者A和B通过安全电路协议共同重新计算新的k个质心点;
S5:数据拥有者A或B通过安全比较协议判断密文数据中新的质心点与原质心点的距离,如果小于阈值,结束分类,数据拥有者A和B请求服务器将分类好的数据分别发送给数据拥有者A和B,否则,返回执行步骤S2,进行下一轮迭代。
本发明作进一步改进,在步骤S1中,所述服务器为云服务器,云服务器将数据拥有者A和B上传的数据再加密存储在云端的文件系统中。
本发明作进一步改进,在步骤S2中,所述质心点的选取包括质心点数量和数值的选取,具体包括如下步骤:
S21:数据拥有者A和B分别随机选择k个质心点;
S22:根据传统K-means聚类算法在各自的数据集上进行迭代,并归类;
S23:计算每个数据点到各自对应质心点的距离,并计算所有数据点的距离总和S;
S24:当k-1、k、k+1个质心点对应的总和S变化不大时,此时,k为质心点的个数;
S25:数据拥有者A和B分别用各自的质心点的数值计算平均值,所述平均值即为k个质心点的值。
本发明作进一步改进,步骤S3的计算方法包括如下步骤:
S31:服务器计算数据拥有者A的每条密文记录与其上传的密文质心点的密文距离,及数据拥有者B的每条密文记录与其上传的密文质心点的密文距离;
S32:服务器与数据拥有者A共同通过安全距离计算协议计算数据拥有者A的每个数据点与质心点的密文距离;服务器和数据拥有者B利用安全距离计算协议共同计算数据拥有者B的每个数据点与质心点的密文距离;
S33:服务器根据步骤S32得到的密文距离集,将数据拥有者A和B的数据划分到最近的类中,并在同一类中分开存放。
本发明作进一步改进,步骤S4的处理方法包括如下步骤:
S41:服务器将同一类中分开存放的数据点分别发送给对应的数据拥有者A和B;
S42:数据拥有者A和B解密;
S43:服务器、数据拥有者A和B在该类别中通过安全电路协议计算新的质心点。
本发明还提供了一种实现所述方法的系统,包括数据库、数据拥有者A所使用的第一客户端和数据拥有者B所使用的第二客户端,其中,所述第一客户端和第二客户端用于加密各自的数据,然后把密文上传至服务器,并分别随机选择k个质心点,并加密上传至服务器,等服务器归类后,与服务器共同重新计算新的k个质心点,判断新的质心点与原质心点的距离,如果小于阈值,结束分类,请求服务器将分类好的数据分别发送给第一客户端和第二客户端,否则重新上传质心点;服务器用于接收第一客户端和第二客户端上传的数据,计算数据点到质心点的欧氏距离,根据计算的欧氏距离将数据点归类,然后与第一客户端和第二客户端共同重新计算新的k个质心点。
本发明作进一步改进,所述服务器为云服务器,云服务器将数据拥有者A和B上传的数据再加密存储在云端的文件系统中。
本发明作进一步改进,所述第一客户端和第二客户端的质心点的选取包括质心点数量和数值的选取,具体包括如下模块:
质心点选择模块:用于随机选择k个质心点;
归类模块:用于根据传统K-means聚类算法在各自的数据集上进行迭代,并归类;
安全距离计算模块:用于通过安全距离计算协议计算每个数据点到各自对应质心点的距离,并计算所有数据点的距离总和S;
质心点个数选取模块:用于判断当k-1、k、k+1个质心点对应的总和S变化不大时,此时,k为质心点的个数;
质心点数值选取模块:用于用各自的质心点的数值计算平均值,所述平均值即为k个质心点的值。
本发明作进一步改进,所述服务器包括:
第一密文距离计算模块:用于计算第一客户端每条密文记录与其上传的密文质心点的密文距离,及计算数据拥有者B的每条密文记录与其上传的密文质心点的密文距离;
第二密文距离计算模块:用于与第一客户端共同计算第一客户端的每个数据点与质心点的密文距离;服务器和第二客户端共同计算第二客户端的每个数据点与质心点的密文距离;分类模块:用于根据第二密文距离计算模块计算得到的密文距离集,将第一客户端和第二客户端的数据划分到最近的类中,并在同一类中分开存放。
本发明作进一步改进,所述服务器还包括发送模块:用于将同一类中分开存放的数据 点分别发送给对应的第一客户端和第二客户端;安全质心点计算模块:用于同第一客户端和第二客户端通过安全电路协议在同一类别中计算新的质心点。
与现有技术相比,本发明的有益效果是:本申请采用加密的方式很好的保证了数据挖掘过程中的安全性,并结果的正确性;支持数据存储外包,可以在更大规模的数据集上执行;支持数据计算外包,将大部分的计算外包给云平台,借助云平台强大的计算能力,在保证正确性的同时,执行效率也大幅度提升;不仅实现了半诚实模型下的安全计算,而且在重计算质心点阶段支持三个参与方中最多一方为恶意方的安全计算。
附图说明
图1为本发明方法流程图;
图2为本发明系统结构示意图;
图3为传统K-means聚类算法服务器和客户端消耗时间示意图;
图4为本发明服务器和客户端消耗时间示意图;
图5为传统K-means聚类算法服务器和客户端消耗时间占有比示意图;
图6为本发明服务器和客户端消耗时间占有比示意图;
图7为本发明与传统K-means聚类算法耗费时间比值。
具体实施方式
下面结合附图和实施例对本发明做进一步详细说明。
针对具有隐私保护的数据挖掘中存在的性能问题,本发明开展了对现有具有隐私保护的数据挖掘算法的深入研究,进而在水平划分的数据集上提出一种高效的具有隐私保护的K-means聚类算法,该方案支持有两个数据拥有者和云平台同时存在的存储外包和计算外包。数据以密文形式存储在云端,云平台通过与两个数据拥有者交互,完成在双方的联合数据集上K-means聚类数据挖掘的任务。本发明分别设计不同的安全协议解决具有隐私保护的K-means聚类算法中的三个技术难题:解决密文距离计算问题的安全距离计算协议、解决密文比较问题的安全比较协议和解决密文除法问题的安全电路协议。进而将这些安全协议应用到聚类算法框架中,实现具有隐私保护的K-means聚类算法。
如图1所示,本发明具有隐私保护的K-means聚类方法主要包括5个步骤,接下来对其进行详细说明:
步骤S1:数据拥有者A和B加密各自的数据,然后把密文上传至服务器。本例假设数据拥有者A为Alice,数据拥有者B为Bob,服务器为C。
Alice和Bob分别用自己的公钥pk 1和pk 2加密他们的数据D x和D y,密文为C x和C y,然后将C x和C y上传到C。其中,D x和D y中的每条记录都是l维的,所以对数据库加密也 就是对每条记录中的每一维数据进行加密。Alice和Bob的所有数据会以密文的形式存储在云端的文件系统中。具体的表示如下:
Figure PCTCN2017117943-appb-000002
其中,m为记录的条数。
步骤S2:Alice和Bob选择k个质心点,并用各自公钥加密上传到C。
本例中,质心点的选择是非常重要的一步,因为它的选择直接关系到迭代次数的多少,进而影响到系统整体的执行时间,所以好的质心点也会加快系统的收敛速度和执行效率。这里选择质心点分为两个部分:第一是质心点数量的选择,Alice和Bob分别随机k值和k个质心点,然后在自己的数据集上进行一次迭代。归类后计算出每个数据点到各自对应质心点的距离,然后得到这些所有距离的总和为S。当k-1,k和k+1所对应的S变化不大时,此时,k即为质心点的个数。Alice和Bob分别找出自己的k,然后两个k值区平均值即为最终的k值。Alice随机选择k个质心点M={μ c|1≤c≤k},其中μ c={u cj|1≤j≤l}。Alice和Bob分别用Alice和Bob的公钥将质心点加密上传到云端,质心点的密文为
Figure PCTCN2017117943-appb-000003
Figure PCTCN2017117943-appb-000004
步骤3:服务器C通过安全距离计算协议计算密文数据点到质心点的欧氏距离,然后,通过安全比较协议根据计算的欧氏距离将数据点归类。具体为:
C计算每个记录 和每个质心点
Figure PCTCN2017117943-appb-000006
以及每个记录
Figure PCTCN2017117943-appb-000007
和每个质心点
Figure PCTCN2017117943-appb-000008
的密文距离;C和Alice共同运行SSED(安全距离计算)协议去计算每个x i和μ c之间的密文距离,用
Figure PCTCN2017117943-appb-000009
表示。C和Bob共同运行SSED协议去计算每个y i和μ c之间的密文距离,用
Figure PCTCN2017117943-appb-000010
表示。所有x i和μ c之间的密文距离存储在
Figure PCTCN2017117943-appb-000011
中,所有y i和μ c之间的密文距离存储在
Figure PCTCN2017117943-appb-000012
中。
本方法中用到的同态加密是支持密文加法操作的半同态加密,即Paillier加密,它是一个4元组的概率性加密,表示为Enc pa={KenGen,Encrypt,Decrypt,Evaluate}。Paillier加密的过程如下:
●KenGen(1 k)→(pk,sk):
(1)选出两个大素数p和q,且满足gcd(pq,(p-1)(q-1))=1;
(2)计算N=pq和λ=lcm(p-1,q-1);
(3)随机选择一个整数
Figure PCTCN2017117943-appb-000013
(4)找到μ,使得它可以满足μ=(L(g λ?mod N 2)) -1mod N,这里L是一个函数
L(μ)=(μ-1)/N。进而得到公钥为(N,g),私钥为(λ,μ)。
●Encrypt(x,r)→c:
假设明文为x,选择一个随机数r,密文计算为c=g xr nmod N 2。加密也可表示为 E pk(x)=c。
●Decrypt(c)→x
解密过程为x=L(c λmod N 2)mod N。D sk(c)代表Decrypt(c)。
●Evaluate:
E pk(x)E pk(y)=E pk(x+y),E pk(x) y=E pk(xy)。其中x和y是两个明文。
本例的安全距离计算协议基于安全乘法协议来实现,所述安全乘法协议的具体处理过程如下:
Figure PCTCN2017117943-appb-000014
其中,Z n是正整数空间,此处表示r x和r y为正整数。
本例的安全距离计算协议的具体处理过程如下:
Figure PCTCN2017117943-appb-000015
Figure PCTCN2017117943-appb-000016
然后,C将所有的数据点归类,具体为:
通过比较
Figure PCTCN2017117943-appb-000017
Figure PCTCN2017117943-appb-000018
中的距离,将x i和y i划分到最近的类中。C和Alice执行安全比较协议
Figure PCTCN2017117943-appb-000019
C和Bob执行
Figure PCTCN2017117943-appb-000020
然后将所有的密文归类到相应的类别
Figure PCTCN2017117943-appb-000021
Figure PCTCN2017117943-appb-000022
中去。每个
Figure PCTCN2017117943-appb-000023
存储了P1中划分到C类的数据点,每个
Figure PCTCN2017117943-appb-000024
存储了Bob中划分到C类的数据点,计算公式为:
Figure PCTCN2017117943-appb-000025
Figure PCTCN2017117943-appb-000026
安全比较协议的具体处理过程如下:
Figure PCTCN2017117943-appb-000027
Figure PCTCN2017117943-appb-000028
步骤S4:C、Alice和Bob通过安全电路协议共同重计算k个质心点,因为在CL 1和CL 2中两个参与方加密数据的公钥不一样,所有新的质心点不能直接计算。本例先让C把CL 1和CL 2分别发给Alice和Bob做解密得到L 1和L 2,计算公式为:
Figure PCTCN2017117943-appb-000029
Figure PCTCN2017117943-appb-000030
然后C、Alice和Bob将会执行SC(安全电路)协议,计算
Figure PCTCN2017117943-appb-000031
其中,
Figure PCTCN2017117943-appb-000032
分别为Alice和Bob中的密文数据。
从而计算出新的质心点的一个分量μ cj。SC安全电路协议可以保证Alice和Bob得到所有的新的质心点。
其中,安全电路协议的具体处理过程为:
Figure PCTCN2017117943-appb-000033
Figure PCTCN2017117943-appb-000034
步骤S5:Alice会通过安全比较协议计算新的质心点和以前质心点的距离,如果小于阈值,那么Alice和Bob将会请求C将分好类的数据分别发给Alice和Bob。否则,Alice和Bob用他们各自的公钥将新的质心点加密上传到C,进行下一轮迭代。
如图2所示,本发明还提供了一种实现上述方法的系统,本例系统包括数据库C、数据拥有者A所使用的第一客户端P 1和数据拥有者B所使用的第二客户端P 2,其中,所述第一客户端P 1和第二客户端P 2用于加密各自的数据,然后把密文上传至服务器,并分别随机选择k个质心点,并加密上传至服务器,等服务器归类后,与服务器共同重新计算新的k个质心点,判断新的质心点与原质心点的距离,如果小于阈值,结束分类,请求服务器将分类好的数据分别发送给第一客户端P 1和第二客户端P 2,否则重新上传质心点;服务器用于接收第一客户端P 1和第二客户端P 2上传的数据,计算数据点到质心点的欧氏距离,根据计算的欧氏距离将数据点归类,然后与第一客户端P 1和第二客户端P 2共同重新计算新的k个质心点。
本例服务器C为云服务器,云服务器将数据拥有者A和B上传的数据再加密存储在云端的文件系统中,能够支持数据存储外包,可以在更大规模的数据集上执行;支持数据计算外包,将大部分的计算外包给云平台,借助云平台强大的计算能力,在保证正确性的同时,执行效率也大幅度提升。
本发明有益效果分析:
1、本发明选用的比较方案
本发明使用的框架是在文献《Outsourcing Two-Party Privacy Preserving K-Means Clustering Protocol in Wireless Sensor Networks》中首次提出的,在本比较中对该论文方法用之前方案表示,相对于其它框架下的聚类算法,同一框架下的聚类算法更具有可比性,所以本发明主要与之前传统方案进行比较分析。为保证实验对比的可靠性,两种方案均在同一个实验环境中运行。下面将介绍两种方法的评价标准,并进行实验结果的比较分析。
2、评价标准
本发明方法的时间消耗主要分为三个部分:客户端时间消耗、通信消耗和服务器端时间消耗,其中客户端和服务器时间消耗又包括初始化阶段和协议运行阶段的时间消耗。又因为本申请与之前方案所用方法的不同,所以只能从宏观上进行比较。比较主要包括两个方面,一个是理论上复杂度分析、包括时间复杂度、空间复杂度和通信复杂度,另一个是实验中测试结果的比较。而不同的迭代次数会影响实验的整体效果,所以本例以一次迭代 为准,将从以下几个方面进行比较:
(1)对比两种方案的理论上的时间复杂度、空间复杂度和通信复杂度。
(2)对比两种方案数据加密的时间。
(3)对比两种方案在一次迭代中服务器和客户端的时间消耗。
3、实验结果分析
从理论来讲,本发明方案在时间复杂度、空间复杂度和通信复杂度方面都低于之前方案。下面将根据实验数据对两种方案的实验结果进行分析。
首先比较的两种方案的加密时间消耗。之前方案中采用的两种加密方式,所有的明文数据必须都要被改进的Liu加密方案加密一次,还要被Paillier加密方案加密一次。本发明的方案中所有的明文数据只需要一次Paillier加密即可,理论上本发明中的方案中的加密时间应该快于之前方案中的加密时间消耗。又因为Paillier的操作是在群上的,又有很多的指数操作,而改进的Liu加密方案都是线性操作,所以大部分的加密时间消耗是因为Paillier加密造成的。所以,本发明中的加密时间消耗会略小于之前方案中的加密时间消耗,但是时间并没有数量级的差别,实验的结果有力的证明了该结论。之前方案加密时间消耗如表1所示,本发明的加密时间消耗如表2所示。
表1现有方案加密时间消耗
Figure PCTCN2017117943-appb-000035
表2本发明加密时间消耗
Figure PCTCN2017117943-appb-000036
接着,本发明对一次迭代中所消耗的时间进行了统计和对比。从理论上来说,本发明引入的云平台提高了强大的计算能力应该会比之前方案中的运行效率略胜一筹。因为本发 明的云平台是由30台PC机和一台服务器构成,在任务的处理过程中需要对每台机器进行任务分工、任务调度和数据回收,这些操作也会消耗部分的时间。当数据点越多的时候,一次迭代的时间会更长,而任务分工等操作所消耗的时间占用的比例就会越低。本发明在安全电路协议中,电路的生成需要耗费较大的时间,但是电路只需要在第一次迭代中生成一次即可,所以理论上在数据点规模较小的时候,之前方案的一次迭代的效率会高于本发明中的方案,当数据点规模高于某一阈值时,本发明的方案一次迭代的效率会高于之前方案中的效率,随着数据规模越来越大,本发明中方案的效率优势会越来越明显。实验结果很好的论证了我们的观点,同时实验结果表明数据点规模的阈值大约为5000个数据点,当数据规模大于7000时本发明方案一次迭代消耗时间较少,当数据规模小于5000时,之前方案中方案一次迭代消耗时间较少。两种方案一次迭代消耗时间对比如表3所示。
表3一次迭代消耗时间对比
Figure PCTCN2017117943-appb-000037
在一次迭代中,本发明关注的不仅是这一次迭代的所消耗的时间,同时也希望在每一次迭代计算中服务器C能够承担更多的任务,拥有更高的消耗时间占有比,也就是说在保证一次迭代时间较小的情况下,使得服务器消耗时间与一次迭代消耗时间的比例更大,这样就可以较少客户端的计算量。因此,随着数据规模的增大,这样的方案效率也会越来越高。对于客户端来说,主要做的就是加密和解密操作,两种方案中客户端的加解密的次数基本一致。但是,在之前方案中密文距离计算和密文距离比较大小采用的是改进的Liu加密,该加密的所有操作都是线性运算,而本发明中的方案采用的是Paillier加密算法,该算法的解密和解密需要在群上进行指数运算和模运算。对于计算能力较小的客户端来说,改进的Liu加密算法所消耗的时间应该会小于该本发明中采用的Paillier加密。所以,理论上在同规模的数据集下,之前方案中客户端所消耗的时间会低于本发明的方案中客户端所消耗的时间。随着数据规模的增大,本发明的方案中一次迭代消耗的时间相对较少,而客户端所消耗的时间相对较大。因此,当数据规模越来越大的时候,本发明方案中的客户端消耗时间占有比相对会越来越大,相反服务器所消耗时间的占有比相对会越来越小。通过进行实验数据的采集和分析,也很好证明了先前的猜想。两种方案一次迭代各参与方消耗时间如表4和表5所示。之前方案服务器和客户端的消耗时间如图3所示,本发明服务 器和客户端消耗时间如图4所示。
表4之前方案一次迭代各参与方消耗时间
Figure PCTCN2017117943-appb-000038
表5本申请一次迭代各参与方消耗时间
Figure PCTCN2017117943-appb-000039
从图3和图4中可以看出,两种方案服务器和客户端消耗时间随着数据点增长的一个趋势。之前方案的实验方案中,随着数据规模的增大,服务器消耗有明显的上升趋势,而客户端的消耗时间也有较小的上升趋势。主要是因为服务器的计算能力有限,数据的计算比较负责。随着数据规模的增大,服务器必然需要越来越多的时间去处理这些数据,导致消耗时间明显增多,服务器消耗时间的占有比也会增加。随着数据规模的增大,虽然客户端需要处理的数据也有所增加,相比服务器,客户端的操作大多都是线性计算,所以数据规模增加带来的消耗时间的增加并不明显,那么客户端消耗时间的占有比会减少。本发明中服务器是在有30台PC机和1台服务器构成的云平台上运行的,所以服务器的计算能力是可以保证的。根据图4可以看出,随着数据规模的增加,服务器端消耗时间有所增加,并没有明显的上升趋势。而客户端消耗时间随着数据规模的增加越来越大,主要是因为客户端所做的解密操作是在群上的指数操作,相比于线性操作,该操作具有更大的计算量。因此,随着数据规模的增大,本发明中服务器消耗时间占有比会有所减少,而客户端消耗时间占有比会有所增加。之前方案中服务器和客户端消耗时间占有比如图5所示,本发明中服务器和客户端消耗时间占有比如图6所示。
最后,本发明通过实验,给出了一次迭代中具有隐私保护的K-means聚类算法与经典的K-means算法处理数据的时间,可以看出加密所带来的时间消耗是比较大的。但是,随 着数据规模的增大,本发明一次迭代的时间消耗与经典的K-means时间消耗的比值越来越小。一次迭代中本发明与经典的K-means算法耗费时间如表6所示,时间比值如图7所示。
表6一次迭代中本发明与经典的K-means算法耗费时间
Figure PCTCN2017117943-appb-000040
本发明选用的是数据挖掘中较为典型的K-means算法,并且在双方的水平划分的联合数据集进行挖掘,同时支持云平台的存储外包和计算外包。本发明的有益效果主要有以下几方面:
(1)通过分析隐私保护的数据挖掘的国内外现状,清楚地了解现在常用技术的优势和劣势。基于数据扰乱技术的方案虽然执行效率较高,但是因为它破坏了原有的数据集,所以对数据挖掘结果肯定会产生一定的影响,而基于加密的方案可以很好的保证挖掘结果的正确性,本发明采用加密的方式很好的保证了数据挖掘结果的正确性;
(2)本发明方案支持数据存储外包。云平台相比于一般的PC机,拥有更大的存储能力,这使得本发明方案可以在更大规模的数据集上执行;
(3)本发明方案支持数据计算外包。云平台是一种分布式计算框架,它可以把很多的资源整合到一起成为一个集群,从而大幅度提升系统的计算能力。本发明方案将大部分的计算外包给云平台,借助云平台强大的计算能力,在保证正确性的同时,执行效率也大幅度提升;
(4)从理论分析算法的时间复杂度、空间复杂度、通讯复杂度和安全性,并且通过实验验证该算法的正确性和高效性。本发明提出的具有隐私保护的K-means聚类算法不仅实现了半诚实模型下的安全计算,而且在重计算质心点阶段支持三个参与方中最多一方为恶意方的安全计算。
以上所述之具体实施方式为本发明的较佳实施方式,并非以此限定本发明的具体实施范围,本发明的范围包括并不限于本具体实施方式,凡依照本发明所作的等效变化均在本发明的保护范围内。

Claims (10)

  1. 一种具有隐私保护的K-means聚类方法,其特征在于包括如下步骤:
    S1:数据拥有者A和B加密各自的数据,然后把密文上传至服务器;
    S2:数据拥有者A和B分别随机选择k个质心点,并加密上传至服务器;
    S3:服务器通过安全距离计算协议计算密文数据点到质心点的欧氏距离,通过安全比较协议计算的欧氏距离将数据点归类;
    S4:服务器、数据拥有者A和B通过安全电路协议共同重新计算新的k个质心点;
    S5:数据拥有者A或B通过安全比较协议判断密文数据中新的质心点与原质心点的距离,如果小于阈值,结束分类,数据拥有者A和B请求服务器将分类好的数据分别发送给数据拥有者A和B,否则,返回执行步骤S2,进行下一轮迭代。
  2. 根据权利要求1所述的具有隐私保护的K-means聚类方法,其特征在于:在步骤S1中,所述服务器为云服务器,云服务器将数据拥有者A和B上传的加密数据再存储在云端的文件系统中。
  3. 根据权利要求2所述的具有隐私保护的K-means聚类方法,其特征在于:在步骤S2中,所述质心点的选取包括质心点数量和数值的选取,具体包括如下步骤:
    S21:数据拥有者A和B分别随机选择k个质心点;
    S22:根据传统K-means聚类算法在各自的数据集上进行迭代,并归类;
    S23:计算每个数据点到各自对应质心点的距离,并计算所有数据点的距离总和S;
    S24:当k-1、k、k+1个质心点对应的总和S变化不大时,此时,k为质心点的个数;
    S25:数据拥有者A和B分别用各自的质心点的数值计算平均值,所述平均值即为k个质心点的值。
  4. 根据权利要求3所述的具有隐私保护的K-means聚类方法,其特征在于:步骤S3的计算方法包括如下步骤:
    S31:服务器计算数据拥有者A的每条密文记录与其上传的密文质心点的密文距离,及数据拥有者B的每条密文记录与其上传的密文质心点的密文距离;
    S32:服务器与数据拥有者A共同利用安全距离计算协议计算数据拥有者A的每个数据点与质心点的密文距离;服务器和数据拥有者利用安全距离计算协议B共同计算数据拥有者B的每个数据点与质心点的密文距离;
    S33:服务器根据步骤S32得到的密文距离集,将数据拥有者A和B的数据划分到最近的类中,并在同一类中分开存放。
  5. 根据权利要求4所述的具有隐私保护的K-means聚类方法,其特征在于:步骤S4的处理方法包括如下步骤:
    S41:服务器将同一类中分开存放的数据点分别发送给对应的数据拥有者A和B;
    S42:数据拥有者A和B解密;
    S43:服务器、数据拥有者A和B在该类别中利用安全电路协议计算新的质心点。
  6. 一种实现权利要求1-5任一项所述具有隐私保护的K-means聚类方法的系统,其特征在于包括数据库、数据拥有者A所使用的第一客户端和数据拥有者B所使用的第二客户端,其中,所述第一客户端和第二客户端用于加密各自的数据,然后把密文上传至服务器,并分别随机选择k个质心点,并加密上传至服务器,等服务器归类后,与服务器共同重新计算新的k个质心点,判断新的质心点与原质心点的距离,如果小于阈值,结束分类,请求服务器将分类好的数据分别发送给第一客户端和第二客户端,否则重新上传质心点;服务器用于接收第一客户端和第二客户端上传的数据,计算数据点到质心点的欧氏距离,根据计算的欧氏距离将数据点归类,然后与第一客户端和第二客户端共同重新计算新的k个质心点。
  7. 根据权利要求6所述的系统,其特征在于:所述服务器为云服务器,云服务器将数据拥有者A和B上传的数据再加密存储在云端的文件系统中。
  8. 根据权利要求7所述的系统,其特征在于:所述第一客户端和第二客户端的质心点的选取包括质心点数量和数值的选取,具体包括如下模块:
    质心点选择模块:用于随机选择k个质心点;
    归类模块:用于根据传统K-means聚类算法在各自的数据集上进行迭代,并归类;
    安全距离计算模块:用于通过安全距离计算协议计算每个数据点到各自对应质心点的距离,并计算所有数据点的距离总和S;
    质心点个数选取模块:用于判断当k-1、k、k+1个质心点对应的总和S变化不大时,此时,k为质心点的个数;
    质心点数值选取模块:用于用各自的质心点的数值计算平均值,所述平均值即为k个质心点的值。
  9. 根据权利要求8所述的系统,其特征在于:所述服务器包括:
    第一密文距离计算模块:用于计算第一客户端每条密文记录与其上传的密文质心点的密文距离,及计算数据拥有者B的每条密文记录与其上传的密文质心点的密文距离;
    第二密文距离计算模块:用于与第一客户端共同计算第一客户端的每个数据点与质心点的密文距离;服务器和第二客户端共同计算第二客户端的每个数据点与质心点的密文距离;
    分类模块:用于根据第二密文距离计算模块计算得到的密文距离集,将第一客户端和第 二客户端的数据划分到最近的类中,并在同一类中分开存放。
  10. 根据权利要求9所述的系统,其特征在于:所述服务器还包括发送模块:用于将同一类中分开存放的数据点分别发送给对应的第一客户端和第二客户端;
    安全质心点计算模块:用于同第一客户端和第二客户端通过安全电路协议在同一类别中计算新的质心点。
PCT/CN2017/117943 2017-04-07 2017-12-22 一种具有隐私保护的K-means聚类方法及系统 WO2018184407A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2017102242757 2017-04-07
CN201710224275.7A CN107145791B (zh) 2017-04-07 2017-04-07 一种具有隐私保护的K-means聚类方法及系统

Publications (1)

Publication Number Publication Date
WO2018184407A1 true WO2018184407A1 (zh) 2018-10-11

Family

ID=59775048

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/117943 WO2018184407A1 (zh) 2017-04-07 2017-12-22 一种具有隐私保护的K-means聚类方法及系统

Country Status (2)

Country Link
CN (1) CN107145791B (zh)
WO (1) WO2018184407A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610196A (zh) * 2019-08-14 2019-12-24 平安科技(深圳)有限公司 脱敏方法、系统、计算机设备和计算机可读存储介质
CN114154554A (zh) * 2021-10-28 2022-03-08 上海海洋大学 一种基于非共谋双云服务器的隐私保护外包数据knn算法
CN117633881A (zh) * 2023-11-27 2024-03-01 国能神皖合肥发电有限责任公司 一种电力数据优化处理方法
CN117688502A (zh) * 2024-02-04 2024-03-12 山东大学 一种局部异常因子检测的安全外包计算方法及系统

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145791B (zh) * 2017-04-07 2020-07-10 哈尔滨工业大学深圳研究生院 一种具有隐私保护的K-means聚类方法及系统
CN107707494B (zh) * 2017-10-10 2020-02-11 苏州大学 用于64-qam相干光通信系统的光纤非线性均衡方法
CN107784663B (zh) * 2017-11-14 2020-10-20 哈尔滨工业大学深圳研究生院 基于深度信息的相关滤波跟踪方法及装置
CN109214205B (zh) * 2018-08-01 2021-07-02 安徽师范大学 一种群智感知中基于k-匿名的位置及数据隐私保护方法
CN109615021B (zh) * 2018-12-20 2022-09-27 暨南大学 一种基于k均值聚类的隐私信息保护方法
CN110162999B (zh) * 2019-05-08 2022-06-07 湖北工业大学 一种基于隐私保护的收入分配差距基尼系数度量方法
CN110163292A (zh) * 2019-05-28 2019-08-23 电子科技大学 基于向量同态加密的隐私保护k-means聚类方法
US11663521B2 (en) * 2019-11-06 2023-05-30 Visa International Service Association Two-server privacy-preserving clustering
CN111444545B (zh) * 2020-06-12 2020-09-04 支付宝(杭州)信息技术有限公司 针对多方的隐私数据进行聚类的方法和装置
CN112487481B (zh) * 2020-12-09 2022-06-10 重庆邮电大学 一种具有隐私保护的可验证多方k-means联邦学习方法
CN112508203B (zh) * 2021-02-08 2021-06-15 同盾控股有限公司 基于联邦学习的数据聚类处理方法、装置、设备及介质
CN113033915B (zh) * 2021-04-16 2021-12-31 哈尔滨理工大学 一种拼车用户端与司机端最短距离的比较方法及装置
CN113438254B (zh) * 2021-08-24 2021-11-05 北京金睛云华科技有限公司 一种云环境上密文数据的分布式分类方法及系统
CN116801380B (zh) * 2023-03-23 2024-05-28 昆明理工大学 基于改进全质心-Taylor的UWB室内定位方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102970143A (zh) * 2012-12-13 2013-03-13 中国科学技术大学苏州研究院 采用加法同态加密方法进行安全计算双方持有数和的指数的方法
US20140258295A1 (en) * 2013-03-08 2014-09-11 Microsoft Corporation Approximate K-Means via Cluster Closures
CN107145791A (zh) * 2017-04-07 2017-09-08 哈尔滨工业大学深圳研究生院 一种具有隐私保护的K‑means聚类方法及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138923B (zh) * 2015-08-11 2019-01-08 苏州大学 一种保护隐私的时间序列相似度计算方法
CN105760780B (zh) * 2016-02-29 2018-06-08 福建师范大学 基于路网的轨迹数据隐私保护方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102970143A (zh) * 2012-12-13 2013-03-13 中国科学技术大学苏州研究院 采用加法同态加密方法进行安全计算双方持有数和的指数的方法
US20140258295A1 (en) * 2013-03-08 2014-09-11 Microsoft Corporation Approximate K-Means via Cluster Closures
CN107145791A (zh) * 2017-04-07 2017-09-08 哈尔滨工业大学深圳研究生院 一种具有隐私保护的K‑means聚类方法及系统

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GU, FENFEI ET AL.: "A Secure and Efficient Data Aggregation Algorithm Based on K-means Clustering", NATURAL SCIENCE JOURNAL OF HARBIN NORMAL UNIVERSITY, vol. 32, no. 5, 31 May 2015 (2015-05-31), pages 20 - 24 *
LIU, XIAOYAN ET AL.: "Outsourcing Two-party Privacy Preserving K-means Clustering Protocol in Wireless Sensor Networks", IEEE COMPUTER SOCIETY, 31 December 2015 (2015-12-31), pages 124 - 133, XP032875254 *
XUE, ANRONG ET AL.: "Fast Privacy-preserving Clustering Algorithm", SYSTEMS ENGINEERING AND ELECTRONICS, vol. 31, no. 10, 30 October 2009 (2009-10-30), pages 2521 - 2526, ISSN: 1001-506X *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610196A (zh) * 2019-08-14 2019-12-24 平安科技(深圳)有限公司 脱敏方法、系统、计算机设备和计算机可读存储介质
CN114154554A (zh) * 2021-10-28 2022-03-08 上海海洋大学 一种基于非共谋双云服务器的隐私保护外包数据knn算法
CN117633881A (zh) * 2023-11-27 2024-03-01 国能神皖合肥发电有限责任公司 一种电力数据优化处理方法
CN117688502A (zh) * 2024-02-04 2024-03-12 山东大学 一种局部异常因子检测的安全外包计算方法及系统
CN117688502B (zh) * 2024-02-04 2024-04-30 山东大学 一种局部异常因子检测的安全外包计算方法及系统

Also Published As

Publication number Publication date
CN107145791B (zh) 2020-07-10
CN107145791A (zh) 2017-09-08

Similar Documents

Publication Publication Date Title
WO2018184407A1 (zh) 一种具有隐私保护的K-means聚类方法及系统
Liu et al. An efficient privacy-preserving outsourced calculation toolkit with multiple keys
Liu et al. Efficient and privacy-preserving outsourced calculation of rational numbers
Zhang et al. Identity-based key-exposure resilient cloud storage public auditing scheme from lattices
US10211975B2 (en) Managed secure computations on encrypted data
JP6180177B2 (ja) プライバシーを保護することができる暗号化データの問い合わせ方法及びシステム
Paulet et al. Privacy-preserving and content-protecting location based queries
Zhou et al. Efficient homomorphic encryption on integer vectors and its applications
Lien et al. A novel privacy preserving location-based service protocol with secret circular shift for k-nn search
Gahi et al. A secure database system using homomorphic encryption schemes
CN104521178A (zh) 安全的多方云计算的方法和系统
Xu et al. Tc-PEDCKS: Towards time controlled public key encryption with delegatable conjunctive keyword search for Internet of Things
CN104967693A (zh) 面向云存储的基于全同态密码技术的文档相似度计算方法
CN114039785B (zh) 数据加密、解密、处理方法、装置、设备和存储介质
Zou et al. Highly secure privacy‐preserving outsourced k‐means clustering under multiple keys in cloud computing
CN115664629A (zh) 一种基于同态加密的智慧物联平台数据隐私保护方法
CN114528331A (zh) 基于区块链的数据查询方法及装置、介质、设备
Cafaro et al. Space-efficient verifiable secret sharing using polynomial interpolation
Hussien et al. Public auditing for secure data storage in cloud through a third party auditor using modern ciphertext
CN109409111B (zh) 一种面向加密图像的模糊搜索方法
CN116681141A (zh) 隐私保护的联邦学习方法、终端及存储介质
Theodouli et al. Implementing private k-means clustering using a LWE-based cryptosystem
Liu et al. Secure scalar product for big-data in MapReduce
Sabbu et al. An oblivious image retrieval protocol
Feng et al. Secure outsourced principal eigentensor computation for cyber-physical-social systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17904796

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17904796

Country of ref document: EP

Kind code of ref document: A1