CN107145791B

CN107145791B - K-means clustering method and system with privacy protection function

Info

Publication number: CN107145791B
Application number: CN201710224275.7A
Authority: CN
Inventors: 王轩; 蒋琳; 李晔; 姚霖; 刘泽超; 靳亚宾; 梁玉冬; 刘猛; 漆舒汉
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2020-07-10
Anticipated expiration: 2037-04-07
Also published as: CN107145791A; WO2018184407A1

Abstract

The invention provides a K-means clustering method and system with privacy protection, and belongs to the technical field of data mining. The invention comprises the following steps: the data owners A and B encrypt respective data and randomly selected centroid points, and upload the data and the randomly selected centroid points to a server; the server calculates the Euclidean distance from the data point to the centroid point in the ciphertext data through a secure multiplication protocol and a secure distance calculation protocol, and classifies the data point; the server, the data owner A and the data owner B jointly recalculate a new centroid point in the ciphertext data through a safety circuit protocol; and the data owner A or B judges the distance between the new centroid point and the original centroid point through a safety comparison protocol, if the distance is smaller than a threshold value, the classification is finished, the data owner A and the data owner B request the server to respectively send the classified data to the data owner A and the data owner B, and if the distance is not smaller than the threshold value, the data owner A and the data owner B upload the new centroid point again and perform the next iteration. The invention ensures the correctness of the data mining result while ensuring the privacy and the safety of the data; the data storage outsourcing and the data calculation outsourcing are supported, and the execution efficiency is greatly improved while the correctness is ensured; and the safety calculation that at most one of the three participants is a malicious party is supported.

Description

K-means clustering method and system with privacy protection function

Technical Field

The invention relates to the technical field of data mining, in particular to a K-means clustering method with privacy protection and a system for realizing the method.

Background

It is well known that K-means clustering is one of the very classical and common methods in data mining, which can cluster similar data items together by calculating the distance between the data items. With the acceleration of informatization, digitization and networking processes, economic globalization becomes an irreversible trend, data sources in a clustering algorithm are more and more diversified, and data security is more and more important. Given that data may come from multiple parties, which may contain sensitive or private information about the parties, privacy of the data may not be guaranteed if the information is shared among the multiple parties. The joint data mining with privacy protection can be used for mining data of joint databases of multiple participants while protecting the privacy of user data and mining results, and further extracting useful information. Therefore, how to design a joint data mining algorithm with privacy protection becomes a difficult problem to be solved.

The semi-honest model, in which privacy of data is guaranteed by the various parties following the protocol all the time, is in many cases realistic. However, to ensure the privacy of the data, the solution under this model is generally not feasible in practice because the computational and communication consumption is high.

The traditional K-means clustering algorithm is a classical clustering algorithm based on Euclidean distance. The traditional K-means clustering algorithm is mainly divided into 3 steps: and selecting a centroid point, classifying the data points and recalculating a new centroid point. Assume training samples as { x_i∈R^lI is more than or equal to 1 and less than or equal to l, wherein l is the number of samples, and firstly, randomly selecting k centroid pointsM, expressed as M ═ μ_c∈R^lL 1 is less than or equal to c is less than or equal to l }. Then calculate each data point to x_iTo the centroid point mu_cThen x is measured, and_ithe centroid point mu classified as closest to the point of cluster_cIn the class, the formula is: c_c:＝argmin_c||x_i-μ_c||². Finally for each centroid point mu_cAnd (3) recalculating the centroid point, wherein the calculation formula is as follows:

therefore, the traditional K-means clustering algorithm mainly comprises three steps: and selecting the centroid points and the data points for classification and recalculating the centroid points. In the classification process, the Euclidean distance between a data point and each centroid point is calculated firstly, then the centroid point closest to the data point is compared for classification, and the distance is calculated by adopting the square of the Euclidean distance, so that the magnitude of two values is better compared under the condition of changing the magnitude relation of the two values. In the process of recalculating centroid points, the component sum of data points in each class needs to be calculated, and the data points may come from different participants, so that privacy problems may be involved in the calculation process. In summary, privacy leakage may be caused in the calculation process of the traditional K-means clustering algorithm.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a K-means clustering method with privacy protection and a system for realizing the method.

The method comprises the following steps:

s1: the data owners A and B encrypt respective data, and then the ciphertext is uploaded to the server;

s2: respectively randomly selecting k centroid points by the data owners A and B, and encrypting and uploading the k centroid points to a server;

s3: the server calculates the Euclidean distance from the ciphertext data point to the centroid point through a safe distance calculation protocol, and classifies the data point according to the calculated Euclidean distance through a safe comparison protocol;

s4: the server, the data owner A and the data owner B recalculate new k centroid points through a secure circuit protocol;

s5: and the data owner A or B judges the distance between the new centroid point and the original centroid point in the ciphertext data through a safety comparison protocol, if the distance is smaller than a threshold value, classification is finished, the data owner A and the data owner B request the server to respectively send the classified data to the data owner A and the data owner B, and if the distance is not smaller than the threshold value, the data owner A and the data owner B return to the step S2 to carry out the next iteration.

In step S1, the server is a cloud server, and the cloud server re-encrypts and stores the data uploaded by the data owners a and B in a cloud file system.

The invention is further improved, in step S2, the selection of the centroid points includes the selection of the centroid point number and the numerical value, and specifically includes the following steps:

s21: respectively and randomly selecting k centroid points by a data owner A and a data owner B;

s22: iterating on respective data sets according to a traditional K-means clustering algorithm, and classifying;

s23: calculating the distance from each data point to the corresponding centroid point, and calculating the sum S of the distances of all the data points;

s24: when the sum S corresponding to the k-1 centroid points, k and k +1 centroid points does not change greatly, k is the number of the centroid points;

s25: and respectively calculating an average value by using the numerical values of the respective centroid points by the data owners A and B, wherein the average value is the value of k centroid points.

In a further improvement of the present invention, the calculating method of step S3 includes the following steps:

s31: the server calculates the ciphertext distance between each ciphertext record of the data owner A and the uploaded ciphertext centroid point thereof, and calculates the ciphertext distance between each ciphertext record of the data owner B and the uploaded ciphertext centroid point thereof;

s32: the server and the data owner A jointly calculate the ciphertext distance between each data point and the centroid point of the data owner A through a safe distance calculation protocol; the server and the data owner B jointly calculate the ciphertext distance between each data point and the centroid point of the data owner B by using a safe distance calculation protocol;

s33: the server classifies the data of the data owners a and B into the closest class according to the ciphertext distance set obtained in step S32, and stores the data in the same class separately.

In a further improvement of the present invention, the processing method of step S4 includes the following steps:

s41: the server respectively sends the data points separately stored in the same class to the corresponding data owners A and B;

s42: decrypting by the data owner A and B;

s43: the server, data owners a and B compute a new centroid point in this category through the secure circuit protocol.

The invention also provides a system for realizing the method, which comprises a database, a first client used by the data owner A and a second client used by the data owner B, wherein the first client and the second client are used for encrypting respective data, then uploading a ciphertext to the server, respectively and randomly selecting k centroid points, encrypting and uploading the k centroid points to the server, after the server is classified, recalculating new k centroid points together with the server, judging the distance between the new centroid points and the original centroid points, if the distance is less than a threshold value, finishing the classification, requesting the server to respectively send the classified data to the first client and the second client, and otherwise, re-uploading the centroid points; the server is used for receiving data uploaded by the first client and the second client, calculating Euclidean distances from data points to the centroid points, classifying the data points according to the calculated Euclidean distances, and then recalculating new k centroid points together with the first client and the second client.

The invention is further improved, the server is a cloud server, and the cloud server stores the data uploaded by the data owners A and B in a cloud file system in a re-encryption manner.

The invention is further improved, the selection of the centroid points of the first client and the second client comprises the selection of the number and the value of the centroid points, and the selection specifically comprises the following modules:

a centroid point selection module: for randomly selecting k centroid points;

a classification module: the method is used for iterating on respective data sets according to a traditional K-means clustering algorithm and classifying;

a safe distance calculation module: the distance calculation module is used for calculating the distance from each data point to the corresponding centroid point through a safe distance calculation protocol and calculating the distance sum S of all the data points;

a centroid point number selection module: the method is used for judging that when the sum S corresponding to the k-1 centroid points, k and k +1 centroid points does not change greatly, k is the number of the centroid points;

the center of mass point value selecting module: for calculating an average value, i.e. the value of the k centroid points, from the respective values of the centroid points.

In a further refinement of the present invention, the server comprises:

the first ciphertext distance calculating module: the system comprises a data owner B and a data processing server, wherein the data owner B is used for calculating the ciphertext distance between each ciphertext record of a first client and an uploaded ciphertext centroid point of the first client and calculating the ciphertext distance between each ciphertext record of the data owner B and the uploaded ciphertext centroid point of the data owner B;

the second ciphertext distance calculating module: the system comprises a data point and a centroid point, wherein the data point and the centroid point are used for calculating the ciphertext distance of each data point of the first client together with the first client; the server and the second client jointly calculate the ciphertext distance between each data point and the centroid point of the second client; a classification module: and the data of the first client and the second client are divided into the closest classes according to the ciphertext distance set calculated by the second ciphertext distance calculation module and are stored in the same class separately.

The invention makes further improvement, and the server further comprises a sending module: the data point storage device is used for respectively sending the data points separately stored in the same class to the corresponding first client and second client; a secure centroid point calculation module: for calculating a new centroid point in the same class as the first client and the second client via the secure circuit protocol.

Compared with the prior art, the invention has the beneficial effects that: the method and the device have the advantages that the safety in the data digging process is well guaranteed by adopting an encryption mode, and the result correctness is guaranteed; supporting outsourcing of data storage, which can be executed on a larger-scale data set; supporting data computing outsourcing, outsourcing most of computing to a cloud platform, and greatly improving the execution efficiency while ensuring the correctness by means of the strong computing capability of the cloud platform; the method not only realizes the safety calculation under the semi-honest model, but also supports the safety calculation that at most one party in the three parties is a malicious party in the recalculation centroid point stage.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of the system of the present invention;

FIG. 3 is a diagram illustrating the time consumption of a server and a client in a conventional K-means clustering algorithm;

FIG. 4 is a diagram illustrating server and client elapsed time in accordance with the present invention;

FIG. 5 is a diagram illustrating the consumption time ratio of a server and a client in a conventional K-means clustering algorithm;

FIG. 6 is a diagram illustrating the consumption time ratio of the server and the client according to the present invention;

FIG. 7 is a time consumption ratio of the K-means clustering algorithm of the present invention to the conventional K-means clustering algorithm.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Aiming at the performance problem existing in data mining with privacy protection, the invention develops deep research on the existing data mining algorithm with privacy protection, and further provides an efficient K-means clustering algorithm with privacy protection on a horizontally divided data set. Data are stored in a cloud end in a ciphertext mode, and a cloud platform interacts with two data owners to complete a task of K-means clustering data mining on a combined data set of the two data owners. The invention respectively designs different safety protocols to solve three technical problems in a K-means clustering algorithm with privacy protection: a safe distance calculation protocol for solving the problem of ciphertext distance calculation, a safe comparison protocol for solving the problem of ciphertext comparison and a safe circuit protocol for solving the problem of ciphertext division. And the security protocols are applied to a clustering algorithm framework to realize a K-means clustering algorithm with privacy protection.

As shown in fig. 1, the K-means clustering method with privacy protection of the present invention mainly includes 5 steps, which are explained in detail below:

step S1: the data owners a and B encrypt the respective data and then upload the ciphertext to the server. In this example, assume that data owner a is Alice, data owner B is Bob, and the server is C.

Alice and Bob use their public keys pk respectively₁And pk₂Encrypt their data D_xAnd D_yThe ciphertext is C_xAnd C_yThen, C is added_xAnd C_yAnd uploading to C. Wherein D is_xAnd D_yEach record in the set is l-dimensional, so the database is encrypted, i.e., each dimension of data in each record is encrypted. All data of Alice and Bob can be stored in a file system of the cloud in a ciphertext mode. The specific representation is as follows:

wherein m is the number of records.

Step S2: alice and Bob select k centroid points and upload to C encrypted with their respective public keys.

In this example, the selection of the centroid point is a very important step, because the selection directly relates to the number of iterations, which further affects the execution time of the whole system, so a good centroid point also increases the convergence rate and execution efficiency of the system. The centroid point is here chosen to be divided into two parts: the first is the selection of the number of centroid points, Alice and Bob respectively randomize k values and k centroid points, and then perform one iteration on their own data set. After classification, the distance from each data point to the corresponding centroid point is calculated, and the sum of all the distances isAnd S. And when the S corresponding to k-1, k and k +1 does not change greatly, the k is the number of the centroid points. And respectively finding out k of Alice and Bob, and then obtaining the final k value as the average value of the two k value areas. Alice randomly selects k centroid points M ═ mu_c1 ≦ c ≦ k, where μ_c＝{u_cjJ is more than or equal to 1 and less than or equal to l. The center of mass point is encrypted and uploaded to the cloud end by the public keys of Alice and Bob respectively, and the ciphertext of the center of mass point is

And

and 3, the server C calculates the Euclidean distance from the ciphertext data point to the centroid point through a safe distance calculation protocol, and then classifies the data point according to the calculated Euclidean distance through a safe comparison protocol. The method specifically comprises the following steps:

c calculate each record

And each centroid point

And each record

And each centroid point

The ciphertext distance of (a); c and Alice co-operate the SSED (safe distance computation) protocol to compute each x_iAnd mu_cDistance of ciphertext therebetween, by

And (4) showing. C and Bob co-operate the SSED protocol to compute each y_iAnd mu_cDistance of ciphertext therebetween, by

And (4) showing. All x_iAnd mu_cThe ciphertext distance between is stored in

In (1), all of y_iAnd mu_cThe ciphertext distance between is stored in

In (1).

The homomorphic encryption used in the method is semi-homomorphic encryption supporting ciphertext addition operation, namely Paillier encryption, which is 4-tuple probabilistic encryption and is represented as Enc_pa= KenGen, Encrypt, Decrypt, Evaluate }. The procedure for Paillier encryption is as follows:

●KenGen(1^k)→(pk,sk):

(1) two large prime numbers p and q are selected, and gcd (pq, (p-1) (q-1)) ═ 1 is satisfied;

(2) calculating N ═ pq and λ ═ lcm (p-1, q-1);

(3) randomly selecting an integer

(4) Find μ so that it can satisfy μ ═ (L (g)^λmod N²))^-1mod N, where L is a function L (μ) ═ μ -1)/N, resulting in a public key of (N, g) and a private key of (λ, μ).

●Encrypt(x,r)→c:

Assuming x as plaintext, a random number r is selected, and c as g as ciphertext^xrⁿmod N². The encryption can also be denoted as E_pk(x)＝c。

●Decrypt(c)→x

The decryption process is x-L (c)^λmod N²)mod N。D_sk(c) Represents Decrypt (c).

●Evaluate：

E_pk(x)E_pk(y)＝E_pk(x+y),E_pk(x)^y＝E_pk(xy). Where x and y are two plaintexts.

The safe distance calculation protocol of the embodiment is realized based on a safe multiplication protocol, and the specific processing procedure of the safe multiplication protocol is as follows:

wherein Z is_nIs a positive integer space, here denoted r_xAnd r_yIs a positive integer.

The specific processing procedure of the safe distance calculation protocol in this example is as follows:

then, C classifies all data points, specifically:

by comparison

And

a distance of (1) x_iAnd y_iInto the nearest class. C and Alice execute a secure comparison protocol

C and Bob execution

All the ciphertexts are classified into corresponding categories

And

and (4) removing. Each one of which is

Data points classified into class C in P1 are stored, each

Data points in Bob divided into class C are stored, and the calculation formula is as follows:

the specific processing procedure of the safety comparison protocol is as follows:

steps S4: C, Alice and Bob recompute k centroid points collectively via the secure circuit protocol because at C L₁And C L₂The public keys of the two participants for encrypting data are different, and all new centroid points cannot be directly calculated₁And C L₂Respectively sent to Alice and Bob for decryption to obtain L₁And L₂The calculation formula is as follows:

c, Alice and Bob then execute the SC (secure Circuit) protocol, calculate

Wherein the content of the first and second substances,

respectively, the ciphertext data in Alice and Bob.

Thereby calculating a component mu of the new centroid point_cj. The SC safe circuit protocol can ensure that Alice and Bob get all new centroid points.

The specific processing procedure of the safety circuit protocol is as follows:

step S5: and Alice calculates the distance between the new center of mass point and the previous center of mass point through a safety comparison protocol, and if the distance is smaller than the threshold value, Alice and Bob request C to respectively send classified data to Alice and Bob. Otherwise, Alice and Bob upload the new centroid point to C with their respective public keys for the next iteration.

As shown in FIG. 2, the present invention also provides a system for implementing the above method, and the system of this embodiment includes a database C, a first client P used by a data owner A₁And a second client P used by the data owner B₂Wherein the first client P₁And a second client P₂The server is used for encrypting respective data, uploading the encrypted messages to the server, randomly selecting k centroid points respectively, encrypting and uploading the encrypted messages to the server, after the server is classified, recalculating new k centroid points together with the server, judging the distance between the new centroid points and the original centroid points, finishing classification if the distance is less than a threshold value, and requesting the server to send the classified data to the first client P respectively₁And a second client P₂Otherwise, the mass transfer center point is reapplied; the server is used for receiving the first clientP₁And a second client P₂Uploading data, calculating Euclidean distance from the data point to the centroid point, classifying the data point according to the calculated Euclidean distance, and then enabling the data point to be in contact with the first client P₁And a second client P₂Collectively recalculating new k centroid points.

The server C is a cloud server, the cloud server encrypts and stores the data uploaded by the data owners A and B in a file system of a cloud end, outsourcing of data storage can be supported, and the data storage can be executed on a larger-scale data set; support data calculation outsourcing, outsourcing most calculation for the cloud platform, with the help of the powerful computing power of cloud platform, when guaranteeing the exactness, execution efficiency also promotes by a wide margin.

The beneficial effects of the invention are analyzed:

1. comparison scheme selected by the invention

The frame used by the invention is firstly proposed in the document 'Outsouring Two-Party Privacy monitoring K-Means Clustering Protocol in Wireless Sensor Networks', the method for the paper article in the comparison is represented by the prior scheme, and compared with the Clustering algorithm under other frames, the Clustering algorithm under the same frame is more comparable, so the invention mainly carries out comparative analysis with the prior traditional scheme. In order to ensure the reliability of the experimental comparison, the two schemes are operated in the same experimental environment. Evaluation criteria for both methods will be described below, and comparative analysis of experimental results will be performed.

2. Evaluation criteria

The time consumption of the method of the invention is mainly divided into three parts: client-side time consumption, communication consumption, and server-side time consumption, wherein client-side and server-side time consumption in turn comprises time consumption of an initialization phase and a protocol execution phase. Again, because of the differences in the methods used in this application from the previous protocol, comparisons can only be made macroscopically. The comparison mainly comprises two aspects, one is theoretical complexity analysis, including time complexity, space complexity and communication complexity, and the other is the comparison of test results in experiments. Different iteration times can affect the overall effect of the experiment, so that the example takes one iteration as a standard and compares the following aspects:

(1) the theoretical temporal, spatial and communication complexity of the two schemes are compared.

(2) And comparing the data encryption time of the two schemes.

(3) The time consumption of the server and the client in one iteration is compared between the two schemes.

3. Analysis of Experimental results

Theoretically, the scheme of the present invention is lower than the former scheme in terms of time complexity, space complexity and communication complexity. The results of the two protocols are analyzed in the following on the basis of experimental data.

The encryption time consumption of the present invention is slightly less than that of the previous scheme, but the time is not different, and the experimental result is that L iu encryption scheme is linear operation, so most of the encryption time consumption is caused by Paillier encryption.

Table 1 existing scheme encryption time consumption

Table 2 consumption of encryption time according to the invention

The invention then makes statistics and comparisons of the time consumed in one iteration. Theoretically, the improvement of powerful computing capacity of the cloud platform introduced by the invention is better than the operation efficiency of the previous scheme. Because the cloud platform of the present invention is composed of 30 PCs and one server, task division, task scheduling and data recovery are required for each machine during the task processing, and these operations also consume part of the time. As more data points are available, the time for one iteration is longer, and the proportion of time consumed by task division and the like is lower. In the safety circuit protocol, the generation of the circuit needs to consume a longer time, but the circuit only needs to be generated once in the first iteration, so that theoretically, when the data point scale is smaller, the efficiency of one iteration of the previous scheme is higher than that of the scheme in the invention, when the data point scale is higher than a certain threshold value, the efficiency of one iteration of the scheme in the invention is higher than that of the previous scheme, and the efficiency advantage of the scheme in the invention is more and more obvious as the data scale is larger and larger. The experimental result well demonstrates our idea, and the experimental result shows that the threshold of the data point scale is about 5000 data points, when the data scale is more than 7000, the one-time iteration of the scheme of the invention consumes less time, and when the data scale is less than 5000, the one-time iteration of the scheme in the previous scheme consumes less time. The time-consuming ratio of one iteration of both schemes is shown in table 3.

TABLE 3 one iteration elapsed time comparison

In one iteration, the present invention is concerned not only with the time consumed by this iteration, but also hopes that the server C will be able to take more tasks in each iteration, having a higher consumption time to capacity ratio, i.e., ensuring that the time consumed by one iteration is smaller, making the ratio of the time consumed by the server C to the time consumed by one iteration be substantially the same, which may result in less client computation, and therefore more efficient as the data size increases, for the client, it is the encryption and decryption operations that are primarily done, and the number of client encryptions and decryptions in both schemes is substantially the same, however, in the previous scheme, the cipher text distance calculation and cipher text distance comparison are improved L iu encryption, all operations of which are linear operations, while the scheme of the present invention uses a Paillier encryption algorithm whose decryption and decryption require exponential and modulo operations on a group of the client with less computational power, and the client's consumption time should be less and less for the client to take a relatively smaller data collection than the previous scheme, and the client consumes less time as shown in the previous scheme of the present invention, which is smaller and the server C consumes less and the data collection of the client, and the data consumption time of the previous scheme of the present invention, which is shown in the previous guess server C, which is smaller and the client, which is larger and the data collection of the present invention, which is smaller and the client, and the present invention, which is shown in the client, which is smaller and the present invention, and the client consumption time of the client, which is smaller and the present invention, and the client consumption time of the client, and the present invention, and the client consume the present invention, and the.

TABLE 4 time consuming one iteration of the previous scheme for each participant

Table 5 time consumption of each participant in one iteration of the present application

As can be seen in fig. 3 and 4, both schemes server and client consume a trend of time increasing with data point. In the experimental scheme of the previous scheme, as the data scale increases, the consumption of the server has a significant upward trend, and the consumption time of the client has a smaller upward trend. The main reason is that the computing power of the server is limited, and the computing of the data is relatively responsible. As the size of data increases, the server inevitably requires more and more time to process the data, resulting in a significant increase in consumed time, and the share ratio of consumed time of the server also increases. With the increase of the data size, although the data to be processed by the client is increased, the operation of the client is mostly linear calculation compared with the operation of the server, so the increase of the consumed time caused by the increase of the data size is not obvious, and the occupied ratio of the consumed time of the client is reduced. The server runs on a cloud platform consisting of 30 PCs and 1 server, so that the computing capacity of the server can be ensured. As can be seen from fig. 4, as the data size increases, the consumption time of the server increases, and there is no obvious upward trend. The client consumption time is larger and larger along with the increase of the data size, mainly because the decryption operation performed by the client is an exponential operation on the group, which has a larger calculation amount compared with a linear operation. Therefore, as the data size increases, the server consumption time occupancy rate decreases and the client consumption time occupancy rate increases in the present invention. The server and client time consuming share in the previous scenario is shown in FIG. 5, and the server and client time consuming share in the present invention is shown in FIG. 6.

Finally, the invention gives the time for processing data by the K-means clustering algorithm with privacy protection and the classic K-means algorithm in one iteration through experiments, and can show that the time consumption brought by encryption is relatively large. However, as the data size increases, the ratio of the time consumption of one iteration to the time consumption of the classical K-means becomes smaller and smaller. The time spent by the present invention and the classical K-means algorithm in one iteration is shown in table 6, and the time ratio is shown in fig. 7.

TABLE 6 time consuming one iteration of the present invention and classical K-means algorithm

The invention selects a typical K-means algorithm in data mining, and mines in the horizontally divided combined data sets of both sides, and simultaneously supports the storage outsourcing and the calculation outsourcing of the cloud platform. The beneficial effects of the invention mainly comprise the following aspects:

(1) by analyzing the current situations at home and abroad of data mining of privacy protection, the advantages and disadvantages of the conventional technology are clearly known. Although the scheme based on the data scrambling technology has higher execution efficiency, the original data set is damaged, so that certain influence is certainly generated on the data mining result, and the correctness of the mining result can be well ensured by the scheme based on encryption;

(2) the scheme of the invention supports data storage outsourcing. Compared with a common PC (personal computer), the cloud platform has larger storage capacity, so that the scheme of the invention can be executed on a larger-scale data set;

(3) the scheme of the invention supports data computation outsourcing. The cloud platform is a distributed computing framework, and can integrate a plurality of resources together into a cluster, so that the computing capacity of the system is greatly improved. According to the scheme, most of calculation is outsourced to the cloud platform, and the execution efficiency is greatly improved while the correctness is ensured by means of the strong calculation capacity of the cloud platform;

(4) the time complexity, the space complexity, the communication complexity and the safety of the algorithm are analyzed from theory, and the correctness and the efficiency of the algorithm are verified through experiments. The K-means clustering algorithm with privacy protection not only realizes the safety calculation under a semi-honest model, but also supports the safety calculation that the most party in three participants is a malicious party in the recalculation centroid point stage.

The above-described embodiments are intended to be illustrative, and not restrictive, of the invention, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A K-means clustering method with privacy protection is characterized by comprising the following steps:

s3: the server calculates the Euclidean distance from the ciphertext data point to the centroid point through a safety distance calculation protocol, and classifies the data point through the Euclidean distance calculated by a safety comparison protocol;

2. The K-means clustering method with privacy protection as claimed in claim 1, wherein: in step S1, the server is a cloud server, and the cloud server stores the encrypted data uploaded by the data owners a and B in the cloud file system.

3. The K-means clustering method with privacy protection as claimed in claim 2, wherein: in step S2, the selection of the centroid points includes selection of the centroid point number and the numerical value, and specifically includes the following steps:

4. The K-means clustering method with privacy protection as claimed in claim 3, wherein: the calculation method of step S3 includes the steps of:

s32: the server and the data owner A jointly calculate the ciphertext distance between each data point and the centroid point of the data owner A by using a safe distance calculation protocol; the server and the data owner jointly calculate the ciphertext distance between each data point and the centroid point of the data owner B by using a safe distance calculation protocol B;

s33: the server classifies the data of the data owners a and B into the closest class according to the ciphertext distance set obtained in step S32, and stores the data separately in the same class.

5. The K-means clustering method with privacy protection as claimed in claim 4, wherein: the processing method of step S4 includes the steps of:

s42: decrypting by the data owner A and B;

s43: the server, data owners a and B compute a new centroid point in this category using the secure circuit protocol.

6. A system for realizing the K-means clustering method with privacy protection of any one of claims 1-5 is characterized by comprising a database, a first client used by a data owner A and a second client used by a data owner B, wherein the first client and the second client are used for encrypting respective data, then uploading a ciphertext to a server, randomly selecting K centroid points respectively, encrypting and uploading the ciphertext to the server, after the server is classified, recalculating new K centroid points together with the server, judging the distance between the new centroid points and the original centroid points, if the distance is smaller than a threshold value, finishing classification, requesting the server to send the classified data to the first client and the second client respectively, and otherwise, resuming the centroid points; the server is used for receiving data uploaded by the first client and the second client, calculating Euclidean distances from data points to the centroid points, classifying the data points according to the calculated Euclidean distances, and then recalculating new k centroid points together with the first client and the second client.

7. The system of claim 6, wherein: the server is a cloud server, and the cloud server encrypts and stores the data uploaded by the data owners A and B in a cloud file system.

8. The system of claim 7, wherein: the selection of the centroid points of the first client and the second client comprises selection of the centroid point quantity and the numerical value, and specifically comprises the following modules:

a centroid point selection module: for randomly selecting k centroid points;

the center of mass point value selecting module: for calculating an average value, i.e. the value of k centroid points, from the respective values of the centroid points.

9. The system of claim 8, wherein: the server includes:

the second ciphertext distance calculating module: the system comprises a data point and a centroid point, wherein the data point and the centroid point are used for calculating the ciphertext distance of each data point of the first client together with the first client; the server and the second client jointly calculate the ciphertext distance between each data point and the centroid point of the second client;

a classification module: and the data of the first client and the second client are divided into the closest classes according to the ciphertext distance set calculated by the second ciphertext distance calculation module and are stored in the same class separately.

10. The system of claim 9, wherein: the server further comprises a sending module: the data point storage device is used for respectively sending data points separately stored in the same class to a corresponding first client and a corresponding second client;

a secure centroid point calculation module: for calculating a new centroid point in the same class as the first client and the second client via the secure circuit protocol.