CN107145791B - K-means clustering method and system with privacy protection function - Google Patents

K-means clustering method and system with privacy protection function Download PDF

Info

Publication number
CN107145791B
CN107145791B CN201710224275.7A CN201710224275A CN107145791B CN 107145791 B CN107145791 B CN 107145791B CN 201710224275 A CN201710224275 A CN 201710224275A CN 107145791 B CN107145791 B CN 107145791B
Authority
CN
China
Prior art keywords
data
centroid
server
point
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710224275.7A
Other languages
Chinese (zh)
Other versions
CN107145791A (en
Inventor
王轩
蒋琳
李晔
姚霖
刘泽超
靳亚宾
梁玉冬
刘猛
漆舒汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN201710224275.7A priority Critical patent/CN107145791B/en
Publication of CN107145791A publication Critical patent/CN107145791A/en
Priority to PCT/CN2017/117943 priority patent/WO2018184407A1/en
Application granted granted Critical
Publication of CN107145791B publication Critical patent/CN107145791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Abstract

The invention provides a K-means clustering method and system with privacy protection, and belongs to the technical field of data mining. The invention comprises the following steps: the data owners A and B encrypt respective data and randomly selected centroid points, and upload the data and the randomly selected centroid points to a server; the server calculates the Euclidean distance from the data point to the centroid point in the ciphertext data through a secure multiplication protocol and a secure distance calculation protocol, and classifies the data point; the server, the data owner A and the data owner B jointly recalculate a new centroid point in the ciphertext data through a safety circuit protocol; and the data owner A or B judges the distance between the new centroid point and the original centroid point through a safety comparison protocol, if the distance is smaller than a threshold value, the classification is finished, the data owner A and the data owner B request the server to respectively send the classified data to the data owner A and the data owner B, and if the distance is not smaller than the threshold value, the data owner A and the data owner B upload the new centroid point again and perform the next iteration. The invention ensures the correctness of the data mining result while ensuring the privacy and the safety of the data; the data storage outsourcing and the data calculation outsourcing are supported, and the execution efficiency is greatly improved while the correctness is ensured; and the safety calculation that at most one of the three participants is a malicious party is supported.

Description

K-means clustering method and system with privacy protection function
Technical Field
The invention relates to the technical field of data mining, in particular to a K-means clustering method with privacy protection and a system for realizing the method.
Background
It is well known that K-means clustering is one of the very classical and common methods in data mining, which can cluster similar data items together by calculating the distance between the data items. With the acceleration of informatization, digitization and networking processes, economic globalization becomes an irreversible trend, data sources in a clustering algorithm are more and more diversified, and data security is more and more important. Given that data may come from multiple parties, which may contain sensitive or private information about the parties, privacy of the data may not be guaranteed if the information is shared among the multiple parties. The joint data mining with privacy protection can be used for mining data of joint databases of multiple participants while protecting the privacy of user data and mining results, and further extracting useful information. Therefore, how to design a joint data mining algorithm with privacy protection becomes a difficult problem to be solved.
The semi-honest model, in which privacy of data is guaranteed by the various parties following the protocol all the time, is in many cases realistic. However, to ensure the privacy of the data, the solution under this model is generally not feasible in practice because the computational and communication consumption is high.
The traditional K-means clustering algorithm is a classical clustering algorithm based on Euclidean distance. The traditional K-means clustering algorithm is mainly divided into 3 steps: and selecting a centroid point, classifying the data points and recalculating a new centroid point. Assume training samples as { xi∈RlI is more than or equal to 1 and less than or equal to l, wherein l is the number of samples, and firstly, randomly selecting k centroid pointsM, expressed as M ═ μc∈RlL 1 is less than or equal to c is less than or equal to l }. Then calculate each data point to xiTo the centroid point mucThen x is measured, andithe centroid point mu classified as closest to the point of clustercIn the class, the formula is: cc:=argminc||xic||2. Finally for each centroid point mucAnd (3) recalculating the centroid point, wherein the calculation formula is as follows:
Figure BDA0001264720220000011
therefore, the traditional K-means clustering algorithm mainly comprises three steps: and selecting the centroid points and the data points for classification and recalculating the centroid points. In the classification process, the Euclidean distance between a data point and each centroid point is calculated firstly, then the centroid point closest to the data point is compared for classification, and the distance is calculated by adopting the square of the Euclidean distance, so that the magnitude of two values is better compared under the condition of changing the magnitude relation of the two values. In the process of recalculating centroid points, the component sum of data points in each class needs to be calculated, and the data points may come from different participants, so that privacy problems may be involved in the calculation process. In summary, privacy leakage may be caused in the calculation process of the traditional K-means clustering algorithm.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a K-means clustering method with privacy protection and a system for realizing the method.
The method comprises the following steps:
s1: the data owners A and B encrypt respective data, and then the ciphertext is uploaded to the server;
s2: respectively randomly selecting k centroid points by the data owners A and B, and encrypting and uploading the k centroid points to a server;
s3: the server calculates the Euclidean distance from the ciphertext data point to the centroid point through a safe distance calculation protocol, and classifies the data point according to the calculated Euclidean distance through a safe comparison protocol;
s4: the server, the data owner A and the data owner B recalculate new k centroid points through a secure circuit protocol;
s5: and the data owner A or B judges the distance between the new centroid point and the original centroid point in the ciphertext data through a safety comparison protocol, if the distance is smaller than a threshold value, classification is finished, the data owner A and the data owner B request the server to respectively send the classified data to the data owner A and the data owner B, and if the distance is not smaller than the threshold value, the data owner A and the data owner B return to the step S2 to carry out the next iteration.
In step S1, the server is a cloud server, and the cloud server re-encrypts and stores the data uploaded by the data owners a and B in a cloud file system.
The invention is further improved, in step S2, the selection of the centroid points includes the selection of the centroid point number and the numerical value, and specifically includes the following steps:
s21: respectively and randomly selecting k centroid points by a data owner A and a data owner B;
s22: iterating on respective data sets according to a traditional K-means clustering algorithm, and classifying;
s23: calculating the distance from each data point to the corresponding centroid point, and calculating the sum S of the distances of all the data points;
s24: when the sum S corresponding to the k-1 centroid points, k and k +1 centroid points does not change greatly, k is the number of the centroid points;
s25: and respectively calculating an average value by using the numerical values of the respective centroid points by the data owners A and B, wherein the average value is the value of k centroid points.
In a further improvement of the present invention, the calculating method of step S3 includes the following steps:
s31: the server calculates the ciphertext distance between each ciphertext record of the data owner A and the uploaded ciphertext centroid point thereof, and calculates the ciphertext distance between each ciphertext record of the data owner B and the uploaded ciphertext centroid point thereof;
s32: the server and the data owner A jointly calculate the ciphertext distance between each data point and the centroid point of the data owner A through a safe distance calculation protocol; the server and the data owner B jointly calculate the ciphertext distance between each data point and the centroid point of the data owner B by using a safe distance calculation protocol;
s33: the server classifies the data of the data owners a and B into the closest class according to the ciphertext distance set obtained in step S32, and stores the data in the same class separately.
In a further improvement of the present invention, the processing method of step S4 includes the following steps:
s41: the server respectively sends the data points separately stored in the same class to the corresponding data owners A and B;
s42: decrypting by the data owner A and B;
s43: the server, data owners a and B compute a new centroid point in this category through the secure circuit protocol.
The invention also provides a system for realizing the method, which comprises a database, a first client used by the data owner A and a second client used by the data owner B, wherein the first client and the second client are used for encrypting respective data, then uploading a ciphertext to the server, respectively and randomly selecting k centroid points, encrypting and uploading the k centroid points to the server, after the server is classified, recalculating new k centroid points together with the server, judging the distance between the new centroid points and the original centroid points, if the distance is less than a threshold value, finishing the classification, requesting the server to respectively send the classified data to the first client and the second client, and otherwise, re-uploading the centroid points; the server is used for receiving data uploaded by the first client and the second client, calculating Euclidean distances from data points to the centroid points, classifying the data points according to the calculated Euclidean distances, and then recalculating new k centroid points together with the first client and the second client.
The invention is further improved, the server is a cloud server, and the cloud server stores the data uploaded by the data owners A and B in a cloud file system in a re-encryption manner.
The invention is further improved, the selection of the centroid points of the first client and the second client comprises the selection of the number and the value of the centroid points, and the selection specifically comprises the following modules:
a centroid point selection module: for randomly selecting k centroid points;
a classification module: the method is used for iterating on respective data sets according to a traditional K-means clustering algorithm and classifying;
a safe distance calculation module: the distance calculation module is used for calculating the distance from each data point to the corresponding centroid point through a safe distance calculation protocol and calculating the distance sum S of all the data points;
a centroid point number selection module: the method is used for judging that when the sum S corresponding to the k-1 centroid points, k and k +1 centroid points does not change greatly, k is the number of the centroid points;
the center of mass point value selecting module: for calculating an average value, i.e. the value of the k centroid points, from the respective values of the centroid points.
In a further refinement of the present invention, the server comprises:
the first ciphertext distance calculating module: the system comprises a data owner B and a data processing server, wherein the data owner B is used for calculating the ciphertext distance between each ciphertext record of a first client and an uploaded ciphertext centroid point of the first client and calculating the ciphertext distance between each ciphertext record of the data owner B and the uploaded ciphertext centroid point of the data owner B;
the second ciphertext distance calculating module: the system comprises a data point and a centroid point, wherein the data point and the centroid point are used for calculating the ciphertext distance of each data point of the first client together with the first client; the server and the second client jointly calculate the ciphertext distance between each data point and the centroid point of the second client; a classification module: and the data of the first client and the second client are divided into the closest classes according to the ciphertext distance set calculated by the second ciphertext distance calculation module and are stored in the same class separately.
The invention makes further improvement, and the server further comprises a sending module: the data point storage device is used for respectively sending the data points separately stored in the same class to the corresponding first client and second client; a secure centroid point calculation module: for calculating a new centroid point in the same class as the first client and the second client via the secure circuit protocol.
Compared with the prior art, the invention has the beneficial effects that: the method and the device have the advantages that the safety in the data digging process is well guaranteed by adopting an encryption mode, and the result correctness is guaranteed; supporting outsourcing of data storage, which can be executed on a larger-scale data set; supporting data computing outsourcing, outsourcing most of computing to a cloud platform, and greatly improving the execution efficiency while ensuring the correctness by means of the strong computing capability of the cloud platform; the method not only realizes the safety calculation under the semi-honest model, but also supports the safety calculation that at most one party in the three parties is a malicious party in the recalculation centroid point stage.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of the system of the present invention;
FIG. 3 is a diagram illustrating the time consumption of a server and a client in a conventional K-means clustering algorithm;
FIG. 4 is a diagram illustrating server and client elapsed time in accordance with the present invention;
FIG. 5 is a diagram illustrating the consumption time ratio of a server and a client in a conventional K-means clustering algorithm;
FIG. 6 is a diagram illustrating the consumption time ratio of the server and the client according to the present invention;
FIG. 7 is a time consumption ratio of the K-means clustering algorithm of the present invention to the conventional K-means clustering algorithm.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Aiming at the performance problem existing in data mining with privacy protection, the invention develops deep research on the existing data mining algorithm with privacy protection, and further provides an efficient K-means clustering algorithm with privacy protection on a horizontally divided data set. Data are stored in a cloud end in a ciphertext mode, and a cloud platform interacts with two data owners to complete a task of K-means clustering data mining on a combined data set of the two data owners. The invention respectively designs different safety protocols to solve three technical problems in a K-means clustering algorithm with privacy protection: a safe distance calculation protocol for solving the problem of ciphertext distance calculation, a safe comparison protocol for solving the problem of ciphertext comparison and a safe circuit protocol for solving the problem of ciphertext division. And the security protocols are applied to a clustering algorithm framework to realize a K-means clustering algorithm with privacy protection.
As shown in fig. 1, the K-means clustering method with privacy protection of the present invention mainly includes 5 steps, which are explained in detail below:
step S1: the data owners a and B encrypt the respective data and then upload the ciphertext to the server. In this example, assume that data owner a is Alice, data owner B is Bob, and the server is C.
Alice and Bob use their public keys pk respectively1And pk2Encrypt their data DxAnd DyThe ciphertext is CxAnd CyThen, C is addedxAnd CyAnd uploading to C. Wherein D isxAnd DyEach record in the set is l-dimensional, so the database is encrypted, i.e., each dimension of data in each record is encrypted. All data of Alice and Bob can be stored in a file system of the cloud in a ciphertext mode. The specific representation is as follows:
Figure BDA0001264720220000051
wherein m is the number of records.
Step S2: alice and Bob select k centroid points and upload to C encrypted with their respective public keys.
In this example, the selection of the centroid point is a very important step, because the selection directly relates to the number of iterations, which further affects the execution time of the whole system, so a good centroid point also increases the convergence rate and execution efficiency of the system. The centroid point is here chosen to be divided into two parts: the first is the selection of the number of centroid points, Alice and Bob respectively randomize k values and k centroid points, and then perform one iteration on their own data set. After classification, the distance from each data point to the corresponding centroid point is calculated, and the sum of all the distances isAnd S. And when the S corresponding to k-1, k and k +1 does not change greatly, the k is the number of the centroid points. And respectively finding out k of Alice and Bob, and then obtaining the final k value as the average value of the two k value areas. Alice randomly selects k centroid points M ═ muc1 ≦ c ≦ k, where μc={ucjJ is more than or equal to 1 and less than or equal to l. The center of mass point is encrypted and uploaded to the cloud end by the public keys of Alice and Bob respectively, and the ciphertext of the center of mass point is
Figure BDA0001264720220000052
And
Figure BDA0001264720220000053
and 3, the server C calculates the Euclidean distance from the ciphertext data point to the centroid point through a safe distance calculation protocol, and then classifies the data point according to the calculated Euclidean distance through a safe comparison protocol. The method specifically comprises the following steps:
c calculate each record
Figure BDA0001264720220000054
And each centroid point
Figure BDA0001264720220000055
And each record
Figure BDA0001264720220000056
And each centroid point
Figure BDA0001264720220000057
The ciphertext distance of (a); c and Alice co-operate the SSED (safe distance computation) protocol to compute each xiAnd mucDistance of ciphertext therebetween, by
Figure BDA0001264720220000058
And (4) showing. C and Bob co-operate the SSED protocol to compute each yiAnd mucDistance of ciphertext therebetween, by
Figure BDA0001264720220000059
And (4) showing. All xiAnd mucThe ciphertext distance between is stored in
Figure BDA00012647202200000510
In (1), all of yiAnd mucThe ciphertext distance between is stored in
Figure BDA00012647202200000511
In (1).
The homomorphic encryption used in the method is semi-homomorphic encryption supporting ciphertext addition operation, namely Paillier encryption, which is 4-tuple probabilistic encryption and is represented as Encpa= KenGen, Encrypt, Decrypt, Evaluate }. The procedure for Paillier encryption is as follows:
●KenGen(1k)→(pk,sk):
(1) two large prime numbers p and q are selected, and gcd (pq, (p-1) (q-1)) ═ 1 is satisfied;
(2) calculating N ═ pq and λ ═ lcm (p-1, q-1);
(3) randomly selecting an integer
Figure BDA00012647202200000512
(4) Find μ so that it can satisfy μ ═ (L (g)λmod N2))-1mod N, where L is a function L (μ) ═ μ -1)/N, resulting in a public key of (N, g) and a private key of (λ, μ).
●Encrypt(x,r)→c:
Assuming x as plaintext, a random number r is selected, and c as g as ciphertextxrnmod N2. The encryption can also be denoted as Epk(x)=c。
●Decrypt(c)→x
The decryption process is x-L (c)λmod N2)mod N。Dsk(c) Represents Decrypt (c).
●Evaluate:
Epk(x)Epk(y)=Epk(x+y),Epk(x)y=Epk(xy). Where x and y are two plaintexts.
The safe distance calculation protocol of the embodiment is realized based on a safe multiplication protocol, and the specific processing procedure of the safe multiplication protocol is as follows:
Figure BDA0001264720220000061
wherein Z isnIs a positive integer space, here denoted rxAnd ryIs a positive integer.
The specific processing procedure of the safe distance calculation protocol in this example is as follows:
Figure BDA0001264720220000062
Figure BDA0001264720220000071
then, C classifies all data points, specifically:
by comparison
Figure BDA0001264720220000072
And
Figure BDA0001264720220000073
a distance of (1) xiAnd yiInto the nearest class. C and Alice execute a secure comparison protocol
Figure BDA0001264720220000074
C and Bob execution
Figure BDA0001264720220000075
All the ciphertexts are classified into corresponding categories
Figure BDA0001264720220000076
And
Figure BDA0001264720220000077
and (4) removing. Each one of which is
Figure BDA0001264720220000078
Data points classified into class C in P1 are stored, each
Figure BDA0001264720220000079
Data points in Bob divided into class C are stored, and the calculation formula is as follows:
Figure BDA00012647202200000710
Figure BDA00012647202200000711
the specific processing procedure of the safety comparison protocol is as follows:
Figure BDA00012647202200000712
Figure BDA0001264720220000081
Figure BDA0001264720220000082
steps S4: C, Alice and Bob recompute k centroid points collectively via the secure circuit protocol because at C L1And C L2The public keys of the two participants for encrypting data are different, and all new centroid points cannot be directly calculated1And C L2Respectively sent to Alice and Bob for decryption to obtain L1And L2The calculation formula is as follows:
Figure BDA0001264720220000083
Figure BDA0001264720220000084
c, Alice and Bob then execute the SC (secure Circuit) protocol, calculate
Figure BDA0001264720220000091
Wherein the content of the first and second substances,
Figure BDA0001264720220000092
respectively, the ciphertext data in Alice and Bob.
Thereby calculating a component mu of the new centroid pointcj. The SC safe circuit protocol can ensure that Alice and Bob get all new centroid points.
The specific processing procedure of the safety circuit protocol is as follows:
Figure BDA0001264720220000093
Figure BDA0001264720220000101
step S5: and Alice calculates the distance between the new center of mass point and the previous center of mass point through a safety comparison protocol, and if the distance is smaller than the threshold value, Alice and Bob request C to respectively send classified data to Alice and Bob. Otherwise, Alice and Bob upload the new centroid point to C with their respective public keys for the next iteration.
As shown in FIG. 2, the present invention also provides a system for implementing the above method, and the system of this embodiment includes a database C, a first client P used by a data owner A1And a second client P used by the data owner B2Wherein the first client P1And a second client P2The server is used for encrypting respective data, uploading the encrypted messages to the server, randomly selecting k centroid points respectively, encrypting and uploading the encrypted messages to the server, after the server is classified, recalculating new k centroid points together with the server, judging the distance between the new centroid points and the original centroid points, finishing classification if the distance is less than a threshold value, and requesting the server to send the classified data to the first client P respectively1And a second client P2Otherwise, the mass transfer center point is reapplied; the server is used for receiving the first clientP1And a second client P2Uploading data, calculating Euclidean distance from the data point to the centroid point, classifying the data point according to the calculated Euclidean distance, and then enabling the data point to be in contact with the first client P1And a second client P2Collectively recalculating new k centroid points.
The server C is a cloud server, the cloud server encrypts and stores the data uploaded by the data owners A and B in a file system of a cloud end, outsourcing of data storage can be supported, and the data storage can be executed on a larger-scale data set; support data calculation outsourcing, outsourcing most calculation for the cloud platform, with the help of the powerful computing power of cloud platform, when guaranteeing the exactness, execution efficiency also promotes by a wide margin.
The beneficial effects of the invention are analyzed:
1. comparison scheme selected by the invention
The frame used by the invention is firstly proposed in the document 'Outsouring Two-Party Privacy monitoring K-Means Clustering Protocol in Wireless Sensor Networks', the method for the paper article in the comparison is represented by the prior scheme, and compared with the Clustering algorithm under other frames, the Clustering algorithm under the same frame is more comparable, so the invention mainly carries out comparative analysis with the prior traditional scheme. In order to ensure the reliability of the experimental comparison, the two schemes are operated in the same experimental environment. Evaluation criteria for both methods will be described below, and comparative analysis of experimental results will be performed.
2. Evaluation criteria
The time consumption of the method of the invention is mainly divided into three parts: client-side time consumption, communication consumption, and server-side time consumption, wherein client-side and server-side time consumption in turn comprises time consumption of an initialization phase and a protocol execution phase. Again, because of the differences in the methods used in this application from the previous protocol, comparisons can only be made macroscopically. The comparison mainly comprises two aspects, one is theoretical complexity analysis, including time complexity, space complexity and communication complexity, and the other is the comparison of test results in experiments. Different iteration times can affect the overall effect of the experiment, so that the example takes one iteration as a standard and compares the following aspects:
(1) the theoretical temporal, spatial and communication complexity of the two schemes are compared.
(2) And comparing the data encryption time of the two schemes.
(3) The time consumption of the server and the client in one iteration is compared between the two schemes.
3. Analysis of Experimental results
Theoretically, the scheme of the present invention is lower than the former scheme in terms of time complexity, space complexity and communication complexity. The results of the two protocols are analyzed in the following on the basis of experimental data.
The encryption time consumption of the present invention is slightly less than that of the previous scheme, but the time is not different, and the experimental result is that L iu encryption scheme is linear operation, so most of the encryption time consumption is caused by Paillier encryption.
Table 1 existing scheme encryption time consumption
Figure BDA0001264720220000111
Table 2 consumption of encryption time according to the invention
Figure BDA0001264720220000112
The invention then makes statistics and comparisons of the time consumed in one iteration. Theoretically, the improvement of powerful computing capacity of the cloud platform introduced by the invention is better than the operation efficiency of the previous scheme. Because the cloud platform of the present invention is composed of 30 PCs and one server, task division, task scheduling and data recovery are required for each machine during the task processing, and these operations also consume part of the time. As more data points are available, the time for one iteration is longer, and the proportion of time consumed by task division and the like is lower. In the safety circuit protocol, the generation of the circuit needs to consume a longer time, but the circuit only needs to be generated once in the first iteration, so that theoretically, when the data point scale is smaller, the efficiency of one iteration of the previous scheme is higher than that of the scheme in the invention, when the data point scale is higher than a certain threshold value, the efficiency of one iteration of the scheme in the invention is higher than that of the previous scheme, and the efficiency advantage of the scheme in the invention is more and more obvious as the data scale is larger and larger. The experimental result well demonstrates our idea, and the experimental result shows that the threshold of the data point scale is about 5000 data points, when the data scale is more than 7000, the one-time iteration of the scheme of the invention consumes less time, and when the data scale is less than 5000, the one-time iteration of the scheme in the previous scheme consumes less time. The time-consuming ratio of one iteration of both schemes is shown in table 3.
TABLE 3 one iteration elapsed time comparison
Figure BDA0001264720220000121
In one iteration, the present invention is concerned not only with the time consumed by this iteration, but also hopes that the server C will be able to take more tasks in each iteration, having a higher consumption time to capacity ratio, i.e., ensuring that the time consumed by one iteration is smaller, making the ratio of the time consumed by the server C to the time consumed by one iteration be substantially the same, which may result in less client computation, and therefore more efficient as the data size increases, for the client, it is the encryption and decryption operations that are primarily done, and the number of client encryptions and decryptions in both schemes is substantially the same, however, in the previous scheme, the cipher text distance calculation and cipher text distance comparison are improved L iu encryption, all operations of which are linear operations, while the scheme of the present invention uses a Paillier encryption algorithm whose decryption and decryption require exponential and modulo operations on a group of the client with less computational power, and the client's consumption time should be less and less for the client to take a relatively smaller data collection than the previous scheme, and the client consumes less time as shown in the previous scheme of the present invention, which is smaller and the server C consumes less and the data collection of the client, and the data consumption time of the previous scheme of the present invention, which is shown in the previous guess server C, which is smaller and the client, which is larger and the data collection of the present invention, which is smaller and the client, and the present invention, which is shown in the client, which is smaller and the present invention, and the client consumption time of the client, which is smaller and the present invention, and the client consumption time of the client, and the present invention, and the client consume the present invention, and the.
TABLE 4 time consuming one iteration of the previous scheme for each participant
Figure BDA0001264720220000131
Table 5 time consumption of each participant in one iteration of the present application
Figure BDA0001264720220000132
As can be seen in fig. 3 and 4, both schemes server and client consume a trend of time increasing with data point. In the experimental scheme of the previous scheme, as the data scale increases, the consumption of the server has a significant upward trend, and the consumption time of the client has a smaller upward trend. The main reason is that the computing power of the server is limited, and the computing of the data is relatively responsible. As the size of data increases, the server inevitably requires more and more time to process the data, resulting in a significant increase in consumed time, and the share ratio of consumed time of the server also increases. With the increase of the data size, although the data to be processed by the client is increased, the operation of the client is mostly linear calculation compared with the operation of the server, so the increase of the consumed time caused by the increase of the data size is not obvious, and the occupied ratio of the consumed time of the client is reduced. The server runs on a cloud platform consisting of 30 PCs and 1 server, so that the computing capacity of the server can be ensured. As can be seen from fig. 4, as the data size increases, the consumption time of the server increases, and there is no obvious upward trend. The client consumption time is larger and larger along with the increase of the data size, mainly because the decryption operation performed by the client is an exponential operation on the group, which has a larger calculation amount compared with a linear operation. Therefore, as the data size increases, the server consumption time occupancy rate decreases and the client consumption time occupancy rate increases in the present invention. The server and client time consuming share in the previous scenario is shown in FIG. 5, and the server and client time consuming share in the present invention is shown in FIG. 6.
Finally, the invention gives the time for processing data by the K-means clustering algorithm with privacy protection and the classic K-means algorithm in one iteration through experiments, and can show that the time consumption brought by encryption is relatively large. However, as the data size increases, the ratio of the time consumption of one iteration to the time consumption of the classical K-means becomes smaller and smaller. The time spent by the present invention and the classical K-means algorithm in one iteration is shown in table 6, and the time ratio is shown in fig. 7.
TABLE 6 time consuming one iteration of the present invention and classical K-means algorithm
Figure BDA0001264720220000141
The invention selects a typical K-means algorithm in data mining, and mines in the horizontally divided combined data sets of both sides, and simultaneously supports the storage outsourcing and the calculation outsourcing of the cloud platform. The beneficial effects of the invention mainly comprise the following aspects:
(1) by analyzing the current situations at home and abroad of data mining of privacy protection, the advantages and disadvantages of the conventional technology are clearly known. Although the scheme based on the data scrambling technology has higher execution efficiency, the original data set is damaged, so that certain influence is certainly generated on the data mining result, and the correctness of the mining result can be well ensured by the scheme based on encryption;
(2) the scheme of the invention supports data storage outsourcing. Compared with a common PC (personal computer), the cloud platform has larger storage capacity, so that the scheme of the invention can be executed on a larger-scale data set;
(3) the scheme of the invention supports data computation outsourcing. The cloud platform is a distributed computing framework, and can integrate a plurality of resources together into a cluster, so that the computing capacity of the system is greatly improved. According to the scheme, most of calculation is outsourced to the cloud platform, and the execution efficiency is greatly improved while the correctness is ensured by means of the strong calculation capacity of the cloud platform;
(4) the time complexity, the space complexity, the communication complexity and the safety of the algorithm are analyzed from theory, and the correctness and the efficiency of the algorithm are verified through experiments. The K-means clustering algorithm with privacy protection not only realizes the safety calculation under a semi-honest model, but also supports the safety calculation that the most party in three participants is a malicious party in the recalculation centroid point stage.
The above-described embodiments are intended to be illustrative, and not restrictive, of the invention, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (10)

1. A K-means clustering method with privacy protection is characterized by comprising the following steps:
s1: the data owners A and B encrypt respective data, and then the ciphertext is uploaded to the server;
s2: respectively randomly selecting k centroid points by the data owners A and B, and encrypting and uploading the k centroid points to a server;
s3: the server calculates the Euclidean distance from the ciphertext data point to the centroid point through a safety distance calculation protocol, and classifies the data point through the Euclidean distance calculated by a safety comparison protocol;
s4: the server, the data owner A and the data owner B recalculate new k centroid points through a secure circuit protocol;
s5: and the data owner A or B judges the distance between the new centroid point and the original centroid point in the ciphertext data through a safety comparison protocol, if the distance is smaller than a threshold value, classification is finished, the data owner A and the data owner B request the server to respectively send the classified data to the data owner A and the data owner B, and if the distance is not smaller than the threshold value, the data owner A and the data owner B return to the step S2 to carry out the next iteration.
2. The K-means clustering method with privacy protection as claimed in claim 1, wherein: in step S1, the server is a cloud server, and the cloud server stores the encrypted data uploaded by the data owners a and B in the cloud file system.
3. The K-means clustering method with privacy protection as claimed in claim 2, wherein: in step S2, the selection of the centroid points includes selection of the centroid point number and the numerical value, and specifically includes the following steps:
s21: respectively and randomly selecting k centroid points by a data owner A and a data owner B;
s22: iterating on respective data sets according to a traditional K-means clustering algorithm, and classifying;
s23: calculating the distance from each data point to the corresponding centroid point, and calculating the sum S of the distances of all the data points;
s24: when the sum S corresponding to the k-1 centroid points, k and k +1 centroid points does not change greatly, k is the number of the centroid points;
s25: and respectively calculating an average value by using the numerical values of the respective centroid points by the data owners A and B, wherein the average value is the value of k centroid points.
4. The K-means clustering method with privacy protection as claimed in claim 3, wherein: the calculation method of step S3 includes the steps of:
s31: the server calculates the ciphertext distance between each ciphertext record of the data owner A and the uploaded ciphertext centroid point thereof, and calculates the ciphertext distance between each ciphertext record of the data owner B and the uploaded ciphertext centroid point thereof;
s32: the server and the data owner A jointly calculate the ciphertext distance between each data point and the centroid point of the data owner A by using a safe distance calculation protocol; the server and the data owner jointly calculate the ciphertext distance between each data point and the centroid point of the data owner B by using a safe distance calculation protocol B;
s33: the server classifies the data of the data owners a and B into the closest class according to the ciphertext distance set obtained in step S32, and stores the data separately in the same class.
5. The K-means clustering method with privacy protection as claimed in claim 4, wherein: the processing method of step S4 includes the steps of:
s41: the server respectively sends the data points separately stored in the same class to the corresponding data owners A and B;
s42: decrypting by the data owner A and B;
s43: the server, data owners a and B compute a new centroid point in this category using the secure circuit protocol.
6. A system for realizing the K-means clustering method with privacy protection of any one of claims 1-5 is characterized by comprising a database, a first client used by a data owner A and a second client used by a data owner B, wherein the first client and the second client are used for encrypting respective data, then uploading a ciphertext to a server, randomly selecting K centroid points respectively, encrypting and uploading the ciphertext to the server, after the server is classified, recalculating new K centroid points together with the server, judging the distance between the new centroid points and the original centroid points, if the distance is smaller than a threshold value, finishing classification, requesting the server to send the classified data to the first client and the second client respectively, and otherwise, resuming the centroid points; the server is used for receiving data uploaded by the first client and the second client, calculating Euclidean distances from data points to the centroid points, classifying the data points according to the calculated Euclidean distances, and then recalculating new k centroid points together with the first client and the second client.
7. The system of claim 6, wherein: the server is a cloud server, and the cloud server encrypts and stores the data uploaded by the data owners A and B in a cloud file system.
8. The system of claim 7, wherein: the selection of the centroid points of the first client and the second client comprises selection of the centroid point quantity and the numerical value, and specifically comprises the following modules:
a centroid point selection module: for randomly selecting k centroid points;
a classification module: the method is used for iterating on respective data sets according to a traditional K-means clustering algorithm and classifying;
a safe distance calculation module: the distance calculation module is used for calculating the distance from each data point to the corresponding centroid point through a safe distance calculation protocol and calculating the distance sum S of all the data points;
a centroid point number selection module: the method is used for judging that when the sum S corresponding to the k-1 centroid points, k and k +1 centroid points does not change greatly, k is the number of the centroid points;
the center of mass point value selecting module: for calculating an average value, i.e. the value of k centroid points, from the respective values of the centroid points.
9. The system of claim 8, wherein: the server includes:
the first ciphertext distance calculating module: the system comprises a data owner B and a data processing server, wherein the data owner B is used for calculating the ciphertext distance between each ciphertext record of a first client and an uploaded ciphertext centroid point of the first client and calculating the ciphertext distance between each ciphertext record of the data owner B and the uploaded ciphertext centroid point of the data owner B;
the second ciphertext distance calculating module: the system comprises a data point and a centroid point, wherein the data point and the centroid point are used for calculating the ciphertext distance of each data point of the first client together with the first client; the server and the second client jointly calculate the ciphertext distance between each data point and the centroid point of the second client;
a classification module: and the data of the first client and the second client are divided into the closest classes according to the ciphertext distance set calculated by the second ciphertext distance calculation module and are stored in the same class separately.
10. The system of claim 9, wherein: the server further comprises a sending module: the data point storage device is used for respectively sending data points separately stored in the same class to a corresponding first client and a corresponding second client;
a secure centroid point calculation module: for calculating a new centroid point in the same class as the first client and the second client via the secure circuit protocol.
CN201710224275.7A 2017-04-07 2017-04-07 K-means clustering method and system with privacy protection function Active CN107145791B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710224275.7A CN107145791B (en) 2017-04-07 2017-04-07 K-means clustering method and system with privacy protection function
PCT/CN2017/117943 WO2018184407A1 (en) 2017-04-07 2017-12-22 K-means clustering method and system having privacy protection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710224275.7A CN107145791B (en) 2017-04-07 2017-04-07 K-means clustering method and system with privacy protection function

Publications (2)

Publication Number Publication Date
CN107145791A CN107145791A (en) 2017-09-08
CN107145791B true CN107145791B (en) 2020-07-10

Family

ID=59775048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710224275.7A Active CN107145791B (en) 2017-04-07 2017-04-07 K-means clustering method and system with privacy protection function

Country Status (2)

Country Link
CN (1) CN107145791B (en)
WO (1) WO2018184407A1 (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145791B (en) * 2017-04-07 2020-07-10 哈尔滨工业大学深圳研究生院 K-means clustering method and system with privacy protection function
CN107707494B (en) * 2017-10-10 2020-02-11 苏州大学 Optical fiber nonlinear equalization method for 64-QAM coherent optical communication system
CN107784663B (en) * 2017-11-14 2020-10-20 哈尔滨工业大学深圳研究生院 Depth information-based related filtering tracking method and device
CN109214205B (en) * 2018-08-01 2021-07-02 安徽师范大学 K-anonymity-based position and data privacy protection method in crowd-sourcing perception
CN109615021B (en) * 2018-12-20 2022-09-27 暨南大学 Privacy information protection method based on k-means clustering
CN110162999B (en) * 2019-05-08 2022-06-07 湖北工业大学 Income distribution difference kini coefficient measurement method based on privacy protection
CN110163292A (en) * 2019-05-28 2019-08-23 电子科技大学 Secret protection k-means clustering method based on vector homomorphic cryptography
CN110610196B (en) * 2019-08-14 2023-04-28 平安科技(深圳)有限公司 Desensitization method, system, computer device and computer readable storage medium
US11663521B2 (en) * 2019-11-06 2023-05-30 Visa International Service Association Two-server privacy-preserving clustering
CN111444545B (en) * 2020-06-12 2020-09-04 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties
CN112487481B (en) * 2020-12-09 2022-06-10 重庆邮电大学 Verifiable multi-party k-means federal learning method with privacy protection
CN112508203B (en) * 2021-02-08 2021-06-15 同盾控股有限公司 Data clustering processing method, device, equipment and medium based on federal learning
CN113033915B (en) * 2021-04-16 2021-12-31 哈尔滨理工大学 Method and device for comparing shortest distance between car sharing user side and driver side
CN113438254B (en) * 2021-08-24 2021-11-05 北京金睛云华科技有限公司 Distributed classification method and system for ciphertext data in cloud environment
CN114154554A (en) * 2021-10-28 2022-03-08 上海海洋大学 Privacy protection outsourcing data KNN algorithm based on non-collusion double-cloud server
CN117688502B (en) * 2024-02-04 2024-04-30 山东大学 Safe outsourcing calculation method and system for detecting local abnormal factors

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138923A (en) * 2015-08-11 2015-12-09 苏州大学 Privacy protection time sequence similarity calculation method
CN105760780A (en) * 2016-02-29 2016-07-13 福建师范大学 Trajectory data privacy protection method based on road network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102970143B (en) * 2012-12-13 2015-04-22 中国科学技术大学苏州研究院 Method for securely computing index of sum of held data of both parties by adopting addition homomorphic encryption
US9710493B2 (en) * 2013-03-08 2017-07-18 Microsoft Technology Licensing, Llc Approximate K-means via cluster closures
CN107145791B (en) * 2017-04-07 2020-07-10 哈尔滨工业大学深圳研究生院 K-means clustering method and system with privacy protection function

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138923A (en) * 2015-08-11 2015-12-09 苏州大学 Privacy protection time sequence similarity calculation method
CN105760780A (en) * 2016-02-29 2016-07-13 福建师范大学 Trajectory data privacy protection method based on road network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Outsourcing Two-party Privacy Preserving K-mians Clustering Protocol In Wireless Sensor Networks》;Liu Xiaoyan等;《IEEE computer society》;20151231(第11期);第124-133页 *
《隐私保护的快速聚类算法》;薛安荣等;《系统工程与电子技术》;20091030(第10期);第2521-2526页 *

Also Published As

Publication number Publication date
CN107145791A (en) 2017-09-08
WO2018184407A1 (en) 2018-10-11

Similar Documents

Publication Publication Date Title
CN107145791B (en) K-means clustering method and system with privacy protection function
Liu et al. An efficient privacy-preserving outsourced calculation toolkit with multiple keys
Xu et al. Efficient and privacy-preserving truth discovery in mobile crowd sensing systems
Zhang et al. DeepPAR and DeepDPA: privacy preserving and asynchronous deep learning for industrial IoT
CN110008717B (en) Decision tree classification service system and method supporting privacy protection
Zhang et al. Lattice-based proxy-oriented identity-based encryption with keyword search for cloud storage
Liu et al. Efficient and privacy-preserving outsourced calculation of rational numbers
Paulet et al. Privacy-preserving and content-protecting location based queries
Zhang et al. Fine-grained private matching for proximity-based mobile social networking
CN107196926B (en) Cloud outsourcing privacy set comparison method and device
CN108737115B (en) Private attribute set intersection solving method with privacy protection
US20160234010A1 (en) Method and system for homomorphicly randomizing an input
CN109728906B (en) Anti-quantum-computation asymmetric encryption method and system based on asymmetric key pool
US10367640B2 (en) Shared secret data production system
CN111404943B (en) Data processing method and device, electronic equipment and computer readable storage medium
CN104967693A (en) Document similarity calculation method facing cloud storage based on fully homomorphic password technology
CN109921905B (en) Anti-quantum computation key negotiation method and system based on private key pool
CN110445797B (en) Two-party multidimensional data comparison method and system with privacy protection function
CN114039785B (en) Data encryption, decryption and processing methods, devices, equipment and storage medium
CN105376057A (en) Method for solving large-scale linear equation set through cloud outsourcing
CN115664629A (en) Homomorphic encryption-based data privacy protection method for intelligent Internet of things platform
Liu et al. EMK-ABSE: Efficient multikeyword attribute-based searchable encryption scheme through cloud-edge coordination
Fatahi et al. High-efficient arbitrated quantum signature scheme based on cluster states
WO2014030706A1 (en) Encrypted database system, client device and server, method and program for adding encrypted data
Lawnik et al. Application of modified Chebyshev polynomials in asymmetric cryptography

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant