CN107145791A

CN107145791A - A kind of K means clustering methods and system with secret protection

Info

Publication number: CN107145791A
Application number: CN201710224275.7A
Authority: CN
Inventors: 王轩; 蒋琳; 李晔; 姚霖; 刘泽超; 靳亚宾; 梁玉冬; 刘猛; 漆舒汉
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2017-09-08
Anticipated expiration: 2037-04-07
Also published as: CN107145791B; WO2018184407A1

Abstract

The present invention provides a kind of K means clustering methods and system with secret protection, belongs to data mining technology field.The present invention comprises the following steps：Data owner A and B encrypt respective data and randomly selected center of mass point, upload onto the server；Server is sorted out by secure multiplication agreement and Calculation of Safety Distance agreement at fall into a trap strong point of counting of ciphertext data to the Euclidean distance of center of mass point, and by data point；Server, data owner A and B recalculate new center of mass point in ciphertext data jointly by safety circuit agreement；Data owner A or B judges the distance of new center of mass point and former center of mass point by safety ratio compared with agreement, if less than threshold value, terminate classification, the data classified are sent respectively to data owner A and B by data owner A and B request server, otherwise, again new center of mass point is uploaded, next round iteration is carried out.The present invention ensure that the correctness of data mining results while ensureing that data-privacy is safe；Data storage outsourcing and data are supported to calculate outsourcing, while correctness is ensured, execution efficiency is also significantly lifted；Support that most sides calculate for the safety of malicious parties in three participants.

Description

A kind of K-means clustering methods and system with secret protection

Technical field

The present invention relates to data mining technology field, more particularly to a kind of K-means clustering methods with secret protection, Further relate to a kind of system for realizing methods described.

Background technology

It is well known that K-means clusters are one of very classical and conventional methods in data mining, it is by calculating number Similar data item can be flocked together according to the distance between item.As information-based, digitlization, networking process accelerate, The data source that economic globalization has turned into a kind of irreversible trend, clustering algorithm is more and more diversified, and data safety is got over Come more important.Multiple participants can be come from view of data, the sensitive information on participant may be included in these data Or personal information, if these information are shared between multiple participants, then the privacy of data can not be guaranteed. Joint data mining with secret protection can be participated in while user data and Result privacy is protected to multiple The federated database of side carries out data mining, further extracts useful information.Therefore, how to design and protected with privacy The joint data mining algorithm of shield turns into a problem for needing to solve.

Semi-honesty model meets actual scene in many cases, and the privacy of data is by each under the model Participant follows agreement to ensure all the time.But to ensure the privacy of data, solution under the model usually because Calculate consumption and communication consumption is higher, so in practice and infeasible.

Traditional K-means clustering algorithms are a kind of classical clustering algorithms based on Euclidean distance.Traditional K-means clusters Algorithm is broadly divided into 3 steps：Choose center of mass point, data point sorted out and center of mass point that re-computation is new.Assuming that training sample This is { x_i∈R^l| 1≤i≤l }, wherein, l is the quantity of sample, and k center of mass point M is randomly selected first, M={ μ are expressed as_c∈ R^l|1≤c≤l}.Then each data point is calculated to x_iTo center of mass point μ_cDistance, then by x_iBe referred to cluster the point it is nearest Center of mass point μ_cIn affiliated class, formula is：C_c:=argmin_c||x_i-μ_c||².Finally for each center of mass point μ_cWeighed Center of mass point is calculated, calculation formula is：

It can be seen that tradition K-means clustering algorithms mainly include three steps：Center of mass point, data point is chosen to carry out Sort out and re-computation center of mass point.Wherein during classification, first have to calculate data point apart from each center of mass point it is European away from From then comparing the nearest center of mass point of range data point and sorted out, the calculating of distance here uses Euclidean distance Square, the size of two values is so preferably compared in the case where changing the magnitude relationship of two values.In re-computation matter During heart point, it is necessary to calculate in each class the component of data point and, and these data points may be from different participations Side, so may relate to privacy concern during calculating.In a word, in the calculating process of traditional K-means clustering algorithms It is middle to cause the leakage of privacy.

The content of the invention

To solve the problems of the prior art, the present invention provides a kind of K-means clustering methods with secret protection, also There is provided a kind of system for realizing methods described.

The inventive method comprises the following steps：

S1：Data owner A and B encrypt respective data, and then ciphertext is uploaded onto the server；

S2：Data owner A and B randomly choose k center of mass point respectively, and encryption uploads onto the server；

S3：Server calculates ciphertext data point to the Euclidean distance of center of mass point by Calculation of Safety Distance agreement, passes through peace Compare agreement entirely to be sorted out data point according to the Euclidean distance of calculating；

S4：Server, data owner A and B recalculate k new center of mass point by safety circuit agreement jointly；

S5：Data owner A or B judges center of mass point new in ciphertext data and former center of mass point by safety ratio compared with agreement Distance, if less than threshold value, terminates classification, the data classified are sent respectively to by data owner A and B request server Data owner A and B, otherwise, return and perform step S2, carry out next round iteration.

The present invention is further improved, in step sl, and the server is Cloud Server, and Cloud Server possesses data The data that person A and B are uploaded are re-encrypted in the file system of storage beyond the clouds.

The present invention is further improved, in step s 2, and the selection of the center of mass point includes center of mass point quantity and numerical value Choose, specifically include following steps：

S21：Data owner A and B randomly choose k center of mass point respectively；

S22：It is iterated, and is sorted out on respective data set according to traditional K-means clustering algorithms；

S23：Calculate each data point to it is respective correspondence center of mass point distance, and calculate all data points apart from summation S；

S24：When the corresponding summation S changes of k-1, k, k+1 center of mass point are little, now, k is the number of center of mass point；

S25：Data owner A and B are respectively with the numerical computations average value of respective center of mass point, and the average value is k The value of individual center of mass point.

The present invention is further improved, and step S3 computational methods comprise the following steps：

S31：The ciphertext of ciphertext center of mass point that Server Calculates Data owner A every ciphertext record is uploaded with it away from From, and data owner B the ciphertext distance of ciphertext center of mass point that is uploaded with it of every ciphertext record；

S32：Server calculates each of data owner A with data owner A commonly through Calculation of Safety Distance agreement The ciphertext distance of data point and center of mass point；Server and data owner B calculate data jointly using Calculation of Safety Distance agreement Owner B each data point and the ciphertext distance of center of mass point；

S33：The ciphertext distance set that server is obtained according to step S32, data owner A and B data are divided into most In near class, and separately deposited in same class.

The present invention is further improved, and step S4 processing method comprises the following steps：

S41：The data point separately deposited in same class is sent respectively to corresponding data owner A and B by server；

S42：Data owner A and B are decrypted；

S43：Server, data owner A and B calculate new center of mass point in the category by safety circuit agreement.

Present invention also offers a kind of system for realizing methods described, including used in database, data owner A Second client used in one client and data owner B, wherein, first client and the second client are used for Respective data are encrypted, then ciphertext is uploaded onto the server, and randomly choose k center of mass point respectively, and encryption is uploaded to clothes Business device, waits after server classification, recalculates k new center of mass point jointly with server, judge new center of mass point and the protoplasm heart The distance of point, if less than threshold value, terminates classification, the data classified are sent respectively to the first client by request server With the second client, center of mass point is otherwise uploaded again；Server is used to receive the number that the first client and the second client are uploaded According to, data point is calculated to the Euclidean distance of center of mass point, is sorted out data point according to the Euclidean distance of calculating, it is then objective with first Family end and the second client recalculate k new center of mass point jointly.

The present invention is further improved, and the server is Cloud Server, and Cloud Server uploads data owner A and B Data re-encrypt in storage file system beyond the clouds.

The present invention is further improved, and the selection of the center of mass point of first client and the second client includes center of mass point The selection of quantity and numerical value, specifically includes following module：

Barycenter point selection module：For randomly choosing k center of mass point；

Classifying module：For being iterated according to traditional K-means clustering algorithms on respective data set, and sort out；

Calculation of Safety Distance module：For calculating each data point to respective correspondence barycenter by Calculation of Safety Distance agreement Point distance, and calculate all data points apart from summation S；

Center of mass point number chooses module：For judging when the corresponding summation S changes of k-1, k, k+1 center of mass point are little, Now, k is the number of center of mass point；

Barycenter point value chooses module：For the numerical computations average value with respective center of mass point, the average value is k The value of individual center of mass point.

The present invention is further improved, and the server includes：

First ciphertext distance calculation module：For calculating the ciphertext barycenter that every ciphertext record of the first client is uploaded with it The ciphertext distance of point, and calculate the ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it；

Second ciphertext distance calculation module：For with the first client jointly calculate the first client each data point with The ciphertext distance of center of mass point；The each data point and center of mass point of server and the second client the second client of common calculating Ciphertext distance；Sort module：For calculating obtained ciphertext distance set according to the second ciphertext distance calculation module, by the first visitor The data of family end and the second client are divided into nearest class, and separately deposited in same class.

The present invention is further improved, and the server also includes sending module：For will separately deposit in same class Data point is sent respectively to corresponding first client and the second client；Safe center of mass point computing module：For with the first visitor Family end and the second client calculate new center of mass point by safety circuit agreement in same category.

Compared with prior art, the beneficial effects of the invention are as follows：The application ensure that number well by the way of encryption According to the security in mining process, and the correctness of result；Data storage outsourcing is supported, can be in more massive data set It is upper to perform；Support data to calculate outsourcing, most calculating is contracted out to cloud platform, by the powerful computing capability of cloud platform, While correctness is ensured, execution efficiency is also significantly lifted；The safety calculating under semi-honesty model is not only realized, and And support that most sides calculate for the safety of malicious parties in three participants in the re-computation center of mass point stage.

Brief description of the drawings

Fig. 1 is the inventive method flow chart；

Fig. 2 is present system structural representation；

Fig. 3 is tradition K-means clustering algorithm server and client side's elapsed time schematic diagrames；

Fig. 4 is server and client side's elapsed time schematic diagram of the present invention；

Fig. 5 is tradition K-means clustering algorithm server and client side's elapsed time occupation ratio schematic diagrames；

Fig. 6 is server and client side's elapsed time occupation ratio schematic diagram of the present invention；

Fig. 7 expends time ratios for the present invention with tradition K-means clustering algorithms.

Embodiment

The present invention is described in further details with reference to the accompanying drawings and examples.

For performance issue present in the data mining with secret protection, the present invention has carried out has privacy to existing The further investigation of the data mining algorithm of protection, and then propose that one kind efficiently has privacy on the data set of horizontal division The K-means clustering algorithms of protection, the program support have two data owners and cloud platform it is simultaneous storage outsourcing and Calculate outsourcing.Data are stored beyond the clouds with ciphertext form, and cloud platform is completed in both sides by being interacted with two data owners Joint data set on K-means cluster data minings task.The present invention separately designs different security protocols and solves to have Three technical barriers in the K-means clustering algorithms of secret protection：Solve Calculation of Safety Distance of the ciphertext apart from computational problem Agreement, the safety ratio of solution ciphertext comparison problem are compared with agreement and the safety circuit agreement of solution ciphertext division problem.And then by this A little security protocols are applied in clustering algorithm framework, realize the K-means clustering algorithms with secret protection.

As shown in figure 1, K-means clustering method of the present invention with secret protection mainly includes 5 steps, it is next right It is described in detail：

Step S1：Data owner A and B encrypt respective data, and then ciphertext is uploaded onto the server.This example assumes number It is Alice according to owner A, data owner B is Bob, and server is C.

Alice and Bob are respectively with the public key pk of oneself₁And pk₂Encrypt their data D_xAnd D_y, ciphertext is C_xAnd C_y, so Afterwards by C_xAnd C_yUpload to C.Wherein, D_xAnd D_yIn every record be all l dimension, so to data base encryption namely to every Every one-dimensional data in bar record is encrypted.Alice and Bob all data can be stored beyond the clouds in the form of ciphertext In file system.Specifically it is expressed as follows：

Wherein, m is the bar number of record.

Step S2：Alice and Bob selects k center of mass point, and uploads to C with respective public key encryption.

In this example, the selection of center of mass point is a very important step, because its selection is directly connected to iterations How much, and then the system overall execution time is had influence on, so good center of mass point can also accelerate convergence rate and the execution of system Efficiency.Here selection center of mass point is divided into two parts：First is the selection of center of mass point quantity, and Alice and Bob distinguish random k values With k center of mass point, an iteration is then carried out on the data set of oneself.Calculated after classification each data point to each it is right The distance of center of mass point is answered, the summation for then obtaining these all distances is S.The S changes worked as corresponding to k-1, k and k+1 are little When, now, k is the number of center of mass point.Alice and Bob find out the k of oneself respectively, and latter two right k value plot mean is Final k values.Alice randomly chooses k center of mass point M={ μ_c| 1≤c≤k }, wherein μ_c={ u_cj|1≤j≤l}.Alice and Center of mass point encryption is uploaded to high in the clouds by Bob with Alice and Bob public key respectively, and the ciphertext of center of mass point isWith

Step 3:Server C by Calculation of Safety Distance agreement calculate ciphertext data point to center of mass point Euclidean distance, so Afterwards, data point is sorted out according to the Euclidean distance of calculating compared with agreement by safety ratio.Specially：

C calculates each recordWith each center of mass pointAnd each recordWith each center of mass pointCiphertext away from From；C and Alice are operated together SSED (Calculation of Safety Distance) agreement and go to calculate each x_iAnd μ_cBetween ciphertext distance, useRepresent.C and Bob are operated together SSED agreements and go to calculate each y_iAnd μ_cBetween ciphertext distance, useRepresent.It is all x_iAnd μ_cBetween ciphertext distance be stored inIn, all y_iAnd μ_cBetween ciphertext distance be stored inIn.

The homomorphic cryptography used in this method is to support the half homomorphism encryption of ciphertext add operation, i.e. Paillier encryptions, It is the probability encryption of 4 tuples, is expressed as Enc_pa={ KenGen, Encrypt, Decrypt, Evaluate }. The process of Paillier encryptions is as follows：

●KenGen(1^k)→(pk,sk):

(1) two Big primes p and q are selected, and meet gcd (pq, (p-1) (q-1))=1；

(2) N=pq and λ=lcm (p-1, q-1) is calculated；

(3) integer is randomly choosed

(4) μ is found so that it can meet μ=(L (g^λmod N²))^-1Mod N, here L be a function L (μ)= (μ-1)/N.And then public key is obtained for (N, g), private key are (λ, μ).

●Encrypt(x,r)→c:

Assuming that being x in plain text, a random number r is selected, cryptogram computation is c=g^xrⁿmod N².Encryption is also referred to as E_pk (x)=c.

●Decrypt(c)→x

Decrypting process is x=L (c^λmod N²)mod N。D_sk(c) Decrypt (c) is represented.

●Evaluate：

E_pk(x)E_pk(y)=E_pk(x+y),E_pk(x)^y=E_pk(xy).Wherein x and y are two plaintexts.

The Calculation of Safety Distance agreement of this example realized based on secure multiplication agreement, the specific place of the secure multiplication agreement Reason process is as follows：

Wherein, Z_nIt is positive integer space, r is represented herein_xAnd r_yFor positive integer.

The concrete processing procedure of the Calculation of Safety Distance agreement of this example is as follows：

Then, C sorts out all data points, is specially：

By comparingWithIn distance, by x_iAnd y_iIt is divided into nearest class.C and Alice performs safety ratio Compared with agreementC and Bob is performedThen all ciphertexts are referred to corresponding classificationWithIn.EachThe data point that C classes are divided into P1 is stored, eachStore the number that C classes are divided into Bob Strong point, calculation formula is：

Safety ratio is as follows compared with the concrete processing procedure of agreement：

Step S4：C, Alice and Bob are by safety circuit agreement k center of mass point of common re-computation, because in CL₁And CL₂ In two participant encryption datas public key it is different, all new center of mass point can not be calculated directly.This example first allows C CL₁With CL₂It is respectively issued to Alice and Bob and does decryption to obtain L₁And L₂, calculation formula is：

Then C, Alice and Bob will perform SC (safety circuit) agreement, calculate

Wherein,Respectively Alice and Ciphertext data in Bob.

So as to calculate the one-component μ of new center of mass point_cj.SC safety circuits agreement can ensure that Alice and Bob are obtained All new center of mass point.

Wherein, the concrete processing procedure of safety circuit agreement is：

Step S5：Alice can be by the safety ratio center of mass point new compared with agreement calculating and the distance of former center of mass point, if small In threshold value, then Alice and Bob will ask C that the data of point good class are respectively issued into Alice and Bob.Otherwise, Alice and New center of mass point is encrypted and uploads to C by Bob with their respective public keys, carries out next round iteration.

As shown in Fig. 2 present invention also offers a kind of system for realizing the above method, this example system includes database C, number According to the first client P used in owner A₁With the second client P used in data owner B₂, wherein, described first Client P₁With the second client P₂For encrypting respective data, then ciphertext is uploaded onto the server, and random choosing respectively K center of mass point is selected, and encryption uploads onto the server, and waits after server classification, recalculates k new matter jointly with server Heart point, judges the distance of new center of mass point and former center of mass point, if less than threshold value, terminates classification, request server will classify Data be sent respectively to the first client P₁With the second client P₂, center of mass point is otherwise uploaded again；Server is used to receive First client P₁With the second client P₂The data of upload, calculate data point to the Euclidean distance of center of mass point, according to calculating Euclidean distance sorts out data point, then with the first client P₁With the second client P₂K new barycenter is recalculated jointly Point.

This example server C is Cloud Server, and data owner A and the B data uploaded are re-encrypted and are stored in by Cloud Server In the file system in high in the clouds, it would be preferable to support data storage outsourcing, it can be performed on more massive data set；Support data Outsourcing is calculated, most calculating is contracted out to cloud platform, by the powerful computing capability of cloud platform, is ensureing correctness Meanwhile, execution efficiency is also significantly lifted.

Beneficial effect analysis of the present invention：

1st, the comparison scheme that the present invention is selected

The framework that the present invention is used is in document《Outsourcing Two-Party Privacy Preserving K- Means Clustering Protocol in Wireless Sensor Networks》In propose first, this relatively in To the paper method, scheme is represented before, and relative to the clustering algorithm under other frameworks, the clustering algorithm under same framework is more With comparativity, so the present invention is mainly compared analysis with traditional scheme before.To ensure the reliability of Experimental comparison, Two schemes are run in same experimental situation.The evaluation criterion of two methods is explained below, and carries out experimental result Comparative analysis.

2nd, evaluation criterion

The time loss of the inventive method is broadly divided into three parts：Client time consumption, communication consumption and server Time loss is held, wherein client and server time loss includes initial phase again and the time of agreement operation phase disappears Consumption.Again because the application therewith in front of case method therefor difference, can only be from being macroscopically compared.Relatively main bag Include two aspects, one is analysis of complexity including time complexity, space complexity and communication complexity in theory, another Individual is the comparison of test result in experiment.And different iterationses can influence the whole structure of experiment, so this example is with once Iteration is defined, and will be compared from the following aspects：

(1) theoretic time complexity, space complexity and the communication complexity of two schemes are contrasted.

(2) time of two schemes data encryption is contrasted.

(3) time loss of two schemes server and client side in an iteration is contrasted.

3rd, analysis of experimental results

For theory, the present invention program is below it in terms of time complexity, space complexity and communication complexity Front case.The experimental result of two schemes will be analyzed according to experimental data below.

The encryption times consumption of the two schemes compared first.The two kinds of cipher modes used before in scheme, it is all Clear data must all Liu encipherment schemes to be improved encryptions once, will also be by the encryption of Paillier encipherment schemes once. All clear datas only need to a Paillier encryption in the solution of the present invention, in theory the scheme in the present invention In encryption times should be faster than before encryption times consumption in scheme.Again because Paillier operation is on group, There are many index operations again, and improved Liu encipherment schemes are all linear operations, so most encryption times are consumed Because Paillier encryptions are caused.So, the encryption before the encryption times consumption in the present invention can be slightly less than in scheme Time loss, but the time do not have a difference of the order of magnitude, the result of experiment effectively demonstrate the conclusion.Scheme adds before As shown in table 1, encryption times consumption of the invention is as shown in table 2 for close time loss.

The existing scheme encryption times of table 1 are consumed

The encryption times of the present invention of table 2 are consumed

Then, the present invention was counted and contrasted to the time consumed in an iteration.In theory, this hair The bright cloud platform being introduced into improves powerful computing capability should be able to be more slightly better than the operational efficiency in scheme before.Because The cloud platform of the present invention is made up of 30 PCs and a server, is needed in the processing procedure of task to every machine The division of labor of carry out task, task scheduling and data record, these operations can also consume the time of part.When data point is more Wait, the time of an iteration can be longer, and the ratio that operation the consumed time such as task division of labor takes will be lower.This hair It is bright in safety circuit agreement, the generation of circuit is needed to expend the larger time, but circuit is only needed in the first iteration Generate once, so in theory when data point scale is less, the efficiency of an iteration of scheme can be higher than before Scheme in the present invention, when data point scale is higher than a certain threshold value, the efficiency of the solution of the present invention an iteration can be higher than it Efficiency in the case of front, as data scale is increasing, the odds for effectiveness of scheme can be more and more obvious in the present invention.Experiment The dry straight viewpoint for demonstrating us, while test result indicates that the threshold value of data point scale is about 5000 data Point, when data scale is more than 7000, the present invention program an iteration elapsed time is less, when data scale is less than 5000, Scheme an iteration elapsed time is less in scheme before.The contrast of two schemes an iteration elapsed time is as shown in table 3.

An iteration elapsed time of table 3 is contrasted

In an iteration, the time consumed for being not only this time iteration that the present invention is paid close attention to, while also being intended to Server C can undertake more tasks in iterating to calculate each time, possess higher elapsed time occupation ratio, that is to say, that In the case of ensureing that an iteration time is less so that server elapsed time and the time consuming ratio of an iteration are more Greatly, thus can be with the amount of calculation of less client.Therefore, with the increase of data scale, such scheme efficiency can also be got over Come higher.For client, what is mainly done is exactly the encryption and decryption of client in encrypt and decrypt operation, two schemes Number of times is basically identical.But, in scheme before ciphertext distance calculate and ciphertext distance compare size use it is improved Liu is encrypted, and all operations of the encryption are all linear operations, and the scheme in the present invention uses Paillier encryption calculations Method, the decryption and decryption of the algorithm are needed in the enterprising row index computing of group and modular arithmetic.For the less client of computing capability For, the time that improved Liu AESs are consumed should be able to be less than the Paillier encryptions used in the present invention.Institute So that in theory under the data set of same scale, the time that client is consumed in scheme before can be less than in the solution of the present invention The time that client is consumed.With the increase of data scale, the time that an iteration is consumed in the solution of the present invention is relatively It is few, and the time that client is consumed is relatively large.Therefore, when data scale is increasing, in the present invention program Client elapsed time occupation ratio is relative can be increasing, and the time consuming occupation ratio of opposite server institute is relative can be increasingly It is small.By carrying out the collection and analysis of experimental data, previous guess is also demonstrated very well.Two schemes an iteration is respectively joined With square elapsed time as shown in table 4 and table 5.The elapsed time of scheme server and client side is as shown in figure 3, the present invention before Server and client side's elapsed time is as shown in Figure 4.

Each participant elapsed time of scheme an iteration before table 4

Each participant elapsed time of the application an iteration of table 5

Two schemes server and client side elapsed time increases with data point one is can be seen that from Fig. 3 and Fig. 4 Individual trend.Before in the experimental program of scheme, with the increase of data scale, server consumption has obvious ascendant trend, and The elapsed time of client also has less ascendant trend.The computing capability for being primarily due to server is limited, the calculating of data Compare responsible.With the increase of data scale, server be necessarily required to the increasing time go handle these data, cause Elapsed time showed increased, the time consuming occupation ratio of server can also increase.With the increase of data scale, although client End needs data to be processed also to increased, compared to server, and the operation of client is all linear calculating mostly, so data The time consuming increase that scale increase is brought is not obvious, then the time consuming occupation ratio of client can be reduced.The present invention Middle server is run in the cloud platform for having 30 PCs and 1 server to constitute, so the computing capability of server is It can ensure.According to Fig. 4 as can be seen that with the increase of data scale, server end elapsed time increased, not There is obvious ascendant trend.And client elapsed time is increasing with the increase of data scale, client is primarily due to The decryption oprerations done are the index operations on group, and compared to linear operation, the operation has bigger amount of calculation.Therefore, With the increase of data scale, server elapsed time occupation ratio can have been reduced in the present invention, and client elapsed time is accounted for Have than that can increased.Before in scheme server and client side's elapsed time occupation ratio as shown in figure 5, in the present invention service Device and client elapsed time occupation ratio are as shown in Figure 6.

Finally, the present invention gives the K-means clustering algorithms and warp in an iteration with secret protection by experiment The time of the K-means algorithm process data of allusion quotation, it can be seen that the brought time loss of encryption is than larger.But, with The increase of data scale, the ratio of the time loss of an iteration of the present invention and classical K-means time loss is increasingly It is small.The present invention expends the time as shown in table 6 with classical K-means algorithms in an iteration, and time ratios are as shown in Figure 7.

The present invention expends the time with classical K-means algorithms in an iteration of table 6

What the present invention was selected is more typical K-means algorithms in data mining, and in the horizontal division of both sides Joint data set is excavated, while supporting the storage outsourcing of cloud platform and calculating outsourcing.Beneficial effects of the present invention mainly have Following several respects：

(1) by the status both at home and abroad for the data mining for analyzing secret protection, the excellent of existing frequently-used technology is well understood Gesture and inferior position.Although the scheme execution efficiency based on data perturbation technology is higher, because it destroys original data Collection, so data mining results are certain to produce certain influence, and the scheme based on encryption can be very good to ensure to excavate As a result correctness, the present invention ensure that the correctness of data mining results well by the way of encryption；

(2) the present invention program supports data storage outsourcing.Cloud platform possesses bigger storage energy compared to general PC Power, this allows the present invention program to be performed on more massive data set；

(3) the present invention program supports data to calculate outsourcing.Cloud platform is a kind of distributed computing framework, and it can be a lot Resource consolidation together into a cluster so that the significantly computing capability of lifting system.The present invention program will be most of Calculating be contracted out to cloud platform, by the powerful computing capability of cloud platform, while correctness is ensured, execution efficiency is also big Amplitude is lifted；

(4) from the time complexity of theory analysis algorithm, space complexity, communication complexity and security, and pass through The correctness and high efficiency of the experimental verification algorithm.K-means clustering algorithms proposed by the present invention with secret protection are not only Realize the safety calculating under semi-honesty model, and most sides in three participants of re-computation center of mass point stage support Calculated for the safety of malicious parties.

Embodiment described above is the better embodiment of the present invention, not limits the specific of the present invention with this Practical range, the scope of the present invention includes being not limited to present embodiment, all equal according to the equivalence changes of the invention made Within the scope of the present invention.

Claims

1. a kind of K-means clustering methods with secret protection, it is characterised in that comprise the following steps：

S3：Server calculates ciphertext data point to the Euclidean distance of center of mass point by Calculation of Safety Distance agreement, passes through safety ratio The Euclidean distance calculated compared with agreement sorts out data point；

S5：Data owner A or B by safety ratio compared with agreement judge center of mass point new in ciphertext data and former center of mass point away from From if less than threshold value, end classification, the data classified are sent respectively to data by data owner A and B request server Owner A and B, otherwise, return and perform step S2, carry out next round iteration.

2. the K-means clustering methods according to claim 1 with secret protection, it is characterised in that：In step sl, The server is Cloud Server, and Cloud Server stores data owner A and the B encryption data uploaded file beyond the clouds again In system.

3. the K-means clustering methods according to claim 2 with secret protection, it is characterised in that：In step s 2, The selection of the center of mass point includes the selection of center of mass point quantity and numerical value, specifically includes following steps：

S21：Data owner A and B randomly choose k center of mass point respectively；

S25：Data owner A and B are respectively with the numerical computations average value of respective center of mass point, and the average value is k matter The value of heart point.

4. the K-means clustering methods according to claim 3 with secret protection, it is characterised in that：Step S3 meter Calculation method comprises the following steps：

S31：The ciphertext distance for the ciphertext center of mass point that Server Calculates Data owner A every ciphertext record is uploaded with it, and The ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it；

S32：Server calculates data owner A each data with data owner A using Calculation of Safety Distance agreement jointly Point and the ciphertext distance of center of mass point；Server and data owner calculate data using Calculation of Safety Distance agreement B and possessed jointly Person B each data point and the ciphertext distance of center of mass point；

S33：The ciphertext distance set that server is obtained according to step S32, data owner A and B data is divided into nearest In class, and separately deposited in same class.

5. the K-means clustering methods according to claim 4 with secret protection, it is characterised in that：Step S4 place Reason method comprises the following steps：

S42：Data owner A and B are decrypted；

S43：Server, data owner A and B calculate new center of mass point in the category using safety circuit agreement.

6. a kind of system for realizing the K-means clustering methods with secret protection described in claim any one of 1-5, its feature Be to include database, the second client used in the first client used in data owner A and data owner B, Wherein, first client and the second client are used to encrypt respective data, and then ciphertext is uploaded onto the server, and point Not Sui Jixuanze k center of mass point, and encrypt and upload onto the server, wait after server sorts out, recalculated jointly with server new K center of mass point, judge the distance of new center of mass point and former center of mass point, if less than threshold value, terminate classification, request server will The data classified are sent respectively to the first client and the second client, and center of mass point is otherwise uploaded again；Server is used to connect The data that the first client and the second client are uploaded are received, data point are calculated to the Euclidean distance of center of mass point, according to the Europe of calculating Family name's distance sorts out data point, then recalculates k new center of mass point jointly with the first client and the second client.

7. system according to claim 6, it is characterised in that：The server is Cloud Server, and Cloud Server is by data The data that owner A and B are uploaded are re-encrypted in the file system of storage beyond the clouds.

8. system according to claim 7, it is characterised in that：The center of mass point of first client and the second client Selection includes the selection of center of mass point quantity and numerical value, specifically includes following module：

Calculation of Safety Distance module：For calculating each data point to respective correspondence center of mass point by Calculation of Safety Distance agreement Distance, and calculate all data points apart from summation S；

Barycenter point value chooses module：For the numerical computations average value with respective center of mass point, the average value is k matter The value of heart point.

9. system according to claim 8, it is characterised in that：The server includes：

First ciphertext distance calculation module：For calculating the ciphertext center of mass point that every ciphertext record of the first client is uploaded with it Ciphertext distance, and calculate the ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it；

Second ciphertext distance calculation module：Each data point and barycenter for calculating the first client jointly with the first client The ciphertext distance of point；Server and the second client calculate jointly each data point of the second client and the ciphertext of center of mass point away from From；

Sort module：For calculating obtained ciphertext distance set according to the second ciphertext distance calculation module, by the first client and The data of second client are divided into nearest class, and separately deposited in same class.

10. system according to claim 9, it is characterised in that：The server also includes sending module：For will be same The data point separately deposited in class is sent respectively to corresponding first client and the second client；

Safe center of mass point computing module：For same first client and the second client by safety circuit agreement in same category It is middle to calculate new center of mass point.