CN107145791A - A kind of K means clustering methods and system with secret protection - Google Patents

A kind of K means clustering methods and system with secret protection Download PDF

Info

Publication number
CN107145791A
CN107145791A CN201710224275.7A CN201710224275A CN107145791A CN 107145791 A CN107145791 A CN 107145791A CN 201710224275 A CN201710224275 A CN 201710224275A CN 107145791 A CN107145791 A CN 107145791A
Authority
CN
China
Prior art keywords
data
center
mass point
point
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710224275.7A
Other languages
Chinese (zh)
Other versions
CN107145791B (en
Inventor
王轩
蒋琳
李晔
姚霖
刘泽超
靳亚宾
梁玉冬
刘猛
漆舒汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN201710224275.7A priority Critical patent/CN107145791B/en
Publication of CN107145791A publication Critical patent/CN107145791A/en
Priority to PCT/CN2017/117943 priority patent/WO2018184407A1/en
Application granted granted Critical
Publication of CN107145791B publication Critical patent/CN107145791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

The present invention provides a kind of K means clustering methods and system with secret protection, belongs to data mining technology field.The present invention comprises the following steps:Data owner A and B encrypt respective data and randomly selected center of mass point, upload onto the server;Server is sorted out by secure multiplication agreement and Calculation of Safety Distance agreement at fall into a trap strong point of counting of ciphertext data to the Euclidean distance of center of mass point, and by data point;Server, data owner A and B recalculate new center of mass point in ciphertext data jointly by safety circuit agreement;Data owner A or B judges the distance of new center of mass point and former center of mass point by safety ratio compared with agreement, if less than threshold value, terminate classification, the data classified are sent respectively to data owner A and B by data owner A and B request server, otherwise, again new center of mass point is uploaded, next round iteration is carried out.The present invention ensure that the correctness of data mining results while ensureing that data-privacy is safe;Data storage outsourcing and data are supported to calculate outsourcing, while correctness is ensured, execution efficiency is also significantly lifted;Support that most sides calculate for the safety of malicious parties in three participants.

Description

A kind of K-means clustering methods and system with secret protection
Technical field
The present invention relates to data mining technology field, more particularly to a kind of K-means clustering methods with secret protection, Further relate to a kind of system for realizing methods described.
Background technology
It is well known that K-means clusters are one of very classical and conventional methods in data mining, it is by calculating number Similar data item can be flocked together according to the distance between item.As information-based, digitlization, networking process accelerate, The data source that economic globalization has turned into a kind of irreversible trend, clustering algorithm is more and more diversified, and data safety is got over Come more important.Multiple participants can be come from view of data, the sensitive information on participant may be included in these data Or personal information, if these information are shared between multiple participants, then the privacy of data can not be guaranteed. Joint data mining with secret protection can be participated in while user data and Result privacy is protected to multiple The federated database of side carries out data mining, further extracts useful information.Therefore, how to design and protected with privacy The joint data mining algorithm of shield turns into a problem for needing to solve.
Semi-honesty model meets actual scene in many cases, and the privacy of data is by each under the model Participant follows agreement to ensure all the time.But to ensure the privacy of data, solution under the model usually because Calculate consumption and communication consumption is higher, so in practice and infeasible.
Traditional K-means clustering algorithms are a kind of classical clustering algorithms based on Euclidean distance.Traditional K-means clusters Algorithm is broadly divided into 3 steps:Choose center of mass point, data point sorted out and center of mass point that re-computation is new.Assuming that training sample This is { xi∈Rl| 1≤i≤l }, wherein, l is the quantity of sample, and k center of mass point M is randomly selected first, M={ μ are expressed asc∈ Rl|1≤c≤l}.Then each data point is calculated to xiTo center of mass point μcDistance, then by xiBe referred to cluster the point it is nearest Center of mass point μcIn affiliated class, formula is:Cc:=argminc||xic||2.Finally for each center of mass point μcWeighed Center of mass point is calculated, calculation formula is:
It can be seen that tradition K-means clustering algorithms mainly include three steps:Center of mass point, data point is chosen to carry out Sort out and re-computation center of mass point.Wherein during classification, first have to calculate data point apart from each center of mass point it is European away from From then comparing the nearest center of mass point of range data point and sorted out, the calculating of distance here uses Euclidean distance Square, the size of two values is so preferably compared in the case where changing the magnitude relationship of two values.In re-computation matter During heart point, it is necessary to calculate in each class the component of data point and, and these data points may be from different participations Side, so may relate to privacy concern during calculating.In a word, in the calculating process of traditional K-means clustering algorithms It is middle to cause the leakage of privacy.
The content of the invention
To solve the problems of the prior art, the present invention provides a kind of K-means clustering methods with secret protection, also There is provided a kind of system for realizing methods described.
The inventive method comprises the following steps:
S1:Data owner A and B encrypt respective data, and then ciphertext is uploaded onto the server;
S2:Data owner A and B randomly choose k center of mass point respectively, and encryption uploads onto the server;
S3:Server calculates ciphertext data point to the Euclidean distance of center of mass point by Calculation of Safety Distance agreement, passes through peace Compare agreement entirely to be sorted out data point according to the Euclidean distance of calculating;
S4:Server, data owner A and B recalculate k new center of mass point by safety circuit agreement jointly;
S5:Data owner A or B judges center of mass point new in ciphertext data and former center of mass point by safety ratio compared with agreement Distance, if less than threshold value, terminates classification, the data classified are sent respectively to by data owner A and B request server Data owner A and B, otherwise, return and perform step S2, carry out next round iteration.
The present invention is further improved, in step sl, and the server is Cloud Server, and Cloud Server possesses data The data that person A and B are uploaded are re-encrypted in the file system of storage beyond the clouds.
The present invention is further improved, in step s 2, and the selection of the center of mass point includes center of mass point quantity and numerical value Choose, specifically include following steps:
S21:Data owner A and B randomly choose k center of mass point respectively;
S22:It is iterated, and is sorted out on respective data set according to traditional K-means clustering algorithms;
S23:Calculate each data point to it is respective correspondence center of mass point distance, and calculate all data points apart from summation S;
S24:When the corresponding summation S changes of k-1, k, k+1 center of mass point are little, now, k is the number of center of mass point;
S25:Data owner A and B are respectively with the numerical computations average value of respective center of mass point, and the average value is k The value of individual center of mass point.
The present invention is further improved, and step S3 computational methods comprise the following steps:
S31:The ciphertext of ciphertext center of mass point that Server Calculates Data owner A every ciphertext record is uploaded with it away from From, and data owner B the ciphertext distance of ciphertext center of mass point that is uploaded with it of every ciphertext record;
S32:Server calculates each of data owner A with data owner A commonly through Calculation of Safety Distance agreement The ciphertext distance of data point and center of mass point;Server and data owner B calculate data jointly using Calculation of Safety Distance agreement Owner B each data point and the ciphertext distance of center of mass point;
S33:The ciphertext distance set that server is obtained according to step S32, data owner A and B data are divided into most In near class, and separately deposited in same class.
The present invention is further improved, and step S4 processing method comprises the following steps:
S41:The data point separately deposited in same class is sent respectively to corresponding data owner A and B by server;
S42:Data owner A and B are decrypted;
S43:Server, data owner A and B calculate new center of mass point in the category by safety circuit agreement.
Present invention also offers a kind of system for realizing methods described, including used in database, data owner A Second client used in one client and data owner B, wherein, first client and the second client are used for Respective data are encrypted, then ciphertext is uploaded onto the server, and randomly choose k center of mass point respectively, and encryption is uploaded to clothes Business device, waits after server classification, recalculates k new center of mass point jointly with server, judge new center of mass point and the protoplasm heart The distance of point, if less than threshold value, terminates classification, the data classified are sent respectively to the first client by request server With the second client, center of mass point is otherwise uploaded again;Server is used to receive the number that the first client and the second client are uploaded According to, data point is calculated to the Euclidean distance of center of mass point, is sorted out data point according to the Euclidean distance of calculating, it is then objective with first Family end and the second client recalculate k new center of mass point jointly.
The present invention is further improved, and the server is Cloud Server, and Cloud Server uploads data owner A and B Data re-encrypt in storage file system beyond the clouds.
The present invention is further improved, and the selection of the center of mass point of first client and the second client includes center of mass point The selection of quantity and numerical value, specifically includes following module:
Barycenter point selection module:For randomly choosing k center of mass point;
Classifying module:For being iterated according to traditional K-means clustering algorithms on respective data set, and sort out;
Calculation of Safety Distance module:For calculating each data point to respective correspondence barycenter by Calculation of Safety Distance agreement Point distance, and calculate all data points apart from summation S;
Center of mass point number chooses module:For judging when the corresponding summation S changes of k-1, k, k+1 center of mass point are little, Now, k is the number of center of mass point;
Barycenter point value chooses module:For the numerical computations average value with respective center of mass point, the average value is k The value of individual center of mass point.
The present invention is further improved, and the server includes:
First ciphertext distance calculation module:For calculating the ciphertext barycenter that every ciphertext record of the first client is uploaded with it The ciphertext distance of point, and calculate the ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it;
Second ciphertext distance calculation module:For with the first client jointly calculate the first client each data point with The ciphertext distance of center of mass point;The each data point and center of mass point of server and the second client the second client of common calculating Ciphertext distance;Sort module:For calculating obtained ciphertext distance set according to the second ciphertext distance calculation module, by the first visitor The data of family end and the second client are divided into nearest class, and separately deposited in same class.
The present invention is further improved, and the server also includes sending module:For will separately deposit in same class Data point is sent respectively to corresponding first client and the second client;Safe center of mass point computing module:For with the first visitor Family end and the second client calculate new center of mass point by safety circuit agreement in same category.
Compared with prior art, the beneficial effects of the invention are as follows:The application ensure that number well by the way of encryption According to the security in mining process, and the correctness of result;Data storage outsourcing is supported, can be in more massive data set It is upper to perform;Support data to calculate outsourcing, most calculating is contracted out to cloud platform, by the powerful computing capability of cloud platform, While correctness is ensured, execution efficiency is also significantly lifted;The safety calculating under semi-honesty model is not only realized, and And support that most sides calculate for the safety of malicious parties in three participants in the re-computation center of mass point stage.
Brief description of the drawings
Fig. 1 is the inventive method flow chart;
Fig. 2 is present system structural representation;
Fig. 3 is tradition K-means clustering algorithm server and client side's elapsed time schematic diagrames;
Fig. 4 is server and client side's elapsed time schematic diagram of the present invention;
Fig. 5 is tradition K-means clustering algorithm server and client side's elapsed time occupation ratio schematic diagrames;
Fig. 6 is server and client side's elapsed time occupation ratio schematic diagram of the present invention;
Fig. 7 expends time ratios for the present invention with tradition K-means clustering algorithms.
Embodiment
The present invention is described in further details with reference to the accompanying drawings and examples.
For performance issue present in the data mining with secret protection, the present invention has carried out has privacy to existing The further investigation of the data mining algorithm of protection, and then propose that one kind efficiently has privacy on the data set of horizontal division The K-means clustering algorithms of protection, the program support have two data owners and cloud platform it is simultaneous storage outsourcing and Calculate outsourcing.Data are stored beyond the clouds with ciphertext form, and cloud platform is completed in both sides by being interacted with two data owners Joint data set on K-means cluster data minings task.The present invention separately designs different security protocols and solves to have Three technical barriers in the K-means clustering algorithms of secret protection:Solve Calculation of Safety Distance of the ciphertext apart from computational problem Agreement, the safety ratio of solution ciphertext comparison problem are compared with agreement and the safety circuit agreement of solution ciphertext division problem.And then by this A little security protocols are applied in clustering algorithm framework, realize the K-means clustering algorithms with secret protection.
As shown in figure 1, K-means clustering method of the present invention with secret protection mainly includes 5 steps, it is next right It is described in detail:
Step S1:Data owner A and B encrypt respective data, and then ciphertext is uploaded onto the server.This example assumes number It is Alice according to owner A, data owner B is Bob, and server is C.
Alice and Bob are respectively with the public key pk of oneself1And pk2Encrypt their data DxAnd Dy, ciphertext is CxAnd Cy, so Afterwards by CxAnd CyUpload to C.Wherein, DxAnd DyIn every record be all l dimension, so to data base encryption namely to every Every one-dimensional data in bar record is encrypted.Alice and Bob all data can be stored beyond the clouds in the form of ciphertext In file system.Specifically it is expressed as follows:
Wherein, m is the bar number of record.
Step S2:Alice and Bob selects k center of mass point, and uploads to C with respective public key encryption.
In this example, the selection of center of mass point is a very important step, because its selection is directly connected to iterations How much, and then the system overall execution time is had influence on, so good center of mass point can also accelerate convergence rate and the execution of system Efficiency.Here selection center of mass point is divided into two parts:First is the selection of center of mass point quantity, and Alice and Bob distinguish random k values With k center of mass point, an iteration is then carried out on the data set of oneself.Calculated after classification each data point to each it is right The distance of center of mass point is answered, the summation for then obtaining these all distances is S.The S changes worked as corresponding to k-1, k and k+1 are little When, now, k is the number of center of mass point.Alice and Bob find out the k of oneself respectively, and latter two right k value plot mean is Final k values.Alice randomly chooses k center of mass point M={ μc| 1≤c≤k }, wherein μc={ ucj|1≤j≤l}.Alice and Center of mass point encryption is uploaded to high in the clouds by Bob with Alice and Bob public key respectively, and the ciphertext of center of mass point isWith
Step 3:Server C by Calculation of Safety Distance agreement calculate ciphertext data point to center of mass point Euclidean distance, so Afterwards, data point is sorted out according to the Euclidean distance of calculating compared with agreement by safety ratio.Specially:
C calculates each recordWith each center of mass pointAnd each recordWith each center of mass pointCiphertext away from From;C and Alice are operated together SSED (Calculation of Safety Distance) agreement and go to calculate each xiAnd μcBetween ciphertext distance, useRepresent.C and Bob are operated together SSED agreements and go to calculate each yiAnd μcBetween ciphertext distance, useRepresent.It is all xiAnd μcBetween ciphertext distance be stored inIn, all yiAnd μcBetween ciphertext distance be stored inIn.
The homomorphic cryptography used in this method is to support the half homomorphism encryption of ciphertext add operation, i.e. Paillier encryptions, It is the probability encryption of 4 tuples, is expressed as Encpa={ KenGen, Encrypt, Decrypt, Evaluate }. The process of Paillier encryptions is as follows:
●KenGen(1k)→(pk,sk):
(1) two Big primes p and q are selected, and meet gcd (pq, (p-1) (q-1))=1;
(2) N=pq and λ=lcm (p-1, q-1) is calculated;
(3) integer is randomly choosed
(4) μ is found so that it can meet μ=(L (gλmod N2))-1Mod N, here L be a function L (μ)= (μ-1)/N.And then public key is obtained for (N, g), private key are (λ, μ).
●Encrypt(x,r)→c:
Assuming that being x in plain text, a random number r is selected, cryptogram computation is c=gxrnmod N2.Encryption is also referred to as Epk (x)=c.
●Decrypt(c)→x
Decrypting process is x=L (cλmod N2)mod N。Dsk(c) Decrypt (c) is represented.
●Evaluate:
Epk(x)Epk(y)=Epk(x+y),Epk(x)y=Epk(xy).Wherein x and y are two plaintexts.
The Calculation of Safety Distance agreement of this example realized based on secure multiplication agreement, the specific place of the secure multiplication agreement Reason process is as follows:
Wherein, ZnIt is positive integer space, r is represented hereinxAnd ryFor positive integer.
The concrete processing procedure of the Calculation of Safety Distance agreement of this example is as follows:
Then, C sorts out all data points, is specially:
By comparingWithIn distance, by xiAnd yiIt is divided into nearest class.C and Alice performs safety ratio Compared with agreementC and Bob is performedThen all ciphertexts are referred to corresponding classificationWithIn.EachThe data point that C classes are divided into P1 is stored, eachStore the number that C classes are divided into Bob Strong point, calculation formula is:
Safety ratio is as follows compared with the concrete processing procedure of agreement:
Step S4:C, Alice and Bob are by safety circuit agreement k center of mass point of common re-computation, because in CL1And CL2 In two participant encryption datas public key it is different, all new center of mass point can not be calculated directly.This example first allows C CL1With CL2It is respectively issued to Alice and Bob and does decryption to obtain L1And L2, calculation formula is:
Then C, Alice and Bob will perform SC (safety circuit) agreement, calculate
Wherein,Respectively Alice and Ciphertext data in Bob.
So as to calculate the one-component μ of new center of mass pointcj.SC safety circuits agreement can ensure that Alice and Bob are obtained All new center of mass point.
Wherein, the concrete processing procedure of safety circuit agreement is:
Step S5:Alice can be by the safety ratio center of mass point new compared with agreement calculating and the distance of former center of mass point, if small In threshold value, then Alice and Bob will ask C that the data of point good class are respectively issued into Alice and Bob.Otherwise, Alice and New center of mass point is encrypted and uploads to C by Bob with their respective public keys, carries out next round iteration.
As shown in Fig. 2 present invention also offers a kind of system for realizing the above method, this example system includes database C, number According to the first client P used in owner A1With the second client P used in data owner B2, wherein, described first Client P1With the second client P2For encrypting respective data, then ciphertext is uploaded onto the server, and random choosing respectively K center of mass point is selected, and encryption uploads onto the server, and waits after server classification, recalculates k new matter jointly with server Heart point, judges the distance of new center of mass point and former center of mass point, if less than threshold value, terminates classification, request server will classify Data be sent respectively to the first client P1With the second client P2, center of mass point is otherwise uploaded again;Server is used to receive First client P1With the second client P2The data of upload, calculate data point to the Euclidean distance of center of mass point, according to calculating Euclidean distance sorts out data point, then with the first client P1With the second client P2K new barycenter is recalculated jointly Point.
This example server C is Cloud Server, and data owner A and the B data uploaded are re-encrypted and are stored in by Cloud Server In the file system in high in the clouds, it would be preferable to support data storage outsourcing, it can be performed on more massive data set;Support data Outsourcing is calculated, most calculating is contracted out to cloud platform, by the powerful computing capability of cloud platform, is ensureing correctness Meanwhile, execution efficiency is also significantly lifted.
Beneficial effect analysis of the present invention:
1st, the comparison scheme that the present invention is selected
The framework that the present invention is used is in document《Outsourcing Two-Party Privacy Preserving K- Means Clustering Protocol in Wireless Sensor Networks》In propose first, this relatively in To the paper method, scheme is represented before, and relative to the clustering algorithm under other frameworks, the clustering algorithm under same framework is more With comparativity, so the present invention is mainly compared analysis with traditional scheme before.To ensure the reliability of Experimental comparison, Two schemes are run in same experimental situation.The evaluation criterion of two methods is explained below, and carries out experimental result Comparative analysis.
2nd, evaluation criterion
The time loss of the inventive method is broadly divided into three parts:Client time consumption, communication consumption and server Time loss is held, wherein client and server time loss includes initial phase again and the time of agreement operation phase disappears Consumption.Again because the application therewith in front of case method therefor difference, can only be from being macroscopically compared.Relatively main bag Include two aspects, one is analysis of complexity including time complexity, space complexity and communication complexity in theory, another Individual is the comparison of test result in experiment.And different iterationses can influence the whole structure of experiment, so this example is with once Iteration is defined, and will be compared from the following aspects:
(1) theoretic time complexity, space complexity and the communication complexity of two schemes are contrasted.
(2) time of two schemes data encryption is contrasted.
(3) time loss of two schemes server and client side in an iteration is contrasted.
3rd, analysis of experimental results
For theory, the present invention program is below it in terms of time complexity, space complexity and communication complexity Front case.The experimental result of two schemes will be analyzed according to experimental data below.
The encryption times consumption of the two schemes compared first.The two kinds of cipher modes used before in scheme, it is all Clear data must all Liu encipherment schemes to be improved encryptions once, will also be by the encryption of Paillier encipherment schemes once. All clear datas only need to a Paillier encryption in the solution of the present invention, in theory the scheme in the present invention In encryption times should be faster than before encryption times consumption in scheme.Again because Paillier operation is on group, There are many index operations again, and improved Liu encipherment schemes are all linear operations, so most encryption times are consumed Because Paillier encryptions are caused.So, the encryption before the encryption times consumption in the present invention can be slightly less than in scheme Time loss, but the time do not have a difference of the order of magnitude, the result of experiment effectively demonstrate the conclusion.Scheme adds before As shown in table 1, encryption times consumption of the invention is as shown in table 2 for close time loss.
The existing scheme encryption times of table 1 are consumed
The encryption times of the present invention of table 2 are consumed
Then, the present invention was counted and contrasted to the time consumed in an iteration.In theory, this hair The bright cloud platform being introduced into improves powerful computing capability should be able to be more slightly better than the operational efficiency in scheme before.Because The cloud platform of the present invention is made up of 30 PCs and a server, is needed in the processing procedure of task to every machine The division of labor of carry out task, task scheduling and data record, these operations can also consume the time of part.When data point is more Wait, the time of an iteration can be longer, and the ratio that operation the consumed time such as task division of labor takes will be lower.This hair It is bright in safety circuit agreement, the generation of circuit is needed to expend the larger time, but circuit is only needed in the first iteration Generate once, so in theory when data point scale is less, the efficiency of an iteration of scheme can be higher than before Scheme in the present invention, when data point scale is higher than a certain threshold value, the efficiency of the solution of the present invention an iteration can be higher than it Efficiency in the case of front, as data scale is increasing, the odds for effectiveness of scheme can be more and more obvious in the present invention.Experiment The dry straight viewpoint for demonstrating us, while test result indicates that the threshold value of data point scale is about 5000 data Point, when data scale is more than 7000, the present invention program an iteration elapsed time is less, when data scale is less than 5000, Scheme an iteration elapsed time is less in scheme before.The contrast of two schemes an iteration elapsed time is as shown in table 3.
An iteration elapsed time of table 3 is contrasted
In an iteration, the time consumed for being not only this time iteration that the present invention is paid close attention to, while also being intended to Server C can undertake more tasks in iterating to calculate each time, possess higher elapsed time occupation ratio, that is to say, that In the case of ensureing that an iteration time is less so that server elapsed time and the time consuming ratio of an iteration are more Greatly, thus can be with the amount of calculation of less client.Therefore, with the increase of data scale, such scheme efficiency can also be got over Come higher.For client, what is mainly done is exactly the encryption and decryption of client in encrypt and decrypt operation, two schemes Number of times is basically identical.But, in scheme before ciphertext distance calculate and ciphertext distance compare size use it is improved Liu is encrypted, and all operations of the encryption are all linear operations, and the scheme in the present invention uses Paillier encryption calculations Method, the decryption and decryption of the algorithm are needed in the enterprising row index computing of group and modular arithmetic.For the less client of computing capability For, the time that improved Liu AESs are consumed should be able to be less than the Paillier encryptions used in the present invention.Institute So that in theory under the data set of same scale, the time that client is consumed in scheme before can be less than in the solution of the present invention The time that client is consumed.With the increase of data scale, the time that an iteration is consumed in the solution of the present invention is relatively It is few, and the time that client is consumed is relatively large.Therefore, when data scale is increasing, in the present invention program Client elapsed time occupation ratio is relative can be increasing, and the time consuming occupation ratio of opposite server institute is relative can be increasingly It is small.By carrying out the collection and analysis of experimental data, previous guess is also demonstrated very well.Two schemes an iteration is respectively joined With square elapsed time as shown in table 4 and table 5.The elapsed time of scheme server and client side is as shown in figure 3, the present invention before Server and client side's elapsed time is as shown in Figure 4.
Each participant elapsed time of scheme an iteration before table 4
Each participant elapsed time of the application an iteration of table 5
Two schemes server and client side elapsed time increases with data point one is can be seen that from Fig. 3 and Fig. 4 Individual trend.Before in the experimental program of scheme, with the increase of data scale, server consumption has obvious ascendant trend, and The elapsed time of client also has less ascendant trend.The computing capability for being primarily due to server is limited, the calculating of data Compare responsible.With the increase of data scale, server be necessarily required to the increasing time go handle these data, cause Elapsed time showed increased, the time consuming occupation ratio of server can also increase.With the increase of data scale, although client End needs data to be processed also to increased, compared to server, and the operation of client is all linear calculating mostly, so data The time consuming increase that scale increase is brought is not obvious, then the time consuming occupation ratio of client can be reduced.The present invention Middle server is run in the cloud platform for having 30 PCs and 1 server to constitute, so the computing capability of server is It can ensure.According to Fig. 4 as can be seen that with the increase of data scale, server end elapsed time increased, not There is obvious ascendant trend.And client elapsed time is increasing with the increase of data scale, client is primarily due to The decryption oprerations done are the index operations on group, and compared to linear operation, the operation has bigger amount of calculation.Therefore, With the increase of data scale, server elapsed time occupation ratio can have been reduced in the present invention, and client elapsed time is accounted for Have than that can increased.Before in scheme server and client side's elapsed time occupation ratio as shown in figure 5, in the present invention service Device and client elapsed time occupation ratio are as shown in Figure 6.
Finally, the present invention gives the K-means clustering algorithms and warp in an iteration with secret protection by experiment The time of the K-means algorithm process data of allusion quotation, it can be seen that the brought time loss of encryption is than larger.But, with The increase of data scale, the ratio of the time loss of an iteration of the present invention and classical K-means time loss is increasingly It is small.The present invention expends the time as shown in table 6 with classical K-means algorithms in an iteration, and time ratios are as shown in Figure 7.
The present invention expends the time with classical K-means algorithms in an iteration of table 6
What the present invention was selected is more typical K-means algorithms in data mining, and in the horizontal division of both sides Joint data set is excavated, while supporting the storage outsourcing of cloud platform and calculating outsourcing.Beneficial effects of the present invention mainly have Following several respects:
(1) by the status both at home and abroad for the data mining for analyzing secret protection, the excellent of existing frequently-used technology is well understood Gesture and inferior position.Although the scheme execution efficiency based on data perturbation technology is higher, because it destroys original data Collection, so data mining results are certain to produce certain influence, and the scheme based on encryption can be very good to ensure to excavate As a result correctness, the present invention ensure that the correctness of data mining results well by the way of encryption;
(2) the present invention program supports data storage outsourcing.Cloud platform possesses bigger storage energy compared to general PC Power, this allows the present invention program to be performed on more massive data set;
(3) the present invention program supports data to calculate outsourcing.Cloud platform is a kind of distributed computing framework, and it can be a lot Resource consolidation together into a cluster so that the significantly computing capability of lifting system.The present invention program will be most of Calculating be contracted out to cloud platform, by the powerful computing capability of cloud platform, while correctness is ensured, execution efficiency is also big Amplitude is lifted;
(4) from the time complexity of theory analysis algorithm, space complexity, communication complexity and security, and pass through The correctness and high efficiency of the experimental verification algorithm.K-means clustering algorithms proposed by the present invention with secret protection are not only Realize the safety calculating under semi-honesty model, and most sides in three participants of re-computation center of mass point stage support Calculated for the safety of malicious parties.
Embodiment described above is the better embodiment of the present invention, not limits the specific of the present invention with this Practical range, the scope of the present invention includes being not limited to present embodiment, all equal according to the equivalence changes of the invention made Within the scope of the present invention.

Claims (10)

1. a kind of K-means clustering methods with secret protection, it is characterised in that comprise the following steps:
S1:Data owner A and B encrypt respective data, and then ciphertext is uploaded onto the server;
S2:Data owner A and B randomly choose k center of mass point respectively, and encryption uploads onto the server;
S3:Server calculates ciphertext data point to the Euclidean distance of center of mass point by Calculation of Safety Distance agreement, passes through safety ratio The Euclidean distance calculated compared with agreement sorts out data point;
S4:Server, data owner A and B recalculate k new center of mass point by safety circuit agreement jointly;
S5:Data owner A or B by safety ratio compared with agreement judge center of mass point new in ciphertext data and former center of mass point away from From if less than threshold value, end classification, the data classified are sent respectively to data by data owner A and B request server Owner A and B, otherwise, return and perform step S2, carry out next round iteration.
2. the K-means clustering methods according to claim 1 with secret protection, it is characterised in that:In step sl, The server is Cloud Server, and Cloud Server stores data owner A and the B encryption data uploaded file beyond the clouds again In system.
3. the K-means clustering methods according to claim 2 with secret protection, it is characterised in that:In step s 2, The selection of the center of mass point includes the selection of center of mass point quantity and numerical value, specifically includes following steps:
S21:Data owner A and B randomly choose k center of mass point respectively;
S22:It is iterated, and is sorted out on respective data set according to traditional K-means clustering algorithms;
S23:Calculate each data point to it is respective correspondence center of mass point distance, and calculate all data points apart from summation S;
S24:When the corresponding summation S changes of k-1, k, k+1 center of mass point are little, now, k is the number of center of mass point;
S25:Data owner A and B are respectively with the numerical computations average value of respective center of mass point, and the average value is k matter The value of heart point.
4. the K-means clustering methods according to claim 3 with secret protection, it is characterised in that:Step S3 meter Calculation method comprises the following steps:
S31:The ciphertext distance for the ciphertext center of mass point that Server Calculates Data owner A every ciphertext record is uploaded with it, and The ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it;
S32:Server calculates data owner A each data with data owner A using Calculation of Safety Distance agreement jointly Point and the ciphertext distance of center of mass point;Server and data owner calculate data using Calculation of Safety Distance agreement B and possessed jointly Person B each data point and the ciphertext distance of center of mass point;
S33:The ciphertext distance set that server is obtained according to step S32, data owner A and B data is divided into nearest In class, and separately deposited in same class.
5. the K-means clustering methods according to claim 4 with secret protection, it is characterised in that:Step S4 place Reason method comprises the following steps:
S41:The data point separately deposited in same class is sent respectively to corresponding data owner A and B by server;
S42:Data owner A and B are decrypted;
S43:Server, data owner A and B calculate new center of mass point in the category using safety circuit agreement.
6. a kind of system for realizing the K-means clustering methods with secret protection described in claim any one of 1-5, its feature Be to include database, the second client used in the first client used in data owner A and data owner B, Wherein, first client and the second client are used to encrypt respective data, and then ciphertext is uploaded onto the server, and point Not Sui Jixuanze k center of mass point, and encrypt and upload onto the server, wait after server sorts out, recalculated jointly with server new K center of mass point, judge the distance of new center of mass point and former center of mass point, if less than threshold value, terminate classification, request server will The data classified are sent respectively to the first client and the second client, and center of mass point is otherwise uploaded again;Server is used to connect The data that the first client and the second client are uploaded are received, data point are calculated to the Euclidean distance of center of mass point, according to the Europe of calculating Family name's distance sorts out data point, then recalculates k new center of mass point jointly with the first client and the second client.
7. system according to claim 6, it is characterised in that:The server is Cloud Server, and Cloud Server is by data The data that owner A and B are uploaded are re-encrypted in the file system of storage beyond the clouds.
8. system according to claim 7, it is characterised in that:The center of mass point of first client and the second client Selection includes the selection of center of mass point quantity and numerical value, specifically includes following module:
Barycenter point selection module:For randomly choosing k center of mass point;
Classifying module:For being iterated according to traditional K-means clustering algorithms on respective data set, and sort out;
Calculation of Safety Distance module:For calculating each data point to respective correspondence center of mass point by Calculation of Safety Distance agreement Distance, and calculate all data points apart from summation S;
Center of mass point number chooses module:For judging when the corresponding summation S changes of k-1, k, k+1 center of mass point are little, now, K is the number of center of mass point;
Barycenter point value chooses module:For the numerical computations average value with respective center of mass point, the average value is k matter The value of heart point.
9. system according to claim 8, it is characterised in that:The server includes:
First ciphertext distance calculation module:For calculating the ciphertext center of mass point that every ciphertext record of the first client is uploaded with it Ciphertext distance, and calculate the ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it;
Second ciphertext distance calculation module:Each data point and barycenter for calculating the first client jointly with the first client The ciphertext distance of point;Server and the second client calculate jointly each data point of the second client and the ciphertext of center of mass point away from From;
Sort module:For calculating obtained ciphertext distance set according to the second ciphertext distance calculation module, by the first client and The data of second client are divided into nearest class, and separately deposited in same class.
10. system according to claim 9, it is characterised in that:The server also includes sending module:For will be same The data point separately deposited in class is sent respectively to corresponding first client and the second client;
Safe center of mass point computing module:For same first client and the second client by safety circuit agreement in same category It is middle to calculate new center of mass point.
CN201710224275.7A 2017-04-07 2017-04-07 K-means clustering method and system with privacy protection function Active CN107145791B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710224275.7A CN107145791B (en) 2017-04-07 2017-04-07 K-means clustering method and system with privacy protection function
PCT/CN2017/117943 WO2018184407A1 (en) 2017-04-07 2017-12-22 K-means clustering method and system having privacy protection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710224275.7A CN107145791B (en) 2017-04-07 2017-04-07 K-means clustering method and system with privacy protection function

Publications (2)

Publication Number Publication Date
CN107145791A true CN107145791A (en) 2017-09-08
CN107145791B CN107145791B (en) 2020-07-10

Family

ID=59775048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710224275.7A Active CN107145791B (en) 2017-04-07 2017-04-07 K-means clustering method and system with privacy protection function

Country Status (2)

Country Link
CN (1) CN107145791B (en)
WO (1) WO2018184407A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107707494A (en) * 2017-10-10 2018-02-16 苏州大学 Nonlinear fiber equalization methods for 64 QAM coherent optical communication systems
CN107784663A (en) * 2017-11-14 2018-03-09 哈尔滨工业大学深圳研究生院 Correlation filtering tracking and device based on depth information
WO2018184407A1 (en) * 2017-04-07 2018-10-11 哈尔滨工业大学深圳研究生院 K-means clustering method and system having privacy protection
CN109214205A (en) * 2018-08-01 2019-01-15 安徽师范大学 Position and data-privacy guard method in a kind of intelligent perception based on k- anonymity
CN109615021A (en) * 2018-12-20 2019-04-12 暨南大学 A kind of method for protecting privacy based on k mean cluster
CN110163292A (en) * 2019-05-28 2019-08-23 电子科技大学 Secret protection k-means clustering method based on vector homomorphic cryptography
CN110162999A (en) * 2019-05-08 2019-08-23 湖北工业大学 A kind of income distribution difference Gini coefficient measure based on secret protection
CN111444545A (en) * 2020-06-12 2020-07-24 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties
CN112487481A (en) * 2020-12-09 2021-03-12 重庆邮电大学 Verifiable multi-party k-means federal learning method with privacy protection
CN112508203A (en) * 2021-02-08 2021-03-16 同盾控股有限公司 Federated data clustering method and device, computer equipment and storage medium
CN113033915A (en) * 2021-04-16 2021-06-25 哈尔滨理工大学 Method and device for comparing shortest distance between car sharing user side and driver side
CN113438254A (en) * 2021-08-24 2021-09-24 北京金睛云华科技有限公司 Distributed classification method and system for ciphertext data in cloud environment
CN114730389A (en) * 2019-11-06 2022-07-08 维萨国际服务协会 Dual server privacy preserving clustering
CN116801380B (en) * 2023-03-23 2024-05-28 昆明理工大学 UWB indoor positioning method based on improved full centroid-Taylor

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610196B (en) * 2019-08-14 2023-04-28 平安科技(深圳)有限公司 Desensitization method, system, computer device and computer readable storage medium
CN114154554A (en) * 2021-10-28 2022-03-08 上海海洋大学 Privacy protection outsourcing data KNN algorithm based on non-collusion double-cloud server
CN117688502B (en) * 2024-02-04 2024-04-30 山东大学 Safe outsourcing calculation method and system for detecting local abnormal factors

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138923A (en) * 2015-08-11 2015-12-09 苏州大学 Privacy protection time sequence similarity calculation method
CN105760780A (en) * 2016-02-29 2016-07-13 福建师范大学 Trajectory data privacy protection method based on road network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102970143B (en) * 2012-12-13 2015-04-22 中国科学技术大学苏州研究院 Method for securely computing index of sum of held data of both parties by adopting addition homomorphic encryption
US9710493B2 (en) * 2013-03-08 2017-07-18 Microsoft Technology Licensing, Llc Approximate K-means via cluster closures
CN107145791B (en) * 2017-04-07 2020-07-10 哈尔滨工业大学深圳研究生院 K-means clustering method and system with privacy protection function

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138923A (en) * 2015-08-11 2015-12-09 苏州大学 Privacy protection time sequence similarity calculation method
CN105760780A (en) * 2016-02-29 2016-07-13 福建师范大学 Trajectory data privacy protection method based on road network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIU XIAOYAN等: "《Outsourcing Two-party Privacy Preserving K-mians Clustering Protocol In Wireless Sensor Networks》", 《IEEE COMPUTER SOCIETY》 *
薛安荣等: "《隐私保护的快速聚类算法》", 《系统工程与电子技术》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018184407A1 (en) * 2017-04-07 2018-10-11 哈尔滨工业大学深圳研究生院 K-means clustering method and system having privacy protection
CN107707494B (en) * 2017-10-10 2020-02-11 苏州大学 Optical fiber nonlinear equalization method for 64-QAM coherent optical communication system
CN107707494A (en) * 2017-10-10 2018-02-16 苏州大学 Nonlinear fiber equalization methods for 64 QAM coherent optical communication systems
CN107784663A (en) * 2017-11-14 2018-03-09 哈尔滨工业大学深圳研究生院 Correlation filtering tracking and device based on depth information
CN107784663B (en) * 2017-11-14 2020-10-20 哈尔滨工业大学深圳研究生院 Depth information-based related filtering tracking method and device
CN109214205A (en) * 2018-08-01 2019-01-15 安徽师范大学 Position and data-privacy guard method in a kind of intelligent perception based on k- anonymity
CN109615021A (en) * 2018-12-20 2019-04-12 暨南大学 A kind of method for protecting privacy based on k mean cluster
CN109615021B (en) * 2018-12-20 2022-09-27 暨南大学 Privacy information protection method based on k-means clustering
CN110162999A (en) * 2019-05-08 2019-08-23 湖北工业大学 A kind of income distribution difference Gini coefficient measure based on secret protection
CN110162999B (en) * 2019-05-08 2022-06-07 湖北工业大学 Income distribution difference kini coefficient measurement method based on privacy protection
CN110163292A (en) * 2019-05-28 2019-08-23 电子科技大学 Secret protection k-means clustering method based on vector homomorphic cryptography
CN114730389B (en) * 2019-11-06 2023-07-07 维萨国际服务协会 System and method for privacy preserving unsupervised learning
CN114730389A (en) * 2019-11-06 2022-07-08 维萨国际服务协会 Dual server privacy preserving clustering
CN111444545A (en) * 2020-06-12 2020-07-24 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties
CN112487481A (en) * 2020-12-09 2021-03-12 重庆邮电大学 Verifiable multi-party k-means federal learning method with privacy protection
CN112487481B (en) * 2020-12-09 2022-06-10 重庆邮电大学 Verifiable multi-party k-means federal learning method with privacy protection
CN112508203B (en) * 2021-02-08 2021-06-15 同盾控股有限公司 Data clustering processing method, device, equipment and medium based on federal learning
CN112508203A (en) * 2021-02-08 2021-03-16 同盾控股有限公司 Federated data clustering method and device, computer equipment and storage medium
CN113033915B (en) * 2021-04-16 2021-12-31 哈尔滨理工大学 Method and device for comparing shortest distance between car sharing user side and driver side
CN113033915A (en) * 2021-04-16 2021-06-25 哈尔滨理工大学 Method and device for comparing shortest distance between car sharing user side and driver side
CN113438254B (en) * 2021-08-24 2021-11-05 北京金睛云华科技有限公司 Distributed classification method and system for ciphertext data in cloud environment
CN113438254A (en) * 2021-08-24 2021-09-24 北京金睛云华科技有限公司 Distributed classification method and system for ciphertext data in cloud environment
CN116801380B (en) * 2023-03-23 2024-05-28 昆明理工大学 UWB indoor positioning method based on improved full centroid-Taylor

Also Published As

Publication number Publication date
CN107145791B (en) 2020-07-10
WO2018184407A1 (en) 2018-10-11

Similar Documents

Publication Publication Date Title
CN107145791A (en) A kind of K means clustering methods and system with secret protection
Liu et al. An efficient privacy-preserving outsourced calculation toolkit with multiple keys
Wang An identity-based data aggregation protocol for the smart grid
CN105122721B (en) For managing the method and system for being directed to the trustship of encryption data and calculating safely
CN110536259A (en) A kind of lightweight secret protection data multilevel polymerization calculated based on mist
CN108737115B (en) Private attribute set intersection solving method with privacy protection
CN110011784A (en) Support the KNN classified service system and method for secret protection
CN107196926A (en) A kind of cloud outsourcing privacy set comparative approach and device
CN107145792A (en) Multi-user's secret protection data clustering method and system based on ciphertext data
Min et al. Novel multi-party quantum key agreement protocol with g-like states and bell states
CN106921493A (en) A kind of encryption method and system
CN105376057B (en) A kind of method of the extensive system of linear equations of cloud outsourcing solution
CN104967693A (en) Document similarity calculation method facing cloud storage based on fully homomorphic password technology
CN107864040A (en) A kind of intelligent grid big data information management system based on safe cloud computing
Wang et al. Lightweight certificate-based public/private auditing scheme based on bilinear pairing for cloud storage
CN110474770A (en) A kind of multi-party half quantum secret sharing method and system based on single photon
Agarkar et al. LRSPPP: lightweight R-LWE-based secure and privacy-preserving scheme for prosumer side network in smart grid
Hasan et al. Encryption as a service for smart grid advanced metering infrastructure
Fatahi et al. High-efficient arbitrated quantum signature scheme based on cluster states
CN109495244A (en) Anti- quantum calculation cryptographic key negotiation method based on pool of symmetric keys
Cheng et al. Batten down the hatches: Securing neighborhood area networks of smart grid in the quantum era
CN103763100A (en) Sum and product computing method for protecting data privacy security of arbitrary user group
Tallapally et al. Competent multi-level encryption methods for implementing cloud security
Li et al. Priexpress: Privacy-preserving express delivery with fine-grained attribute-based access control
Shi et al. Verifiable quantum key exchange with authentication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant