CN107145791A - A kind of K means clustering methods and system with secret protection - Google Patents
A kind of K means clustering methods and system with secret protection Download PDFInfo
- Publication number
- CN107145791A CN107145791A CN201710224275.7A CN201710224275A CN107145791A CN 107145791 A CN107145791 A CN 107145791A CN 201710224275 A CN201710224275 A CN 201710224275A CN 107145791 A CN107145791 A CN 107145791A
- Authority
- CN
- China
- Prior art keywords
- data
- center
- mass point
- point
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/04—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
- H04L63/0428—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- General Health & Medical Sciences (AREA)
- Bioethics (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computing Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Storage Device Security (AREA)
Abstract
The present invention provides a kind of K means clustering methods and system with secret protection, belongs to data mining technology field.The present invention comprises the following steps:Data owner A and B encrypt respective data and randomly selected center of mass point, upload onto the server;Server is sorted out by secure multiplication agreement and Calculation of Safety Distance agreement at fall into a trap strong point of counting of ciphertext data to the Euclidean distance of center of mass point, and by data point;Server, data owner A and B recalculate new center of mass point in ciphertext data jointly by safety circuit agreement;Data owner A or B judges the distance of new center of mass point and former center of mass point by safety ratio compared with agreement, if less than threshold value, terminate classification, the data classified are sent respectively to data owner A and B by data owner A and B request server, otherwise, again new center of mass point is uploaded, next round iteration is carried out.The present invention ensure that the correctness of data mining results while ensureing that data-privacy is safe;Data storage outsourcing and data are supported to calculate outsourcing, while correctness is ensured, execution efficiency is also significantly lifted;Support that most sides calculate for the safety of malicious parties in three participants.
Description
Technical field
The present invention relates to data mining technology field, more particularly to a kind of K-means clustering methods with secret protection,
Further relate to a kind of system for realizing methods described.
Background technology
It is well known that K-means clusters are one of very classical and conventional methods in data mining, it is by calculating number
Similar data item can be flocked together according to the distance between item.As information-based, digitlization, networking process accelerate,
The data source that economic globalization has turned into a kind of irreversible trend, clustering algorithm is more and more diversified, and data safety is got over
Come more important.Multiple participants can be come from view of data, the sensitive information on participant may be included in these data
Or personal information, if these information are shared between multiple participants, then the privacy of data can not be guaranteed.
Joint data mining with secret protection can be participated in while user data and Result privacy is protected to multiple
The federated database of side carries out data mining, further extracts useful information.Therefore, how to design and protected with privacy
The joint data mining algorithm of shield turns into a problem for needing to solve.
Semi-honesty model meets actual scene in many cases, and the privacy of data is by each under the model
Participant follows agreement to ensure all the time.But to ensure the privacy of data, solution under the model usually because
Calculate consumption and communication consumption is higher, so in practice and infeasible.
Traditional K-means clustering algorithms are a kind of classical clustering algorithms based on Euclidean distance.Traditional K-means clusters
Algorithm is broadly divided into 3 steps:Choose center of mass point, data point sorted out and center of mass point that re-computation is new.Assuming that training sample
This is { xi∈Rl| 1≤i≤l }, wherein, l is the quantity of sample, and k center of mass point M is randomly selected first, M={ μ are expressed asc∈
Rl|1≤c≤l}.Then each data point is calculated to xiTo center of mass point μcDistance, then by xiBe referred to cluster the point it is nearest
Center of mass point μcIn affiliated class, formula is:Cc:=argminc||xi-μc||2.Finally for each center of mass point μcWeighed
Center of mass point is calculated, calculation formula is:
It can be seen that tradition K-means clustering algorithms mainly include three steps:Center of mass point, data point is chosen to carry out
Sort out and re-computation center of mass point.Wherein during classification, first have to calculate data point apart from each center of mass point it is European away from
From then comparing the nearest center of mass point of range data point and sorted out, the calculating of distance here uses Euclidean distance
Square, the size of two values is so preferably compared in the case where changing the magnitude relationship of two values.In re-computation matter
During heart point, it is necessary to calculate in each class the component of data point and, and these data points may be from different participations
Side, so may relate to privacy concern during calculating.In a word, in the calculating process of traditional K-means clustering algorithms
It is middle to cause the leakage of privacy.
The content of the invention
To solve the problems of the prior art, the present invention provides a kind of K-means clustering methods with secret protection, also
There is provided a kind of system for realizing methods described.
The inventive method comprises the following steps:
S1:Data owner A and B encrypt respective data, and then ciphertext is uploaded onto the server;
S2:Data owner A and B randomly choose k center of mass point respectively, and encryption uploads onto the server;
S3:Server calculates ciphertext data point to the Euclidean distance of center of mass point by Calculation of Safety Distance agreement, passes through peace
Compare agreement entirely to be sorted out data point according to the Euclidean distance of calculating;
S4:Server, data owner A and B recalculate k new center of mass point by safety circuit agreement jointly;
S5:Data owner A or B judges center of mass point new in ciphertext data and former center of mass point by safety ratio compared with agreement
Distance, if less than threshold value, terminates classification, the data classified are sent respectively to by data owner A and B request server
Data owner A and B, otherwise, return and perform step S2, carry out next round iteration.
The present invention is further improved, in step sl, and the server is Cloud Server, and Cloud Server possesses data
The data that person A and B are uploaded are re-encrypted in the file system of storage beyond the clouds.
The present invention is further improved, in step s 2, and the selection of the center of mass point includes center of mass point quantity and numerical value
Choose, specifically include following steps:
S21:Data owner A and B randomly choose k center of mass point respectively;
S22:It is iterated, and is sorted out on respective data set according to traditional K-means clustering algorithms;
S23:Calculate each data point to it is respective correspondence center of mass point distance, and calculate all data points apart from summation S;
S24:When the corresponding summation S changes of k-1, k, k+1 center of mass point are little, now, k is the number of center of mass point;
S25:Data owner A and B are respectively with the numerical computations average value of respective center of mass point, and the average value is k
The value of individual center of mass point.
The present invention is further improved, and step S3 computational methods comprise the following steps:
S31:The ciphertext of ciphertext center of mass point that Server Calculates Data owner A every ciphertext record is uploaded with it away from
From, and data owner B the ciphertext distance of ciphertext center of mass point that is uploaded with it of every ciphertext record;
S32:Server calculates each of data owner A with data owner A commonly through Calculation of Safety Distance agreement
The ciphertext distance of data point and center of mass point;Server and data owner B calculate data jointly using Calculation of Safety Distance agreement
Owner B each data point and the ciphertext distance of center of mass point;
S33:The ciphertext distance set that server is obtained according to step S32, data owner A and B data are divided into most
In near class, and separately deposited in same class.
The present invention is further improved, and step S4 processing method comprises the following steps:
S41:The data point separately deposited in same class is sent respectively to corresponding data owner A and B by server;
S42:Data owner A and B are decrypted;
S43:Server, data owner A and B calculate new center of mass point in the category by safety circuit agreement.
Present invention also offers a kind of system for realizing methods described, including used in database, data owner A
Second client used in one client and data owner B, wherein, first client and the second client are used for
Respective data are encrypted, then ciphertext is uploaded onto the server, and randomly choose k center of mass point respectively, and encryption is uploaded to clothes
Business device, waits after server classification, recalculates k new center of mass point jointly with server, judge new center of mass point and the protoplasm heart
The distance of point, if less than threshold value, terminates classification, the data classified are sent respectively to the first client by request server
With the second client, center of mass point is otherwise uploaded again;Server is used to receive the number that the first client and the second client are uploaded
According to, data point is calculated to the Euclidean distance of center of mass point, is sorted out data point according to the Euclidean distance of calculating, it is then objective with first
Family end and the second client recalculate k new center of mass point jointly.
The present invention is further improved, and the server is Cloud Server, and Cloud Server uploads data owner A and B
Data re-encrypt in storage file system beyond the clouds.
The present invention is further improved, and the selection of the center of mass point of first client and the second client includes center of mass point
The selection of quantity and numerical value, specifically includes following module:
Barycenter point selection module:For randomly choosing k center of mass point;
Classifying module:For being iterated according to traditional K-means clustering algorithms on respective data set, and sort out;
Calculation of Safety Distance module:For calculating each data point to respective correspondence barycenter by Calculation of Safety Distance agreement
Point distance, and calculate all data points apart from summation S;
Center of mass point number chooses module:For judging when the corresponding summation S changes of k-1, k, k+1 center of mass point are little,
Now, k is the number of center of mass point;
Barycenter point value chooses module:For the numerical computations average value with respective center of mass point, the average value is k
The value of individual center of mass point.
The present invention is further improved, and the server includes:
First ciphertext distance calculation module:For calculating the ciphertext barycenter that every ciphertext record of the first client is uploaded with it
The ciphertext distance of point, and calculate the ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it;
Second ciphertext distance calculation module:For with the first client jointly calculate the first client each data point with
The ciphertext distance of center of mass point;The each data point and center of mass point of server and the second client the second client of common calculating
Ciphertext distance;Sort module:For calculating obtained ciphertext distance set according to the second ciphertext distance calculation module, by the first visitor
The data of family end and the second client are divided into nearest class, and separately deposited in same class.
The present invention is further improved, and the server also includes sending module:For will separately deposit in same class
Data point is sent respectively to corresponding first client and the second client;Safe center of mass point computing module:For with the first visitor
Family end and the second client calculate new center of mass point by safety circuit agreement in same category.
Compared with prior art, the beneficial effects of the invention are as follows:The application ensure that number well by the way of encryption
According to the security in mining process, and the correctness of result;Data storage outsourcing is supported, can be in more massive data set
It is upper to perform;Support data to calculate outsourcing, most calculating is contracted out to cloud platform, by the powerful computing capability of cloud platform,
While correctness is ensured, execution efficiency is also significantly lifted;The safety calculating under semi-honesty model is not only realized, and
And support that most sides calculate for the safety of malicious parties in three participants in the re-computation center of mass point stage.
Brief description of the drawings
Fig. 1 is the inventive method flow chart;
Fig. 2 is present system structural representation;
Fig. 3 is tradition K-means clustering algorithm server and client side's elapsed time schematic diagrames;
Fig. 4 is server and client side's elapsed time schematic diagram of the present invention;
Fig. 5 is tradition K-means clustering algorithm server and client side's elapsed time occupation ratio schematic diagrames;
Fig. 6 is server and client side's elapsed time occupation ratio schematic diagram of the present invention;
Fig. 7 expends time ratios for the present invention with tradition K-means clustering algorithms.
Embodiment
The present invention is described in further details with reference to the accompanying drawings and examples.
For performance issue present in the data mining with secret protection, the present invention has carried out has privacy to existing
The further investigation of the data mining algorithm of protection, and then propose that one kind efficiently has privacy on the data set of horizontal division
The K-means clustering algorithms of protection, the program support have two data owners and cloud platform it is simultaneous storage outsourcing and
Calculate outsourcing.Data are stored beyond the clouds with ciphertext form, and cloud platform is completed in both sides by being interacted with two data owners
Joint data set on K-means cluster data minings task.The present invention separately designs different security protocols and solves to have
Three technical barriers in the K-means clustering algorithms of secret protection:Solve Calculation of Safety Distance of the ciphertext apart from computational problem
Agreement, the safety ratio of solution ciphertext comparison problem are compared with agreement and the safety circuit agreement of solution ciphertext division problem.And then by this
A little security protocols are applied in clustering algorithm framework, realize the K-means clustering algorithms with secret protection.
As shown in figure 1, K-means clustering method of the present invention with secret protection mainly includes 5 steps, it is next right
It is described in detail:
Step S1:Data owner A and B encrypt respective data, and then ciphertext is uploaded onto the server.This example assumes number
It is Alice according to owner A, data owner B is Bob, and server is C.
Alice and Bob are respectively with the public key pk of oneself1And pk2Encrypt their data DxAnd Dy, ciphertext is CxAnd Cy, so
Afterwards by CxAnd CyUpload to C.Wherein, DxAnd DyIn every record be all l dimension, so to data base encryption namely to every
Every one-dimensional data in bar record is encrypted.Alice and Bob all data can be stored beyond the clouds in the form of ciphertext
In file system.Specifically it is expressed as follows:
Wherein, m is the bar number of record.
Step S2:Alice and Bob selects k center of mass point, and uploads to C with respective public key encryption.
In this example, the selection of center of mass point is a very important step, because its selection is directly connected to iterations
How much, and then the system overall execution time is had influence on, so good center of mass point can also accelerate convergence rate and the execution of system
Efficiency.Here selection center of mass point is divided into two parts:First is the selection of center of mass point quantity, and Alice and Bob distinguish random k values
With k center of mass point, an iteration is then carried out on the data set of oneself.Calculated after classification each data point to each it is right
The distance of center of mass point is answered, the summation for then obtaining these all distances is S.The S changes worked as corresponding to k-1, k and k+1 are little
When, now, k is the number of center of mass point.Alice and Bob find out the k of oneself respectively, and latter two right k value plot mean is
Final k values.Alice randomly chooses k center of mass point M={ μc| 1≤c≤k }, wherein μc={ ucj|1≤j≤l}.Alice and
Center of mass point encryption is uploaded to high in the clouds by Bob with Alice and Bob public key respectively, and the ciphertext of center of mass point isWith
Step 3:Server C by Calculation of Safety Distance agreement calculate ciphertext data point to center of mass point Euclidean distance, so
Afterwards, data point is sorted out according to the Euclidean distance of calculating compared with agreement by safety ratio.Specially:
C calculates each recordWith each center of mass pointAnd each recordWith each center of mass pointCiphertext away from
From;C and Alice are operated together SSED (Calculation of Safety Distance) agreement and go to calculate each xiAnd μcBetween ciphertext distance, useRepresent.C and Bob are operated together SSED agreements and go to calculate each yiAnd μcBetween ciphertext distance, useRepresent.It is all
xiAnd μcBetween ciphertext distance be stored inIn, all yiAnd μcBetween ciphertext distance be stored inIn.
The homomorphic cryptography used in this method is to support the half homomorphism encryption of ciphertext add operation, i.e. Paillier encryptions,
It is the probability encryption of 4 tuples, is expressed as Encpa={ KenGen, Encrypt, Decrypt, Evaluate }.
The process of Paillier encryptions is as follows:
●KenGen(1k)→(pk,sk):
(1) two Big primes p and q are selected, and meet gcd (pq, (p-1) (q-1))=1;
(2) N=pq and λ=lcm (p-1, q-1) is calculated;
(3) integer is randomly choosed
(4) μ is found so that it can meet μ=(L (gλmod N2))-1Mod N, here L be a function L (μ)=
(μ-1)/N.And then public key is obtained for (N, g), private key are (λ, μ).
●Encrypt(x,r)→c:
Assuming that being x in plain text, a random number r is selected, cryptogram computation is c=gxrnmod N2.Encryption is also referred to as Epk
(x)=c.
●Decrypt(c)→x
Decrypting process is x=L (cλmod N2)mod N。Dsk(c) Decrypt (c) is represented.
●Evaluate:
Epk(x)Epk(y)=Epk(x+y),Epk(x)y=Epk(xy).Wherein x and y are two plaintexts.
The Calculation of Safety Distance agreement of this example realized based on secure multiplication agreement, the specific place of the secure multiplication agreement
Reason process is as follows:
Wherein, ZnIt is positive integer space, r is represented hereinxAnd ryFor positive integer.
The concrete processing procedure of the Calculation of Safety Distance agreement of this example is as follows:
Then, C sorts out all data points, is specially:
By comparingWithIn distance, by xiAnd yiIt is divided into nearest class.C and Alice performs safety ratio
Compared with agreementC and Bob is performedThen all ciphertexts are referred to corresponding classificationWithIn.EachThe data point that C classes are divided into P1 is stored, eachStore the number that C classes are divided into Bob
Strong point, calculation formula is:
Safety ratio is as follows compared with the concrete processing procedure of agreement:
Step S4:C, Alice and Bob are by safety circuit agreement k center of mass point of common re-computation, because in CL1And CL2
In two participant encryption datas public key it is different, all new center of mass point can not be calculated directly.This example first allows C CL1With
CL2It is respectively issued to Alice and Bob and does decryption to obtain L1And L2, calculation formula is:
Then C, Alice and Bob will perform SC (safety circuit) agreement, calculate
Wherein,Respectively Alice and
Ciphertext data in Bob.
So as to calculate the one-component μ of new center of mass pointcj.SC safety circuits agreement can ensure that Alice and Bob are obtained
All new center of mass point.
Wherein, the concrete processing procedure of safety circuit agreement is:
Step S5:Alice can be by the safety ratio center of mass point new compared with agreement calculating and the distance of former center of mass point, if small
In threshold value, then Alice and Bob will ask C that the data of point good class are respectively issued into Alice and Bob.Otherwise, Alice and
New center of mass point is encrypted and uploads to C by Bob with their respective public keys, carries out next round iteration.
As shown in Fig. 2 present invention also offers a kind of system for realizing the above method, this example system includes database C, number
According to the first client P used in owner A1With the second client P used in data owner B2, wherein, described first
Client P1With the second client P2For encrypting respective data, then ciphertext is uploaded onto the server, and random choosing respectively
K center of mass point is selected, and encryption uploads onto the server, and waits after server classification, recalculates k new matter jointly with server
Heart point, judges the distance of new center of mass point and former center of mass point, if less than threshold value, terminates classification, request server will classify
Data be sent respectively to the first client P1With the second client P2, center of mass point is otherwise uploaded again;Server is used to receive
First client P1With the second client P2The data of upload, calculate data point to the Euclidean distance of center of mass point, according to calculating
Euclidean distance sorts out data point, then with the first client P1With the second client P2K new barycenter is recalculated jointly
Point.
This example server C is Cloud Server, and data owner A and the B data uploaded are re-encrypted and are stored in by Cloud Server
In the file system in high in the clouds, it would be preferable to support data storage outsourcing, it can be performed on more massive data set;Support data
Outsourcing is calculated, most calculating is contracted out to cloud platform, by the powerful computing capability of cloud platform, is ensureing correctness
Meanwhile, execution efficiency is also significantly lifted.
Beneficial effect analysis of the present invention:
1st, the comparison scheme that the present invention is selected
The framework that the present invention is used is in document《Outsourcing Two-Party Privacy Preserving K-
Means Clustering Protocol in Wireless Sensor Networks》In propose first, this relatively in
To the paper method, scheme is represented before, and relative to the clustering algorithm under other frameworks, the clustering algorithm under same framework is more
With comparativity, so the present invention is mainly compared analysis with traditional scheme before.To ensure the reliability of Experimental comparison,
Two schemes are run in same experimental situation.The evaluation criterion of two methods is explained below, and carries out experimental result
Comparative analysis.
2nd, evaluation criterion
The time loss of the inventive method is broadly divided into three parts:Client time consumption, communication consumption and server
Time loss is held, wherein client and server time loss includes initial phase again and the time of agreement operation phase disappears
Consumption.Again because the application therewith in front of case method therefor difference, can only be from being macroscopically compared.Relatively main bag
Include two aspects, one is analysis of complexity including time complexity, space complexity and communication complexity in theory, another
Individual is the comparison of test result in experiment.And different iterationses can influence the whole structure of experiment, so this example is with once
Iteration is defined, and will be compared from the following aspects:
(1) theoretic time complexity, space complexity and the communication complexity of two schemes are contrasted.
(2) time of two schemes data encryption is contrasted.
(3) time loss of two schemes server and client side in an iteration is contrasted.
3rd, analysis of experimental results
For theory, the present invention program is below it in terms of time complexity, space complexity and communication complexity
Front case.The experimental result of two schemes will be analyzed according to experimental data below.
The encryption times consumption of the two schemes compared first.The two kinds of cipher modes used before in scheme, it is all
Clear data must all Liu encipherment schemes to be improved encryptions once, will also be by the encryption of Paillier encipherment schemes once.
All clear datas only need to a Paillier encryption in the solution of the present invention, in theory the scheme in the present invention
In encryption times should be faster than before encryption times consumption in scheme.Again because Paillier operation is on group,
There are many index operations again, and improved Liu encipherment schemes are all linear operations, so most encryption times are consumed
Because Paillier encryptions are caused.So, the encryption before the encryption times consumption in the present invention can be slightly less than in scheme
Time loss, but the time do not have a difference of the order of magnitude, the result of experiment effectively demonstrate the conclusion.Scheme adds before
As shown in table 1, encryption times consumption of the invention is as shown in table 2 for close time loss.
The existing scheme encryption times of table 1 are consumed
The encryption times of the present invention of table 2 are consumed
Then, the present invention was counted and contrasted to the time consumed in an iteration.In theory, this hair
The bright cloud platform being introduced into improves powerful computing capability should be able to be more slightly better than the operational efficiency in scheme before.Because
The cloud platform of the present invention is made up of 30 PCs and a server, is needed in the processing procedure of task to every machine
The division of labor of carry out task, task scheduling and data record, these operations can also consume the time of part.When data point is more
Wait, the time of an iteration can be longer, and the ratio that operation the consumed time such as task division of labor takes will be lower.This hair
It is bright in safety circuit agreement, the generation of circuit is needed to expend the larger time, but circuit is only needed in the first iteration
Generate once, so in theory when data point scale is less, the efficiency of an iteration of scheme can be higher than before
Scheme in the present invention, when data point scale is higher than a certain threshold value, the efficiency of the solution of the present invention an iteration can be higher than it
Efficiency in the case of front, as data scale is increasing, the odds for effectiveness of scheme can be more and more obvious in the present invention.Experiment
The dry straight viewpoint for demonstrating us, while test result indicates that the threshold value of data point scale is about 5000 data
Point, when data scale is more than 7000, the present invention program an iteration elapsed time is less, when data scale is less than 5000,
Scheme an iteration elapsed time is less in scheme before.The contrast of two schemes an iteration elapsed time is as shown in table 3.
An iteration elapsed time of table 3 is contrasted
In an iteration, the time consumed for being not only this time iteration that the present invention is paid close attention to, while also being intended to
Server C can undertake more tasks in iterating to calculate each time, possess higher elapsed time occupation ratio, that is to say, that
In the case of ensureing that an iteration time is less so that server elapsed time and the time consuming ratio of an iteration are more
Greatly, thus can be with the amount of calculation of less client.Therefore, with the increase of data scale, such scheme efficiency can also be got over
Come higher.For client, what is mainly done is exactly the encryption and decryption of client in encrypt and decrypt operation, two schemes
Number of times is basically identical.But, in scheme before ciphertext distance calculate and ciphertext distance compare size use it is improved
Liu is encrypted, and all operations of the encryption are all linear operations, and the scheme in the present invention uses Paillier encryption calculations
Method, the decryption and decryption of the algorithm are needed in the enterprising row index computing of group and modular arithmetic.For the less client of computing capability
For, the time that improved Liu AESs are consumed should be able to be less than the Paillier encryptions used in the present invention.Institute
So that in theory under the data set of same scale, the time that client is consumed in scheme before can be less than in the solution of the present invention
The time that client is consumed.With the increase of data scale, the time that an iteration is consumed in the solution of the present invention is relatively
It is few, and the time that client is consumed is relatively large.Therefore, when data scale is increasing, in the present invention program
Client elapsed time occupation ratio is relative can be increasing, and the time consuming occupation ratio of opposite server institute is relative can be increasingly
It is small.By carrying out the collection and analysis of experimental data, previous guess is also demonstrated very well.Two schemes an iteration is respectively joined
With square elapsed time as shown in table 4 and table 5.The elapsed time of scheme server and client side is as shown in figure 3, the present invention before
Server and client side's elapsed time is as shown in Figure 4.
Each participant elapsed time of scheme an iteration before table 4
Each participant elapsed time of the application an iteration of table 5
Two schemes server and client side elapsed time increases with data point one is can be seen that from Fig. 3 and Fig. 4
Individual trend.Before in the experimental program of scheme, with the increase of data scale, server consumption has obvious ascendant trend, and
The elapsed time of client also has less ascendant trend.The computing capability for being primarily due to server is limited, the calculating of data
Compare responsible.With the increase of data scale, server be necessarily required to the increasing time go handle these data, cause
Elapsed time showed increased, the time consuming occupation ratio of server can also increase.With the increase of data scale, although client
End needs data to be processed also to increased, compared to server, and the operation of client is all linear calculating mostly, so data
The time consuming increase that scale increase is brought is not obvious, then the time consuming occupation ratio of client can be reduced.The present invention
Middle server is run in the cloud platform for having 30 PCs and 1 server to constitute, so the computing capability of server is
It can ensure.According to Fig. 4 as can be seen that with the increase of data scale, server end elapsed time increased, not
There is obvious ascendant trend.And client elapsed time is increasing with the increase of data scale, client is primarily due to
The decryption oprerations done are the index operations on group, and compared to linear operation, the operation has bigger amount of calculation.Therefore,
With the increase of data scale, server elapsed time occupation ratio can have been reduced in the present invention, and client elapsed time is accounted for
Have than that can increased.Before in scheme server and client side's elapsed time occupation ratio as shown in figure 5, in the present invention service
Device and client elapsed time occupation ratio are as shown in Figure 6.
Finally, the present invention gives the K-means clustering algorithms and warp in an iteration with secret protection by experiment
The time of the K-means algorithm process data of allusion quotation, it can be seen that the brought time loss of encryption is than larger.But, with
The increase of data scale, the ratio of the time loss of an iteration of the present invention and classical K-means time loss is increasingly
It is small.The present invention expends the time as shown in table 6 with classical K-means algorithms in an iteration, and time ratios are as shown in Figure 7.
The present invention expends the time with classical K-means algorithms in an iteration of table 6
What the present invention was selected is more typical K-means algorithms in data mining, and in the horizontal division of both sides
Joint data set is excavated, while supporting the storage outsourcing of cloud platform and calculating outsourcing.Beneficial effects of the present invention mainly have
Following several respects:
(1) by the status both at home and abroad for the data mining for analyzing secret protection, the excellent of existing frequently-used technology is well understood
Gesture and inferior position.Although the scheme execution efficiency based on data perturbation technology is higher, because it destroys original data
Collection, so data mining results are certain to produce certain influence, and the scheme based on encryption can be very good to ensure to excavate
As a result correctness, the present invention ensure that the correctness of data mining results well by the way of encryption;
(2) the present invention program supports data storage outsourcing.Cloud platform possesses bigger storage energy compared to general PC
Power, this allows the present invention program to be performed on more massive data set;
(3) the present invention program supports data to calculate outsourcing.Cloud platform is a kind of distributed computing framework, and it can be a lot
Resource consolidation together into a cluster so that the significantly computing capability of lifting system.The present invention program will be most of
Calculating be contracted out to cloud platform, by the powerful computing capability of cloud platform, while correctness is ensured, execution efficiency is also big
Amplitude is lifted;
(4) from the time complexity of theory analysis algorithm, space complexity, communication complexity and security, and pass through
The correctness and high efficiency of the experimental verification algorithm.K-means clustering algorithms proposed by the present invention with secret protection are not only
Realize the safety calculating under semi-honesty model, and most sides in three participants of re-computation center of mass point stage support
Calculated for the safety of malicious parties.
Embodiment described above is the better embodiment of the present invention, not limits the specific of the present invention with this
Practical range, the scope of the present invention includes being not limited to present embodiment, all equal according to the equivalence changes of the invention made
Within the scope of the present invention.
Claims (10)
1. a kind of K-means clustering methods with secret protection, it is characterised in that comprise the following steps:
S1:Data owner A and B encrypt respective data, and then ciphertext is uploaded onto the server;
S2:Data owner A and B randomly choose k center of mass point respectively, and encryption uploads onto the server;
S3:Server calculates ciphertext data point to the Euclidean distance of center of mass point by Calculation of Safety Distance agreement, passes through safety ratio
The Euclidean distance calculated compared with agreement sorts out data point;
S4:Server, data owner A and B recalculate k new center of mass point by safety circuit agreement jointly;
S5:Data owner A or B by safety ratio compared with agreement judge center of mass point new in ciphertext data and former center of mass point away from
From if less than threshold value, end classification, the data classified are sent respectively to data by data owner A and B request server
Owner A and B, otherwise, return and perform step S2, carry out next round iteration.
2. the K-means clustering methods according to claim 1 with secret protection, it is characterised in that:In step sl,
The server is Cloud Server, and Cloud Server stores data owner A and the B encryption data uploaded file beyond the clouds again
In system.
3. the K-means clustering methods according to claim 2 with secret protection, it is characterised in that:In step s 2,
The selection of the center of mass point includes the selection of center of mass point quantity and numerical value, specifically includes following steps:
S21:Data owner A and B randomly choose k center of mass point respectively;
S22:It is iterated, and is sorted out on respective data set according to traditional K-means clustering algorithms;
S23:Calculate each data point to it is respective correspondence center of mass point distance, and calculate all data points apart from summation S;
S24:When the corresponding summation S changes of k-1, k, k+1 center of mass point are little, now, k is the number of center of mass point;
S25:Data owner A and B are respectively with the numerical computations average value of respective center of mass point, and the average value is k matter
The value of heart point.
4. the K-means clustering methods according to claim 3 with secret protection, it is characterised in that:Step S3 meter
Calculation method comprises the following steps:
S31:The ciphertext distance for the ciphertext center of mass point that Server Calculates Data owner A every ciphertext record is uploaded with it, and
The ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it;
S32:Server calculates data owner A each data with data owner A using Calculation of Safety Distance agreement jointly
Point and the ciphertext distance of center of mass point;Server and data owner calculate data using Calculation of Safety Distance agreement B and possessed jointly
Person B each data point and the ciphertext distance of center of mass point;
S33:The ciphertext distance set that server is obtained according to step S32, data owner A and B data is divided into nearest
In class, and separately deposited in same class.
5. the K-means clustering methods according to claim 4 with secret protection, it is characterised in that:Step S4 place
Reason method comprises the following steps:
S41:The data point separately deposited in same class is sent respectively to corresponding data owner A and B by server;
S42:Data owner A and B are decrypted;
S43:Server, data owner A and B calculate new center of mass point in the category using safety circuit agreement.
6. a kind of system for realizing the K-means clustering methods with secret protection described in claim any one of 1-5, its feature
Be to include database, the second client used in the first client used in data owner A and data owner B,
Wherein, first client and the second client are used to encrypt respective data, and then ciphertext is uploaded onto the server, and point
Not Sui Jixuanze k center of mass point, and encrypt and upload onto the server, wait after server sorts out, recalculated jointly with server new
K center of mass point, judge the distance of new center of mass point and former center of mass point, if less than threshold value, terminate classification, request server will
The data classified are sent respectively to the first client and the second client, and center of mass point is otherwise uploaded again;Server is used to connect
The data that the first client and the second client are uploaded are received, data point are calculated to the Euclidean distance of center of mass point, according to the Europe of calculating
Family name's distance sorts out data point, then recalculates k new center of mass point jointly with the first client and the second client.
7. system according to claim 6, it is characterised in that:The server is Cloud Server, and Cloud Server is by data
The data that owner A and B are uploaded are re-encrypted in the file system of storage beyond the clouds.
8. system according to claim 7, it is characterised in that:The center of mass point of first client and the second client
Selection includes the selection of center of mass point quantity and numerical value, specifically includes following module:
Barycenter point selection module:For randomly choosing k center of mass point;
Classifying module:For being iterated according to traditional K-means clustering algorithms on respective data set, and sort out;
Calculation of Safety Distance module:For calculating each data point to respective correspondence center of mass point by Calculation of Safety Distance agreement
Distance, and calculate all data points apart from summation S;
Center of mass point number chooses module:For judging when the corresponding summation S changes of k-1, k, k+1 center of mass point are little, now,
K is the number of center of mass point;
Barycenter point value chooses module:For the numerical computations average value with respective center of mass point, the average value is k matter
The value of heart point.
9. system according to claim 8, it is characterised in that:The server includes:
First ciphertext distance calculation module:For calculating the ciphertext center of mass point that every ciphertext record of the first client is uploaded with it
Ciphertext distance, and calculate the ciphertext distance for the ciphertext center of mass point that data owner B every ciphertext record is uploaded with it;
Second ciphertext distance calculation module:Each data point and barycenter for calculating the first client jointly with the first client
The ciphertext distance of point;Server and the second client calculate jointly each data point of the second client and the ciphertext of center of mass point away from
From;
Sort module:For calculating obtained ciphertext distance set according to the second ciphertext distance calculation module, by the first client and
The data of second client are divided into nearest class, and separately deposited in same class.
10. system according to claim 9, it is characterised in that:The server also includes sending module:For will be same
The data point separately deposited in class is sent respectively to corresponding first client and the second client;
Safe center of mass point computing module:For same first client and the second client by safety circuit agreement in same category
It is middle to calculate new center of mass point.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710224275.7A CN107145791B (en) | 2017-04-07 | 2017-04-07 | K-means clustering method and system with privacy protection function |
PCT/CN2017/117943 WO2018184407A1 (en) | 2017-04-07 | 2017-12-22 | K-means clustering method and system having privacy protection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710224275.7A CN107145791B (en) | 2017-04-07 | 2017-04-07 | K-means clustering method and system with privacy protection function |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107145791A true CN107145791A (en) | 2017-09-08 |
CN107145791B CN107145791B (en) | 2020-07-10 |
Family
ID=59775048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710224275.7A Active CN107145791B (en) | 2017-04-07 | 2017-04-07 | K-means clustering method and system with privacy protection function |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107145791B (en) |
WO (1) | WO2018184407A1 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107707494A (en) * | 2017-10-10 | 2018-02-16 | 苏州大学 | Nonlinear fiber equalization methods for 64 QAM coherent optical communication systems |
CN107784663A (en) * | 2017-11-14 | 2018-03-09 | 哈尔滨工业大学深圳研究生院 | Correlation filtering tracking and device based on depth information |
WO2018184407A1 (en) * | 2017-04-07 | 2018-10-11 | 哈尔滨工业大学深圳研究生院 | K-means clustering method and system having privacy protection |
CN109214205A (en) * | 2018-08-01 | 2019-01-15 | 安徽师范大学 | Position and data-privacy guard method in a kind of intelligent perception based on k- anonymity |
CN109615021A (en) * | 2018-12-20 | 2019-04-12 | 暨南大学 | A kind of method for protecting privacy based on k mean cluster |
CN110163292A (en) * | 2019-05-28 | 2019-08-23 | 电子科技大学 | Secret protection k-means clustering method based on vector homomorphic cryptography |
CN110162999A (en) * | 2019-05-08 | 2019-08-23 | 湖北工业大学 | A kind of income distribution difference Gini coefficient measure based on secret protection |
CN111444545A (en) * | 2020-06-12 | 2020-07-24 | 支付宝(杭州)信息技术有限公司 | Method and device for clustering private data of multiple parties |
CN112487481A (en) * | 2020-12-09 | 2021-03-12 | 重庆邮电大学 | Verifiable multi-party k-means federal learning method with privacy protection |
CN112508203A (en) * | 2021-02-08 | 2021-03-16 | 同盾控股有限公司 | Federated data clustering method and device, computer equipment and storage medium |
CN113033915A (en) * | 2021-04-16 | 2021-06-25 | 哈尔滨理工大学 | Method and device for comparing shortest distance between car sharing user side and driver side |
CN113438254A (en) * | 2021-08-24 | 2021-09-24 | 北京金睛云华科技有限公司 | Distributed classification method and system for ciphertext data in cloud environment |
CN114730389A (en) * | 2019-11-06 | 2022-07-08 | 维萨国际服务协会 | Dual server privacy preserving clustering |
CN116801380A (en) * | 2023-03-23 | 2023-09-22 | 昆明理工大学 | UWB indoor positioning method based on improved full centroid-Taylor |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110610196B (en) * | 2019-08-14 | 2023-04-28 | 平安科技(深圳)有限公司 | Desensitization method, system, computer device and computer readable storage medium |
CN114154554A (en) * | 2021-10-28 | 2022-03-08 | 上海海洋大学 | Privacy protection outsourcing data KNN algorithm based on non-collusion double-cloud server |
CN117633881A (en) * | 2023-11-27 | 2024-03-01 | 国能神皖合肥发电有限责任公司 | Power data optimization processing method |
CN117688502B (en) * | 2024-02-04 | 2024-04-30 | 山东大学 | Safe outsourcing calculation method and system for detecting local abnormal factors |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105138923A (en) * | 2015-08-11 | 2015-12-09 | 苏州大学 | Privacy protection time sequence similarity calculation method |
CN105760780A (en) * | 2016-02-29 | 2016-07-13 | 福建师范大学 | Trajectory data privacy protection method based on road network |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102970143B (en) * | 2012-12-13 | 2015-04-22 | 中国科学技术大学苏州研究院 | Method for securely computing index of sum of held data of both parties by adopting addition homomorphic encryption |
US9710493B2 (en) * | 2013-03-08 | 2017-07-18 | Microsoft Technology Licensing, Llc | Approximate K-means via cluster closures |
CN107145791B (en) * | 2017-04-07 | 2020-07-10 | 哈尔滨工业大学深圳研究生院 | K-means clustering method and system with privacy protection function |
-
2017
- 2017-04-07 CN CN201710224275.7A patent/CN107145791B/en active Active
- 2017-12-22 WO PCT/CN2017/117943 patent/WO2018184407A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105138923A (en) * | 2015-08-11 | 2015-12-09 | 苏州大学 | Privacy protection time sequence similarity calculation method |
CN105760780A (en) * | 2016-02-29 | 2016-07-13 | 福建师范大学 | Trajectory data privacy protection method based on road network |
Non-Patent Citations (2)
Title |
---|
LIU XIAOYAN等: "《Outsourcing Two-party Privacy Preserving K-mians Clustering Protocol In Wireless Sensor Networks》", 《IEEE COMPUTER SOCIETY》 * |
薛安荣等: "《隐私保护的快速聚类算法》", 《系统工程与电子技术》 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018184407A1 (en) * | 2017-04-07 | 2018-10-11 | 哈尔滨工业大学深圳研究生院 | K-means clustering method and system having privacy protection |
CN107707494B (en) * | 2017-10-10 | 2020-02-11 | 苏州大学 | Optical fiber nonlinear equalization method for 64-QAM coherent optical communication system |
CN107707494A (en) * | 2017-10-10 | 2018-02-16 | 苏州大学 | Nonlinear fiber equalization methods for 64 QAM coherent optical communication systems |
CN107784663A (en) * | 2017-11-14 | 2018-03-09 | 哈尔滨工业大学深圳研究生院 | Correlation filtering tracking and device based on depth information |
CN107784663B (en) * | 2017-11-14 | 2020-10-20 | 哈尔滨工业大学深圳研究生院 | Depth information-based related filtering tracking method and device |
CN109214205A (en) * | 2018-08-01 | 2019-01-15 | 安徽师范大学 | Position and data-privacy guard method in a kind of intelligent perception based on k- anonymity |
CN109615021A (en) * | 2018-12-20 | 2019-04-12 | 暨南大学 | A kind of method for protecting privacy based on k mean cluster |
CN109615021B (en) * | 2018-12-20 | 2022-09-27 | 暨南大学 | Privacy information protection method based on k-means clustering |
CN110162999A (en) * | 2019-05-08 | 2019-08-23 | 湖北工业大学 | A kind of income distribution difference Gini coefficient measure based on secret protection |
CN110162999B (en) * | 2019-05-08 | 2022-06-07 | 湖北工业大学 | Income distribution difference kini coefficient measurement method based on privacy protection |
CN110163292A (en) * | 2019-05-28 | 2019-08-23 | 电子科技大学 | Secret protection k-means clustering method based on vector homomorphic cryptography |
CN114730389B (en) * | 2019-11-06 | 2023-07-07 | 维萨国际服务协会 | System and method for privacy preserving unsupervised learning |
CN114730389A (en) * | 2019-11-06 | 2022-07-08 | 维萨国际服务协会 | Dual server privacy preserving clustering |
CN111444545A (en) * | 2020-06-12 | 2020-07-24 | 支付宝(杭州)信息技术有限公司 | Method and device for clustering private data of multiple parties |
CN112487481B (en) * | 2020-12-09 | 2022-06-10 | 重庆邮电大学 | Verifiable multi-party k-means federal learning method with privacy protection |
CN112487481A (en) * | 2020-12-09 | 2021-03-12 | 重庆邮电大学 | Verifiable multi-party k-means federal learning method with privacy protection |
CN112508203B (en) * | 2021-02-08 | 2021-06-15 | 同盾控股有限公司 | Data clustering processing method, device, equipment and medium based on federal learning |
CN112508203A (en) * | 2021-02-08 | 2021-03-16 | 同盾控股有限公司 | Federated data clustering method and device, computer equipment and storage medium |
CN113033915B (en) * | 2021-04-16 | 2021-12-31 | 哈尔滨理工大学 | Method and device for comparing shortest distance between car sharing user side and driver side |
CN113033915A (en) * | 2021-04-16 | 2021-06-25 | 哈尔滨理工大学 | Method and device for comparing shortest distance between car sharing user side and driver side |
CN113438254B (en) * | 2021-08-24 | 2021-11-05 | 北京金睛云华科技有限公司 | Distributed classification method and system for ciphertext data in cloud environment |
CN113438254A (en) * | 2021-08-24 | 2021-09-24 | 北京金睛云华科技有限公司 | Distributed classification method and system for ciphertext data in cloud environment |
CN116801380A (en) * | 2023-03-23 | 2023-09-22 | 昆明理工大学 | UWB indoor positioning method based on improved full centroid-Taylor |
CN116801380B (en) * | 2023-03-23 | 2024-05-28 | 昆明理工大学 | UWB indoor positioning method based on improved full centroid-Taylor |
Also Published As
Publication number | Publication date |
---|---|
CN107145791B (en) | 2020-07-10 |
WO2018184407A1 (en) | 2018-10-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107145791A (en) | A kind of K means clustering methods and system with secret protection | |
Liu et al. | An efficient privacy-preserving outsourced calculation toolkit with multiple keys | |
Wang | An identity-based data aggregation protocol for the smart grid | |
CN105122721B (en) | For managing the method and system for being directed to the trustship of encryption data and calculating safely | |
CN108737115B (en) | Private attribute set intersection solving method with privacy protection | |
CN110011784A (en) | Support the KNN classified service system and method for secret protection | |
CN107196926A (en) | A kind of cloud outsourcing privacy set comparative approach and device | |
CN107145792A (en) | Multi-user's secret protection data clustering method and system based on ciphertext data | |
CN106972927A (en) | A kind of encryption method and system for different safety class | |
Min et al. | Novel multi-party quantum key agreement protocol with g-like states and bell states | |
CN106921493A (en) | A kind of encryption method and system | |
CN105376057B (en) | A kind of method of the extensive system of linear equations of cloud outsourcing solution | |
CN110445797B (en) | Two-party multidimensional data comparison method and system with privacy protection function | |
CN104967693A (en) | Document similarity calculation method facing cloud storage based on fully homomorphic password technology | |
CN107864040A (en) | A kind of intelligent grid big data information management system based on safe cloud computing | |
Wang et al. | Lightweight certificate-based public/private auditing scheme based on bilinear pairing for cloud storage | |
CN110474770A (en) | A kind of multi-party half quantum secret sharing method and system based on single photon | |
Agarkar et al. | LRSPPP: lightweight R-LWE-based secure and privacy-preserving scheme for prosumer side network in smart grid | |
Hasan et al. | Encryption as a service for smart grid advanced metering infrastructure | |
Fatahi et al. | High-efficient arbitrated quantum signature scheme based on cluster states | |
CN109495244A (en) | Anti- quantum calculation cryptographic key negotiation method based on pool of symmetric keys | |
Cheng et al. | Batten down the hatches: Securing neighborhood area networks of smart grid in the quantum era | |
Tallapally et al. | Competent multi-level encryption methods for implementing cloud security | |
CN103763100A (en) | Sum and product computing method for protecting data privacy security of arbitrary user group | |
Li et al. | Priexpress: Privacy-preserving express delivery with fine-grained attribute-based access control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |