CN107145792B - Multi-user privacy protection data clustering method and system based on ciphertext data - Google Patents
Multi-user privacy protection data clustering method and system based on ciphertext data Download PDFInfo
- Publication number
- CN107145792B CN107145792B CN201710225047.1A CN201710225047A CN107145792B CN 107145792 B CN107145792 B CN 107145792B CN 201710225047 A CN201710225047 A CN 201710225047A CN 107145792 B CN107145792 B CN 107145792B
- Authority
- CN
- China
- Prior art keywords
- data
- server
- user
- clustering
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/04—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
- H04L63/0428—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Bioethics (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Storage Device Security (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a multi-user privacy protection data clustering method and system based on ciphertext data, and belongs to the technical field of data mining. The method of the invention comprises the following steps: more than two users send the data after being encrypted respectively, the clustering center point and the trapdoor information to a server; the server calculates the distance between the ciphertext data point and the clustering center point, and divides the clustering; the server respectively adds data points of different users in each cluster, and respectively sends the sum and the number of the data to the users; the user re-encrypts the received data sum and the number and sends the data sum and the number to the server; the server calculates a new clustering center point and sends the new clustering center point to each user; and all users jointly calculate the average value of the data points in each cluster from the cluster central point through an outsourcing privacy protection average number calculation protocol, and then send the average value to the server for next iteration. The invention greatly improves the clustering efficiency; the security calculation under the semi-honest model is realized, and collusion attack to a certain degree can be resisted.
Description
Technical Field
The invention relates to the technical field of data mining, in particular to a multi-user privacy protection data clustering method based on ciphertext data, and further provides a system for realizing the multi-user privacy protection data clustering method based on the ciphertext data.
Background
Privacy Preserving Data Mining (PPDM) is mainly a method for solving Data Mining involving two or more partners, but does not want private Data to be revealed in a calculation process. The privacy protection data mining ensures that the data mining can be carried out on the joint data of two parties or even multiple parties, and simultaneously ensures that the data privacy is not stolen by other people.
The technology of privacy protection data mining is mainly divided into a technical method based on data scrambling and a technical method based on passwords. The data scrambling based technology mainly realizes privacy protection of source data by adding interference on the basis of the source data, but certain precision loss is brought. The cryptographic technology mainly uses homomorphic encryption and secure multi-party calculation as main methods, and compared with data interference, the cryptographic technology has low data intervention and high precision, but the time complexity is often higher and the calculation cost is larger.
The technical method based on the password is mainly divided into a distributed computing method mainly without cloud end participation in the early stage, the method mainly adopts a protocol of security circuit evaluation of the Yao intelligence or semi-homomorphic encryption to realize data privacy protection, but the method brings about the problems of low efficiency, large calculation amount born by each participant and difficulty in practicability. Later 2012, Peter et al proposed outsourced secure multi-party computing based on BCP encryption methods, making it possible to reduce the computational load of the participants using the cloud. In the same year, Asharov proposes a gate trap homomorphic encryption method for multi-party computing, so that the efficiency of cloud computing is further improved, but the method cannot protect the privacy of users, and the content of the users is easily stolen by other users.
As for the clustering method, the classical is the traditional K-means clustering algorithm, and the realization process is that K points are randomly selected from data in the first iteration as clustering center points, then Euclidean distances from other points to the clustering center points are calculated, the shortest distance is divided into the corresponding clustering centers by comparison, after the clustering division is finished, each component in each point in each cluster is recalculated with an average value, the clustering centers are recalculated, after the calculation is finished, the first iteration is finished, and the next iteration is started. And (5) circulating to the clustering center of the iterative computation, stopping the iteration and finishing the clustering.
The K-means in the clustering algorithm is a relatively simple one, and the K-means clusters the samples into K clusters according to a certain rule through algorithm calculation, but the traditional clustering algorithm cannot realize user privacy protection, and data participants can easily acquire data of other users, so that the method has a defect in safety.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a multi-user privacy protection data clustering method based on ciphertext data and a system for realizing the method.
The invention relates to a multi-user privacy protection data clustering method based on ciphertext data, which comprises the following steps:
s1: more than two users send the data after being encrypted respectively, the clustering center point and the trapdoor information to a server;
s2: the server calculates the distance between the ciphertext data point and the clustering center point, and divides clustering according to the distance and the trapdoor information;
s3: the server respectively adds data points of different users in each cluster, and respectively sends the sum and the number of the data to respective users;
s4: each user re-encrypts the data according to the received data sum and number by a BCP encryption method and sends the data to the server;
s5: the server calculates a new clustering center point and sends the new clustering center point to each user;
s6: and (4) all users jointly calculate the average value of the data points in each cluster from the cluster central point, then the average value is sent to the server, the step S1 is executed again until the average value is smaller than the threshold value, the classification is finished, and the server sends the classification result to all users respectively according to the data source.
In a further improvement of the present invention, in step S1, the server is an outsourced server, and the user encrypts the data twice by using homomorphic encryption and BCP encryption respectively, where the data set D ═ D1,d2,...,dnContains n data points, each data point di=(xi,1,...,xi,m) M denotes that each data point is an m-dimensional vector, and each data point diComponent x in (1)i,jWill be encrypted twice and uploaded to the outsource server Enc (x)i,j)=(ce(i,j),cp(i,j)) Wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to m, ce(i,j)Representing the ciphertext encrypted using the Liu's homomorphic encryption scheme, cp(i,j)Representing the ciphertext encrypted using the BCP encryption scheme.
In a further improvement of the present invention, the processing method in step S2 includes:
s21: the server is based on the ciphertext ce(i,j)Computing ciphertext data point diAnd the t-th cluster center point EtDistance ED2(di,Et) Wherein k is the number of the clustering central points, and t is more than or equal to 1 and less than or equal to k;
s22: according to the Trapdoor function in the Trapdoor information provided by the user, the outsourcing server calculates ED2(di,Et) + Trapdoor compares the distance of each data point to the center of each cluster, selects the closest one, and classifies this point into the corresponding cluster.
In a further refinement of the present invention, in step S21, each data point di=(xi,1,...,xi,m) And each cluster center point Et=(et,1,...,et,m) Are all m-dimensional vectors, and the encrypted data for each data point is ce(i,j)=(ce(i,1),...,ce(i,m)) Said distance ED2(di,Et) The calculation formula of (2) is as follows:
in a further improvement of the present invention, in step S22, the trapwood function is used to generate an order-preserving encryption index that can compare two data sizes.
In step S3, the server side uses the ciphertext cp(i,j)To calculate how many data points in each cluster there are and the sum of the corresponding components of the data points, and to send the sum result to each user Pi, respectively, according to the data distribution.
The present invention is further improved in that in steps S4-S6, since the recalculation of each cluster center is to add the components corresponding to the discrete points belonging to the center in each divided cluster to an average value, assuming that there are n points, t users, each user Pi,is the value of Pi, and is,each user has an encrypted value ofEach one of cpiIs a m-dimensional vector, and the cloud respectively calculates the discrete point c of each clustering center PipiThe values of the corresponding components are summed and the number is calculated. ThenThe addition result is Xi=(xi1,xi2,...,xim) And the number of points belonging to Pi in the cluster is aiaiThe server sends the calculated Xi,aiRespectively sending to each user Pi, encrypting each user Pi by BCP encryption scheme, and calculating by combining with the server with OPPWAP protocolThe final result is the calculated average.
The invention is further improved, the processing procedure of the server and each user based on the OPPWAP protocol comprises the following steps: a1: the outsourcing server S initializes by Setup and generates the common parameter PP ═ N, K, g, and applies the common parameter PP to the outsourcing server S
Sending the data to each user Pi;
a2: after each user Pi obtains the public parameters, the public key and the private key (pk) of the user are generated by a key generatori,ski) And the public key pkiSending the data to a server;
a3: the server combines all the public keys to calculate a unified public key and sends the unified public key Prod.pk to each user Pi;
a4: user Pi encrypts his data to obtain result (A)i,Bi) And (A)i′,Bi′);
A5: the user Pi generates two random numbers ρiAnd ρi' recalculating the encrypted data to obtain:
and sending the data to a server;
a6: after the server obtains the data, the data is calculated according to a formulaAnd will beAndreturning to each user Pi;
a8: server gets data XiAnd X'iThen, new data is calculatedAndthen, a random number tau is generated, and then K, K',sending to each user Pi;
a9: after each user Pi obtains the data, the calculation is finally carried outThereby obtaining the average value of the distance between the data point in each cluster and the cluster central point.
The invention also provides a system for realizing the method, which comprises a server and more than two users, wherein the users are used for sending the encrypted data, the clustering central point and the trapdoor information to the server, re-encrypting the data by a BCP encryption method according to the total number and the number of the received data, sending the data to the server, calculating the average value of the distance between the data point in each cluster and the clustering central point, and then sending the data to the server; the server is used for calculating the distance between the ciphertext data points and the clustering center points, dividing the clusters according to the distance and the trapdoor information, adding the data points of different users in each cluster respectively, sending the sum and the number of the data to each user respectively, calculating a new clustering center point, sending the new clustering center point to each user, and sending the classification result to each user according to the data source after the classification is finished.
The invention is further improved, and the server is an outsourced cloud server.
Compared with the prior art, the invention has the beneficial effects that: the cryptography technology is selected, and the efficiency is improved by selecting a relatively high-efficiency encryption algorithm and a data outsourcing mode; the improved door trap encryption algorithm is combined with data mining of privacy protection, so that the efficiency is improved; the security calculation under the semi-honest model is realized, and collusion attack to a certain degree can be resisted.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a one-time iteration structure and data processing flow according to the present invention;
FIG. 3 is a partial clustering of data;
FIG. 4 is a clustering result of the data cipher text of FIG. 3;
FIG. 5 shows the result of plaintext clustering in the data shown in FIG. 3;
FIG. 6 is a comparison of the time spent by the server and the user in one iteration;
FIG. 7 is a comparison histogram of data plaintext and ciphertext at the time of the last data and one iteration;
fig. 8 is a time-contrast histogram of data plaintext and ciphertext over an iteration.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The method mainly aims at calculating the privacy protection data cluster of multiple users or multiple data sources, and in order to ensure that the privacy data of multiple data owners are not leaked in the calculation process, a scheme is needed to protect the privacy data of the data owners. Meanwhile, privacy protection of data brings a large amount of computation to a data owner, and the computation needs to be outsourced to a server to reduce the computation amount of the data owner. The invention combines the two requirements, combines the K-means clustering algorithm of multi-party data privacy protection and outsourcing calculation, realizes privacy protection through encryption, and realizes ciphertext calculation through safe multi-party calculation. And a plurality of data owners encrypt the data and upload the data to the outsourcing server, and the server calculates the data in the ciphertext and returns the data to the data owner clustering result. Most of calculation is handed to the outsourcing server, and the data owner carries out a small amount of calculation, and when realizing clustering, guarantee that the privacy data of the data owner is not revealed in the clustering process. The invention needs to overcome two main difficulties, one is to realize the K-means clustering algorithm outsourcing calculation of privacy protection; another is the computational difficulty presented by the diversity of data distribution in multi-party data sets. The invention mainly breaks through the two difficulties.
As shown in fig. 1, the method for clustering multi-user privacy protection data based on ciphertext data of the present invention includes the following steps: more than two users send the data after being encrypted respectively, the clustering center point and the trapdoor information to a server;
s2: the server calculates the distance between the ciphertext data point and the clustering center point, and divides clustering according to the distance and the trapdoor information;
s3: the server respectively adds data points of different users in each cluster, and respectively sends the sum and the number of the data to respective users;
s4: each user re-encrypts the data according to the received data sum and number by a BCP encryption method and sends the data to the server;
s5: the server calculates a new clustering center point and sends the new clustering center point to each user;
s6: and (4) all users jointly calculate the average value of the data points in each cluster from the cluster central point, then the average value is sent to the server, the step S1 is executed again until the average value is unchanged, the classification is finished, and the server sends the classification result to all users according to the data source.
In step S1, the present invention encrypts data using two encryption schemes, Liu homomorphic encryption and BCP encryption. Data set D ═ D1,d2,...,dnContains n data points, each data point di=(xi,1,...,xi,m) M denotes that each data point is an m-dimensional vector, and each data point diComponent x in (1)i,jWill be encrypted twice and uploaded to the outsource server Enc (x)i,j)=(ce(i,j),cp(i,j)) Wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to m, ce(i,j)Representing the ciphertext encrypted using the Liu's homomorphic encryption scheme, cp(i,j)Representing the ciphertext encrypted using the BCP encryption scheme.
The server of this example is the cloud server of outsourcing, calculates most calculations through the cloud server of outsourcing, effectively improves clustering efficiency.
As shown in fig. 2, a complete iteration process of the present invention mainly includes the following steps: assuming that each user Pi uploads the encrypted data to the wrapper server after encryption of the data is completed, the synthesized data set D is equivalent to a two-dimensional table, and the encryption process is to encrypt each data in the synthesized data set D twice and upload the encrypted data to the cloud server. The outsourced cloud server mainly calculates the distance from the data point to the clustering center, receives Trapdoor (Trapdoor) information from the data owner, compares and selects the clustering center with the shortest distance, and divides the clustering. And then, adding each component of the data points in each cluster according to the divided cluster result, sending the data addition result and the number belonging to P1 to P1, sending the data belonging to P2 to P2 and sending the data belonging to Pn according to different data distribution. And each user Pi (1< ═ i < ═ n) re-encrypts the information of the user Pi (1< ═ i < ═ n) again and sends the information to the cloud, the cloud calculates a new clustering center point, finally, the cloud calculates the completion and sends the information to each user, and the user decrypts the information and sends the new clustering center to the cloud to enter the next iteration.
From the perspective of users, an iterative process is that each user Pi provides its own Trapdoor information (according to different data distributions), a server waits for sending the sum and number of various data belonging to each clustering center, after Pi receives the data, the data is encrypted by using a BCP encryption scheme, and then the cloud is combined to complete recalculation of the clustering centers by using an opppap protocol, wherein, because the data of more than two users are distributed horizontally, each data point in a data set belongs to each user, and all the data points in the data set, in this example, when two or more parties recalculate the clustering centers, the two or more parties negotiate to generate a common set of r1v,r2v,...rmvThe value of the clustering center is encrypted, and the clustering center is re-encrypted and then returned to the cloud outsourcing server, so that the consistency of database calculation is ensured. It is emphasized that when re-encrypting a cluster center, no BCP encryption is used for the new cluster centerThe scheme, that is, the cluster center point only needs to be encrypted once by the encryption scheme of Liu.
Specifically, the processing method of step S2 includes:
s21: the server is based on the ciphertext ce(i,j)Computing ciphertext data point diAnd the t-th cluster center point EtDistance ED2(di,Et) Wherein k is the number of the clustering central points, and t is more than or equal to 1 and less than or equal to k;
assume that there are n data points D ═ D in the data set D1,d2,...,dnK cluster centers are set in advance, di(1. ltoreq. i. ltoreq.n) denotes the ith data point, Et(1. ltoreq. t.ltoreq.k) represents the t-th cluster center. Each data point di=(xi,1,...,xi,m) And each cluster center point Et=(et,1,...,et,m) Are all m-dimensional vectors. In the following, the Euclidean distance is calculated according to the formulaj denotes the jth vector.
However, since this example uses two encryption operations, each xi,jWill be encrypted and upload Enc (x) twicei,j)=(ce(i,j),cp(i,j))。ce(i,j)The distance between the discrete points and the central point is calculated, and the clustering center is divided; c. Cp(i,j)For recalculating the cluster centers. For convenience of representation, in the process of calculating the distance from the discrete point to the central point and dividing the clustering center, the example uses ci,jIn place of ce(i,j). The homomorphic encryption scheme of Liu is used in both the comparison and calculation of the distance of the data points to the cluster center. The encryption key in the encryption algorithm is a list K (v) because there is only one t in the key listiNot equal to 0, so the data owner only needs to associate t withiC not equal to 0iUploading to a package server, wherein c is assumed to be uploaded in the invention1。
In the process of dividing and clustering, the distance from each point to the central point needs to be calculated, in this example, d is usedi=(xi,1,xi,2,...xi,m) To Et=(et,1,et,2,...et,m) Distance is an example, since only c is uploaded1Therefore using ci,jIs represented by Enc (K (v), xi,j)=(c1(i,j),...,cv(i,j)),ci,j=k1*t1*xi,j+s1 *rv(i,j)+k1 *(r1-rv-1) Similarly, use c't,jIndicating e after encryptiont,jThen c't,j=k1*t1*et,j+s1*r′v(t,j)+k1*(r1-rv-1). Then ED2(di,Et) Represents the encrypted data point diTo the cluster center EtThe distance of (c). The following formula is the calculation process of the distance from the data point to the cluster center under the ciphertext condition:
however, the distance calculation in this ciphertext cannot be directly used for distance comparison because the original distance D is used2(di,Et) Then adding the sum rvThe associated suffix. Therefore, when the sizes are compared under the ciphertext condition, the data owner in the trap encryption, namely the user, needs to provide trapdigital trap information to offset the part affecting the distance comparison.
S22: according to the Trapdoor function in the Trapdoor information provided by the user, the outsourcing server calculates ED2(di,Et) + Trapdoor compares the distance of each data point to the center of each cluster, selects the closest one, and classifies this point into the corresponding cluster.
The threshold information in this example is a kind of Order-preserving encrypted index (OPI) introduced in 2014 outsourcing encryption calculation by Liu. Given a key k and plaintext x, the expression OPI (k, x) will yield an index with respect to x. If there are two plaintext data x1And x2If x is1>x2Then the order-preserved encryption index will guarantee the OPI (k, x)1)>OPI(k,x2). This scheme does not recover x1And x2But their sizes may be compared.
For example, a plaintext is represented in decimal notation by the rightmost digit after the decimal point of the plaintext, e.g., a plaintext number is in XXX.XX format, which is typically 2, and the sensitivity is 10-2So if a plaintext array is of size s, its sensitivity is 10-s. The key k of the index used in this example is a pair of numbers (a, b) and a > 0. In this example, Sens represents the sensitivity of the plaintext, and OPI (k, x) ═ a × x + r, where r is uniformly distributed in the data interval [0, a × Sens). That is, the magnitude of r does not affect the comparison of the values of x. If x1>x2Then OPI (k, x)1)-OPI(k,x2)=a*(x1-x2)+r1-r2. Due to a (x)1-x2)>a*Sens>r1-r2Thus, OPI (k, x)1)>OPI(k,x2)。
Therefore, in this example, the cipher text size is compared, and the index format that needs to be created for order preservation is a × f (X) + g (X, R), where a denotes the encrypted key, X is the plaintext data set that needs to be compared, R is the set of random numbers, and f and g are two functions that respectively denote the definitions in addition and multiplication. The sensitivity of the simultaneous declaration of f (X) is f (x)1) And f (x)2) The minimum gap between them. Suppose f (x)1) Scale of is s1,f(x2) Scale of is s2Then f (x)1)+f(x2) Scale of is s1And s2The larger of the two; f (x)1)*f(x2) Scale of is s1+s2。
Assuming that the sensitivity of f (X) is Senf, the ciphertext is in the form a f (X) + g (X, R), if this form is to be converted to OPI (k, X) ═ a X + R. Therefore, if the outsourcing server is required to convert the format of a × (X) + g (X, R) ciphertext into the order-preserving index of the format a × f (X) + R, the data owner needs to construct the trapdoor information-g (X, R) + R. First of all, it is necessary toWill ED2(di,Et) Written in the form of a × (X) + g (X, R), the specific calculation formula is as follows:
wherein a, f (X) and g (X, R) are respectively as follows:
a=(k1*t1)2
here, assume that D (D)i,Et) Is s, then D2(di,Et) The scale of (D) is s + s2 s, so D2(di,Et) Has a sensitivity of 10-2*sSince the data owner needs to provide the Trapdoor information (Trapdoor), different data distributions may result in different forms of Trapdoor information (Trapdoor).
In calculating D2(di,Et) When this happens, the outsource server does not need to make different calculations for data originating from different data owners. But the outsourcing server needs to keep track of which user each record is coming from. User Trapdoor information (Trapdoor) is required to compute distance and partition the cluster center because the outsourcing server knows each diWhether it comes from P1 or P2, etc., so the data owner to whom the data point to be computed belongs is the corresponding data owner that provides the Trapdoor information (Trapdoor).
In this example, in a horizontal data distribution, assume diFrom P1, then trapdoor information is provided by P1. The format of the trapdoor information is-g (X, R) + R. The trapdoor function designed in this example consists of two parts:
Trapit(di,Et)+Trapt(Et)
the first of these two parts Trapit(di,Et) Is a trapdoor function of the distance of each data point to the cluster center point, which is a part that can be calculated in advance by the data owner, Trapt(Et) The trapdoor function of each cluster center point is changed along with the change of different cluster centers in each iteration, and the calculation formula is as follows:
wherein, NBtjThe size range is [0, (k)1*t1)2*sens]The result of adding the random number R corresponding to a part of the random number R in-g (X, R) + R is shown in the following formula.
In recalculating the clustering centers, this example will utilize the encrypted ciphertext c of the second BCP encryption schemep(i,j)To perform calculation, the server side adopts the ciphertext cp(i,j)To calculate how many data points in each cluster there are and the sum of the corresponding components of the data points, and to send the sum result to each user Pi, respectively, according to the data distribution. In order to calculate the average value of the data points in each cluster center among all parties, the example designs an OPPWAP protocol (outsourcing privacy protection average number calculation protocol), and calculates the data to be calculated under the condition of reducing the data to be calculated into a ciphertextTo a problem of (a).
Specifically, in steps S4-S6, since the recalculation of each cluster center is to add the components corresponding to the discrete points belonging to the center in each divided cluster to the average, assuming that there are n points, t users, each user Pi,is the value of Pi, and is,each user has an encrypted value ofEach one of cpiIs a m-dimensional vector, and the cloud respectively calculates the discrete point c of each clustering center PipiThe values of the corresponding components are summed and the number is calculated. ThenThe addition result is Xi=(xi1,xi2,...,xim) And the number of points belonging to Pi in the cluster is aiThe server sends the calculated Xi,aiRespectively sending to each user Pi, encrypting each user Pi by BCP encryption scheme, and calculating by combining with the server with OPPWAP protocolThe final result is the calculated average.
The specific implementation method comprises the following steps:
a1: the outsourcing server S initializes through Setup and generates a common parameter PP which is (N, K, g), and sends the common parameter PP to each user Pi;
a2: after each user Pi obtains the public parameters, the public key and the private key (pk) of the user are generated by a key generatori,ski) And the public key pkiSending the data to a server;
a3: the server combines all the public keys to calculate a unified public key and sends the unified public key Prod.pk to each user Pi;
a4: user Pi encrypts his data to obtain result (A)i,Bi) And (A)i′,Bi′);
A5: the user Pi generates two random numbers ρiAnd ρi' recalculating the encrypted data to obtain:
and sending the data to a server;
a6: after the server obtains the data, the data is calculated according to a formulaAnd will beAndreturning to each user Pi;
a8: server gets data XiAnd X'iThen, new data is calculatedAndthen, a random number tau is generated, and then K, K',sending to each user Pi;
a9: after each user Pi obtains the data, the calculation is finally carried outThereby obtaining the average value of the distance between the data point in each cluster and the cluster central point.
The invention selects a typical K-means algorithm in the cluster, realizes the privacy protection of the personal data of the data owner by using a cryptography technology under the condition that the data source is two or even more, and performs the safety calculation by using the safety multi-party calculation. In addition, each iteration of the K-means algorithm needs to calculate the distance from each data point to each central point, the time cost of circular calculation is high, the calculation is outsourced to the server, and the efficiency is improved.
The effects of the present invention are further illustrated below in conjunction with experimental data:
the experiment of the invention is carried out on a single machine, and the system development environment is as follows:
(1) the running system is windows7, the processor is Intel (R) core (TM) i5-4570CPU speed is 3.2GHz, and the memory size of the system is 8G;
(2) the encryption key for BCP encryption is 512 bits, and in the operation stage, the speed is low because the key is large;
(3) the programming language is Java, the operating environment is eclipse, and the system database is Mysql.
The experimental data is data mined from data downloaded from a public data set UCI, the original data is decimal, BCP encryption is a group-based encryption scheme and does not support decimal operation, and the data is processed into integers in the later period. The processed data are 10000 pieces of data with 7 attributes. Part of the data is shown in figure 3.
As shown in FIG. 4 and FIG. 5, in order to verify the correctness of the calculation under the ciphertext of the present invention, in the experiment, K-means clustering under the plaintext is performed, and it can be seen through comparison that the clustering results of the ciphertext and the plaintext under the same data are completely consistent, and the experimental result is used to verify the correctness of the theory herein.
Fig. 3 and 4 are portions of the clustering results, which are truncated, and are set to three {15,14,2,6,4,4,6}, {1,1,1,1,1, 1}, {15,13,2,6,4,4,6}, at an initial point of a cluster center, where the first piece of data is test data, and it can be seen that the clustering results are completely consistent under the condition that the data are completely the same. The data encryption time is shown in table 1, and the encryption process is performed simultaneously by two homomorphic encryptions.
From table 1, it can be seen that the time spent for encryption is within an acceptable range, and the sum of the time spent by trapport is slightly less than the encryption time, because for each d in the calculation of trapportiThe accumulation of each component needs to be calculated, the encryption times are less, but the formula is more complicated than the encryption, and d is encryptediThe time spent was slightly less contrasted, but the difference was not large.
TABLE 1 encryption time consumption
The time spent by the data owner in one iteration mainly includes the calculation of Trapdoor information (Trapdoor), and the time spent by the OPPWAP protocol calculation. The results of comparison in the case where the number of data points was different are shown in Table 2.
TABLE 2 one iteration elapsed time
As shown in fig. 6, a line graph comparing the time consumption of one iteration for the server and one user. The abscissa represents the number of center points of the cluster, and the ordinate represents the time of the cluster in milliseconds (ms). It can be seen that the data owner time consumption is much less than for the server. Table 3 shows the time consumed by the users Alice and Bob, including the time consumed by OPPWAP and Tradpor, in one iteration.
TABLE 3 time consumed by data owner in one iteration
Tables 2 and 3 show the time cost consumed by one iteration, and it can be seen in tables 3 and 2 that as the number of data points increases, the time cost at the server end increases, and the increase is not linear because the increase is related to the number of iterations. However, the time cost consumption between Alice and Bob is not related to the number of data points, and it can be seen from table 3 that the time cost consumed by trapwood is relatively small, and most of the time cost is occupied by the time consumption calculated by opppap. However, the number of times of OPPWAP calculation is only related to the number k of clusters and the dimension m of the data points, so in one iteration, the time consumed by the time of OPPWAP calculation in Alice and Bob is basically stable, because k and m are not changed in the calculation process. As can be seen from table 2, when the number of data points increases, the server side will bear more calculation cost, while the cost calculated by the data owner tends to be substantially stable, and the time cost calculated by the trapwood is far less than that of the server.
Because the iteration times of K-means cluster calculation are not controllable, the iteration times are related to the number of data points, the number of clusters and the initial point selected each time. In the use of the K-means algorithm, K (the number of clusters) is often a certain number, and in the case of a certain value of the number of clusters K, the experiment is set to 3, and the approximate correlation degree between the number of data points and the number of iterations is shown in table 4.
TABLE 4 reference of data points number and iteration number
The communication consumption in the whole cluster mainly comprises the uploading of ciphertext data, the OPPWAP protocol and the uploading of a new cluster center and a Tradpor function, wherein the cluster center and the Tradpor are uploaded together, so the communication cost is calculated together. The whole communication cost is shown in table 5, and it should be noted that the opppap protocol and the clustering center and the trapwood function are exemplified by an iteration, because the number of K-means iterations is not controllable, the total communication consumption after a plurality of iterations is given, and the cost is more meaningful than that of one iteration.
TABLE 5 communication consumption
From table 5, the uploading time of the data is similar to the uploading time of the cluster center point, and the uploading time of the data is basically consistent because the three-dimensional arrays are uploaded in the experiment. The time of the whole OPPWAP is the result of the time addition of the time of respectively calculating the OPPWAP by Alice and Bob in one iteration in the table 2 because the time of calculation and communication are basically carried out at the same time, and the calculated time is the average value after a plurality of programs are run because the computer starting process is not determined every time.
Finally, this example compares the time efficiency calculated in the ciphertext and the time efficiency calculated in the plaintext according to the present solution experimentally. The device is mainly divided into three parts: the method comprises the steps of respectively distinguishing uploaded plaintext data and uploaded ciphertext data, comparing consumed time of next iteration of the plaintext and the ciphertext, and comparing the consumed time of the whole clustering process. For the experimental data comparison to be obvious, the case of selecting 2000 data points and 7 attribute values was performed.
TABLE 6 comparison of computation times for plaintext and ciphertext
Table 6 shows the overall comparison of plaintext calculations and ciphertext calculations over the time of the upload data, one iteration, and the entire cluster. Because the uploaded data are uploaded in a 7-dimensional array form in the experiment, the time for uploading the array is basically 79ms for a set time. In one iteration, the cryptograph calculation requires approximately twice as much time as the plaintext calculation because of the required OPPWAP calculation and communication consumption. In the whole clustering calculation, because the experiment adopts a document reading mode, the data reading time is faster, and the encryption time is more increased in the ciphertext calculation than in the plaintext calculation. If the data base is read, the time is 2-3s more than that of the read document, and the method adopts the form of reading the document in order to calculate the accuracy of the time. In fig. 7 and 8, the difference between plaintext and ciphertext is represented in the form of a histogram. The ordinate in the histogram represents time, and the unit in fig. 7 is milliseconds (ms) and the unit in fig. 8 is seconds(s).
The outsourcing calculation experiment performed by the invention only increases the calculation time of the OPPWAP protocol compared with the K-means outsourcing calculation with privacy protection. From table 5, it can be seen that the opppap time consumption in one iteration is about 412ms, and since the number of iterations is not controllable, the experiment is based on the given one iteration time. The time consumption for transferring the Trapdoor function is basically the same regardless of multiple parties or a single party.
In conclusion, the invention combines the data mining and outsourcing calculation of privacy protection, and performs experimental analysis. The main achievements of the invention are as follows:
(1) the analysis summarizes the advantages and disadvantages of different technologies in the aspect of privacy protection data mining, the data scrambling technology is prone to damage data, and the method is a compromise between privacy protection and data mining precision. The cryptography technology does not affect the data mining result, and the data encryption also brings larger time cost. The invention selects the cryptography technology, and improves the efficiency by selecting a relatively high-efficiency encryption algorithm and a data outsourcing mode;
(2) the traditional method applies a safety circuit evaluation method proposed by the Yaoqian for safety calculation, and the method is realized by bit-wise encryption of 01 strings, so that the time cost is very high; the weighted averaging problem with privacy protection alone is not perfect in comparison of data point distances. The invention better solves the problems by combining an improved door trap encryption algorithm with data mining of privacy protection, and the efficiency is also improved;
(3) an outsourcing calculation protocol of a privacy protection K-means clustering algorithm is designed, the calculation of the distance between two points in the K-means algorithm through cyclic calculation is outsourced to a server, and the calculation is realized through safe multi-party calculation designed by adopting two encryption technologies. The improved Liu encryption scheme is used for comparing the distance from a data point to a cluster center and dividing clusters; BCP encryption is used for recalculation of the clustering center;
(4) time complexity analysis, space complexity and safety analysis are carried out aiming at the invention, and finally experimental verification is carried out. The invention realizes the safety calculation under the semi-honest model and can resist collusion attack to a certain degree.
The above-described embodiments are intended to be illustrative, and not restrictive, of the invention, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Claims (9)
1. The multi-user privacy protection data clustering method based on the ciphertext data is characterized by comprising the following steps of:
s1: more than two users send the data after being encrypted respectively, the clustering center point and the trapdoor information to a server;
s2: the server calculates the distance between the ciphertext data point and the clustering center point, and divides clustering according to the distance and the trapdoor information;
s3: the server respectively adds data points of different users in each cluster, and respectively sends the sum and the number of the data to respective users;
s4: each user re-encrypts the data according to the received data sum and number by a BCP encryption method and sends the data to the server;
s5: the server calculates a new clustering center point and sends the new clustering center point to each user;
s6: all users jointly calculate the average value of the data points in each cluster from the cluster center point through outsourcing a privacy protection average number calculation protocol, then send the average value to the server, and return to execute the step S1 until the average value is smaller than the threshold value, the classification is finished, and the server respectively sends the classification result to all users according to the data source;
in step S1, the server is an outsourced server, and the user encrypts the data twice through homomorphic encryption and BCP encryption respectively to obtain a data setContaining n data points, each data pointM denotes that each data point is an m-dimensional vector, and each data point is a vector of m dimensionsComponent (b) ofWill be encrypted and uploaded to the outsourcing server twiceWherein, in the step (A),,,representing the ciphertext encrypted using a homomorphic encryption scheme,representing the ciphertext encrypted using the BCP encryption scheme.
2. The multi-user privacy preserving data clustering method according to claim 1, characterized in that: the processing method of step S2 includes:
s21: the server according to the ciphertextComputing ciphertext data pointsAnd the t-th cluster center pointIs a distance ofWherein k is the number of the clustering central points,;
4. the multi-user privacy preserving data clustering method according to claim 3, characterized in that: in step S22, the trapwood function is used to generate an order-preserving encryption index that can compare two data sizes.
5. The multi-user privacy preserving data clustering method according to claim 2, characterized in that: in step S3, the server side uses the ciphertextTo calculate how many data points in each cluster there are and the sum of the corresponding components of the data points, and to send the sum result to each user Pi, respectively, according to the data distribution.
6. The multi-user privacy preserving data clustering method of claim 5, wherein: in steps S4-S6, since the recalculation of each cluster center is to add the components corresponding to the discrete points belonging to the center in each divided cluster to the averaging, assuming that there are n points, t users, each user Pi,is the value of Pi, and is,each user having an encrypted value ofEach of which isIs a m-dimensional vector, and the cloud calculates the discrete point of each clustering center PiThe values of the corresponding components are summed and the number is calculated, thenThe result of the addition isAnd the number of points belonging to Pi in the cluster is aiThe server sends the calculated Xi,aiRespectively sending to each user Pi, encrypting each user Pi by BCP encryption scheme, and calculating by combining with the server with OPPWAP protocolThe final result is the calculated average.
7. The multi-user privacy preserving data clustering method of claim 6, wherein: the process of the server and the respective users Pi based on the opppap protocol comprises the following steps:
a1: the outsourcing server S initializes through Setup and generates a common parameter PP = (N, K, g), and sends the common parameter PP to each user Pi;
a2: after each user Pi obtains the public parameters, the public key and the private key of the user Pi are generated through the key generatorAnd will public keySending the data to a server;
a3: the server combines all the combinations to calculate a unified public key and sends the unified public key Prod.pk to each user Pi;
and sending the data to a server;
a6: after the server obtains the data, the data is calculated according to a formula,,,And will beAndreturning to each user Pi;
a8: server get dataAndthen, new data is calculatedAndthen generates a random numberThen will beSending to each user Pi;
8. A system for implementing the multi-user privacy preserving data clustering method according to any one of claims 1 to 7, characterized in that: the system comprises a server and more than two users, wherein the users are used for sending encrypted data, a clustering central point and trapdoor information to the server, re-encrypting the data by a BCP encryption method according to the total number and the number of the received data, sending the data to the server, calculating the average value of the data points in each cluster from the clustering central point, and then sending the data to the server; the server is used for calculating the distance between the ciphertext data points and the clustering center points, dividing the clusters according to the distance and the trapdoor information, adding the data points of different users in each cluster respectively, sending the sum and the number of the data to each user respectively, calculating a new clustering center point, sending the new clustering center point to each user, and sending the classification result to each user according to the data source after the classification is finished.
9. The system of claim 8, wherein: the server is an outsourced cloud server.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710225047.1A CN107145792B (en) | 2017-04-07 | 2017-04-07 | Multi-user privacy protection data clustering method and system based on ciphertext data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710225047.1A CN107145792B (en) | 2017-04-07 | 2017-04-07 | Multi-user privacy protection data clustering method and system based on ciphertext data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107145792A CN107145792A (en) | 2017-09-08 |
CN107145792B true CN107145792B (en) | 2020-09-15 |
Family
ID=59775113
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710225047.1A Active CN107145792B (en) | 2017-04-07 | 2017-04-07 | Multi-user privacy protection data clustering method and system based on ciphertext data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107145792B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109615021B (en) * | 2018-12-20 | 2022-09-27 | 暨南大学 | Privacy information protection method based on k-means clustering |
CN109688143B (en) * | 2018-12-28 | 2021-01-22 | 西安电子科技大学 | Clustering data mining method for privacy protection in cloud environment |
CN110233730B (en) * | 2019-05-22 | 2022-05-03 | 暨南大学 | Privacy information protection method based on K-means clustering |
CN110163292A (en) * | 2019-05-28 | 2019-08-23 | 电子科技大学 | Secret protection k-means clustering method based on vector homomorphic cryptography |
CN111291406B (en) * | 2020-01-19 | 2022-07-26 | 山东师范大学 | Facility site selection method and system based on encrypted position data |
CN111542058A (en) * | 2020-04-27 | 2020-08-14 | 福建省众联网络科技有限公司 | Encryption processing method for communication |
CN111291417B (en) * | 2020-05-09 | 2020-08-28 | 支付宝(杭州)信息技术有限公司 | Method and device for protecting data privacy of multi-party combined training object recommendation model |
CN111444545B (en) * | 2020-06-12 | 2020-09-04 | 支付宝(杭州)信息技术有限公司 | Method and device for clustering private data of multiple parties |
CN111737753B (en) * | 2020-07-24 | 2020-12-22 | 支付宝(杭州)信息技术有限公司 | Two-party data clustering method, device and system based on data privacy protection |
CN112101579B (en) * | 2020-11-18 | 2021-02-09 | 杭州趣链科技有限公司 | Federal learning-based machine learning method, electronic device, and storage medium |
CN112487481B (en) * | 2020-12-09 | 2022-06-10 | 重庆邮电大学 | Verifiable multi-party k-means federal learning method with privacy protection |
KR102247182B1 (en) * | 2020-12-18 | 2021-05-03 | 주식회사 이글루시큐리티 | Method, device and program for creating new data using clustering technique |
WO2022141014A1 (en) * | 2020-12-29 | 2022-07-07 | 深圳大学 | Security averaging method based on multi-user data |
CN112765664B (en) * | 2021-01-26 | 2022-12-27 | 河南师范大学 | Safe multi-party k-means clustering method with differential privacy |
CN113626858A (en) * | 2021-07-21 | 2021-11-09 | 西安电子科技大学 | Privacy protection k-means clustering method, device, medium and terminal |
CN113792760A (en) * | 2021-08-19 | 2021-12-14 | 北京爱笔科技有限公司 | Cluster analysis method and device, computer equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104601596A (en) * | 2015-02-05 | 2015-05-06 | 南京邮电大学 | Data privacy protection method in classification data mining system |
CN105760780A (en) * | 2016-02-29 | 2016-07-13 | 福建师范大学 | Trajectory data privacy protection method based on road network |
-
2017
- 2017-04-07 CN CN201710225047.1A patent/CN107145792B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104601596A (en) * | 2015-02-05 | 2015-05-06 | 南京邮电大学 | Data privacy protection method in classification data mining system |
CN105760780A (en) * | 2016-02-29 | 2016-07-13 | 福建师范大学 | Trajectory data privacy protection method based on road network |
Non-Patent Citations (2)
Title |
---|
《Outsourcing Two-party Privacy Preserving K-means Clustering Protocol Inn Wireless Sensor Networks》;Liu Xiaoyan etc;《IEEE computer society》;20151231;第124-133页 * |
《隐私保护的快速聚类算法》;薛安荣 等;《系统工程与电子技术》;20091030;第2521-2526页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107145792A (en) | 2017-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107145792B (en) | Multi-user privacy protection data clustering method and system based on ciphertext data | |
US11206132B2 (en) | Multiparty secure computing method, device, and electronic device | |
CN110995409B (en) | Mimicry defense arbitration method and system based on partial homomorphic encryption algorithm | |
TWI706279B (en) | Multi-party safe calculation method and device, electronic equipment | |
Bonawitz et al. | Practical secure aggregation for privacy-preserving machine learning | |
CN112989368B (en) | Method and device for processing private data by combining multiple parties | |
US10489604B2 (en) | Searchable encryption processing system and searchable encryption processing method | |
CN106789044B (en) | Searchable encryption method for cipher text data public key stored in cloud on grid under standard model | |
US20190140819A1 (en) | System and method for mekle puzzles symeteric key establishment and generation of lamport merkle signatures | |
WO2011052056A1 (en) | Data processing device | |
JP6497747B2 (en) | Key exchange method, key exchange system | |
JP6477461B2 (en) | Order-preserving encryption system, apparatus, method and program | |
CN114219483B (en) | Method, equipment and storage medium for sharing block chain data based on LWE-CPBE | |
CN110190945A (en) | Based on adding close linear regression method for secret protection and system | |
CN105474575A (en) | Multi-party secure authentication system, authentication server, intermediate server, multi-party secure authentication method, and program | |
CN116561787A (en) | Training method and device for visual image classification model and electronic equipment | |
WO2018043573A1 (en) | Key exchange method and key exchange system | |
CN107637013B (en) | Key exchange method, key exchange system, key distribution device, communication device, and recording medium | |
CN116170142B (en) | Distributed collaborative decryption method, device and storage medium | |
US8325913B2 (en) | System and method of authentication | |
Behera et al. | Preserving the Privacy of Medical Data using Homomorphic Encryption and Prediction of Heart Disease using K-Nearest Neighbor | |
Tan et al. | High-performance ring-LWE cryptography scheme for biometric data security | |
Liu et al. | Efficient and Privacy-Preserving Logistic Regression Scheme based on Leveled Fully Homomorphic Encryption | |
Hu et al. | MASKCRYPT: Federated Learning with Selective Homomorphic Encryption | |
US11451518B2 (en) | Communication device, server device, concealed communication system, methods for the same, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |