CN107145792B

CN107145792B - Multi-user privacy protection data clustering method and system based on ciphertext data

Info

Publication number: CN107145792B
Application number: CN201710225047.1A
Authority: CN
Inventors: 王轩; 蒋琳; 李晔; 姚霖; 刘泽超; 刘猛; 漆舒汉
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2020-09-15
Anticipated expiration: 2037-04-07
Also published as: CN107145792A

Abstract

The invention provides a multi-user privacy protection data clustering method and system based on ciphertext data, and belongs to the technical field of data mining. The method of the invention comprises the following steps: more than two users send the data after being encrypted respectively, the clustering center point and the trapdoor information to a server; the server calculates the distance between the ciphertext data point and the clustering center point, and divides the clustering; the server respectively adds data points of different users in each cluster, and respectively sends the sum and the number of the data to the users; the user re-encrypts the received data sum and the number and sends the data sum and the number to the server; the server calculates a new clustering center point and sends the new clustering center point to each user; and all users jointly calculate the average value of the data points in each cluster from the cluster central point through an outsourcing privacy protection average number calculation protocol, and then send the average value to the server for next iteration. The invention greatly improves the clustering efficiency; the security calculation under the semi-honest model is realized, and collusion attack to a certain degree can be resisted.

Description

Multi-user privacy protection data clustering method and system based on ciphertext data

Technical Field

The invention relates to the technical field of data mining, in particular to a multi-user privacy protection data clustering method based on ciphertext data, and further provides a system for realizing the multi-user privacy protection data clustering method based on the ciphertext data.

Background

Privacy Preserving Data Mining (PPDM) is mainly a method for solving Data Mining involving two or more partners, but does not want private Data to be revealed in a calculation process. The privacy protection data mining ensures that the data mining can be carried out on the joint data of two parties or even multiple parties, and simultaneously ensures that the data privacy is not stolen by other people.

The technology of privacy protection data mining is mainly divided into a technical method based on data scrambling and a technical method based on passwords. The data scrambling based technology mainly realizes privacy protection of source data by adding interference on the basis of the source data, but certain precision loss is brought. The cryptographic technology mainly uses homomorphic encryption and secure multi-party calculation as main methods, and compared with data interference, the cryptographic technology has low data intervention and high precision, but the time complexity is often higher and the calculation cost is larger.

The technical method based on the password is mainly divided into a distributed computing method mainly without cloud end participation in the early stage, the method mainly adopts a protocol of security circuit evaluation of the Yao intelligence or semi-homomorphic encryption to realize data privacy protection, but the method brings about the problems of low efficiency, large calculation amount born by each participant and difficulty in practicability. Later 2012, Peter et al proposed outsourced secure multi-party computing based on BCP encryption methods, making it possible to reduce the computational load of the participants using the cloud. In the same year, Asharov proposes a gate trap homomorphic encryption method for multi-party computing, so that the efficiency of cloud computing is further improved, but the method cannot protect the privacy of users, and the content of the users is easily stolen by other users.

As for the clustering method, the classical is the traditional K-means clustering algorithm, and the realization process is that K points are randomly selected from data in the first iteration as clustering center points, then Euclidean distances from other points to the clustering center points are calculated, the shortest distance is divided into the corresponding clustering centers by comparison, after the clustering division is finished, each component in each point in each cluster is recalculated with an average value, the clustering centers are recalculated, after the calculation is finished, the first iteration is finished, and the next iteration is started. And (5) circulating to the clustering center of the iterative computation, stopping the iteration and finishing the clustering.

The K-means in the clustering algorithm is a relatively simple one, and the K-means clusters the samples into K clusters according to a certain rule through algorithm calculation, but the traditional clustering algorithm cannot realize user privacy protection, and data participants can easily acquire data of other users, so that the method has a defect in safety.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a multi-user privacy protection data clustering method based on ciphertext data and a system for realizing the method.

The invention relates to a multi-user privacy protection data clustering method based on ciphertext data, which comprises the following steps:

s1: more than two users send the data after being encrypted respectively, the clustering center point and the trapdoor information to a server;

s2: the server calculates the distance between the ciphertext data point and the clustering center point, and divides clustering according to the distance and the trapdoor information;

s3: the server respectively adds data points of different users in each cluster, and respectively sends the sum and the number of the data to respective users;

s4: each user re-encrypts the data according to the received data sum and number by a BCP encryption method and sends the data to the server;

s5: the server calculates a new clustering center point and sends the new clustering center point to each user;

s6: and (4) all users jointly calculate the average value of the data points in each cluster from the cluster central point, then the average value is sent to the server, the step S1 is executed again until the average value is smaller than the threshold value, the classification is finished, and the server sends the classification result to all users respectively according to the data source.

In a further improvement of the present invention, in step S1, the server is an outsourced server, and the user encrypts the data twice by using homomorphic encryption and BCP encryption respectively, where the data set D ═ D₁,d₂,...,d_nContains n data points, each data point d_i＝(x_i,1,...,x_i,m) M denotes that each data point is an m-dimensional vector, and each data point d_iComponent x in (1)_i,jWill be encrypted twice and uploaded to the outsource server Enc (x)_i,j)＝(c_e(i,j),c_p(i,j)) Wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to m, c_e(i,j)Representing the ciphertext encrypted using the Liu's homomorphic encryption scheme, c_p(i,j)Representing the ciphertext encrypted using the BCP encryption scheme.

In a further improvement of the present invention, the processing method in step S2 includes:

s21: the server is based on the ciphertext c_e(i,j)Computing ciphertext data point d_iAnd the t-th cluster center point E_tDistance ED²(d_i,E_t) Wherein k is the number of the clustering central points, and t is more than or equal to 1 and less than or equal to k;

s22: according to the Trapdoor function in the Trapdoor information provided by the user, the outsourcing server calculates ED²(d_i,E_t) + Trapdoor compares the distance of each data point to the center of each cluster, selects the closest one, and classifies this point into the corresponding cluster.

In a further refinement of the present invention, in step S21, each data point d_i＝(x_i,1,...,x_i,m) And each cluster center point E_t＝(e_t,1,...,e_t,m) Are all m-dimensional vectors, and the encrypted data for each data point is c_e(i,j)＝(c_e(i,1),...,c_e(i,m)) Said distance ED²(d_i,E_t) The calculation formula of (2) is as follows:

in a further improvement of the present invention, in step S22, the trapwood function is used to generate an order-preserving encryption index that can compare two data sizes.

In step S3, the server side uses the ciphertext c_p(i,j)To calculate how many data points in each cluster there are and the sum of the corresponding components of the data points, and to send the sum result to each user Pi, respectively, according to the data distribution.

The present invention is further improved in that in steps S4-S6, since the recalculation of each cluster center is to add the components corresponding to the discrete points belonging to the center in each divided cluster to an average value, assuming that there are n points, t users, each user Pi,

is the value of Pi, and is,

each user has an encrypted value of

Each one of c_piIs a m-dimensional vector, and the cloud respectively calculates the discrete point c of each clustering center Pi_piThe values of the corresponding components are summed and the number is calculated. Then

The addition result is X_i＝(x_i1,x_i2,...,x_im) And the number of points belonging to Pi in the cluster is a_ia_iThe server sends the calculated X_i,a_iRespectively sending to each user Pi, encrypting each user Pi by BCP encryption scheme, and calculating by combining with the server with OPPWAP protocol

The final result is the calculated average.

The invention is further improved, the processing procedure of the server and each user based on the OPPWAP protocol comprises the following steps: a1: the outsourcing server S initializes by Setup and generates the common parameter PP ═ N, K, g, and applies the common parameter PP to the outsourcing server S

Sending the data to each user Pi;

a2: after each user Pi obtains the public parameters, the public key and the private key (pk) of the user are generated by a key generator_i,sk_i) And the public key pk_iSending the data to a server;

a3: the server combines all the public keys to calculate a unified public key and sends the unified public key Prod.pk to each user Pi;

a4: user Pi encrypts his data to obtain result (A)_i,B_i) And (A)_i′,B_i′)；

A5: the user Pi generates two random numbers ρ_iAnd ρ_i' recalculating the encrypted data to obtain:

and sending the data to a server;

a6: after the server obtains the data, the data is calculated according to a formula

And will be

And

returning to each user Pi;

a7: the user Pi is calculated to obtain

And

and sending to the server;

a8: server gets data X_iAnd X'_iThen, new data is calculated

And

then, a random number tau is generated, and then K, K',

sending to each user Pi;

a9: after each user Pi obtains the data, the calculation is finally carried out

Thereby obtaining the average value of the distance between the data point in each cluster and the cluster central point.

The invention also provides a system for realizing the method, which comprises a server and more than two users, wherein the users are used for sending the encrypted data, the clustering central point and the trapdoor information to the server, re-encrypting the data by a BCP encryption method according to the total number and the number of the received data, sending the data to the server, calculating the average value of the distance between the data point in each cluster and the clustering central point, and then sending the data to the server; the server is used for calculating the distance between the ciphertext data points and the clustering center points, dividing the clusters according to the distance and the trapdoor information, adding the data points of different users in each cluster respectively, sending the sum and the number of the data to each user respectively, calculating a new clustering center point, sending the new clustering center point to each user, and sending the classification result to each user according to the data source after the classification is finished.

The invention is further improved, and the server is an outsourced cloud server.

Compared with the prior art, the invention has the beneficial effects that: the cryptography technology is selected, and the efficiency is improved by selecting a relatively high-efficiency encryption algorithm and a data outsourcing mode; the improved door trap encryption algorithm is combined with data mining of privacy protection, so that the efficiency is improved; the security calculation under the semi-honest model is realized, and collusion attack to a certain degree can be resisted.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a one-time iteration structure and data processing flow according to the present invention;

FIG. 3 is a partial clustering of data;

FIG. 4 is a clustering result of the data cipher text of FIG. 3;

FIG. 5 shows the result of plaintext clustering in the data shown in FIG. 3;

FIG. 6 is a comparison of the time spent by the server and the user in one iteration;

FIG. 7 is a comparison histogram of data plaintext and ciphertext at the time of the last data and one iteration;

fig. 8 is a time-contrast histogram of data plaintext and ciphertext over an iteration.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The method mainly aims at calculating the privacy protection data cluster of multiple users or multiple data sources, and in order to ensure that the privacy data of multiple data owners are not leaked in the calculation process, a scheme is needed to protect the privacy data of the data owners. Meanwhile, privacy protection of data brings a large amount of computation to a data owner, and the computation needs to be outsourced to a server to reduce the computation amount of the data owner. The invention combines the two requirements, combines the K-means clustering algorithm of multi-party data privacy protection and outsourcing calculation, realizes privacy protection through encryption, and realizes ciphertext calculation through safe multi-party calculation. And a plurality of data owners encrypt the data and upload the data to the outsourcing server, and the server calculates the data in the ciphertext and returns the data to the data owner clustering result. Most of calculation is handed to the outsourcing server, and the data owner carries out a small amount of calculation, and when realizing clustering, guarantee that the privacy data of the data owner is not revealed in the clustering process. The invention needs to overcome two main difficulties, one is to realize the K-means clustering algorithm outsourcing calculation of privacy protection; another is the computational difficulty presented by the diversity of data distribution in multi-party data sets. The invention mainly breaks through the two difficulties.

As shown in fig. 1, the method for clustering multi-user privacy protection data based on ciphertext data of the present invention includes the following steps: more than two users send the data after being encrypted respectively, the clustering center point and the trapdoor information to a server;

s6: and (4) all users jointly calculate the average value of the data points in each cluster from the cluster central point, then the average value is sent to the server, the step S1 is executed again until the average value is unchanged, the classification is finished, and the server sends the classification result to all users according to the data source.

In step S1, the present invention encrypts data using two encryption schemes, Liu homomorphic encryption and BCP encryption. Data set D ═ D₁,d₂,...,d_nContains n data points, each data point d_i＝(x_i,1,...,x_i,m) M denotes that each data point is an m-dimensional vector, and each data point d_iComponent x in (1)_i,jWill be encrypted twice and uploaded to the outsource server Enc (x)_i,j)＝(c_e(i,j),c_p(i,j)) Wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to m, c_e(i,j)Representing the ciphertext encrypted using the Liu's homomorphic encryption scheme, c_p(i,j)Representing the ciphertext encrypted using the BCP encryption scheme.

The server of this example is the cloud server of outsourcing, calculates most calculations through the cloud server of outsourcing, effectively improves clustering efficiency.

As shown in fig. 2, a complete iteration process of the present invention mainly includes the following steps: assuming that each user Pi uploads the encrypted data to the wrapper server after encryption of the data is completed, the synthesized data set D is equivalent to a two-dimensional table, and the encryption process is to encrypt each data in the synthesized data set D twice and upload the encrypted data to the cloud server. The outsourced cloud server mainly calculates the distance from the data point to the clustering center, receives Trapdoor (Trapdoor) information from the data owner, compares and selects the clustering center with the shortest distance, and divides the clustering. And then, adding each component of the data points in each cluster according to the divided cluster result, sending the data addition result and the number belonging to P1 to P1, sending the data belonging to P2 to P2 and sending the data belonging to Pn according to different data distribution. And each user Pi (1< ═ i < ═ n) re-encrypts the information of the user Pi (1< ═ i < ═ n) again and sends the information to the cloud, the cloud calculates a new clustering center point, finally, the cloud calculates the completion and sends the information to each user, and the user decrypts the information and sends the new clustering center to the cloud to enter the next iteration.

From the perspective of users, an iterative process is that each user Pi provides its own Trapdoor information (according to different data distributions), a server waits for sending the sum and number of various data belonging to each clustering center, after Pi receives the data, the data is encrypted by using a BCP encryption scheme, and then the cloud is combined to complete recalculation of the clustering centers by using an opppap protocol, wherein, because the data of more than two users are distributed horizontally, each data point in a data set belongs to each user, and all the data points in the data set, in this example, when two or more parties recalculate the clustering centers, the two or more parties negotiate to generate a common set of r_1v,r_2v,...r_mvThe value of the clustering center is encrypted, and the clustering center is re-encrypted and then returned to the cloud outsourcing server, so that the consistency of database calculation is ensured. It is emphasized that when re-encrypting a cluster center, no BCP encryption is used for the new cluster centerThe scheme, that is, the cluster center point only needs to be encrypted once by the encryption scheme of Liu.

Specifically, the processing method of step S2 includes:

assume that there are n data points D ═ D in the data set D₁,d₂,...,d_nK cluster centers are set in advance, d_i(1. ltoreq. i. ltoreq.n) denotes the ith data point, E_t(1. ltoreq. t.ltoreq.k) represents the t-th cluster center. Each data point d_i＝(x_i,1,...,x_i,m) And each cluster center point E_t＝(e_t,1,...,e_t,m) Are all m-dimensional vectors. In the following, the Euclidean distance is calculated according to the formula

j denotes the jth vector.

However, since this example uses two encryption operations, each x_i,jWill be encrypted and upload Enc (x) twice_i,j)＝(c_e(i,j),c_p(i,j))。c_e(i,j)The distance between the discrete points and the central point is calculated, and the clustering center is divided; c. C_p(i,j)For recalculating the cluster centers. For convenience of representation, in the process of calculating the distance from the discrete point to the central point and dividing the clustering center, the example uses c_i,jIn place of c_e(i,j). The homomorphic encryption scheme of Liu is used in both the comparison and calculation of the distance of the data points to the cluster center. The encryption key in the encryption algorithm is a list K (v) because there is only one t in the key list_iNot equal to 0, so the data owner only needs to associate t with_iC not equal to 0_iUploading to a package server, wherein c is assumed to be uploaded in the invention₁。

In the process of dividing and clustering, the distance from each point to the central point needs to be calculated, in this example, d is used_i＝(x_i,1,x_i,2,...x_i,m) To E_t＝(e_t,1,e_t,2,...e_t,m) Distance is an example, since only c is uploaded₁Therefore using c_i,jIs represented by Enc (K (v), x_i,j)＝(c_1(i,j),...,c_v(i,j))，c_i,j＝k₁*t₁*x_i,j+s₁ ^*r_v(i,j)+k₁ ^*(r₁-r_v-1) Similarly, use c'_t,jIndicating e after encryption_t,jThen c'_t,j＝k₁*t₁*e_t,j+s₁*r′_v(t,j)+k₁*(r₁-r_v-1). Then ED²(d_i,E_t) Represents the encrypted data point d_iTo the cluster center E_tThe distance of (c). The following formula is the calculation process of the distance from the data point to the cluster center under the ciphertext condition:

however, the distance calculation in this ciphertext cannot be directly used for distance comparison because the original distance D is used²(d_i,E_t) Then adding the sum r_vThe associated suffix. Therefore, when the sizes are compared under the ciphertext condition, the data owner in the trap encryption, namely the user, needs to provide trapdigital trap information to offset the part affecting the distance comparison.

The threshold information in this example is a kind of Order-preserving encrypted index (OPI) introduced in 2014 outsourcing encryption calculation by Liu. Given a key k and plaintext x, the expression OPI (k, x) will yield an index with respect to x. If there are two plaintext data x₁And x₂If x is₁＞x₂Then the order-preserved encryption index will guarantee the OPI (k, x)₁)＞OPI(k,x₂). This scheme does not recover x₁And x₂But their sizes may be compared.

For example, a plaintext is represented in decimal notation by the rightmost digit after the decimal point of the plaintext, e.g., a plaintext number is in XXX.XX format, which is typically 2, and the sensitivity is 10^-2So if a plaintext array is of size s, its sensitivity is 10^-s. The key k of the index used in this example is a pair of numbers (a, b) and a > 0. In this example, Sens represents the sensitivity of the plaintext, and OPI (k, x) ═ a × x + r, where r is uniformly distributed in the data interval [0, a × Sens). That is, the magnitude of r does not affect the comparison of the values of x. If x₁＞x₂Then OPI (k, x)₁)-OPI(k,x₂)＝a*(x₁-x₂)+r₁-r₂. Due to a (x)₁-x₂)＞a*Sens＞r₁-r₂Thus, OPI (k, x)₁)＞OPI(k,x₂)。

Therefore, in this example, the cipher text size is compared, and the index format that needs to be created for order preservation is a × f (X) + g (X, R), where a denotes the encrypted key, X is the plaintext data set that needs to be compared, R is the set of random numbers, and f and g are two functions that respectively denote the definitions in addition and multiplication. The sensitivity of the simultaneous declaration of f (X) is f (x)₁) And f (x)₂) The minimum gap between them. Suppose f (x)₁) Scale of is s₁，f(x₂) Scale of is s₂Then f (x)₁)+f(x₂) Scale of is s₁And s₂The larger of the two; f (x)₁)*f(x₂) Scale of is s₁+s₂。

Assuming that the sensitivity of f (X) is Senf, the ciphertext is in the form a f (X) + g (X, R), if this form is to be converted to OPI (k, X) ═ a X + R. Therefore, if the outsourcing server is required to convert the format of a × (X) + g (X, R) ciphertext into the order-preserving index of the format a × f (X) + R, the data owner needs to construct the trapdoor information-g (X, R) + R. First of all, it is necessary toWill ED²(d_i,E_t) Written in the form of a × (X) + g (X, R), the specific calculation formula is as follows:

wherein a, f (X) and g (X, R) are respectively as follows:

a＝(k₁*t₁)²

here, assume that D (D)_i,E_t) Is s, then D²(d_i,E_t) The scale of (D) is s + s2 s, so D²(d_i,E_t) Has a sensitivity of 10^-2*sSince the data owner needs to provide the Trapdoor information (Trapdoor), different data distributions may result in different forms of Trapdoor information (Trapdoor).

In calculating D²(d_i,E_t) When this happens, the outsource server does not need to make different calculations for data originating from different data owners. But the outsourcing server needs to keep track of which user each record is coming from. User Trapdoor information (Trapdoor) is required to compute distance and partition the cluster center because the outsourcing server knows each d_iWhether it comes from P1 or P2, etc., so the data owner to whom the data point to be computed belongs is the corresponding data owner that provides the Trapdoor information (Trapdoor).

In this example, in a horizontal data distribution, assume d_iFrom P1, then trapdoor information is provided by P1. The format of the trapdoor information is-g (X, R) + R. The trapdoor function designed in this example consists of two parts:

Trap_it(d_i,E_t)+Trap_t(E_t)

the first of these two parts Trap_it(d_i,E_t) Is a trapdoor function of the distance of each data point to the cluster center point, which is a part that can be calculated in advance by the data owner, Trap_t(E_t) The trapdoor function of each cluster center point is changed along with the change of different cluster centers in each iteration, and the calculation formula is as follows:

wherein, NB_tjThe size range is [0, (k)₁*t₁)²*sens]The result of adding the random number R corresponding to a part of the random number R in-g (X, R) + R is shown in the following formula.

Wherein-g (X, R) corresponds to

And random number

In recalculating the clustering centers, this example will utilize the encrypted ciphertext c of the second BCP encryption scheme_p(i,j)To perform calculation, the server side adopts the ciphertext c_p(i,j)To calculate how many data points in each cluster there are and the sum of the corresponding components of the data points, and to send the sum result to each user Pi, respectively, according to the data distribution. In order to calculate the average value of the data points in each cluster center among all parties, the example designs an OPPWAP protocol (outsourcing privacy protection average number calculation protocol), and calculates the data to be calculated under the condition of reducing the data to be calculated into a ciphertext

To a problem of (a).

Specifically, in steps S4-S6, since the recalculation of each cluster center is to add the components corresponding to the discrete points belonging to the center in each divided cluster to the average, assuming that there are n points, t users, each user Pi,

is the value of Pi, and is,

each user has an encrypted value of

The addition result is X_i＝(x_i1,x_i2,...,x_im) And the number of points belonging to Pi in the cluster is a_iThe server sends the calculated X_i,a_iRespectively sending to each user Pi, encrypting each user Pi by BCP encryption scheme, and calculating by combining with the server with OPPWAP protocol

The final result is the calculated average.

The specific implementation method comprises the following steps:

a1: the outsourcing server S initializes through Setup and generates a common parameter PP which is (N, K, g), and sends the common parameter PP to each user Pi;

and sending the data to a server;

And will be

And

returning to each user Pi;

a7: the user Pi is calculated to obtain

And

and sending to the server;

a8: server gets data X_iAnd X'_iThen, new data is calculated

And

then, a random number tau is generated, and then K, K',

sending to each user Pi;

a9: after each user Pi obtains the data, the calculation is finally carried out

The invention selects a typical K-means algorithm in the cluster, realizes the privacy protection of the personal data of the data owner by using a cryptography technology under the condition that the data source is two or even more, and performs the safety calculation by using the safety multi-party calculation. In addition, each iteration of the K-means algorithm needs to calculate the distance from each data point to each central point, the time cost of circular calculation is high, the calculation is outsourced to the server, and the efficiency is improved.

The effects of the present invention are further illustrated below in conjunction with experimental data:

the experiment of the invention is carried out on a single machine, and the system development environment is as follows:

(1) the running system is windows7, the processor is Intel (R) core (TM) i5-4570CPU speed is 3.2GHz, and the memory size of the system is 8G;

(2) the encryption key for BCP encryption is 512 bits, and in the operation stage, the speed is low because the key is large;

(3) the programming language is Java, the operating environment is eclipse, and the system database is Mysql.

The experimental data is data mined from data downloaded from a public data set UCI, the original data is decimal, BCP encryption is a group-based encryption scheme and does not support decimal operation, and the data is processed into integers in the later period. The processed data are 10000 pieces of data with 7 attributes. Part of the data is shown in figure 3.

As shown in FIG. 4 and FIG. 5, in order to verify the correctness of the calculation under the ciphertext of the present invention, in the experiment, K-means clustering under the plaintext is performed, and it can be seen through comparison that the clustering results of the ciphertext and the plaintext under the same data are completely consistent, and the experimental result is used to verify the correctness of the theory herein.

Fig. 3 and 4 are portions of the clustering results, which are truncated, and are set to three {15,14,2,6,4,4,6}, {1,1,1,1,1, 1}, {15,13,2,6,4,4,6}, at an initial point of a cluster center, where the first piece of data is test data, and it can be seen that the clustering results are completely consistent under the condition that the data are completely the same. The data encryption time is shown in table 1, and the encryption process is performed simultaneously by two homomorphic encryptions.

From table 1, it can be seen that the time spent for encryption is within an acceptable range, and the sum of the time spent by trapport is slightly less than the encryption time, because for each d in the calculation of trapport_iThe accumulation of each component needs to be calculated, the encryption times are less, but the formula is more complicated than the encryption, and d is encrypted_iThe time spent was slightly less contrasted, but the difference was not large.

TABLE 1 encryption time consumption

The time spent by the data owner in one iteration mainly includes the calculation of Trapdoor information (Trapdoor), and the time spent by the OPPWAP protocol calculation. The results of comparison in the case where the number of data points was different are shown in Table 2.

TABLE 2 one iteration elapsed time

As shown in fig. 6, a line graph comparing the time consumption of one iteration for the server and one user. The abscissa represents the number of center points of the cluster, and the ordinate represents the time of the cluster in milliseconds (ms). It can be seen that the data owner time consumption is much less than for the server. Table 3 shows the time consumed by the users Alice and Bob, including the time consumed by OPPWAP and Tradpor, in one iteration.

TABLE 3 time consumed by data owner in one iteration

Tables 2 and 3 show the time cost consumed by one iteration, and it can be seen in tables 3 and 2 that as the number of data points increases, the time cost at the server end increases, and the increase is not linear because the increase is related to the number of iterations. However, the time cost consumption between Alice and Bob is not related to the number of data points, and it can be seen from table 3 that the time cost consumed by trapwood is relatively small, and most of the time cost is occupied by the time consumption calculated by opppap. However, the number of times of OPPWAP calculation is only related to the number k of clusters and the dimension m of the data points, so in one iteration, the time consumed by the time of OPPWAP calculation in Alice and Bob is basically stable, because k and m are not changed in the calculation process. As can be seen from table 2, when the number of data points increases, the server side will bear more calculation cost, while the cost calculated by the data owner tends to be substantially stable, and the time cost calculated by the trapwood is far less than that of the server.

Because the iteration times of K-means cluster calculation are not controllable, the iteration times are related to the number of data points, the number of clusters and the initial point selected each time. In the use of the K-means algorithm, K (the number of clusters) is often a certain number, and in the case of a certain value of the number of clusters K, the experiment is set to 3, and the approximate correlation degree between the number of data points and the number of iterations is shown in table 4.

TABLE 4 reference of data points number and iteration number

The communication consumption in the whole cluster mainly comprises the uploading of ciphertext data, the OPPWAP protocol and the uploading of a new cluster center and a Tradpor function, wherein the cluster center and the Tradpor are uploaded together, so the communication cost is calculated together. The whole communication cost is shown in table 5, and it should be noted that the opppap protocol and the clustering center and the trapwood function are exemplified by an iteration, because the number of K-means iterations is not controllable, the total communication consumption after a plurality of iterations is given, and the cost is more meaningful than that of one iteration.

TABLE 5 communication consumption

From table 5, the uploading time of the data is similar to the uploading time of the cluster center point, and the uploading time of the data is basically consistent because the three-dimensional arrays are uploaded in the experiment. The time of the whole OPPWAP is the result of the time addition of the time of respectively calculating the OPPWAP by Alice and Bob in one iteration in the table 2 because the time of calculation and communication are basically carried out at the same time, and the calculated time is the average value after a plurality of programs are run because the computer starting process is not determined every time.

Finally, this example compares the time efficiency calculated in the ciphertext and the time efficiency calculated in the plaintext according to the present solution experimentally. The device is mainly divided into three parts: the method comprises the steps of respectively distinguishing uploaded plaintext data and uploaded ciphertext data, comparing consumed time of next iteration of the plaintext and the ciphertext, and comparing the consumed time of the whole clustering process. For the experimental data comparison to be obvious, the case of selecting 2000 data points and 7 attribute values was performed.

TABLE 6 comparison of computation times for plaintext and ciphertext

Table 6 shows the overall comparison of plaintext calculations and ciphertext calculations over the time of the upload data, one iteration, and the entire cluster. Because the uploaded data are uploaded in a 7-dimensional array form in the experiment, the time for uploading the array is basically 79ms for a set time. In one iteration, the cryptograph calculation requires approximately twice as much time as the plaintext calculation because of the required OPPWAP calculation and communication consumption. In the whole clustering calculation, because the experiment adopts a document reading mode, the data reading time is faster, and the encryption time is more increased in the ciphertext calculation than in the plaintext calculation. If the data base is read, the time is 2-3s more than that of the read document, and the method adopts the form of reading the document in order to calculate the accuracy of the time. In fig. 7 and 8, the difference between plaintext and ciphertext is represented in the form of a histogram. The ordinate in the histogram represents time, and the unit in fig. 7 is milliseconds (ms) and the unit in fig. 8 is seconds(s).

The outsourcing calculation experiment performed by the invention only increases the calculation time of the OPPWAP protocol compared with the K-means outsourcing calculation with privacy protection. From table 5, it can be seen that the opppap time consumption in one iteration is about 412ms, and since the number of iterations is not controllable, the experiment is based on the given one iteration time. The time consumption for transferring the Trapdoor function is basically the same regardless of multiple parties or a single party.

In conclusion, the invention combines the data mining and outsourcing calculation of privacy protection, and performs experimental analysis. The main achievements of the invention are as follows:

(1) the analysis summarizes the advantages and disadvantages of different technologies in the aspect of privacy protection data mining, the data scrambling technology is prone to damage data, and the method is a compromise between privacy protection and data mining precision. The cryptography technology does not affect the data mining result, and the data encryption also brings larger time cost. The invention selects the cryptography technology, and improves the efficiency by selecting a relatively high-efficiency encryption algorithm and a data outsourcing mode;

(2) the traditional method applies a safety circuit evaluation method proposed by the Yaoqian for safety calculation, and the method is realized by bit-wise encryption of 01 strings, so that the time cost is very high; the weighted averaging problem with privacy protection alone is not perfect in comparison of data point distances. The invention better solves the problems by combining an improved door trap encryption algorithm with data mining of privacy protection, and the efficiency is also improved;

(3) an outsourcing calculation protocol of a privacy protection K-means clustering algorithm is designed, the calculation of the distance between two points in the K-means algorithm through cyclic calculation is outsourced to a server, and the calculation is realized through safe multi-party calculation designed by adopting two encryption technologies. The improved Liu encryption scheme is used for comparing the distance from a data point to a cluster center and dividing clusters; BCP encryption is used for recalculation of the clustering center;

(4) time complexity analysis, space complexity and safety analysis are carried out aiming at the invention, and finally experimental verification is carried out. The invention realizes the safety calculation under the semi-honest model and can resist collusion attack to a certain degree.

The above-described embodiments are intended to be illustrative, and not restrictive, of the invention, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The multi-user privacy protection data clustering method based on the ciphertext data is characterized by comprising the following steps of:

s6: all users jointly calculate the average value of the data points in each cluster from the cluster center point through outsourcing a privacy protection average number calculation protocol, then send the average value to the server, and return to execute the step S1 until the average value is smaller than the threshold value, the classification is finished, and the server respectively sends the classification result to all users according to the data source;

in step S1, the server is an outsourced server, and the user encrypts the data twice through homomorphic encryption and BCP encryption respectively to obtain a data set

Containing n data points, each data point

M denotes that each data point is an m-dimensional vector, and each data point is a vector of m dimensions

Component (b) of

Will be encrypted and uploaded to the outsourcing server twice

Wherein, in the step (A),

，

，

representing the ciphertext encrypted using a homomorphic encryption scheme,

representing the ciphertext encrypted using the BCP encryption scheme.

2. The multi-user privacy preserving data clustering method according to claim 1, characterized in that: the processing method of step S2 includes:

s21: the server according to the ciphertext

Computing ciphertext data points

And the t-th cluster center point

Is a distance of

Wherein k is the number of the clustering central points,

；

s22: according to the Trapdoor function in the Trapdoor information provided by the user, the outsourcing server calculates

And comparing the distance from each data point to the center of each cluster, selecting the closest one, and dividing the point into corresponding clusters.

3. The multi-user privacy preserving data clustering method according to claim 2, characterized in that: in step S21, each data point

And each cluster center point

Are all m-dimensional vectors, and the encrypted data for each data point is

Said distance

The calculation formula of (2) is as follows:

。

4. the multi-user privacy preserving data clustering method according to claim 3, characterized in that: in step S22, the trapwood function is used to generate an order-preserving encryption index that can compare two data sizes.

5. The multi-user privacy preserving data clustering method according to claim 2, characterized in that: in step S3, the server side uses the ciphertext

To calculate how many data points in each cluster there are and the sum of the corresponding components of the data points, and to send the sum result to each user Pi, respectively, according to the data distribution.

6. The multi-user privacy preserving data clustering method of claim 5, wherein: in steps S4-S6, since the recalculation of each cluster center is to add the components corresponding to the discrete points belonging to the center in each divided cluster to the averaging, assuming that there are n points, t users, each user Pi,

is the value of Pi, and is,

each user having an encrypted value of

Each of which is

Is a m-dimensional vector, and the cloud calculates the discrete point of each clustering center Pi

The values of the corresponding components are summed and the number is calculated, then

The result of the addition is

And the number of points belonging to Pi in the cluster is a_iThe server sends the calculated X_i,a_iRespectively sending to each user Pi, encrypting each user Pi by BCP encryption scheme, and calculating by combining with the server with OPPWAP protocol

The final result is the calculated average.

7. The multi-user privacy preserving data clustering method of claim 6, wherein: the process of the server and the respective users Pi based on the opppap protocol comprises the following steps:

a1: the outsourcing server S initializes through Setup and generates a common parameter PP = (N, K, g), and sends the common parameter PP to each user Pi;

a2: after each user Pi obtains the public parameters, the public key and the private key of the user Pi are generated through the key generator

And will public key

Sending the data to a server;

a3: the server combines all the combinations to calculate a unified public key and sends the unified public key Prod.pk to each user Pi;

a4: user Pi encrypts its data to obtain result

And

；

a5: user Pi generates two random numbers

And

and recalculating the encrypted data to obtain:

，

，

,

and sending the data to a server;

,

,

,

And will be

And

returning to each user Pi;

a7: the user Pi is calculated to obtain

And

and sending to the server;

a8: server get data

And

then, new data is calculated

And

then generates a random number

Then will be

Sending to each user Pi;

a9: after each user Pi obtains the data, the calculation is finally carried out

So as to obtain the average of the data points in each cluster from the cluster central pointAnd (4) average value.

8. A system for implementing the multi-user privacy preserving data clustering method according to any one of claims 1 to 7, characterized in that: the system comprises a server and more than two users, wherein the users are used for sending encrypted data, a clustering central point and trapdoor information to the server, re-encrypting the data by a BCP encryption method according to the total number and the number of the received data, sending the data to the server, calculating the average value of the data points in each cluster from the clustering central point, and then sending the data to the server; the server is used for calculating the distance between the ciphertext data points and the clustering center points, dividing the clusters according to the distance and the trapdoor information, adding the data points of different users in each cluster respectively, sending the sum and the number of the data to each user respectively, calculating a new clustering center point, sending the new clustering center point to each user, and sending the classification result to each user according to the data source after the classification is finished.

9. The system of claim 8, wherein: the server is an outsourced cloud server.