CN108280491A

CN108280491A - A kind of k means clustering methods towards difference secret protection

Info

Publication number: CN108280491A
Application number: CN201810347108.6A
Authority: CN
Inventors: 杨庚; 胡闯; 白云璐; 王璇; 唐海霞
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Dongguan Mengda Group Co.,Ltd.
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2018-07-13
Anticipated expiration: 2038-04-18
Also published as: CN108280491B

Abstract

The invention discloses a kind of k means clustering methods towards difference secret protection, including data prediction；Center point set after indicating cluster with C, C indicate the error sum of squares under given data set and cluster center C；Judge C, size；Cycle executes, until retry is more than given number of retries maximum value retry_max, it is then back to optimal central point C_best；It, is categorized into nearest central point by each point in ergodic data collection X；The random noise of addition is set；The summation of the data point of each cluster, the quantity of point are recalculated, noise, the barycenter of final updating cluster are added；Step is repeated until error sum of squares restrains or iterations reach the upper limit.The present invention increases the random noise appropriate for meeting specific distribution in the iterative process of k means clustering algorithms so that cluster result is distorted to a certain extent, achievees the purpose that secret protection, while ensure that the availability of data.

Description

A kind of k means clustering methods towards difference secret protection

Technical field

The present invention relates to a kind of secret protection, clustering methods, and in particular to a kind of k mean values towards difference secret protection are poly- Class method, belongs to field of information security technology.

Background technology

With the fast development of cloud computing and big data, data mining technology obtains in some in-depth studies and application Significant progress.One of important method as data mining, clustering algorithm can excavate implicit, unknown knowledge and rule Then, and in the operational decision making of a large amount of related datas there is important potential value.But at the same time, bulk information discloses sensitive Information brings the threat and loss that can not be estimated to user.Therefore, how data-privacy to be protected to become in process of cluster analysis The hot issue of data mining and data secret protection field.With the proposition and development of secret protection technology, difference privacy is protected Maintaining method becomes a kind of current secret protection technology of hot topic.Difference privacy is realized by noise mechanism, i.e., into output result Random noise is added to protect data safety, the noise of addition is bigger, and data are safer, however, the availability of data is lower, instead It is as the same.

As one of most common clustering method, k-means algorithms are realized simply, while being provided high speed and being clustered.But tradition Difference secret protection k-means algorithms (such as difference privacy k-means algorithms, difference privacy k-means++ algorithms), at the beginning of it The selection of beginning central point is more sensitive, and there are certain blindness in the selection for the number k values that cluster, and reduce cluster knot The availability of fruit.

Invention content

Problem to be solved by this invention is paid aiming at insufficient present in background technology, proposes one kind towards difference The k means clustering methods of secret protection, increase in the iterative process of k means clustering algorithms and meet the appropriate of specific distribution Random noise so that cluster result is distorted to a certain extent, achievees the purpose that secret protection, while ensure that the available of data Property；Method is simple, easy to operate and do not limit data set size and attribute.

The method of the present invention executes the result of k-means++ algorithms acquisition as input value on data set, then passes through Alternately a series of non local " jumps " k-means algorithm traditional with execution, the cluster initial center point optimized, profit With this center point set again implementation center's point plus the cluster process of iteration of making an uproar；Difference secret protection technology of the present invention is fixed Justice one and its stringent challenge model, and carried out stringent mathematical proof and quantificational expression, the same time difference to privacy risk It is divided to privacy mechanism also can obtain and preferably put down in k-means cluster data mining result availabilities and two aspect of secret protection rank Weighing apparatus.

A kind of k means clustering methods towards difference secret protection of the present invention, include the following steps：

Step 1：Sample data pre-processes；

Step 2：Center point set after indicating cluster with C, φ (C, X) indicate given sample data set X and cluster central point Collecting the error sum of squares under C, x indicates that a data that sample data is concentrated, c indicate the central point that cluster central point is concentrated, Wherein

φ (C, X)=∑_x∈X min_c∈C||x-c||² (2)

Retry indicates the number retried, retry_maxIndicate maximum reattempt times, φ_bestIndicate updated square-error With C_bestIndicate updated center point set；Be then store in data set X execute obtain after k-means++ algorithms be at present Only minimum error sum of squares φ (C, X) Dao φ_bestNeutralize optimal cluster centre point set C to C_bestIn；Enable retry_max=m, m ∈ { 0,1,2 ... }, and initialize retry=0；

Step 3：As retry≤retry_maxWhen, enable λ indicate the central point of most " useless ", C_iIndicate the barycenter of cluster i, whereinC_μIndicate that the barycenter of cluster μ, μ indicate the maximum central point of intra-cluster distance quadratic sum, d_μTable Show the average distance of cluster μ, whereinEnable o tables Show that a small random number, u indicate that the random vector of d dimension unit hyper-spheres, ∈ indicate offset vector, wherein o=∈ d_μu；So Enable λ=μ+o, μ=μ-o again afterwards；

Step 4：The center point set C obtained using step 3 executes traditional k-means algorithms as initial center point set, Judge the size of φ (C, X)；If φ (C, X) is less than φ_best, then φ_best=φ (C, X), C_best=C, retry=0, otherwise Current this layer of cycle is exited, retry=retry+1, C are enabled_best=C；

Step 5：Cycle executes step 3 and 4, until retry is more than given number of retries maximum value retry_max, then Return to optimal central point C_best；

Step 6：Each point in ergodic data collection X calculates each point and arrives the distance between all central points, it is classified To nearest central point, and k cluster will be divided into X；

Step 7：The random noise of addition is set：

Random noise is Laplace noises, i.e., noise obeys Laplace distribution Lap (b), and b=Δs f/ ε, Δ f are the overall situation Susceptibility, ε are secret protection budget；Remember that location parameter is 0, the Laplace that scale parameter is b is distributed as Lap (b), probability Density function is

Wherein, η indicates stochastic variable；

Step 8：Recalculate the summation of the data point of each cluster, the quantity of point, addition noise Lap (b), obtain sum '= The barycenter of sum+Lap (b) and num '=num+Lap (b), final updating cluster are sum '/num '；

Step 9：Step 7 and 8 is repeated until error sum of squares restrains or iterations reach the upper limit, error sum of squares is got over Small, cluster result is more independent and compact.

In step 1, the method for data prediction is as follows：

If sample data set is X, sample space dimension is d, number of samples n；Determine the ratio between each attribute of sample Relationship；Maximum value Max based on initial data and minimum M in carries out the standardization of data, number using normalization processing method According to each record be d dimensional vectors, need to zoom in and out to space [0,1] d dimension spaces per one-dimensional^dIn, such as formula (1) institute Show：

Min, Max indicate that the minimum value of l dimensions, maximum value, f (l) are the data of l dimensions respectively, and y (l) is l dimension scalings Data afterwards.

In step 3, the offset vector ∈ takes 0.01.

In step 6, the distance between point x and point y, x are indicated with dist (x, y)_iIndicate the value of the i-th dimension of point x, y_iIt indicates The value of the i-th dimension of point y, dim indicate the dimension of point；The distance between 2 points calculate using Euclidean distance calculation formula, calculating side Shown in method such as formula (3)

In step 7, different data sets executes different iterations and can be only achieved the condition of convergence in clustering algorithm,

If (a) iterations N is fixed, the privacy budget of each iteration consumption is ε/N, and it is Lap that can add size every time The noise of ((d+1) N/ ε) obtains ε-difference secret protection；

If (b) iterations N is unknown, the value of privacy budget ε will be constantly adjusted in an iterative process.

Early period, influence of the iteration to cluster result was greater than later stage iteration；Select the increase privacy in cluster process gradually Budget ε, the pre- of the first sub-distribution is ε/2, and noise size is Lap (2 (d+1)/ε), each iteration consumption later it is pre- at last Previous half, until to the last an iteration is completed.

The present invention has the beneficial effect that：

To ensure the safety of k-means clustering algorithms, appropriate make an uproar is added by the central point in k-means algorithms Sound devises the clustering algorithm based on difference secret protection, and proves that algorithm meets difference privacy conditions.It is hidden with existing difference Private k mean algorithms are compared, and method of the invention is using the result of k-means++ algorithms as input value, then by alternately A series of non local " jumps " k-means algorithm traditional with execution, improves the selection of initial center point, it can effectively keep away Exempt from k values blindness and initial point sensibility, and its iterations can be reduced, to improve the availability of cluster, protects simultaneously Privacy.

Description of the drawings

Fig. 1 is the number for testing difference privacy k-means clustering algorithm performances used in experiment provided by the invention According to schematic diagram；

Fig. 2 is the work flow diagram of the k-means clustering methods provided by the invention towards difference secret protection.

Specific implementation mode

The implementation of technical scheme of the present invention is described in further detail below in conjunction with the accompanying drawings, it should be understood that these examples It is only illustrative of the invention and is not intended to limit the scope of the invention, after having read the present invention, those skilled in the art couple The modification of the various equivalent forms of the present invention falls within the application range as defined in the appended claims.

A kind of k means clustering methods towards difference privacy of the present invention, this method are made with the result of k-means++ algorithms For input value, then by alternately a series of non local " jumps " with execute traditional k-means algorithms, improve it is initial in The selection of heart point, and difference secret protection Laplace mechanism is utilized, increase in the iterative process of k means clustering algorithms and meets The random noise appropriate of specific distribution so that cluster result is distorted to a certain extent, achievees the purpose that secret protection, simultaneously It ensure that the availability of data.The method of the present invention is simple, easy to operate and theoretical proof its meet ε-difference privacy conditions, Ke Yiyou Effect avoids k values blindness and initial point sensibility, and can reduce its iterations, to improve the availability of cluster, simultaneously Privacy is protected, data publication and the secret protection of the data set of different scales and different dimensions are applicable to.

Referring to Fig. 2, specific implementation mode is as follows：

Step 1:Collection obtains a sample data set housec8.txt, storage be house color three color values, sample This number is 34112, attribute 3, image pattern collection X={ x₁,x₂,…,x₃₄₁₁₂, it is contracted to every one-dimensional data with formula (1) It puts to [0,1] section.20 row data in data set after scaling are taken, as follows：

x₁=[0. 0.08130081 0.00473934] x₅₀=[0.00961538 0.0203252 0.00947867]

x₁₀₂=[0.02403846 0.01626016 0.03317536] x₁₅₅=[0.02403846 0.0203252 0.01895735]

x₂₅₀=[0.03365385 0.06910569 0.00473934] x₃₅₀=[0.03365385 0.11788618 0.01895735]

x₁₀₀₀=[0.04326923 0.03658537 0.07109005] x₃₀₂₀=[0.04326923 0.10569106 0.02369668]

x₅₀₃₀=[0.04807692 0.03658537 0.02843602] x₆₀₀₀=[0.05288462 0.06097561 0.04265403]

x₉₈₄₃=[0.05288462 0.08130081 0.04265403] x₁₀₃₄₅=[0.05288462 0.09349593 0.01895735]

x₁₈₅₄₆=[0.05769231 0.01219512 0.04265403] x₂₀₃₄₅=[0.05769231 0.05284553 0.04739336]

x₂₄₆₇₅=[0.05769231 0.06097561 0.07582938] x₂₆₅₄₆=[0.05769231 0.06910569 0.02843602]

x₂₉₆₅₄=[0.0625 0.06097561 0.10900474] x₃₀₀₀₀=[0.0625 0.06910569 0.07582938]

Step 2：K-means++ algorithms are executed in data set X, obtain φ_best=0.024421323538, C_best= [[0.33290311 0.25585707 0.15738572][0.61027347 0.44192056 0.28916641] [0.70476998 0.79096867 0.75084066]]

Step 3：The C obtained with step 2_bestThe choosing of more preferably initial center point is carried out according to step 3-5 in technical solution It takes, result is

[[0.59984837 0.42572074 0.28687944][0.70476998 0.79096867 0.75084066] [0.59510802 0.42289839 0.28504289]]

Step 4：The random noise of addition is set.Difference privacy budget total amount ε ∈ [0.1,1] are taken, due to experimental data set Attribute d=3, and iterations are unknown, it is ε/2 that can obtain the pre- of the first sub-distribution in iterative process using formula (4), and noise is big Small is Lap (8/ ε), and the pre- of the second sub-distribution is ε/4, and noise size is Lap (16/ ε), later the budget of each iteration consumption It is previous half, until to the last an iteration is completed.

Step 5：The noisiness being arranged according to step 4 is carried out plus is made an uproar to the summation of data point and the quantity of point of each cluster, Barycenter is updated, in the first iteration, the summation matrix of 3 attributes at 3 cluster midpoints is

And the quantity of the respective point of 3 clusters is

Num=[23,101 3,600 7411]

The noisiness of first time iteration addition is Lap (8/ ε) known to step 4, therefore newer new barycenter is for the first timeAs a result it is

[[0.6381381 0.53917325 0.41654519][0.24548196 0.1825142 0.10771185] [0.39675444 0.31281772 0.20345833]]

Iteration concrete outcome is no longer described in detail below, and experiment finally reaches convergence after the 40th iteration, finally obtains plus poor Point privacy cluster result center point set is

[[0.5893953 0.4049542 0.27335089][0.59260889 0.40021986 0.27148743] [0.70471501 0.790935 0.75081869]]

Step 6：Assess clustering performance.Since reference class is provided via selected data set, we use F-measure To assess clustering performance.The range of F-measure values is [0,1], and value means that more greatly algorithm has preferably cluster availability.

It is that both are poor by difference privacy clustering algorithm proposed by the present invention and DPk-means and DPk-means++ herein Point privacy clustering algorithm is compared, corresponding each ε values, and data set calls three difference privacy clustering algorithms 50 times respectively, takes The average value of corresponding F-measure results, as shown in Fig. 1 (wherein red lines are arithmetic result provided by the invention).

As seen from the figure, at the horizontal ε of identical privacy, difference privacy clustering algorithm proposed by the present invention and other two kinds calculations Method is compared, and the result of F-measure, which has been worth to, largely to be improved, this illustrates the present invention under identical secret protection rank The cluster availability higher of acquisition, and privacy budget is bigger, and cluster availability is higher, but privacy level reduces.

In conclusion the present invention proposes a kind of k-means clustering methods towards difference privacy, the program is with k- Then the result of means++ algorithms passes through alternately a series of non local " jumps " k- traditional with execution as input value Means algorithms improve the selection of initial center point, and utilize difference secret protection Laplace mechanism, in k means clustering algorithms Iterative process in increase and meet the random noise appropriate of specific distribution so that cluster result is distorted to a certain extent, is reached To the purpose of secret protection, while it ensure that the availability of data.

The above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of k means clustering methods towards difference secret protection, which is characterized in that include the following steps：

Step 1：Sample data pre-processes；

Step 2：Center point set after indicating cluster with C, φ (C, X) are indicated under given sample data set X and cluster center point set C Error sum of squares, x indicate sample data concentrate a data, c indicate cluster central point concentrate a central point, wherein

φ (C, X)=∑_x∈Xmin_c∈C||x-c||² (2)

Retry indicates the number retried, retry_maxIndicate maximum reattempt times, φ_bestIndicate updated error sum of squares, C_bestIndicate updated center point set；Be then store in data set X execute obtain after k-means++ algorithms so far most Small error sum of squares φ (C, X) Dao φ_bestNeutralize optimal cluster centre point set C to C_bestIn；Enable retry_max=m, m ∈ { 0,1,2 ... }, and initialize retry=0；

Step 3：As retry≤retry_maxWhen, enable λ indicate the central point of most " useless ", C_iIndicate the barycenter of cluster i, whereinC_μIndicate that the barycenter of cluster μ, μ indicate the maximum central point of intra-cluster distance quadratic sum, d_μTable Show the average distance of cluster μ, wherein O is enabled to indicate One small random number, u indicate that the random vector of d dimension unit hyper-spheres, ∈ indicate offset vector, wherein o=∈ d_μu；Then λ=μ+o, μ=μ-o are enabled again；

Step 4：The center point set C obtained using step 3 executes traditional k-means algorithms as initial center point set, judges The size of φ (C, X)；If φ (C, X) is less than φ_best, then φ_best=φ (C, X), C_best=C, retry=0, is otherwise exited Current this layer cycle, enables retry=retry+1, C_best=C；

Step 5：Cycle executes step 3 and 4, until retry is more than given number of retries maximum value retry_max, it is then back to Optimal central point C_best；

Step 6：Each point in ergodic data collection X calculates each point and arrives the distance between all central points, it is categorized into most Close central point, and k cluster will be divided into X；

Step 7：The random noise of addition is set：

Random noise is Laplace noises, i.e., noise obeys Laplace distribution Lap (b), and b=Δs f/ ε, Δ f are global sensitive Degree, ε are secret protection budget；Remember that location parameter is 0, the Laplace that scale parameter is b is distributed as Lap (b), probability density Function is

Wherein, η indicates stochastic variable；

Step 8：The summation of the data point of each cluster, the quantity of point are recalculated, addition noise Lap (b) obtains sum '=sum+ The barycenter of Lap (b) and num '=num+Lap (b), final updating cluster are sum '/num '；

Step 9：Step 7 and 8 is repeated until error sum of squares restrains or iterations reach the upper limit, error sum of squares is smaller, gathers Class result is more independent and compact.

2. the k means clustering methods according to claim 1 towards difference secret protection, which is characterized in that in step 1, The method of data prediction is as follows：

If sample data set is X, sample space dimension is d, number of samples n；Determine the proportionate relationship between each attribute of sample； Maximum value Max based on initial data and minimum M in carries out the standardization of data using normalization processing method, data Each record is d dimensional vectors, needs to zoom in and out to space [0,1] d dimension spaces per one-dimensional^dIn, as shown in formula (1)：

Min, Max indicate that the minimum value of l dimensions, maximum value, f (l) are the data of l dimensions respectively, and y (l) is after l dimensions scale Data.

3. the k means clustering methods according to claim 1 towards difference secret protection, which is characterized in that in step 3, The offset vector ∈ takes 0.01.

4. the k means clustering methods according to claim 1 towards difference secret protection, which is characterized in that in step 6, The distance between point x and point y, x are indicated with dist (x, y)_iIndicate the value of the i-th dimension of point x, y_iIndicate the value of the i-th dimension of point y, Dim indicates the dimension of point；The distance between 2 points calculate using Euclidean distance calculation formula, shown in computational methods such as formula (3)

5. the k means clustering methods according to claim 1 towards difference secret protection, which is characterized in that in step 7, Different data sets executes different iterations and can be only achieved the condition of convergence in clustering algorithm,

If (a) iterations N is fixed, the privacy budget of each iteration consumption is ε/N, and it is Lap ((d+ that can add size every time 1) N/ ε) noise obtain ε-difference secret protection；

6. the k means clustering methods according to claim 5 towards difference secret protection, which is characterized in that early period iteration Influence to cluster result is greater than later stage iteration；Select the increase privacy budget ε in cluster process gradually, the first sub-distribution It is pre- be ε/2, noise size is Lap (2 (d+1)/ε), later the pre- half previous at last of each iteration consumption, until Until last time iteration is completed.