A kind of k means clustering methods towards difference secret protection
Technical field
The present invention relates to a kind of secret protection, clustering methods, and in particular to a kind of k mean values towards difference secret protection are poly-
Class method, belongs to field of information security technology.
Background technology
With the fast development of cloud computing and big data, data mining technology obtains in some in-depth studies and application
Significant progress.One of important method as data mining, clustering algorithm can excavate implicit, unknown knowledge and rule
Then, and in the operational decision making of a large amount of related datas there is important potential value.But at the same time, bulk information discloses sensitive
Information brings the threat and loss that can not be estimated to user.Therefore, how data-privacy to be protected to become in process of cluster analysis
The hot issue of data mining and data secret protection field.With the proposition and development of secret protection technology, difference privacy is protected
Maintaining method becomes a kind of current secret protection technology of hot topic.Difference privacy is realized by noise mechanism, i.e., into output result
Random noise is added to protect data safety, the noise of addition is bigger, and data are safer, however, the availability of data is lower, instead
It is as the same.
As one of most common clustering method, k-means algorithms are realized simply, while being provided high speed and being clustered.But tradition
Difference secret protection k-means algorithms (such as difference privacy k-means algorithms, difference privacy k-means++ algorithms), at the beginning of it
The selection of beginning central point is more sensitive, and there are certain blindness in the selection for the number k values that cluster, and reduce cluster knot
The availability of fruit.
Invention content
Problem to be solved by this invention is paid aiming at insufficient present in background technology, proposes one kind towards difference
The k means clustering methods of secret protection, increase in the iterative process of k means clustering algorithms and meet the appropriate of specific distribution
Random noise so that cluster result is distorted to a certain extent, achievees the purpose that secret protection, while ensure that the available of data
Property;Method is simple, easy to operate and do not limit data set size and attribute.
The method of the present invention executes the result of k-means++ algorithms acquisition as input value on data set, then passes through
Alternately a series of non local " jumps " k-means algorithm traditional with execution, the cluster initial center point optimized, profit
With this center point set again implementation center's point plus the cluster process of iteration of making an uproar;Difference secret protection technology of the present invention is fixed
Justice one and its stringent challenge model, and carried out stringent mathematical proof and quantificational expression, the same time difference to privacy risk
It is divided to privacy mechanism also can obtain and preferably put down in k-means cluster data mining result availabilities and two aspect of secret protection rank
Weighing apparatus.
A kind of k means clustering methods towards difference secret protection of the present invention, include the following steps:
Step 1:Sample data pre-processes;
Step 2:Center point set after indicating cluster with C, φ (C, X) indicate given sample data set X and cluster central point
Collecting the error sum of squares under C, x indicates that a data that sample data is concentrated, c indicate the central point that cluster central point is concentrated,
Wherein
φ (C, X)=∑x∈X minc∈C||x-c||2 (2)
Retry indicates the number retried, retrymaxIndicate maximum reattempt times, φbestIndicate updated square-error
With CbestIndicate updated center point set;Be then store in data set X execute obtain after k-means++ algorithms be at present
Only minimum error sum of squares φ (C, X) Dao φbestNeutralize optimal cluster centre point set C to CbestIn;Enable retrymax=m, m
∈ { 0,1,2 ... }, and initialize retry=0;
Step 3:As retry≤retrymaxWhen, enable λ indicate the central point of most " useless ", CiIndicate the barycenter of cluster i, whereinCμIndicate that the barycenter of cluster μ, μ indicate the maximum central point of intra-cluster distance quadratic sum, dμTable
Show the average distance of cluster μ, whereinEnable o tables
Show that a small random number, u indicate that the random vector of d dimension unit hyper-spheres, ∈ indicate offset vector, wherein o=∈ dμu;So
Enable λ=μ+o, μ=μ-o again afterwards;
Step 4:The center point set C obtained using step 3 executes traditional k-means algorithms as initial center point set,
Judge the size of φ (C, X);If φ (C, X) is less than φbest, then φbest=φ (C, X), Cbest=C, retry=0, otherwise
Current this layer of cycle is exited, retry=retry+1, C are enabledbest=C;
Step 5:Cycle executes step 3 and 4, until retry is more than given number of retries maximum value retrymax, then
Return to optimal central point Cbest;
Step 6:Each point in ergodic data collection X calculates each point and arrives the distance between all central points, it is classified
To nearest central point, and k cluster will be divided into X;
Step 7:The random noise of addition is set:
Random noise is Laplace noises, i.e., noise obeys Laplace distribution Lap (b), and b=Δs f/ ε, Δ f are the overall situation
Susceptibility, ε are secret protection budget;Remember that location parameter is 0, the Laplace that scale parameter is b is distributed as Lap (b), probability
Density function is
Wherein, η indicates stochastic variable;
Step 8:Recalculate the summation of the data point of each cluster, the quantity of point, addition noise Lap (b), obtain sum '=
The barycenter of sum+Lap (b) and num '=num+Lap (b), final updating cluster are sum '/num ';
Step 9:Step 7 and 8 is repeated until error sum of squares restrains or iterations reach the upper limit, error sum of squares is got over
Small, cluster result is more independent and compact.
In step 1, the method for data prediction is as follows:
If sample data set is X, sample space dimension is d, number of samples n;Determine the ratio between each attribute of sample
Relationship;Maximum value Max based on initial data and minimum M in carries out the standardization of data, number using normalization processing method
According to each record be d dimensional vectors, need to zoom in and out to space [0,1] d dimension spaces per one-dimensionaldIn, such as formula (1) institute
Show:
Min, Max indicate that the minimum value of l dimensions, maximum value, f (l) are the data of l dimensions respectively, and y (l) is l dimension scalings
Data afterwards.
In step 3, the offset vector ∈ takes 0.01.
In step 6, the distance between point x and point y, x are indicated with dist (x, y)iIndicate the value of the i-th dimension of point x, yiIt indicates
The value of the i-th dimension of point y, dim indicate the dimension of point;The distance between 2 points calculate using Euclidean distance calculation formula, calculating side
Shown in method such as formula (3)
In step 7, different data sets executes different iterations and can be only achieved the condition of convergence in clustering algorithm,
If (a) iterations N is fixed, the privacy budget of each iteration consumption is ε/N, and it is Lap that can add size every time
The noise of ((d+1) N/ ε) obtains ε-difference secret protection;
If (b) iterations N is unknown, the value of privacy budget ε will be constantly adjusted in an iterative process.
Early period, influence of the iteration to cluster result was greater than later stage iteration;Select the increase privacy in cluster process gradually
Budget ε, the pre- of the first sub-distribution is ε/2, and noise size is Lap (2 (d+1)/ε), each iteration consumption later it is pre- at last
Previous half, until to the last an iteration is completed.
The present invention has the beneficial effect that:
To ensure the safety of k-means clustering algorithms, appropriate make an uproar is added by the central point in k-means algorithms
Sound devises the clustering algorithm based on difference secret protection, and proves that algorithm meets difference privacy conditions.It is hidden with existing difference
Private k mean algorithms are compared, and method of the invention is using the result of k-means++ algorithms as input value, then by alternately
A series of non local " jumps " k-means algorithm traditional with execution, improves the selection of initial center point, it can effectively keep away
Exempt from k values blindness and initial point sensibility, and its iterations can be reduced, to improve the availability of cluster, protects simultaneously
Privacy.
Description of the drawings
Fig. 1 is the number for testing difference privacy k-means clustering algorithm performances used in experiment provided by the invention
According to schematic diagram;
Fig. 2 is the work flow diagram of the k-means clustering methods provided by the invention towards difference secret protection.
Specific implementation mode
The implementation of technical scheme of the present invention is described in further detail below in conjunction with the accompanying drawings, it should be understood that these examples
It is only illustrative of the invention and is not intended to limit the scope of the invention, after having read the present invention, those skilled in the art couple
The modification of the various equivalent forms of the present invention falls within the application range as defined in the appended claims.
A kind of k means clustering methods towards difference privacy of the present invention, this method are made with the result of k-means++ algorithms
For input value, then by alternately a series of non local " jumps " with execute traditional k-means algorithms, improve it is initial in
The selection of heart point, and difference secret protection Laplace mechanism is utilized, increase in the iterative process of k means clustering algorithms and meets
The random noise appropriate of specific distribution so that cluster result is distorted to a certain extent, achievees the purpose that secret protection, simultaneously
It ensure that the availability of data.The method of the present invention is simple, easy to operate and theoretical proof its meet ε-difference privacy conditions, Ke Yiyou
Effect avoids k values blindness and initial point sensibility, and can reduce its iterations, to improve the availability of cluster, simultaneously
Privacy is protected, data publication and the secret protection of the data set of different scales and different dimensions are applicable to.
Referring to Fig. 2, specific implementation mode is as follows:
Step 1:Collection obtains a sample data set housec8.txt, storage be house color three color values, sample
This number is 34112, attribute 3, image pattern collection X={ x1,x2,…,x34112, it is contracted to every one-dimensional data with formula (1)
It puts to [0,1] section.20 row data in data set after scaling are taken, as follows:
x1=[0. 0.08130081 0.00473934] x50=[0.00961538 0.0203252 0.00947867]
x102=[0.02403846 0.01626016 0.03317536] x155=[0.02403846 0.0203252
0.01895735]
x250=[0.03365385 0.06910569 0.00473934] x350=[0.03365385 0.11788618
0.01895735]
x1000=[0.04326923 0.03658537 0.07109005] x3020=[0.04326923 0.10569106
0.02369668]
x5030=[0.04807692 0.03658537 0.02843602] x6000=[0.05288462 0.06097561
0.04265403]
x9843=[0.05288462 0.08130081 0.04265403] x10345=[0.05288462 0.09349593
0.01895735]
x18546=[0.05769231 0.01219512 0.04265403] x20345=[0.05769231 0.05284553
0.04739336]
x24675=[0.05769231 0.06097561 0.07582938] x26546=[0.05769231 0.06910569
0.02843602]
x29654=[0.0625 0.06097561 0.10900474] x30000=[0.0625 0.06910569
0.07582938]
Step 2:K-means++ algorithms are executed in data set X, obtain φbest=0.024421323538, Cbest=
[[0.33290311 0.25585707 0.15738572][0.61027347 0.44192056 0.28916641]
[0.70476998 0.79096867 0.75084066]]
Step 3:The C obtained with step 2bestThe choosing of more preferably initial center point is carried out according to step 3-5 in technical solution
It takes, result is
[[0.59984837 0.42572074 0.28687944][0.70476998 0.79096867 0.75084066]
[0.59510802 0.42289839 0.28504289]]
Step 4:The random noise of addition is set.Difference privacy budget total amount ε ∈ [0.1,1] are taken, due to experimental data set
Attribute d=3, and iterations are unknown, it is ε/2 that can obtain the pre- of the first sub-distribution in iterative process using formula (4), and noise is big
Small is Lap (8/ ε), and the pre- of the second sub-distribution is ε/4, and noise size is Lap (16/ ε), later the budget of each iteration consumption
It is previous half, until to the last an iteration is completed.
Step 5:The noisiness being arranged according to step 4 is carried out plus is made an uproar to the summation of data point and the quantity of point of each cluster,
Barycenter is updated, in the first iteration, the summation matrix of 3 attributes at 3 cluster midpoints is
And the quantity of the respective point of 3 clusters is
Num=[23,101 3,600 7411]
The noisiness of first time iteration addition is Lap (8/ ε) known to step 4, therefore newer new barycenter is for the first timeAs a result it is
[[0.6381381 0.53917325 0.41654519][0.24548196 0.1825142 0.10771185]
[0.39675444 0.31281772 0.20345833]]
Iteration concrete outcome is no longer described in detail below, and experiment finally reaches convergence after the 40th iteration, finally obtains plus poor
Point privacy cluster result center point set is
[[0.5893953 0.4049542 0.27335089][0.59260889 0.40021986 0.27148743]
[0.70471501 0.790935 0.75081869]]
Step 6:Assess clustering performance.Since reference class is provided via selected data set, we use F-measure
To assess clustering performance.The range of F-measure values is [0,1], and value means that more greatly algorithm has preferably cluster availability.
It is that both are poor by difference privacy clustering algorithm proposed by the present invention and DPk-means and DPk-means++ herein
Point privacy clustering algorithm is compared, corresponding each ε values, and data set calls three difference privacy clustering algorithms 50 times respectively, takes
The average value of corresponding F-measure results, as shown in Fig. 1 (wherein red lines are arithmetic result provided by the invention).
As seen from the figure, at the horizontal ε of identical privacy, difference privacy clustering algorithm proposed by the present invention and other two kinds calculations
Method is compared, and the result of F-measure, which has been worth to, largely to be improved, this illustrates the present invention under identical secret protection rank
The cluster availability higher of acquisition, and privacy budget is bigger, and cluster availability is higher, but privacy level reduces.
In conclusion the present invention proposes a kind of k-means clustering methods towards difference privacy, the program is with k-
Then the result of means++ algorithms passes through alternately a series of non local " jumps " k- traditional with execution as input value
Means algorithms improve the selection of initial center point, and utilize difference secret protection Laplace mechanism, in k means clustering algorithms
Iterative process in increase and meet the random noise appropriate of specific distribution so that cluster result is distorted to a certain extent, is reached
To the purpose of secret protection, while it ensure that the availability of data.
The above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.