CN107330458A

CN107330458A - A kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers

Info

Publication number: CN107330458A
Application number: CN201710503214.4A
Authority: CN
Inventors: 李学刚; 狄岚; 李斌; 李通明
Original assignee: Changzhou College of Information Technology CCIT
Current assignee: Changzhou College of Information Technology CCIT
Priority date: 2017-06-27
Filing date: 2017-06-27
Publication date: 2017-11-07

Abstract

The invention discloses a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers, belong to data mining and mode identification technology, comprise the following steps：The distance relation of input data set and sample point is clustered；Clustering method is used to obtain cluster labels to target data set clustering；The cluster labels obtained after clustering carry out performance evaluation with original tag according to evaluation index.Present invention seek to address that the Clustering Effect of fuzzy C-mean algorithm by the cluster centre that it is initialized influenceed it is larger, it cannot be guaranteed that the problem of obtaining optimal solution, the selection of initial cluster center is first carried out on the basis of FCM algorithms, it is to be used as heuristic information using the variance of sample to choose FCM initial cluster centers, with the field radius of sample, K are chosen positioned at the minimum sample point of different zones upside deviation as initial cluster center, the algorithm need not set any parameter.

Description

A kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers

Technical field

The present invention relates to a kind of clustering method of data set, more particularly to a kind of minimum variance optimization initial clustering The fuzzy C-means clustering method at center, belongs to data mining and mode identification technology.

Background technology

Traditional FCM algorithms are all randomly selected when choosing cluster centre, so that the result for being easily caused cluster is inadequate It is stable, in some instances it may even be possible to cluster centre to be made to converge to local extremum, to solve the above problems, being believed according to the tight ness rating of sample distribution Breath, can calculate sample according to minimum variance clustering of optimizing initial centers, the initialization algorithm according to the space distribution information of sample This variance draws the tight ness rating information of sample, and the minimum sample point of selection variance and its a range of sample point are as first Beginning cluster centre, realizes improved fuzzy clustering algorithm.

FCM utilizes the algorithm that iteration declines, and is the search procedure of a part, more sensitive to initial cluster centre, The result finally given is not necessarily global optimal dividing, if the cluster centre that can be chosen, according to arest neighbors method by sample Originally it is assigned to each initial cluster center and produces initial clustering, the result of cluster is up to global optimum, therefore, based on each class The variance minimum principle of cluster central sample, proposes the FCM clustering algorithms of the minimum variance clustering of optimizing initial centers based on sample.

The content of the invention

The main object of the present invention is to provide for a kind of fuzzy C-means clustering of minimum variance clustering of optimizing initial centers Method, the problem of result that solution is caused because of not knowing for initial cluster center cannot get optimal solution.

The purpose of the present invention can reach by using following technical scheme：

A kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers, comprises the following steps：

Step S1：The distance relation of input data set and sample point is clustered；

Step S2：Clustering method is used to obtain cluster labels to target data set clustering；

Step S3：The cluster labels obtained after clustering carry out performance evaluation with original tag according to evaluation index.

Further, in the step S1, the data set of input is defeated using manual simulation's data set and the progress of UCI data sets Enter, cluster classification is several to be determined according to manual simulation's data set and UCI data sets.

Further, in the step S2, by target data set clustering, target data set and pixel are set Cluster labels, the setting procedure of cluster labels includes：

Step S21：The physical location for concentrating sample according to target data sets label, in manual simulation data set and UCI data concentrated setting number of tags；

Step S22：In the data set for the data composition that FCM algorithms are used to set label, obtain after clustering Subordinated-degree matrix U and cluster centre V.

Further, the step S22 specifically includes following steps：

Step S221：Cluster classification number c is determined first；

Step S222：Maximum iteration Maxt and worst error threshold epsilon are set；

Step S223：The subordinated-degree matrix U obtained by FCM algorithm clusterings, and cluster centre V are set, FCM is used as The initial degree of membership and cluster centre of algorithm, now set primary iteration number of times t=1；

Step S224：Subordinated-degree matrix and cluster centre matrix are updated by iteration optimization formula.

Further, in the step S224, the iteration optimization formula is：

U is subordinated-degree matrix, and d is fuzziness matrix, and v is cluster centre, and m is Fuzzy Exponential, and x is sample variance；

Until when t reaches maximum iteration Max_t or works as | | U^(t+1)-U^(t)||_FrobeniusDuring ＜ ε, method is terminated, this When U, V is the optimal solution of method.

Further, the cluster centre V is obtained to comprise the following steps：

Step S2231：Calculate each sample x in sample set_iVariance, find out variance in data set W minimum SampleWillIt is set to the initial cluster center v of first class cluster₁；Calculate the half r of the root-mean-square distance of data set sample_m, Order：

C=1,

W=W-W₁；

Step S2232：If c ＜ K, make c=c+1, the minimum sample of variance in data set W is found outIt is set to c The initial cluster center v of class cluster_c, and make：

W=W-W_c,

Otherwise, it just have found K initial cluster center V⁰=[v₁,v₂,…,v_k]。

Further, the FCM algorithms comprise the following steps：

Step S2233：Set Fuzzy Exponential m (1≤m)；K initialized in the step S2231 are initial poly- Class center V⁰=[v₁,v₂,…,v_k]；Convergence precision ε ＞ 0 are set；Maximum iteration t_max；Make iterations k=0；

Step S2234：Calculate U^(k+1)；

Step S2235：Calculate V^(k+1)；

Step S2236：If | | V^(k)-V^(k+1)| |≤ε, stop iteration；Otherwise, k=k+1, goes to step S2232；

Step S2237：When algorithm is terminated, the degree of membership U and cluster centre V finally given is just cluster optimal solution.

Further, performance evaluation, property are carried out according to evaluation index to the label that is obtained after clustering and original tag Energy evaluation index includes：NMI evaluation indexes and RandIndex evaluation indexes.

Further, the NMI evaluation indexes are：

Wherein：N_i,jRepresent the compatible degree between ith cluster and class j；

N represents the size of sample capacity；

N_iRepresent the number of samples of ith cluster；

N_jRepresent the number of samples of j-th of cluster.

Further, the RandIndex evaluation indexes are：

Wherein：f₀₀Represent that data point has different class labels, and belong to inhomogeneous data and count out；

f₁₁Represent that there is identical class label, and belong to same category of data and count out；

N represents the amount of capacity of sample.

The advantageous effects of the present invention：According to the fuzzy C-mean algorithm of the minimum variance clustering of optimizing initial centers of the present invention Clustering method, the fuzzy C-means clustering method for the minimum variance clustering of optimizing initial centers that the present invention is provided, it is intended to solve fuzzy The Clustering Effect of C averages is influenceed larger by the cluster centre that it is initialized, it is impossible to which the problem of guarantee obtains optimal solution, the present invention is The selection of initial cluster center is first carried out on the basis of FCM algorithms, it is proposed that a kind of new minimum variance optimization initial clustering The C means clustering methods at center, it is using the variance of sample as heuristic information, with sample that the present invention, which chooses FCM initial cluster centers, This field radius, chooses K positioned at the minimum sample point of different zones upside deviation as initial cluster center, the algorithm is not required to Any parameter is set.

Brief description of the drawings

Fig. 1 is the one preferred of the fuzzy C-means clustering method of the minimum variance clustering of optimizing initial centers according to the present invention The schematic flow sheet of embodiment.

Embodiment

To make those skilled in the art's more clear and clear and definite technical scheme, with reference to embodiment and accompanying drawing The present invention is described in further detail, but the implementation of the present invention is not limited to this.

As shown in figure 1, a kind of fuzzy C-means clustering side for minimum variance clustering of optimizing initial centers that the present embodiment is provided Method, comprises the following steps：

Further, in the present embodiment, in the step S1, the data set of input using manual simulation's data set and UCI data sets are inputted, and cluster classification is several to be determined according to manual simulation's data set and UCI data sets.

Further, in the present embodiment, in the step S2, by target data set clustering, to target data Collection and pixel set cluster labels, and the setting procedure of cluster labels includes：

Further, in the present embodiment, the step S22 specifically includes following steps：

Step S221：Cluster classification number c is determined first；

Step S222：Maximum iteration Maxt and worst error threshold epsilon are set；

Further, in the present embodiment, in the step S224, the iteration optimization formula is：

Further, in the present embodiment, the cluster centre V is obtained to comprise the following steps：

C=1,

W=W-W₁；

W=W-W_c,

Further, in the present embodiment, the FCM algorithms comprise the following steps：

Step S2234：Calculate U^(k+1)；

Step S2235：Calculate V^(k+1)；

Further, in the present embodiment, the NMI evaluation indexes are：

N represents the size of sample capacity；

N_iRepresent the number of samples of ith cluster；

N_jRepresent the number of samples of j-th of cluster.

Further, in the present embodiment, the RandIndex evaluation indexes are：

N represents the amount of capacity of sample.

Further, in the present embodiment, the C of the minimum variance clustering of optimizing initial centers proposed by the present embodiment is equal It is worth the validity of clustering method, experiment is classified into 3 parts, respectively using noiseless simulated data sets, with noise and outlier Data set, UCI True Data collection, pass through the method for the present invention：The core possibility C means clustering methods and mould of heart septum in greatly Paste c means clustering algorithms, possibility c averages (PCM) clustering algorithm, Fuzzy C-Means Cluster Algorithm based on core and based on core The Comparison of experiment results analysis of possibility C means clustering algorithms, illustrates that the present invention is imitated in the cluster of the data set to obscurity boundary Really and the robustness of noise is all lifted.

In summary, in the present embodiment, the Fuzzy C according to the minimum variance clustering of optimizing initial centers of the present embodiment is equal It is worth clustering method, the fuzzy C-means clustering method for the minimum variance clustering of optimizing initial centers that the present embodiment is provided, it is intended to solve The Clustering Effect of fuzzy C-mean algorithm is influenceed larger by the cluster centre that it is initialized, it is impossible to the problem of guarantee obtains optimal solution, this hair Bright is the selection for first being carried out on the basis of FCM algorithms initial cluster center, it is proposed that a kind of new minimum variance optimization is initial The C means clustering methods of cluster centre, the present invention choose FCM initial cluster centers be the variance using sample as heuristic information, With the field radius of sample, choose K and be used as initial cluster center, the algorithm positioned at the minimum sample point of different zones upside deviation Any parameter need not be set.

It is described above, it is only further embodiment of the present invention, but protection scope of the present invention is not limited thereto, and it is any Those familiar with the art is in scope disclosed in this invention, and technique according to the invention scheme and its design add With equivalent substitution or change, protection scope of the present invention is belonged to.

Claims

1. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers, it is characterised in that including following step Suddenly：

2. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 1, it is special Levy and be, in the step S1, the data set of input is inputted using manual simulation's data set and UCI data sets, clusters class It is not several to be determined according to manual simulation's data set and UCI data sets.

3. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 2, it is special Levy and be, in the step S2, by target data set clustering, cluster labels are set to target data set and pixel, The setting procedure of cluster labels includes：

Step S21：The physical location of sample is concentrated to set label according to target data, in manual simulation's data set and UCI numbers According to concentrated setting number of tags；

Step S22：In the data set for the data composition that FCM algorithms are used to set label, being subordinate to after clustering is obtained Spend matrix U and cluster centre V.

4. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 3, it is special Levy and be, the step S22 specifically includes following steps：

Step S221：Cluster classification number c is determined first；

Step S222：Maximum iteration Maxt and worst error threshold epsilon are set；

Step S223：The subordinated-degree matrix U obtained by FCM algorithm clusterings, and cluster centre V are set, FCM algorithms are used as Initial degree of membership and cluster centre, now set primary iteration number of times t=1；

5. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 4, it is special Levy and be, in the step S224, the iteration optimization formula is：

<mrow> <msub> <mi>u</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msup> <mrow> <mo>(</mo> <mfrac> <mrow> <msub> <mi>d</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>v</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>d</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>x</mi> <mi>k</mi> </msub> <mo>,</mo> <msub> <mi>v</mi> <mi>j</mi> </msub> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mrow> <mn>2</mn> <mo>/</mo> <mrow> <mo>(</mo> <mrow> <mi>m</mi> <mo>-</mo> <mn>1</mn> </mrow> <mo>)</mo> </mrow> </mrow> </msup> </mrow> </mfrac> </mrow>

<mrow> <msub> <mi>v</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>u</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mi>m</mi> </msup> <msub> <mi>x</mi> <mi>j</mi> </msub> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>u</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mi>m</mi> </msup> </mrow> </mfrac> </mrow>

Until when t reaches maximum iteration Max_t or works as | | U^(t+1)-U^(t)||_FrobeniusDuring ＜ ε, method is terminated, now U, V are the optimal solution of method.

6. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 4, it is special Levy and be, obtain the cluster centre V and comprise the following steps：

Step S2231：Calculate each sample x in sample set_iVariance, find out the minimum sample of variance in data set WWillIt is set to the initial cluster center v of first class cluster₁；Calculate the half r of the root-mean-square distance of data set sample_m, order：

C=1,

W=W-W₁；

Step S2232：If c ＜ K, make c=c+1, the minimum sample of variance in data set W is found outIt is set to c class clusters Initial cluster center v_c, and make：

<mrow> <msub> <mi>W</mi> <mi>c</mi> </msub> <mo>=</mo> <mo>{</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mn>1</mn> </msubsup> <mo>)</mo> </mrow> <mo><</mo> <mi>c</mi> <mi>m</mi> <mi>e</mi> <mi>a</mi> <mi>n</mi> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>n</mi> <mo>,</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>&NotElement;</mo> <msub> <mi>w</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>r</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>c</mi> <mo>-</mo> <mn>1</mn> <mo>}</mo> <mo>,</mo> </mrow>

W=W-W_c,

7. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 6, it is special Levy and be, the FCM algorithms comprise the following steps：

Step S2233：Set Fuzzy Exponential m (1≤m)；In the K initial clustering initialized in the step S2231 Heart V⁰=[v₁,v₂,…,v_k]；Convergence precision ε ＞ 0 are set；Maximum iteration t_max；Make iterations k=0；

Step S2234：Calculate U^(k+1)；

<mrow> <msub> <mi>u</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msup> <mrow> <mo>(</mo> <mfrac> <mrow> <msub> <mi>d</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>v</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>d</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>k</mi> </msub> <mo>,</mo> <msub> <mi>v</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mrow> <mn>2</mn> <mo>/</mo> <mrow> <mo>(</mo> <mi>m</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </msup> </mrow> </mfrac> </mrow>

Step S2235：Calculate V^(k+1)；

8. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 7, it is special Levy and be, performance evaluation, Performance Evaluating Indexes are carried out according to evaluation index to the label that is obtained after clustering and original tag Including：NMI evaluation indexes and RandIndex evaluation indexes.

9. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 8, it is special Levy and be, the NMI evaluation indexes are：

<mrow> <mi>N</mi> <mi>M</mi> <mi>I</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msub> <mi>N</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mi>N</mi> <mo>&times;</mo> <msub> <mi>N</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> </mrow> <mrow> <msub> <mi>N</mi> <mi>i</mi> </msub> <mo>&times;</mo> <msub> <mi>N</mi> <mi>j</mi> </msub> </mrow> </mfrac> </mrow> <msqrt> <mrow> <mo>(</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msub> <mi>N</mi> <mi>i</mi> </msub> <mi>log</mi> <mi> </mi> <msub> <mi>N</mi> <mi>i</mi> </msub> <mo>/</mo> <mi>N</mi> <mo>)</mo> <mo>&times;</mo> <mo>(</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msub> <mi>N</mi> <mi>j</mi> </msub> <mi>log</mi> <mi> </mi> <msub> <mi>N</mi> <mi>j</mi> </msub> <mo>/</mo> <mi>N</mi> <mo>)</mo> </mrow> </msqrt> </mfrac> </mrow>

N represents the size of sample capacity；

N_iRepresent the number of samples of ith cluster；

N_jRepresent the number of samples of j-th of cluster.

10. a kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers according to claim 8, its It is characterised by, the RandIndex evaluation indexes are：

N represents the amount of capacity of sample.