CN103714154A

CN103714154A - Method for determining optimum cluster number

Info

Publication number: CN103714154A
Application number: CN201310739837.3A
Authority: CN
Inventors: 周红芳; 王啸; 赵雪涵; 段文聪; 郭杰; 张国荣; 王心怡; 何馨依
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2013-12-26
Filing date: 2013-12-26
Publication date: 2014-04-09

Abstract

Discloses is a method for determining an optimum cluster number. The cluster effect of a data set is evaluated through an effectiveness indicator Q (C), and the cluster number corresponding to the minimum value of the effectiveness indicator Q (C) is the optimum cluster number. According to the method for determining the optimum cluster number, a new similarity measuring method is provided, all possible cluster partitions are generated in a bottom-up mode by combining hierarchical clustering, the effectiveness indicator value at the moment is calculated, a cluster quality curve regarding different partitions is established according to the effectiveness indicator value, and the partition corresponding to the extreme point of the curve is the optimum cluster partition. Repeated clustering on a large data set can be avoided, and the method does not rely on specific clustering algorithms. Experimental results and theoretical analysis both show that the method has good performance and feasibility, and computational efficiency can be improved greatly.

Description

A kind of method of definite best cluster numbers

Technical field

The invention belongs to data mining technology field, relate to a kind of method of definite best cluster numbers.

Background technology

The judgement great majority of best cluster numbers are all to adopt a kind of trial-and-error process based on iteration to carry out, on given data set, use different parameter (normally cluster numbers k) to move specific clustering algorithm data set is carried out to different divisions, then calculate the Validity Index value of various divisions, by comparing each desired value, select the corresponding cluster numbers of desired value conforming to a predetermined condition to be considered to best cluster numbers.In fact, there are several weak points in trial-and-error process, the one, the user who enriches cluster analysis experience for shortage that determines of cluster numbers k value is difficult to accurately determine, this just requires us further to propose to find the method for more rational cluster numbers k; It two is the indexs that proposed at present many check Cluster Validities, and main representative has V _xieindex, V _wsjindex etc.Because these indexs all propose based on certain specific clustering algorithm, the method is greatly limited in actual applications.The method is to data set large-scale, dimension more complicated in addition, and counting yield is poor.

Summary of the invention

A kind of method that the object of this invention is to provide definite best cluster numbers, can avoid the problem of prior art to the cluster repeatedly of large data collection, and counting yield is higher.

Technical scheme of the present invention is, a kind of method of definite best cluster numbers is carried out the Clustering Effect of assessment data collection by Validity Index Q (C), and when Cluster Validity Index Q (C) gets minimum value, corresponding cluster numbers is best cluster numbers.

Feature of the present invention is also:

Determining of Validity Index, degree of separation between compactness and class in compute classes first, then represent Validity Index according to both a linear combination; Specifically comprise:

1, suppose for cube DB, one of them clustering is C ^k={ C ₁, C ₂..., C _k, and cluster C now ^kclass in compactness be to obtain by calculating the quadratic sum of distance between any two data objects in same class, with Scat (C ^k) represent,

Scat (C^{k}) = Σ_{i = 1}^{k} \underset{X, Y &Element; C_{i}}{Σ} {| | X - Y | |}^{2} - - - (1)

Meanwhile, cluster C ^kclass between degree of separation Sep (C ^k) by calculating, the quadratic sum of distance between any two data objects in inhomogeneity obtains,

Sep (C^{k}) = Σ_{i = 1}^{k} (Σ_{j = 1, j &NotEqual; i}^{k} \frac{1}{| C_{i} | \cdot | C_{j} |} \underset{X &Element; C_{i}, Y &Element; C_{j}}{Σ} {| | X - Y | |}^{2}) - - - (2)

In formula (1) and formula (2), X, Y represents two data objects, k represents the cluster number that data set DB is divided into;

2, Euclidean distance formula is brought into formula (1) and formula (2), then do conversion obtain:

Scat (C^{k}) = 2 Σ_{i = 1}^{k} (| C_{i} | {SS}_{i} - {LS}_{i}^{2}) - - - (3)

Sep (C^{k}) 2 ((k - 1) Σ_{i = 1}^{k} \frac{{SS}_{i}}{| C_{i} |} - {(Σ_{i = 1}^{k} \frac{{LS}_{i}}{| C_{i} |})}^{2} + Σ_{i = 1}^{k} \frac{{LS}_{i}^{2}}{{| C_{i} |}^{2}}) - - - (4)

Wherein,

k represents cluster number, x _jrepresent cluster C _iin a data object, | C _i| represent cluster C _ithe number of middle data object;

3, formula (3) and formula (4) are carried out to linear combination:

Q(C ^k)＝Scat(C ^k)+β.Sep(C ^k) （5）

Wherein, β is combination parameter, for balance Scat (C ^k) and Sep (C ^k) difference in span; At this, regard the clustering C of data set DB as a variable, obtain its field of definition for { C ¹, C ²...., C ⁿ, in the value of this β, be 1;

4, in given data set DB, Scat (C ^k) and Sep (C ^k) there is identical codomain scope.In original state, namely when cluster numbers k is n, from its formula (1), Scat (C now ⁿ) value is 0, and now establishes:

Sep (C^{n}) = 2 (n . Σ_{x &Element; DB} x^{2} - {(Σ_{x &Element; DB} x)}^{2}) = M - - - (6)

Due to Scat (C ^k) be monotonically increasing function, and Sep (C ^k) be monotonic decreasing function, can obtain when cluster numbers k is 1 Sep (C ¹)=0, Scat (C ¹)=M; So Validity Index Q (C adopting ^k) form can be expressed as:

Q (C^{k}) = \frac{1}{M} (Scat (C^{k}) + Sep (C^{k})) - - - (7)

。

Definite method of best cluster numbers is that employing is eliminated noise spot and the impact of isolated point on cluster result based on MDL beta pruning algorithm, finally obtains best cluster numbers; The processing procedure of MDL algorithm is:

k _opt＝β(C ^*) （9）

In formula (9), C ^*the value that represents cluster quality Q (C) reaches hour corresponding clustering, k _optrepresent the best cluster numbers obtaining.

The removing method of noise spot and isolated point is, adopts the pruning method based on MDL (minimal description length) to process result, and concrete disposal route is as follows:

Order

C^{*} = {C_{1}^{*}, C_{2}^{*}, . . . . . . C_{k}^{*}},

for

the number of the data object comprising; First according to

sequence from big to small generates a new sequence C ₁, C ₂... ..C _k, then by this sequence with C _m(1<m<k) for boundary is divided into two parts, that is: S _l(m)={ C ₁, C ₂... .C _mand S _r(m)={ C _m+1, C _m+2... .C _k, trying to achieve the code length CL (m) of data, CL (m) is defined as:

CL (m) = \log_{2} (μ_{S_{L} (m)}) + \underset{1 \leq j < m}{Σ} \log_{2} (| | C_{j} | - μ_{S_{L} (m)} |) + \log_{2} (μ_{S_{R} (m)}) + \underset{m + 1 \leq j \leq k}{Σ} \log_{2} | | C_{j} | - μ_{S_{R} (m)} | - - - (10)

Wherein,

in formula (10) the 1st and the 3rd represents respectively to take the average code length of two sequences that m is boundary, and all the other two is to weigh | C _j| and the difference between average data number of objects; S _r(m) data object in is identified as noise spot and rejects, and has finally obtained the best cluster numbers k of data set _optfor m.

Above-mentioned data set DB comprises artificial synthetic data set and standard data set.

Specific implementation process is as follows:

1. the similarity of any two points in computational data collection DB, deposits in array D, and the numerical value in array D is sorted according to order from big to small;

2. the currentElement in couple array D, first judge whether these two data objects have been integrated in class, if do not had, just these two data object mergings are become to a class, if one of them data object has been integrated in some classes, another object is also merged in that class, if they are integrated into respectively two different classes, two classes at its place are merged into a class, when if they have belonged to same class, abandon this time merging, now, according to formula (7), calculate the value of Cluster Validity Index Q (C) now, together with clustering now, be kept in array A, the cluster number k=k-1 of data set now, then get the next element in D, continue judgement and calculate, until the cluster number of data set is to finish for 1 o'clock.

3. according to formula (8), obtain cluster desired value and corresponding clustering minimum in array A; To selected min cluster desired value and corresponding clustering, by the process of formula (9), the class that is wherein identified as noise spot and isolated point and forms is carried out to " rejecting ", finally obtain best cluster numbers k _opt.

Method for measuring similarity is: in given d dimension data collection DB, and any two data object x _iand x _jsimilarity formula may be defined as:

s (x_{i}, x_{j}) = Σ_{k = 1}^{d} \frac{1}{1 + | x_{ik} - x_{jk} |} - - - (11)

Wherein, x _ikwith x _jkrepresent two different data objects in k dimension, their similarity coefficient equals similarity coefficient sum between d attribute, the distance that similarity coefficient between every pair of attribute equals every pair of attribute adds 1 inverse, formula (11) is that the similarity coefficient of each attribute between two data objects is mapped to (0,1) in interval, each properties affect at this tentation data object is identical, so just can reduce different attribute in data object and analog result be judged to the impact bringing, by formula (11), can be obtained, as s (x _i, x _j) value larger, data object x _iand x _jmore similar.

The present invention has following beneficial effect:

1, the present invention proposes new data method for measuring similarity, binding hierarchy cluster, according to bottom-up, generate all possible clustering, and calculating Validity Index value now, according to this value, build a cluster figure-of-merit curve about different demarcation, the extreme point of curve is corresponding is divided into best clustering.So just can avoid the cluster repeatedly to large data collection, and the present invention does not rely on specific clustering algorithm.Experimental result and theoretical analysis all show, the present invention has good performance and feasibility, also can increase substantially counting yield simultaneously.

2, the present invention can identify correct cluster numbers, and process data set also can obtain its best cluster numbers more exactly.

3, the present invention can correctly process and obtain the best cluster numbers of data set, with other algorithm, compares, and has higher accuracy rate and performance.

4, the present invention carries out similarity determination to data object from integral body, and along with the dimension of data object increases, its efficiency obviously improves, have higher accuracy rate and time efficiency.When the dimension of data object is higher, efficiency of the present invention can be higher, shows as in time faster than other algorithms.

Accompanying drawing explanation

Fig. 1 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB1;

Fig. 2 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB2;

Fig. 3 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB3;

Fig. 4 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB4;

Fig. 5 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB5.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

The present invention adopts a kind of Validity Index Q (C) to carry out the Clustering Effect of assessment data collection.This Validity Index is mainly to weigh cluster quality by the degree of separation of data object between the compactness of data object in class and class.Introduce relevant concept below.

1. Validity Index

Suppose for cube DB, one of them clustering is C ^k={ C ₁, C ₂..., C _k.And cluster C now ^kclass in compactness be to obtain by calculating the quadratic sum of distance between any two data objects in same class, with Scat (C ^k) represent:

Scat (C^{k}) = Σ_{i = 1}^{k} \underset{X, Y &Element; C_{i}}{Σ} {| | X - Y | |}^{2} - - - (1)

While cluster C ^kclass between degree of separation Sep (C ^k) be to obtain by calculating the quadratic sum of distance between any two data objects in inhomogeneity,

Sep (C^{k}) = Σ_{i = 1}^{k} (Σ_{j = 1, j &NotEqual; i}^{k} \frac{1}{| C_{i} | \cdot | C_{j} |} \underset{X &Element; C_{i}, Y &Element; C_{j}}{Σ} {| | X - Y | |}^{2}) - - - (2)

In formula (1) and formula (2), X, Y represents two data objects, k represents the cluster number that data set DB is divided into.Substitution Euclidean distance formula is done simple conversion again, Sep (C ^k) and Scat (C ^k) can further be transformed to formula (3) and formula (4):

Scat (C^{k}) = 2 Σ_{i = 1}^{k} (| C_{i} | {SS}_{i} - {LS}_{i}^{2}) - - - (3)

Sep (C^{k}) 2 ((k - 1) Σ_{i = 1}^{k} \frac{{SS}_{i}}{| C_{i} |} - {(Σ_{i = 1}^{k} \frac{{LS}_{i}}{| C_{i} |})}^{2} + Σ_{i = 1}^{k} {LS}_{i} \frac{^{2}}{{| C_{i} |}^{2}}) - - - (4)

Wherein,

k represents cluster number, x _jrepresent cluster C _iin a data object, | C _i| represent cluster C _ithe number of middle data object.

By analysis mode (3) and formula (4), can find Scat (C ^k) value less, the data object in same class is compacter; Sep (C ^k) value larger, illustrate that the separation property between class and class is better.For balance Scat (C preferably ^k) and Sep (C ^k) effect, formula (3) and formula (4) have been carried out to linear combination, see formula (5).

Q(C ^k)＝Scat(C ^k)+β.Sep(C ^k) （5）

Wherein, β is combination parameter, for balance Scat (C ^k) and Sep (C ^k) difference in span.At this, regard the clustering C of data set DB as a variable, can obtain its field of definition for { C ¹, C ²...., C ⁿ, from correlation theory, the value of getting β at this is 1.

In given data set DB, Scat (C ^k) and Sep (C ^k) there is identical codomain scope.In original state, namely when cluster numbers k is n, from its formula (1), Scat (C now ⁿ) value is 0, and now establishes:

Sep (C^{n}) = 2 (n . Σ_{x &Element; DB} x^{2} - {(Σ_{x &Element; DB} x)}^{2}) = M - - - (6)

Due to Scat (C ^k) be monotonically increasing function, and Sep (C ^k) be monotonic decreasing function.Can obtain when cluster numbers k is 1 Sep (C ¹)=0, Scat (C ¹)=M.So Validity Index Q (C adopting ^k) form can be expressed as:

Q (C^{k}) = \frac{1}{M} (Scat (C^{k}) + Sep (C^{k})) - - - (7)

2. the deterministic process of best cluster numbers

In the cluster process of each step, all need to calculate and preserve the value of the Validity Index Q (C) now dividing, until whole cluster process finishes, then according to Validity Index value, find optimum clustering.And optimum clustering is corresponding to the equilibrium point of degree of separation between compactness in class and class, when being numerically reflected as Validity Index Q (C) and getting minimum value.Can obtain thus Validity Index value less, illustrate that Clustering Effect is better, so it is considered herein that when Cluster Validity Index Q (C) gets minimum value, corresponding cluster numbers is best cluster numbers, the optimum cluster numbers that the present invention utilizes formula (8) to come computational data to concentrate.

C^{*} = \arg \min_{C^{k} &Element; {C^{1}, C^{2}, . . ., C^{n}}} Q (C^{k}) - - - (8)

Because data centralization often exists noise spot and isolated point, and they have very important impact to cluster result, so be not best by the drawn cluster numbers of formula (8).Given this plant situation, in the present invention, adopt and eliminate noise spot and the impact of isolated point on cluster result based on MDL beta pruning algorithm, last resulting clustering can be thought best.Formula (9) is the processing procedure of MDL algorithm.

k _opt＝β(C ^*) （9）

In formula (8) and formula (9), C ^*the value that means cluster quality Q (C) reaches hour corresponding clustering, k _optrepresent the best cluster numbers obtaining.

3. the elimination of noise spot and isolated point

Because data centralization exists noise spot and the impact of isolated point on cluster result, it is considered herein that and utilize separately the conclusion that the drawn cluster numbers of Validity Index is best cluster numbers k* and be false.The pruning method of employing based on MDL (minimal description length) result is processed.Concrete disposal route is as follows:

Order

for

the number of the data object comprising.First according to

sequence from big to small generates a new sequence C ₁, C ₂... ..C _k, then by this sequence with C _m(1<m<k) for boundary is divided into two parts, that is: S _l(m)={ C ₁, C ₂... .C _mand S _r(m)={ C _m+1, C _m+2... .C _k, try to achieve the code length CL (m) of data.CL (m) is defined as:

CL (m) = \log_{2} (μ_{S_{L} (m)}) + \underset{1 \leq j < m}{Σ} \log_{2} (| | C_{j} | - μ_{S_{L} (m)} |) + \log_{2} (μ_{S_{R} (m)}) + \underset{m + 1 \leq j \leq k}{Σ} \log_{2} | | C_{j} | - μ_{S_{R} (m)} | - - - (10)

Wherein,

in formula (10) the 1st and the 3rd represents respectively to take the average code length of two sequences that m is boundary, and all the other two is to weigh | C _j| and the difference between average data number of objects.

It has been generally acknowledged that comprising the less class of data object is the class that noise spot and isolated point form.Be S _r(m) data object in is identified as noise spot and rejects, and has finally obtained the best cluster numbers k of data set _optfor m.

In the present invention, need to judge whether two objects merge according to the similarity degree of data object, because most data set is all multidimensional, traditional method for measuring similarity is also not suitable for.So it is very important selecting a suitable method for measuring similarity, the present invention proposes a kind of new method for measuring similarity.

In traditional method for measuring similarity, regard each data object as a point in hyperspace, and then judge according to distance between points.Distance is larger, illustrates that the similarity between data object is less.So, resulting cluster result is the spheroid that some volumes are close, and its scope of application is very restricted.The present invention on this basis, has proposed a kind of new method for measuring similarity based on cube.It is described below: in given d dimension data collection DB, and any two data object x _iand x _jsimilarity formula may be defined as:

s (x_{i}, x_{j}) = Σ_{k = 1}^{d} \frac{1}{1 + | x_{ik} - x_{jk} |} - - - (11)

Wherein, x _ikwith x _jkrepresent two different data objects in k dimension.Their similarity coefficient equals similarity coefficient sum between d attribute, and the distance that the similarity coefficient between every pair of attribute equals every pair of attribute adds 1 inverse.Formula (11) is that the similarity coefficient of each attribute between two data objects is mapped to (0,1) in interval, each properties affect at this tentation data object is identical, so just can reduce different attribute in data object and analog result be judged to the impact bringing.By formula (11), can be obtained, as s (x _i, x _j) value larger, data object x _iand x _jmore similar.

Binding hierarchy cluster of the present invention, propose best cluster numbers and determine method, when initialization, total number n that the cluster numbers k that makes data set is data object, similarity s (the i of any two data objects of concentrating according to formula (11) computational data again, j), and by them be stored in an array D, element to array D, sorts by order from big to small, to the currentElement in array D, first judge whether these two data objects have been integrated in class, if do not had, just these two data object mergings are become to a class, if one of them data object has been integrated in some classes, another object is also merged in that class, if they are integrated into respectively two different classes, two classes at its place are merged into a class, when if they have belonged to same class, abandon this time merging, now, according to formula (7), calculate the value of Cluster Validity Index Q (C) now, together with clustering now, be kept in array A, the cluster number k=k-1 of data set now, then get the next element in D, continue judgement and calculate, until the cluster number of data set is to finish for 1 o'clock, then according to formula (8), obtain cluster desired value and corresponding clustering minimum in array A, to selected min cluster desired value and corresponding clustering, by the process of formula (9), the class that is wherein identified as noise spot and isolated point and forms is carried out to " rejecting ", finally obtain best cluster numbers k _opt.

Below for the present invention is directed to the specific implementation process that data set DB determines best cluster numbers:

1, according to the similarity of any two points in formula (11) computational data collection DB, deposit in array D, and the numerical value in array D is sorted according to order from big to small

2, to the currentElement in array D, first judge whether these two data objects have been integrated in class, if do not had, just these two data object mergings are become to a class, if one of them data object has been integrated in some classes, another object is also merged in that class, if they are integrated into respectively two different classes, two classes at its place are merged into a class, when if they have belonged to same class, abandon this time merging, now, according to formula (7), calculate the value of Cluster Validity Index Q (C) now, together with clustering situation now, be kept in array A, the cluster number k=k-1 of data set now, then get the next element in D, continue judgement and calculate, until the cluster number of data set is to finish for 1 o'clock.

3, according to formula (8), obtain cluster desired value and corresponding clustering minimum in array A; To selected min cluster desired value and corresponding clustering situation, by the process of formula (9), the class that is wherein identified as noise spot and isolated point and forms is carried out to " rejecting ", finally obtain best cluster numbers k _opt.

The present invention is under worst case, and space complexity is O (n ²).While just starting, be first the similarity of calculating any two data objects, then by the algorithm of quicksort, it sorted from big to small, its average time complexity is O (n ²).In data object merging process, total number of merging is n, calculates Validity Index value Q (C ^k) time complexity of sequence is O (nk), the complexity of MDL pruning method is O (k ²), k is the number of the class of correspondence after cluster completes.So time complexity of the present invention is O (n ²).

In order to illustrate that the present invention has higher accuracy rate to the best cluster numbers of data set definite, adopts the Validity Index V based on FCM algorithm _xieand V _wsjobject as a comparison.In experiment, the fuzzy factor m that FCM clustering algorithm is set is 2.Also select cops algorithm as a comparison in addition.The best cluster numbers situation that algorithms of different obtains is as shown in table 1.

The best cluster numbers that table 1 algorithms of different obtains

In this experiment, the scope of the cluster numbers c in FCM algorithm is set to [1,12], can improve fast like this efficiency of FCM algorithm.In order to improve the accuracy of experimental result, we to 5 data sets in each algorithm respectively operation repeatedly, using the maximum cluster result of cluster numbers occurrence number as final cluster numbers, can avoid like this error result being caused due to other factors.By table 1, can be drawn, the present invention can correctly process and obtain the best cluster numbers of data set, with other algorithm, compares, and has higher accuracy rate and performance.

In experiment, executing arithmetic recording the each run time repeatedly.We get the mean value of these times as this data set of this algorithm process time used, as shown in table 2.

Table 2 algorithms of different working time

By table 2, can be drawn, for structure standard of comparison, the more concentrated data set DB1 that distributes, the time of four algorithm operations is almost similar, because Validity Index V _wsjand V _xiewhen simple, standard of comparison and the fewer data set of data object number are processed, efficiency is also higher.And when number, the dimension of data set constantly increase, the efficiency of the present invention and COPS performance is better.When the special dimension when data object is higher, efficiency of the present invention can be higher, shows as in time faster than other algorithms.First COPS algorithm determines the similar data object in single dimension, and then the similar data object in definite higher dimensional space, and its calculated amount is larger.The present invention carries out similarity determination to data object from integral body by new similar decision method, so while increasing along with the dimension of data object, its efficiency obviously improves.In a word, the present invention has higher accuracy rate and time efficiency.

Claims

1. determine a method for best cluster numbers, it is characterized in that: by Validity Index Q (C), carry out the Clustering Effect of assessment data collection, when Cluster Validity Index Q (C) gets minimum value, corresponding cluster numbers is best cluster numbers.

2. the method for definite best cluster numbers as claimed in claim 1, is characterized in that: being defined as of described Validity Index, and degree of separation between compactness and class in compute classes first, then represent Validity Index according to both a linear combination; Specifically comprise:

1) suppose for cube DB, one of them clustering is C ^k={ C ₁, C ₂..., C _k, and cluster C now ^kclass in compactness be to obtain by calculating the quadratic sum of distance between any two data objects in same class, with Scat (C ^k) represent,

Scat (C^{k}) = Σ_{i = 1}^{k} \underset{X, Y &Element; C_{i}}{Σ} {| | X - Y | |}^{2} - - - (1)

Sep (C^{k}) = Σ_{i = 1}^{k} (Σ_{j = 1, j &NotEqual; i}^{k} \frac{1}{| C_{i} | \cdot | C_{j} |} \underset{X &Element; C_{i}, Y &Element; C_{j}}{Σ} {| | X - Y | |}^{2}) - - - (2)

2) Euclidean distance formula is brought into formula (1) and formula (2), then do conversion obtain:

Scat (C^{k}) = 2 Σ_{i = 1}^{k} (| C_{i} | {SS}_{i} - {LS}_{i}^{2}) - - - (3)

Sep (C^{k}) 2 ((k - 1) Σ_{i = 1}^{k} \frac{{SS}_{i}}{| C_{i} |} - {(Σ_{i = 1}^{k} \frac{{LS}_{i}}{| C_{i} |})}^{2} + Σ_{i = 1}^{k} \frac{{LS}_{i}^{2}}{{| C_{i} |}^{2}}) - - - (4)

Wherein,

3) formula (3) and formula (4) are carried out to linear combination, obtain formula (5),

Q(C ^k)＝Scat(C ^k)+β.Sep(C ^k) （5）

4) in given data set DB, Scat (C ^k) and Sep (C ^k) there is identical codomain scope; In original state, namely when cluster numbers k is n, from its formula (1), Scat (C now ⁿ) value is 0, and now establishes:

Sep (C^{n}) = 2 (n . Σ_{x &Element; DB} x^{2} - {(Σ_{x &Element; DB} x)}^{2}) = M - - - (6)

Q (C^{k}) = \frac{1}{M} (Scat (C^{k}) + Sep (C^{k})) - - - (7)

。

3. the method for definite best cluster numbers as claimed in claim 1, is characterized in that: definite method of described best cluster numbers is that employing is eliminated noise spot and the impact of isolated point on cluster result based on MDL beta pruning algorithm, finally obtains best cluster numbers; The processing procedure of MDL algorithm is:

k _opt＝β(C ^*) （9）

4. the method for definite best cluster numbers as claimed in claim 3, it is characterized in that: the removing method of described noise spot and isolated point is, the pruning method of employing based on MDL (minimal description length) processed result, and concrete disposal route is as follows:

Order

C^{*} = {C_{1}^{*}, C_{2}^{*}, . . . . . . C_{k}^{*}},

for the number of the data object comprising; First according to

CL (m) = \log_{2} (μ_{S_{L} (m)}) + \underset{1 \leq j < m}{Σ} \log_{2} (| | C_{j} | - μ_{S_{L} (m)} |) + \log_{2} (μ_{S_{R} (m)}) + \underset{m + 1 \leq j \leq k}{Σ} \log_{2} | | C_{j} | - μ_{S_{R} (m)} | - - - (10)

Wherein,

5. the method for definite best cluster numbers as claimed in claim 2, is characterized in that: described data set DB comprises artificial synthetic data set and standard data set.

6. the method for the definite best cluster numbers as described in claim 1-5 any one, is characterized in that: specific implementation process is as follows:

1) similarity of any two points in computational data collection DB, deposits in array D, and the numerical value in array D is sorted according to order from big to small;

2) to the currentElement in array D, first judge whether these two data objects have been integrated in class, if do not had, just these two data object mergings are become to a class, if one of them data object has been integrated in some classes, another object is also merged in that class, if they are integrated into respectively two different classes, two classes at its place are merged into a class, when if they have belonged to same class, abandon this time merging, now, according to formula (7), calculate the value of Cluster Validity Index Q (C) now, together with clustering now, be kept in array A, the cluster number k=k-1 of data set now, then get the next element in D, continue judgement and calculate, until the cluster number of data set is to finish for 1 o'clock,

3) according to formula (8), obtain cluster desired value and corresponding clustering minimum in array A; To selected min cluster desired value and corresponding clustering, by the process of formula (9), the class that is wherein identified as noise spot and isolated point and forms is carried out to " rejecting ", finally obtain best cluster numbers k _opt.

7. the method for definite best cluster numbers as claimed in claim 2 as claimed in claim 6, is characterized in that: the measure of described similarity is, in given d dimension data collection DB, and any two data object x _iand x _jsimilarity formula may be defined as:

s (x_{i}, x_{j}) = Σ_{k = 1}^{d} \frac{1}{1 + | x_{ik} - x_{jk} |} - - - (11)