CN103714154A - Method for determining optimum cluster number - Google Patents

Method for determining optimum cluster number Download PDF

Info

Publication number
CN103714154A
CN103714154A CN201310739837.3A CN201310739837A CN103714154A CN 103714154 A CN103714154 A CN 103714154A CN 201310739837 A CN201310739837 A CN 201310739837A CN 103714154 A CN103714154 A CN 103714154A
Authority
CN
China
Prior art keywords
cluster
formula
data
sigma
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310739837.3A
Other languages
Chinese (zh)
Inventor
周红芳
王啸
赵雪涵
段文聪
郭杰
张国荣
王心怡
何馨依
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201310739837.3A priority Critical patent/CN103714154A/en
Publication of CN103714154A publication Critical patent/CN103714154A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Discloses is a method for determining an optimum cluster number. The cluster effect of a data set is evaluated through an effectiveness indicator Q (C), and the cluster number corresponding to the minimum value of the effectiveness indicator Q (C) is the optimum cluster number. According to the method for determining the optimum cluster number, a new similarity measuring method is provided, all possible cluster partitions are generated in a bottom-up mode by combining hierarchical clustering, the effectiveness indicator value at the moment is calculated, a cluster quality curve regarding different partitions is established according to the effectiveness indicator value, and the partition corresponding to the extreme point of the curve is the optimum cluster partition. Repeated clustering on a large data set can be avoided, and the method does not rely on specific clustering algorithms. Experimental results and theoretical analysis both show that the method has good performance and feasibility, and computational efficiency can be improved greatly.

Description

A kind of method of definite best cluster numbers
Technical field
The invention belongs to data mining technology field, relate to a kind of method of definite best cluster numbers.
Background technology
The judgement great majority of best cluster numbers are all to adopt a kind of trial-and-error process based on iteration to carry out, on given data set, use different parameter (normally cluster numbers k) to move specific clustering algorithm data set is carried out to different divisions, then calculate the Validity Index value of various divisions, by comparing each desired value, select the corresponding cluster numbers of desired value conforming to a predetermined condition to be considered to best cluster numbers.In fact, there are several weak points in trial-and-error process, the one, the user who enriches cluster analysis experience for shortage that determines of cluster numbers k value is difficult to accurately determine, this just requires us further to propose to find the method for more rational cluster numbers k; It two is the indexs that proposed at present many check Cluster Validities, and main representative has V xieindex, V wsjindex etc.Because these indexs all propose based on certain specific clustering algorithm, the method is greatly limited in actual applications.The method is to data set large-scale, dimension more complicated in addition, and counting yield is poor.
Summary of the invention
A kind of method that the object of this invention is to provide definite best cluster numbers, can avoid the problem of prior art to the cluster repeatedly of large data collection, and counting yield is higher.
Technical scheme of the present invention is, a kind of method of definite best cluster numbers is carried out the Clustering Effect of assessment data collection by Validity Index Q (C), and when Cluster Validity Index Q (C) gets minimum value, corresponding cluster numbers is best cluster numbers.
Feature of the present invention is also:
Determining of Validity Index, degree of separation between compactness and class in compute classes first, then represent Validity Index according to both a linear combination; Specifically comprise:
1, suppose for cube DB, one of them clustering is C k={ C 1, C 2..., C k, and cluster C now kclass in compactness be to obtain by calculating the quadratic sum of distance between any two data objects in same class, with Scat (C k) represent,
Scat ( C k ) = Σ i = 1 k Σ X , Y ∈ C i | | X - Y | | 2 - - - ( 1 )
Meanwhile, cluster C kclass between degree of separation Sep (C k) by calculating, the quadratic sum of distance between any two data objects in inhomogeneity obtains,
Sep ( C k ) = Σ i = 1 k ( Σ j = 1 , j ≠ i k 1 | C i | · | C j | Σ X ∈ C i , Y ∈ C j | | X - Y | | 2 ) - - - ( 2 )
In formula (1) and formula (2), X, Y represents two data objects, k represents the cluster number that data set DB is divided into;
2, Euclidean distance formula is brought into formula (1) and formula (2), then do conversion obtain:
Scat ( C k ) = 2 Σ i = 1 k ( | C i | SS i - LS i 2 ) - - - ( 3 )
Sep ( C k ) 2 ( ( k - 1 ) Σ i = 1 k SS i | C i | - ( Σ i = 1 k LS i | C i | ) 2 + Σ i = 1 k LS i 2 | C i | 2 ) - - - ( 4 )
Wherein,
Figure BDA0000447410410000025
Figure BDA0000447410410000026
k represents cluster number, x jrepresent cluster C iin a data object, | C i| represent cluster C ithe number of middle data object;
3, formula (3) and formula (4) are carried out to linear combination:
Q(C k)=Scat(C k)+β.Sep(C k) (5)
Wherein, β is combination parameter, for balance Scat (C k) and Sep (C k) difference in span; At this, regard the clustering C of data set DB as a variable, obtain its field of definition for { C 1, C 2...., C n, in the value of this β, be 1;
4, in given data set DB, Scat (C k) and Sep (C k) there is identical codomain scope.In original state, namely when cluster numbers k is n, from its formula (1), Scat (C now n) value is 0, and now establishes:
Sep ( C n ) = 2 ( n . Σ x ∈ DB x 2 - ( Σ x ∈ DB x ) 2 ) = M - - - ( 6 )
Due to Scat (C k) be monotonically increasing function, and Sep (C k) be monotonic decreasing function, can obtain when cluster numbers k is 1 Sep (C 1)=0, Scat (C 1)=M; So Validity Index Q (C adopting k) form can be expressed as:
Q ( C k ) = 1 M ( Scat ( C k ) + Sep ( C k ) ) - - - ( 7 )
Definite method of best cluster numbers is that employing is eliminated noise spot and the impact of isolated point on cluster result based on MDL beta pruning algorithm, finally obtains best cluster numbers; The processing procedure of MDL algorithm is:
k opt=β(C *) (9)
In formula (9), C *the value that represents cluster quality Q (C) reaches hour corresponding clustering, k optrepresent the best cluster numbers obtaining.
The removing method of noise spot and isolated point is, adopts the pruning method based on MDL (minimal description length) to process result, and concrete disposal route is as follows:
Order C * = { C 1 * , C 2 * , . . . . . . C k * } , for
Figure BDA0000447410410000035
the number of the data object comprising; First according to
Figure BDA0000447410410000036
sequence from big to small generates a new sequence C 1, C 2... ..C k, then by this sequence with C m(1<m<k) for boundary is divided into two parts, that is: S l(m)={ C 1, C 2... .C mand S r(m)={ C m+1, C m+2... .C k, trying to achieve the code length CL (m) of data, CL (m) is defined as: CL ( m ) = log 2 ( &mu; S L ( m ) ) + &Sigma; 1 &le; j < m log 2 ( | | C j | - &mu; S L ( m ) | ) + log 2 ( &mu; S R ( m ) ) + &Sigma; m + 1 &le; j &le; k log 2 | | C j | - &mu; S R ( m ) | - - - ( 10 ) Wherein,
Figure BDA0000447410410000043
in formula (10) the 1st and the 3rd represents respectively to take the average code length of two sequences that m is boundary, and all the other two is to weigh | C j| and the difference between average data number of objects; S r(m) data object in is identified as noise spot and rejects, and has finally obtained the best cluster numbers k of data set optfor m.
Above-mentioned data set DB comprises artificial synthetic data set and standard data set.
Specific implementation process is as follows:
1. the similarity of any two points in computational data collection DB, deposits in array D, and the numerical value in array D is sorted according to order from big to small;
2. the currentElement in couple array D, first judge whether these two data objects have been integrated in class, if do not had, just these two data object mergings are become to a class, if one of them data object has been integrated in some classes, another object is also merged in that class, if they are integrated into respectively two different classes, two classes at its place are merged into a class, when if they have belonged to same class, abandon this time merging, now, according to formula (7), calculate the value of Cluster Validity Index Q (C) now, together with clustering now, be kept in array A, the cluster number k=k-1 of data set now, then get the next element in D, continue judgement and calculate, until the cluster number of data set is to finish for 1 o'clock.
3. according to formula (8), obtain cluster desired value and corresponding clustering minimum in array A; To selected min cluster desired value and corresponding clustering, by the process of formula (9), the class that is wherein identified as noise spot and isolated point and forms is carried out to " rejecting ", finally obtain best cluster numbers k opt.
Method for measuring similarity is: in given d dimension data collection DB, and any two data object x iand x jsimilarity formula may be defined as:
s ( x i , x j ) = &Sigma; k = 1 d 1 1 + | x ik - x jk | - - - ( 11 )
Wherein, x ikwith x jkrepresent two different data objects in k dimension, their similarity coefficient equals similarity coefficient sum between d attribute, the distance that similarity coefficient between every pair of attribute equals every pair of attribute adds 1 inverse, formula (11) is that the similarity coefficient of each attribute between two data objects is mapped to (0,1) in interval, each properties affect at this tentation data object is identical, so just can reduce different attribute in data object and analog result be judged to the impact bringing, by formula (11), can be obtained, as s (x i, x j) value larger, data object x iand x jmore similar.
The present invention has following beneficial effect:
1, the present invention proposes new data method for measuring similarity, binding hierarchy cluster, according to bottom-up, generate all possible clustering, and calculating Validity Index value now, according to this value, build a cluster figure-of-merit curve about different demarcation, the extreme point of curve is corresponding is divided into best clustering.So just can avoid the cluster repeatedly to large data collection, and the present invention does not rely on specific clustering algorithm.Experimental result and theoretical analysis all show, the present invention has good performance and feasibility, also can increase substantially counting yield simultaneously.
2, the present invention can identify correct cluster numbers, and process data set also can obtain its best cluster numbers more exactly.
3, the present invention can correctly process and obtain the best cluster numbers of data set, with other algorithm, compares, and has higher accuracy rate and performance.
4, the present invention carries out similarity determination to data object from integral body, and along with the dimension of data object increases, its efficiency obviously improves, have higher accuracy rate and time efficiency.When the dimension of data object is higher, efficiency of the present invention can be higher, shows as in time faster than other algorithms.
Accompanying drawing explanation
Fig. 1 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB1;
Fig. 2 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB2;
Fig. 3 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB3;
Fig. 4 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB4;
Fig. 5 adopts the present invention to determine that the method for best cluster numbers is for the data set Clustering Effect figure of data set DB5.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
The present invention adopts a kind of Validity Index Q (C) to carry out the Clustering Effect of assessment data collection.This Validity Index is mainly to weigh cluster quality by the degree of separation of data object between the compactness of data object in class and class.Introduce relevant concept below.
1. Validity Index
Suppose for cube DB, one of them clustering is C k={ C 1, C 2..., C k.And cluster C now kclass in compactness be to obtain by calculating the quadratic sum of distance between any two data objects in same class, with Scat (C k) represent:
Scat ( C k ) = &Sigma; i = 1 k &Sigma; X , Y &Element; C i | | X - Y | | 2 - - - ( 1 )
While cluster C kclass between degree of separation Sep (C k) be to obtain by calculating the quadratic sum of distance between any two data objects in inhomogeneity,
Sep ( C k ) = &Sigma; i = 1 k ( &Sigma; j = 1 , j &NotEqual; i k 1 | C i | &CenterDot; | C j | &Sigma; X &Element; C i , Y &Element; C j | | X - Y | | 2 ) - - - ( 2 )
In formula (1) and formula (2), X, Y represents two data objects, k represents the cluster number that data set DB is divided into.Substitution Euclidean distance formula is done simple conversion again, Sep (C k) and Scat (C k) can further be transformed to formula (3) and formula (4):
Scat ( C k ) = 2 &Sigma; i = 1 k ( | C i | SS i - LS i 2 ) - - - ( 3 )
Sep ( C k ) 2 ( ( k - 1 ) &Sigma; i = 1 k SS i | C i | - ( &Sigma; i = 1 k LS i | C i | ) 2 + &Sigma; i = 1 k LS i 2 | C i | 2 ) - - - ( 4 )
Wherein,
Figure BDA0000447410410000075
Figure BDA0000447410410000076
k represents cluster number, x jrepresent cluster C iin a data object, | C i| represent cluster C ithe number of middle data object.
By analysis mode (3) and formula (4), can find Scat (C k) value less, the data object in same class is compacter; Sep (C k) value larger, illustrate that the separation property between class and class is better.For balance Scat (C preferably k) and Sep (C k) effect, formula (3) and formula (4) have been carried out to linear combination, see formula (5).
Q(C k)=Scat(C k)+β.Sep(C k) (5)
Wherein, β is combination parameter, for balance Scat (C k) and Sep (C k) difference in span.At this, regard the clustering C of data set DB as a variable, can obtain its field of definition for { C 1, C 2...., C n, from correlation theory, the value of getting β at this is 1.
In given data set DB, Scat (C k) and Sep (C k) there is identical codomain scope.In original state, namely when cluster numbers k is n, from its formula (1), Scat (C now n) value is 0, and now establishes:
Sep ( C n ) = 2 ( n . &Sigma; x &Element; DB x 2 - ( &Sigma; x &Element; DB x ) 2 ) = M - - - ( 6 )
Due to Scat (C k) be monotonically increasing function, and Sep (C k) be monotonic decreasing function.Can obtain when cluster numbers k is 1 Sep (C 1)=0, Scat (C 1)=M.So Validity Index Q (C adopting k) form can be expressed as:
Q ( C k ) = 1 M ( Scat ( C k ) + Sep ( C k ) ) - - - ( 7 )
2. the deterministic process of best cluster numbers
In the cluster process of each step, all need to calculate and preserve the value of the Validity Index Q (C) now dividing, until whole cluster process finishes, then according to Validity Index value, find optimum clustering.And optimum clustering is corresponding to the equilibrium point of degree of separation between compactness in class and class, when being numerically reflected as Validity Index Q (C) and getting minimum value.Can obtain thus Validity Index value less, illustrate that Clustering Effect is better, so it is considered herein that when Cluster Validity Index Q (C) gets minimum value, corresponding cluster numbers is best cluster numbers, the optimum cluster numbers that the present invention utilizes formula (8) to come computational data to concentrate.
C * = arg min C k &Element; { C 1 , C 2 , . . . , C n } Q ( C k ) - - - ( 8 )
Because data centralization often exists noise spot and isolated point, and they have very important impact to cluster result, so be not best by the drawn cluster numbers of formula (8).Given this plant situation, in the present invention, adopt and eliminate noise spot and the impact of isolated point on cluster result based on MDL beta pruning algorithm, last resulting clustering can be thought best.Formula (9) is the processing procedure of MDL algorithm.
k opt=β(C *) (9)
In formula (8) and formula (9), C *the value that means cluster quality Q (C) reaches hour corresponding clustering, k optrepresent the best cluster numbers obtaining.
3. the elimination of noise spot and isolated point
Because data centralization exists noise spot and the impact of isolated point on cluster result, it is considered herein that and utilize separately the conclusion that the drawn cluster numbers of Validity Index is best cluster numbers k* and be false.The pruning method of employing based on MDL (minimal description length) result is processed.Concrete disposal route is as follows:
Order
Figure BDA0000447410410000091
Figure BDA0000447410410000092
for
Figure BDA0000447410410000093
the number of the data object comprising.First according to
Figure BDA0000447410410000094
sequence from big to small generates a new sequence C 1, C 2... ..C k, then by this sequence with C m(1<m<k) for boundary is divided into two parts, that is: S l(m)={ C 1, C 2... .C mand S r(m)={ C m+1, C m+2... .C k, try to achieve the code length CL (m) of data.CL (m) is defined as:
CL ( m ) = log 2 ( &mu; S L ( m ) ) + &Sigma; 1 &le; j < m log 2 ( | | C j | - &mu; S L ( m ) | ) + log 2 ( &mu; S R ( m ) ) + &Sigma; m + 1 &le; j &le; k log 2 | | C j | - &mu; S R ( m ) | - - - ( 10 ) Wherein,
Figure BDA0000447410410000096
Figure BDA0000447410410000097
in formula (10) the 1st and the 3rd represents respectively to take the average code length of two sequences that m is boundary, and all the other two is to weigh | C j| and the difference between average data number of objects.
It has been generally acknowledged that comprising the less class of data object is the class that noise spot and isolated point form.Be S r(m) data object in is identified as noise spot and rejects, and has finally obtained the best cluster numbers k of data set optfor m.
In the present invention, need to judge whether two objects merge according to the similarity degree of data object, because most data set is all multidimensional, traditional method for measuring similarity is also not suitable for.So it is very important selecting a suitable method for measuring similarity, the present invention proposes a kind of new method for measuring similarity.
In traditional method for measuring similarity, regard each data object as a point in hyperspace, and then judge according to distance between points.Distance is larger, illustrates that the similarity between data object is less.So, resulting cluster result is the spheroid that some volumes are close, and its scope of application is very restricted.The present invention on this basis, has proposed a kind of new method for measuring similarity based on cube.It is described below: in given d dimension data collection DB, and any two data object x iand x jsimilarity formula may be defined as:
s ( x i , x j ) = &Sigma; k = 1 d 1 1 + | x ik - x jk | - - - ( 11 )
Wherein, x ikwith x jkrepresent two different data objects in k dimension.Their similarity coefficient equals similarity coefficient sum between d attribute, and the distance that the similarity coefficient between every pair of attribute equals every pair of attribute adds 1 inverse.Formula (11) is that the similarity coefficient of each attribute between two data objects is mapped to (0,1) in interval, each properties affect at this tentation data object is identical, so just can reduce different attribute in data object and analog result be judged to the impact bringing.By formula (11), can be obtained, as s (x i, x j) value larger, data object x iand x jmore similar.
Binding hierarchy cluster of the present invention, propose best cluster numbers and determine method, when initialization, total number n that the cluster numbers k that makes data set is data object, similarity s (the i of any two data objects of concentrating according to formula (11) computational data again, j), and by them be stored in an array D, element to array D, sorts by order from big to small, to the currentElement in array D, first judge whether these two data objects have been integrated in class, if do not had, just these two data object mergings are become to a class, if one of them data object has been integrated in some classes, another object is also merged in that class, if they are integrated into respectively two different classes, two classes at its place are merged into a class, when if they have belonged to same class, abandon this time merging, now, according to formula (7), calculate the value of Cluster Validity Index Q (C) now, together with clustering now, be kept in array A, the cluster number k=k-1 of data set now, then get the next element in D, continue judgement and calculate, until the cluster number of data set is to finish for 1 o'clock, then according to formula (8), obtain cluster desired value and corresponding clustering minimum in array A, to selected min cluster desired value and corresponding clustering, by the process of formula (9), the class that is wherein identified as noise spot and isolated point and forms is carried out to " rejecting ", finally obtain best cluster numbers k opt.
Below for the present invention is directed to the specific implementation process that data set DB determines best cluster numbers:
1, according to the similarity of any two points in formula (11) computational data collection DB, deposit in array D, and the numerical value in array D is sorted according to order from big to small
2, to the currentElement in array D, first judge whether these two data objects have been integrated in class, if do not had, just these two data object mergings are become to a class, if one of them data object has been integrated in some classes, another object is also merged in that class, if they are integrated into respectively two different classes, two classes at its place are merged into a class, when if they have belonged to same class, abandon this time merging, now, according to formula (7), calculate the value of Cluster Validity Index Q (C) now, together with clustering situation now, be kept in array A, the cluster number k=k-1 of data set now, then get the next element in D, continue judgement and calculate, until the cluster number of data set is to finish for 1 o'clock.
3, according to formula (8), obtain cluster desired value and corresponding clustering minimum in array A; To selected min cluster desired value and corresponding clustering situation, by the process of formula (9), the class that is wherein identified as noise spot and isolated point and forms is carried out to " rejecting ", finally obtain best cluster numbers k opt.
The present invention is under worst case, and space complexity is O (n 2).While just starting, be first the similarity of calculating any two data objects, then by the algorithm of quicksort, it sorted from big to small, its average time complexity is O (n 2).In data object merging process, total number of merging is n, calculates Validity Index value Q (C k) time complexity of sequence is O (nk), the complexity of MDL pruning method is O (k 2), k is the number of the class of correspondence after cluster completes.So time complexity of the present invention is O (n 2).
In order to illustrate that the present invention has higher accuracy rate to the best cluster numbers of data set definite, adopts the Validity Index V based on FCM algorithm xieand V wsjobject as a comparison.In experiment, the fuzzy factor m that FCM clustering algorithm is set is 2.Also select cops algorithm as a comparison in addition.The best cluster numbers situation that algorithms of different obtains is as shown in table 1.
The best cluster numbers that table 1 algorithms of different obtains
Figure BDA0000447410410000121
In this experiment, the scope of the cluster numbers c in FCM algorithm is set to [1,12], can improve fast like this efficiency of FCM algorithm.In order to improve the accuracy of experimental result, we to 5 data sets in each algorithm respectively operation repeatedly, using the maximum cluster result of cluster numbers occurrence number as final cluster numbers, can avoid like this error result being caused due to other factors.By table 1, can be drawn, the present invention can correctly process and obtain the best cluster numbers of data set, with other algorithm, compares, and has higher accuracy rate and performance.
In experiment, executing arithmetic recording the each run time repeatedly.We get the mean value of these times as this data set of this algorithm process time used, as shown in table 2.
Table 2 algorithms of different working time
By table 2, can be drawn, for structure standard of comparison, the more concentrated data set DB1 that distributes, the time of four algorithm operations is almost similar, because Validity Index V wsjand V xiewhen simple, standard of comparison and the fewer data set of data object number are processed, efficiency is also higher.And when number, the dimension of data set constantly increase, the efficiency of the present invention and COPS performance is better.When the special dimension when data object is higher, efficiency of the present invention can be higher, shows as in time faster than other algorithms.First COPS algorithm determines the similar data object in single dimension, and then the similar data object in definite higher dimensional space, and its calculated amount is larger.The present invention carries out similarity determination to data object from integral body by new similar decision method, so while increasing along with the dimension of data object, its efficiency obviously improves.In a word, the present invention has higher accuracy rate and time efficiency.

Claims (7)

1. determine a method for best cluster numbers, it is characterized in that: by Validity Index Q (C), carry out the Clustering Effect of assessment data collection, when Cluster Validity Index Q (C) gets minimum value, corresponding cluster numbers is best cluster numbers.
2. the method for definite best cluster numbers as claimed in claim 1, is characterized in that: being defined as of described Validity Index, and degree of separation between compactness and class in compute classes first, then represent Validity Index according to both a linear combination; Specifically comprise:
1) suppose for cube DB, one of them clustering is C k={ C 1, C 2..., C k, and cluster C now kclass in compactness be to obtain by calculating the quadratic sum of distance between any two data objects in same class, with Scat (C k) represent,
Scat ( C k ) = &Sigma; i = 1 k &Sigma; X , Y &Element; C i | | X - Y | | 2 - - - ( 1 )
Meanwhile, cluster C kclass between degree of separation Sep (C k) by calculating, the quadratic sum of distance between any two data objects in inhomogeneity obtains,
Sep ( C k ) = &Sigma; i = 1 k ( &Sigma; j = 1 , j &NotEqual; i k 1 | C i | &CenterDot; | C j | &Sigma; X &Element; C i , Y &Element; C j | | X - Y | | 2 ) - - - ( 2 )
In formula (1) and formula (2), X, Y represents two data objects, k represents the cluster number that data set DB is divided into;
2) Euclidean distance formula is brought into formula (1) and formula (2), then do conversion obtain:
Scat ( C k ) = 2 &Sigma; i = 1 k ( | C i | SS i - LS i 2 ) - - - ( 3 )
Sep ( C k ) 2 ( ( k - 1 ) &Sigma; i = 1 k SS i | C i | - ( &Sigma; i = 1 k LS i | C i | ) 2 + &Sigma; i = 1 k LS i 2 | C i | 2 ) - - - ( 4 )
Wherein,
Figure FDA0000447410400000021
k represents cluster number, x jrepresent cluster C iin a data object, | C i| represent cluster C ithe number of middle data object;
3) formula (3) and formula (4) are carried out to linear combination, obtain formula (5),
Q(C k)=Scat(C k)+β.Sep(C k) (5)
Wherein, β is combination parameter, for balance Scat (C k) and Sep (C k) difference in span; At this, regard the clustering C of data set DB as a variable, obtain its field of definition for { C 1, C 2...., C n, in the value of this β, be 1;
4) in given data set DB, Scat (C k) and Sep (C k) there is identical codomain scope; In original state, namely when cluster numbers k is n, from its formula (1), Scat (C now n) value is 0, and now establishes:
Sep ( C n ) = 2 ( n . &Sigma; x &Element; DB x 2 - ( &Sigma; x &Element; DB x ) 2 ) = M - - - ( 6 )
Due to Scat (C k) be monotonically increasing function, and Sep (C k) be monotonic decreasing function, can obtain when cluster numbers k is 1 Sep (C 1)=0, Scat (C 1)=M; So Validity Index Q (C adopting k) form can be expressed as:
Q ( C k ) = 1 M ( Scat ( C k ) + Sep ( C k ) ) - - - ( 7 )
3. the method for definite best cluster numbers as claimed in claim 1, is characterized in that: definite method of described best cluster numbers is that employing is eliminated noise spot and the impact of isolated point on cluster result based on MDL beta pruning algorithm, finally obtains best cluster numbers; The processing procedure of MDL algorithm is:
k opt=β(C *) (9)
In formula (9), C *the value that represents cluster quality Q (C) reaches hour corresponding clustering, k optrepresent the best cluster numbers obtaining.
4. the method for definite best cluster numbers as claimed in claim 3, it is characterized in that: the removing method of described noise spot and isolated point is, the pruning method of employing based on MDL (minimal description length) processed result, and concrete disposal route is as follows:
Order C * = { C 1 * , C 2 * , . . . . . . C k * } ,
Figure FDA0000447410400000032
for the number of the data object comprising; First according to
Figure FDA0000447410400000034
sequence from big to small generates a new sequence C 1, C 2... ..C k, then by this sequence with C m(1<m<k) for boundary is divided into two parts, that is: S l(m)={ C 1, C 2... .C mand S r(m)={ C m+1, C m+2... .C k, trying to achieve the code length CL (m) of data, CL (m) is defined as: CL ( m ) = log 2 ( &mu; S L ( m ) ) + &Sigma; 1 &le; j < m log 2 ( | | C j | - &mu; S L ( m ) | ) + log 2 ( &mu; S R ( m ) ) + &Sigma; m + 1 &le; j &le; k log 2 | | C j | - &mu; S R ( m ) | - - - ( 10 ) Wherein,
Figure FDA0000447410400000036
in formula (10) the 1st and the 3rd represents respectively to take the average code length of two sequences that m is boundary, and all the other two is to weigh | C j| and the difference between average data number of objects; S r(m) data object in is identified as noise spot and rejects, and has finally obtained the best cluster numbers k of data set optfor m.
5. the method for definite best cluster numbers as claimed in claim 2, is characterized in that: described data set DB comprises artificial synthetic data set and standard data set.
6. the method for the definite best cluster numbers as described in claim 1-5 any one, is characterized in that: specific implementation process is as follows:
1) similarity of any two points in computational data collection DB, deposits in array D, and the numerical value in array D is sorted according to order from big to small;
2) to the currentElement in array D, first judge whether these two data objects have been integrated in class, if do not had, just these two data object mergings are become to a class, if one of them data object has been integrated in some classes, another object is also merged in that class, if they are integrated into respectively two different classes, two classes at its place are merged into a class, when if they have belonged to same class, abandon this time merging, now, according to formula (7), calculate the value of Cluster Validity Index Q (C) now, together with clustering now, be kept in array A, the cluster number k=k-1 of data set now, then get the next element in D, continue judgement and calculate, until the cluster number of data set is to finish for 1 o'clock,
3) according to formula (8), obtain cluster desired value and corresponding clustering minimum in array A; To selected min cluster desired value and corresponding clustering, by the process of formula (9), the class that is wherein identified as noise spot and isolated point and forms is carried out to " rejecting ", finally obtain best cluster numbers k opt.
7. the method for definite best cluster numbers as claimed in claim 2 as claimed in claim 6, is characterized in that: the measure of described similarity is, in given d dimension data collection DB, and any two data object x iand x jsimilarity formula may be defined as:
s ( x i , x j ) = &Sigma; k = 1 d 1 1 + | x ik - x jk | - - - ( 11 )
Wherein, x ikwith x jkrepresent two different data objects in k dimension, their similarity coefficient equals similarity coefficient sum between d attribute, the distance that similarity coefficient between every pair of attribute equals every pair of attribute adds 1 inverse, formula (11) is that the similarity coefficient of each attribute between two data objects is mapped to (0,1) in interval, each properties affect at this tentation data object is identical, so just can reduce different attribute in data object and analog result be judged to the impact bringing, by formula (11), can be obtained, as s (x i, x j) value larger, data object x iand x jmore similar.
CN201310739837.3A 2013-12-26 2013-12-26 Method for determining optimum cluster number Pending CN103714154A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310739837.3A CN103714154A (en) 2013-12-26 2013-12-26 Method for determining optimum cluster number

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310739837.3A CN103714154A (en) 2013-12-26 2013-12-26 Method for determining optimum cluster number

Publications (1)

Publication Number Publication Date
CN103714154A true CN103714154A (en) 2014-04-09

Family

ID=50407129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310739837.3A Pending CN103714154A (en) 2013-12-26 2013-12-26 Method for determining optimum cluster number

Country Status (1)

Country Link
CN (1) CN103714154A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574005A (en) * 2014-10-10 2016-05-11 富士通株式会社 Device and method for clustering source data containing a plurality of documents
CN105956628A (en) * 2016-05-13 2016-09-21 北京京东尚科信息技术有限公司 Data classification method and device for data classification
CN108696521A (en) * 2018-05-11 2018-10-23 雷恩友力数据科技南京有限公司 A kind of cyberspace intrusion detection method
CN109147877A (en) * 2018-09-27 2019-01-04 大连大学 A method of ethane molecule energy is calculated by deep learning
CN110390470A (en) * 2019-07-01 2019-10-29 北京工业大学 Climate region of building method and apparatus
CN110895333A (en) * 2019-12-05 2020-03-20 电子科技大学 Rapid 77G vehicle-mounted radar data clustering method based on Doppler frequency
CN112783883A (en) * 2021-01-22 2021-05-11 广东电网有限责任公司东莞供电局 Power data standardized cleaning method and device under multi-source data access

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574005A (en) * 2014-10-10 2016-05-11 富士通株式会社 Device and method for clustering source data containing a plurality of documents
CN105956628A (en) * 2016-05-13 2016-09-21 北京京东尚科信息技术有限公司 Data classification method and device for data classification
CN105956628B (en) * 2016-05-13 2021-01-26 北京京东尚科信息技术有限公司 Data classification method and device for data classification
CN108696521A (en) * 2018-05-11 2018-10-23 雷恩友力数据科技南京有限公司 A kind of cyberspace intrusion detection method
CN109147877A (en) * 2018-09-27 2019-01-04 大连大学 A method of ethane molecule energy is calculated by deep learning
CN110390470A (en) * 2019-07-01 2019-10-29 北京工业大学 Climate region of building method and apparatus
CN110895333A (en) * 2019-12-05 2020-03-20 电子科技大学 Rapid 77G vehicle-mounted radar data clustering method based on Doppler frequency
CN110895333B (en) * 2019-12-05 2022-06-03 电子科技大学 Rapid 77G vehicle-mounted radar data clustering method based on Doppler frequency
CN112783883A (en) * 2021-01-22 2021-05-11 广东电网有限责任公司东莞供电局 Power data standardized cleaning method and device under multi-source data access

Similar Documents

Publication Publication Date Title
CN103714154A (en) Method for determining optimum cluster number
CN102915347B (en) A kind of distributed traffic clustering method and system
CN100456281C (en) Data division apparatus, data division method
Lu et al. PHA: A fast potential-based hierarchical agglomerative clustering method
CN102722531B (en) Query method based on regional bitmap indexes in cloud environment
CN103136355B (en) A kind of Text Clustering Method based on automatic threshold fish-swarm algorithm
CN103034766B (en) A kind of laying angular direction of definite Test of Laminate Composites and the method for thickness
CN106056136A (en) Data clustering method for rapidly determining clustering center
CN106991447A (en) A kind of embedded multi-class attribute tags dynamic feature selection algorithm
CN109214429A (en) Localized loss multiple view based on matrix guidance regularization clusters machine learning method
Hernández-Lobato et al. Designing neural network hardware accelerators with decoupled objective evaluations
CN116109195A (en) Performance evaluation method and system based on graph convolution neural network
CN109840558B (en) Self-adaptive clustering method based on density peak value-core fusion
CN104778480A (en) Hierarchical spectral clustering method based on local density and geodesic distance
CN104036024B (en) It is a kind of based on GACUC and the spatial clustering method of Delaunay triangulation network
CN109858507B (en) Rare subsequence mining method of multidimensional time sequence data applied to atmospheric pollution control
Wang et al. Fuzzy C-means clustering algorithm for automatically determining the number of clusters
CN105512480A (en) Wearable device data optimization processing method based on editing distance
CN105183804A (en) Ontology based clustering service method
CN101286159A (en) Document meaning similarity distance metrization method based on EMD
CN105354264A (en) Locality-sensitive-hashing-based subject label fast endowing method
Bo Research on the classification of high dimensional imbalanced data based on the optimizational random forest algorithm
CN104657473A (en) Large-scale data mining method capable of guaranteeing quality monotony
CN104135510A (en) Distributed computing environment performance prediction method and system based on mode matching
Zhu et al. Effective clustering analysis based on new designed clustering validity index and revised K-means algorithm for big data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140409