CN104199853A

CN104199853A - Clustering method

Info

Publication number: CN104199853A
Application number: CN201410394502.7A
Authority: CN
Inventors: 侯荣涛; 王琴; 周彬; 路郁
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2014-08-12
Filing date: 2014-08-12
Publication date: 2014-12-10

Abstract

The invention discloses a clustering method. The method comprises the steps that firstly, the pre-classification technology based on the density algorithm is used for obtaining a high-density core class, and a class hierarchy tree capable of representing a dataset structure is determined; then, K-MEANS clustering is carried out according to subclass centers with high representativeness in the class hierarchy tree to obtain fine clusters; finally, the fine clusters are combined according to class attributes in the class hierarchy tree to achieve a precise and stable clustering effect. The stable algorithm based on the fine clusters is provided according to sensibility of K-MEANS to initial clustering centers, convex type classes in a dataset can be divided, and the optimal division can be carried out on classes in irregular shapes.

Description

A kind of clustering method

Technical field

The present invention relates to relate to a kind of clustering method, particularly relate to a kind of novel K-MEANS clustering method, belong to data mining technology field.

Background technology

Along with the development of internet, data are shared and accumulation in a large number, data overload and the phenomenon of knowledge deficiency is more and more outstanding.The data that day by day expand can become data tomb because being not used, if can access fully, excavate, and the potential information wherein containing will be created high value.The task of data mining is from mass data, to find knowledge, and it is mainly for structural data, and in fact a large amount of data are stored in database with the form of text, and this makes text data digging become an important branch of data mining.

Clustering technique is the key means of data mining, and its task is that the similar text of subject content is classified as to a class, and the different text of content is separated from each other.Wherein K-MEANS algorithm is one of the most classical clustering algorithm, its feature simple and quick and that be easy to realize makes it become algorithm the most frequently used in text data digging, yet K-MEANS exists that affected by initial cluster center excessive, the execution efficiency shortcoming such as can not meet the demands.Text mining application is desired is a kind of unsupervised Text Clustering Method, requires clustering method can stably obtain high precision cluster result, for traditional clustering method, also needs to improve further and could be applied in text data digging well.

Summary of the invention

Technical matters to be solved by this invention is: a kind of clustering method is provided, made up traditional K-MEANS algorithm and be subject to initial cluster center to affect excessive shortcoming, thus cluster more exactly.

The present invention is for solving the problems of the technologies described above by the following technical solutions:

A clustering method, comprises the steps:

Step 1, the OPTICS clustering method of utilization based on density carry out preliminary cluster to data set, obtain reachability graph;

The reachability graph that step 2, obtaining step 1 obtain comprises all bunches, by all bunches according to how many descending sequences of data object that each bunch comprises, root node using described data set as hierarchical tree, by bunch traveling through and joining successively in hierarchical tree by range after sequence, build hierarchical tree, and definition can to comprise the node of bunch be the father node of this bunch, the brotgher of node that the node that can not comprise bunch is this bunch, hierarchical tree other nodes except root node are described bunch, and each subtree of hierarchical tree is distributed to different id;

Step 3, the number of the leaf node of the hierarchical tree that step 2 is obtained is as the initial category number of K-MEANS cluster, the data object that each leaf node of hierarchical tree is comprised is averaged as the initial cluster center of each initial category of K-MEANS cluster, the id that step 2 is distributed is as the initial id of each leaf node initial cluster center of hierarchical tree and the data object that comprises, described data set is carried out to K-MEANS cluster, every iteration once, the id of new cluster centre is identical with the id of cluster centre before iteration, and the id that is classified as the data object of a class with new cluster centre is consistent with the id of new cluster centre, obtain the K-MEANS clustering cluster with id,

Step 4, the K-MEANS clustering cluster with id that step 3 is obtained merge, and the K-MEANS clustering cluster that id is identical is merged into same class, obtains the final cluster result of data set.

Preferably, the range formula in the described OPTICS clustering method based on density is wherein, Distance (x _i, x _j) represent any two data object x _i, x _jdistance, cos (x _i, x _j) represent any two data object x _i, x _jcosine similarity, x _i, x _jrepresent i, a j text object, and i ≠ j.

A Text Clustering Method, comprises the steps:

Step 1, choose at least two text categories arbitrarily, each classification is chosen at least one text object and is formed text data set;

Step 2, utilize clustering method as claimed in claim 1 to carry out cluster to text data set, obtain text cluster result.

Preferably, the range formula between described K-MEANS cluster Chinese version object is wherein, Distance (x _i, x _j) represent any two text object x _i, x _jdistance, cos (x _i, x _j) represent any two text object x _i, x _jcosine similarity, x _i, x _jbe text object, and i ≠ j.

Preferably, the convergence criterion of described K-MEANS cluster is bisection error criterion, and formula is wherein, x is the vector of data centralization class i, m _ifor the barycenter of class i, Ci is bunch, the quantity that k is class, dis (x, m _i) for vector x is to barycenter m _idistance.

The present invention adopts above technical scheme compared with prior art, has following technique effect:

1, a kind of clustering method of the present invention can dividing data be concentrated the class that presents convex, for erose class, also can carry out optimum division.

2, a kind of clustering method of the present invention has overcome traditional K-MEANS algorithm and is subject to initial cluster center to affect excessive shortcoming, has obtained a kind of stable algorithm based on meticulous bunch, and Clustering Effect is better.

3, succinct, the easy to understand of a kind of clustering method algorithm of the present invention, and easily realize.

Accompanying drawing explanation

Fig. 1 (a), (b) are traditional K-MEANS cluster result figure.

Fig. 2 (a), (b), (c) are cluster result figure of the present invention.

Fig. 3 (a), (b) are OPTICS cluster result figure of the present invention.

Fig. 4 is the structural drawing of hierarchical tree of the present invention.

Fig. 5 is clustering method process flow diagram of the present invention.

Fig. 6 is algorithm of the present invention and traditional K-MEANS algorithm cluster accuracy rate comparison diagram.

Embodiment

Describe embodiments of the present invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, only for explaining the present invention, and can not be interpreted as limitation of the present invention.

The main advantage of tradition K-MEANS algorithm is except algorithm is succinct, easy to understand and realization etc., for the processing of sparse matrix, has good performance.Yet this algorithm needs manually to specify clusters number K before carrying out.For unknown data set, before cluster, be cannot predict the real class of data centralization to distribute, the process need user's of true defining K value practical experience, and exist very large uncertainty.Secondly, because initial cluster center is random generation, this algorithm is easily absorbed in locally optimal solution and cluster result difference is very large.Finally, there is the restriction can only find convex bunch in K-MEANS algorithm.Due to K-MEANS, to be that the lineoid that consists of a plurality of cluster centres is divided all kinds of, so these classes present convex.In practical application, class not necessarily presents convex regularly, and for the class of some irregular shapes, K-MEANS cannot obtain optimum division.As shown in Fig. 1 (a), (b), be the K-MEANS cluster result of out-of-shape bunch, figure hollow core round dot is that data set distributes, solid black round dot is for by K-MEANS iteration convergence Shi Cu center.According to data object, be assigned to the principle in class under center nearest with it, division result can You Lianggecu center bisector (in figure with dotted line) separates, due to data centralization bunch for presenting complete convex, partial data point is grouped in wrong class, makes data set cannot obtain best division result.

Due to data centralization, true class number K cannot predict, random initial cluster center lacks basis, strengthened the possibility that K-MEANS algorithm is absorbed in local optimum, and iterations increases thereupon.The convex of cluster gained bunch also makes the object in irregularly shaped bunch cannot be divided in correct class.For these shortcomings, the present invention proposes the meticulous genus that gathers, real bunch is divided meticulouslyr and merge these meticulous bunch to obtain higher clustering precision.Adopt meticulous bunch of main thought that carries out K-MEANS cluster as shown in Figure 2, if increasing the value of K is 4,, after K-MEANS cluster, bunch center is as shown in Fig. 2 (a), it is meticulousr that the bisector being comprised of these centers is divided data set, as shown in Fig. 2 (b).If two meticulous bunch, left side in Fig. 2 (b) can be merged into one bunch, two bunches, right side is merged into one bunch, data set will obtain more accurately and divide, as shown in Fig. 2 (c).

Thereby the introducing of meticulous bunch will overcome shortcoming that K-MEANS algorithm can only identify convex structure bunch cluster more exactly to a certain extent, and wherein most critical be how to obtain and to merge meticulous bunch, need to obtain submanifold and his father's bunch hierarchical structure.The present invention has introduced the OPTICS cluster based on density, and this clustering method can refer to < < OPTICS algorithm that 01 phase < < computer utility > > Hou Rong great waves in 2014 etc. the deliver application > > in thunder and lightning nowcasting.This algorithm not explicitly produces cluster result, but concept scan data set based on its core distance and reach distance obtains the reachability graph of a designation data collection object distribution.By reachability graph, reflect data set inner structure, the cluster result finally obtaining bunch from extracting reachability graph's recess region.Fig. 3 (a), (b) are the reachability graph that OPTICS generates text data set cluster analysis, as can be seen from the figure, because algorithm is all the time toward the high regional development of density, make the object in the sparse region of density be adjusted to the afterbody slowly raising up after reachability graph's groove area, these parts cannot obtain correct classification.Particularly in actual text cluster application, text is compact distribution not bunch conventionally, and the region that causes slowly raising up in reachability graph becomes very wide, and the object in this part region cannot correctly be sorted out.

Although OPTICS has the advantage of automatic identification bunch class quantity, merely use OPTICS text cluster cannot obtain the correct classification of full text object.The density structure that has reflected data set integral body in reachability graph, the reachability graph schematic diagram of the data set of illustrating in Fig. 1 after OPTICS cluster is as shown in Fig. 3 (a), (b), and wherein C1 has comprised submanifold a and b, and C2 has comprised submanifold c and d.When carrying out meticulous bunch of K-MEANS cluster, such hierarchical structure can effectively instruct choosing and the merging of meticulous bunch of initial cluster center.Therefore, bunch hierarchical structure of effectively utilizing OPTICS reachability graph to comprise, can be K-MEANS algorithm initialization foundation is provided, and the overall splitting scheme in conjunction with K-MEANS algorithm, will obtain better clustering performance.

On data model basis, the present invention adopts the similarity degree between two text objects of cosine measuring similarity between text.Known according to the character of cosine similarity, the similarity degree of two objects is higher, and cosine similarity value is larger.Because clustering algorithm carries out according to the distinctiveness ratio between object, being about to that object at a distance of little (being that distance value is little) gathers is a class, and in order to meet this specific character, revising cosine formula is as shown in formula (1), to represent the distance between text object.

Dis \tan ce (x_{i}, x_{j}) = \frac{1}{\cos (x_{i}, x_{j}) + 0.001} - - - (1)

Wherein, Distance (x _i, x _j) represent any two text object x _i, x _jdistance be distinctiveness ratio, cos (x _i, x _j) represent the cosine similarity of any two text objects, x _i, x _jbe text object, and i ≠ j.

Utilize OPTICS algorithm to carry out preliminary cluster to data set, obtain reflecting the reachability graph of data set inner structure.At present existing method can be obtained all possible bunch according to precipitous decline and precipitous elevated areas from reachability graph, the method can be referring to Mihael Ankerst, Markus M.Breunig in 1999, Hans-Peter Kriegel and the < < OPTICS:Ordering Points To Identify the Clustering Structure > > that Sander delivers on ACM SIGMOD Record.In these bunches, comprised the affiliated father bunch of submanifold and submanifold, the hierarchical tree that comprises structure according to reachability graph's bunch boundary formation bunch, the structure of hierarchical tree is as shown in Figure 4.

As follows according to above-mentioned all possible bunch of method of constructing hierarchical tree: first, by all possible bunch of obtaining from reachability graph according to bunch how many descending sequences of object number; Then, by what sorted, bunch take out successively and join in hierarchical tree, the mode traveling through by range is searched for the node that can comprise each bunch in hierachy number, if certain node can comprise one bunch, show the child nodes that this bunch is certain node, if certain node can not comprise one bunch, show the brotgher of node that this bunch is certain node; Since the second bunch, each bunch repeats above-mentioned ergodic process, until last bunch, all bunches all joined in hierarchical tree, completes the structure of hierarchical tree.

Calculate all leaf node representative Cu center in hierarchical tree, the input using these centers as K-MEANS algorithm.According to initial center point, start iteration, in iterative process, adopt the range formula Distance (x identical with OPTICS _i, x _j) weigh distance between text object, adopt barycenter as Cu center.Convergence criterion is square error criterion, as shown in formula (2).

E = Σ_{i = 1}^{k} \underset{x &Element; C_{i}}{Σ} dis (x, m_{i}) - - - (2)

Wherein, x is the vector of data centralization class i, m _ifor the average of class i is barycenter, Ci is bunch, the quantity that k is class, dis (x, m _i) be the distance of vector x to barycenter.In iterative process, square error E constantly diminishes, and when E no longer changes, iteration finishes, and all text objects are divided in meticulous bunch, and merging has meticulous bunch of same item mark.So far, the cluster of whole data set finishes.Meticulous bunch of K-MEANS clustering algorithm flow process as shown in Figure 5.

Below the quality of cluster result is evaluated, the evaluation method of check cluster result quality has precision ratio, recall ratio and F-Measure.Suppose that document classification is K, wherein cluster C themes as T, and the object number that in C, subject categories is T is C _t, subject categories not for the number of T be C _f, the number that themes as T in other classes is C _oT, in other classes, theme is for the number of T is not C _oF, the corresponding relation of variable and class is as shown in table 1.

The relation of table 1 variable name and class

Classification	Relevant to T	Uncorrelated with T
			Theme T class	C _T	C _F
Other classes	C _OT	C _OF

Precision ratio also claims accuracy rate sometimes, and after being all text object classification, the quantity that its affiliated classification is consistent with concrete class and the ratio of all kinds of middle object sums, as shown in formula (3).

P (C, T) = \frac{C_{T}}{C_{T} + C_{F}} - - - (3)

Recall ratio also claims recall rate, be in cluster C with all number of documents ratios of number of documents theme corresponding to manual sort of Topic relative, as shown in formula (4).

P (C, T) = \frac{C_{T}}{C_{T} + C_{OT}} - - - (4)

F-Measure is that to take precision ratio and recall ratio be basic aggregate concept, for the F-Measure value definition of i class as, as shown in formula (5).

F (i) = \frac{2 PR}{P + R} - - - (5)

In order to test Clustering Effect, the corpus that adopts clustering method of the present invention to make the classification of search dog laboratory is tested, choose respectively class automobile, finance and economics, IT, healthy four classes, each classification arbitrary extracting 200 pieces of documents wherein, totally 800 pieces of documents carry out cluster.Through many experiments, determine that OPTICS parameter chooses MinPts=18 and ε=20.The present invention and traditional K-MEANS algorithm contrast, and in order to keep consistency, have adopted equally cosine similarity and square error convergence criterion in traditional K-MEANS algorithm, and the K of setting K-MEANS algorithm is 4.In experiment, the inventive method and traditional K-MEANS algorithm have carried out respectively 10 experiments, and by statistics, classification accuracy as shown in Figure 6.

As can be seen from Figure 6, even if the given K of cluster numbers accurately in traditional K-MEANS algorithm also can cause precision ratio unstable (domain of walker 0.7064-0.8953) due to the choosing at random of initial seed, float larger.Because the present invention is combined OPTICS algorithm with K-MEANS algorithm, for K-MEANS algorithm, automatically determine initial classes and counted K, and provide high-quality initial cluster center, make cluster result relatively stable (domain of walker 0.9114-0.9207), the meticulous raising cluster accuracy rate aspect that is introduced in of gathering genus has also played certain effect.The recall ratio of each experimental calculation gained presents the fluctuating trend identical with precision ratio with F-Measure value, and every data statistics result is as shown in table 2.From experimental result, the more traditional K-MEANS algorithm of average belavior of algorithm of the present invention has promoted 7% left and right.

Table 2 algorithm of the present invention and traditional K-MEANS algorithm cluster quality mean value compare

Algorithm	Accuracy rate	Recall ratio	F-Measure
				Text algorithm of the present invention	0.918	0.915	0.916
Tradition K-MEANS	0.851	0.845	0.852

Above embodiment only, for explanation technological thought of the present invention, can not limit protection scope of the present invention with this, every technological thought proposing according to the present invention, and any change of doing on technical scheme basis, within all falling into protection domain of the present invention.

Claims

1. a clustering method, is characterized in that: comprise the steps:

Step 3, the number of the leaf node of the hierarchical tree that step 2 is obtained is as the initial category number of K-MEANS cluster, the data object that each leaf node of hierarchical tree is comprised is averaged as the initial cluster center of each initial category of K-MEANS cluster, the id that step 2 is distributed is as the initial id of each leaf node initial cluster center of hierarchical tree and the data object that comprises, described data set is carried out to K-MEANS cluster, every iteration once, make the id of new cluster centre identical with the id of cluster centre before iteration, and the id that is classified as the data object of a class with new cluster centre is consistent with the id of new cluster centre, obtain the K-MEANS clustering cluster with id,

2. clustering method as claimed in claim 1, is characterized in that: the range formula in the described OPTICS clustering method based on density is wherein, Distance (x _i, x _j) represent any two data object x _i, x _jdistance, cos (x _i, x _j) represent any two data object x _i, x _jcosine similarity, x _i, x _jrepresent i, a j text object, and i ≠ j.

3. a Text Clustering Method, is characterized in that: comprise the steps:

4. Text Clustering Method as claimed in claim 2, is characterized in that: the range formula between described K-MEANS cluster Chinese version object is wherein, Distance (x _i, x _j) represent any two text object x _i, x _jdistance, cos (x _i, x _j) represent any two text object x _i, x _jcosine similarity, x _i, x _jrepresent i, a j text object, and i ≠ j.

5. Text Clustering Method as claimed in claim 2, is characterized in that: the convergence criterion of described K-MEANS cluster is bisection error criterion, and formula is wherein, x is the vector of data centralization class i, m _ifor the barycenter of class i, Ci is bunch, the quantity that k is class, dis (x, m _i) for vector x is to barycenter m _idistance.