CN104376057A

CN104376057A - Self-adaptation clustering method based on maximum distance, minimum distance and K-means

Info

Publication number: CN104376057A
Application number: CN201410621601.4A
Authority: CN
Inventors: 成卫青; 卢艳红; 仲伟伟
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2014-11-06
Filing date: 2014-11-06
Publication date: 2015-02-25

Abstract

The invention discloses a self-adaptation clustering method based on the maximum distance, the minimum distance and the K-means. The method solves the problems that a traditional K-means clustering algorithm is sensitive to an initial center, and the number of clusters needs to be determined in advance. The self-adaptation clustering method is a strategic method. According to the self-adaptation clustering method, selection of the initial center and initial centers of the newly-added clusters is not random and is obtained through calculation, two points serve as the initial centers, the distances between the two points and the two points with the maximum distance in data sets (subsets) are minimum, and isolated points can be effectively prevented from being selected to serve as the initial centers; meanwhile, the number of iterations in the clustering process can be effectively decreased, and the better clustering results can be obtained. According to the self-adaptation clustering method, whether the number of the clusters is increased or not and whether clustering is finished or not are determined according to SSE (sum of squared errors) conditions of the clusters and the total SSE variation trend, the number of the clusters can be additionally determined in a self-adaptation mode, and the self-adaptation clustering method is particularly suitable for the application scenes with the difficultly-determined number of clusters.

Description

A kind of adaptive clustering scheme based on minimax Distance geometry K-means

Technical field

The present invention relates to a kind of adaptive clustering scheme based on minimax Distance geometry K-means clustering algorithm, belong to data mining technology field.

Background technology

Data Placement is become meaningful or useful group (bunch) by cluster analysis, and its target is: the object in group is similar each other, and in different groups to as if different.Similarity in group is larger, and between group, difference is larger, and cluster is better.Under certain meaning, cluster analysis just solves the starting point of other problems.In information retrieval, WWW comprises hundreds of millions of Web page, and network search engines may return thousands of pages.Can use cluster that Search Results is divided into some bunches, each bunch of certain particular aspects of catching inquiry, each classification (bunch) can be divided into again some subclass (submanifold), thus produces a hierarchical structure, helps user to explore Query Result further.In weather, cluster analysis has been used for finding to have land weather polar region and the marine atmosphere pressure pattern of appreciable impact.Learn within and medical science aspect, cluster has been used for identifying dissimilar depression, and cluster analysis simultaneously also can be used for the spatio-temporal distribution pattern of detection disease.Therefore no matter be intended to understand or practicality, cluster analysis all plays important role at numerous areas, and these fields comprise: statistics, pattern-recognition, information retrieval, machine learning and data mining.

Internal authority academic conference IEEE International Conference on Data Mining (ICDM) has chosen the ten communication classics algorithm of Data Mining in Dec, 2006, and K-means clustering algorithm is one of them.K-means method comparison is simple, and first, selects K initial center, wherein K is the parameter that user specifies, namely bunch number.Each point is divided into nearest center, and the point set being divided into a center is one bunch.Then, calculate the barycenter of each bunch, it can be used as the center of each bunch.Repeat division points and upgrade bunch central step, until bunch no longer to change, or equivalently, until center does not change.But, select K-means clustering algorithm initial center randomly, cluster not only can be made to be absorbed in locally optimal solution, and optimum cluster result may be can not get.Select suitable initial center, not only can reduce the iterations of cluster process, and the effect of cluster can be improved, and random selecting initial center may choose isolated point as initial center, cause iterations too much, or obtain irrational cluster result.K-means clustering algorithm is not only responsive to initial center, and the selection of bunch number is also the key factor affecting cluster result simultaneously.And the present invention can solve problem above preferably.

Summary of the invention

The object of the invention is to provide a kind of adaptive clustering scheme based on minimax Distance geometry K-means clustering algorithm, this method solve the responsive to initial center of traditional K-means clustering algorithm existence, and bunch number needs pre-determined problem.The method can be avoided choosing isolated point as initial center effectively, effectively can reduce the iterations of cluster process simultaneously, and can obtain good cluster result.

The technical solution adopted for the present invention to solve the technical problems is: the present invention is a kind of tactic method.K-means is based on the clustering technique of prototype, division, be widely applied with its simple algorithm, faster cluster speed and stable cluster result, but also there are some problems in basic K-means algorithm, such as K-means difficult aspheric bunch and different size bunch, and by the impact of noise and outlier.Cluster result is also very by the impact that cluster centre number and initial center are selected simultaneously.

The shortcoming that the present invention is directed to K-means clustering algorithm proposes one based on minimax Distance geometry K-means algorithm, the adaptive clustering scheme of flex point as cluster termination condition is there is with the SSE of data lump (Sum of Square Error, error sum of squares).The method is not random to the selection of initial center, and is through and calculates, and more effectively can avoid choosing isolated point as initial center, effectively can reduce the iterations of cluster process simultaneously, and obtain good cluster result; In addition determine whether to increase bunch for each bunch of SSE situation and total SSE variation tendency and whether terminate cluster, thus can self-adaptation determination number of clusters, be particularly suitable for the application scenarios that those number of clusters are difficult to determine.

Tradition K-means clustering algorithm n data point is divided in K bunch, makes each data point minimum to the distance sum at its bunch of center, algorithm process process:

(1) Stochastic choice K data point is as initial center

(2) each data point is divided into nearest center, forms K bunch

(3) barycenter of each bunch is calculated, the center to it can be used as bunch

(4) step (2) and (3) is repeated until center no longer changes

Following definition and computing formula is used in the present invention:

(1) similarity degree between data point can be determined by the distance calculated between two between data, Euclidean distance is the most known distance measure, tie up in theorem in Euclid space at n, each point is that n ties up real number vector, and the Euclidean distance in space between x and y 2 is defined as:

d (x, y) = \sqrt{Σ_{i = 1}^{n} {(x_{i} - y_{i})}^{2}} - - - (1)

(2) cluster objective function is used for measuring clustering result quality, and use error quadratic sum of the present invention is as the objective function of tolerance clustering result quality, and the error sum of squares SSE of data lump is defined as:

SSE = Σ_{i = 1}^{K} \underset{x &Element; S_{i}}{Σ} {(d (c_{i}, x))}^{2} - - - (2)

Wherein c _ithe i-th bunch of S _icenter.

Method flow:

The present invention proposes a kind of adaptive clustering scheme based on minimax Distance geometry K-means clustering algorithm, and the method is to data set S={x ₁, x ₂..., x _ncluster comprise the steps:

Step 1: calculate the Euclidean distance between any two data points in data set S;

Step 2: two the some x finding lie farthest away in data set S _iand x _j, then find distance x _inearest some x _pwith distance x _jnearest some x _q;

Step 3: by x _pand x _qas initial clustering (bunch) center, now cluster (bunch) centralization C ⁽⁰⁾={ x _p, x _q, separately establish t=1, SSE ⁽⁰⁾=∞;

Step 4: adopt K-means clustering algorithm, dividing data collection S, upgrades each bunch of center, obtains new bunch centralization C ^(t)with | C ^(t)| individual bunch, now

Step 5: calculate respectively each point in each bunch to the square distance at bunch center and k=1,2 ..., | C ^(t)|, and cumulative obtain total error sum of squares SSE ^{{ t}}if, wherein δ is threshold value, goes to step 9; Otherwise continue;

Step 6: select maximum bunch, be designated as S _max, its center is designated as c _max, remove C ^(t)in the cluster centre of this bunch, even C ^(t)=C ^(t)-{ c _max;

Step 7: find data subset S _maxtwo some x of middle lie farthest away _iand x _j, then find distance x _inearest some x _pwith distance x _jnearest some x _q;

Step 8: by x _pand x _qbe incorporated to C ^(t)even, C ^(t)=C ^(t)∪ { x _p, x _q, then make t=t+1, go to step 4;

Step 9: getting last cluster result is final cluster result, namely final cluster centre collection C=C ^(t-1).

First step 2 of the present invention, step 3 find two some x of lie farthest away in data set S _iand x _j, avoid problem likely too contiguous when K-means algorithm initial center is chosen; Find distance x again _inearest some x _pwith distance x _jnearest some x _q, the nearest neighbor point choosing lie farthest away two points is in order to avoid lie farthest away two points are isolated points; And with these two closest approach x farthest _pand x _qfor initial center, carry out first time K-means cluster.Whether step 5, step 6 determine to increase bunch for each bunch of SSE situation and total SSE variation tendency and whether terminate cluster, thus can self-adaptation determination number of clusters, are particularly suitable for the application scenarios that those number of clusters are difficult to determine.Step 6 ~ 8 divide bunch and adopt the present invention's distinctive minimax distance method be new bunch select initial center.

Beneficial effect:

1, the selection of the present invention to the initial center of initial center and newly-increased bunch is not random, and be through and calculate, can effectively avoid choosing isolated point as initial center, effectively can reduce the iterations of cluster process simultaneously, and good cluster result can be obtained.

2, the present invention is directed to each bunch of SSE situation and total SSE variation tendency determine whether to increase bunch and whether terminate cluster, thus can self-adaptation determination number of clusters, be particularly suitable for the application scenarios that those number of clusters are difficult to determine.

3, the present invention is applied to data mining technology field.

Accompanying drawing explanation

Fig. 1 is method flow diagram of the present invention.

Embodiment

Below in conjunction with Figure of description, the invention is described in further detail.

For convenience of description, give one example below and carry out brief description:

Given data collection: x ₁=(0,0), x ₂=(1,1), x ₃=(2,2), x ₄=(4,4), x ₅=(5,5), x ₆=(2,1), x ₇=(5,4), x ₈=(3,6), x ₉=(7,4), x ₁₀=(8,5), threshold value σ=0.3

As shown in Figure 1, the invention provides a kind of adaptive clustering scheme based on minimax Distance geometry K-means, the method comprises the steps:

(1) according to above-mentioned formula (1), calculate the distance of data centralization between two between data, select two points farthest, for given data collection known some x ₁=(0,0) and some x ₁₀=(8,5) are distance two points farthest;

(2) with time point x ₂=(1,1) is x ₁x, apart from minimum point, is put in=(0,0) ₉=(7,4) are x ₁₀=(8,5) are apart from minimum point;

(3) initial center point c is made ₁and c ₂store minimax range points x respectively ₂=(1,1) and some x ₉=(7,4)

(4) by initial center point c ₁and c ₂substitute into K-means clustering algorithm, obtain two bunches of S ₁and S ₂, S ₁={ x ₁, x ₂, x ₃, x ₆and S ₂={ x ₄, x ₅, x ₇, x ₈, x ₉, x ₁₀, calculate SSE ₁, SSE ₂, SSE ⁽¹⁾=SSE ₁+ SSE ₂;

(5) because of SSE ₁/ | S ₁| <SSE ₂/ | S ₂|, by bunch S ₂in data regard new data set as, calculating now minimax range points is x ₅=(5,5) and x ₉c is also used in=(7,4) ₂₁and c ₂₂store;

(6) new initial center C={c ₁, c ₂₁, c ₂₂and substitute into K-means clustering algorithm, obtain 3 bunches of S ₂₁, S ₂₂and S ₂₃, S ₂₁={ x ₁, x ₂, x ₃, x ₆, S ₂₂={ x ₄, x ₅, x ₇, x ₈, S ₂₃={ x ₉, x ₁₀, calculate SSE ₂₁, SSE ₂₂, SSE ₂₃, SSE ⁽²⁾=SSE ₂₁+ SSE ₂₂+ SSE ₂₃, because of cluster terminates.

Claims

1. based on an adaptive clustering scheme for minimax Distance geometry K-means clustering algorithm, it is characterized in that, described method is to data set S={x ₁, x ₂..., x _ncluster comprise the steps:

Step 5: calculate respectively each point in each bunch to the square distance at bunch center and and cumulative obtain total error sum of squares SSE ^{{ t}}if, wherein δ is threshold value, goes to step 9; Otherwise continue;

2. a kind of adaptive clustering scheme based on minimax Distance geometry K-means clustering algorithm according to claim 1, to it is characterized in that: the selection of described method to the initial center of initial center and newly-increased bunch is not random, and be through and calculate.

3. a kind of adaptive clustering scheme based on minimax Distance geometry K-means clustering algorithm according to claim 1, is characterized in that: described method is applied to data mining technology field.