CN108549913A

CN108549913A - Improvement K-means clustering algorithms based on density radius

Info

Publication number: CN108549913A
Application number: CN201810354305.0A
Authority: CN
Inventors: 万思思; 刘丹; 王永松; 伍功宇
Original assignee: Chengdu Kang Qiao Electronic LLC; University of Electronic Science and Technology of China
Current assignee: Chengdu Kang Qiao Electronic LLC; University of Electronic Science and Technology of China
Priority date: 2018-04-19
Filing date: 2018-04-19
Publication date: 2018-09-18

Abstract

The present invention relates to clustering algorithm fields, disclose a kind of improvement K means clustering algorithms based on density radius, and it is inaccurate to solve the problems, such as that locally optimal solution existing for existing K means clustering algorithms, k value more sensitive to noise and outlier are chosen.The present invention is ranked up all sample points first, in accordance with density radius, chooses the maximum sample point of density radius as initial value, repetition is aforementioned to state step, selects all initial points and categorical measure k, and start cluster operation；Two nearest barycenter of distance are selected from the class centroid after cluster, classification where the two barycenter is individually taken out and regards one two classification as, and calculate Bayes's score secondly classification, then it is a classification by this two categories combinations, and calculate Bayes's score after merging, judge whether to need to merge the two classifications further according to score, repeats abovementioned steps until not having to merging.The present invention is suitable for big data clustering processing.

Description

Improvement K-means clustering algorithms based on density radius

Technical field

The present invention relates to clustering algorithm fields, more particularly to the improvement K-means clustering algorithms based on density radius.

Background technology

Cluster is that some physics or abstract object are divided into several according to the similarity degree between object Cluster class makes the data in the same cluster class have higher similitude, the data similarity in different cluster classes low.Cluster is a kind of Unsupervised learning method will not have markd data to classify under the premise of no prior information.K-means algorithms It is most common a kind of typical partitioning algorithm in clustering, which carried out according to certain method for measuring similarity of data It divides, keeps each data small as far as possible to the distance of the cluster class barycenter belonging to it, the algorithm is simple, efficient etc. excellent due to its Point is widely used.But at the same time it there is also some defects, such as need to preset the number (k values) of cluster, the choosing of k values Inaccuracy is taken to may result in classification inaccurate；The value of initial value has randomness, the case where being easy to cause locally optimal solution； It is more sensitive to noise and outlier, also influences whether the result finally clustered.

Invention content

The technical problem to be solved by the present invention is to：A kind of improvement K-means clustering algorithms based on density radius are provided, are solved Locally optimal solution existing for certainly existing K-means clustering algorithms, k value more sensitive to noise and outlier are chosen inaccuracy and are asked Topic.

To solve the above problems, the technical solution adopted by the present invention is：Improvement K-means clusters based on density radius are calculated Method includes the following steps：

A. the distance of all sample points between any two in sample data set T is calculated；

B. a density radius d is specified, each sample is found out according to the distance of density radius d and sample point between any two All sample points of the point in density radius d；

C. it is sorted to the sample point in sample data set according to the number of each sample point sample point in density radius d, from And the data set T after sorting is obtained,；

D. an empty set S is defined, first sample point in data set T ' is put into set S, and by data set T ' In all sample points in first sample point and one sample point density radius d deleted from data set T ',；

E. step D is repeated, until setWork as setWhen, the number of the sample in set S is The possible k values of K-means clustering algorithms, the value in set S is the possible initial value of K-means clustering algorithms；

F. regard set S as barycenter set, each initial value is different class centroid in barycenter set, calculates sample In data set T all sample points in barycenter set at a distance from barycenter of all categories, and each sample point in marker samples data set T Classification be the barycenter minimum with sample point distance classification；

G. new class centroid of all categories is recalculated using interior all sample points of all categories, to update barycenter collection It closes；

H. judge that the error sum of squares between the sample point in the barycenter and sample data set T after updating in barycenter set is accurate Then whether function restrains, and step I is directly entered if not changing before barycenter set is relative to update if convergence and after updating； Otherwise, step F and G are repeated, it is unchanged again after the convergence of error sum of squares criterion function and the update of barycenter set, it enters step I；

I. the distance of all barycenter between any two in barycenter set, and two barycenter of chosen distance minimum are found out, it will be away from It is individually taken out from two classifications where two minimum barycenter；

Whether two classifications that J. judgment step I individually takes out need to merge, if need not merge, algorithm terminates； If desired merge, then two categories combinations individually taken out step I, and calculate the barycenter of classification after merging, while will step Two barycenter of rapid I selections are deleted from barycenter set, and the barycenter of classification after merging is put into barycenter set, while redirecting step Rapid I.

Further, step J judges need whether the method that should merge includes when two classifications：

It regards two classifications to be judged as in two points of clusters two classes, calculates the BIC values of two points of clusters, be denoted as BIC score2；

It regards two classifications to be judged as a whole classification, calculates the BIC values of the entirety classification, be denoted as BIC score1；

Further, the calculation formula of BIC values is：

BIC=-2 × ln (L)+ln (s) × t

Wherein, s indicates the number of sample point in data set；L indicates likelihood function；T indicates number of features, described in calculating T=2 when the BIC values of two points of clusters, the t=1 when calculating the BIC values of the whole classification.

Further, further include before step A：Noise and outlier processing are removed to sample data set；Step K it After further include：The distance between outlier and class centroid are calculated, by outlier labeled as the class apart from nearest barycenter therewith Not.

Further, noise is removed to sample data set using lof methods and outlier is handled.

Further, it after being removed noise and outlier processing to sample data set, and before step A, also wraps It includes and sample data set is normalized, the sample coordinate after sample data set normalization is x_i.j,

Wherein, m indicates that the number of sample point, v indicate dimension.

Further, step A, F and I calculate apart from when calculated using Euclidean distance formula, formula is as follows：

Further, the formula of step G calculating center-of-mass coordinate is：

Wherein, Z_iIndicate the coordinate of i-th of barycenter.

Further, in step H, the error between sample point in the barycenter in barycenter set and sample data set T is flat Side and the formula of criterion function are：

Wherein, the barycenter number in k barycenter set, Z_jIndicate j-th of barycenter.

The beneficial effects of the invention are as follows：The present invention mainly chooses the initial value of cluster by density radius, and density is bigger Illustrate the sample point it is easier be class centroid, therefore all sample points are ranked up first, in accordance with density radius, are chosen close The maximum sample point of radius is spent as initial value, and deletes all sample points in the radius, then repeats above-mentioned steps, All initial points and categorical measure k are selected, and carries out cluster operation.Institute can be covered since the present invention finds out the initial value come There is the higher point of density, therefore find out the k values come and be greater than equal to true k values, it is therefore desirable to further determine that k values.From poly- Two nearest barycenter of distance are selected in class centroid after class, and the classification where the two barycenter is individually taken out and regards one as A two classification, and Bayes's score secondly classification is calculated, it is then a classification by this two categories combinations, and calculate conjunction Bayes's score after and judges whether to need to merge the two classifications further according to score, steps be repeated alternatively until without closing Until and.Since this method most first meeting covers all higher points of density, the existing clustering algorithm solved well is deposited Locally optimal solution the case where.

Description of the drawings

Fig. 1 is that the present invention carries out process chart to initial data set；

Fig. 2 is that the present invention primarily determines k values and initial value flow chart according to density radius；

Fig. 3 is preliminary clusters flow chart of the present invention；

Fig. 4 is classification of the present invention (k values) optimized flow chart；

Fig. 5 is overview flow chart of the present invention.

Specific implementation mode

The present invention is primarily to overcome insufficient existing for some existing clustering algorithms, it is proposed that a kind of new cluster calculation Method, it can solve locally optimal solution existing for existing clustering algorithm, k value more sensitive to noise and outlier is chosen and be not allowed The problems such as true.The invention mainly comprises：Initial data set is handled, primarily determines classification number k values according to density radius And initial value, cluster and the several steps of determining barycenter number.It is carried out below in conjunction with Fig. 1-4 pairs of the technical solution adopted by the present invention detailed It describes in detail bright.

One, initial data set is handled.

There may be some noises and outliers for the sample point concentrated due to primary data, and these points are to initial value and k The selection of value has prodigious influence.Therefore it should first remove these outliers before cluster, remaining sample point is carried out Finally these outliers are added in corresponding classification further according to the case where cluster for cluster.The operation of removal outlier will use Lof algorithms carry out, i.e., judge whether the point is outlier by comparing the density of each point and its field point, if the point The density the low more is possible to be identified as outlier.And density, then it is to be calculated apart from field by the kth of point, kth distance Field is the kth distance of point p and the number of all the points within kth distance.Therefore density is higher, and distance is closer, and density is got over Low, distance is remoter.Low density outlier is therefrom chosen again, and removes these outliers.

Then in order to limit all data in a certain range, facilitate subsequent calculations, remaining sample is clicked through Row normalized makes the coordinate in each dimension of all sample points between [0,1].

The distance between sample point will be calculated after the completion of normalization, and be calculated and be less than centainly apart from each sample point distance The quantity of point in range, the distance are density radius.Then further according to this quantity sequence from big to small again to sample Point is ranked up.To sum up, it is as shown in Figure 1 to carry out process flow to initial data set by the present invention.

Two, classification number k values and initial value are primarily determined according to density radius.

Present invention determine that the basic skills of initial point is chosen according to density, i.e., each dot density of distance in previous step The number of point in radius.Easier density bigger explanation point is class centroid.According to above-mentioned thought, by sample point according to The case where front is sorted, this sample point is set to one of them by the maximum point set center of sample point i.e. density to make number one Initial value, and by all point deletions in the density radius.Aforesaid operations are repeated after deletion, i.e., are selected from remaining sample point The maximum sample point of density is taken, and deletes all the points in the density radius, until data set is sky.The point chosen at this time Tentatively it is set to initial point, number is set to classification number k values.To sum up, the present invention primarily determines k values and initial value flow such as Fig. 2 It is shown.

Three, it clusters and determines barycenter number

The k value points calculated according to above-mentioned steps can cover the higher point of all density, i.e., should belong to same class Sample set have been partitioned into multiple classes.Therefore the k values come are found out to be greater than equal to true k values (categorical measure), therefore are needed K value ranges are reduced.

The k values and initial value obtained first, in accordance with previous step carries out preliminary clusters, the flow of preliminary clusters to sample point As shown in figure 3, to find out new class centroid after preliminary clusters.The flow of preliminary clusters is as shown in figure 3, include：1. calculating In sample data set T all sample points in barycenter set at a distance from barycenter of all categories；2. each sample in marker samples data set T The classification of this point is the classification of the barycenter minimum with sample point distance；3. being recalculated using interior all sample points of all categories Go out new class centroid of all categories, to update barycenter set；4. judging barycenter and sample data in barycenter set after update Whether the error sum of squares criterion function between sample point in collection T restrains, if barycenter set is relative to more after convergence and update It does not change before new, then terminates preliminary clusters；2. and 3. otherwise step is repeated, until error sum of squares criterion function is restrained And it is unchanged again after the update of barycenter set, terminate preliminary clusters.

If be divided for multiple classes due to belonging to of a sort sample set, their centroid distance can also compare Closely, it is therefore desirable to classification optimization is carried out, as shown in figure 4, the present invention first calculates the distance between all barycenter, then apart from most Sample set where two close barycenter is individually taken out, and is regarded as one two classification, and calculates the BIC secondly classification score.Then by the two categories combinations it is again a classification, and calculates its BIC score.According to the BIC calculated twice Score, judges whether the two classifications should merge.Execution above-mentioned steps are recycled if merging, otherwise category division at this time Even if optimal division.Since initial initial value point can cover the higher point of all density, then gradually optimize again, because The case where this can solve locally optimal solution existing for some existing clustering methods by the method.

Finally again the outlier got rid of before is added, by calculating the distance between outlier and class centroid, By outlier labeled as the classification apart from nearest barycenter therewith.

In conjunction with described above, overall flow figure as shown in Figure 5 is finally obtained.

Embodiment

Embodiment provides a kind of improvement K-means clustering algorithms based on density radius, includes the following steps：

1, data set prepares, if data are concentrated with m sample point, each sample point is v dimensions, wherein v ∈ Z^*.Data Collection is denoted as T={ n₁,n₂,…,n_m, wherein n_iIndicate that sample point, m indicate the number of sample point, sample point n_iCoordinate be denoted as (x_i,1,x_i,2,…,x_i,v), v indicates dimension；

2, data prediction：Noise and outlier are removed with lof methods；

3, data are normalized：By each latitude coordinates of sample point, all divided by respective dimensions sample point is sat Target maximum value, calculation formula such as shown in (1), make the sample coordinate x after normalization_i.j∈ [0,1],

4, after normalized, the Euclidean distance of all sample points between any two is calculated, wherein i-th of sample point n_iWith j-th of sample point n_jThe distance between d (n_i,n_j) calculation formula is such as shown in (2),

5, a density radius d is specified,According to density radius d and sample point between any two away from From all sample points of each sample point in density radius d are found out, include the value and number of sample point, by sample point n_iWeek The all the points number enclosed in density radius is denoted as

6, according to the number of each sample point sample point in density radius d to the sample point in sample data set from high to low It is ranked up, to obtain the data set T ' after sorting, is denoted as T '={ n '₁,n’₂,…,n’_m, wherein T '=T；

7, an empty set S is defined, for storing possible initial value, by first element n ' in data set T '_i(the As n ' when primary execution₁) be put into set S, and by first element n '_iAnd all sample points in its density radius d from It is deleted in data set T ', wherein element n '_iDensity radius d in all sample points, that is, all and n '_iDistance is less thanSample This n '_j,j∈[2,m]；

8, first new element n ' is chosen again from the data set T ' after deletion_i, step 7 is so repeated, until SetWork as setWhen, the number of the sample in set S is the possible k values of K-means clustering algorithms at this time, Value in set S is possible initial value, remembers set S={ s at this time₁,s₂,…,s_m’, wherein m '≤m；

9. by set S={ s₁,s₂,…,s_m’Regard barycenter set as, each initial value is different class in barycenter set Other barycenter calculates all sample point n in sample data set T using Euclidean distance formula_iWith matter of all categories in barycenter set The distance of the heart, and the classification of each sample point is the classification of the barycenter minimum with sample point distance in marker samples data set T；

10, new class centroid of all categories is recalculated using interior all sample points of all categories, to update barycenter The formula of set S, center-of-mass coordinate is：

Wherein, Z_iIndicate the coordinate of i-th of barycenter；

11, judge the barycenter in barycenter set S and the sample point n in sample data set T after updating_iBetween square-error Whether restrain with criterion function, is directly entered if not changing before barycenter set S is relative to update if convergence and after updating Step 12；Otherwise, step 9 and 10 is repeated, again without change after the convergence of error sum of squares criterion function and the S updates of barycenter set Change, enters step 12；

Wherein, the error sum of squares criterion function between the sample point in the barycenter and sample data set T in barycenter set Formula be：

In formula, the barycenter number in k barycenter set, Z_jIndicate j-th of barycenter.

12, the distance of all barycenter between any two in barycenter set is found out using Euclidean distance formula, and select away from From two minimum barycenter, two classifications where two minimum barycenter of distance are individually taken out；

13, whether two classifications that judgment step 12 is individually taken out need to merge, if need not merge, barycenter collection at this time The number closed in S is classification number, and the value in barycenter set S is class centroid, goes successively to step 14；If desired it closes And two categories combinations for then individually taking out step 12, and the barycenter of classification after merging is calculated, while step 12 being selected Two barycenter deleted from barycenter set S, and the barycenter of classification after merging is put into barycenter set S, while jump procedure 12；

In this step, the method for two categories combinations that judgment step 12 is individually taken out is as follows：

I. two classifications to be judged are regarded as in two points of clusters two classes, calculates the BIC value (shellfishes of two points of clusters Leaf this score), it is denoted as BIC score2；

Ii. regard two classifications to be judged as a whole classification, calculate the BIC values of the entirety classification, be denoted as BIC score1；

When calculating BIC values, calculation formula is step i and ii：BIC=-2 × ln (L)+ln (s) × t, wherein s tables Registration is according to the number for concentrating sample point；L indicates likelihood function；T indicates number of features, when the BIC values for calculating two points of clusters When t=2, the t=1 when calculating the BIC values of the whole classification；

14, last to add the outlier got rid of before to come again, by calculating the Europe between outlier and class centroid Formula distance, by outlier labeled as the classification apart from nearest barycenter therewith.

The foregoing describe the basic principle of the present invention and main feature, the description of specification only illustrates the original of the present invention Reason, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes and improvements It all fall within the protetion scope of the claimed invention.

Claims

1. the improvement K-means clustering algorithms based on density radius, which is characterized in that include the following steps：

B. a density radius d is specified, each sample point is found out according to the distance of density radius d and sample point between any two and is existed All sample points in density radius d；

C. it is sorted to the sample point in sample data set according to the number of each sample point sample point in density radius d, to Data set T ' to after will sort；

D. an empty set S is defined, first sample point in data set T ' is put into set S, and by data set T ' the All sample points in one sample point and its density radius d are deleted from data set T '；

E. step D is repeated, until setWork as setWhen, the number of the sample in set S is K- The possible k values of means clustering algorithms, the value in set S is the possible initial value of K-means clustering algorithms；

F. regard set S as barycenter set, each initial value is different class centroid in barycenter set, calculates sample data Collect in T all sample points in barycenter set at a distance from barycenter of all categories, and in marker samples data set T each sample point class The classification of barycenter that Wei be not minimum with sample point distance；

G. new class centroid of all categories is recalculated using interior all sample points of all categories, to update barycenter set；

H. judge the error sum of squares criterion letter between the sample point in the barycenter and sample data set T after updating in barycenter set Whether number restrains, if barycenter set is directly entered step I relative to not changing before update after convergence and update；It is no Then, step F and G are repeated, it is unchanged again after the convergence of error sum of squares criterion function and the update of barycenter set, into step Rapid I；

I. the distance of all barycenter between any two in barycenter set, and two barycenter of chosen distance minimum are found out, it will be apart from most Two classifications where two small barycenter are individually taken out；

Whether two classifications that J. judgment step I individually takes out need to merge, if need not merge, algorithm terminates；If needing Merge, then two categories combinations individually taken out step I, and calculate the barycenter of classification after merging, while step I being selected Two barycenter selected are deleted from barycenter set, and the barycenter of classification after merging is put into barycenter set, while jump procedure I.

2. the improvement K-means clustering algorithms based on density radius as described in claim 1, which is characterized in that step J judges Need whether the method that should merge includes when two classifications：

3. the improvement K-means clustering algorithms based on density radius as claimed in claim 2, which is characterized in that the meter of BIC values Calculating formula is：

BIC=-2 × ln (L)+ln (s) × t

Wherein, s indicates the number of sample point in data set；L indicates likelihood function；T indicates number of features, described two points when calculating T=2 when the BIC values of cluster, the t=1 when calculating the BIC values of the whole classification.

4. the improvement K-means clustering algorithms based on density radius as claimed in claim 1 or 2, which is characterized in that step A Further include before：Noise and outlier processing are removed to sample data set；Further include after step K：Calculate outlier with The distance between class centroid, by outlier labeled as the classification apart from nearest barycenter therewith.

5. the improvement K-means clustering algorithms based on density radius as claimed in claim 4, which is characterized in that use the side lof Method is removed noise to sample data set and outlier is handled.

6. the improvement K-means clustering algorithms based on density radius as claimed in claim 4, which is characterized in that sample Data set is removed after noise and outlier processing, and before step A, further includes：Sample data set is normalized, sample Sample coordinate after data set normalization is x_i.j, then

Wherein, m indicates that the number of sample point, v indicate dimension.

7. the improvement K-means clustering algorithms based on density radius as claimed in claim 6, which is characterized in that step A, F and I calculate apart from when calculated using Euclidean distance formula, formula is as follows：

8. the improvement K-means clustering algorithms based on density radius as claimed in claim 6, which is characterized in that step G is calculated The formula of center-of-mass coordinate is：

Wherein, Z_iIndicate the coordinate of i-th of barycenter.

9. the improvement K-means clustering algorithms based on density radius as claimed in claim 6, which is characterized in that in step H, The formula of error sum of squares criterion function between sample point in barycenter in barycenter set and sample data set T is：