CN106934415A

CN106934415A - A kind of K means initial cluster center choosing methods based on Delaunay triangulation network

Info

Publication number: CN106934415A
Application number: CN201710090315.3A
Authority: CN
Inventors: 马燕; 杨杰; 韦高洁; 张相芬; 李顺宝; 张玉萍
Original assignee: Shanghai Normal University
Current assignee: Shanghai Normal University; University of Shanghai for Science and Technology
Priority date: 2017-02-20
Filing date: 2017-02-20
Publication date: 2017-07-07

Abstract

The invention discloses a kind of K means initial cluster center choosing methods based on Delaunay triangulation network, data set table to be clustered is shown as Delaunay triangulation network, calculates the representative point in the triangulation network；The product of each density sum for representing point and its Euclidean distance is calculated as two mixing distances represented between point, then, the 1st initial cluster center is selected in all representative points, and in adding it to initial cluster center set C, the 2nd initial cluster center of reselection, and in adding it to initial cluster center set C, then, calculate the mixing distance with each initial cluster center in initial cluster center set one by one in remaining representative point, and select minimum mixing distance, then the representative point corresponding to maximum mixing distance is picked out in all of minimum mixing distance, and in adding it to initial cluster center set C, the qualified point that represents constantly is picked out from point is represented and is added to set C, until the element number that initial cluster center set C is included is equal to K.

Description

A kind of K-means initial cluster center choosing methods based on Delaunay triangulation network

Technical field

It is initial the present invention relates to computer classes field, more particularly to a kind of K-means based on Delaunay triangulation network Cluster centre choosing method.

Background technology

Cluster is a kind of unsupervised data analysing method, in the case of no priori, to sample by respective Characteristic is reasonably classified, and is widely used in Data Mining.The principle of classification of cluster is to make the data in same group With similitude as big as possible, the data in different groups have diversity as big as possible.That is, data similarity is got in organizing Greatly, data similarity is smaller between group, then classifying quality is better.Clustering algorithm can be divided into based on divide, density, layering, Grid and the type such as model.Used as based on the clustering algorithm for dividing, K-means clustering algorithms are because its algorithm is simple, perform height Imitate and be widely used.

The basic step of K-means clustering algorithms is as follows：

The first step：K data object is randomly selected from comprising the n data set of data object as in initial clustering The heart, wherein K (K >=2) are the number of predetermined cluster；

Second step：Closest class is assigned to according to minimal distance principle to the data object that data are concentrated；

3rd step：The average of each data object in clustering is calculated as new cluster centre；

4th step：Second step and the 3rd step are repeated, until cluster centre no longer changes.

K-means clustering algorithms have a quick, simple advantage, but due to initial cluster center be by randomly select come Determine, therefore there is problems with the method：If 1) initial cluster center of a certain classification comes from another category, cluster knot Easily there is local optimum in fruit, and can not reach global optimum；2) cluster result depends on the selection of initial cluster center, causes to gather Class unstable result；3) mistake cluster result is caused when hypotelorism between initial cluster center.

To overcome disadvantage mentioned above, many technical staff propose improved method.CCIA algorithms are based on data compression principle, right Each attribute of data performs K-means algorithms and obtains many data patterns, finally merges, and algorithm whole structure is good, But algorithm complex increases with the increase of data object dimension.Another kd-tree methods be with the density of bounding box come Instead of the density of each data point.The method has the following disadvantages：First, this replacement cannot accurately express the density of data point Distribution situation, if second, value of all data points under a certain attribute is all equal in certain bounding box, the density of the bounding box It is as a result meaningless for infinite.Also a kind of K-means++ algorithms, the algorithm considers the distance between data point, but there is also with Lower shortcoming：First, first initial center is randomly selected and causes final result unstable, second, the density to data point is not done Definition, so as to cause cluster result easily to be influenceed by outlier.

Therefore, those skilled in the art is devoted to a kind of K-means based on Delaunay triangulation network of exploitation and initially gathers Class center choosing method, overcomes the shortcoming of random selection initial cluster center in traditional K-means methods, improves clustering precision, Avoid the influence of outlier.

The content of the invention

In view of the drawbacks described above of prior art, the technical problems to be solved by the invention are to overcome traditional K-means side The shortcoming of initial cluster center is randomly choosed in method, clustering precision is improved, it is to avoid the influence of outlier.

To achieve the above object, the invention provides in a kind of K-means initial clusterings based on Delaunay triangulation network Heart choosing method, comprises the following steps：

Step 1, data set table to be clustered is shown as Delaunay triangulation network so that each data point in data set to be clustered Corresponded with the node in Delaunay triangulation network；

Step 2, the average for calculating each Atria summit in Delaunay triangulation network, and using average as triangle Represent a little；

Step 3, the inverse for calculating triangle area where each representative point, and represent falling for point place triangle area by each Number is used as the density for representing point；

Step 4, the density sum of calculation representative point and the Euclidean distance of point is represented, and using both products as two generations Mixing distance between table point；

Step 5, it is all represent point in select the maximum representative o'clock of density as the 1st initial cluster center, and will be close Maximum representative point is spent to be added in initial cluster center set C；

Step 6, selection are with the 1st mixing of initial cluster center apart from farthest representative o'clock as the 2nd initial clustering Center, and will be added in initial cluster center set C apart from farthest representative point with the mixing of the 1st initial cluster center；

Step 7, it is remaining represent point in calculate one by one with initial cluster center set each initial cluster center it is mixed Distance is closed, and selects minimum mixing distance, it is corresponding then to pick out maximum mixing distance in all of minimum mixing distance Representative point, and the representative point that will be picked out corresponding to maximum mixing distance in all of minimum mixing distance be added to it is initial poly- In class centralization C, the qualified point that represents constantly is picked out from point is represented and is added to set C, until initial clustering The element number that centralization C is included is equal to K.

Further, step 1 specific method includes：

Data set to be clustered is arranged to X={ x₁,x₂,...,x_nN data object is included, it is that data set X builds Delaunay triangulation network G=(V, E), wherein, V={ v₁,v₂,...,v_nThe set of triangulation network G interior joints is represented, E represents triangle The set on side in net G, and a data object x in data set X_iA node v in ∈ X and triangulation network G_iIt is between ∈ V One-to-one relationship, then the interstitial content in triangulation network G be equal to data set X in data object number, two in triangulation network G Distance between node is equal to the Euclidean distance between its corresponding data object, i.e. d (v_i,v_j)=d (x_i,x_j)。

Further, step 2 specific method includes：

The three of triangle T summits are separately arranged as v in constituting triangulation network G_i、v_j、v_k, three summits respectively with X in data set to be clustered_i、x_j、x_kThese three data objects are corresponded, wherein, x_i=(x_i1,x_i2,…,x_id), x_j= (x_j1,x_j2,…,x_jd), x_k=(x_k1,x_k2,…,x_kd), d represents the attribute dimension of data object, calculates three averages on summit ForAverage as triangle T representative point.

Further, step 3 specific method includes：

Three summits for representing triangle T where point r are separately arranged as v_i、v_j、v_k, three summits respectively with it is to be clustered Data set in x_i、x_j、x_kThese three data objects are corresponded, wherein, x_i=(x_i1,x_i2,…,x_id), x_j=(x_j1, x_j2,…,x_jd), x_k=(x_k1,x_k2,...,x_kd), d represents the attribute dimension of data object, then three length of side difference in triangle T It is arranged to：

Calculate the semi-perimeter of triangleObtaining the area S of triangle T is Finally, the inverse of area S is obtained, i.e.As the density for representing point r.

Further, step 4 specific method includes：

Two represent point r₁, r₂Density be separately arranged as ρ₁With ρ₂, represent point r₁With representative point r₂Euclidean distance etc. In d₁₂, then point r is represented₁With representative point r₂Between mixing distance be equal to h=(ρ₁+ρ₂)×d₁₂。

Further, step 5 specific method includes：

All set for representing point composition are arranged to R={ r₁,r₂,...,r_t, t is the Delaunay triangulation network for building The number of intermediate cam shape, first, all density for representing point is calculated by step 3, then, density maximum is selected from set R Representative o'clock is used as the 1st initial cluster center c₁, and the maximum representative point of set R Midst densities is added to initial cluster center collection In conjunction C, i.e. C={ c₁, then the maximum representative point of density is removed from set R, rearrangement represents point set, obtains R= {r₁,r₂,...,r_t-1}。

Further, step 6 specific method includes：

Difference calculation representative point set R={ r₁,r₂,...,r_t-1In each represent point with first initial cluster center Mixing distance, takes mixing apart from farthest representative o'clock as the 2nd initial cluster center c₂, and will mix apart from farthest representative Point is added in initial cluster center set C, i.e. C={ c₁,c₂, then mixing is moved apart from farthest representative point from set R Remove, rearrangement represents point set, obtains R={ r₁,r₂,...,r_t-2}。

Further, step 7 specific method includes：

Step 71, from it is remaining represent point set R in select r₁, calculate each initial poly- with initial cluster center set C The mixing distance at class center, and the mixing distance of minimum is selected in all of mixing distance, it is expressed as h_1min；

Step 72, r is selected from R₂, the mixing distance with each initial cluster center in initial cluster center set C is calculated, And minimum mixing distance is selected in all of mixing distance, it is expressed as h_2min；Until picking out last representative from R Point r_t-2, the mixing distance with each initial cluster center in initial cluster center set C is calculated, and in all of mixing distance The mixing distance of minimum is selected, h is expressed as_(t-2)min；

Step 73, in all of minimum mixing apart from h_1min, h_2min..., h_(t-2)minIn pick out maximum mixing apart from institute It is corresponding to represent a little, and representative point is added in initial cluster center set C, constantly picked out from point is represented and meet bar The representative point of part is added to initial cluster center set C, until the element number that initial cluster center set C is included is equal to K.

Technique effect

Overcome the shortcoming that initial cluster center is randomly choosed in traditional K-means methods, improve clustering precision, it is to avoid from The influence of group's point.

The technique effect of design of the invention, concrete structure and generation is described further below with reference to accompanying drawing, with It is fully understood from the purpose of the present invention, feature and effect.

Brief description of the drawings

Fig. 1 is a kind of K-means initial clusterings based on Delaunay triangulation network of a preferred embodiment of the invention The schematic flow sheet of center choosing method.

Fig. 2 is a kind of K-means initial clusterings based on Delaunay triangulation network of a preferred embodiment of the invention Schematic diagram of 60 data objects of center choosing method under plane right-angle coordinate.

Fig. 3 is a kind of K-means initial clusterings based on Delaunay triangulation network of a preferred embodiment of the invention 108 triangulars of center choosing method into schematic diagram of the triangulation network under plane right-angle coordinate.

Fig. 4 is a kind of K-means initial clusterings based on Delaunay triangulation network of a preferred embodiment of the invention 108 of center choosing method represent schematic diagram of the point under plane right-angle coordinate.

Fig. 5 is a kind of K-means initial clusterings based on Delaunay triangulation network of a preferred embodiment of the invention Schematic diagram of the initial cluster center of center choosing method under plane right-angle coordinate.

Fig. 6 is a kind of K-means initial clusterings based on Delaunay triangulation network of a preferred embodiment of the invention Schematic diagram of the initial cluster center of center choosing method on the data set with outlier.

Specific embodiment

To improve the accuracy rate of K-means Clustering Effects, can be using in method choice initial clustering proposed by the present invention The heart.

It is specific to implement that following processing procedure is deferred to：

Data-oriented collection X={ x₁,x₂,...,x_n, comprising n data object.For data set X builds Delaunay triangles Net is simultaneously calculated and represents point set R={ r₁,r₂,...,r_t, t is the number of the Delaunay triangulation network intermediate cam shape for building, Each density for representing point is calculated, and density sum and its product of Euclidean distance that two represent point are represented a little as two Between mixing distance, then, it is all represent point in select the maximum representative o'clock of density as the 1st initial cluster center, and Add it in initial cluster center set C, reselection is with first mixing of initial cluster center apart from farthest representative O'clock as the 2nd initial cluster center, and add it to initial cluster center set C, then, point is represented remaining The mixing distance with each initial cluster center in initial cluster center set is calculated one by one, and selects minimum mixing distance, then Pick out the representative point corresponding to maximum mixing distance in all of minimum mixing distance, and add it to initial clustering In heart set C, the qualified point that represents constantly is picked out from point is represented and is added to set C, until initial cluster center The element number that set C is included is equal to K.

As shown in figure 1, a kind of K-means initial clusterings based on Delaunay triangulation network of a preferred embodiment of the present invention The specific implementation of center choosing method is comprised the following steps：

Step 1, data set table to be clustered is shown as Delaunay triangulation network so that each data point in data set to be clustered Corresponded with the node in Delaunay triangulation network；Concrete operations are：

Step 2, the average for calculating each Atria summit in Delaunay triangulation network, and using average as triangle Represent a little；Concrete operations are：

The three of triangle T summits are separately arranged as v in constituting triangulation network G_i、v_j、v_k, three summits respectively with X in data set to be clustered_i、x_j、x_kThese three data objects are corresponded, wherein, x_i=(x_i1,x_i2,...,x_id), x_j= (x_j1,x_j2,...,x_jd), x_k=(x_k1,x_k2,...,x_kd), d represents the attribute dimension of data object, calculates the equal of three summits It is worth and isAverage as triangle T representative point.

Step 3, the inverse for calculating triangle area where each representative point, and represent falling for point place triangle area by each Number is used as the density for representing point；Concrete operations are：

Three summits for representing triangle T where point r are separately arranged as v_i、v_j、v_k, three summits respectively with it is to be clustered Data set in x_i、x_j、x_kThese three data objects are corresponded, wherein, x_i=(x_i1,x_i2,...,x_id), x_j=(x_j1, x_j2,...,x_jd), x_k=(x_k1,x_k2,...,x_kd), d represents the attribute dimension of data object, then three length of sides point in triangle T It is not arranged to：

Step 4, the density sum of calculation representative point and the Euclidean distance of point is represented, and using both products as two generations Mixing distance between table point；Concrete operations are：

Step 5, it is all represent point in select the maximum representative o'clock of density as the 1st initial cluster center, and will be close Maximum representative point is spent to be added in initial cluster center set C；Concrete operations are：

All set for representing point composition are arranged to R={ r₁,r₂,…,r_t, t be build Delaunay triangulation network in The number of triangle, first, all density for representing point is calculated by step 3, then, density maximum generation is selected from set R Table o'clock is used as the 1st initial cluster center c₁, and the maximum representative point of set R Midst densities is added to initial cluster center set In C, i.e. C={ c₁, then the maximum representative point of density is removed from set R, rearrangement represents point set, obtains R={ r₁, r₂,…,r_t-1}。

Step 6, selection are with the 1st mixing of initial cluster center apart from farthest representative o'clock as the 2nd initial clustering Center, and will be added in initial cluster center set C apart from farthest representative point with the mixing of the 1st initial cluster center； Concrete operations are：

Difference calculation representative point set R={ r₁,r₂,…,r_t-1In each represent point with first initial cluster center Mixing distance, takes mixing apart from farthest representative o'clock as the 2nd initial cluster center c₂, and will mix apart from farthest representative Point is added in initial cluster center set C, i.e. C={ c₁,c₂, then mixing is moved apart from farthest representative point from set R Remove, rearrangement represents point set, obtains R={ r₁,r₂,…,r_t-2}。

Step 7, it is remaining represent point in calculate one by one with initial cluster center set each initial cluster center it is mixed Distance is closed, and selects minimum mixing distance, it is corresponding then to pick out maximum mixing distance in all of minimum mixing distance Representative point, and the representative point that will be picked out corresponding to maximum mixing distance in all of minimum mixing distance be added to it is initial poly- In class centralization C, the qualified point that represents constantly is picked out from point is represented and is added to set C, until initial clustering The element number that centralization C is included is equal to K.Concrete operations are：

Embodiment

(1) it is, manually generated to include 60 data set X={ x of data object₁,x₂,...,x₆₀, the classification of the data set Number K=3, the attribute dimension of each data object is 2 dimensions, below, list the specific object of all data objects：

x₁(-0.15,0.77),x₂(-0.04,0.3),x₃(0.47,-0.1),x₄(-0.27,0.14),x₅(0.25,- 0.33),x₆(0.1,0.59),x₇(-0.07,0.3),x₈(-0.33,0.25),x₉(-0.09,-0.28),x₁₀(-0.03,0.1), x₁₁(-0.46,-0.18),x₁₂(0.06,-0.1),x₁₃(-0.26,-0.18),x₁₄(-0.03,-0.32),x₁₅(0.11,- 0.29),x₁₆(-0.29,-0.07),x₁₇(-0.09,-0.54),x₁₈(0.11,0.19),x₁₉(-0.58,-0.04),x₂₀ (0.33,0.22),x₂₁(2.09,-0.16),x₂₂(2.16,-0.06),x₂₃(1.53,0.01),x₂₄(1.68,-0.02),x₂₅ (1.86,0.19),x₂₆(2.03,0.03),x₂₇(2.36,0.57),x₂₈(1.91,0.1),x₂₉(2.4,0.57),x₃₀(2.15,- 0.23),x₃₁(2.37,0.17),x₃₂(2.04,-0.08),x₃₃(1.79,0.19),x₃₄(1.53,0.19),x₃₅(2.05,- 0.69),x₃₆(2.26,-0.42),x₃₇(1.91,-0.46),x₃₈(1.83,0.13),x₃₉(1.9,0.46),x₄₀(1.65,- 0.1),x₄₁(1.26,1.8),x₄₂(1.17,2.57),x₄₃(0.67,1.66),x₄₄(1.13,2.06),x₄₅(0.76,1.52), x₄₆(1.48,1.77),x₄₇(0.99,1.81),x₄₈(1.52,2.13),x₄₉(0.87,2.3),x₅₀(1.19,2.1),x₅₁ (0.98,1.88),x₅₂(0.36,2.26),x₅₃(0.69,2.25),x₅₄(1.19,2.04),x₅₅(0.98,2.18),x₅₆ (0.65,2.13),x₅₇(0.8,1.69),x₅₈(1.08,2.24),x₅₉(0.69,1.79),x₆₀(1.31,1.81)；

Here to represent convenient as, each data object is regarded the point under plane right-angle coordinate, each data object 2 attributes regard 2 coordinates put under plane right-angle coordinate as, as shown in Fig. 2 it is straight in plane to list 60 data objects Corresponding point under angular coordinate system.

(2), it is data set X={ x₁,x₂,...,x₆₀Delaunay triangulation network G=(V, E) is built, as shown in figure 3, under List for data set X={ x in face₁,x₂,...,x₆₀Constructed by all Delaunay triangles：

t₁(v₂₈,v₂₆,v₂₅),t₂(v₁₂,v₁₃,v₉),t₃(v₈,v₇,v₁),t₄(v₁₂,v₁₆,v₁₃),t₅(v₄,v₁₆,v₁₀),t₆ (v₁₈,v₂,v₁₀),t₇(v₁₂,v₁₀,v₁₆),t₈(v₇,v₈,v₄),t₉(v₂,v₆,v₇),t₁₀(v₆,v₁,v₇),t₁₁(v₄₃,v₅₉,v₅₂),t₁₂ (v₁₂,v₅,v₃),t₁₃(v₃₉,v₃₄,v₃₃),t₁₄(v₁₉,v₁₁,v₁₆),t₁₅(v₅₅,v₅₈,v₄₉),t₁₆(v₅₃,v₅₂,v₅₆),t₁₇(v₂₀,v₃, v₃₄),t₁₈(v₅₄,v₆₀,v₄₈),t₁₉(v₃,v₂₃,v₃₄),t₂₀(v₂₆,v₃₉,v₂₅),t₂₁(v₃₄,v₄₅,v₂₀),t₂₂(v₃₃,v₃₈,v₂₅), t₂₃(v₂₄,v₄₀,v₃₂),t₂₄(v₃₅,v₅,v₁₇),t₂₅(v₃₉,v₂₆,v₃₁),t₂₆(v₃₂,v₂₈,v₂₄),t₂₇(v₁₃,v₁₆,v₁₁),t₂₈ (v₁₇,v₁₃,v₁₁),t₂₉(v₉,v₁₄,v₁₂),t₃₀(v₁₇,v₉,v₁₃),t₃₁(v₁₄,v₁₅,v₁₂),t₃₂(v₁₇,v₁₄,v₉),t₃₃(v₁₇,v₁₅, v₁₄),t₃₄(v₁₇,v₅,v₁₅),t₃₅(v₃,v₃₇,v₄₀),t₃₆(v₅,v₁₂,v₁₅),t₃₇(v₁₈,v₁₀,v₁₂),t₃₈(v₁₈,v₂₀,v₆),t₃₉ (v₁₈,v₁₂,v₂₀),t₄₀(v₁₂,v₃,v₂₀),t₄₁(v₅,v₃₇,v₃),t₄₂(v₁,v₁₉,v₈),t₄₃(v₁₆,v₄,v₁₉),t₄₄(v₆,v₄₅, v₁),t₄₅(v₇,v₄,v₁₀),t₄₆(v₈,v₁₉,v₄),t₄₇(v₅₂,v₁,v₄₃),t₄₈(v₅₂,v₁₉,v₁),t₄₉(v₁₀,v₂,v₇),t₅₀(v₁₈, v₆,v₂),t₅₁(v₅₅,v₄₉,v₅₆),t₅₂(v₂₀,v₄₅,v₆),t₅₃(v₅₉,v₅₆,v₅₂),t₅₄(v₄₅,v₅₇,v₄₃),t₅₅(v₄₄,v₅₄,v₅₀), t₅₆(v₄₉,v₅₃,v₅₆),t₅₇(v₄₉,v₄₂,v₅₃),t₅₈(v₄₂,v₄₉,v₅₈),t₅₉(v₄₂,v₅₂,v₅₃),t₆₀(v₅₉,v₅₁,v₅₆),t₆₁ (v₅₉,v₅₇,v₅₁),t₆₂(v₄₇,v₄₅,v₄₁),t₆₃(v₄₃,v₁,v₄₅),t₆₄(v₄₇,v₅₇,v₄₅),t₆₅(v₅₉,v₄₃,v₅₇),t₆₆(v₅₄, v₅₁,v₄₁),t₆₇(v₅₅,v₅₆,v₅₁),t₆₈(v₅₁,v₄₇,v₄₁),t₆₉(v₅₁,v₅₇,v₄₇),t₇₀(v₅₅,v₄₄,v₅₈),t₇₁(v₅₁,v₄₄, v₅₅),t₇₂(v₅₁,v₅₄,v₄₄),t₇₃(v₅₀,v₄₈,v₅₈),t₇₄(v₄₆,v₄₈,v₆₀),t₇₅(v₄₄,v₅₀,v₅₈),t₇₆(v₅₄,v₄₁,v₆₀), t₇₇(v₄₅,v₄₆,v₄₁),t₇₈(v₄₁,v₄₆,v₆₀),t₇₉(v₅₀,v₅₄,v₄₈),t₈₀(v₅₈,v₄₈,v₄₂),t₈₁(v₂₂,v₂₆,v₃₂),t₈₂ (v₄₆,v₃₉,v₂₇),t₈₃(v₄₆,v₄₅,v₃₉),t₈₄(v₄₆,v₂₇,v₂₉),t₈₅(v₄₆,v₂₉,v₄₈),t₈₆(v₃₁,v₂₆,v₂₂),t₈₇(v₃₈, v₂₈,v₂₅),t₈₈(v₃₈,v₃₃,v₂₄),t₈₉(v₃₁,v₂₂,v₃₀),t₉₀(v₃₁,v₂₇,v₃₉),t₉₁(v₂₇,v₃₁,v₂₉),t₉₂(v₃₀,v₃₇, v₃₆),t₉₃(v₂₄,v₃₄,v₂₃),t₉₄(v₃₉,v₄₅,v₃₄),t₉₅(v₃₈,v₂₄,v₂₈),t₉₆(v₂₃,v₃,v₄₀),t₉₇(v₃₉,v₃₃,v₂₅), t₉₈(v₃₄,v₂₄,v₃₃),t₉₉(v₃₂,v₄₀,v₃₇),t₁₀₀(v₂₄,v₂₃,v₄₀),t₁₀₁(v₂₁,v₃₇,v₃₀),t₁₀₂(v₂₂,v₃₂,v₂₁),t₁₀₃ (v₂₆,v₂₈,v₃₂),t₁₀₄(v₃₇,v₃₅,v₃₆),t₁₀₅(v₃₇,v₅,v₃₅),t₁₀₆(v₂₂,v₂₁,v₃₀),t₁₀₇(v₃₂,v₃₇,v₂₁),t₁₀₈ (v₃₀,v₃₆,v₃₁)；

The number one of the triangle built by Delaunay triangulation network has 108, for example, t₁Triangle is by v₂₈, v₂₆, v₂₅These three nodes are constituted, due to a data object x in data set X_iA node v in ∈ X and triangulation network G_i∈ V it Between be one-to-one relationship, so, t₁Triangle can be regarded as by x₂₈, x₂₆, x₂₅These three data objects are constituted, x₂₈, x₂₆, x₂₅2 dimension attributes of these three data objects are respectively (1.91,0.1), (2.03,0.03) and (1.86,0.19), be underneath with It is convenient, the node in above-mentioned all Delaunay triangles is all expressed as form with data object：

t₁(x₂₈,x₂₆,x₂₅),t₂(x₁₂,x₁₃,x₉),t₃(x₈,x₇,x₁),t₄(x₁₂,x₁₆,x₁₃),t₅(x₄,x₁₆,x₁₀),t₆ (x₁₈,x₂,x₁₀),t₇(x₁₂,x₁₀,x₁₆),t₈(x₇,x₈,x₄),t₉(x₂,x₆,x₇),t₁₀(x₆,x₁,x₇),t₁₁(x₄₃,x₅₉,x₅₂),t₁₂ (x₁₂,x₅,x₃),t₁₃(x₃₉,x₃₄,x₃₃),t₁₄(x₁₉,x₁₁,x₁₆),t₁₅(x₅₅,x₅₈,x₄₉),t₁₆(x₅₃,x₅₂,x₅₆),t₁₇(x₂₀,x₃, x₃₄),t₁₈(x₅₄,x₆₀,x₄₈),t₁₉(x₃,x₂₃,x₃₄),t₂₀(x₂₆,x₃₉,x₂₅),t₂₁(x₃₄,x₄₅,x₂₀),t₂₂(x₃₃,x₃₈,x₂₅), t₂₃(x₂₄,x₄₀,x₃₂),t₂₄(x₃₅,x₅,x₁₇),t₂₅(x₃₉,x₂₆,x₃₁),t₂₆(x₃₂,x₂₈,x₂₄),t₂₇(x₁₃,x₁₆,x₁₁),t₂₈ (x₁₇,x₁₃,x₁₁),t₂₉(x₉,x₁₄,x₁₂),t₃₀(x₁₇,x₉,x₁₃),t₃₁(x₁₄,x₁₅,x₁₂),t₃₂(x₁₇,x₁₄,x₉),t₃₃(x₁₇,x₁₅, x₁₄),t₃₄(x₁₇,x₅,x₁₅),t₃₅(x₃,x₃₇,x₄₀),t₃₆(x₅,x₁₂,x₁₅),t₃₇(x₁₈,x₁₀,x₁₂),t₃₈(x₁₈,x₂₀,x₆),t₃₉ (x₁₈,x₁₂,x₂₀),t₄₀(x₁₂,x₃,x₂₀),t₄₁(x₅,x₃₇,x₃),t₄₂(x₁,x₁₉,x₈),t₄₃(x₁₆,x₄,x₁₉),t₄₄(x₆,x₄₅, x₁),t₄₅(x₇,x₄,x₁₀),t₄₆(x₈,x₁₉,x₄),t₄₇(x₅₂,x₁,x₄₃),t₄₈(x₅₂,x₁₉,x₁),t₄₉(x₁₀,x₂,x₇),t₅₀(x₁₈, x₆,x₂),t₅₁(x₅₅,x₄₉,x₅₆),t₅₂(x₂₀,x₄₅,x₆),t₅₃(x₅₉,x₅₆,x₅₂),t₅₄(x₄₅,x₅₇,x₄₃),t₅₅(x₄₄,x₅₄,x₅₀), t₅₆(x₄₉,x₅₃,x₅₆),t₅₇(x₄₉,x₄₂,x₅₃),t₅₈(x₄₂,x₄₉,x₅₈),t₅₉(x₄₂,x₅₂,x₅₃),t₆₀(x₅₉,x₅₁,x₅₆),t₆₁ (x₅₉,x₅₇,x₅₁),t₆₂(x₄₇,x₄₅,x₄₁),t₆₃(x₄₃,x₁,x₄₅),t₆₄(x₄₇,x₅₇,x₄₅),t₆₅(x₅₉,x₄₃,x₅₇),t₆₆(x₅₄, x₅₁,x₄₁),t₆₇(x₅₅,x₅₆,x₅₁),t₆₈(x₅₁,x₄₇,x₄₁),t₆₉(x₅₁,x₅₇,x₄₇),t₇₀(x₅₅,x₄₄,x₅₈),t₇₁(x₅₁,x₄₄, x₅₅),t₇₂(x₅₁,x₅₄,x₄₄),t₇₃(x₅₀,x₄₈,x₅₈),t₇₄(x₄₆,x₄₈,x₆₀),t₇₅(x₄₄,x₅₀,x₅₈),t₇₆(x₅₄,x₄₁,x₆₀), t₇₇(x₄₅,x₄₆,x₄₁),t₇₈(x₄₁,x₄₆,x₆₀),t₇₉(x₅₀,x₅₄,x₄₈),t₈₀(x₅₈,x₄₈,x₄₂),t₈₁(x₂₂,x₂₆,x₃₂),t₈₂ (x₄₆,x₃₉,x₂₇),t₈₃(x₄₆,x₄₅,x₃₉),t₈₄(x₄₆,x₂₇,x₂₉),t₈₅(x₄₆,x₂₉,x₄₈),t₈₆(x₃₁,x₂₆,x₂₂),t₈₇(x₃₈, x₂₈,x₂₅),t₈₈(x₃₈,x₃₃,x₂₄),t₈₉(x₃₁,x₂₂,x₃₀),t₉₀(x₃₁,x₂₇,x₃₉),t₉₁(x₂₇,x₃₁,x₂₉),t₉₂(x₃₀,x₃₇, x₃₆),t₉₃(x₂₄,x₃₄,x₂₃),t₉₄(x₃₉,x₄₅,x₃₄),t₉₅(x₃₈,x₂₄,x₂₈),t₉₆(x₂₃,x₃,x₄₀),t₉₇(x₃₉,x₃₃,x₂₅), t₉₈(x₃₄,x₂₄,x₃₃),t₉₉(x₃₂,x₄₀,x₃₇),t₁₀₀(x₂₄,x₂₃,x₄₀),t₁₀₁(x₂₁,x₃₇,x₃₀),t₁₀₂(x₂₂,x₃₂,x₂₁),t₁₀₃ (x₂₆,x₂₈,x₃₂),t₁₀₄(x₃₇,x₃₅,x₃₆),t₁₀₅(x₃₇,x₅,x₃₅),t₁₀₆(x₂₂,x₂₁,x₃₀),t₁₀₇(x₃₂,x₃₇,x₂₁),t₁₀₈ (x₃₀,x₃₆,x₃₁)。

(3) t, is calculated₁-t₁₀₈Totally 108 averages on Atria summit, and as the representative point of the triangle, As shown in Figure 4.For example, t₁Triangle is by x₂₈, x₂₆, x₂₅These three data objects are constituted, x₂₈, x₂₆, x₂₅These three data pair 2 dimension attributes of elephant are respectively (1.91,0.1), (2.03,0.03) and (1.86,0.19), then t₁The representative point r of triangle₁It isIt is listed below t₁-t₁₀₈Representative point：

r₁(1.93,0.11),r₂(-0.10,-0.19),r₃(-0.18,0.44),r₄(-0.16,-0.12),r₅(-0.20, 0.06),r₆(0.01,0.20),r₇(-0.09,-0.02),r₈(-0.22,0.23),r₉(0.00,0.40),r₁₀(-0.04, 0.55),r₁₁(0.57,1.90),r₁₂(0.26,-0.18),r₁₃(1.74,0.28),r₁₄(-0.44,-0.10),r₁₅(0.98, 2.24),r₁₆(0.57,2.21),r₁₇(0.78,0.10),r₁₈(1.34,1.99),r₁₉(1.18,0.03),r₂₀(1.93, 0.23),r₂₁(0.87,0.64),r₂₂(1.83,0.17),r₂₃(1.79,-0.07),r₂₄(0.74,-0.52),r₂₅(2.10, 0.22),r₂₆(1.88,0.00),r₂₇(-0.34,-0.14),r₂₈(-0.27,-0.30),r₂₉(-0.02,-0.23),r₃₀(- 0.15,-0.33),r₃₁(0.05,-0.24),r₃₂(-0.07,-0.38),r₃₃(0.00,-0.38),r₃₄(0.09,-0.39),r₃₅ (1.34,-0.22),r₃₆(0.14,-0.24),r₃₇(0.05,0.06),r₃₈(0.18,0.33),r₃₉(0.17,0.10),r₄₀ (0.29,0.01),r₄₁(0.88,-0.30),r₄₂(-0.35,0.33),r₄₃(-0.38,0.01),r₄₄(0.24,0.96),r₄₅(- 0.12,0.18),r₄₆(-0.39,0.12),r₄₇(0.29,1.56),r₄₈(-0.12,1.00),r₄₉(-0.05,0.23),r₅₀ (0.06,0.36),r₅₁(0.83,2.20),r₅₂(0.40,0.78),r₅₃(0.57,2.06),r₅₄(0.74,1.62),r₅₅ (1.17,2.07),r₅₆(0.74,2.23),r₅₇(0.91,2.37),r₅₈(1.04,2.37),r₅₉(0.74,2.36),r₆₀ (0.77,1.93),r₆₁(0.82,1.79),r₆₂(1.00,1.71),r₆₃(0.43,1.32),r₆₄(0.85,1.67),r₆₅ (0.72,1.71),r₆₆(1.14,1.91),r₆₇(0.87,2.06),r₆₈(1.08,1.83),r₆₉(0.92,1.79),r₇₀ (1.06,2.16),r₇₁(1.03,2.04),r₇₂(1.10,1.99),r₇₃(1.26,2.16),r₇₄(1.44,1.90),r₇₅ (1.13,2.13),r₇₆(1.25,1.88),r₇₇(1.17,1.70),r₇₈(1.35,1.79),r₇₉(1.30,2.09),r₈₀ (1.26,2.31),r₈₁(2.08,-0.04),r₈₂(1.91,0.93),r₈₃(1.38,1.25),r₈₄(2.08,0.97),r₈₅ (1.80,1.49),r₈₆(2.19,0.05),r₈₇(1.87,0.14),r₈₈(1.77,0.10),r₈₉(2.23,-0.04),r₉₀ (2.21,0.40),r₉₁(2.38,0.44),r₉₂(2.11,-0.37),r₉₃(1.58,0.06),r₉₄(1.40,0.72),r₉₅ (1.81,0.07),r₉₆(1.22,-0.06),r₉₇(1.85,0.28),r₉₈(1.67,0.12),r₉₉(1.87,-0.21),r₁₀₀ (1.62,-0.04),r₁₀₁(2.05,-0.28),r₁₀₂(2.10,-0.10),r₁₀₃(1.99,0.02),r₁₀₄(2.07,-0.52), r₁₀₅(1.40,-0.49),r₁₀₆(2.13,-0.15),r₁₀₇(2.01,-0.23),r₁₀₈(2.26,-0.16)。

(4) r, is calculated₁-r₁₀₈Totally 108 represent an inverse for place triangle area, and as the representative point Density；For example, r₁Triangle where (1.93,0.11) is t₁, by x₂₈(1.91,0.1), x₂₆(2.03,0.03), x₂₅(1.86, 0.19) these three data objects are constituted, and first, are calculated by x₂₈And x₂₆The length of side on the side that the two data objects are constitutedThen, calculate by x₂₈And x₂₅The side that the two data objects are constituted The length of sideCalculate again by x₂₆And x₂₅The side that the two data objects are constituted The length of sideThen, t is calculated₁The semi-perimeter of triangleThen t₁The area S of triangle is

Finally, area S Inverse, i.e.,As representative point r₁Density；As stated above, r is calculated₁-r₁₀₈The density of point is represented, It is as follows：

ρ₁=277.78, ρ₂=43.86, ρ₃=15.85, ρ₄=53.19, ρ₅=39.06, ρ₆=69.44, ρ₇=29.76, ρ₈=63.29, ρ₉=232.56, ρ₁₀=19.38, ρ₁₁=38.31, ρ₁₂=21.23, ρ₁₃=28.49, ρ₁₄=54.05, ρ₁₅= 107.53,ρ₁₆=50., ρ₁₇=5.26, ρ₁₈=23.04, ρ₁₉=10.47, ρ₂₀=38.17, ρ₂₁=1.27, ρ₂₂=476.19, ρ₂₃=65.36, ρ₂₄=4., ρ₂₅=12.17, ρ₂₆=35.09, ρ₂₇=90.91, ρ₂₈=27.78, ρ₂₉=119.05, ρ₃₀= 45.25,ρ₃₁=70.92, ρ₃₂=128.21, ρ₃₃=68.97, ρ₃₄=46.51, ρ₃₅=4.71, ρ₃₆=81.3, ρ₃₇=55.56, ρ₃₈=22.68, ρ₃₉=32.15, ρ₄₀=15.24, ρ₄₁=4.87, ρ₄₂=25.64, ρ₄₃=32.47, ρ₄₄=5.69, ρ₄₅= 43.1,ρ₄₆=44.44, ρ₄₇=2.6, ρ₄₈=8.7, ρ₄₉=333.33, ρ₅₀=33.9, ρ₅₁=44.44, ρ₅₂=4.36, ρ₅₃= 21.41,ρ₅₄=96.15, ρ₅₅=555.56, ρ₅₆=102.04, ρ₅₇=59.52, ρ₅₈=26.74, ρ₅₉=18.12, ρ₆₀= 19.57,ρ₆₁=51.28, ρ₆₂=24.81, ρ₆₃=10.27, ρ₆₄=72.99, ρ₆₅=123.46, ρ₆₆=32.47, ρ₆₇= 20.2,ρ₆₈=106.38, ρ₆₉=136.99, ρ₇₀=95.24, ρ₇₁=44.44, ρ₇₂=144.93, ρ₇₃=40.32, ρ₇₄= 31.85,ρ₇₅=156.25, ρ₇₆=156.25, ρ₇₇=26.11, ρ₇₈=526.32, ρ₇₉=101.01, ρ₈₀=12.89, ρ₈₁= 149.25,ρ₈₂=3.08, ρ₈₃=1.91, ρ₈₄=41.67, ρ₈₅=5.27, ρ₈₆=40.98, ρ₈₇=344.83, ρ₈₈= 133.33,ρ₈₉=59.88, ρ₉₀=10.8, ρ₉₁=125, ρ₉₂=28.25, ρ₉₃=74.07, ρ₉₄=2.86, ρ₉₅=120.48, ρ₉₆=15.38, ρ₉₇=106.38, ρ₉₈=36.63, ρ₉₉=13.74, ρ₁₀₀=156.25, ρ₁₀₁=65.36, ρ₁₀₂=188.68, ρ₁₀₃=158.73, ρ₁₀₄=23.2, ρ₁₀₅=5.5, ρ₁₀₆=181.82, ρ₁₀₇=68.03, ρ₁₀₈=23.31.

(5) all set R={ r for representing point composition, are made₁,r₂,...,r₁₀₈, 108 is the Delaunay triangulation network for building The number of intermediate cam shape；The maximum representative o'clock of density is selected from set R as the 1st initial cluster center c₁, wherein, r₅₅It is right The density p answered₅₅=555.56 is maximum, then by r₅₅(1.17,2.07) are added in initial cluster center set C, i.e. r₅₅Make It is the 1st initial cluster center c₁, as shown in figure 5, representing c with " ★ "₁, i.e. C={ c1 }={ (1.17,2.07) }, then by density Maximum representative point r₅₅Removed from set R, obtain R={ r₁,r₂,...,r₅₄,r₅₆,r₅₇,...,r₁₀₈}。

(6), difference calculation representative point set R={ r₁,r₂,...,r₅₄,r₅₆,r₅₇,...,r₁₀₈In each represent point and the One mixing distance of initial cluster center, for example, calculating r₁With r₅₅Between mixing distance, first, calculate r₁With r₅₅It is close Degree sum=ρ₁+ρ₅₅=277.78+555.56=833.34, then, calculates r₁With r₅₅Euclidean distance, according to r₁(1.93, 0.11), r₅₅(1.17,2.07), both Euclidean distances Then, Calculate r₁With r₅₅Between mixing apart from h₁₍₅₅₎=(ρ₁+ρ₅₅)×d₁₍₅₅₎=833.34 × 2.10=1750.01, for set R ={ r₁,r₂,...,r₅₄,r₅₆,r₅₇,...,r₁₀₈In each represent point as stated above calculate with first initial clustering in Heart r₅₅Mixing distance, obtain following result：

r₁:1750.01, r₂:1552.5, r₃:1211.39, r₄:1558.4, r₅:1444.93, r₆:1375., r₇: 1428.18, r₈:1429.54, r₉:1607.76, r₁₀:1115.38, r₁₁:368.2, r₁₂:1401.6, r₁₃:1098.01, r₁₄: 1645.95, r₁₅:165.77, r₁₆:375.45, r₁₇:1127.25, r₁₈:109.93, r₁₉:1154.7, r₂₀:1181.52, r₂₁: 812.97, r₂₂:2073.82, r₂₃:1384.65, r₂₄:1471.64, r₂₅:1175.2, r₂₆:1293.52, r₂₇:1732.54, r₂₈: 1615.85, r₂₉:1747.24, r₃₀:1646.22, r₃₁:1610.05, r₃₂:1880.37, r₃₃:1698.72, r₃₄:1619.57, r₃₅:1288.62, r₃₆:1611.26, r₃₇:1405.58, r₃₈:1156.48, r₃₉:1298.84, r₄₀:1278.59, r₄₁: 1339.43, r₄₂:1342.57, r₄₃:1517.12, r₄₄:813.81, r₄₅:1370.93, r₄₆:1500., r₄₇:569.32, r₄₈: 947.96, r₄₉:1964.45, r₅₀:1202.5, r₅₁:216., r₅₂:839.88, r₅₃:346.18, r₅₄:404.06, r₅₆:302.5, r₅₇:246.03, r₅₈:192.16, r₅₉:298.31, r₆₀:241.55, r₆₁:273.08, r₆₂:232.15, r₆₃:594.12, r₆₄: 320.56, r₆₅:393.83, r₆₆:94.08, r₆₇:172.73, r₆₈:172.1, r₆₉:263.17, r₇₀:91.11, r₇₁:84., r₇₂: 77.05, r₇₃:77.46, r₇₄:187.97, r₇₅:49.83, r₇₆:149.48, r₇₇:215.22, r₇₈:357.02, r₇₉:85.35, r₈₀:147.8, r₈₁:1621.06, r₈₂:759.75, r₈₃:473.85, r₈₄:854.04, r₈₅:482.31, r₈₆:1348.18, r₈₇: 1845.8, r₈₈:1419.11, r₈₉:1452.44, r₉₀:1115.73, r₉₁:1381.54, r₉₂:1523.74, r₉₃:1290.74, r₉₄:765.04, r₉₅:1419.68, r₉₆:1216.1, r₉₇:1264.31, r₉₈:1190.3, r₉₉:1360.63, r₁₀₀:1537.51, r₁₀₁:1558.51, r₁₀₂:1756.41, r₁₀₃:1578.58, r₁₀₄:1585.8, r₁₀₅:1441.92, r₁₀₆:1784.46, r₁₀₇: 1527.8, r₁₀₈:1435.6, wherein, r₂₂With r₅₅Between mixing distance 2073.82, be farthest in all distances, therefore, r₂₂(1.83,0.17) is used as the 2nd initial cluster center c₂, as shown in figure 5, representing c with " ▲ "₂, and add it to initial In cluster centre set C, i.e. C={ c₁,c₂}={ (1.17,2.07), (1.83,0.17) }, then by r₂₂Removed from set R, Obtain R={ r₁,r₂,…,r₂₁,r₂₃,…,r₅₄,r₅₆,…,r₁₀₈}。

(7), calculated one by one and c in initial cluster center set in set R₁And c₂Mixing distance, and select minimum mixed Distance is closed, for example, taking out r from set R₁, calculate respectively and c₁(1.17,2.07) and c₂Mixing between (1.83,0.17) away from From, 2073.82 and 90.48 are obtained, it is 90.48, same method, all of representative in set of computations R to select minimum mixing distance Point and c in initial cluster center set₁And c₂Mixing distance, and select minimum mixing distance, obtain following result：

r₁:90.48, r₂:1019.3, r₃:998.84, r₄:1064.05, r₅:1045.96, r₆:993.05, r₇:976.48, r₈:1105.93, r₉:1304.1, r₁₀:946.54, r₁₁:368.2, r₁₂:800.85, r₁₃:70.66, r₁₄:1214.25, r₁₅: 165.77, r₁₆:375.45, r₁₇:505.52, r₁₈:109.93, r₁₉:321.2, r₂₀:61.72, r₂₁:510.88, r₂₃:129.97, r₂₄:619.45, r₂₅:131.86, r₂₆:92.03, r₂₇:1241.95, r₂₈:1083.54, r₂₉:1125., r₃₀:1063.74, r₃₁: 1001.21, r₃₂:1196.71, r₃₃:1041.26, r₃₄:956.54, r₃₅:302.97, r₃₆:970.03, r₃₇:946.52, r₃₈: 828.12, r₃₉:843.84, r₄₀:761.72, r₄₁:509.92, r₄₂:1099.01, r₄₃:1129.23, r₄₄:813.81, r₄₅: 1012.62, r₄₆:1155.8, r₄₇:569.32, r₄₈:947.96, r₄₉:1521.9, r₅₀:907.96, r₅₁:216., r₅₂: 744.85, r₅₃:346.18, r₅₄:404.06, r₅₆:302.5, r₅₇:246.03, r₅₈:192.16, r₅₉:298.31, r₆₀: 241.55, r₆₁:273.08, r₆₂:232.15, r₆₃:594.12, r₆₄:320.56, r₆₅:393.83, r₆₆:94.08, r₆₇: 172.73, r₆₈:172.1, r₆₉:263.17, r₇₀:91.11, r₇₁:84.00, r₇₂:77.05, r₇₃:77.46, r₇₄:187.97, r₇₅:49.83, r₇₆:149.48, r₇₇:215.22, r₇₈:357.02, r₇₉:85.35, r₈₀:147.8, r₈₁:206.4, r₈₂: 364.25, r₈₃:473.85, r₈₄:435., r₈₅:482.31, r₈₆:196.52, r₈₇:41.05, r₈₈:54.86, r₈₉:241.23, r₉₀:214.28, r₉₁:366.73, r₉₂:307.71, r₉₃:148.57, r₉₄:335.34, r₉₅:59.67, r₉₆:319.52, r₉₇: 64.08, r₉₈:87.18, r₉₉:186.17, r₁₀₀:189.73, r₁₀₁:270.78, r₁₀₂:252.65, r₁₀₃:139.68, r₁₀₄: 364.55, r₁₀₅:380.54, r₁₀₆:289.52, r₁₀₇:239.46, r₁₀₈:269.73, selected in above minimum range it is maximum away from From, wherein, 1521.9 is ultimate range, and its corresponding representative be a little r₄₉, by r₄₉(- 0.05,0.23) is added in initial clustering In heart set C, as the 3rd initial cluster center c₃, as shown in figure 5, representing c with "●"₃, i.e. C={ c₁,c₂}={ (1.17, 2.07), (1.83,0.17), (- 0.05,0.23) }, at this moment, the element number that initial cluster center set C is included is equal to K, then Algorithm terminates.

To verify the validity of proposition method of the present invention, its Clustering Effect on real data collection is given below：

We pick 4 data sets Wine, Soybean-small, Iris, Haberman, table 1 from UCI data sets List this 4 relevant informations of data set：

14 information of data set of table

We are respectively by initial cluster center system of selection proposed by the present invention and random selection initial cluster center method For K-means clustering algorithms, cluster result is analyzed and evaluated as evaluation index using classification accuracy rate (AC), its In, the precision for randomly choosing initial cluster center method is 10 average values of stochastic clustering result, and as a result such as table 2, classification is just True rate (AC) is defined as follows：

Wherein, K represents the class number of data set, and N represents the sum of data object in data set, a_iExpression is correctly assigned to The number of the data object of the i-th class.

AC value of 24 data sets of table under two kinds of different initial cluster center methods

From table 2 it can be seen that the initial cluster center based on Delaunay triangulation network proposed by the present invention is chosen into method using In K-means clustering algorithms, its classification accuracy rate (AC) is apparently higher than randomized.

Initial cluster center is chosen using the inventive method, the influence of outlier can also be avoided.In preceding embodiment It is manually generated comprising 60 data set X={ x of data object₁,x₂,...,x₆₀On the basis of, increased to belong to again and peel off 3 data objects of point, its attribute is respectively (- 0.5,2.5), (1.0,0.8), (2.25,2.5), with shown in " ▲ " in Fig. 6. Next, being that the data set chooses initial cluster center using the inventive method, 3 initial cluster centers are obtained, be respectively (1.17,2.07), (1.83,0.17), (- 0.05,0.23), in Fig. 6 with " ★ " represent 3 initial cluster centers, the result with The selection result of the initial cluster center in embodiment is completely the same, is not influenceed by outlier.And if selected using randomized Initial cluster center is taken, it is likely that outlier is elected to be initial cluster center, because the cluster result of K-means algorithms is relied on In the selection of initial cluster center, if inputing to K-means algorithms using outlier as initial cluster center, mistake can be caused Cluster result by mistake.

Preferred embodiment of the invention described in detail above.It should be appreciated that one of ordinary skill in the art without Need creative work just can make many modifications and variations with design of the invention.Therefore, all technologies in the art Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Technical scheme, all should be in the protection domain being defined in the patent claims.

Claims

1. a kind of K-means initial cluster center choosing methods based on Delaunay triangulation network, it is characterised in that including following Step：

Step 1, data set table to be clustered is shown as Delaunay triangulation network so that each data point in the data set to be clustered Corresponded with the node in the Delaunay triangulation network；

Step 2, the average for calculating each Atria summit in the Delaunay triangulation network, and using the average as described The representative point of triangle；

Step 3, calculate it is each it is described represent point where triangle area an inverse, and will it is each it is described represent point place a triangle area It is reciprocal as the density for representing point；

Step 4, the Euclidean distance for calculating the density sum and representative point for representing point, and using both products as two The individual mixing distance represented between point；

Step 5, it is all it is described represent point in select the maximum representative o'clock of density as the 1st initial cluster center, and by institute The maximum representative point of density is stated to be added in initial cluster center set C；

Step 6, selection are with the mixing of the 1st initial cluster center apart from farthest representative o'clock as the 2nd initial clustering Center, and the mixing with the 1st initial cluster center is added to the initial cluster center apart from farthest representative point In set C；

Step 7, it is remaining represent point in calculate one by one and each initial cluster center in the initial cluster center set C Mixing distance, and minimum mixing distance is selected, maximum mixing distance institute is then picked out in all of minimum mixing distance right The representative point answered, and the representative point picked out corresponding to maximum mixing distance in all of minimum mixing distance is added to described In initial cluster center set C, the qualified point that represents constantly is picked out from point is represented and is added to the initial clustering Centralization C, until the element number that the initial cluster center set C is included is equal to K.

2. a kind of K-means initial cluster center choosing methods based on Delaunay triangulation network as claimed in claim 1, its It is characterised by, step 1 specific method includes：

The data set to be clustered is arranged to X={ x₁,x₂,...,x_n, it is that data set X builds comprising n data object A data object x in Delaunay triangulation network G=(V, E), and the data set X_iOne in ∈ X and triangulation network G Node v_iIt is one-to-one relationship between ∈ V, the distance between two nodes in the triangulation network G is equal to its corresponding data object Between Euclidean distance, i.e. d (v_i,v_j)=d (x_i,x_j)。

3. a kind of K-means initial cluster center choosing methods based on Delaunay triangulation network as claimed in claim 1, its It is characterised by, step 2 specific method includes：

The three of triangle T summits are separately arranged as v in constituting the triangulation network G_i、v_j、v_k, three summits point Not with the data set to be clustered in x_i、x_j、x_kThese three data objects are corresponded, wherein, x_i=(x_i1,x_i2,…, x_id), x_j=(x_j1,x_j2,…,x_jd), x_k=(x_k1,x_k2,…,x_kd), the average for calculating three summits isThe average as the triangle T representative point.

4. a kind of K-means initial cluster center choosing methods based on Delaunay triangulation network as claimed in claim 1, its It is characterised by, step 3 specific method includes：

Three summits for representing triangle T where point r are separately arranged as v_i、v_j、v_k, three summits are treated with described respectively X in the data set of cluster_i、x_j、x_kThese three data objects are corresponded, wherein, x_i=(x_i1,x_i2,…,x_id), x_j= (x_j1,x_j2,…,x_jd), x_k=(x_k1,x_k2,...,x_kd), then three length of sides are separately arranged as in the triangle T：

a = \sqrt{{(x_{i 1} - x_{j 1})}^{2} + {(x_{i 2} - x_{j 2})}^{2} + ... + {(x_{i d} - x_{j d})}^{2}},

b = \sqrt{{(x_{i 1} - x_{k 1})}^{2} + {(x_{i 2} - x_{k 2})}^{2} + ... + {(x_{i d} - x_{k d})}^{2}},

c = \sqrt{{(x_{j 1} - x_{k 1})}^{2} + {(x_{j 2} - x_{k 2})}^{2} + ... + {(x_{j d} - x_{k d})}^{2}},

Calculate the semi-perimeter of triangleObtaining the area S of the triangle T is Finally, the inverse of the area S is obtained, i.e.As the density of the representative point r.

5. a kind of K-means initial cluster center choosing methods based on Delaunay triangulation network as claimed in claim 1, its It is characterised by, step 4 specific method includes：

Two represent point r₁, r₂Density be separately arranged as ρ₁With ρ₂, the representative point r₁With the representative point r₂Euclidean away from From equal to d₁₂, then it is described to represent point r₁With the representative point r₂Between mixing distance be equal to h=(ρ₁+ρ₂)×d₁₂。

6. a kind of K-means initial cluster center choosing methods based on Delaunay triangulation network as claimed in claim 1, its It is characterised by, step 5 specific method includes：

All set for representing point composition are arranged to R={ r₁,r₂,...,r_t, first, all representatives are calculated by the step 3 The density of point, then, selects the maximum representative o'clock of density as the 1st initial cluster center c from the set R₁, and by institute State the maximum representative point of set R Midst densities to be added in the initial cluster center set C, i.e. C={ c₁, then density is maximum Representative point removed from set R, rearrangement represent point set, obtain R={ r₁,r₂,...,r_t-1}。

7. a kind of K-means initial cluster center choosing methods based on Delaunay triangulation network as claimed in claim 1, its It is characterised by, step 6 specific method includes：

Difference calculation representative point set R={ r₁,r₂,...,r_t-1In each represent point and first mixing of initial cluster center Distance, takes mixing apart from farthest representative o'clock as the 2nd initial cluster center c₂, and by the mixing apart from farthest representative Point is added in the initial cluster center set C, i.e. C={ c₁,c₂, then by the mixing apart from farthest representative point from collection Removal in R is closed, rearrangement represents point set, obtains R={ r₁,r₂,...,r_t-2}。

8. a kind of K-means initial cluster center choosing methods based on Delaunay triangulation network as claimed in claim 1, its It is characterised by, step 7 specific method includes：

Step 71, from it is remaining represent point set R in select r₁, calculate and each initial clustering in the initial cluster center set C The mixing distance at center, and the mixing distance of minimum is selected in all of mixing distance, it is expressed as h_1min；

Step 72, r is selected from R₂, the mixing distance with each initial cluster center in the initial cluster center set C is calculated, And minimum mixing distance is selected in all of mixing distance, it is expressed as h_2min；Until picking out last representative from R Point r_t-2, calculate with the mixing distance of each initial cluster center in the initial cluster center set C, and all of mixing away from Selected in a distance from the mixing of minimum, be expressed as h_(t-2)min；

Step 73, in all of minimum mixing apart from h_1min, h_2min..., h_(t-2)minIn pick out corresponding to maximum mixing distance Represent a little, and the point that represents is added in the initial cluster center set C, constantly picked out from point is represented and met The representative point of condition is added to the initial cluster center set C, until the element that the initial cluster center set C is included Number is equal to K.