CN111062418A

CN111062418A - Non-parametric clustering algorithm and system based on minimum spanning tree

Info

Publication number: CN111062418A
Application number: CN201911168955.7A
Authority: CN
Inventors: 吴怀岗; 陈靖飒; 窦万峰; 程开丰
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2020-04-24

Abstract

The invention provides a parameterization-free clustering algorithm and a parameterization-free clustering system based on a minimum spanning treeWCGWherein, the points represent vectors, and the weighted edges represent the similarity relation among data; then will beWCGTransforming into fully-connected minimal spanning treeMST(ii) a Then using k =2K‑meansAlgorithm pairMSTClustering the one-dimensional weight space of the edge set to obtain a pruning threshold value; last pair ofMSTPruning and noise filtering are carried out, and the obtained connected components are clustered. The algorithm converts the original 'dimensional space clustering problem of indefinite category number' with higher complexity into the 'one-dimensional space clustering problem of category number 2' with lower complexity, so thatK‑meansThe defects of the algorithm are made up, non-parametric clustering is really realized, the clustering efficiency is improved while the clustering time is reduced, and the dependence of the algorithm on experience parameters is eliminated.

Description

Non-parametric clustering algorithm and system based on minimum spanning tree

Technical Field

The invention relates to a parameterization-free clustering algorithm based on a minimum spanning tree, and belongs to the field of clustering algorithms.

Background

The clustering algorithm is a very effective unsupervised machine learning algorithm and is an important branch in the field of data mining. The conventional clustering algorithm can be roughly classified into a partition clustering method, a hierarchical clustering method, a density clustering method, a grid clustering method, a model clustering method, and the like. The K-means algorithm, as a partition clustering algorithm, has many advantages of simple principle, easy description, high time efficiency, and suitability for processing large-scale data, and is therefore widely applied in many fields. However, the algorithm has obvious defects: the accuracy and computational complexity of clustering is heavily dependent on the initial cluster number k and the choice of initial cluster center parameters. In a large number of practical application scenarios, the data set is not only large in scale, but also is constantly in a dynamic change process, so that the clustering number and the clustering center are often difficult to predict and determine in advance.

In order to solve the above problems, the scholars have successively proposed some solutions for improving K-means, which reduce the dependence of the algorithm on empirical parameters to some extent, but none of them can be parameterized at all. The invention provides a Non-parametric Clustering algorithm MNC (MST based Non-parametric Clustering) based on a minimum spanning tree on the basis of the work, the algorithm reduces Clustering time, improves Clustering efficiency and eliminates the dependence of the algorithm on empirical parameters.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to overcome the defect that the existing unsupervised clustering algorithm is seriously dependent on selection of empirical parameters (initial clustering number k and initial cluster center), and provides a novel parameterization-free clustering algorithm which only depends on a data set and does not need empirical parameters, namely, the parameterization-free clustering algorithm based on a minimum spanning tree, so that the clustering efficiency is improved, and the clustering time is reduced. It is a further object to provide a system for carrying out the above method.

The technical scheme is as follows: a parametrization-free clustering algorithm based on a minimum spanning tree comprises the following steps:

step 1: abstracting a data set to be clustered into a weighted Complete graph WCG (weighted Complete graph), wherein points represent vectors, and weighted edges represent similarity among data;

step 2: converting the WCG into a fully connected Minimum Spanning Tree (MST) (minimum Spanning Tree);

and step 3: clustering the one-dimensional weight space of the MST edge set by using a K-means algorithm with K being 2 to obtain a pruning threshold value;

and 4, step 4: pruning the MST to obtain a connected component which is an initial clustering cluster;

and 5: and (4) carrying out outlier filtering on the initial clustering cluster so as to obtain a final clustering result.

In a further embodiment, the step 1 is further:

will S_NEach data sample in (2) is considered as a vector of an N-dimensional euclidean space, and the similarity between data samples is measured by the euclidean distance between vectors:

in the formula xⁱDenotes the given i-th sample, x^jRepresenting the given jth sample of the sequence,

D_ed(xⁱ,x^j) Represents a sample xⁱAnd x^jThe Euclidean distance of (c);

the original data sample set is converted into the weighted complete graph model WCG ═<V,E,W>Wherein V, E and W represent the set of points, the set of edges, and the set of weights in the WCG, respectively; wherein, the point set V is S_NSet of weights, E ═ V × V

In a further embodiment, the step 2 is further:

inputting the weighted complete graph WCG in the step 1 to < V, E, W >, generating a minimum spanning tree MST by adopting a Prim algorithm in graph theory, randomly selecting a point V from V, and establishing two point sets P and Q which respectively represent point sets in the MST and points not in the MST, wherein P is { V }, and Q is V-P. The point closest to P is selected from Q at each iteration and added to P until Q is empty, at which point all points are already in MST.

In a further embodiment, the step 3 is further:

extracting one-dimensional weight space S of MST edge set₁Using K-means for S₁Clustering with k being 2 is carried out, so that the original high-complexity N-dimensional space clustering problem with indefinite category number is converted into low-complexity one-dimensional space clustering problem with category number being 2, namely { W {₁,W₂}＝kmeans(S₁2), wherein max (W)₁)＜min(W₂) Further, the threshold PT is min (W)₂)。

Let S_NThe Optimal Clustering function (Optimal Clustering) of (C) is OC (S)_NAnd CN), wherein CN is the number of the clustered clusters. Define MST (S)_N) To OC (S)_NM) of a projection function P whose output is to two disjoint sets of edges E_IntraAnd E_InterThe edge sets between the points in the final clustering result cluster and the edge sets between the clusters are respectively as follows:

P(OC(S_N,M),MST(S_N))＝<E_Intra,E_Inter>

wherein, the edge set E_IntraAnd E_InterThe following constraints are satisfied:

definition of clusters according to Everitt: "the distance between any two points of the same cluster is smaller than the distance between any two points of different clusters", we can obtain:

max(E_Intra)<min(E_Inter)

i.e. MST (S) from the last cluster perspective_N)，MST(S_N) The longest edge of the inner point of the middle connection cluster is shorter than the shortest edge between the connection clusters.

Due to MST (S)_N) The weight space of the edge set can form a one-dimensional data space S₁Each number thereinAccording to the coordinate of the sample as MST (S)_N) The weight value of the middle edge is:

from the above two equations, we can derive: MST (S)_N) Can be covered by S_NOptimal cluster of OC (S)_NM) division into two disjoint subsets E_IntraAnd E_InterAnd E is_IntraAnd E_InterThe distance therebetween is sufficiently long. MST (S)_N) Projection onto OC (S)_NM) is equivalent to a one-dimensional edge set weight data set S₁The two classification processes of (1), namely:

high original complexity N-dimensional space clustering OC (S) with indefinite category number_NM) problem is converted into a one-dimensional spatial cluster OC (S) with a lower complexity class number of 2₁And, 2) problem ". Realizing OC (S)₁The method of 2) is various, but because of OC (S)₁And 2) the method has the characteristics of fixed category number of 2, simple subtraction of distance calculation, optional left and right endpoints of initial gravity center and the like, so that the defects of the K-means algorithm can be overcome, and the OC (S) is realized by adopting the K-means algorithm₁，2)。

In a further embodiment, the step 4 is further:

pruning the MST by using the threshold PT obtained in the step 3, namely disconnecting all edges of which the weight is greater than the PT in the MST; after pruning, the originally fully-connected MST becomes a forest, and the connected components in the forest F are the preliminary clustering results.

In a further embodiment, the step 5 is further:

and detecting a connected component with the number less than 3, regarding the connected component as an outlier, and filtering by using an outlier filtering function Filter, wherein the filtered connected component is a final clustering result of the MNC algorithm. Connected components in the forest F obtained through pruning can be used as a preliminary clustering result, but because the actual data set often contains a lot of useless noise data, the accuracy of the clustering result is reduced if filtering is not performed. The noise data is reflected in the clustering result, i.e., outliers that are low in spatial density and relatively far from the normal data. In order to detect these outliers in F, only the connected component of F needs to be subjected to point number judgment, and the algorithm considers the connected component with the point number less than 3 as the outlier.

A parameterization-free clustering system based on a minimum spanning tree mainly comprises the following modules:

a first module for abstracting a data set to be clustered into a weighted complete graph WCG;

a second module for converting the WCG into a fully connected minimum spanning tree MST;

a third module for clustering the one-dimensional weight space of the MST edge set to obtain a pruning threshold;

a fourth module for pruning the MST to obtain a connected component which is the initial clustering cluster;

and the fifth module is used for carrying out outlier filtering on the initial clustering cluster to obtain a final clustering result.

In a further embodiment, the first module is further configured to assign S_NEach data sample in (2) is considered as a vector of an N-dimensional euclidean space, and the similarity between data samples is measured by the euclidean distance between vectors:

D_ed(xⁱ,x^j) Represents a sample xⁱAnd x^jThe Euclidean distance of (c);

The second module further generates a Minimum Spanning Tree (MST) by using a Prim algorithm in a graph theory, namely two point sets P and Q are constructed and respectively represent the point sets in the MST and the point sets not in the MST, and a point closest to P is selected from Q to be added into P during each iteration until Q is empty;

the fourth module further prunes the MST by using the threshold PT obtained by the third module, namely all edges with the weight larger than the PT in the MST are disconnected; after pruning, the original fully-connected MST can become a forest, and the connected component in the forest F is a preliminary clustering result;

the fifth module is further used for detecting a connected component with the number less than 3, regarding the connected component as an outlier, and filtering by using an outlier filtering function Filter, wherein the filtered connected component is a final clustering result of the MNC algorithm.

In a further embodiment, the third module is further for extracting a one-dimensional weight space S of the MST edge set₁Using K-means for S₁Clustering with k being 2 is carried out, so that the original high-complexity 'N-dimensional space clustering problem with indefinite class number' is converted into low-complexity 'one-dimensional space clustering problem with class number being 2', namely { W₁,W₂}＝kmeans(S₁2), wherein max (W)₁)＜min(W₂) Further, the threshold PT is min (W)₂)：

Let S_NHas an optimal clustering function of OC (S)_NCN), wherein CN is the number of clusters after clustering; define MST (S)_N) To OC (S)_NM) of a projection function P whose output is to two disjoint sets of edges E_IntraAnd E_InterThe edge sets between the points in the final clustering result cluster and the edge sets between the clusters are respectively as follows:

P(OC(S_N,M),MST(S_N))＝<E_Intra,E_Inter>

wherein MST (S)_N) The longest edge of the inner point of the middle connection cluster is shorter than the shortest edge between the connection clusters:

max(E_Intra)<min(E_Inter)

due to MST (S)_N) The weight space of the edge set can form a one-dimensional data space S₁Where the coordinates of each data sample are MST (S)_N) The weight value of the middle edge is:

MST(S_N) Projection onto OC (S)_NM) is equivalent to a one-dimensional edge set weight data set S₁The two classification processes of (1), namely:

in the formula, OC (S)_NM) N-dimensional spatial clustering problem, OC (S), representing an indefinite number of classes₁And 2) represents a one-dimensional spatial clustering problem with a class number of 2.

Has the advantages that:

first, the novel and efficient parameterization-free clustering algorithm of the present invention is a clustering algorithm that relies only on the data set itself and does not require empirical parameters. The algorithm converts the original N-dimensional spatial clustering problem with indefinite category number with higher complexity into the one-dimensional spatial clustering problem with category number of 2 with lower complexity, so that the deficiency of the K-means algorithm is made up, and the non-parametric clustering is really realized.

Secondly, the novel efficient non-parametric clustering algorithm can improve the clustering efficiency and greatly reduce the clustering time. The Euclidean distances between all data objects and the new centers of all clusters need to be recalculated in each iteration of the K-means algorithm, and the calculation of the Euclidean distances in each iteration is not only simple operations such as addition and subtraction, but also complex square sum and evolution operations. In the MNC algorithm, the calculation of the Euclidean distance between the data objects is only carried out once in the first stage of the algorithm, and the subsequent two stages only have simple addition, subtraction and comparison operations, so that compared with the traditional K-means algorithm, the MNC algorithm has lower calculation complexity and consumes less time.

In conclusion, the invention can effectively solve the problem of dependence of traditional clustering algorithms such as K-means on empirical parameters, and has good practical application value.

Drawings

FIG. 1 is a block flow diagram of a novel efficient non-parameterized clustering algorithm of the present invention.

FIG. 2 is a flow chart of the present invention for generating a weighted complete graph.

FIG. 3 is a flow chart of generating a minimum spanning tree according to the present invention.

FIG. 4 is a flow chart of the present invention for generating pruning thresholds.

Fig. 5 is a comparison of clustering results of the conventional K-means algorithm, the MSTCluster algorithm based on the minimum spanning tree, and the MNC algorithm herein.

Fig. 6 shows the running time statistics of the three clustering algorithms under the same data set.

Detailed Description

The applicant believes that the traditional K-means algorithm as a partitional clustering algorithm has the following defects: the accuracy and computational complexity of clustering is heavily dependent on the initial cluster number k and the choice of initial cluster center parameters. In a large number of practical application scenarios, the data set is not only large in scale, but also is constantly in a dynamic change process, so that the clustering number and the clustering center are often difficult to predict and determine in advance.

Therefore, the invention provides a minimum spanning tree-based non-parametric Clustering algorithm MNC (MST based on N-parameterized Clustering), which can reduce Clustering time, improve Clustering efficiency and eliminate the dependence of the algorithm on empirical parameters.

The algorithm of the present invention is further described in detail below with reference to the accompanying drawings:

the symbolic meanings used by the algorithm are summarized as follows: s_NIs an N (N is more than or equal to 2) dimensional data set to be clustered;

is a sample vector

And

the Euclidean distance between them; WCG (S)_N) Is S_NCorresponding weighted complete graph; e.g. of the type_i,jIs WCG (S)_N) Middle sample

And

the edges between can be expressed as binary groups

w_i,jIs an edge e_i,jWeight value of (2), i.e.

And

european distance between

MST(S_N) Is WCG (S)_N) The corresponding minimum spanning tree can be specifically expressed as a set of edges { e_i,j}；OC(S_NCN) is S_NAnd (4) corresponding to the optimal clustering function, wherein CN is the number of the clustered clusters.

FIG. 1 is a block flow diagram of a novel efficient non-parameterized clustering algorithm of the present invention. As shown in fig. 1, the specific idea of the algorithm is:

1. generating weighted complete graph WCG

FIG. 2 is a flow chart of the present invention for generating a weighted complete graph. Firstly, S is_NEach data sample in the data set is regarded as a vector of an N-dimensional Euclidean space, the similarity among the data samples is measured by the Euclidean distance among the vectors, and then the original data sample set can be converted into a weighted complete graph model WCG<V,E,W>Wherein V, E and W represent the set of points, the set of edges, and the set of weights in the WCG, respectively; wherein, the point set V is S_NSet of weights, E ═ V × V

2. Generating a minimum spanning tree MST

FIG. 3 is a flow chart of generating a minimum spanning tree according to the present invention. Generating a minimum spanning tree by using a Prim algorithm in graph theory, namely: two point sets, P and Q, are constructed, representing the set of points already in MST and not in MST, respectively, and the point closest to P is selected from Q to add P at each iteration until Q is empty (all points are already in MST).

3. Generating pruning threshold values

FIG. 4 is a flow chart of the present invention for generating pruning thresholds. Suppose S_NThe optimal clustering function (optimal clustering) of (C) is OC (S)_NAnd CN), wherein CN is the number of the clustered clusters. Define MST (S)_N) To OC (S)_NM) of a projection function P whose output is to two disjoint sets of edges E_IntraAnd E_InterThe edge sets between the points in the final clustering result cluster and the edge sets between the clusters are respectively as follows:

P(OC(S_N,M),MST(S_N))＝<E_Intra,E_Inter>(1)

wherein the edge set E_IntraAnd E_InterSatisfy the following constraints

Definition of clusters according to Everitt: "the distance between any two points of the same cluster is smaller than the distance between any two points of different clusters", so as to obtain the distance between any two points of different clusters

max(E_Intra)＜min(E_Inter) (2)

according to formula (1), MST (S)_N) Can be covered by S_NOptimal cluster of OC (S)_NM) division into two disjoint subsets E_IntraAnd E_InterAnd E is_IntraAnd E_InterThe distance therebetween is sufficiently long. Visible MST (S) combined with (2)_N) Projection onto OC (S)_NM) is equivalent to a one-dimensional edge set weight data set S₁2 classification process of (a), namely:

according to the formula (4), the original complexity is higher, and the OC (S) is clustered by the N-dimensional space with the indefinite category number_NM) problem is converted into a one-dimensional spatial cluster OC (S) with a lower complexity class number of 2₁And, 2) problem ". Realizing OC (S)₁The method of 2) is various, but because of OC (S)₁And 2) the method has the characteristics of fixed category number of 2, simple subtraction of distance calculation, optional left and right endpoints of initial gravity center and the like, so that the defects of the K-means algorithm can be overcome, and the OC (S) is realized by adopting the K-means algorithm₁,2). On the basis of the above conclusion, the pruning threshold generation process of the scheme is as follows: extracting one-dimensional weight space S of MST edge set₁Using K-means for S₁Clustering with k 2 is performed fromAnd the original N-dimensional space clustering problem with indefinite category number with higher complexity is converted into the one-dimensional space clustering problem with 2 category numbers with lower complexity, namely { W }₁,W₂}＝kmeans(S₁2), wherein max (W)₁)＜min(W₂) Further, the threshold PT is min (W)₂)。

4. Pruning and splitting

And pruning the MST by using the threshold PT obtained in the last step, namely disconnecting all edges with the weight larger than the PT in the MST. After pruning, the originally fully connected MST will become a forest.

And (3) outputting: forest F

5. Outlier filtering

Connected components in the forest F obtained through pruning can be used as a preliminary clustering result, but because the actual data set often contains a lot of useless noise data, the accuracy of the clustering result is reduced if filtering is not performed. The noise data is reflected in the clustering result, i.e., outliers that are low in spatial density and relatively far from the normal data. In order to detect these outliers in F, only the connected component of F needs to be subjected to point number judgment, and the algorithm considers the connected component with the point number less than 3 as the outlier. And the filtered connected component is the final clustering result of the MNC algorithm.

Example one

In order to evaluate the beneficial effects of the present invention, three different algorithms were compared experimentally, the traditional K-means algorithm, the MSTCluster algorithm based on the minimum spanning tree, and the MNC algorithm herein, which are clustering results on two-dimensional random datasets having different shapes as shown in fig. 5. Because the traditional K-means is a parameterized algorithm, namely, the input of the traditional K-means needs additional parameters, namely the number K of clusters and an initial clustering center, besides a data set to be clustered; compared with the traditional K-means, the MSTClustre based on the minimum spanning tree belongs to a non-parametric clustering algorithm, because the MSTClustre does not need to specify the cluster number K and the initial clustering center, but the algorithm still needs to input a parameter, namely an adjusting factor, for determining the pruning threshold, the algorithm is not completely non-parametric; the MNC algorithm herein is completely parameterised-free and does not require any parameter to be entered.

From the results it can be seen that:

1. clustering effect and recognizable data shape

While the conventional K-means and MSTClustre based minimum spanning tree can only identify convex data clusters, the MNC algorithm of the invention can not only identify conventional convex data clusters, but also can identify annular and other non-convex data clusters. The reason for this analysis is: unlike traditional K-means direct clustering, MSTCluster and MNC both adopt an indirect way of clustering first and then classifying, i.e. the whole data set is clustered into a large class according to a minimum spanning tree mode, then the overall properties of the data set are analyzed, and classification is carried out from top to bottom under the guidance. The final classification result is more accurate than the conventional K-means because the overall nature of the data set is fully utilized. Comparing the clustering results between MNC and MSTCluxer, it can be found that MNC is superior to MSTCluxer: neither of the two circular clusters of the two-dimensional random dataset, MSTCluster, was distinguished, whereas the MNC successfully distinguished.

2. Algorithm runtime

The running time statistics of the three clustering algorithms under different data sets are shown in fig. 6, and the time unit is second. Since the K-means adopts a mode of multiple iteration experiments, the running time of the K-means and the running time of the K-means adopt the average value of multiple iterations. As can be seen from the data in FIG. 6, the K-means time is longest, MSTCluxer times, and MNC is shortest. The reason for the analysis is that the Euclidean distance between all data objects and the new center of each cluster needs to be recalculated when the K-means algorithm is iterated every time, the calculation of the Euclidean distance every time is simple in addition to addition and subtraction and the like, and complex square sum and evolution operations are also carried out, in MSTCluxer and MNC, the calculation of the Euclidean distance between the data objects is only carried out once in the first stage of the algorithm, and only simple addition, subtraction and comparison operations are carried out in the subsequent two stages, so that the calculation complexity of the MSTCluxer and MNC is lower compared with that of the traditional K-means. Compared with the MSTCluxer and the MNC, although the MSTCluxer and the MNC are both based on the minimum spanning tree in the earlier stage, the difference is that the generation of the pruning threshold in the MSTCluxer needs to depend on empirical parameters (adjustment factors), and the pruning threshold of the MNC is automatically generated according to the minimum spanning tree, so that the optimal search process of a parameter space is avoided, and the calculation complexity of the MNC is lower than that of the MSTCluxer.

In conclusion, compared with the traditional K-means and MSTCluster based on the minimum spanning tree, the MNC algorithm provided by the invention greatly reduces the running time on the basis of improving the clustering effect.

The present invention is not limited to the above embodiments, and any easily conceivable changes or substitutions based on the mechanism of the present invention should fall within the scope of the present invention.

As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A parametrization-free clustering algorithm based on a minimum spanning tree is characterized by comprising the following steps:

step 1, abstracting a data set to be clustered into a weighted complete graph WCG, wherein points represent vectors, and weighted edges represent similarity relations among data;

step 2, converting the WCG into a fully-communicated minimum spanning tree MST;

step 3, clustering the one-dimensional weight space of the MST edge set by using a K-means algorithm with K being 2 to obtain a pruning threshold value;

step 4, pruning is carried out on the MST, and the obtained connected component is the initial clustering cluster;

and 5, carrying out outlier filtering on the initial clustering cluster to further obtain a final clustering result.

2. The minimum spanning tree based parameterization-free clustering algorithm according to claim 1, wherein the step 1 further comprises:

D_ed(xⁱ，x^j) Represents a sample xⁱAnd x^jThe Euclidean distance of (c);

the original data sample set is converted into the weighted complete graph model WCG ═<V，E，W>Wherein V, E and W represent the set of points, the set of edges, and the set of weights in the WCG, respectively; wherein, the point set V is S_NSet of weights, E ═ V × V

3. The minimum spanning tree based parameterization-free clustering algorithm according to claim 1, wherein the step 2 further comprises:

and (3) generating a Minimum Spanning Tree (MST) by using a Prim algorithm in a graph theory, namely constructing two point sets P and Q respectively representing the point sets in the MST and the points not in the MST, selecting the point closest to the P from the Q to add the P in each iteration until the Q is empty, wherein all the points are in the MST.

4. The minimum spanning tree based parameterization-free clustering algorithm according to claim 1, wherein the step 3 further comprises:

extracting one-dimensional weight space S of MST edge set₁Using K-means for S₁Clustering with k being 2 is carried out, so that the original high-complexity N-dimensional space clustering problem with indefinite category number is converted into low-complexity one-dimensional space clustering problem with category number being 2, namely { W {₁，W₂}＝kmeans(S₁2), wherein max (W)₁)＜min(W₂) Further, the threshold PT is min (W)₂)。

5. The minimum spanning tree-based parameterization-free clustering algorithm according to claim 4, wherein:

P(OC(S_N，M)，MST(S_N))＝<E_Intra，E_Inter>

max(E_Intra)＜min(E_Inter)

6. The minimum spanning tree based parameterization-free clustering algorithm according to claim 3, wherein the step 4 further comprises:

7. The minimum spanning tree based parameterization-free clustering algorithm according to claim 1, wherein the step 5 further comprises:

and detecting a connected component with the number less than 3, regarding the connected component as an outlier, and filtering by using an outlier filtering function Filter, wherein the filtered connected component is a final clustering result of the MNC algorithm.

8. A parameterization-free clustering system based on a minimum spanning tree is characterized by comprising the following modules:

9. The minimum spanning tree based parameterless clustering system of claim 8, wherein:

the first module is further for coupling S_NEach data sample in (2) is considered as a vector of an N-dimensional euclidean space, and the similarity between data samples is measured by the euclidean distance between vectors:

D_ed(xⁱ，x^j) Represents a sample xⁱAnd x^jThe Euclidean distance of (c);

10. The system of claim 9, wherein the third module is further configured to extract a one-dimensional weight space S of the MST edge set₁Using K-means for S₁Clustering with k being 2 is carried out, so that the original high-complexity 'N-dimensional space clustering problem with indefinite class number' is converted into low-complexity 'one-dimensional space clustering problem with class number being 2', namely { W₁，W₂}＝kmeans(S₁2), wherein max (W)₁)＜min(W₂) Further, the threshold PT is min (W)₂)：

P(OC(S_N，M)，MST(S_N))＝<E_Intra，E_Inter>

max(E_Intra)＜min(E_Inter)