CN111062418A - Non-parametric clustering algorithm and system based on minimum spanning tree - Google Patents

Non-parametric clustering algorithm and system based on minimum spanning tree Download PDF

Info

Publication number
CN111062418A
CN111062418A CN201911168955.7A CN201911168955A CN111062418A CN 111062418 A CN111062418 A CN 111062418A CN 201911168955 A CN201911168955 A CN 201911168955A CN 111062418 A CN111062418 A CN 111062418A
Authority
CN
China
Prior art keywords
clustering
mst
dimensional
algorithm
edge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911168955.7A
Other languages
Chinese (zh)
Inventor
吴怀岗
陈靖飒
窦万峰
程开丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Normal University
Original Assignee
Nanjing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Normal University filed Critical Nanjing Normal University
Priority to CN201911168955.7A priority Critical patent/CN111062418A/en
Publication of CN111062418A publication Critical patent/CN111062418A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a parameterization-free clustering algorithm and a parameterization-free clustering system based on a minimum spanning treeWCGWherein, the points represent vectors, and the weighted edges represent the similarity relation among data; then will beWCGTransforming into fully-connected minimal spanning treeMST(ii) a Then using k =2K‑meansAlgorithm pairMSTClustering the one-dimensional weight space of the edge set to obtain a pruning threshold value; last pair ofMSTPruning and noise filtering are carried out, and the obtained connected components are clustered. The algorithm converts the original 'dimensional space clustering problem of indefinite category number' with higher complexity into the 'one-dimensional space clustering problem of category number 2' with lower complexity, so thatK‑meansThe defects of the algorithm are made up, non-parametric clustering is really realized, the clustering efficiency is improved while the clustering time is reduced, and the dependence of the algorithm on experience parameters is eliminated.

Description

Non-parametric clustering algorithm and system based on minimum spanning tree
Technical Field
The invention relates to a parameterization-free clustering algorithm based on a minimum spanning tree, and belongs to the field of clustering algorithms.
Background
The clustering algorithm is a very effective unsupervised machine learning algorithm and is an important branch in the field of data mining. The conventional clustering algorithm can be roughly classified into a partition clustering method, a hierarchical clustering method, a density clustering method, a grid clustering method, a model clustering method, and the like. The K-means algorithm, as a partition clustering algorithm, has many advantages of simple principle, easy description, high time efficiency, and suitability for processing large-scale data, and is therefore widely applied in many fields. However, the algorithm has obvious defects: the accuracy and computational complexity of clustering is heavily dependent on the initial cluster number k and the choice of initial cluster center parameters. In a large number of practical application scenarios, the data set is not only large in scale, but also is constantly in a dynamic change process, so that the clustering number and the clustering center are often difficult to predict and determine in advance.
In order to solve the above problems, the scholars have successively proposed some solutions for improving K-means, which reduce the dependence of the algorithm on empirical parameters to some extent, but none of them can be parameterized at all. The invention provides a Non-parametric Clustering algorithm MNC (MST based Non-parametric Clustering) based on a minimum spanning tree on the basis of the work, the algorithm reduces Clustering time, improves Clustering efficiency and eliminates the dependence of the algorithm on empirical parameters.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to overcome the defect that the existing unsupervised clustering algorithm is seriously dependent on selection of empirical parameters (initial clustering number k and initial cluster center), and provides a novel parameterization-free clustering algorithm which only depends on a data set and does not need empirical parameters, namely, the parameterization-free clustering algorithm based on a minimum spanning tree, so that the clustering efficiency is improved, and the clustering time is reduced. It is a further object to provide a system for carrying out the above method.
The technical scheme is as follows: a parametrization-free clustering algorithm based on a minimum spanning tree comprises the following steps:
step 1: abstracting a data set to be clustered into a weighted Complete graph WCG (weighted Complete graph), wherein points represent vectors, and weighted edges represent similarity among data;
step 2: converting the WCG into a fully connected Minimum Spanning Tree (MST) (minimum Spanning Tree);
and step 3: clustering the one-dimensional weight space of the MST edge set by using a K-means algorithm with K being 2 to obtain a pruning threshold value;
and 4, step 4: pruning the MST to obtain a connected component which is an initial clustering cluster;
and 5: and (4) carrying out outlier filtering on the initial clustering cluster so as to obtain a final clustering result.
In a further embodiment, the step 1 is further:
will SNEach data sample in (2) is considered as a vector of an N-dimensional euclidean space, and the similarity between data samples is measured by the euclidean distance between vectors:
Figure BDA0002288205580000021
in the formula xiDenotes the given i-th sample, xjRepresenting the given jth sample of the sequence,
Figure BDA0002288205580000022
Figure BDA0002288205580000023
Ded(xi,xj) Represents a sample xiAnd xjThe Euclidean distance of (c);
the original data sample set is converted into the weighted complete graph model WCG ═<V,E,W>Wherein V, E and W represent the set of points, the set of edges, and the set of weights in the WCG, respectively; wherein, the point set V is SNSet of weights, E ═ V × V
Figure BDA0002288205580000024
In a further embodiment, the step 2 is further:
inputting the weighted complete graph WCG in the step 1 to < V, E, W >, generating a minimum spanning tree MST by adopting a Prim algorithm in graph theory, randomly selecting a point V from V, and establishing two point sets P and Q which respectively represent point sets in the MST and points not in the MST, wherein P is { V }, and Q is V-P. The point closest to P is selected from Q at each iteration and added to P until Q is empty, at which point all points are already in MST.
In a further embodiment, the step 3 is further:
extracting one-dimensional weight space S of MST edge set1Using K-means for S1Clustering with k being 2 is carried out, so that the original high-complexity N-dimensional space clustering problem with indefinite category number is converted into low-complexity one-dimensional space clustering problem with category number being 2, namely { W {1,W2}=kmeans(S12), wherein max (W)1)<min(W2) Further, the threshold PT is min (W)2)。
Let SNThe Optimal Clustering function (Optimal Clustering) of (C) is OC (S)NAnd CN), wherein CN is the number of the clustered clusters. Define MST (S)N) To OC (S)NM) of a projection function P whose output is to two disjoint sets of edges EIntraAnd EInterThe edge sets between the points in the final clustering result cluster and the edge sets between the clusters are respectively as follows:
P(OC(SN,M),MST(SN))=<EIntra,EInter>
wherein, the edge set EIntraAnd EInterThe following constraints are satisfied:
Figure BDA0002288205580000031
definition of clusters according to Everitt: "the distance between any two points of the same cluster is smaller than the distance between any two points of different clusters", we can obtain:
max(EIntra)<min(EInter)
i.e. MST (S) from the last cluster perspectiveN),MST(SN) The longest edge of the inner point of the middle connection cluster is shorter than the shortest edge between the connection clusters.
Due to MST (S)N) The weight space of the edge set can form a one-dimensional data space S1Each number thereinAccording to the coordinate of the sample as MST (S)N) The weight value of the middle edge is:
Figure BDA0002288205580000032
from the above two equations, we can derive: MST (S)N) Can be covered by SNOptimal cluster of OC (S)NM) division into two disjoint subsets EIntraAnd EInterAnd E isIntraAnd EInterThe distance therebetween is sufficiently long. MST (S)N) Projection onto OC (S)NM) is equivalent to a one-dimensional edge set weight data set S1The two classification processes of (1), namely:
Figure BDA0002288205580000033
high original complexity N-dimensional space clustering OC (S) with indefinite category numberNM) problem is converted into a one-dimensional spatial cluster OC (S) with a lower complexity class number of 21And, 2) problem ". Realizing OC (S)1The method of 2) is various, but because of OC (S)1And 2) the method has the characteristics of fixed category number of 2, simple subtraction of distance calculation, optional left and right endpoints of initial gravity center and the like, so that the defects of the K-means algorithm can be overcome, and the OC (S) is realized by adopting the K-means algorithm1,2)。
In a further embodiment, the step 4 is further:
pruning the MST by using the threshold PT obtained in the step 3, namely disconnecting all edges of which the weight is greater than the PT in the MST; after pruning, the originally fully-connected MST becomes a forest, and the connected components in the forest F are the preliminary clustering results.
In a further embodiment, the step 5 is further:
and detecting a connected component with the number less than 3, regarding the connected component as an outlier, and filtering by using an outlier filtering function Filter, wherein the filtered connected component is a final clustering result of the MNC algorithm. Connected components in the forest F obtained through pruning can be used as a preliminary clustering result, but because the actual data set often contains a lot of useless noise data, the accuracy of the clustering result is reduced if filtering is not performed. The noise data is reflected in the clustering result, i.e., outliers that are low in spatial density and relatively far from the normal data. In order to detect these outliers in F, only the connected component of F needs to be subjected to point number judgment, and the algorithm considers the connected component with the point number less than 3 as the outlier.
A parameterization-free clustering system based on a minimum spanning tree mainly comprises the following modules:
a first module for abstracting a data set to be clustered into a weighted complete graph WCG;
a second module for converting the WCG into a fully connected minimum spanning tree MST;
a third module for clustering the one-dimensional weight space of the MST edge set to obtain a pruning threshold;
a fourth module for pruning the MST to obtain a connected component which is the initial clustering cluster;
and the fifth module is used for carrying out outlier filtering on the initial clustering cluster to obtain a final clustering result.
In a further embodiment, the first module is further configured to assign SNEach data sample in (2) is considered as a vector of an N-dimensional euclidean space, and the similarity between data samples is measured by the euclidean distance between vectors:
Figure BDA0002288205580000041
in the formula xiDenotes the given i-th sample, xjRepresenting the given jth sample of the sequence,
Figure BDA0002288205580000042
Figure BDA0002288205580000043
Ded(xi,xj) Represents a sample xiAnd xjThe Euclidean distance of (c);
the original data sample set is converted into the weighted complete graph model WCG ═<V,E,W>Wherein V, E and W represent the set of points, the set of edges, and the set of weights in the WCG, respectively; wherein, the point set V is SNSet of weights, E ═ V × V
Figure BDA0002288205580000044
The second module further generates a Minimum Spanning Tree (MST) by using a Prim algorithm in a graph theory, namely two point sets P and Q are constructed and respectively represent the point sets in the MST and the point sets not in the MST, and a point closest to P is selected from Q to be added into P during each iteration until Q is empty;
the fourth module further prunes the MST by using the threshold PT obtained by the third module, namely all edges with the weight larger than the PT in the MST are disconnected; after pruning, the original fully-connected MST can become a forest, and the connected component in the forest F is a preliminary clustering result;
the fifth module is further used for detecting a connected component with the number less than 3, regarding the connected component as an outlier, and filtering by using an outlier filtering function Filter, wherein the filtered connected component is a final clustering result of the MNC algorithm.
In a further embodiment, the third module is further for extracting a one-dimensional weight space S of the MST edge set1Using K-means for S1Clustering with k being 2 is carried out, so that the original high-complexity 'N-dimensional space clustering problem with indefinite class number' is converted into low-complexity 'one-dimensional space clustering problem with class number being 2', namely { W1,W2}=kmeans(S12), wherein max (W)1)<min(W2) Further, the threshold PT is min (W)2):
Let SNHas an optimal clustering function of OC (S)NCN), wherein CN is the number of clusters after clustering; define MST (S)N) To OC (S)NM) of a projection function P whose output is to two disjoint sets of edges EIntraAnd EInterThe edge sets between the points in the final clustering result cluster and the edge sets between the clusters are respectively as follows:
P(OC(SN,M),MST(SN))=<EIntra,EInter>
wherein, the edge set EIntraAnd EInterThe following constraints are satisfied:
Figure BDA0002288205580000051
wherein MST (S)N) The longest edge of the inner point of the middle connection cluster is shorter than the shortest edge between the connection clusters:
max(EIntra)<min(EInter)
due to MST (S)N) The weight space of the edge set can form a one-dimensional data space S1Where the coordinates of each data sample are MST (S)N) The weight value of the middle edge is:
Figure BDA0002288205580000052
MST(SN) Projection onto OC (S)NM) is equivalent to a one-dimensional edge set weight data set S1The two classification processes of (1), namely:
Figure BDA0002288205580000053
in the formula, OC (S)NM) N-dimensional spatial clustering problem, OC (S), representing an indefinite number of classes1And 2) represents a one-dimensional spatial clustering problem with a class number of 2.
Has the advantages that:
first, the novel and efficient parameterization-free clustering algorithm of the present invention is a clustering algorithm that relies only on the data set itself and does not require empirical parameters. The algorithm converts the original N-dimensional spatial clustering problem with indefinite category number with higher complexity into the one-dimensional spatial clustering problem with category number of 2 with lower complexity, so that the deficiency of the K-means algorithm is made up, and the non-parametric clustering is really realized.
Secondly, the novel efficient non-parametric clustering algorithm can improve the clustering efficiency and greatly reduce the clustering time. The Euclidean distances between all data objects and the new centers of all clusters need to be recalculated in each iteration of the K-means algorithm, and the calculation of the Euclidean distances in each iteration is not only simple operations such as addition and subtraction, but also complex square sum and evolution operations. In the MNC algorithm, the calculation of the Euclidean distance between the data objects is only carried out once in the first stage of the algorithm, and the subsequent two stages only have simple addition, subtraction and comparison operations, so that compared with the traditional K-means algorithm, the MNC algorithm has lower calculation complexity and consumes less time.
In conclusion, the invention can effectively solve the problem of dependence of traditional clustering algorithms such as K-means on empirical parameters, and has good practical application value.
Drawings
FIG. 1 is a block flow diagram of a novel efficient non-parameterized clustering algorithm of the present invention.
FIG. 2 is a flow chart of the present invention for generating a weighted complete graph.
FIG. 3 is a flow chart of generating a minimum spanning tree according to the present invention.
FIG. 4 is a flow chart of the present invention for generating pruning thresholds.
Fig. 5 is a comparison of clustering results of the conventional K-means algorithm, the MSTCluster algorithm based on the minimum spanning tree, and the MNC algorithm herein.
Fig. 6 shows the running time statistics of the three clustering algorithms under the same data set.
Detailed Description
The applicant believes that the traditional K-means algorithm as a partitional clustering algorithm has the following defects: the accuracy and computational complexity of clustering is heavily dependent on the initial cluster number k and the choice of initial cluster center parameters. In a large number of practical application scenarios, the data set is not only large in scale, but also is constantly in a dynamic change process, so that the clustering number and the clustering center are often difficult to predict and determine in advance.
Therefore, the invention provides a minimum spanning tree-based non-parametric Clustering algorithm MNC (MST based on N-parameterized Clustering), which can reduce Clustering time, improve Clustering efficiency and eliminate the dependence of the algorithm on empirical parameters.
The algorithm of the present invention is further described in detail below with reference to the accompanying drawings:
the symbolic meanings used by the algorithm are summarized as follows: sNIs an N (N is more than or equal to 2) dimensional data set to be clustered;
Figure BDA0002288205580000071
is a sample vector
Figure BDA0002288205580000072
And
Figure BDA0002288205580000073
the Euclidean distance between them; WCG (S)N) Is SNCorresponding weighted complete graph; e.g. of the typei,jIs WCG (S)N) Middle sample
Figure BDA0002288205580000074
And
Figure BDA0002288205580000075
the edges between can be expressed as binary groups
Figure BDA0002288205580000076
wi,jIs an edge ei,jWeight value of (2), i.e.
Figure BDA0002288205580000077
And
Figure BDA0002288205580000078
european distance between
Figure BDA0002288205580000079
MST(SN) Is WCG (S)N) The corresponding minimum spanning tree can be specifically expressed as a set of edges { ei,j};OC(SNCN) is SNAnd (4) corresponding to the optimal clustering function, wherein CN is the number of the clustered clusters.
FIG. 1 is a block flow diagram of a novel efficient non-parameterized clustering algorithm of the present invention. As shown in fig. 1, the specific idea of the algorithm is:
1. generating weighted complete graph WCG
FIG. 2 is a flow chart of the present invention for generating a weighted complete graph. Firstly, S isNEach data sample in the data set is regarded as a vector of an N-dimensional Euclidean space, the similarity among the data samples is measured by the Euclidean distance among the vectors, and then the original data sample set can be converted into a weighted complete graph model WCG<V,E,W>Wherein V, E and W represent the set of points, the set of edges, and the set of weights in the WCG, respectively; wherein, the point set V is SNSet of weights, E ═ V × V
Figure BDA00022882055800000710
2. Generating a minimum spanning tree MST
FIG. 3 is a flow chart of generating a minimum spanning tree according to the present invention. Generating a minimum spanning tree by using a Prim algorithm in graph theory, namely: two point sets, P and Q, are constructed, representing the set of points already in MST and not in MST, respectively, and the point closest to P is selected from Q to add P at each iteration until Q is empty (all points are already in MST).
3. Generating pruning threshold values
FIG. 4 is a flow chart of the present invention for generating pruning thresholds. Suppose SNThe optimal clustering function (optimal clustering) of (C) is OC (S)NAnd CN), wherein CN is the number of the clustered clusters. Define MST (S)N) To OC (S)NM) of a projection function P whose output is to two disjoint sets of edges EIntraAnd EInterThe edge sets between the points in the final clustering result cluster and the edge sets between the clusters are respectively as follows:
P(OC(SN,M),MST(SN))=<EIntra,EInter>(1)
wherein the edge set EIntraAnd EInterSatisfy the following constraints
Figure BDA0002288205580000081
Definition of clusters according to Everitt: "the distance between any two points of the same cluster is smaller than the distance between any two points of different clusters", so as to obtain the distance between any two points of different clusters
max(EIntra)<min(EInter) (2)
I.e. MST (S) from the last cluster perspectiveN),MST(SN) The longest edge of the inner point of the middle connection cluster is shorter than the shortest edge between the connection clusters.
Due to MST (S)N) The weight space of the edge set can form a one-dimensional data space S1Where the coordinates of each data sample are MST (S)N) The weight value of the middle edge is:
Figure BDA0002288205580000082
according to formula (1), MST (S)N) Can be covered by SNOptimal cluster of OC (S)NM) division into two disjoint subsets EIntraAnd EInterAnd E isIntraAnd EInterThe distance therebetween is sufficiently long. Visible MST (S) combined with (2)N) Projection onto OC (S)NM) is equivalent to a one-dimensional edge set weight data set S12 classification process of (a), namely:
Figure BDA0002288205580000083
according to the formula (4), the original complexity is higher, and the OC (S) is clustered by the N-dimensional space with the indefinite category numberNM) problem is converted into a one-dimensional spatial cluster OC (S) with a lower complexity class number of 21And, 2) problem ". Realizing OC (S)1The method of 2) is various, but because of OC (S)1And 2) the method has the characteristics of fixed category number of 2, simple subtraction of distance calculation, optional left and right endpoints of initial gravity center and the like, so that the defects of the K-means algorithm can be overcome, and the OC (S) is realized by adopting the K-means algorithm1,2). On the basis of the above conclusion, the pruning threshold generation process of the scheme is as follows: extracting one-dimensional weight space S of MST edge set1Using K-means for S1Clustering with k 2 is performed fromAnd the original N-dimensional space clustering problem with indefinite category number with higher complexity is converted into the one-dimensional space clustering problem with 2 category numbers with lower complexity, namely { W }1,W2}=kmeans(S12), wherein max (W)1)<min(W2) Further, the threshold PT is min (W)2)。
4. Pruning and splitting
And pruning the MST by using the threshold PT obtained in the last step, namely disconnecting all edges with the weight larger than the PT in the MST. After pruning, the originally fully connected MST will become a forest.
And (3) outputting: forest F
5. Outlier filtering
Connected components in the forest F obtained through pruning can be used as a preliminary clustering result, but because the actual data set often contains a lot of useless noise data, the accuracy of the clustering result is reduced if filtering is not performed. The noise data is reflected in the clustering result, i.e., outliers that are low in spatial density and relatively far from the normal data. In order to detect these outliers in F, only the connected component of F needs to be subjected to point number judgment, and the algorithm considers the connected component with the point number less than 3 as the outlier. And the filtered connected component is the final clustering result of the MNC algorithm.
Example one
In order to evaluate the beneficial effects of the present invention, three different algorithms were compared experimentally, the traditional K-means algorithm, the MSTCluster algorithm based on the minimum spanning tree, and the MNC algorithm herein, which are clustering results on two-dimensional random datasets having different shapes as shown in fig. 5. Because the traditional K-means is a parameterized algorithm, namely, the input of the traditional K-means needs additional parameters, namely the number K of clusters and an initial clustering center, besides a data set to be clustered; compared with the traditional K-means, the MSTClustre based on the minimum spanning tree belongs to a non-parametric clustering algorithm, because the MSTClustre does not need to specify the cluster number K and the initial clustering center, but the algorithm still needs to input a parameter, namely an adjusting factor, for determining the pruning threshold, the algorithm is not completely non-parametric; the MNC algorithm herein is completely parameterised-free and does not require any parameter to be entered.
From the results it can be seen that:
1. clustering effect and recognizable data shape
While the conventional K-means and MSTClustre based minimum spanning tree can only identify convex data clusters, the MNC algorithm of the invention can not only identify conventional convex data clusters, but also can identify annular and other non-convex data clusters. The reason for this analysis is: unlike traditional K-means direct clustering, MSTCluster and MNC both adopt an indirect way of clustering first and then classifying, i.e. the whole data set is clustered into a large class according to a minimum spanning tree mode, then the overall properties of the data set are analyzed, and classification is carried out from top to bottom under the guidance. The final classification result is more accurate than the conventional K-means because the overall nature of the data set is fully utilized. Comparing the clustering results between MNC and MSTCluxer, it can be found that MNC is superior to MSTCluxer: neither of the two circular clusters of the two-dimensional random dataset, MSTCluster, was distinguished, whereas the MNC successfully distinguished.
2. Algorithm runtime
The running time statistics of the three clustering algorithms under different data sets are shown in fig. 6, and the time unit is second. Since the K-means adopts a mode of multiple iteration experiments, the running time of the K-means and the running time of the K-means adopt the average value of multiple iterations. As can be seen from the data in FIG. 6, the K-means time is longest, MSTCluxer times, and MNC is shortest. The reason for the analysis is that the Euclidean distance between all data objects and the new center of each cluster needs to be recalculated when the K-means algorithm is iterated every time, the calculation of the Euclidean distance every time is simple in addition to addition and subtraction and the like, and complex square sum and evolution operations are also carried out, in MSTCluxer and MNC, the calculation of the Euclidean distance between the data objects is only carried out once in the first stage of the algorithm, and only simple addition, subtraction and comparison operations are carried out in the subsequent two stages, so that the calculation complexity of the MSTCluxer and MNC is lower compared with that of the traditional K-means. Compared with the MSTCluxer and the MNC, although the MSTCluxer and the MNC are both based on the minimum spanning tree in the earlier stage, the difference is that the generation of the pruning threshold in the MSTCluxer needs to depend on empirical parameters (adjustment factors), and the pruning threshold of the MNC is automatically generated according to the minimum spanning tree, so that the optimal search process of a parameter space is avoided, and the calculation complexity of the MNC is lower than that of the MSTCluxer.
In conclusion, compared with the traditional K-means and MSTCluster based on the minimum spanning tree, the MNC algorithm provided by the invention greatly reduces the running time on the basis of improving the clustering effect.
The present invention is not limited to the above embodiments, and any easily conceivable changes or substitutions based on the mechanism of the present invention should fall within the scope of the present invention.
As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A parametrization-free clustering algorithm based on a minimum spanning tree is characterized by comprising the following steps:
step 1, abstracting a data set to be clustered into a weighted complete graph WCG, wherein points represent vectors, and weighted edges represent similarity relations among data;
step 2, converting the WCG into a fully-communicated minimum spanning tree MST;
step 3, clustering the one-dimensional weight space of the MST edge set by using a K-means algorithm with K being 2 to obtain a pruning threshold value;
step 4, pruning is carried out on the MST, and the obtained connected component is the initial clustering cluster;
and 5, carrying out outlier filtering on the initial clustering cluster to further obtain a final clustering result.
2. The minimum spanning tree based parameterization-free clustering algorithm according to claim 1, wherein the step 1 further comprises:
will SNEach data sample in (2) is considered as a vector of an N-dimensional euclidean space, and the similarity between data samples is measured by the euclidean distance between vectors:
Figure FDA0002288205570000011
in the formula xiDenotes the given i-th sample, xjRepresenting the given jth sample of the sequence,
Figure FDA0002288205570000012
Figure FDA0002288205570000013
Ded(xi,xj) Represents a sample xiAnd xjThe Euclidean distance of (c);
the original data sample set is converted into the weighted complete graph model WCG ═<V,E,W>Wherein V, E and W represent the set of points, the set of edges, and the set of weights in the WCG, respectively; wherein, the point set V is SNSet of weights, E ═ V × V
Figure FDA0002288205570000014
3. The minimum spanning tree based parameterization-free clustering algorithm according to claim 1, wherein the step 2 further comprises:
and (3) generating a Minimum Spanning Tree (MST) by using a Prim algorithm in a graph theory, namely constructing two point sets P and Q respectively representing the point sets in the MST and the points not in the MST, selecting the point closest to the P from the Q to add the P in each iteration until the Q is empty, wherein all the points are in the MST.
4. The minimum spanning tree based parameterization-free clustering algorithm according to claim 1, wherein the step 3 further comprises:
extracting one-dimensional weight space S of MST edge set1Using K-means for S1Clustering with k being 2 is carried out, so that the original high-complexity N-dimensional space clustering problem with indefinite category number is converted into low-complexity one-dimensional space clustering problem with category number being 2, namely { W {1,W2}=kmeans(S12), wherein max (W)1)<min(W2) Further, the threshold PT is min (W)2)。
5. The minimum spanning tree-based parameterization-free clustering algorithm according to claim 4, wherein:
let SNHas an optimal clustering function of OC (S)NCN), wherein CN is the number of clusters after clustering; define MST (S)N) To OC (S)NM) of a projection function P whose output is to two disjoint sets of edges EIntraAnd EInterThe edge sets between the points in the final clustering result cluster and the edge sets between the clusters are respectively as follows:
P(OC(SN,M),MST(SN))=<EIntra,EInter>
wherein, the edge set EIntraAnd EInterThe following constraints are satisfied:
Figure FDA0002288205570000021
wherein MST (S)N) The longest edge of the inner point of the middle connection cluster is shorter than the shortest edge between the connection clusters:
max(EIntra)<min(EInter)
due to MST (S)N) The weight space of the edge set can form a one-dimensional data space S1Where the coordinates of each data sample are MST (S)N) The weight value of the middle edge is:
Figure FDA0002288205570000022
MST(SN) Projection onto OC (S)NM) is equivalent to a one-dimensional edge set weight data set S1The two classification processes of (1), namely:
Figure FDA0002288205570000023
in the formula, OC (S)NM) N-dimensional spatial clustering problem, OC (S), representing an indefinite number of classes1And 2) represents a one-dimensional spatial clustering problem with a class number of 2.
6. The minimum spanning tree based parameterization-free clustering algorithm according to claim 3, wherein the step 4 further comprises:
pruning the MST by using the threshold PT obtained in the step 3, namely disconnecting all edges of which the weight is greater than the PT in the MST; after pruning, the originally fully-connected MST becomes a forest, and the connected components in the forest F are the preliminary clustering results.
7. The minimum spanning tree based parameterization-free clustering algorithm according to claim 1, wherein the step 5 further comprises:
and detecting a connected component with the number less than 3, regarding the connected component as an outlier, and filtering by using an outlier filtering function Filter, wherein the filtered connected component is a final clustering result of the MNC algorithm.
8. A parameterization-free clustering system based on a minimum spanning tree is characterized by comprising the following modules:
a first module for abstracting a data set to be clustered into a weighted complete graph WCG;
a second module for converting the WCG into a fully connected minimum spanning tree MST;
a third module for clustering the one-dimensional weight space of the MST edge set to obtain a pruning threshold;
a fourth module for pruning the MST to obtain a connected component which is the initial clustering cluster;
and the fifth module is used for carrying out outlier filtering on the initial clustering cluster to obtain a final clustering result.
9. The minimum spanning tree based parameterless clustering system of claim 8, wherein:
the first module is further for coupling SNEach data sample in (2) is considered as a vector of an N-dimensional euclidean space, and the similarity between data samples is measured by the euclidean distance between vectors:
Figure FDA0002288205570000031
in the formula xiDenotes the given i-th sample, xjRepresenting the given jth sample of the sequence,
Figure FDA0002288205570000032
Figure FDA0002288205570000033
Ded(xi,xj) Represents a sample xiAnd xjThe Euclidean distance of (c);
the original data sample set is converted into the weighted complete graph model WCG ═<V,E,W>Wherein V, E and W represent the set of points, the set of edges, and the set of weights in the WCG, respectively; wherein, the point set V is SNSet of weights, E ═ V × V
Figure FDA0002288205570000034
The second module further generates a Minimum Spanning Tree (MST) by using a Prim algorithm in a graph theory, namely two point sets P and Q are constructed and respectively represent the point sets in the MST and the point sets not in the MST, and a point closest to P is selected from Q to be added into P during each iteration until Q is empty;
the fourth module further prunes the MST by using the threshold PT obtained by the third module, namely all edges with the weight larger than the PT in the MST are disconnected; after pruning, the original fully-connected MST can become a forest, and the connected component in the forest F is a preliminary clustering result;
the fifth module is further used for detecting a connected component with the number less than 3, regarding the connected component as an outlier, and filtering by using an outlier filtering function Filter, wherein the filtered connected component is a final clustering result of the MNC algorithm.
10. The system of claim 9, wherein the third module is further configured to extract a one-dimensional weight space S of the MST edge set1Using K-means for S1Clustering with k being 2 is carried out, so that the original high-complexity 'N-dimensional space clustering problem with indefinite class number' is converted into low-complexity 'one-dimensional space clustering problem with class number being 2', namely { W1,W2}=kmeans(S12), wherein max (W)1)<min(W2) Further, the threshold PT is min (W)2):
Let SNHas an optimal clustering function of OC (S)NCN), wherein CN is the number of clusters after clustering; define MST (S)N) To OC (S)NM) of a projection function P whose output is to two disjoint sets of edges EIntraAnd EInterThe edge sets between the points in the final clustering result cluster and the edge sets between the clusters are respectively as follows:
P(OC(SN,M),MST(SN))=<EIntra,EInter>
wherein, the edge set EIntraAnd EInterThe following constraints are satisfied:
Figure FDA0002288205570000041
wherein MST (S)N) The longest edge of the inner point of the middle connection cluster is shorter than the shortest edge between the connection clusters:
max(EIntra)<min(EInter)
due to MST (S)N) The weight space of the edge set can form a one-dimensional data space S1Where the coordinates of each data sample are MST (S)N) The weight value of the middle edge is:
Figure FDA0002288205570000042
MST(SN) Projection onto OC (S)NM) is equivalent to a one-dimensional edge set weight data set S1The two classification processes of (1), namely:
Figure FDA0002288205570000043
in the formula, OC (S)NM) N-dimensional spatial clustering problem, OC (S), representing an indefinite number of classes1And 2) represents a one-dimensional spatial clustering problem with a class number of 2.
CN201911168955.7A 2019-11-25 2019-11-25 Non-parametric clustering algorithm and system based on minimum spanning tree Pending CN111062418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911168955.7A CN111062418A (en) 2019-11-25 2019-11-25 Non-parametric clustering algorithm and system based on minimum spanning tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911168955.7A CN111062418A (en) 2019-11-25 2019-11-25 Non-parametric clustering algorithm and system based on minimum spanning tree

Publications (1)

Publication Number Publication Date
CN111062418A true CN111062418A (en) 2020-04-24

Family

ID=70298280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911168955.7A Pending CN111062418A (en) 2019-11-25 2019-11-25 Non-parametric clustering algorithm and system based on minimum spanning tree

Country Status (1)

Country Link
CN (1) CN111062418A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948732A (en) * 2021-01-14 2021-06-11 西安交通大学 Outlier detection method based on normalized minimum spanning tree clustering

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948732A (en) * 2021-01-14 2021-06-11 西安交通大学 Outlier detection method based on normalized minimum spanning tree clustering
CN112948732B (en) * 2021-01-14 2023-08-22 西安交通大学 Outlier detection method based on normalized minimum spanning tree clustering

Similar Documents

Publication Publication Date Title
Lee et al. Self-attention graph pooling
CN106682116B (en) OPTIC point sorting and clustering method based on Spark memory calculation big data platform
CN108280236B (en) Method for analyzing random forest visual data based on LargeVis
CN107609105B (en) Construction method of big data acceleration structure
Pardeshi et al. Improved k-medoids clustering based on cluster validity index and object density
CN115578248B (en) Generalized enhanced image classification algorithm based on style guidance
Chebbout et al. Comparative study of clustering based colour image segmentation techniques
Moitra et al. Cluster-based data reduction for persistent homology
Nayini et al. A novel threshold-based clustering method to solve K-means weaknesses
Hruschka et al. Improving the efficiency of a clustering genetic algorithm
CN108564116A (en) A kind of ingredient intelligent analysis method of camera scene image
CN110781943A (en) Clustering method based on adjacent grid search
CN113516019B (en) Hyperspectral image unmixing method and device and electronic equipment
CN111062418A (en) Non-parametric clustering algorithm and system based on minimum spanning tree
CN106980878B (en) Method and device for determining geometric style of three-dimensional model
Zhang et al. A new outlier detection algorithm based on fast density peak clustering outlier factor.
CN112215490A (en) Power load cluster analysis method based on correlation coefficient improved K-means
Tsai et al. ANGEL: A new effective and efficient hybrid clustering technique for large databases
Li et al. Randomly sketched sparse subspace clustering for acoustic scene clustering
CN110929801A (en) Improved Euclid distance KNN classification method and system
Kanzawa A maximizing model of Bezdek-like spherical fuzzy c-means clustering
Vijayakumar et al. Inter cluster distance management model with optimal centroid estimation for k-means clustering algorithm
CN108090514B (en) Infrared image identification method based on two-stage density clustering
Bao et al. Influence of boundary extraction operators on boundary shape classification algorithms based on complex networks
Yu et al. GCA: A real-time grid-based clustering algorithm for large data set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination