CN115293290A

CN115293290A - Hierarchical clustering algorithm for automatically identifying cluster number

Info

Publication number: CN115293290A
Application number: CN202211039756.8A
Authority: CN
Inventors: 龙建武; 王强
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-04

Abstract

The invention provides a hierarchical clustering algorithm for automatically identifying cluster number, which comprises the following steps: calculating the density of the data points by utilizing the sum of the maximum value mu of the reverse neighbor number of all the data points when the natural neighbor search is stopped and the Euclidean distance of the neighbor of the data points mu, and designing a dynamic noise recognizer for controlling the noise proportion by manually inputting parameters to recognize noise points for denoising so as to obtain a denoised data set; the idea that points with closer distances are easier to be divided into the same cluster is utilized, a clustering method based on k neighbor directed edge iterative graph merging process division clustering is designed for clustering, the relation between the cluster number and the iteration times is recorded, the optimal cluster number is judged by utilizing the maximum value of the iteration times, and the optimal clustering result of the denoised data set is obtained; and dividing and clustering the noise data to obtain a final clustering result. According to the method, a k neighbor directed graph is constructed for the denoised data set, clustering is carried out in the iterative graph merging process, and the optimal clustering number of the data set is judged by utilizing the maximum value of the iterative times.

Description

Hierarchical clustering algorithm for automatically identifying cluster number

Technical Field

The invention relates to the technical field of data clustering, in particular to a hierarchical clustering algorithm for automatically identifying cluster numbers.

Background

Clustering is the most common unsupervised method in machine learning. The clustering aims to divide the data into different clusters according to the similarity of the data, so that the data in the same cluster have stronger similarity, and the data in different clusters are different as much as possible. Currently, clustering is widely applied in the fields of data mining, machine learning, document clustering, image segmentation, pattern recognition and the like. According to the different implementation modes of the clustering, the following methods can be used: 1) The clustering method based on the partitions comprises the following steps: the method comprises the steps of initializing a clustering center by utilizing a given clustering number, then calculating the distance between a data point and the clustering center to divide clustering, and updating the clustering center for multiple times to obtain a clustering result of the clustering center which tends to be in a stable state. 2) The graph-based clustering method comprises the following steps: the method firstly constructs an undirected graph, and then utilizes certain strategies such as normalized cutting and the like to divide the clustering of the undirected graph. 3) The density-based clustering method comprises the following steps: the method considers the density information of the data points in the clustering process while clustering, so that the method is insensitive to noise data and has better algorithm robustness. 4) The clustering method based on layering comprises the following steps: the method constructs a tree for the data points, and utilizes a strategy based on aggregation or segmentation to perform clustering division in the process of building the tree.

k-means is the most typical clustering algorithm based on partition, and the algorithm firstly initializes clustering centers according to given clustering numbers and divides points into clusters where corresponding clustering centers are located through multiple iterations. The k-means algorithm is simple in concept and easy to implement, is one of the most widely applied partition clustering algorithms, but has two defects: 1) The k-means algorithm requires manually giving the number of clusters of the data set; 2) The k-means algorithm uses the mean of the samples as the cluster center, which makes it unable to fit a non-convex dataset and relatively sensitive to noisy data. In order to overcome the defects of the k-means algorithm, researchers also propose a plurality of improved algorithms, such as a k-means + + algorithm, a k-means algorithm and the like.

In addition, a graph-based clustering method such as an NCut algorithm firstly constructs an undirected graph for the data set, and then performs clustering on eigenvectors obtained by performing characteristic decomposition on Laplacian matrixes corresponding to the undirected graph to obtain a final clustering result. Although the graph-based clustering method has good algorithm robustness, most of the graph-based clustering algorithms also cannot automatically identify the clustering number.

The density-based clustering algorithm is widely applied to clustering analysis due to the characteristics of better algorithm robustness and insensitivity to noise data. The most typical density-based clustering method is the DBSCAN algorithm, which defines the data set as dense regions separated by points of sparse regions, clustering by setting two parameters neighborhood radius epsilon (eps) and density threshold MinPts. Although the DBSCAN algorithm does not need to manually give the cluster number, the parameter adjustment of the method is often complicated. In addition, in 2014, rodriguez et al proposed a density peak clustering DPC algorithm, which also requires manually giving the number of clusters and cannot be applied to a non-convex data set, by initializing the point with the highest density as a cluster center and then dividing other points into the categories where the cluster centers with the closest distance and the higher density are located.

A hierarchical clustering algorithm constructs a tree by data points, clustering division is carried out in the tree building process, and the division strategies comprise two types: clustering-based hierarchical clustering and splitting-based hierarchical clustering. The hierarchical clustering algorithm can provide more clustering results than the partitioning-based clustering algorithm, but how to select the clustering result corresponding to the optimal clustering number from a plurality of clustering results is a very challenging problem, so that the research of the clustering algorithm capable of automatically identifying the clustering number has important significance.

Currently, the commonly used cluster evaluation indexes are mainly divided into an external evaluation index and an internal evaluation index, the external evaluation index usually needs to input external information, such as a real cluster label, so as to evaluate the clustering effect, and the internal evaluation index evaluates the clusters by using the existing internal information of the data set, and is usually used for identifying the optimal cluster number of the data set.

Davies et al propose a Davis-Bouldin (DB) index, and comprehensively consider intra-class sample similarity and inter-class sample difference to measure the mean value of the maximum similarity in each cluster. The smaller the value of the DB index is, the tighter the inside of the same cluster is, and the farther the distance between different clusters is, namely, the smaller the distance between clusters is, the larger the distance between clusters is, and the better the clustering effect is.

Rousseeuw et al propose a contour coefficient (Sil) index, which uses the similarity of a point to the cluster to which it belongs compared to other clusters to determine the number of clusters. First, for each point, their contour coefficients are calculated separately, and then the mean of the contour coefficients of all sample points in the sample space is calculated as the final contour coefficient. In addition, when the contour coefficient of a single cluster is calculated, the mean value of the dissimilarity degree of one point in the cluster and other points in the cluster is respectively calculated to measure the cluster internal cohesion degree, and then the minimum value of the mean dissimilarity degree of one point in the cluster and other clusters is calculated to measure the cluster separation degree. The value range of the Sil index is [ -1,1], and the larger the Sil value is, the better the clustering effect is. The Sil index focuses on the clustering effectiveness of a single sample, and therefore may ignore common features of all samples.

In order to accelerate the calculation speed and improve the calculation efficiency, chen et al use the thought of natural neighbors to find local density core points to construct LCCV indexes to judge the clustering number. The LCCV index utilizes the distance mean value between the local core point in the cluster and other local core points in the same cluster to calculate the compactness in the cluster, and the minimum value of the distance mean value between the local core point in the cluster and the local core points of other clusters is used for measuring the separation degree between clusters. The LCCV index is an improvement in the Sil index, so it has the same drawback of the Sil index.

Aiming at the problems in the method, it is important how to innovatively design a clustering algorithm capable of identifying an optimal cluster number of an arbitrary shape and containing a noisy point data set.

Disclosure of Invention

The invention provides a hierarchical clustering algorithm for automatically identifying cluster numbers, which aims to solve the technical problems that the optimal cluster numbers are determined by calculating the value of an internal evaluation index corresponding to each clustering result generated by the algorithm in the clustering process in the existing hierarchical clustering algorithm, the calculated amount is large, most of the internal evaluation indexes generally consider the intra-cluster compactness and the inter-cluster separation corresponding to different cluster numbers, the global similarity between data points cannot be reflected, and therefore the optimal cluster numbers cannot be accurately determined.

In order to solve the technical problem, the invention adopts the following technical scheme:

a hierarchical clustering algorithm for automatically identifying cluster numbers comprises the following steps:

s1, calculating the density of data points by utilizing the sum of the maximum value mu of the reverse neighbor number of all data points in an original data set when natural neighbor search is stopped and the Euclidean distance of the neighbor of the data points mu, designing a dynamic noise recognizer for controlling the noise proportion by manually inputting parameters aiming at the data set with the calculated density to recognize noise points, and denoising to obtain a denoised ideal data set;

s2, designing a method for dividing and clustering based on the k neighbor directed edge iterative graph merging process to perform clustering by utilizing the idea that points with closer distances are easier to divide into the same cluster, recording the relation between cluster numbers and iteration times, judging the optimal cluster number of the k value by utilizing the maximum value of the iteration times, taking the k value as all integers in the range of [3,9] in order to avoid the excessive sensitivity of an algorithm to the k value due to the complex diversity of a data set, taking the mode of the iteration times maximum value corresponding to the cluster numbers obtained by different k values as the optimal cluster number of the data set, and obtaining the optimal clustering result of the denoised data set;

s3, synchronizing the clustering labels of the denoised data into the original data, and defining the following noisy point division rules on the basis of the optimal clustering result of the denoised data set: and dividing the noise point into the category of the non-noise point data which is closest to the noise point and has a density higher than that of the noise point to obtain a final clustering result.

Further, the step S1 specifically includes the following steps:

s11, inputting an original data set, and defining the original data set as X;

s12, calculating the density of the data points according to the maximum value mu of the reverse neighbor number of all the data points in the original data set when the natural neighbor search is stopped and the sum of the Euclidean distances of the neighbors of the data points mu, wherein the specific calculation mode is defined as follows:

μ＝max(nb _λ (X))

where μ represents the maximum inverse neighbor number, NN, of all data points in the original data set when natural neighbor iteration stops _μ (p) represents data points in the neighborhood of μ for data point p, dist (p, j) represents the Euclidean distance between data points p and j, nb _λ (X) represents the number of reverse neighbors of all data points in X when the natural neighbor search reaches a stable state, wherein lambda represents a natural characteristic value when the natural neighbor search is stopped, the natural neighbor search is gradually increased from k =1 until all data points have reverse neighbors or the data points with reverse neighbors are kept unchanged in the process of continuous two iterations, the natural neighbor search is stopped, nb _k (x _i ) Represents the data point x _i The number of reverse neighbors in the k iteration of the natural neighbor search, the function f (x) represents all data points with reverse neighbors, defined as follows:

s13, aiming at the data set with the calculated density, designing a dynamic noise recognizer as follows:

τ(α)＝mean(ρ(X))-Φ ^-1 (1-α)*σ(ρ(X))

where τ (α) represents the noise threshold, mean (ρ (X)) represents the mean of the density of the data points, Φ ^-1 (. Cndot.) represents the quantile function of normal distribution, alpha represents the noise parameter, the value range is [0,1 ], and sigma (·) represents the standard deviation function; the larger the number of noise points in the data set, the larger the noise threshold τ (α) is required, and thus the larger the noise parameter α needs to be set.

Further, the step S2 specifically includes the following steps:

s21, on the basis of denoising of a data set, dividing and clustering the denoised data, recording the denoised data set as D, and constructing a k neighbor directed graph G = (V, E) for the data set D by adopting a KNN algorithm, wherein V represents a set of retained data points after denoising, E represents a set of k neighbor directed edges, and the weight of the k neighbor directed graph is measured by adopting an Euclidean distance;

s22, initializing n data points of the denoised data set D into n clusters, and simultaneously initializing a clustering number of cluNum = n and an iteration number iter of 0;

s23, sorting the directed edges in the E from small to large, traversing the sorted directed edges, and adding 1 to the iteration times iter if data points connected by the traversed directed edges belong to the same cluster; if the data points connected by the traversed directed edges belong to different clusters, adding 1 to the iteration number iter, recording the iteration number of the current clustering number, then combining the two clusters into one cluster, and simultaneously subtracting 1 from the clustering number cluNum;

s24, merging and clustering the data points by using the previous step, stopping the algorithm until n data points are merged into a cluster or the traversal of the directed edge is completed, recording the change condition of the clustering number along with the iteration number, and selecting the clustering number corresponding to the maximum value of the iteration number as the optimal clustering number of the value k;

s25, defining the value of k as all integers in the range of [3,9], taking the mode of the clustering number corresponding to the maximum value of the iteration times obtained by taking different values of k as the final optimal clustering number of the data set, and taking the corresponding clustering result as the final clustering result of the denoised data set.

Compared with the prior art, the hierarchical clustering algorithm for automatically identifying the clustering number provided by the invention has the following advantages:

1. the existing method for judging the cluster number generally utilizes the intra-cluster compactness and the inter-cluster separation to construct an intra-cluster evaluation index to judge the optimal cluster number and obtain a clustering result at the same time. In contrast, the invention does not use the cluster compactness and the cluster separation degree of the clusters to identify the cluster number, so the calculation is not needed, and the calculation amount can be effectively reduced.

2. For an original hierarchical clustering mode, the similarity between two clusters is calculated by using the inter-cluster distance or the distance between the centroids in the two clusters, and clustering combination is performed based on the similarity, so that the global similarity between data points cannot be reflected by the mode. The invention mainly integrates clusters by considering the distance between data points, and the method can fit data sets with any shapes to achieve better clustering effect.

3. According to the method, after the k neighbor directed graph is constructed, clustering is divided according to the process of iterative graph merging from small to large according to the weight of the directed edge, the directed graph is constructed in a k neighbor mode, and compared with a full connection mode, the structural characteristics inside the cluster can be better embodied, so that the problem that the optimal clustering number cannot be accurately judged due to the fact that data points of different clusters are excessively connected in the full connection mode is avoided, and the final clustering effect is influenced.

4. When the optimal clustering number is judged, in order to reduce the influence of k values on the clustering effect, the k values are defined as all integer values in the range of [3,9], the mode of the clustering number corresponding to the maximum value of the iteration times obtained by different k values is used as the optimal clustering number, and the k value is selected in a proper range, so that the problem that the optimal clustering number cannot be accurately judged because the data in the clusters are not connected with each other due to too small k values or the data among the clusters are connected too much due to too large k values can be effectively avoided, and a better effect can be obtained when the data set is clustered.

Drawings

FIG. 1 is a schematic flow chart of a hierarchical clustering algorithm for automatically identifying cluster numbers according to the present invention.

Fig. 2 is a result obtained by setting a suitable noise parameter for the input data set and calculating a noise threshold value by using a noise identifier, and then performing denoising. (FIG. 2 (a) is the raw data set, FIG. 2 (b) is the denoised data set, the red dots represent the data points remaining after denoising, the gray dots represent the noise points, and the noise parameter is set to 0.13).

Fig. 3 (a) is a variation of the number of clusters obtained in the merging process of the post-k neighbor (k = 5) directed graph iteration graph constructed by the algorithm proposed by the present invention in the denoised data set in fig. 2 (b) along with the number of iterations, and fig. 3 (b) is a variation of the number of clusters corresponding to the maximum number of iterations obtained by different values of k when the values of k are all integers in the range of [3,9 ].

Fig. 4 (a) is a final clustering result graph obtained by clustering the denoised data set in fig. 2 (b), and fig. 4 (b) is a clustering effect graph corresponding to the data set in fig. 2 (a).

FIG. 5 is a graph of the optimal clustering number versus clustering results obtained for the DB index, the Sil index, the LCCV algorithm, and the algorithm of the present invention in three synthetic datasets applied to a hierarchical clustering algorithm.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further explained below by combining the specific drawings.

Referring to fig. 1 to 5, the present invention provides a hierarchical clustering algorithm for automatically identifying cluster numbers, which includes the following steps:

s3, regarding the noise point division, the noise point is divided into the categories to which the noise points with the closest distances belong by only considering the distance between the noise point and the noise points in the traditional division mode, but for some data sets with uneven densities, the noise point division result is possibly inaccurate by only considering the distance information, and in order to further improve the accuracy of the noise point division, the density information is considered in the noise point division; specifically, synchronizing the clustering label of the denoised data into the original data, and defining the following noisy point partition rule on the basis of the optimal clustering result of the denoised data set: and dividing the noise point into the category of the non-noise point data which is closest to the noise point and has a density higher than that of the noise point to obtain a final clustering result.

As a specific embodiment, the step S1 specifically includes the following steps:

s11, inputting an original data set, and defining the original data set as X;

s12, calculating the density of the data points according to the maximum value mu of the reverse neighbor number of all the data points in the original data set when the natural neighbor search is stopped and the sum of Euclidean distances of the neighbors of the data points mu, wherein the specific calculation mode is defined as follows:

μ＝max(nb _λ (X)) formula (2)

Wherein μ represents the maximum of the inverse neighbor number, NN, of all data points in the original data set when the natural neighbor iteration stops _μ (p) represents data points in the vicinity of μ of data point p, dist (p, j) represents the Euclidean distance between data points p and j, nb _λ (X) represents the number of reverse neighbors of all data points in X when the natural neighbor search reaches a steady state, wherein lambda represents a natural characteristic value when the natural neighbor search is stopped, the natural neighbor search is gradually increased from k =1 until all data points have reverse neighbors or the data points with reverse neighbors are kept unchanged in the process of two continuous iterationsHowever the neighbor search stops, nb _k (x _i ) Represents the data point x _i The number of reverse neighbors at the kth iteration of the natural neighbor search, and the function f (x) represents all data points that own reverse neighbors, defined as follows:

τ(α)＝mean(ρ(X))-Φ ^-1 (1- α) × σ (ρ (X)) formula (5)

Where τ (α) represents the noise threshold, mean (ρ (X)) represents the mean of the density of the data points, Φ ^-1 (. Cndot.) represents quantile function of normal distribution, alpha represents noise parameter, the value range is [0,1 ], and sigma (·) represents standard deviation function; the larger the number of noise points in the data set, the larger the required noise threshold τ (α) and hence the larger the noise parameter α that needs to be set, and the noise recognizer sets the appropriate noise parameter to denoise the data.

As a specific embodiment, the step S2 specifically includes the following steps:

s21, on the basis of denoising of a data set, dividing and clustering the denoised data, recording the denoised data set as D, and constructing a k neighbor directed graph G = (V, E) for the data set D by adopting the conventional KNN algorithm, wherein V represents a set of retained data points after denoising, E represents a set of k neighbor directed edges, and the weight of the Euclidean distance measuring edges is adopted when the k neighbor directed graph is constructed;

s23, sorting the directed edges in the E from small to large, traversing the sorted directed edges, and adding 1 to the iteration times iter if data points connected by the traversed directed edges belong to the same cluster; if the traversed data points connected by the directed edges belong to different clusters, adding 1 to the iteration number iter, recording the iteration number of the current clustering number, then combining the two clusters into one cluster, and simultaneously subtracting 1 from the clustering number cluNum;

s24, merging and clustering the data points by using the previous step, stopping the algorithm until n data points are merged into a cluster or traversing the directed edges is completed, recording the change condition of the clustering number along with the iteration number, and selecting the clustering number corresponding to the maximum value of the iteration number as the optimal clustering number of the value k;

s25, selecting a k value has certain influence on a clustering result, and in order to solve the problem, the value of k is defined in a range, specifically, the value of k is defined as all integers in the range of [3,9], the mode of the iteration number maximum value obtained by different values of k corresponding to the clustering number is used as the final optimal clustering number of the data set, and the corresponding clustering result is used as the final clustering result of the denoised data set.

In order to better understand the hierarchical clustering algorithm for automatically identifying the cluster number provided by the present invention, the following will explain the technical solution provided by the present invention in detail with reference to the specific implementation:

(1) Inputting a data set X, and obtaining a natural characteristic value lambda and the reverse neighbor number nb of all data points when the natural neighbor search is stopped by adopting the thought of natural neighbors _λ (X), calculating the density of the data points by using a formula (1) through the sum of the maximum value mu of the reverse neighbor number of all the data points and the Euclidean distances of the neighbors of the data points mu, then setting a noise parameter alpha to obtain a corresponding noise threshold value tau (alpha) through a formula (5), and judging the point of which the density is less than tau (alpha) as a noise point. Fig. 2 (b) shows data obtained by denoising the data set shown in fig. 2 (a) with 513 data points in total, with the value of the noise parameter α set to 0.13, and with the noise point identified by the formula (5). The denoised result shows that the denoised data set is close to an ideal data set by setting a proper noise parameter alpha.

(2) And for clustering, the closer points are easier to be classified into the same cluster, the farther points are easier to be classified into different clusters, the clustering is performed by using the method provided in the step S2, and firstly, a k-nearest neighbor directed graph G = (V, E) is constructed on the denoised data set by using a KNN algorithm, wherein V represents a set of denoised data points, and E represents a set of constructed k-nearest neighbor directed edges. And for the constructed k adjacent directed edges, measuring the weight of the edges by using Euclidean distance, and sequencing the directed edges from small to large according to the weight of the edges.

And (5) according to the directed edges after traversing the sorting from small to large by weight, dividing the clusters by the method provided in the step (S23) to obtain the clustering results of the denoised data set corresponding to different clustering numbers. Hierarchical clustering in a traditional mode can obtain a large number of clustering results by clustering through a clustering or splitting strategy based on an arborescence, but the clustering idea of the hierarchical clustering method cannot accurately find a clustering result corresponding to the optimal clustering number from a plurality of clustering results. The invention constructs the division clustering of the merging process of the directed edge iteration graph in a k neighbor mode, and obtains the optimal clustering number and the corresponding clustering result by adopting a strategy of taking the clustering number corresponding to the maximum value of the iteration times as the optimal clustering number, and the reason for judging the optimal clustering number by adopting the strategy is as follows:

firstly, when a directed graph with k neighbors iterates from small to large on a directed edge, the clustering number is reduced sharply by the combination of smaller clusters before the optimal clustering number is combined, so that the iteration frequency corresponding to each clustering number is smaller, when the optimal clustering number is combined, a stable state is reached, and the clustering number does not change any more in a long period of time when the merging process of the iterative graph continues;

secondly, when the cluster number is smaller than the optimal cluster number due to continuous combination, the clusters which do not belong to the same cluster are combined, because the k-nearest neighbor mode is adopted for composition, directed edges connecting data points which do not belong to the same cluster are often fewer or 0, and when all the data points are combined into one cluster or directed edge iteration is completed, the algorithm is stopped, and the condition enables the iteration frequency corresponding to each cluster number after the optimal cluster number is crossed to be relatively less.

The iteration times corresponding to the clustering numbers before and after the stable state are both smaller than the iteration times corresponding to the clustering numbers in the stable state through the two conditions, so that the clustering number corresponding to the maximum value of the iteration times is selected as the optimal clustering number.

For the denoised data set in fig. 2 (b), a directed graph is constructed by using a k neighbor (k = 5) manner in the manner described above, after the directed edges are sorted from small to large, the change situation of the number of clusters obtained in the process of traversing the directed edge iterative graph merging along with the iteration number is as shown in fig. 3 (a), and it is found through observation that the number of clusters corresponding to the maximum value of the iteration number obtained in the process of constructing the k neighbor (k = 5) directed edge iterative graph merging for the data set is 5, so that the optimal number of clusters of the data set is 5, and the corresponding clustering result is the final clustering result of the denoised data set.

The selection of the k value has a great influence on the judgment of the cluster number, when the k value is too small, the inside of the cluster is not completely connected, so that the algorithm cannot identify the internal structure of the cluster, and the clustering effect is influenced, and when the k value is too large, data points which do not belong to the same cluster are excessively connected, so that the optimal cluster number cannot be accurately judged. In order to solve the problem, the invention constructs k adjacent directed edge multiple iteration graph merging process division clustering with different k values, namely, the k value is changed into a range value from a determined value, in an experiment, the k value is changed into all integer values in the range of [3,9], the mode of the iteration maximum value corresponding to the clustering number obtained by using different k values is used as the optimal clustering number, and the corresponding result is used as the final clustering result. For the denoised data set in fig. 2 (b), the variation of the number of clusters corresponding to the maximum value of the iteration number obtained by taking k as all integer values in the range of [3,9] is shown in fig. 3 (b). The best clustering number of the denoised data set in fig. 2 (b) corresponding to the range with k value of [3,9] is 5, the clustering result corresponding to the clustering number of 5 is the best clustering result, and the final clustering result of the denoised data set in fig. 2 (b) is shown in fig. 4 (a).

(3) And after the optimal clustering number and the clustering result corresponding to the optimal clustering number are obtained by utilizing the process for the data set in the graph 2, dividing and clustering the noise points to be removed in the first step. For the data set in fig. 2 (a), the final clustering result obtained by clustering according to the present invention is shown in fig. 4 (b).

(4) Fig. 5 (a), (b), (c), and (d) respectively represent a DB index, a Sil index, and an LCCV algorithm applied to a hierarchical clustering algorithm, and an algorithm of the present invention, which were subjected to experiments on three synthetic data sets, and obtained clustering result visualization effect graphs. The DB index, the Sil index and the LCCV algorithm applied to the hierarchical clustering algorithm are found to have poor clustering effects through comparison, and three synthetic data sets cannot be well fitted.

(5) And table 2 shows the results obtained after clustering is performed on the five real data sets in table 1 by using the DB index, the Sil index, the LCCV algorithm applied to the hierarchical clustering algorithm and the algorithm of the present invention, and clustering effect evaluation is performed by using the clustering evaluation indexes ACC, NMI, and ARI. Compared with a comparison algorithm, the algorithm disclosed by the invention can accurately identify the optimal clustering number of five real data sets, the clustering effect of the algorithm disclosed by the invention on the four real data sets of Iris, cancer, seeds and Leuk is superior to that of the comparison algorithm, and the clustering result on the Heart data set is close to the optimal clustering result obtained by the comparison algorithm. In conclusion, the algorithm provided by the invention can identify the optimal clustering number of the real data set and obtain a better clustering effect.

TABLE 1 basic information of five real data sets

Table 2 results of clustering evaluation index

1. the existing method for judging the cluster number generally utilizes the intra-cluster compactness and the inter-cluster separation to construct an intra-cluster evaluation index to judge the optimal cluster number and obtain a clustering result at the same time. In contrast, the cluster number is not identified by the cluster compactness and the inter-cluster separation degree of the clusters, so that the calculation is not needed, and the calculation amount can be effectively reduced.

2. For the original hierarchical clustering mode, the similarity between two clusters is calculated by using the inter-cluster distance or the distance between the centroids in the two clusters, and clustering combination is performed based on the similarity, which cannot reflect the global similarity between data points. The invention mainly integrates clusters by considering the distance between data points, and the method can fit data sets with any shapes to achieve better clustering effect.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. A hierarchical clustering algorithm for automatically identifying cluster numbers is characterized by comprising the following steps:

s1, calculating the density of data points by using the sum of the maximum value mu of the reverse neighbor number of all the data points in an original data set when natural neighbor search is stopped and the Euclidean distance of the neighbor of the data point mu, designing a dynamic noise recognizer for controlling the noise ratio by manually inputting parameters aiming at the data set with the calculated density to recognize noise points, and denoising to obtain a denoised ideal data set;

s3, synchronizing the clustering label of the denoised data into the original data, and defining the following noise point division rule on the basis of the optimal clustering result of the denoised data set: and dividing the noise point into the category of the non-noise point data which is closest to the noise point and has a density higher than that of the noise point to obtain a final clustering result.

2. The hierarchical clustering algorithm for automatically identifying cluster numbers according to claim 1, wherein the step S1 specifically comprises the following steps:

s11, inputting an original data set, and defining the original data set as X;

μ＝max(nb _λ (X))

wherein μ represents the maximum of the inverse neighbor number, NN, of all data points in the original data set when the natural neighbor iteration stops _μ (p) represents data points in the neighborhood of μ for data point p, dist (p, j) represents the Euclidean distance between data points p and j, nb _λ (X) represents the number of reverse neighbors of all data points in X when the natural neighbor search reaches a steady state, wherein lambda represents a natural characteristic value when the natural neighbor search is stopped, the natural neighbor search is gradually increased from k =1 until all data points have reverse neighbors or the data points with the reverse neighbors are kept unchanged in the continuous two-iteration process, the natural neighbor search is stopped, nb is used for calculating the number of reverse neighbors of all data points in X, and the number of the reverse neighbors is calculated according to the number of the data points in X _k (x _i ) Represents the data point x _i The number of reverse neighbors at the kth iteration of the natural neighbor search, and the function f (x) represents all data points that own reverse neighbors, defined as follows:

τ(α)＝mean(ρ(X))-Φ ^-1 (1-α)*σ(ρ(X))

where τ (α) represents the noise threshold, mean (ρ (X)) represents the mean of the density of the data points, Φ ^-1 (. Cndot.) represents quantile function of normal distribution, alpha represents noise parameter, the value range is [0,1 ], and sigma (·) represents standard deviation function; the larger the number of noise points in the data set, the larger the required noise threshold τ (α) and hence the larger the noise parameter α that needs to be set.

3. The hierarchical clustering algorithm for automatically identifying cluster numbers according to claim 1, wherein the step S2 specifically comprises the following steps:

s21, on the basis of denoising the data set, dividing and clustering the denoised data, recording the denoised data set as D, and constructing a k neighbor directed graph G = (V, E) for the data set D by adopting a KNN algorithm, wherein V represents a set of denoised reserved data points, E represents a set of k neighbor directed edges, and the weight of the k neighbor directed graph is measured by adopting an Euclidean distance;

s23, sequencing the directed edges in the E from small to large, traversing the sequenced directed edges, and if data points connected by the traversed directed edges belong to the same cluster, adding 1 to the iteration number iter; if the data points connected by the traversed directed edges belong to different clusters, adding 1 to the iteration number iter, recording the iteration number of the current clustering number, then combining the two clusters into one cluster, and simultaneously subtracting 1 from the clustering number cluNum;

s25, defining the value of k as all integers in the range of [3,9], taking the mode of the clustering number corresponding to the maximum value of the iteration times obtained by different values of k as the final optimal clustering number of the data set, and taking the corresponding clustering result as the final clustering result of the denoised data set.