CN112163623B

CN112163623B - Fast clustering method based on density subgraph estimation, computer equipment and storage medium

Info

Publication number: CN112163623B
Application number: CN202011060417.9A
Authority: CN
Inventors: 杨易扬; 郑喜臣; 任成森; 巩志国; 蔡瑞初; 郝志峰; 陈炳丰
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2022-03-04
Anticipated expiration: 2040-09-30
Also published as: CN112163623A

Abstract

The invention relates to the technical field of machine learning, and provides a fast clustering method based on density subgraph estimation, computer equipment and a storage medium for overcoming the defects that the centroid of a cluster cannot be determined, the calculation cost is high and over-segmentation occurs in the clustering process in the prior art, wherein the fast clustering method based on the density subgraph estimation comprises the following steps: obtaining a sample, preprocessing the sample and forming a data set; carrying out density value estimation on each sample in the data set to construct a density subgraph set; finding out the density highest point of each density subgraph from the density subgraph set as a representative point of the density subgraph, and forming a candidate set by samples corresponding to the representative points; calculating the importance value of each sample in the candidate set; sorting the candidate sets in a descending order according to the important values, and selecting the first K samples as the centroids of the K clusters; and classifying the non-centroid samples in the candidate set, and outputting to obtain a clustering result.

Description

Fast clustering method based on density subgraph estimation, computer equipment and storage medium

Technical Field

The invention relates to the technical field of machine learning, in particular to a fast clustering method based on density subgraph estimation, computer equipment and a storage medium.

Background

The density-based clustering method is a classic research direction in data mining, and is popular in both academic and industrial fields because it can find clusters of any shape in a data set, and it mainly performs density connection on samples of the data set through Kernel Density Estimates (KDE).

In a conventional Density-Based Clustering method, the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) mainly performs cluster division by judging whether the Density between sample points is reachable. However, DBSCAN cannot determine the centroid of a cluster, which is usually the representative point of the cluster. To address the above problem, MeanShift proposes to move continuously through kernel density estimation for each sample point until finally moving to a density local maximum point, which is the centroid of the cluster. Data samples that move to the same centroid are grouped into the same cluster. However, this method generates new sample points continuously during the moving process, which results in poor scalability and very high computational cost. While QuickShift proposes to move each sample point to the nearest higher density sample point in its τ -radius sphere radius range. Therefore, new points do not need to be generated in the moving process, only density connection is needed to be carried out between samples, and corresponding calculation cost is greatly reduced. On the other hand, for methods for finding the highest density point, including MeanShift and QuickShift, the methods have the problem of over-segmentation because only the highest density point is concerned and the local density structures of other points are ignored. In addition to the over-segmentation problem, the density-based clustering method cannot return a specified number of K clusters according to the requirements of users, and although a DPC algorithm is proposed in the science journal of 2014, which determines the centroids of the K clusters according to two dimensions of the density and the distance of a data set, the DPC algorithm needs to calculate the distance between every two points in the data set to obtain a distance matrix, which results in excessively high complexity of the algorithm. Meanwhile, the K-Mode and the LK-Mode have similar problems.

Disclosure of Invention

The invention provides a fast clustering method based on density subgraph estimation, computer equipment and a storage medium, aiming at overcoming the defects that the centroid of a cluster cannot be determined, the calculation cost is high and over-segmentation occurs in the clustering process in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a fast clustering method based on density subgraph estimation comprises the following steps:

s1: obtaining a sample, preprocessing the sample and forming a data set;

s2: carrying out density value estimation on each sample in the data set to construct a density subgraph set;

s3: finding out the density highest point of each density subgraph from the density subgraph set as a representative point of the density subgraph, and forming a candidate set by samples corresponding to the representative points;

s4: calculating the importance value of each sample in the candidate set;

s5: sorting the candidate sets in a descending order according to the important values, and selecting the first K samples as the centroids of the K clusters;

s6: and classifying the non-centroid samples in the candidate set, and outputting to obtain a clustering result.

Preferably, in step S1, the samples include, but are not limited to, picture samples, real world (real) data set samples.

Preferably, in the step S1, the specific step of preprocessing the picture sample includes: and converting the picture into a 5-dimensional array, wherein each point in the array consists of the position coordinates of a corresponding pixel point in the picture sample and a corresponding RGB channel value.

Preferably, in the step S2, the specific steps are as follows:

s21: carrying out density value estimation calculation on each sample in the data set to obtain the density value of each sample;

s22: judging the density value of each sample according to a preset density threshold: if the density value of the current sample is greater than the preset density threshold value, executing the step S23; otherwise, judging that the current sample is not added into the density subgraph, and then judging the next sample in the current step;

s23: judging whether the current sample is communicated with any subset in the density subgraph set: if so, adding the current sample into the communicated subsets, further judging whether the communicated subsets are intersected with any other subsets in the density subgraph set, and if so, merging the intersected subsets; if not, creating a new subset, and adding the current sample to the new subset; and then, performing density value judgment of the step S22 on the next sample until all samples are judged, so as to obtain a density subgraph set.

Preferably, in the step S21, a density value estimation calculation is performed on each sample in the data set by using a k-NN density estimation method; the calculation formula is as follows:

wherein f is_k(x) Represents the density value of the sample x; k is the number of neighbors, n is the total number of samples in the dataset, v_dIs the volume of a unit sphere in d-dimensional space, r_k(x) Representing the distance of the sample point x to the kth neighbor.

Preferably, in step S4, the specific step of calculating the importance value of each sample in the candidate set includes:

s41: for the samples in the candidate set, calculating the weight value w of each sample_i；

S42: the density value f of each sample in the candidate set_k(x_i) And weight value w thereof_iAnd multiplying to obtain the important value of the corresponding sample.

Preferably, the weight value w of each sample in the candidate set_iThe calculation steps are as follows:

judging candidate set X_HIs present or not present than the current sample x_iHigher density value of point x_h: if yes, its weighted value w_iThe calculation formula of (a) is as follows:

otherwise, its weight value w_iThe calculation formula of (a) is as follows:

preferably, in step S6, the specific step of classifying the non-centroid samples in the candidate set includes: classifying the current sample into the nearest point with higher density value according to the nearest-higher density principle for the sample of the non-centroid in the candidate set until the current sample converges to a certain centroid; judging whether the samples in the non-candidate set are in the density subgraph, if so, classifying the current samples into the cluster where the representative point in the corresponding density subgraph is located; otherwise, directly classifying the current sample into the cluster where the nearest point with higher density value is located according to the nearest-density higher principle; and after classifying all samples in the data set, outputting to obtain a clustering result.

The invention also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the density subgraph estimation-based rapid clustering method provided by any technical scheme when executing the computer program.

The invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the density subgraph estimation-based fast clustering method provided by any of the above technical solutions are implemented.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: according to the method, density value estimation is carried out on a sample, a density subgraph set is constructed according to the density value and a preset threshold value, and the point with the highest density value is confirmed as the mass center of a cluster; for the number K of the given clusters, the required clustering result can be returned, and the method has the characteristic of less calculation amount; by constructing the density subgraph and considering the local density structure, the problem of over-segmentation in clustering can be effectively avoided.

Drawings

FIG. 1 is a flow chart of the fast clustering method based on density subgraph estimation according to the present invention.

FIG. 2 is a flow chart of constructing a density subgraph set according to the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a fast clustering method based on density subgraph estimation, which is a flow chart of the fast clustering method based on density subgraph estimation in the embodiment, as shown in fig. 1.

The fast clustering method based on density subgraph estimation provided by the embodiment specifically comprises the following steps:

s1: and obtaining a sample, and preprocessing the sample to form a data set.

The data set in this embodiment includes two types, one is real world data set sample, and the other is picture sample. For real world dataset samples, obtained by direct download at a website, such as the UCI website (a database for machine learning proposed by University of california irvine); the picture sample is firstly obtained on the network and needs to be divided, and then the picture is preprocessed.

The specific steps of preprocessing the picture sample comprise: and converting the picture into a 5-dimensional array, wherein each point in the array consists of the position coordinates of a corresponding pixel point in the picture sample and a corresponding RGB channel value.

S2: and carrying out density value estimation on each sample in the data set to construct a density subgraph set. The method comprises the following specific steps:

In the step, density value estimation calculation is carried out on each sample in the data set by adopting a k-NN density estimation method; the calculation formula is as follows:

In the step, whether the current sample is added into the density subgraph set is mainly judged through a preset density threshold, and the sample in the density subgraph finds out a connected component through the connected relation between the sample and the subset, and the connected variable is added into the corresponding subset to obtain the density subgraph set.

Fig. 2 is a flowchart of constructing a density subgraph set in this embodiment.

S3: and finding out the density highest point of each density subgraph from the density subgraph set as a representative point of the density subgraph, and forming a candidate set by samples corresponding to the representative points.

S4: the importance value of each sample in the candidate set is calculated.

The specific steps of calculating the importance value of each sample in the candidate set comprise:

s41: for the samples in the candidate set, calculating the weight value w of each sample_i(ii) a Wherein the weight value w of each sample in the candidate set_iThe calculation steps are as follows:

otherwise, its weight value w_iThe calculation formula of (a) is as follows:

In this step, the weight value of the sample is calculated first, the density higher point closest to the current sample is found in the candidate set, the euclidean distance between the two is used as the weight value, and then the weight value is multiplied by the density value of the sample to obtain the important value.

S5: and sorting the candidate sets in a descending order according to the important values, and selecting the first K samples as the centroids of the K clusters.

S6: and classifying the non-centroid samples in the candidate set, and outputting to obtain a clustering result. The concrete steps of classifying the samples of the non-centroid in the candidate set comprise:

classifying the current sample into the nearest point with higher density value according to the nearest-higher density principle for the sample of the non-centroid in the candidate set until the current sample converges to a certain centroid;

judging whether the samples in the non-candidate set are in the density subgraph, if so, classifying the current samples into the cluster where the representative point in the corresponding density subgraph is located; otherwise, directly classifying the current sample into the cluster where the nearest point with higher density value is located according to the nearest-density higher principle;

and after classifying all samples in the data set, outputting to obtain a clustering result.

The fast clustering method based on density subgraph estimation provided by the embodiment is characterized by firstly fast constructing a density subgraph set according to a local density structure, then selecting a mass center in each density subgraph as a candidate set, and finally returning a clustering result according to the given number of clusters to realize fast clustering. The method comprises the steps of estimating the density value of a sample, constructing a density subgraph set according to the density value and a preset threshold value, confirming the centroid of a cluster from the point with the highest density value, effectively solving the defect that the centroid of the cluster cannot be determined in a DBSCAN algorithm, and returning a required clustering result for the number K of given clusters in a specific implementation process. In addition, the problem of over-segmentation in clustering can be effectively avoided by constructing the density subgraph and considering the local density structure, and the calculation cost of the method provided by the embodiment is obviously lower than that of other density clustering algorithms.

Further, the present embodiment also provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the above-mentioned fast clustering method based on density subgraph estimation when executing the computer program.

Further, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned fast clustering method based on density subgraph estimation.

The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A fast clustering method based on density subgraph estimation is characterized by comprising the following steps:

s1: obtaining samples, wherein the samples comprise picture samples and real world data set samples; preprocessing the samples to form a data set; the specific steps of preprocessing the picture sample comprise: converting the picture sample into a 5-dimensional array, wherein each element in the array consists of the position coordinate of a corresponding pixel point in the picture sample and a corresponding RGB channel value;

s4: calculating an importance value for each sample in the candidate set;

2. The fast clustering method based on density subgraph estimation according to claim 1 is characterized in that: in the step S2, the specific steps are as follows:

3. The fast clustering method based on density subgraph estimation according to claim 2 is characterized in that: in the step S21, performing density value estimation calculation on each sample in the data set by using a k-NN density estimation method; the calculation formula is as follows:

wherein f is_k(x) Represents the density value of the sample x; k is the number of neighbors, n is the total number of samples in the dataset, v_dIs the volume of a unit sphere in d-dimensional space, r_k(x) Representing the distance of sample x to the kth neighbor.

4. The fast clustering method based on density subgraph estimation according to claim 3 is characterized in that: in the step S4, the specific step of calculating the importance value of each sample in the candidate set includes:

5. The fast clustering method based on density subgraph estimation according to claim 4 is characterized in that: weight value w for each sample in the candidate set_iThe calculation steps are as follows:

judging candidate set X_HIs present or not present than the current sample x_iHigher density value of sample x_h: if yes, its weighted value w_iIs calculated byThe formula is as follows:

otherwise, its weight value w_iThe calculation formula of (a) is as follows:

6. the fast clustering method based on density subgraph estimation according to claim 1 is characterized in that: in the step S6, the specific step of classifying the non-centroid samples in the candidate set includes:

judging whether the samples in the non-candidate set are in the density subgraph, if so, classifying the current samples into the cluster where the representative point in the corresponding density subgraph is located; otherwise, classifying the current sample into the cluster where the nearest point with higher density value is located according to the nearest-density higher principle;

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor when executing the computer program implements the steps of the density subgraph estimation based fast clustering method according to any one of claims 1 to 6.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the density subgraph estimation based fast clustering method according to any one of claims 1 to 6.