CN112163623B - Fast clustering method based on density subgraph estimation, computer equipment and storage medium - Google Patents
Fast clustering method based on density subgraph estimation, computer equipment and storage medium Download PDFInfo
- Publication number
- CN112163623B CN112163623B CN202011060417.9A CN202011060417A CN112163623B CN 112163623 B CN112163623 B CN 112163623B CN 202011060417 A CN202011060417 A CN 202011060417A CN 112163623 B CN112163623 B CN 112163623B
- Authority
- CN
- China
- Prior art keywords
- density
- sample
- subgraph
- samples
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of machine learning, and provides a fast clustering method based on density subgraph estimation, computer equipment and a storage medium for overcoming the defects that the centroid of a cluster cannot be determined, the calculation cost is high and over-segmentation occurs in the clustering process in the prior art, wherein the fast clustering method based on the density subgraph estimation comprises the following steps: obtaining a sample, preprocessing the sample and forming a data set; carrying out density value estimation on each sample in the data set to construct a density subgraph set; finding out the density highest point of each density subgraph from the density subgraph set as a representative point of the density subgraph, and forming a candidate set by samples corresponding to the representative points; calculating the importance value of each sample in the candidate set; sorting the candidate sets in a descending order according to the important values, and selecting the first K samples as the centroids of the K clusters; and classifying the non-centroid samples in the candidate set, and outputting to obtain a clustering result.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a fast clustering method based on density subgraph estimation, computer equipment and a storage medium.
Background
The density-based clustering method is a classic research direction in data mining, and is popular in both academic and industrial fields because it can find clusters of any shape in a data set, and it mainly performs density connection on samples of the data set through Kernel Density Estimates (KDE).
In a conventional Density-Based Clustering method, the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) mainly performs cluster division by judging whether the Density between sample points is reachable. However, DBSCAN cannot determine the centroid of a cluster, which is usually the representative point of the cluster. To address the above problem, MeanShift proposes to move continuously through kernel density estimation for each sample point until finally moving to a density local maximum point, which is the centroid of the cluster. Data samples that move to the same centroid are grouped into the same cluster. However, this method generates new sample points continuously during the moving process, which results in poor scalability and very high computational cost. While QuickShift proposes to move each sample point to the nearest higher density sample point in its τ -radius sphere radius range. Therefore, new points do not need to be generated in the moving process, only density connection is needed to be carried out between samples, and corresponding calculation cost is greatly reduced. On the other hand, for methods for finding the highest density point, including MeanShift and QuickShift, the methods have the problem of over-segmentation because only the highest density point is concerned and the local density structures of other points are ignored. In addition to the over-segmentation problem, the density-based clustering method cannot return a specified number of K clusters according to the requirements of users, and although a DPC algorithm is proposed in the science journal of 2014, which determines the centroids of the K clusters according to two dimensions of the density and the distance of a data set, the DPC algorithm needs to calculate the distance between every two points in the data set to obtain a distance matrix, which results in excessively high complexity of the algorithm. Meanwhile, the K-Mode and the LK-Mode have similar problems.
Disclosure of Invention
The invention provides a fast clustering method based on density subgraph estimation, computer equipment and a storage medium, aiming at overcoming the defects that the centroid of a cluster cannot be determined, the calculation cost is high and over-segmentation occurs in the clustering process in the prior art.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a fast clustering method based on density subgraph estimation comprises the following steps:
s1: obtaining a sample, preprocessing the sample and forming a data set;
s2: carrying out density value estimation on each sample in the data set to construct a density subgraph set;
s3: finding out the density highest point of each density subgraph from the density subgraph set as a representative point of the density subgraph, and forming a candidate set by samples corresponding to the representative points;
s4: calculating the importance value of each sample in the candidate set;
s5: sorting the candidate sets in a descending order according to the important values, and selecting the first K samples as the centroids of the K clusters;
s6: and classifying the non-centroid samples in the candidate set, and outputting to obtain a clustering result.
Preferably, in step S1, the samples include, but are not limited to, picture samples, real world (real) data set samples.
Preferably, in the step S1, the specific step of preprocessing the picture sample includes: and converting the picture into a 5-dimensional array, wherein each point in the array consists of the position coordinates of a corresponding pixel point in the picture sample and a corresponding RGB channel value.
Preferably, in the step S2, the specific steps are as follows:
s21: carrying out density value estimation calculation on each sample in the data set to obtain the density value of each sample;
s22: judging the density value of each sample according to a preset density threshold: if the density value of the current sample is greater than the preset density threshold value, executing the step S23; otherwise, judging that the current sample is not added into the density subgraph, and then judging the next sample in the current step;
s23: judging whether the current sample is communicated with any subset in the density subgraph set: if so, adding the current sample into the communicated subsets, further judging whether the communicated subsets are intersected with any other subsets in the density subgraph set, and if so, merging the intersected subsets; if not, creating a new subset, and adding the current sample to the new subset; and then, performing density value judgment of the step S22 on the next sample until all samples are judged, so as to obtain a density subgraph set.
Preferably, in the step S21, a density value estimation calculation is performed on each sample in the data set by using a k-NN density estimation method; the calculation formula is as follows:
wherein f isk(x) Represents the density value of the sample x; k is the number of neighbors, n is the total number of samples in the dataset, vdIs the volume of a unit sphere in d-dimensional space, rk(x) Representing the distance of the sample point x to the kth neighbor.
Preferably, in step S4, the specific step of calculating the importance value of each sample in the candidate set includes:
s41: for the samples in the candidate set, calculating the weight value w of each samplei;
S42: the density value f of each sample in the candidate setk(xi) And weight value w thereofiAnd multiplying to obtain the important value of the corresponding sample.
Preferably, the weight value w of each sample in the candidate setiThe calculation steps are as follows:
judging candidate set XHIs present or not present than the current sample xiHigher density value of point xh: if yes, its weighted value wiThe calculation formula of (a) is as follows:
otherwise, its weight value wiThe calculation formula of (a) is as follows:
preferably, in step S6, the specific step of classifying the non-centroid samples in the candidate set includes: classifying the current sample into the nearest point with higher density value according to the nearest-higher density principle for the sample of the non-centroid in the candidate set until the current sample converges to a certain centroid; judging whether the samples in the non-candidate set are in the density subgraph, if so, classifying the current samples into the cluster where the representative point in the corresponding density subgraph is located; otherwise, directly classifying the current sample into the cluster where the nearest point with higher density value is located according to the nearest-density higher principle; and after classifying all samples in the data set, outputting to obtain a clustering result.
The invention also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the density subgraph estimation-based rapid clustering method provided by any technical scheme when executing the computer program.
The invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the density subgraph estimation-based fast clustering method provided by any of the above technical solutions are implemented.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that: according to the method, density value estimation is carried out on a sample, a density subgraph set is constructed according to the density value and a preset threshold value, and the point with the highest density value is confirmed as the mass center of a cluster; for the number K of the given clusters, the required clustering result can be returned, and the method has the characteristic of less calculation amount; by constructing the density subgraph and considering the local density structure, the problem of over-segmentation in clustering can be effectively avoided.
Drawings
FIG. 1 is a flow chart of the fast clustering method based on density subgraph estimation according to the present invention.
FIG. 2 is a flow chart of constructing a density subgraph set according to the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a fast clustering method based on density subgraph estimation, which is a flow chart of the fast clustering method based on density subgraph estimation in the embodiment, as shown in fig. 1.
The fast clustering method based on density subgraph estimation provided by the embodiment specifically comprises the following steps:
s1: and obtaining a sample, and preprocessing the sample to form a data set.
The data set in this embodiment includes two types, one is real world data set sample, and the other is picture sample. For real world dataset samples, obtained by direct download at a website, such as the UCI website (a database for machine learning proposed by University of california irvine); the picture sample is firstly obtained on the network and needs to be divided, and then the picture is preprocessed.
The specific steps of preprocessing the picture sample comprise: and converting the picture into a 5-dimensional array, wherein each point in the array consists of the position coordinates of a corresponding pixel point in the picture sample and a corresponding RGB channel value.
S2: and carrying out density value estimation on each sample in the data set to construct a density subgraph set. The method comprises the following specific steps:
s21: carrying out density value estimation calculation on each sample in the data set to obtain the density value of each sample;
s22: judging the density value of each sample according to a preset density threshold: if the density value of the current sample is greater than the preset density threshold value, executing the step S23; otherwise, judging that the current sample is not added into the density subgraph, and then judging the next sample in the current step;
s23: judging whether the current sample is communicated with any subset in the density subgraph set: if so, adding the current sample into the communicated subsets, further judging whether the communicated subsets are intersected with any other subsets in the density subgraph set, and if so, merging the intersected subsets; if not, creating a new subset, and adding the current sample to the new subset; and then, performing density value judgment of the step S22 on the next sample until all samples are judged, so as to obtain a density subgraph set.
In the step, density value estimation calculation is carried out on each sample in the data set by adopting a k-NN density estimation method; the calculation formula is as follows:
wherein f isk(x) Represents the density value of the sample x; k is the number of neighbors, n is the total number of samples in the dataset, vdIs the volume of a unit sphere in d-dimensional space, rk(x) Representing the distance of the sample point x to the kth neighbor.
In the step, whether the current sample is added into the density subgraph set is mainly judged through a preset density threshold, and the sample in the density subgraph finds out a connected component through the connected relation between the sample and the subset, and the connected variable is added into the corresponding subset to obtain the density subgraph set.
Fig. 2 is a flowchart of constructing a density subgraph set in this embodiment.
S3: and finding out the density highest point of each density subgraph from the density subgraph set as a representative point of the density subgraph, and forming a candidate set by samples corresponding to the representative points.
S4: the importance value of each sample in the candidate set is calculated.
The specific steps of calculating the importance value of each sample in the candidate set comprise:
s41: for the samples in the candidate set, calculating the weight value w of each samplei(ii) a Wherein the weight value w of each sample in the candidate setiThe calculation steps are as follows:
judging candidate set XHIs present or not present than the current sample xiHigher density value of point xh: if yes, its weighted value wiThe calculation formula of (a) is as follows:
otherwise, its weight value wiThe calculation formula of (a) is as follows:
s42: the density value f of each sample in the candidate setk(xi) And weight value w thereofiAnd multiplying to obtain the important value of the corresponding sample.
In this step, the weight value of the sample is calculated first, the density higher point closest to the current sample is found in the candidate set, the euclidean distance between the two is used as the weight value, and then the weight value is multiplied by the density value of the sample to obtain the important value.
S5: and sorting the candidate sets in a descending order according to the important values, and selecting the first K samples as the centroids of the K clusters.
S6: and classifying the non-centroid samples in the candidate set, and outputting to obtain a clustering result. The concrete steps of classifying the samples of the non-centroid in the candidate set comprise:
classifying the current sample into the nearest point with higher density value according to the nearest-higher density principle for the sample of the non-centroid in the candidate set until the current sample converges to a certain centroid;
judging whether the samples in the non-candidate set are in the density subgraph, if so, classifying the current samples into the cluster where the representative point in the corresponding density subgraph is located; otherwise, directly classifying the current sample into the cluster where the nearest point with higher density value is located according to the nearest-density higher principle;
and after classifying all samples in the data set, outputting to obtain a clustering result.
The fast clustering method based on density subgraph estimation provided by the embodiment is characterized by firstly fast constructing a density subgraph set according to a local density structure, then selecting a mass center in each density subgraph as a candidate set, and finally returning a clustering result according to the given number of clusters to realize fast clustering. The method comprises the steps of estimating the density value of a sample, constructing a density subgraph set according to the density value and a preset threshold value, confirming the centroid of a cluster from the point with the highest density value, effectively solving the defect that the centroid of the cluster cannot be determined in a DBSCAN algorithm, and returning a required clustering result for the number K of given clusters in a specific implementation process. In addition, the problem of over-segmentation in clustering can be effectively avoided by constructing the density subgraph and considering the local density structure, and the calculation cost of the method provided by the embodiment is obviously lower than that of other density clustering algorithms.
Further, the present embodiment also provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the above-mentioned fast clustering method based on density subgraph estimation when executing the computer program.
Further, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned fast clustering method based on density subgraph estimation.
The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (8)
1. A fast clustering method based on density subgraph estimation is characterized by comprising the following steps:
s1: obtaining samples, wherein the samples comprise picture samples and real world data set samples; preprocessing the samples to form a data set; the specific steps of preprocessing the picture sample comprise: converting the picture sample into a 5-dimensional array, wherein each element in the array consists of the position coordinate of a corresponding pixel point in the picture sample and a corresponding RGB channel value;
s2: carrying out density value estimation on each sample in the data set to construct a density subgraph set;
s3: finding out the density highest point of each density subgraph from the density subgraph set as a representative point of the density subgraph, and forming a candidate set by samples corresponding to the representative points;
s4: calculating an importance value for each sample in the candidate set;
s5: sorting the candidate sets in a descending order according to the important values, and selecting the first K samples as the centroids of the K clusters;
s6: and classifying the non-centroid samples in the candidate set, and outputting to obtain a clustering result.
2. The fast clustering method based on density subgraph estimation according to claim 1 is characterized in that: in the step S2, the specific steps are as follows:
s21: carrying out density value estimation calculation on each sample in the data set to obtain the density value of each sample;
s22: judging the density value of each sample according to a preset density threshold: if the density value of the current sample is greater than the preset density threshold value, executing the step S23; otherwise, judging that the current sample is not added into the density subgraph, and then judging the next sample in the current step;
s23: judging whether the current sample is communicated with any subset in the density subgraph set: if so, adding the current sample into the communicated subsets, further judging whether the communicated subsets are intersected with any other subsets in the density subgraph set, and if so, merging the intersected subsets; if not, creating a new subset, and adding the current sample to the new subset; and then, performing density value judgment of the step S22 on the next sample until all samples are judged, so as to obtain a density subgraph set.
3. The fast clustering method based on density subgraph estimation according to claim 2 is characterized in that: in the step S21, performing density value estimation calculation on each sample in the data set by using a k-NN density estimation method; the calculation formula is as follows:
wherein f isk(x) Represents the density value of the sample x; k is the number of neighbors, n is the total number of samples in the dataset, vdIs the volume of a unit sphere in d-dimensional space, rk(x) Representing the distance of sample x to the kth neighbor.
4. The fast clustering method based on density subgraph estimation according to claim 3 is characterized in that: in the step S4, the specific step of calculating the importance value of each sample in the candidate set includes:
s41: for the samples in the candidate set, calculating the weight value w of each samplei;
S42: the density value f of each sample in the candidate setk(xi) And weight value w thereofiAnd multiplying to obtain the important value of the corresponding sample.
5. The fast clustering method based on density subgraph estimation according to claim 4 is characterized in that: weight value w for each sample in the candidate setiThe calculation steps are as follows:
judging candidate set XHIs present or not present than the current sample xiHigher density value of sample xh: if yes, its weighted value wiIs calculated byThe formula is as follows:
otherwise, its weight value wiThe calculation formula of (a) is as follows:
6. the fast clustering method based on density subgraph estimation according to claim 1 is characterized in that: in the step S6, the specific step of classifying the non-centroid samples in the candidate set includes:
classifying the current sample into the nearest point with higher density value according to the nearest-higher density principle for the sample of the non-centroid in the candidate set until the current sample converges to a certain centroid;
judging whether the samples in the non-candidate set are in the density subgraph, if so, classifying the current samples into the cluster where the representative point in the corresponding density subgraph is located; otherwise, classifying the current sample into the cluster where the nearest point with higher density value is located according to the nearest-density higher principle;
and after classifying all samples in the data set, outputting to obtain a clustering result.
7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor when executing the computer program implements the steps of the density subgraph estimation based fast clustering method according to any one of claims 1 to 6.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the density subgraph estimation based fast clustering method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011060417.9A CN112163623B (en) | 2020-09-30 | 2020-09-30 | Fast clustering method based on density subgraph estimation, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011060417.9A CN112163623B (en) | 2020-09-30 | 2020-09-30 | Fast clustering method based on density subgraph estimation, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112163623A CN112163623A (en) | 2021-01-01 |
CN112163623B true CN112163623B (en) | 2022-03-04 |
Family
ID=73862200
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011060417.9A Active CN112163623B (en) | 2020-09-30 | 2020-09-30 | Fast clustering method based on density subgraph estimation, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112163623B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115563522B (en) * | 2022-12-02 | 2023-04-07 | 湖南工商大学 | Traffic data clustering method, device, equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7412429B1 (en) * | 2007-11-15 | 2008-08-12 | International Business Machines Corporation | Method for data classification by kernel density shape interpolation of clusters |
CN105930862A (en) * | 2016-04-13 | 2016-09-07 | 江南大学 | Density peak clustering algorithm based on density adaptive distance |
CN108280472A (en) * | 2018-01-18 | 2018-07-13 | 安徽师范大学 | A kind of density peak clustering method optimized based on local density and cluster centre |
CN109409400A (en) * | 2018-08-28 | 2019-03-01 | 西安电子科技大学 | Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9727532B2 (en) * | 2008-04-25 | 2017-08-08 | Xerox Corporation | Clustering using non-negative matrix factorization on sparse graphs |
US8971665B2 (en) * | 2012-07-31 | 2015-03-03 | Hewlett-Packard Development Company, L.P. | Hierarchical cluster determination based on subgraph density |
-
2020
- 2020-09-30 CN CN202011060417.9A patent/CN112163623B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7412429B1 (en) * | 2007-11-15 | 2008-08-12 | International Business Machines Corporation | Method for data classification by kernel density shape interpolation of clusters |
CN105930862A (en) * | 2016-04-13 | 2016-09-07 | 江南大学 | Density peak clustering algorithm based on density adaptive distance |
CN108280472A (en) * | 2018-01-18 | 2018-07-13 | 安徽师范大学 | A kind of density peak clustering method optimized based on local density and cluster centre |
CN109409400A (en) * | 2018-08-28 | 2019-03-01 | 西安电子科技大学 | Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass |
Non-Patent Citations (2)
Title |
---|
基于密度峰值和社区归属度的重叠社区发现算法;郭昆等;《小型微型计算机系统》;20190514(第05期);第217-226页 * |
基于随机取样的选择性K-means聚类融合算法;王丽娟等;《计算机应用》;20130701(第07期);第183-186页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112163623A (en) | 2021-01-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11430261B2 (en) | Target re-identification | |
CN111199214B (en) | Residual network multispectral image ground object classification method | |
CN107247961B (en) | Track prediction method applying fuzzy track sequence | |
CN110930454A (en) | Six-degree-of-freedom pose estimation algorithm based on boundary box outer key point positioning | |
TW201832134A (en) | Method and device for training human face recognition, electronic device, computer readable storage medium, and computer program product | |
CN111460234B (en) | Graph query method, device, electronic equipment and computer readable storage medium | |
CN112613575B (en) | Data set expansion method, training method and device of image classification model | |
CN109214403A (en) | Image-recognizing method, device and equipment, readable medium | |
CN111292377B (en) | Target detection method, device, computer equipment and storage medium | |
US8429163B1 (en) | Content similarity pyramid | |
CN110705602A (en) | Large-scale data clustering method and device and computer readable storage medium | |
Chebbout et al. | Comparative study of clustering based colour image segmentation techniques | |
CN112163623B (en) | Fast clustering method based on density subgraph estimation, computer equipment and storage medium | |
Joo et al. | Real‐Time Depth‐Based Hand Detection and Tracking | |
CN112734654A (en) | Image processing method, device, equipment and storage medium | |
JP2023510945A (en) | Scene identification method and apparatus, intelligent device, storage medium and computer program | |
CN109523015B (en) | Image processing method in neural network | |
CN109961129A (en) | A kind of Ocean stationary targets search scheme generation method based on improvement population | |
CN111260660A (en) | 3D point cloud semantic segmentation migration method based on meta-learning | |
Zhu et al. | Multi-scale region-based saliency detection using W 2 distance on N-dimensional normal distributions | |
CN110929801A (en) | Improved Euclid distance KNN classification method and system | |
CN111192302A (en) | Feature matching method based on motion smoothness and RANSAC algorithm | |
CN113903016B (en) | Bifurcation point detection method, bifurcation point detection device, computer equipment and storage medium | |
CN114511862B (en) | Form identification method and device and electronic equipment | |
CN116129263A (en) | Cluster ship formation identification method based on topological structure similarity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |