CN109711439A - A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm - Google Patents
A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm Download PDFInfo
- Publication number
- CN109711439A CN109711439A CN201811515205.8A CN201811515205A CN109711439A CN 109711439 A CN109711439 A CN 109711439A CN 201811515205 A CN201811515205 A CN 201811515205A CN 109711439 A CN109711439 A CN 109711439A
- Authority
- CN
- China
- Prior art keywords
- circle
- distance
- density
- data
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000010586 diagram Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 abstract description 2
- 230000000694 effects Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 238000000638 solvent extraction Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 101100533306 Mus musculus Setx gene Proteins 0.000 description 1
- JXASPPWQHFOWPL-UHFFFAOYSA-N Tamarixin Natural products C1=C(O)C(OC)=CC=C1C1=C(OC2C(C(O)C(O)C(CO)O2)O)C(=O)C2=C(O)C=C(O)C=C2O1 JXASPPWQHFOWPL-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm, the statistical property for being primarily based on Euclidean distance and data set itself defines degree adaptive distance, then all samples are traversed using Group algorithm, forms border circular areas not of uniform size and without intersection;The density and distance of each circle are calculated further according to the mode of new definition density;After finding out cluster heart circle using decision diagram, remaining circle is distributed into the circle that its nearest density of distance is higher than it, to complete to cluster.Cluster can be rapidly completed in the case where not influencing and clustering accuracy in the method for the present invention, have apparent advantage when handling large-scale data, be more able to satisfy practical engineering application demand.
Description
Technical Field
The invention relates to the field of density clustering, in particular to a density peak large-scale tourist figure data clustering method for accelerating clustering by using a Group algorithm.
Background
Clustering is an important component of data mining technology, which refers to the process of dividing a collection of physical or abstract objects into classes composed of similar objects. In colloquial, clustering is a process of dividing a target object into a plurality of clusters, so that the object similarity in the same cluster is high, and the object similarity between different clusters is low. The cluster analysis is a common data analysis tool and has wide application prospects in the fields of pattern recognition, image processing, machine learning, web search, marketing and the like. The traditional clustering analysis and calculation methods mainly comprise the following steps: partitional-based clustering, hierarchical-based clustering, density-based clustering, grid-based clustering, and graph-based clustering. The K-means algorithm is a classical algorithm based on division clustering, and the clustering quality is improved through multiple iterations. Because the algorithm is very sensitive to the initial clustering center, if the initial clustering center is not well selected, the result is very easy to fall into local optimum, and the clustering result is unstable. And the K-means algorithm is not suitable for processing clusters of arbitrary shape; although the density partitioning-based DBSCAN algorithm and the graph-based SC (spectral clustering) clustering algorithm are suitable for data sets with any shapes, the density partitioning-based DBSCAN algorithm and the graph-based SC (spectral clustering) clustering algorithm are too dependent on the setting of parameters; grid-based clustering algorithms such as STING, CLIQUE tend to reduce the accuracy of clustering when processing data.
In 2014, Rodriguez et al in the Science journal propose an algorithm that can process data sets of any shape: cluster by fast search and find of sensitivity Peaks (DPC algorithm for short). The algorithm assumes that the cluster center has a higher density ρ and a relatively larger distance δ from other data points with higher local densities. Compared with the traditional clustering algorithm, the density peak clustering algorithm has good clustering effect but usually needs longer time as a cost.
Disclosure of Invention
In order to overcome the defect that a large amount of time is consumed when the existing DPC algorithm is used for processing large-scale tourist figure data, the invention provides a density peak large-scale tourist figure data clustering method for accelerating neighbor search by utilizing a Group algorithm, firstly, a density self-adaptive distance is defined based on Euclidean distance and the statistical characteristics of a data set, so that a data space distribution structure is better described; secondly, a Group algorithm and a DPC algorithm are combined, a new density defining mode is provided, and experiments on UCI real data sets show that the new algorithm can not only guarantee the clustering effect, but also greatly reduce the time spent on clustering.
In order to solve the technical problems, the invention adopts the following technical scheme:
a density peak large-scale tourist image data clustering method for accelerating neighbor search by using a Group algorithm comprises the following steps:
step 1, inputting a data setX={x1,x2,…,xn}∈RDWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;
step 2, determining an Eps radius parameter, wherein the process is as follows:
2.1 first calculate the sample point xiAnd xjEuclidean distance between:obtaining a distance distribution matrix DISTn×nThe value of (a) is:
DISTn×n={dist(xi,xj),1≤i≤n,1≤j≤n} (1)
wherein x isiRepresenting the ith sample point. To DISTn×nThe values of each row in the table are sorted from small to large, and the DIST is recordedn×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;
2.2 for DISTn×mThe distance of each sample point in the distance metric, i.e., the data for each row vector, should have its own weight α, i.e.:
wherein, αiThe selection of the random variable is related to the statistical characteristics of the data, the variance describes the deviation degree of the random variable to the mean value, and the larger the variance is, the larger the fluctuation is;
2.3 first, calculate DISTn×mThe standard deviation σ for each row, i.e.:
wherein,is a distance DISTn×mMean of row i;
2.4 the larger the standard deviation, the smaller its corresponding weight should be, thus defining the weight:
normalizing the weights:thereby giving Eps;
the operation is divided into two stages: a first stage of running a Group algorithm on the whole data set to obtain a series of circular areas; a second stage of clustering the obtained circular regions using a DPC-based algorithm;
step 3, running a Group algorithm on the whole data set to obtain a series of circular areas;
after the Group algorithm traverses all data sets, a series of circles with different sizes are formed and recorded as S, so that each data point is sequentially distributed into the circles S, and all S form a set S, SmRepresenting the center of each circle;
and 4, clustering the obtained circular regions by using a DPC (predictive coding rate) -based algorithm, wherein the clustering center has higher density rho and has larger distance delta from higher density points, and the process is as follows:
4.1, redefining the density ρ, wherein after the Group algorithm traverses all the data sets, the obtained circular regions may overlap each other, but there is no intersection therebetween, and at this time, the data points between the overlapping regions are assigned to the intersecting circular regions, so that the density of each circular region is defined as:
wherein numiAnd riThe number of data points in the ith circular area and the radius of the circular area are respectively;
4.2 taking the shortest distance from the circle i to the higher density circle j as the distance value of the sample point in the circle, and recording as deltaiThe definition is as follows:for the data point with the highest density globally, there is δj=maxi≠jδi;
Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, the density rho of each circle is calculatediAnd a distance deltaiTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center through a decision diagram, after finding out cluster center circles, firstly endowing each cluster center circle with different class marks, and then adopting a density-based division mode;
and 6, each non-cluster center circle follows the nearest circle with the density higher than that of the circle until all the non-cluster center circles are completely distributed, after the distribution is completed, the class mark of each non-cluster center circle only needs to follow the class mark of the circle into which the non-cluster center circle falls, and the clustering is finished.
The beneficial effects of the invention are as follows:
(1) compared with the Euclidean distance, the data statistical characteristic is added, the spatial distribution structure of the data can be better described, the difference of different distributed data sets can be effectively found, and the clustering performance is further improved.
(2) The invention processes the round area as the sample point after traversing all the data sets by the Group algorithm instead of processing the single sample point, thereby greatly shortening the time required by clustering and leading the original density algorithm to have wider applicability.
Drawings
FIG. 1 is an overall flow diagram of the method of the present invention;
FIG. 2 is a flow chart of a preliminary phase of the Group algorithm;
FIG. 3 is a flow chart of a further stage of the Group algorithm;
FIG. 4 is a result of running a Group algorithm on an Agg dataset;
FIG. 5 is the final clustering result using the algorithm of the present invention on an Agg data set.
Detailed Description
For the purpose of illustrating the objects, technical solutions and advantages of the present invention, the present invention will be described in further detail below with reference to specific embodiments and accompanying drawings.
Referring to fig. 1 to 5, a method for clustering density peak large-scale tourist image data by using a Group algorithm to accelerate neighbor search includes the following steps:
step 1, inputting a data set X ═ X1,x2,…,xn}∈RDWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;
step 2, determining an Eps (radius) parameter, and the process is as follows:
2.1 first calculate the sample point xiAnd xjEuclidean distance between:obtaining a distance distribution matrix DISTn×nThe value of (a) is:
DISTn×n={dist(xi,xj),1≤i≤n,1≤j≤n} (1)
wherein x isiTo representIth sample point, for DISTn×nThe values of each row in the table are sorted from small to large, and the DIST is recordedn×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;
2.2 the idea of density adaptive distance is mainly as follows: for DISTn×mThe distance of each sample point in the distance metric, i.e., the data for each row vector, should have its own weight α, i.e.:
wherein, αiIs selected in relation to the statistical properties of the data itself. The variance describes the deviation degree of the random variable from the mean value, and the larger the variance is, the larger the fluctuation is;
2.3 first, calculate DISTn×mThe standard deviation σ for each row, i.e.:
wherein,is a distance DISTn×mMean of row i;
2.4 the larger the standard deviation, the smaller its corresponding weight should be, thus defining the weight:
normalizing the weights:thereby giving Eps;
the algorithm is combined with the DPC algorithm, and the proposed new algorithm operates in two stages: a first stage of running a Group algorithm on the whole data set to obtain a series of circular areas; a second stage of clustering the obtained circular regions using a DPC-based algorithm;
step 3, running a Group algorithm on the whole data set to obtain a series of circular areas, and running the Group algorithm on the whole data set to obtain a series of circular areas;
3.1, traversing the whole data set, and performing preliminary division on the sample points, as shown in fig. 2, the specific implementation process is as follows:
input dataset X ═ X1,x2,…,xn}∈Rd. For each sample point: each sample point is searched for a suitable existing circle. Firstly, calculating the distance from a data sample point x to a circle center smThe euclidean distance between and compared to the Eps size;
if the distance from a given sample point to the center of the circle is less than or equal to Eps, i.e. | | smEps is less than or equal to-x | |, then s is usedmAs the center of a circle, smThe farthest distance to x is the radius to draw a circle; if it isOr S is absent in S such that | | Sm-x | | ≦ 2Eps, then x is defined as the new center of the circle sm(ii) a If the first two conditions are not met, marking the sample point x as an unprocessed point;
3.2 further dividing the unprocessed sample points in step 3.1, as shown in fig. 3, the specific implementation process is as follows:
for unprocessed sample point x, if | | | smEps is less than or equal to-x | |, then s is usedmAs the center of a circle, smThe distance to x is a radius to draw a circle; otherwise, defining x as a new circle center sm;
Step 4, clustering the obtained circular areas by using a DPC (design control point) -based algorithm;
the DPC algorithm is based on the following assumptions: the clustering center has a higher density ρ and a larger distance δ from a higher density point;
4.1, redefining the density ρ, wherein after the Group algorithm traverses all the data sets, the obtained circular regions may overlap each other, but there is no intersection therebetween, and at this time, the data points between the overlapping regions are assigned to the intersecting circular regions, so that the density of each circular region is defined as:
wherein numiAnd riThe number of data points in the ith circular area and the radius of the circular area are respectively;
4.2 taking the shortest distance from the circle i to the higher density circle j as the distance value of the sample point in the circle, and recording as deltaiThe definition is as follows:for the data point with the highest density globally, there is δj=maxi≠jδi;
Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, the density rho of each circle is calculatediAnd a distance deltaiTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center (cluster center circle) through a decision diagram, after finding the cluster center circle, firstly endowing each cluster center circle with different class marks, and then adopting a density-based dividing mode;
and 6, each non-cluster center circle follows the nearest circle with the density higher than that of the circle until all the non-cluster center circles are completely distributed, after the distribution is completed, the class mark of each non-cluster center circle only needs to follow the class mark of the circle into which the non-cluster center circle falls, and the clustering is finished.
The effects of the present invention can be further illustrated by the following simulation experiments.
1) Simulation conditions
The experimental operating system is Windows10, simulation software Matlab (R2014a) (64 bits), the processor is Inter (R) core (TM) i5, and the installation memory is 4.00GB
Table 1 shows UCI real data:
TABLE 1
2) Simulation result
The algorithm and DPC method of the invention are used for comparative experiment on UCI real data set. In order to further verify the performance of the algorithm on a real data set, 5 common UCI data sets in the table 1 are used for carrying out experiments, and common ACC and F-measure indexes are adopted to evaluate clustering results, wherein the ACC and F-measure indexes have the value range of [0,1], and the larger the value is, the better the clustering effect is. And the speed of the algorithm is evaluated by the cluster execution time (t/s). The shorter the time, the better.
TABLE 2
As can be seen from Table 2, the method of the present invention has superior results to DPC algorithm. The execution time of the algorithm is greatly reduced, especially when the data size is large. Compared with the DPC algorithm, the method is more suitable for processing large-scale data and has better practical engineering application value.
Details not described in this specification are within the skill of the art that are well known to those skilled in the art.
Claims (1)
1. A density peak large-scale tourist image data clustering method for accelerating neighbor search by using a Group algorithm is characterized by comprising the following steps:
step 1, inputting a data set X ═ X1,x2,…,xn}∈RDWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;
step 2, determining an Eps radius parameter, wherein the process is as follows:
2.1 first calculate the sample point xiAnd xjEuclidean distance between:obtaining a distance distribution matrix DISTn×nThe value of (a) is:
DISTn×n={dist(xi,xj),1≤i≤n,1≤j≤n} (1)
wherein x isiRepresents the ith sample point, for DISTn×nThe values of each row in the table are sorted from small to large, and the DIST is recordedn×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;
2.2 for DISTn×mThe distance of each sample point in the distance metric, i.e., the data for each row vector, should have its own weight α, i.e.:
wherein, αiThe selection of the random variable is related to the statistical characteristics of the data, the variance describes the deviation degree of the random variable to the mean value, and the larger the variance is, the larger the fluctuation is;
2.3 first, calculate DISTn×mThe standard deviation σ for each row, i.e.:
wherein,is a distance DISTn×mMean of row i;
2.4 the larger the standard deviation, the smaller its corresponding weight should be, thus defining the weight:
normalizing the weights:thereby giving Eps;
the operation is divided into two stages: a first stage of running a Group algorithm on the whole data set to obtain a series of circular areas; a second stage of clustering the obtained circular regions using a DPC-based algorithm;
step 3, running a Group algorithm on the whole data set to obtain a series of circular areas;
after the Group algorithm traverses all data sets, a series of circles with different sizes are formed and recorded as S, so that each data point is sequentially distributed into the circles S, and all S form a set S, SmRepresenting the center of each circle;
and 4, clustering the obtained circular regions by using a DPC (predictive coding rate) -based algorithm, wherein the clustering center has higher density rho and has larger distance delta from higher density points, and the process is as follows:
4.1, redefining the density ρ, wherein after the Group algorithm traverses all the data sets, the obtained circular regions may overlap each other, but there is no intersection therebetween, and at this time, the data points between the overlapping regions are assigned to the intersecting circular regions, so that the density of each circular region is defined as:
wherein humiAnd riThe number of data points in the ith circular area and the radius of the circular area are respectively;
4.2 taking the shortest distance from the circle i to the higher density circle j as the distance value of the sample point in the circle, and recording as deltaiThe definition is as follows:for the data point with the highest density globally, there is δj=maxi≠jδi;
Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, calculatingDensity per circle ρiAnd a distance deltaiTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center through a decision diagram, after finding out cluster center circles, firstly endowing each cluster center circle with different class marks, and then adopting a density-based division mode;
and 6, each non-cluster center circle follows the nearest circle with the density higher than that of the circle until all the non-cluster center circles are completely distributed, after the distribution is completed, the class mark of each non-cluster center circle only needs to follow the class mark of the circle into which the non-cluster center circle falls, and the clustering is finished.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811515205.8A CN109711439A (en) | 2018-12-12 | 2018-12-12 | A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811515205.8A CN109711439A (en) | 2018-12-12 | 2018-12-12 | A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109711439A true CN109711439A (en) | 2019-05-03 |
Family
ID=66255617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811515205.8A Pending CN109711439A (en) | 2018-12-12 | 2018-12-12 | A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109711439A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112070387A (en) * | 2020-09-04 | 2020-12-11 | 北京交通大学 | Multipath component clustering performance evaluation method in complex propagation environment |
CN113743457A (en) * | 2021-07-29 | 2021-12-03 | 暨南大学 | Quantum density peak value clustering method based on quantum Grover search technology |
CN116796214A (en) * | 2023-06-07 | 2023-09-22 | 南京北极光生物科技有限公司 | Data clustering method based on differential features |
-
2018
- 2018-12-12 CN CN201811515205.8A patent/CN109711439A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112070387A (en) * | 2020-09-04 | 2020-12-11 | 北京交通大学 | Multipath component clustering performance evaluation method in complex propagation environment |
CN112070387B (en) * | 2020-09-04 | 2023-09-26 | 北京交通大学 | Method for evaluating multipath component clustering performance of complex propagation environment |
CN113743457A (en) * | 2021-07-29 | 2021-12-03 | 暨南大学 | Quantum density peak value clustering method based on quantum Grover search technology |
CN113743457B (en) * | 2021-07-29 | 2023-07-28 | 暨南大学 | Quantum density peak clustering method based on quantum Grover search technology |
CN116796214A (en) * | 2023-06-07 | 2023-09-22 | 南京北极光生物科技有限公司 | Data clustering method based on differential features |
CN116796214B (en) * | 2023-06-07 | 2024-01-30 | 南京北极光生物科技有限公司 | Data clustering method based on differential features |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105930862A (en) | Density peak clustering algorithm based on density adaptive distance | |
CN109711439A (en) | A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm | |
CN113344019A (en) | K-means algorithm for improving decision value selection initial clustering center | |
CN109508752A (en) | A kind of quick self-adapted neighbour's clustering method based on structuring anchor figure | |
CN110991518B (en) | Two-stage feature selection method and system based on evolutionary multitasking | |
Yi et al. | An improved initialization center algorithm for K-means clustering | |
CN106845536B (en) | Parallel clustering method based on image scaling | |
Cheng et al. | A local cores-based hierarchical clustering algorithm for data sets with complex structures | |
CN113850281A (en) | Data processing method and device based on MEANSHIFT optimization | |
CN116503676B (en) | Picture classification method and system based on knowledge distillation small sample increment learning | |
CN115496138A (en) | Self-adaptive density peak value clustering method based on natural neighbors | |
CN114386466B (en) | Parallel hybrid clustering method for candidate signal mining in pulsar search | |
CN113435108A (en) | Battlefield target grouping method based on improved whale optimization algorithm | |
Kumar et al. | Automatic clustering and feature selection using gravitational search algorithm and its application to microarray data analysis | |
CN112232383A (en) | Integrated clustering method based on super-cluster weighting | |
CN110781943A (en) | Clustering method based on adjacent grid search | |
CN114638301A (en) | Density peak value clustering algorithm based on density similarity | |
CN108614889B (en) | Moving object continuous k nearest neighbor query method and system based on Gaussian mixture model | |
CN116578893A (en) | Clustering integration system and method for self-adaptive density peak value | |
CN110309424A (en) | A kind of socialization recommended method based on Rough clustering | |
CN103336781B (en) | A kind of medical image clustering method | |
CN114328922B (en) | Selective text clustering integration method based on spectrogram theory | |
CN110837853A (en) | Rapid classification model construction method | |
CN115965318A (en) | Logistics center site selection method based on variable center evolution clustering | |
CN113837293A (en) | mRNA subcellular localization model training method, mRNA subcellular localization model localization method and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190503 |