CN109711439A - A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm - Google Patents

A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm Download PDF

Info

Publication number
CN109711439A
CN109711439A CN201811515205.8A CN201811515205A CN109711439A CN 109711439 A CN109711439 A CN 109711439A CN 201811515205 A CN201811515205 A CN 201811515205A CN 109711439 A CN109711439 A CN 109711439A
Authority
CN
China
Prior art keywords
circle
distance
density
data
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811515205.8A
Other languages
Chinese (zh)
Inventor
李胜
洪彩霞
何熊熊
常丽萍
杨建军
管俊轶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201811515205.8A priority Critical patent/CN109711439A/en
Publication of CN109711439A publication Critical patent/CN109711439A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm, the statistical property for being primarily based on Euclidean distance and data set itself defines degree adaptive distance, then all samples are traversed using Group algorithm, forms border circular areas not of uniform size and without intersection;The density and distance of each circle are calculated further according to the mode of new definition density;After finding out cluster heart circle using decision diagram, remaining circle is distributed into the circle that its nearest density of distance is higher than it, to complete to cluster.Cluster can be rapidly completed in the case where not influencing and clustering accuracy in the method for the present invention, have apparent advantage when handling large-scale data, be more able to satisfy practical engineering application demand.

Description

Density peak large-scale tourist figure data clustering method for accelerating neighbor search by using Group algorithm
Technical Field
The invention relates to the field of density clustering, in particular to a density peak large-scale tourist figure data clustering method for accelerating clustering by using a Group algorithm.
Background
Clustering is an important component of data mining technology, which refers to the process of dividing a collection of physical or abstract objects into classes composed of similar objects. In colloquial, clustering is a process of dividing a target object into a plurality of clusters, so that the object similarity in the same cluster is high, and the object similarity between different clusters is low. The cluster analysis is a common data analysis tool and has wide application prospects in the fields of pattern recognition, image processing, machine learning, web search, marketing and the like. The traditional clustering analysis and calculation methods mainly comprise the following steps: partitional-based clustering, hierarchical-based clustering, density-based clustering, grid-based clustering, and graph-based clustering. The K-means algorithm is a classical algorithm based on division clustering, and the clustering quality is improved through multiple iterations. Because the algorithm is very sensitive to the initial clustering center, if the initial clustering center is not well selected, the result is very easy to fall into local optimum, and the clustering result is unstable. And the K-means algorithm is not suitable for processing clusters of arbitrary shape; although the density partitioning-based DBSCAN algorithm and the graph-based SC (spectral clustering) clustering algorithm are suitable for data sets with any shapes, the density partitioning-based DBSCAN algorithm and the graph-based SC (spectral clustering) clustering algorithm are too dependent on the setting of parameters; grid-based clustering algorithms such as STING, CLIQUE tend to reduce the accuracy of clustering when processing data.
In 2014, Rodriguez et al in the Science journal propose an algorithm that can process data sets of any shape: cluster by fast search and find of sensitivity Peaks (DPC algorithm for short). The algorithm assumes that the cluster center has a higher density ρ and a relatively larger distance δ from other data points with higher local densities. Compared with the traditional clustering algorithm, the density peak clustering algorithm has good clustering effect but usually needs longer time as a cost.
Disclosure of Invention
In order to overcome the defect that a large amount of time is consumed when the existing DPC algorithm is used for processing large-scale tourist figure data, the invention provides a density peak large-scale tourist figure data clustering method for accelerating neighbor search by utilizing a Group algorithm, firstly, a density self-adaptive distance is defined based on Euclidean distance and the statistical characteristics of a data set, so that a data space distribution structure is better described; secondly, a Group algorithm and a DPC algorithm are combined, a new density defining mode is provided, and experiments on UCI real data sets show that the new algorithm can not only guarantee the clustering effect, but also greatly reduce the time spent on clustering.
In order to solve the technical problems, the invention adopts the following technical scheme:
a density peak large-scale tourist image data clustering method for accelerating neighbor search by using a Group algorithm comprises the following steps:
step 1, inputting a data setX={x1,x2,…,xn}∈RDWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;
step 2, determining an Eps radius parameter, wherein the process is as follows:
2.1 first calculate the sample point xiAnd xjEuclidean distance between:obtaining a distance distribution matrix DISTn×nThe value of (a) is:
DISTn×n={dist(xi,xj),1≤i≤n,1≤j≤n} (1)
wherein x isiRepresenting the ith sample point. To DISTn×nThe values of each row in the table are sorted from small to large, and the DIST is recordedn×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;
2.2 for DISTn×mThe distance of each sample point in the distance metric, i.e., the data for each row vector, should have its own weight α, i.e.:
wherein, αiThe selection of the random variable is related to the statistical characteristics of the data, the variance describes the deviation degree of the random variable to the mean value, and the larger the variance is, the larger the fluctuation is;
2.3 first, calculate DISTn×mThe standard deviation σ for each row, i.e.:
wherein,is a distance DISTn×mMean of row i;
2.4 the larger the standard deviation, the smaller its corresponding weight should be, thus defining the weight:
normalizing the weights:thereby giving Eps;
the operation is divided into two stages: a first stage of running a Group algorithm on the whole data set to obtain a series of circular areas; a second stage of clustering the obtained circular regions using a DPC-based algorithm;
step 3, running a Group algorithm on the whole data set to obtain a series of circular areas;
after the Group algorithm traverses all data sets, a series of circles with different sizes are formed and recorded as S, so that each data point is sequentially distributed into the circles S, and all S form a set S, SmRepresenting the center of each circle;
and 4, clustering the obtained circular regions by using a DPC (predictive coding rate) -based algorithm, wherein the clustering center has higher density rho and has larger distance delta from higher density points, and the process is as follows:
4.1, redefining the density ρ, wherein after the Group algorithm traverses all the data sets, the obtained circular regions may overlap each other, but there is no intersection therebetween, and at this time, the data points between the overlapping regions are assigned to the intersecting circular regions, so that the density of each circular region is defined as:
wherein numiAnd riThe number of data points in the ith circular area and the radius of the circular area are respectively;
4.2 taking the shortest distance from the circle i to the higher density circle j as the distance value of the sample point in the circle, and recording as deltaiThe definition is as follows:for the data point with the highest density globally, there is δj=maxi≠jδi
Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, the density rho of each circle is calculatediAnd a distance deltaiTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center through a decision diagram, after finding out cluster center circles, firstly endowing each cluster center circle with different class marks, and then adopting a density-based division mode;
and 6, each non-cluster center circle follows the nearest circle with the density higher than that of the circle until all the non-cluster center circles are completely distributed, after the distribution is completed, the class mark of each non-cluster center circle only needs to follow the class mark of the circle into which the non-cluster center circle falls, and the clustering is finished.
The beneficial effects of the invention are as follows:
(1) compared with the Euclidean distance, the data statistical characteristic is added, the spatial distribution structure of the data can be better described, the difference of different distributed data sets can be effectively found, and the clustering performance is further improved.
(2) The invention processes the round area as the sample point after traversing all the data sets by the Group algorithm instead of processing the single sample point, thereby greatly shortening the time required by clustering and leading the original density algorithm to have wider applicability.
Drawings
FIG. 1 is an overall flow diagram of the method of the present invention;
FIG. 2 is a flow chart of a preliminary phase of the Group algorithm;
FIG. 3 is a flow chart of a further stage of the Group algorithm;
FIG. 4 is a result of running a Group algorithm on an Agg dataset;
FIG. 5 is the final clustering result using the algorithm of the present invention on an Agg data set.
Detailed Description
For the purpose of illustrating the objects, technical solutions and advantages of the present invention, the present invention will be described in further detail below with reference to specific embodiments and accompanying drawings.
Referring to fig. 1 to 5, a method for clustering density peak large-scale tourist image data by using a Group algorithm to accelerate neighbor search includes the following steps:
step 1, inputting a data set X ═ X1,x2,…,xn}∈RDWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;
step 2, determining an Eps (radius) parameter, and the process is as follows:
2.1 first calculate the sample point xiAnd xjEuclidean distance between:obtaining a distance distribution matrix DISTn×nThe value of (a) is:
DISTn×n={dist(xi,xj),1≤i≤n,1≤j≤n} (1)
wherein x isiTo representIth sample point, for DISTn×nThe values of each row in the table are sorted from small to large, and the DIST is recordedn×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;
2.2 the idea of density adaptive distance is mainly as follows: for DISTn×mThe distance of each sample point in the distance metric, i.e., the data for each row vector, should have its own weight α, i.e.:
wherein, αiIs selected in relation to the statistical properties of the data itself. The variance describes the deviation degree of the random variable from the mean value, and the larger the variance is, the larger the fluctuation is;
2.3 first, calculate DISTn×mThe standard deviation σ for each row, i.e.:
wherein,is a distance DISTn×mMean of row i;
2.4 the larger the standard deviation, the smaller its corresponding weight should be, thus defining the weight:
normalizing the weights:thereby giving Eps;
the algorithm is combined with the DPC algorithm, and the proposed new algorithm operates in two stages: a first stage of running a Group algorithm on the whole data set to obtain a series of circular areas; a second stage of clustering the obtained circular regions using a DPC-based algorithm;
step 3, running a Group algorithm on the whole data set to obtain a series of circular areas, and running the Group algorithm on the whole data set to obtain a series of circular areas;
3.1, traversing the whole data set, and performing preliminary division on the sample points, as shown in fig. 2, the specific implementation process is as follows:
input dataset X ═ X1,x2,…,xn}∈Rd. For each sample point: each sample point is searched for a suitable existing circle. Firstly, calculating the distance from a data sample point x to a circle center smThe euclidean distance between and compared to the Eps size;
if the distance from a given sample point to the center of the circle is less than or equal to Eps, i.e. | | smEps is less than or equal to-x | |, then s is usedmAs the center of a circle, smThe farthest distance to x is the radius to draw a circle; if it isOr S is absent in S such that | | Sm-x | | ≦ 2Eps, then x is defined as the new center of the circle sm(ii) a If the first two conditions are not met, marking the sample point x as an unprocessed point;
3.2 further dividing the unprocessed sample points in step 3.1, as shown in fig. 3, the specific implementation process is as follows:
for unprocessed sample point x, if | | | smEps is less than or equal to-x | |, then s is usedmAs the center of a circle, smThe distance to x is a radius to draw a circle; otherwise, defining x as a new circle center sm
Step 4, clustering the obtained circular areas by using a DPC (design control point) -based algorithm;
the DPC algorithm is based on the following assumptions: the clustering center has a higher density ρ and a larger distance δ from a higher density point;
4.1, redefining the density ρ, wherein after the Group algorithm traverses all the data sets, the obtained circular regions may overlap each other, but there is no intersection therebetween, and at this time, the data points between the overlapping regions are assigned to the intersecting circular regions, so that the density of each circular region is defined as:
wherein numiAnd riThe number of data points in the ith circular area and the radius of the circular area are respectively;
4.2 taking the shortest distance from the circle i to the higher density circle j as the distance value of the sample point in the circle, and recording as deltaiThe definition is as follows:for the data point with the highest density globally, there is δj=maxi≠jδi
Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, the density rho of each circle is calculatediAnd a distance deltaiTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center (cluster center circle) through a decision diagram, after finding the cluster center circle, firstly endowing each cluster center circle with different class marks, and then adopting a density-based dividing mode;
and 6, each non-cluster center circle follows the nearest circle with the density higher than that of the circle until all the non-cluster center circles are completely distributed, after the distribution is completed, the class mark of each non-cluster center circle only needs to follow the class mark of the circle into which the non-cluster center circle falls, and the clustering is finished.
The effects of the present invention can be further illustrated by the following simulation experiments.
1) Simulation conditions
The experimental operating system is Windows10, simulation software Matlab (R2014a) (64 bits), the processor is Inter (R) core (TM) i5, and the installation memory is 4.00GB
Table 1 shows UCI real data:
TABLE 1
2) Simulation result
The algorithm and DPC method of the invention are used for comparative experiment on UCI real data set. In order to further verify the performance of the algorithm on a real data set, 5 common UCI data sets in the table 1 are used for carrying out experiments, and common ACC and F-measure indexes are adopted to evaluate clustering results, wherein the ACC and F-measure indexes have the value range of [0,1], and the larger the value is, the better the clustering effect is. And the speed of the algorithm is evaluated by the cluster execution time (t/s). The shorter the time, the better.
TABLE 2
As can be seen from Table 2, the method of the present invention has superior results to DPC algorithm. The execution time of the algorithm is greatly reduced, especially when the data size is large. Compared with the DPC algorithm, the method is more suitable for processing large-scale data and has better practical engineering application value.
Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims (1)

1. A density peak large-scale tourist image data clustering method for accelerating neighbor search by using a Group algorithm is characterized by comprising the following steps:
step 1, inputting a data set X ═ X1,x2,…,xn}∈RDWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;
step 2, determining an Eps radius parameter, wherein the process is as follows:
2.1 first calculate the sample point xiAnd xjEuclidean distance between:obtaining a distance distribution matrix DISTn×nThe value of (a) is:
DISTn×n={dist(xi,xj),1≤i≤n,1≤j≤n} (1)
wherein x isiRepresents the ith sample point, for DISTn×nThe values of each row in the table are sorted from small to large, and the DIST is recordedn×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;
2.2 for DISTn×mThe distance of each sample point in the distance metric, i.e., the data for each row vector, should have its own weight α, i.e.:
wherein, αiThe selection of the random variable is related to the statistical characteristics of the data, the variance describes the deviation degree of the random variable to the mean value, and the larger the variance is, the larger the fluctuation is;
2.3 first, calculate DISTn×mThe standard deviation σ for each row, i.e.:
wherein,is a distance DISTn×mMean of row i;
2.4 the larger the standard deviation, the smaller its corresponding weight should be, thus defining the weight:
normalizing the weights:thereby giving Eps;
the operation is divided into two stages: a first stage of running a Group algorithm on the whole data set to obtain a series of circular areas; a second stage of clustering the obtained circular regions using a DPC-based algorithm;
step 3, running a Group algorithm on the whole data set to obtain a series of circular areas;
after the Group algorithm traverses all data sets, a series of circles with different sizes are formed and recorded as S, so that each data point is sequentially distributed into the circles S, and all S form a set S, SmRepresenting the center of each circle;
and 4, clustering the obtained circular regions by using a DPC (predictive coding rate) -based algorithm, wherein the clustering center has higher density rho and has larger distance delta from higher density points, and the process is as follows:
4.1, redefining the density ρ, wherein after the Group algorithm traverses all the data sets, the obtained circular regions may overlap each other, but there is no intersection therebetween, and at this time, the data points between the overlapping regions are assigned to the intersecting circular regions, so that the density of each circular region is defined as:
wherein humiAnd riThe number of data points in the ith circular area and the radius of the circular area are respectively;
4.2 taking the shortest distance from the circle i to the higher density circle j as the distance value of the sample point in the circle, and recording as deltaiThe definition is as follows:for the data point with the highest density globally, there is δj=maxi≠jδi
Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, calculatingDensity per circle ρiAnd a distance deltaiTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center through a decision diagram, after finding out cluster center circles, firstly endowing each cluster center circle with different class marks, and then adopting a density-based division mode;
and 6, each non-cluster center circle follows the nearest circle with the density higher than that of the circle until all the non-cluster center circles are completely distributed, after the distribution is completed, the class mark of each non-cluster center circle only needs to follow the class mark of the circle into which the non-cluster center circle falls, and the clustering is finished.
CN201811515205.8A 2018-12-12 2018-12-12 A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm Pending CN109711439A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811515205.8A CN109711439A (en) 2018-12-12 2018-12-12 A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811515205.8A CN109711439A (en) 2018-12-12 2018-12-12 A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm

Publications (1)

Publication Number Publication Date
CN109711439A true CN109711439A (en) 2019-05-03

Family

ID=66255617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811515205.8A Pending CN109711439A (en) 2018-12-12 2018-12-12 A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm

Country Status (1)

Country Link
CN (1) CN109711439A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070387A (en) * 2020-09-04 2020-12-11 北京交通大学 Multipath component clustering performance evaluation method in complex propagation environment
CN113743457A (en) * 2021-07-29 2021-12-03 暨南大学 Quantum density peak value clustering method based on quantum Grover search technology
CN116796214A (en) * 2023-06-07 2023-09-22 南京北极光生物科技有限公司 Data clustering method based on differential features

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070387A (en) * 2020-09-04 2020-12-11 北京交通大学 Multipath component clustering performance evaluation method in complex propagation environment
CN112070387B (en) * 2020-09-04 2023-09-26 北京交通大学 Method for evaluating multipath component clustering performance of complex propagation environment
CN113743457A (en) * 2021-07-29 2021-12-03 暨南大学 Quantum density peak value clustering method based on quantum Grover search technology
CN113743457B (en) * 2021-07-29 2023-07-28 暨南大学 Quantum density peak clustering method based on quantum Grover search technology
CN116796214A (en) * 2023-06-07 2023-09-22 南京北极光生物科技有限公司 Data clustering method based on differential features
CN116796214B (en) * 2023-06-07 2024-01-30 南京北极光生物科技有限公司 Data clustering method based on differential features

Similar Documents

Publication Publication Date Title
CN105930862A (en) Density peak clustering algorithm based on density adaptive distance
CN109711439A (en) A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm
CN113344019A (en) K-means algorithm for improving decision value selection initial clustering center
CN109508752A (en) A kind of quick self-adapted neighbour's clustering method based on structuring anchor figure
CN110991518B (en) Two-stage feature selection method and system based on evolutionary multitasking
Yi et al. An improved initialization center algorithm for K-means clustering
CN106845536B (en) Parallel clustering method based on image scaling
Cheng et al. A local cores-based hierarchical clustering algorithm for data sets with complex structures
CN113850281A (en) Data processing method and device based on MEANSHIFT optimization
CN116503676B (en) Picture classification method and system based on knowledge distillation small sample increment learning
CN115496138A (en) Self-adaptive density peak value clustering method based on natural neighbors
CN114386466B (en) Parallel hybrid clustering method for candidate signal mining in pulsar search
CN113435108A (en) Battlefield target grouping method based on improved whale optimization algorithm
Kumar et al. Automatic clustering and feature selection using gravitational search algorithm and its application to microarray data analysis
CN112232383A (en) Integrated clustering method based on super-cluster weighting
CN110781943A (en) Clustering method based on adjacent grid search
CN114638301A (en) Density peak value clustering algorithm based on density similarity
CN108614889B (en) Moving object continuous k nearest neighbor query method and system based on Gaussian mixture model
CN116578893A (en) Clustering integration system and method for self-adaptive density peak value
CN110309424A (en) A kind of socialization recommended method based on Rough clustering
CN103336781B (en) A kind of medical image clustering method
CN114328922B (en) Selective text clustering integration method based on spectrogram theory
CN110837853A (en) Rapid classification model construction method
CN115965318A (en) Logistics center site selection method based on variable center evolution clustering
CN113837293A (en) mRNA subcellular localization model training method, mRNA subcellular localization model localization method and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190503