CN109711439A

CN109711439A - A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm

Info

Publication number: CN109711439A
Application number: CN201811515205.8A
Authority: CN
Inventors: 李胜; 洪彩霞; 何熊熊; 常丽萍; 杨建军; 管俊轶
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2019-05-03

Abstract

A kind of extensive tourist's representation data clustering method in density peak accelerating neighbor seaching using Group algorithm, the statistical property for being primarily based on Euclidean distance and data set itself defines degree adaptive distance, then all samples are traversed using Group algorithm, forms border circular areas not of uniform size and without intersection；The density and distance of each circle are calculated further according to the mode of new definition density；After finding out cluster heart circle using decision diagram, remaining circle is distributed into the circle that its nearest density of distance is higher than it, to complete to cluster.Cluster can be rapidly completed in the case where not influencing and clustering accuracy in the method for the present invention, have apparent advantage when handling large-scale data, be more able to satisfy practical engineering application demand.

Description

Density peak large-scale tourist figure data clustering method for accelerating neighbor search by using Group algorithm

Technical Field

The invention relates to the field of density clustering, in particular to a density peak large-scale tourist figure data clustering method for accelerating clustering by using a Group algorithm.

Background

Clustering is an important component of data mining technology, which refers to the process of dividing a collection of physical or abstract objects into classes composed of similar objects. In colloquial, clustering is a process of dividing a target object into a plurality of clusters, so that the object similarity in the same cluster is high, and the object similarity between different clusters is low. The cluster analysis is a common data analysis tool and has wide application prospects in the fields of pattern recognition, image processing, machine learning, web search, marketing and the like. The traditional clustering analysis and calculation methods mainly comprise the following steps: partitional-based clustering, hierarchical-based clustering, density-based clustering, grid-based clustering, and graph-based clustering. The K-means algorithm is a classical algorithm based on division clustering, and the clustering quality is improved through multiple iterations. Because the algorithm is very sensitive to the initial clustering center, if the initial clustering center is not well selected, the result is very easy to fall into local optimum, and the clustering result is unstable. And the K-means algorithm is not suitable for processing clusters of arbitrary shape; although the density partitioning-based DBSCAN algorithm and the graph-based SC (spectral clustering) clustering algorithm are suitable for data sets with any shapes, the density partitioning-based DBSCAN algorithm and the graph-based SC (spectral clustering) clustering algorithm are too dependent on the setting of parameters; grid-based clustering algorithms such as STING, CLIQUE tend to reduce the accuracy of clustering when processing data.

In 2014, Rodriguez et al in the Science journal propose an algorithm that can process data sets of any shape: cluster by fast search and find of sensitivity Peaks (DPC algorithm for short). The algorithm assumes that the cluster center has a higher density ρ and a relatively larger distance δ from other data points with higher local densities. Compared with the traditional clustering algorithm, the density peak clustering algorithm has good clustering effect but usually needs longer time as a cost.

Disclosure of Invention

In order to overcome the defect that a large amount of time is consumed when the existing DPC algorithm is used for processing large-scale tourist figure data, the invention provides a density peak large-scale tourist figure data clustering method for accelerating neighbor search by utilizing a Group algorithm, firstly, a density self-adaptive distance is defined based on Euclidean distance and the statistical characteristics of a data set, so that a data space distribution structure is better described; secondly, a Group algorithm and a DPC algorithm are combined, a new density defining mode is provided, and experiments on UCI real data sets show that the new algorithm can not only guarantee the clustering effect, but also greatly reduce the time spent on clustering.

In order to solve the technical problems, the invention adopts the following technical scheme:

a density peak large-scale tourist image data clustering method for accelerating neighbor search by using a Group algorithm comprises the following steps:

step 1, inputting a data setX＝{x₁,x₂,…,x_n}∈R^DWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;

step 2, determining an Eps radius parameter, wherein the process is as follows:

2.1 first calculate the sample point x_iAnd x_jEuclidean distance between:obtaining a distance distribution matrix DIST_n×nThe value of (a) is:

DIST_n×n＝{dist(x_i,x_j),1≤i≤n,1≤j≤n} (1)

wherein x is_iRepresenting the ith sample point. To DIST_n×nThe values of each row in the table are sorted from small to large, and the DIST is recorded_n×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;

2.2 for DIST_n×mThe distance of each sample point in the distance metric, i.e., the data for each row vector, should have its own weight α, i.e.:

wherein, α_iThe selection of the random variable is related to the statistical characteristics of the data, the variance describes the deviation degree of the random variable to the mean value, and the larger the variance is, the larger the fluctuation is;

2.3 first, calculate DIST_n×mThe standard deviation σ for each row, i.e.:

wherein,is a distance DIST_n×mMean of row i;

2.4 the larger the standard deviation, the smaller its corresponding weight should be, thus defining the weight:

normalizing the weights:thereby giving Eps;

the operation is divided into two stages: a first stage of running a Group algorithm on the whole data set to obtain a series of circular areas; a second stage of clustering the obtained circular regions using a DPC-based algorithm;

step 3, running a Group algorithm on the whole data set to obtain a series of circular areas;

after the Group algorithm traverses all data sets, a series of circles with different sizes are formed and recorded as S, so that each data point is sequentially distributed into the circles S, and all S form a set S, S_mRepresenting the center of each circle;

and 4, clustering the obtained circular regions by using a DPC (predictive coding rate) -based algorithm, wherein the clustering center has higher density rho and has larger distance delta from higher density points, and the process is as follows:

4.1, redefining the density ρ, wherein after the Group algorithm traverses all the data sets, the obtained circular regions may overlap each other, but there is no intersection therebetween, and at this time, the data points between the overlapping regions are assigned to the intersecting circular regions, so that the density of each circular region is defined as:

wherein num_iAnd r_iThe number of data points in the ith circular area and the radius of the circular area are respectively;

4.2 taking the shortest distance from the circle i to the higher density circle j as the distance value of the sample point in the circle, and recording as delta_iThe definition is as follows:for the data point with the highest density globally, there is δ_j＝max_i≠jδ_i；

Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, the density rho of each circle is calculated_iAnd a distance delta_iTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center through a decision diagram, after finding out cluster center circles, firstly endowing each cluster center circle with different class marks, and then adopting a density-based division mode;

and 6, each non-cluster center circle follows the nearest circle with the density higher than that of the circle until all the non-cluster center circles are completely distributed, after the distribution is completed, the class mark of each non-cluster center circle only needs to follow the class mark of the circle into which the non-cluster center circle falls, and the clustering is finished.

The beneficial effects of the invention are as follows:

(1) compared with the Euclidean distance, the data statistical characteristic is added, the spatial distribution structure of the data can be better described, the difference of different distributed data sets can be effectively found, and the clustering performance is further improved.

(2) The invention processes the round area as the sample point after traversing all the data sets by the Group algorithm instead of processing the single sample point, thereby greatly shortening the time required by clustering and leading the original density algorithm to have wider applicability.

Drawings

FIG. 1 is an overall flow diagram of the method of the present invention;

FIG. 2 is a flow chart of a preliminary phase of the Group algorithm;

FIG. 3 is a flow chart of a further stage of the Group algorithm;

FIG. 4 is a result of running a Group algorithm on an Agg dataset;

FIG. 5 is the final clustering result using the algorithm of the present invention on an Agg data set.

Detailed Description

For the purpose of illustrating the objects, technical solutions and advantages of the present invention, the present invention will be described in further detail below with reference to specific embodiments and accompanying drawings.

Referring to fig. 1 to 5, a method for clustering density peak large-scale tourist image data by using a Group algorithm to accelerate neighbor search includes the following steps:

step 1, inputting a data set X ═ X₁,x₂,…,x_n}∈R^DWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;

step 2, determining an Eps (radius) parameter, and the process is as follows:

DIST_n×n＝{dist(x_i,x_j),1≤i≤n,1≤j≤n} (1)

wherein x is_iTo representIth sample point, for DIST_n×nThe values of each row in the table are sorted from small to large, and the DIST is recorded_n×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;

2.2 the idea of density adaptive distance is mainly as follows: for DIST_n×mThe distance of each sample point in the distance metric, i.e., the data for each row vector, should have its own weight α, i.e.:

wherein, α_iIs selected in relation to the statistical properties of the data itself. The variance describes the deviation degree of the random variable from the mean value, and the larger the variance is, the larger the fluctuation is;

2.3 first, calculate DIST_n×mThe standard deviation σ for each row, i.e.:

wherein,is a distance DIST_n×mMean of row i;

normalizing the weights:thereby giving Eps;

the algorithm is combined with the DPC algorithm, and the proposed new algorithm operates in two stages: a first stage of running a Group algorithm on the whole data set to obtain a series of circular areas; a second stage of clustering the obtained circular regions using a DPC-based algorithm;

step 3, running a Group algorithm on the whole data set to obtain a series of circular areas, and running the Group algorithm on the whole data set to obtain a series of circular areas;

3.1, traversing the whole data set, and performing preliminary division on the sample points, as shown in fig. 2, the specific implementation process is as follows:

input dataset X ═ X₁,x₂,…,x_n}∈R^d. For each sample point: each sample point is searched for a suitable existing circle. Firstly, calculating the distance from a data sample point x to a circle center s_mThe euclidean distance between and compared to the Eps size;

if the distance from a given sample point to the center of the circle is less than or equal to Eps, i.e. | | s_mEps is less than or equal to-x | |, then s is used_mAs the center of a circle, s_mThe farthest distance to x is the radius to draw a circle; if it isOr S is absent in S such that | | S_m-x | | ≦ 2Eps, then x is defined as the new center of the circle s_m(ii) a If the first two conditions are not met, marking the sample point x as an unprocessed point;

3.2 further dividing the unprocessed sample points in step 3.1, as shown in fig. 3, the specific implementation process is as follows:

for unprocessed sample point x, if | | | s_mEps is less than or equal to-x | |, then s is used_mAs the center of a circle, s_mThe distance to x is a radius to draw a circle; otherwise, defining x as a new circle center s_m；

Step 4, clustering the obtained circular areas by using a DPC (design control point) -based algorithm;

the DPC algorithm is based on the following assumptions: the clustering center has a higher density ρ and a larger distance δ from a higher density point;

Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, the density rho of each circle is calculated_iAnd a distance delta_iTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center (cluster center circle) through a decision diagram, after finding the cluster center circle, firstly endowing each cluster center circle with different class marks, and then adopting a density-based dividing mode;

The effects of the present invention can be further illustrated by the following simulation experiments.

1) Simulation conditions

The experimental operating system is Windows10, simulation software Matlab (R2014a) (64 bits), the processor is Inter (R) core (TM) i5, and the installation memory is 4.00GB

Table 1 shows UCI real data:

TABLE 1

2) Simulation result

The algorithm and DPC method of the invention are used for comparative experiment on UCI real data set. In order to further verify the performance of the algorithm on a real data set, 5 common UCI data sets in the table 1 are used for carrying out experiments, and common ACC and F-measure indexes are adopted to evaluate clustering results, wherein the ACC and F-measure indexes have the value range of [0,1], and the larger the value is, the better the clustering effect is. And the speed of the algorithm is evaluated by the cluster execution time (t/s). The shorter the time, the better.

TABLE 2

As can be seen from Table 2, the method of the present invention has superior results to DPC algorithm. The execution time of the algorithm is greatly reduced, especially when the data size is large. Compared with the DPC algorithm, the method is more suitable for processing large-scale data and has better practical engineering application value.

Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims

1. A density peak large-scale tourist image data clustering method for accelerating neighbor search by using a Group algorithm is characterized by comprising the following steps:

step 1, inputting a data set X ═ X₁，x₂，…，x_n}∈R^DWherein x represents a sample point in the data set, D represents a sample dimension, and n represents the number of samples;

step 2, determining an Eps radius parameter, wherein the process is as follows:

DIST_n×n＝{dist(x_i，x_j)，1≤i≤n，1≤j≤n} (1)

wherein x is_iRepresents the ith sample point, for DIST_n×nThe values of each row in the table are sorted from small to large, and the DIST is recorded_n×mFor the mth distance value closest to the data point n after sorting, m ═ 0.01n]When the data set is less than 200, m is 2;

2.3 first, calculate DIST_n×mThe standard deviation σ for each row, i.e.:

wherein,is a distance DIST_n×mMean of row i;

normalizing the weights:thereby giving Eps;

wherein hum_iAnd r_iThe number of data points in the ith circular area and the radius of the circular area are respectively;

Step 5, after all sample points are represented by a plurality of circular areas by the Group algorithm, calculatingDensity per circle ρ_iAnd a distance delta_iTaking rho with higher density and delta with larger distance as a clustering center, selecting the clustering center through a decision diagram, after finding out cluster center circles, firstly endowing each cluster center circle with different class marks, and then adopting a density-based division mode;