CN110728293B

CN110728293B - Hierarchical clustering method for tourist heading data

Info

Publication number: CN110728293B
Application number: CN201910812062.5A
Authority: CN
Inventors: 何熊熊; 袁志琴; 庄华亮
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-10-29
Anticipated expiration: 2039-08-30
Also published as: CN110728293A

Abstract

The invention discloses a region growth and competition-based tourist destination data hierarchical clustering method for a variable-scale data density space, which is different from the conventional method in that the hierarchical clustering idea is adopted, and the clustering process is divided into three levels. The first-level clustering is used for dividing the objects into a certain number of subclasses based on Euclidean distance by using a distance threshold R1, so that the algorithm is simplified and the complexity is reduced. And then, the second-level method for growing the spatial data area uses the obtained cluster center as a growth seed, and the obtained cluster center grows under a growth criterion until a stop condition is reached, so that the problem of variable-scale data density clustering is solved. And finally, calculating the weight between the cluster centers based on the competitive idea and density similarity principle, and adopting a proper rule to merge the clusters to solve the problem of non-convex data clustering. Compared with other clustering algorithms, the method disclosed by the invention can maximally improve the clustering accuracy on the basis of reducing the complexity, has obvious advantages in processing mass data, and can better meet the requirements of practical engineering application.

Description

Hierarchical clustering method for tourist heading data

Technical Field

The invention relates to the field of hierarchical clustering, in particular to a clustering method for improving variable-scale density data by using a region growing and competition-based method.

Background

Data mining is a hot problem of research in the fields of artificial intelligence and databases, clustering analysis is an important branch of data mining, and the clustering is widely applied in various fields as a tool for data analysis. Clustering is the process of dividing a physical or abstract collection into classes composed of similar objects. Clustering originates in taxonomy, but differs from classification. Clustering differs from classification in that the class to which clustering requires partitioning is unknown and unsupervised. Clustering algorithms are broadly classified into (1) partition-based methods, such as K-means algorithm, and the like; (2) hierarchy-based methods such as the BIRCH algorithm, the CURE algorithm; (3) density-based methods, such as DBSCAN algorithm, density, and the like; (4) a grid-based approach; (5) neural networks, and other various clustering methods. Among them, the K-means algorithm is one of the most classical clustering algorithms. As the clustering algorithm based on division which is most widely applied at present, the K-means algorithm is simpler to realize, but has the following three defects: (1) the user must specify the clustering number k in advance; (2) the K-means algorithm is not suitable for finding non-convex clusters; (3) the K-means algorithm is very sensitive to noise and outlier data. The DBSCAN determines whether to establish a new cluster taking an object as a core object by checking whether the density of an object epsilon neighborhood is high enough, namely whether the number of data points in a certain distance epsilon exceeds a set threshold value, and then combines the clusters with reachable density to realize that the cluster class with any shape can be found in a spatial database with noise, but the DBSCAN algorithm is sensitive to two parameters which are difficult to determine, namely epsilon and the set threshold value. In addition, DBSCAN is relatively high in computational complexity.

Disclosure of Invention

Traditional clustering algorithms mostly assume the same scale of spatial density, but real data are often non-convex data with density multi-scale changes. Various defects often occur when the traditional clustering algorithm is adopted for data with density multi-scale change. Especially distance-based clustering algorithms such as the Kmeans algorithm, increase the sensitivity of the parameters and decrease the accuracy. Aiming at the defect that multi-scale data is mostly limited in space, the invention provides a novel multi-level clustering algorithm based on distance according to actual needs by means of multi-level analysis, and solves the clustering problem of multi-scale density data by means of multi-level rapid non-convex clustering. The algorithm can correspondingly simplify the algorithm complexity based on the distance, and the calculation of the density is avoided; and (4) performing reasonable fusion by utilizing seed region growth in a multi-stage aggregation manner to complete the clustering of the data. The invention can reduce the complexity on the basis of simplifying the algorithm, is beneficial to the clustering of mass data and is suitable for the analysis of the data of the tourist destination.

In order to solve the technical problems, the invention adopts the following technical scheme:

a variable-scale data density space oriented region growing and competition based visitor destination data hierarchical clustering method comprises the following steps:

a first stage: the cluster center is updated by drawing a circle from the distance threshold R1 as follows:

step 1.1: inputting a set of unlabeled data sets X ═ X₁,x₂,...x_i,...x_N}∈R^PRandomly fetching the ith data object X from X_iStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in X_jCalculating x by equation (1)_i x_jEuclidean distance between them

If it is

Less than R1(R1 is 10% of the spatial size of the data set), point x_ix_jFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)_iIf, if

Greater than R1, indicating x_i x_jNot of the same class, x_jAlso as a cluster core, the cluster core set C ═ { x ═ is stored_i}；

Wherein, S in the formula (2) is the updated cluster center, and β is the weight coefficient;

step 1.2: from dataset X (excluding X)_i、x_j) In random fetching of the m-th data object x_mCalculating the Euclidean distance set

n is the number of points in the C set, and x is determined_mTo the closest point C in the cluster center set_iUsing point x in combination_m、C_iUpdating the cluster center according to the method of formula (1):

step 1.3: repeating the steps 1.1 and 1.2 to traverse all the points in the data X, and obtaining the updated cluster center set C ═ C₁,...,C_i,...C_wW is the cluster number, and the corresponding cluster set M ═ C₁{...},...,C_i{...},...C_w{...}}；

And a second stage: the region growing is carried out as follows:

step 1: determining the seed sequence: firstly, all the points in the cluster center set C are traversed, and then the points are countedCalculating the number n of points corresponding to the ith cluster _i1,2.. m. If n is_iIf min C is less than min C, the corresponding cluster center point C is deleted_iDeleting the corresponding cluster center point set C in M_i{., and storing the cluster center points in the set D as the seed sequence B ═ C₁,...,C_i,...,C_d}，d＜＝w；

Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B₁And a circle is drawn with R1 as a radius. Calculating the number of points n in a circle₁If n is₁If min is greater than C, continue with C₁Drawing a circle Q with the circle center R being R1 plus Delta R as a radius_B1And is judged to enter the circle Q_B1If the point (i) belongs to D, i +1 continues to grow;

△R＝e^(sm(x))/10*i^2*0.03 (3)

wherein sm (x) is the average value of the distance between data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain updated M;

step 2.3, for the points obtained after each cluster center area grows, the next secondary time is not taken as a growing object to be processed, and then other cluster center points in the C are traversed by the method in the step 2.2 to obtain the data of each cluster center point and the corresponding cluster thereof;

and a third stage: and calculating the relation weight among all cluster centers of the clusters by a competition-based idea, and adopting a proper rule to merge the clusters.

After the data set X is subjected to the second-level clustering, if all cluster centers carry out the second-level clustering on the data X_iIn the competition process, the winner is the heart of the cluster respectively

And

get

When d has a value in a certain range, we consider the cluster

Hezhou cluster

There is a relational weight, the increasing criterion of which is: by using

Expressing the relationship weight between two small clusters, and the calculation method is as the following formula (4):

wherein, in the formula (4)

y＝max(x,y)；

Step 3.1: first, for a data set X ═ X₁,...,X_i,...,X_NFrom the first data X₁Starting to traverse in sequence, and finding out two winners of all cluster centers in the process of competing for data for each specific data

And

then, the clusters corresponding to the two winners are judged according to the relation weight existence criterion

And

if there is a weight, then the cluster with the weight is subjected to the relational weight according to the formula (4)Then traverse the next data; if the relation weight does not exist, directly traversing the next data until all the data are traversed once in sequence;

after the calculation of the relationship weight is completed, the relationship weight is formed as

Wherein the subscript x takes values from 1 up to M, and the subscript y takes values from x up to M;

step 3.2: calculating density similarity between each cluster, firstly calculating the intra-cluster density rho of each cluster for the cluster set M clustered at the second stage_i：

ρ_i＝n_i/S_i (5)

n_iIs the number of points included in the ith cluster, S_iIs the area size of the ith cluster. ρ ═ ρ₁,...,ρ_i,...,ρ_dAnd calculating a density difference between the x-th cluster and the y-th cluster

Namely:

subscript x takes values from 1 up to d, and subscript y takes values from x up to d;

and step 3: when in use

And is

In the middle of the time, cluster

Hezhou cluster

May be combined;

assuming the finally formed clustersIs of M_kWherein each value of the subscript k corresponds to an independent cluster, and a finally formed cluster set M is subjected to_kThe subscript of (a) is initialized to k 1,

relationship weight

Subscript x is initialized to x ═ 1;

relationship weight

Starting from 1 up to M, the subscript x of (1) weights the relationship

The superscript y takes values from x up to M when

When x is equal to y, let

Satisfy the requirement of

Relationship weight

Not satisfying the condition

And is

In time, the small clusters are not processed; relationship weight

Satisfy the requirement of

And is

When it is in condition, if

Or

Then

And

are simultaneously merged into M_kIn (1),

otherwise k is k +1, simultaneously

And

merge into a new cluster M_kIn (1),

wherein

And

the same elements present in (a) are combined into the same item;

step 3.4: the cluster center set finally formed is M_kK clustering ends.

The region growing of the invention is a process of gradually aggregating a data or sub data set region into a complete independent connected region according to a predefined growing rule. For the interested target region R, z in the spatial data as the seed points found in advance on the region R, gradually merging the data meeting the similarity criterion in a certain neighborhood with the seed points z into a seed group according to the specified growth criterion for the growth of the next stage, and continuously carrying out cyclic growth until the growth stopping condition is met, thereby completing the process of growing the interested region from one seed point into an independent connected region. The similarity criterion can be the distance between data, density and other related attributes. The region growing algorithm is therefore generally implemented in three steps: (1) determining growing seed points (2) stipulates a growing criterion (3) determines a growth stop condition.

The invention adopts the idea of hierarchical clustering and divides the clustering process into three-level clustering. The first-level clustering divides the objects into a certain number of subclasses based on a distance threshold R1; the second stage is grown by region growing. And performing second-level clustering on the non-clustered data, and finally calculating the weights among all clustering centers of the clusters on the basis of a competitive idea and density similarity principle, and combining the clusters by adopting a proper rule.

The beneficial effects of the invention are as follows:

(1) the first-level distance-based clustering can simplify the algorithm and reduce the complexity of the algorithm.

(2) The second stage can solve the problem of variable scale density data by using a seed region growing method.

(3) The third-level merging part provides a relation weight threshold and a density similarity threshold, so that the merging of the small clusters is more reasonable and double-guarantee. The problem of non-convex clustering is effectively solved, and the merging accuracy is improved.

(4) By utilizing the reasonable design and fusion of the three-level algorithm, the overall algorithm avoids multi-layer iteration and greatly reduces the complexity of the algorithm.

Drawings

FIG. 1 is an overall flow diagram of the method of the present invention;

FIG. 2 is a flow chart of the first level clustering of the algorithm of the present invention;

FIG. 3 is a flow chart of the second level clustering of the algorithm of the present invention;

FIG. 4 is a flow chart of the third level clustering of the algorithm of the present invention

FIG. 5 is the final clustering result of the algorithm of the present invention applied to a occlusion data set.

Fig. 6 shows the final clustering result of the algorithm of the present invention run on the non-uniform density data set new.

Detailed Description

For the purpose of illustrating the objects, technical solutions and advantages of the present invention, the present invention will be described in further detail below with reference to specific embodiments and accompanying drawings.

Referring to fig. 1 to 6, a hierarchical clustering method based on region growing and competition for a variable-scale data density space includes the following steps:

step 1.1: inputting a set of unlabeled data sets X ═ X₁,x₂,...x_i,...x_N}∈R^PWhere X represents a sample point in the data set, P represents a sample dimension, N represents the number of samples, and the ith data object X is randomly selected from X_iStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in X_jCalculating x by equation (1)_i x_jEuclidean distance between them

If it is

Less than R1(R1 is 10% of the spatial size of the data set), point x_i x_jFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)_i. If it is

Greater than R1, indicating x_i x_jNot of the same class, x_jAlso as a cluster core, the cluster core set C ═ { x ═ is stored_i}。

In equation (2), S is the updated cluster center, and β is the weighting factor (β is 1/16).

n is the number of points in the C set, and x is determined_mTo the closest point C in the cluster center set_iUsing point x in combination_m、C_iAnd updating the cluster center according to the method of the formula (1).

Step 1.3: repeating the steps 1.1 and 1.2 to traverse all the points in the data X, and obtaining the updated cluster center set C ═ C₁,...,C_i,...C_wAnd w is the cluster number. Corresponding cluster set M ═ C₁{...},...,C_i{...},...C_w{...}}。

And a second stage: the region growing is carried out as follows:

step 2.1: determining the seed sequence: firstly, traversing all the points in the cluster center set C, and calculating the number n of the points corresponding to the ith cluster _i1,2.. m. If n is_iIf min C (min C is 5% of all samples), no cluster is formed, and the corresponding cluster center point C is deleted from C_iDeleting the corresponding cluster center point set C in M_i{., and storing the cluster center points in the set D as the seed sequence B ═ C₁,...,C_i,...,C_d}，d＜＝w。

Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B₁And a circle is drawn with R1 as a radius. Calculating the number of points n in a circle₁If n is₁If min is greater than C, continue with C₁Drawing a circle Q with the circle center R being R1 plus Delta R as a radius_B1And is judged to enter the circle Q_B1Whether or not the point(s) is (are)And if the growth belongs to D, i is i +1 and continues to grow.

△R＝e^(sm(x))/10*i^2*0.03 (3)

And sm (x) is the average value of the distance between data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain the updated M.

And 2.3, for the points obtained after each cluster center area grows, the next secondary time is not taken as a growing object to be processed, and then other cluster center points in the C are traversed by the method in the step 2.2 to obtain the data of each cluster center point and the corresponding cluster thereof.

And

get

When d has a value in a certain range, we consider the cluster

Hezhou cluster

There is a relational weight. When d < ═ 2.5, the algorithm has better clustering quality. And taking d < 2.5 as existence criterion of the relation weight. Increase criterion of the relational weight: by using

Expressing the weight of the relationship between the two small clusters, the calculation method is as follows (4)

Wherein, in the formula (4)

Where x is min (x, y) and y is max (x, y).

And

And

if the cluster with the weight exists, the relationship weight is increased according to a formula (4), and then the next data is traversed; if no relationship weight exists, the next data is directly traversed. Until all data has been traversed once in turn.

Where subscript x takes on values from 1 up to M and superscript y takes on values from x up to M.

Step 3.2: calculating the density similarity between each cluster by first clustering the second-level clustersM, calculating the intra-cluster density ρ of each cluster_i：

ρ_i＝n_i/S_i (5)

Namely:

subscript x takes on values from 1 up to d, and superscript y takes on values from x up to d.

Step 3.3: when in use

And is

In the middle of the time, cluster

Hezhou cluster

May be combined. (experiments have found that a reasonable sum of the number of all data in two small clusters with link thresholds of about 40% to 50% of the weight of the relevant system is good.Sim represents the difference between the two densities, i.e., smaller is better, and a number less than 1.5 is used.)

Assume that the final cluster set formed is M_kWherein each value of the subscript k corresponds to an independent cluster, and a finally formed cluster set M is subjected to_kThe subscript of (a) is initialized to k 1,

relationship weight

The subscript x is initialized to x ═ 1.

Relationship weight

Starting from 1 up to M, the subscript x of (1) weights the relationship

The superscript y takes values from x up to M when

When x is equal to y, let

Satisfy the requirement of

Relationship weight

Not satisfying the condition

And is

In time, the small clusters are not processed; relationship weight

Satisfy the requirement of

And is

When it is in condition, if

Or

Then

And

are simultaneously merged into M_kIn (1),

otherwise k is k +1, simultaneously

And

merge into a new cluster M_kIn (1),

wherein

And

the same elements present in (a) are combined into the same item;

step 3.4: the cluster center set finally formed is M_k,k＝1,2...K。

The effects of the present invention can be further illustrated by the following simulation experiments.

1) Simulation conditions

The operating system used for the experiment is Windows10, simulation software Matlab (R2018b) (64 bits), the processor is Inter (R) core (TM) i7, and the installation memory is 8.00GB

Table 1 is partial UCI real data:

TABLE 1

2) Simulation result

The algorithm of the invention, the DBSCAN algorithm and the Kmean algorithm are used for the comparison experiment on a UCI data set with scale transformation and a group of artificial data sets with scale transformation new. In order to further verify the performance of the algorithm on a real data set, 4 data sets in the table 1 are used for carrying out experiments, and common ACC and F-measure indexes are adopted to evaluate clustering results, wherein the value ranges of the ACC and the F-measure indexes are [0,1], and the larger the value is, the better the clustering effect is.

TABLE 2

As can be seen from Table 2, the method of the present invention has better results than the conventional DBSCAN algorithm and the Kmeans algorithm. Compared with the running time of the DBSCAN algorithm, the complexity of the algorithm is lower. Especially when the amount of data is large. Has better practical engineering application value.

Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims

1. A method for hierarchical clustering of guest travel direction data, the method comprising the steps of:

step 1.1: inputting a set of unlabeled data sets X ═ X₁,x₂,...x_i,...x_N}∈R^PRandomly fetching the ith data object X from X_iStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in X_jCalculating x by equation (1)_iAnd x_jEuclidean distance between them

If it is

Less than R1, R1 is 10% of the spatial size of the data set, point x_i x_jFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)_iIf, if

step 1.2: never include x_i、x_jRandomly fetch the mth data object X in the data set X_mCalculating the Euclidean distance set

n is the number of points in the C set, and x is determined_mTo the closest point C in the cluster center set_iUsing point x in combination_m、C_iUpdating the cluster center according to the method of the formula (1);

And a second stage: the region growing is carried out as follows:

step 2.1: determining the seed sequence: firstly, traversing all the points in the cluster center set C, and calculating the number n of the points corresponding to the ith cluster_i1,2, m, if n_iIf min C is less than min C, the corresponding cluster center point C is deleted_iDeleting the corresponding cluster center point set C in M_i{., and storing the cluster center points in the set D as the seed sequence B ═ C₁,...,C_i,...,C_d}，d＜＝w；

Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B₁Drawing a circle with R1 as the radius, and calculating the number n of points in the circle₁If n is₁If min is greater than C, continue with C₁Drawing a circle Q with the circle center R being R1 plus Delta R as a radius_B1And is judged to enter the circle Q_B1If the point (i) belongs to D, i +1 continues to grow;

△R＝e^(sm(x))/10*i^2*0.03 (3)

wherein sm (x) is the average value of the distance between the data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain updated M;

and a third stage: calculating the relation weight and density similarity among all clustering centers by a competition-based idea, and adopting a proper rule to merge the clusters;

And

get

When d has a value in a certain range, we consider the cluster

Hezhou cluster

There is a relational weight; increase criterion of the relational weight: by using

Wherein, in the formula (4)

Where x is min (x, y), y is max (x, y);

step 3.1: first, for a data set X ═ X₁,…,X_i,…,X_NFrom the first data X₁Starting to traverse in sequence, and finding out two winners of all cluster centers in the process of competing for data for each specific data

And

And

if the cluster with the weight exists, the relationship weight is increased according to a formula (4), and then the next data is traversed; if the relation weight does not exist, directly traversing the next data until all the data are traversed once in sequence;

ρ_i＝n_i/S_i (5)

n_iIs the number of points included in the ith cluster, S_iIs the area size of the ith cluster, ρ ═ ρ { [ ρ ]₁,...,ρ_i,...,ρ_dAnd calculating a density difference between the x-th cluster and the y-th cluster