CN110728293B - Hierarchical clustering method for tourist heading data - Google Patents

Hierarchical clustering method for tourist heading data Download PDF

Info

Publication number
CN110728293B
CN110728293B CN201910812062.5A CN201910812062A CN110728293B CN 110728293 B CN110728293 B CN 110728293B CN 201910812062 A CN201910812062 A CN 201910812062A CN 110728293 B CN110728293 B CN 110728293B
Authority
CN
China
Prior art keywords
cluster
data
clustering
weight
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910812062.5A
Other languages
Chinese (zh)
Other versions
CN110728293A (en
Inventor
何熊熊
袁志琴
庄华亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201910812062.5A priority Critical patent/CN110728293B/en
Publication of CN110728293A publication Critical patent/CN110728293A/en
Application granted granted Critical
Publication of CN110728293B publication Critical patent/CN110728293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a region growth and competition-based tourist destination data hierarchical clustering method for a variable-scale data density space, which is different from the conventional method in that the hierarchical clustering idea is adopted, and the clustering process is divided into three levels. The first-level clustering is used for dividing the objects into a certain number of subclasses based on Euclidean distance by using a distance threshold R1, so that the algorithm is simplified and the complexity is reduced. And then, the second-level method for growing the spatial data area uses the obtained cluster center as a growth seed, and the obtained cluster center grows under a growth criterion until a stop condition is reached, so that the problem of variable-scale data density clustering is solved. And finally, calculating the weight between the cluster centers based on the competitive idea and density similarity principle, and adopting a proper rule to merge the clusters to solve the problem of non-convex data clustering. Compared with other clustering algorithms, the method disclosed by the invention can maximally improve the clustering accuracy on the basis of reducing the complexity, has obvious advantages in processing mass data, and can better meet the requirements of practical engineering application.

Description

Hierarchical clustering method for tourist heading data
Technical Field
The invention relates to the field of hierarchical clustering, in particular to a clustering method for improving variable-scale density data by using a region growing and competition-based method.
Background
Data mining is a hot problem of research in the fields of artificial intelligence and databases, clustering analysis is an important branch of data mining, and the clustering is widely applied in various fields as a tool for data analysis. Clustering is the process of dividing a physical or abstract collection into classes composed of similar objects. Clustering originates in taxonomy, but differs from classification. Clustering differs from classification in that the class to which clustering requires partitioning is unknown and unsupervised. Clustering algorithms are broadly classified into (1) partition-based methods, such as K-means algorithm, and the like; (2) hierarchy-based methods such as the BIRCH algorithm, the CURE algorithm; (3) density-based methods, such as DBSCAN algorithm, density, and the like; (4) a grid-based approach; (5) neural networks, and other various clustering methods. Among them, the K-means algorithm is one of the most classical clustering algorithms. As the clustering algorithm based on division which is most widely applied at present, the K-means algorithm is simpler to realize, but has the following three defects: (1) the user must specify the clustering number k in advance; (2) the K-means algorithm is not suitable for finding non-convex clusters; (3) the K-means algorithm is very sensitive to noise and outlier data. The DBSCAN determines whether to establish a new cluster taking an object as a core object by checking whether the density of an object epsilon neighborhood is high enough, namely whether the number of data points in a certain distance epsilon exceeds a set threshold value, and then combines the clusters with reachable density to realize that the cluster class with any shape can be found in a spatial database with noise, but the DBSCAN algorithm is sensitive to two parameters which are difficult to determine, namely epsilon and the set threshold value. In addition, DBSCAN is relatively high in computational complexity.
Disclosure of Invention
Traditional clustering algorithms mostly assume the same scale of spatial density, but real data are often non-convex data with density multi-scale changes. Various defects often occur when the traditional clustering algorithm is adopted for data with density multi-scale change. Especially distance-based clustering algorithms such as the Kmeans algorithm, increase the sensitivity of the parameters and decrease the accuracy. Aiming at the defect that multi-scale data is mostly limited in space, the invention provides a novel multi-level clustering algorithm based on distance according to actual needs by means of multi-level analysis, and solves the clustering problem of multi-scale density data by means of multi-level rapid non-convex clustering. The algorithm can correspondingly simplify the algorithm complexity based on the distance, and the calculation of the density is avoided; and (4) performing reasonable fusion by utilizing seed region growth in a multi-stage aggregation manner to complete the clustering of the data. The invention can reduce the complexity on the basis of simplifying the algorithm, is beneficial to the clustering of mass data and is suitable for the analysis of the data of the tourist destination.
In order to solve the technical problems, the invention adopts the following technical scheme:
a variable-scale data density space oriented region growing and competition based visitor destination data hierarchical clustering method comprises the following steps:
a first stage: the cluster center is updated by drawing a circle from the distance threshold R1 as follows:
step 1.1: inputting a set of unlabeled data sets X ═ X1,x2,...xi,...xN}∈RPRandomly fetching the ith data object X from XiStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in XjCalculating x by equation (1)i xjEuclidean distance between them
Figure BDA0002185348140000021
If it is
Figure BDA0002185348140000022
Less than R1(R1 is 10% of the spatial size of the data set), point xixjFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)iIf, if
Figure BDA0002185348140000023
Greater than R1, indicating xi xjNot of the same class, xjAlso as a cluster core, the cluster core set C ═ { x ═ is storedi};
Figure BDA0002185348140000024
Figure BDA0002185348140000025
Wherein, S in the formula (2) is the updated cluster center, and β is the weight coefficient;
step 1.2: from dataset X (excluding X)i、xj) In random fetching of the m-th data object xmCalculating the Euclidean distance set
Figure BDA0002185348140000026
n is the number of points in the C set, and x is determinedmTo the closest point C in the cluster center setiUsing point x in combinationm、CiUpdating the cluster center according to the method of formula (1):
step 1.3: repeating the steps 1.1 and 1.2 to traverse all the points in the data X, and obtaining the updated cluster center set C ═ C1,...,Ci,...CwW is the cluster number, and the corresponding cluster set M ═ C1{...},...,Ci{...},...Cw{...}};
And a second stage: the region growing is carried out as follows:
step 1: determining the seed sequence: firstly, all the points in the cluster center set C are traversed, and then the points are countedCalculating the number n of points corresponding to the ith cluster i1,2.. m. If n isiIf min C is less than min C, the corresponding cluster center point C is deletediDeleting the corresponding cluster center point set C in Mi{., and storing the cluster center points in the set D as the seed sequence B ═ C1,...,Ci,...,Cd},d<=w;
Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B1And a circle is drawn with R1 as a radius. Calculating the number of points n in a circle1If n is1If min is greater than C, continue with C1Drawing a circle Q with the circle center R being R1 plus Delta R as a radiusB1And is judged to enter the circle QB1If the point (i) belongs to D, i +1 continues to grow;
△R=e(sm(x))/10*i^2*0.03 (3)
wherein sm (x) is the average value of the distance between data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain updated M;
step 2.3, for the points obtained after each cluster center area grows, the next secondary time is not taken as a growing object to be processed, and then other cluster center points in the C are traversed by the method in the step 2.2 to obtain the data of each cluster center point and the corresponding cluster thereof;
and a third stage: and calculating the relation weight among all cluster centers of the clusters by a competition-based idea, and adopting a proper rule to merge the clusters.
After the data set X is subjected to the second-level clustering, if all cluster centers carry out the second-level clustering on the data XiIn the competition process, the winner is the heart of the cluster respectively
Figure BDA0002185348140000031
And
Figure BDA0002185348140000032
get
Figure BDA0002185348140000033
When d has a value in a certain range, we consider the cluster
Figure BDA0002185348140000041
Hezhou cluster
Figure BDA0002185348140000042
There is a relational weight, the increasing criterion of which is: by using
Figure BDA0002185348140000043
Expressing the relationship weight between two small clusters, and the calculation method is as the following formula (4):
Figure BDA0002185348140000044
wherein, in the formula (4)
Figure BDA0002185348140000045
Figure BDA0002185348140000046
y=max(x,y);
Step 3.1: first, for a data set X ═ X1,...,Xi,...,XNFrom the first data X1Starting to traverse in sequence, and finding out two winners of all cluster centers in the process of competing for data for each specific data
Figure BDA0002185348140000047
And
Figure BDA0002185348140000048
then, the clusters corresponding to the two winners are judged according to the relation weight existence criterion
Figure BDA0002185348140000049
And
Figure BDA00021853481400000410
if there is a weight, then the cluster with the weight is subjected to the relational weight according to the formula (4)Then traverse the next data; if the relation weight does not exist, directly traversing the next data until all the data are traversed once in sequence;
after the calculation of the relationship weight is completed, the relationship weight is formed as
Figure BDA00021853481400000411
Wherein the subscript x takes values from 1 up to M, and the subscript y takes values from x up to M;
step 3.2: calculating density similarity between each cluster, firstly calculating the intra-cluster density rho of each cluster for the cluster set M clustered at the second stagei
ρi=ni/Si (5)
niIs the number of points included in the ith cluster, SiIs the area size of the ith cluster. ρ ═ ρ1,...,ρi,...,ρdAnd calculating a density difference between the x-th cluster and the y-th cluster
Figure BDA00021853481400000412
Namely:
Figure BDA00021853481400000413
subscript x takes values from 1 up to d, and subscript y takes values from x up to d;
and step 3: when in use
Figure BDA00021853481400000414
And is
Figure BDA00021853481400000415
In the middle of the time, cluster
Figure BDA00021853481400000416
Hezhou cluster
Figure BDA00021853481400000417
May be combined;
assuming the finally formed clustersIs of MkWherein each value of the subscript k corresponds to an independent cluster, and a finally formed cluster set M is subjected tokThe subscript of (a) is initialized to k 1,
Figure BDA00021853481400000418
relationship weight
Figure BDA00021853481400000419
Subscript x is initialized to x ═ 1;
relationship weight
Figure BDA0002185348140000051
Starting from 1 up to M, the subscript x of (1) weights the relationship
Figure BDA0002185348140000052
The superscript y takes values from x up to M when
Figure BDA0002185348140000053
When x is equal to y, let
Figure BDA0002185348140000054
Satisfy the requirement of
Figure BDA0002185348140000055
Relationship weight
Figure BDA0002185348140000056
Not satisfying the condition
Figure BDA0002185348140000057
And is
Figure BDA0002185348140000058
In time, the small clusters are not processed; relationship weight
Figure BDA0002185348140000059
Satisfy the requirement of
Figure BDA00021853481400000510
And is
Figure BDA00021853481400000511
When it is in condition, if
Figure BDA00021853481400000512
Or
Figure BDA00021853481400000513
Then
Figure BDA00021853481400000514
And
Figure BDA00021853481400000515
are simultaneously merged into MkIn (1),
Figure BDA00021853481400000516
otherwise k is k +1, simultaneously
Figure BDA00021853481400000517
And
Figure BDA00021853481400000518
merge into a new cluster MkIn (1),
Figure BDA00021853481400000519
wherein
Figure BDA00021853481400000520
And
Figure BDA00021853481400000521
the same elements present in (a) are combined into the same item;
step 3.4: the cluster center set finally formed is MkK clustering ends.
The region growing of the invention is a process of gradually aggregating a data or sub data set region into a complete independent connected region according to a predefined growing rule. For the interested target region R, z in the spatial data as the seed points found in advance on the region R, gradually merging the data meeting the similarity criterion in a certain neighborhood with the seed points z into a seed group according to the specified growth criterion for the growth of the next stage, and continuously carrying out cyclic growth until the growth stopping condition is met, thereby completing the process of growing the interested region from one seed point into an independent connected region. The similarity criterion can be the distance between data, density and other related attributes. The region growing algorithm is therefore generally implemented in three steps: (1) determining growing seed points (2) stipulates a growing criterion (3) determines a growth stop condition.
The invention adopts the idea of hierarchical clustering and divides the clustering process into three-level clustering. The first-level clustering divides the objects into a certain number of subclasses based on a distance threshold R1; the second stage is grown by region growing. And performing second-level clustering on the non-clustered data, and finally calculating the weights among all clustering centers of the clusters on the basis of a competitive idea and density similarity principle, and combining the clusters by adopting a proper rule.
The beneficial effects of the invention are as follows:
(1) the first-level distance-based clustering can simplify the algorithm and reduce the complexity of the algorithm.
(2) The second stage can solve the problem of variable scale density data by using a seed region growing method.
(3) The third-level merging part provides a relation weight threshold and a density similarity threshold, so that the merging of the small clusters is more reasonable and double-guarantee. The problem of non-convex clustering is effectively solved, and the merging accuracy is improved.
(4) By utilizing the reasonable design and fusion of the three-level algorithm, the overall algorithm avoids multi-layer iteration and greatly reduces the complexity of the algorithm.
Drawings
FIG. 1 is an overall flow diagram of the method of the present invention;
FIG. 2 is a flow chart of the first level clustering of the algorithm of the present invention;
FIG. 3 is a flow chart of the second level clustering of the algorithm of the present invention;
FIG. 4 is a flow chart of the third level clustering of the algorithm of the present invention
FIG. 5 is the final clustering result of the algorithm of the present invention applied to a occlusion data set.
Fig. 6 shows the final clustering result of the algorithm of the present invention run on the non-uniform density data set new.
Detailed Description
For the purpose of illustrating the objects, technical solutions and advantages of the present invention, the present invention will be described in further detail below with reference to specific embodiments and accompanying drawings.
Referring to fig. 1 to 6, a hierarchical clustering method based on region growing and competition for a variable-scale data density space includes the following steps:
a first stage: the cluster center is updated by drawing a circle from the distance threshold R1 as follows:
step 1.1: inputting a set of unlabeled data sets X ═ X1,x2,...xi,...xN}∈RPWhere X represents a sample point in the data set, P represents a sample dimension, N represents the number of samples, and the ith data object X is randomly selected from XiStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in XjCalculating x by equation (1)i xjEuclidean distance between them
Figure BDA0002185348140000061
If it is
Figure BDA0002185348140000062
Less than R1(R1 is 10% of the spatial size of the data set), point xi xjFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)i. If it is
Figure BDA0002185348140000063
Greater than R1, indicating xi xjNot of the same class, xjAlso as a cluster core, the cluster core set C ═ { x ═ is storedi}。
Figure BDA0002185348140000064
Figure BDA0002185348140000071
In equation (2), S is the updated cluster center, and β is the weighting factor (β is 1/16).
Step 1.2: from dataset X (excluding X)i、xj) In random fetching of the m-th data object xmCalculating the Euclidean distance set
Figure BDA0002185348140000072
n is the number of points in the C set, and x is determinedmTo the closest point C in the cluster center setiUsing point x in combinationm、CiAnd updating the cluster center according to the method of the formula (1).
Step 1.3: repeating the steps 1.1 and 1.2 to traverse all the points in the data X, and obtaining the updated cluster center set C ═ C1,...,Ci,...CwAnd w is the cluster number. Corresponding cluster set M ═ C1{...},...,Ci{...},...Cw{...}}。
And a second stage: the region growing is carried out as follows:
step 2.1: determining the seed sequence: firstly, traversing all the points in the cluster center set C, and calculating the number n of the points corresponding to the ith cluster i1,2.. m. If n isiIf min C (min C is 5% of all samples), no cluster is formed, and the corresponding cluster center point C is deleted from CiDeleting the corresponding cluster center point set C in Mi{., and storing the cluster center points in the set D as the seed sequence B ═ C1,...,Ci,...,Cd},d<=w。
Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B1And a circle is drawn with R1 as a radius. Calculating the number of points n in a circle1If n is1If min is greater than C, continue with C1Drawing a circle Q with the circle center R being R1 plus Delta R as a radiusB1And is judged to enter the circle QB1Whether or not the point(s) is (are)And if the growth belongs to D, i is i +1 and continues to grow.
△R=e(sm(x))/10*i^2*0.03 (3)
And sm (x) is the average value of the distance between data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain the updated M.
And 2.3, for the points obtained after each cluster center area grows, the next secondary time is not taken as a growing object to be processed, and then other cluster center points in the C are traversed by the method in the step 2.2 to obtain the data of each cluster center point and the corresponding cluster thereof.
And a third stage: and calculating the relation weight among all cluster centers of the clusters by a competition-based idea, and adopting a proper rule to merge the clusters.
After the data set X is subjected to the second-level clustering, if all cluster centers carry out the second-level clustering on the data XiIn the competition process, the winner is the heart of the cluster respectively
Figure BDA0002185348140000081
And
Figure BDA0002185348140000082
get
Figure BDA0002185348140000083
When d has a value in a certain range, we consider the cluster
Figure BDA0002185348140000084
Hezhou cluster
Figure BDA0002185348140000085
There is a relational weight. When d < ═ 2.5, the algorithm has better clustering quality. And taking d < 2.5 as existence criterion of the relation weight. Increase criterion of the relational weight: by using
Figure BDA00021853481400000816
Expressing the weight of the relationship between the two small clusters, the calculation method is as follows (4)
Figure BDA0002185348140000086
Wherein, in the formula (4)
Figure BDA0002185348140000087
Figure BDA0002185348140000088
Where x is min (x, y) and y is max (x, y).
Step 3.1: first, for a data set X ═ X1,...,Xi,...,XNFrom the first data X1Starting to traverse in sequence, and finding out two winners of all cluster centers in the process of competing for data for each specific data
Figure BDA0002185348140000089
And
Figure BDA00021853481400000810
then, the clusters corresponding to the two winners are judged according to the relation weight existence criterion
Figure BDA00021853481400000811
And
Figure BDA00021853481400000812
if the cluster with the weight exists, the relationship weight is increased according to a formula (4), and then the next data is traversed; if no relationship weight exists, the next data is directly traversed. Until all data has been traversed once in turn.
After the calculation of the relationship weight is completed, the relationship weight is formed as
Figure BDA00021853481400000813
Where subscript x takes on values from 1 up to M and superscript y takes on values from x up to M.
Step 3.2: calculating the density similarity between each cluster by first clustering the second-level clustersM, calculating the intra-cluster density ρ of each clusteri
ρi=ni/Si (5)
niIs the number of points included in the ith cluster, SiIs the area size of the ith cluster. ρ ═ ρ1,...,ρi,...,ρdAnd calculating a density difference between the x-th cluster and the y-th cluster
Figure BDA00021853481400000814
Namely:
Figure BDA00021853481400000815
subscript x takes on values from 1 up to d, and superscript y takes on values from x up to d.
Step 3.3: when in use
Figure BDA0002185348140000091
And is
Figure BDA0002185348140000092
In the middle of the time, cluster
Figure BDA0002185348140000093
Hezhou cluster
Figure BDA0002185348140000094
May be combined. (experiments have found that a reasonable sum of the number of all data in two small clusters with link thresholds of about 40% to 50% of the weight of the relevant system is good.Sim represents the difference between the two densities, i.e., smaller is better, and a number less than 1.5 is used.)
Assume that the final cluster set formed is MkWherein each value of the subscript k corresponds to an independent cluster, and a finally formed cluster set M is subjected tokThe subscript of (a) is initialized to k 1,
Figure BDA0002185348140000095
relationship weight
Figure BDA0002185348140000096
The subscript x is initialized to x ═ 1.
Relationship weight
Figure BDA0002185348140000097
Starting from 1 up to M, the subscript x of (1) weights the relationship
Figure BDA0002185348140000098
The superscript y takes values from x up to M when
Figure BDA0002185348140000099
When x is equal to y, let
Figure BDA00021853481400000910
Satisfy the requirement of
Figure BDA00021853481400000911
Relationship weight
Figure BDA00021853481400000912
Not satisfying the condition
Figure BDA00021853481400000913
And is
Figure BDA00021853481400000914
In time, the small clusters are not processed; relationship weight
Figure BDA00021853481400000915
Satisfy the requirement of
Figure BDA00021853481400000916
And is
Figure BDA00021853481400000917
When it is in condition, if
Figure BDA00021853481400000918
Or
Figure BDA00021853481400000919
Then
Figure BDA00021853481400000920
And
Figure BDA00021853481400000921
are simultaneously merged into MkIn (1),
Figure BDA00021853481400000922
otherwise k is k +1, simultaneously
Figure BDA00021853481400000923
And
Figure BDA00021853481400000924
merge into a new cluster MkIn (1),
Figure BDA00021853481400000925
wherein
Figure BDA00021853481400000926
And
Figure BDA00021853481400000927
the same elements present in (a) are combined into the same item;
step 3.4: the cluster center set finally formed is Mk,k=1,2...K。
The effects of the present invention can be further illustrated by the following simulation experiments.
1) Simulation conditions
The operating system used for the experiment is Windows10, simulation software Matlab (R2018b) (64 bits), the processor is Inter (R) core (TM) i7, and the installation memory is 8.00GB
Table 1 is partial UCI real data:
Figure BDA00021853481400000928
Figure BDA0002185348140000101
TABLE 1
2) Simulation result
The algorithm of the invention, the DBSCAN algorithm and the Kmean algorithm are used for the comparison experiment on a UCI data set with scale transformation and a group of artificial data sets with scale transformation new. In order to further verify the performance of the algorithm on a real data set, 4 data sets in the table 1 are used for carrying out experiments, and common ACC and F-measure indexes are adopted to evaluate clustering results, wherein the value ranges of the ACC and the F-measure indexes are [0,1], and the larger the value is, the better the clustering effect is.
Figure BDA0002185348140000102
TABLE 2
As can be seen from Table 2, the method of the present invention has better results than the conventional DBSCAN algorithm and the Kmeans algorithm. Compared with the running time of the DBSCAN algorithm, the complexity of the algorithm is lower. Especially when the amount of data is large. Has better practical engineering application value.
Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims (1)

1. A method for hierarchical clustering of guest travel direction data, the method comprising the steps of:
a first stage: the cluster center is updated by drawing a circle from the distance threshold R1 as follows:
step 1.1: inputting a set of unlabeled data sets X ═ X1,x2,...xi,...xN}∈RPRandomly fetching the ith data object X from XiStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in XjCalculating x by equation (1)iAnd xjEuclidean distance between them
Figure FDA0003057303710000011
If it is
Figure FDA0003057303710000012
Less than R1, R1 is 10% of the spatial size of the data set, point xi xjFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)iIf, if
Figure FDA0003057303710000013
Greater than R1, indicating xi xjNot of the same class, xjAlso as a cluster core, the cluster core set C ═ { x ═ is storedi};
Figure FDA0003057303710000014
Figure FDA0003057303710000015
Wherein, S in the formula (2) is the updated cluster center, and β is the weight coefficient;
step 1.2: never include xi、xjRandomly fetch the mth data object X in the data set XmCalculating the Euclidean distance set
Figure FDA0003057303710000016
n is the number of points in the C set, and x is determinedmTo the closest point C in the cluster center setiUsing point x in combinationm、CiUpdating the cluster center according to the method of the formula (1);
step 1.3: repeating the steps 1.1 and 1.2 to traverse all the points in the data X, and obtaining the updated cluster center set C ═ C1,...,Ci,...CwW is the cluster number, and the corresponding cluster set M ═ C1{...},...,Ci{...},...Cw{...}};
And a second stage: the region growing is carried out as follows:
step 2.1: determining the seed sequence: firstly, traversing all the points in the cluster center set C, and calculating the number n of the points corresponding to the ith clusteri1,2, m, if niIf min C is less than min C, the corresponding cluster center point C is deletediDeleting the corresponding cluster center point set C in Mi{., and storing the cluster center points in the set D as the seed sequence B ═ C1,...,Ci,...,Cd},d<=w;
Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B1Drawing a circle with R1 as the radius, and calculating the number n of points in the circle1If n is1If min is greater than C, continue with C1Drawing a circle Q with the circle center R being R1 plus Delta R as a radiusB1And is judged to enter the circle QB1If the point (i) belongs to D, i +1 continues to grow;
△R=e(sm(x))/10*i^2*0.03 (3)
wherein sm (x) is the average value of the distance between the data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain updated M;
step 2.3, for the points obtained after each cluster center area grows, the next secondary time is not taken as a growing object to be processed, and then other cluster center points in the C are traversed by the method in the step 2.2 to obtain the data of each cluster center point and the corresponding cluster thereof;
and a third stage: calculating the relation weight and density similarity among all clustering centers by a competition-based idea, and adopting a proper rule to merge the clusters;
after the data set X is subjected to the second-level clustering, if all cluster centers carry out the second-level clustering on the data XiIn the competition process, the winner is the heart of the cluster respectively
Figure FDA0003057303710000021
And
Figure FDA0003057303710000022
get
Figure FDA0003057303710000023
When d has a value in a certain range, we consider the cluster
Figure FDA0003057303710000024
Hezhou cluster
Figure FDA0003057303710000025
There is a relational weight; increase criterion of the relational weight: by using
Figure FDA0003057303710000026
Expressing the weight of the relationship between the two small clusters, the calculation method is as follows (4)
Figure FDA0003057303710000027
Wherein, in the formula (4)
Figure FDA0003057303710000028
Figure FDA0003057303710000029
Where x is min (x, y), y is max (x, y);
step 3.1: first, for a data set X ═ X1,…,Xi,…,XNFrom the first data X1Starting to traverse in sequence, and finding out two winners of all cluster centers in the process of competing for data for each specific data
Figure FDA00030573037100000210
And
Figure FDA00030573037100000211
then, the clusters corresponding to the two winners are judged according to the relation weight existence criterion
Figure FDA00030573037100000212
And
Figure FDA00030573037100000213
if the cluster with the weight exists, the relationship weight is increased according to a formula (4), and then the next data is traversed; if the relation weight does not exist, directly traversing the next data until all the data are traversed once in sequence;
after the calculation of the relationship weight is completed, the relationship weight is formed as
Figure FDA00030573037100000243
Wherein the subscript x takes values from 1 up to M, and the subscript y takes values from x up to M;
step 3.2: calculating density similarity between each cluster, firstly calculating the intra-cluster density rho of each cluster for the cluster set M clustered at the second stagei
ρi=ni/Si (5)
niIs the number of points included in the ith cluster, SiIs the area size of the ith cluster, ρ ═ ρ { [ ρ ]1,...,ρi,...,ρdAnd calculating a density difference between the x-th cluster and the y-th cluster
Figure FDA00030573037100000214
Namely:
Figure FDA00030573037100000215
subscript x takes values from 1 up to d, and subscript y takes values from x up to d;
step 3.3: when in use
Figure FDA00030573037100000216
And is
Figure FDA00030573037100000217
In the middle of the time, cluster
Figure FDA00030573037100000218
Hezhou cluster
Figure FDA00030573037100000219
May be combined;
assume that the final cluster set formed is MkWherein each value of the subscript k corresponds to an independent cluster, and a finally formed cluster set M is subjected tokThe subscript of (a) is initialized to k 1,
Figure FDA00030573037100000220
relationship weight
Figure FDA00030573037100000221
Subscript x is initialized to x ═ 1;
relationship weight
Figure FDA00030573037100000222
Starting from 1 up to M, the subscript x of (1) weights the relationship
Figure FDA00030573037100000223
The superscript y takes values from x up to M when
Figure FDA00030573037100000224
When x is equal to y, let
Figure FDA00030573037100000225
Satisfy the requirement of
Figure FDA00030573037100000226
Relationship weight
Figure FDA00030573037100000227
Not satisfying the condition
Figure FDA00030573037100000228
And is
Figure FDA00030573037100000229
In time, the small clusters are not processed; relationship weight
Figure FDA00030573037100000230
Satisfy the requirement of
Figure FDA00030573037100000231
And is
Figure FDA00030573037100000232
When it is in condition, if
Figure FDA00030573037100000233
Or
Figure FDA00030573037100000234
Then
Figure FDA00030573037100000235
And
Figure FDA00030573037100000236
are simultaneously merged into MkIn (1),
Figure FDA00030573037100000237
otherwise k is k +1, simultaneously
Figure FDA00030573037100000238
And
Figure FDA00030573037100000239
merge into a new cluster MkIn (1),
Figure FDA00030573037100000240
wherein
Figure FDA00030573037100000241
And
Figure FDA00030573037100000242
the same elements present in (a) are combined into the same item;
step 3.4: the cluster center set finally formed is Mk,k=1,2…K。
CN201910812062.5A 2019-08-30 2019-08-30 Hierarchical clustering method for tourist heading data Active CN110728293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910812062.5A CN110728293B (en) 2019-08-30 2019-08-30 Hierarchical clustering method for tourist heading data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910812062.5A CN110728293B (en) 2019-08-30 2019-08-30 Hierarchical clustering method for tourist heading data

Publications (2)

Publication Number Publication Date
CN110728293A CN110728293A (en) 2020-01-24
CN110728293B true CN110728293B (en) 2021-10-29

Family

ID=69218832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910812062.5A Active CN110728293B (en) 2019-08-30 2019-08-30 Hierarchical clustering method for tourist heading data

Country Status (1)

Country Link
CN (1) CN110728293B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2002259250A1 (en) * 2001-05-18 2002-12-03 Biowulf Technologies, Llc Model selection for cluster data analysis
US8031914B2 (en) * 2006-10-11 2011-10-04 Hewlett-Packard Development Company, L.P. Face-based image clustering
CN105550744A (en) * 2015-12-06 2016-05-04 北京工业大学 Nerve network clustering method based on iteration
CN106776849B (en) * 2016-11-28 2020-01-10 西安交通大学 Method for quickly searching scenic spots by using pictures and tour guide system

Also Published As

Publication number Publication date
CN110728293A (en) 2020-01-24

Similar Documents

Publication Publication Date Title
CN111211994B (en) Network traffic classification method based on SOM and K-means fusion algorithm
WO2018086433A1 (en) Medical image segmenting method
CN107578061A (en) Based on the imbalanced data classification issue method for minimizing loss study
CN109002858B (en) Evidence reasoning-based integrated clustering method for user behavior analysis
CN113076970A (en) Gaussian mixture model clustering machine learning method under deficiency condition
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN109271427A (en) A kind of clustering method based on neighbour&#39;s density and manifold distance
CN115641177B (en) Second-prevention killing pre-judging system based on machine learning
CN108416381B (en) Multi-density clustering method for three-dimensional point set
CN113435108A (en) Battlefield target grouping method based on improved whale optimization algorithm
CN116821715A (en) Artificial bee colony optimization clustering method based on semi-supervision constraint
CN115690476A (en) Automatic data clustering method based on improved harmony search algorithm
CN113128617B (en) Spark and ASPSO based parallelization K-means optimization method
Xing et al. Fuzzy c-means algorithm automatically determining optimal number of clusters
CN114638301A (en) Density peak value clustering algorithm based on density similarity
CN110781943A (en) Clustering method based on adjacent grid search
CN110580252A (en) Space object indexing and query method under multi-objective optimization
CN110728293B (en) Hierarchical clustering method for tourist heading data
CN108897820B (en) Parallelization method of DENCLUE algorithm
CN108446740B (en) A kind of consistent Synergistic method of multilayer for brain image case history feature extraction
Mir et al. Improving data clustering using fuzzy logic and PSO algorithm
CN108304546B (en) Medical image retrieval method based on content similarity and Softmax classifier
CN113205124B (en) Clustering method, system and storage medium based on density peak value under high-dimensional real scene
CN113469107B (en) Bearing fault diagnosis method integrating space density distribution
Cui et al. Weighted particle swarm clustering algorithm for self-organizing maps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant