CN104217015B - Based on the hierarchy clustering method for sharing arest neighbors each other - Google Patents

Based on the hierarchy clustering method for sharing arest neighbors each other Download PDF

Info

Publication number
CN104217015B
CN104217015B CN201410488243.4A CN201410488243A CN104217015B CN 104217015 B CN104217015 B CN 104217015B CN 201410488243 A CN201410488243 A CN 201410488243A CN 104217015 B CN104217015 B CN 104217015B
Authority
CN
China
Prior art keywords
arest neighbors
submanifold
point
matrix
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410488243.4A
Other languages
Chinese (zh)
Other versions
CN104217015A (en
Inventor
周红芳
王心怡
刘园
郭杰
段文聪
何馨依
刘杰
李锦�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201410488243.4A priority Critical patent/CN104217015B/en
Publication of CN104217015A publication Critical patent/CN104217015A/en
Application granted granted Critical
Publication of CN104217015B publication Critical patent/CN104217015B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses based on the hierarchy clustering method for sharing arest neighbors each other, whole data set D arest neighbors matrix T1 and arest neighbors matrix T2 is calculated first;Arest neighbors ranking matrix M is calculated by arest neighbors matrix T1 and arest neighbors matrix T2;Local density is calculated by arest neighbors ranking matrix M, submanifold set is obtained;The similarity between submanifold is finally calculated, cohesion submanifold obtains final division result.Hierarchy clustering method based on shared arest neighbors each other of the invention, is solved the existing dot-dash misclassification existed based on k nearest neighbor figure cluster in rarefaction and figure partition process during generation submanifold set and misleads the problem of cause clustering precision is low.

Description

Based on the hierarchy clustering method for sharing arest neighbors each other
Technical field
The invention belongs to the data mining technology field of Computer Science and Technology, it is related to a kind of based on shared each other nearest Adjacent hierarchy clustering method.
Background technology
Clustering is an important research topic in Data Mining.Clustering technique has been widely applied to The fields such as telecommunications industry, retail business, biology, the marketing.Cluster is a kind of unsupervised classification, is for finding in data set In the data point of object feature itself and clustering, and ensure in cluster to have between similarity as big as possible, cluster and have to the greatest extent Possible big distinctiveness ratio.Existing clustering algorithm is generally divided into:1. using K-means, Fuzzy K-means, k central point as representative The clustering algorithm based on division;2. with QROCK, CURE, BIRCH, the clustering algorithm based on level for representative;3. with DBSCAN, OPTICS are the density-based algorithms of representative;4. other kinds of clustering algorithm, such as based on subspace Clustering algorithm or the clustering algorithm based on model.
The son that clustering algorithm such as Chameleon algorithms based on k neighbour's figures are produced during rarefaction and figure are divided When gathering is closed, the point for all or most that a submanifold is included is to belong to same real cluster.But, wherein include The Agglomerative Hierarchical Clustering result that wrong data may result in the next stage mixes these mistakes, causes bigger deviation.It is based on The Jarvis-Patrick algorithms of SNN similarities have one real cluster of division, merge the cluster that should divide.This two class is calculated The common ground of method is the similarity graph for constructing k neighbours figure, or the shared arest neighbors based on k neighbours, in rarefaction similarity graph Or during K arest neighbors figures, it is possible to data point can be divided to mistake, and can be by mistake amplification during cluster is condensed.
The content of the invention
It is an object of the invention to provide a kind of based on the hierarchy clustering method for sharing arest neighbors each other, existing base is solved The dot-dash misclassification existed when k nearest neighbor figure clusters and produces submanifold set in rarefaction and figure partition process, which is misled, causes clustering precision Low the problem of.
The technical solution adopted in the present invention is, will be pending based on the hierarchy clustering method for sharing arest neighbors each other Data set is set to D, if cluster numbers are K, if arest neighbors value one is K1, specifically real according to following steps if arest neighbors value two is K2 Apply:
Step 1, data set D arest neighbors matrix is calculated by the K1 of the arest neighbors value one and K2 of arest neighbors value two respectively, is obtained Arest neighbors matrix T1 and arest neighbors matrix T2;
Step 2, each neighborhood point successively in searching data collection D in each data point i arest neighbors matrix T2 Arest neighbors matrix T1 ', if data point i is included in arest neighbors matrix T1 ', by the data point in arest neighbors matrix T2 I retains, and is otherwise deleted, obtains data point i arest neighbors ranking matrix Mi, all data points in ergodic data collection D obtain To arest neighbors ranking matrix M;
Step 3, the local density D of each data point i in data set D is calculated by arest neighbors ranking matrix Mi, and And by these data points according to local density DiSize carry out descending arrangement;
Step 4, preceding K × 10 data point after sequence is taken as submanifold central point, and with submanifold central point and submanifold Nearest-neighbor point composition submanifold in the arest neighbors ranking matrix of heart point;The data point not divided is divided into the data point Arest neighbors in the submanifold that occurs at first, obtain some submanifolds;
Step 5, the similarity of each submanifold that calculation procedure 4 finally gives between any two, by the submanifold pair that similarity is maximum Merge;
Step 6, the submanifold number after merging then performs step 5 if less than K;Submanifold number after merging if equal to K, then perform step 7;
Step 7, the son nearest from unassigned data point will be divided into data set D from unassigned data point i In cluster, final division result is obtained, division result is K submanifold.
The features of the present invention is also resided in,
Local density D in step 3iCalculated according to below equation:
Di=count (Mi),i∈n (1)
Wherein, MiFor the arest neighbors ranking matrix of i-th of data point in arest neighbors ranking matrix M.
The similarity of submanifold between any two is calculated in accordance with the following methods in step 5:
Provided with submanifold Ci, submanifold Cj, 0<I, j≤n, arest neighbors ranking matrix M, then:The similarity of submanifold between any two is:
Wherein, NumNeighborCi(Cj) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M Point, in the nearest-neighbor of these nearest neighbor points, occurs belonging to submanifold CjPoint number of times;
NumNeighborCi(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, In the nearest-neighbor of these nearest neighbor points, occur belonging to submanifold CiPoint number of times;
CountNeighbor(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, this A little nearest neighbor points adhere to the submanifold number of different submanifolds separately;
CountNeighbor(Cj) it is submanifold CjIn all arest neighbors of the point in arest neighbors ranking matrix M point, this A little nearest neighbor points adhere to the submanifold number of different submanifolds separately.
The data point not divided is divided into step 4 in the submanifold occurred at first in the arest neighbors of the data point, Refer to the arest neighbors ranking matrix M of the data pointiIn all nearest-neighbor points in if including submanifold central point, just should Data point i is divided into the submanifold;If included in all nearest-neighbor points in data point i arest neighbors ranking matrix Data point i, then be divided into the submanifold of that submanifold central point in the top by multiple submanifold central points.
The submanifold nearest from unassigned data point in step 7, refers in data set D from unassigned data point and step The submanifold of Euclidean distance minimum between the K submanifold obtained in rapid 6.
The beneficial effects of the invention are as follows:
1. Clustering Effect is good.The present invention generated data collection DB1, DB2, DB3 and UCI standard data set Iris, Wine, There is obvious advantage result on Soybean, Unbalanced, can obtain with the total purity of higher cluster and relatively low comentropy Cluster result.
2. clustering precision is high.The similarity function of the present invention can will merge the cluster of mistake for the wrong cluster merged Push away at the time of being put into more late to carry out the merging of next step, can effectively avoid the accumulation step by step of mistake from expanding, obtain more Good clustering precision.
Brief description of the drawings
Fig. 1 is schematic flow sheet of the present invention based on the hierarchy clustering method for sharing arest neighbors each other;
Fig. 2 is original state of the present invention based on data set in the hierarchy clustering method cluster process for sharing arest neighbors each other Figure;
Fig. 3 be the present invention based on each other share arest neighbors hierarchy clustering method for choose data set in local density values most The candidate centers point diagram of big data point formation;
Fig. 4 is the generated data collection DB1 used during the present invention is tested;
Fig. 5 is the generated data collection DB2 used during the present invention is tested;
Fig. 6 is the generated data collection DB3 used during the present invention is tested;
Fig. 7 is the generated data collection DB4 used during the present invention is tested;
Fig. 8 is the present invention based on cluster result of the hierarchy clustering method to synthesis data set DB1 for sharing arest neighbors each other Figure;
Fig. 9 is the present invention based on cluster result of the hierarchy clustering method to synthesis data set DB2 for sharing arest neighbors each other Figure;
Figure 10 is the present invention based on cluster knot of the hierarchy clustering method to synthesis data set DB3 for sharing arest neighbors each other Fruit is schemed;
Figure 11 is the present invention based on cluster knot of the hierarchy clustering method to synthesis data set DB4 for sharing arest neighbors each other Fruit is schemed.
Embodiment
The present invention is described in detail with reference to the accompanying drawings and detailed description.
Related definition in the present invention is as follows:
1 arest neighbors ranking matrix is defined, refers to the matrix that using data point and its arest neighbors data point is built as row each other.
2 local densities are defined, are the expressions of regional area dense degree in whole data set where data point, greatly The number of the small point equal to the data point nearest-neighbor.
Define 3 one class cluster CiPurity:Total purity of cluster result is:
Wherein, piRefer to class cluster CiPurity;
Mi refers to class cluster CiThe number of middle data point;
M refers to the sum of data intensive data point;
K refers to the number of class cluster in data set D;
Define the entropy of 4 cluster results:
Firstly, it is necessary to calculate class cluster CiMiddle data point belongs to class cluster CjProbability:
Wherein, miIt is class cluster CiThe number of middle data point;
mijIt is class cluster CiIn belong to class cluster CjPoint number.
Then, each class cluster C is calculatediEntropy
L is the number of class cluster.
Finally, the entropy of cluster result is calculated
K is the number of cluster;
miIt is class cluster CiThe number of middle data point;
M is the sum of data point in data set.
Define the combination that 5 F measurements are precision and recall rate.
Class cluster CiOn class cluster CjPrecision:
Precision (i, j)=pij (7)
Class cluster CiOn class cluster CjRecall rate
Thus:Class cluster CiOn class cluster CjF measurement
The invention provides a kind of based on the hierarchy clustering method for sharing arest neighbors each other, pending data set is set to D, if cluster numbers are K, if arest neighbors value one is K1, if arest neighbors value two is K2, as shown in figure 1, specific real according to following steps Apply:
Step 1, data set D arest neighbors matrix is calculated by the K1 of the arest neighbors value one and K2 of arest neighbors value two respectively, is obtained Arest neighbors matrix T1 and arest neighbors matrix T2;
Step 2, each neighborhood point successively in searching data collection D in each data point i arest neighbors matrix T2 Arest neighbors matrix T1 ', if data point i is included in arest neighbors matrix T1 ', by the data point in arest neighbors matrix T2 I retains, and is otherwise deleted, obtains data point i arest neighbors ranking matrix Mi, all data points in ergodic data collection D obtain To arest neighbors ranking matrix M;
Step 3, the local density D of each data point i in data set D is calculated by arest neighbors ranking matrix Mi, and And by these data points according to local density DiSize carry out descending arrangement;Local density DiCalculated according to below equation:
Di=count (Mi),i∈n (1)
Wherein, MiFor the arest neighbors ranking matrix of i-th of data point in arest neighbors ranking matrix M;
Step 4, preceding K × 10 data point after sequence is taken as submanifold central point, and with submanifold central point and submanifold Nearest-neighbor point composition submanifold in the arest neighbors ranking matrix of heart point;The data point not divided is divided into the data point Arest neighbors in the submanifold that occurs at first, obtain some submanifolds;The data point not divided is divided into the data point In the submanifold occurred at first in arest neighbors, refer to the arest neighbors ranking matrix M of the data pointiIn all nearest-neighbor points in such as Fruit includes submanifold central point, and just data point i is divided into the submanifold;If data point i arest neighbors ranking matrix In all nearest-neighbor points in include multiple submanifold central points, then data point i is divided into that in the top son In the submanifold of cluster central point;
Step 5, the similarity of some submanifolds that calculation procedure 4 finally gives between any two, by the submanifold that similarity is maximum To merging;The similarity of submanifold between any two is calculated in accordance with the following methods:
Provided with submanifold Ci, submanifold Cj, 0<I, j≤n, arest neighbors ranking matrix M, then:The similarity of submanifold between any two is:
Wherein, NumNeighborCi(Cj) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M Point, in the nearest-neighbor of these nearest neighbor points, occurs belonging to submanifold CjPoint number of times;
NumNeighborCi(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, In the nearest-neighbor of these nearest neighbor points, occur belonging to submanifold CiPoint number of times;
CountNeighbor(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, this A little nearest neighbor points adhere to the submanifold number of different submanifolds separately;
CountNeighbor(Cj) it is submanifold CjIn all arest neighbors of the point in arest neighbors ranking matrix M point, this A little nearest neighbor points adhere to the submanifold number of different submanifolds separately;
Step 6, the submanifold number after merging then performs step 5 if less than K;Submanifold number after merging if equal to K, then perform step 7;
Step 7, the son nearest from unassigned data point will be divided into data set D from unassigned data point i In cluster, final division result is obtained, division result is K class cluster;Wherein, the submanifold nearest from unassigned data point, Refer to Euclidean distance minimum between the K submanifold obtained in data set D from unassigned data point and step 6 Submanifold.
Embodiment:
The present invention is made up of based on the hierarchy clustering method for sharing arest neighbors each other three committed steps:Calculate arest neighbors Domain matrix, matrix are divided, hierarchical clustering.Whole data set D arest neighbors matrix T1 and arest neighbors matrix T2 is calculated first, (ginseng Number k1, k2 is input parameter, k2>k1);Arest neighbors ranking matrix M is calculated by arest neighbors matrix T1 and arest neighbors matrix T2; Local density is calculated by arest neighbors ranking matrix M, submanifold set is obtained;Finally calculate the similarity between submanifold, cohesion Cluster obtains K class cluster.
First, nearest-neighbor matrix is calculated, detailed process is as follows:
Assuming that data set D k1 nearest-neighbor matrixes areK1 is input parameter, 0<i≤n, 0<j≤k1; Data set D k2 nearest-neighbor matrixes areK2 is algorithm input parameter, 0<i≤n, 0<j≤k2,k1<K2, then: Arest neighbors ranking matrix M=[xij], and0<i≤n,0<j≤k2.Pass through control Parameter k1 processed, k2 size (reducing k1, increase k2), can obtain bigger and denser submanifold.In conditionFor Preceding k1 row.
By taking the data set X in Fig. 2 as an example, the arest neighbors matrix T1 and arest neighbors square of data set X in Fig. 2 is calculated first Battle array T2, as shown in table 1;Then based on arest neighbors matrix T2, each data point xi arest neighbors is filtered, if most Point in neighbour includes xi in T1 arest neighbors, then retains this neighborhood point, otherwise deletes neighborhood point.For example, the k2 of point 0 is nearest Adjacent (k2=10) is { 4,2,1,3,5,6,11,9,7,10 }, and the T1 arest neighbors (k1=3) of point 4 is { 2,3,0 }, comprising putting 0, then Remain, according to the method described above point successively in inquiry neighborhood, the arest neighbors ranking for finally giving a little 0 is { 4,2 }.Traversal All data points, obtain final arest neighbors ranking matrix, as shown in table 2.
Data set X K arest neighbors, K=10 in the Fig. 1 of table 1
Data point K arest neighbors lists
0 4,2,1,3,5,6,9,11,10,7
1 3,5,4,0,2,6,9,10,7,14
2 4,0,3,1,5,11,6,15,14,9
3 4,1,2,0,5,11,6,14,15,10
4 2,0,3,1,5,6,11,14,9,15
5 6,1,9,10,3,7,8,14,4,0
6 9,7,5,8,10,1,14,3,0,4
7 8,9,6,10,5,1,14,3,0,4
8 9,7,10,6,5,14,1,3,12,15
9 8,10,6,7,5,14,1,3,15,12
10 9,8,6,7,5,14,1,12,15,3
11 13,15,14,12,3,2,4,1,5,0
12 15,14,13,11,10,9,5,8,3,6
13 15,11,12,14,3,10,5,4,2,1
14 15,12,10,5,9,13,11,8,6,3
15 13,12,14,11,10,3,5,1,9,4
The arest neighbors ranking matrix of table 2
Data point Arest neighbors list each other
0 4,2
1 3,5,0
2 4,0,3
3 4,1,2
4 2,0,3,1
5 6,1
6 9,7,5,10
7 8,6
8 9,7,10
9 8,10,6,7,5
10 9,8,14
11 13
12 15,14,13
13 15,11,12
14 15,12,11
15 13,12,14,11
Secondly, local density is calculated by Neighborhood matrix, submanifold set is obtained.
Provided with arest neighbors ranking matrix M, Mi represents the arest neighbors matrix of i-th of data point, then local density
Di=count (Mi),i∈n (1)
The number of the point of the nearest-neighbor of i.e. one data point is more, then the regional area where this data point is whole It is denser in data set.Therefore, when the local density of a data point is very big, it is regarded as the center of its this neighborhood Point.
For example:Local density and the sequence of each point can be calculated by table 2, result is obtained as shown in table 3.Here, K*2 point is chosen as the central point of candidate cluster (K is the cluster numbers of data set).Also, as can be seen from Figure 2 by No. 11-15 The density of the class of point composition is far smaller than the density of two classes above figure, but can also obtain its central point 15, this It is due to that the ranking matrix based on arest neighbors is not to directly rely on distance function as measurement, but uses mutual distance to arrange Name relation calculates local density as measurement foundation, thus, obtains the maximal point of local density, enabling processing is different close The cluster of degree.
The ranking results of the local density of table 3
Data point Local density
9 5
4 4
6 4
15 4
1 3
2 3
3 3
8 3
10 3
12 3
13 3
0 2
5 2
7 2
14 2
11 1
Finally, clustered according to class cluster similarity.
Provided with Ci, Cj, 0<I, j≤n, arest neighbors ranking matrix M, then:Submanifold similarity is:
Wherein, NumNeighborCi(Cj) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M Point, in the nearest-neighbor of these nearest neighbor points, occurs belonging to submanifold CjPoint number of times;
NumNeighborCi(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, In the nearest-neighbor of these nearest neighbor points, occur belonging to submanifold CiPoint number of times;
CountNeighbor(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, this A little nearest neighbor points adhere to the submanifold number of different submanifolds separately;
CountNeighbor(Cj) it is submanifold CjIn all arest neighbors of the point in arest neighbors ranking matrix M point, this A little nearest neighbor points adhere to the submanifold number of different submanifolds separately.
For class Ci and class Cj similarity definition, comprising two parts, a part is submanifold Ci, its phase to submanifold Cj Like degree;Another part is submanifold Cj, its similarity to submanifold Ci, i.e. similarity between them is not reciprocity.For example, right For interpersonal relation, PersonA best friends are PersonB, and only one of which, and for PersonB For may there is ace buddy to have many, PersonA is one of them, so if weighing good friend's degree with numerical value Talk about, good friend's degree between them is simultaneously unequal.So, for merging submanifold for, we take such a strategy to close And:The neighbouring submanifold number of each submanifold is minimum, and the number of the point of the arest neighbors between submanifold pair is most.
All data points, are further divided into recently by such as shown result in figure 3 according to arest neighbors ranking matrix Candidate centers submanifold in, obtain result as shown in table 4.Next, according to arest neighbors ranking matrix, counting submanifold adjacent The number of times of other submanifolds is connect, result is obtained as shown in table 5.Calculate the similarity of 6 submanifolds between any two.For example:The neighbour of cluster 1 Connecing cluster has two (representing cluster using the numbering for the central point for representing cluster), is cluster 4 and cluster 9 respectively, obtain similarity (C1 → C4)=2/2, equally, similarity (C4 → C1)=2/3 is calculated, both are added the similarity for obtaining cluster 1 and cluster 4 Similarity (C1, C4)=similarity (C1 → C4)+similarity (C4 → C1) ≈ 1.667.
All points are divided into after the submanifold of candidate centers by table 4, obtained submanifold
Cluster label Data point in cluster
1 1
2 2
4 4,2,0,3,1
6 6
9 9,8,10,6,7,5
15 15,13,12,14,11
The submanifold of table 5 and its number of times for abutting submanifold are counted
Cluster label Adjacent cluster and occurrence number
1 4=2,9=1
2 4=3
4 1=2,2=3,9=1
6 9=3
9 1=1,6=4
15 9=1
Clustering method performance evaluating of the present invention:
In order to verify the validity of clustering method of the present invention, two kinds of algorithms are selected to be contrasted:Based on figure Chameleon methods and Jarvis-Patrick (JP) method.Chameleon methods have stronger discovery arbitrary size and shape The ability of the cluster of shape.JP methods are good at the close cluster for finding strong correlation object.Both approaches are with the present invention based on common each other The hierarchy clustering method identical place for enjoying arest neighbors is to be required for calculating K arest neighbors, then passes through each different methods Calculating obtains final result.
The present invention is using four artificial data collection and 6 UCI standard data sets come testing algorithm performance.Four artificial datas Collect DB1, DB2, DB3, DB4, data distribution is respectively as shown in Fig. 4, Fig. 5, Fig. 6, Fig. 7.6 UCI standard data sets are:cpu- with-vendor,glass,iris,soybean,unbalanced, wine.4 artificial data collection and 6 UCI normal datas The attribute of collection is as shown in table 6 and table 7.
The artificial data set attribute of table 6
The UCI data set attributes of table 7
Experimental result is contrasted:
The total purity Purity of cluster result, precision and recall rate composite function F-measure, cluster result entropy are used herein Tri- kinds of evaluation functions of Entropy evaluate the validity of cluster result, three kinds of evaluation functions be specifically defined as defined above 4, Define 5 and define shown in 6.
Table 8 is the present invention based on the hierarchy clustering method and Chameleon methods and JP methods for sharing arest neighbors each other Experimental result on correction data collection, from Fig. 8, Fig. 9, Figure 10, Figure 11 and table 8 as can be seen that the present invention based on each other The hierarchy clustering method of shared arest neighbors generated data collection DB1, DB2, DB3 and UCI standard data set Iris, Wine, There is obvious advantage result on Soybean, Unbalanced.Shown by Cluster Validity external evaluation index, Chameleon methods and JP methods have very bad result on some UCI data sets, and its reason is mixing in data set The variable of categorical attribute, is caused due to being set algorithm parameter so that cluster result is excessively poor.It is less for some Data set, for example, Cpu-with-vendor, Glass, of the invention to be led to based on the hierarchy clustering method for sharing arest neighbors each other Similarity function is crossed, can be pushed away at the time of being put into more late to carry out the merging of next step for the wrong cluster that has merged, The accumulation step by step of mistake so can be effectively avoided to expand.
8 three kinds of methods experiment Comparative results of table

Claims (4)

1. based on the hierarchy clustering method for sharing arest neighbors each other, it is characterised in that pending data set is set into D, if poly- Class number is K, if arest neighbors value one is K1, if arest neighbors value two is K2, and K1<K2, specifically implements according to following steps:
Step 1, data set D arest neighbors matrix is calculated by the K1 of the arest neighbors value one and K2 of arest neighbors value two respectively, is obtained recently Adjacent matrix T1 and arest neighbors matrix T2;
Step 2, each neighborhood point successively in searching data collection D in each data point i arest neighbors matrix T2 is nearest Adjacent matrix T1 ', if including data point i in T1 ', the data point i in arest neighbors matrix T2 is retained, otherwise deleted Remove, obtain data point i arest neighbors ranking matrix Mi, arest neighbors ranking matrix MiRefer to data point i and its arest neighbors number each other All data points in the matrix that strong point builds for row, ergodic data collection D, obtain arest neighbors ranking matrix M;
Step 3, the local density D of each data point i in data set D is calculated by arest neighbors ranking matrix Mi, local density DiIt is the expression of regional area dense degree in whole data set where data point i, and by these data points according to office Portion density DiSize carry out descending arrangement;
Wherein, local density DiCalculated according to below equation:
Di=count (Mi),i∈n (1)
MiFor the arest neighbors ranking matrix of i-th of data point in arest neighbors ranking matrix M;
Step 4, preceding K × 10 data point after sequence is taken as submanifold central point, and with submanifold central point and submanifold central point Arest neighbors ranking matrix included in group of data points into submanifold;The data point not divided is divided into the data point In the submanifold occurred at first in arest neighbors, some submanifolds are obtained;
Step 5, the similarity of each submanifold that calculation procedure 4 finally gives between any two, by the maximum submanifold of similarity to carrying out Merge;
Step 6, the submanifold number after merging then performs step 4 if less than K;Submanifold number after merging is if equal to K, then Perform step 7;
Step 7, by data set D from unassigned data point i is divided into the submanifold nearest from unassigned data point, Final division result is obtained, the division result is K class cluster.
2. it is according to claim 1 based on the hierarchy clustering method for sharing arest neighbors each other, it is characterised in that in step 5 The similarity of submanifold between any two is calculated in accordance with the following methods:
Provided with submanifold Cx, submanifold Cy, 0<X, y≤z, arest neighbors ranking matrix M, then:The similarity of submanifold between any two is:
<mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mi>i</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>x</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mi>y</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>NumNeighbor</mi> <msub> <mi>C</mi> <mi>x</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>y</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mi>N</mi> <mi>e</mi> <mi>i</mi> <mi>g</mi> <mi>h</mi> <mi>b</mi> <mi>o</mi> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>x</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>+</mo> <mfrac> <mrow> <msub> <mi>NumNeighbor</mi> <msub> <mi>C</mi> <mi>x</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>x</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mi>N</mi> <mi>e</mi> <mi>i</mi> <mi>g</mi> <mi>h</mi> <mi>b</mi> <mi>o</mi> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>C</mi> <mi>y</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
Wherein,It is submanifold CxIn all arest neighbors of the point in arest neighbors ranking matrix M point, In the arest neighbors ranking matrix of these nearest neighbor points, occur belonging to submanifold CyPoint number of times;
It is submanifold CxIn all arest neighbors of the point in arest neighbors ranking matrix M point, at these In the arest neighbors ranking matrix of nearest neighbor point, occur belonging to submanifold CxPoint number of times;
CountNeighbor(Cx) it is submanifold CxIn all arest neighbors of the point in arest neighbors ranking matrix M point, these are most Neighbor Points adhere to the submanifold number of different submanifolds separately;
CountNeighbor(Cy) it is submanifold CyIn all arest neighbors of the point in arest neighbors ranking matrix M point, these are most Neighbor Points adhere to the submanifold number of different submanifolds separately.
3. the hierarchy clustering method based on shared arest neighbors each other according to claim 1, it is characterised in that step 4 Described in the data point not divided is divided into the submanifold occurred at first in the arest neighbors of the data point, refer to the data If including submanifold central point in the arest neighbors ranking matrix of point, just the data point is divided into the submanifold;If the number Include multiple submanifold central points in the arest neighbors ranking matrix at strong point, then the data point is divided into that son in the top In the submanifold of cluster central point.
4. it is according to claim 1 based on the hierarchy clustering method for sharing arest neighbors each other, it is characterised in that in step 7 The submanifold nearest from unassigned data point, refers to K obtained in data set D from unassigned data point and step 6 The submanifold of Euclidean distance minimum between submanifold.
CN201410488243.4A 2014-09-22 2014-09-22 Based on the hierarchy clustering method for sharing arest neighbors each other Expired - Fee Related CN104217015B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410488243.4A CN104217015B (en) 2014-09-22 2014-09-22 Based on the hierarchy clustering method for sharing arest neighbors each other

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410488243.4A CN104217015B (en) 2014-09-22 2014-09-22 Based on the hierarchy clustering method for sharing arest neighbors each other

Publications (2)

Publication Number Publication Date
CN104217015A CN104217015A (en) 2014-12-17
CN104217015B true CN104217015B (en) 2017-11-03

Family

ID=52098505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410488243.4A Expired - Fee Related CN104217015B (en) 2014-09-22 2014-09-22 Based on the hierarchy clustering method for sharing arest neighbors each other

Country Status (1)

Country Link
CN (1) CN104217015B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682052B (en) * 2015-11-11 2021-11-12 恩智浦美国有限公司 Data aggregation using mapping and merging
CN105787520B (en) * 2016-03-25 2019-09-20 中国农业大学 A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search
CN106570178B (en) * 2016-11-10 2020-09-29 重庆邮电大学 High-dimensional text data feature selection method based on graph clustering
CN108337226A (en) * 2017-12-19 2018-07-27 中国科学院声学研究所 The detection method and embedded intelligent terminal of embedded intelligent terminal abnormal data
CN108510615A (en) * 2018-04-02 2018-09-07 深圳智达机械技术有限公司 A kind of control system of semiconductor manufacturing facility and technique
CN108596737A (en) * 2018-05-07 2018-09-28 山东师范大学 Non-cluster Centroid distribution method based on e-commerce comment data and device
CN108932528B (en) * 2018-06-08 2021-08-31 哈尔滨工程大学 Similarity measurement and truncation method in chameleon algorithm
CN108765954B (en) * 2018-06-13 2022-05-24 上海应用技术大学 Road traffic safety condition monitoring method based on SNN density ST-OPTIC improved clustering algorithm
CN109871768B (en) * 2019-01-18 2022-04-29 西北工业大学 Hyperspectral optimal waveband selection method based on shared nearest neighbor
CN113850995B (en) * 2021-09-14 2022-12-27 华设设计集团股份有限公司 Event detection method, device and system based on tunnel radar vision data fusion

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478090B2 (en) * 2005-01-14 2009-01-13 Saffron Technology, Inc. Methods, systems and computer program products for analogy detection among entities using reciprocal similarity measures
CN101963995A (en) * 2010-10-25 2011-02-02 哈尔滨工程大学 Image marking method based on characteristic scene

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100191731A1 (en) * 2009-01-23 2010-07-29 Vasile Rus Methods and systems for automatic clustering of defect reports

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478090B2 (en) * 2005-01-14 2009-01-13 Saffron Technology, Inc. Methods, systems and computer program products for analogy detection among entities using reciprocal similarity measures
CN101963995A (en) * 2010-10-25 2011-02-02 哈尔滨工程大学 Image marking method based on characteristic scene

Also Published As

Publication number Publication date
CN104217015A (en) 2014-12-17

Similar Documents

Publication Publication Date Title
CN104217015B (en) Based on the hierarchy clustering method for sharing arest neighbors each other
CN110532436B (en) Cross-social network user identity recognition method based on community structure
CN104462184B (en) A kind of large-scale data abnormality recognition method based on two-way sampling combination
CN105930862A (en) Density peak clustering algorithm based on density adaptive distance
CN106960390A (en) Overlapping community division method based on convergence degree
CN108319987A (en) A kind of filtering based on support vector machines-packaged type combined flow feature selection approach
CN103888541A (en) Method and system for discovering cells fused with topology potential and spectral clustering
CN107368540A (en) The film that multi-model based on user&#39;s self-similarity is combined recommends method
CN109271427A (en) A kind of clustering method based on neighbour&#39;s density and manifold distance
CN114091603A (en) Spatial transcriptome cell clustering and analyzing method
CN107944487B (en) Crop breeding variety recommendation method based on mixed collaborative filtering algorithm
CN107180079A (en) The image search method of index is combined with Hash based on convolutional neural networks and tree
CN111340069A (en) Incomplete data fine modeling and missing value filling method based on alternate learning
CN111581532A (en) Social network friend-making recommendation method and system based on random block
CN104715034A (en) Weighed graph overlapping community discovery method based on central persons
Carbonera et al. An entropy-based subspace clustering algorithm for categorical data
CN103164487B (en) A kind of data clustering method based on density and geological information
Amelio et al. A new evolutionary-based clustering framework for image databases
CN108763283A (en) A kind of unbalanced dataset oversampler method
Peng et al. DE-MC: A membrane clustering algorithm based on differential evolution mechanism
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN107169522A (en) A kind of improvement Fuzzy C means clustering algorithm based on rough set and particle cluster algorithm
CN106354886A (en) Method for screening nearest neighbor by using potential neighbor relation graph in recommendation system
CN106203469A (en) A kind of figure sorting technique based on orderly pattern
CN111444454B (en) Dynamic community division method based on spectrum method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171103

Termination date: 20200922