CN104217015B - Based on the hierarchy clustering method for sharing arest neighbors each other - Google Patents
Based on the hierarchy clustering method for sharing arest neighbors each other Download PDFInfo
- Publication number
- CN104217015B CN104217015B CN201410488243.4A CN201410488243A CN104217015B CN 104217015 B CN104217015 B CN 104217015B CN 201410488243 A CN201410488243 A CN 201410488243A CN 104217015 B CN104217015 B CN 104217015B
- Authority
- CN
- China
- Prior art keywords
- arest neighbors
- submanifold
- point
- matrix
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses based on the hierarchy clustering method for sharing arest neighbors each other, whole data set D arest neighbors matrix T1 and arest neighbors matrix T2 is calculated first;Arest neighbors ranking matrix M is calculated by arest neighbors matrix T1 and arest neighbors matrix T2;Local density is calculated by arest neighbors ranking matrix M, submanifold set is obtained;The similarity between submanifold is finally calculated, cohesion submanifold obtains final division result.Hierarchy clustering method based on shared arest neighbors each other of the invention, is solved the existing dot-dash misclassification existed based on k nearest neighbor figure cluster in rarefaction and figure partition process during generation submanifold set and misleads the problem of cause clustering precision is low.
Description
Technical field
The invention belongs to the data mining technology field of Computer Science and Technology, it is related to a kind of based on shared each other nearest
Adjacent hierarchy clustering method.
Background technology
Clustering is an important research topic in Data Mining.Clustering technique has been widely applied to
The fields such as telecommunications industry, retail business, biology, the marketing.Cluster is a kind of unsupervised classification, is for finding in data set
In the data point of object feature itself and clustering, and ensure in cluster to have between similarity as big as possible, cluster and have to the greatest extent
Possible big distinctiveness ratio.Existing clustering algorithm is generally divided into:1. using K-means, Fuzzy K-means, k central point as representative
The clustering algorithm based on division;2. with QROCK, CURE, BIRCH, the clustering algorithm based on level for representative;3. with
DBSCAN, OPTICS are the density-based algorithms of representative;4. other kinds of clustering algorithm, such as based on subspace
Clustering algorithm or the clustering algorithm based on model.
The son that clustering algorithm such as Chameleon algorithms based on k neighbour's figures are produced during rarefaction and figure are divided
When gathering is closed, the point for all or most that a submanifold is included is to belong to same real cluster.But, wherein include
The Agglomerative Hierarchical Clustering result that wrong data may result in the next stage mixes these mistakes, causes bigger deviation.It is based on
The Jarvis-Patrick algorithms of SNN similarities have one real cluster of division, merge the cluster that should divide.This two class is calculated
The common ground of method is the similarity graph for constructing k neighbours figure, or the shared arest neighbors based on k neighbours, in rarefaction similarity graph
Or during K arest neighbors figures, it is possible to data point can be divided to mistake, and can be by mistake amplification during cluster is condensed.
The content of the invention
It is an object of the invention to provide a kind of based on the hierarchy clustering method for sharing arest neighbors each other, existing base is solved
The dot-dash misclassification existed when k nearest neighbor figure clusters and produces submanifold set in rarefaction and figure partition process, which is misled, causes clustering precision
Low the problem of.
The technical solution adopted in the present invention is, will be pending based on the hierarchy clustering method for sharing arest neighbors each other
Data set is set to D, if cluster numbers are K, if arest neighbors value one is K1, specifically real according to following steps if arest neighbors value two is K2
Apply:
Step 1, data set D arest neighbors matrix is calculated by the K1 of the arest neighbors value one and K2 of arest neighbors value two respectively, is obtained
Arest neighbors matrix T1 and arest neighbors matrix T2;
Step 2, each neighborhood point successively in searching data collection D in each data point i arest neighbors matrix T2
Arest neighbors matrix T1 ', if data point i is included in arest neighbors matrix T1 ', by the data point in arest neighbors matrix T2
I retains, and is otherwise deleted, obtains data point i arest neighbors ranking matrix Mi, all data points in ergodic data collection D obtain
To arest neighbors ranking matrix M;
Step 3, the local density D of each data point i in data set D is calculated by arest neighbors ranking matrix Mi, and
And by these data points according to local density DiSize carry out descending arrangement;
Step 4, preceding K × 10 data point after sequence is taken as submanifold central point, and with submanifold central point and submanifold
Nearest-neighbor point composition submanifold in the arest neighbors ranking matrix of heart point;The data point not divided is divided into the data point
Arest neighbors in the submanifold that occurs at first, obtain some submanifolds;
Step 5, the similarity of each submanifold that calculation procedure 4 finally gives between any two, by the submanifold pair that similarity is maximum
Merge;
Step 6, the submanifold number after merging then performs step 5 if less than K;Submanifold number after merging if equal to
K, then perform step 7;
Step 7, the son nearest from unassigned data point will be divided into data set D from unassigned data point i
In cluster, final division result is obtained, division result is K submanifold.
The features of the present invention is also resided in,
Local density D in step 3iCalculated according to below equation:
Di=count (Mi),i∈n (1)
Wherein, MiFor the arest neighbors ranking matrix of i-th of data point in arest neighbors ranking matrix M.
The similarity of submanifold between any two is calculated in accordance with the following methods in step 5:
Provided with submanifold Ci, submanifold Cj, 0<I, j≤n, arest neighbors ranking matrix M, then:The similarity of submanifold between any two is:
Wherein, NumNeighborCi(Cj) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M
Point, in the nearest-neighbor of these nearest neighbor points, occurs belonging to submanifold CjPoint number of times;
NumNeighborCi(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point,
In the nearest-neighbor of these nearest neighbor points, occur belonging to submanifold CiPoint number of times;
CountNeighbor(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, this
A little nearest neighbor points adhere to the submanifold number of different submanifolds separately;
CountNeighbor(Cj) it is submanifold CjIn all arest neighbors of the point in arest neighbors ranking matrix M point, this
A little nearest neighbor points adhere to the submanifold number of different submanifolds separately.
The data point not divided is divided into step 4 in the submanifold occurred at first in the arest neighbors of the data point,
Refer to the arest neighbors ranking matrix M of the data pointiIn all nearest-neighbor points in if including submanifold central point, just should
Data point i is divided into the submanifold;If included in all nearest-neighbor points in data point i arest neighbors ranking matrix
Data point i, then be divided into the submanifold of that submanifold central point in the top by multiple submanifold central points.
The submanifold nearest from unassigned data point in step 7, refers in data set D from unassigned data point and step
The submanifold of Euclidean distance minimum between the K submanifold obtained in rapid 6.
The beneficial effects of the invention are as follows:
1. Clustering Effect is good.The present invention generated data collection DB1, DB2, DB3 and UCI standard data set Iris, Wine,
There is obvious advantage result on Soybean, Unbalanced, can obtain with the total purity of higher cluster and relatively low comentropy
Cluster result.
2. clustering precision is high.The similarity function of the present invention can will merge the cluster of mistake for the wrong cluster merged
Push away at the time of being put into more late to carry out the merging of next step, can effectively avoid the accumulation step by step of mistake from expanding, obtain more
Good clustering precision.
Brief description of the drawings
Fig. 1 is schematic flow sheet of the present invention based on the hierarchy clustering method for sharing arest neighbors each other;
Fig. 2 is original state of the present invention based on data set in the hierarchy clustering method cluster process for sharing arest neighbors each other
Figure;
Fig. 3 be the present invention based on each other share arest neighbors hierarchy clustering method for choose data set in local density values most
The candidate centers point diagram of big data point formation;
Fig. 4 is the generated data collection DB1 used during the present invention is tested;
Fig. 5 is the generated data collection DB2 used during the present invention is tested;
Fig. 6 is the generated data collection DB3 used during the present invention is tested;
Fig. 7 is the generated data collection DB4 used during the present invention is tested;
Fig. 8 is the present invention based on cluster result of the hierarchy clustering method to synthesis data set DB1 for sharing arest neighbors each other
Figure;
Fig. 9 is the present invention based on cluster result of the hierarchy clustering method to synthesis data set DB2 for sharing arest neighbors each other
Figure;
Figure 10 is the present invention based on cluster knot of the hierarchy clustering method to synthesis data set DB3 for sharing arest neighbors each other
Fruit is schemed;
Figure 11 is the present invention based on cluster knot of the hierarchy clustering method to synthesis data set DB4 for sharing arest neighbors each other
Fruit is schemed.
Embodiment
The present invention is described in detail with reference to the accompanying drawings and detailed description.
Related definition in the present invention is as follows:
1 arest neighbors ranking matrix is defined, refers to the matrix that using data point and its arest neighbors data point is built as row each other.
2 local densities are defined, are the expressions of regional area dense degree in whole data set where data point, greatly
The number of the small point equal to the data point nearest-neighbor.
Define 3 one class cluster CiPurity:Total purity of cluster result is:
Wherein, piRefer to class cluster CiPurity;
Mi refers to class cluster CiThe number of middle data point;
M refers to the sum of data intensive data point;
K refers to the number of class cluster in data set D;
Define the entropy of 4 cluster results:
Firstly, it is necessary to calculate class cluster CiMiddle data point belongs to class cluster CjProbability:
Wherein, miIt is class cluster CiThe number of middle data point;
mijIt is class cluster CiIn belong to class cluster CjPoint number.
Then, each class cluster C is calculatediEntropy
L is the number of class cluster.
Finally, the entropy of cluster result is calculated
K is the number of cluster;
miIt is class cluster CiThe number of middle data point;
M is the sum of data point in data set.
Define the combination that 5 F measurements are precision and recall rate.
Class cluster CiOn class cluster CjPrecision:
Precision (i, j)=pij (7)
Class cluster CiOn class cluster CjRecall rate
Thus:Class cluster CiOn class cluster CjF measurement
The invention provides a kind of based on the hierarchy clustering method for sharing arest neighbors each other, pending data set is set to
D, if cluster numbers are K, if arest neighbors value one is K1, if arest neighbors value two is K2, as shown in figure 1, specific real according to following steps
Apply:
Step 1, data set D arest neighbors matrix is calculated by the K1 of the arest neighbors value one and K2 of arest neighbors value two respectively, is obtained
Arest neighbors matrix T1 and arest neighbors matrix T2;
Step 2, each neighborhood point successively in searching data collection D in each data point i arest neighbors matrix T2
Arest neighbors matrix T1 ', if data point i is included in arest neighbors matrix T1 ', by the data point in arest neighbors matrix T2
I retains, and is otherwise deleted, obtains data point i arest neighbors ranking matrix Mi, all data points in ergodic data collection D obtain
To arest neighbors ranking matrix M;
Step 3, the local density D of each data point i in data set D is calculated by arest neighbors ranking matrix Mi, and
And by these data points according to local density DiSize carry out descending arrangement;Local density DiCalculated according to below equation:
Di=count (Mi),i∈n (1)
Wherein, MiFor the arest neighbors ranking matrix of i-th of data point in arest neighbors ranking matrix M;
Step 4, preceding K × 10 data point after sequence is taken as submanifold central point, and with submanifold central point and submanifold
Nearest-neighbor point composition submanifold in the arest neighbors ranking matrix of heart point;The data point not divided is divided into the data point
Arest neighbors in the submanifold that occurs at first, obtain some submanifolds;The data point not divided is divided into the data point
In the submanifold occurred at first in arest neighbors, refer to the arest neighbors ranking matrix M of the data pointiIn all nearest-neighbor points in such as
Fruit includes submanifold central point, and just data point i is divided into the submanifold;If data point i arest neighbors ranking matrix
In all nearest-neighbor points in include multiple submanifold central points, then data point i is divided into that in the top son
In the submanifold of cluster central point;
Step 5, the similarity of some submanifolds that calculation procedure 4 finally gives between any two, by the submanifold that similarity is maximum
To merging;The similarity of submanifold between any two is calculated in accordance with the following methods:
Provided with submanifold Ci, submanifold Cj, 0<I, j≤n, arest neighbors ranking matrix M, then:The similarity of submanifold between any two is:
Wherein, NumNeighborCi(Cj) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M
Point, in the nearest-neighbor of these nearest neighbor points, occurs belonging to submanifold CjPoint number of times;
NumNeighborCi(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point,
In the nearest-neighbor of these nearest neighbor points, occur belonging to submanifold CiPoint number of times;
CountNeighbor(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, this
A little nearest neighbor points adhere to the submanifold number of different submanifolds separately;
CountNeighbor(Cj) it is submanifold CjIn all arest neighbors of the point in arest neighbors ranking matrix M point, this
A little nearest neighbor points adhere to the submanifold number of different submanifolds separately;
Step 6, the submanifold number after merging then performs step 5 if less than K;Submanifold number after merging if equal to
K, then perform step 7;
Step 7, the son nearest from unassigned data point will be divided into data set D from unassigned data point i
In cluster, final division result is obtained, division result is K class cluster;Wherein, the submanifold nearest from unassigned data point,
Refer to Euclidean distance minimum between the K submanifold obtained in data set D from unassigned data point and step 6
Submanifold.
Embodiment:
The present invention is made up of based on the hierarchy clustering method for sharing arest neighbors each other three committed steps:Calculate arest neighbors
Domain matrix, matrix are divided, hierarchical clustering.Whole data set D arest neighbors matrix T1 and arest neighbors matrix T2 is calculated first, (ginseng
Number k1, k2 is input parameter, k2>k1);Arest neighbors ranking matrix M is calculated by arest neighbors matrix T1 and arest neighbors matrix T2;
Local density is calculated by arest neighbors ranking matrix M, submanifold set is obtained;Finally calculate the similarity between submanifold, cohesion
Cluster obtains K class cluster.
First, nearest-neighbor matrix is calculated, detailed process is as follows:
Assuming that data set D k1 nearest-neighbor matrixes areK1 is input parameter, 0<i≤n, 0<j≤k1;
Data set D k2 nearest-neighbor matrixes areK2 is algorithm input parameter, 0<i≤n, 0<j≤k2,k1<K2, then:
Arest neighbors ranking matrix M=[xij], and0<i≤n,0<j≤k2.Pass through control
Parameter k1 processed, k2 size (reducing k1, increase k2), can obtain bigger and denser submanifold.In conditionFor
Preceding k1 row.
By taking the data set X in Fig. 2 as an example, the arest neighbors matrix T1 and arest neighbors square of data set X in Fig. 2 is calculated first
Battle array T2, as shown in table 1;Then based on arest neighbors matrix T2, each data point xi arest neighbors is filtered, if most
Point in neighbour includes xi in T1 arest neighbors, then retains this neighborhood point, otherwise deletes neighborhood point.For example, the k2 of point 0 is nearest
Adjacent (k2=10) is { 4,2,1,3,5,6,11,9,7,10 }, and the T1 arest neighbors (k1=3) of point 4 is { 2,3,0 }, comprising putting 0, then
Remain, according to the method described above point successively in inquiry neighborhood, the arest neighbors ranking for finally giving a little 0 is { 4,2 }.Traversal
All data points, obtain final arest neighbors ranking matrix, as shown in table 2.
Data set X K arest neighbors, K=10 in the Fig. 1 of table 1
Data point | K arest neighbors lists |
0 | 4,2,1,3,5,6,9,11,10,7 |
1 | 3,5,4,0,2,6,9,10,7,14 |
2 | 4,0,3,1,5,11,6,15,14,9 |
3 | 4,1,2,0,5,11,6,14,15,10 |
4 | 2,0,3,1,5,6,11,14,9,15 |
5 | 6,1,9,10,3,7,8,14,4,0 |
6 | 9,7,5,8,10,1,14,3,0,4 |
7 | 8,9,6,10,5,1,14,3,0,4 |
8 | 9,7,10,6,5,14,1,3,12,15 |
9 | 8,10,6,7,5,14,1,3,15,12 |
10 | 9,8,6,7,5,14,1,12,15,3 |
11 | 13,15,14,12,3,2,4,1,5,0 |
12 | 15,14,13,11,10,9,5,8,3,6 |
13 | 15,11,12,14,3,10,5,4,2,1 |
14 | 15,12,10,5,9,13,11,8,6,3 |
15 | 13,12,14,11,10,3,5,1,9,4 |
The arest neighbors ranking matrix of table 2
Data point | Arest neighbors list each other |
0 | 4,2 |
1 | 3,5,0 |
2 | 4,0,3 |
3 | 4,1,2 |
4 | 2,0,3,1 |
5 | 6,1 |
6 | 9,7,5,10 |
7 | 8,6 |
8 | 9,7,10 |
9 | 8,10,6,7,5 |
10 | 9,8,14 |
11 | 13 |
12 | 15,14,13 |
13 | 15,11,12 |
14 | 15,12,11 |
15 | 13,12,14,11 |
Secondly, local density is calculated by Neighborhood matrix, submanifold set is obtained.
Provided with arest neighbors ranking matrix M, Mi represents the arest neighbors matrix of i-th of data point, then local density
Di=count (Mi),i∈n (1)
The number of the point of the nearest-neighbor of i.e. one data point is more, then the regional area where this data point is whole
It is denser in data set.Therefore, when the local density of a data point is very big, it is regarded as the center of its this neighborhood
Point.
For example:Local density and the sequence of each point can be calculated by table 2, result is obtained as shown in table 3.Here,
K*2 point is chosen as the central point of candidate cluster (K is the cluster numbers of data set).Also, as can be seen from Figure 2 by No. 11-15
The density of the class of point composition is far smaller than the density of two classes above figure, but can also obtain its central point 15, this
It is due to that the ranking matrix based on arest neighbors is not to directly rely on distance function as measurement, but uses mutual distance to arrange
Name relation calculates local density as measurement foundation, thus, obtains the maximal point of local density, enabling processing is different close
The cluster of degree.
The ranking results of the local density of table 3
Data point | Local density |
9 | 5 |
4 | 4 |
6 | 4 |
15 | 4 |
1 | 3 |
2 | 3 |
3 | 3 |
8 | 3 |
10 | 3 |
12 | 3 |
13 | 3 |
0 | 2 |
5 | 2 |
7 | 2 |
14 | 2 |
11 | 1 |
Finally, clustered according to class cluster similarity.
Provided with Ci, Cj, 0<I, j≤n, arest neighbors ranking matrix M, then:Submanifold similarity is:
Wherein, NumNeighborCi(Cj) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M
Point, in the nearest-neighbor of these nearest neighbor points, occurs belonging to submanifold CjPoint number of times;
NumNeighborCi(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point,
In the nearest-neighbor of these nearest neighbor points, occur belonging to submanifold CiPoint number of times;
CountNeighbor(Ci) it is submanifold CiIn all arest neighbors of the point in arest neighbors ranking matrix M point, this
A little nearest neighbor points adhere to the submanifold number of different submanifolds separately;
CountNeighbor(Cj) it is submanifold CjIn all arest neighbors of the point in arest neighbors ranking matrix M point, this
A little nearest neighbor points adhere to the submanifold number of different submanifolds separately.
For class Ci and class Cj similarity definition, comprising two parts, a part is submanifold Ci, its phase to submanifold Cj
Like degree;Another part is submanifold Cj, its similarity to submanifold Ci, i.e. similarity between them is not reciprocity.For example, right
For interpersonal relation, PersonA best friends are PersonB, and only one of which, and for PersonB
For may there is ace buddy to have many, PersonA is one of them, so if weighing good friend's degree with numerical value
Talk about, good friend's degree between them is simultaneously unequal.So, for merging submanifold for, we take such a strategy to close
And:The neighbouring submanifold number of each submanifold is minimum, and the number of the point of the arest neighbors between submanifold pair is most.
All data points, are further divided into recently by such as shown result in figure 3 according to arest neighbors ranking matrix
Candidate centers submanifold in, obtain result as shown in table 4.Next, according to arest neighbors ranking matrix, counting submanifold adjacent
The number of times of other submanifolds is connect, result is obtained as shown in table 5.Calculate the similarity of 6 submanifolds between any two.For example:The neighbour of cluster 1
Connecing cluster has two (representing cluster using the numbering for the central point for representing cluster), is cluster 4 and cluster 9 respectively, obtain similarity (C1 →
C4)=2/2, equally, similarity (C4 → C1)=2/3 is calculated, both are added the similarity for obtaining cluster 1 and cluster 4
Similarity (C1, C4)=similarity (C1 → C4)+similarity (C4 → C1) ≈ 1.667.
All points are divided into after the submanifold of candidate centers by table 4, obtained submanifold
Cluster label | Data point in cluster |
1 | 1 |
2 | 2 |
4 | 4,2,0,3,1 |
6 | 6 |
9 | 9,8,10,6,7,5 |
15 | 15,13,12,14,11 |
The submanifold of table 5 and its number of times for abutting submanifold are counted
Cluster label | Adjacent cluster and occurrence number |
1 | 4=2,9=1 |
2 | 4=3 |
4 | 1=2,2=3,9=1 |
6 | 9=3 |
9 | 1=1,6=4 |
15 | 9=1 |
Clustering method performance evaluating of the present invention:
In order to verify the validity of clustering method of the present invention, two kinds of algorithms are selected to be contrasted:Based on figure
Chameleon methods and Jarvis-Patrick (JP) method.Chameleon methods have stronger discovery arbitrary size and shape
The ability of the cluster of shape.JP methods are good at the close cluster for finding strong correlation object.Both approaches are with the present invention based on common each other
The hierarchy clustering method identical place for enjoying arest neighbors is to be required for calculating K arest neighbors, then passes through each different methods
Calculating obtains final result.
The present invention is using four artificial data collection and 6 UCI standard data sets come testing algorithm performance.Four artificial datas
Collect DB1, DB2, DB3, DB4, data distribution is respectively as shown in Fig. 4, Fig. 5, Fig. 6, Fig. 7.6 UCI standard data sets are:cpu-
with-vendor,glass,iris,soybean,unbalanced, wine.4 artificial data collection and 6 UCI normal datas
The attribute of collection is as shown in table 6 and table 7.
The artificial data set attribute of table 6
The UCI data set attributes of table 7
Experimental result is contrasted:
The total purity Purity of cluster result, precision and recall rate composite function F-measure, cluster result entropy are used herein
Tri- kinds of evaluation functions of Entropy evaluate the validity of cluster result, three kinds of evaluation functions be specifically defined as defined above 4,
Define 5 and define shown in 6.
Table 8 is the present invention based on the hierarchy clustering method and Chameleon methods and JP methods for sharing arest neighbors each other
Experimental result on correction data collection, from Fig. 8, Fig. 9, Figure 10, Figure 11 and table 8 as can be seen that the present invention based on each other
The hierarchy clustering method of shared arest neighbors generated data collection DB1, DB2, DB3 and UCI standard data set Iris, Wine,
There is obvious advantage result on Soybean, Unbalanced.Shown by Cluster Validity external evaluation index,
Chameleon methods and JP methods have very bad result on some UCI data sets, and its reason is mixing in data set
The variable of categorical attribute, is caused due to being set algorithm parameter so that cluster result is excessively poor.It is less for some
Data set, for example, Cpu-with-vendor, Glass, of the invention to be led to based on the hierarchy clustering method for sharing arest neighbors each other
Similarity function is crossed, can be pushed away at the time of being put into more late to carry out the merging of next step for the wrong cluster that has merged,
The accumulation step by step of mistake so can be effectively avoided to expand.
8 three kinds of methods experiment Comparative results of table
Claims (4)
1. based on the hierarchy clustering method for sharing arest neighbors each other, it is characterised in that pending data set is set into D, if poly-
Class number is K, if arest neighbors value one is K1, if arest neighbors value two is K2, and K1<K2, specifically implements according to following steps:
Step 1, data set D arest neighbors matrix is calculated by the K1 of the arest neighbors value one and K2 of arest neighbors value two respectively, is obtained recently
Adjacent matrix T1 and arest neighbors matrix T2;
Step 2, each neighborhood point successively in searching data collection D in each data point i arest neighbors matrix T2 is nearest
Adjacent matrix T1 ', if including data point i in T1 ', the data point i in arest neighbors matrix T2 is retained, otherwise deleted
Remove, obtain data point i arest neighbors ranking matrix Mi, arest neighbors ranking matrix MiRefer to data point i and its arest neighbors number each other
All data points in the matrix that strong point builds for row, ergodic data collection D, obtain arest neighbors ranking matrix M;
Step 3, the local density D of each data point i in data set D is calculated by arest neighbors ranking matrix Mi, local density
DiIt is the expression of regional area dense degree in whole data set where data point i, and by these data points according to office
Portion density DiSize carry out descending arrangement;
Wherein, local density DiCalculated according to below equation:
Di=count (Mi),i∈n (1)
MiFor the arest neighbors ranking matrix of i-th of data point in arest neighbors ranking matrix M;
Step 4, preceding K × 10 data point after sequence is taken as submanifold central point, and with submanifold central point and submanifold central point
Arest neighbors ranking matrix included in group of data points into submanifold;The data point not divided is divided into the data point
In the submanifold occurred at first in arest neighbors, some submanifolds are obtained;
Step 5, the similarity of each submanifold that calculation procedure 4 finally gives between any two, by the maximum submanifold of similarity to carrying out
Merge;
Step 6, the submanifold number after merging then performs step 4 if less than K;Submanifold number after merging is if equal to K, then
Perform step 7;
Step 7, by data set D from unassigned data point i is divided into the submanifold nearest from unassigned data point,
Final division result is obtained, the division result is K class cluster.
2. it is according to claim 1 based on the hierarchy clustering method for sharing arest neighbors each other, it is characterised in that in step 5
The similarity of submanifold between any two is calculated in accordance with the following methods:
Provided with submanifold Cx, submanifold Cy, 0<X, y≤z, arest neighbors ranking matrix M, then:The similarity of submanifold between any two is:
<mrow>
<mi>S</mi>
<mi>i</mi>
<mi>m</mi>
<mi>i</mi>
<mi>l</mi>
<mi>a</mi>
<mi>r</mi>
<mi>i</mi>
<mi>t</mi>
<mi>y</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>C</mi>
<mi>x</mi>
</msub>
<mo>,</mo>
<msub>
<mi>C</mi>
<mi>y</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>NumNeighbor</mi>
<msub>
<mi>C</mi>
<mi>x</mi>
</msub>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>C</mi>
<mi>y</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>C</mi>
<mi>o</mi>
<mi>u</mi>
<mi>n</mi>
<mi>t</mi>
<mi>N</mi>
<mi>e</mi>
<mi>i</mi>
<mi>g</mi>
<mi>h</mi>
<mi>b</mi>
<mi>o</mi>
<mi>r</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>C</mi>
<mi>x</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>+</mo>
<mfrac>
<mrow>
<msub>
<mi>NumNeighbor</mi>
<msub>
<mi>C</mi>
<mi>x</mi>
</msub>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>C</mi>
<mi>x</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>C</mi>
<mi>o</mi>
<mi>u</mi>
<mi>n</mi>
<mi>t</mi>
<mi>N</mi>
<mi>e</mi>
<mi>i</mi>
<mi>g</mi>
<mi>h</mi>
<mi>b</mi>
<mi>o</mi>
<mi>r</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>C</mi>
<mi>y</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein,It is submanifold CxIn all arest neighbors of the point in arest neighbors ranking matrix M point,
In the arest neighbors ranking matrix of these nearest neighbor points, occur belonging to submanifold CyPoint number of times;
It is submanifold CxIn all arest neighbors of the point in arest neighbors ranking matrix M point, at these
In the arest neighbors ranking matrix of nearest neighbor point, occur belonging to submanifold CxPoint number of times;
CountNeighbor(Cx) it is submanifold CxIn all arest neighbors of the point in arest neighbors ranking matrix M point, these are most
Neighbor Points adhere to the submanifold number of different submanifolds separately;
CountNeighbor(Cy) it is submanifold CyIn all arest neighbors of the point in arest neighbors ranking matrix M point, these are most
Neighbor Points adhere to the submanifold number of different submanifolds separately.
3. the hierarchy clustering method based on shared arest neighbors each other according to claim 1, it is characterised in that step 4
Described in the data point not divided is divided into the submanifold occurred at first in the arest neighbors of the data point, refer to the data
If including submanifold central point in the arest neighbors ranking matrix of point, just the data point is divided into the submanifold;If the number
Include multiple submanifold central points in the arest neighbors ranking matrix at strong point, then the data point is divided into that son in the top
In the submanifold of cluster central point.
4. it is according to claim 1 based on the hierarchy clustering method for sharing arest neighbors each other, it is characterised in that in step 7
The submanifold nearest from unassigned data point, refers to K obtained in data set D from unassigned data point and step 6
The submanifold of Euclidean distance minimum between submanifold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410488243.4A CN104217015B (en) | 2014-09-22 | 2014-09-22 | Based on the hierarchy clustering method for sharing arest neighbors each other |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410488243.4A CN104217015B (en) | 2014-09-22 | 2014-09-22 | Based on the hierarchy clustering method for sharing arest neighbors each other |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104217015A CN104217015A (en) | 2014-12-17 |
CN104217015B true CN104217015B (en) | 2017-11-03 |
Family
ID=52098505
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410488243.4A Expired - Fee Related CN104217015B (en) | 2014-09-22 | 2014-09-22 | Based on the hierarchy clustering method for sharing arest neighbors each other |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104217015B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682052B (en) * | 2015-11-11 | 2021-11-12 | 恩智浦美国有限公司 | Data aggregation using mapping and merging |
CN105787520B (en) * | 2016-03-25 | 2019-09-20 | 中国农业大学 | A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search |
CN106570178B (en) * | 2016-11-10 | 2020-09-29 | 重庆邮电大学 | High-dimensional text data feature selection method based on graph clustering |
CN108337226A (en) * | 2017-12-19 | 2018-07-27 | 中国科学院声学研究所 | The detection method and embedded intelligent terminal of embedded intelligent terminal abnormal data |
CN108510615A (en) * | 2018-04-02 | 2018-09-07 | 深圳智达机械技术有限公司 | A kind of control system of semiconductor manufacturing facility and technique |
CN108596737A (en) * | 2018-05-07 | 2018-09-28 | 山东师范大学 | Non-cluster Centroid distribution method based on e-commerce comment data and device |
CN108932528B (en) * | 2018-06-08 | 2021-08-31 | 哈尔滨工程大学 | Similarity measurement and truncation method in chameleon algorithm |
CN108765954B (en) * | 2018-06-13 | 2022-05-24 | 上海应用技术大学 | Road traffic safety condition monitoring method based on SNN density ST-OPTIC improved clustering algorithm |
CN109871768B (en) * | 2019-01-18 | 2022-04-29 | 西北工业大学 | Hyperspectral optimal waveband selection method based on shared nearest neighbor |
CN113850995B (en) * | 2021-09-14 | 2022-12-27 | 华设设计集团股份有限公司 | Event detection method, device and system based on tunnel radar vision data fusion |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7478090B2 (en) * | 2005-01-14 | 2009-01-13 | Saffron Technology, Inc. | Methods, systems and computer program products for analogy detection among entities using reciprocal similarity measures |
CN101963995A (en) * | 2010-10-25 | 2011-02-02 | 哈尔滨工程大学 | Image marking method based on characteristic scene |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100191731A1 (en) * | 2009-01-23 | 2010-07-29 | Vasile Rus | Methods and systems for automatic clustering of defect reports |
-
2014
- 2014-09-22 CN CN201410488243.4A patent/CN104217015B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7478090B2 (en) * | 2005-01-14 | 2009-01-13 | Saffron Technology, Inc. | Methods, systems and computer program products for analogy detection among entities using reciprocal similarity measures |
CN101963995A (en) * | 2010-10-25 | 2011-02-02 | 哈尔滨工程大学 | Image marking method based on characteristic scene |
Also Published As
Publication number | Publication date |
---|---|
CN104217015A (en) | 2014-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104217015B (en) | Based on the hierarchy clustering method for sharing arest neighbors each other | |
CN110532436B (en) | Cross-social network user identity recognition method based on community structure | |
CN104462184B (en) | A kind of large-scale data abnormality recognition method based on two-way sampling combination | |
CN105930862A (en) | Density peak clustering algorithm based on density adaptive distance | |
CN106960390A (en) | Overlapping community division method based on convergence degree | |
CN108319987A (en) | A kind of filtering based on support vector machines-packaged type combined flow feature selection approach | |
CN103888541A (en) | Method and system for discovering cells fused with topology potential and spectral clustering | |
CN107368540A (en) | The film that multi-model based on user's self-similarity is combined recommends method | |
CN109271427A (en) | A kind of clustering method based on neighbour's density and manifold distance | |
CN114091603A (en) | Spatial transcriptome cell clustering and analyzing method | |
CN107944487B (en) | Crop breeding variety recommendation method based on mixed collaborative filtering algorithm | |
CN107180079A (en) | The image search method of index is combined with Hash based on convolutional neural networks and tree | |
CN111340069A (en) | Incomplete data fine modeling and missing value filling method based on alternate learning | |
CN111581532A (en) | Social network friend-making recommendation method and system based on random block | |
CN104715034A (en) | Weighed graph overlapping community discovery method based on central persons | |
Carbonera et al. | An entropy-based subspace clustering algorithm for categorical data | |
CN103164487B (en) | A kind of data clustering method based on density and geological information | |
Amelio et al. | A new evolutionary-based clustering framework for image databases | |
CN108763283A (en) | A kind of unbalanced dataset oversampler method | |
Peng et al. | DE-MC: A membrane clustering algorithm based on differential evolution mechanism | |
CN111914930A (en) | Density peak value clustering method based on self-adaptive micro-cluster fusion | |
CN107169522A (en) | A kind of improvement Fuzzy C means clustering algorithm based on rough set and particle cluster algorithm | |
CN106354886A (en) | Method for screening nearest neighbor by using potential neighbor relation graph in recommendation system | |
CN106203469A (en) | A kind of figure sorting technique based on orderly pattern | |
CN111444454B (en) | Dynamic community division method based on spectrum method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171103 Termination date: 20200922 |