CN108764307A

CN108764307A - The density peaks clustering method of natural arest neighbors optimization

Info

Publication number: CN108764307A
Application number: CN201810463136.4A
Authority: CN
Inventors: 钱雪忠; 金辉
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2018-05-15
Filing date: 2018-05-15
Publication date: 2018-11-06

Abstract

The present invention relates to a kind of density peaks clustering methods (TNDP methods) of natural arest neighbors optimization, and the local density of data point is calculated with nature nearest-neighbors, does not need parameter, avoids parameter sensitivity problem；Local density is calculated with natural nearest-neighbors, since natural nearest-neighbors accurately reflect the attribute feature of data point, so the local density calculated in this way can accurately indicate the density size of each data point, improves Clustering Effect；Since natural nearest-neighbors do not include noise spot and abnormal point, so reducing the influence of noise spot and abnormal point to cluster result.

Description

The density peaks clustering method of natural arest neighbors optimization

Technical field

The present invention relates to clustering methods, more particularly to the density peaks clustering method of natural arest neighbors optimization.

Background technology

In the evolution of clustering, KMEANS, DBSCAN, FCM are proposed in succession, and a series of cluster such as AP is calculated Method document, 2014《Science》On delivered one《Clusteringby fast search and find offast search》, paper proposes a kind of fast search and finds the clustering algorithm of density peaks.The algorithm can provide data set sample automatically This class cluster center, and there is no harsh requirement to the shape of data set sample, it can to the data set sample of arbitrary shape Realize efficient cluster.The core concept of the algorithm is to assert cluster centre while meeting 2 basic demands:1) itself is close Degree is very big, i.e., the density of its surrounding neighbours point is big without it；2) and than " distance " between the data point of its density bigger Bigger.However the disadvantage and difficult point of DPC algorithms should not be underestimated:1) when using DPC algorithms, block distance is every field The parameter that the algorithm must be set, people always set the parameter by hand, and manual setting is there are certain randomness and artificially Factor influences clustering result quality；2) be always to the analyzing processing of higher dimensional data DPC algorithms short slab, higher dimensional data Self structure possesses sparsity and space complexity so that traditional Euclidean distance is in the similitude between reflecting data object It is unable to reach accurate, rational purpose, therefore the algorithm is caused to fail；3) although DPC algorithms claim to automatically determine cluster knot Fruit, but the selected of manual progress cluster result is needed in practical cluster operation, cluster result cannot provide automatically.

For deficiency existing for DPC clustering algorithms, Zhang WenKai combine the algorithm and CHAMELEON algorithms, propose E_CFSFDP, solves the problems, such as not handling in a class cluster that there are one above density peaks points in CFSFDP algorithms, but Be the algorithm performance need to be further increased and the ability on processing high dimensional data have it is to be strengthened.LiuY proposes a kind of base In k nearest neighbor fast density peak value searching and efficiently distribute sample algorithm KNN-DPC, solve CFSFDP algorithm cluster results To more sensitive and because caused by a step distribution the problem of related assignment error, but the cluster of the algorithm of blocking distance dc Cluster result is more sensitive to the selection of neighbour's number K.Rashid Mehmood propose Fuzzy-CFSFDP algorithms, by fuzzy rule Then be used for CFSFDP algorithms class cluster central point determine in, improve class cluster central point choose and cluster result it is accurate Rate, but it is slightly inadequate when handling complex data.

There are following technical problems for traditional technology：

There are parameter sensitivity, processing aspherical surface data and complicated manifold data clusters for existing density-based algorithms The problem of effect difference.

Invention content

Based on this, it is necessary in view of the above technical problems, provide a kind of density peaks cluster side of natural arest neighbors optimization Method avoids parameter sensitivity problem, improves Clustering Effect.

A kind of density peaks clustering method of nature arest neighbors optimization, including：

Find density peak all in data set；

The sparse neighbours at the density peak He the density peak are assigned to the same cluster by one density peak of random access；

A point is arbitrarily found in the cluster, and the sparse neighbours of this point are assigned into the cluster, until described The sparse neighbours of all the points in cluster assign to the cluster；

Repeating step, " one density peak of random access is assigned to the sparse neighbours at the density peak He the density peak same Cluster；" " arbitrarily find a point in the cluster, and the sparse neighbours of this point are assigned into the cluster, until described poly- The sparse neighbours of all the points in class assign to the cluster；", until accessing all density peaks；

Similarity between class between the cluster formed through the above steps according to all density peaks, merging similarity is high to gather Class.

In other one embodiment, middle data amount check between the cluster that all density peaks are formed through the above steps Cluster less than smallest natural neighbours' number is removed from cluster result, and is noise spot by these data markers in clustering, and is obtained Final cluster result, the smallest natural neighbours number refer in cluster in the natural nearest-neighbors number of all data points most Small value.

A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor The step of computer program, the processor realizes any one the method when executing described program.

A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor The step of any one the method.

The density peaks clustering method (TNDP methods) of above-mentioned nature arest neighbors optimization, number is calculated with nature nearest-neighbors The local density at strong point, does not need parameter, avoids parameter sensitivity problem；Local density is calculated with natural nearest-neighbors, due to Natural nearest-neighbors accurately reflect the attribute feature of data point, so the local density calculated in this way can accurately indicate each The density size of data point improves Clustering Effect；Since natural nearest-neighbors do not include noise spot and abnormal point, so reducing The influence of noise spot and abnormal point to cluster result.

Description of the drawings

Fig. 1 is a kind of flow chart of the density peaks clustering method of natural arest neighbors optimization provided by the embodiments of the present application.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Refering to fig. 1, a kind of density peaks clustering method of natural arest neighbors optimization, including：

Find density peak all in data set；

The local density of data point is determined according to the concept of natural nearest-neighbors first, then according to density peak local density Highest and cluster centre is determined by sparse region segmentation, similarity concept is multiple to solve between finally proposing a kind of new class cluster Miscellaneous manifold problem.

Natural nearest-neighbors (Natural Nearest Neighbor：TN it is) a kind of new nearest-neighbors concept, it is A kind of nearest-neighbors of scale free, this is also it and K- arest neighbors and the maximum difference of ε-arest neighbors.Natural nearest-neighbors Basic thought be exactly that the data points of close quarters in data set possesses more neighbours, the data point of sparse region possesses less Neighbours, and the data point most to peel off in data set only has several or not have nearest-neighbors, the characteristics of natural nearest-neighbors be meter Calculation process does not need any parameter, and data point obtains accurate neighbours according to the attribute feature of data set itself, neighbours' number due to The dense degree of data has differences, since noise spot and abnormal point do not have neighbours, so normal point will not noise spot and Abnormal point is as neighbours.

Define 1 natural nearest-neighbors (Natural Nearest Neighbor：TN):It is searched for and is calculated based on nature nearest-neighbors Method (TN-Searching algorithms), if fruit dot X belongs to the neighbours of point Y, and point Y belongs to the neighbours of point X, then point X and Y belong to that This natural nearest-neighbors.

Define 2 physical feature values (supk)：According to TN-Searching algorithms, each point has the neighbours of different number, right In any point i, neighbours' quantity is nb (i).But TN-Searching, there are one the neighbours of par, referred to as supk, it is Physical feature value.The formula for calculating supk is as follows：

Define 3R- neighborhoods (R-neighbor):FindKNN (xi, r) indicates KNN search functions, it returns to r-th of xi Neighbours, KNNr (xi) are the subsets of X, are defined as follows:

Define the density (Den (Pi)) of 4 data points:The density defined based on natural nearest-neighbors is as follows：

Here nb (i) is the natural nearest-neighbors number for each of obtaining putting according to TN-Searching algorithms, N (i, nb (i)) be point i a natural nearest-neighbors of nb (i), dist (i, j) is the distance between data point i and j.

It defines 5 and represents point (Exemplar)：The representative point of data point q is defined as：

Exemplar (q)=max Den (NN (p) &&pq }

Define 6 density peaks (DensityPeak)：If data point p meets following condition, just data point p is referred to as one close Spend peak：

Define similarity (Similarity Between Clusters) between 7 classes：

| Ci ∩ Cj | refer to that the common portion of class Ci and class Cj, supk are nature neighbors feature values, Sim's (Ci, Cj) Value is not less than 0, if the two adjacent initial clusters are divided by sparse region, the similitude between the two clusters by very little, It is two individual clusters.On the contrary, if the two adjacent initial clusters are connected by density area, the two adjacent clusters it Between similitude can be very big, then the two clusters will be merged into a cluster.

8 sparse neighbours and intensive neighbours (Sparse and Dense Neighbor) are defined, if the density of data point q Density and q less than data point p are the natural nearest-neighbors of p, then it is the sparse neighbours of p to claim q, if opposite data point q's is close Density of the degree more than or equal to data point p and q are the natural nearest-neighbors of p, then it is the intensive neighbours of p to claim q, is defined as follows：

SN (p)=q | Den (q)<Den(p)&&q∈NN(p)}

DN (p)=q | Den (q)>Den(p)&&q∈NN(p)}

The main flow for the TN-Searching algorithms that the present invention mentions：

Step 1:Input data set X, enables r=1, and xi is found with the method that K-d trees are searched for each of data concentration point xi R neighbours knnr (xi), and the r neighbours knnr (xi) of xi is merged into the R- neighborhoods KNNr (xi) of xi；

Step 2:If xi is in the R- neighborhoods KNNr (knnr (xi)) of the r neighbours knnr (xi) of xi, and xi and knnr (xi) it is not mutual natural nearest-neighbors, xi and knnr (xi) is just defined as mutual natural nearest-neighbors.

Step 3:R=r+1 repeats step 1 and 2, if the number for the point that natural nearest-neighbors are 0 does not change, Step 4 is jumped to, otherwise repeatedly step 3；

Step 4:Export physical feature value r, the natural nearest-neighbors number of each data point and the natural arest neighbors each put Occupy set.

One possible TN-Searching specific implementations code is as follows：

The possible specific implementation code that the density peaks clustering method of the natural arest neighbors optimization of one present invention is realized is such as Under：

By following table it is found that in accuracy rate, TNDP methods will be substantially better than DPC, DBSCAN, KMEANS algorithm, in F values All it is that TNDP methods are apparent in addition to DBSCAN is better than TNDP methods on wpbc data sets, on other data sets in calculating Correct class number can be clustered out better than DPC, DBSCAN, KMEANS algorithm, and to these data sets TNDP methods.Integrate this Three aspects, it is clear that TNDP methods are classic.

The concrete application embodiment of a application is described below：

Step 1:TNDP obtains the natural arest neighbors of each data point in data set X using TN-Searching algorithms It occupies, then calculates the density of each data point；

Step 2:The representative point of each data point and sparse neighbours are found with defining 5 and defining 8；

Step 3:All density peaks and one density peak of random access are found, it is assigned to together with its sparse neighbours One cluster；

Step 4:A point is arbitrarily found in this cluster, and the sparse neighbours of this point and this point are classified as together One cluster, until all the points of this cluster are all accessed；

Step 5:It finds the density peak that one does not access and repeats the above steps, until all density peaks are all accessed It crosses；

Step 6:Initial classes cluster has been divided, according to the similarity relationship between initial classes cluster, has merged high initial of similarity Class cluster；

Step 7:The class cluster that data amount check in class cluster is less than to smallest natural neighbours' number is removed from cluster result, and will Data markers in these class clusters are noise spot, obtain final cluster result.

Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.

Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of density peaks clustering method of nature arest neighbors optimization, which is characterized in that including：

Find density peak all in the data set；

A point is arbitrarily found in the cluster, and the sparse neighbours of this point are assigned into the cluster, until the cluster In the sparse neighbours of all the points assign to the cluster；

Repeating step, " one density peak of random access is assigned to the sparse neighbours at the density peak He the density peak same poly- Class；" " arbitrarily find a point in the cluster, and the sparse neighbours of this point are assigned into the cluster, until the cluster In the sparse neighbours of all the points assign to the cluster；", until accessing all density peaks；

Similarity between class between the cluster formed through the above steps according to all density peaks merges the high cluster of similarity.

2. the density peaks clustering method of nature arest neighbors optimization according to claim 1, which is characterized in that further include： Middle data amount check is less than the cluster of smallest natural neighbours' number from poly- between the cluster that all density peaks are formed through the above steps It is removed in class result, and is noise spot by these data markers in clustering, obtain final cluster result, the smallest natural Neighbours' number refers to the minimum value in the natural nearest-neighbors number of all data points in cluster.

3. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, which is characterized in that the processor realizes any one of claim 1-2 the methods when executing described program The step of.

4. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step of claim 1-2 any one the methods are realized when row.