CN107563443A

CN107563443A - A kind of adaptive semi-supervised Density Clustering method and system

Info

Publication number: CN107563443A
Application number: CN201710789195.6A
Authority: CN
Inventors: 杨云; 李宗泽
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2017-09-05
Filing date: 2017-09-05
Publication date: 2018-01-09

Abstract

The invention belongs to technical field of data processing, discloses a kind of adaptive semi-supervised Density Clustering method and system, automatically extracts density parameter from the data with category and without category first；Then initial clustering analysis is carried out using multiple density parameters on target data set, obtains Local Clustering result；Finally by the integration to local cluster result, final global clustering result is obtained.Algorithm proposed by the present invention need not set cluster class number, and in process of cluster analysis, cluster class number automatically determines according to data set density information；Algorithm proposed by the present invention can automatically extract multigroup density parameter from the data with category and without category, then density clustering analysis is carried out to target data set using these density parameters, excellent cluster analysis result can be obtained, and there is stronger adaptivity and noise immunity.

Description

A kind of adaptive semi-supervised Density Clustering method and system

Technical field

The invention belongs to technical field of data processing, more particularly to a kind of adaptive semi-supervised Density Clustering method and it is System.

Background technology

Clustering algorithm is in computer operation, it may be said that is one of most important data mining task, it is directed to target The structure of cluster (cluster) in data.One cluster is made up of the example of all " similar " between some, " dissmilarity " Example is then in other cluster.Under different angles and different standards, the category division of clustering algorithm seems varied. However, there is a kind of division system to clustering algorithm, approved by everybody.Clustering algorithm is divided into level and gathered by the system Class algorithm (hierarchical clustering), the clustering algorithm (partitioning-based based on segmentation Clustering), density-based algorithms (density-based clustering), the clustering algorithm based on model (model-based).Recently, many researchers attempt to extract some constraint informations from supervision message, and are constrained with these Information come instruct cluster flow, with reach improve cluster result efficiency and accuracy rate result.Therefore, clustering algorithm produces again One new branch --- semi-supervised clustering (semi-supervised clustering).As a rule, semi-supervised clustering Algorithm is divided into two classes again, based on the semi-supervised clustering apart from (distance-based) and based on constraints (constraint- Based semi-supervised clustering).

In the method based on distance, adjustment distance method is parametrization, and parameter is from advance with restrictive condition Supervision message existing for form, such as Must-Link and Cannot-Link constraintss.Must-Link means in the constraint Under the conditions of example must be assigned in same cluster, on the contrary, Cannot-Link then means the example under the constraints It must be divided into different clusters.Adjustment distance method is generally achieved in a manner of the transition matrix that can be inquired about, so Every a pair of examples have a corresponding distance values, when the relation between two examples is Must-Link, two examples it Between distance should be shortened, on the contrary, when the relation between two examples is Cannot-Link, between two examples away from From should be extended.However, this mode for realizing constraints, the cluster process of algorithm is not instructed strictly, some When the result that clusters can and constraints disagree, such as, it is Must-Link to have the relation between a pair of examples, but it The distance between still farther out, cause to be divided into different clusters.Literature survey is shown, most to adjust distance side Method is provided to solve classification or semi-supervised clustering task puts forward.

Current clustering algorithm is changed based on the method for constraints, by user provide category or Constraints instructs clustering algorithm, to reach more preferable cluster result.Specifically, cluster can be calculated with different modes Method is modified to be realized.Constrained COBWEB algorithms are embedded in constraints by optimizing the target clustered During progressively splitting.Constrained K-means algorithms by constraints information be specially example rank about The form of beam is incorporated in traditional K-means algorithms.Seeded-K-means algorithms set congruent point using constraints, without It is that congruent point is randomly choosed as conventional method.In the method, initial cluster meets the transitive closure of constraints, these The center of cluster is arranged to congruent point, and after initialization step is completed, no matter either with or without restricted information, the structure of cluster can be by Iteration updates.Semi-supervised hierarchical clustering algorithm was clustered by the way that constraint information is introduced based on gathering together for hypermetric distance matrix Cheng Zhong, so as to realize semi-supervised clustering.C-DBSCAN algorithms are on the basis of data instance with constraints come cluster dividing.The party The algorithm that method is illustrated based on DBSCAN simultaneously has good robustness on the data application of irregular shape.

Many clustering algorithms have all been modified to the algorithm with semi-supervised learning function, and most of algorithm therein is all It is clustering algorithm or hierarchical clustering algorithm based on segmentation, Name-based Routing does not almost have.It is used for clustering in fact, working as Data when there is the features such as the of different sizes of cluster, shape inequality, density-based algorithms are a kind of ideal choosings Select.It unlike the clustering algorithm based on segmentation, it is necessary to make great efforts to build the optimum segmentation situation in global data space, such as DBSCAN algorithms accomplish that region is optimal.Furthermore, it is understood that the semi-supervised learning algorithm based on density, to mutual immediate data Example can use Must-Link constraintss, can also use Cannot-Link constraintss.Contrast is using constraints as base For the partitioning algorithm of plinth, it has natural advantage, because many such algorithms are generally in the same Cannot- of converged state Link constraints conflicts.

In summary, the problem of prior art is present be：

Existing clustering method adaptivity and noise immunity are poor, it is impossible to identify that target data concentrates complicated clustering architecture；

The existing specific shortcoming of semi-supervised clustering algorithm is as follows：

On the number select permeability of cluster class, But most of algorithms is required for user to be determined in advance before being clustered to be obtained Cluster number (class number)；But in real data, class number is unknown, is typically passed through constantly experiment to be closed Suitable class number, to obtain preferable cluster result.

In algorithm parameter unicity problem,

The selection of algorithm parameter generally has unicity, for same algorithm, can be generally applicable without one group of parameter In the complicated clustering architecture that all kinds data set is showed.One group of parameter is typically suitable only for the cluster point of certain situation Analysis.

The content of the invention

The problem of existing for prior art, the invention provides a kind of adaptive semi-supervised Density Clustering method and it is System.

The present invention is achieved in that a kind of adaptive semi-supervised Density Clustering method, described adaptive semi-supervised Density Clustering method includes：First density parameter is automatically extracted from the data with category and without category；

Then initial clustering analysis is carried out using multiple density parameters on target data set, obtains Local Clustering result；

Finally by the integration to local cluster result, final global clustering result is obtained.

Further, the clustering architecture that the target data is concentrated, including：Different size of cluster, cluster of different shapes and difference The cluster of density.

Further, the multiple density parameter learning includes：

Known X_UFor the data of no category, X_LFor the data with category, data set is builtThe data Collection comprising all classes be designated as j data and either with or without the data with category set；

Specific steps include：

Step 1, first, randomly select a data point x with category₁For initialization points most at the beginning,Then its closest point p in X ' is found₁, p₁∈ X ', and using the distance between 2 points as r,

With the two points structure data set D, and D from the middle removals of X ', X '=X '-D；If point as the middle presence of X ' p₂, p₂It is less than or equal to r with the distance of some point in data set D, then, just by p₂Move into data set D；

Step 2, iterate step 1, and until not new data point is added in data set D, whether validation data set D Completely include data setIf completely include data setSo for category j, Eps_j=r, if without complete Include data setSo suitably increase r length；

Step 3, step 2 is repeated, until data set D meets condition；Obtain radius Eps_j。

Further, Eps is being obtained_jNumerical value after, initialization MinPts be 2, then using DBSCAN to no band category The set of data set and all classes data set for being designated as j clustered, the density parameter of cluster is Eps_j, MinPts；Obtain Cluster result P_j, check P_jIf data setIn all data point be all divided into same cluster, by MinPts value Add 1, then data set is clustered using DBSCAN algorithms again；Always it is more than iteration initialize MinPts be 2 to Sorting procedure is carried out to data set using DBSCAN algorithms again, until data setIn all point do not have in cluster result It is divided into same cluster；For category j, MinPts_j=MinPts-1.

Further, the Local Clustering result is integrated, including：

Obtain Eps corresponding to each category_jAnd MinPts_jAfterwards, density set { density corresponding to category is calculated_j =MinPts_j/Eps_jAnd category corresponding to Local Clustering results set { P_j}；

Then, sorted to density set corresponding to category in a manner of descending, find numerical value highest density in sequence density_oCorresponding class is designated as o., it is described corresponding to class be designated as o.In cluster result P_oIn, if there is no category and drawn The point that class is designated as in o cluster is assigned to, the cluster in semi-supervised learning theory for these points it is assumed that assign category o., connect down Come, in the same way, the high density of logarithm value second is operated accordingly, by that analogy, to remaining all density All operated accordingly successively；If also data point is not endowed category, these data points are considered as noise spot.

Another object of the present invention is to provide a kind of adaptive semi-supervised Density Clustering system.

Advantages of the present invention and good effect are：

The invention discloses a kind of adaptive semi-supervised clustering algorithm of multiple density information in clustering architecture based on data, this Algorithm has stronger adaptivity and noise immunity, can recognize that target data concentrates complicated clustering architecture, including：Cluster size is not Same, shape difference, density difference etc..For different density areas, the algorithm can be from the data with category and without category Density parameter is automatically extracted, then initial clustering analysis is carried out using multiple density parameters on target data set, so as to obtain office Portion's cluster result, finally by the integration to local cluster result, obtain final global clustering result.

In adaptive semi-supervised Density Clustering method provided by the invention, when the size for the data that cluster with cluster During the features such as difference, shape inequality, density-based algorithms are a kind of ideal selections.It is unlike based on segmentation Clustering algorithm is, it is necessary to make great efforts to build the optimum segmentation situation in global data space, for example DBSCAN algorithms accomplish that region is optimal. Furthermore, it is understood that the semi-supervised learning algorithm based on density, Must-Link can be used about to mutual immediate data instance Beam condition, it can also use Cannot-Link constraintss.Contrast for the partitioning algorithm based on constraints, the present invention With natural advantage because many such algorithms generally in converged state with Cannot-Link constraints conflicts.

It is also an advantage of the present invention that：

On the number select permeability of cluster class：But most of algorithms is required for user to be determined in advance before being clustered to be obtained Cluster number (class number)；But in real data, class number is unknown, is typically passed through constantly experiment to be closed Suitable class number, to obtain preferable cluster result.And algorithm proposed by the present invention need not set cluster class number, in cluster analysis During, cluster class number automatically determines according to data set density information；

In algorithm parameter unicity problem：The selection of algorithm parameter generally has unicity, for same algorithm, does not have There is one group of parameter to be generally applicable to the complicated clustering architecture that all kinds data set is showed.One group of parameter is typically only fitted Together in the cluster analysis of certain situation.And algorithm proposed by the present invention can automatically extract from the data with category and without category Multigroup density parameter, density clustering analysis then is carried out to target data set using these density parameters, can be obtained Excellent cluster analysis result, and there is stronger adaptivity and noise immunity.

The present invention does not need user's input parameter (such as：Cluster class number), the related parameter of density can be passed through spy by algorithm Rope data internal structure automatically obtains；

The present invention can identify of different sizes, shape is different, density is different data by integrating region clustering result, It is and insensitive to noise data；

The present invention not only fully meets the restrictive condition of input, and much the semi-supervised clustering algorithm based on density can produce The raw cluster being largely made up of an independent data point or minimal amount of data point, the algorithm in the present invention can significantly reduce this The side effect of kind.

Brief description of the drawings

Fig. 1 is adaptive semi-supervised Density Clustering method flow diagram provided in an embodiment of the present invention.

Fig. 2 is the exemplary plot of determination Eps processes provided in an embodiment of the present invention.

In figure：(a), known X_UFor the data of no category, X_LFor the data with category, data set can be builtContain all classes be designated as j data and either with or without the data with category set.First, random choosing Take a data point x with category₁For initialization points most at the beginning,Then it is found closest one in X ' Individual point p₁, p₁∈ X ', and using the distance between 2 points as r,With the two point Data set D is built, and D from the middle removals of X ', X '=X '-D；If point p as the middle presence of X '₂, p₂With in data set D The distance of some point is less than or equal to r, then, just by p₂Move into data set D.

(b), iterate previous step, until not new data point is added in data set D, now, it is possible to verify Whether data set D has completely included data setIf data set is completely includedSo for category j, Eps_j= R, if not completely including data setSo suitably increase r length.

(c) above step, is repeated, until data set D meets condition, determines radius Eps_j。

Fig. 3 is the exemplary plot of integration Local Clustering result provided in an embodiment of the present invention.

In figure：The distribution situation of cluster during (a)-(f) Local Clusterings result is integrated in figure.

Fig. 4 is each algorithm provided in an embodiment of the present invention on the data set with multiple density and different shape characteristic Performance figure.

In figure：(a), three different clusters altogether by 191 data point example sets into, wherein spherical cluster by 60 with circle mark The data point example composition of note, a crescent cluster are made up of 60 data point examples with rhombus mark, and another is crescent Cluster is made up of 60 data point examples with triangle mark, and in addition, also 11 noise datas are in figure with the shape of cross Formula, which marks out, to be come.(b) situation of data band category in data space, is represented.(c)-(g), represent from multiple density, different shape Data set in distinguish three cluster distribution situations；(h), represent to obtain highest accuracy rate -94.3%.

Fig. 5 is each algorithm provided in an embodiment of the present invention on the uneven data set with manifold structure characteristic of cluster Performance figure.

In figure：(a), represent data set altogether by 790 data point example sets into wherein positioned at the spherical of data space center Cluster is by 395 data point example sets into and with circle labeled data point, circular cluster is by 363 data point example sets into and with water chestnut Shape mark data points, the cluster in the upper left corner is by 3 data point example sets into and being marked with asterisk, the cluster in the upper right corner is by 3 data points Example forms, and is marked with triangle, and the cluster in the lower left corner is by 3 data point example sets into and to mark, the cluster in the lower right corner is by 3 Individual data point example set into, and with mark.There are 20 noise datas in data set, positioned at circular cluster and four clusters of surrounding Between.(b) situation of data band category in data space, is represented, the situation of input data set can also be regarded as；(c-h) it is, anti- The performance situation reflected on the data set, identify the uneven characteristic with manifold structure of the intrinsic cluster of data set.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

The application principle of the present invention is described below in conjunction with the accompanying drawings.

As shown in figure 1, adaptive semi-supervised Density Clustering method provided in an embodiment of the present invention, including：

S101：First density parameter is automatically extracted from the data with category and without category；

S102：Then initial clustering analysis is carried out using multiple density parameters on target data set, obtains Local Clustering knot Fruit；

S103：Finally by the integration to local cluster result, final global clustering result is obtained.

First, the application principle of the present invention is further described with reference to specific embodiment.

The present invention proposes a kind of adaptive semi-supervised Density Clustering method, and it is made up of two parts, including multiple Density parameter learns to integrate with Local Clustering result, including：

1) multiple density parameter learning：

For density-based algorithms, different density parameters (minimal data points MinPts and radius Eps) mean that cluster result also differs.Present invention assumes that be all between data set corresponding to each category it is of different sizes, Shape is different, density is also different, then, its single density parameter is all found for each category, is just highly desirable.

Known X_UFor the data of no category, X_LFor the data with category, data set can be builtWrap Contained all classes be designated as j data and either with or without the data with category set.First, a number with category is randomly selected Strong point x₁For initialization points most at the beginning,Then its closest point p in X ' is found₁, p₁∈ X ', and Using the distance between 2 points as r,With the two points structure data set D, and And D from the middle removals of X ', X '=X '-D.If point p as the middle presence of X '₂, p₂With the distance of some point in data set D Less than or equal to r, then, just by p₂Move into data set D.

Iterate previous step, until not new data point is added in data set D, now, it is possible to verify data Whether collection D has completely included data setIf data set is completely includedSo for category j, Eps_j=r, such as Fruit does not completely include data setSo suitably increase r length,

Above step is repeated, until data set D meets condition.How Fig. 2 determines radius Eps if being demonstrated_jProcess.

Obtaining Eps_jNumerical value after, MinPts can be initialized as 2, then using DBSCAN to the not number with category Clustered according to the set of collection and all classes data set for being designated as j, density parameter when cluster is (Eps_j, MinPts).This Sample one, it is possible to obtain cluster result P_j, check P_jIf data setIn all data point be all divided into it is same In cluster, then MinPts value is added into 1, then data set clustered using DBSCAN algorithms again.Iteration always Above step, until data setIn all point be not divided into same cluster in cluster result, then for class Mark for j, MinPts_j=MinPts-1.

2) Local Clustering result is integrated：

Once obtain Eps corresponding to each category_jAnd MinPts_j, with regard to density set corresponding to category can be calculated {density_j=MinPts_j/Eps_j, and Local Clustering results set { P corresponding to category_jx.Then, to corresponding to category Density set is sorted in a manner of descending, finds wherein numerical value highest density d ensity_o, corresponding class is designated as o.Clustering As a result P_oIn, if there is no category and it is divided into the point that class is designated as in o cluster, then, it is theoretical according to semi-supervised learning In cluster assume (the cluster assumption) (if some point in same cluster, then they are very possible With identical category), it is possible to assign category o for these points.Next, in the same way, logarithm value second is high Density is operated accordingly, after this, by that analogy, is all operated accordingly successively to being left all density.On After the completion of stating step, if also data point is not endowed category, then these data points are considered as noise spot.

Essentially, the step for integrating Local Clustering result, according to data dividing condition before to every number Strong point has carried out the assignment of category.When a data point can be identified by different categories, the algorithm is always data point tax Give and wherein correspond to density highest category.In other words, if having lap between two clusters, the algorithm is more likely to weight Folded part incorporates the high cluster of density into, this also means that the division interval between cluster and cluster has been put into low density region.This (the low density eparation are assumed in the low-density segmentation that the mentality of designing of sample has met in semi-supervised learning just Assumption) (division border should be placed on low density place as far as possible).In order to more vivid explain that integration is local The step for cluster result, the present invention have the artificial 2-D data of different densities characteristic as sample using one group, demonstrated in figure 3 This process.The distribution situation of cluster during (a)-(f) Local Clusterings result is integrated in figure；Wherein entered respectively with circle, triangle, rhombus Rower is noted.

2nd, the application principle of the present invention is further described with reference to specific embodiment.

The embodiment of the present invention provides a kind of adaptive semi-supervised Density Clustering method, and it is made up of two parts, including Multiple density parameter learning is integrated with Local Clustering result, and idiographic flow is as follows：

3rd, the application principle of the present invention is further described with reference to good effect.

The comprehensive contrast of the invention carried out with representative semi-supervised algorithm, these algorithms include：C-DBSCAN, Constrained Clustering via Spectral Regularization(CCSR),Constrained Kmeans, Constrained Evidential Clustering (CEVCLUS) and Semi-Bayesian.Experimental data set is former This all data carries category, before the experiments were performed, remains the category of 10% data, the data of residue 90% at random Category be removed, turn into the not data with category.Although the data for being retained category are selected at random, need to protect Demonstrate,prove each category at least two data points and remain corresponding category, the information of constraints is only from remaining with category Obtained in data.

First, table of each algorithm on first data set (cluster has multiple density and different shape characteristic) is observed It is existing.As shown in Fig. 4 (a), three different clusters altogether by 191 data point example sets into, wherein spherical cluster by 60 with circle mark The data point example composition of note, a crescent cluster are made up of 60 data point examples with rhombus mark, and another is crescent Cluster is made up of 60 data point examples with triangle mark, and in addition, also 11 noise datas are in figure with the shape of cross Formula, which marks out, to be come.Fig. 3 (b) illustrates the situation of data band category in data space.It is proposed by the present invention as shown in Fig. 4 (c-h) Algorithm (Fig. 3-h) obtains highest accuracy rate-94.3%.It is not only exactly from multiple density, data set of different shapes Three clusters have been distinguished, and have contemplated the influence for the noise data being added thereto.

Next, to observe table of each algorithm on second data set (cluster has cluster uneven and manifold structure characteristic) It is existing.As shown in Fig. 5 (a), the data set is altogether by 790 data point example sets into wherein the spherical cluster positioned at data space center By 395 data point example sets into and with circle labeled data point, circular cluster is by 363 data point example sets into and with rhombus Mark data points, the cluster in the upper left corner is by 3 data point example sets into and being marked with asterisk, the cluster in the upper right corner is by 3 data point realities Example composition, and is marked with triangle, and the cluster in the lower left corner is by 3 data point example sets into and to mark, the cluster in the lower right corner is by 3 Data point example forms, and with a mark.Also have 20 noise datas in data set, positioned at circular cluster and four clusters of surrounding it Between.Fig. 5 (b) illustrates the situation of data band category in data space, can also regard the situation of input data set as.Fig. 5 (c- H) performance situation of the algorithm proposed by the present invention on the data set is reflected, the algorithm, can be fine compared with analogous algorithms Ground have identified the uneven characteristic with manifold structure of the intrinsic cluster of data set, and achieve 98.6% accuracy rate.

Artificial 2-D data is tested by using different algorithms, experimental result has convincingly demonstrated the present invention and carried The advantages of algorithm gone out, and be unequivocally demonstrated that, when designing density-based algorithms, the weight of multiple density parameter learning The property wanted.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims

A kind of 1. adaptive semi-supervised Density Clustering method, it is characterised in that the adaptive semi-supervised Density Clustering side Method includes：First density parameter is automatically extracted from the data with category and without category；

Then initial clustering analysis is carried out using multiple density parameters on target data set, obtains Local Clustering result；

Finally by the integration to local cluster result, final global clustering result is obtained.
2. adaptive semi-supervised Density Clustering method as claimed in claim 1, it is characterised in that the target data is concentrated Clustering architecture, including：The cluster of different size of cluster, cluster of different shapes and different densities.
3. adaptive semi-supervised Density Clustering method as claimed in claim 1, it is characterised in that the multiple density parameter Study includes：

Known X_UFor the data of no category, X_LFor the data with category, data set is builtThe data set includes All classes be designated as j data and either with or without the data with category set；

Specific steps include：

Step 1, first, randomly select a data point x with category₁For initialization points most at the beginning, Then its closest point p in X ' is found₁, p₁∈ X ', and using the distance between 2 points as r,

With the two points structure data set D, and D from the middle removals of X ', X '=X '-D；If point p as the middle presence of X '₂, p₂ It is less than or equal to r with the distance of some point in data set D, then, just by p₂Move into data set D；

Step 2, iterate step 1, and until not new data point is added in data set D, whether validation data set D is complete Include data setIf completely include data setSo for category j, Eps_j=r, if do not completely included Data setSo suitably increase r length；

Step 3, step 2 is repeated, until data set D meets condition；Obtain radius Eps_j。
4. adaptive semi-supervised Density Clustering method as claimed in claim 3, it is characterised in that obtaining Eps_jNumerical value Afterwards, it is 2 to initialize MinPts, is then designated as j data set to the not data set with category and all classes using DBSCAN Set is clustered, and the density parameter of cluster is Eps_j, MinPts；Obtain cluster result P_j, check P_jIf data setIn All data points are all divided into same cluster, and MinPts value is added into 1, then again using DBSCAN algorithms to data Collection is clustered；It is 2 to gathering again using DBSCAN algorithms to data set more than iteration to initialize MinPts always Class step, until data setIn all point be not divided into same cluster in cluster result；For category j, MinPts_j=MinPts-1.
5. adaptive semi-supervised Density Clustering method as claimed in claim 1, it is characterised in that the Local Clustering result Integrate, including：

Obtain Eps corresponding to each category_jAnd MinPts_jAfterwards, density set { density corresponding to category is calculated_j= MinPts_j/Eps_jAnd category corresponding to Local Clustering results set { P_j}；

Then, sorted to density set corresponding to category in a manner of descending, find numerical value highest density d ensity in sequence_o Corresponding class is designated as o., it is described corresponding to class be designated as o.In cluster result P_oIn, if there is no category and it is divided into class The point being designated as in o cluster, the cluster in semi-supervised learning theory for these points it is assumed that assign category o., next, according to Same mode, the high density of logarithm value second are operated accordingly, by that analogy, are all entered successively to being left all density The corresponding operation of row；If also data point is not endowed category, these data points are considered as noise spot.
A kind of 6. adaptive semi-supervised Density Clustering system of adaptive semi-supervised Density Clustering method as claimed in claim 1.