CN109241991A

CN109241991A - A kind of data clusters integrated approach based on comentropy weight incremental learning strategy

Info

Publication number: CN109241991A
Application number: CN201810810646.4A
Authority: CN
Inventors: 徐健锋; 薛国泽; 刘斓; 梁伟; 严方圆
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2019-01-18

Abstract

A kind of data clusters integrated approach based on comentropy weight incremental learning strategy, comprising the following steps: (1) clustering ensemble member generates；(2) the partial weight algorithm process basis cluster based on incremental learning thought；(3) clustering ensemble based on final basis cluster set；(4) terminate.The present invention can promote the final effect of the anti-interference of clustering ensemble, robustness and clustering ensemble.

Description

A kind of data clusters integrated approach based on comentropy weight incremental learning strategy

Technical field

The invention belongs to the clustering method researchs in data mining, machine learning field to research object or data.

Background technique

Clustering ensemble method is a kind of effective clustering technique, and the Integrated Strategy based on cluster weight information entropy is therein A kind of effective novel clustering ensemble method.But there are Clustering Effects to basis cluster sensitive issue for this method.

Integrated Strategy main method based on cluster weight information entropy are as follows: be based on some particular data set object, use Common clustering method implements M cluster and generates M kind basis cluster, and determines that each basis clusters and respectively using comentropy method The weight of a class cluster.The cum rights distance matrix between above-mentioned data acquisition system element is constructed based on above-mentioned basis cluster and its weight.So Afterwards based on cum rights distance matrix, using traditional hierarchical clustering or other classical clustering methods, constantly it polymerize high weight element Group, until obtaining final cluster.

NMI index is mostly used for the judge of cluster result superiority and inferiority degree, for NMI value between 0 to 1, numerical value is bigger, gathers Class effect is better.Choose a cluster result π^GAs judgment criteria, then having following calculating for test set cluster element π ' Formula:

Wherein n' is the class cluster number in π ', n^GIt is π^GIn class cluster number, n'_iIt is the element of i-th of class cluster in π ' Number,It is π^GIn j-th of class cluster element number, n_ijIt is i-th of class cluster and π in π '^GIn j-th class cluster common element Number.

Summary of the invention

Basis cluster is the cluster basis of clustering ensemble technology, and quality is closely related with final cluster result quality.This Strategy of the invention based on comentropy weight incremental learning proposes a kind of clustering ensemble skill improved based on cluster weight information entropy The new method of basic clustering result quality in art, so as to effectively realize the promotion of clustering ensemble outcome quality.

The present invention is achieved by the following technical solutions.

A kind of data clusters integrated approach based on comentropy weight incremental learning strategy of the present invention, feature exist In the operation clustered to founding member, comprising the following steps:

(1) clustering ensemble member tentatively generates；

(2) the partial weight algorithm process basis based on incremental learning thought clusters member；

(3) clustering ensemble based on final basis cluster set；

(4) terminate.

Step (1) the clustering ensemble member of the present invention tentatively generates, and the steps include:

1) using Data Clustering Algorithm to data set D={ d₁,d₂,...,d_xDo cluster operation；

2) cluster number control parameter m is set initial value is 1；

3) judge whether control parameter m is less than or equal to alternative clusters membership M, be to then follow the steps 4, otherwise go to Step 6)；

4) the m times cluster is obtained to be denoted asWherein set π^mRadix number be | π^m| it is denoted as N^m；

5) control parameter m is executed into m=m+1, then goes to step 3)；

6) alternative alternative clusters set is generatedBy all alternative clusters member Π ={ π¹,π²,...,π^MPut on uncertain label；

7) terminate.

Step (2) the of the present invention partial weight algorithm process basis based on incremental learning thought clusters member, step Suddenly are as follows:

1) setting control parameter r initial value is 1, and setting cycle-index limits k；

2) Π={ π is calculated¹,π²,...,π^MIn indicate any one class clusters of uncertain labeled clustersπ^m∈ Π, relative to the uncertain information entropy of clusters all in Π, its calculation formula is:Wherein 1≤m≤M, 1≤n≤N^M；1≤μ≤M, 1≤ j≤N^M,| * | it is the element number of set *；

3) π obtained in step 2) is calculated^mIn each class clusterAverage value；

Its calculation formula is:Wherein N^m=| π^m|；

4) alternative clusters π obtained in step 2) is calculated^mInStandard deviation；

Its calculation formula is:

5) formula W (*)=e is used^-*Normalization obtains the uncertain information entropy average value Ψ (π of class cluster in each cluster^m) With standard deviation sigma (π^m), two kinds of weight W (Ψ (π as each cluster^m)) and W (σ (π^m)), so that the value interval of two weights For (0,1]；

6) weight W (Ψ (π is set^m)) threshold be α and weight W (σ (π^m)) threshold be β；

7) by the W (Ψ (π for all alternative clusters being calculated in step 5)^m)) be compared with α, W (σ (π^m)) and β into Row compares；

If certain cluster π^mThere is (W (Ψ (π in ∈ Π^m)) > α) ∧ (W (σ (π^m)) < β), then retain the cluster, and will not be really Calibration note is revised as determining label, otherwise will delete the cluster；If r be equal to k, then institute it is with a grain of salt possess determine label gather Class quantity is denoted as M, jumps to step 9)；

8) institute's alternative clusters quantity with a grain of salt in step 7) is denoted as t, if t is equal to the quantity M of cluster, then going to step It is rapid that any M-t kind alternative clusters otherwise 9) are produced using traditional data clustering method again and alternative with the determination withed a hook at the end Cluster forms M new alternative clusters, and to newest M alternative clusters, all label is to cluster and be denoted as Π={ π¹, π²,...,π^M}.Control parameter r is added 1, return step 2)；

9) final basis cluster set Π={ π is obtained¹,π²,...,π^M}。

Clustering ensemble based on final basis cluster set described in step (3) of the present invention, the steps include:

1) weight information based on class cluster in the cluster of above-mentioned acquisition basis, calculates any two elements in data set D and occurs It is integrated in each number clustered in same class cluster and its with the product of such cluster weight as the cum rights between any two elements Distance；

Wherein d_i∈D,d_j∈ D and d_i≠d_j,

C_n ^mTo cluster π^mMiddle d_iThe class cluster at place, is denoted as d_i∈C_n ^m, C_n ^m∈π^m,n∈[1,N^m]；

w_i ^mIt is equal to

If d_i∈C_n ^mWhen, d_jAlso belong to cluster π^mIn class cluster C_n ^mWhen Φ_ij ^m=1

If d_i∈C_n ^mWhen, d_jIt is not belonging to cluster π^mIn class cluster C_n ^mWhen Φ_ij ^m=0

2) the data set D={ d obtained based on step 1)₁,d₂,…d_xIn cum rights between any two elements integrate distance Dis(d_i,d_j) as the clustering distance between element in hierarchy clustering method.Hierarchical clustering is implemented to data set D, and is obtained last Clustering ensemble output.

Compared to the classical clustering ensemble method based on cluster weight information entropy, the invention proposes one kind to be based on information The novel clustering ensemble method of entropy weight weight incremental learning optimization basis cluster.The present invention calculates basis cluster class cluster using comentropy Between the degree of association and basis cluster stability itself, based on effective weight for clustering.By introducing incremental learning strategy The poor low weighted basis cluster of stability is deleted, the stability for effectively reducing basis cluster itself makes an uproar to cluster result Sound shadow is rung；Join matrix altogether in conjunction with the weighting that traditional mutual association matrix constructs after the cluster optimization of basis, effectively measures any two Correlation degree between a element, to obtain more accurately clustering ensemble result.

Detailed description of the invention

Fig. 1 is basic flow chart of the invention.

Fig. 2 is the flow chart of incremental learning thought selection basis cluster.

Fig. 3 is the partial weight clustering ensemble method flow chart based on incremental learning thought.

Specific embodiment

The present invention will be described further by following embodiment.

A kind of data clusters integrated approach based on comentropy weight incremental learning strategy described in the present embodiment, by following Step:

(1) the iris data set (https: //archive.ics.uci.edu/ml/datasets/ of UCI is used Iris/), 20 kinds of alternative clusters (being all labeled as uncertain cluster) are generated using K-means clustering method, wherein every kind of cluster Division numbers be 5 clusters；

(2) power that each class cluster of each uncertain alternative clusters is clustered relative to remaining 19 kinds is calculated by information Entropy Method Weight.

(3) average value and standard deviation for calculating each class cluster weight of each alternative clusters, as each alternative clusters two A weight index.

(4) average value threshold alpha=0.6 is set, each class cluster weight of each alternative clusters is worked as in standard deviation threshold method β=0.3 Average value is greater than specified threshold 0.6, and when its standard deviation is less than specified threshold 0.3, then retains such cluster, and mark To determine cluster (i.e. two weights of the alternative clusters all meet reservation condition).If not meeting above-mentioned condition, by the cluster It deletes.

(5) cycle-index is set as 10.For the first time after circulation terminates, the quantity for determining cluster of reservation is 5, eliminates 15 A uncertain cluster.Any 15 kinds of alternative clusters are generated using K-means clustering method according to incremental learning thought, by new 20 A alternative clusters are collectively labeled as uncertain cluster and return step 2 carries out second and recycles.Retain after circulation terminates for the second time The quantity for determining cluster is 9, eliminates 11 uncertain clusters.11 kinds of alternative clusters are generated again using above-mentioned thought carries out the It recycles three times.The quantity for determining cluster that third time retains after circulation terminates is 13, eliminates 7 uncertain clusters.Using upper State thought generate again 7 kinds of alternative clusters carry out the 4th time circulation.4th quantity for determining cluster retained after circulation terminates It is 16, eliminates 4 uncertain clusters.4 kinds of alternative clusters are generated again using above-mentioned thought carries out the 5th circulation.5th time The quantity for determining cluster retained after circulation terminates is 19, eliminates 1 uncertain cluster.1 is generated again using above-mentioned thought Kind alternative clusters carry out the 6th circulation.6th time the quantity for determining cluster retained after circulation terminates is 20, end loop.? This 20 determining clusters are as final basis cluster set.

(6) based on the weight information of class cluster in each determining cluster of above-mentioned acquisition, any two elements in data set is calculated and are gone out Now the number in each cluster same class cluster and its with the product of such cluster weight as the cum rights collection between any two elements At distance.

(7) distance is integrated based on the cum rights between any two elements of object dataset, using classical hierarchical clustering mode Complete last cluster.

Conclusion: the clustering ensemble method compared to classics based on cluster weight information entropy, using the iris of the method for the present invention The accuracy rate of the cluster result of platymiscium data set has larger amplitude promotion (by 50 comparative experimentss, to cluster compared to classics The average NMI index 0.51 of Integrated Algorithm, the method for the present invention NMI index average out to 0.72.).Emphasis of the present invention considers class cluster Between the degree of association and basis cluster stability itself.It is poly- that the poor basis of stability is deleted by introducing incremental learning strategy Class effectively reduces influence of noise of the stability of basis cluster to cluster result itself；Consider the degree of association between class cluster, passes through Comentropy calculating correlation is simultaneously converted into weight, joins matrix altogether in conjunction with the building weighting of traditional mutual association matrix, effectively measures Correlation degree between any two element obtains more accurately cluster result.

Claims

1. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy, which is characterized in that including following step It is rapid:

(1) clustering ensemble member tentatively generates；

(3) clustering ensemble based on final basis cluster set；

(4) terminate.

2. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1, It is characterized in that, step (1) the clustering ensemble member tentatively generates, and the steps include:

2) cluster number control parameter m is set initial value is 1；

5) control parameter m is executed into m=m+1, then goes to step 3)；

6) alternative alternative clusters set is generatedBy all alternative clusters member Π= {π¹,π²,...,π^MPut on uncertain label；

7) terminate.

3. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1, It is characterized in that, step (2) the partial weight algorithm process basis based on incremental learning thought clusters member, the steps include:

2) Π={ π is calculated¹,π²,...,π^MIn indicate any one class clusters of uncertain labeled clustersπ^m∈ Π, relatively The uncertain information entropy of all clusters in Π, its calculation formula is: Wherein 1≤m≤M, 1≤n≤N^M；1≤μ≤M, 1≤j≤N^M,| * | it is set * Element number；

3) π obtained in step 2) is calculated^mIn each class clusterAverage value；

Its calculation formula is:Wherein N^m=| π^m|；

Its calculation formula is:

5) formula W (*)=e is used^-*Normalization obtains the uncertain information entropy average value Ψ (π of class cluster in each cluster^m) and mark Quasi- difference σ (π^m), two kinds of weight W (Ψ (π as each cluster^m)) and W (σ (π^m)), so that the value interval of two weights is (0,1]；

7) by the W (Ψ (π for all alternative clusters being calculated in step 5)^m)) be compared with α, W (σ (π^m)) compared with β Compared with；

If certain cluster π^mThere is (W (Ψ (π in ∈ Π^m)) > α) ^ (W (σ (π^m)) < β), then retain the cluster, and by uncertain label It is revised as determining label, otherwise will delete the cluster；If r be equal to k, then institute it is with a grain of salt possess determine mark number of clusters It is denoted as M, jumps to step 9)；

8) institute's alternative clusters quantity with a grain of salt in step 7) is denoted as t, if t is equal to the quantity M of cluster, then going to step 9) any M-t kind alternative clusters otherwise, are produced using traditional data clustering method again, and are alternatively gathered with the determination withed a hook at the end Class forms M new alternative clusters, and to newest M alternative clusters, all label for cluster and is denoted as Π={ π¹, π²,...,π^M}；Control parameter r is added 1, return step 2)；

9) final basis cluster set Π={ π is obtained¹,π²,...,π^M}。

4. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1, It is characterized in that, the clustering ensemble based on final basis cluster set described in step (3) the steps include:

1) weight information based on class cluster in the cluster of above-mentioned acquisition basis, calculates any two elements in data set D and appears in respectively Number in a cluster same class cluster and its as the cum rights between any two elements distance is integrated with the product of such cluster weight；

Wherein d_i∈D,d_j∈ D and d_i≠d_j,

w_i ^mIt is equal to

If d_i∈C_n ^mWhen, d_jAlso belong to cluster π^mIn class cluster C_n ^mWhen Φ_ij ^m=1；

If d_i∈C_n ^mWhen, d_jIt is not belonging to cluster π^mIn class cluster C_n ^mWhen Φ_ij ^m=0；

2) the data set D={ d obtained based on step 1)₁,d₂,…d_xIn cum rights between any two elements integrate distance Dis (d_i,d_j) as the clustering distance between element in hierarchy clustering method；Hierarchical clustering is implemented to data set D, and obtains last collection It is exported at cluster.