CN109241991A - A kind of data clusters integrated approach based on comentropy weight incremental learning strategy - Google Patents
A kind of data clusters integrated approach based on comentropy weight incremental learning strategy Download PDFInfo
- Publication number
- CN109241991A CN109241991A CN201810810646.4A CN201810810646A CN109241991A CN 109241991 A CN109241991 A CN 109241991A CN 201810810646 A CN201810810646 A CN 201810810646A CN 109241991 A CN109241991 A CN 109241991A
- Authority
- CN
- China
- Prior art keywords
- cluster
- clusters
- weight
- clustering
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of data clusters integrated approach based on comentropy weight incremental learning strategy, comprising the following steps: (1) clustering ensemble member generates;(2) the partial weight algorithm process basis cluster based on incremental learning thought;(3) clustering ensemble based on final basis cluster set;(4) terminate.The present invention can promote the final effect of the anti-interference of clustering ensemble, robustness and clustering ensemble.
Description
Technical field
The invention belongs to the clustering method researchs in data mining, machine learning field to research object or data.
Background technique
Clustering ensemble method is a kind of effective clustering technique, and the Integrated Strategy based on cluster weight information entropy is therein
A kind of effective novel clustering ensemble method.But there are Clustering Effects to basis cluster sensitive issue for this method.
Integrated Strategy main method based on cluster weight information entropy are as follows: be based on some particular data set object, use
Common clustering method implements M cluster and generates M kind basis cluster, and determines that each basis clusters and respectively using comentropy method
The weight of a class cluster.The cum rights distance matrix between above-mentioned data acquisition system element is constructed based on above-mentioned basis cluster and its weight.So
Afterwards based on cum rights distance matrix, using traditional hierarchical clustering or other classical clustering methods, constantly it polymerize high weight element
Group, until obtaining final cluster.
NMI index is mostly used for the judge of cluster result superiority and inferiority degree, for NMI value between 0 to 1, numerical value is bigger, gathers
Class effect is better.Choose a cluster result πGAs judgment criteria, then having following calculating for test set cluster element π '
Formula:
Wherein n' is the class cluster number in π ', nGIt is πGIn class cluster number, n'iIt is the element of i-th of class cluster in π '
Number,It is πGIn j-th of class cluster element number, nijIt is i-th of class cluster and π in π 'GIn j-th class cluster common element
Number.
Summary of the invention
Basis cluster is the cluster basis of clustering ensemble technology, and quality is closely related with final cluster result quality.This
Strategy of the invention based on comentropy weight incremental learning proposes a kind of clustering ensemble skill improved based on cluster weight information entropy
The new method of basic clustering result quality in art, so as to effectively realize the promotion of clustering ensemble outcome quality.
The present invention is achieved by the following technical solutions.
A kind of data clusters integrated approach based on comentropy weight incremental learning strategy of the present invention, feature exist
In the operation clustered to founding member, comprising the following steps:
(1) clustering ensemble member tentatively generates;
(2) the partial weight algorithm process basis based on incremental learning thought clusters member;
(3) clustering ensemble based on final basis cluster set;
(4) terminate.
Step (1) the clustering ensemble member of the present invention tentatively generates, and the steps include:
1) using Data Clustering Algorithm to data set D={ d1,d2,...,dxDo cluster operation;
2) cluster number control parameter m is set initial value is 1;
3) judge whether control parameter m is less than or equal to alternative clusters membership M, be to then follow the steps 4, otherwise go to
Step 6);
4) the m times cluster is obtained to be denoted asWherein set πmRadix number be | πm| it is denoted as
Nm;
5) control parameter m is executed into m=m+1, then goes to step 3);
6) alternative alternative clusters set is generatedBy all alternative clusters member Π
={ π1,π2,...,πMPut on uncertain label;
7) terminate.
Step (2) the of the present invention partial weight algorithm process basis based on incremental learning thought clusters member, step
Suddenly are as follows:
1) setting control parameter r initial value is 1, and setting cycle-index limits k;
2) Π={ π is calculated1,π2,...,πMIn indicate any one class clusters of uncertain labeled clustersπm∈
Π, relative to the uncertain information entropy of clusters all in Π, its calculation formula is:Wherein 1≤m≤M, 1≤n≤NM;1≤μ≤M, 1≤
j≤NM,| * | it is the element number of set *;
3) π obtained in step 2) is calculatedmIn each class clusterAverage value;
Its calculation formula is:Wherein Nm=| πm|;
4) alternative clusters π obtained in step 2) is calculatedmInStandard deviation;
Its calculation formula is:
5) formula W (*)=e is used-*Normalization obtains the uncertain information entropy average value Ψ (π of class cluster in each clusterm)
With standard deviation sigma (πm), two kinds of weight W (Ψ (π as each clusterm)) and W (σ (πm)), so that the value interval of two weights
For (0,1];
6) weight W (Ψ (π is setm)) threshold be α and weight W (σ (πm)) threshold be β;
7) by the W (Ψ (π for all alternative clusters being calculated in step 5)m)) be compared with α, W (σ (πm)) and β into
Row compares;
If certain cluster πmThere is (W (Ψ (π in ∈ Πm)) > α) ∧ (W (σ (πm)) < β), then retain the cluster, and will not be really
Calibration note is revised as determining label, otherwise will delete the cluster;If r be equal to k, then institute it is with a grain of salt possess determine label gather
Class quantity is denoted as M, jumps to step 9);
8) institute's alternative clusters quantity with a grain of salt in step 7) is denoted as t, if t is equal to the quantity M of cluster, then going to step
It is rapid that any M-t kind alternative clusters otherwise 9) are produced using traditional data clustering method again and alternative with the determination withed a hook at the end
Cluster forms M new alternative clusters, and to newest M alternative clusters, all label is to cluster and be denoted as Π={ π1,
π2,...,πM}.Control parameter r is added 1, return step 2);
9) final basis cluster set Π={ π is obtained1,π2,...,πM}。
Clustering ensemble based on final basis cluster set described in step (3) of the present invention, the steps include:
1) weight information based on class cluster in the cluster of above-mentioned acquisition basis, calculates any two elements in data set D and occurs
It is integrated in each number clustered in same class cluster and its with the product of such cluster weight as the cum rights between any two elements
Distance;
Wherein di∈D,dj∈ D and di≠dj,
Cn mTo cluster πmMiddle diThe class cluster at place, is denoted as di∈Cn m, Cn m∈πm,n∈[1,Nm];
wi mIt is equal to
If di∈Cn mWhen, djAlso belong to cluster πmIn class cluster Cn mWhen Φij m=1
If di∈Cn mWhen, djIt is not belonging to cluster πmIn class cluster Cn mWhen Φij m=0
2) the data set D={ d obtained based on step 1)1,d2,…dxIn cum rights between any two elements integrate distance
Dis(di,dj) as the clustering distance between element in hierarchy clustering method.Hierarchical clustering is implemented to data set D, and is obtained last
Clustering ensemble output.
Compared to the classical clustering ensemble method based on cluster weight information entropy, the invention proposes one kind to be based on information
The novel clustering ensemble method of entropy weight weight incremental learning optimization basis cluster.The present invention calculates basis cluster class cluster using comentropy
Between the degree of association and basis cluster stability itself, based on effective weight for clustering.By introducing incremental learning strategy
The poor low weighted basis cluster of stability is deleted, the stability for effectively reducing basis cluster itself makes an uproar to cluster result
Sound shadow is rung;Join matrix altogether in conjunction with the weighting that traditional mutual association matrix constructs after the cluster optimization of basis, effectively measures any two
Correlation degree between a element, to obtain more accurately clustering ensemble result.
Detailed description of the invention
Fig. 1 is basic flow chart of the invention.
Fig. 2 is the flow chart of incremental learning thought selection basis cluster.
Fig. 3 is the partial weight clustering ensemble method flow chart based on incremental learning thought.
Specific embodiment
The present invention will be described further by following embodiment.
A kind of data clusters integrated approach based on comentropy weight incremental learning strategy described in the present embodiment, by following
Step:
(1) the iris data set (https: //archive.ics.uci.edu/ml/datasets/ of UCI is used
Iris/), 20 kinds of alternative clusters (being all labeled as uncertain cluster) are generated using K-means clustering method, wherein every kind of cluster
Division numbers be 5 clusters;
(2) power that each class cluster of each uncertain alternative clusters is clustered relative to remaining 19 kinds is calculated by information Entropy Method
Weight.
(3) average value and standard deviation for calculating each class cluster weight of each alternative clusters, as each alternative clusters two
A weight index.
(4) average value threshold alpha=0.6 is set, each class cluster weight of each alternative clusters is worked as in standard deviation threshold method β=0.3
Average value is greater than specified threshold 0.6, and when its standard deviation is less than specified threshold 0.3, then retains such cluster, and mark
To determine cluster (i.e. two weights of the alternative clusters all meet reservation condition).If not meeting above-mentioned condition, by the cluster
It deletes.
(5) cycle-index is set as 10.For the first time after circulation terminates, the quantity for determining cluster of reservation is 5, eliminates 15
A uncertain cluster.Any 15 kinds of alternative clusters are generated using K-means clustering method according to incremental learning thought, by new 20
A alternative clusters are collectively labeled as uncertain cluster and return step 2 carries out second and recycles.Retain after circulation terminates for the second time
The quantity for determining cluster is 9, eliminates 11 uncertain clusters.11 kinds of alternative clusters are generated again using above-mentioned thought carries out the
It recycles three times.The quantity for determining cluster that third time retains after circulation terminates is 13, eliminates 7 uncertain clusters.Using upper
State thought generate again 7 kinds of alternative clusters carry out the 4th time circulation.4th quantity for determining cluster retained after circulation terminates
It is 16, eliminates 4 uncertain clusters.4 kinds of alternative clusters are generated again using above-mentioned thought carries out the 5th circulation.5th time
The quantity for determining cluster retained after circulation terminates is 19, eliminates 1 uncertain cluster.1 is generated again using above-mentioned thought
Kind alternative clusters carry out the 6th circulation.6th time the quantity for determining cluster retained after circulation terminates is 20, end loop.?
This 20 determining clusters are as final basis cluster set.
(6) based on the weight information of class cluster in each determining cluster of above-mentioned acquisition, any two elements in data set is calculated and are gone out
Now the number in each cluster same class cluster and its with the product of such cluster weight as the cum rights collection between any two elements
At distance.
(7) distance is integrated based on the cum rights between any two elements of object dataset, using classical hierarchical clustering mode
Complete last cluster.
Conclusion: the clustering ensemble method compared to classics based on cluster weight information entropy, using the iris of the method for the present invention
The accuracy rate of the cluster result of platymiscium data set has larger amplitude promotion (by 50 comparative experimentss, to cluster compared to classics
The average NMI index 0.51 of Integrated Algorithm, the method for the present invention NMI index average out to 0.72.).Emphasis of the present invention considers class cluster
Between the degree of association and basis cluster stability itself.It is poly- that the poor basis of stability is deleted by introducing incremental learning strategy
Class effectively reduces influence of noise of the stability of basis cluster to cluster result itself;Consider the degree of association between class cluster, passes through
Comentropy calculating correlation is simultaneously converted into weight, joins matrix altogether in conjunction with the building weighting of traditional mutual association matrix, effectively measures
Correlation degree between any two element obtains more accurately cluster result.
Claims (4)
1. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy, which is characterized in that including following step
It is rapid:
(1) clustering ensemble member tentatively generates;
(2) the partial weight algorithm process basis based on incremental learning thought clusters member;
(3) clustering ensemble based on final basis cluster set;
(4) terminate.
2. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1,
It is characterized in that, step (1) the clustering ensemble member tentatively generates, and the steps include:
1) using Data Clustering Algorithm to data set D={ d1,d2,...,dxDo cluster operation;
2) cluster number control parameter m is set initial value is 1;
3) judge whether control parameter m is less than or equal to alternative clusters membership M, be to then follow the steps 4, otherwise go to step
6);
4) the m times cluster is obtained to be denoted asWherein set πmRadix number be | πm| it is denoted as Nm;
5) control parameter m is executed into m=m+1, then goes to step 3);
6) alternative alternative clusters set is generatedBy all alternative clusters member Π=
{π1,π2,...,πMPut on uncertain label;
7) terminate.
3. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1,
It is characterized in that, step (2) the partial weight algorithm process basis based on incremental learning thought clusters member, the steps include:
1) setting control parameter r initial value is 1, and setting cycle-index limits k;
2) Π={ π is calculated1,π2,...,πMIn indicate any one class clusters of uncertain labeled clustersπm∈ Π, relatively
The uncertain information entropy of all clusters in Π, its calculation formula is:
Wherein 1≤m≤M, 1≤n≤NM;1≤μ≤M, 1≤j≤NM,| * | it is set *
Element number;
3) π obtained in step 2) is calculatedmIn each class clusterAverage value;
Its calculation formula is:Wherein Nm=| πm|;
4) alternative clusters π obtained in step 2) is calculatedmInStandard deviation;
Its calculation formula is:
5) formula W (*)=e is used-*Normalization obtains the uncertain information entropy average value Ψ (π of class cluster in each clusterm) and mark
Quasi- difference σ (πm), two kinds of weight W (Ψ (π as each clusterm)) and W (σ (πm)), so that the value interval of two weights is
(0,1];
6) weight W (Ψ (π is setm)) threshold be α and weight W (σ (πm)) threshold be β;
7) by the W (Ψ (π for all alternative clusters being calculated in step 5)m)) be compared with α, W (σ (πm)) compared with β
Compared with;
If certain cluster πmThere is (W (Ψ (π in ∈ Πm)) > α) ^ (W (σ (πm)) < β), then retain the cluster, and by uncertain label
It is revised as determining label, otherwise will delete the cluster;If r be equal to k, then institute it is with a grain of salt possess determine mark number of clusters
It is denoted as M, jumps to step 9);
8) institute's alternative clusters quantity with a grain of salt in step 7) is denoted as t, if t is equal to the quantity M of cluster, then going to step
9) any M-t kind alternative clusters otherwise, are produced using traditional data clustering method again, and are alternatively gathered with the determination withed a hook at the end
Class forms M new alternative clusters, and to newest M alternative clusters, all label for cluster and is denoted as Π={ π1,
π2,...,πM};Control parameter r is added 1, return step 2);
9) final basis cluster set Π={ π is obtained1,π2,...,πM}。
4. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1,
It is characterized in that, the clustering ensemble based on final basis cluster set described in step (3) the steps include:
1) weight information based on class cluster in the cluster of above-mentioned acquisition basis, calculates any two elements in data set D and appears in respectively
Number in a cluster same class cluster and its as the cum rights between any two elements distance is integrated with the product of such cluster weight;
Wherein di∈D,dj∈ D and di≠dj,
Cn mTo cluster πmMiddle diThe class cluster at place, is denoted as di∈Cn m, Cn m∈πm,n∈[1,Nm];
wi mIt is equal to
If di∈Cn mWhen, djAlso belong to cluster πmIn class cluster Cn mWhen Φij m=1;
If di∈Cn mWhen, djIt is not belonging to cluster πmIn class cluster Cn mWhen Φij m=0;
2) the data set D={ d obtained based on step 1)1,d2,…dxIn cum rights between any two elements integrate distance Dis
(di,dj) as the clustering distance between element in hierarchy clustering method;Hierarchical clustering is implemented to data set D, and obtains last collection
It is exported at cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810810646.4A CN109241991A (en) | 2018-07-23 | 2018-07-23 | A kind of data clusters integrated approach based on comentropy weight incremental learning strategy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810810646.4A CN109241991A (en) | 2018-07-23 | 2018-07-23 | A kind of data clusters integrated approach based on comentropy weight incremental learning strategy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109241991A true CN109241991A (en) | 2019-01-18 |
Family
ID=65072916
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810810646.4A Pending CN109241991A (en) | 2018-07-23 | 2018-07-23 | A kind of data clusters integrated approach based on comentropy weight incremental learning strategy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109241991A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110619107A (en) * | 2019-08-07 | 2019-12-27 | 南昌大学 | Lstm and Gcforest algorithm mixed reinforcement learning distribution network transformer load prediction method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096805A (en) * | 2016-05-10 | 2016-11-09 | 华北电力大学 | A kind of residential electricity consumption load classification method based on entropy assessment feature selection |
CN107480694A (en) * | 2017-07-06 | 2017-12-15 | 重庆邮电大学 | Three clustering methods are integrated using the weighting selection evaluated twice based on Spark platforms |
-
2018
- 2018-07-23 CN CN201810810646.4A patent/CN109241991A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096805A (en) * | 2016-05-10 | 2016-11-09 | 华北电力大学 | A kind of residential electricity consumption load classification method based on entropy assessment feature selection |
CN107480694A (en) * | 2017-07-06 | 2017-12-15 | 重庆邮电大学 | Three clustering methods are integrated using the weighting selection evaluated twice based on Spark platforms |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110619107A (en) * | 2019-08-07 | 2019-12-27 | 南昌大学 | Lstm and Gcforest algorithm mixed reinforcement learning distribution network transformer load prediction method |
CN110619107B (en) * | 2019-08-07 | 2022-04-12 | 南昌大学 | Lstm and Gcforest algorithm mixed reinforcement learning distribution network transformer load prediction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Elbatta et al. | A dynamic method for discovering density varied clusters | |
Xie et al. | A new internal index based on density core for clustering validation | |
CN106067034B (en) | Power distribution network load curve clustering method based on high-dimensional matrix characteristic root | |
CN105354588A (en) | Decision tree construction method | |
CN109669990A (en) | A kind of innovatory algorithm carrying out Outliers mining to density irregular data based on DBSCAN | |
CN103699678A (en) | Hierarchical clustering method and system based on multistage layered sampling | |
Cheng et al. | Searching dimension incomplete databases | |
Jiang et al. | Classification methods of remote sensing image based on decision tree technologies | |
Indira et al. | Performance analysis of genetic algorithm for mining association rules | |
CN108154185A (en) | A kind of k-means clustering methods of secret protection | |
Qin et al. | Associative classifier for uncertain data | |
CN109241991A (en) | A kind of data clusters integrated approach based on comentropy weight incremental learning strategy | |
CN116186757A (en) | Method for publishing condition feature selection differential privacy data with enhanced utility | |
Chaturvedi et al. | An improvement in K-mean clustering algorithm using better time and accuracy | |
CN105631465A (en) | Density peak-based high-efficiency hierarchical clustering method | |
CN109190659A (en) | A kind of data integration clustering method based on three decision strategies of comentropy weight | |
CN110287992A (en) | Agricultural features information extracting method based on big data | |
CN110533111A (en) | A kind of adaptive K mean cluster method based on local density Yu ball Hash | |
CN109241992A (en) | A kind of data clusters integrated approach based on two decision optimizations of comentropy weight | |
CN108319658A (en) | A kind of improvement Apriori algorithm based on desert steppe | |
Yong et al. | Short-term building load forecasting based on similar day selection and LSTM network | |
CN108717551A (en) | A kind of fuzzy hierarchy clustering method based on maximum membership degree | |
CN113780864B (en) | Key ecological hydrological index identification method for influencing spawning of four major Chinese carps | |
CN109522750A (en) | A kind of new k anonymity realization method and system | |
CN108681576A (en) | A kind of data digging method based on Quality of Safflower decision tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190118 |