CN109241991A - A kind of data clusters integrated approach based on comentropy weight incremental learning strategy - Google Patents

A kind of data clusters integrated approach based on comentropy weight incremental learning strategy Download PDF

Info

Publication number
CN109241991A
CN109241991A CN201810810646.4A CN201810810646A CN109241991A CN 109241991 A CN109241991 A CN 109241991A CN 201810810646 A CN201810810646 A CN 201810810646A CN 109241991 A CN109241991 A CN 109241991A
Authority
CN
China
Prior art keywords
cluster
clusters
weight
clustering
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810810646.4A
Other languages
Chinese (zh)
Inventor
徐健锋
薛国泽
刘斓
梁伟
严方圆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang University
Original Assignee
Nanchang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang University filed Critical Nanchang University
Priority to CN201810810646.4A priority Critical patent/CN109241991A/en
Publication of CN109241991A publication Critical patent/CN109241991A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of data clusters integrated approach based on comentropy weight incremental learning strategy, comprising the following steps: (1) clustering ensemble member generates;(2) the partial weight algorithm process basis cluster based on incremental learning thought;(3) clustering ensemble based on final basis cluster set;(4) terminate.The present invention can promote the final effect of the anti-interference of clustering ensemble, robustness and clustering ensemble.

Description

A kind of data clusters integrated approach based on comentropy weight incremental learning strategy
Technical field
The invention belongs to the clustering method researchs in data mining, machine learning field to research object or data.
Background technique
Clustering ensemble method is a kind of effective clustering technique, and the Integrated Strategy based on cluster weight information entropy is therein A kind of effective novel clustering ensemble method.But there are Clustering Effects to basis cluster sensitive issue for this method.
Integrated Strategy main method based on cluster weight information entropy are as follows: be based on some particular data set object, use Common clustering method implements M cluster and generates M kind basis cluster, and determines that each basis clusters and respectively using comentropy method The weight of a class cluster.The cum rights distance matrix between above-mentioned data acquisition system element is constructed based on above-mentioned basis cluster and its weight.So Afterwards based on cum rights distance matrix, using traditional hierarchical clustering or other classical clustering methods, constantly it polymerize high weight element Group, until obtaining final cluster.
NMI index is mostly used for the judge of cluster result superiority and inferiority degree, for NMI value between 0 to 1, numerical value is bigger, gathers Class effect is better.Choose a cluster result πGAs judgment criteria, then having following calculating for test set cluster element π ' Formula:
Wherein n' is the class cluster number in π ', nGIt is πGIn class cluster number, n'iIt is the element of i-th of class cluster in π ' Number,It is πGIn j-th of class cluster element number, nijIt is i-th of class cluster and π in π 'GIn j-th class cluster common element Number.
Summary of the invention
Basis cluster is the cluster basis of clustering ensemble technology, and quality is closely related with final cluster result quality.This Strategy of the invention based on comentropy weight incremental learning proposes a kind of clustering ensemble skill improved based on cluster weight information entropy The new method of basic clustering result quality in art, so as to effectively realize the promotion of clustering ensemble outcome quality.
The present invention is achieved by the following technical solutions.
A kind of data clusters integrated approach based on comentropy weight incremental learning strategy of the present invention, feature exist In the operation clustered to founding member, comprising the following steps:
(1) clustering ensemble member tentatively generates;
(2) the partial weight algorithm process basis based on incremental learning thought clusters member;
(3) clustering ensemble based on final basis cluster set;
(4) terminate.
Step (1) the clustering ensemble member of the present invention tentatively generates, and the steps include:
1) using Data Clustering Algorithm to data set D={ d1,d2,...,dxDo cluster operation;
2) cluster number control parameter m is set initial value is 1;
3) judge whether control parameter m is less than or equal to alternative clusters membership M, be to then follow the steps 4, otherwise go to Step 6);
4) the m times cluster is obtained to be denoted asWherein set πmRadix number be | πm| it is denoted as Nm
5) control parameter m is executed into m=m+1, then goes to step 3);
6) alternative alternative clusters set is generatedBy all alternative clusters member Π ={ π12,...,πMPut on uncertain label;
7) terminate.
Step (2) the of the present invention partial weight algorithm process basis based on incremental learning thought clusters member, step Suddenly are as follows:
1) setting control parameter r initial value is 1, and setting cycle-index limits k;
2) Π={ π is calculated12,...,πMIn indicate any one class clusters of uncertain labeled clustersπm∈ Π, relative to the uncertain information entropy of clusters all in Π, its calculation formula is:Wherein 1≤m≤M, 1≤n≤NM1≤μ≤M, 1≤ j≤NM,| * | it is the element number of set *;
3) π obtained in step 2) is calculatedmIn each class clusterAverage value;
Its calculation formula is:Wherein Nm=| πm|;
4) alternative clusters π obtained in step 2) is calculatedmInStandard deviation;
Its calculation formula is:
5) formula W (*)=e is used-*Normalization obtains the uncertain information entropy average value Ψ (π of class cluster in each clusterm) With standard deviation sigma (πm), two kinds of weight W (Ψ (π as each clusterm)) and W (σ (πm)), so that the value interval of two weights For (0,1];
6) weight W (Ψ (π is setm)) threshold be α and weight W (σ (πm)) threshold be β;
7) by the W (Ψ (π for all alternative clusters being calculated in step 5)m)) be compared with α, W (σ (πm)) and β into Row compares;
If certain cluster πmThere is (W (Ψ (π in ∈ Πm)) > α) ∧ (W (σ (πm)) < β), then retain the cluster, and will not be really Calibration note is revised as determining label, otherwise will delete the cluster;If r be equal to k, then institute it is with a grain of salt possess determine label gather Class quantity is denoted as M, jumps to step 9);
8) institute's alternative clusters quantity with a grain of salt in step 7) is denoted as t, if t is equal to the quantity M of cluster, then going to step It is rapid that any M-t kind alternative clusters otherwise 9) are produced using traditional data clustering method again and alternative with the determination withed a hook at the end Cluster forms M new alternative clusters, and to newest M alternative clusters, all label is to cluster and be denoted as Π={ π1, π2,...,πM}.Control parameter r is added 1, return step 2);
9) final basis cluster set Π={ π is obtained12,...,πM}。
Clustering ensemble based on final basis cluster set described in step (3) of the present invention, the steps include:
1) weight information based on class cluster in the cluster of above-mentioned acquisition basis, calculates any two elements in data set D and occurs It is integrated in each number clustered in same class cluster and its with the product of such cluster weight as the cum rights between any two elements Distance;
Wherein di∈D,dj∈ D and di≠dj,
Cn mTo cluster πmMiddle diThe class cluster at place, is denoted as di∈Cn m, Cn m∈πm,n∈[1,Nm];
wi mIt is equal to
If di∈Cn mWhen, djAlso belong to cluster πmIn class cluster Cn mWhen Φij m=1
If di∈Cn mWhen, djIt is not belonging to cluster πmIn class cluster Cn mWhen Φij m=0
2) the data set D={ d obtained based on step 1)1,d2,…dxIn cum rights between any two elements integrate distance Dis(di,dj) as the clustering distance between element in hierarchy clustering method.Hierarchical clustering is implemented to data set D, and is obtained last Clustering ensemble output.
Compared to the classical clustering ensemble method based on cluster weight information entropy, the invention proposes one kind to be based on information The novel clustering ensemble method of entropy weight weight incremental learning optimization basis cluster.The present invention calculates basis cluster class cluster using comentropy Between the degree of association and basis cluster stability itself, based on effective weight for clustering.By introducing incremental learning strategy The poor low weighted basis cluster of stability is deleted, the stability for effectively reducing basis cluster itself makes an uproar to cluster result Sound shadow is rung;Join matrix altogether in conjunction with the weighting that traditional mutual association matrix constructs after the cluster optimization of basis, effectively measures any two Correlation degree between a element, to obtain more accurately clustering ensemble result.
Detailed description of the invention
Fig. 1 is basic flow chart of the invention.
Fig. 2 is the flow chart of incremental learning thought selection basis cluster.
Fig. 3 is the partial weight clustering ensemble method flow chart based on incremental learning thought.
Specific embodiment
The present invention will be described further by following embodiment.
A kind of data clusters integrated approach based on comentropy weight incremental learning strategy described in the present embodiment, by following Step:
(1) the iris data set (https: //archive.ics.uci.edu/ml/datasets/ of UCI is used Iris/), 20 kinds of alternative clusters (being all labeled as uncertain cluster) are generated using K-means clustering method, wherein every kind of cluster Division numbers be 5 clusters;
(2) power that each class cluster of each uncertain alternative clusters is clustered relative to remaining 19 kinds is calculated by information Entropy Method Weight.
(3) average value and standard deviation for calculating each class cluster weight of each alternative clusters, as each alternative clusters two A weight index.
(4) average value threshold alpha=0.6 is set, each class cluster weight of each alternative clusters is worked as in standard deviation threshold method β=0.3 Average value is greater than specified threshold 0.6, and when its standard deviation is less than specified threshold 0.3, then retains such cluster, and mark To determine cluster (i.e. two weights of the alternative clusters all meet reservation condition).If not meeting above-mentioned condition, by the cluster It deletes.
(5) cycle-index is set as 10.For the first time after circulation terminates, the quantity for determining cluster of reservation is 5, eliminates 15 A uncertain cluster.Any 15 kinds of alternative clusters are generated using K-means clustering method according to incremental learning thought, by new 20 A alternative clusters are collectively labeled as uncertain cluster and return step 2 carries out second and recycles.Retain after circulation terminates for the second time The quantity for determining cluster is 9, eliminates 11 uncertain clusters.11 kinds of alternative clusters are generated again using above-mentioned thought carries out the It recycles three times.The quantity for determining cluster that third time retains after circulation terminates is 13, eliminates 7 uncertain clusters.Using upper State thought generate again 7 kinds of alternative clusters carry out the 4th time circulation.4th quantity for determining cluster retained after circulation terminates It is 16, eliminates 4 uncertain clusters.4 kinds of alternative clusters are generated again using above-mentioned thought carries out the 5th circulation.5th time The quantity for determining cluster retained after circulation terminates is 19, eliminates 1 uncertain cluster.1 is generated again using above-mentioned thought Kind alternative clusters carry out the 6th circulation.6th time the quantity for determining cluster retained after circulation terminates is 20, end loop.? This 20 determining clusters are as final basis cluster set.
(6) based on the weight information of class cluster in each determining cluster of above-mentioned acquisition, any two elements in data set is calculated and are gone out Now the number in each cluster same class cluster and its with the product of such cluster weight as the cum rights collection between any two elements At distance.
(7) distance is integrated based on the cum rights between any two elements of object dataset, using classical hierarchical clustering mode Complete last cluster.
Conclusion: the clustering ensemble method compared to classics based on cluster weight information entropy, using the iris of the method for the present invention The accuracy rate of the cluster result of platymiscium data set has larger amplitude promotion (by 50 comparative experimentss, to cluster compared to classics The average NMI index 0.51 of Integrated Algorithm, the method for the present invention NMI index average out to 0.72.).Emphasis of the present invention considers class cluster Between the degree of association and basis cluster stability itself.It is poly- that the poor basis of stability is deleted by introducing incremental learning strategy Class effectively reduces influence of noise of the stability of basis cluster to cluster result itself;Consider the degree of association between class cluster, passes through Comentropy calculating correlation is simultaneously converted into weight, joins matrix altogether in conjunction with the building weighting of traditional mutual association matrix, effectively measures Correlation degree between any two element obtains more accurately cluster result.

Claims (4)

1. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy, which is characterized in that including following step It is rapid:
(1) clustering ensemble member tentatively generates;
(2) the partial weight algorithm process basis based on incremental learning thought clusters member;
(3) clustering ensemble based on final basis cluster set;
(4) terminate.
2. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1, It is characterized in that, step (1) the clustering ensemble member tentatively generates, and the steps include:
1) using Data Clustering Algorithm to data set D={ d1,d2,...,dxDo cluster operation;
2) cluster number control parameter m is set initial value is 1;
3) judge whether control parameter m is less than or equal to alternative clusters membership M, be to then follow the steps 4, otherwise go to step 6);
4) the m times cluster is obtained to be denoted asWherein set πmRadix number be | πm| it is denoted as Nm
5) control parameter m is executed into m=m+1, then goes to step 3);
6) alternative alternative clusters set is generatedBy all alternative clusters member Π= {π12,...,πMPut on uncertain label;
7) terminate.
3. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1, It is characterized in that, step (2) the partial weight algorithm process basis based on incremental learning thought clusters member, the steps include:
1) setting control parameter r initial value is 1, and setting cycle-index limits k;
2) Π={ π is calculated12,...,πMIn indicate any one class clusters of uncertain labeled clustersπm∈ Π, relatively The uncertain information entropy of all clusters in Π, its calculation formula is: Wherein 1≤m≤M, 1≤n≤NM1≤μ≤M, 1≤j≤NM,| * | it is set * Element number;
3) π obtained in step 2) is calculatedmIn each class clusterAverage value;
Its calculation formula is:Wherein Nm=| πm|;
4) alternative clusters π obtained in step 2) is calculatedmInStandard deviation;
Its calculation formula is:
5) formula W (*)=e is used-*Normalization obtains the uncertain information entropy average value Ψ (π of class cluster in each clusterm) and mark Quasi- difference σ (πm), two kinds of weight W (Ψ (π as each clusterm)) and W (σ (πm)), so that the value interval of two weights is (0,1];
6) weight W (Ψ (π is setm)) threshold be α and weight W (σ (πm)) threshold be β;
7) by the W (Ψ (π for all alternative clusters being calculated in step 5)m)) be compared with α, W (σ (πm)) compared with β Compared with;
If certain cluster πmThere is (W (Ψ (π in ∈ Πm)) > α) ^ (W (σ (πm)) < β), then retain the cluster, and by uncertain label It is revised as determining label, otherwise will delete the cluster;If r be equal to k, then institute it is with a grain of salt possess determine mark number of clusters It is denoted as M, jumps to step 9);
8) institute's alternative clusters quantity with a grain of salt in step 7) is denoted as t, if t is equal to the quantity M of cluster, then going to step 9) any M-t kind alternative clusters otherwise, are produced using traditional data clustering method again, and are alternatively gathered with the determination withed a hook at the end Class forms M new alternative clusters, and to newest M alternative clusters, all label for cluster and is denoted as Π={ π1, π2,...,πM};Control parameter r is added 1, return step 2);
9) final basis cluster set Π={ π is obtained12,...,πM}。
4. a kind of data clusters integrated approach based on comentropy weight incremental learning strategy according to claim 1, It is characterized in that, the clustering ensemble based on final basis cluster set described in step (3) the steps include:
1) weight information based on class cluster in the cluster of above-mentioned acquisition basis, calculates any two elements in data set D and appears in respectively Number in a cluster same class cluster and its as the cum rights between any two elements distance is integrated with the product of such cluster weight;
Wherein di∈D,dj∈ D and di≠dj,
Cn mTo cluster πmMiddle diThe class cluster at place, is denoted as di∈Cn m, Cn m∈πm,n∈[1,Nm];
wi mIt is equal to
If di∈Cn mWhen, djAlso belong to cluster πmIn class cluster Cn mWhen Φij m=1;
If di∈Cn mWhen, djIt is not belonging to cluster πmIn class cluster Cn mWhen Φij m=0;
2) the data set D={ d obtained based on step 1)1,d2,…dxIn cum rights between any two elements integrate distance Dis (di,dj) as the clustering distance between element in hierarchy clustering method;Hierarchical clustering is implemented to data set D, and obtains last collection It is exported at cluster.
CN201810810646.4A 2018-07-23 2018-07-23 A kind of data clusters integrated approach based on comentropy weight incremental learning strategy Pending CN109241991A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810810646.4A CN109241991A (en) 2018-07-23 2018-07-23 A kind of data clusters integrated approach based on comentropy weight incremental learning strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810810646.4A CN109241991A (en) 2018-07-23 2018-07-23 A kind of data clusters integrated approach based on comentropy weight incremental learning strategy

Publications (1)

Publication Number Publication Date
CN109241991A true CN109241991A (en) 2019-01-18

Family

ID=65072916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810810646.4A Pending CN109241991A (en) 2018-07-23 2018-07-23 A kind of data clusters integrated approach based on comentropy weight incremental learning strategy

Country Status (1)

Country Link
CN (1) CN109241991A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619107A (en) * 2019-08-07 2019-12-27 南昌大学 Lstm and Gcforest algorithm mixed reinforcement learning distribution network transformer load prediction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096805A (en) * 2016-05-10 2016-11-09 华北电力大学 A kind of residential electricity consumption load classification method based on entropy assessment feature selection
CN107480694A (en) * 2017-07-06 2017-12-15 重庆邮电大学 Three clustering methods are integrated using the weighting selection evaluated twice based on Spark platforms

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096805A (en) * 2016-05-10 2016-11-09 华北电力大学 A kind of residential electricity consumption load classification method based on entropy assessment feature selection
CN107480694A (en) * 2017-07-06 2017-12-15 重庆邮电大学 Three clustering methods are integrated using the weighting selection evaluated twice based on Spark platforms

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619107A (en) * 2019-08-07 2019-12-27 南昌大学 Lstm and Gcforest algorithm mixed reinforcement learning distribution network transformer load prediction method
CN110619107B (en) * 2019-08-07 2022-04-12 南昌大学 Lstm and Gcforest algorithm mixed reinforcement learning distribution network transformer load prediction method

Similar Documents

Publication Publication Date Title
Elbatta et al. A dynamic method for discovering density varied clusters
Xie et al. A new internal index based on density core for clustering validation
CN106067034B (en) Power distribution network load curve clustering method based on high-dimensional matrix characteristic root
CN105354588A (en) Decision tree construction method
CN109669990A (en) A kind of innovatory algorithm carrying out Outliers mining to density irregular data based on DBSCAN
CN103699678A (en) Hierarchical clustering method and system based on multistage layered sampling
Cheng et al. Searching dimension incomplete databases
Jiang et al. Classification methods of remote sensing image based on decision tree technologies
Indira et al. Performance analysis of genetic algorithm for mining association rules
CN108154185A (en) A kind of k-means clustering methods of secret protection
Qin et al. Associative classifier for uncertain data
CN109241991A (en) A kind of data clusters integrated approach based on comentropy weight incremental learning strategy
CN116186757A (en) Method for publishing condition feature selection differential privacy data with enhanced utility
Chaturvedi et al. An improvement in K-mean clustering algorithm using better time and accuracy
CN105631465A (en) Density peak-based high-efficiency hierarchical clustering method
CN109190659A (en) A kind of data integration clustering method based on three decision strategies of comentropy weight
CN110287992A (en) Agricultural features information extracting method based on big data
CN110533111A (en) A kind of adaptive K mean cluster method based on local density Yu ball Hash
CN109241992A (en) A kind of data clusters integrated approach based on two decision optimizations of comentropy weight
CN108319658A (en) A kind of improvement Apriori algorithm based on desert steppe
Yong et al. Short-term building load forecasting based on similar day selection and LSTM network
CN108717551A (en) A kind of fuzzy hierarchy clustering method based on maximum membership degree
CN113780864B (en) Key ecological hydrological index identification method for influencing spawning of four major Chinese carps
CN109522750A (en) A kind of new k anonymity realization method and system
CN108681576A (en) A kind of data digging method based on Quality of Safflower decision tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190118