CN109829494A - A kind of clustering ensemble method based on weighting similarity measurement - Google Patents

A kind of clustering ensemble method based on weighting similarity measurement Download PDF

Info

Publication number
CN109829494A
CN109829494A CN201910079817.5A CN201910079817A CN109829494A CN 109829494 A CN109829494 A CN 109829494A CN 201910079817 A CN201910079817 A CN 201910079817A CN 109829494 A CN109829494 A CN 109829494A
Authority
CN
China
Prior art keywords
cluster
data
data set
sample
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910079817.5A
Other languages
Chinese (zh)
Inventor
白亮
杜航原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Priority to CN201910079817.5A priority Critical patent/CN109829494A/en
Publication of CN109829494A publication Critical patent/CN109829494A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to clustering ensemble analysis field, in particular to a kind of clustering ensemble method based on weighting similarity measurement.It is weighted similarity measurement according to the quality of cluster member, reinforces the positive influence of high quality cluster member in integrating process, while inhibiting the unfavorable interference of low quality cluster member, to obtain the clustering ensemble result for having more accuracy and robustness.This method calculates the consistency that any two sample describes symbol space data in each cluster member in data set first, then it calculates the consistency that each cluster member describe feature space data and each integrated weight for clustering member is calculated with this, the weighting similitude of any two sample in data set is calculated on this basis, then the weighting similarity matrix of data set is constructed to be figure minimum partition problem by clustering ensemble Task Switching, clustering ensemble is obtained as a result, finally carrying out result output by solving using Spectral Clustering.

Description

A kind of clustering ensemble method based on weighting similarity measurement
Technical field
The present invention relates to clustering ensemble analysis field, in particular to a kind of clustering ensemble side based on weighting similarity measurement Method.
Background technique
Clustering is an important and active research field in data mining.As a kind of unsupervised learning method, Cluster is substantially a density estimation problem, and the data for needing to cluster are not marked generic in advance, and can be by one Mixed model generates.Its main thought is to split data into several classes or cluster (group), so that data object similarity in cluster It maximizes, data object similarity minimizes between cluster.In recent years, large-scale dataset was frequently emerged in large numbers in every field, this is to poly- New challenge has been researched and proposed in alanysis.In face of large-scale data, traditional cluster algorithm is no longer as handling middle and small scale Data are the same " handy ", and generally existing processing is difficult, the processing time is long, parameter difficulty is determining, inefficiency and cluster matter Measure not high problems.What clustering ensemble exactly grew up in this background, it finds the combination of multiple cluster answers More preferably to be clustered.Clustering ensemble has better average behavior in different field and data set, can find any single The answer that clustering algorithm is unable to get, it is more insensitive for the variation of noise, abnormal point, sampling, it can also be from cluster collective point Estimation obtains the uncertainty of cluster in cloth.There are two the clustering ensemble algorithm main problems to be solved: one is how to generate not With cluster to form a cluster collective, Second Problem is how to obtain one from this cluster collective unified to gather Class result.It is all laid stress on Second Problem in the research in terms of clustering ensemble both at home and abroad at present, that is, how from poly- A unified cluster result is obtained in class collective.
A kind of patent " sampling type clustering ensemble side based on part and global information of Publication No. CN105844303A Method " a kind of sampling type clustering ensemble method based on part and global information is disclosed, target data set is mixed first Learning sample is sampled and generated, clustering is carried out in this learning sample space and generates clustering, next to cluster It divides and carries out quality evaluation, and update the weight vectors of target data set according to assessment result;Above step is taken turns into repetition, into And generate multiple clusterings.Then permeate multiple clusterings a new character representation, and uses traditional cluster Algorithm does clustering to this character representation, and generates integrated cluster result.The invention is so that integrated study has stronger resist Making an uproar property, while also making it have the high ability for solving problem data;And new feature can be characterized effectively and comprehensively entirely The clustering architecture information of office and part, so that the effect that Ensemble Learning Algorithms have generated on the data set of different characteristics.Publication number Clustering ensemble selection is asked for the patent " the clustering ensemble method based on mixing clustering ensemble selection strategy " of CN107169511A Topic is converted into feature selection issues, generates basic cluster result, more diversity from multi-angle, is carried out using feature selecting algorithm Optimization, avoids human factor and redundancy problem, it is contemplated that part and global weight organically combine each cluster result subset, mention Rise cluster accuracy.The step of this method includes: input test data set sample matrix X;Data set sample matrix X is gathered Generic operation generates basic cluster result set;Basic cluster result set is transformed into new feature space, and basic cluster result Each of set each feature of cluster result as new feature space;Feature is gathered using Feature Selection Class integrated selection obtains cluster result subset;Final cluster result subset is obtained using weight function is assigned to cluster result subset;Collection At final cluster result subset, final cluster result is obtained.
During generating unified integrated results by the set of multiple cluster members, a kind of common method is to utilize sample The similarity measurement between the frequency progress sample in same cluster is appeared in different cluster members, constructs the similitude of data set Matrix recycles smallest partition method to be split data set, to obtain unified integrated result.However, due to cluster Each cluster member mass in collective is irregular, their influences for final clustering ensemble result are not also identical, ignores These are influenced and single consideration sample similarity may cause the reduction of clustering ensemble result validity.For this purpose, the present invention proposes one Clustering ensemble method of the kind based on weighting similarity measurement reinforces matter during calculating Sample Similarity using cluster set is total Amount preferably clusters member to the positive influence of clustering ensemble result, while limiting the unfavorable interference of second-rate member, makes to gather Class integrates result and has more accuracy and robustness.
Summary of the invention
The technical problem to be solved by the present invention is designing a kind of clustering ensemble method, carried out according to the quality of cluster member Similarity measurement is weighted, reinforces the positive influence of high quality cluster member in integrating process, while low quality being inhibited to be clustered into The unfavorable interference of member, to obtain the clustering ensemble result for having more accuracy and robustness.This method, which is calculated first in data set, appoints The consistency that two samples of anticipating describe symbol space data in each cluster member, then calculates each cluster member to spy The consistency of sign spatial data description and the integrated weight that each cluster member is calculated with this, calculate in data set on this basis Then the weighting similitude of any two sample constructs the weighting similarity matrix of data set thus by clustering ensemble Task Switching For figure minimum partition problem, clustering ensemble is obtained as a result, finally carrying out result output by solving using Spectral Clustering.
Method proposed by the present invention can be used for the clustering task of all kinds of numeric type data collection, such as: this method is available Parallel pattern in identification gene expression data and possess the gene set and sample set of identical biological meaning, and then is clustering The function and transcriptional control found relevant gene on analysis foundation, analyze gene;This method can also be used for discovery complex web The linked character of structure and function between network data set internal node carries out the division of community structure, and then understands multiple The function of miscellaneous network seeks the rule hidden in network and the behavior for predicting complex network;It is multiple that this method can be also used for processing Miscellaneous image data set divides an image into several regions not overlapped according to visual signature, convex target and background scene etc., Realize image segmentation.
The technical scheme adopted by the invention is that: a kind of clustering ensemble method based on weighting similarity measurement, for sample This quantity is the data set of NI-th of sample in feature space in X is denoted as xiIndicate a system It is listed in the set that the cluster member generated on data set X is constituted, wherein T indicates the quantity that member is clustered in C, Indicate t-th of cluster member in C, Ct,kFor CtIn k-th of cluster, StIndicate CtThe quantity of middle cluster;Clustering is considered as logarithm It is indicated according to the symbol of collection, then the cluster symbolic vector in cluster set in each corresponding symbol space of cluster member, T The cluster symbolic vector set that cluster symbolic vector is constituted is denoted asWhereinIt indicates to be clustered into for t-th Member CtCluster symbolic vector, lt,kIndicate CtIn k-th of cluster label;Indicate clustering ensemble as a result, wherein C*,sIndicate C*In s-th of cluster, S*Indicate C*The quantity of middle cluster.Content of the present invention utilizesGenerate cluster set At result C*Process, comprising the following steps:
S10, data normalization processing is carried out to data set, utilizes gaussian kernel functionTo the data set in feature spaceIt is mapped, makes the standardized data collection obtained after mappingGaussian distributed, wherein ψiIndicate mark I-th of sample in standardization data set;
Any two sample is in each cluster member to the consistent of symbol space data description in S20, calculating data set Property: firstly, conditional information entropy of the cluster symbolic vector set L about data set X is calculated, for indicating using data set X to symbol The uncertainty of number spatial data description;Then, cluster symbolic vector set L is calculated to be clustered into about two samples at some The conditional information entropy of affiliated cluster in member, the uncertainty for indicating to describe symbol space data using the two clusters;It counts again The difference of the two above conditional information entropy of calculation cluster symbolic vector set L is as two samples to symbol in this cluster member The consistency of number spatial data description, and so on calculate any two sample in each cluster member to symbol space data The consistency of description;
S30, the consistency that each cluster member describes feature space data is calculated: firstly, normalized data set Conditional information entropy of the Ψ about data set X, the uncertainty for indicating to describe feature space data using data set X;It connects , conditional information entropy of the normalized data set Ψ about some cluster member, for indicating cluster member to feature sky Between data describe uncertainty;The difference of the two above conditional information entropy of normalized data set Ψ is clustered into as this Member to the consistency of feature space data description, and so on calculate each cluster member feature space data are described it is consistent Property;
S40, the consistency described according to each cluster member to feature space data calculate the integrated of each cluster member Weight controls influence of each cluster member to final clustering ensemble result respectively;
S50, any two sample obtained using step S20 are in each cluster member to the description of symbol space data Weighting in the integrated weight calculation data set for each cluster member that consistency and step S40 are obtained between any two sample Similitude;
S60, by clustering ensemble Task Switching be figure minimum partition problem, i.e., so that in final clustering ensemble result own Weighting similitude between two objects not in same cluster is minimum;
S70, the figure minimum partition problem that clustering ensemble Task Switching obtains is solved using Spectral Clustering, is obtained Clustering ensemble result C*
S80, clustering ensemble result C* is exported.
Gaussian kernel function in step S10 described in this methodAs shown in formula (1):
Wherein, the value of parameter alpha is set as | | xi-xo||2Standard deviation, xoFor o-th of sample (i ≠ o) in data set X, ψoFor o-th of sample in standardized data collection Ψ.
Step S20 described in this method includes:
S21, conditional information entropy of the cluster symbolic vector set L about data set X is calculated using formula (2), for indicating benefit The uncertainty that symbol space data are described with data set X:
Wherein, H (lt| X) it is t-th of cluster member CtCluster symbolic vector ltAbout the conditional information entropy of data set X, It can be calculated by formula (3):
In formula, P (lt,k| X) indicate cluster symbolic vector ltAbout the conditional probability of data set X, can be calculated by formula (4):
X in formulai(lt) indicate sample xiIn the value that t-th clusters on symbolic vector, i.e. sample xiIt is clustered at t-th Corresponding cluster label in member;
S22, for any two sample x in data set XiAnd xj, they are in t-th of cluster member CtIn belonging to cluster RespectivelyWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5), Uncertainty for indicating to describe symbol space data using the two clusters:
Wherein,ForWithThe set of composition,It is T cluster member CtCluster symbolic vector ltAbout setConditional information entropy, can be calculated by formula (6):
Wherein,Indicate cluster symbolic vector ltAbout setItem Part probability can be calculated by formula (7):
X in formulad(lt) indicate sample xdIn the value that t-th clusters on symbolic vector, i.e. sample xdIt is clustered at t-th Corresponding cluster label in member;
S23, calculate cluster symbolic vector set L about data set X conditional information entropy with about's The difference of comentropy is as sample xiAnd xjIn cluster member CtIn to symbol space data description consistency, such as formula (8) institute Show:
S24, using the method for step S21~S23, calculate in data set X all two samples in each cluster member To the consistency of symbol space data description.
Step S30 described in this method includes:
S31, the conditional information entropy using formula (9) normalized data set Ψ about data set X utilize number for indicating The uncertainty that feature space data are described according to collection X
Wherein, H (Ψ | X) is conditional information entropy of the standardized data collection Ψ about data set X,Indicate standardized data The variance for collecting Ψ, is calculated by formula (10):
Wherein, μΨFor the expectation of standardized data collection Ψ, meet formula (11):
The conditional information entropy of S32, normalized data set Ψ about each cluster member, for describing each cluster member To the consistency of feature space data description, wherein Ψ is about t-th of cluster member CtConditional information entropy can be counted by formula (12) It calculates:
Wherein, H (Ψ | Ct) it is standardized data collection Ψ about t-th of cluster member CtConditional information entropy,Indicate Ct The variance of middle sample is calculated by formula (13):
Wherein,For CtThe expectation of middle sample meets formula (14):
In formula, ψeFor e-th of sample in standardized data collection Ψ, xeFor e-th of sample of data set X, xfFor data set F-th of sample (e ≠ f) of X, xgAnd xhFor any two sample different in data set X;
The difference of the two above conditional information entropy of S33, normalized data set Ψ, as cluster member to feature The consistency of spatial data description, wherein t-th of cluster member CtConsistency metric on Ψ is calculated by formula (15):
I(Ψ|Ct)=H (Ψ | X)-H (Ψ | Ct) (15)
Wherein, I (Ψ | Ct) indicate CtConsistency metric on Ψ;
S34, using the method for step S31~S33, calculate one by one it is each cluster member feature space data are described one Cause property.
The consistency that feature space data describe is calculated each according to each cluster member in step S40 described in this method Shown in the method such as formula (16) for clustering the integrated weight of member:
Wherein, ωtIndicate cluster member CtClustering ensemble weight.
It is calculated in step S50 described in this method in data set X shown in the weighting similitude such as formula (17) of two samples:
Wherein, sim (xi,xj) indicate sample xiAnd xjBetween weighting similitude, in this way calculate data set X The weighting similitude of middle any two sample.
Step S60 described in this method includes:
S61, the weighting similarity matrix Θ=[θ (x for constructing data set Xp,xq)]N×N, matrix element θ (xp,xq) meter Calculation side
Shown in method such as formula (18):
Wherein the value of parameter γ is sim (xi,xj) standard deviation;
It S62, is figure minimum partition problem by clustering ensemble Task Switching, shown in building objective function such as formula (19):
Step S70 described in this method includes:
S71, a N-dimensional diagonal matrix is constructed using the sum of element on the weighting each column of similarity matrix Θ, is denoted as D, and Define matrix L=D- Θ;
S72, the preceding S that matrix L arranges from small to large ord is found out*A characteristic valueAnd corresponding feature vector
S73, by S*A feature vector is arranged together one N × S of composition*Matrix, wherein every a line will regard S as*Dimension is empty Between in a vector, and clustered using K-means algorithm, cluster belonging to every a line is exactly data set X in cluster result In cluster belonging to each sample data.
The present invention is irregular for each cluster member mass in cluster collective and generates unfavorable shadow to clustering ensemble Loud problem proposes a kind of clustering ensemble method based on weighting similarity measurement, first any two in calculating data set Sample, to the consistency of symbol space data description, then calculates each cluster member to feature space in each cluster member The consistency of data description and the integrated weight that each cluster member is calculated with this, calculate on this basis any two in data set Then the weighting similitude of a sample constructs the weighting similarity matrix of data set to be to scheme most by clustering ensemble Task Switching Small segmentation problem obtains clustering ensemble as a result, finally carrying out result output by solving using Spectral Clustering.Master of the invention Want parameter include: cluster symbolic vector set, cluster symbolic vector set about data set conditional information entropy, cluster symbol to About two samples, the conditional information entropy of affiliated cluster, two samples in some cluster member are clustered at some duration set Conditional information entropy, normalized number in member to the consistency, standardized data collection of the description of symbol space data about original data set The conditional information entropy of member is clustered about some according to collection, clusters consistency, cluster member that member describes feature space data Integrated weight, the weighting similitude between two samples, weighting similarity matrix.Wherein, cluster symbolic vector collection is combined into symbol A series of pairs of data sets carry out the set that the cluster symbolic vector of denotational description is constituted in space;Cluster symbolic vector set about The conditional information entropy of data set is used for the uncertainty for indicating to describe symbol space data using data set X;Cluster symbol to Duration set utilizes the two clusters pair for indicating about the conditional information entropy of two samples affiliated cluster in some cluster member The uncertainty of symbol space data description;Two samples are in some cluster member to the consistent of symbol space data description Property indicate cluster symbolic vector set about data set conditional information entropy and about two samples some cluster member In affiliated cluster conditional information entropy between difference;Standardized data collection is used to indicate benefit about the conditional information entropy of original data set The uncertainty that feature space data are described with data set;Conditional information entropy of the standardized data collection about some cluster member The uncertainty that feature space data are described for cluster member;The consistency that cluster member describes feature space data It indicates between conditional information entropy of the standardized data collection about the conditional information entropy of original data set and about this cluster member Difference;The integrated weight of cluster member is used to control influence of each cluster member to final clustering ensemble result;Between two samples Weighting similitude be used for describe two samples it is a series of cluster members in cluster division similitudes;Similarity matrix is weighted to use The similar implementations that cluster divides between all samples in descriptor data set inside.
The beneficial effects of the present invention are: the consistency in data description is measured using comentropy, is calculated with this The clustering ensemble weight for clustering member expresses a series of two samples phase that cluster divides in cluster members using weighting similitude Like implementations, figure minimum partition problem is converted by clustering ensemble task, and solve by Spectral Clustering, this method is constructing Influence of the cluster member of different quality to clustering ensemble result has been fully considered when data set similarity matrix, can have been made final The clustering ensemble result of acquisition has more accuracy and robustness.
Detailed description of the invention
Fig. 1 is the computer implemented system structure of the clustering ensemble method of the present invention based on weighting similarity measurement Figure;
Fig. 2 is the flow chart of the clustering ensemble method of the present invention based on weighting similarity measurement.
Specific embodiment
Detailed description of the preferred embodiments with reference to the accompanying drawing.
Clustering ensemble method of the present invention based on weighting similarity measurement is implemented by computer program, Fig. 1 institute Show it is computer implemented system construction drawing.Technical solution proposed by the present invention is used to handle remote sensing image data below, it is right Remote sensing images are classified automatically, realize the identification of ground object target, and input data is the image data set that pixel is constituted, and are passed through Pixel with similar spectral feature is classified as one kind by clustering ensemble method, is identified as same ground object target, finally will identification Various ground object targets out are exported, and specific implementation process is as shown in Figure 2.By the Spectral Properties of pixel each in remote sensing images Sign is used as a sample, and the sample that quantity is N constitutes image data setI-th of sample in feature space in X Originally it is denoted as xiIndicate a series of set that cluster members generated on data set X are constituted, i.e., a series of pixels Division result, wherein T indicates the quantity that member is clustered in C,Indicate t-th of cluster member, i.e., to image data T-th of division result of collection, Ct,kFor CtIn k-th of cluster, i.e. CtIn the divide classification, StIndicate CtThe quantity of middle cluster; Clustering is considered as the symbol expression to data set, then in cluster set in the corresponding symbol space of each cluster member Cluster symbolic vector, T cluster symbolic vector constitute set be denoted asWhereinIt indicates t-th Cluster member CtCluster symbolic vector, lt,kIndicate CtIn k-th of cluster label;Indicate clustering ensemble knot Fruit, i.e., final integrated segmentation result, wherein C*,sIndicate that s-th of cluster in C*, S* indicate the quantity of cluster in C*.The present embodiment is Utilize a series of pixel division results of image data set XGenerate consistent ground object target recognition result C*Mistake Journey, including following key content:
Step 1 carries out data normalization processing to data set, utilizes gaussian kernel function shown in formula (1)To feature sky Between in data setIt is mapped, makes the data set obtained after mappingGaussian distributed, wherein ψiIndicate i-th of sample that standardized data is concentrated;
Wherein, the value of parameter alpha is set as | | xi-xo||2Standard deviation, xoFor o-th of sample (i ≠ o) in data set X, ψoFor o-th of sample in standardized data collection Ψ.
Any two sample is in each cluster member to the one of the description of symbol space data in step 2, calculating data set Cause property: firstly, calculating conditional information entropy of the cluster symbolic vector set L about data set X, X pairs of data set is utilized for indicating The uncertainty of symbol space data description;Then, cluster symbolic vector set L is calculated about two samples in some cluster The conditional information entropy of affiliated cluster in member, the uncertainty for indicating to describe symbol space data using the two clusters;Again The difference for calculating the two above conditional information entropy of cluster symbolic vector set L is right in this cluster member as two samples Symbol space data description consistency, and so on calculate any two sample in each cluster member to symbol space number According to the consistency of description, comprising the following steps:
S21, conditional information entropy of the cluster symbolic vector set L about data set X is calculated using formula (2), for indicating benefit The uncertainty that symbol space data are described with data set X:
Wherein, H (lt| X) it is t-th of cluster member CtCluster symbolic vector ltAbout the conditional information entropy of data set X, It can be calculated by formula (3):
Wherein, P (lt,k| X) indicate cluster symbolic vector ltAbout the conditional probability of data set X, can be calculated by formula (4):
X in formulai(lt) indicate sample xiIn the value that t-th clusters on symbolic vector, i.e. sample xiIt is clustered at t-th Corresponding cluster label in member;
S22, for any two sample x in data set XiAnd xj, they are in t-th of cluster member CtIn belonging to cluster RespectivelyWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5), is used In the uncertainty for indicating to describe symbol space data using the two clusters:
Wherein,ForWithThe set of composition,For t A cluster member CtCluster symbolic vector ltAbout setConditional information entropy, can be calculated by formula (6):
Wherein,Indicate cluster symbolic vector ltAbout setItem Part probability can be calculated by formula (7):
X in formulad(lt) indicate sample xdIn the value that t-th clusters on symbolic vector, i.e. sample xdIt is clustered at t-th Corresponding cluster label in member;
S23, calculate cluster symbolic vector set L about data set X conditional information entropy with about's The difference of comentropy is as sample xiAnd xjIn cluster member CtIn to symbol space data description consistency, such as formula (8) institute Show:
S24, using the method for step S21~S23, calculate in data set X all two samples in each cluster member To the consistency of symbol space data description.
Step 3 calculates the consistency that each cluster member describes feature space data: firstly, normalized data Collect conditional information entropy of the Ψ about data set X, the uncertainty for indicating feature space data to be described using data set X; Then, conditional information entropy of the data set Ψ about some cluster member is calculated, for indicating cluster member to feature space number According to the uncertainty of description;The difference for calculating the two above conditional information entropy of data set Ψ is empty to feature as cluster member Between the consistency that describes of data, and so on calculate each consistency for describe to feature space data of cluster member, specific packet Containing following steps:
S31, the conditional information entropy using formula (9) normalized data set Ψ about data set X utilize number for indicating The uncertainty that feature space data are described according to collection X
Wherein, H (Ψ | X) is conditional information entropy of the standardized data collection Ψ about data set X,Indicate standardized data The variance for collecting Ψ, is calculated by formula (10):
Wherein, μΨFor the expectation of standardized data collection Ψ, meet formula (11):
The conditional information entropy of S32, normalized data set Ψ about each cluster member, for describing each cluster member To the consistency of feature space data description, wherein Ψ is about t-th of cluster member CtConditional information entropy can be counted by formula (12) It calculates:
Wherein, H (Ψ | Ct) it is standardized data collection Ψ about t-th of cluster member CtConditional information entropy,Indicate Ct The variance of middle sample is calculated by formula (13):
Wherein,For CtThe expectation of middle sample meets formula (14):
In formula, ψeFor e-th of sample in standardized data collection Ψ, xeFor e-th of sample of data set X, xfFor data set F-th of sample (e ≠ f) of X, xgAnd xhFor any two sample different in data set X;
The difference of the two above conditional information entropy of S33, normalized data set Ψ, as cluster member to feature The consistency of spatial data description, wherein t-th of cluster member CtConsistency metric on Ψ is calculated by formula (15):
I(Ψ|Ct)=H (Ψ | X)-H (Ψ | Ct) (15)
Wherein, I (Ψ | Ct) indicate CtConsistency metric on Ψ;
S34, using the method for step S31~S33, calculate one by one it is each cluster member feature space data are described one Cause property.
Step 4, the consistency described according to each cluster member to feature space data calculate each collection for clustering member At weight, influence of each cluster member to final clustering ensemble result is controlled respectively, wherein cluster member CtClustering ensemble power Weight ωtCalculation method such as formula (16) shown in:
Step 5, any two sample obtained using step S20 describe symbol space data in each cluster member Consistency and step S40 obtain each cluster member integrated weight calculation data set in any two sample weighting Similitude, wherein xiAnd xjBetween weighting similitude sim (xi,xj) calculation method such as formula (17) shown in:
Clustering ensemble Task Switching is figure minimum partition problem by step 6, i.e., so that institute in final clustering ensemble result There is the weighting similitude between two objects not in same cluster minimum, comprising the following steps:
S61, the weighting similarity matrix Θ=[θ (x for constructing data set Xp,xq)]N×N, matrix element θ (xp,xq) meter Shown in calculation method such as formula (18):
Wherein the value of parameter γ is sim (xi,xj) standard deviation;
It S62, is figure minimum partition problem by clustering ensemble Task Switching, shown in building objective function such as formula (19):
Step 7 solves the figure minimum partition problem that clustering ensemble Task Switching obtains using Spectral Clustering, obtains Obtain clustering ensemble result C*, comprising the following steps:
S71, a N-dimensional diagonal matrix is constructed using the sum of element on the weighting each column of similarity matrix Θ, is denoted as D, and Define matrix L=D- Θ;
S72, the preceding S that matrix L arranges from small to large ord is found out*A characteristic valueAnd corresponding feature vector
S73, by S*A feature vector is arranged together one N × S of composition*Matrix, wherein every a line will regard S as*Dimension is empty Between in a vector, and clustered using K-means algorithm, cluster belonging to every a line is exactly data set X in cluster result In cluster belonging to each sample data.
Step 8, by clustering ensemble result C*That is ground object target recognition result is exported, as a result in each cluster indicate one A ground object target identified.

Claims (8)

1. a kind of clustering ensemble method based on weighting similarity measurement, collecting sample data, for sample number in feature space Amount is the data set of NI-th of sample in data set X is denoted as xi,Indicate a series of in data The set that the cluster member generated on collection X is constituted, wherein T indicates the quantity that member is clustered in C,It indicates the in C T cluster member, Ct,kFor CtIn k-th of cluster, StIndicate CtThe quantity of middle cluster;Clustering is considered as the symbol to data set Indicate, then the cluster symbolic vector in cluster set in the corresponding symbol space of each cluster member, T cluster symbol to The cluster symbolic vector set that amount is constituted is denoted asWhereinIndicate t-th of cluster member CtCluster symbol Number vector, lt,kIndicate CtIn k-th of cluster label;Clustering ensemble is indicated as a result, wherein C*,sIndicate C*In S-th of cluster, S*Indicate C*The quantity of middle cluster;It utilizesGenerate clustering ensemble result C*Process, comprising the following steps:
S10, data normalization processing is carried out to data set X, utilizes gaussian kernel functionTo data setIt is mapped, Make the standardized data collection obtained after mappingGaussian distributed, wherein ψiIndicate standardized data is concentrated i-th A sample;
S20, the consistency that any two sample describes symbol space data in each cluster member in data set X is calculated: Firstly, conditional information entropy of the cluster symbolic vector set L about data set X is calculated, for indicating using data set X to symbol sky Between data describe uncertainty;Then, calculating cluster symbolic vector set L is about two samples in some cluster member The conditional information entropy of affiliated cluster, the uncertainty for indicating to describe symbol space data using the two clusters;It calculates again poly- The difference of the two above conditional information entropy of class symbolic vector set L is as two samples to symbol sky in this cluster member Between data describe consistency, and so on calculate any two sample in each cluster member to symbol space data description Consistency;
S30, the consistency that each cluster member describes feature space data is calculated: firstly, normalized data set Ψ is closed Uncertainty in the conditional information entropy of data set X, for indicating to describe feature space data using data set X;Then, it counts Conditional information entropy of the standardized data collection Ψ about some cluster member is calculated, for indicating cluster member to feature space data The uncertainty of description;The difference of the two above conditional information entropy of normalized data set Ψ is as cluster member to spy Levy the consistency of spatial data description, and so on calculate each consistency for describing to feature space data of cluster member;
S40, the consistency described according to each cluster member to feature space data calculate each integrated weight for clustering member, Influence of each cluster member to final clustering ensemble result is controlled respectively;
S50, any two sample obtained using step S20 are in each cluster member to the consistent of symbol space data description Property and the integrated weight calculation data set of each cluster member that obtains of step S40 in weighting between any two sample it is similar Property;
S60, by clustering ensemble Task Switching be figure minimum partition problem, i.e., so that all in final clustering ensemble result do not exist It is minimum with the weighting similitude between two objects in cluster;
S70, the figure minimum partition problem that clustering ensemble Task Switching obtains is solved using Spectral Clustering, is clustered Integrated result C*
S80, by clustering ensemble result C*It is exported.
2. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that the step Rapid S10 gaussian kernel functionAs shown in formula (1):
Wherein, the value of parameter alpha is set as | | xi-xo||2Standard deviation, xoFor o-th of sample (i ≠ o) in data set X, ψoFor O-th of sample in standardized data collection Ψ.
3. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The step S20 includes:
S21, conditional information entropy of the cluster symbolic vector set L about data set X is calculated using formula (2), utilize number for indicating The uncertainty that symbol space data are described according to collection X:
Wherein, H (lt| X) it is t-th of cluster member CtCluster symbolic vector ltIt, can be by about the conditional information entropy of data set X Formula (3) calculates:
In formula, P (lt,k| X) indicate cluster symbolic vector ltAbout the conditional probability of data set X, can be calculated by formula (4):
X in formulai(lt) indicate sample xiIn the value that t-th clusters on symbolic vector, i.e. sample xiIt is right in t-th of cluster member The cluster label answered;
S22, for any two sample x in data set XiAnd xj, they are in t-th of cluster member CtIn belonging to cluster difference ForWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5), is used for Indicate the uncertainty described using the two clusters to symbol space data:
Wherein,ForWithThe set of composition,It is poly- for t-th Class members CtCluster symbolic vector ltAbout setConditional information entropy, can be calculated by formula (6):
Wherein,Indicate cluster symbolic vector ltAbout setCondition it is general Rate can be calculated by formula (7):
X in formulad(lt) indicate sample xdIn the value that t-th clusters on symbolic vector, i.e. sample xdIt is right in t-th of cluster member The cluster label answered;
S23, calculate cluster symbolic vector set L about data set X conditional information entropy with aboutInformation The difference of entropy is as sample xiAnd xjIn cluster member CtIn to symbol space data description consistency, as shown in formula (8):
S24, using the method for step S21~S23, calculate in data set X all two samples in each cluster member to symbol The consistency of number spatial data description.
4. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The step S30 includes:
S31, the conditional information entropy using formula (9) normalized data set Ψ about data set X utilize data set for indicating The uncertainty that X describes feature space data
Wherein, H (Ψ | X) is conditional information entropy of the standardized data collection Ψ about data set X,Indicate standardized data collection Ψ Variance, calculated by formula (10):
Wherein, μΨFor the expectation of standardized data collection Ψ, meet formula (11):
The conditional information entropy of S32, normalized data set Ψ about each cluster member, for describing each cluster member to spy The consistency of spatial data description is levied, wherein Ψ is about t-th of cluster member CtConditional information entropy can be calculated by formula (12):
Wherein, H (Ψ | Ct) it is standardized data collection Ψ about t-th of cluster member CtConditional information entropy,Indicate CtMiddle sample This variance, is calculated by formula (13):
Wherein,For CtThe expectation of middle sample meets formula (14):
In formula, ψeFor e-th of sample in standardized data collection Ψ, xeFor e-th of sample of data set X, xfFor data set X's F-th of sample (e ≠ f), xgAnd xhFor any two sample different in data set X;
The difference of the two above conditional information entropy of S33, normalized data set Ψ, as cluster member to feature space The consistency of data description, wherein t-th of cluster member CtConsistency metric on Ψ is calculated by formula (15):
I(Ψ|Ct)=H (Ψ | X)-H (Ψ | Ct) (15)
Wherein, I (Ψ | Ct) indicate CtConsistency metric on Ψ;
S34, using the method for step S31~S33, calculate one by one each cluster member feature space data are described it is consistent Property.
5. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The consistency that feature space data describe is calculated each according to each cluster member in step S40 described in described step this method Shown in the method such as formula (16) for clustering the integrated weight of member:
Wherein, ωtIndicate cluster member CtClustering ensemble weight.
6. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method It is calculated in the step S50 in data set X shown in the weighting similitude such as formula (17) of two samples:
Wherein, sim (xi,xj) indicate sample xiAnd xjBetween weighting similitude, calculate in this way any in data set X The weighting similitude of two samples.
7. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The step S60 includes:
S61, the weighting similarity matrix Θ=[θ (x for constructing data set Xp,xq)]N×N, matrix element θ (xp,xq) calculating side Shown in method such as formula (18):
Wherein the value of parameter γ is sim (xi,xj) standard deviation;
It S62, is figure minimum partition problem by clustering ensemble Task Switching, shown in building objective function such as formula (19):
8. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The step S70 includes:
S71, a N-dimensional diagonal matrix is constructed using the sum of element on the weighting each column of similarity matrix Θ, is denoted as D, and define Matrix L=D- Θ;
S72, the preceding S that matrix L arranges from small to large ord is found out*A characteristic valueAnd corresponding feature vector
S73, by S*A feature vector is arranged together one N × S of composition*Matrix, wherein every a line will regard S as*In dimension space A vector, and clustered using K-means algorithm, cluster belonging to every a line is exactly every in data set X in cluster result Cluster belonging to a sample data.
CN201910079817.5A 2019-01-28 2019-01-28 A kind of clustering ensemble method based on weighting similarity measurement Pending CN109829494A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910079817.5A CN109829494A (en) 2019-01-28 2019-01-28 A kind of clustering ensemble method based on weighting similarity measurement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910079817.5A CN109829494A (en) 2019-01-28 2019-01-28 A kind of clustering ensemble method based on weighting similarity measurement

Publications (1)

Publication Number Publication Date
CN109829494A true CN109829494A (en) 2019-05-31

Family

ID=66862670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910079817.5A Pending CN109829494A (en) 2019-01-28 2019-01-28 A kind of clustering ensemble method based on weighting similarity measurement

Country Status (1)

Country Link
CN (1) CN109829494A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276188A (en) * 2020-01-19 2020-06-12 西安理工大学 Short-time-sequence gene expression data clustering method based on angle characteristics
CN111464529A (en) * 2020-03-31 2020-07-28 山西大学 Network intrusion detection method and system based on cluster integration
CN111598120A (en) * 2020-03-31 2020-08-28 宁波吉利汽车研究开发有限公司 Data labeling method, equipment and device
CN111726765A (en) * 2020-05-29 2020-09-29 山西大学 WIFI indoor positioning method and system for large-scale complex scene
CN111899115A (en) * 2020-05-30 2020-11-06 中国兵器科学研究院 Method, device and storage medium for determining community structure in social network
CN117828380A (en) * 2024-03-05 2024-04-05 厦门爱逸零食研究所有限公司 Intelligent sterilization detection method and device
CN117828380B (en) * 2024-03-05 2024-06-04 厦门爱逸零食研究所有限公司 Intelligent sterilization detection method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105957063A (en) * 2016-04-22 2016-09-21 北京理工大学 CT image liver segmentation method and system based on multi-scale weighting similarity measure
CN109214427A (en) * 2018-08-13 2019-01-15 山西大学 A kind of Nearest Neighbor with Weighted Voting clustering ensemble method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105957063A (en) * 2016-04-22 2016-09-21 北京理工大学 CT image liver segmentation method and system based on multi-scale weighting similarity measure
CN109214427A (en) * 2018-08-13 2019-01-15 山西大学 A kind of Nearest Neighbor with Weighted Voting clustering ensemble method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
白亮等: "基于聚类准则融合的加权聚类集成算法", 《山西大学学报(自然科学版)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276188A (en) * 2020-01-19 2020-06-12 西安理工大学 Short-time-sequence gene expression data clustering method based on angle characteristics
CN111276188B (en) * 2020-01-19 2023-03-24 西安理工大学 Short-time-sequence gene expression data clustering method based on angle characteristics
CN111464529A (en) * 2020-03-31 2020-07-28 山西大学 Network intrusion detection method and system based on cluster integration
CN111598120A (en) * 2020-03-31 2020-08-28 宁波吉利汽车研究开发有限公司 Data labeling method, equipment and device
CN111726765A (en) * 2020-05-29 2020-09-29 山西大学 WIFI indoor positioning method and system for large-scale complex scene
CN111899115A (en) * 2020-05-30 2020-11-06 中国兵器科学研究院 Method, device and storage medium for determining community structure in social network
CN117828380A (en) * 2024-03-05 2024-04-05 厦门爱逸零食研究所有限公司 Intelligent sterilization detection method and device
CN117828380B (en) * 2024-03-05 2024-06-04 厦门爱逸零食研究所有限公司 Intelligent sterilization detection method and device

Similar Documents

Publication Publication Date Title
CN109829494A (en) A kind of clustering ensemble method based on weighting similarity measurement
McIver et al. Estimating pixel-scale land cover classification confidence using nonparametric machine learning methods
Erisoglu et al. A new algorithm for initial cluster centers in k-means algorithm
CN103020978B (en) SAR (synthetic aperture radar) image change detection method combining multi-threshold segmentation with fuzzy clustering
CN103648106B (en) WiFi indoor positioning method of semi-supervised manifold learning based on category matching
CN114564982B (en) Automatic identification method for radar signal modulation type
CN107239795A (en) SAR image change detecting system and method based on sparse self-encoding encoder and convolutional neural networks
Yang et al. A feature-metric-based affinity propagation technique for feature selection in hyperspectral image classification
CN108664986B (en) Based on lpNorm regularized multi-task learning image classification method and system
CN109490838A (en) A kind of Recognition Method of Radar Emitters of data base-oriented incompleteness
CN108492298A (en) Based on the multispectral image change detecting method for generating confrontation network
CN108154094A (en) The non-supervisory band selection method of high spectrum image divided based on subinterval
CN106529563A (en) High-spectral band selection method based on double-graph sparse non-negative matrix factorization
CN111062428A (en) Hyperspectral image clustering method, system and equipment
CN108564083A (en) A kind of method for detecting change of remote sensing image and device
CN105205807B (en) Method for detecting change of remote sensing image based on sparse automatic coding machine
CN109034238A (en) A kind of clustering method based on comentropy
Lu et al. Multiple-kernel combination fuzzy clustering for community detection
CN106599927B (en) The Target cluster dividing method divided based on Fuzzy ART
CN115186012A (en) Power consumption data detection method, device, equipment and storage medium
CN111680579A (en) Remote sensing image classification method for adaptive weight multi-view metric learning
Upadhyay et al. A brief review of fuzzy soft classification and assessment of accuracy methods for identification of single land cover
CN109214427A (en) A kind of Nearest Neighbor with Weighted Voting clustering ensemble method
Förster et al. Significance analysis of different types of ancillary geodata utilized in a multisource classification process for forest identification in Germany
CN114648683A (en) Neural network performance improving method and device based on uncertainty analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190531

WD01 Invention patent application deemed withdrawn after publication