CN109829494A

CN109829494A - A kind of clustering ensemble method based on weighting similarity measurement

Info

Publication number: CN109829494A
Application number: CN201910079817.5A
Authority: CN
Inventors: 白亮; 杜航原
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2019-01-28
Filing date: 2019-01-28
Publication date: 2019-05-31

Abstract

The present invention relates to clustering ensemble analysis field, in particular to a kind of clustering ensemble method based on weighting similarity measurement.It is weighted similarity measurement according to the quality of cluster member, reinforces the positive influence of high quality cluster member in integrating process, while inhibiting the unfavorable interference of low quality cluster member, to obtain the clustering ensemble result for having more accuracy and robustness.This method calculates the consistency that any two sample describes symbol space data in each cluster member in data set first, then it calculates the consistency that each cluster member describe feature space data and each integrated weight for clustering member is calculated with this, the weighting similitude of any two sample in data set is calculated on this basis, then the weighting similarity matrix of data set is constructed to be figure minimum partition problem by clustering ensemble Task Switching, clustering ensemble is obtained as a result, finally carrying out result output by solving using Spectral Clustering.

Description

A kind of clustering ensemble method based on weighting similarity measurement

Technical field

The present invention relates to clustering ensemble analysis field, in particular to a kind of clustering ensemble side based on weighting similarity measurement Method.

Background technique

Clustering is an important and active research field in data mining.As a kind of unsupervised learning method, Cluster is substantially a density estimation problem, and the data for needing to cluster are not marked generic in advance, and can be by one Mixed model generates.Its main thought is to split data into several classes or cluster (group), so that data object similarity in cluster It maximizes, data object similarity minimizes between cluster.In recent years, large-scale dataset was frequently emerged in large numbers in every field, this is to poly- New challenge has been researched and proposed in alanysis.In face of large-scale data, traditional cluster algorithm is no longer as handling middle and small scale Data are the same " handy ", and generally existing processing is difficult, the processing time is long, parameter difficulty is determining, inefficiency and cluster matter Measure not high problems.What clustering ensemble exactly grew up in this background, it finds the combination of multiple cluster answers More preferably to be clustered.Clustering ensemble has better average behavior in different field and data set, can find any single The answer that clustering algorithm is unable to get, it is more insensitive for the variation of noise, abnormal point, sampling, it can also be from cluster collective point Estimation obtains the uncertainty of cluster in cloth.There are two the clustering ensemble algorithm main problems to be solved: one is how to generate not With cluster to form a cluster collective, Second Problem is how to obtain one from this cluster collective unified to gather Class result.It is all laid stress on Second Problem in the research in terms of clustering ensemble both at home and abroad at present, that is, how from poly- A unified cluster result is obtained in class collective.

A kind of patent " sampling type clustering ensemble side based on part and global information of Publication No. CN105844303A Method " a kind of sampling type clustering ensemble method based on part and global information is disclosed, target data set is mixed first Learning sample is sampled and generated, clustering is carried out in this learning sample space and generates clustering, next to cluster It divides and carries out quality evaluation, and update the weight vectors of target data set according to assessment result；Above step is taken turns into repetition, into And generate multiple clusterings.Then permeate multiple clusterings a new character representation, and uses traditional cluster Algorithm does clustering to this character representation, and generates integrated cluster result.The invention is so that integrated study has stronger resist Making an uproar property, while also making it have the high ability for solving problem data；And new feature can be characterized effectively and comprehensively entirely The clustering architecture information of office and part, so that the effect that Ensemble Learning Algorithms have generated on the data set of different characteristics.Publication number Clustering ensemble selection is asked for the patent " the clustering ensemble method based on mixing clustering ensemble selection strategy " of CN107169511A Topic is converted into feature selection issues, generates basic cluster result, more diversity from multi-angle, is carried out using feature selecting algorithm Optimization, avoids human factor and redundancy problem, it is contemplated that part and global weight organically combine each cluster result subset, mention Rise cluster accuracy.The step of this method includes: input test data set sample matrix X；Data set sample matrix X is gathered Generic operation generates basic cluster result set；Basic cluster result set is transformed into new feature space, and basic cluster result Each of set each feature of cluster result as new feature space；Feature is gathered using Feature Selection Class integrated selection obtains cluster result subset；Final cluster result subset is obtained using weight function is assigned to cluster result subset；Collection At final cluster result subset, final cluster result is obtained.

During generating unified integrated results by the set of multiple cluster members, a kind of common method is to utilize sample The similarity measurement between the frequency progress sample in same cluster is appeared in different cluster members, constructs the similitude of data set Matrix recycles smallest partition method to be split data set, to obtain unified integrated result.However, due to cluster Each cluster member mass in collective is irregular, their influences for final clustering ensemble result are not also identical, ignores These are influenced and single consideration sample similarity may cause the reduction of clustering ensemble result validity.For this purpose, the present invention proposes one Clustering ensemble method of the kind based on weighting similarity measurement reinforces matter during calculating Sample Similarity using cluster set is total Amount preferably clusters member to the positive influence of clustering ensemble result, while limiting the unfavorable interference of second-rate member, makes to gather Class integrates result and has more accuracy and robustness.

Summary of the invention

The technical problem to be solved by the present invention is designing a kind of clustering ensemble method, carried out according to the quality of cluster member Similarity measurement is weighted, reinforces the positive influence of high quality cluster member in integrating process, while low quality being inhibited to be clustered into The unfavorable interference of member, to obtain the clustering ensemble result for having more accuracy and robustness.This method, which is calculated first in data set, appoints The consistency that two samples of anticipating describe symbol space data in each cluster member, then calculates each cluster member to spy The consistency of sign spatial data description and the integrated weight that each cluster member is calculated with this, calculate in data set on this basis Then the weighting similitude of any two sample constructs the weighting similarity matrix of data set thus by clustering ensemble Task Switching For figure minimum partition problem, clustering ensemble is obtained as a result, finally carrying out result output by solving using Spectral Clustering.

Method proposed by the present invention can be used for the clustering task of all kinds of numeric type data collection, such as: this method is available Parallel pattern in identification gene expression data and possess the gene set and sample set of identical biological meaning, and then is clustering The function and transcriptional control found relevant gene on analysis foundation, analyze gene；This method can also be used for discovery complex web The linked character of structure and function between network data set internal node carries out the division of community structure, and then understands multiple The function of miscellaneous network seeks the rule hidden in network and the behavior for predicting complex network；It is multiple that this method can be also used for processing Miscellaneous image data set divides an image into several regions not overlapped according to visual signature, convex target and background scene etc., Realize image segmentation.

The technical scheme adopted by the invention is that: a kind of clustering ensemble method based on weighting similarity measurement, for sample This quantity is the data set of NI-th of sample in feature space in X is denoted as x_i；Indicate a system It is listed in the set that the cluster member generated on data set X is constituted, wherein T indicates the quantity that member is clustered in C, Indicate t-th of cluster member in C, C_t,kFor C_tIn k-th of cluster, S_tIndicate C_tThe quantity of middle cluster；Clustering is considered as logarithm It is indicated according to the symbol of collection, then the cluster symbolic vector in cluster set in each corresponding symbol space of cluster member, T The cluster symbolic vector set that cluster symbolic vector is constituted is denoted asWhereinIt indicates to be clustered into for t-th Member C_tCluster symbolic vector, l_t,kIndicate C_tIn k-th of cluster label；Indicate clustering ensemble as a result, wherein C_*,sIndicate C_*In s-th of cluster, S_*Indicate C_*The quantity of middle cluster.Content of the present invention utilizesGenerate cluster set At result C_*Process, comprising the following steps:

S10, data normalization processing is carried out to data set, utilizes gaussian kernel functionTo the data set in feature spaceIt is mapped, makes the standardized data collection obtained after mappingGaussian distributed, wherein ψ_iIndicate mark I-th of sample in standardization data set；

Any two sample is in each cluster member to the consistent of symbol space data description in S20, calculating data set Property: firstly, conditional information entropy of the cluster symbolic vector set L about data set X is calculated, for indicating using data set X to symbol The uncertainty of number spatial data description；Then, cluster symbolic vector set L is calculated to be clustered into about two samples at some The conditional information entropy of affiliated cluster in member, the uncertainty for indicating to describe symbol space data using the two clusters；It counts again The difference of the two above conditional information entropy of calculation cluster symbolic vector set L is as two samples to symbol in this cluster member The consistency of number spatial data description, and so on calculate any two sample in each cluster member to symbol space data The consistency of description；

S30, the consistency that each cluster member describes feature space data is calculated: firstly, normalized data set Conditional information entropy of the Ψ about data set X, the uncertainty for indicating to describe feature space data using data set X；It connects , conditional information entropy of the normalized data set Ψ about some cluster member, for indicating cluster member to feature sky Between data describe uncertainty；The difference of the two above conditional information entropy of normalized data set Ψ is clustered into as this Member to the consistency of feature space data description, and so on calculate each cluster member feature space data are described it is consistent Property；

S40, the consistency described according to each cluster member to feature space data calculate the integrated of each cluster member Weight controls influence of each cluster member to final clustering ensemble result respectively；

S50, any two sample obtained using step S20 are in each cluster member to the description of symbol space data Weighting in the integrated weight calculation data set for each cluster member that consistency and step S40 are obtained between any two sample Similitude；

S60, by clustering ensemble Task Switching be figure minimum partition problem, i.e., so that in final clustering ensemble result own Weighting similitude between two objects not in same cluster is minimum；

S70, the figure minimum partition problem that clustering ensemble Task Switching obtains is solved using Spectral Clustering, is obtained Clustering ensemble result C_*；

S80, clustering ensemble result C* is exported.

Gaussian kernel function in step S10 described in this methodAs shown in formula (1):

Wherein, the value of parameter alpha is set as | | x_i-x_o||²Standard deviation, x_oFor o-th of sample (i ≠ o) in data set X, ψ_oFor o-th of sample in standardized data collection Ψ.

Step S20 described in this method includes:

S21, conditional information entropy of the cluster symbolic vector set L about data set X is calculated using formula (2), for indicating benefit The uncertainty that symbol space data are described with data set X:

Wherein, H (l_t| X) it is t-th of cluster member C_tCluster symbolic vector l_tAbout the conditional information entropy of data set X, It can be calculated by formula (3):

In formula, P (l_t,k| X) indicate cluster symbolic vector l_tAbout the conditional probability of data set X, can be calculated by formula (4):

X in formula_i(l_t) indicate sample x_iIn the value that t-th clusters on symbolic vector, i.e. sample x_iIt is clustered at t-th Corresponding cluster label in member；

S22, for any two sample x in data set X_iAnd x_j, they are in t-th of cluster member C_tIn belonging to cluster RespectivelyWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5), Uncertainty for indicating to describe symbol space data using the two clusters:

Wherein,ForWithThe set of composition,It is T cluster member C_tCluster symbolic vector l_tAbout setConditional information entropy, can be calculated by formula (6):

Wherein,Indicate cluster symbolic vector l_tAbout setItem Part probability can be calculated by formula (7):

X in formula_d(l_t) indicate sample x_dIn the value that t-th clusters on symbolic vector, i.e. sample x_dIt is clustered at t-th Corresponding cluster label in member；

S23, calculate cluster symbolic vector set L about data set X conditional information entropy with about's The difference of comentropy is as sample x_iAnd x_jIn cluster member C_tIn to symbol space data description consistency, such as formula (8) institute Show:

S24, using the method for step S21~S23, calculate in data set X all two samples in each cluster member To the consistency of symbol space data description.

Step S30 described in this method includes:

S31, the conditional information entropy using formula (9) normalized data set Ψ about data set X utilize number for indicating The uncertainty that feature space data are described according to collection X

Wherein, H (Ψ | X) is conditional information entropy of the standardized data collection Ψ about data set X,Indicate standardized data The variance for collecting Ψ, is calculated by formula (10):

Wherein, μ_ΨFor the expectation of standardized data collection Ψ, meet formula (11):

The conditional information entropy of S32, normalized data set Ψ about each cluster member, for describing each cluster member To the consistency of feature space data description, wherein Ψ is about t-th of cluster member C_tConditional information entropy can be counted by formula (12) It calculates:

Wherein, H (Ψ | C_t) it is standardized data collection Ψ about t-th of cluster member C_tConditional information entropy,Indicate C_t The variance of middle sample is calculated by formula (13):

Wherein,For C_tThe expectation of middle sample meets formula (14):

In formula, ψ_eFor e-th of sample in standardized data collection Ψ, x_eFor e-th of sample of data set X, x_fFor data set F-th of sample (e ≠ f) of X, x_gAnd x_hFor any two sample different in data set X；

The difference of the two above conditional information entropy of S33, normalized data set Ψ, as cluster member to feature The consistency of spatial data description, wherein t-th of cluster member C_tConsistency metric on Ψ is calculated by formula (15):

I(Ψ|C_t)=H (Ψ | X)-H (Ψ | C_t) (15)

Wherein, I (Ψ | C_t) indicate C_tConsistency metric on Ψ；

S34, using the method for step S31~S33, calculate one by one it is each cluster member feature space data are described one Cause property.

The consistency that feature space data describe is calculated each according to each cluster member in step S40 described in this method Shown in the method such as formula (16) for clustering the integrated weight of member:

Wherein, ω_tIndicate cluster member C_tClustering ensemble weight.

It is calculated in step S50 described in this method in data set X shown in the weighting similitude such as formula (17) of two samples:

Wherein, sim (x_i,x_j) indicate sample x_iAnd x_jBetween weighting similitude, in this way calculate data set X The weighting similitude of middle any two sample.

Step S60 described in this method includes:

S61, the weighting similarity matrix Θ=[θ (x for constructing data set X_p,x_q)]_N×N, matrix element θ (x_p,x_q) meter Calculation side

Shown in method such as formula (18):

Wherein the value of parameter γ is sim (x_i,x_j) standard deviation；

It S62, is figure minimum partition problem by clustering ensemble Task Switching, shown in building objective function such as formula (19):

Step S70 described in this method includes:

S71, a N-dimensional diagonal matrix is constructed using the sum of element on the weighting each column of similarity matrix Θ, is denoted as D, and Define matrix L=D- Θ；

S72, the preceding S that matrix L arranges from small to large ord is found out_*A characteristic valueAnd corresponding feature vector

S73, by S_*A feature vector is arranged together one N × S of composition_*Matrix, wherein every a line will regard S as_*Dimension is empty Between in a vector, and clustered using K-means algorithm, cluster belonging to every a line is exactly data set X in cluster result In cluster belonging to each sample data.

The present invention is irregular for each cluster member mass in cluster collective and generates unfavorable shadow to clustering ensemble Loud problem proposes a kind of clustering ensemble method based on weighting similarity measurement, first any two in calculating data set Sample, to the consistency of symbol space data description, then calculates each cluster member to feature space in each cluster member The consistency of data description and the integrated weight that each cluster member is calculated with this, calculate on this basis any two in data set Then the weighting similitude of a sample constructs the weighting similarity matrix of data set to be to scheme most by clustering ensemble Task Switching Small segmentation problem obtains clustering ensemble as a result, finally carrying out result output by solving using Spectral Clustering.Master of the invention Want parameter include: cluster symbolic vector set, cluster symbolic vector set about data set conditional information entropy, cluster symbol to About two samples, the conditional information entropy of affiliated cluster, two samples in some cluster member are clustered at some duration set Conditional information entropy, normalized number in member to the consistency, standardized data collection of the description of symbol space data about original data set The conditional information entropy of member is clustered about some according to collection, clusters consistency, cluster member that member describes feature space data Integrated weight, the weighting similitude between two samples, weighting similarity matrix.Wherein, cluster symbolic vector collection is combined into symbol A series of pairs of data sets carry out the set that the cluster symbolic vector of denotational description is constituted in space；Cluster symbolic vector set about The conditional information entropy of data set is used for the uncertainty for indicating to describe symbol space data using data set X；Cluster symbol to Duration set utilizes the two clusters pair for indicating about the conditional information entropy of two samples affiliated cluster in some cluster member The uncertainty of symbol space data description；Two samples are in some cluster member to the consistent of symbol space data description Property indicate cluster symbolic vector set about data set conditional information entropy and about two samples some cluster member In affiliated cluster conditional information entropy between difference；Standardized data collection is used to indicate benefit about the conditional information entropy of original data set The uncertainty that feature space data are described with data set；Conditional information entropy of the standardized data collection about some cluster member The uncertainty that feature space data are described for cluster member；The consistency that cluster member describes feature space data It indicates between conditional information entropy of the standardized data collection about the conditional information entropy of original data set and about this cluster member Difference；The integrated weight of cluster member is used to control influence of each cluster member to final clustering ensemble result；Between two samples Weighting similitude be used for describe two samples it is a series of cluster members in cluster division similitudes；Similarity matrix is weighted to use The similar implementations that cluster divides between all samples in descriptor data set inside.

The beneficial effects of the present invention are: the consistency in data description is measured using comentropy, is calculated with this The clustering ensemble weight for clustering member expresses a series of two samples phase that cluster divides in cluster members using weighting similitude Like implementations, figure minimum partition problem is converted by clustering ensemble task, and solve by Spectral Clustering, this method is constructing Influence of the cluster member of different quality to clustering ensemble result has been fully considered when data set similarity matrix, can have been made final The clustering ensemble result of acquisition has more accuracy and robustness.

Detailed description of the invention

Fig. 1 is the computer implemented system structure of the clustering ensemble method of the present invention based on weighting similarity measurement Figure；

Fig. 2 is the flow chart of the clustering ensemble method of the present invention based on weighting similarity measurement.

Specific embodiment

Detailed description of the preferred embodiments with reference to the accompanying drawing.

Clustering ensemble method of the present invention based on weighting similarity measurement is implemented by computer program, Fig. 1 institute Show it is computer implemented system construction drawing.Technical solution proposed by the present invention is used to handle remote sensing image data below, it is right Remote sensing images are classified automatically, realize the identification of ground object target, and input data is the image data set that pixel is constituted, and are passed through Pixel with similar spectral feature is classified as one kind by clustering ensemble method, is identified as same ground object target, finally will identification Various ground object targets out are exported, and specific implementation process is as shown in Figure 2.By the Spectral Properties of pixel each in remote sensing images Sign is used as a sample, and the sample that quantity is N constitutes image data setI-th of sample in feature space in X Originally it is denoted as x_i；Indicate a series of set that cluster members generated on data set X are constituted, i.e., a series of pixels Division result, wherein T indicates the quantity that member is clustered in C,Indicate t-th of cluster member, i.e., to image data T-th of division result of collection, C_t,kFor C_tIn k-th of cluster, i.e. C_tIn the divide classification, S_tIndicate C_tThe quantity of middle cluster； Clustering is considered as the symbol expression to data set, then in cluster set in the corresponding symbol space of each cluster member Cluster symbolic vector, T cluster symbolic vector constitute set be denoted asWhereinIt indicates t-th Cluster member C_tCluster symbolic vector, l_t,kIndicate C_tIn k-th of cluster label；Indicate clustering ensemble knot Fruit, i.e., final integrated segmentation result, wherein C*,_sIndicate that s-th of cluster in C*, S* indicate the quantity of cluster in C*.The present embodiment is Utilize a series of pixel division results of image data set XGenerate consistent ground object target recognition result C_*Mistake Journey, including following key content:

Step 1 carries out data normalization processing to data set, utilizes gaussian kernel function shown in formula (1)To feature sky Between in data setIt is mapped, makes the data set obtained after mappingGaussian distributed, wherein ψ_iIndicate i-th of sample that standardized data is concentrated；

Any two sample is in each cluster member to the one of the description of symbol space data in step 2, calculating data set Cause property: firstly, calculating conditional information entropy of the cluster symbolic vector set L about data set X, X pairs of data set is utilized for indicating The uncertainty of symbol space data description；Then, cluster symbolic vector set L is calculated about two samples in some cluster The conditional information entropy of affiliated cluster in member, the uncertainty for indicating to describe symbol space data using the two clusters；Again The difference for calculating the two above conditional information entropy of cluster symbolic vector set L is right in this cluster member as two samples Symbol space data description consistency, and so on calculate any two sample in each cluster member to symbol space number According to the consistency of description, comprising the following steps:

Wherein, P (l_t,k| X) indicate cluster symbolic vector l_tAbout the conditional probability of data set X, can be calculated by formula (4):

S22, for any two sample x in data set X_iAnd x_j, they are in t-th of cluster member C_tIn belonging to cluster RespectivelyWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5), is used In the uncertainty for indicating to describe symbol space data using the two clusters:

Wherein,ForWithThe set of composition,For t A cluster member C_tCluster symbolic vector l_tAbout setConditional information entropy, can be calculated by formula (6):

Step 3 calculates the consistency that each cluster member describes feature space data: firstly, normalized data Collect conditional information entropy of the Ψ about data set X, the uncertainty for indicating feature space data to be described using data set X； Then, conditional information entropy of the data set Ψ about some cluster member is calculated, for indicating cluster member to feature space number According to the uncertainty of description；The difference for calculating the two above conditional information entropy of data set Ψ is empty to feature as cluster member Between the consistency that describes of data, and so on calculate each consistency for describe to feature space data of cluster member, specific packet Containing following steps:

Wherein,For C_tThe expectation of middle sample meets formula (14):

I(Ψ|C_t)=H (Ψ | X)-H (Ψ | C_t) (15)

Wherein, I (Ψ | C_t) indicate C_tConsistency metric on Ψ；

Step 4, the consistency described according to each cluster member to feature space data calculate each collection for clustering member At weight, influence of each cluster member to final clustering ensemble result is controlled respectively, wherein cluster member C_tClustering ensemble power Weight ω_tCalculation method such as formula (16) shown in:

Step 5, any two sample obtained using step S20 describe symbol space data in each cluster member Consistency and step S40 obtain each cluster member integrated weight calculation data set in any two sample weighting Similitude, wherein x_iAnd x_jBetween weighting similitude sim (x_i,x_j) calculation method such as formula (17) shown in:

Clustering ensemble Task Switching is figure minimum partition problem by step 6, i.e., so that institute in final clustering ensemble result There is the weighting similitude between two objects not in same cluster minimum, comprising the following steps:

S61, the weighting similarity matrix Θ=[θ (x for constructing data set X_p,x_q)]_N×N, matrix element θ (x_p,x_q) meter Shown in calculation method such as formula (18):

Wherein the value of parameter γ is sim (x_i,x_j) standard deviation；

Step 7 solves the figure minimum partition problem that clustering ensemble Task Switching obtains using Spectral Clustering, obtains Obtain clustering ensemble result C_*, comprising the following steps:

Step 8, by clustering ensemble result C_*That is ground object target recognition result is exported, as a result in each cluster indicate one A ground object target identified.

Claims

1. a kind of clustering ensemble method based on weighting similarity measurement, collecting sample data, for sample number in feature space Amount is the data set of NI-th of sample in data set X is denoted as x_i,Indicate a series of in data The set that the cluster member generated on collection X is constituted, wherein T indicates the quantity that member is clustered in C,It indicates the in C T cluster member, C_t,kFor C_tIn k-th of cluster, S_tIndicate C_tThe quantity of middle cluster；Clustering is considered as the symbol to data set Indicate, then the cluster symbolic vector in cluster set in the corresponding symbol space of each cluster member, T cluster symbol to The cluster symbolic vector set that amount is constituted is denoted asWhereinIndicate t-th of cluster member C_tCluster symbol Number vector, l_t,kIndicate C_tIn k-th of cluster label；Clustering ensemble is indicated as a result, wherein C_*,sIndicate C_*In S-th of cluster, S_*Indicate C_*The quantity of middle cluster；It utilizesGenerate clustering ensemble result C_*Process, comprising the following steps:

S10, data normalization processing is carried out to data set X, utilizes gaussian kernel functionTo data setIt is mapped, Make the standardized data collection obtained after mappingGaussian distributed, wherein ψ_iIndicate standardized data is concentrated i-th A sample；

S20, the consistency that any two sample describes symbol space data in each cluster member in data set X is calculated: Firstly, conditional information entropy of the cluster symbolic vector set L about data set X is calculated, for indicating using data set X to symbol sky Between data describe uncertainty；Then, calculating cluster symbolic vector set L is about two samples in some cluster member The conditional information entropy of affiliated cluster, the uncertainty for indicating to describe symbol space data using the two clusters；It calculates again poly- The difference of the two above conditional information entropy of class symbolic vector set L is as two samples to symbol sky in this cluster member Between data describe consistency, and so on calculate any two sample in each cluster member to symbol space data description Consistency；

S30, the consistency that each cluster member describes feature space data is calculated: firstly, normalized data set Ψ is closed Uncertainty in the conditional information entropy of data set X, for indicating to describe feature space data using data set X；Then, it counts Conditional information entropy of the standardized data collection Ψ about some cluster member is calculated, for indicating cluster member to feature space data The uncertainty of description；The difference of the two above conditional information entropy of normalized data set Ψ is as cluster member to spy Levy the consistency of spatial data description, and so on calculate each consistency for describing to feature space data of cluster member；

S40, the consistency described according to each cluster member to feature space data calculate each integrated weight for clustering member, Influence of each cluster member to final clustering ensemble result is controlled respectively；

S50, any two sample obtained using step S20 are in each cluster member to the consistent of symbol space data description Property and the integrated weight calculation data set of each cluster member that obtains of step S40 in weighting between any two sample it is similar Property；

S60, by clustering ensemble Task Switching be figure minimum partition problem, i.e., so that all in final clustering ensemble result do not exist It is minimum with the weighting similitude between two objects in cluster；

S70, the figure minimum partition problem that clustering ensemble Task Switching obtains is solved using Spectral Clustering, is clustered Integrated result C_*；

S80, by clustering ensemble result C_*It is exported.

2. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that the step Rapid S10 gaussian kernel functionAs shown in formula (1):

3. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The step S20 includes:

S21, conditional information entropy of the cluster symbolic vector set L about data set X is calculated using formula (2), utilize number for indicating The uncertainty that symbol space data are described according to collection X:

Wherein, H (l_t| X) it is t-th of cluster member C_tCluster symbolic vector l_tIt, can be by about the conditional information entropy of data set X Formula (3) calculates:

X in formula_i(l_t) indicate sample x_iIn the value that t-th clusters on symbolic vector, i.e. sample x_iIt is right in t-th of cluster member The cluster label answered；

S22, for any two sample x in data set X_iAnd x_j, they are in t-th of cluster member C_tIn belonging to cluster difference ForWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5), is used for Indicate the uncertainty described using the two clusters to symbol space data:

Wherein,ForWithThe set of composition,It is poly- for t-th Class members C_tCluster symbolic vector l_tAbout setConditional information entropy, can be calculated by formula (6):

Wherein,Indicate cluster symbolic vector l_tAbout setCondition it is general Rate can be calculated by formula (7):

X in formula_d(l_t) indicate sample x_dIn the value that t-th clusters on symbolic vector, i.e. sample x_dIt is right in t-th of cluster member The cluster label answered；

S23, calculate cluster symbolic vector set L about data set X conditional information entropy with aboutInformation The difference of entropy is as sample x_iAnd x_jIn cluster member C_tIn to symbol space data description consistency, as shown in formula (8):

S24, using the method for step S21~S23, calculate in data set X all two samples in each cluster member to symbol The consistency of number spatial data description.

4. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The step S30 includes:

S31, the conditional information entropy using formula (9) normalized data set Ψ about data set X utilize data set for indicating The uncertainty that X describes feature space data

Wherein, H (Ψ | X) is conditional information entropy of the standardized data collection Ψ about data set X,Indicate standardized data collection Ψ Variance, calculated by formula (10):

The conditional information entropy of S32, normalized data set Ψ about each cluster member, for describing each cluster member to spy The consistency of spatial data description is levied, wherein Ψ is about t-th of cluster member C_tConditional information entropy can be calculated by formula (12):

Wherein, H (Ψ | C_t) it is standardized data collection Ψ about t-th of cluster member C_tConditional information entropy,Indicate C_tMiddle sample This variance, is calculated by formula (13):

Wherein,For C_tThe expectation of middle sample meets formula (14):

In formula, ψ_eFor e-th of sample in standardized data collection Ψ, x_eFor e-th of sample of data set X, x_fFor data set X's F-th of sample (e ≠ f), x_gAnd x_hFor any two sample different in data set X；

The difference of the two above conditional information entropy of S33, normalized data set Ψ, as cluster member to feature space The consistency of data description, wherein t-th of cluster member C_tConsistency metric on Ψ is calculated by formula (15):

I(Ψ|C_t)=H (Ψ | X)-H (Ψ | C_t) (15)

Wherein, I (Ψ | C_t) indicate C_tConsistency metric on Ψ；

S34, using the method for step S31~S33, calculate one by one each cluster member feature space data are described it is consistent Property.

5. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The consistency that feature space data describe is calculated each according to each cluster member in step S40 described in described step this method Shown in the method such as formula (16) for clustering the integrated weight of member:

Wherein, ω_tIndicate cluster member C_tClustering ensemble weight.

6. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method It is calculated in the step S50 in data set X shown in the weighting similitude such as formula (17) of two samples:

Wherein, sim (x_i,x_j) indicate sample x_iAnd x_jBetween weighting similitude, calculate in this way any in data set X The weighting similitude of two samples.

7. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The step S60 includes:

S61, the weighting similarity matrix Θ=[θ (x for constructing data set X_p,x_q)]_N×N, matrix element θ (x_p,x_q) calculating side Shown in method such as formula (18):

Wherein the value of parameter γ is sim (x_i,x_j) standard deviation；

8. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method The step S70 includes:

S73, by S_*A feature vector is arranged together one N × S of composition_*Matrix, wherein every a line will regard S as_*In dimension space A vector, and clustered using K-means algorithm, cluster belonging to every a line is exactly every in data set X in cluster result Cluster belonging to a sample data.