CN109829494A - A kind of clustering ensemble method based on weighting similarity measurement - Google Patents
A kind of clustering ensemble method based on weighting similarity measurement Download PDFInfo
- Publication number
- CN109829494A CN109829494A CN201910079817.5A CN201910079817A CN109829494A CN 109829494 A CN109829494 A CN 109829494A CN 201910079817 A CN201910079817 A CN 201910079817A CN 109829494 A CN109829494 A CN 109829494A
- Authority
- CN
- China
- Prior art keywords
- cluster
- data
- data set
- sample
- formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to clustering ensemble analysis field, in particular to a kind of clustering ensemble method based on weighting similarity measurement.It is weighted similarity measurement according to the quality of cluster member, reinforces the positive influence of high quality cluster member in integrating process, while inhibiting the unfavorable interference of low quality cluster member, to obtain the clustering ensemble result for having more accuracy and robustness.This method calculates the consistency that any two sample describes symbol space data in each cluster member in data set first, then it calculates the consistency that each cluster member describe feature space data and each integrated weight for clustering member is calculated with this, the weighting similitude of any two sample in data set is calculated on this basis, then the weighting similarity matrix of data set is constructed to be figure minimum partition problem by clustering ensemble Task Switching, clustering ensemble is obtained as a result, finally carrying out result output by solving using Spectral Clustering.
Description
Technical field
The present invention relates to clustering ensemble analysis field, in particular to a kind of clustering ensemble side based on weighting similarity measurement
Method.
Background technique
Clustering is an important and active research field in data mining.As a kind of unsupervised learning method,
Cluster is substantially a density estimation problem, and the data for needing to cluster are not marked generic in advance, and can be by one
Mixed model generates.Its main thought is to split data into several classes or cluster (group), so that data object similarity in cluster
It maximizes, data object similarity minimizes between cluster.In recent years, large-scale dataset was frequently emerged in large numbers in every field, this is to poly-
New challenge has been researched and proposed in alanysis.In face of large-scale data, traditional cluster algorithm is no longer as handling middle and small scale
Data are the same " handy ", and generally existing processing is difficult, the processing time is long, parameter difficulty is determining, inefficiency and cluster matter
Measure not high problems.What clustering ensemble exactly grew up in this background, it finds the combination of multiple cluster answers
More preferably to be clustered.Clustering ensemble has better average behavior in different field and data set, can find any single
The answer that clustering algorithm is unable to get, it is more insensitive for the variation of noise, abnormal point, sampling, it can also be from cluster collective point
Estimation obtains the uncertainty of cluster in cloth.There are two the clustering ensemble algorithm main problems to be solved: one is how to generate not
With cluster to form a cluster collective, Second Problem is how to obtain one from this cluster collective unified to gather
Class result.It is all laid stress on Second Problem in the research in terms of clustering ensemble both at home and abroad at present, that is, how from poly-
A unified cluster result is obtained in class collective.
A kind of patent " sampling type clustering ensemble side based on part and global information of Publication No. CN105844303A
Method " a kind of sampling type clustering ensemble method based on part and global information is disclosed, target data set is mixed first
Learning sample is sampled and generated, clustering is carried out in this learning sample space and generates clustering, next to cluster
It divides and carries out quality evaluation, and update the weight vectors of target data set according to assessment result;Above step is taken turns into repetition, into
And generate multiple clusterings.Then permeate multiple clusterings a new character representation, and uses traditional cluster
Algorithm does clustering to this character representation, and generates integrated cluster result.The invention is so that integrated study has stronger resist
Making an uproar property, while also making it have the high ability for solving problem data;And new feature can be characterized effectively and comprehensively entirely
The clustering architecture information of office and part, so that the effect that Ensemble Learning Algorithms have generated on the data set of different characteristics.Publication number
Clustering ensemble selection is asked for the patent " the clustering ensemble method based on mixing clustering ensemble selection strategy " of CN107169511A
Topic is converted into feature selection issues, generates basic cluster result, more diversity from multi-angle, is carried out using feature selecting algorithm
Optimization, avoids human factor and redundancy problem, it is contemplated that part and global weight organically combine each cluster result subset, mention
Rise cluster accuracy.The step of this method includes: input test data set sample matrix X;Data set sample matrix X is gathered
Generic operation generates basic cluster result set;Basic cluster result set is transformed into new feature space, and basic cluster result
Each of set each feature of cluster result as new feature space;Feature is gathered using Feature Selection
Class integrated selection obtains cluster result subset;Final cluster result subset is obtained using weight function is assigned to cluster result subset;Collection
At final cluster result subset, final cluster result is obtained.
During generating unified integrated results by the set of multiple cluster members, a kind of common method is to utilize sample
The similarity measurement between the frequency progress sample in same cluster is appeared in different cluster members, constructs the similitude of data set
Matrix recycles smallest partition method to be split data set, to obtain unified integrated result.However, due to cluster
Each cluster member mass in collective is irregular, their influences for final clustering ensemble result are not also identical, ignores
These are influenced and single consideration sample similarity may cause the reduction of clustering ensemble result validity.For this purpose, the present invention proposes one
Clustering ensemble method of the kind based on weighting similarity measurement reinforces matter during calculating Sample Similarity using cluster set is total
Amount preferably clusters member to the positive influence of clustering ensemble result, while limiting the unfavorable interference of second-rate member, makes to gather
Class integrates result and has more accuracy and robustness.
Summary of the invention
The technical problem to be solved by the present invention is designing a kind of clustering ensemble method, carried out according to the quality of cluster member
Similarity measurement is weighted, reinforces the positive influence of high quality cluster member in integrating process, while low quality being inhibited to be clustered into
The unfavorable interference of member, to obtain the clustering ensemble result for having more accuracy and robustness.This method, which is calculated first in data set, appoints
The consistency that two samples of anticipating describe symbol space data in each cluster member, then calculates each cluster member to spy
The consistency of sign spatial data description and the integrated weight that each cluster member is calculated with this, calculate in data set on this basis
Then the weighting similitude of any two sample constructs the weighting similarity matrix of data set thus by clustering ensemble Task Switching
For figure minimum partition problem, clustering ensemble is obtained as a result, finally carrying out result output by solving using Spectral Clustering.
Method proposed by the present invention can be used for the clustering task of all kinds of numeric type data collection, such as: this method is available
Parallel pattern in identification gene expression data and possess the gene set and sample set of identical biological meaning, and then is clustering
The function and transcriptional control found relevant gene on analysis foundation, analyze gene;This method can also be used for discovery complex web
The linked character of structure and function between network data set internal node carries out the division of community structure, and then understands multiple
The function of miscellaneous network seeks the rule hidden in network and the behavior for predicting complex network;It is multiple that this method can be also used for processing
Miscellaneous image data set divides an image into several regions not overlapped according to visual signature, convex target and background scene etc.,
Realize image segmentation.
The technical scheme adopted by the invention is that: a kind of clustering ensemble method based on weighting similarity measurement, for sample
This quantity is the data set of NI-th of sample in feature space in X is denoted as xi;Indicate a system
It is listed in the set that the cluster member generated on data set X is constituted, wherein T indicates the quantity that member is clustered in C,
Indicate t-th of cluster member in C, Ct,kFor CtIn k-th of cluster, StIndicate CtThe quantity of middle cluster;Clustering is considered as logarithm
It is indicated according to the symbol of collection, then the cluster symbolic vector in cluster set in each corresponding symbol space of cluster member, T
The cluster symbolic vector set that cluster symbolic vector is constituted is denoted asWhereinIt indicates to be clustered into for t-th
Member CtCluster symbolic vector, lt,kIndicate CtIn k-th of cluster label;Indicate clustering ensemble as a result, wherein
C*,sIndicate C*In s-th of cluster, S*Indicate C*The quantity of middle cluster.Content of the present invention utilizesGenerate cluster set
At result C*Process, comprising the following steps:
S10, data normalization processing is carried out to data set, utilizes gaussian kernel functionTo the data set in feature spaceIt is mapped, makes the standardized data collection obtained after mappingGaussian distributed, wherein ψiIndicate mark
I-th of sample in standardization data set;
Any two sample is in each cluster member to the consistent of symbol space data description in S20, calculating data set
Property: firstly, conditional information entropy of the cluster symbolic vector set L about data set X is calculated, for indicating using data set X to symbol
The uncertainty of number spatial data description;Then, cluster symbolic vector set L is calculated to be clustered into about two samples at some
The conditional information entropy of affiliated cluster in member, the uncertainty for indicating to describe symbol space data using the two clusters;It counts again
The difference of the two above conditional information entropy of calculation cluster symbolic vector set L is as two samples to symbol in this cluster member
The consistency of number spatial data description, and so on calculate any two sample in each cluster member to symbol space data
The consistency of description;
S30, the consistency that each cluster member describes feature space data is calculated: firstly, normalized data set
Conditional information entropy of the Ψ about data set X, the uncertainty for indicating to describe feature space data using data set X;It connects
, conditional information entropy of the normalized data set Ψ about some cluster member, for indicating cluster member to feature sky
Between data describe uncertainty;The difference of the two above conditional information entropy of normalized data set Ψ is clustered into as this
Member to the consistency of feature space data description, and so on calculate each cluster member feature space data are described it is consistent
Property;
S40, the consistency described according to each cluster member to feature space data calculate the integrated of each cluster member
Weight controls influence of each cluster member to final clustering ensemble result respectively;
S50, any two sample obtained using step S20 are in each cluster member to the description of symbol space data
Weighting in the integrated weight calculation data set for each cluster member that consistency and step S40 are obtained between any two sample
Similitude;
S60, by clustering ensemble Task Switching be figure minimum partition problem, i.e., so that in final clustering ensemble result own
Weighting similitude between two objects not in same cluster is minimum;
S70, the figure minimum partition problem that clustering ensemble Task Switching obtains is solved using Spectral Clustering, is obtained
Clustering ensemble result C*;
S80, clustering ensemble result C* is exported.
Gaussian kernel function in step S10 described in this methodAs shown in formula (1):
Wherein, the value of parameter alpha is set as | | xi-xo||2Standard deviation, xoFor o-th of sample (i ≠ o) in data set X,
ψoFor o-th of sample in standardized data collection Ψ.
Step S20 described in this method includes:
S21, conditional information entropy of the cluster symbolic vector set L about data set X is calculated using formula (2), for indicating benefit
The uncertainty that symbol space data are described with data set X:
Wherein, H (lt| X) it is t-th of cluster member CtCluster symbolic vector ltAbout the conditional information entropy of data set X,
It can be calculated by formula (3):
In formula, P (lt,k| X) indicate cluster symbolic vector ltAbout the conditional probability of data set X, can be calculated by formula (4):
X in formulai(lt) indicate sample xiIn the value that t-th clusters on symbolic vector, i.e. sample xiIt is clustered at t-th
Corresponding cluster label in member;
S22, for any two sample x in data set XiAnd xj, they are in t-th of cluster member CtIn belonging to cluster
RespectivelyWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5),
Uncertainty for indicating to describe symbol space data using the two clusters:
Wherein,ForWithThe set of composition,It is
T cluster member CtCluster symbolic vector ltAbout setConditional information entropy, can be calculated by formula (6):
Wherein,Indicate cluster symbolic vector ltAbout setItem
Part probability can be calculated by formula (7):
X in formulad(lt) indicate sample xdIn the value that t-th clusters on symbolic vector, i.e. sample xdIt is clustered at t-th
Corresponding cluster label in member;
S23, calculate cluster symbolic vector set L about data set X conditional information entropy with about's
The difference of comentropy is as sample xiAnd xjIn cluster member CtIn to symbol space data description consistency, such as formula (8) institute
Show:
S24, using the method for step S21~S23, calculate in data set X all two samples in each cluster member
To the consistency of symbol space data description.
Step S30 described in this method includes:
S31, the conditional information entropy using formula (9) normalized data set Ψ about data set X utilize number for indicating
The uncertainty that feature space data are described according to collection X
Wherein, H (Ψ | X) is conditional information entropy of the standardized data collection Ψ about data set X,Indicate standardized data
The variance for collecting Ψ, is calculated by formula (10):
Wherein, μΨFor the expectation of standardized data collection Ψ, meet formula (11):
The conditional information entropy of S32, normalized data set Ψ about each cluster member, for describing each cluster member
To the consistency of feature space data description, wherein Ψ is about t-th of cluster member CtConditional information entropy can be counted by formula (12)
It calculates:
Wherein, H (Ψ | Ct) it is standardized data collection Ψ about t-th of cluster member CtConditional information entropy,Indicate Ct
The variance of middle sample is calculated by formula (13):
Wherein,For CtThe expectation of middle sample meets formula (14):
In formula, ψeFor e-th of sample in standardized data collection Ψ, xeFor e-th of sample of data set X, xfFor data set
F-th of sample (e ≠ f) of X, xgAnd xhFor any two sample different in data set X;
The difference of the two above conditional information entropy of S33, normalized data set Ψ, as cluster member to feature
The consistency of spatial data description, wherein t-th of cluster member CtConsistency metric on Ψ is calculated by formula (15):
I(Ψ|Ct)=H (Ψ | X)-H (Ψ | Ct) (15)
Wherein, I (Ψ | Ct) indicate CtConsistency metric on Ψ;
S34, using the method for step S31~S33, calculate one by one it is each cluster member feature space data are described one
Cause property.
The consistency that feature space data describe is calculated each according to each cluster member in step S40 described in this method
Shown in the method such as formula (16) for clustering the integrated weight of member:
Wherein, ωtIndicate cluster member CtClustering ensemble weight.
It is calculated in step S50 described in this method in data set X shown in the weighting similitude such as formula (17) of two samples:
Wherein, sim (xi,xj) indicate sample xiAnd xjBetween weighting similitude, in this way calculate data set X
The weighting similitude of middle any two sample.
Step S60 described in this method includes:
S61, the weighting similarity matrix Θ=[θ (x for constructing data set Xp,xq)]N×N, matrix element θ (xp,xq) meter
Calculation side
Shown in method such as formula (18):
Wherein the value of parameter γ is sim (xi,xj) standard deviation;
It S62, is figure minimum partition problem by clustering ensemble Task Switching, shown in building objective function such as formula (19):
Step S70 described in this method includes:
S71, a N-dimensional diagonal matrix is constructed using the sum of element on the weighting each column of similarity matrix Θ, is denoted as D, and
Define matrix L=D- Θ;
S72, the preceding S that matrix L arranges from small to large ord is found out*A characteristic valueAnd corresponding feature vector
S73, by S*A feature vector is arranged together one N × S of composition*Matrix, wherein every a line will regard S as*Dimension is empty
Between in a vector, and clustered using K-means algorithm, cluster belonging to every a line is exactly data set X in cluster result
In cluster belonging to each sample data.
The present invention is irregular for each cluster member mass in cluster collective and generates unfavorable shadow to clustering ensemble
Loud problem proposes a kind of clustering ensemble method based on weighting similarity measurement, first any two in calculating data set
Sample, to the consistency of symbol space data description, then calculates each cluster member to feature space in each cluster member
The consistency of data description and the integrated weight that each cluster member is calculated with this, calculate on this basis any two in data set
Then the weighting similitude of a sample constructs the weighting similarity matrix of data set to be to scheme most by clustering ensemble Task Switching
Small segmentation problem obtains clustering ensemble as a result, finally carrying out result output by solving using Spectral Clustering.Master of the invention
Want parameter include: cluster symbolic vector set, cluster symbolic vector set about data set conditional information entropy, cluster symbol to
About two samples, the conditional information entropy of affiliated cluster, two samples in some cluster member are clustered at some duration set
Conditional information entropy, normalized number in member to the consistency, standardized data collection of the description of symbol space data about original data set
The conditional information entropy of member is clustered about some according to collection, clusters consistency, cluster member that member describes feature space data
Integrated weight, the weighting similitude between two samples, weighting similarity matrix.Wherein, cluster symbolic vector collection is combined into symbol
A series of pairs of data sets carry out the set that the cluster symbolic vector of denotational description is constituted in space;Cluster symbolic vector set about
The conditional information entropy of data set is used for the uncertainty for indicating to describe symbol space data using data set X;Cluster symbol to
Duration set utilizes the two clusters pair for indicating about the conditional information entropy of two samples affiliated cluster in some cluster member
The uncertainty of symbol space data description;Two samples are in some cluster member to the consistent of symbol space data description
Property indicate cluster symbolic vector set about data set conditional information entropy and about two samples some cluster member
In affiliated cluster conditional information entropy between difference;Standardized data collection is used to indicate benefit about the conditional information entropy of original data set
The uncertainty that feature space data are described with data set;Conditional information entropy of the standardized data collection about some cluster member
The uncertainty that feature space data are described for cluster member;The consistency that cluster member describes feature space data
It indicates between conditional information entropy of the standardized data collection about the conditional information entropy of original data set and about this cluster member
Difference;The integrated weight of cluster member is used to control influence of each cluster member to final clustering ensemble result;Between two samples
Weighting similitude be used for describe two samples it is a series of cluster members in cluster division similitudes;Similarity matrix is weighted to use
The similar implementations that cluster divides between all samples in descriptor data set inside.
The beneficial effects of the present invention are: the consistency in data description is measured using comentropy, is calculated with this
The clustering ensemble weight for clustering member expresses a series of two samples phase that cluster divides in cluster members using weighting similitude
Like implementations, figure minimum partition problem is converted by clustering ensemble task, and solve by Spectral Clustering, this method is constructing
Influence of the cluster member of different quality to clustering ensemble result has been fully considered when data set similarity matrix, can have been made final
The clustering ensemble result of acquisition has more accuracy and robustness.
Detailed description of the invention
Fig. 1 is the computer implemented system structure of the clustering ensemble method of the present invention based on weighting similarity measurement
Figure;
Fig. 2 is the flow chart of the clustering ensemble method of the present invention based on weighting similarity measurement.
Specific embodiment
Detailed description of the preferred embodiments with reference to the accompanying drawing.
Clustering ensemble method of the present invention based on weighting similarity measurement is implemented by computer program, Fig. 1 institute
Show it is computer implemented system construction drawing.Technical solution proposed by the present invention is used to handle remote sensing image data below, it is right
Remote sensing images are classified automatically, realize the identification of ground object target, and input data is the image data set that pixel is constituted, and are passed through
Pixel with similar spectral feature is classified as one kind by clustering ensemble method, is identified as same ground object target, finally will identification
Various ground object targets out are exported, and specific implementation process is as shown in Figure 2.By the Spectral Properties of pixel each in remote sensing images
Sign is used as a sample, and the sample that quantity is N constitutes image data setI-th of sample in feature space in X
Originally it is denoted as xi;Indicate a series of set that cluster members generated on data set X are constituted, i.e., a series of pixels
Division result, wherein T indicates the quantity that member is clustered in C,Indicate t-th of cluster member, i.e., to image data
T-th of division result of collection, Ct,kFor CtIn k-th of cluster, i.e. CtIn the divide classification, StIndicate CtThe quantity of middle cluster;
Clustering is considered as the symbol expression to data set, then in cluster set in the corresponding symbol space of each cluster member
Cluster symbolic vector, T cluster symbolic vector constitute set be denoted asWhereinIt indicates t-th
Cluster member CtCluster symbolic vector, lt,kIndicate CtIn k-th of cluster label;Indicate clustering ensemble knot
Fruit, i.e., final integrated segmentation result, wherein C*,sIndicate that s-th of cluster in C*, S* indicate the quantity of cluster in C*.The present embodiment is
Utilize a series of pixel division results of image data set XGenerate consistent ground object target recognition result C*Mistake
Journey, including following key content:
Step 1 carries out data normalization processing to data set, utilizes gaussian kernel function shown in formula (1)To feature sky
Between in data setIt is mapped, makes the data set obtained after mappingGaussian distributed, wherein
ψiIndicate i-th of sample that standardized data is concentrated;
Wherein, the value of parameter alpha is set as | | xi-xo||2Standard deviation, xoFor o-th of sample (i ≠ o) in data set X,
ψoFor o-th of sample in standardized data collection Ψ.
Any two sample is in each cluster member to the one of the description of symbol space data in step 2, calculating data set
Cause property: firstly, calculating conditional information entropy of the cluster symbolic vector set L about data set X, X pairs of data set is utilized for indicating
The uncertainty of symbol space data description;Then, cluster symbolic vector set L is calculated about two samples in some cluster
The conditional information entropy of affiliated cluster in member, the uncertainty for indicating to describe symbol space data using the two clusters;Again
The difference for calculating the two above conditional information entropy of cluster symbolic vector set L is right in this cluster member as two samples
Symbol space data description consistency, and so on calculate any two sample in each cluster member to symbol space number
According to the consistency of description, comprising the following steps:
S21, conditional information entropy of the cluster symbolic vector set L about data set X is calculated using formula (2), for indicating benefit
The uncertainty that symbol space data are described with data set X:
Wherein, H (lt| X) it is t-th of cluster member CtCluster symbolic vector ltAbout the conditional information entropy of data set X,
It can be calculated by formula (3):
Wherein, P (lt,k| X) indicate cluster symbolic vector ltAbout the conditional probability of data set X, can be calculated by formula (4):
X in formulai(lt) indicate sample xiIn the value that t-th clusters on symbolic vector, i.e. sample xiIt is clustered at t-th
Corresponding cluster label in member;
S22, for any two sample x in data set XiAnd xj, they are in t-th of cluster member CtIn belonging to cluster
RespectivelyWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5), is used
In the uncertainty for indicating to describe symbol space data using the two clusters:
Wherein,ForWithThe set of composition,For t
A cluster member CtCluster symbolic vector ltAbout setConditional information entropy, can be calculated by formula (6):
Wherein,Indicate cluster symbolic vector ltAbout setItem
Part probability can be calculated by formula (7):
X in formulad(lt) indicate sample xdIn the value that t-th clusters on symbolic vector, i.e. sample xdIt is clustered at t-th
Corresponding cluster label in member;
S23, calculate cluster symbolic vector set L about data set X conditional information entropy with about's
The difference of comentropy is as sample xiAnd xjIn cluster member CtIn to symbol space data description consistency, such as formula (8) institute
Show:
S24, using the method for step S21~S23, calculate in data set X all two samples in each cluster member
To the consistency of symbol space data description.
Step 3 calculates the consistency that each cluster member describes feature space data: firstly, normalized data
Collect conditional information entropy of the Ψ about data set X, the uncertainty for indicating feature space data to be described using data set X;
Then, conditional information entropy of the data set Ψ about some cluster member is calculated, for indicating cluster member to feature space number
According to the uncertainty of description;The difference for calculating the two above conditional information entropy of data set Ψ is empty to feature as cluster member
Between the consistency that describes of data, and so on calculate each consistency for describe to feature space data of cluster member, specific packet
Containing following steps:
S31, the conditional information entropy using formula (9) normalized data set Ψ about data set X utilize number for indicating
The uncertainty that feature space data are described according to collection X
Wherein, H (Ψ | X) is conditional information entropy of the standardized data collection Ψ about data set X,Indicate standardized data
The variance for collecting Ψ, is calculated by formula (10):
Wherein, μΨFor the expectation of standardized data collection Ψ, meet formula (11):
The conditional information entropy of S32, normalized data set Ψ about each cluster member, for describing each cluster member
To the consistency of feature space data description, wherein Ψ is about t-th of cluster member CtConditional information entropy can be counted by formula (12)
It calculates:
Wherein, H (Ψ | Ct) it is standardized data collection Ψ about t-th of cluster member CtConditional information entropy,Indicate Ct
The variance of middle sample is calculated by formula (13):
Wherein,For CtThe expectation of middle sample meets formula (14):
In formula, ψeFor e-th of sample in standardized data collection Ψ, xeFor e-th of sample of data set X, xfFor data set
F-th of sample (e ≠ f) of X, xgAnd xhFor any two sample different in data set X;
The difference of the two above conditional information entropy of S33, normalized data set Ψ, as cluster member to feature
The consistency of spatial data description, wherein t-th of cluster member CtConsistency metric on Ψ is calculated by formula (15):
I(Ψ|Ct)=H (Ψ | X)-H (Ψ | Ct) (15)
Wherein, I (Ψ | Ct) indicate CtConsistency metric on Ψ;
S34, using the method for step S31~S33, calculate one by one it is each cluster member feature space data are described one
Cause property.
Step 4, the consistency described according to each cluster member to feature space data calculate each collection for clustering member
At weight, influence of each cluster member to final clustering ensemble result is controlled respectively, wherein cluster member CtClustering ensemble power
Weight ωtCalculation method such as formula (16) shown in:
Step 5, any two sample obtained using step S20 describe symbol space data in each cluster member
Consistency and step S40 obtain each cluster member integrated weight calculation data set in any two sample weighting
Similitude, wherein xiAnd xjBetween weighting similitude sim (xi,xj) calculation method such as formula (17) shown in:
Clustering ensemble Task Switching is figure minimum partition problem by step 6, i.e., so that institute in final clustering ensemble result
There is the weighting similitude between two objects not in same cluster minimum, comprising the following steps:
S61, the weighting similarity matrix Θ=[θ (x for constructing data set Xp,xq)]N×N, matrix element θ (xp,xq) meter
Shown in calculation method such as formula (18):
Wherein the value of parameter γ is sim (xi,xj) standard deviation;
It S62, is figure minimum partition problem by clustering ensemble Task Switching, shown in building objective function such as formula (19):
Step 7 solves the figure minimum partition problem that clustering ensemble Task Switching obtains using Spectral Clustering, obtains
Obtain clustering ensemble result C*, comprising the following steps:
S71, a N-dimensional diagonal matrix is constructed using the sum of element on the weighting each column of similarity matrix Θ, is denoted as D, and
Define matrix L=D- Θ;
S72, the preceding S that matrix L arranges from small to large ord is found out*A characteristic valueAnd corresponding feature vector
S73, by S*A feature vector is arranged together one N × S of composition*Matrix, wherein every a line will regard S as*Dimension is empty
Between in a vector, and clustered using K-means algorithm, cluster belonging to every a line is exactly data set X in cluster result
In cluster belonging to each sample data.
Step 8, by clustering ensemble result C*That is ground object target recognition result is exported, as a result in each cluster indicate one
A ground object target identified.
Claims (8)
1. a kind of clustering ensemble method based on weighting similarity measurement, collecting sample data, for sample number in feature space
Amount is the data set of NI-th of sample in data set X is denoted as xi,Indicate a series of in data
The set that the cluster member generated on collection X is constituted, wherein T indicates the quantity that member is clustered in C,It indicates the in C
T cluster member, Ct,kFor CtIn k-th of cluster, StIndicate CtThe quantity of middle cluster;Clustering is considered as the symbol to data set
Indicate, then the cluster symbolic vector in cluster set in the corresponding symbol space of each cluster member, T cluster symbol to
The cluster symbolic vector set that amount is constituted is denoted asWhereinIndicate t-th of cluster member CtCluster symbol
Number vector, lt,kIndicate CtIn k-th of cluster label;Clustering ensemble is indicated as a result, wherein C*,sIndicate C*In
S-th of cluster, S*Indicate C*The quantity of middle cluster;It utilizesGenerate clustering ensemble result C*Process, comprising the following steps:
S10, data normalization processing is carried out to data set X, utilizes gaussian kernel functionTo data setIt is mapped,
Make the standardized data collection obtained after mappingGaussian distributed, wherein ψiIndicate standardized data is concentrated i-th
A sample;
S20, the consistency that any two sample describes symbol space data in each cluster member in data set X is calculated:
Firstly, conditional information entropy of the cluster symbolic vector set L about data set X is calculated, for indicating using data set X to symbol sky
Between data describe uncertainty;Then, calculating cluster symbolic vector set L is about two samples in some cluster member
The conditional information entropy of affiliated cluster, the uncertainty for indicating to describe symbol space data using the two clusters;It calculates again poly-
The difference of the two above conditional information entropy of class symbolic vector set L is as two samples to symbol sky in this cluster member
Between data describe consistency, and so on calculate any two sample in each cluster member to symbol space data description
Consistency;
S30, the consistency that each cluster member describes feature space data is calculated: firstly, normalized data set Ψ is closed
Uncertainty in the conditional information entropy of data set X, for indicating to describe feature space data using data set X;Then, it counts
Conditional information entropy of the standardized data collection Ψ about some cluster member is calculated, for indicating cluster member to feature space data
The uncertainty of description;The difference of the two above conditional information entropy of normalized data set Ψ is as cluster member to spy
Levy the consistency of spatial data description, and so on calculate each consistency for describing to feature space data of cluster member;
S40, the consistency described according to each cluster member to feature space data calculate each integrated weight for clustering member,
Influence of each cluster member to final clustering ensemble result is controlled respectively;
S50, any two sample obtained using step S20 are in each cluster member to the consistent of symbol space data description
Property and the integrated weight calculation data set of each cluster member that obtains of step S40 in weighting between any two sample it is similar
Property;
S60, by clustering ensemble Task Switching be figure minimum partition problem, i.e., so that all in final clustering ensemble result do not exist
It is minimum with the weighting similitude between two objects in cluster;
S70, the figure minimum partition problem that clustering ensemble Task Switching obtains is solved using Spectral Clustering, is clustered
Integrated result C*;
S80, by clustering ensemble result C*It is exported.
2. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that the step
Rapid S10 gaussian kernel functionAs shown in formula (1):
Wherein, the value of parameter alpha is set as | | xi-xo||2Standard deviation, xoFor o-th of sample (i ≠ o) in data set X, ψoFor
O-th of sample in standardized data collection Ψ.
3. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method
The step S20 includes:
S21, conditional information entropy of the cluster symbolic vector set L about data set X is calculated using formula (2), utilize number for indicating
The uncertainty that symbol space data are described according to collection X:
Wherein, H (lt| X) it is t-th of cluster member CtCluster symbolic vector ltIt, can be by about the conditional information entropy of data set X
Formula (3) calculates:
In formula, P (lt,k| X) indicate cluster symbolic vector ltAbout the conditional probability of data set X, can be calculated by formula (4):
X in formulai(lt) indicate sample xiIn the value that t-th clusters on symbolic vector, i.e. sample xiIt is right in t-th of cluster member
The cluster label answered;
S22, for any two sample x in data set XiAnd xj, they are in t-th of cluster member CtIn belonging to cluster difference
ForWithConditional information entropy of the cluster symbolic vector set L about the two clusters is calculated using formula (5), is used for
Indicate the uncertainty described using the two clusters to symbol space data:
Wherein,ForWithThe set of composition,It is poly- for t-th
Class members CtCluster symbolic vector ltAbout setConditional information entropy, can be calculated by formula (6):
Wherein,Indicate cluster symbolic vector ltAbout setCondition it is general
Rate can be calculated by formula (7):
X in formulad(lt) indicate sample xdIn the value that t-th clusters on symbolic vector, i.e. sample xdIt is right in t-th of cluster member
The cluster label answered;
S23, calculate cluster symbolic vector set L about data set X conditional information entropy with aboutInformation
The difference of entropy is as sample xiAnd xjIn cluster member CtIn to symbol space data description consistency, as shown in formula (8):
S24, using the method for step S21~S23, calculate in data set X all two samples in each cluster member to symbol
The consistency of number spatial data description.
4. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method
The step S30 includes:
S31, the conditional information entropy using formula (9) normalized data set Ψ about data set X utilize data set for indicating
The uncertainty that X describes feature space data
Wherein, H (Ψ | X) is conditional information entropy of the standardized data collection Ψ about data set X,Indicate standardized data collection Ψ
Variance, calculated by formula (10):
Wherein, μΨFor the expectation of standardized data collection Ψ, meet formula (11):
The conditional information entropy of S32, normalized data set Ψ about each cluster member, for describing each cluster member to spy
The consistency of spatial data description is levied, wherein Ψ is about t-th of cluster member CtConditional information entropy can be calculated by formula (12):
Wherein, H (Ψ | Ct) it is standardized data collection Ψ about t-th of cluster member CtConditional information entropy,Indicate CtMiddle sample
This variance, is calculated by formula (13):
Wherein,For CtThe expectation of middle sample meets formula (14):
In formula, ψeFor e-th of sample in standardized data collection Ψ, xeFor e-th of sample of data set X, xfFor data set X's
F-th of sample (e ≠ f), xgAnd xhFor any two sample different in data set X;
The difference of the two above conditional information entropy of S33, normalized data set Ψ, as cluster member to feature space
The consistency of data description, wherein t-th of cluster member CtConsistency metric on Ψ is calculated by formula (15):
I(Ψ|Ct)=H (Ψ | X)-H (Ψ | Ct) (15)
Wherein, I (Ψ | Ct) indicate CtConsistency metric on Ψ;
S34, using the method for step S31~S33, calculate one by one each cluster member feature space data are described it is consistent
Property.
5. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method
The consistency that feature space data describe is calculated each according to each cluster member in step S40 described in described step this method
Shown in the method such as formula (16) for clustering the integrated weight of member:
Wherein, ωtIndicate cluster member CtClustering ensemble weight.
6. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method
It is calculated in the step S50 in data set X shown in the weighting similitude such as formula (17) of two samples:
Wherein, sim (xi,xj) indicate sample xiAnd xjBetween weighting similitude, calculate in this way any in data set X
The weighting similitude of two samples.
7. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method
The step S60 includes:
S61, the weighting similarity matrix Θ=[θ (x for constructing data set Xp,xq)]N×N, matrix element θ (xp,xq) calculating side
Shown in method such as formula (18):
Wherein the value of parameter γ is sim (xi,xj) standard deviation;
It S62, is figure minimum partition problem by clustering ensemble Task Switching, shown in building objective function such as formula (19):
8. a kind of clustering ensemble method based on weighting similarity measurement according to claim 1, which is characterized in that this method
The step S70 includes:
S71, a N-dimensional diagonal matrix is constructed using the sum of element on the weighting each column of similarity matrix Θ, is denoted as D, and define
Matrix L=D- Θ;
S72, the preceding S that matrix L arranges from small to large ord is found out*A characteristic valueAnd corresponding feature vector
S73, by S*A feature vector is arranged together one N × S of composition*Matrix, wherein every a line will regard S as*In dimension space
A vector, and clustered using K-means algorithm, cluster belonging to every a line is exactly every in data set X in cluster result
Cluster belonging to a sample data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910079817.5A CN109829494A (en) | 2019-01-28 | 2019-01-28 | A kind of clustering ensemble method based on weighting similarity measurement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910079817.5A CN109829494A (en) | 2019-01-28 | 2019-01-28 | A kind of clustering ensemble method based on weighting similarity measurement |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109829494A true CN109829494A (en) | 2019-05-31 |
Family
ID=66862670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910079817.5A Pending CN109829494A (en) | 2019-01-28 | 2019-01-28 | A kind of clustering ensemble method based on weighting similarity measurement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109829494A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111276188A (en) * | 2020-01-19 | 2020-06-12 | 西安理工大学 | Short-time-sequence gene expression data clustering method based on angle characteristics |
CN111464529A (en) * | 2020-03-31 | 2020-07-28 | 山西大学 | Network intrusion detection method and system based on cluster integration |
CN111598120A (en) * | 2020-03-31 | 2020-08-28 | 宁波吉利汽车研究开发有限公司 | Data labeling method, equipment and device |
CN111726765A (en) * | 2020-05-29 | 2020-09-29 | 山西大学 | WIFI indoor positioning method and system for large-scale complex scene |
CN111899115A (en) * | 2020-05-30 | 2020-11-06 | 中国兵器科学研究院 | Method, device and storage medium for determining community structure in social network |
CN117828380A (en) * | 2024-03-05 | 2024-04-05 | 厦门爱逸零食研究所有限公司 | Intelligent sterilization detection method and device |
CN117828380B (en) * | 2024-03-05 | 2024-06-04 | 厦门爱逸零食研究所有限公司 | Intelligent sterilization detection method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105957063A (en) * | 2016-04-22 | 2016-09-21 | 北京理工大学 | CT image liver segmentation method and system based on multi-scale weighting similarity measure |
CN109214427A (en) * | 2018-08-13 | 2019-01-15 | 山西大学 | A kind of Nearest Neighbor with Weighted Voting clustering ensemble method |
-
2019
- 2019-01-28 CN CN201910079817.5A patent/CN109829494A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105957063A (en) * | 2016-04-22 | 2016-09-21 | 北京理工大学 | CT image liver segmentation method and system based on multi-scale weighting similarity measure |
CN109214427A (en) * | 2018-08-13 | 2019-01-15 | 山西大学 | A kind of Nearest Neighbor with Weighted Voting clustering ensemble method |
Non-Patent Citations (1)
Title |
---|
白亮等: "基于聚类准则融合的加权聚类集成算法", 《山西大学学报(自然科学版)》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111276188A (en) * | 2020-01-19 | 2020-06-12 | 西安理工大学 | Short-time-sequence gene expression data clustering method based on angle characteristics |
CN111276188B (en) * | 2020-01-19 | 2023-03-24 | 西安理工大学 | Short-time-sequence gene expression data clustering method based on angle characteristics |
CN111464529A (en) * | 2020-03-31 | 2020-07-28 | 山西大学 | Network intrusion detection method and system based on cluster integration |
CN111598120A (en) * | 2020-03-31 | 2020-08-28 | 宁波吉利汽车研究开发有限公司 | Data labeling method, equipment and device |
CN111726765A (en) * | 2020-05-29 | 2020-09-29 | 山西大学 | WIFI indoor positioning method and system for large-scale complex scene |
CN111899115A (en) * | 2020-05-30 | 2020-11-06 | 中国兵器科学研究院 | Method, device and storage medium for determining community structure in social network |
CN117828380A (en) * | 2024-03-05 | 2024-04-05 | 厦门爱逸零食研究所有限公司 | Intelligent sterilization detection method and device |
CN117828380B (en) * | 2024-03-05 | 2024-06-04 | 厦门爱逸零食研究所有限公司 | Intelligent sterilization detection method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109829494A (en) | A kind of clustering ensemble method based on weighting similarity measurement | |
McIver et al. | Estimating pixel-scale land cover classification confidence using nonparametric machine learning methods | |
Erisoglu et al. | A new algorithm for initial cluster centers in k-means algorithm | |
CN103020978B (en) | SAR (synthetic aperture radar) image change detection method combining multi-threshold segmentation with fuzzy clustering | |
CN103648106B (en) | WiFi indoor positioning method of semi-supervised manifold learning based on category matching | |
CN114564982B (en) | Automatic identification method for radar signal modulation type | |
CN107239795A (en) | SAR image change detecting system and method based on sparse self-encoding encoder and convolutional neural networks | |
Yang et al. | A feature-metric-based affinity propagation technique for feature selection in hyperspectral image classification | |
CN108664986B (en) | Based on lpNorm regularized multi-task learning image classification method and system | |
CN109490838A (en) | A kind of Recognition Method of Radar Emitters of data base-oriented incompleteness | |
CN108492298A (en) | Based on the multispectral image change detecting method for generating confrontation network | |
CN108154094A (en) | The non-supervisory band selection method of high spectrum image divided based on subinterval | |
CN106529563A (en) | High-spectral band selection method based on double-graph sparse non-negative matrix factorization | |
CN111062428A (en) | Hyperspectral image clustering method, system and equipment | |
CN108564083A (en) | A kind of method for detecting change of remote sensing image and device | |
CN105205807B (en) | Method for detecting change of remote sensing image based on sparse automatic coding machine | |
CN109034238A (en) | A kind of clustering method based on comentropy | |
Lu et al. | Multiple-kernel combination fuzzy clustering for community detection | |
CN106599927B (en) | The Target cluster dividing method divided based on Fuzzy ART | |
CN115186012A (en) | Power consumption data detection method, device, equipment and storage medium | |
CN111680579A (en) | Remote sensing image classification method for adaptive weight multi-view metric learning | |
Upadhyay et al. | A brief review of fuzzy soft classification and assessment of accuracy methods for identification of single land cover | |
CN109214427A (en) | A kind of Nearest Neighbor with Weighted Voting clustering ensemble method | |
Förster et al. | Significance analysis of different types of ancillary geodata utilized in a multisource classification process for forest identification in Germany | |
CN114648683A (en) | Neural network performance improving method and device based on uncertainty analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190531 |
|
WD01 | Invention patent application deemed withdrawn after publication |