CN106446947A - High-dimension data soft and hard clustering integration method based on random subspace - Google Patents
High-dimension data soft and hard clustering integration method based on random subspace Download PDFInfo
- Publication number
- CN106446947A CN106446947A CN201610843524.6A CN201610843524A CN106446947A CN 106446947 A CN106446947 A CN 106446947A CN 201610843524 A CN201610843524 A CN 201610843524A CN 106446947 A CN106446947 A CN 106446947A
- Authority
- CN
- China
- Prior art keywords
- respectively calculate
- clusters number
- calculate clusters
- cluster
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Abstract
The invention discloses a high-dimension data soft and hard clustering integration method based on a random subspace. The method comprises the following steps of (1) inputting a high-dimension data set; (2) performing data normalization; (3) generating the random subspace; (4) performing kmeans and fuzzy cmeans clustering; (5) generating a fusion matrix; (6) using a clustering validity index to obtain an optimum clustering number; (7) constructing a decision attribute set; (8) improving rough set attribute reduction to obtain a simplified fusion matrix; (9) performing consistency function division; (10) obtaining a clustering purification rate. By using the method provided by the invention, the random subspace is used for solving the problem of processing difficulty of high-dimension data; the combination of soft clustering and hard clustering is used; original data and intermediate result information are sufficiently utilized for performing the intermediate result redundant attribute reduction; the clustering accuracy is improved; meanwhile, the clustering speed is also accelerated; the problems of incapability of sufficiently utilizing clustering information and removing redundant information in the prior art are solved.
Description
Technical field
The present invention relates to machine learning field, more particularly to a kind of high dimensional data soft or hard cluster set based on stochastic subspace
Become method.
Background technology
Different data sources adopts different clustering algorithms, can obtain different cluster results.And by this cluster result
The effect is significant of unified result is formed using clustering ensemble framework, is increasingly subject to concern and the research of academia.Clustering ensemble
Method be successfully applied to Data Mining, such as noise data is excavated, different source data is excavated, data distribution is excavated, point
Class data mining and timing driving etc..And at the aspect such as bio information, information retrieval, decision-making judgement and image processing
There is good application.At present, the method that Yu et al. proposes different clustering ensemble frameworks, the such as mixing based on three spectral clusterings
Cluster framework and the Knowledge Discovery based on clustering ensemble from gene expression data to cancer mechanism.Carpineto et al. is proposed
Based on the clustering ensemble framework of probability index, and it is applied in title searching field.In many applications, gather compared to single
Class algorithm, clustering ensemble method has more preferable accuracy, robustness and stability on cluster result.
The method of clustering ensemble is divided into two big class, hard clustering ensemble method and soft clustering ensemble method at present.Hard cluster set
The method for becoming method to be adopted during algorithm integration is hard clustering algorithm.Also have at present a lot of with regard to how using not
Same compatibility function obtains the higher outcome research of robustness and stability, for example, divided using similar matrix, figure cutting, base
Split in weight, the compatibility function such as association segmentation.Also have at present and use different techniques to different cluster results are produced, increase
Plus result multiformity contribute to compatibility function final result more efficient.For example using random resampling, Random Maps skill
Art, random initializtion technology etc..There is research priori to be incorporated in integrated framework at present, also has research to do semi-supervised
Method is incorporated in the middle of integrated framework, and different hard clustering algorithms are adapted on different data sets, but they do not consider
Combination with soft cluster.Soft cluster uses the method for fuzzy clustering.Soft clustering ensemble framework also has a lot, for example at present,
Yu et al. proposes the tumor data cluster analyses based on soft clustering ensemble framework.Also study other fuzzy theorys are incorporated soft
In clustering ensemble framework, such as fuzzy graph theory, fuzzy resembling relation, Fuzzy Consistent function based on site and voting mechanism etc..
Mirzaei et al. proposes the hierarchical clustering integrated framework based on fuzzy resembling relation.Also there is research by rough set and Granular Computing
Add in clustering ensemble framework.Avogadri et al. devises micro- to analyze DNA based on Random Maps fuzzy clustering integrated framework
Array data analysis.In sum, current framework is to consider how preferably to be added in integrated framework fuzzy clustering, but
It is that but seldom consideration is simultaneously introduced soft cluster and hard cluster in clustering ensemble framework.
Current clustering ensemble also has its certain limitation.First, most of clustering ensemble frameworks do not have well
Process the method for High Dimensional Data Set.Second, traditional clustering ensemble framework is simply considered as hard cluster or soft cluster to enter
Row analysis, but not considering to be combined both incorporates in clustering ensemble framework.3rd, the clustering ensemble method of part
Although have also contemplated that, cluster result is carried out clustering ensemble as new attribute, not in view of the attribute of this neotectonics
Collection is comprising redundancy or noise attribute.And do not have the redundancy category that method eliminates these new property sets in integrated framework at present
Property.
Content of the invention
In order to overcome shortcoming and the deficiency of prior art presence, the present invention provides a kind of high dimension based on stochastic subspace
According to soft or hard clustering ensemble method, can solve the problem that with present on 3 points of circumscribed problems.Reach by being input into higher-dimension
Data set, final obtain than traditional single clustering algorithm or current integrated framework to information more fully using and more preferably
Cluster accuracy effect..
For solving above-mentioned technical problem, the present invention provides following technical scheme:A kind of high dimension based on stochastic subspace
According to soft or hard clustering ensemble method, comprise the steps:
S1, input High Dimensional Data Set, are normalized;
S2, the High Dimensional Data Set after normalization is made to produce stochastic subspace;
S3, subspace is clustered, obtain cluster result matrix;
S4, cluster result matrix is merged, generate fusion matrix;
S5, according to fusion matrix, draw optimum cluster number using Cluster Validity Index;
S6, use fusion matrix and optimum cluster number as parameter, construct decision kind set;
S7, will fusion matrix as conditional attribute collection, according to decision kind set, improve rough set category to merging matrix
Property yojan, obtains simplifying fusion matrix;
S8, use simplify fusion matrix and true clusters number clustered as parameter, obtain cluster result matrix, root
Final cluster result is determined according to cluster result matrix;
S9, final cluster result is calculated with the pure rate of true cluster result.
Further, in step S1, normalization process is specially:
Obtain corresponding maximum V (d) of d Column PropertiesmaxWith minima V (d)min, following public affairs are pressed to the data value of d row
Formula is changed:
Wherein,For i-th data of d row,For the numerical value after renewal, i ∈ 1,2 ... .., n }, d ∈ 1,
2 ... .., D }, it is sample dimension that n is number of samples, D.
Further, step S2 is specially:
S21, the sample dimension of acquisition High Dimensional Data Set, produce the sample dimension of subspace;
S22, High Dimensional Data Set is carried out random and not repeatedly attribute column is chosen, subspace is constructed, when reaching subspace
Stop during sample dimension;Wherein construction subspace is specially:
After obtaining High Dimensional Data Set sample dimension and being D, the interval setting of subspace is carried out, using random function 0
Choose in~D interval unduplicatedIndividual positive integer, whereinRepresent the smallest positive integral not less than r × D, r ∈
(0,1);By thisIndividual positive integer sequence is sorted from small to large;Will after sequenceIndividual positive integer sequence pair should
In the attribute row number of High Dimensional Data Set, one new subspace of construction is extracted;
S23, repeat S21-S22 is walked, stop until producing S sub-spaces.
Further, step S3 is specially:
S31, subspace is carried out kmeans cluster, randomly choose 2~kmaxIn a positive integer as clusters number k,
Cluster centre random initializtion, obtains kmeans cluster result matrix, kmaxFor a positive integer more than 2;Wherein
The step of kmeans clustering algorithm is,
A) 2~k is randomly choosedmaxIn a positive integer as clusters number k;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) all samples are calculated to the distance of k cluster centre using Euclidean distance formula, if the dimension of data set is D,
The distance between sample point A=(a [1], a [2] ..., a [D]) and central point C=(c [1], c [2] ..., c [D]) ρ (A, C) is fixed
Justice is equation below:
Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample;Then,
The matrix W of a n × k has just been obtained, and it is cluster centre number that n is number of samples, k;W in matrixihRepresent:Judge xiWhether belong to
In h-th class chIf belonging to, wihFor 1, no be not belonging to, then wih, it is 0;
That is the renewal computing formula of W is as follows:
D) meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, and used as new cluster centre, computing formula is as follows:
C={ c1,c2,……,ch},(xi,ch)2Represent xiTo h-th class chDistance;nhFor belonging to h-th class chSample
This number;
The difference for relatively changing with former cluster centre if in restriction range, enters f) step whether in restriction range
Suddenly, if not in restriction range, e) step is entered;
If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not having
Reach maximum iteration time, then continue to repeat c), d) step;
F) final output cluster result;
Wherein, the distance that an object function evaluates sample point and cluster centre is constructed;It is based on to the minimization of object function
On the premise of iteration update the degree of membership of cluster centre and sample point;Until the degree of object function minimizing is in given increment model
Within enclosing, or reach maximum iteration time and terminate, export cluster result;
The object function Ω of kmeans clustering algorithm in theory1(C*,W*) as follows:
Ω1(C*,W*)=arg min(C,W)φ1(C,W)
Wherein, C*And W*Represented is the optimal solution for minimizing object function, and the target formula for iterating to calculate each time is such as
Under:
S32, fuzzy cmeans cluster is carried out using step S31 identical subspace and identical clusters number, obtain
Fuzzy cmeans cluster result matrix;
Wherein, fuzzy cmeans clustering algorithm is specially:
A) to carry out selected clusters number k during kmeans cluster identical for clusters number and same sub-spaces;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) distribute a degree of membership with respect to each cluster centre according to membership function to each sample, the degree of membership represents
Sample point and the distance of cluster centre, being calculated as follows of subordinated-degree matrix:
Wherein, F is the set of degree of membership, F={ fij, i ∈ { 1 ... ..., n }, j ∈ { 1 ... ..., k }, n are number of samples,
K is clusters number, and β is the real number more than 1,
D) according to degree of membership, recalculate and determine new cluster centre, computing formula is as follows:
Wherein, C={ c1,c2,……,ck, | | xi-cj| | represent xiTo j-th class cjDistance, compare with former cluster
The difference of heart change if in restriction range, enters f) step whether in restriction range, if not in restriction range,
Enter e) step;
If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not having
Reach maximum iteration time, then continue to repeat c), d) step;
F) final output cluster result;
Wherein, the distance that an object function evaluates sample point and cluster centre is constructed;It is based on to the minimization of object function
On the premise of iteration update the degree of membership of cluster centre and sample point;Until the degree of object function minimizing is in given increment model
Within enclosing, or reach maximum iteration time and terminate, export cluster result;
The object function Ω of fuzzy cmeans clustering algorithm2(C*,F*) be:
Ω2(C*,F*)=arg min(C,F)φ2(C,F)
Wherein, C*And F*Represented is to minimize object function optimal solution.
S33, repeat step S31 and S32, until S sub-spaces all carry out kmeans and fuzzy cmeans cluster, respectively
Obtain S kmeans cluster result matrix and S fuzzycmeans cluster result matrix.
Further, step S4 is specially:
S41, S kmeans cluster result matrix is converted into binary system cluster result matrix, matrix size be
For clusters number, n is number of samples;S fuzzy cmeans cluster result matrix is not changed, and the size of the matrix is also
k×n;Wherein, described kmeans cluster result matrix is converted into the step of binary system clusters matrix is:
First, k rank unit matrix is built
Then, if kmeans cluster result matrix is Hk, then binary system cluster result matrix is Hb=H (i,:), wherein H
(i, j) is arranged for the i-th row of H-matrix jth, and H (i,:) for H-matrix the i-th row, i ∈ Hk, particularly, cluster result matrix HkFor one
Individual column vector;
S42, S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are closed
And, obtaining merging matrix, size is the summation of 2K × n, K for the corresponding clusters number of S sub-spaces, and n is number of samples;Its
In, combining step is:
If S binary system cluster matrix is respectively Hb1、Hb2、……、Hbs, S fuzzycmeans cluster result matrix divide
Wei not F1、F2、……、Fs;Successively by S kmeans binary system cluster result matrix and S fuzzycmeans cluster result square
Battle array is extended merging according to row vector, and form is as follows, [Hb1,Hb2,……,Hbs,F1,F2,……,Fs], size is 2K × n, K
For the summation of the corresponding clusters number of S sub-spaces, n is sample number.
Further, step S5 is specially:
S51, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen partition coefficient
Index, obtains optimum cluster number for k1, specially:Substitute into formula
Wherein, fijIt is degree of membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, successively
DrawValue, k takes 2~kmaxInterval positive integer;
S52, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen separation index, obtain most
Excellent clusters number is k2, specially:Substitute into formula
Wherein, xiFor i-th sample, chFor h-th cluster centre, n is the number of sample, draws successivelyValue;
S53, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen alternative Dunn index,
Optimum cluster number is obtained for k3, specially:Substitute into formula
S54, to k1、k2And k3Being ranked up, intermediate value is chosen as optimum cluster number kfixSpecially:
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k1;
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of minima and be designated as k2;
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k3;
To k1、k2And k3Being ranked up, intermediate value is chosen as optimum cluster number kfix.
Further, it is specially in step S6:
S61, use fusion matrix and optimum cluster number kfixAs parameter, fuzzy cmeans cluster is carried out, draw every
Individual sample is to inhomogeneous cluster result matrix;
S62, according to cluster result matrix, if degree of membership is unequal, according to the degree of membership maximum of each sample correspond to
Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values
One classification, the string categorization vector for finally giving is used as the decision kind set of next step;
Further, step S7 is specially:
S71, will fusion matrix as conditional attribute collection, degree of dependence of the design conditions property set to decision kind set, tool
Body is:
S711, searching equivalence relation, using formula:
High dimensional data is concentrated all samples to be divided into limited equivalence set according to conditional attribute collection, wherein M is condition
Property set, ψ (x, m) is the value of sample x under attribute m;
S712, calculating positive region δM(L), δM(L)=∪1≤i≤np(M,Li),Wherein, M ' conditional attribute collection, n
For the number of equivalence set in the first step, LiFor i-th equivalence set, [x]M=
X ' ∈ X | and (x, x ') ∈ Γ (M) }, X is fusion matrix;
Degree of dependence ζ of S713, design conditions property set to decision kind setM(L), calculated using equation below:
Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζM(L) it is
The degree of dependence of conditional attribute set pair decision kind set, δM(L) it is positive region of the conditional attribute collection M in decision kind set L;
S72, calculating core attribute set:After maximum sub- conditional attribute collection is removed, the residue condition property set for obtaining is fought to the finish
The degree of dependence of plan property set is constant, then the residue condition property set is core attribute set;
S73, init attributes integrate A as empty set, and a belongs to conditional attribute and integrates and the difference set as property set A;
S74, traversal read all properties row of a, attribute column are incorporated to property set A so as to become property set A', judge category
Whether the degree of dependence of decision kind set of the property collection A' and property set A between them changes, if changing, by a simultaneously
Enter in A;Otherwise it is not incorporated in, until a has been traveled through, stops;
S75, judge whether are property set A and core attribute set, if equal, A is to simplify fusion matrix after yojan;If not phase
Deng the attribute for then removing redundancy obtains simplifying fusion matrix.
Further, the S8 is specially:
S81, use simplify fusion matrix and true clusters number as parameter, carry out fuzzy cmeans cluster, obtain
Each sample is to inhomogeneous cluster result matrix;
S82, according to cluster result matrix, if degree of membership is unequal, according to the degree of membership maximum of each sample correspond to
Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values
One classification;The string categorization vector for finally giving is used as final cluster result.
Further, the S9, calculating pure rate PU is:
Wherein, real cluster result is
Finally cluster result isk1For the clusters number of legitimate reading, k2
For clusters number.
After technique scheme, the present invention at least has the advantages that:
(1) present invention enables framework to process High Dimensional Data Set using the method for stochastic subspace, not only on subspace
Multiformity have good effect, it is often more important that also lifted in subspace clustering speed;
(2) present invention is soft cluster and hard Cluster-Fusion in integrated framework, it is contemplated that soft cluster is effective with hard cluster
In conjunction with come improve integrated in multiformity, combine respective advantage;
(3) different cluster results is analyzed by the present invention as new property set, takes full advantage of the number of intermediate result
According to making analysis information more accurate;
(4) present invention introduces the combined method of Cluster Validity Index, to clusters number in using improved rough set
Prediction more accurate, and it is more preferable to remove redundant attributes effect to follow-up rough set attribute reduction improved method lifting;
(5) present invention removes the redundant attributes of new property set using the method for improved rough set attribute reduction, effectively keeps away
Exempt from the low impact of accuracy rate that redundancy brings.
Description of the drawings
The step of Fig. 1 is a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace of present invention flow chart;
Fig. 2 is a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace of the present invention and traditional single cluster
The contrast table of algorithm cluster accuracy.
Specific embodiment
It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases
Mutually combine, with specific embodiment, the application is described in further detail below in conjunction with the accompanying drawings.
It is described further the step of 1 couple of present invention below in conjunction with the accompanying drawings.
Step 1, input High Dimensional Data Set:The High Dimensional Data Set of one process to be clustered of input, row vector corresponds to sample dimension,
Column vector corresponds to attribute dimension;
Step 2, data normalization:D Column Properties corresponding maximum V (d) is first obtainedmaxWith minima V (d)min, to
The property value of d row is changed as follows:
Wherein,For i-th data of d row,For the numerical value after renewal, i ∈ 1,2 ... .., n }, d ∈ 1,
2 ... .., D }, it is sample dimension that n is number of samples, D;The sample dimension refers to the number of the corresponding attribute of the sample,
For example, sample has 3 attributes, is blood group, height, body weight respectively, then sample dimension is exactly 3;
Step 3, produces stochastic subspace:First, the sample dimension for obtaining High Dimensional Data Set is D, then the sample of subspace
Dimension isWhereinRepresent the smallest positive integral not less than r × D, r ∈ (0,1);
Secondly, do not repeat to choose in 0~D interval using random functionPositive integer, by thisIndividual just whole
Number Sequence is sorted from small to large, will after sequenceIndividual positive integer sequence corresponds to the attribute row number of High Dimensional Data Set,
Do not repeat reading attributes row number to High Dimensional Data Set at random, subspace is constructed, sample dimension reaches until subspaceStop
Only;
According to as above two step, circulation produces subspace, just stops circulation until producing S sub-spaces;
Step 4, kmeans and fuzzy cmeans is clustered:
First, one of subspace is obtained, carries out kmeans cluster behind the subspace for obtaining, random selection 2~
kmaxIn a positive integer as clusters number k, cluster centre random initializtion, concentrate k sample of random selection from high dimensional data
This is used as cluster centre, and each cluster centre represents a class, calculates all samples using Euclidean distance formula initial to k
The distance of cluster centre, the final affiliated classification of each sample is the class corresponding to the corresponding nearest cluster centre of the sample
Not;The meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, used as new cluster centre, comparing the difference for changing with former cluster centre is
No if not in restriction range, continue iteration in restriction range, sample and cluster centre are re-started according to distance and are drawn
Point, divide and after finishing, produce k new cluster centre again, until new cluster centre with last cluster centre excursion in constraint
Within the scope of, or maximum iteration time has been reached, final output cluster result.The step of kmeans clustering algorithm is,
A) 2~k is randomly choosedmaxIn a positive integer as clusters number k;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) all samples are calculated to the distance of k cluster centre using Euclidean distance formula, if the dimension of data set is D,
The distance between sample point A=(a [1], a [2] ..., a [D]) and central point C=(c [1], c [2] ..., c [D]) ρ (A, C) is fixed
Justice is equation below:
Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample;Then,
The matrix W of a n × k has just been obtained, and it is cluster centre number that n is number of samples, k;W in matrixihRepresent:Judge xiWhether belong to
In h-th class chIf belonging to, wihFor 1, no be not belonging to,
Then wih, it is 0;
That is the renewal computing formula of W is as follows:
D) meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, and used as new cluster centre, computing formula is as follows:
C={ c1,c2,……,ch},(xi,ch)2Represent xiTo h-th class chDistance;nhFor belonging to h-th class chSample
This number;
The difference for relatively changing with former cluster centre if in restriction range, enters g) step whether in restriction range
Suddenly, if not in restriction range, e) step is entered;
If e) iterationses reach maximum iteration time (acquiescence is 100 iteration), f) step is entered, if not reaching most
Big iterationses, then continue to repeat c), d) step;
F) final output cluster result;
Secondly, select clusters number identical with clusters number k that same sub-spaces carry out kmeans selection, from high dimension
According to concentrating k sample of random selection as cluster centre, each cluster centre represents a class, poly- with respect to each to each sample
A degree of membership is distributed according to membership function in class center, and the degree of membership represents the distance of sample point and cluster centre, constructs one
Object function evaluates the distance of sample point and cluster centre, is updated in cluster based on iteration on the premise of to the minimization of object function
The heart and the degree of membership of sample point, the degree for reducing until object function is within given incremental range, or reaches maximum and change
Generation number terminates, and exports cluster result, obtains fuzzy cmeans cluster result matrix.
Wherein, fuzzy cmeans clustering algorithm is specially:
A) to carry out selected clusters number k during kmeans cluster identical for clusters number and same sub-spaces;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) distribute a degree of membership with respect to each cluster centre according to membership function to each sample, the degree of membership represents
Sample point and the distance of cluster centre, being calculated as follows of subordinated-degree matrix:
Wherein, F is the set of degree of membership, F={ fij, i ∈ { 1 ... ..., n }, j ∈ { 1 ... ..., k }, n are number of samples,
K is clusters number, and β is the real number more than 1,
D) according to degree of membership, recalculate and determine new cluster centre, computing formula is as follows:
Wherein, C={ c1,c2,……,ck, | | xi-cj| | represent xiTo j-th class cjDistance,
The difference for relatively changing with former cluster centre if in restriction range, enters f) step whether in restriction range
Suddenly, if not in restriction range, e) step is entered;
If e) iterationses reach maximum iteration time (acquiescence is 100 iteration), f) step is entered, if not reaching most
Big iterationses, then continue to repeat c), d) step;
F) final output cluster result;
After completing above-mentioned sub-spaces, continuing the next sub-spaces of acquisition carries out same operation, until S son is empty
Between all carry out kmeans and fuzzy cmeans cluster, respectively obtain S kmeans cluster result matrix and S individual
Fuzzycmeans cluster result matrix;
Step 5, generates fusion matrix:
First, S kmeans cluster result matrix is converted into binary system cluster result matrix;According to respective subspace
Clusters number k, the form for building k rank unit matrix H, H is as follows:
Kmeans cluster result matrix is made for Hk, then binary system cluster result matrix is Hb=H (i,:), wherein H (i, j) is
The i-th row of H-matrix jth is arranged, and H (i,:) for H-matrix the i-th row, i ∈ Hk, particularly, cluster result matrix HkFor one arrange to
Amount.Then obtain corresponding binary system cluster result matrix, its size is k × n (it is number of samples that k is clusters number, n);
Secondly, S fuzzy cmeans cluster result matrix is not changed, and the size of matrix is also k × n;Then by S
Individual kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are merged, if S binary system gathers
Matroid is respectively Hb1, Hb2... ..., Hbs, S fuzzycmeans cluster result matrix be respectively F1, F2... ..., Fs;Successively
S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are extended closing according to row vector
And, form is as follows, [Hb1, Hb2... ..., Hbs, F1, F2... ..., Fs], then just obtaining merging matrix, size is that (K is S to 2K × n
The summation of the corresponding clusters number of sub-spaces, n is number of samples);
Step 6, draws optimum cluster number using Cluster Validity Index:
First, according to fusion matrix, it is 2~k to calculate clusters number respectivelymaxWhen partition coefficient
Index, obtains optimum cluster number for k1;Partition coefficient index is calculated according to the following formula:
Wherein fijIt is membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, successively
DrawValue (k takes 2~kmaxInterval positive integer);
Further according to fusion matrix, it is 2~k to calculate clusters number respectivelymaxWhen separation index, obtain optimum
Clusters number is k2;
Separation index is calculated according to the following formula:
Wherein fijIt is membership of the sample point i in cluster centre j, xiFor i-th sample, chFor in h-th cluster
The heart, k is number of clusters number, the n for sample, draws successivelyValue (k takes 2~kmaxInterval positive integer);
Further according to fusion matrix, it is 2~k to calculate clusters number respectivelymaxWhen alternative Dunn index, obtain
It is k to optimum number3;Alternative Dunn index is calculated according to the following formula
Wherein, k is number of clusters number, the n for sample, xiFor i-th sample, chFor h-th cluster centre, obtain successively
Go outValue (k takes 2~kmaxInterval positive integer);
Then, rightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k1;
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of minima and be designated as k2;RightValue (k takes 2
~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k3;To k1, k2And k3It is ranked up, in selection
Between be worth as optimum cluster number kfix.
Step 7, constructs decision kind set:
First, using fusion matrix and optimum cluster number kfixAs parameter, using fuzzy cmeans clustering algorithm,
Concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;Relative to each sample
Distribute a degree of membership in each cluster centre according to membership function, the degree of membership represents the distance of sample point and cluster centre;
One object function of construction evaluates the distance of sample point and cluster centre;Be based on to the minimization of object function on the premise of iteration more
New cluster centre and the degree of membership of sample point;Until the degree of object function minimizing is within given incremental range, Huo Zheda
Terminate to maximum iteration time, cluster result is exported, obtains fuzzy cmeans cluster result matrix.Draw each sample to not
Similar is subordinate to probability matrix, and this is subordinate to probability matrix size for n × kfix, wherein n is the number of sample, each probability matrix
In probit interval 0~1;
Secondly, according to cluster result matrix, if degree of membership is unequal, corresponded to according to the degree of membership maximum of each sample
Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values
One classification.The string categorization vector for finally giving is used as the decision kind set of next step, and the decision kind set size is n × 1,
Wherein n is the number of sample, and each value corresponds to the generic of sample.
Step 8, improved rough set attribute reduction, obtain simplifying fusion matrix
First, matrix will be merged as conditional attribute collection, the decision kind set that the conditional attribute set pair step 7 is obtained is relied on
Degree is calculated according to the following formula:
Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζM(L) it is
The degree of dependence of conditional attribute set pair decision kind set, δM(L) it is positive region of the conditional attribute collection M in decision kind set L;Bar
The positive region that part property set is concentrated in decision attribute is calculated as follows:
Wherein, M ' conditional attribute collection, n is Г(M)The number of equivalence set, LiFor i-th equivalence set
[x]M=x ' ∈ X | and (x, x ') ∈ Γ (M) }, X is fusion matrix;Equivalence set is calculated as follows:
Wherein M is conditional attribute collection, and ψ (x, m) is the value of sample x under attribute m, then, has obtained conditional attribute collection
Degree of dependence to decision kind set;
Secondly, each attribute column of ergodic condition property set, after one of attribute is removed, residue condition attribute
Set pair decision kind set calculates degree of dependence by above-mentioned calculating process, if degree of dependence is constant, removes the attribute, otherwise, protects
The attribute is stayed, until all properties row have been traveled through, final residue condition property set is used as core attribute set;
Then, init attributes integrate A as empty set, and a belongs to the difference set of conditional attribute collection and property set A;Traversal reads the institute of a
There is attribute column, it is A' that attribute column is incorporated to after property set A, if A' causes the degree of dependence to decision kind set with former property set A
The degree of dependence of decision kind set is compared and is changed, then a is incorporated in A, is otherwise not incorporated in.Until a has been traveled through, stop;
Finally, property set A is carried out attribute column with core attributes to compare, if comprising attribute column equal, after A is brief
Simplify fusion matrix.If unequal, the attribute for removing redundancy between attribute A and core attributes obtains simplifying fusion matrix.
Step 9, compatibility function is divided:
First, using simplifying fusion matrix and true clusters number ktrueAs parameter, clustered using fuzzy cmeans
Algorithm, concentrates k sample of random selection as cluster centre from high dimensional data, and each cluster centre represents a class;To each sample
This distributes a degree of membership with respect to each cluster centre according to membership function, and the degree of membership represents sample point and cluster centre
Distance;One object function of construction evaluates the distance of sample point and cluster centre;On the premise of being based on to the minimization of object function
Iteration updates the degree of membership of cluster centre and sample point;Until the degree of object function minimizing is within given incremental range,
Or reach maximum iteration time and terminate, cluster result is exported, obtains fuzzy cmeans cluster result matrix.Draw each sample
This is subordinate to probability matrix to inhomogeneous, and this is subordinate to probability matrix size for n × ktrue, wherein n is the number of sample, and each is general
Rate value is interval 0~1;Wherein, true clusters number is:Data set can be obtained truly according to priori when clustering
Clusters number, such as, the clusters number of cancer data is assured that for two classes, a class is to suffer from cancer, and another kind of is not suffer from
Cancer;
Secondly, according to cluster result matrix, if degree of membership is unequal, corresponded to according to the degree of membership maximum of each sample
Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values
One classification;The string categorization vector for finally giving is used as final cluster result, and the final cluster result size is n × 1, wherein
N is the number of sample, and each value corresponds to the generic of sample;
Step 10, the cluster result for being drawn with the inventive method calculates pure rate with real cluster result, according to the following formula
Calculate:
Wherein, real cluster result isThe cluster result for being drawn by the inventive method
Fork1For the clusters number of legitimate reading, k2Clusters number for the inventive method.Its
In, true cluster result refer to the label already provided with label information (such as suffer from cancer and do not suffer from cancer), why use
Carry markd information, be intended to result that the integrated approach is predicted can the labelling result of approaching to reality evaluate, comment
The standard of valency is exactly to utilize pure rate.Pure rate is exactly, if integrated approach cluster result is completely the same with true cluster result, pure
Net rate is just that pure rate is 0, if result part 11, if the cluster result of integrated approach is completely inconsistent with true cluster result
Cause, be then between 0~1, so closer to 1, then representing the integrated approach better in performance.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, permissible
It is understood by, can these embodiments be carried out with multiple equivalent changes without departing from the principles and spirit of the present invention
Change, change, replace and modification, the scope of the present invention is limited by claims and its equivalency range.
Claims (10)
1. a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace, it is characterised in that comprise the steps:
S1, input High Dimensional Data Set, are normalized;
S2, the High Dimensional Data Set after normalization is made to produce stochastic subspace;
S3, subspace is clustered, obtain cluster result matrix;
S4, cluster result matrix is merged, generate fusion matrix;
S5, according to fusion matrix, draw optimum cluster number using Cluster Validity Index;
S6, use fusion matrix and optimum cluster number as parameter, construct decision kind set;
S7, will fusion matrix as conditional attribute collection, according to decision kind set, improve rough set attribute about to merging matrix
Letter, obtains simplifying fusion matrix;
S8, use and simplify fusion matrix and true clusters number is clustered as parameter, cluster result matrix is obtained, according to poly-
Class matrix of consequence determines final cluster result;
S9, final cluster result is calculated with the pure rate of true cluster result.
2. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that
In step S1, normalization process is specially:
Obtain corresponding maximum V (d) of d Column PropertiesmaxWith minima V (d)min, the data value of d row is entered as follows
Row conversion:
Wherein,For i-th data of d row,For the numerical value after renewal, i ∈ 1,2 ... .., n }, d ∈ 1,
2 ... .., D }, it is sample dimension that n is number of samples, D.
3. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that
Step S2 is specially:
S21, the sample dimension of acquisition High Dimensional Data Set, produce the sample dimension of subspace;
S22, High Dimensional Data Set is carried out random and not repeatedly attribute column is chosen, subspace is constructed, when reaching subspace sample
Stop during dimension;Wherein construction subspace is specially:
After obtaining High Dimensional Data Set sample dimension and being D, the interval setting of subspace is carried out, using random function in 0~D area
Between in choose unduplicatedIndividual positive integer, whereinRepresent the smallest positive integral not less than r × D, r ∈ (0,1);
By thisIndividual positive integer sequence is sorted from small to large;Will after sequenceIndividual positive integer sequence corresponds to higher-dimension
The attribute row number of data set, extracts one new subspace of construction;
S23, repeat S21-S22 is walked, stop until producing S sub-spaces.
4. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that
Step S3 is specially:
S31, subspace is carried out kmeans cluster, randomly choose 2~kmaxIn a positive integer as clusters number k, cluster
Center random initializtion, obtains kmeans cluster result matrix, kmaxFor a positive integer more than 2;Wherein
The step of kmeans clustering algorithm is,
A) 2~k is randomly choosedmaxIn a positive integer as clusters number k;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) all samples are calculated to the distance of k cluster centre using Euclidean distance formula, if the dimension of data set is D, sample
The distance between point A=(a [1], a [2] ..., a [D]) and central point C=(c [1], c [2] ..., c [D]) ρ (A, C) is defined as
Equation below:
Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample;Then, must
To the matrix W of a n × k, it is cluster centre number that n is number of samples, k;W in matrixihRepresent:Judge xiWhether is belonged to
H class chIf belonging to, wihFor 1, no be not belonging to, then wih, it is 0;
That is the renewal computing formula of W is as follows:
D) meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, and used as new cluster centre, computing formula is as follows:
C={ c1,c2,……,ch},(xi,ch)2Represent xiTo h-th class chDistance;nhFor belonging to h-th class chSample
Number;
The difference for relatively changing with former cluster centre if in restriction range, enters f) step whether in restriction range, if
Not in restriction range, then enter e) step;
If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not reaching
Maximum iteration time, then continue to repeat c), d) step;
F) final output cluster result;
Wherein, the distance that an object function evaluates sample point and cluster centre is constructed;Before to the minimization of object function
Put the degree of membership that iteration updates cluster centre and sample point;Until object function reduce degree given incremental range it
Interior, or reach maximum iteration time and terminate, export cluster result;
The object function Ω of kmeans clustering algorithm in theory1(C*,W*) as follows:
Ω1(C*,W*)=argmin(C,W)φ1(C,W)
Wherein, C*And W*Represented is the optimal solution for minimizing object function, and the target formula for iterating to calculate each time is as follows:
S32, fuzzy cmeans cluster is carried out using step S31 identical subspace and identical clusters number, obtain fuzzy
Cmeans cluster result matrix;
Wherein, fuzzy cmeans clustering algorithm is specially:
A) to carry out selected clusters number k during kmeans cluster identical for clusters number and same sub-spaces;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) distribute a degree of membership with respect to each cluster centre according to membership function to each sample, the degree of membership represents sample
Point and the distance of cluster centre, being calculated as follows of subordinated-degree matrix:
Wherein, F is the set of degree of membership, F={ fij, i ∈ { 1 ... ..., n }, j ∈ { 1 ... ..., k }, n are that number of samples, k is
Clusters number, β is the real number more than 1,
D) according to degree of membership, recalculate and determine new cluster centre, computing formula is as follows:
Wherein, C={ c1,c2,……,ck, | | xi-cj| | represent xiTo j-th class cjDistance, compare and former cluster centre become
The difference of change if in restriction range, enters f) step, if not in restriction range, entering whether in restriction range
E) step;
If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not reaching
Maximum iteration time, then continue to repeat c), d) step;
F) final output cluster result;
Wherein, the distance that an object function evaluates sample point and cluster centre is constructed;Before to the minimization of object function
Put the degree of membership that iteration updates cluster centre and sample point;Until object function reduce degree given incremental range it
Interior, or reach maximum iteration time and terminate, export cluster result;
The object function Ω of fuzzy cmeans clustering algorithm2(C*,F*) be:
Ω2(C*,F*)=argmin(C,F)φ2(C,F)
Wherein, C*And F*Represented is to minimize object function optimal solution.
S33, repeat step S31 and S32, until S sub-spaces all carry out kmeans and fuzzy cmeans cluster, respectively obtain
S kmeans cluster result matrix and S fuzzycmeans cluster result matrix.
5. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 4, it is characterised in that
Step S4 is specially:
S41, S kmeans cluster result matrix is converted into binary system cluster result matrix, matrix size is poly- for k × n, k
Class number, n is number of samples;S fuzzy cmeans cluster result matrix is not changed, the size of the matrix also for k ×
n;Wherein, described kmeans cluster result matrix is converted into the step of binary system clusters matrix is:
First, k rank unit matrix is built
Then, if kmeans cluster result matrix is Hk, then binary system cluster result matrix is Hb=H (i,:), wherein H (i, j)
Arrange for the i-th row of H-matrix jth, and H (i,:) for H-matrix the i-th row, i ∈ Hk, particularly, cluster result matrix HkFor one arrange to
Amount;
S42, S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are merged, obtain
To fusion matrix, size is the summation of 2K × n, K for the corresponding clusters number of S sub-spaces, and n is number of samples;Wherein, merge
Step is:
If S binary system cluster matrix is respectively Hb1、Hb2、……、Hbs, S fuzzycmeans cluster result matrix be respectively
F1、F2、……、Fs;Successively S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are pressed
It is extended merging according to row vector, form is as follows, [Hb1,Hb2,……,Hbs,F1,F2,……,Fs], it is S that size is 2K × n, K
The summation of the corresponding clusters number of sub-spaces, n is sample number.
6. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that
Step S5 is specially:
S51, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen partition coefficient index,
Optimum cluster number is obtained for k1, specially:Substitute into formula
Wherein, fijIt is degree of membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, draws successivelyValue, k takes 2~kmaxInterval positive integer;
S52, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen separation index, obtain optimum poly-
Class number is k2, specially:Substitute into formula
Wherein, xiFor i-th sample, chFor h-th cluster centre, n is the number of sample, draws successivelyValue;
S53, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen alternative Dunn index, obtain
Optimum cluster number is k3, specially:Substitute into formula
S54, to k1、k2And k3Being ranked up, intermediate value is chosen as optimum cluster number kfixSpecially:
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k1;
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of minima and be designated as k2;
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k3;
To k1、k2And k3Being ranked up, intermediate value is chosen as optimum cluster number kfix.
7. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that
It is specially in step S6:
S61, use fusion matrix and optimum cluster number kfixAs parameter, fuzzy cmeans cluster is carried out, draws each sample
This is to inhomogeneous cluster result matrix;
S62, according to cluster result matrix, if degree of membership is unequal, according to the corresponding class of the degree of membership maximum of each sample
Not, determine the final classification of each sample, if degree of membership is equal, random choose one in the classification of equal membership values
Classification, the string categorization vector for finally giving is used as the decision kind set of next step.
8. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that
Step S7 is specially:
S71, will fusion matrix as conditional attribute collection, degree of dependence of the design conditions property set to decision kind set, specially:
S711, searching equivalence relation, using formula:
High dimensional data is concentrated all samples to be divided into limited equivalence set according to conditional attribute collection, wherein M is conditional attribute
Collection, ψ (x, m) is the value of sample x under attribute m;
S712, calculating positive region δM(L), δM(L)=∪1≤i≤np(M,Li),Wherein, M ' conditional attribute collection, n is first
The number of equivalence set, L in stepiFor i-th equivalence set, [x]M=x ' ∈ X |
(x, x ') ∈ Γ (M) }, X is fusion matrix;
Degree of dependence ζ of S713, design conditions property set to decision kind setM(L), calculated using equation below:
Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζM(L) it is condition
Degree of dependence of the property set to decision kind set, δM(L) it is positive region of the conditional attribute collection M in decision kind set L;
S72, calculating core attribute set:After maximum sub- conditional attribute collection is removed, the residue condition property set for obtaining belongs to decision-making
The degree of dependence of property collection is constant, then the residue condition property set is core attribute set;
S73, init attributes integrate A as empty set, and a belongs to conditional attribute and integrates and the difference set as property set A;
S74, traversal read all properties row of a, attribute column are incorporated to property set A so as to become property set A', judge property set
Whether the degree of dependence of decision kind set of the A' and property set A between them changes, if changing, a is incorporated to A
In;Otherwise it is not incorporated in, until a has been traveled through, stops;
S75, judge whether are property set A and core attribute set, if equal, A is to simplify fusion matrix after yojan;If unequal,
The attribute for then removing redundancy obtains simplifying fusion matrix.
9. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that
The S8 is specially:
S81, use simplify fusion matrix and true clusters number as parameter, carry out fuzzy cmeans cluster, obtain each
Sample is to inhomogeneous cluster result matrix;
S82, according to cluster result matrix, if degree of membership is unequal, according to the corresponding class of the degree of membership maximum of each sample
Not, determine the final classification of each sample, if degree of membership is equal, random choose one in the classification of equal membership values
Classification;The string categorization vector for finally giving is used as final cluster result.
10. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, its feature exists
In, the S9, calculating pure rate PU is:
Wherein, real cluster result is
Finally cluster result isk1For the clusters number of legitimate reading, k2It is poly-
Class number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610843524.6A CN106446947A (en) | 2016-09-22 | 2016-09-22 | High-dimension data soft and hard clustering integration method based on random subspace |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610843524.6A CN106446947A (en) | 2016-09-22 | 2016-09-22 | High-dimension data soft and hard clustering integration method based on random subspace |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106446947A true CN106446947A (en) | 2017-02-22 |
Family
ID=58166005
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610843524.6A Pending CN106446947A (en) | 2016-09-22 | 2016-09-22 | High-dimension data soft and hard clustering integration method based on random subspace |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106446947A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984551A (en) * | 2017-05-31 | 2018-12-11 | 广州智慧城市发展研究院 | A kind of recommended method and system based on the multi-class soft cluster of joint |
CN109242030A (en) * | 2018-09-21 | 2019-01-18 | 京东方科技集团股份有限公司 | Draw single generation method and device, electronic equipment, computer readable storage medium |
CN110929777A (en) * | 2019-11-18 | 2020-03-27 | 济南大学 | Data kernel clustering method based on transfer learning |
CN113159155A (en) * | 2021-04-15 | 2021-07-23 | 华南农业大学 | Crime risk early warning mixed attribute data processing method, medium and equipment |
-
2016
- 2016-09-22 CN CN201610843524.6A patent/CN106446947A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984551A (en) * | 2017-05-31 | 2018-12-11 | 广州智慧城市发展研究院 | A kind of recommended method and system based on the multi-class soft cluster of joint |
CN109242030A (en) * | 2018-09-21 | 2019-01-18 | 京东方科技集团股份有限公司 | Draw single generation method and device, electronic equipment, computer readable storage medium |
CN110929777A (en) * | 2019-11-18 | 2020-03-27 | 济南大学 | Data kernel clustering method based on transfer learning |
CN113159155A (en) * | 2021-04-15 | 2021-07-23 | 华南农业大学 | Crime risk early warning mixed attribute data processing method, medium and equipment |
CN113159155B (en) * | 2021-04-15 | 2024-01-23 | 华南农业大学 | Mixed attribute data processing method, medium and equipment for crime risk early warning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Huang et al. | Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis | |
Kang et al. | A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence | |
CN106096727A (en) | A kind of network model based on machine learning building method and device | |
CN106446947A (en) | High-dimension data soft and hard clustering integration method based on random subspace | |
CN103488662A (en) | Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit | |
CN106845536B (en) | Parallel clustering method based on image scaling | |
CN103208027A (en) | Method for genetic algorithm with local modularity for community detecting | |
CN109669990A (en) | A kind of innovatory algorithm carrying out Outliers mining to density irregular data based on DBSCAN | |
CN105046323B (en) | Regularization-based RBF network multi-label classification method | |
Coelho et al. | Multi-objective design of hierarchical consensus functions for clustering ensembles via genetic programming | |
Shang et al. | Multi-objective clustering technique based on k-nodes update policy and similarity matrix for mining communities in social networks | |
Diallo et al. | Auto-attention mechanism for multi-view deep embedding clustering | |
Li et al. | A hybrid coevolutionary algorithm for designing fuzzy classifiers | |
Bourqui et al. | How to draw clusteredweighted graphs using a multilevel force-directed graph drawing algorithm | |
Hao et al. | Ensemble clustering with attentional representation | |
CN105159918A (en) | Trust correlation based microblog network community discovery method | |
Xia et al. | GRRS: Accurate and efficient neighborhood rough set for feature selection | |
CN111814979A (en) | Fuzzy set automatic partitioning method based on dynamic programming | |
Chen et al. | An active learning algorithm based on Shannon entropy for constraint-based clustering | |
Ding et al. | Density peaks clustering algorithm based on improved similarity and allocation strategy | |
Parvin et al. | A metric to evaluate a cluster by eliminating effect of complement cluster | |
Luo et al. | A reduced mixed representation based multi-objective evolutionary algorithm for large-scale overlapping community detection | |
Kong et al. | Intelligent Data Analysis and its challenges in big data environment | |
Du et al. | Cluster ensembles via weighted graph regularized nonnegative matrix factorization | |
Deng et al. | Enhanced multiview fuzzy clustering using double visible-hidden view cooperation and network LASSO constraint |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170222 |