CN106446947A

CN106446947A - High-dimension data soft and hard clustering integration method based on random subspace

Info

Publication number: CN106446947A
Application number: CN201610843524.6A
Authority: CN
Inventors: 余志文; 陈洁彦; 马帅; 韩国强
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-09-22
Filing date: 2016-09-22
Publication date: 2017-02-22

Abstract

The invention discloses a high-dimension data soft and hard clustering integration method based on a random subspace. The method comprises the following steps of (1) inputting a high-dimension data set; (2) performing data normalization; (3) generating the random subspace; (4) performing kmeans and fuzzy cmeans clustering; (5) generating a fusion matrix; (6) using a clustering validity index to obtain an optimum clustering number; (7) constructing a decision attribute set; (8) improving rough set attribute reduction to obtain a simplified fusion matrix; (9) performing consistency function division; (10) obtaining a clustering purification rate. By using the method provided by the invention, the random subspace is used for solving the problem of processing difficulty of high-dimension data; the combination of soft clustering and hard clustering is used; original data and intermediate result information are sufficiently utilized for performing the intermediate result redundant attribute reduction; the clustering accuracy is improved; meanwhile, the clustering speed is also accelerated; the problems of incapability of sufficiently utilizing clustering information and removing redundant information in the prior art are solved.

Description

High dimensional data soft or hard clustering ensemble method based on stochastic subspace

Technical field

The present invention relates to machine learning field, more particularly to a kind of high dimensional data soft or hard cluster set based on stochastic subspace Become method.

Background technology

Different data sources adopts different clustering algorithms, can obtain different cluster results.And by this cluster result The effect is significant of unified result is formed using clustering ensemble framework, is increasingly subject to concern and the research of academia.Clustering ensemble Method be successfully applied to Data Mining, such as noise data is excavated, different source data is excavated, data distribution is excavated, point Class data mining and timing driving etc..And at the aspect such as bio information, information retrieval, decision-making judgement and image processing There is good application.At present, the method that Yu et al. proposes different clustering ensemble frameworks, the such as mixing based on three spectral clusterings Cluster framework and the Knowledge Discovery based on clustering ensemble from gene expression data to cancer mechanism.Carpineto et al. is proposed Based on the clustering ensemble framework of probability index, and it is applied in title searching field.In many applications, gather compared to single Class algorithm, clustering ensemble method has more preferable accuracy, robustness and stability on cluster result.

The method of clustering ensemble is divided into two big class, hard clustering ensemble method and soft clustering ensemble method at present.Hard cluster set The method for becoming method to be adopted during algorithm integration is hard clustering algorithm.Also have at present a lot of with regard to how using not Same compatibility function obtains the higher outcome research of robustness and stability, for example, divided using similar matrix, figure cutting, base Split in weight, the compatibility function such as association segmentation.Also have at present and use different techniques to different cluster results are produced, increase Plus result multiformity contribute to compatibility function final result more efficient.For example using random resampling, Random Maps skill Art, random initializtion technology etc..There is research priori to be incorporated in integrated framework at present, also has research to do semi-supervised Method is incorporated in the middle of integrated framework, and different hard clustering algorithms are adapted on different data sets, but they do not consider Combination with soft cluster.Soft cluster uses the method for fuzzy clustering.Soft clustering ensemble framework also has a lot, for example at present, Yu et al. proposes the tumor data cluster analyses based on soft clustering ensemble framework.Also study other fuzzy theorys are incorporated soft In clustering ensemble framework, such as fuzzy graph theory, fuzzy resembling relation, Fuzzy Consistent function based on site and voting mechanism etc.. Mirzaei et al. proposes the hierarchical clustering integrated framework based on fuzzy resembling relation.Also there is research by rough set and Granular Computing Add in clustering ensemble framework.Avogadri et al. devises micro- to analyze DNA based on Random Maps fuzzy clustering integrated framework Array data analysis.In sum, current framework is to consider how preferably to be added in integrated framework fuzzy clustering, but It is that but seldom consideration is simultaneously introduced soft cluster and hard cluster in clustering ensemble framework.

Current clustering ensemble also has its certain limitation.First, most of clustering ensemble frameworks do not have well Process the method for High Dimensional Data Set.Second, traditional clustering ensemble framework is simply considered as hard cluster or soft cluster to enter Row analysis, but not considering to be combined both incorporates in clustering ensemble framework.3rd, the clustering ensemble method of part Although have also contemplated that, cluster result is carried out clustering ensemble as new attribute, not in view of the attribute of this neotectonics Collection is comprising redundancy or noise attribute.And do not have the redundancy category that method eliminates these new property sets in integrated framework at present Property.

Content of the invention

In order to overcome shortcoming and the deficiency of prior art presence, the present invention provides a kind of high dimension based on stochastic subspace According to soft or hard clustering ensemble method, can solve the problem that with present on 3 points of circumscribed problems.Reach by being input into higher-dimension Data set, final obtain than traditional single clustering algorithm or current integrated framework to information more fully using and more preferably Cluster accuracy effect..

For solving above-mentioned technical problem, the present invention provides following technical scheme：A kind of high dimension based on stochastic subspace According to soft or hard clustering ensemble method, comprise the steps：

S1, input High Dimensional Data Set, are normalized；

S2, the High Dimensional Data Set after normalization is made to produce stochastic subspace；

S3, subspace is clustered, obtain cluster result matrix；

S4, cluster result matrix is merged, generate fusion matrix；

S5, according to fusion matrix, draw optimum cluster number using Cluster Validity Index；

S6, use fusion matrix and optimum cluster number as parameter, construct decision kind set；

S7, will fusion matrix as conditional attribute collection, according to decision kind set, improve rough set category to merging matrix Property yojan, obtains simplifying fusion matrix；

S8, use simplify fusion matrix and true clusters number clustered as parameter, obtain cluster result matrix, root Final cluster result is determined according to cluster result matrix；

S9, final cluster result is calculated with the pure rate of true cluster result.

Further, in step S1, normalization process is specially：

Obtain corresponding maximum V (d) of d Column Properties_maxWith minima V (d)_min, following public affairs are pressed to the data value of d row Formula is changed：

Wherein,For i-th data of d row,For the numerical value after renewal, i ∈ 1,2 ... .., n }, d ∈ 1, 2 ... .., D }, it is sample dimension that n is number of samples, D.

Further, step S2 is specially：

S21, the sample dimension of acquisition High Dimensional Data Set, produce the sample dimension of subspace；

S22, High Dimensional Data Set is carried out random and not repeatedly attribute column is chosen, subspace is constructed, when reaching subspace Stop during sample dimension；Wherein construction subspace is specially：

After obtaining High Dimensional Data Set sample dimension and being D, the interval setting of subspace is carried out, using random function 0 Choose in～D interval unduplicatedIndividual positive integer, whereinRepresent the smallest positive integral not less than r × D, r ∈ (0,1)；By thisIndividual positive integer sequence is sorted from small to large；Will after sequenceIndividual positive integer sequence pair should In the attribute row number of High Dimensional Data Set, one new subspace of construction is extracted；

S23, repeat S21-S22 is walked, stop until producing S sub-spaces.

Further, step S3 is specially：

S31, subspace is carried out kmeans cluster, randomly choose 2～k_maxIn a positive integer as clusters number k, Cluster centre random initializtion, obtains kmeans cluster result matrix, k_maxFor a positive integer more than 2；Wherein

The step of kmeans clustering algorithm is,

A) 2～k is randomly choosed_maxIn a positive integer as clusters number k；

B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class；

C) all samples are calculated to the distance of k cluster centre using Euclidean distance formula, if the dimension of data set is D, The distance between sample point A=(a [1], a [2] ..., a [D]) and central point C=(c [1], c [2] ..., c [D]) ρ (A, C) is fixed Justice is equation below：

Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample；Then, The matrix W of a n × k has just been obtained, and it is cluster centre number that n is number of samples, k；W in matrix_ihRepresent：Judge x_iWhether belong to In h-th class c_hIf belonging to, w_ihFor 1, no be not belonging to, then w_ih, it is 0；

That is the renewal computing formula of W is as follows：

D) meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, and used as new cluster centre, computing formula is as follows：

C={ c₁,c₂,……,c_h},(x_i,c_h)²Represent x_iTo h-th class c_hDistance；n_hFor belonging to h-th class c_hSample This number；

The difference for relatively changing with former cluster centre if in restriction range, enters f) step whether in restriction range Suddenly, if not in restriction range, e) step is entered；

If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not having Reach maximum iteration time, then continue to repeat c), d) step；

F) final output cluster result；

Wherein, the distance that an object function evaluates sample point and cluster centre is constructed；It is based on to the minimization of object function On the premise of iteration update the degree of membership of cluster centre and sample point；Until the degree of object function minimizing is in given increment model Within enclosing, or reach maximum iteration time and terminate, export cluster result；

The object function Ω of kmeans clustering algorithm in theory₁(C^*,W^*) as follows：

Ω₁(C^*,W^*)=arg min_(C,W)φ₁(C,W)

Wherein, C^*And W^*Represented is the optimal solution for minimizing object function, and the target formula for iterating to calculate each time is such as Under：

S32, fuzzy cmeans cluster is carried out using step S31 identical subspace and identical clusters number, obtain Fuzzy cmeans cluster result matrix；

Wherein, fuzzy cmeans clustering algorithm is specially：

A) to carry out selected clusters number k during kmeans cluster identical for clusters number and same sub-spaces；

C) distribute a degree of membership with respect to each cluster centre according to membership function to each sample, the degree of membership represents Sample point and the distance of cluster centre, being calculated as follows of subordinated-degree matrix：

Wherein, F is the set of degree of membership, F={ f_ij, i ∈ { 1 ... ..., n }, j ∈ { 1 ... ..., k }, n are number of samples, K is clusters number, and β is the real number more than 1,

D) according to degree of membership, recalculate and determine new cluster centre, computing formula is as follows：

Wherein, C={ c₁,c₂,……,c_k, | | x_i-c_j| | represent x_iTo j-th class c_jDistance, compare with former cluster The difference of heart change if in restriction range, enters f) step whether in restriction range, if not in restriction range, Enter e) step；

F) final output cluster result；

The object function Ω of fuzzy cmeans clustering algorithm₂(C^*,F^*) be：

Ω₂(C^*,F^*)=arg min_(C,F)φ₂(C,F)

Wherein, C^*And F^*Represented is to minimize object function optimal solution.

S33, repeat step S31 and S32, until S sub-spaces all carry out kmeans and fuzzy cmeans cluster, respectively Obtain S kmeans cluster result matrix and S fuzzycmeans cluster result matrix.

Further, step S4 is specially：

S41, S kmeans cluster result matrix is converted into binary system cluster result matrix, matrix size be For clusters number, n is number of samples；S fuzzy cmeans cluster result matrix is not changed, and the size of the matrix is also k×n；Wherein, described kmeans cluster result matrix is converted into the step of binary system clusters matrix is：

First, k rank unit matrix is built

Then, if kmeans cluster result matrix is H_k, then binary system cluster result matrix is H_b=H (i,：), wherein H (i, j) is arranged for the i-th row of H-matrix jth, and H (i,：) for H-matrix the i-th row, i ∈ H_k, particularly, cluster result matrix H_kFor one Individual column vector；

S42, S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are closed And, obtaining merging matrix, size is the summation of 2K × n, K for the corresponding clusters number of S sub-spaces, and n is number of samples；Its In, combining step is：

If S binary system cluster matrix is respectively H_b1、H_b2、……、H_bs, S fuzzycmeans cluster result matrix divide Wei not F₁、F₂、……、F_s；Successively by S kmeans binary system cluster result matrix and S fuzzycmeans cluster result square Battle array is extended merging according to row vector, and form is as follows, [H_b1,H_b2,……,H_bs,F₁,F₂,……,F_s], size is 2K × n, K For the summation of the corresponding clusters number of S sub-spaces, n is sample number.

Further, step S5 is specially：

S51, according to fusion matrix, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k_maxWhen partition coefficient Index, obtains optimum cluster number for k¹, specially：Substitute into formula

Wherein, f_ijIt is degree of membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, successively DrawValue, k takes 2～k_maxInterval positive integer；

S52, according to fusion matrix, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k_maxWhen separation index, obtain most Excellent clusters number is k², specially：Substitute into formula

Wherein, x_iFor i-th sample, c_hFor h-th cluster centre, n is the number of sample, draws successivelyValue；

S53, according to fusion matrix, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k_maxWhen alternative Dunn index, Optimum cluster number is obtained for k³, specially：Substitute into formula

S54, to k¹、k²And k³Being ranked up, intermediate value is chosen as optimum cluster number k_fixSpecially：

RightValue (k takes 2～k_maxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k¹；

RightValue (k takes 2～k_maxInterval positive integer) it is ranked up, take out the corresponding k of minima and be designated as k²；

RightValue (k takes 2～k_maxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k³；

To k¹、k²And k³Being ranked up, intermediate value is chosen as optimum cluster number k_fix.

Further, it is specially in step S6：

S61, use fusion matrix and optimum cluster number k_fixAs parameter, fuzzy cmeans cluster is carried out, draw every Individual sample is to inhomogeneous cluster result matrix；

S62, according to cluster result matrix, if degree of membership is unequal, according to the degree of membership maximum of each sample correspond to Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values One classification, the string categorization vector for finally giving is used as the decision kind set of next step；

Further, step S7 is specially：

S71, will fusion matrix as conditional attribute collection, degree of dependence of the design conditions property set to decision kind set, tool Body is：

S711, searching equivalence relation, using formula：

High dimensional data is concentrated all samples to be divided into limited equivalence set according to conditional attribute collection, wherein M is condition Property set, ψ (x, m) is the value of sample x under attribute m；

S712, calculating positive region δ_M(L), δ_M(L)=∪_1≤i≤np(M,L_i),Wherein, M ' conditional attribute collection, n For the number of equivalence set in the first step, L_iFor i-th equivalence set, [x]_M= X ' ∈ X | and (x, x ') ∈ Γ (M) }, X is fusion matrix；

Degree of dependence ζ of S713, design conditions property set to decision kind set_M(L), calculated using equation below：

Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζ_M(L) it is The degree of dependence of conditional attribute set pair decision kind set, δ_M(L) it is positive region of the conditional attribute collection M in decision kind set L；

S72, calculating core attribute set：After maximum sub- conditional attribute collection is removed, the residue condition property set for obtaining is fought to the finish The degree of dependence of plan property set is constant, then the residue condition property set is core attribute set；

S73, init attributes integrate A as empty set, and a belongs to conditional attribute and integrates and the difference set as property set A；

S74, traversal read all properties row of a, attribute column are incorporated to property set A so as to become property set A', judge category Whether the degree of dependence of decision kind set of the property collection A' and property set A between them changes, if changing, by a simultaneously Enter in A；Otherwise it is not incorporated in, until a has been traveled through, stops；

S75, judge whether are property set A and core attribute set, if equal, A is to simplify fusion matrix after yojan；If not phase Deng the attribute for then removing redundancy obtains simplifying fusion matrix.

Further, the S8 is specially：

S81, use simplify fusion matrix and true clusters number as parameter, carry out fuzzy cmeans cluster, obtain Each sample is to inhomogeneous cluster result matrix；

S82, according to cluster result matrix, if degree of membership is unequal, according to the degree of membership maximum of each sample correspond to Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values One classification；The string categorization vector for finally giving is used as final cluster result.

Further, the S9, calculating pure rate PU is：

Wherein, real cluster result is

Finally cluster result isk¹For the clusters number of legitimate reading, k² For clusters number.

After technique scheme, the present invention at least has the advantages that：

(1) present invention enables framework to process High Dimensional Data Set using the method for stochastic subspace, not only on subspace Multiformity have good effect, it is often more important that also lifted in subspace clustering speed；

(2) present invention is soft cluster and hard Cluster-Fusion in integrated framework, it is contemplated that soft cluster is effective with hard cluster In conjunction with come improve integrated in multiformity, combine respective advantage；

(3) different cluster results is analyzed by the present invention as new property set, takes full advantage of the number of intermediate result According to making analysis information more accurate；

(4) present invention introduces the combined method of Cluster Validity Index, to clusters number in using improved rough set Prediction more accurate, and it is more preferable to remove redundant attributes effect to follow-up rough set attribute reduction improved method lifting；

(5) present invention removes the redundant attributes of new property set using the method for improved rough set attribute reduction, effectively keeps away Exempt from the low impact of accuracy rate that redundancy brings.

Description of the drawings

The step of Fig. 1 is a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace of present invention flow chart；

Fig. 2 is a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace of the present invention and traditional single cluster The contrast table of algorithm cluster accuracy.

Specific embodiment

It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases Mutually combine, with specific embodiment, the application is described in further detail below in conjunction with the accompanying drawings.

It is described further the step of 1 couple of present invention below in conjunction with the accompanying drawings.

Step 1, input High Dimensional Data Set：The High Dimensional Data Set of one process to be clustered of input, row vector corresponds to sample dimension, Column vector corresponds to attribute dimension；

Step 2, data normalization:D Column Properties corresponding maximum V (d) is first obtained_maxWith minima V (d)_min, to The property value of d row is changed as follows：

Wherein,For i-th data of d row,For the numerical value after renewal, i ∈ 1,2 ... .., n }, d ∈ 1, 2 ... .., D }, it is sample dimension that n is number of samples, D；The sample dimension refers to the number of the corresponding attribute of the sample, For example, sample has 3 attributes, is blood group, height, body weight respectively, then sample dimension is exactly 3；

Step 3, produces stochastic subspace：First, the sample dimension for obtaining High Dimensional Data Set is D, then the sample of subspace Dimension isWhereinRepresent the smallest positive integral not less than r × D, r ∈ (0,1)；

Secondly, do not repeat to choose in 0～D interval using random functionPositive integer, by thisIndividual just whole Number Sequence is sorted from small to large, will after sequenceIndividual positive integer sequence corresponds to the attribute row number of High Dimensional Data Set, Do not repeat reading attributes row number to High Dimensional Data Set at random, subspace is constructed, sample dimension reaches until subspaceStop Only；

According to as above two step, circulation produces subspace, just stops circulation until producing S sub-spaces；

Step 4, kmeans and fuzzy cmeans is clustered：

First, one of subspace is obtained, carries out kmeans cluster behind the subspace for obtaining, random selection 2～ k_maxIn a positive integer as clusters number k, cluster centre random initializtion, concentrate k sample of random selection from high dimensional data This is used as cluster centre, and each cluster centre represents a class, calculates all samples using Euclidean distance formula initial to k The distance of cluster centre, the final affiliated classification of each sample is the class corresponding to the corresponding nearest cluster centre of the sample Not；The meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, used as new cluster centre, comparing the difference for changing with former cluster centre is No if not in restriction range, continue iteration in restriction range, sample and cluster centre are re-started according to distance and are drawn Point, divide and after finishing, produce k new cluster centre again, until new cluster centre with last cluster centre excursion in constraint Within the scope of, or maximum iteration time has been reached, final output cluster result.The step of kmeans clustering algorithm is,

A) 2～k is randomly choosed_maxIn a positive integer as clusters number k；

Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample；Then, The matrix W of a n × k has just been obtained, and it is cluster centre number that n is number of samples, k；W in matrix_ihRepresent：Judge x_iWhether belong to In h-th class c_hIf belonging to, w_ihFor 1, no be not belonging to,

Then w_ih, it is 0；

That is the renewal computing formula of W is as follows：

The difference for relatively changing with former cluster centre if in restriction range, enters g) step whether in restriction range Suddenly, if not in restriction range, e) step is entered；

If e) iterationses reach maximum iteration time (acquiescence is 100 iteration), f) step is entered, if not reaching most Big iterationses, then continue to repeat c), d) step；

F) final output cluster result；

Secondly, select clusters number identical with clusters number k that same sub-spaces carry out kmeans selection, from high dimension According to concentrating k sample of random selection as cluster centre, each cluster centre represents a class, poly- with respect to each to each sample A degree of membership is distributed according to membership function in class center, and the degree of membership represents the distance of sample point and cluster centre, constructs one Object function evaluates the distance of sample point and cluster centre, is updated in cluster based on iteration on the premise of to the minimization of object function The heart and the degree of membership of sample point, the degree for reducing until object function is within given incremental range, or reaches maximum and change Generation number terminates, and exports cluster result, obtains fuzzy cmeans cluster result matrix.

Wherein, fuzzy cmeans clustering algorithm is specially：

Wherein, C={ c₁,c₂,……,c_k, | | x_i-c_j| | represent x_iTo j-th class c_jDistance,

F) final output cluster result；

After completing above-mentioned sub-spaces, continuing the next sub-spaces of acquisition carries out same operation, until S son is empty Between all carry out kmeans and fuzzy cmeans cluster, respectively obtain S kmeans cluster result matrix and S individual Fuzzycmeans cluster result matrix；

Step 5, generates fusion matrix：

First, S kmeans cluster result matrix is converted into binary system cluster result matrix；According to respective subspace Clusters number k, the form for building k rank unit matrix H, H is as follows：

Kmeans cluster result matrix is made for H_k, then binary system cluster result matrix is H_b=H (i,：), wherein H (i, j) is The i-th row of H-matrix jth is arranged, and H (i,：) for H-matrix the i-th row, i ∈ H_k, particularly, cluster result matrix H_kFor one arrange to Amount.Then obtain corresponding binary system cluster result matrix, its size is k × n (it is number of samples that k is clusters number, n)；

Secondly, S fuzzy cmeans cluster result matrix is not changed, and the size of matrix is also k × n；Then by S Individual kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are merged, if S binary system gathers Matroid is respectively H_b1, H_b2... ..., H_bs, S fuzzycmeans cluster result matrix be respectively F₁, F₂... ..., F_s；Successively S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are extended closing according to row vector And, form is as follows, [H_b1, H_b2... ..., H_bs, F₁, F₂... ..., F_s], then just obtaining merging matrix, size is that (K is S to 2K × n The summation of the corresponding clusters number of sub-spaces, n is number of samples)；

Step 6, draws optimum cluster number using Cluster Validity Index：

First, according to fusion matrix, it is 2～k to calculate clusters number respectively_maxWhen partition coefficient Index, obtains optimum cluster number for k¹；Partition coefficient index is calculated according to the following formula：

Wherein f_ijIt is membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, successively DrawValue (k takes 2～k_maxInterval positive integer)；

Further according to fusion matrix, it is 2～k to calculate clusters number respectively_maxWhen separation index, obtain optimum Clusters number is k²；

Separation index is calculated according to the following formula：

Wherein f_ijIt is membership of the sample point i in cluster centre j, x_iFor i-th sample, c_hFor in h-th cluster The heart, k is number of clusters number, the n for sample, draws successivelyValue (k takes 2～k_maxInterval positive integer)；

Further according to fusion matrix, it is 2～k to calculate clusters number respectively_maxWhen alternative Dunn index, obtain It is k to optimum number³；Alternative Dunn index is calculated according to the following formula

Wherein, k is number of clusters number, the n for sample, x_iFor i-th sample, c_hFor h-th cluster centre, obtain successively Go outValue (k takes 2～k_maxInterval positive integer)；

Then, rightValue (k takes 2～k_maxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k¹； RightValue (k takes 2～k_maxInterval positive integer) it is ranked up, take out the corresponding k of minima and be designated as k²；RightValue (k takes 2 ～k_maxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k³；To k¹, k²And k³It is ranked up, in selection Between be worth as optimum cluster number k_fix.

Step 7, constructs decision kind set：

First, using fusion matrix and optimum cluster number k_fixAs parameter, using fuzzy cmeans clustering algorithm, Concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class；Relative to each sample Distribute a degree of membership in each cluster centre according to membership function, the degree of membership represents the distance of sample point and cluster centre； One object function of construction evaluates the distance of sample point and cluster centre；Be based on to the minimization of object function on the premise of iteration more New cluster centre and the degree of membership of sample point；Until the degree of object function minimizing is within given incremental range, Huo Zheda Terminate to maximum iteration time, cluster result is exported, obtains fuzzy cmeans cluster result matrix.Draw each sample to not Similar is subordinate to probability matrix, and this is subordinate to probability matrix size for n × k_fix, wherein n is the number of sample, each probability matrix In probit interval 0～1；

Secondly, according to cluster result matrix, if degree of membership is unequal, corresponded to according to the degree of membership maximum of each sample Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values One classification.The string categorization vector for finally giving is used as the decision kind set of next step, and the decision kind set size is n × 1, Wherein n is the number of sample, and each value corresponds to the generic of sample.

Step 8, improved rough set attribute reduction, obtain simplifying fusion matrix

First, matrix will be merged as conditional attribute collection, the decision kind set that the conditional attribute set pair step 7 is obtained is relied on Degree is calculated according to the following formula：

Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζ_M(L) it is The degree of dependence of conditional attribute set pair decision kind set, δ_M(L) it is positive region of the conditional attribute collection M in decision kind set L；Bar The positive region that part property set is concentrated in decision attribute is calculated as follows：

Wherein, M ' conditional attribute collection, n is Г（M）The number of equivalence set, L_iFor i-th equivalence set [x]_M=x ' ∈ X | and (x, x ') ∈ Γ (M) }, X is fusion matrix；Equivalence set is calculated as follows：

Wherein M is conditional attribute collection, and ψ (x, m) is the value of sample x under attribute m, then, has obtained conditional attribute collection Degree of dependence to decision kind set；

Secondly, each attribute column of ergodic condition property set, after one of attribute is removed, residue condition attribute Set pair decision kind set calculates degree of dependence by above-mentioned calculating process, if degree of dependence is constant, removes the attribute, otherwise, protects The attribute is stayed, until all properties row have been traveled through, final residue condition property set is used as core attribute set；

Then, init attributes integrate A as empty set, and a belongs to the difference set of conditional attribute collection and property set A；Traversal reads the institute of a There is attribute column, it is A' that attribute column is incorporated to after property set A, if A' causes the degree of dependence to decision kind set with former property set A The degree of dependence of decision kind set is compared and is changed, then a is incorporated in A, is otherwise not incorporated in.Until a has been traveled through, stop；

Finally, property set A is carried out attribute column with core attributes to compare, if comprising attribute column equal, after A is brief Simplify fusion matrix.If unequal, the attribute for removing redundancy between attribute A and core attributes obtains simplifying fusion matrix.

Step 9, compatibility function is divided：

First, using simplifying fusion matrix and true clusters number k_trueAs parameter, clustered using fuzzy cmeans Algorithm, concentrates k sample of random selection as cluster centre from high dimensional data, and each cluster centre represents a class；To each sample This distributes a degree of membership with respect to each cluster centre according to membership function, and the degree of membership represents sample point and cluster centre Distance；One object function of construction evaluates the distance of sample point and cluster centre；On the premise of being based on to the minimization of object function Iteration updates the degree of membership of cluster centre and sample point；Until the degree of object function minimizing is within given incremental range, Or reach maximum iteration time and terminate, cluster result is exported, obtains fuzzy cmeans cluster result matrix.Draw each sample This is subordinate to probability matrix to inhomogeneous, and this is subordinate to probability matrix size for n × k_true, wherein n is the number of sample, and each is general Rate value is interval 0～1；Wherein, true clusters number is：Data set can be obtained truly according to priori when clustering Clusters number, such as, the clusters number of cancer data is assured that for two classes, a class is to suffer from cancer, and another kind of is not suffer from Cancer；

Secondly, according to cluster result matrix, if degree of membership is unequal, corresponded to according to the degree of membership maximum of each sample Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values One classification；The string categorization vector for finally giving is used as final cluster result, and the final cluster result size is n × 1, wherein N is the number of sample, and each value corresponds to the generic of sample；

Step 10, the cluster result for being drawn with the inventive method calculates pure rate with real cluster result, according to the following formula Calculate：

Wherein, real cluster result isThe cluster result for being drawn by the inventive method Fork¹For the clusters number of legitimate reading, k²Clusters number for the inventive method.Its In, true cluster result refer to the label already provided with label information (such as suffer from cancer and do not suffer from cancer), why use Carry markd information, be intended to result that the integrated approach is predicted can the labelling result of approaching to reality evaluate, comment The standard of valency is exactly to utilize pure rate.Pure rate is exactly, if integrated approach cluster result is completely the same with true cluster result, pure Net rate is just that pure rate is 0, if result part 11, if the cluster result of integrated approach is completely inconsistent with true cluster result Cause, be then between 0～1, so closer to 1, then representing the integrated approach better in performance.

Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, permissible It is understood by, can these embodiments be carried out with multiple equivalent changes without departing from the principles and spirit of the present invention Change, change, replace and modification, the scope of the present invention is limited by claims and its equivalency range.

Claims

1. a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace, it is characterised in that comprise the steps：

S1, input High Dimensional Data Set, are normalized；

S3, subspace is clustered, obtain cluster result matrix；

S4, cluster result matrix is merged, generate fusion matrix；

S7, will fusion matrix as conditional attribute collection, according to decision kind set, improve rough set attribute about to merging matrix Letter, obtains simplifying fusion matrix；

S8, use and simplify fusion matrix and true clusters number is clustered as parameter, cluster result matrix is obtained, according to poly- Class matrix of consequence determines final cluster result；

2. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that In step S1, normalization process is specially：

Obtain corresponding maximum V (d) of d Column Properties_maxWith minima V (d)_min, the data value of d row is entered as follows Row conversion：

3. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that Step S2 is specially：

S22, High Dimensional Data Set is carried out random and not repeatedly attribute column is chosen, subspace is constructed, when reaching subspace sample Stop during dimension；Wherein construction subspace is specially：

After obtaining High Dimensional Data Set sample dimension and being D, the interval setting of subspace is carried out, using random function in 0～D area Between in choose unduplicatedIndividual positive integer, whereinRepresent the smallest positive integral not less than r × D, r ∈ (0,1)； By thisIndividual positive integer sequence is sorted from small to large；Will after sequenceIndividual positive integer sequence corresponds to higher-dimension The attribute row number of data set, extracts one new subspace of construction；

S23, repeat S21-S22 is walked, stop until producing S sub-spaces.

4. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that Step S3 is specially：

S31, subspace is carried out kmeans cluster, randomly choose 2～k_maxIn a positive integer as clusters number k, cluster Center random initializtion, obtains kmeans cluster result matrix, k_maxFor a positive integer more than 2；Wherein

The step of kmeans clustering algorithm is,

A) 2～k is randomly choosed_maxIn a positive integer as clusters number k；

C) all samples are calculated to the distance of k cluster centre using Euclidean distance formula, if the dimension of data set is D, sample The distance between point A=(a [1], a [2] ..., a [D]) and central point C=(c [1], c [2] ..., c [D]) ρ (A, C) is defined as Equation below：

Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample；Then, must To the matrix W of a n × k, it is cluster centre number that n is number of samples, k；W in matrix_ihRepresent：Judge x_iWhether is belonged to H class c_hIf belonging to, w_ihFor 1, no be not belonging to, then w_ih, it is 0；

That is the renewal computing formula of W is as follows：

C={ c₁,c₂,……,c_h},(x_i,c_h)²Represent x_iTo h-th class c_hDistance；n_hFor belonging to h-th class c_hSample Number；

The difference for relatively changing with former cluster centre if in restriction range, enters f) step whether in restriction range, if Not in restriction range, then enter e) step；

If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not reaching Maximum iteration time, then continue to repeat c), d) step；

F) final output cluster result；

Wherein, the distance that an object function evaluates sample point and cluster centre is constructed；Before to the minimization of object function Put the degree of membership that iteration updates cluster centre and sample point；Until object function reduce degree given incremental range it Interior, or reach maximum iteration time and terminate, export cluster result；

Ω₁(C^*,W^*)=argmin_(C,W)φ₁(C,W)

Wherein, C^*And W^*Represented is the optimal solution for minimizing object function, and the target formula for iterating to calculate each time is as follows：

Wherein, fuzzy cmeans clustering algorithm is specially：

Wherein, F is the set of degree of membership, F={ f_ij, i ∈ { 1 ... ..., n }, j ∈ { 1 ... ..., k }, n are that number of samples, k is Clusters number, β is the real number more than 1,

Wherein, C={ c₁,c₂,……,c_k, | | x_i-c_j| | represent x_iTo j-th class c_jDistance, compare and former cluster centre become The difference of change if in restriction range, enters f) step, if not in restriction range, entering whether in restriction range E) step；

F) final output cluster result；

The object function Ω of fuzzy cmeans clustering algorithm₂(C^*,F^*) be：

Ω₂(C^*,F^*)=argmin_(C,F)φ₂(C,F)

Wherein, C^*And F^*Represented is to minimize object function optimal solution.

5. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 4, it is characterised in that Step S4 is specially：

S41, S kmeans cluster result matrix is converted into binary system cluster result matrix, matrix size is poly- for k × n, k Class number, n is number of samples；S fuzzy cmeans cluster result matrix is not changed, the size of the matrix also for k × n；Wherein, described kmeans cluster result matrix is converted into the step of binary system clusters matrix is：

First, k rank unit matrix is built

Then, if kmeans cluster result matrix is H_k, then binary system cluster result matrix is H_b=H (i,：), wherein H (i, j) Arrange for the i-th row of H-matrix jth, and H (i,：) for H-matrix the i-th row, i ∈ H_k, particularly, cluster result matrix H_kFor one arrange to Amount；

S42, S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are merged, obtain To fusion matrix, size is the summation of 2K × n, K for the corresponding clusters number of S sub-spaces, and n is number of samples；Wherein, merge Step is：

If S binary system cluster matrix is respectively H_b1、H_b2、……、H_bs, S fuzzycmeans cluster result matrix be respectively F₁、F₂、……、F_s；Successively S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are pressed It is extended merging according to row vector, form is as follows, [H_b1,H_b2,……,H_bs,F₁,F₂,……,F_s], it is S that size is 2K × n, K The summation of the corresponding clusters number of sub-spaces, n is sample number.

6. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that Step S5 is specially：

S51, according to fusion matrix, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k_maxWhen partition coefficient index, Optimum cluster number is obtained for k¹, specially：Substitute into formula

Wherein, f_ijIt is degree of membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, draws successivelyValue, k takes 2～k_maxInterval positive integer；

S52, according to fusion matrix, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k_maxWhen separation index, obtain optimum poly- Class number is k², specially：Substitute into formula

S53, according to fusion matrix, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k, respectively calculate clusters number be 2～k_maxWhen alternative Dunn index, obtain Optimum cluster number is k³, specially：Substitute into formula

7. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that It is specially in step S6：

S61, use fusion matrix and optimum cluster number k_fixAs parameter, fuzzy cmeans cluster is carried out, draws each sample This is to inhomogeneous cluster result matrix；

S62, according to cluster result matrix, if degree of membership is unequal, according to the corresponding class of the degree of membership maximum of each sample Not, determine the final classification of each sample, if degree of membership is equal, random choose one in the classification of equal membership values Classification, the string categorization vector for finally giving is used as the decision kind set of next step.

8. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that Step S7 is specially：

S71, will fusion matrix as conditional attribute collection, degree of dependence of the design conditions property set to decision kind set, specially：

S711, searching equivalence relation, using formula：

High dimensional data is concentrated all samples to be divided into limited equivalence set according to conditional attribute collection, wherein M is conditional attribute Collection, ψ (x, m) is the value of sample x under attribute m；

S712, calculating positive region δ_M(L), δ_M(L)=∪_1≤i≤np(M,L_i),Wherein, M ' conditional attribute collection, n is first The number of equivalence set, L in step_iFor i-th equivalence set, [x]_M=x ' ∈ X | (x, x ') ∈ Γ (M) }, X is fusion matrix；

Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζ_M(L) it is condition Degree of dependence of the property set to decision kind set, δ_M(L) it is positive region of the conditional attribute collection M in decision kind set L；

S72, calculating core attribute set：After maximum sub- conditional attribute collection is removed, the residue condition property set for obtaining belongs to decision-making The degree of dependence of property collection is constant, then the residue condition property set is core attribute set；

S74, traversal read all properties row of a, attribute column are incorporated to property set A so as to become property set A', judge property set Whether the degree of dependence of decision kind set of the A' and property set A between them changes, if changing, a is incorporated to A In；Otherwise it is not incorporated in, until a has been traveled through, stops；

S75, judge whether are property set A and core attribute set, if equal, A is to simplify fusion matrix after yojan；If unequal, The attribute for then removing redundancy obtains simplifying fusion matrix.

9. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that The S8 is specially：

S82, according to cluster result matrix, if degree of membership is unequal, according to the corresponding class of the degree of membership maximum of each sample Not, determine the final classification of each sample, if degree of membership is equal, random choose one in the classification of equal membership values Classification；The string categorization vector for finally giving is used as final cluster result.

10. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, its feature exists In, the S9, calculating pure rate PU is：

Wherein, real cluster result is

Finally cluster result isk¹For the clusters number of legitimate reading, k²It is poly- Class number.