CN106446947A - High-dimension data soft and hard clustering integration method based on random subspace - Google Patents

High-dimension data soft and hard clustering integration method based on random subspace Download PDF

Info

Publication number
CN106446947A
CN106446947A CN201610843524.6A CN201610843524A CN106446947A CN 106446947 A CN106446947 A CN 106446947A CN 201610843524 A CN201610843524 A CN 201610843524A CN 106446947 A CN106446947 A CN 106446947A
Authority
CN
China
Prior art keywords
respectively calculate
clusters number
calculate clusters
cluster
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610843524.6A
Other languages
Chinese (zh)
Inventor
余志文
陈洁彦
马帅
韩国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201610843524.6A priority Critical patent/CN106446947A/en
Publication of CN106446947A publication Critical patent/CN106446947A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The invention discloses a high-dimension data soft and hard clustering integration method based on a random subspace. The method comprises the following steps of (1) inputting a high-dimension data set; (2) performing data normalization; (3) generating the random subspace; (4) performing kmeans and fuzzy cmeans clustering; (5) generating a fusion matrix; (6) using a clustering validity index to obtain an optimum clustering number; (7) constructing a decision attribute set; (8) improving rough set attribute reduction to obtain a simplified fusion matrix; (9) performing consistency function division; (10) obtaining a clustering purification rate. By using the method provided by the invention, the random subspace is used for solving the problem of processing difficulty of high-dimension data; the combination of soft clustering and hard clustering is used; original data and intermediate result information are sufficiently utilized for performing the intermediate result redundant attribute reduction; the clustering accuracy is improved; meanwhile, the clustering speed is also accelerated; the problems of incapability of sufficiently utilizing clustering information and removing redundant information in the prior art are solved.

Description

High dimensional data soft or hard clustering ensemble method based on stochastic subspace
Technical field
The present invention relates to machine learning field, more particularly to a kind of high dimensional data soft or hard cluster set based on stochastic subspace Become method.
Background technology
Different data sources adopts different clustering algorithms, can obtain different cluster results.And by this cluster result The effect is significant of unified result is formed using clustering ensemble framework, is increasingly subject to concern and the research of academia.Clustering ensemble Method be successfully applied to Data Mining, such as noise data is excavated, different source data is excavated, data distribution is excavated, point Class data mining and timing driving etc..And at the aspect such as bio information, information retrieval, decision-making judgement and image processing There is good application.At present, the method that Yu et al. proposes different clustering ensemble frameworks, the such as mixing based on three spectral clusterings Cluster framework and the Knowledge Discovery based on clustering ensemble from gene expression data to cancer mechanism.Carpineto et al. is proposed Based on the clustering ensemble framework of probability index, and it is applied in title searching field.In many applications, gather compared to single Class algorithm, clustering ensemble method has more preferable accuracy, robustness and stability on cluster result.
The method of clustering ensemble is divided into two big class, hard clustering ensemble method and soft clustering ensemble method at present.Hard cluster set The method for becoming method to be adopted during algorithm integration is hard clustering algorithm.Also have at present a lot of with regard to how using not Same compatibility function obtains the higher outcome research of robustness and stability, for example, divided using similar matrix, figure cutting, base Split in weight, the compatibility function such as association segmentation.Also have at present and use different techniques to different cluster results are produced, increase Plus result multiformity contribute to compatibility function final result more efficient.For example using random resampling, Random Maps skill Art, random initializtion technology etc..There is research priori to be incorporated in integrated framework at present, also has research to do semi-supervised Method is incorporated in the middle of integrated framework, and different hard clustering algorithms are adapted on different data sets, but they do not consider Combination with soft cluster.Soft cluster uses the method for fuzzy clustering.Soft clustering ensemble framework also has a lot, for example at present, Yu et al. proposes the tumor data cluster analyses based on soft clustering ensemble framework.Also study other fuzzy theorys are incorporated soft In clustering ensemble framework, such as fuzzy graph theory, fuzzy resembling relation, Fuzzy Consistent function based on site and voting mechanism etc.. Mirzaei et al. proposes the hierarchical clustering integrated framework based on fuzzy resembling relation.Also there is research by rough set and Granular Computing Add in clustering ensemble framework.Avogadri et al. devises micro- to analyze DNA based on Random Maps fuzzy clustering integrated framework Array data analysis.In sum, current framework is to consider how preferably to be added in integrated framework fuzzy clustering, but It is that but seldom consideration is simultaneously introduced soft cluster and hard cluster in clustering ensemble framework.
Current clustering ensemble also has its certain limitation.First, most of clustering ensemble frameworks do not have well Process the method for High Dimensional Data Set.Second, traditional clustering ensemble framework is simply considered as hard cluster or soft cluster to enter Row analysis, but not considering to be combined both incorporates in clustering ensemble framework.3rd, the clustering ensemble method of part Although have also contemplated that, cluster result is carried out clustering ensemble as new attribute, not in view of the attribute of this neotectonics Collection is comprising redundancy or noise attribute.And do not have the redundancy category that method eliminates these new property sets in integrated framework at present Property.
Content of the invention
In order to overcome shortcoming and the deficiency of prior art presence, the present invention provides a kind of high dimension based on stochastic subspace According to soft or hard clustering ensemble method, can solve the problem that with present on 3 points of circumscribed problems.Reach by being input into higher-dimension Data set, final obtain than traditional single clustering algorithm or current integrated framework to information more fully using and more preferably Cluster accuracy effect..
For solving above-mentioned technical problem, the present invention provides following technical scheme:A kind of high dimension based on stochastic subspace According to soft or hard clustering ensemble method, comprise the steps:
S1, input High Dimensional Data Set, are normalized;
S2, the High Dimensional Data Set after normalization is made to produce stochastic subspace;
S3, subspace is clustered, obtain cluster result matrix;
S4, cluster result matrix is merged, generate fusion matrix;
S5, according to fusion matrix, draw optimum cluster number using Cluster Validity Index;
S6, use fusion matrix and optimum cluster number as parameter, construct decision kind set;
S7, will fusion matrix as conditional attribute collection, according to decision kind set, improve rough set category to merging matrix Property yojan, obtains simplifying fusion matrix;
S8, use simplify fusion matrix and true clusters number clustered as parameter, obtain cluster result matrix, root Final cluster result is determined according to cluster result matrix;
S9, final cluster result is calculated with the pure rate of true cluster result.
Further, in step S1, normalization process is specially:
Obtain corresponding maximum V (d) of d Column PropertiesmaxWith minima V (d)min, following public affairs are pressed to the data value of d row Formula is changed:
Wherein,For i-th data of d row,For the numerical value after renewal, i ∈ 1,2 ... .., n }, d ∈ 1, 2 ... .., D }, it is sample dimension that n is number of samples, D.
Further, step S2 is specially:
S21, the sample dimension of acquisition High Dimensional Data Set, produce the sample dimension of subspace;
S22, High Dimensional Data Set is carried out random and not repeatedly attribute column is chosen, subspace is constructed, when reaching subspace Stop during sample dimension;Wherein construction subspace is specially:
After obtaining High Dimensional Data Set sample dimension and being D, the interval setting of subspace is carried out, using random function 0 Choose in~D interval unduplicatedIndividual positive integer, whereinRepresent the smallest positive integral not less than r × D, r ∈ (0,1);By thisIndividual positive integer sequence is sorted from small to large;Will after sequenceIndividual positive integer sequence pair should In the attribute row number of High Dimensional Data Set, one new subspace of construction is extracted;
S23, repeat S21-S22 is walked, stop until producing S sub-spaces.
Further, step S3 is specially:
S31, subspace is carried out kmeans cluster, randomly choose 2~kmaxIn a positive integer as clusters number k, Cluster centre random initializtion, obtains kmeans cluster result matrix, kmaxFor a positive integer more than 2;Wherein
The step of kmeans clustering algorithm is,
A) 2~k is randomly choosedmaxIn a positive integer as clusters number k;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) all samples are calculated to the distance of k cluster centre using Euclidean distance formula, if the dimension of data set is D, The distance between sample point A=(a [1], a [2] ..., a [D]) and central point C=(c [1], c [2] ..., c [D]) ρ (A, C) is fixed Justice is equation below:
Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample;Then, The matrix W of a n × k has just been obtained, and it is cluster centre number that n is number of samples, k;W in matrixihRepresent:Judge xiWhether belong to In h-th class chIf belonging to, wihFor 1, no be not belonging to, then wih, it is 0;
That is the renewal computing formula of W is as follows:
D) meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, and used as new cluster centre, computing formula is as follows:
C={ c1,c2,……,ch},(xi,ch)2Represent xiTo h-th class chDistance;nhFor belonging to h-th class chSample This number;
The difference for relatively changing with former cluster centre if in restriction range, enters f) step whether in restriction range Suddenly, if not in restriction range, e) step is entered;
If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not having Reach maximum iteration time, then continue to repeat c), d) step;
F) final output cluster result;
Wherein, the distance that an object function evaluates sample point and cluster centre is constructed;It is based on to the minimization of object function On the premise of iteration update the degree of membership of cluster centre and sample point;Until the degree of object function minimizing is in given increment model Within enclosing, or reach maximum iteration time and terminate, export cluster result;
The object function Ω of kmeans clustering algorithm in theory1(C*,W*) as follows:
Ω1(C*,W*)=arg min(C,W)φ1(C,W)
Wherein, C*And W*Represented is the optimal solution for minimizing object function, and the target formula for iterating to calculate each time is such as Under:
S32, fuzzy cmeans cluster is carried out using step S31 identical subspace and identical clusters number, obtain Fuzzy cmeans cluster result matrix;
Wherein, fuzzy cmeans clustering algorithm is specially:
A) to carry out selected clusters number k during kmeans cluster identical for clusters number and same sub-spaces;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) distribute a degree of membership with respect to each cluster centre according to membership function to each sample, the degree of membership represents Sample point and the distance of cluster centre, being calculated as follows of subordinated-degree matrix:
Wherein, F is the set of degree of membership, F={ fij, i ∈ { 1 ... ..., n }, j ∈ { 1 ... ..., k }, n are number of samples, K is clusters number, and β is the real number more than 1,
D) according to degree of membership, recalculate and determine new cluster centre, computing formula is as follows:
Wherein, C={ c1,c2,……,ck, | | xi-cj| | represent xiTo j-th class cjDistance, compare with former cluster The difference of heart change if in restriction range, enters f) step whether in restriction range, if not in restriction range, Enter e) step;
If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not having Reach maximum iteration time, then continue to repeat c), d) step;
F) final output cluster result;
Wherein, the distance that an object function evaluates sample point and cluster centre is constructed;It is based on to the minimization of object function On the premise of iteration update the degree of membership of cluster centre and sample point;Until the degree of object function minimizing is in given increment model Within enclosing, or reach maximum iteration time and terminate, export cluster result;
The object function Ω of fuzzy cmeans clustering algorithm2(C*,F*) be:
Ω2(C*,F*)=arg min(C,F)φ2(C,F)
Wherein, C*And F*Represented is to minimize object function optimal solution.
S33, repeat step S31 and S32, until S sub-spaces all carry out kmeans and fuzzy cmeans cluster, respectively Obtain S kmeans cluster result matrix and S fuzzycmeans cluster result matrix.
Further, step S4 is specially:
S41, S kmeans cluster result matrix is converted into binary system cluster result matrix, matrix size be For clusters number, n is number of samples;S fuzzy cmeans cluster result matrix is not changed, and the size of the matrix is also k×n;Wherein, described kmeans cluster result matrix is converted into the step of binary system clusters matrix is:
First, k rank unit matrix is built
Then, if kmeans cluster result matrix is Hk, then binary system cluster result matrix is Hb=H (i,:), wherein H (i, j) is arranged for the i-th row of H-matrix jth, and H (i,:) for H-matrix the i-th row, i ∈ Hk, particularly, cluster result matrix HkFor one Individual column vector;
S42, S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are closed And, obtaining merging matrix, size is the summation of 2K × n, K for the corresponding clusters number of S sub-spaces, and n is number of samples;Its In, combining step is:
If S binary system cluster matrix is respectively Hb1、Hb2、……、Hbs, S fuzzycmeans cluster result matrix divide Wei not F1、F2、……、Fs;Successively by S kmeans binary system cluster result matrix and S fuzzycmeans cluster result square Battle array is extended merging according to row vector, and form is as follows, [Hb1,Hb2,……,Hbs,F1,F2,……,Fs], size is 2K × n, K For the summation of the corresponding clusters number of S sub-spaces, n is sample number.
Further, step S5 is specially:
S51, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen partition coefficient Index, obtains optimum cluster number for k1, specially:Substitute into formula
Wherein, fijIt is degree of membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, successively DrawValue, k takes 2~kmaxInterval positive integer;
S52, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen separation index, obtain most Excellent clusters number is k2, specially:Substitute into formula
Wherein, xiFor i-th sample, chFor h-th cluster centre, n is the number of sample, draws successivelyValue;
S53, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen alternative Dunn index, Optimum cluster number is obtained for k3, specially:Substitute into formula
S54, to k1、k2And k3Being ranked up, intermediate value is chosen as optimum cluster number kfixSpecially:
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k1
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of minima and be designated as k2
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k3
To k1、k2And k3Being ranked up, intermediate value is chosen as optimum cluster number kfix.
Further, it is specially in step S6:
S61, use fusion matrix and optimum cluster number kfixAs parameter, fuzzy cmeans cluster is carried out, draw every Individual sample is to inhomogeneous cluster result matrix;
S62, according to cluster result matrix, if degree of membership is unequal, according to the degree of membership maximum of each sample correspond to Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values One classification, the string categorization vector for finally giving is used as the decision kind set of next step;
Further, step S7 is specially:
S71, will fusion matrix as conditional attribute collection, degree of dependence of the design conditions property set to decision kind set, tool Body is:
S711, searching equivalence relation, using formula:
High dimensional data is concentrated all samples to be divided into limited equivalence set according to conditional attribute collection, wherein M is condition Property set, ψ (x, m) is the value of sample x under attribute m;
S712, calculating positive region δM(L), δM(L)=∪1≤i≤np(M,Li),Wherein, M ' conditional attribute collection, n For the number of equivalence set in the first step, LiFor i-th equivalence set, [x]M= X ' ∈ X | and (x, x ') ∈ Γ (M) }, X is fusion matrix;
Degree of dependence ζ of S713, design conditions property set to decision kind setM(L), calculated using equation below:
Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζM(L) it is The degree of dependence of conditional attribute set pair decision kind set, δM(L) it is positive region of the conditional attribute collection M in decision kind set L;
S72, calculating core attribute set:After maximum sub- conditional attribute collection is removed, the residue condition property set for obtaining is fought to the finish The degree of dependence of plan property set is constant, then the residue condition property set is core attribute set;
S73, init attributes integrate A as empty set, and a belongs to conditional attribute and integrates and the difference set as property set A;
S74, traversal read all properties row of a, attribute column are incorporated to property set A so as to become property set A', judge category Whether the degree of dependence of decision kind set of the property collection A' and property set A between them changes, if changing, by a simultaneously Enter in A;Otherwise it is not incorporated in, until a has been traveled through, stops;
S75, judge whether are property set A and core attribute set, if equal, A is to simplify fusion matrix after yojan;If not phase Deng the attribute for then removing redundancy obtains simplifying fusion matrix.
Further, the S8 is specially:
S81, use simplify fusion matrix and true clusters number as parameter, carry out fuzzy cmeans cluster, obtain Each sample is to inhomogeneous cluster result matrix;
S82, according to cluster result matrix, if degree of membership is unequal, according to the degree of membership maximum of each sample correspond to Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values One classification;The string categorization vector for finally giving is used as final cluster result.
Further, the S9, calculating pure rate PU is:
Wherein, real cluster result is
Finally cluster result isk1For the clusters number of legitimate reading, k2 For clusters number.
After technique scheme, the present invention at least has the advantages that:
(1) present invention enables framework to process High Dimensional Data Set using the method for stochastic subspace, not only on subspace Multiformity have good effect, it is often more important that also lifted in subspace clustering speed;
(2) present invention is soft cluster and hard Cluster-Fusion in integrated framework, it is contemplated that soft cluster is effective with hard cluster In conjunction with come improve integrated in multiformity, combine respective advantage;
(3) different cluster results is analyzed by the present invention as new property set, takes full advantage of the number of intermediate result According to making analysis information more accurate;
(4) present invention introduces the combined method of Cluster Validity Index, to clusters number in using improved rough set Prediction more accurate, and it is more preferable to remove redundant attributes effect to follow-up rough set attribute reduction improved method lifting;
(5) present invention removes the redundant attributes of new property set using the method for improved rough set attribute reduction, effectively keeps away Exempt from the low impact of accuracy rate that redundancy brings.
Description of the drawings
The step of Fig. 1 is a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace of present invention flow chart;
Fig. 2 is a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace of the present invention and traditional single cluster The contrast table of algorithm cluster accuracy.
Specific embodiment
It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases Mutually combine, with specific embodiment, the application is described in further detail below in conjunction with the accompanying drawings.
It is described further the step of 1 couple of present invention below in conjunction with the accompanying drawings.
Step 1, input High Dimensional Data Set:The High Dimensional Data Set of one process to be clustered of input, row vector corresponds to sample dimension, Column vector corresponds to attribute dimension;
Step 2, data normalization:D Column Properties corresponding maximum V (d) is first obtainedmaxWith minima V (d)min, to The property value of d row is changed as follows:
Wherein,For i-th data of d row,For the numerical value after renewal, i ∈ 1,2 ... .., n }, d ∈ 1, 2 ... .., D }, it is sample dimension that n is number of samples, D;The sample dimension refers to the number of the corresponding attribute of the sample, For example, sample has 3 attributes, is blood group, height, body weight respectively, then sample dimension is exactly 3;
Step 3, produces stochastic subspace:First, the sample dimension for obtaining High Dimensional Data Set is D, then the sample of subspace Dimension isWhereinRepresent the smallest positive integral not less than r × D, r ∈ (0,1);
Secondly, do not repeat to choose in 0~D interval using random functionPositive integer, by thisIndividual just whole Number Sequence is sorted from small to large, will after sequenceIndividual positive integer sequence corresponds to the attribute row number of High Dimensional Data Set, Do not repeat reading attributes row number to High Dimensional Data Set at random, subspace is constructed, sample dimension reaches until subspaceStop Only;
According to as above two step, circulation produces subspace, just stops circulation until producing S sub-spaces;
Step 4, kmeans and fuzzy cmeans is clustered:
First, one of subspace is obtained, carries out kmeans cluster behind the subspace for obtaining, random selection 2~ kmaxIn a positive integer as clusters number k, cluster centre random initializtion, concentrate k sample of random selection from high dimensional data This is used as cluster centre, and each cluster centre represents a class, calculates all samples using Euclidean distance formula initial to k The distance of cluster centre, the final affiliated classification of each sample is the class corresponding to the corresponding nearest cluster centre of the sample Not;The meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, used as new cluster centre, comparing the difference for changing with former cluster centre is No if not in restriction range, continue iteration in restriction range, sample and cluster centre are re-started according to distance and are drawn Point, divide and after finishing, produce k new cluster centre again, until new cluster centre with last cluster centre excursion in constraint Within the scope of, or maximum iteration time has been reached, final output cluster result.The step of kmeans clustering algorithm is,
A) 2~k is randomly choosedmaxIn a positive integer as clusters number k;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) all samples are calculated to the distance of k cluster centre using Euclidean distance formula, if the dimension of data set is D, The distance between sample point A=(a [1], a [2] ..., a [D]) and central point C=(c [1], c [2] ..., c [D]) ρ (A, C) is fixed Justice is equation below:
Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample;Then, The matrix W of a n × k has just been obtained, and it is cluster centre number that n is number of samples, k;W in matrixihRepresent:Judge xiWhether belong to In h-th class chIf belonging to, wihFor 1, no be not belonging to,
Then wih, it is 0;
That is the renewal computing formula of W is as follows:
D) meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, and used as new cluster centre, computing formula is as follows:
C={ c1,c2,……,ch},(xi,ch)2Represent xiTo h-th class chDistance;nhFor belonging to h-th class chSample This number;
The difference for relatively changing with former cluster centre if in restriction range, enters g) step whether in restriction range Suddenly, if not in restriction range, e) step is entered;
If e) iterationses reach maximum iteration time (acquiescence is 100 iteration), f) step is entered, if not reaching most Big iterationses, then continue to repeat c), d) step;
F) final output cluster result;
Secondly, select clusters number identical with clusters number k that same sub-spaces carry out kmeans selection, from high dimension According to concentrating k sample of random selection as cluster centre, each cluster centre represents a class, poly- with respect to each to each sample A degree of membership is distributed according to membership function in class center, and the degree of membership represents the distance of sample point and cluster centre, constructs one Object function evaluates the distance of sample point and cluster centre, is updated in cluster based on iteration on the premise of to the minimization of object function The heart and the degree of membership of sample point, the degree for reducing until object function is within given incremental range, or reaches maximum and change Generation number terminates, and exports cluster result, obtains fuzzy cmeans cluster result matrix.
Wherein, fuzzy cmeans clustering algorithm is specially:
A) to carry out selected clusters number k during kmeans cluster identical for clusters number and same sub-spaces;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) distribute a degree of membership with respect to each cluster centre according to membership function to each sample, the degree of membership represents Sample point and the distance of cluster centre, being calculated as follows of subordinated-degree matrix:
Wherein, F is the set of degree of membership, F={ fij, i ∈ { 1 ... ..., n }, j ∈ { 1 ... ..., k }, n are number of samples, K is clusters number, and β is the real number more than 1,
D) according to degree of membership, recalculate and determine new cluster centre, computing formula is as follows:
Wherein, C={ c1,c2,……,ck, | | xi-cj| | represent xiTo j-th class cjDistance,
The difference for relatively changing with former cluster centre if in restriction range, enters f) step whether in restriction range Suddenly, if not in restriction range, e) step is entered;
If e) iterationses reach maximum iteration time (acquiescence is 100 iteration), f) step is entered, if not reaching most Big iterationses, then continue to repeat c), d) step;
F) final output cluster result;
After completing above-mentioned sub-spaces, continuing the next sub-spaces of acquisition carries out same operation, until S son is empty Between all carry out kmeans and fuzzy cmeans cluster, respectively obtain S kmeans cluster result matrix and S individual Fuzzycmeans cluster result matrix;
Step 5, generates fusion matrix:
First, S kmeans cluster result matrix is converted into binary system cluster result matrix;According to respective subspace Clusters number k, the form for building k rank unit matrix H, H is as follows:
Kmeans cluster result matrix is made for Hk, then binary system cluster result matrix is Hb=H (i,:), wherein H (i, j) is The i-th row of H-matrix jth is arranged, and H (i,:) for H-matrix the i-th row, i ∈ Hk, particularly, cluster result matrix HkFor one arrange to Amount.Then obtain corresponding binary system cluster result matrix, its size is k × n (it is number of samples that k is clusters number, n);
Secondly, S fuzzy cmeans cluster result matrix is not changed, and the size of matrix is also k × n;Then by S Individual kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are merged, if S binary system gathers Matroid is respectively Hb1, Hb2... ..., Hbs, S fuzzycmeans cluster result matrix be respectively F1, F2... ..., Fs;Successively S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are extended closing according to row vector And, form is as follows, [Hb1, Hb2... ..., Hbs, F1, F2... ..., Fs], then just obtaining merging matrix, size is that (K is S to 2K × n The summation of the corresponding clusters number of sub-spaces, n is number of samples);
Step 6, draws optimum cluster number using Cluster Validity Index:
First, according to fusion matrix, it is 2~k to calculate clusters number respectivelymaxWhen partition coefficient Index, obtains optimum cluster number for k1;Partition coefficient index is calculated according to the following formula:
Wherein fijIt is membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, successively DrawValue (k takes 2~kmaxInterval positive integer);
Further according to fusion matrix, it is 2~k to calculate clusters number respectivelymaxWhen separation index, obtain optimum Clusters number is k2
Separation index is calculated according to the following formula:
Wherein fijIt is membership of the sample point i in cluster centre j, xiFor i-th sample, chFor in h-th cluster The heart, k is number of clusters number, the n for sample, draws successivelyValue (k takes 2~kmaxInterval positive integer);
Further according to fusion matrix, it is 2~k to calculate clusters number respectivelymaxWhen alternative Dunn index, obtain It is k to optimum number3;Alternative Dunn index is calculated according to the following formula
Wherein, k is number of clusters number, the n for sample, xiFor i-th sample, chFor h-th cluster centre, obtain successively Go outValue (k takes 2~kmaxInterval positive integer);
Then, rightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k1; RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of minima and be designated as k2;RightValue (k takes 2 ~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k3;To k1, k2And k3It is ranked up, in selection Between be worth as optimum cluster number kfix.
Step 7, constructs decision kind set:
First, using fusion matrix and optimum cluster number kfixAs parameter, using fuzzy cmeans clustering algorithm, Concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;Relative to each sample Distribute a degree of membership in each cluster centre according to membership function, the degree of membership represents the distance of sample point and cluster centre; One object function of construction evaluates the distance of sample point and cluster centre;Be based on to the minimization of object function on the premise of iteration more New cluster centre and the degree of membership of sample point;Until the degree of object function minimizing is within given incremental range, Huo Zheda Terminate to maximum iteration time, cluster result is exported, obtains fuzzy cmeans cluster result matrix.Draw each sample to not Similar is subordinate to probability matrix, and this is subordinate to probability matrix size for n × kfix, wherein n is the number of sample, each probability matrix In probit interval 0~1;
Secondly, according to cluster result matrix, if degree of membership is unequal, corresponded to according to the degree of membership maximum of each sample Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values One classification.The string categorization vector for finally giving is used as the decision kind set of next step, and the decision kind set size is n × 1, Wherein n is the number of sample, and each value corresponds to the generic of sample.
Step 8, improved rough set attribute reduction, obtain simplifying fusion matrix
First, matrix will be merged as conditional attribute collection, the decision kind set that the conditional attribute set pair step 7 is obtained is relied on Degree is calculated according to the following formula:
Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζM(L) it is The degree of dependence of conditional attribute set pair decision kind set, δM(L) it is positive region of the conditional attribute collection M in decision kind set L;Bar The positive region that part property set is concentrated in decision attribute is calculated as follows:
Wherein, M ' conditional attribute collection, n is Г(M)The number of equivalence set, LiFor i-th equivalence set [x]M=x ' ∈ X | and (x, x ') ∈ Γ (M) }, X is fusion matrix;Equivalence set is calculated as follows:
Wherein M is conditional attribute collection, and ψ (x, m) is the value of sample x under attribute m, then, has obtained conditional attribute collection Degree of dependence to decision kind set;
Secondly, each attribute column of ergodic condition property set, after one of attribute is removed, residue condition attribute Set pair decision kind set calculates degree of dependence by above-mentioned calculating process, if degree of dependence is constant, removes the attribute, otherwise, protects The attribute is stayed, until all properties row have been traveled through, final residue condition property set is used as core attribute set;
Then, init attributes integrate A as empty set, and a belongs to the difference set of conditional attribute collection and property set A;Traversal reads the institute of a There is attribute column, it is A' that attribute column is incorporated to after property set A, if A' causes the degree of dependence to decision kind set with former property set A The degree of dependence of decision kind set is compared and is changed, then a is incorporated in A, is otherwise not incorporated in.Until a has been traveled through, stop;
Finally, property set A is carried out attribute column with core attributes to compare, if comprising attribute column equal, after A is brief Simplify fusion matrix.If unequal, the attribute for removing redundancy between attribute A and core attributes obtains simplifying fusion matrix.
Step 9, compatibility function is divided:
First, using simplifying fusion matrix and true clusters number ktrueAs parameter, clustered using fuzzy cmeans Algorithm, concentrates k sample of random selection as cluster centre from high dimensional data, and each cluster centre represents a class;To each sample This distributes a degree of membership with respect to each cluster centre according to membership function, and the degree of membership represents sample point and cluster centre Distance;One object function of construction evaluates the distance of sample point and cluster centre;On the premise of being based on to the minimization of object function Iteration updates the degree of membership of cluster centre and sample point;Until the degree of object function minimizing is within given incremental range, Or reach maximum iteration time and terminate, cluster result is exported, obtains fuzzy cmeans cluster result matrix.Draw each sample This is subordinate to probability matrix to inhomogeneous, and this is subordinate to probability matrix size for n × ktrue, wherein n is the number of sample, and each is general Rate value is interval 0~1;Wherein, true clusters number is:Data set can be obtained truly according to priori when clustering Clusters number, such as, the clusters number of cancer data is assured that for two classes, a class is to suffer from cancer, and another kind of is not suffer from Cancer;
Secondly, according to cluster result matrix, if degree of membership is unequal, corresponded to according to the degree of membership maximum of each sample Classification, determine the final classification of each sample, if degree of membership is equal, random choose in the classification of equal membership values One classification;The string categorization vector for finally giving is used as final cluster result, and the final cluster result size is n × 1, wherein N is the number of sample, and each value corresponds to the generic of sample;
Step 10, the cluster result for being drawn with the inventive method calculates pure rate with real cluster result, according to the following formula Calculate:
Wherein, real cluster result isThe cluster result for being drawn by the inventive method Fork1For the clusters number of legitimate reading, k2Clusters number for the inventive method.Its In, true cluster result refer to the label already provided with label information (such as suffer from cancer and do not suffer from cancer), why use Carry markd information, be intended to result that the integrated approach is predicted can the labelling result of approaching to reality evaluate, comment The standard of valency is exactly to utilize pure rate.Pure rate is exactly, if integrated approach cluster result is completely the same with true cluster result, pure Net rate is just that pure rate is 0, if result part 11, if the cluster result of integrated approach is completely inconsistent with true cluster result Cause, be then between 0~1, so closer to 1, then representing the integrated approach better in performance.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, permissible It is understood by, can these embodiments be carried out with multiple equivalent changes without departing from the principles and spirit of the present invention Change, change, replace and modification, the scope of the present invention is limited by claims and its equivalency range.

Claims (10)

1. a kind of high dimensional data soft or hard clustering ensemble method based on stochastic subspace, it is characterised in that comprise the steps:
S1, input High Dimensional Data Set, are normalized;
S2, the High Dimensional Data Set after normalization is made to produce stochastic subspace;
S3, subspace is clustered, obtain cluster result matrix;
S4, cluster result matrix is merged, generate fusion matrix;
S5, according to fusion matrix, draw optimum cluster number using Cluster Validity Index;
S6, use fusion matrix and optimum cluster number as parameter, construct decision kind set;
S7, will fusion matrix as conditional attribute collection, according to decision kind set, improve rough set attribute about to merging matrix Letter, obtains simplifying fusion matrix;
S8, use and simplify fusion matrix and true clusters number is clustered as parameter, cluster result matrix is obtained, according to poly- Class matrix of consequence determines final cluster result;
S9, final cluster result is calculated with the pure rate of true cluster result.
2. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that In step S1, normalization process is specially:
Obtain corresponding maximum V (d) of d Column PropertiesmaxWith minima V (d)min, the data value of d row is entered as follows Row conversion:
Wherein,For i-th data of d row,For the numerical value after renewal, i ∈ 1,2 ... .., n }, d ∈ 1, 2 ... .., D }, it is sample dimension that n is number of samples, D.
3. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that Step S2 is specially:
S21, the sample dimension of acquisition High Dimensional Data Set, produce the sample dimension of subspace;
S22, High Dimensional Data Set is carried out random and not repeatedly attribute column is chosen, subspace is constructed, when reaching subspace sample Stop during dimension;Wherein construction subspace is specially:
After obtaining High Dimensional Data Set sample dimension and being D, the interval setting of subspace is carried out, using random function in 0~D area Between in choose unduplicatedIndividual positive integer, whereinRepresent the smallest positive integral not less than r × D, r ∈ (0,1); By thisIndividual positive integer sequence is sorted from small to large;Will after sequenceIndividual positive integer sequence corresponds to higher-dimension The attribute row number of data set, extracts one new subspace of construction;
S23, repeat S21-S22 is walked, stop until producing S sub-spaces.
4. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that Step S3 is specially:
S31, subspace is carried out kmeans cluster, randomly choose 2~kmaxIn a positive integer as clusters number k, cluster Center random initializtion, obtains kmeans cluster result matrix, kmaxFor a positive integer more than 2;Wherein
The step of kmeans clustering algorithm is,
A) 2~k is randomly choosedmaxIn a positive integer as clusters number k;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) all samples are calculated to the distance of k cluster centre using Euclidean distance formula, if the dimension of data set is D, sample The distance between point A=(a [1], a [2] ..., a [D]) and central point C=(c [1], c [2] ..., c [D]) ρ (A, C) is defined as Equation below:
Classification belonging to each sample is final is the classification corresponding to the corresponding nearest cluster centre of the sample;Then, must To the matrix W of a n × k, it is cluster centre number that n is number of samples, k;W in matrixihRepresent:Judge xiWhether is belonged to H class chIf belonging to, wihFor 1, no be not belonging to, then wih, it is 0;
That is the renewal computing formula of W is as follows:
D) meansigma methodss of all samples of each apoplexy due to endogenous wind are calculated, and used as new cluster centre, computing formula is as follows:
C={ c1,c2,……,ch},(xi,ch)2Represent xiTo h-th class chDistance;nhFor belonging to h-th class chSample Number;
The difference for relatively changing with former cluster centre if in restriction range, enters f) step whether in restriction range, if Not in restriction range, then enter e) step;
If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not reaching Maximum iteration time, then continue to repeat c), d) step;
F) final output cluster result;
Wherein, the distance that an object function evaluates sample point and cluster centre is constructed;Before to the minimization of object function Put the degree of membership that iteration updates cluster centre and sample point;Until object function reduce degree given incremental range it Interior, or reach maximum iteration time and terminate, export cluster result;
The object function Ω of kmeans clustering algorithm in theory1(C*,W*) as follows:
Ω1(C*,W*)=argmin(C,W)φ1(C,W)
Wherein, C*And W*Represented is the optimal solution for minimizing object function, and the target formula for iterating to calculate each time is as follows:
S32, fuzzy cmeans cluster is carried out using step S31 identical subspace and identical clusters number, obtain fuzzy Cmeans cluster result matrix;
Wherein, fuzzy cmeans clustering algorithm is specially:
A) to carry out selected clusters number k during kmeans cluster identical for clusters number and same sub-spaces;
B) concentrate k sample of random selection as cluster centre from high dimensional data, each cluster centre represents a class;
C) distribute a degree of membership with respect to each cluster centre according to membership function to each sample, the degree of membership represents sample Point and the distance of cluster centre, being calculated as follows of subordinated-degree matrix:
Wherein, F is the set of degree of membership, F={ fij, i ∈ { 1 ... ..., n }, j ∈ { 1 ... ..., k }, n are that number of samples, k is Clusters number, β is the real number more than 1,
D) according to degree of membership, recalculate and determine new cluster centre, computing formula is as follows:
Wherein, C={ c1,c2,……,ck, | | xi-cj| | represent xiTo j-th class cjDistance, compare and former cluster centre become The difference of change if in restriction range, enters f) step, if not in restriction range, entering whether in restriction range E) step;
If e) iterationses reach maximum iteration time, acquiescence maximum iteration time is 100, then to enter f) step, if not reaching Maximum iteration time, then continue to repeat c), d) step;
F) final output cluster result;
Wherein, the distance that an object function evaluates sample point and cluster centre is constructed;Before to the minimization of object function Put the degree of membership that iteration updates cluster centre and sample point;Until object function reduce degree given incremental range it Interior, or reach maximum iteration time and terminate, export cluster result;
The object function Ω of fuzzy cmeans clustering algorithm2(C*,F*) be:
Ω2(C*,F*)=argmin(C,F)φ2(C,F)
Wherein, C*And F*Represented is to minimize object function optimal solution.
S33, repeat step S31 and S32, until S sub-spaces all carry out kmeans and fuzzy cmeans cluster, respectively obtain S kmeans cluster result matrix and S fuzzycmeans cluster result matrix.
5. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 4, it is characterised in that Step S4 is specially:
S41, S kmeans cluster result matrix is converted into binary system cluster result matrix, matrix size is poly- for k × n, k Class number, n is number of samples;S fuzzy cmeans cluster result matrix is not changed, the size of the matrix also for k × n;Wherein, described kmeans cluster result matrix is converted into the step of binary system clusters matrix is:
First, k rank unit matrix is built
Then, if kmeans cluster result matrix is Hk, then binary system cluster result matrix is Hb=H (i,:), wherein H (i, j) Arrange for the i-th row of H-matrix jth, and H (i,:) for H-matrix the i-th row, i ∈ Hk, particularly, cluster result matrix HkFor one arrange to Amount;
S42, S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are merged, obtain To fusion matrix, size is the summation of 2K × n, K for the corresponding clusters number of S sub-spaces, and n is number of samples;Wherein, merge Step is:
If S binary system cluster matrix is respectively Hb1、Hb2、……、Hbs, S fuzzycmeans cluster result matrix be respectively F1、F2、……、Fs;Successively S kmeans binary system cluster result matrix and S fuzzycmeans cluster result matrix are pressed It is extended merging according to row vector, form is as follows, [Hb1,Hb2,……,Hbs,F1,F2,……,Fs], it is S that size is 2K × n, K The summation of the corresponding clusters number of sub-spaces, n is sample number.
6. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that Step S5 is specially:
S51, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen partition coefficient index, Optimum cluster number is obtained for k1, specially:Substitute into formula
Wherein, fijIt is degree of membership of the sample point i in cluster centre j, k is number of clusters number, the n for sample, draws successivelyValue, k takes 2~kmaxInterval positive integer;
S52, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen separation index, obtain optimum poly- Class number is k2, specially:Substitute into formula
Wherein, xiFor i-th sample, chFor h-th cluster centre, n is the number of sample, draws successivelyValue;
S53, according to fusion matrix, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~k, respectively calculate clusters number be 2~kmaxWhen alternative Dunn index, obtain Optimum cluster number is k3, specially:Substitute into formula
S54, to k1、k2And k3Being ranked up, intermediate value is chosen as optimum cluster number kfixSpecially:
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k1
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of minima and be designated as k2
RightValue (k takes 2~kmaxInterval positive integer) it is ranked up, take out the corresponding k of maximum and be designated as k3
To k1、k2And k3Being ranked up, intermediate value is chosen as optimum cluster number kfix.
7. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that It is specially in step S6:
S61, use fusion matrix and optimum cluster number kfixAs parameter, fuzzy cmeans cluster is carried out, draws each sample This is to inhomogeneous cluster result matrix;
S62, according to cluster result matrix, if degree of membership is unequal, according to the corresponding class of the degree of membership maximum of each sample Not, determine the final classification of each sample, if degree of membership is equal, random choose one in the classification of equal membership values Classification, the string categorization vector for finally giving is used as the decision kind set of next step.
8. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that Step S7 is specially:
S71, will fusion matrix as conditional attribute collection, degree of dependence of the design conditions property set to decision kind set, specially:
S711, searching equivalence relation, using formula:
High dimensional data is concentrated all samples to be divided into limited equivalence set according to conditional attribute collection, wherein M is conditional attribute Collection, ψ (x, m) is the value of sample x under attribute m;
S712, calculating positive region δM(L), δM(L)=∪1≤i≤np(M,Li),Wherein, M ' conditional attribute collection, n is first The number of equivalence set, L in stepiFor i-th equivalence set, [x]M=x ' ∈ X | (x, x ') ∈ Γ (M) }, X is fusion matrix;
Degree of dependence ζ of S713, design conditions property set to decision kind setM(L), calculated using equation below:
Wherein M ' is conditional attribute collection, and it is fusion matrix that L is decision kind set, X, and | X | is for calculating the base in X, ζM(L) it is condition Degree of dependence of the property set to decision kind set, δM(L) it is positive region of the conditional attribute collection M in decision kind set L;
S72, calculating core attribute set:After maximum sub- conditional attribute collection is removed, the residue condition property set for obtaining belongs to decision-making The degree of dependence of property collection is constant, then the residue condition property set is core attribute set;
S73, init attributes integrate A as empty set, and a belongs to conditional attribute and integrates and the difference set as property set A;
S74, traversal read all properties row of a, attribute column are incorporated to property set A so as to become property set A', judge property set Whether the degree of dependence of decision kind set of the A' and property set A between them changes, if changing, a is incorporated to A In;Otherwise it is not incorporated in, until a has been traveled through, stops;
S75, judge whether are property set A and core attribute set, if equal, A is to simplify fusion matrix after yojan;If unequal, The attribute for then removing redundancy obtains simplifying fusion matrix.
9. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, it is characterised in that The S8 is specially:
S81, use simplify fusion matrix and true clusters number as parameter, carry out fuzzy cmeans cluster, obtain each Sample is to inhomogeneous cluster result matrix;
S82, according to cluster result matrix, if degree of membership is unequal, according to the corresponding class of the degree of membership maximum of each sample Not, determine the final classification of each sample, if degree of membership is equal, random choose one in the classification of equal membership values Classification;The string categorization vector for finally giving is used as final cluster result.
10. the high dimensional data soft or hard clustering ensemble method based on stochastic subspace according to claim 1, its feature exists In, the S9, calculating pure rate PU is:
Wherein, real cluster result is
Finally cluster result isk1For the clusters number of legitimate reading, k2It is poly- Class number.
CN201610843524.6A 2016-09-22 2016-09-22 High-dimension data soft and hard clustering integration method based on random subspace Pending CN106446947A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610843524.6A CN106446947A (en) 2016-09-22 2016-09-22 High-dimension data soft and hard clustering integration method based on random subspace

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610843524.6A CN106446947A (en) 2016-09-22 2016-09-22 High-dimension data soft and hard clustering integration method based on random subspace

Publications (1)

Publication Number Publication Date
CN106446947A true CN106446947A (en) 2017-02-22

Family

ID=58166005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610843524.6A Pending CN106446947A (en) 2016-09-22 2016-09-22 High-dimension data soft and hard clustering integration method based on random subspace

Country Status (1)

Country Link
CN (1) CN106446947A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984551A (en) * 2017-05-31 2018-12-11 广州智慧城市发展研究院 A kind of recommended method and system based on the multi-class soft cluster of joint
CN109242030A (en) * 2018-09-21 2019-01-18 京东方科技集团股份有限公司 Draw single generation method and device, electronic equipment, computer readable storage medium
CN110929777A (en) * 2019-11-18 2020-03-27 济南大学 Data kernel clustering method based on transfer learning
CN113159155A (en) * 2021-04-15 2021-07-23 华南农业大学 Crime risk early warning mixed attribute data processing method, medium and equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984551A (en) * 2017-05-31 2018-12-11 广州智慧城市发展研究院 A kind of recommended method and system based on the multi-class soft cluster of joint
CN109242030A (en) * 2018-09-21 2019-01-18 京东方科技集团股份有限公司 Draw single generation method and device, electronic equipment, computer readable storage medium
CN110929777A (en) * 2019-11-18 2020-03-27 济南大学 Data kernel clustering method based on transfer learning
CN113159155A (en) * 2021-04-15 2021-07-23 华南农业大学 Crime risk early warning mixed attribute data processing method, medium and equipment
CN113159155B (en) * 2021-04-15 2024-01-23 华南农业大学 Mixed attribute data processing method, medium and equipment for crime risk early warning

Similar Documents

Publication Publication Date Title
Huang et al. Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis
Kang et al. A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence
CN106096727A (en) A kind of network model based on machine learning building method and device
CN106446947A (en) High-dimension data soft and hard clustering integration method based on random subspace
CN103488662A (en) Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit
CN106845536B (en) Parallel clustering method based on image scaling
CN103208027A (en) Method for genetic algorithm with local modularity for community detecting
CN109669990A (en) A kind of innovatory algorithm carrying out Outliers mining to density irregular data based on DBSCAN
CN105046323B (en) Regularization-based RBF network multi-label classification method
Coelho et al. Multi-objective design of hierarchical consensus functions for clustering ensembles via genetic programming
Shang et al. Multi-objective clustering technique based on k-nodes update policy and similarity matrix for mining communities in social networks
Diallo et al. Auto-attention mechanism for multi-view deep embedding clustering
Li et al. A hybrid coevolutionary algorithm for designing fuzzy classifiers
Bourqui et al. How to draw clusteredweighted graphs using a multilevel force-directed graph drawing algorithm
Hao et al. Ensemble clustering with attentional representation
CN105159918A (en) Trust correlation based microblog network community discovery method
Xia et al. GRRS: Accurate and efficient neighborhood rough set for feature selection
CN111814979A (en) Fuzzy set automatic partitioning method based on dynamic programming
Chen et al. An active learning algorithm based on Shannon entropy for constraint-based clustering
Ding et al. Density peaks clustering algorithm based on improved similarity and allocation strategy
Parvin et al. A metric to evaluate a cluster by eliminating effect of complement cluster
Luo et al. A reduced mixed representation based multi-objective evolutionary algorithm for large-scale overlapping community detection
Kong et al. Intelligent Data Analysis and its challenges in big data environment
Du et al. Cluster ensembles via weighted graph regularized nonnegative matrix factorization
Deng et al. Enhanced multiview fuzzy clustering using double visible-hidden view cooperation and network LASSO constraint

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170222