CN103226595A

CN103226595A - Clustering method for high dimensional data based on Bayes mixed common factor analyzer

Info

Publication number: CN103226595A
Application number: CN2013101334151A
Authority: CN
Inventors: 魏昕; 李宗辰
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Tian Gu Information Technology Co ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2013-04-17
Filing date: 2013-04-17
Publication date: 2013-07-31
Anticipated expiration: 2033-04-17
Also published as: CN103226595B

Abstract

The invention discloses a clustering method for high dimensional data based on a Bayes mixed common factor analyzer. The method comprises the following steps: firstly, a model of a Bayes mixed common factor analyzer is built for to-be-clustered high dimensional data; secondly, posteriori distributions of various random variables of the model are subjected to inference, and statistics relevant to the random variables can be obtained; and finally, categories which each dimensional datum belongs to can be obtained through judgment, and the clustering process can be completed. According to the invention, the built Bayes mixed common factor analyzer model has strong flexibility; as the method is based on the inference procedure of Bayes criterion, the phenomenon of overfitting and a dimensionality disaster can be prevented effectively; the method can automatically adjust an optimal structure of the model according to the high dimensional data, so that optimal category data can be confirmed automatically to finish clustering smoothly while performing dimensionality reduction, and excellent clustering performance and computational efficiency can be obtained.

Description

The clustering method that mixes the high dimensional data of common factor analyzer based on Bayes

Technical field

The present invention relates to a kind ofly mix the clustering method of the high dimensional data of common factor analyzer, belong to the disposal route and the applied technical field of high dimensional data based on Bayes.

Background technology

Along with the continuous development of collection and memory technology, the data of higher-dimension and superelevation dimension continue to bring out.For example, in the high dimensional feature vector of inevitable appearance, the bioinformatics biological tissue is carried out higher-dimension gene expression data in the cluster analysis in web page text, voice and the Audio Signal Processing of dimension facial images up to ten thousand of common occurrence and hundreds of thousands dimension in CBIR and the file retrieval, or the like.Obviously, dimension high more (attribute of object is many more) can be portrayed described object more all sidedly and differentiate object better.Yet when the data sample amount was little, too high dimension had proposed stern challenge to the processing of data inevitably." dimension disaster " is a very stubborn problem.In addition, too high dimension has also brought high computation burden, and makes relevant issues indigestion and expression, and more impossible realization is visual.Therefore, how to realize high dimensional data is analyzed accurately and efficiently and handled, become in correlative technology field and the practical application one and had challenging problem.

For the high dimensional data that major part is observed or collected, its main information is present in the lower dimensional space.Therefore, how in lower dimensional space, to portray the useful information of high dimensional data effectively, thereby design corresponding dimensionality reduction algorithm, for this way to solve the problem important academic significance is arranged not only, and have great application value.Hybrid cytokine analyzer (MFA) is in order to the inside dependence between each dimension component of higher-dimension observation data is carried out modeling, thereby reach a kind of statistical and analytical tool that data is carried out dimension-reduction treatment, MFA has a wide range of applications in fields such as image and Video processing, biological information processing.Yet,, during especially for cluster, still have limitation based on the high dimensional data disposal route of MFA.At first, in MFA, because each blending constituent all has different factor loading matrixes, the population parameter number of model is more, and existing MFA is based on maximum-likelihood criterion and carries out the reasoning of model and parameter estimation, therefore occurs the over-fitting problem when the number of samples of high dimensional data is little easily; Secondly, also be the most important, in most cases, the number of classification was pre-unknown before this in the application of data clusters, if set too high or too low, the capital influences the accuracy of final cluster result, and for high dimensional data, it is difficult more that this problem will become, how in dimensionality reduction, determining optimum classification number adaptively according to high dimensional data, thereby obtain cluster performance preferably, is a difficult problem and the crucial part that faces in high dimensional data clustering technique and the method.The invention solves the defective of prior art, proposed a kind of clustering method that mixes the high dimensional data of common factor analyzer based on Bayes.

Summary of the invention

The present invention proposes and a kind ofly mix the clustering method of the high dimensional data of common factor analyzer based on Bayes, it may further comprise the steps:

(1) establishes high dimensional data set to be clustered

, wherein Be the number of high dimensional data, each data

Dimension be

Set up Bayes and mix common factor analyzer (BMCFA) model, represent with this model Distribution; That is, BMCFA is that an one-tenth mark is Mixture model; For each high dimensional data

, it can be expressed as

With probability

(

), (formula 1)

Wherein,

For with high dimensional data Corresponding and and composition

The factor in the relevant lower dimensional space, its dimension is

(

),

Value according in the particular problem

Size choose: the traversal Between all integers, each candidate's

Do cluster one time, get performance best that time correspondence As final

Value;

For

The factor loading matrix; Error variance

Gaussian distributed

, wherein

For

Diagonal matrix; Probability

Satisfy

(2),, the Bayes that foundation is good in the step (1) is mixed common factor analyzer (BMCFA) model carry out reasoning based on bayesian criterion according to pending high dimensional data; After finishing this reasoning process, for each high dimensional data

, can obtain indieating variable corresponding with it

The posteriority expectation value,

, wherein

Represent current high dimensional data

Be by in the mixture model

The probability that individual composition produces;

(3) judgement: will

In the pairing sequence number conduct of maximal value

The class that is finally allocated to

, promptly

(formula 2)

Obtain the high dimensional data collection in such a way

In the cluster result of all data

In the clustering method of the described high dimensional data that mixes the common factor analyzer based on Bayes, to setting up in the process that Bayes mixes common factor analyzer (BMCFA) model described in the step (1), the conditional likelihood of each variable distributes, prior distribution is specified as follows:

(1-1) set one with

In the indieating variable set one to one of each data

, wherein with Corresponding

It is one

N dimensional vector n, having only an element in this vector is 1, all the other are 0; When

Individual element

The time (this moment other elements all be 0), show Be by Individual composition produces; So, About mixed weight-value

Condition be distributed as

(formula 3)

(1-2) be with average

, covariance matrix is

Gaussian distribution

Define Distribution; So,

Affiliated set About

,

,

Condition be distributed as

(formula 4)

(1-3) according to (formula 1), the high dimensional data collection

About Condition be distributed as

(formula 5)

(1-4) factor loading matrix Distribution be set at its row vector

Product, each row vector

Gaussian distributed

, (formula 6)

Wherein,

Be that a diagonal entry is

Diagonal matrix,

Obeying Gamma distributes

, (formula 7)

Wherein

Super parameter for the Gamma distribution;

(1-5) set

,

Prior distribution be the Gaussian-Wishart joint distribution:

, (formula 8)

Wherein

Be the super parameter in the Gaussian-Wishart joint distribution;

(1-6) set mixed weight-value

Prior distribution be that Dirichlet distributes:

, (formula 9)

Wherein

Super parameter for above-mentioned Dirichlet distribution.

In the clustering method of the described high dimensional data that mixes the common factor analyzer based on Bayes, to Bayes is mixed common factor analyzer (BMCFA) model to carry out reasoning process as follows described in the step (2):

(2-1) set

Value, this is worth according to high dimensional data collection to be clustered

The classification number determine; If it is just known before cluster begins that classification is counted C, then If classification is counted the unknown, then

Be set at Between any positive integer;

(2-2) produce at random

Individual obedience Equally distributed integer on the interval is added up the probability that each integer occurs on this interval; That is, if produced

Individual integer ,

, so

For each , corresponding hidden variable Initial distribution and its expectation be respectively

(formula 10)

(2-3) set super parameter

, ,

Value and matrix Value; For all

(

), ,

,

,

,

,

, wherein

For less than 0.1 positive count;

Be unit matrix; In iteration renewal first

In,

,

,

In addition, produce

Initial value, that is, and each element in this matrix

(

) the obedience standardized normal distribution

, so with

The initial value of relevant statistic is:

, ,

Set the counting variable of iterations in the reasoning process

, the beginning iteration;

(2-4) upgrade

Posteriority distribute

, promptly

, (formula 11)

Wherein, super parameter

More new formula be

(formula 12)

(formula 13)

In (formula 13),

For

In The dimension component,

Be diagonal matrix

Inverse matrix in

Row

Column element; So, about

Statistic be updated to thereupon:

(formula 14)

(2-5) upgrade

Posteriority distribute

, promptly

(formula 15)

Wherein, super parameter More new formula be:

,

, (formula 16)

In (formula 16)

Be vector

In

Individual element; So about

Statistic be updated to thereupon

(formula 17)

(2-6) upgrade

Posteriority distribute

, promptly

(formula 18)

Wherein, super parameter

More new formula be

(formula 19)

(formula 20)

So, about

Statistic be updated to thereupon:

(formula 21)

(2-7) upgrade

Posteriority distribute

, promptly

(formula 22)

Wherein, super parameter

More new formula be

, (formula 23)

So, about

Statistic be updated to thereupon:

(formula 24)

In (formula 24)

Digamma function for standard;

(2-8) upgrade

Posteriority distribute, promptly

(formula 25)

Wherein, super parameter

More new formula be:

, (formula 26)

, (formula 27)

, (formula 28)

(formula 29)

So, about

,

Statistic be updated to thereupon:

, (formula 30)

(formula 31)

(2-9) upgrade

Posteriority distribute, promptly

, (formula 32)

Wherein,

(formula 33)

(formula 34)

In (formula 31) and (formula 34)

Marks of representing matrix (trace) all; So, about Statistic be updated to thereupon:

(formula 35)

(2-10) upgrade diagonal matrix

, on its diagonal line

Individual element

, For

(formula 36)

(2-11) likelihood value after the calculating current iteration ,

Be current iterations;

(formula 37)

(2-12) calculate after the current iteration with last iteration after the difference of likelihood value If , the reasoning process of BMCFA model finishes so, otherwise forwards step (2-4) to,

Value increase by 1, proceed iteration next time; Threshold value

Span be

~

It should be noted that when iteration finishes for the first time, only need to calculate

, and will Value increase by 1, need not to carry out Judgement, directly enter next iteration.

Beneficial effect:

1. the Bayes who is adopted among the present invention mixes the common factor analyzer and has very strong dirigibility, can regulate the optimum structure of model according to given high dimensional data automatically, thereby determine suitable blending constituent number automatically, promptly, optimum classification number, thereby in dimensionality reduction, finish cluster smoothly, obtained the better cluster performance.

2. the Bayes who is adopted among the present invention reasoning learning process of mixing the common factor analyzer is based on bayesian criterion, has solved in existing model and the learning process thereof the problem based on the over-fitting high dimensional data that maximum-likelihood criterion occurred.

3. the Bayes who is adopted among the present invention mixes that all the components has public factor loading matrix in the common factor analyzer, and the factor has the mixture model structure, compare with traditional MFA, the complexity of structure of models and parameter all reduces greatly, thereby can represent and handle high dimensional data better.

Description of drawings

What Fig. 1 the present invention relates to mixes the realization flow figure of clustering method of the high dimensional data of common factor analyzer based on Bayes.

Fig. 2 adopts the BMCFA method that the present invention relates to employing MFA and MCUFSA method gene expression data to be carried out cluster ERR performance comparison diagram afterwards.

Fig. 3 adopts the BMCFA method that the present invention relates to employing MFA and MCUFSA method gene expression data to be carried out cluster ARI performance comparison diagram afterwards.

Embodiment

Mix the clustering method of the high dimensional data of common factor analyzer (BMCFA) for what explanation better the present invention relates to based on Bayes, be applied to the cluster of higher-dimension gene expression data in the field of bioinformatics.Data source to be clustered comes from that people such as Yeoh provide has passed through pretreated 248 tissues (tissues) sample, the dimension of each sample is 50(E. J. Yeoh et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling, Cancer Cell, vol.1, no.2, pp.133-143,2002.), promptly N=248, p=50,

One has 6 classes in this application, and the sample number in class name and such is: MLL(20 sample), a T-ALL(43 sample), a Hyperdip(64 sample), a TEL-AML1(79 sample), an E2A-PBX1(27 sample), a BCR-ABL(15 sample).Hypothesis does not know the number and the concrete condition of class cluster result and above-mentioned legitimate reading to be compared after cluster is finished before cluster, thereby assesses the accuracy and the validity of method involved in the present invention.

Employing is as follows based on the process that the clustering method of the high dimensional data of BMCFA carries out cluster to these data:

The 1st step: set up the BMCFA model, represent with this model

Distribution.Particularly, BMCFA is that an one-tenth mark is

Mixture model, for each high dimensional data

, it can be expressed as

With probability (

), (formula 1)

Wherein, error variance Gaussian distributed

,

For

Diagonal matrix.Probability

Satisfy

Conditional likelihood distribution or prior distribution according to (formula 1) each variable are specified as follows:

(1-1) set one with

In the indieating variable set one to one of each data

, wherein with

Corresponding

It is one N dimensional vector n, having only an element in this vector is 1, all the other are 0, when

Individual element

The time (this moment other elements all be 0), show

Be by

Individual composition produces.So, About mixed weight-value

Condition be distributed as

(formula 2)

(1-2) in (formula 1),

For with high dimensional data

Corresponding and and composition The factor in the relevant lower dimensional space, its dimension is (

),

Value according in the particular problem

Size choose.Here same data are carried out cluster six times, in each cluster

Distribution gets 3,4, and 5,6,7,8; With average be

, covariance matrix is

Gaussian distribution

Define

Priori.So

Affiliated set

About

,

,

Condition be distributed as

。(formula 3)

(1-3) according to (formula 1), the high dimensional data collection

About

Condition be distributed as

。(formula 4)

(1-4)

For

The factor loading matrix,

Distribution be set at its row vector

Product, each row vector Gaussian distributed

, (formula 5)

Wherein,

Be that a diagonal entry is

Diagonal matrix,

Obeying Gamma distributes

, (formula 6)

Wherein

Super parameter for the Gamma distribution.

(1-5) set

,

Prior distribution be the Gaussian-Wishart joint distribution:

(formula 7)

Wherein

Be the super parameter in the Gaussian-Wishart joint distribution.

(1-6) set mixed weight-value

Prior distribution be that Dirichlet distributes:

, (formula 8)

Wherein

Be the super parameter of above-mentioned Dirichlet distribution,

Be the Gamma function.

The 2nd step: according to pending high dimensional data, based on bayesian criterion, good set up Bayes and mix common factor analyzer (BMCFA) model and carry out reasoning setting up in the step (1), detailed process is as follows:

(2-1) set Value, this is worth according to high dimensional data collection to be clustered The classification number determine, in this example since cluster before

Value hypothesis be unknown, therefore set here

(2-2) produce at random

Individual obedience

Equally distributed integer on the interval is added up the probability that each integer occurs on this interval; That is, if produced

Individual integer , so

For each

, corresponding hidden variable

Initial distribution and its expectation be respectively

(formula 9)

(2-3) set super parameter

, ,

Value and matrix

Value.For all

(

),

,

,

,

,

,

, wherein For less than 0.1 arbitrary small number.

Be unit matrix.Iteration is upgraded first

In

,

, In addition, produce

Initial value, that is, and each element in this matrix (

) the obedience standardized normal distribution

, so with

The initial value of relevant statistic is:

,

,

Set the counting variable of iterations in the reasoning process

, the beginning iteration.

(2-4) upgrade Posteriority distribute

, promptly

, (formula 10)

Wherein, super parameter

More new formula be

(formula 11)

(formula 12)

In (formula 12),

For

In

The dimension component,

Be diagonal matrix

Inverse matrix in

Row

Column element.So, about

Statistic be updated to thereupon:

(formula 13)

(2-5) upgrade

Posteriority distribute

, promptly

(formula 14)

Wherein, super parameter More new formula be:

,

, (formula 15)

In (formula 15) Be vector

In Individual element.So about

Statistic be updated to thereupon

(formula 16)

(2-6) upgrade Posteriority distribute

, promptly

(formula 17)

Wherein, super parameter

More new formula be

(formula 18)

(formula 19)

So, about

Statistic be updated to thereupon:

(formula 20)

(2-7) upgrade

Posteriority distribute

, promptly

(formula 21)

Wherein, super parameter

More new formula be

, (formula 22)

So, about

Statistic be updated to thereupon:

(formula 23)

In (formula 23)

Digamma function for standard.

(2-8) upgrade

Posteriority distribute, promptly

(formula 24)

Wherein, super parameter

More new formula be:

, (formula 25)

, (formula 26)

, (formula 27)

(formula 28)

So, about

,

Statistic be updated to thereupon:

, (formula 29)

(formula 30)

(2-9) upgrade

Posteriority distribute, promptly

, (formula 31)

Wherein,

(formula 32)

(formula 33)

In (formula 30) and (formula 33)

Marks of representing matrix (trace) all.So, about

Statistic be updated to thereupon:

(formula 34)

(2-10) upgrade diagonal matrix

, on its diagonal line Individual element

For

(formula 35)

(2-11) likelihood value after the calculating current iteration

,

Be current iterations;

(formula 36)

(2-12) calculate after the current iteration with last iteration after the difference of likelihood value

If , the reasoning process of BMCFA model finishes so, otherwise forwards step (2-4) to,

Value increase by 1, proceed iteration next time; Threshold value

Get

, and will

Value increase by 1, need not to carry out

Judgement, directly enter next iteration.

The 3rd step: judgement.Will with each high dimensional data

Be correlated with

In the pairing sequence number conduct of maximal value

The class that is finally allocated to , promptly

?。(formula 37)

Obtain the high dimensional data collection in such a way

Cluster result

Performance evaluation:

Adopt the resulting result of clustering method involved in the present invention

Compare with correct generic result, thereby can estimate and weigh out the validity and the accuracy of method involved in the present invention.Here adopt two evaluation indexes---weigh the error rate(ERR of cluster error rate) index and the adjusted rand index(ARI that weighs cluster purity) index.The span of ERR and ARR for ERR, is worth more for a short time all between 0 ~ 1, shows that the performance that adopts this method cluster is good more, and for ARI, is worth greatly more, adopts the performance of this method cluster good more.Fig. 2 has adopted BMCFA method involved in the present invention and other two kinds of methods---MFA and Mixtures of common uncorrelated factors with spherical-error analyzers(MCUFSA) this higher-dimension gene expression data is carried out ERR performance after the cluster.Fig. 3 has adopted BMCFA method involved in the present invention and MFA and MCUFSA that this higher-dimension gene expression data is carried out ARI performance after the cluster.At first, for MFA and MCUFSA, need to adopt model selection criteria (as Bayesian Information Criterion) to determine optimum classification number I, and BMCFA need not model selection criteria, therefore greatly reduces the counting yield and the operation time of cluster process.If determine or adopt classification number that model selection criteria obtains and 6 unequal after cluster finishes automatically, then ERR can't calculate, and its result queue is " NA " in Fig. 2.Secondly, can see, qIn the time of=6 ~ 8, BMCFA not only can obtain correct classification number, and in three kinds of methods the ERR minimum, the ARI maximum, therefore the clustering method based on BMCFA can obtain optimum cluster performance, thereby can accurately and effectively handle high dimensional data.

Claims

1. mix the clustering method of the high dimensional data of common factor analyzer based on Bayes, it is characterized in that, may further comprise the steps:

(1) establishes high dimensional data set to be clustered , wherein

Be the number of high dimensional data, each data

Dimension be

Set up Bayes and mix common factor analyzer (BMCFA) model, represent with this model Distribution; That is, BMCFA is that an one-tenth mark is

Mixture model; For each high dimensional data

, it can be expressed as

With probability (

), (formula 1)

Wherein,

For with high dimensional data

Corresponding and and composition

The factor in the relevant lower dimensional space, its dimension is (

),

Value according in the particular problem

Size choose: the traversal Between all integers, each candidate's

Do cluster one time, get performance best that time correspondence

As final Value;

For The factor loading matrix; Error variance

Gaussian distributed

, wherein

For

Diagonal matrix; Probability Satisfy

, can obtain indieating variable corresponding with it

The posteriority expectation value, , wherein

Represent current high dimensional data Be by in the mixture model

The probability that individual composition produces;

(3) judgement: will

In the pairing sequence number conduct of maximal value

The class that is finally allocated to

, promptly

(formula 2)

Obtain the high dimensional data collection in such a way

In the cluster result of all data

2. the clustering method that mixes the high dimensional data of common factor analyzer based on Bayes according to claim 1, it is characterized in that, setting up in the process that Bayes mixes common factor analyzer (BMCFA) model described in the step (1), the conditional likelihood of each variable distributes, the step of prior distribution is as follows:

(1-1) set one with

In the indieating variable set one to one of each data

, wherein with

Corresponding

It is one

Individual element

The time (this moment other elements all be 0), show

Be by

Individual composition produces; So,

About mixed weight-value

Condition be distributed as

(formula 3)

(1-2) be with average , covariance matrix is

Gaussian distribution

Define

Distribution; So, Affiliated set

About

,

,

Condition be distributed as

(formula 4)

(1-3) according to (formula 1), the high dimensional data collection

About

Condition be distributed as

(formula 5)

(1-4) factor loading matrix

Distribution be set at its row vector

The product of distribution, each row vector

Gaussian distributed

, (formula 6)

Wherein,

Be that a diagonal entry is

Diagonal matrix,

Obeying Gamma distributes

, (formula 7)

Wherein Super parameter for the Gamma distribution;

(1-5) set

,

Prior distribution be the Gaussian-Wishart joint distribution:

, (formula 8)

Wherein

Be the super parameter in the Gaussian-Wishart joint distribution;

(1-6) set mixed weight-value

Prior distribution be that Dirichlet distributes:

, (formula 9)

Wherein

Super parameter for above-mentioned Dirichlet distribution.

3. the clustering method that mixes the high dimensional data of common factor analyzer based on Bayes according to claim 1, it is characterized in that Bayes is mixed common factor analyzer (BMCFA) model to carry out the step of reasoning process as follows described in the step (2):

(2-1) set

The classification number determine; If it is just known before cluster begins that classification is counted C, then

If classification is counted the unknown, then

Be set at Between any positive integer;

(2-2) produce at random

Individual obedience

Equally distributed integer on the interval is added up the probability that each integer occurs on this interval; That is, if produced Individual integer ,

, so

For each

, corresponding hidden variable

Initial distribution and its expectation be respectively

(formula 10)

(2-3) set super parameter

,

,

Value and matrix

Value; For all

(

),

,

,

,

,

,

, wherein

For less than 0.1 positive count;

Be unit matrix; In iteration renewal first In,

,

,

In addition, produce

Initial value, that is, and each element in this matrix

(

) the obedience standardized normal distribution

, so with

The initial value of relevant statistic is:

,

,

Set the counting variable of iterations in the reasoning process , the beginning iteration;

(2-4) upgrade

Posteriority distribute , promptly

, (formula 11)

Wherein, super parameter

More new formula be

(formula 12)

(formula 13)

In (formula 13),

For

In

The dimension component,

Be diagonal matrix

Inverse matrix in

Row

Column element; So, about

Statistic be updated to thereupon:

(formula 14)

(2-5) upgrade

Posteriority distribute

, promptly

(formula 15)

Wherein, super parameter

More new formula be:

,

, (formula 16)

In (formula 16) Be vector

In Individual element; So about

Statistic be updated to thereupon

(formula 17)

(2-6) upgrade

Posteriority distribute , promptly

(formula 18)

Wherein, super parameter

More new formula be

(formula 19)

(formula 20)

So, about

Statistic be updated to thereupon:

(formula 21)

(2-7) upgrade

Posteriority distribute , promptly

(formula 22)

Wherein, super parameter

More new formula be

, (formula 23)

So, about Statistic be updated to thereupon:

(formula 24)

In (formula 24)

Digamma function for standard;

(2-8) upgrade Posteriority distribute, promptly

(formula 25)

Wherein, super parameter

More new formula be:

, (formula 26)

, (formula 27)

, (formula 28)

(formula 29)

So, about

,

Statistic be updated to thereupon:

, (formula 30)

(formula 31)

(2-9) upgrade

Posteriority distribute, promptly

, (formula 32)

Wherein,

(formula 33)

(formula 34)

In (formula 31) and (formula 34)

Marks of representing matrix (trace) all; So, about

Statistic be updated to thereupon:

(formula 35)

(2-10) upgrade diagonal matrix

, on its diagonal line

Individual element

For

(formula 36)

(2-11) likelihood value after the calculating current iteration

,

Be current iterations;

(formula 37)

If

, the reasoning process of BMCFA model finishes so, otherwise forwards step (2-4) to,

Value increase by 1, proceed iteration next time; Threshold value

Span be

~

It should be noted that when iteration finishes for the first time, only need to calculate , and will

Value increase by 1, need not to carry out

Judgement, directly enter next iteration.