CN103226595A - Clustering method for high dimensional data based on Bayes mixed common factor analyzer - Google Patents

Clustering method for high dimensional data based on Bayes mixed common factor analyzer Download PDF

Info

Publication number
CN103226595A
CN103226595A CN2013101334151A CN201310133415A CN103226595A CN 103226595 A CN103226595 A CN 103226595A CN 2013101334151 A CN2013101334151 A CN 2013101334151A CN 201310133415 A CN201310133415 A CN 201310133415A CN 103226595 A CN103226595 A CN 103226595A
Authority
CN
China
Prior art keywords
formula
high dimensional
dimensional data
value
bayes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101334151A
Other languages
Chinese (zh)
Other versions
CN103226595B (en
Inventor
魏昕
李宗辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tian Gu Information Technology Co ltd
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201310133415.1A priority Critical patent/CN103226595B/en
Publication of CN103226595A publication Critical patent/CN103226595A/en
Application granted granted Critical
Publication of CN103226595B publication Critical patent/CN103226595B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a clustering method for high dimensional data based on a Bayes mixed common factor analyzer. The method comprises the following steps: firstly, a model of a Bayes mixed common factor analyzer is built for to-be-clustered high dimensional data; secondly, posteriori distributions of various random variables of the model are subjected to inference, and statistics relevant to the random variables can be obtained; and finally, categories which each dimensional datum belongs to can be obtained through judgment, and the clustering process can be completed. According to the invention, the built Bayes mixed common factor analyzer model has strong flexibility; as the method is based on the inference procedure of Bayes criterion, the phenomenon of overfitting and a dimensionality disaster can be prevented effectively; the method can automatically adjust an optimal structure of the model according to the high dimensional data, so that optimal category data can be confirmed automatically to finish clustering smoothly while performing dimensionality reduction, and excellent clustering performance and computational efficiency can be obtained.

Description

The clustering method that mixes the high dimensional data of common factor analyzer based on Bayes
Technical field
The present invention relates to a kind ofly mix the clustering method of the high dimensional data of common factor analyzer, belong to the disposal route and the applied technical field of high dimensional data based on Bayes.
 
Background technology
Along with the continuous development of collection and memory technology, the data of higher-dimension and superelevation dimension continue to bring out.For example, in the high dimensional feature vector of inevitable appearance, the bioinformatics biological tissue is carried out higher-dimension gene expression data in the cluster analysis in web page text, voice and the Audio Signal Processing of dimension facial images up to ten thousand of common occurrence and hundreds of thousands dimension in CBIR and the file retrieval, or the like.Obviously, dimension high more (attribute of object is many more) can be portrayed described object more all sidedly and differentiate object better.Yet when the data sample amount was little, too high dimension had proposed stern challenge to the processing of data inevitably." dimension disaster " is a very stubborn problem.In addition, too high dimension has also brought high computation burden, and makes relevant issues indigestion and expression, and more impossible realization is visual.Therefore, how to realize high dimensional data is analyzed accurately and efficiently and handled, become in correlative technology field and the practical application one and had challenging problem.
For the high dimensional data that major part is observed or collected, its main information is present in the lower dimensional space.Therefore, how in lower dimensional space, to portray the useful information of high dimensional data effectively, thereby design corresponding dimensionality reduction algorithm, for this way to solve the problem important academic significance is arranged not only, and have great application value.Hybrid cytokine analyzer (MFA) is in order to the inside dependence between each dimension component of higher-dimension observation data is carried out modeling, thereby reach a kind of statistical and analytical tool that data is carried out dimension-reduction treatment, MFA has a wide range of applications in fields such as image and Video processing, biological information processing.Yet,, during especially for cluster, still have limitation based on the high dimensional data disposal route of MFA.At first, in MFA, because each blending constituent all has different factor loading matrixes, the population parameter number of model is more, and existing MFA is based on maximum-likelihood criterion and carries out the reasoning of model and parameter estimation, therefore occurs the over-fitting problem when the number of samples of high dimensional data is little easily; Secondly, also be the most important, in most cases, the number of classification was pre-unknown before this in the application of data clusters, if set too high or too low, the capital influences the accuracy of final cluster result, and for high dimensional data, it is difficult more that this problem will become, how in dimensionality reduction, determining optimum classification number adaptively according to high dimensional data, thereby obtain cluster performance preferably, is a difficult problem and the crucial part that faces in high dimensional data clustering technique and the method.The invention solves the defective of prior art, proposed a kind of clustering method that mixes the high dimensional data of common factor analyzer based on Bayes.
 
Summary of the invention
The present invention proposes and a kind ofly mix the clustering method of the high dimensional data of common factor analyzer based on Bayes, it may further comprise the steps:
(1) establishes high dimensional data set to be clustered
Figure 216476DEST_PATH_IMAGE001
, wherein Be the number of high dimensional data, each data
Figure 573825DEST_PATH_IMAGE003
Figure 693484DEST_PATH_IMAGE004
Dimension be
Figure 746891DEST_PATH_IMAGE005
Set up Bayes and mix common factor analyzer (BMCFA) model, represent with this model Distribution; That is, BMCFA is that an one-tenth mark is Mixture model; For each high dimensional data
Figure 500717DEST_PATH_IMAGE003
, it can be expressed as
With probability
Figure 114418DEST_PATH_IMAGE009
(
Figure 77564DEST_PATH_IMAGE010
), (formula 1)
Wherein,
Figure 919618DEST_PATH_IMAGE011
For with high dimensional data Corresponding and and composition
Figure 942249DEST_PATH_IMAGE012
The factor in the relevant lower dimensional space, its dimension is
Figure 764711DEST_PATH_IMAGE013
(
Figure 94061DEST_PATH_IMAGE014
),
Figure 607476DEST_PATH_IMAGE013
Value according in the particular problem
Figure 276354DEST_PATH_IMAGE005
Size choose: the traversal Between all integers, each candidate's
Figure 899414DEST_PATH_IMAGE013
Do cluster one time, get performance best that time correspondence As final
Figure 487707DEST_PATH_IMAGE013
Value;
Figure 963557DEST_PATH_IMAGE016
For
Figure 205182DEST_PATH_IMAGE017
The factor loading matrix; Error variance
Figure 808202DEST_PATH_IMAGE018
Gaussian distributed
Figure 999143DEST_PATH_IMAGE019
, wherein
Figure 599888DEST_PATH_IMAGE020
For
Figure 125548DEST_PATH_IMAGE021
Diagonal matrix; Probability
Figure 469941DEST_PATH_IMAGE009
Satisfy
Figure 16854DEST_PATH_IMAGE022
(2),, the Bayes that foundation is good in the step (1) is mixed common factor analyzer (BMCFA) model carry out reasoning based on bayesian criterion according to pending high dimensional data; After finishing this reasoning process, for each high dimensional data
Figure 788501DEST_PATH_IMAGE003
, can obtain indieating variable corresponding with it
Figure 801456DEST_PATH_IMAGE023
The posteriority expectation value,
Figure 683961DEST_PATH_IMAGE024
, wherein
Figure 849495DEST_PATH_IMAGE025
Represent current high dimensional data
Figure 854360DEST_PATH_IMAGE003
Be by in the mixture model
Figure 603879DEST_PATH_IMAGE026
The probability that individual composition produces;
(3) judgement: will
Figure 290075DEST_PATH_IMAGE027
In the pairing sequence number conduct of maximal value
Figure 559382DEST_PATH_IMAGE003
The class that is finally allocated to
Figure 672832DEST_PATH_IMAGE028
, promptly
Figure 411112DEST_PATH_IMAGE029
(formula 2)
Obtain the high dimensional data collection in such a way
Figure 635420DEST_PATH_IMAGE006
In the cluster result of all data
In the clustering method of the described high dimensional data that mixes the common factor analyzer based on Bayes, to setting up in the process that Bayes mixes common factor analyzer (BMCFA) model described in the step (1), the conditional likelihood of each variable distributes, prior distribution is specified as follows:
(1-1) set one with
Figure 43584DEST_PATH_IMAGE001
In the indieating variable set one to one of each data
Figure 782344DEST_PATH_IMAGE031
, wherein with Corresponding
Figure 54243DEST_PATH_IMAGE023
It is one
Figure 322544DEST_PATH_IMAGE007
N dimensional vector n, having only an element in this vector is 1, all the other are 0; When
Figure 956788DEST_PATH_IMAGE023
Figure 585215DEST_PATH_IMAGE026
Individual element
Figure 886883DEST_PATH_IMAGE032
The time (this moment other elements all be 0), show Be by Individual composition produces; So, About mixed weight-value
Figure 534454DEST_PATH_IMAGE034
Condition be distributed as
Figure 144558DEST_PATH_IMAGE035
(formula 3)
(1-2) be with average
Figure 753394DEST_PATH_IMAGE036
, covariance matrix is
Figure 661307DEST_PATH_IMAGE037
Gaussian distribution
Figure 734306DEST_PATH_IMAGE038
Define Distribution; So,
Figure 112908DEST_PATH_IMAGE011
Affiliated set About
Figure 752016DEST_PATH_IMAGE040
,
Figure 703923DEST_PATH_IMAGE041
,
Figure 287351DEST_PATH_IMAGE042
Condition be distributed as
Figure 176548DEST_PATH_IMAGE043
(formula 4)
(1-3) according to (formula 1), the high dimensional data collection
Figure 958559DEST_PATH_IMAGE006
About Condition be distributed as
Figure 152091DEST_PATH_IMAGE045
(formula 5)
(1-4) factor loading matrix Distribution be set at its row vector
Figure 107594DEST_PATH_IMAGE047
Figure 588254DEST_PATH_IMAGE048
Product, each row vector
Figure 460789DEST_PATH_IMAGE047
Gaussian distributed
Figure 52307DEST_PATH_IMAGE049
, (formula 6)
Wherein,
Figure 808911DEST_PATH_IMAGE050
Be that a diagonal entry is
Figure 539100DEST_PATH_IMAGE051
Diagonal matrix,
Figure 318837DEST_PATH_IMAGE052
Obeying Gamma distributes
Figure 776364DEST_PATH_IMAGE053
, (formula 7)
Wherein
Figure 590736DEST_PATH_IMAGE054
Super parameter for the Gamma distribution;
(1-5) set
Figure 724783DEST_PATH_IMAGE055
,
Figure 991816DEST_PATH_IMAGE056
Prior distribution be the Gaussian-Wishart joint distribution:
Figure 253033DEST_PATH_IMAGE057
, (formula 8)
Wherein
Figure 734961DEST_PATH_IMAGE058
Be the super parameter in the Gaussian-Wishart joint distribution;
(1-6) set mixed weight-value
Figure 728325DEST_PATH_IMAGE059
Prior distribution be that Dirichlet distributes:
Figure 544971DEST_PATH_IMAGE060
, (formula 9)
Wherein
Figure 281983DEST_PATH_IMAGE061
Super parameter for above-mentioned Dirichlet distribution.
In the clustering method of the described high dimensional data that mixes the common factor analyzer based on Bayes, to Bayes is mixed common factor analyzer (BMCFA) model to carry out reasoning process as follows described in the step (2):
(2-1) set
Figure 119883DEST_PATH_IMAGE007
Value, this is worth according to high dimensional data collection to be clustered
Figure 549727DEST_PATH_IMAGE006
The classification number determine; If it is just known before cluster begins that classification is counted C, then If classification is counted the unknown, then
Figure 128793DEST_PATH_IMAGE007
Be set at Between any positive integer;
(2-2) produce at random
Figure 920479DEST_PATH_IMAGE002
Individual obedience Equally distributed integer on the interval is added up the probability that each integer occurs on this interval; That is, if produced
Figure 790532DEST_PATH_IMAGE065
Individual integer ,
Figure 371741DEST_PATH_IMAGE066
, so
Figure 384697DEST_PATH_IMAGE067
For each , corresponding hidden variable Initial distribution and its expectation be respectively
(formula 10)
(2-3) set super parameter
Figure 813218DEST_PATH_IMAGE069
, ,
Figure 768722DEST_PATH_IMAGE054
Value and matrix Value; For all
Figure 110197DEST_PATH_IMAGE026
(
Figure 334505DEST_PATH_IMAGE010
), ,
Figure 742670DEST_PATH_IMAGE072
,
Figure 968246DEST_PATH_IMAGE073
,
Figure 996245DEST_PATH_IMAGE074
,
Figure 240144DEST_PATH_IMAGE075
Figure 429817DEST_PATH_IMAGE076
,
Figure 329640DEST_PATH_IMAGE077
, wherein
Figure 207335DEST_PATH_IMAGE078
For less than 0.1 positive count;
Figure 509003DEST_PATH_IMAGE079
Be unit matrix; In iteration renewal first
Figure 931894DEST_PATH_IMAGE080
In,
Figure 319013DEST_PATH_IMAGE081
,
Figure 501864DEST_PATH_IMAGE082
,
Figure 392460DEST_PATH_IMAGE083
In addition, produce
Figure 189515DEST_PATH_IMAGE016
Initial value, that is, and each element in this matrix
Figure 860667DEST_PATH_IMAGE084
(
Figure 34160DEST_PATH_IMAGE085
) the obedience standardized normal distribution
Figure 359355DEST_PATH_IMAGE086
, so with
Figure 327311DEST_PATH_IMAGE016
The initial value of relevant statistic is:
Figure 485760DEST_PATH_IMAGE087
, ,
Set the counting variable of iterations in the reasoning process
Figure 14459DEST_PATH_IMAGE090
, the beginning iteration;
(2-4) upgrade
Figure 597887DEST_PATH_IMAGE080
Posteriority distribute
Figure 909919DEST_PATH_IMAGE091
, promptly
Figure 895193DEST_PATH_IMAGE092
, (formula 11)
Wherein, super parameter
Figure 516536DEST_PATH_IMAGE093
More new formula be
Figure 587260DEST_PATH_IMAGE094
(formula 12)
Figure 702984DEST_PATH_IMAGE095
(formula 13)
In (formula 13),
Figure 542764DEST_PATH_IMAGE096
For
Figure 836473DEST_PATH_IMAGE003
In The dimension component,
Figure 986012DEST_PATH_IMAGE098
Be diagonal matrix
Figure 742615DEST_PATH_IMAGE020
Inverse matrix in
Figure 394176DEST_PATH_IMAGE097
Row
Figure 754007DEST_PATH_IMAGE097
Column element; So, about
Figure 149216DEST_PATH_IMAGE011
Statistic be updated to thereupon:
Figure 760326DEST_PATH_IMAGE099
(formula 14)
(2-5) upgrade
Figure 848368DEST_PATH_IMAGE052
Posteriority distribute
Figure 115401DEST_PATH_IMAGE100
, promptly
Figure 127350DEST_PATH_IMAGE101
(formula 15)
Wherein, super parameter More new formula be:
Figure 851910DEST_PATH_IMAGE103
,
Figure 606239DEST_PATH_IMAGE104
, (formula 16)
In (formula 16)
Figure 654835DEST_PATH_IMAGE105
Be vector
Figure 178221DEST_PATH_IMAGE106
In
Figure 608065DEST_PATH_IMAGE107
Individual element; So about
Figure 646428DEST_PATH_IMAGE052
Statistic be updated to thereupon
Figure 187131DEST_PATH_IMAGE108
(formula 17)
(2-6) upgrade
Figure 643651DEST_PATH_IMAGE016
Posteriority distribute
Figure 978818DEST_PATH_IMAGE109
, promptly
Figure 770056DEST_PATH_IMAGE110
(formula 18)
Wherein, super parameter
Figure 288018DEST_PATH_IMAGE111
More new formula be
Figure 785996DEST_PATH_IMAGE112
(formula 19)
(formula 20)
So, about
Figure 570598DEST_PATH_IMAGE047
Statistic be updated to thereupon:
(formula 21)
(2-7) upgrade
Figure 131820DEST_PATH_IMAGE059
Posteriority distribute
Figure 136685DEST_PATH_IMAGE115
, promptly
Figure 574620DEST_PATH_IMAGE116
(formula 22)
Wherein, super parameter
Figure 995237DEST_PATH_IMAGE117
More new formula be
Figure 280856DEST_PATH_IMAGE118
, (formula 23)
So, about
Figure 128726DEST_PATH_IMAGE009
Statistic be updated to thereupon:
Figure 116274DEST_PATH_IMAGE119
(formula 24)
In (formula 24)
Figure 340582DEST_PATH_IMAGE120
Digamma function for standard;
(2-8) upgrade
Figure 979243DEST_PATH_IMAGE121
Posteriority distribute, promptly
Figure 263593DEST_PATH_IMAGE122
(formula 25)
Wherein, super parameter
Figure 738437DEST_PATH_IMAGE123
More new formula be:
, (formula 26)
, (formula 27)
Figure 950741DEST_PATH_IMAGE126
, (formula 28)
Figure 850564DEST_PATH_IMAGE127
(formula 29)
So, about
Figure 478991DEST_PATH_IMAGE036
,
Figure 515080DEST_PATH_IMAGE128
Statistic be updated to thereupon:
Figure 455748DEST_PATH_IMAGE129
, (formula 30)
(formula 31)
(2-9) upgrade
Figure 947089DEST_PATH_IMAGE131
Posteriority distribute, promptly
, (formula 32)
Wherein,
(formula 33)
Figure 384521DEST_PATH_IMAGE134
(formula 34)
In (formula 31) and (formula 34)
Figure 558013DEST_PATH_IMAGE135
Marks of representing matrix (trace) all; So, about Statistic be updated to thereupon:
Figure 598967DEST_PATH_IMAGE137
(formula 35)
(2-10) upgrade diagonal matrix
Figure 6684DEST_PATH_IMAGE020
, on its diagonal line
Figure 452708DEST_PATH_IMAGE097
Individual element
Figure 583476DEST_PATH_IMAGE138
, For
Figure 368078DEST_PATH_IMAGE140
(formula 36)
(2-11) likelihood value after the calculating current iteration ,
Figure 416116DEST_PATH_IMAGE142
Be current iterations;
Figure 788192DEST_PATH_IMAGE143
(formula 37)
(2-12) calculate after the current iteration with last iteration after the difference of likelihood value If , the reasoning process of BMCFA model finishes so, otherwise forwards step (2-4) to,
Figure 66617DEST_PATH_IMAGE142
Value increase by 1, proceed iteration next time; Threshold value
Figure 547277DEST_PATH_IMAGE146
Span be
Figure 902035DEST_PATH_IMAGE147
~
Figure 759132DEST_PATH_IMAGE148
It should be noted that when iteration finishes for the first time, only need to calculate
Figure 266468DEST_PATH_IMAGE141
, and will Value increase by 1, need not to carry out Judgement, directly enter next iteration.
Beneficial effect:
1. the Bayes who is adopted among the present invention mixes the common factor analyzer and has very strong dirigibility, can regulate the optimum structure of model according to given high dimensional data automatically, thereby determine suitable blending constituent number automatically, promptly, optimum classification number, thereby in dimensionality reduction, finish cluster smoothly, obtained the better cluster performance.
2. the Bayes who is adopted among the present invention reasoning learning process of mixing the common factor analyzer is based on bayesian criterion, has solved in existing model and the learning process thereof the problem based on the over-fitting high dimensional data that maximum-likelihood criterion occurred.
3. the Bayes who is adopted among the present invention mixes that all the components has public factor loading matrix in the common factor analyzer, and the factor has the mixture model structure, compare with traditional MFA, the complexity of structure of models and parameter all reduces greatly, thereby can represent and handle high dimensional data better.
Description of drawings
What Fig. 1 the present invention relates to mixes the realization flow figure of clustering method of the high dimensional data of common factor analyzer based on Bayes.
Fig. 2 adopts the BMCFA method that the present invention relates to employing MFA and MCUFSA method gene expression data to be carried out cluster ERR performance comparison diagram afterwards.
Fig. 3 adopts the BMCFA method that the present invention relates to employing MFA and MCUFSA method gene expression data to be carried out cluster ARI performance comparison diagram afterwards.
Embodiment
Mix the clustering method of the high dimensional data of common factor analyzer (BMCFA) for what explanation better the present invention relates to based on Bayes, be applied to the cluster of higher-dimension gene expression data in the field of bioinformatics.Data source to be clustered comes from that people such as Yeoh provide has passed through pretreated 248 tissues (tissues) sample, the dimension of each sample is 50(E. J. Yeoh et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling, Cancer Cell, vol.1, no.2, pp.133-143,2002.), promptly N=248, p=50,
Figure 420872DEST_PATH_IMAGE149
One has 6 classes in this application, and the sample number in class name and such is: MLL(20 sample), a T-ALL(43 sample), a Hyperdip(64 sample), a TEL-AML1(79 sample), an E2A-PBX1(27 sample), a BCR-ABL(15 sample).Hypothesis does not know the number and the concrete condition of class cluster result and above-mentioned legitimate reading to be compared after cluster is finished before cluster, thereby assesses the accuracy and the validity of method involved in the present invention.
Employing is as follows based on the process that the clustering method of the high dimensional data of BMCFA carries out cluster to these data:
The 1st step: set up the BMCFA model, represent with this model
Figure 969665DEST_PATH_IMAGE006
Distribution.Particularly, BMCFA is that an one-tenth mark is
Figure 369291DEST_PATH_IMAGE007
Mixture model, for each high dimensional data
Figure 636324DEST_PATH_IMAGE003
, it can be expressed as
With probability (
Figure 372833DEST_PATH_IMAGE010
), (formula 1)
Wherein, error variance Gaussian distributed
Figure 864175DEST_PATH_IMAGE019
,
Figure 449877DEST_PATH_IMAGE020
For
Figure 614142DEST_PATH_IMAGE021
Diagonal matrix.Probability
Figure 170281DEST_PATH_IMAGE009
Satisfy
Figure 710984DEST_PATH_IMAGE022
Conditional likelihood distribution or prior distribution according to (formula 1) each variable are specified as follows:
(1-1) set one with
Figure 151193DEST_PATH_IMAGE001
In the indieating variable set one to one of each data
Figure 751938DEST_PATH_IMAGE031
, wherein with
Figure 215281DEST_PATH_IMAGE003
Corresponding
Figure 372724DEST_PATH_IMAGE023
It is one N dimensional vector n, having only an element in this vector is 1, all the other are 0, when
Figure 439086DEST_PATH_IMAGE023
Figure 655304DEST_PATH_IMAGE026
Individual element
Figure 849393DEST_PATH_IMAGE032
The time (this moment other elements all be 0), show
Figure 201877DEST_PATH_IMAGE003
Be by
Figure 878846DEST_PATH_IMAGE026
Individual composition produces.So, About mixed weight-value
Figure 65294DEST_PATH_IMAGE034
Condition be distributed as
Figure 85334DEST_PATH_IMAGE035
(formula 2)
(1-2) in (formula 1),
Figure 198783DEST_PATH_IMAGE011
For with high dimensional data
Figure 186331DEST_PATH_IMAGE003
Corresponding and and composition The factor in the relevant lower dimensional space, its dimension is (
Figure 324861DEST_PATH_IMAGE014
),
Figure 471809DEST_PATH_IMAGE013
Value according in the particular problem
Figure 562124DEST_PATH_IMAGE005
Size choose.Here same data are carried out cluster six times, in each cluster
Figure 743707DEST_PATH_IMAGE013
Distribution gets 3,4, and 5,6,7,8; With average be
Figure 12009DEST_PATH_IMAGE036
, covariance matrix is
Figure 911831DEST_PATH_IMAGE037
Gaussian distribution
Figure 540259DEST_PATH_IMAGE038
Define
Figure 576348DEST_PATH_IMAGE011
Priori.So
Figure 202501DEST_PATH_IMAGE011
Affiliated set
Figure 635626DEST_PATH_IMAGE039
About
Figure 5427DEST_PATH_IMAGE040
,
Figure 223919DEST_PATH_IMAGE041
,
Figure 20974DEST_PATH_IMAGE042
Condition be distributed as
Figure 442859DEST_PATH_IMAGE043
。(formula 3)
(1-3) according to (formula 1), the high dimensional data collection
Figure 350772DEST_PATH_IMAGE006
About
Figure 361453DEST_PATH_IMAGE044
Condition be distributed as
Figure 391726DEST_PATH_IMAGE045
。(formula 4)
(1-4)
Figure 487858DEST_PATH_IMAGE016
For
Figure 513976DEST_PATH_IMAGE017
The factor loading matrix,
Figure 644743DEST_PATH_IMAGE016
Distribution be set at its row vector
Figure 845918DEST_PATH_IMAGE047
Figure 163766DEST_PATH_IMAGE048
Product, each row vector Gaussian distributed
, (formula 5)
Wherein,
Figure 787143DEST_PATH_IMAGE050
Be that a diagonal entry is
Figure 654605DEST_PATH_IMAGE051
Diagonal matrix,
Figure 708011DEST_PATH_IMAGE052
Obeying Gamma distributes
Figure 859376DEST_PATH_IMAGE053
, (formula 6)
Wherein
Figure 340036DEST_PATH_IMAGE054
Super parameter for the Gamma distribution.
(1-5) set
Figure 960373DEST_PATH_IMAGE055
,
Figure 551891DEST_PATH_IMAGE056
Prior distribution be the Gaussian-Wishart joint distribution:
(formula 7)
Wherein
Figure 976368DEST_PATH_IMAGE058
Be the super parameter in the Gaussian-Wishart joint distribution.
(1-6) set mixed weight-value
Figure 756105DEST_PATH_IMAGE059
Prior distribution be that Dirichlet distributes:
Figure 213631DEST_PATH_IMAGE150
, (formula 8)
Wherein
Figure 28003DEST_PATH_IMAGE061
Be the super parameter of above-mentioned Dirichlet distribution,
Figure 164980DEST_PATH_IMAGE151
Be the Gamma function.
 
The 2nd step: according to pending high dimensional data, based on bayesian criterion, good set up Bayes and mix common factor analyzer (BMCFA) model and carry out reasoning setting up in the step (1), detailed process is as follows:
(2-1) set Value, this is worth according to high dimensional data collection to be clustered The classification number determine, in this example since cluster before
Figure 362109DEST_PATH_IMAGE007
Value hypothesis be unknown, therefore set here
Figure 168522DEST_PATH_IMAGE152
(2-2) produce at random
Figure 922851DEST_PATH_IMAGE002
Individual obedience
Figure 987759DEST_PATH_IMAGE064
Equally distributed integer on the interval is added up the probability that each integer occurs on this interval; That is, if produced
Figure 511145DEST_PATH_IMAGE065
Individual integer , so
Figure 228620DEST_PATH_IMAGE067
For each
Figure 769322DEST_PATH_IMAGE003
, corresponding hidden variable
Figure 209531DEST_PATH_IMAGE023
Initial distribution and its expectation be respectively
Figure 810277DEST_PATH_IMAGE068
(formula 9)
(2-3) set super parameter
Figure 86668DEST_PATH_IMAGE069
, ,
Figure 725777DEST_PATH_IMAGE054
Value and matrix
Figure 497424DEST_PATH_IMAGE020
Value.For all
Figure 448062DEST_PATH_IMAGE012
(
Figure 645082DEST_PATH_IMAGE010
),
Figure 997566DEST_PATH_IMAGE071
,
Figure 2431DEST_PATH_IMAGE072
,
Figure 440366DEST_PATH_IMAGE073
,
Figure 939611DEST_PATH_IMAGE074
,
Figure 146602DEST_PATH_IMAGE075
Figure 260051DEST_PATH_IMAGE076
,
Figure 247599DEST_PATH_IMAGE077
, wherein For less than 0.1 arbitrary small number.
Figure 110567DEST_PATH_IMAGE079
Be unit matrix.Iteration is upgraded first
Figure 129339DEST_PATH_IMAGE080
In
Figure 604183DEST_PATH_IMAGE081
,
Figure 632181DEST_PATH_IMAGE082
, In addition, produce
Figure 82065DEST_PATH_IMAGE016
Initial value, that is, and each element in this matrix (
Figure 344737DEST_PATH_IMAGE085
) the obedience standardized normal distribution
Figure 646405DEST_PATH_IMAGE086
, so with
Figure 598791DEST_PATH_IMAGE016
The initial value of relevant statistic is:
Figure 720331DEST_PATH_IMAGE087
,
Figure 152449DEST_PATH_IMAGE088
,
Figure 308624DEST_PATH_IMAGE153
Set the counting variable of iterations in the reasoning process
Figure 105679DEST_PATH_IMAGE090
, the beginning iteration.
(2-4) upgrade Posteriority distribute
Figure 435477DEST_PATH_IMAGE091
, promptly
Figure 508475DEST_PATH_IMAGE092
, (formula 10)
Wherein, super parameter
Figure 476431DEST_PATH_IMAGE093
More new formula be
Figure 884148DEST_PATH_IMAGE094
(formula 11)
Figure 595752DEST_PATH_IMAGE095
(formula 12)
In (formula 12),
Figure 523257DEST_PATH_IMAGE096
For
Figure 662114DEST_PATH_IMAGE003
In
Figure 245542DEST_PATH_IMAGE097
The dimension component,
Figure 573886DEST_PATH_IMAGE098
Be diagonal matrix
Figure 293581DEST_PATH_IMAGE020
Inverse matrix in
Figure 665656DEST_PATH_IMAGE097
Row
Figure 736380DEST_PATH_IMAGE097
Column element.So, about
Figure 104301DEST_PATH_IMAGE011
Statistic be updated to thereupon:
(formula 13)
(2-5) upgrade
Figure 424741DEST_PATH_IMAGE052
Posteriority distribute
Figure 45078DEST_PATH_IMAGE100
, promptly
Figure 636597DEST_PATH_IMAGE101
(formula 14)
Wherein, super parameter More new formula be:
Figure 61073DEST_PATH_IMAGE103
,
Figure 903127DEST_PATH_IMAGE104
, (formula 15)
In (formula 15) Be vector
Figure 424293DEST_PATH_IMAGE106
In Individual element.So about
Figure 513789DEST_PATH_IMAGE052
Statistic be updated to thereupon
Figure 775006DEST_PATH_IMAGE108
(formula 16)
(2-6) upgrade Posteriority distribute
Figure 250298DEST_PATH_IMAGE109
, promptly
Figure 4627DEST_PATH_IMAGE110
(formula 17)
Wherein, super parameter
Figure 803956DEST_PATH_IMAGE111
More new formula be
Figure 327341DEST_PATH_IMAGE112
(formula 18)
Figure 757185DEST_PATH_IMAGE113
(formula 19)
So, about
Figure 313325DEST_PATH_IMAGE047
Statistic be updated to thereupon:
Figure 588449DEST_PATH_IMAGE114
(formula 20)
(2-7) upgrade
Figure 294236DEST_PATH_IMAGE059
Posteriority distribute
Figure 629403DEST_PATH_IMAGE115
, promptly
Figure 171374DEST_PATH_IMAGE116
(formula 21)
Wherein, super parameter
Figure 250188DEST_PATH_IMAGE117
More new formula be
Figure 810482DEST_PATH_IMAGE118
, (formula 22)
So, about
Figure 582129DEST_PATH_IMAGE009
Statistic be updated to thereupon:
Figure 844352DEST_PATH_IMAGE119
(formula 23)
In (formula 23)
Figure 726858DEST_PATH_IMAGE120
Digamma function for standard.
(2-8) upgrade
Figure 79342DEST_PATH_IMAGE121
Posteriority distribute, promptly
Figure 84207DEST_PATH_IMAGE122
(formula 24)
Wherein, super parameter
Figure 522141DEST_PATH_IMAGE123
More new formula be:
, (formula 25)
Figure 228377DEST_PATH_IMAGE125
, (formula 26)
Figure 404144DEST_PATH_IMAGE126
, (formula 27)
(formula 28)
So, about
Figure 553682DEST_PATH_IMAGE036
,
Figure 929693DEST_PATH_IMAGE128
Statistic be updated to thereupon:
Figure 214044DEST_PATH_IMAGE129
, (formula 29)
(formula 30)
(2-9) upgrade
Figure 716887DEST_PATH_IMAGE131
Posteriority distribute, promptly
Figure 711519DEST_PATH_IMAGE132
, (formula 31)
Wherein,
Figure 901192DEST_PATH_IMAGE133
(formula 32)
Figure 863331DEST_PATH_IMAGE134
(formula 33)
In (formula 30) and (formula 33)
Figure 429442DEST_PATH_IMAGE135
Marks of representing matrix (trace) all.So, about
Figure 731110DEST_PATH_IMAGE136
Statistic be updated to thereupon:
(formula 34)
(2-10) upgrade diagonal matrix
Figure 790388DEST_PATH_IMAGE020
, on its diagonal line Individual element
Figure 723206DEST_PATH_IMAGE155
For
Figure 394359DEST_PATH_IMAGE140
(formula 35)
(2-11) likelihood value after the calculating current iteration
Figure 567851DEST_PATH_IMAGE141
,
Figure 578532DEST_PATH_IMAGE142
Be current iterations;
(formula 36)
(2-12) calculate after the current iteration with last iteration after the difference of likelihood value
Figure 945416DEST_PATH_IMAGE144
If , the reasoning process of BMCFA model finishes so, otherwise forwards step (2-4) to,
Figure 584525DEST_PATH_IMAGE142
Value increase by 1, proceed iteration next time; Threshold value
Figure 536431DEST_PATH_IMAGE146
Get
Figure 119859DEST_PATH_IMAGE147
It should be noted that when iteration finishes for the first time, only need to calculate
Figure 431892DEST_PATH_IMAGE141
, and will
Figure 417165DEST_PATH_IMAGE142
Value increase by 1, need not to carry out
Figure 726924DEST_PATH_IMAGE145
Judgement, directly enter next iteration.
The 3rd step: judgement.Will with each high dimensional data
Figure 109233DEST_PATH_IMAGE003
Be correlated with
Figure 162639DEST_PATH_IMAGE027
In the pairing sequence number conduct of maximal value
Figure 64736DEST_PATH_IMAGE003
The class that is finally allocated to , promptly
Figure 713203DEST_PATH_IMAGE029
?。(formula 37)
Obtain the high dimensional data collection in such a way
Figure 884815DEST_PATH_IMAGE006
Cluster result
Figure 579102DEST_PATH_IMAGE030
Performance evaluation:
Adopt the resulting result of clustering method involved in the present invention
Figure 230663DEST_PATH_IMAGE030
Compare with correct generic result, thereby can estimate and weigh out the validity and the accuracy of method involved in the present invention.Here adopt two evaluation indexes---weigh the error rate(ERR of cluster error rate) index and the adjusted rand index(ARI that weighs cluster purity) index.The span of ERR and ARR for ERR, is worth more for a short time all between 0 ~ 1, shows that the performance that adopts this method cluster is good more, and for ARI, is worth greatly more, adopts the performance of this method cluster good more.Fig. 2 has adopted BMCFA method involved in the present invention and other two kinds of methods---MFA and Mixtures of common uncorrelated factors with spherical-error analyzers(MCUFSA) this higher-dimension gene expression data is carried out ERR performance after the cluster.Fig. 3 has adopted BMCFA method involved in the present invention and MFA and MCUFSA that this higher-dimension gene expression data is carried out ARI performance after the cluster.At first, for MFA and MCUFSA, need to adopt model selection criteria (as Bayesian Information Criterion) to determine optimum classification number I, and BMCFA need not model selection criteria, therefore greatly reduces the counting yield and the operation time of cluster process.If determine or adopt classification number that model selection criteria obtains and 6 unequal after cluster finishes automatically, then ERR can't calculate, and its result queue is " NA " in Fig. 2.Secondly, can see, qIn the time of=6 ~ 8, BMCFA not only can obtain correct classification number, and in three kinds of methods the ERR minimum, the ARI maximum, therefore the clustering method based on BMCFA can obtain optimum cluster performance, thereby can accurately and effectively handle high dimensional data.

Claims (3)

1. mix the clustering method of the high dimensional data of common factor analyzer based on Bayes, it is characterized in that, may further comprise the steps:
(1) establishes high dimensional data set to be clustered , wherein
Figure 227299DEST_PATH_IMAGE002
Be the number of high dimensional data, each data
Figure 794678DEST_PATH_IMAGE003
Figure 797269DEST_PATH_IMAGE004
Dimension be
Figure 382971DEST_PATH_IMAGE005
Set up Bayes and mix common factor analyzer (BMCFA) model, represent with this model Distribution; That is, BMCFA is that an one-tenth mark is
Figure 115094DEST_PATH_IMAGE007
Mixture model; For each high dimensional data
Figure 718114DEST_PATH_IMAGE003
, it can be expressed as
Figure 96006DEST_PATH_IMAGE008
With probability (
Figure 35460DEST_PATH_IMAGE010
), (formula 1)
Wherein,
Figure 379853DEST_PATH_IMAGE011
For with high dimensional data
Figure 923836DEST_PATH_IMAGE003
Corresponding and and composition
Figure 695483DEST_PATH_IMAGE012
The factor in the relevant lower dimensional space, its dimension is (
Figure 590944DEST_PATH_IMAGE014
),
Figure 756477DEST_PATH_IMAGE013
Value according in the particular problem
Figure 761342DEST_PATH_IMAGE005
Size choose: the traversal Between all integers, each candidate's
Figure 199987DEST_PATH_IMAGE013
Do cluster one time, get performance best that time correspondence
Figure 469294DEST_PATH_IMAGE013
As final Value;
Figure 321024DEST_PATH_IMAGE016
For The factor loading matrix; Error variance
Figure 308626DEST_PATH_IMAGE018
Gaussian distributed
Figure 327398DEST_PATH_IMAGE019
, wherein
Figure 552974DEST_PATH_IMAGE020
For
Figure 580973DEST_PATH_IMAGE021
Diagonal matrix; Probability Satisfy
Figure 280125DEST_PATH_IMAGE022
(2),, the Bayes that foundation is good in the step (1) is mixed common factor analyzer (BMCFA) model carry out reasoning based on bayesian criterion according to pending high dimensional data; After finishing this reasoning process, for each high dimensional data
Figure 228882DEST_PATH_IMAGE003
, can obtain indieating variable corresponding with it
Figure 794993DEST_PATH_IMAGE023
The posteriority expectation value, , wherein
Figure 785132DEST_PATH_IMAGE025
Represent current high dimensional data Be by in the mixture model
Figure 89522DEST_PATH_IMAGE012
The probability that individual composition produces;
(3) judgement: will
Figure 308014DEST_PATH_IMAGE026
In the pairing sequence number conduct of maximal value
Figure 105069DEST_PATH_IMAGE003
The class that is finally allocated to
Figure 713904DEST_PATH_IMAGE027
, promptly
Figure 933402DEST_PATH_IMAGE028
(formula 2)
Obtain the high dimensional data collection in such a way
Figure 944083DEST_PATH_IMAGE006
In the cluster result of all data
Figure 974356DEST_PATH_IMAGE029
2. the clustering method that mixes the high dimensional data of common factor analyzer based on Bayes according to claim 1, it is characterized in that, setting up in the process that Bayes mixes common factor analyzer (BMCFA) model described in the step (1), the conditional likelihood of each variable distributes, the step of prior distribution is as follows:
(1-1) set one with
Figure 883538DEST_PATH_IMAGE001
In the indieating variable set one to one of each data
Figure 595142DEST_PATH_IMAGE030
, wherein with
Figure 460329DEST_PATH_IMAGE003
Corresponding
Figure 661504DEST_PATH_IMAGE023
It is one
Figure 244932DEST_PATH_IMAGE007
N dimensional vector n, having only an element in this vector is 1, all the other are 0; When
Figure 794435DEST_PATH_IMAGE012
Individual element
Figure 166511DEST_PATH_IMAGE031
The time (this moment other elements all be 0), show
Figure 237235DEST_PATH_IMAGE003
Be by
Figure 290642DEST_PATH_IMAGE012
Individual composition produces; So,
Figure 943471DEST_PATH_IMAGE032
About mixed weight-value
Figure 424131DEST_PATH_IMAGE033
Condition be distributed as
Figure 44468DEST_PATH_IMAGE034
(formula 3)
(1-2) be with average , covariance matrix is
Figure 641857DEST_PATH_IMAGE036
Gaussian distribution
Figure 558998DEST_PATH_IMAGE037
Define
Figure 338735DEST_PATH_IMAGE011
Distribution; So, Affiliated set
Figure 610633DEST_PATH_IMAGE038
About
Figure 246145DEST_PATH_IMAGE039
,
Figure 513178DEST_PATH_IMAGE040
,
Figure 774395DEST_PATH_IMAGE041
Condition be distributed as
Figure 443274DEST_PATH_IMAGE042
(formula 4)
(1-3) according to (formula 1), the high dimensional data collection
Figure 739433DEST_PATH_IMAGE006
About
Figure 493763DEST_PATH_IMAGE043
Condition be distributed as
Figure 230775DEST_PATH_IMAGE044
(formula 5)
(1-4) factor loading matrix
Figure 816477DEST_PATH_IMAGE045
Distribution be set at its row vector
Figure 246321DEST_PATH_IMAGE046
Figure 300996DEST_PATH_IMAGE047
The product of distribution, each row vector
Figure 576119DEST_PATH_IMAGE046
Gaussian distributed
Figure 281907DEST_PATH_IMAGE048
, (formula 6)
Wherein,
Figure 617074DEST_PATH_IMAGE049
Be that a diagonal entry is
Figure 345995DEST_PATH_IMAGE050
Diagonal matrix,
Figure 736394DEST_PATH_IMAGE051
Obeying Gamma distributes
, (formula 7)
Wherein Super parameter for the Gamma distribution;
(1-5) set
Figure 18974DEST_PATH_IMAGE054
,
Figure 714529DEST_PATH_IMAGE055
Prior distribution be the Gaussian-Wishart joint distribution:
Figure 67012DEST_PATH_IMAGE056
, (formula 8)
Wherein
Figure 71878DEST_PATH_IMAGE057
Be the super parameter in the Gaussian-Wishart joint distribution;
(1-6) set mixed weight-value
Figure 509812DEST_PATH_IMAGE058
Prior distribution be that Dirichlet distributes:
Figure 930429DEST_PATH_IMAGE059
, (formula 9)
Wherein
Figure 717513DEST_PATH_IMAGE060
Super parameter for above-mentioned Dirichlet distribution.
3. the clustering method that mixes the high dimensional data of common factor analyzer based on Bayes according to claim 1, it is characterized in that Bayes is mixed common factor analyzer (BMCFA) model to carry out the step of reasoning process as follows described in the step (2):
(2-1) set
Figure 830963DEST_PATH_IMAGE007
Value, this is worth according to high dimensional data collection to be clustered
Figure 818510DEST_PATH_IMAGE006
The classification number determine; If it is just known before cluster begins that classification is counted C, then
Figure 42818DEST_PATH_IMAGE061
If classification is counted the unknown, then
Figure 917364DEST_PATH_IMAGE007
Be set at Between any positive integer;
(2-2) produce at random
Figure 614242DEST_PATH_IMAGE002
Individual obedience
Figure 704558DEST_PATH_IMAGE063
Equally distributed integer on the interval is added up the probability that each integer occurs on this interval; That is, if produced Individual integer ,
Figure 287221DEST_PATH_IMAGE065
, so
Figure 915648DEST_PATH_IMAGE066
For each
Figure 217316DEST_PATH_IMAGE003
, corresponding hidden variable
Figure 577891DEST_PATH_IMAGE023
Initial distribution and its expectation be respectively
Figure 778059DEST_PATH_IMAGE067
(formula 10)
(2-3) set super parameter
Figure 147860DEST_PATH_IMAGE068
,
Figure 100773DEST_PATH_IMAGE069
,
Figure 897828DEST_PATH_IMAGE053
Value and matrix
Figure 821178DEST_PATH_IMAGE020
Value; For all
Figure 994670DEST_PATH_IMAGE012
(
Figure 67668DEST_PATH_IMAGE010
),
Figure 35624DEST_PATH_IMAGE070
,
Figure 944805DEST_PATH_IMAGE071
,
Figure 656410DEST_PATH_IMAGE072
,
Figure 521597DEST_PATH_IMAGE073
,
Figure 722772DEST_PATH_IMAGE074
Figure 306200DEST_PATH_IMAGE075
,
Figure 867500DEST_PATH_IMAGE076
, wherein
Figure 852773DEST_PATH_IMAGE077
For less than 0.1 positive count;
Figure 224849DEST_PATH_IMAGE078
Be unit matrix; In iteration renewal first In,
Figure 348980DEST_PATH_IMAGE080
,
Figure 1809DEST_PATH_IMAGE081
,
Figure 482469DEST_PATH_IMAGE082
In addition, produce
Figure 837227DEST_PATH_IMAGE016
Initial value, that is, and each element in this matrix
Figure 694324DEST_PATH_IMAGE083
(
Figure 703125DEST_PATH_IMAGE084
) the obedience standardized normal distribution
Figure 354686DEST_PATH_IMAGE085
, so with
Figure 400003DEST_PATH_IMAGE016
The initial value of relevant statistic is:
Figure 857529DEST_PATH_IMAGE086
,
Figure 406322DEST_PATH_IMAGE087
,
Figure 307413DEST_PATH_IMAGE088
Set the counting variable of iterations in the reasoning process , the beginning iteration;
(2-4) upgrade
Figure 835663DEST_PATH_IMAGE079
Posteriority distribute , promptly
Figure 497906DEST_PATH_IMAGE091
, (formula 11)
Wherein, super parameter
Figure 563820DEST_PATH_IMAGE092
More new formula be
Figure 300832DEST_PATH_IMAGE093
(formula 12)
(formula 13)
In (formula 13),
Figure 316378DEST_PATH_IMAGE095
For
Figure 105474DEST_PATH_IMAGE003
In
Figure 646176DEST_PATH_IMAGE096
The dimension component,
Figure 289647DEST_PATH_IMAGE097
Be diagonal matrix
Figure 687131DEST_PATH_IMAGE020
Inverse matrix in
Figure 416052DEST_PATH_IMAGE096
Row
Figure 821100DEST_PATH_IMAGE096
Column element; So, about
Figure 319077DEST_PATH_IMAGE011
Statistic be updated to thereupon:
(formula 14)
(2-5) upgrade
Figure 103679DEST_PATH_IMAGE051
Posteriority distribute
Figure 986185DEST_PATH_IMAGE099
, promptly
Figure 151718DEST_PATH_IMAGE100
(formula 15)
Wherein, super parameter
Figure 94266DEST_PATH_IMAGE101
More new formula be:
Figure 594518DEST_PATH_IMAGE102
,
Figure 15135DEST_PATH_IMAGE103
, (formula 16)
In (formula 16) Be vector
Figure 647159DEST_PATH_IMAGE105
In Individual element; So about
Figure 859015DEST_PATH_IMAGE051
Statistic be updated to thereupon
Figure 186091DEST_PATH_IMAGE107
(formula 17)
(2-6) upgrade
Figure 283491DEST_PATH_IMAGE016
Posteriority distribute , promptly
Figure 835268DEST_PATH_IMAGE109
(formula 18)
Wherein, super parameter
Figure 79168DEST_PATH_IMAGE110
More new formula be
Figure 534420DEST_PATH_IMAGE111
(formula 19)
Figure 247292DEST_PATH_IMAGE112
(formula 20)
So, about
Figure 813403DEST_PATH_IMAGE046
Statistic be updated to thereupon:
(formula 21)
(2-7) upgrade
Figure 537962DEST_PATH_IMAGE058
Posteriority distribute , promptly
(formula 22)
Wherein, super parameter
Figure 497063DEST_PATH_IMAGE116
More new formula be
Figure 356434DEST_PATH_IMAGE117
, (formula 23)
So, about Statistic be updated to thereupon:
Figure 951812DEST_PATH_IMAGE118
(formula 24)
In (formula 24)
Figure 962493DEST_PATH_IMAGE119
Digamma function for standard;
(2-8) upgrade Posteriority distribute, promptly
Figure 88898DEST_PATH_IMAGE121
(formula 25)
Wherein, super parameter
Figure 534923DEST_PATH_IMAGE122
More new formula be:
Figure 980204DEST_PATH_IMAGE123
, (formula 26)
, (formula 27)
Figure 764806DEST_PATH_IMAGE125
, (formula 28)
(formula 29)
So, about
Figure 999796DEST_PATH_IMAGE035
,
Figure 122604DEST_PATH_IMAGE127
Statistic be updated to thereupon:
, (formula 30)
Figure 309051DEST_PATH_IMAGE129
(formula 31)
(2-9) upgrade
Figure 148831DEST_PATH_IMAGE130
Posteriority distribute, promptly
Figure 941076DEST_PATH_IMAGE131
, (formula 32)
Wherein,
(formula 33)
(formula 34)
In (formula 31) and (formula 34)
Figure 847218DEST_PATH_IMAGE134
Marks of representing matrix (trace) all; So, about
Figure 498779DEST_PATH_IMAGE135
Statistic be updated to thereupon:
Figure 357145DEST_PATH_IMAGE136
(formula 35)
(2-10) upgrade diagonal matrix
Figure 752354DEST_PATH_IMAGE020
, on its diagonal line
Figure 363464DEST_PATH_IMAGE096
Individual element
Figure 451506DEST_PATH_IMAGE137
Figure 33053DEST_PATH_IMAGE138
For
Figure 966374DEST_PATH_IMAGE139
(formula 36)
(2-11) likelihood value after the calculating current iteration
Figure 697570DEST_PATH_IMAGE140
,
Figure 956513DEST_PATH_IMAGE141
Be current iterations;
Figure 445263DEST_PATH_IMAGE142
(formula 37)
(2-12) calculate after the current iteration with last iteration after the difference of likelihood value
Figure 260903DEST_PATH_IMAGE143
If
Figure 784288DEST_PATH_IMAGE144
, the reasoning process of BMCFA model finishes so, otherwise forwards step (2-4) to,
Figure 10870DEST_PATH_IMAGE141
Value increase by 1, proceed iteration next time; Threshold value
Figure 252496DEST_PATH_IMAGE145
Span be
Figure 104783DEST_PATH_IMAGE146
~
Figure 482675DEST_PATH_IMAGE147
It should be noted that when iteration finishes for the first time, only need to calculate , and will
Figure 609080DEST_PATH_IMAGE141
Value increase by 1, need not to carry out
Figure 953473DEST_PATH_IMAGE144
Judgement, directly enter next iteration.
CN201310133415.1A 2013-04-17 2013-04-17 The clustering method of the high dimensional data of common factor analyzer is mixed based on Bayes Expired - Fee Related CN103226595B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310133415.1A CN103226595B (en) 2013-04-17 2013-04-17 The clustering method of the high dimensional data of common factor analyzer is mixed based on Bayes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310133415.1A CN103226595B (en) 2013-04-17 2013-04-17 The clustering method of the high dimensional data of common factor analyzer is mixed based on Bayes

Publications (2)

Publication Number Publication Date
CN103226595A true CN103226595A (en) 2013-07-31
CN103226595B CN103226595B (en) 2016-06-15

Family

ID=48837040

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310133415.1A Expired - Fee Related CN103226595B (en) 2013-04-17 2013-04-17 The clustering method of the high dimensional data of common factor analyzer is mixed based on Bayes

Country Status (1)

Country Link
CN (1) CN103226595B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455842A (en) * 2013-09-04 2013-12-18 福州大学 Credibility measuring method combining Bayesian algorithm and MapReduce
CN104994170A (en) * 2015-07-15 2015-10-21 南京邮电大学 Distributed clustering method based on mixed factor analysis model in sensor network
CN105320727A (en) * 2014-06-16 2016-02-10 三菱电机株式会社 Method for detecting anomalies in real time series
CN106776641A (en) * 2015-11-24 2017-05-31 华为技术有限公司 A kind of data processing method and device
CN107292323A (en) * 2016-03-31 2017-10-24 日本电气株式会社 Method and apparatus for training mixed model
CN109951327A (en) * 2019-03-05 2019-06-28 南京信息职业技术学院 A kind of network failure data synthesis method based on Bayesian mixture models
CN111612101A (en) * 2020-06-04 2020-09-01 华侨大学 Gene expression data clustering method, device and equipment of nonparametric Watton mixed model
CN111612102A (en) * 2020-06-05 2020-09-01 华侨大学 Satellite image data clustering method, device and equipment based on local feature selection
CN114462548B (en) * 2022-02-23 2023-07-18 曲阜师范大学 Method for improving accuracy of single-cell deep clustering algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411610A (en) * 2011-10-12 2012-04-11 浙江大学 Semi-supervised dimensionality reduction method for high dimensional data clustering
US8363961B1 (en) * 2008-10-14 2013-01-29 Adobe Systems Incorporated Clustering techniques for large, high-dimensionality data sets

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8363961B1 (en) * 2008-10-14 2013-01-29 Adobe Systems Incorporated Clustering techniques for large, high-dimensionality data sets
CN102411610A (en) * 2011-10-12 2012-04-11 浙江大学 Semi-supervised dimensionality reduction method for high dimensional data clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
J.BAEK: "Mixtures of Factor Analyzers with Common Factor Loadings:Applications to the Clustering and Visualization of High-Dimensional Data", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 *
XIN WEI: "Bayesian mixtures of common factor analyzers: Model, variational interface, and applications", 《SIGNAL PROCESSING》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455842B (en) * 2013-09-04 2015-06-03 福州大学 Credibility measuring method combining Bayesian algorithm and MapReduce
CN103455842A (en) * 2013-09-04 2013-12-18 福州大学 Credibility measuring method combining Bayesian algorithm and MapReduce
CN105320727A (en) * 2014-06-16 2016-02-10 三菱电机株式会社 Method for detecting anomalies in real time series
CN105320727B (en) * 2014-06-16 2020-03-17 三菱电机株式会社 Method for detecting anomalies in real-time sequences
CN104994170B (en) * 2015-07-15 2018-06-05 南京邮电大学 Distributed clustering method based on hybrid cytokine analysis model in sensor network
CN104994170A (en) * 2015-07-15 2015-10-21 南京邮电大学 Distributed clustering method based on mixed factor analysis model in sensor network
CN106776641A (en) * 2015-11-24 2017-05-31 华为技术有限公司 A kind of data processing method and device
CN106776641B (en) * 2015-11-24 2020-09-08 华为技术有限公司 Data processing method and device
CN107292323A (en) * 2016-03-31 2017-10-24 日本电气株式会社 Method and apparatus for training mixed model
CN107292323B (en) * 2016-03-31 2023-09-19 日本电气株式会社 Method and apparatus for training a hybrid model
CN109951327A (en) * 2019-03-05 2019-06-28 南京信息职业技术学院 A kind of network failure data synthesis method based on Bayesian mixture models
CN111612101A (en) * 2020-06-04 2020-09-01 华侨大学 Gene expression data clustering method, device and equipment of nonparametric Watton mixed model
CN111612101B (en) * 2020-06-04 2023-02-07 华侨大学 Gene expression data clustering method, device and equipment of nonparametric Watson mixed model
CN111612102A (en) * 2020-06-05 2020-09-01 华侨大学 Satellite image data clustering method, device and equipment based on local feature selection
CN111612102B (en) * 2020-06-05 2023-02-07 华侨大学 Satellite image data clustering method, device and equipment based on local feature selection
CN114462548B (en) * 2022-02-23 2023-07-18 曲阜师范大学 Method for improving accuracy of single-cell deep clustering algorithm

Also Published As

Publication number Publication date
CN103226595B (en) 2016-06-15

Similar Documents

Publication Publication Date Title
CN103226595A (en) Clustering method for high dimensional data based on Bayes mixed common factor analyzer
Li et al. A method of two-stage clustering learning based on improved DBSCAN and density peak algorithm
Raman et al. The Bayesian group-lasso for analyzing contingency tables
Seo et al. Root selection in normal mixture models
Gao et al. James–Stein shrinkage to improve k-means cluster analysis
CN107203785A (en) Multipath Gaussian kernel Fuzzy c-Means Clustering Algorithm
Reiter et al. Clustering of cell populations in flow cytometry data using a combination of Gaussian mixtures
CN111222847A (en) Open-source community developer recommendation method based on deep learning and unsupervised clustering
CN105447844A (en) New method for characteristic selection of complex multivariable data
Zhang et al. Ascnet: Adaptive-scale convolutional neural networks for multi-scale feature learning
Jin et al. Inter-and intra-uncertainty based feature aggregation model for semi-supervised histopathology image segmentation
CN106951918B (en) Single-particle image clustering method for analysis of cryoelectron microscope
Aslam et al. Vrl-iqa: Visual representation learning for image quality assessment
Wang Mixture of multivariate t nonlinear mixed models for multiple longitudinal data with heterogeneity and missing values
Yin et al. A two-stage variable selection strategy for supersaturated designs with multiple responses
CN111898666A (en) Random forest algorithm and module population combined data variable selection method
Athanasiadis et al. Segmentation of complementary DNA microarray images by wavelet-based Markov random field model
Wang et al. Spline estimator for ultra-high dimensional partially linear varying coefficient models
Wang et al. scBKAP: a clustering model for single-cell RNA-Seq data based on bisecting K-means
CN111046248A (en) Two-class hierarchical graph sampling method based on approximation degree distribution
CN109614587A (en) A kind of intelligence relationship among persons method for analyzing and modeling, terminal device and storage medium
CN106156856A (en) The method and apparatus selected for mixed model
Teimouri Finite mixture of skewed sub-Gaussian stable distributions
Wu Gaussian Process and Functional Data Methods for Mortality Modelling
Qu et al. An integration convolutional neural network for nuclei instance segmentation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20130731

Assignee: Jiangsu Nanyou IOT Technology Park Ltd.

Assignor: NANJING University OF POSTS AND TELECOMMUNICATIONS

Contract record no.: 2016320000218

Denomination of invention: Clustering method for high dimensional data based on Bayes mixed common factor analyzer

Granted publication date: 20160615

License type: Common License

Record date: 20161118

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EC01 Cancellation of recordation of patent licensing contract
EC01 Cancellation of recordation of patent licensing contract

Assignee: Jiangsu Nanyou IOT Technology Park Ltd.

Assignor: NANJING University OF POSTS AND TELECOMMUNICATIONS

Contract record no.: 2016320000218

Date of cancellation: 20180116

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201204

Address after: Gulou District of Nanjing City, Jiangsu Province, Beijing Road No. 20 210024

Patentee after: STATE GRID JIANGSU ELECTRIC POWER Co.,Ltd. INFORMATION & TELECOMMUNICATION BRANCH

Address before: Room 214, building D5, No. 9, Kechuang Avenue, Zhongshan Science and Technology Park, Jiangbei new district, Nanjing, Jiangsu Province

Patentee before: Nanjing Tian Gu Information Technology Co.,Ltd.

Effective date of registration: 20201204

Address after: Room 214, building D5, No. 9, Kechuang Avenue, Zhongshan Science and Technology Park, Jiangbei new district, Nanjing, Jiangsu Province

Patentee after: Nanjing Tian Gu Information Technology Co.,Ltd.

Address before: 210003 Gulou District, Jiangsu, Nanjing new model road, No. 66

Patentee before: NANJING University OF POSTS AND TELECOMMUNICATIONS

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160615