WO2010064414A1 - 遺伝子クラスタリングプログラム、遺伝子クラスタリング方法及び遺伝子クラスター解析装置 - Google Patents
遺伝子クラスタリングプログラム、遺伝子クラスタリング方法及び遺伝子クラスター解析装置 Download PDFInfo
- Publication number
- WO2010064414A1 WO2010064414A1 PCT/JP2009/006521 JP2009006521W WO2010064414A1 WO 2010064414 A1 WO2010064414 A1 WO 2010064414A1 JP 2009006521 W JP2009006521 W JP 2009006521W WO 2010064414 A1 WO2010064414 A1 WO 2010064414A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gene
- matrix
- clustering
- algorithm
- data
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Definitions
- the present invention relates to a gene clustering program, a gene clustering method, and a gene cluster analysis apparatus. More specifically, the present invention relates to a gene clustering program or the like that classifies each gene into a predetermined cluster based on the similarity of the expression level change with time indicated by each gene.
- the intracellular signal network is constructed by a multi-layered network structure that changes dynamically.
- a “bow-tie signal network” has been proposed as one basic network structure constituting an intracellular signal network (Non-patent Documents 1 and 2).
- the bowtie signal network (hereinafter simply referred to as “bowtie network”) is a core molecule that functions as a “classifier” that defines the network structure in the shape of a bow tie and regulates the cellular response to stimuli at the bow tie knot. It is assumed to exist. That is, in a bowtie network, a wide variety of inputs in intracellular and intercellular signaling are aggregated into a core molecule located at the knot. Then, when the core molecule changes the intracellular concentration according to the input, a predetermined gene group (cluster) located downstream of the signal is activated according to the concentration state, and a specific output appears.
- cluster predetermined gene group located downstream of the signal
- Non-patent document 1 In addition to signal transduction between immune cells, the bowtie network is involved in metabolic signaling (Non-patent document 1), Toll-like receptor signaling (Non-patent document 2), and epidermal growth factor signaling (Non-patent document 3). Has also been reported. It is becoming clear that the bow tie network is an excellent network structure that is strong but also has flexibility for evolution (Non-Patent Document 4 and Non-Patent Document 5).
- genes located downstream of the signal are clustered as a single gene group (cluster) based on a predetermined core molecule concentration.
- cluster to which each gene belongs from the measurement data of gene expression change, localization change, and activity change over time, to analyze the bowtie network, grasp the entire network configuration, There is a need for good geometric tools to predict the relationship.
- Non-patent document 6 tools based on the K-means method (Non-patent document 6), hierarchical clustering (Non-patent document 7), and self-organizing map (Non-patent document 8) have been developed.
- each of these tools that perform computation processing in only one process has its own drawbacks.
- individual data elements are hierarchized by overlapping clusters, so that only a phylogenetic tree with no flexibility can be created.
- genes are clustered based on one-to-one similarity, so there may be a situation in which there is no biological relevance between the genes that are finally located in one cluster. .
- Non-Patent Document 9 is the most useful tool available at present, obtained by horizontally integrating these conventional tools. However, it cannot be said that it has sufficient ability to accurately cluster each gene from the data of gene expression over time and elucidate the bowtie network.
- the main object of the present invention is to provide a gene clustering tool capable of performing gene clustering based on gene expression data over time with high accuracy without performing a priori data prediction.
- the present invention includes (1) a step of calculating a feature amount reflecting similarity between each data from data indicating changes in the expression amount of a gene over time; (3) a step of converting the similarity matrix M into a Boolean matrix N while maintaining the eigenvalues of the eigenvectors, (4) A gene clustering program that performs at least the step of clustering each data based on the Boolean matrix N;
- the feature amount is calculated from the data by linear regression analysis or wavelet transform.
- the eigenvector is calculated from the feature amount by a kernel method or cosine similarity.
- the similarity matrix M is converted to a Boolean matrix N by an FSNN (Filter by Symmetric Nearest Neighbors) algorithm.
- the matrix is standardized by one of graph Laplacian, Markov chain, DSA (Doubly-Stochastic Approximation) algorithm, or DSS (Doubly-Stochastic Scaling) algorithm.
- This gene clustering program performs soft clustering in the step (4) using an EM (Expectation Maximization) algorithm and a CP (Complete Positive factorization) algorithm.
- EM Expoctation Maximization
- CP Complete Positive factorization
- the present invention provides a recording medium on which the gene clustering program is recorded so as to be readable by a computer.
- the present invention includes (1) a step of calculating a feature amount reflecting the similarity between each data from data indicating a change in the expression amount of the gene over time, and (2) from the calculated feature amount, Calculating eigenvectors of similarity matrix M for all gene combinations; (3) converting similarity matrix M into Boolean matrix N while maintaining eigenvalues of eigenvectors; and (4) Boolean matrix.
- a gene clustering method that performs at least the step of clustering each data based on N.
- (1) means for calculating a feature value reflecting the similarity between each data from data indicating changes in the expression amount of the gene over time, and (2) between all the genes from the calculated feature value.
- a gene cluster analysis device comprising at least a means for clustering data.
- a gene clustering tool capable of performing gene clustering based on gene expression data over time with high accuracy without performing a priori data prediction.
- the gene clustering program includes (1) a step of calculating a feature amount reflecting the similarity between each data from data indicating changes in the expression amount of the gene over time, and (2) between all genes. (3) a step of calculating an eigenvector of the similarity matrix M from the calculated feature quantity for the combination; (3) a step of converting the similarity matrix M into a Boolean matrix N while maintaining the eigenvalue of the eigenvector; And clustering each data based on the Boolean matrix N.
- these steps will be described in order.
- This step is a step corresponding to the above-mentioned (1) “Step of calculating feature amount reflecting similarity between each data from data indicating change in expression amount of gene over time” ( (See S1 in Figure 1).
- FIG. 2 shows an example of data showing changes in gene expression over time.
- the data obtained by measuring the expression levels of three genes a, b, and c at four time points 1, 2, 3, and 4 are shown.
- Linear regression analysis is a simple method for comparing fluctuation curves indicating changes in expression level.
- the wavelet transform can aggregate all the time-dependent information of the variation curve. Therefore, in the wavelet transform, even a gene for which expression data is obtained only at one time point that is excluded from the analysis target as incomplete measurement data by the conventional analysis method may be set as the analysis target. Is possible.
- Fig. 3 shows a conceptual diagram of data processing by wavelet transform.
- gene expression change data over time (in this case, data that changes over time as 9, 7, 3, and 5) is processed as a histogram instead of a fluctuation curve.
- a histogram instead of a fluctuation curve.
- This data is expressed as average [9, 7, 3, 5] in 4 dimensions, average [8, 4] and coefficients [1, -1] in 2 dimensions, and average [6] and coefficients [2 in 1 dimension. ] (See FIG. 3B). Therefore, this data is processed as [6 (basis), 2, 1, -1 (coefficient)] by the one-dimensional wavelet transform (see FIG. 3C).
- wavelet transform by processing the gene expression change data with time as a histogram in this way, it is possible to perform optimal fitting with significantly fewer coefficients than when processing as a fluctuation curve.
- FIG. 4 shows a conceptual diagram for making a histogram of data showing changes in gene expression over time as shown in FIG.
- the change in the expression level indicated by a solid line or a dotted line can be converted into a histogram as shown in FIG.
- FIG. 5 is a conceptual diagram showing changes in the data dimension before and after this step.
- the similarity matrix M (semi-definite positive matrix) is calculated from the calculated features using the kernel (Heat kernel) method or cosine similarity. M) Calculate the eigenvector of.
- the similarity matrix M is simply referred to as “matrix M”.
- step (2) of the gene clustering program corresponds to step (2) of the gene clustering program according to the present invention “step of calculating eigenvectors of similarity matrix M from calculated feature quantities for all gene combinations” (FIG. 1). (See S2).
- Standardization (3-1) Conversion by FSNN This matrix M is processed by an FSNN (Filter by Symmetric Nearest Neighbors) algorithm similar to the LLE (Local Linear Embedding) algorithm. This makes it possible to obtain a Boolean similarity matrix that is easier to process than the LLE algorithm.
- FSNN algorithm and LLE algorithm ⁇ A simple locally adaptive nearest neighbor rule with application to pollution forecasting.International Journal on Pattern Recognition and Artificial Intelligence, 17: 1-14, 2003 '' and ⁇ Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 2323-2326, 2000 ".
- This step corresponds to the step (3) of the gene clustering program according to the present invention, “the step of converting the similarity matrix M into the Boolean matrix N while maintaining the eigenvalues of the eigenvectors” (in FIG. 1). , See S3).
- FIG. 6 shows a conceptual diagram of the calculated symmetric nearest neighbor of gene i.
- the Boolean matrix N is simply referred to as “matrix N”.
- 1 is assigned to mij when the genes i and j are symmetric nearest neighbors, and 0 is assigned when they are not symmetric nearest neighbors.
- the symmetric matrix M can be normalized to the matrix N that may be asymmetric while maintaining the eigenvalues.
- FIG. 7 the conceptual diagram of the conversion process from the similarity matrix M to the Boolean matrix N is shown.
- the LLE algorithm reconstructs the neighborhood of each gene under restricted conditions based on the standard q nearest neighbor method. That is, the LLE algorithm has a restriction that the finally obtained matrix must be symmetric. On the other hand, since the FSNN algorithm does not have such a restriction, simple and high-speed processing can be performed.
- the matrix N is defined by the following equation (3).
- equation (3) For graph Laplacian, see “On spectral Clustering: Analysis and an Algorithm. Neural Information Processing Systems, 2001”.
- the matrix N is defined by the following equation (4).
- the matrix N is defined by the following equation (4).
- the matrix N is defined by the solution to the following equation (5).
- the problem of standardization using this DSA algorithm is that the deviation in the matrix N is larger than the deviation of the eigenvector in the matrix M.
- the gene clustering program according to the present invention preferably performs standardization using the DSS algorithm prior to standardization using the DSA algorithm. Thereby, the shift of deviation can be suppressed and the manifold can be held.
- FIG. 8 shows a conceptual diagram of a matrix ⁇ M (B) based on new gene coordinates and a Boolean matrix N (A) obtained by the DSS algorithm.
- the equation (7) is changed to obtain the minimum value (the left parameter is on the convexity of Bregman divergence), and ⁇ satisfying the following equation (8) is obtained.
- ⁇ can be expressed by the following equation (9), and an approximate value of ⁇ can be obtained with arbitrary accuracy by a simple binary search. According to the binary search, the amount of data processing can be remarkably suppressed, so that quick processing is possible.
- the DSA algorithm described above obtains an intermediate stage matrix M between the initial manifold and the clustering prediction described below.
- the DSS algorithm does not change the geometric display, the number of iterations of the DSA algorithm can be remarkably reduced by performing the standardization using the DSS algorithm prior to the standardization using the DSA algorithm.
- Clustering (4-1) Soft Clustering As algorithms for soft membership clustering, EM (Expectation Maximization) algorithm and CP (Complete Positive factorization) algorithm can be used. The EM algorithm optimizes the parameter density, and the CP algorithm contributes to minimize information loss.
- This step is a step corresponding to step (4) “step of clustering each data based on Boolean matrix N” of the gene clustering program according to the present invention (see S4 in FIG. 1).
- the cluster number k is fixed and the solution of the following equation (10) is obtained.
- the matrix G gives the probability that each gene is a soft membership of cluster k.
- one gene can be classified into two or more clusters. This soft membership can be converted as a hard membership in which one gene belongs to only one cluster.
- Hard clustering Algorithms for hard membership clustering include the Bregman k-means (BkM) method, the hierarchical clustering (HC) algorithm, and the AP (Affinity Propagation) algorithm. And are used.
- BkM method is a generalization of the K-means method and can be applied to membership of exponential distribution families.
- HC is an integrated clustering algorithm.
- the AP algorithm does not set the cluster number in advance, and based on the similarity matrix, the probability (responsibility) of selecting a node candidate (exemplar) whose case node is central in the space and the central node are It is an algorithm that converges to a cluster based on the probability (availability: attribute probability) that another node belongs to its own group.
- “probability” may be referred to as “probability message” (see “Clustering by Passing Messages Between Data Points. Science 315: 972-976, 2007”).
- the center of the cluster is selected at random and a prototype is created.
- the deflection distribution existing at the center x obtained with the probability p (x) is used.
- This probability p (x) is expressed by the following equations (11) and (12).
- the prototype means a state where each gene has the most meaningful coordinate axis because of the geometric structure.
- This equation (12) represents the minimum divergence between one of the genes selected as the center C of the cluster and the gene X.
- the generator of Bergman divergence can be arbitrary. In particular, when the square Euclidean distance is selected as the Bregman divergence, it is returned to the Arthur-Vassilvitskii initialization.
- the process by BkM is executed, and the process of assigning each gene to the center of the cluster and calculating a new center is repeated.
- the calculation of the new center may follow the normal k-Means algorithm (the arithmetic mean of cluster members is the new center), but the reallocation of each gene is performed according to a more general rule. That is, it is performed assuming that the gene X is related to a set center c that is a solution of C that satisfies the following equation (13).
- the HC algorithm is executed using both the increment in the sum of squares of distance (Ward linkage distance) and the shortest distance (Single linkage distance).
- the algorithm input is a similarity matrix M between genes.
- the similarity matrix M may include a negative number.
- the input of row i and column j of matrix M is defined by the following equation (14).
- a gene group showing the average profile of each cluster is defined as a model.
- the AP algorithm is executed by repeating the exchange of attribute probability messages and winning probability messages between genes.
- the attribute probability message sent from gene i to gene j as a model candidate can be defined by the following equation (15).
- the attribute probability message is set to 0, and when the winning probability message is updated, the attribute probability message is updated according to the following two rules.
- each message is suppressed so that the new value is equal to ⁇ times the previous value plus (1- ⁇ ) times the updated value.
- the algorithm ends when the model no longer changes during m iterations.
- Cluster calculation is performed based on aij + rij of each gene pair.
- Each cluster corresponds to a model gene and a gene group using the gene as a model.
- a communication medium such as a network or a satellite in addition to a recording medium such as a magnetic disk, a CD-ROM, or a solid-state memory. it can.
- the gene cluster analysis apparatus includes (1) means for calculating a feature amount reflecting similarity between each data from data indicating changes in expression amount of the gene over time, and (2) the calculated feature. Means for calculating eigenvectors of similarity matrix M for all gene combinations from the quantity; (3) means for converting similarity matrix M into Boolean matrix N while maintaining eigenvalues of eigenvectors; (4 And at least means for clustering each data based on the Boolean matrix N.
- This gene cluster analysis apparatus can be configured by installing the above gene clustering program in a normal computer.
- FIG. 10 is a block diagram showing a configuration example of the gene cluster analysis apparatus according to the present invention.
- the internal bus 10 is configured by, for example, PCI (Peripheral Component Interconnect) or a local bus, and connects the CPU 11, the ROM 12, the RAM 13, and the interface 14 to each other. Each unit exchanges data via the internal bus 10.
- the CPU 11 executes processing according to the gene clustering program stored in the ROM 12.
- the RAM 13 appropriately stores data, programs, and the like necessary for the CPU 11 to execute various processes.
- a keyboard 15 and a mouse 16 are connected to the interface 14, and the user can set parameters and the like using these.
- the interface 14 outputs the operation signals output from these to the CPU 11.
- a monitor 17 and a hard disk 18 are connected to the interface 14.
- the monitor 17 is controlled by the CPU 11 and displays a predetermined image.
- the CPU 11 can record or read data or programs on the hard disk 18 via the interface 14.
- the gene clustering method according to the present invention can be implemented by the above-described gene clustering program or gene cluster analyzer, and (1) reflect the similarity between each data from the data indicating the change in gene expression over time. (2) a step of calculating eigenvectors of the similarity matrix M for all combinations between genes from the calculated feature amounts; and (3) eigenvalues of the eigenvectors of the similarity matrix M. At least, the step of converting to the Boolean matrix N and the step of (4) clustering each data based on the Boolean matrix N are performed.
- multi-dimensional data is converted into low-dimensional manifolds while maintaining biological information about similarity of gene expression patterns by using multiple optimized algorithms. be able to. Therefore, it is possible to quickly perform advanced clustering analysis and detect an accurate clustering structure without performing a priori data prediction.
- the gene clustering program according to the present invention can perform gene clustering based on gene expression data over time with high accuracy without performing a priori data prediction. It can be used effectively for analysis.
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
Description
この遺伝子クラスタリングプログラムは、前記(1)のステップにおいて、線形回帰分析又はウェーブレット変換により、前記データから前記特徴量を算出する。
前記(2)のステップにおいては、カーネル法又はコサイン類似度により、前記特徴量から前記固有ベクトルを算出する。
また、前記(3)のステップにおいては、FSNN(Filter by Symmetric Nearest Neighbors)アルゴリズムにより、類似度行列Mをブール行列Nに変換する。
さらに、前記(3)のステップでは、FSNNアルゴリズムによる処理後、グラフラプラシアン、マルコフ連鎖、DSA(Doubly-Stochastic Approximation)アルゴリズム又はDSS(Doubly-Stochastic Scaling)アルゴリズムのいずれかにより、行列を標準化する。
この遺伝子クラスタリングプログラムは、前記(4)のステップにおいて、EM(Expectation Maximization)アルゴリズムとCP(Complete Positive factorization)アルゴリズムにより、ソフトクラスタリングを行う。
さらに、前記(4)のステップでは、ソフトクラスタリング後、BAV(Bregman-Arthur-Vassilvitskiiinitialization)アルゴリズムにより、ハードクラスタリングを行う。
併せて、本発明は、(1)遺伝子の経時的な発現量変化を示すデータから、各データ間の類似度を反映した特徴量を算出するステップと、(2)算出された特徴量から、全ての遺伝子間の組合せについて類似度行列Mの固有ベクトルを算出するステップと、(3)類似度行列Mを、固有ベクトルの固有値を維持したまま、ブール行列Nに変換するステップと、(4)ブール行列Nに基づいて各データをクラスタリングするステップと、を少なくとも行う遺伝子クラスタリング方法を提供する。
さらに、(1)遺伝子の経時的な発現量変化を示すデータから、各データ間の類似度を反映した特徴量を算出する手段と、(2)算出された特徴量から、全ての遺伝子間の組合せについて類似度行列Mの固有ベクトルを算出する手段と、(3)類似度行列Mを、固有ベクトルの固有値を維持したまま、ブール行列Nに変換する手段と、(4)ブール行列Nに基づいて各データをクラスタリングする手段と、を少なくとも備える遺伝子クラスター解析装置をも提供する。
本ステップは、上記(1)の「遺伝子の経時的な発現量変化を示すデータから、各データ間の類似度を反映した特徴量を算出するステップ」に対応するステップである(図1中、S1参照)。
次に、全ての遺伝子間の組み合せについて、算出された特徴量からカーネル(Heat kernel)法又はコサイン類似度により、類似度行列M(semi-definite positive matrix M)
の固有ベクトルを算出する。以下、類似度行列Mを、単に「行列M」と称する。
2つの遺伝子をi及びj(i及びjは1以上の整数)とすると、カーネル法による行列Mの横列(row)i、縦列(column)jの入力は、以下の式(1)によって定義される。この入力は、遺伝子i及び遺伝子jの類似度を示す。
また、コサインシミラリティ法による行列Mでは、以下の式(2)によって定義される。
(3-1)FSNNによる変換
この行列Mを、LLE(Local Linear Embedding)アルゴリズムに類似したFSNN(Filter by Symmetric Nearest Neighbors)アルゴリズムによって処理する。これにより、LLEアルゴリズムに比べ、より処理が容易なブール行列(Boolean similarity matrix)を得ることができる。FSNNアルゴリズム及びLLEアルゴリズムについては、それぞれ「A simple locally adaptive nearest neighbor rule with application to pollution forecasting. International Journal on Pattern Recognition and Artificial Intelligence, 17: 1-14, 2003」及び「Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 2323-2326, 2000」を参照。
FSNNアルゴリズムによる標準化に続けて、さらにグラフラプラシアン、マルコフ連鎖、DSA(Doubly-Stochastic Approximation)アルゴリズムやDSS(Doubly-Stochastic Scaling)アルゴリズムによる標準化のいずれかを行うことにより、固有値の摂動(perturbation)を減少させることができる。なお、このうち、DSSアルゴリズムは、本発明に係る遺伝子クラスタリングプログラムのため、本発明者らによって新たに作製された新規なアルゴリズムである。
。グラフラプラシアンについては、「On spectral Clustering: Analysis and an Algorithm. Neural Information Processing Systems, 2001」を参照。
マルコフチェーンについては、「Soft Membership for spectral clustering, with application to permeable language distinction. Pattern Recognition, 2008」を参照。
(4-1)ソフトクラスタリング
ソフトメンバーシップクラスタリングのためのアルゴリズムには、EM(Expectation Maximization)アルゴリズムとCP(Complete Positive factorization)アルゴリズムを用いて行うことができる。EMアルゴリズムはパラメータ密度を最適化し、CPアルゴリズムは情報量のロスを最小限に抑えるために寄与する。
ハードメンバーシップクラスタリングのためのアルゴリズムには、ブレグマンK平均(Bregman k-means: BkM)法と、階層型クラスタリング(Hierarchical clustering:HC)アルゴリズムと、AP(Affinity Propagation)アルゴリズムと、を用いる。BkM法は、K平均法を一般化して、指数型分布族のメンバーシップへ適用可能としたものである。また、HCは、集積的なクラスタリングアルゴリズムである。さらに、APアルゴリズムは、クラスターナンバーを予め設定することなく、類似度行列に基づいて、空間中に格ノードが中心的なノード候補(exemplar)を選ぶ確率(responsibility:当選確率)と、中心ノードが他のノードを自分のグループに属させる確率(availability:属性確率)と、を元にクラスターへの収束をさせるアルゴリズムである。以下、「確率」を「確率メッセージ」という場合がある(「Clustering by Passing Messages Between Data Points. Science 315: 972-976, 2007」参照)。
遺伝子クラスター解析装置1において、内部バス10は、例えばPCI(Peripheral Component Interconnect)またはローカルバス等により構成され、CPU11、ROM12、RAM13、およびインタフェース14を相互に接続している。各部は、この内部バス10を介してデータの授受を行う。CPU11は、ROM12に記憶されている遺伝子クラスタリングプログラムに従って処理を実行する。RAM13には、CPU11が各種の処理を実行する上において必要なデータやプログラム等が適宜記憶される。インタフェース14には、キーボード15とマウス16が接続されており、ユーザは、これらを用いてパラメータ等の設定を行うことができる。インタフェース14は、これらから出力された操作信号をCPU11に出力する。また、インタフェース14には、モニタ17とハードディスク18が接続されている。モニタ17は、CPU11に制御され、所定の画像を表示する。CPU11は、ハードディスク18に対して、インタフェース14を介してデータまたはプログラム等の記録または読み出しを行うことができる。
Claims (10)
- (1)遺伝子の経時的な発現量変化を示すデータから、各データ間の類似度を反映した特徴量を算出するステップと、
(2)全ての遺伝子間の組合せについて、算出された特徴量から類似度行列Mの固有ベクトルを算出するステップと、
(3)類似度行列Mを、固有ベクトルの固有値を維持したまま、ブール行列Nに変換するステップと、
(4)ブール行列Nに基づいて各データをクラスタリングするステップと、
を少なくとも行う遺伝子クラスタリングプログラム。 - 前記(1)のステップにおいて、線形回帰分析又はウェーブレット変換により、前記データから前記特徴量を算出する請求項1記載の遺伝子クラスタリングプログラム。
- 前記(2)のステップにおいて、カーネル法又はコサイン類似度により、前記特徴量から前記固有ベクトルを算出する請求項2記載の遺伝子クラスタリングプログラム。
- 前記(3)のステップにおいて、FSNN(Filter by Symmetric Nearest Neighbors)アルゴリズムにより、類似度行列Mをブール行列Nに変換する請求項3記載の遺伝子クラスタリングプログラム。
- さらに、前記(3)のステップにおいて、FSNNアルゴリズムによる処理後、グラフラプラシアン、マルコフ連鎖、DSA(Doubly-Stochastic Approximation)アルゴリズム又はDSS(Doubly-Stochastic Scaling)アルゴリズムのいずれかにより、行列を標準化する請求項4記載の遺伝子クラスタリングプログラム。
- 前記(4)のステップにおいて、EM(Expectation Maximization)アルゴリズムとCP(Complete Positive factorization)アルゴリズムにより、ソフトクラスタリングを行う請求項5記載の遺伝子クラスタリングプログラム。
- さらに、前記(4)のステップにおいて、ソフトクラスタリング後、BAV(Bregman-Arthur-Vassilvitskiiinitialization)アルゴリズムにより、ハードクラスタリングを行う請求項6記載の遺伝子クラスタリングプログラム。
- 請求項1記載の遺伝子クラスタリングプログラムをコンピュータが読み取り可能に記録した記録媒体。
- (1)遺伝子の経時的な発現量変化を示すデータから、各データ間の類似度を反映した特徴量を算出するステップと、
(2)算出された特徴量から、全ての遺伝子間の組合せについて類似度行列Mの固有ベクトルを算出するステップと、
(3)類似度行列Mを、固有ベクトルの固有値を維持したまま、ブール行列Nに変換するステップと、
(4)ブール行列Nに基づいて各データをクラスタリングするステップと、
を少なくとも行う遺伝子クラスタリング方法。 - (1)遺伝子の経時的な発現量変化を示すデータから、各データ間の類似度を反映した特徴量を算出する手段と、
(2)算出された特徴量から、全ての遺伝子間の組合せについて類似度行列Mの固有ベクトルを算出する手段と、
(3)類似度行列Mを、固有ベクトルの固有値を維持したまま、ブール行列Nに変換する手段と、
(4)ブール行列Nに基づいて各データをクラスタリングする手段と、
を少なくとも備える遺伝子クラスター解析装置。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009801473989A CN102227731A (zh) | 2008-12-02 | 2009-12-01 | 基因聚类程序、基因聚类方法及基因聚类分析装置 |
US13/131,137 US20110246080A1 (en) | 2008-12-02 | 2009-12-01 | Gene clustering program, gene clustering method, and gene cluster analyzing device |
EP09830184.9A EP2354988B1 (en) | 2008-12-02 | 2009-12-01 | Gene clustering program, gene clustering method, and gene cluster analyzing device |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008307328 | 2008-12-02 | ||
JP2008-307328 | 2008-12-02 | ||
JP2009265433A JP2010157214A (ja) | 2008-12-02 | 2009-11-20 | 遺伝子クラスタリングプログラム、遺伝子クラスタリング方法及び遺伝子クラスター解析装置 |
JP2009-265433 | 2009-11-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010064414A1 true WO2010064414A1 (ja) | 2010-06-10 |
Family
ID=42233072
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/006521 WO2010064414A1 (ja) | 2008-12-02 | 2009-12-01 | 遺伝子クラスタリングプログラム、遺伝子クラスタリング方法及び遺伝子クラスター解析装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20110246080A1 (ja) |
EP (1) | EP2354988B1 (ja) |
JP (1) | JP2010157214A (ja) |
CN (1) | CN102227731A (ja) |
WO (1) | WO2010064414A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018150878A1 (ja) | 2017-02-14 | 2018-08-23 | 富士フイルム株式会社 | 生体物質解析方法および装置並びにプログラム |
CN109934245A (zh) * | 2018-11-03 | 2019-06-25 | 同济大学 | 一种基于聚类的转辙机故障识别方法 |
CN112307906A (zh) * | 2020-10-14 | 2021-02-02 | 北方工业大学 | 一种近邻传播聚类下储能电池故障分类特征筛选降维方法 |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
BR112013018139A8 (pt) * | 2011-01-19 | 2018-02-06 | Koninklijke Philips Electronics Nv | Método para processar dados genômicos de um indivíduo, uso de informação de sequência genômica, opcionalmente na combinação com informação de expressão de gene, apoio à decisão clínica e sistema de armazenamento e sistema |
CN102916426B (zh) * | 2012-09-20 | 2015-01-21 | 中国电力科学研究院 | 一种基于数据聚类的小干扰稳定机组分群方法及其系统 |
GB201320858D0 (en) * | 2013-11-26 | 2014-01-08 | Univ Manchester | Method of determining an islanding solution for an electrical power system |
US20170249422A1 (en) * | 2014-10-17 | 2017-08-31 | Koninklijke Philips N.V. | Cross platform transformation of gene expression data |
US10832799B2 (en) * | 2015-08-17 | 2020-11-10 | Koninklijke Philips N.V. | Multi-level architecture of pattern recognition in biological data |
CN105205345A (zh) * | 2015-08-28 | 2015-12-30 | 北京工业大学 | 基于ShRec3D和转换参数优化的染色体3D结构建模方法 |
CN105335626B (zh) * | 2015-10-26 | 2018-03-16 | 河南师范大学 | 一种基于网络分析的群lasso特征分群方法 |
US11205103B2 (en) | 2016-12-09 | 2021-12-21 | The Research Foundation for the State University | Semisupervised autoencoder for sentiment analysis |
CN108171012B (zh) * | 2018-01-17 | 2020-09-22 | 河南师范大学 | 一种基因分类方法与装置 |
WO2020071500A1 (ja) * | 2018-10-03 | 2020-04-09 | 富士フイルム株式会社 | 細胞情報処理方法 |
CN110222745B (zh) * | 2019-05-24 | 2021-04-30 | 中南大学 | 一种基于相似性学习及其增强的细胞类型鉴定方法 |
CN110414231B (zh) * | 2019-06-25 | 2020-09-29 | 中国人民解放军战略支援部队信息工程大学 | 基于马尔科夫模型的内存中软件基因动态提取方法 |
CN110910952B (zh) * | 2019-11-21 | 2023-05-12 | 衡阳师范学院 | 一种利用化学反应策略预测基本蛋白质方法 |
JP7378001B1 (ja) | 2023-03-09 | 2023-11-10 | 株式会社 日立産業制御ソリューションズ | マッピング装置、マッピング方法及びマッピングプログラム |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6470277B1 (en) * | 1999-07-30 | 2002-10-22 | Agy Therapeutics, Inc. | Techniques for facilitating identification of candidate genes |
AU2001231064A1 (en) * | 2000-01-21 | 2001-07-31 | Lion Bioscience Ag | Data analysis software |
EP1448796A4 (en) * | 2001-11-05 | 2008-04-02 | California Inst Of Techn | NON-METRIC TOOL FOR PREDICTING GENETIC RELATIONS FROM EXPRESSION DATA |
US20030104394A1 (en) * | 2001-12-03 | 2003-06-05 | Xudong Dai | Method and system for gene expression profiling analysis utilizing frequency domain transformation |
US20050027460A1 (en) * | 2003-07-29 | 2005-02-03 | Kelkar Bhooshan Prafulla | Method, program product and apparatus for discovering functionally similar gene expression profiles |
US20050209785A1 (en) * | 2004-02-27 | 2005-09-22 | Wells Martin D | Systems and methods for disease diagnosis |
US8065089B1 (en) * | 2004-03-30 | 2011-11-22 | University Of North Carolina At Charlotte | Methods and systems for analysis of dynamic biological pathways |
WO2008127199A1 (en) * | 2007-04-13 | 2008-10-23 | Agency For Science, Technology And Research | Methods of controlling tumorigenesis and diagnosing the risk thereof |
-
2009
- 2009-11-20 JP JP2009265433A patent/JP2010157214A/ja active Pending
- 2009-12-01 EP EP09830184.9A patent/EP2354988B1/en not_active Not-in-force
- 2009-12-01 CN CN2009801473989A patent/CN102227731A/zh active Pending
- 2009-12-01 US US13/131,137 patent/US20110246080A1/en not_active Abandoned
- 2009-12-01 WO PCT/JP2009/006521 patent/WO2010064414A1/ja active Application Filing
Non-Patent Citations (21)
Title |
---|
"A comprehensive map of the toll-like receptor signaling network", MOLECULAR SYSTEM BIOLOGY, vol. 2, 2006, pages 0015 |
"A comprehensive pathway map of epidermal growth factor receptor signaling", MOLECULAR SYSTEM BIOLOGY, vol. 1, 2005 |
"A simple locally adaptive nearest neighbor rule with application to pollution forecasting", INTERNATIONAL JOURNAL ON PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, vol. 17, 2003, pages 1 - 14 |
"Biological robustness", NATURE REVIEWS GENETICS, vol. 5, no. 11, 2004, pages 826 - 37 |
"Bow ties, metabolism and disease", TRENDS IN BIOTECHNOLOGY, vol. 22, no. 9, 2004, pages 446 - 50 |
"Cluster analysis and display of genome-wide expression patterns", PROCEEDING OF NATIONAL ACADEMY OF SCIENCES, vol. 95, no. 25, 1998, pages 14863 - 14868 |
"Clustering by Passing Messages Between Data Points", SCIENCE, vol. 315, 2007, pages 972 - 976 |
"Doubly Stochastic Normalization for Spectral Clustering", NEURAL INFORMATION AND PROCESSING SYSTEMS, 2006 |
"GenePattern 2.0", NATURE GENETICS, vol. 38, 2006, pages 500 - 501 |
"Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation", PROCEEDING OF NATIONAL ACADEMY OF SCIENCES, vol. 96, no. 6, 1999, pages 2907 - 2912 |
"k-Means++: the advantages of careful seeding", ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2007 |
"Nonlinear dimensionality reduction by locally linear embedding", SCIENCE, vol. 290, 2000, pages 2323 - 2326 |
"On spectral Clustering: Analysis and an Algorithm", NEURAL INFORMATION PROCESSING SYSTEMS, 2001 |
"Soft Membership for spectral clustering, with application to permeable language distinction", PATTERN RECOGNITION, 2008 |
"Systematic determination of genetic network architecture", NATURE GENETICS, vol. 22, no. 3, 1999, pages 281 - 285 |
"The Edinburghhuman metabolic network reconstruction and its functional analysis", MOLECULAR SYSTEM BIOLOGY, vol. 3, 2007, pages 135 |
HIROAKI KITANO, SYSTEM BIOLOGY, 1 July 2001 (2001-07-01), pages 72 - 90, XP008150657 * |
KIKUYA KATO: "Idenshi Hatsugen Profile no Data Kaiseki", PROTEIN, NUCLEIC ACID AND ENZYME, vol. 48, no. 16, 14 December 2004 (2004-12-14), pages 2300 - 2309, XP008149723 * |
NOBUHIRO TSUBOI: "Digital Shingo Shori ni Motozuku Idenshi Clustering", TRANSACTIONS OF INFORMATION PROCESSING SOCIETY OF JAPAN, vol. 41, no. SIG7, 15 November 2000 (2000-11-15), pages 49 - 56, XP008147979 * |
See also references of EP2354988A4 |
TAKASHI ITO: "Kobo Genome no Taikeiteki Kino Kaiseki: Hatsugen Seigyo Network, Tanpakushitsu Kan Sogo Sayo Network", GENOME SCIENCE NO ARATANARU CHOSEN, vol. 46, no. 16, 10 December 2001 (2001-12-10), pages 2407 - 2413, XP008147925 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018150878A1 (ja) | 2017-02-14 | 2018-08-23 | 富士フイルム株式会社 | 生体物質解析方法および装置並びにプログラム |
CN109934245A (zh) * | 2018-11-03 | 2019-06-25 | 同济大学 | 一种基于聚类的转辙机故障识别方法 |
CN109934245B (zh) * | 2018-11-03 | 2023-01-17 | 同济大学 | 一种基于聚类的转辙机故障识别方法 |
CN112307906A (zh) * | 2020-10-14 | 2021-02-02 | 北方工业大学 | 一种近邻传播聚类下储能电池故障分类特征筛选降维方法 |
CN112307906B (zh) * | 2020-10-14 | 2023-07-04 | 北方工业大学 | 一种近邻传播聚类下储能电池故障分类特征筛选降维方法 |
Also Published As
Publication number | Publication date |
---|---|
CN102227731A (zh) | 2011-10-26 |
JP2010157214A (ja) | 2010-07-15 |
EP2354988A4 (en) | 2015-07-01 |
EP2354988B1 (en) | 2016-11-30 |
EP2354988A1 (en) | 2011-08-10 |
US20110246080A1 (en) | 2011-10-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2010064414A1 (ja) | 遺伝子クラスタリングプログラム、遺伝子クラスタリング方法及び遺伝子クラスター解析装置 | |
Li et al. | DeepAtom: A framework for protein-ligand binding affinity prediction | |
Wang et al. | Clustering aggregation by probability accumulation | |
CN111785329B (zh) | 基于对抗自动编码器的单细胞rna测序聚类方法 | |
CN106991296B (zh) | 基于随机化贪心特征选择的集成分类方法 | |
Yu et al. | Self-paced learning for k-means clustering algorithm | |
WO2023217290A1 (zh) | 基于图神经网络的基因表型预测 | |
Ding et al. | Tensor sliced inverse regression | |
Bryan | Problems in gene clustering based on gene expression data | |
Yu et al. | Predicting protein complex in protein interaction network-a supervised learning based method | |
Seo et al. | Self-organizing maps and clustering methods for matrix data | |
JP2022548960A (ja) | 単一細胞rna-seqデータ処理 | |
US20070239415A2 (en) | General graphical gaussian modeling method and apparatus therefore | |
Chen et al. | Adaptive and structured graph learning for semi-supervised clustering | |
Ding et al. | Dance: A deep learning library and benchmark for single-cell analysis | |
McLachlan et al. | The EM algorithm | |
CN117423391A (zh) | 一种基因调控网络数据库的建立方法、系统及设备 | |
CN113516019A (zh) | 高光谱图像解混方法、装置及电子设备 | |
Belanche et al. | Kernel functions for categorical variables with application to problems in the life sciences | |
Lu et al. | Soft-orthogonal constrained dual-stream encoder with self-supervised clustering network for brain functional connectivity data | |
Bichat et al. | Hierarchical correction of p-values via an ultrametric tree running Ornstein-Uhlenbeck process | |
JP2002175305A (ja) | 遺伝子ネットワークを推測するためのグラフィカルモデリング法及びそのための装置 | |
Kim et al. | Difference-based clustering of short time-course microarray data with replicates | |
CN116844649B (zh) | 一种可解释的基于基因选择的细胞数据分析方法 | |
CN108596444B (zh) | 基于多元化策略的大规模社会网络用户抽样的方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980147398.9 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09830184 Country of ref document: EP Kind code of ref document: A1 |
|
REEP | Request for entry into the european phase |
Ref document number: 2009830184 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009830184 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13131137 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |