JP2006277611A

JP2006277611A - Multiple sample gene expression data analysis system, method, program and recording medium

Info

Publication number: JP2006277611A
Application number: JP2005099284A
Authority: JP
Inventors: Masaki Ando; 正貴安東; Akira Saito; 彰斎藤; Shigeru Otaki; 慈大瀧; Kenichi Sato; 健一佐藤; Keiko Otani; 敬子大谷
Original assignee: Hiroshima University NUC; NEC Corp; Japan Biological Informatics Consortium
Current assignee: Hiroshima University NUC; NEC Corp; Japan Biological Informatics Consortium
Priority date: 2005-03-30
Filing date: 2005-03-30
Publication date: 2006-10-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide an analysis technique and analysis system for analyzing gene expression data without using level data and without correcting gene expression data distribution among samples in analysis on gene expression data among multiple samples. <P>SOLUTION: The multiple sample gene expression data analysis system comprises: an input device 1 for inputting gene expression data of multiple samples, a data analysis device 2; and an output device 3 for outputting an analysis result. The data analysis device 2 comprises: an order statistics calculation means 21 for calculating order statistics of a gene expression data set for one sample from among N sets of gene expression data from the input device 1; a level data generation means 22 for assigning a level for the gene expression data for every of the (N-1) samples and generating the (N-1) sets of level data; a data conversion means 23 for converting the gene data expression data sets using the order statistics and the (N-1) sets of level data; and a gene expression estimation means 24 for estimating a gene expression situation using the converted (N-1) sets of gene expression data. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、複数サンプルの遺伝子発現データに対する解析方法、解析システム、及び記録媒体に関し、特にサンプル間の遺伝子発現データの分布を補正することなく遺伝子発現データを解析する解析システム、方法、プログラム及び記録媒体に関するものである。 The present invention relates to an analysis method, analysis system, and recording medium for gene expression data of a plurality of samples, and in particular, an analysis system, method, program, and recording for analyzing gene expression data without correcting the distribution of gene expression data among samples. It relates to the medium.

近年、オリゴ型・スタンフォード型のＤＮＡマイクロアレイやマクロアレイの発展に伴い、網羅的な遺伝子解析が広く研究・実用化されている。例えば、遺伝子発現解析用ＤＮＡマイクロアレイ・ＳＮＰタイピング用ＤＮＡチップ・ＣＧＨアレイなどがゲノム研究に広く利用されている。 In recent years, with the development of oligo- and Stanford-type DNA microarrays and macroarrays, comprehensive gene analysis has been widely studied and put into practical use. For example, DNA microarrays for gene expression analysis, DNA chips for SNP typing, CGH arrays, etc. are widely used for genome research.

例えば、疾患関連遺伝子の研究においては、疾患群と健常群のサンプルに対してマイクロアレイを用いて網羅的に遺伝子の発現解析を行い、遺伝子の発現量が群間において変化しているか否かを調べ、疾患に関連する遺伝子を探索する。また、薬剤の応答性に関する研究においては、培養細胞などに対して投与量を変化させながら薬剤を投与し、投与量の変化によって発現量が変化する遺伝子を調べる。 For example, in the study of disease-related genes, comprehensive gene expression analysis using microarrays is performed on samples of disease groups and healthy groups, and it is examined whether the expression level of genes changes between groups. Search for genes related to diseases. In studies on drug responsiveness, drugs are administered to cultured cells while changing the dose, and genes whose expression level changes with changes in dose are examined.

このようなマイクロアレイを用いた研究においては、１サンプルから数千〜数万の遺伝子に関する情報が得られる。そして、得られた大量の遺伝子情報（遺伝子発現など）を用いて、疾患あるいは薬剤応答性や副作用等と遺伝子との関連性を統計解析やデータマイニングの手法を用いて解析する。 In research using such a microarray, information on thousands to tens of thousands of genes can be obtained from one sample. Then, using the obtained large amount of gene information (such as gene expression), the relationship between the disease or drug responsiveness, side effect, etc. and the gene is analyzed using statistical analysis or data mining techniques.

マイクロアレイの実験においては実験過程が非常に複雑であり、実験の各段階において様々な実験誤差が生じたり、系統的な歪みが生じたりする。そのため、実験誤差を補正する方法がいくつか提案されている。例えば以下の特許文献１や非特許文献１を参照すると、順序統計量やノンパラメトリック平滑化法を用いて、マイクロアレイの実験誤差を補正している。さらに、以下の特許文献２を参照すると、数理モデルを用いて遺伝子の発現状況を推定して発現状態の事後確率を求めることにより、サンプル間で発現が異なっているか否かを定量的に解析している。 In microarray experiments, the experimental process is very complex, and various experimental errors and systematic distortions occur at each stage of the experiment. For this reason, several methods for correcting experimental errors have been proposed. For example, referring to the following Patent Document 1 and Non-Patent Document 1, the experimental error of the microarray is corrected using an order statistic or a nonparametric smoothing method. Furthermore, referring to Patent Document 2 below, it is possible to quantitatively analyze whether or not the expression is different between samples by estimating the expression status of the gene using a mathematical model and obtaining the posterior probability of the expression status. ing.

複数サンプルによって得られた遺伝子発現データに対する解析手法は、例えば、群間において発現量の差があるか否かをt-検定やU検定などを用いて統計的に検定したり、群間において発現量が何倍増減しているか否かを調べたりしている。しかし、前記したようにマイクロアレイの実験においては各サンプル間において実験誤差や系統的な歪みが生じるために、異なったサンプル間における遺伝子発現データの分布は異なる。そのため、サンプル間の遺伝子発現データを比較するためには、各サンプル間の遺伝子発現データの分布を補正する必要がある。 Analysis methods for gene expression data obtained from multiple samples include, for example, statistical tests using t-tests and U tests to determine whether there is a difference in expression levels between groups, and expression between groups I check how many times the amount has increased or decreased. However, as described above, in a microarray experiment, experimental errors and systematic distortions occur between samples, so that the distribution of gene expression data between different samples is different. Therefore, in order to compare gene expression data between samples, it is necessary to correct the distribution of gene expression data between the samples.

一方、例えば以下の特許文献３を参照すると、サンプルごとに遺伝子の発現強度に応じた順位を割り振って、順位データを基にノンパラメトリック的検定を行っている。これにより、発現強度値の正規化や実験誤差の影響を軽減させて解析を行っている。 On the other hand, for example, referring to Patent Document 3 below, a rank corresponding to the expression intensity of a gene is assigned to each sample, and a nonparametric test is performed based on the rank data. Thus, the analysis is performed while normalizing the expression intensity value and reducing the influence of experimental error.

特開２００４−３２５４１９号公報JP 2004-325419 A 特開２００５−３８２７９号公報JP 2005-38279 A 特開２００３−２１６３４号公報Japanese Patent Laid-Open No. 2003-21634 Yang et. al、 2002年、ヌクレ・アシッド・リサーチ、第30巻、第4号（Nucleic Acids Research, 2002, Vol.30, No. 4）Yang et. Al, 2002, Nucleic Acids Research, Vol. 30, No. 4 (Nucleic Acids Research, 2002, Vol. 30, No. 4)

従来技術における第１の問題点は、サンプルの比較毎に遺伝子発現データの分布の偏りを補正しなければならないことである。その理由は、各サンプルにおける遺伝子発現データの分布が異なるからである。 The first problem in the prior art is that the bias in the distribution of gene expression data must be corrected for each sample comparison. The reason is that the distribution of gene expression data in each sample is different.

第２の問題点は、上記した特許文献３においては遺伝子発現データの絶対値を考慮せずに、その順位情報を用いていることにある。マイクロアレイデータは遺伝子の発現状態をｍＲＮＡの産生量として指標化したものであるが、その絶対値は各遺伝子により異なる変動域を持っていることが想定され、共通の尺度を当てはめることに妥当性がない。順位に置き換えることは較べられる値が全て共通の尺度を持つことを前提とするものである。 The second problem is that the above-mentioned Patent Document 3 uses the rank information without considering the absolute value of the gene expression data. Microarray data is an index of gene expression as mRNA production, but its absolute value is assumed to have a different range of variation for each gene, and it is appropriate to apply a common scale. Absent. Replacing with ranking assumes that all compared values have a common scale.

本発明の目的は、複数サンプル間における遺伝子発現データの解析において、順位データを用いることなく、また、サンプル間の遺伝子発現データの分布を補正することなく遺伝子発現データの絶対値を解析する解析システム・解析方法を提供することにある。 An object of the present invention is to analyze an absolute value of gene expression data without using rank data and correcting distribution of gene expression data between samples in analysis of gene expression data among a plurality of samples.・ To provide an analysis method.

本発明に係る遺伝子発現データ解析システムは、図１を参照すると、複数サンプルの遺伝子発現データ（遺伝子発現データセット）を入力する入力装置１と、プログラム制御により動作するデータ解析装置２と、出力装置３とを含む。前記データ解析装置２は、前記入力装置１から与えられたＮ個の遺伝子発現データセットの中から、ある１つのサンプルに対する遺伝子発現データセットの順序統計量を計算する順序統計量計算手段２１と、（Ｎ−１）個のサンプルごとに遺伝子発現データに対して順位をふり、順位データセットを生成する順位データ生成手段２２と、計算された順序統計量と（Ｎ−１）個の順位データセットを用いて、遺伝子発現データセットを変換するデータ変換手段２３と、変換された遺伝子発現データセットを用いて遺伝子の発現状況を推定する遺伝子発現推定手段２４を有する。解析結果は前記出力装置３に出力される。 Referring to FIG. 1, a gene expression data analysis system according to the present invention includes an input device 1 that inputs a plurality of samples of gene expression data (gene expression data set), a data analysis device 2 that operates under program control, and an output device. 3 is included. The data analysis device 2 includes an order statistic calculation unit 21 for calculating an order statistic of a gene expression data set for a certain sample from N gene expression data sets given from the input device 1; Rank data generation means 22 for ranking the gene expression data for each (N-1) samples and generating a rank data set, calculated order statistics, and (N-1) rank data sets Is used to convert the gene expression data set, and the gene expression estimation means 24 is used to estimate the gene expression status using the converted gene expression data set. The analysis result is output to the output device 3.

また、本発明に係る遺伝子発現データ解析方法は、複数サンプルの遺伝子発現データ（遺伝子発現データセット）を入力するステップと、前記遺伝子発現データを解析するステップと、その解析結果を出力するステップを含み、前記データ解析ステップは、前記入力ステップで入力されたＮ個の遺伝子発現データセットの中から、ある１つのサンプルに対する遺伝子発現データセットの順序統計量を計算するステップと、（Ｎ−１）個のサンプルごとに遺伝子発現データに対して順位をふり、（Ｎ−１）個の順位データセットを生成するステップと、前記計算された順序統計量と前記（Ｎ−１）個の順位データセットを用いて、遺伝子発現データセットを変換するステップと、変換された（Ｎ−１）個の遺伝子発現データセットを用いて遺伝子の発現状況を推定するステップを有することを特徴とする。 The gene expression data analysis method according to the present invention includes a step of inputting a plurality of samples of gene expression data (gene expression data set), a step of analyzing the gene expression data, and a step of outputting the analysis result. The data analysis step includes calculating an order statistic of the gene expression data set for one sample from the N gene expression data sets input in the input step, and (N-1) pieces. Assigning a ranking to the gene expression data for each of the samples, generating (N-1) ranking data sets, calculating the calculated order statistics and the (N-1) ranking data sets. And converting the gene expression data set and using the converted (N-1) gene expression data sets Characterized by the step of estimating the state of expression.

本発明によれば、遺伝子発現データの順序統計量を用いて遺伝子発現データを置き換える処理を行っているため、サンプルの比較毎に遺伝子発現データの分布の偏りを補正することなく遺伝子発現データを解析することができる。この結果、遺伝子発現データの解析における分布の補正に費やしていた計算時間を削減でき、スループットを向上することができる。 According to the present invention, the gene expression data is analyzed without correcting the bias of the distribution of the gene expression data every time the samples are compared because the processing for replacing the gene expression data is performed using the order statistics of the gene expression data. can do. As a result, the calculation time spent for correcting the distribution in the analysis of gene expression data can be reduced, and the throughput can be improved.

また、本発明によれば、遺伝子発現データの順位のデータを用いる代わりに、遺伝子発現データの順序統計量を用いているので、遺伝子発現データの絶対量を用いて遺伝子発現データを解析することができる。この結果、様々な発現強度の変動域を持つ遺伝子を、各々の固有の変動域を考慮してデータ解析することができる。 Furthermore, according to the present invention, instead of using the rank data of gene expression data, the order statistics of gene expression data are used, so that gene expression data can be analyzed using the absolute amount of gene expression data. it can. As a result, it is possible to perform data analysis on genes having various fluctuation ranges of expression intensity in consideration of the unique fluctuation areas.

本発明の第１の実施の形態について図面を参照して詳細に説明する。 A first embodiment of the present invention will be described in detail with reference to the drawings.

図１を参照すると、本発明の第１の実施の形態は、遺伝子発現強度データセットを入力する入力装置１と、プログラム制御により動作するデータ解析装置２と、ディスプレイ装置や印刷装置等の出力装置３とを含む。データ解析装置は、順序統計量計算手段２１と、順位データ生成手段２２と、データ変換手段２３と、遺伝子発現推定手段２４とを備えている。 Referring to FIG. 1, the first embodiment of the present invention includes an input device 1 that inputs a gene expression intensity data set, a data analysis device 2 that operates under program control, and an output device such as a display device or a printing device. 3 is included. The data analysis apparatus includes order statistics calculation means 21, rank data generation means 22, data conversion means 23, and gene expression estimation means 24.

順序統計量計算手段２１は、前記入力装置１から与えられた１つのサンプルに対する遺伝子発現データセットの順序統計量を計算し、計算された順序統計量をデータ変換手段２３に送る。 The order statistic calculation means 21 calculates the order statistic of the gene expression data set for one sample given from the input device 1, and sends the calculated order statistic to the data conversion means 23.

順位データ生成手段２２は、前記入力装置から与えられた（Ｎ−１）個のサンプルに対する遺伝子発現データセットに対して、サンプルごとに遺伝子発現量の順位をふり、（Ｎ−１）個の順位データセットを生成し、生成された順位データセットをデータ変換手段２３に送る。 The rank data generation means 22 assigns the rank of gene expression level for each sample to the gene expression data set for (N-1) samples given from the input device, and (N-1) ranks. A data set is generated, and the generated rank data set is sent to the data conversion means 23.

データ変換手段２３は、順序統計量計算手段２１から送られてきた順序統計量と、順位データ生成手段２２から与えられた（Ｎ−１）個の順位データセットを用いて、遺伝子発現データセットを変換し、変換した遺伝子発現データセットを遺伝子発現推定手段２４に送る。 The data conversion means 23 uses the order statistic sent from the order statistic calculation means 21 and the (N-1) rank data sets given from the rank data generation means 22 to generate a gene expression data set. The converted gene expression data set is sent to the gene expression estimating means 24.

遺伝子発現推定手段２４は、上記特許文献１を参照すると、数理モデルを用いて遺伝子の発現状況を推定するデータ解析装置（上記特許文献１の図２）がこれにあたり、遺伝子の発現状況を推定し、出力装置３に遺伝子毎の発現状態の事後確率を送る。 When the gene expression estimation means 24 is referred to the above-mentioned patent document 1, a data analysis device (FIG. 2 of the above-mentioned patent document 1) that estimates the expression status of the gene using a mathematical model estimates the gene expression status. The posterior probability of the expression state for each gene is sent to the output device 3.

次に、図１、図２を参照して本実施の形態の動作について詳細に説明する。 Next, the operation of the present embodiment will be described in detail with reference to FIGS.

いま、Ｎ個の遺伝子発現データセットを

とし、

とする。ここで、

をｋ番目の遺伝子発現データセットにおけるｉ番目の遺伝子の発現量とする。 Now, N gene expression data sets

age,

And here,

Is the expression level of the i-th gene in the k-th gene expression data set.

入力装置１より入力されたＮ個の遺伝子発現データセットの中から、１つの遺伝子発現データセットＣ_ｋ（例えば、注目しているサンプル）が順序統計量計算手段２１へ送られる。順序統計量計算手段２１は、送られてきた遺伝子発現強度データセットＣ_ｋに対して、順序統計量

を計算する（図２のステップB２）。ここで、

である。 Among the N gene expression data sets input from the input device 1, one gene expression data set C _k (for example, a sample of interest) is sent to the order statistic calculation means 21. The order statistic calculation means 21 performs the order statistic on the sent gene expression intensity data set C _k .

Is calculated (step B2 in FIG. 2). here,

It is.

順位データ生成手段２２は、順序統計量に用いた遺伝子発現データセットＣ_ｋ以外の（Ｎ−１）個の遺伝子発現データセット

に対して、各々の遺伝子発現データセットごとに遺伝子発現量に順位をふり、順位データセット

を生成する（ステップＢ３）。 The rank data generation means 22 includes (N-1) gene expression data sets other than the gene expression data set C _k used for the order statistics.

For each gene expression data set, rank the gene expression level,

Is generated (step B3).

ただし、

とし、

をｊ番目の遺伝子発現データセットにおけるｉ番目の遺伝子の発現量の順位とする。 However,

age,

Is the rank of the expression level of the i-th gene in the j-th gene expression data set.

データ変換手段２３は、（Ｎ−１）個の順位データセット

と順序統計量

を用いて、（Ｎ−１）個の遺伝子発現データセットを以下の数５式に従って変換し、（Ｎ−１）個の変換された遺伝子発現データセット

を生成する。 The data conversion means 23 has (N-1) ranking data sets.

And order statistics

(N-1) gene expression data sets are converted according to the following equation (5), and (N-1) converted gene expression data sets:

Is generated.

ただし、

とし、

をｊ番目の遺伝子発現データセットにおけるｉ番目の遺伝子の発現量をｋ番目の遺伝子発現データセットの順序統計量によって置き換えた値である。

However,

age,

Is a value obtained by replacing the expression level of the i-th gene in the j-th gene expression data set by the order statistics of the k-th gene expression data set.

ここで、

here,

は

Is

におけるｍ番目の順序統計量

The m th order statistic at

を表すものとする。つまり、データ変換手段２３では、（Ｎ−１）個の遺伝子発現データセットを遺伝子発現量の順位に基づいて、遺伝子発現データセットＣ_ｋの遺伝子発現データに置き換える（ステップＢ４）。 . That is, in the data conversion unit 23, (N-1) pieces of the gene expression data set on the basis of the order of the gene expression level, replaced with the gene expression data of a gene expression data set C _k (step B4).

遺伝子発現推定手段２４は、データ変換手段２３で変換された（Ｎ−１）個の遺伝子発現データセット

The gene expression estimation means 24 includes (N-1) gene expression data sets converted by the data conversion means 23.

とＣ_ｋを用いて、遺伝子発現状態の事後確率を計算する。ここで、遺伝子発現状況の推定は

And C _k are used to calculate the posterior probability of the gene expression state. Here, the estimation of gene expression status is

とＣ_ｋの全ての組み合わせ（Ｎ−１回）に対して行う。 And _Ck for all combinations (N-1 times).

いま、

Now

とＣ_ｋとの間で遺伝子発現状況の推定を行うことにより、サンプルｋにおける各遺伝子の発現状態の事後確率を

And C _k , the posterior probability of the expression state of each gene in sample k

とする。同様に全ての組み合わせに対して発現状態の事後確率を求める（ステップＢ５）。 And Similarly, the posterior probability of the expression state is obtained for all combinations (step B5).

次に、遺伝子発現推定手段２４は、サンプルｋに対するｐ×（Ｎ−１）個の発現状態の事後確率を用いて、例えば以下の数６式により、全ての遺伝子に対する事後確率の平均を求める（ステップＢ６）。

Next, the gene expression estimation means 24 uses the posterior probabilities of p × (N−1) expression states for the sample k, for example, to obtain the average of the posterior probabilities for all genes by the following equation (6) ( Step B6).

ここで、

here,

は算術平均やロジット平均のようなベクトルｘに対する平均操作を表すものとする。 Let denote an average operation on a vector x such as arithmetic average or logit average.

次に、本実施の形態の効果について説明する。本実施の形態では、特定の１つのサンプルの遺伝子発現データセットの順序統計量によって、残りの（Ｎ−１）個の遺伝子発現データセットのデータを順位の情報に基づいて置き換えた。これにより、それぞれのサンプル間における遺伝子発現データの分布は等しくなり、サンプル間の比較毎に遺伝子発現データの分布を補正することなく遺伝子発現データを解析することができる。さらに、複数サンプルとの比較を繰り返すことにより、サンプルｋにおける各遺伝子の発現状態の事後確率を精度良く推定することができる。 Next, the effect of this embodiment will be described. In the present embodiment, the data of the remaining (N-1) gene expression data sets are replaced based on the rank information by the order statistics of the gene expression data set of a specific sample. Thereby, the distribution of the gene expression data among the respective samples becomes equal, and the gene expression data can be analyzed without correcting the distribution of the gene expression data for each comparison between the samples. Furthermore, by repeating the comparison with a plurality of samples, the posterior probability of the expression state of each gene in sample k can be accurately estimated.

次に、本発明の第２の実施の形態について図面を参照して詳細に説明する。 Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

本発明の第２の実施の形態は、遺伝子発現データを用いてサンプル間の発現量の変動をグラフによって視覚化する点である。図３を参照すると第２の実施の形態の構成は、図１に示された第１の実施の形態の構成における遺伝子発現推定手段２４の代わりに発現変動グラフ化手段２５となっている点が第１の実施の形態の構成と異なる。 The second embodiment of the present invention is that the gene expression data is used to visualize the fluctuation of the expression level between samples by a graph. Referring to FIG. 3, the configuration of the second embodiment is that expression variation graphing means 25 is used instead of the gene expression estimating means 24 in the configuration of the first embodiment shown in FIG. Different from the configuration of the first embodiment.

次に、図３、図４を参照して本実施の形態の動作について詳細に説明する。 Next, the operation of the present embodiment will be described in detail with reference to FIGS.

本実施の形態の動作においては、図２に示された第１の実施の形態のフロー図におけるステップＢ１からＢ４までは図３におけるステップＣ１からＣ４までと共通である。 In the operation of the present embodiment, steps B1 to B4 in the flowchart of the first embodiment shown in FIG. 2 are common to steps C1 to C4 in FIG.

ステップＣ４の後、発現変動グラフ化手段２５は、変換された（Ｎ−１）個の遺伝子発現データ

After step C4, the expression variation graphing means 25 converts the converted (N-1) gene expression data.

とＣ_ｋを用いて、サンプルを変化させたときの遺伝子ｊの発現量の変動のグラフを生成し、出力装置３に送る（ステップＣ５）。出力装置３は発現変動のグラフを表示したり出力したりする（ステップＣ６）。 And C _k are used to generate a graph of the variation in the expression level of gene j when the sample is changed, and send it to the output device 3 (step C5). The output device 3 displays or outputs a graph of expression fluctuation (step C6).

次に、本実施の形態の効果について説明する。本実施の形態では、遺伝子発現データの順序統計量を用いて遺伝子ｉの発現量の変化をグラフ表示している。これにより、サンプル間の補正をすることなく、遺伝子発現データの絶対値をグラフ化することができる。 Next, the effect of this embodiment will be described. In the present embodiment, changes in the expression level of gene i are displayed in a graph using the order statistics of gene expression data. Thereby, the absolute value of gene expression data can be graphed without correcting between samples.

また、細胞株に対して薬剤の投与量を変化させながら遺伝子発現量の変化を調べたり、発現量の時系列変化を調べたりする場合、基準となる遺伝子発現データ（例えば、投与前や時刻０における遺伝子発現データ）の順序統計量によって、残りの遺伝子発現データを順位の情報に基づいて置き換える。そして、置き換えられた遺伝子発現データを用いて、投与量や時間を変化させたときの各遺伝子の発現量の変化をグラフ化する。この例における効果は、各サンプルにおける遺伝子データの発現量の分布を補正することなく、また、順位データを用いることなく遺伝子の発現量の変化を調べることができる点である。 In addition, when examining changes in gene expression levels while changing the dose of a drug to a cell line, or examining time-series changes in expression levels, reference gene expression data (for example, before administration or at time 0) The remaining gene expression data is replaced on the basis of the rank information by the order statistics of the gene expression data). Then, using the replaced gene expression data, the change in the expression level of each gene when the dose or time is changed is graphed. The effect in this example is that the change in the expression level of the gene can be examined without correcting the distribution of the expression level of the gene data in each sample and without using the rank data.

次に、本発明の第３の実施の形態について図面を参照して詳細に説明する。本実施の形態は、遺伝子データを用いて統計的検定を行う点である。図５を参照すると第３の実施の形態の構成は、図１によって示された第１の実施の形態の構成における遺伝子発現推定手段２４の代わりに統計的検定手段２６となっている点が第１の実施の形態の構成と異なる。 Next, a third embodiment of the present invention will be described in detail with reference to the drawings. In the present embodiment, statistical testing is performed using genetic data. Referring to FIG. 5, the configuration of the third embodiment is that statistical test means 26 is used instead of the gene expression estimation means 24 in the configuration of the first embodiment shown in FIG. Different from the configuration of the first embodiment.

次に、図５、図６を参照して本実施の形態の動作について詳細に説明する。 Next, the operation of the present embodiment will be described in detail with reference to FIGS.

本実施の形態の動作においては、図２に示された第１の実施の形態のフロー図におけるステップＢ１からＢ４までは図６におけるステップＤ１からＤ４までと共通である。 In the operation of the present embodiment, steps B1 to B4 in the flowchart of the first embodiment shown in FIG. 2 are common to steps D1 to D4 in FIG.

ステップＤ５の後、統計的検定手段２６は、変換されたN-1個の遺伝子発現データ

After step D5, the statistical test means 26 converts the converted N-1 gene expression data.

とＣ_ｋに対して、t−検定やＵ検定などの統計的検定手法を用いて群間で有意に遺伝子の発現量に差のある遺伝子を抽出し、出力装置３へ送る（ステップＤ５）。 And with respect to C _k, using a statistical test method such as t- test and U test extracts genes significantly a difference in the expression level of the gene between groups, sends to the output device 3 (step D5).

次に、本実施の形態の効果について説明する。本実施の形態では、遺伝子発現データの順序統計量を用いて統計的検定を行っている。これにより、遺伝子発現データの分布をサンプル間で補正することなく統計的検定を行うことができる。 Next, the effect of this embodiment will be described. In the present embodiment, statistical tests are performed using order statistics of gene expression data. Thereby, a statistical test can be performed without correcting the distribution of gene expression data between samples.

次に、本発明の第４の実施の形態について図面を参照して詳細に説明する。 Next, a fourth embodiment of the present invention will be described in detail with reference to the drawings.

図７を参照すると、本発明の第４の実施形態は、本発明の第１の実施形態と同様に、入力装置、データ解析装置、出力装置を備え、更に、上記したデータ解析方法をコンピュータに実行させるためのコンピュータプログラムを記録した記録媒体４を備える。この記録媒体４は可搬形あるいは固定型のいずれであってもよく、磁気ディスク、半導体メモリ、ＣＤ−ＲＯＭその他の記録媒体であってもよい。また、本データ解析方法を実行させるためのコンピュータプログラムを、ネットワークに接続されたコンピュータの記録装置に格納しておき、ネットワークを介して他のコンピュータに転送することもできる。本アルゴリズムを実行させるためのコンピュータプログラムを提供する提供媒体としては、様々な形式のコンピュータに読み出し可能な媒体として頒布可能であって、特定のタイプの媒体に限定されるものではない。 Referring to FIG. 7, the fourth embodiment of the present invention includes an input device, a data analysis device, and an output device as in the first embodiment of the present invention. A recording medium 4 on which a computer program for execution is recorded is provided. The recording medium 4 may be either a portable type or a fixed type, and may be a magnetic disk, a semiconductor memory, a CD-ROM or other recording medium. In addition, a computer program for executing this data analysis method can be stored in a recording device of a computer connected to a network, and transferred to another computer via the network. The providing medium for providing the computer program for executing the present algorithm can be distributed as a computer-readable medium in various formats, and is not limited to a specific type of medium.

データ解析プログラムは記録媒体４からデータ解析装置５に読み込まれ、データ解析装置５の動作を制御し、入力装置１から入力されたデータファイルに対して第１の実施の形態におけるデータ処理装置２による処理と同一の処理を実行する。 The data analysis program is read from the recording medium 4 into the data analysis device 5, controls the operation of the data analysis device 5, and the data file input from the input device 1 is processed by the data processing device 2 in the first embodiment. The same process as the process is executed.

以下、本発明の具体的な実施例について、実データの結果を参照して具体的に説明する。かかる実施例は上記した第１の実施の形態に対応するものである。 Hereinafter, specific examples of the present invention will be specifically described with reference to results of actual data. Such an example corresponds to the first embodiment described above.

ここで用いられたマイクロアレイデータは、オリゴ型マイクロアレイによって得られたデータであり、１つの遺伝子発現データセットは１９９５８個の遺伝子発現データからなる。なお、本実施例においては簡単のために２個の遺伝子発現データセットを用い、それぞれサンプルＡ、サンプルＢとする。 The microarray data used here is data obtained by an oligo type microarray, and one gene expression data set is composed of 19958 gene expression data. In this example, for simplicity, two gene expression data sets are used, which are referred to as sample A and sample B, respectively.

図８にサンプルＡとサンプルＢの遺伝子発現データのプロットした結果を示す。図８の横軸は２つのサンプルにおける遺伝子発現強度の対数値の和を表し、縦軸は２つのサンプルにおける遺伝子発現強度の対数値の差を表している（ＳＤプロットという）。なお、ＳＤプロットにおける曲線は、例えば以下の数７式で示されるノンパラメトリック平滑化曲線である。

FIG. 8 shows the results of plotting the gene expression data of Sample A and Sample B. The horizontal axis of FIG. 8 represents the sum of logarithmic values of gene expression intensities in two samples, and the vertical axis represents the difference in logarithmic values of gene expression intensities in two samples (referred to as SD plot). The curve in the SD plot is, for example, a nonparametric smoothing curve expressed by the following equation (7).

ここで、ｕは２つのサンプルにおける遺伝子発現強度の対数値の和（ＳＤプロットにおける横軸）を表し、ｖは２つのサンプルにおける遺伝子発現強度の対数値の差（ＳＤプロットにおける縦軸）を表すものとする。 Here, u represents the sum of logarithmic values of gene expression in two samples (horizontal axis in SD plot), and v represents the difference in logarithmic values of gene expression intensities in two samples (vertical axis in SD plot). Shall.

散布図の形状及びノンパラメトリック平滑化曲線から分かるように、２つのサンプルにおける遺伝子発現データの分布は偏っていることがわかる。通常、このような遺伝子発現データの分布の偏りを、上記数７式で示されるノンパラメトリック平滑化曲線を用いて補正する必要がある。 As can be seen from the shape of the scatter plot and the nonparametric smoothing curve, the distribution of the gene expression data in the two samples is biased. Usually, it is necessary to correct such a bias in the distribution of gene expression data using a nonparametric smoothing curve expressed by the above equation (7).

図９にサンプルＢの遺伝子発現データをサンプルＡの遺伝子発現データの順序統計量によって置き換えた後のＳＤプロットを示している。なお、ＳＤプロットにおける曲線は、上記数７式で示されるノンパラメトリック平滑化曲線である。図８と比較すると、遺伝子発現データの分布に偏りがなくなっていることが分かる。 FIG. 9 shows an SD plot after the sample B gene expression data is replaced by the order statistics of the sample A gene expression data. Note that the curve in the SD plot is a nonparametric smoothing curve represented by the above equation (7). Compared to FIG. 8, it can be seen that there is no bias in the distribution of gene expression data.

図１０にサンプルＡとサンプルＢに対する遺伝子発現データの分布をノンパラメトリック平滑化曲線によって補正した後に、上記特許文献２における遺伝子発現状況の推定によって推定した結果を示す。 FIG. 10 shows a result estimated by estimating the gene expression status in Patent Document 2 after correcting the distribution of gene expression data for sample A and sample B using a nonparametric smoothing curve.

尚、横軸はサンプルＡとサンプルＢの遺伝子発現強度の対数値の和を、縦軸はそれらの差を示している。図１０の下方には５種類のプロット（１）〜（５）の形状を示した。（１）の白丸プロットの領域はサンプルＡとサンプルＢの発現状態のヘテロの確率が0〜0.2の点を表している。（２）の×プロットの領域はサンプルＡとサンプルＢの発現状態のヘテロの確率が0.2〜0.4の点を表している。（３）の三角プロットの領域はサンプルＡとサンプルＢの発現状態のヘテロの確率が0.4〜0.6の点を表している。（４）の四角プロットの領域はサンプルＡとサンプルＢの発現状態のヘテロの確率が0.6〜0.8の点を表している。（５）の黒丸プロットの領域はサンプルＡとサンプルＢの発現状態のヘテロの確率が0.8〜1.0の点を表している。 The horizontal axis represents the sum of logarithmic values of gene expression intensities of sample A and sample B, and the vertical axis represents the difference between them. The shapes of five types of plots (1) to (5) are shown in the lower part of FIG. The area of the white circle plot in (1) represents a point where the probability of heterogeneity between the expression states of sample A and sample B is 0 to 0.2. The x-plot region in (2) represents a point where the heterogeneity probability of the expression state of sample A and sample B is 0.2 to 0.4. The region of the triangular plot of (3) represents a point where the probability of heterogeneity in the expression state of sample A and sample B is 0.4 to 0.6. The area of the square plot in (4) represents the point where the heterogeneity probability of the expression state of sample A and sample B is 0.6 to 0.8. The area of the black circle plot in (5) represents a point where the probability of heterogeneity in the expression state of sample A and sample B is 0.8 to 1.0.

図１１に図９に対して遺伝子発現状況を推定した結果を示す。図１１におけるプロットの形状が有する意味内容は図１０におけるプロットの形状と同じである。 FIG. 11 shows the result of estimating the gene expression status with respect to FIG. The meaning content of the shape of the plot in FIG. 11 is the same as the shape of the plot in FIG.

図１０と図１１を比較すると、ノンパラメトリック平滑化曲線によって補正された遺伝子発現データに対する遺伝子発現状態の推定結果と、順序統計量によって置き換えられた遺伝子発現データに対する遺伝子発現状態の推定結果は、ほぼ同じ結果を示すことが分かる。 Comparing FIG. 10 and FIG. 11, the estimation result of the gene expression state for the gene expression data corrected by the nonparametric smoothing curve and the estimation result of the gene expression state for the gene expression data replaced by the order statistic are almost equal. It can be seen that the same results are shown.

本発明の第１の実施の形態に係る解析システムの構成を示したブロック図である。It is the block diagram which showed the structure of the analysis system which concerns on the 1st Embodiment of this invention. 図１の解析システムの処理フロー図である。It is a processing flowchart of the analysis system of FIG. 本発明の第２の実施の形態に係る解析システムの構成を示したブロック図である。It is the block diagram which showed the structure of the analysis system which concerns on the 2nd Embodiment of this invention. 図３の解析システムの処理フロー図である。FIG. 4 is a process flow diagram of the analysis system of FIG. 3. 本発明の第３の実施の形態に係る解析システムの構成を示したブロック図である。It is the block diagram which showed the structure of the analysis system which concerns on the 3rd Embodiment of this invention. 図５の解析システムの処理フロー図である。FIG. 6 is a process flow diagram of the analysis system of FIG. 5. 本発明の第４の実施の形態に係る解析システムの構成を示したブロック図である。It is the block diagram which showed the structure of the analysis system which concerns on the 4th Embodiment of this invention. サンプルＡとサンプルＢによるＳＤプロットである。It is SD plot by the sample A and the sample B. FIG. サンプルＡの順序統計量によってサンプルＢの遺伝子発現データを置き換えたＳＤプロットである。It is an SD plot in which the gene expression data of sample B is replaced by the order statistics of sample A. ノンパラメトリック平滑化によって補正後の遺伝子発現状況の推定結果を示した図である。It is the figure which showed the estimation result of the gene expression condition after correction | amendment by nonparametric smoothing. サンプルＡの順序統計量によって置き換えた後の遺伝子発現状況の推定結果を示した図である。It is the figure which showed the estimation result of the gene expression condition after replacing with the order statistic of the sample A.

Explanation of symbols

１入力装置
２データ解析装置
３出力装置
４記録媒体
５データ解析装置
２１順序統計量計算手段
２２順位データ生成手段
２３データ変換手段
２４遺伝子発現推定手段
２５発現変動グラフ化手段
２６統計的検定手段
DESCRIPTION OF SYMBOLS 1 Input device 2 Data analysis device 3 Output device 4 Recording medium 5 Data analysis device 21 Order statistic calculation means 22 Rank data generation means 23 Data conversion means 24 Gene expression estimation means 25 Expression variation graphing means 26 Statistical test means

Claims

In a gene expression data analysis system including an input device that inputs gene expression data (gene expression data set) of a plurality of samples, a data analysis device that operates under program control, and an output device that outputs analysis results,
The data analysis apparatus comprises: an order statistic calculating means for calculating an order statistic of a gene expression data set for a sample from N gene expression data sets given from the input device; 1) rank the gene expression data for each sample, rank data generation means for generating (N-1) rank data sets, the calculated order statistics, and the generated (N -1) Data conversion means for converting gene expression data sets using the number of rank data sets, and gene expression for estimating gene expression status using the converted (N-1) gene expression data sets An analysis system for gene expression data of a plurality of samples, comprising an estimation means.

N gene expression data sets

age,

Is the expression level of the i-th gene in the k-th gene expression data set,
The order statistic calculation means 21 calculates the order statistic with respect to any one of the N gene expression data sets C _k input from the input device.

(here,

It is. The system for analyzing gene expression data of a plurality of samples according to claim 1.

The rank data generating means includes (N-1) gene expression data sets other than the gene expression data set C _k used for the order statistics.

For each gene expression data set, rank the gene expression level,

(However,

age,

The analysis system for gene expression data of a plurality of samples according to claim 2, wherein the expression level of the i-th gene in the j-th gene expression data set is a ranking.

The data conversion means includes (N-1) rank data sets.

And order statistics

(N-1) gene expression data sets are expressed by the following equation 1 (where,

Is

The m th order statistic at

. ), And (N-1) transformed gene expression data sets

(However,

age,

The expression value of a plurality of samples according to claim 3, wherein the expression level of the i-th gene in the j-th gene expression data set is replaced by the order statistic of the k-th gene expression data set) Analysis system for data.

The gene expression estimation means includes (N-1) gene expression data sets converted by the data conversion means.

Analysis system for the C _k with, the plurality of samples according to claim 4, wherein calculating the posterior probability of gene expression state gene expression data.

The estimation of gene expression status is

_6. The analysis system for gene expression data of a plurality of samples according to claim 5, wherein the analysis is performed for all combinations (N-1 times) of Ck and Ck.

The gene expression estimation means uses the posterior probabilities of p × (N−1) expression states for sample k,

Represents the average operation on the vector x. ) To find the average of posterior probabilities for all genes (

And The analysis system for gene expression data of a plurality of samples according to claim 6.

In a gene expression data analysis system including an input device that inputs gene expression data (gene expression data set) of a plurality of samples, a data analysis device that operates under program control, and an output device that outputs analysis results,
The data analysis apparatus comprises: an order statistic calculating means for calculating an order statistic of a gene expression data set for a sample from N gene expression data sets given from the input device; 1) A gene expression data set is converted using rank data generation means for ranking the gene expression data for each sample, calculated order statistics, and (N-1) rank data sets. And a data conversion means and an expression fluctuation graphing means for generating an expression fluctuation graph of gene j when the sample is changed using the converted (N-1) gene expression data sets. Analysis system for gene expression data of multiple samples.

The expression variation graphing means includes (N-1) converted gene expression data.

And with C _k, to generate the expression level variation graph of gene j when changing the sample, multiple samples of gene expression according to claim 8, wherein the sending the graph image data to the output device Analysis system for data.

In a gene expression data analysis system including an input device that inputs gene expression data (gene expression data set) of a plurality of samples, a data analysis device that operates under program control, and an output device that outputs analysis results,
The data analysis apparatus comprises: an order statistic calculating means for calculating an order statistic of a gene expression data set for a sample from N gene expression data sets given from the input device; 1) A gene expression data set is converted using rank data generation means for ranking the gene expression data for each sample, calculated order statistics, and (N-1) rank data sets. Data conversion means and the converted (N-1) gene expression data

And C _k using a predetermined statistical test method, a gene having a significant difference in gene expression level between groups is extracted, and a statistical test means for sending the data to the output device is provided. An analysis system for gene expression data of multiple samples.

The analysis system for gene expression data of a plurality of samples according to claim 10, wherein the statistical test method is a t-test or a U test.

In a method for analyzing gene expression data, comprising a step of inputting gene expression data of multiple samples (gene expression data set), a step of analyzing the gene expression data, and a step of outputting the analysis result,
The data analysis step includes
Calculating an order statistic of the gene expression data set for one sample from the N gene expression data sets input in the input step; and (N-1) gene expression data for each sample. And (N-1) ranking data sets, and using the calculated order statistics and the (N-1) ranking data sets, a gene expression data set And a method of estimating gene expression status using the converted (N-1) gene expression data sets, and a method for analyzing gene expression data of a plurality of samples.

The order statistic calculation step includes N gene expression data sets.

age,

Is the expression level of the i-th gene in the k-th gene expression data set, with respect to any one gene expression data set C _{k of} the N gene expression data sets input in the input step. , Order statistics

(here,

It is. The method for analyzing gene expression data of a plurality of samples according to claim 12, further comprising:

The rank data generation step includes (N−1) gene expression data sets other than the gene expression data set C _k used for the order statistics.

For each gene expression data set, rank the gene expression level,

Step to generate (however,

age,

The analysis method for gene expression data of a plurality of samples according to claim 13, characterized in that the expression level of the i-th gene in the j-th gene expression data set is included.

The data conversion step includes (N-1) ranking data sets.

And order statistics

(N-1) gene expression data sets using the following equation (3)

Is

The m th order statistic at

. ), And (N-1) transformed gene expression data sets

Step to generate (however,

age,

The value obtained by replacing the expression level of the i-th gene in the j-th gene expression data set with the order statistic of the k-th gene expression data set). Analysis method for gene expression data.

The gene expression estimation step includes (N-1) gene expression data sets converted in the data conversion step.

And with C _k, 15. analysis method for multiple sample gene expression data, wherein further comprising the step of calculating a posterior probability of gene expression status.

The estimation of gene expression status is

And all combinations analysis method on the gene expression data for a plurality of samples as claimed in claim 16 wherein the possible wherein performed on (N-1 times) of C _k.

The gene expression estimation step uses the posterior probability of p × (N−1) expression states for sample k,

And 18. The method for analyzing gene expression data of a plurality of samples according to claim 17, further comprising:

In a method for analyzing gene expression data, comprising a step of inputting gene expression data of multiple samples (gene expression data set), a step of analyzing the gene expression data, and a step of outputting the analysis result,
The data analysis step includes
Calculating an order statistic of a gene expression data set for a sample from N gene expression data sets given from the input device;
(N-1) assigning ranks to the gene expression data for each of the samples, and generating (N-1) rank data sets;
Transforming a gene expression data set using the calculated order statistics and the (N-1) rank data sets;
Analysis of gene expression data of a plurality of samples, comprising a step of generating a graph of expression variation of gene j when the sample is changed using the converted (N-1) gene expression data sets Method.

The expression variation graphing step includes (N-1) transformed gene expression data.

And using C _k, to generate the expression level variation graph of gene j when changing the sample, multiple samples of gene expression according to claim 19, wherein sending the graph image data to the output device Analysis method for data.

In a method for analyzing gene expression data, comprising a step of inputting gene expression data of multiple samples (gene expression data set), a step of analyzing the gene expression data, and a step of outputting the analysis result,
The data analysis step includes
Calculating an order statistic of a gene expression data set for a sample from N gene expression data sets given from the input device;
(N-1) assigning ranks to the gene expression data for each of the samples, and generating (N-1) rank data sets;
Transforming a gene expression data set using the calculated order statistics and the (N-1) rank data sets;
The converted (N-1) gene expression data

And C _k using a predetermined statistical test method, a gene having a significant difference in gene expression level between groups is extracted, and a statistical test step of sending the data to the output device is provided. Analysis method for gene expression data of multiple samples.

The analysis method for gene expression data of a plurality of samples according to claim 21, wherein the statistical test method is a t-test or a U test.

On the computer,
Inputting gene expression data of multiple samples (gene expression data set);
Calculating an order statistic of the gene expression data set for one sample from the N gene expression data sets input in the input step;
(N-1) assigning ranks to the gene expression data for each of the samples, and generating (N-1) rank data sets;
Transforming a gene expression data set using the calculated order statistics and the (N-1) rank data sets;
Estimating the expression status of genes using the transformed (N-1) gene expression data sets;
A gene expression data analysis program for executing the step of outputting analysis results.

A computer-readable information recording medium (including a compact disk, a flexible disk, a hard disk, a magneto-optical disk, a digital video disk, a magnetic tape, or a semiconductor memory), wherein the program according to claim 23 is recorded. .