JPWO2003070938A1

JPWO2003070938A1 - GENE EXPRESSION INFORMATION ANALYSIS DEVICE, GENE EXPRESSION INFORMATION ANALYSIS METHOD, PROGRAM, AND RECORDING MEDIUM

Info

Publication number: JPWO2003070938A1
Application number: JP2003569831A
Authority: JP
Inventors: 信和小野; 高原　義之; 義之高原; 慶偉張; 田中　博; 博田中
Original assignee: Ajinomoto Co Inc
Current assignee: Ajinomoto Co Inc
Priority date: 2002-02-21
Filing date: 2003-02-21
Publication date: 2005-06-09
Anticipated expiration: 2023-02-21
Also published as: JP4438414B2; AU2003211240A1; WO2003070938A1

Abstract

本発明は、ＤＮＡマイクロアレイ、および、ＤＮＡチップ実験で測定した比較群と対照群の遺伝子発現量の蛍光測定値を新規の数学モデルに基づき補正する。そして、本発明は、補正したスキャッタープロットを基に、遺伝子の蛍光強度と比例する軸をＸ軸とする新たなＸ−Ｙ軸系を構築する。そして、本発明は、次に、Ｘ軸に一定の遺伝子数で構成されるウィンドウを設定し、各ウィンドウ内でスチューデントのｔ−分布に基づいた任意危険率の信頼限界点を求める。そして、本発明は、ついで、Ｘ軸方向に一定遺伝子ずつウィンドウを移動させ、各信頼限界点を求める。そして、本発明は、求めた複数の信頼限界点を平滑化（スプライン曲線）により補完し、発現変動信頼曲線とする。そして、本発明は、得られた発現変動信頼曲線の外側に位置する遺伝子を変動遺伝子として抽出する。The present invention corrects the fluorescence measurement value of the gene expression level of the comparison group and the control group measured by the DNA microarray and the DNA chip experiment based on a novel mathematical model. Then, the present invention constructs a new XY axis system having the X axis as the axis proportional to the fluorescence intensity of the gene, based on the corrected scatter plot. Then, in the present invention, a window composed of a certain number of genes is set on the X-axis, and a confidence limit point of an arbitrary risk rate based on a student's t-distribution is obtained in each window. Then, in the present invention, each confidence limit point is obtained by moving the window by a certain gene in the X-axis direction. In the present invention, the obtained plurality of confidence limit points are complemented by smoothing (spline curve) to obtain an expression fluctuation confidence curve. In the present invention, genes located outside the obtained expression variation confidence curve are extracted as variation genes.

Description

技術分野
本発明は、遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体に関し、特に、ＤＮＡマイクロアレイやＤＮＡチップなどの測定値データのバックグラウンド補正を行い、発現量が変化した遺伝子を統計的に高い信頼度で抽出することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体に関する。
背景技術
分子生物学の研究、新薬の研究開発、臨床診断などにおいて、メッセンジャーＲＮＡの発現量が変化した遺伝子を探索すること、および、その遺伝子を同定することは非常に重要である。そこで、現在ＲＮＡレベルでの発現変化を調べる方法として、メッセンジャーＲＮＡから逆転写酵素を用いて逆転写したｃＤＮＡ断片をスライドガラス上に高密度に固定化したＤＮＡマイクロアレイ、および、微細加工技術を用いて多種類のオリゴヌクレオチドを基板上に合成したアフィメトリクス社（会社名）のＤＮＡチップ（商品名）が注目を集め、利用されている。
これらのＤＮＡマイクロアレイやＤＮＡチップを用いた発現遺伝子解析法は、数百から数万遺伝子に対して一度に網羅的に発現量が変動した遺伝子を同定するのに有効であり、現在、一般的に測定値の補正方法は、バックグラウンド補正工程、および、ノーマライズ工程とよばれる大きく二つの工程を含んでいる。
バックグラウンド補正工程では、単純に個々の測定値からブランクのスポットの平均バックグラウンド値、あるいは、各スポットの周囲の領域のバックグラウンド値を、スポットの蛍光強度測定値から引くことによってバックグラウンドの補正を行なう方法が主に用いられている。
一方、ノーマライズ工程は、最小自乗法やＬｏｗｅｓｓ平滑化（近傍領域に対応してバンド幅を用いた局所二次推定量）などで求めたノンパラメトリック回帰直線を蛍光強度散布図（スキャッタープロット）のＹ＝Ｘ直線に変換する係数で全ての遺伝子の測定値を補正する手法を用いている。
しかしながら、ＤＮＡマイクロアレイやＤＮＡチップを用いた発現遺伝子解析法は、信頼度の高い測定値の解析手法が確立されていないという問題点を有していた。以下この問題点について具体的に説明する。
まず、従来の補正法は、測定装置、標本間の誤差、および、蛍光標識効率などの違いにより容易に影響を受けるという問題点を有している。また、ノーマライズ工程においては、最小自乗法は厳密には回帰直線が２本引けてしまい、一方、Ｌｏｗｅｓｓ平滑化（ＤｕｄｏｉｔＳ，ＹａｎｇＹＨ，ＣａｌｌｏｗＭＪ，ＳｐｅｅｄＴＰ（２０００）ＳｔａｔｉｓｔｉｃａｌｍｅｔｈｏｄｓｆｏｒｉｄｅｎｔｉｆｙｉｎｇｄｉｆｆｅｒｅｎｔｉａｌｌｙｅｘｐｒｅｓｓｅｄｇｅｎｅｓｉｎｒｅｐｌｉｃａｔｅｄｃＤＮＡｍｉｃｒｏａｒｒａｙｅｘｐｅｒｉｍｅｎｔｓ．Ｔｅｃｈｎｉｃａｌｒｅｐｏｒｔ，ＤｅＰａｒｔｍｅｎｔｏｆＳｔａｔｉｓｔｉｃｓ，ＵＣ−Ｂｅｒｋｅｌｅｙ．ｈｔｔｐ：／／ｗｗｗ．ｓｔａｔ．ｂｅｒｋｅｌｅｙ．ｅｄｕ／ｕｓｅｒｓ／ｔｅｒｒｙ／ｚａｒｒａｙ／Ｈｔｍｌ／ｐａｐｅｒｓｉｎｄｅｘ．ｈｔｍｌ等）は経験則に基づく正規化処理であり根拠のないものに過ぎないという問題点を有していた。
さらに、発現量が変動した遺伝子の抽出法においては、従来の基準では任意の倍率以上の補正蛍光強度比を示した遺伝子を発現に差がある遺伝子として抽出しており、その基準となる倍率は、無根拠に２倍、３倍などに設定されていた（ＣｈｅｎＹ，ＤｏｕｇｈｅｒｔｙＥＲ，ＢｉｔｔｎｅｒＭＬ（１９９７）Ｒａｔｉｏ−ｂａｓｅｄｄｅｃｉｓｉｏｎｓａｎｄｔｈｅｑｕａｎｔｉｔａｔｉｖｅａｎａｌｙｓｉｓｏｆｃＤＮＡｍｉｃｒｏａｒｒａｙｉｍａｇｅｓ．ＪＢｉｏｍｅｄＯｐｔ２：３６４−３７４、ＳｕｓａｎＧ．Ｈｉｌｓｅｎｂｅｃｋ，ｅｔｃ．（１９９９）Ｓｔａｔｉｓｔｉｃａｌａｎａｌｙｓｉｓｏｆａｒｒａｙｅｘｐｒｅｓｓｉｏｎｄａｔａａｓａｐｐｌｉｅｄｔｏｔｈｅｐｒｏｂｌｅｍｏｆｔａｍｏｘｉｆｅｎｒｅｓｉｓｔａｎｃｅ．ＪｏｕｒｎａｌｏｆｔｈｅＮａｔｉｏｎａｌＣａｎｃｅｒＩｎｓｔｉｔｕｔｅ，Ｖｏｌ．９１，Ｎｏ．５等）という問題点を有していた。
一方、誤差モデルや遺伝子発現の確率分布を仮定して、最適化により遺伝子の検出を行なう手法（ＣｈｅｎＹ，ＤｏｕｇｈｅｒｔｙＥＲ，ＢｉｔｔｎｅｒＭＬ（１９９７）Ｒａｔｉｏ−ｂａｓｅｄｄｅｃｉｓｉｏｎｓａｎｄｔｈｅｑｕａｎｔｉｔａｔｉｖｅａｎａｌｙｓｉｓｏｆｃＤＮＡｍｉｃｒｏａｒｒａｙｉｍａｇｅｓ．ＪＢｉｏｍｅｄＯｐｔ２：３６４−３７４、ＮｅｗｔｏｎＭＡ，ＫｅｎｄｚｉｏｒｓｋｉＣＭ，ＲｉｃｈｍｏｎｄＣＳ，ＢｌａｔｔｎｅｒＦＲ，ＴｓｕｉＫＷ（２００１）Ｏｎｄｉｆｆｅｒｅｎｔｉａｌｖａｒｉａｂｉｌｉｔｙｏｆｅｘｐｒｅｓｓｉｏｎｒａｔｉｏｓ：Ｉｍｐｒｏｖｉｎｇｓｔａｔｉｓｔｉｃａｌｉｎｆｅｒｅｎｃｅａｂｏｕｔｇｅｎｅｅｘｐｒｅｓｓｉｏｎｃｈａｎｇｅｓｆｒｏｍｍｉｃｒｏａｒｒａｙｄａｔａ．ＪＣｏｍｐＢｉｏｌ８：３７−５２．等）もいくつか開発されているが、これらの手法は、安定性と再現性に乏しく、必ずしも実用レベルまで達していないという問題点を有していた。また、理想的な検出信頼度を得るために、実験を何回繰り返せばいいという実験の指針となる統計表も存在しないため、実験の繰り返し回数と検出感度と検出信頼度の関係は明らかにされていない。
従って、本発明は、ＤＮＡマイクロアレイ、および、ＤＮＡチップを用いた発現遺伝子解析法において、遺伝子の発現量を確実に比較するための一般式を提供し、実際のデータ分布に合わせた頑健（ロバスト）な信頼度の高い発現変動遺伝子の抽出法を提供することを目的としている。
発明の開示
本発明にかかる遺伝子発現情報解析装置は、２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成するバックグラウンド補正手段と、上記バックグラウンド補正手段によりバックグラウンド補正された上記輝度データの対数をＸ−Ｙ軸にとり蛍光強度散布図を作成し、各遺伝子のスポットについて蛍光強度平衡軸に対するバイアスを求め、上記輝度データから当該バイアスを除去することにより上記蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するバイアス補正手段と、上記バイアス補正手段により構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出する遺伝子検出手段とを備えたことを特徴とする。
この装置によれば、ＤＮＡマイクロアレイやＤＮＡチップなどにより２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成する。ここで、個々のスポットの蛍光強度測定値からブランクのスポットの蛍光強度測定値の平均をバックグラウンド値として用いてもよく、あるいは、各スポットの周囲の領域のブランクの蛍光強度測定値の平均値をバックグラウンド値として用いてもよい。また、これ以外のいかなる方法によりバックグラウンド補正を行ってもよい。
また、本装置によれば、バックグラウンド補正された輝度データの対数（自然対数または２の対数等）をＸ−Ｙ軸にとり蛍光強度散布図（スキャッタープロット）を作成し、各遺伝子のスポットについて同じ蛍光強度を示す蛍光強度平衡軸（すなわち、各遺伝子のスポットについて、２つの条件で発現量が同等である遺伝子集団より得られた漸近線）に対するバイアスを求め、輝度データから当該バイアスを除去することにより蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、より多くのバイアスを含む蛍光成分の判定を行い、このバイアスを除去した上で蛍光強度平衡軸と発現量の倍数軸とを２軸とする新しい直行軸系を構築することができるようになる。
また、本装置によれば、構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出するので、従来の遺伝子検出法に比べて、測定装置、標本間の誤差、および、蛍光標識効率などの違いの影響を受けずに正確に発現量が変動した遺伝子を検出することができるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、上記バイアス補正手段は、発現量が多い遺伝子集団の対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求める第一主成分作成手段と、上記第一主成分作成手段により求めた上記漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算する座標回転手段と、上記座標回転手段による座標軸回転後の上記発現量が少ない遺伝子集団の座標を用いて、上記蛍光強度平衡軸の傾きを計算し、計算された傾きに基づいて２つの条件の上記輝度データのうちどちらに上記バイアスが多く含まれているかを判定するバイアス判定手段と、上記バイアス判定手段にて上記バイアスが多く含まれていると判定された条件の上記輝度データから上記バイアスを差し引くことにより上記蛍光強度平衡軸と上記発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築する補正プロット生成手段とをさらに備えたことを特徴とする。
これはバイアス補正手段の一例を一層具体的に示すものである。この装置によれば、ＤＮＡ濃度希釈系列の品質管理用のコントロール遺伝子サンプル（例えば外部遺伝子λＤＮＡサンプル、あるいは発現量がほとんど変わらないリボソームなどのＨｏｕｓｅ−ｋｅｅｐｉｎｇ遺伝子サンプル）を目的遺伝子サンプルと同時に測定し、蛍光強度データの積の一番小さい遺伝子から順に一つずつコントロール遺伝子を除き、残りすべてのコントロール遺伝子サンプルのデータから遺伝子の発現量とＤＮＡ量の検量線をそれぞれ作成し、データの相関係数を計算し、順番に計算される上記の相関係数が最初に強い相関が認められる基準（例えば０．８以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値１とし、二つの条件における蛍光強度データの積が閾値１を上回るすべての遺伝子サンプルの集団を発現量が多い遺伝子集団とし、上記発現量が順番に計算される相関係数度が最初に弱い相関が認められる基準（例えば０．５以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値２とし（ただし、閾値２＜閾値１）、二つの条件における蛍光強度データの積が閾値２を下回るすべての遺伝子サンプルの集団を発現量が少ない遺伝子集団とし、発現量が多い遺伝子集団の蛍光強度対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求め、求めた漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算し、座標軸回転後の発現量が少ない遺伝子集団の座標を用いて、蛍光強度平衡軸の傾きを計算し、計算された傾き（例えば、正、負、ゼロ等）に基づいて２つの条件の輝度データのうちどちらにバイアスが多く含まれているかを判定し、バイアスが多く含まれていると判定された条件の輝度データからバイアスを差し引くこと（例えば、一定のバイアスをもつ遺伝子集団について座標を回転させる等）により蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、実測値のバイアスを効率的に除去し、かつ、データの性質を明白に表現できる蛍光強度散布図を作成することができるようになる。
なお、本装置は軸回転後にバイアスの大小を判定するものに限定されず、例えば、軸回転の前にも高発現漸近線と低発現漸近線の傾きを比較することにより、バイアスの大小を判定してもよい。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、上記主成分分析は、分散・共分散行列を用いて行うことを特徴とする。
これは主成分分析の一例を一層具体的に示すものである。この装置によれば、主成分分析は、分散・共分散行列を用いて行うので、従来から発現遺伝子解析に用いられている相関行列を用いた主成分分析法と比較して正規化を要しないため、効率的に主成分分析を行うことができるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、上記遺伝子検出手段は、上記蛍光強度平衡軸方向に予め定めた区間内のウィンドウを設定するウィンドウ設定手段と、上記ウィンドウ設定手段により設定された各ウィンドウ内において信頼限界点を決定する信頼限界点決定手段と、蛍光強度平衡軸方向に一定遺伝子ずつウィンドウを移動するウィンドウ移動手段と、上記ウィンドウ移動手段により移動した各ウィンドウについて上記信頼限界点決定手段により各信頼限界点を求め、求めた複数の信頼限界点に基づいて信頼境界線を作成する信頼境界線作成手段と、上記信頼境界線作成手段により作成された上記信頼境界線の外側に位置する遺伝子を発現量が変動した変動遺伝子として抽出する変動遺伝子抽出手段とをさらに備えたことを特徴とする。
これは遺伝子検出手段の一例を一層具体的に示すものである。この装置によれば、予め定めた区間内のウィンドウを設定し、設定された各ウィンドウ内において遺伝子の輝度データの平均値、標準偏差、Ｐ値（例えば、ｐ＝０．０５）、重心などのうち少なくとも一つを用いて信頼限界点を決定する。そして、蛍光強度平衡軸方向に一定遺伝子ずつウィンドウを移動し、移動した各ウィンドウについて各信頼限界点を求め、求めた複数の信頼限界点に基づいて信頼境界線を作成する信頼境界線作成手段と、上記信頼境界線作成手段により作成された上記信頼境界線の外側に位置する遺伝子を発現量が変動した変動遺伝子として抽出するので、安定性、再現性、および、信頼度の高い発現遺伝子抽出を行うことができるようになる。
また、これにより、誤差の範囲が異なる実験データであっても、その誤差に応じて、発現量変動倍率の閾値が決められるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、上記信頼限界点決定手段は、シミュレーションにより得られた重複データの検定統計表に基づき、ｔ−分布を用いて上記信頼限界点を決定することを特徴とする。
これは信頼限界点決定の一例を一層具体的に示すものである。この装置によれば、シミュレーションにより得られた重複データの検定統計表に基づき、ｔ−分布を用いて信頼限界点を決定するので、従来手法と比較して正確かつ効率的に信頼限界点を求めることができるようになる。また、この重複データの検定表によると実験設計の段階で必要となる重複実験の数を求められる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、上記信頼境界線作成手段は、上記複数の信頼限界点に基づいてスプライン曲線を作成することにより平滑化を行い上記信頼境界線を作成することを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。この装置によれば、複数の信頼限界点に基づいてスプライン曲線を作成することにより平滑化を行い信頼境界線を作成するので、効率的に信頼限界点を補完して信頼曲線を作成することができるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、上記信頼境界線作成手段は、蛍光強度の高い領域については、最後の上記ウィンドウで求めた信頼限界点の水平延長線を用いて上記信頼限界線を作成することを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。この装置によれば、蛍光強度の高い領域については、最後のウィンドウ（最も右側にあるウィンドウ）で求めた信頼限界点のＸ軸に対する水平延長線を用いて信頼限界線を作成するので、傾きが少なくどちらに収束するか判断不能の場合であっても、適切な信頼限界線を作成することができるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、上記信頼境界線作成手段は、蛍光強度の低い領域については、各ウィンドウで求めた信頼限界点から最小二乗法により求めた漸近線の補外を上記信頼限界線として用いることを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。この装置によれば、蛍光強度の低い領域については、例えば、最初から数十程度の各ウィンドウで求めた信頼限界点から最小二乗法により求めた漸近線の補外を上記信頼限界線として用いるので、蛍光強度が低い遺伝子のスポットについても的確に検出することができるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、利用者にウィンドウ内の遺伝子数を入力させる遺伝子数入力手段をさらに備え、上記ウィンドウ設定手段は、上記遺伝子数入力手段により入力された上記遺伝子数の上記遺伝子が含まれる上記区間内で上記ウィンドウを設定することを特徴とする。
これはウィンドウ設定の一例を一層具体的に示すものである。この装置によれば、利用者にウィンドウ内の遺伝子数を入力させ、入力された遺伝子数の遺伝子が含まれる区間内でウィンドウを設定するので、実験毎に利用者が設定する遺伝子数を変動させることができるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、利用者に信頼限界値を入力させる信頼限界値入力手段をさらに備え、上記信頼限界点決定手段は、上記ウィンドウ内において上記信頼限界値入力手段により入力された上記信頼限界値に基づいて上記信頼限界点を決定することを特徴とする。
これは信頼限界点決定の一例を一層具体的に示すものである。この装置によれば、利用者に信頼限界値を入力させ、ウィンドウ内において入力された信頼限界値に基づいて信頼限界点を決定するので、実験毎に利用者が設定する信頼限界値を変動させることができ、各実験の誤差を適切な範囲に収めることができるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、利用者に、上記変動しない遺伝子の分布の形、上記変動遺伝子の分布の形、上記変動遺伝子の検出基準、実験の繰り返し数、および、シミュレーション回数のうち少なくとも一つに関する情報を含むシミュレーション条件を入力させるシミュレーション条件設定手段と、上記シミュレーション条件設定手段にて設定された上記シミュレーション条件に従って、同一の遺伝子群に対して同じ分布から繰り返して生成し、上記遺伝子検出手段を実行し、上記発現遺伝子を検出するシミュレーションを複数回実行し、上記検出手段による結果の偽陽性率と偽陰性率を計算し、実験の繰り返し数、上記シミュレーション条件、および検出感度と検出信頼度との関係を計算し、発現量が変わる遺伝子の検定統計表を作成するシミュレーション実行手段と、上記シミュレーション条件毎に、上記シミュレーション実行手段によるシミュレーション結果を出力するシミュレーション結果出力手段とをさらに備えたことを特徴とする。
この装置によれば、利用者に、変動しない遺伝子の分布の形（例えば、分布の標準偏差（例えば、発現が変わらない遺伝子の分布を標準正規分布として標準偏差σ＝１、中心μ＝０としたときに、標準偏差σの幅を０．１から１．５の範囲で設定する））、上記変動遺伝子の分布の形（例えば、中心（例えば、当該条件のときに、中心μの幅を０．４から３の範囲で設定する））、上記変動遺伝子の検出基準（例えば、全体数からみた検出された遺伝子の割合を、２／３、２／４、３／４、３／６、４／６などで設定する）、実験の繰り返し数、および、シミュレーション回数（例えば、３回から１０回の範囲で設定する）のうち少なくとも一つに関する情報を含むシミュレーション条件を入力させ、設定されたシミュレーション条件に従って、同一の遺伝子群に対して同じ分布から繰り返して生成し、遺伝子検出を実行し、発現遺伝子を検出するシミュレーションを複数回実行し、上記検出手段による結果の偽陽性率と偽陰性率を計算し、実験の繰り返し数、シミュレーション条件、および検出感度と検出信頼度との関係を計算し、発現量が変わる遺伝子の検定統計表を作成し、シミュレーション条件毎に、シミュレーション実行によるシミュレーション結果を出力するので、様々な条件におけるシミュレーション結果を組み合わせることにより上記の組み合わせによる検出力と検出信頼度を知ることができる。すなわち、同じ条件の対照実験を繰り返して行い、得られたそれぞれ異なったデータセットに対して変動遺伝子の検出を行い、あらかじめ決めた回数以上検出される遺伝子のみを選択することにより、期待通りの信頼度あるいは検出力で変動遺伝子を検出できるようになる。
また、これにより、発現量が変わらない遺伝子が変動遺伝子として検出されたエラー（第一種のエラー）や、変動遺伝子が発現が変わらない遺伝子として検出されたエラー（第二種のエラー）を算出して比較することにより、シミュレーションのデータから上記の手法による変動遺伝子を検出する検出力と信頼度を把握でき、実際の実験データに対して、期待される検出力と信頼度を得るために、実験の繰り返し数と変動遺伝子の検出基準、および信頼限界点の組み合わせを設定することができる。
また、これにより、何回実験を行えば、正確な実験データを取ることができるかを予測することが可能になり、実験効率を著しく向上させることができるようになる。
つぎの発明にかかる遺伝子発現情報解析装置は、上記に記載の遺伝子発現情報解析装置において、上記遺伝子検出手段は、各スポットの偏差値を計算する偏差値計算手段をさらに備えたことを特徴とする。
これは遺伝子検出の一例を一層具体的に示すものである。この装置によれば、各スポットの偏差値を計算するので、このように計算された各スポットの偏差値を変動比率（倍率）の代わりに用いることで、スライド間の誤差の差異に影響されない解析が可能になる。
また、これにより、本装置により計算される偏差値を、クラスター解析に代表される多変量解析において変動比率の対数や正規化した変動比率の変わりに用いることができ、発現量の大小による誤差の影響の違いに左右されない解析が可能になる。
また、本発明は遺伝子発現情報解析方法に関するものであり、本発明にかかる遺伝子発現情報解析方法は、２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成するバックグラウンド補正ステップと、上記バックグラウンド補正ステップによりバックグラウンド補正された上記輝度データの対数をＸ−Ｙ軸にとり蛍光強度散布図を作成し、各遺伝子のスポットについて蛍光強度平衡軸に対するバイアスを求め、上記輝度データから当該バイアスを除去することにより上記蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するバイアス補正ステップと、上記バイアス補正ステップにより構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出する遺伝子検出ステップとを含むことを特徴とする。
この方法によれば、ＤＮＡマイクロアレイやＤＮＡチップなどにより２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成する。ここで、個々のスポットの蛍光強度測定値からブランクのスポットの蛍光強度測定値の平均をバックグラウンド値として用いてもよく、あるいは、各スポットの周囲の領域のブランクの蛍光強度測定値の平均値をバックグラウンド値として用いてもよい。また、これ以外のいかなる方法によりバックグラウンド補正を行ってもよい。
また、本方法によれば、バックグラウンド補正された輝度データの対数（自然対数または２の対数等）をＸ−Ｙ軸にとり蛍光強度散布図（スキャッタープロット）を作成し、各遺伝子のスポットについて同じ蛍光強度を示す蛍光強度平衡軸に対するバイアスを求め、輝度データから当該バイアスを除去することにより蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、より多くのバイアスを含む蛍光成分の判定を行い、このバイアスを除去した上で蛍光強度平衡軸と発現量の倍数軸とを２軸とする新しい直行軸系を構築することができるようになる。
また、本方法によれば、構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出するので、従来の遺伝子検出法に比べて、測定方法、標本間の誤差、および、蛍光標識効率などの違いの影響を受けずに正確に発現量が変動した遺伝子を検出することができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、上記バイアス補正ステップは、発現量が多い遺伝子集団の対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求める第一主成分作成ステップと、上記第一主成分作成ステップにより求めた上記漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算する座標回転ステップと、上記座標回転ステップによる座標軸回転後の上記発現量が少ない遺伝子集団の座標を用いて、上記蛍光強度平衡軸の傾きを計算し、計算された傾きに基づいて２つの条件の上記輝度データのうちどちらに上記バイアスが多く含まれているかを判定するバイアス判定ステップと、上記バイアス判定ステップにて上記バイアスが多く含まれていると判定された条件の上記輝度データから上記バイアスを差し引くことにより上記蛍光強度平衡軸と上記発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築する補正プロット生成ステップとをさらに含むことを特徴とする。
これはバイアス補正ステップの一例を一層具体的に示すものである。この方法によれば、ＤＮＡ濃度希釈系列の品質管理用のコントロール遺伝子サンプル（例えば外部遺伝子λＤＮＡサンプル、あるいは発現量がほとんど変わらないリボソームなどのＨｏｕｓｅ−ｋｅｅｐｉｎｇ遺伝子サンプル）を目的遺伝子サンプルと同時に測定し、蛍光強度データの積の一番小さい遺伝子から順に一つずつコントロール遺伝子を除き、残りすべてのコントロール遺伝子サンプルのデータから遺伝子の発現量とＤＮＡ量の検量線をそれぞれ作成し、データの相関係数を計算し、順番に計算される上記の相関係数が最初に強い相関が認められる基準（例えば０．８以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値１とし、二つの条件における蛍光強度データの積が閾値１を上回るすべての遺伝子サンプルの集団を発現量が多い遺伝子集団とし、上記発現量が順番に計算される相関係数度が最初に弱い相関が認められる基準（例えば０．５以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値２とし（ただし、閾値２＜閾値１）、二つの条件における蛍光強度データの積が閾値２を下回るすべての遺伝子サンプルの集団を発現量が少ない遺伝子集団とし、発現量が多い遺伝子集団の蛍光強度対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求め、求めた漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算し、座標軸回転後の発現量が少ない遺伝子集団の座標を用いて、蛍光強度平衡軸の傾きを計算し、計算された傾き（例えば、正、負、ゼロ等）に基づいて２つの条件の輝度データのうちどちらにバイアスが多く含まれているかを判定し、バイアスが多く含まれていると判定された条件の輝度データからバイアスを差し引くこと（例えば、一定のバイアスをもつ遺伝子集団について座標を回転させる等）により蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、実測値のバイアスを効率的に除去し、かつ、データの性質を明白に表現できる蛍光強度散布図を作成することができるようになる。
なお、本方法は軸回転後にバイアスの大小を判定するものに限定されず、例えば、軸回転の前にも高発現漸近線と低発現漸近線の傾きを比較することにより、バイアスの大小を判定してもよい。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、上記主成分分析は、分散・共分散行列を用いて行うことを特徴とする。
これは主成分分析の一例を一層具体的に示すものである。この方法によれば、主成分分析は、分散・共分散行列を用いて行うので、従来から発現遺伝子解析に用いられている相関行列を用いた主成分分析法と比較して正規化を要しないため、効率的に主成分分析を行うことができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、上記遺伝子検出ステップは、上記蛍光強度平衡軸方向に予め定めた区間内のウィンドウを設定するウィンドウ設定ステップと、上記ウィンドウ設定ステップにより設定された各ウィンドウ内において信頼限界点を決定する信頼限界点決定ステップと、蛍光強度平衡軸方向に一定遺伝子ずつウィンドウを移動するウィンドウ移動ステップと、上記ウィンドウ移動ステップにより移動した各ウィンドウについて上記信頼限界点決定ステップにより各信頼限界点を求め、求めた複数の信頼限界点に基づいて信頼境界線を作成する信頼境界線作成ステップと、上記信頼境界線作成ステップにより作成された上記信頼境界線の外側に位置する遺伝子を発現量が変動した変動遺伝子として抽出する変動遺伝子抽出ステップとをさらに含むことを特徴とする。
これは遺伝子検出ステップの一例を一層具体的に示すものである。この方法によれば、予め定めた区間内のウィンドウを設定し、設定された各ウィンドウ内において遺伝子の輝度データの平均値、標準偏差、Ｐ値（例えば、ｐ＝０．０５）、重心などのうち少なくとも一つを用いて信頼限界点を決定する。そして、蛍光強度平衡軸方向に一定遺伝子ずつウィンドウを移動し、移動した各ウィンドウについて各信頼限界点を求め、求めた複数の信頼限界点に基づいて信頼境界線を作成する信頼境界線作成ステップと、上記信頼境界線作成ステップにより作成された上記信頼境界線の外側に位置する遺伝子を発現量が変動した変動遺伝子として抽出するので、安定性、再現性、および、信頼度の高い発現遺伝子抽出を行うことができるようになる。
また、これにより、誤差の範囲が異なる実験データであっても、その誤差に応じて、発現量変動倍率の閾値が決められるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、上記信頼限界点決定ステップは、シミュレーションにより得られた重複データの検定統計表に基づき、ｔ−分布を用いて上記信頼限界点を決定することを特徴とする。
これは信頼限界点決定の一例を一層具体的に示すものである。この方法によれば、シミュレーションにより得られた重複データの検定統計表に基づき、ｔ−分布を用いて信頼限界点を決定するので、従来手法と比較して正確かつ効率的に信頼限界点を求めることができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、上記信頼境界線作成ステップは、上記複数の信頼限界点に基づいてスプライン曲線を作成することにより平滑化を行い上記信頼境界線を作成することを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。この方法によれば、複数の信頼限界点に基づいてスプライン曲線を作成することにより平滑化を行い信頼境界線を作成するので、効率的に信頼限界点を補完して信頼曲線を作成することができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、上記信頼境界線作成ステップは、蛍光強度の高い領域については、最後の上記ウィンドウで求めた信頼限界点の水平延長線を用いて上記信頼限界線を作成することを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。この方法によれば、蛍光強度の高い領域については、最後のウィンドウ（最も右側にあるウィンドウ）で求めた信頼限界点のＸ軸に対する水平延長線を用いて信頼限界線を作成するので、傾きが少なくどちらに収束するか判断不能の場合であっても、適切な信頼限界線を作成することができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、上記信頼境界線作成ステップは、蛍光強度の低い領域については、各ウィンドウで求めた信頼限界点から最小二乗法により求めた漸近線の補外を上記信頼限界線として用いることを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。この方法によれば、蛍光強度の低い領域については、例えば、最初から数十程度の各ウィンドウで求めた信頼限界点から最小二乗法により求めた漸近線の補外を上記信頼限界線として用いるので、蛍光強度が低い遺伝子のスポットについても的確に検出することができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、利用者にウィンドウ内の遺伝子数を入力させる遺伝子数入力ステップをさらに含み、上記ウィンドウ設定ステップは、上記遺伝子数入力ステップにより入力された上記遺伝子数の上記遺伝子が含まれる上記区間内で上記ウィンドウを設定することを特徴とする。
これはウィンドウ設定の一例を一層具体的に示すものである。この方法によれば、利用者にウィンドウ内の遺伝子数を入力させ、入力された遺伝子数の遺伝子が含まれる区間内でウィンドウを設定するので、実験毎に利用者が設定する遺伝子数を変動させることができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、利用者に信頼限界値を入力させる信頼限界値入力ステップをさらに含み、上記信頼限界点決定ステップは、上記ウィンドウ内において上記信頼限界値入力ステップにより入力された上記信頼限界値に基づいて上記信頼限界点を決定することを特徴とする。
これは信頼限界点決定の一例を一層具体的に示すものである。この方法によれば、利用者に信頼限界値を入力させ、ウィンドウ内において入力された信頼限界値に基づいて信頼限界点を決定するので、実験毎に利用者が設定する信頼限界値を変動させることができ、各実験の誤差を適切な範囲に収めることができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、利用者に、上記変動しない遺伝子の分布の形、上記変動遺伝子の分布の形、上記変動遺伝子の検出基準、実験の繰り返し数、および、シミュレーション回数のうち少なくとも一つに関する情報を含むシミュレーション条件を入力させるシミュレーション条件設定ステップと、上記シミュレーション条件設定ステップにて設定された上記シミュレーション条件に従って、同一の遺伝子群に対して同じ分布から繰り返して生成し、上記遺伝子検出手段を実行し、上記発現遺伝子を検出するシミュレーションを複数回実行し、上記検出手段による結果の偽陽性率と偽陰性率を計算し、実験の繰り返し数、上記シミュレーション条件、および検出感度と検出信頼度との関係を計算し、発現量が変わる遺伝子の検定統計表を作成するシミュレーション実行ステップと、上記シミュレーション条件毎に、上記シミュレーション実行ステップによるシミュレーション結果を出力するシミュレーション結果出力ステップとをさらに含むことを特徴とする。
この方法によれば、利用者に、変動しない遺伝子の分布の形（例えば、分布の標準偏差（例えば、発現が変わらない遺伝子の分布を標準正規分布として標準偏差σ＝１、中心μ＝０としたときに、標準偏差σの幅を０．１から１．５の範囲で設定する））、上記変動遺伝子の分布の形（例えば、中心（例えば、当該条件のときに、中心μの幅を０．４から３の範囲で設定する））、上記変動遺伝子の検出基準（例えば、全体数からみた検出された遺伝子の割合を、２／３、２／４、３／４、３／６、４／６などで設定する）、実験の繰り返し数、および、シミュレーション回数（例えば、３回から１０回の範囲で設定する）のうち少なくとも一つに関する情報を含むシミュレーション条件を入力させ、設定されたシミュレーション条件に従って、同一の遺伝子群に対して同じ分布から繰り返して生成し、遺伝子検出を実行し、発現遺伝子を検出するシミュレーションを複数回実行し、上記検出手段による結果の偽陽性率と偽陰性率を計算し、実験の繰り返し数、シミュレーション条件、および検出感度と検出信頼度との関係を計算し、発現量が変わる遺伝子の検定統計表を作成し、シミュレーション条件毎に、シミュレーション実行によるシミュレーション結果を出力するので、様々な条件におけるシミュレーション結果を組み合わせることにより上記の組み合わせによる検出力と検出信頼度を知ることができる。すなわち、同じ条件の対照実験を繰り返して行い、得られたそれぞれ異なったデータセットに対して変動遺伝子の検出を行い、あらかじめ決めた回数以上検出される遺伝子のみを選択することにより、期待通りの信頼度あるいは検出力で変動遺伝子を検出できるようになる。
また、これにより、発現量が変わらない遺伝子が変動遺伝子として検出されたエラー（第一種のエラー）や、変動遺伝子が発現が変わらない遺伝子として検出されたエラー（第二種のエラー）を算出して比較することにより、シミュレーションのデータから上記の手法による変動遺伝子を検出する検出力と信頼度を把握でき、実際の実験データに対して、期待される検出力と信頼度を得るために、実験の繰り返し数と変動遺伝子の検出基準、および信頼限界点の組み合わせを設定することができる。
また、これにより、何回実験を行えば、正確な実験データを取ることができるかを予測することが可能になり、実験効率を著しく向上させることができるようになる。
つぎの発明にかかる遺伝子発現情報解析方法は、上記に記載の遺伝子発現情報解析方法において、上記遺伝子検出ステップは、各スポットの偏差値を計算する偏差値計算ステップをさらに含むことを特徴とする。
これは遺伝子検出の一例を一層具体的に示すものである。この方法によれば、各スポットの偏差値を計算するので、このように計算された各スポットの偏差値を変動比率（倍率）の代わりに用いることで、スライド間の誤差の差異に影響されない解析が可能になる。
また、これにより、本方法により計算される偏差値を、クラスター解析に代表される多変量解析において変動比率の対数や正規化した変動比率の変わりに用いることができ、発現量の大小による誤差の影響の違いに左右されない解析が可能になる。
また、本発明はプログラムに関するものであり、本発明にかかるプログラムは、２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成するバックグラウンド補正ステップと、上記バックグラウンド補正ステップによりバックグラウンド補正された上記輝度データの対数をＸ−Ｙ軸にとり蛍光強度散布図を作成し、各遺伝子のスポットについて蛍光強度平衡軸に対するバイアスを求め、上記輝度データから当該バイアスを除去することにより上記蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するバイアス補正ステップと、上記バイアス補正ステップにより構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出する遺伝子検出ステップとを含む遺伝子発現情報解析方法をコンピュータに実行させることを特徴とする。
このプログラムによれば、ＤＮＡマイクロアレイやＤＮＡチップなどにより２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成する。ここで、個々のスポットの蛍光強度測定値からブランクのスポットの蛍光強度測定値の平均をバックグラウンド値として用いてもよく、あるいは、各スポットの周囲の領域のブランクの蛍光強度測定値の平均値をバックグラウンド値として用いてもよい。また、これ以外のいかなるプログラムによりバックグラウンド補正を行ってもよい。
また、本プログラムによれば、バックグラウンド補正された輝度データの対数（自然対数または２の対数等）をＸ−Ｙ軸にとり蛍光強度散布図（スキャッタープロット）を作成し、各遺伝子のスポットについて同じ蛍光強度を示す蛍光強度平衡軸に対するバイアスを求め、輝度データから当該バイアスを除去することにより蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、より多くのバイアスを含む蛍光成分の判定を行い、このバイアスを除去した上で蛍光強度平衡軸と発現量の倍数軸とを２軸とする新しい直行軸系を構築することができるようになる。
また、本プログラムによれば、構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出するので、従来の遺伝子検出法に比べて、測定プログラム、標本間の誤差、および、蛍光標識効率などの違いの影響を受けずに正確に発現量が変動した遺伝子を検出することができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、上記バイアス補正ステップは、発現量が多い遺伝子集団の対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求める第一主成分作成ステップと、上記第一主成分作成ステップにより求めた上記漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算する座標回転ステップと、上記座標回転ステップによる座標軸回転後の上記発現量が少ない遺伝子集団の座標を用いて、上記蛍光強度平衡軸の傾きを計算し、計算された傾きに基づいて２つの条件の上記輝度データのうちどちらに上記バイアスが多く含まれているかを判定するバイアス判定ステップと、上記バイアス判定ステップにて上記バイアスが多く含まれていると判定された条件の上記輝度データから上記バイアスを差し引くことにより上記蛍光強度平衡軸と上記発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築する補正プロット生成ステップとをさらに含むことを特徴とする。
これはバイアス補正ステップの一例を一層具体的に示すものである。このプログラムによれば、ＤＮＡ濃度希釈系列の品質管理用のコントロール遺伝子サンプル（例えば外部遺伝子λＤＮＡサンプル、あるいは発現量がほとんど変わらないリボソームなどのＨｏｕｓｅ−ｋｅｅｐｉｎｇ遺伝子サンプル）を目的遺伝子サンプルと同時に測定し、蛍光強度データの積の一番小さい遺伝子から順に一つずつコントロール遺伝子を除き、残りすべてのコントロール遺伝子サンプルのデータから遺伝子の発現量とＤＮＡ量の検量線をそれぞれ作成し、データの相関係数を計算し、順番に計算される上記の相関係数が最初に強い相関が認められる基準（例えば０．８以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値１とし、二つの条件における蛍光強度データの積が閾値１を上回るすべての遺伝子サンプルの集団を発現量が多い遺伝子集団とし、上記発現量が順番に計算される相関係数度が最初に弱い相関が認められる基準（例えば０．５以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値２とし（ただし、閾値２＜閾値１）、二つの条件における蛍光強度データの積が閾値２を下回るすべての遺伝子サンプルの集団を発現量が少ない遺伝子集団とし、発現量が多い遺伝子集団の蛍光強度対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求め、求めた漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算し、座標軸回転後の発現量が少ない遺伝子集団の座標を用いて、蛍光強度平衡軸の傾きを計算し、計算された傾き（例えば、正、負、ゼロ等）に基づいて２つの条件の輝度データのうちどちらにバイアスが多く含まれているかを判定し、バイアスが多く含まれていると判定された条件の輝度データからバイアスを差し引くこと（例えば、一定のバイアスをもつ遺伝子集団について座標を回転させる等）により蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、実測値のバイアスを効率的に除去し、かつ、データの性質を明白に表現できる蛍光強度散布図を作成することができるようになる。
なお、本プログラムは軸回転後にバイアスの大小を判定するものに限定されず、例えば、軸回転の前にも高発現漸近線と低発現漸近線の傾きを比較することにより、バイアスの大小を判定してもよい。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、上記主成分分析は、分散・共分散行列を用いて行うことを特徴とする。
これは主成分分析の一例を一層具体的に示すものである。このプログラムによれば、主成分分析は、分散・共分散行列を用いて行うので、従来から発現遺伝子解析に用いられている相関行列を用いた主成分分析法と比較して正規化を要しないため、効率的に主成分分析を行うことができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、上記遺伝子検出ステップは、上記蛍光強度平衡軸方向に予め定めた区間内のウィンドウを設定するウィンドウ設定ステップと、上記ウィンドウ設定ステップにより設定された各ウィンドウ内において信頼限界点を決定する信頼限界点決定ステップと、蛍光強度平衡軸方向に一定遺伝子ずつウィンドウを移動するウィンドウ移動ステップと、上記ウィンドウ移動ステップにより移動した各ウィンドウについて上記信頼限界点決定ステップにより各信頼限界点を求め、求めた複数の信頼限界点に基づいて信頼境界線を作成する信頼境界線作成ステップと、上記信頼境界線作成ステップにより作成された上記信頼境界線の外側に位置する遺伝子を発現量が変動した変動遺伝子として抽出する変動遺伝子抽出ステップとをさらに含むことを特徴とする。
これは遺伝子検出ステップの一例を一層具体的に示すものである。このプログラムによれば、予め定めた区間内のウィンドウを設定し、設定された各ウィンドウ内において遺伝子の輝度データの平均値、標準偏差、Ｐ値（例えば、ｐ＝０．０５）、重心などのうち少なくとも一つを用いて信頼限界点を決定する。そして、蛍光強度平衡軸方向に一定遺伝子ずつウィンドウを移動し、移動した各ウィンドウについて各信頼限界点を求め、求めた複数の信頼限界点に基づいて信頼境界線を作成する信頼境界線作成ステップと、上記信頼境界線作成ステップにより作成された上記信頼境界線の外側に位置する遺伝子を発現量が変動した変動遺伝子として抽出するので、安定性、再現性、および、信頼度の高い発現遺伝子抽出を行うことができるようになる。
また、これにより、誤差の範囲が異なる実験データであっても、その誤差に応じて、発現量変動倍率の閾値が決められるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、上記信頼限界点決定ステップは、シミュレーションにより得られた重複データの検定統計表に基づき、ｔ−分布を用いて上記信頼限界点を決定することを特徴とする。
これは信頼限界点決定の一例を一層具体的に示すものである。このプログラムによれば、シミュレーションにより得られた重複データの検定統計表に基づき、ｔ−分布を用いて信頼限界点を決定するので、従来手法と比較して正確かつ効率的に信頼限界点を求めることができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、上記信頼境界線作成ステップは、上記複数の信頼限界点に基づいてスプライン曲線を作成することにより平滑化を行い上記信頼境界線を作成することを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。このプログラムによれば、複数の信頼限界点に基づいてスプライン曲線を作成することにより平滑化を行い信頼境界線を作成するので、効率的に信頼限界点を補完して信頼曲線を作成することができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、上記信頼境界線作成ステップは、蛍光強度の高い領域については、最後の上記ウィンドウで求めた信頼限界点の水平延長線を用いて上記信頼限界線を作成することを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。このプログラムによれば、蛍光強度の高い領域については、最後のウィンドウ（最も右側にあるウィンドウ）で求めた信頼限界点のＸ軸に対する水平延長線を用いて信頼限界線を作成するので、傾きが少なくどちらに収束するか判断不能の場合であっても、適切な信頼限界線を作成することができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、上記信頼境界線作成ステップは、蛍光強度の低い領域については、各ウィンドウで求めた信頼限界点から最小二乗法により求めた漸近線の補外を上記信頼限界線として用いることを特徴とする。
これは信頼境界線作成の一例を一層具体的に示すものである。このプログラムによれば、蛍光強度の低い領域については、例えば、最初から数十程度の各ウィンドウで求めた信頼限界点から最小二乗法により求めた漸近線の補外を上記信頼限界線として用いるので、蛍光強度が低い遺伝子のスポットについても的確に検出することができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、利用者にウィンドウ内の遺伝子数を入力させる遺伝子数入力ステップをさらに含み、上記ウィンドウ設定ステップは、上記遺伝子数入力ステップにより入力された上記遺伝子数の上記遺伝子が含まれる上記区間内で上記ウィンドウを設定することを特徴とする。
これはウィンドウ設定の一例を一層具体的に示すものである。このプログラムによれば、利用者にウィンドウ内の遺伝子数を入力させ、入力された遺伝子数の遺伝子が含まれる区間内でウィンドウを設定するので、実験毎に利用者が設定する遺伝子数を変動させることができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、利用者に信頼限界値を入力させる信頼限界値入力ステップをさらに含み、上記信頼限界点決定ステップは、上記ウィンドウ内において上記信頼限界値入力ステップにより入力された上記信頼限界値に基づいて上記信頼限界点を決定することを特徴とする。
これは信頼限界点決定の一例を一層具体的に示すものである。このプログラムによれば、利用者に信頼限界値を入力させ、ウィンドウ内において入力された信頼限界値に基づいて信頼限界点を決定するので、実験毎に利用者が設定する信頼限界値を変動させることができ、各実験の誤差を適切な範囲に収めることができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、利用者に、上記変動しない遺伝子の分布の形、上記変動遺伝子の分布の形、上記変動遺伝子の検出基準、実験の繰り返し数、および、シミュレーション回数のうち少なくとも一つに関する情報を含むシミュレーション条件を入力させるシミュレーション条件設定ステップと、上記シミュレーション条件設定ステップにて設定された上記シミュレーション条件に従って、同一の遺伝子群に対して同じ分布から繰り返して生成し、上記遺伝子検出手段を実行し、上記発現遺伝子を検出するシミュレーションを複数回実行し、上記検出手段による結果の偽陽性率と偽陰性率を計算し、実験の繰り返し数、上記シミーレーション条件、および検出感度と検出信頼度との関係を計算し、発現量が変わる遺伝子の検定統計表を作成するシミュレーション実行ステップと、上記シミュレーション条件毎に、上記シミュレーション実行ステップによるシミュレーション結果を出力するシミュレーション結果出力ステップとをさらに含むことを特徴とする。
このプログラムによれば、利用者に、変動しない遺伝子の分布の形（例えば、分布の標準偏差（例えば、発現が変わらない遺伝子の分布を標準正規分布として標準偏差σ＝１、中心μ＝０としたときに、標準偏差σの幅を０．１から１．５の範囲で設定する））、上記変動遺伝子の分布の形（例えば、中心（例えば、当該条件のときに、中心μの幅を０．４から３の範囲で設定する））、上記変動遺伝子の検出基準（例えば、全体数からみた検出された遺伝子の割合を、２／３、２／４、３／４、３／６、４／６などで設定する）、実験の繰り返し数、および、シミュレーション回数（例えば、３回から１０回の範囲で設定する）のうち少なくとも一つに関する情報を含むシミュレーション条件を入力させ、設定されたシミュレーション条件に従って、同一の遺伝子群に対して同じ分布から繰り返して生成し、遺伝子検出を実行し、発現遺伝子を検出するシミュレーションを複数回実行し、上記検出手段による結果の偽陽性率と偽陰性率を計算し、実験の繰り返し数、シミュレーション条件、および検出感度と検出信頼度との関係を計算し、発現量が変わる遺伝子の検定統計表を作成し、シミュレーション条件毎に、シミュレーション実行によるシミュレーション結果を出力するので、様々な条件におけるシミュレーション結果を組み合わせることにより上記の組み合わせによる検出力と検出信頼度を知ることができる。すなわち、同じ条件の対照実験を繰り返して行い、得られたそれぞれ異なったデータセットに対して変動遺伝子の検出を行い、あらかじめ決めた回数以上検出される遺伝子のみを選択することにより、期待通りの信頼度あるいは検出力で変動遺伝子を検出できるようになる。
また、これにより、発現量が変わらない遺伝子が変動遺伝子として検出されたエラー（第一種のエラー）や、変動遺伝子が発現が変わらない遺伝子として検出されたエラー（第二種のエラー）を算出して比較することにより、シミュレーションのデータから上記の手法による変動遺伝子を検出する検出力と信頼度を把握でき、実際の実験データに対して、期待される検出力と信頼度を得るために、実験の繰り返し数と変動遺伝子の検出基準、および信頼限界点の組み合わせを設定することができる。
また、これにより、何回実験を行えば、正確な実験データを取ることができるかを予測することが可能になり、実験効率を著しく向上させることができるようになる。
つぎの発明にかかるプログラムは、上記に記載のプログラムにおいて、上記遺伝子検出ステップは、各スポットの偏差値を計算する偏差値計算ステップをさらに含むことを特徴とする。
これは遺伝子検出の一例を一層具体的に示すものである。このプログラムによれば、各スポットの偏差値を計算するので、このように計算された各スポットの偏差値を変動比率（倍率）の代わりに用いることで、スライド間の誤差の差異に影響されない解析が可能になる。
また、これにより、本プログラムにより計算される偏差値を、クラスター解析に代表される多変量解析において変動比率の対数や正規化した変動比率の変わりに用いることができ、発現量の大小による誤差の影響の違いに左右されない解析が可能になる。
また、本発明は記録媒体に関するものであり、本発明にかかる記録媒体は、上記に記載されたプログラムを記録したことを特徴とする。
この記録媒体によれば、当該記録媒体に記録されたプログラムをコンピュータに読み取らせて実行することによって、上記に記載されたプログラムをコンピュータを利用して実現することができ、これら各方法と同様の効果を得ることができる。
発明を実施するための最良の形態
以下に、本発明にかかる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体の実施の形態を図面に基づいて詳細に説明する。尚、この実施の形態によりこの発明が限定されるものではない。
［本装置の概要］
以下、本装置の基本概念を説明し、その後、本発明の各実施例における本装置の構成、処理等について詳細に説明する。
［本装置の基本概念］
以下、図１〜図６および図１１〜図１４を用いて本発明の基本概念について説明する。
１．対照蛍光測定値の２段階データ補正
ＤＮＡマイクロアレイ、または、ＤＮＡチップを用いた発現遺伝子の測定では、各遺伝子の発現量は、各遺伝子に対応する蛍光測定値の輝度に反映され、各遺伝子の発現量比は、対照蛍光測定値との比率として観測される。しかし、ＤＮＡマイクロアレイやＤＮＡチップの誤差、蛍光標識反応の誤差、測定誤差、蛍光物質のモル蛍光係数の違いなどにより、蛍光測定値の比率そのままでは正確に発現量の比を反映しない。そこで、本発明では、これらの誤差を処理するため以下の処理を行う。
（１）バックグラウンド補正
第一段階のデータ補正として、バックグラウンド補正を行なう。まず、遺伝子ｉの二つの条件で測定された輝度を（ａ_ｉ，ｂ_ｉ）とし、各遺伝子の輝度からバックグラウンド（ＢＫＧａ_ｉ，ＢＫＧｂ_ｉ）を差し引く。この修正結果（ａ_ｉ−ＢＫＧａ_ｉ，ｂ_ｉ−ＢＫＧｂ_ｉ）を（Ａ_ｉ，Ｂ_ｉ）とする。
（２）バイアス補正
次に、第二段階のデータ補正として、以下の手順によりバイアスの補正を行なう。まず、本発明のバイアス補正の概要を説明する。ＤＮＡ濃度希釈系列の品質管理用のコントロール遺伝子サンプル（例えば外部遺伝子λＤＮＡサンプル、あるいは発現量がほとんど変わらないリボソームなどのＨｏｕｓｅ−ｋｅｅｐｉｎｇ遺伝子サンプル）を目的遺伝子サンプルと同時に測定し、蛍光強度データの積の一番小さい遺伝子から順に一つずつコントロール遺伝子を除き、残りすべてのコントロール遺伝子サンプルのデータから遺伝子の発現量とＤＮＡ量の検量線をそれぞれ作成し、データの相関係数を計算し、順番に計算される上記の相関係数が最初に強い相関が認められる基準（例えば０．８以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値１とし、二つの条件における蛍光強度データの積が閾値１を上回るすべての遺伝子サンプルの集団を発現量が多い遺伝子集団とし、上記発現量が順番に計算される相関係数度が最初に弱い相関が認められる基準（例えば０．５以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値２とし（ただし、閾値２＜閾値１）、二つの条件における蛍光強度データの積が閾値２を下回るすべての遺伝子サンプルの集団を発現量が少ない遺伝子集団とし、発現量が多い遺伝子集団の蛍光強度対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求め、求めた漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算し、座標軸回転後の発現量が少ない遺伝子集団の座標を用いて、蛍光強度平衡軸の傾きを計算し、計算された傾き（例えば、正、負、ゼロ等）に基づいて２つの条件の輝度データのうちどちらにバイアスが多く含まれているかを判定し、バイアスが多く含まれていると判定された条件の輝度データからバイアスを差し引くこと（例えば、一定のバイアスをもつ遺伝子集団について座標を回転させる等）により蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、実測値のバイアスを効率的に除去し、かつ、データの性質を明白に表現できる蛍光強度散布図を作成することができるようになる。
以下にバイアス補正手順の一例を詳細に説明する。
ｉ）対照蛍光測定値の一般関係式
本発明によるバイアスｋの補正は、蛍光測定値ＡとＢの関係を表す一般式（１）あるいは（１’）に基づく。

ここで、ａ，ｂ，ｋは未知のパラメータ定数である。ＡとＢのうち、より多くのバイアスを含む方から、平均バイアスｋを差し引く。すなわち、ＡのバックグラウンドのノイズがＢより大きく、多くのバイアスを含んでいる場合には、式１を用いることになり、一方、ＢのバックグラウンドのノイズがＡより大きく、多くのバイアスを含んでいる場合には、式１’を用いることになる。ａ、ｂ、および、ｋは（Ｌｏｇ_２Ａ−Ｌｏｇ_２Ｂ）の直交軸系の蛍光測定値のプロット図から推測する。
ｉｉ）分散・共分散行列を用いた主成分分析による蛍光強度平衡軸の抽出
発現量が同じであれば、ＤＮＡマイクロアレイやＤＮＡチップの対照実験の蛍光測定値は、理論的には（Ｌｏｇ_２Ａ−Ｌｏｇ_２Ｂ）直交軸系の蛍光強度散布図上において１：１を示す直線Ｌｏｇ_２Ａ＝Ｌｏｇ_２Ｂ上に位置するはずである。しかし、蛍光物質の性質の違い、実験条件の違い等の原因で、同じ蛍光強度を示す蛍光強度平衡軸（すなわち、各遺伝子のスポットについて、２つの条件で発現量が同等である遺伝子集団より得られた漸近線）がＬｏｇ_２Ａ＝Ｌｏｇ_２Ｂに従わないことがある。この場合、調べる遺伝子数は標本として十分（例えば、千以上）であり、また、発現量が変化する遺伝子である変動遺伝子の数は全体数に対して低い割合であることを前提として、蛍光強度平衡軸は（Ｌｏｇ_２Ａ_ｉ，Ｌｏｇ_２Ｂ_ｉ）集団の漸近線であると仮定する。
ここで、Ａ_ｉとＢ_ｉがｋよりはるかに大きい値の場合、つまりバイアスの影響が少なく無視できる場合、式１と式１’は

に近似できる。このとき、傾きａと切片ｂを求めるために、分散・共分散行列を用いた主成分分析を行なう。尚、分散・共分散行列を用いた主成分分析は、従来から遺伝子の解析で使われている相関行列を用いた主成分分析法と異なり、正規化を要しない。
ここで、図１は分散・共分散行列を用いた主成分分析の概念を示す図である。Ｌｏｇ_２Ａをｘに、Ｌｏｇ_２Ｂをｙに簡略化すると、漸近線を表す式２は、

となる。
従って、各点（ｘ_ｉ，ｙ_ｉ）から漸近線までの距離ｄ_ｉは、

により求められる。
また、全ての点から漸近線までの距離Ｄは、

となる。
ここで、距離Ｄが最小となる場合に、分布図上で最も適切となる漸近線のパラメータａとｂが決められる。
距離Ｄが最小の場合には、

の二つの条件を満たす。
また、式６より、

また、式７より、

となる。ただし、式９で、ａは二つの解のうち、ゼロより大きいものとする。また、Ｓ_ｘはｘ_ｉの分散、Ｓ_ｙはｙ_ｉの分散、Ｓ_ｘｙはｘ_ｉとｙ_ｉの共分散を意味する。実際の補正では、ａとｂは積Ａ_ｉＢ_ｉの上位遺伝子集団（Ｌｏｇ_２Ａ，Ｌｏｇ_２Ｂ）を用いる。簡単な計算法としては全遺伝子の積Ａ_ｉＢ_ｉの上位（例として７０％）の遺伝子集団を用いて求める。正確求めるには、ＤＮＡ濃度希釈系列の品質管理用のコントロール遺伝子サンプル（例えば外部遺伝子λＤＮＡサンプル、あるいは発現量がほとんど変わらないリボソームなどのＨｏｕｓｅ−ｋｅｅｐｉｎｇ遺伝子サンプル）を目的遺伝子サンプルと同時に測定し、蛍光強度データの積の一番小さい遺伝子から順に一つずつコントロール遺伝子を除き、残りすべてのコントロール遺伝子サンプルのデータから遺伝子の発現量とＤＮＡ量の検量線をそれぞれ作成し、データの相関係数を計算し、順番に計算される上記の相関係数が最初に強い相関が認められる基準（例えば０．８以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値１とし、二つの条件における蛍光強度データの積が閾値１を上回るすべての遺伝子サンプルの集団を発現量が多い遺伝子集団とする。
ｉｉｉ）バイアスの修正
（Ｌｏｇ_２Ａ−Ｌｏｇ_２Ｂ）の直交軸系では、Ａはバックグラウンドのノイズが大きく、Ｂより多くのバイアスを含んでいる場合、漸近線とＬｏｇ_２Ａ軸との交わる点の座標は（Ａ_ｃ，０）とすると、式１より

となり、

となる。
また、Ｂはバックグラウンドのノイズが大きく、Ａより多くのバイアスを含んでいる場合、漸近線とＬｏｇ_２Ｂ軸との交わる点の座標は（０，Ｂ_ｃ）とすると、式１’より、

となり、

となる。
ここで、ａ，ｂはすでに求められているため、Ａ_ｃとＢ_ｃはそれぞれ（Ｌｏｇ_２Ａ−Ｌｏｇ_２Ｂ）の直交軸系の積Ａ_ｉＢ_ｉの下位遺伝子集団（Ｌｏｇ_２Ａ，Ｌｏｇ_２Ｂ）から求められた漸近線とＬｏｇ_２Ａ軸、あるいは、Ｌｏｇ_２Ｂ軸の交差点の値として求められる。蛍光測定値の小さい遺伝子は、誤差の強い影響を受けるため、積Ａ_ｉＢ_ｉの下位遺伝子集団（Ｌｏｇ_２Ａ，Ｌｏｇ_２Ｂ）の漸近線の計算に使われる遺伝子は，簡単な計算法は全遺伝子の積Ａ_ｉＢ_ｉの下位（例として１０％）を用いる。正確に求めるには、ＤＮＡ濃度希釈系列の品質管理用のコントロール遺伝子サンプル（例えば外部遺伝子λＤＮＡサンプル、あるいは発現量がほとんど変わらないリボソームなどのＨｏｕｓｅ−ｋｅｅｐｉｎｇ遺伝子サンプル）を目的遺伝子サンプルと同時に測定し、蛍光強度データの積の一番小さい遺伝子から順に一つずつコントロール遺伝子を除き、残りすべてのコントロール遺伝子サンプルのデータから遺伝子の発現量とＤＮＡ量の検量線をそれぞれ作成し、データの相関係数を計算し、順番に計算される上記の相関係数が最初に弱い相関が認められる基準（例えば０．５以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値２とし（ただし、閾値２＜閾値１）、二つの条件における蛍光強度データの積が閾値２を下回るすべての遺伝子サンプルの集団を発現量が少ない遺伝子集団とする。
また、測定値（Ａ_ｉ）と（Ｂ_ｉ）とのどちらがより大きいバイアスを含むことを判断するには、漸近線がＬｏｇ_２Ａ軸とＬｏｇ_２Ｂ軸のどちらかに交差することにより判断できる。このとき、

あるいは、

となる。
ｉｖ）バイアスの判定
図２は新しい座標系での漸近線を求める処理の概念を示す図である。
最小二乗法により、積Ａ_ｉＢ_ｉの下位遺伝子集団の漸近線として、

が求められる。
ただし、最小二乗法の独立変数と従属変数を決めるには、まず（Ｌｏｇ_２Ａ−Ｌｏｇ_２Ｂ）軸系は積Ａ_ｉＢ_ｉの上位遺伝子集団から求めた漸近線を新たなＸ軸とする軸系に回転する必要がある。よって、（Ｌｏｇ_２Ａ_ｉ，Ｌｏｇ_２Ｂ_ｉ）の新しい座標（Ｌｏｇ_２Ａ_ｉ’，Ｌｏｇ_２Ｂ_ｉ’）は、

より求められる。
また、傾きα＝ｔａｎθから、

が求められる。
次に、新しい座標系でＡ_ｉＢ_ｉの下位遺伝子集団（Ａ_ｉ’，Ｂ_ｉ’）の漸近線を最小二乗法で求める。ここで漸近線を

とする。ｍが負数の場合、ＢがＡより多くのバイアスを含むと判定する。一方、ｍが正数の場合、ＡがＢより多くのバイアスを含んでいると判定する。
ｖ）バイアスの計算
式１７で示す漸近線において、ｍが負数の場合、（Ｌｏｇ_２Ａ，Ｌｏｇ_２Ｂ）軸系において、積Ａ_ｉＢ_ｉの下位遺伝子集団（Ａ_ｉ’，Ｂ_ｉ’）のデータを用いてＬｏｇ_２Ｂ軸との切片は、最小二乗法（Ｌｏｇ_２Ａは独立変数、Ｌｏｇ_２Ｂは従属変数）より求められる。

とする場合、

となる。
一方、ｍが正数の場合、（Ｌｏｇ_２Ａ，Ｌｏｇ_２Ｂ）軸系において、積Ａ_ｉＢ_ｉの下位遺伝子集団（Ａ_ｉ’，Ｂ_ｉ’）のデータを用いてＬｏｇ_２Ａ軸との切片は、最小二乗法より（Ｌｏｇ_２Ｂは独立変数、Ｌｏｇ_２Ａは従属変数）で求められる。

とする場合、

となる。
第二段間のデータ補正は、対照測定値の片方のデータ全体に対して、式１１、あるいは、式１３で得られたバイアスを差し引くことで行われる。
以上の補正により、新たなデータプロット図（Ｌｏｇ_２Ａ_ｉ，Ｌｏｇ_２（Ｂ_ｉ−ｋ））、あるいは、（Ｌｏｇ_２（Ａ_ｉ−ｋ），Ｌｏｇ_２Ｂ_ｉ）を用いて（以下、「補正プロット（Ｌｏｇ_２Ａ_ｉ，Ｌｏｇ_２Ｂ_ｉ）」という）、次段階の分析に進む。従って、式１、あるいは、式１’は、

として表現できる。
２．多重検定による発現量が変化した遺伝子の頑健（ロバスト）検出法
本方法において、補正されたデータは発現量が変わる遺伝子集団と発現量が変わらない遺伝子集団との混合分布で構成されていると仮定する。まず、データ対ごとに、蛍光強度平衡軸方向に一定区間内のウィンドウを設定し、各ウィンドウ内でスチューデントのｔ−分布に基づいた任意危険率の信頼限界点を求める。続いて、蛍光強度平衡軸（Ｘ軸）方向に一定遺伝子ずつウィンドウを移動させ、各信頼限界点を求める。求めた複数の信頼限界点を平滑化（スプライン）により補完し、信頼境界線（信頼曲線）とする。
この結果より、信頼境界線の外側に位置する遺伝子を発現量が変わった遺伝子として選択する。さらに高い抽出信頼性を得るため、繰り返し実験による多数決の比率を基準にして、確実に発現量の変わった遺伝子を選択する。次に、抽出の第一種のエラーを減らすため、マルチテストで、決められた回数以上発現量が変化したとして抽出された場合にのみ、遺伝子の発現量が変化したと認める。
（１）蛍光強度平衡軸と発現量の比によるデータ分布の再構築
図３は、分布図の再構築を概念的に示す図である。図３に示すように、各補正プロット（Ｌｏｇ_２Ａ_ｉ，Ｌｏｇ_２Ｂ_ｉ）から蛍光強度平衡軸Ｌｏｇ_２Ｂ_ｉ＝κＬｏｇ_２Ａ_ｉ＋Ｉまでの垂直の距離は、発現量の比に比例すると考えられる。また、蛍光強度平衡軸上、右へ移動する程、蛍光強度が比例して高くなるのは明らかである。従って、各遺伝子から蛍光強度平衡軸までの距離を計算してＹ軸の値とし、蛍光強度平衡軸をＸ軸にした蛍光強度散布図はデータの性質を明白に表現できる。ここで各遺伝子のＹ軸の値ｄ_２（発現量の倍率）は、式４により計算する。
また、各遺伝子のＸ軸の値ｄ_１（蛍光強度）は、蛍光測定値ＡとＢの集団は様々な誤差を含んでいるにもかかわらず、全体的にＡとＢの関係を示す式２２に従う。そして、再構築した蛍光強度散布図は、（蛍光強度−発現量の変化率）のＸ−Ｙ軸を持つ。
（２）発現量が変わる遺伝子集団と、発現量が変わらない遺伝子集団との混合分布モデルの多重検定
図４から図６は、発現倍率の混合正規分布モデルを示す図である。
実際のデータの分布は、発現量が変わった遺伝子（変動遺伝子）の集団と、発現量が変わらない遺伝子（非変動遺伝子）の集団の混合分布であると考えることができる。本方法の混合分布モデルは、図４に示すように、発現量の変化率を表すＹ軸において、ゼロを中心とした非変動遺伝子の集団分布と、それぞれ発現比が上昇、および、下降したある一点を中心とした変動遺伝子の集団分布からなっていると仮定している。ここでは説明の便宜上、正規分布のみを示すが、本発明は正規分布の場合に限定されず、全ての分布のデータに適用することができる。
ここで、図５に示すように、変動遺伝子の全体に対する割合がそれほど大きくない場合（例えば、変動遺伝子の集団が全体数の１０％を占める場合など）には、その混合分布は図６のように、正規分布に近似する。従って、一定の信頼限界値であるＰ値（Ｐ−ｖａｌｕｅ）を条件にした混合分布の蛍光倍率データに対してｔ分布に基づき、変動遺伝子を抽出できる。
本方法は、実際のデータの分散と中心の計算に基づいて発現量変動倍率の閾値を決めているため、本方法は頑健（ロバスト）であるという特徴を持っている。すなわち、本方法は誤差の範囲が異なる実験データでもその誤差に応じて、発現量変動倍率の閾値が決められる。また、本方法のもう一つの特徴として、同じ条件の対照実験で得られた異なるデータセットに対して、数回の検出を行ない、あらかじめ決めた回数以上検出される遺伝子のみを選択することにより、高い信頼度で変動遺伝子を検出できることが挙げられる。
さらに、非変動遺伝子、および、変動遺伝子集団の混合分布を、六つのパラメータ（全遺伝子数、発現が変動する遺伝子の割合、遺伝子分布の標準偏差（幅）、発現が変動する遺伝子の分布の中心、検出基準（検出数／全体数）、および、各データセット（ウィンドウ）内の信頼限界値（Ｐ−ｖａｌｕｅ））を変えてシミュレーションすることにより、第一種の検出エラーと第二種の検出エラーを計算できる。その結果は、実験のガイドラインとすることができる。ここで、「第一種の検出エラー」は、変わらないものが変わるものとして検出された偽陽性エラーをいい、「第二種の検出エラー」は、変わるものが変わらないものとして検出された偽陰性エラーをいう。
（３）移動ウィンドウ法を用いたデータに合わせた発現変動信頼曲線の作成
蛍光強度の小さい遺伝子ほど、その発現変化量の値がバックグラウンドなどの誤差に強い影響を受ける。例えば、対照実験で各蛍光値に除去不可能の誤差α_Ａとα_Ｂが存在するとすれば、ある遺伝子ｉの発現変化量は、蛍光倍率＝（Ａ_ｉ−α_Ａｉ）／（Ｂ_ｉ−α_Ｂｉ）として現れる。従って、Ａ_ｉ＞＞α_Ａｉ、そして、Ｂ_ｉ＞＞α_Ｂｉの場合、蛍光倍率はＡ_ｉ／Ｂ_ｉとして近似できるが、Ａ_ｉとα_Ａｉ、そして、Ｂ_ｉとα_Ｂｉの値が近い場合は、その誤差による影響は無視できない。よって、実際にｔ分布に基づき、遺伝子を選択する場合、バックグラウンドなどの誤差により異なる程度をもって影響を受ける遺伝子集団が混在することを考えると、蛍光強度に応じて異なる集団のｔ値を決定するべきである。
ここで、図１１〜図１４は、信頼曲線の作成を概念的に示した図である。まず、図１１に示すように、本装置は、一定遺伝子数で構成されたウィンドウ内の遺伝子の発現量の倍率分布に対して分散と中心を計算して、倍率変化のｔ値を決める（倍率座標軸の値に相当する）。尚、この発現変化の信頼限界点の蛍光強度平衡軸上の値はウィンドウ内部の全ての遺伝子の蛍光強度平衡軸値ｍｅｄｉａｎ値を用いることとする。
次に、本装置は、図１２に示すように、ウィンドウ内の蛍光強度平衡軸上下において、発現変化の信頼限界点の座標をそれぞれ決めた後、蛍光強度平衡軸が増加する方向に一定遺伝子分ウィンドウを移動させる。以降、この操作を繰り返す。
本装置は、全ての発現変化の信頼限界点の計算を行なった後、発現変化の信頼限界点を３次スプライン曲線によって、信頼限界点同士をつなぎ発現変化境界線である発現変動信頼曲線を作成する。ここで、両端のウィンドウにおいて、３次スプライン曲線による補完ができない蛍光強度の領域では、図１３で示すように、蛍光強度の高いところ（点線で示す）では最後のウィンドウで求めた発現変化の信頼限界点の水平延長線を用い、また蛍光強度の低いところ（点線で示す）では一番左から続いた数十個のウィンドウの境界点から最小二乗法により求めた漸近線の補外を発現変動信頼曲線（補外発現変化境界線）とする。
ついで、図１４に示すように、蛍光強度平衡軸上下の発現変動信頼曲線で挟んだ領域より外れた遺伝子を、発現量が変化した遺伝子、つまり、発現量が上昇、あるいは、下降したものとして抽出する。最終的な遺伝子の抽出は、前述した多重検定（２−（２））により行なう。
［装置構成］
次に、遺伝子発現情報解析装置の構成について以下に図２２〜図２５を参照して説明する。図２２は、本発明が適用される本装置の構成の一例を示すブロック図であり、該構成のうち本発明に関係する部分のみを概念的に示している。
図２２において遺伝子発現情報解析装置１００は、概略的に、遺伝子発現情報解析装置１００の全体を統括的に制御するＣＰＵ等の制御部１０２、通信回線等に接続されるルータ等の通信装置（図示せず）に接続される通信制御インターフェース部１０４、入力装置１１２や出力装置１１４に接続される入出力制御インターフェース部１０８、および、各種のデータベースやテーブルなどを格納する記憶部１０６を備えて構成されており、これら各部は任意の通信路を介して通信可能に接続されている。さらに、この遺伝子発現情報解析装置１００は、ルータ等の通信装置および専用線等の有線または無線の通信回線を介して、ネットワークに通信可能に接続されてもよい。
記憶部１０６に格納される各種のデータベースやテーブル（測定輝度データ１０６ａおよびシミュレーション結果データ１０６ｂ）は、固定ディスク装置等のストレージ手段であり、各種処理に用いる各種のプログラムやテーブルやファイルやデータベースやウェブページ用ファイル等を格納する。
これら記憶部１０６の各構成要素のうち、測定輝度データ１０６ａは、ＤＮＡチップやＤＮＡマイクロアレイなどにより実験された遺伝子の発現量を示す各スポットの測定輝度データを各実験毎に格納した測定輝度データ格納手段である。また、シミュレーション結果データ１０６ｂは、本装置によるシミュレーション結果データを格納したシミュレーション結果データ格納手段である。
また、図２２において、通信制御インターフェース部１０４は、遺伝子発現情報解析装置１００とネットワーク（またはルータ等の通信装置）との間における通信制御を行う。すなわち、通信制御インターフェース部１０４は、他の端末と通信回線を介してデータを通信する機能を有する。
また、図２２において、入出力制御インターフェース部１０８は、入力装置１１２や出力装置１１４の制御を行う。ここで、出力装置１１４としては、モニタ（家庭用テレビを含む）の他、スピーカを用いることができる（なお、以下においては出力装置１１４をモニタとして記載する場合がある）。また、入力装置１１２としては、キーボード、マウス、および、マイク等を用いることができる。また、モニタも、マウスと協働してポインティングデバイス機能を実現する。
また、図２２において、制御部１０２は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等の制御プログラム、各種の処理手順等を規定したプログラム、および所要データを格納するための内部メモリを有し、これらのプログラム等により、種々の処理を実行するための情報処理を行う。制御部１０２は、機能概念的に、バックグラウンド補正部１０２ａ、バイアス補正部１０２ｂ、遺伝子検出部１０２ｃ、および、シミュレーション部１０２ｄを備えて構成されている。
このうち、バックグラウンド補正部１０２ａは、２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成するバックグラウンド補正手段である。
また、バイアス補正部１０２ｂは、バックグラウンド補正手段によりバックグラウンド補正された輝度データの対数をＸ−Ｙ軸にとり蛍光強度散布図を作成し、各遺伝子のスポットについて蛍光強度平衡軸に対するバイアスを求め、輝度データから当該バイアスを除去することにより蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するバイアス補正手段である。
ここで、図２３は、バイアス補正部１０２ｂの構成の一例を示すブロック図であり、該構成のうち本発明に関係する部分のみを概念的に示している。図２３に示すように、バイアス補正部１０２ｂは、機能概念的に、第一主成分作成部１０２ｅ、座標回転部１０２ｆ、バイアス判定部１０２ｇ、および、補正プロット生成部１０２ｈを備えて構成されている。
図２３において、第一主成分作成部１０２ｅは、発現量が多い遺伝子集団の対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求める第一主成分作成手段である。
また、座標回転部１０２ｆは、第一主成分作成手段により求めた漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算する座標回転手段である。
また、バイアス判定部１０２ｇは、座標回転手段による座標軸回転後の発現量が少ない遺伝子集団の座標を用いて、蛍光強度平衡軸の傾きを計算し、計算された傾きに基づいて２つの条件の輝度データのうちどちらにバイアスが多く含まれているかを判定するバイアス判定手段である。
また、補正プロット生成部１０２ｈは、バイアス判定手段にてバイアスが多く含まれていると判定された条件の輝度データからバイアスを差し引くことにより蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築する補正プロット生成手段である。
再び図２２に戻り、遺伝子検出部１０２ｃは、バイアス補正手段により構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出する遺伝子検出手段である。
ここで、図２４は、遺伝子検出部１０２ｃの構成の一例を示すブロック図であり、該構成のうち本発明に関係する部分のみを概念的に示している。図２４に示すように、遺伝子検出部１０２ｃは、機能概念的に、ウィンドウ設定部１０２ｉ、信頼限界点決定部１０２ｊ、ウィンドウ移動部１０２ｋ、信頼境界線作成部１０２ｍ、変動遺伝子抽出部１０２ｎ、遺伝子数入力部１０２ｐ、信頼限界値入力部１０２ｑ、および、偏差値処理部１０２ｕを備えて構成されている。
図２４において、ウィンドウ設定部１０２ｉは、蛍光強度平衡軸方向に予め定めた区間内のウィンドウを設定するウィンドウ設定手段である。
また、信頼限界点決定部１０２ｊは、ウィンドウ設定手段により設定された各ウィンドウ内において信頼限界点を決定する信頼限界点決定手段である。
また、ウィンドウ移動部１０２ｋは、蛍光強度平衡軸方向に一定遺伝子ずつウィンドウを移動するウィンドウ移動手段である。
また、信頼境界線作成部１０２ｍは、ウィンドウ移動手段により移動した各ウィンドウについて信頼限界点決定手段により各信頼限界点を求め、求めた複数の信頼限界点に基づいて信頼境界線を作成する信頼境界線作成手段である。
また、変動遺伝子抽出部１０２ｎは、信頼境界線作成手段により作成された信頼境界線の外側に位置する遺伝子を発現量が変動した変動遺伝子として抽出する変動遺伝子抽出手段である。
また、遺伝子数入力部１０２ｐは、利用者にウィンドウ内の遺伝子数を入力させる遺伝子数入力手段である。
また、信頼限界値入力部１０２ｑは、利用者に信頼限界値を入力させる信頼限界値入力手段である。
また、偏差値処理部１０２ｕは、各スポットの偏差値を計算する偏差値計算手段である。
再び図２２に戻り、シミュレーション部１０２ｄは、予め定めた条件に従って、複数回のシミュレーションを実行してシミュレーション結果を条件毎に出力するシミュレーション手段である。
ここで、図２５は、シミュレーション部１０２ｄの構成の一例を示すブロック図であり、該構成のうち本発明に関係する部分のみを概念的に示している。図２５に示すように、シミュレーション部１０２ｄは、機能概念的に、シミュレーション条件設定部１０２ｒ、シミュレーション実行部１０２ｓ、および、シミュレーション結果出力部１０２ｔを備えて構成されている。
図２５において、シミュレーション条件設定部１０２ｒは、利用者に、遺伝子の分布の標準偏差、変動遺伝子の分布の中心、変動遺伝子の検出基準、および、シミュレーション回数のうち少なくとも一つに関する情報を含むシミュレーション条件を入力させるシミュレーション条件設定手段である。
また、シミュレーション実行部１０２ｓは、シミュレーション条件設定手段にて設定されたシミュレーション条件に従って、同一の遺伝子群に対して同じ分布から繰り返して生成し、遺伝子検出手段を実行し、発現遺伝子を検出するシミュレーションを複数回実行し、検出手段による結果の偽陽性率と偽陰性率を計算し、実験の繰り返し数、シミュレーション条件、および検出感度と検出信頼度との関係を計算し、発現量が変わる遺伝子の検定統計表を作成するシミュレーション実行手段である。
また、シミュレーション結果出力部１０２ｔは、シミュレーション条件毎に、シミュレーション実行手段によるシミュレーション結果を出力するシミュレーション結果出力手段である。
なお、これら各部によって行なわれる処理の詳細については、後述する。
［本装置の処理］
次に、このように構成された本実施の形態における本装置の本実施形態の処理の一例について、以下に図７〜図１０、図１５〜図２８を参照して詳細に説明する。
［本装置のメイン処理］
まず、本装置のメイン処理について図１５を参照して説明する。図１５は本実施形態の本装置のメイン処理の一例を示すフローチャートである。
まず、遺伝子発現情報解析装置１００は、バックグラウンド補正部１０２ａの処理により、図１６を用いて後述するバックグラウンド補正処理を実行する（ステップＳ−１）。すなわち、バックグラウンド補正部１０２ａは、ＤＮＡマイクロアレイやＤＮＡチップなどにより２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成する。
ついで、遺伝子発現情報解析装置１００は、バイアス補正部１０２ｂの処理により、図１７を用いて後述するバイアス補正処理を実行する（ステップＳ−２）。すなわち、バイアス補正部１０２ｂは、バックグラウンド補正された輝度データの対数（自然対数または２の対数等）をＸ−Ｙ軸にとり蛍光強度散布図（スキャッタープロット）を作成し、各遺伝子のスポットについて同じ蛍光強度を示す蛍光強度平衡軸に対するバイアスを求め、輝度データから当該バイアスを除去することにより蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築する。
ついで、遺伝子発現情報解析装置１００は、遺伝子検出部１０２ｃの処理により、図１８および図２０を用いて後述する移動ウィンドウによる遺伝子検出処理を実行する（ステップＳ−３）。すなわち、遺伝子検出部１０２ｃは、構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出する。
ついで、遺伝子発現情報解析装置１００は、シミュレーション部１０２ｄの処理により、図１９および図２１等を用いて後述するシミュレーション処理を実行する（ステップＳ−４）。すなわち、シミュレーション部１０２ｄは、予め定めた条件に従って、複数回のシミュレーションを実行してシミュレーション結果を条件毎に出力する。
これにて、本装置のメイン処理が終了する。
［バックグラウンド補正処理］
次に、バックグラウンド補正処理の詳細について図１６を用いて説明する。図１６は本実施形態の本装置のバイアス補正処理の一例を示すフローチャートである。
まず、遺伝子発現情報解析装置１００は、バックグラウンド補正部１０２ａの処理により、遺伝子の二つの条件で測定された輝度から、平均あるいは局部のバックグラウンド値を求め（ステップＳＡ−１）、このバックグラウンド値を測定値から除去し、この修正の結果をＡ群、および、Ｂ群とする（ステップＳＡ−２）。
すなわち、バックグラウンド補正部１０２ａは、個々のスポットの蛍光強度測定値からブランクのスポットの平均バックグラウンド値、あるいは、各スポットの周囲の領域のバックグラウンド値を、各スポットの蛍光強度測定値から引くことにより、バックグラウンド補正を行う。これにてバックグラウンド補正処理を終了する。
［バイアス補正処理］
次に、バイアス補正処理の詳細について、図１７を参照して説明する。図１７は本実施形態の本装置のバイアス補正処理の一例を示すフローチャートである。まず、バイアス補正部１０２ｂは、第一主成分作成部１０２ｅの処理により、Ａ群、および、Ｂ群に対し、２を底にした対数を計算し、Ｌｏｇ_２Ａ，Ｌｏｇ_２ＢをＸ，Ｙ軸とした直交軸系にスキャッタープロットする（ステップＳＢ−１）。
次に、バイアス補正部１０２ｂは、第一主成分作成部１０２ｅの処理により、積ＡＢの上位遺伝子集団（例えば、上位７０％までの遺伝子集団）の対数値を用いて、分散・共分散行列を用いた主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求める（ステップＳＢ−２）。
ついで、バイアス補正部１０２ｂは、座標回転部１０２ｆの処理により、求めた漸近線とＬｏｇ_２Ａ軸の角度をθとし、積ＡＢの下位に属する遺伝子集団（例えば、下位１０％に含まれる遺伝子の集団など）のＬｏｇ_２Ａ−Ｌｏｇ_２Ｂ軸系における座標を右にθ角度回転した座標を計算する（ステップＳＢ−３）。
ついで、バイアス補正部１０２ｂは、バイアス判定部１０２ｇの処理により、座標軸回転後の積ＡＢの下位遺伝子集団の座標を用いて、漸近線の傾きを計算する（ステップＳＢ−４）。
ついで、バイアス補正部１０２ｂは、バイアス判定部１０２ｇの処理により、漸近線の傾きが、正数か否か判定する（ステップＳＢ−５）。正数の場合、バイアス判定部１０２ｇは、Ａのデータはより多くのバイアスを含んでいると判定する。従って、バイアス補正部１０２ｂは、バイアス判定部１０２ｇの処理により、Ｌｏｇ_２Ａ−Ｌｏｇ_２Ｂ軸系にある積ＡＢの下位遺伝子集団（例えば、下位１０％に含まれる遺伝子の集団など）の座標を用い、Ｌｏｇ_２Ｂ軸のデータを独立変数として、Ｌｏｇ_２Ａのデータを従属変数として用いた最小二乗法により、下位遺伝子集団の漸近線とＬｏｇ_２Ａ軸との交差点（Ａ_ｃ，０）の値Ａ_ｃを求める（ステップＳＢ−６）。
ついで、バイアス補正部１０２ｂは、補正プロット生成部１０２ｈの処理により、バイアスを求め、対照測定値のデータからバイアスを差し引く（ステップＳＢ−７）。
一方、ステップＳＢ−５において、漸近線の傾きが正数でない場合、バイアス判定部１０２ｇは、ゼロであるか否か判定する（ステップＳＢ−８）。ゼロの場合、バイアス補正処理を終了する。
また、ステップＳＢ−８において、漸近線の傾きがゼロでない場合、バイアス判定部１０２ｇは、Ｂのデータがより多くのバイアスを含んでいると判定する。従って、バイアス補正部１０２ｂは、補正プロット生成部１０２ｈの処理により、Ｌｏｇ_２Ａ−Ｌｏｇ_２Ｂ軸系にある積ＡＢの下位遺伝子集団（例えば、下位１０％に含まれる遺伝子の集団など）の座標を用い、Ｌｏｇ_２Ａ軸のデータを独立変数として、Ｌｏｇ_２Ｂのデータを従属変数として用いた最小二乗法により、下位遺伝子集団の漸近線とＬｏｇ_２Ｂ軸との交差点（０，Ｂ_ｃ）の値Ｂ_ｃを求め（ステップＳＢ−９）、上述したステップＳＢ−７の処理を行なう。
次に、バイアス補正部１０２ｂは、補正プロット生成部１０２ｈの処理により、バイアスを差し引いたデータを用いて、直交軸系Ｌｏｇ_２（Ａ−ｋ）−Ｌｏｇ_２Ｂ軸系あるいはＬｏｇ_２Ａ−Ｌｏｇ_２（Ｂ−ｋ）軸系を構築する（ステップＳＢ−１０）。これにてバイアス補正処理を終了する。
［遺伝子検出処理］
次に、遺伝子検出処理の詳細について、図１８を参照して説明する。図１８は本実施形態の本装置の遺伝子検出処理の一例を示すフローチャートである。
まず、遺伝子発現情報解析装置１００の遺伝子検出部１０２ｃは、ウィンドウ設定部１０２ｉの処理により、利用者に対して、図１１を用いて上述したウィンドウ内の遺伝子数、および、信頼限界値である信頼度（Ｐ_ｅ値）を設定させるための遺伝子抽出条件設定画面を出力装置１１４に出力する（ステップＳＣ−１）。
ここで、図２０は、ウィンドウ設定部１０２ｉの処理により、出力装置１１４に出力される遺伝子抽出条件設定画面の一例を示す図である。図２０に示すように、遺伝子抽出条件設定画面は、ウィンドウ内遺伝子数の入力領域ＭＡ−１、信頼限界値である信頼度（Ｐ_ｅ値）の入力領域ＭＡ−２、設定終了ボタンＭＡ−３等を含んで構成される。
ここで、利用者が、図２０に示す遺伝子抽出条件設定画面を見ながら入力装置１１２を用いて、入力領域ＭＡ−１、ＭＡ−２の各項目の入力を完了した後、設定終了ボタンＭＡ−３を選択すると、遺伝子数入力部１０２ｐおよび信頼限界値入力部１０２ｑは、遺伝子抽出条件設定画面で設定された情報に基づいて、図１１に示すウィンドウ内の遺伝子が設定値となるようにウィンドウの大きさを調整する。
再び図１８に戻り、遺伝子検出部１０２ｃは、信頼限界点決定部１０２ｊの処理により、Ｘ軸の最左端から、ウィンドウ内の各点のＹ軸（変化倍率）の値を用いて、分散と中心を計算し、信頼限界点である発現量変化が増加の境界値ｙ_{ｌｉｍｉｔ＋}、減少の倍率の境界値ｙ_{ｌｉｍｉｔ}、および、Ｘ軸の重心を求める（ステップＳＣ−２）。
ついで、遺伝子検出部１０２ｃは、ウィンドウ移動部１０２ｋの処理により、Ｘ軸の蛍光強度が増す方向にウィンドウを一定遺伝子分移動させ、信頼限界点決定部１０２ｊの処理により、新たなウィンドウでの信頼限界点となる発現量変化倍率の境界値ｙ_{ｌｉｍｉｔ＋}とｙ_{ｌｉｍｉｔ}、および、Ｘ軸の重心を求める（ステップＳＣ−３）。
ついで、遺伝子検出部１０２ｃは、この処理をウィンドウがＸ軸の最右端になるまで繰り返す（ステップＳＣ−４）。
ついで、遺伝子検出部１０２ｃは、信頼境界線作成部１０２ｍの処理により、全てのウィンドウの発現変化の信頼限界点である発現量変化倍率境界点を３次スプライン曲線によりつなぎ、発現変動信頼曲線である発現倍率の増加境界線、および、減少境界線を決める（ステップＳＣ−５）。
ついで、遺伝子検出部１０２ｃは、変動遺伝子抽出部１０２ｎの処理により、発現変動信頼曲線である発現倍率の増加境界線、および、減少境界線で挟んだ領域より外れた遺伝子（変動遺伝子）を抽出することにより、多重検定により発現量が変化した遺伝子を頑健（ロバスト）に検出することができる（ステップＳＣ−６）。
また、本発明は、各スポットの偏差値を計算することにより、遺伝子検出効率の向上を行ってもよい。以下に、本実施形態の本装置の偏差値を用いた遺伝子検出処理の詳細について、図２６および図２７を参照して説明する。図２６は本実施形態の本装置の偏差値を用いた遺伝子検出処理の一例を示すフローチャートである。
まず、利用者がウィンドウ内の遺伝子数および信頼度（Ｐｅ値）を設定した後（ステップＳＥ−１）、偏差値処理部１０２ｕは、上述したように蛍光強度平衡軸方向に一定数の遺伝子を含むウィンドウを設定し、各ウィンドウ内全遺伝子の発現量の変化率を表すＹ軸の値を用いて、平均値、標準偏差値を求める。次に、偏差値処理部１０２ｕは、全遺伝子のＸ軸の値を用いて重心（蛍光強度の中間値に相当する）を求める（ステップＳＥ−２）。
続いて、偏差値処理部１０２ｕは、Ｘ軸方向に一定遺伝子ずつウィンドウを移動させ、最右端のウィンドウまで同様の処理を繰り返す（ステップＳＥ−３）。
ついで、偏差値処理部１０２ｕは、求めた複数の（蛍光強度の中間値、平均値）のデータセットを一連の（ｘ，ｙ）のデータとして平滑化により補完し（例えば、３次スプライン曲線を作成）、図２７に示す平均値の平滑線とする。また、偏差値処理部１０２ｕは、同様に複数の（蛍光強度の中間値、標準偏差値）のデータセットを平滑化により補完し（例えば、３次スプライン曲線を作成）、図２７に示す標準偏差値の平滑線とする（ステップＳＥ−４）。
ついで、偏差値処理部１０２ｕは、各遺伝子の蛍光強度平衡軸の値（Ｘ軸の値）より、それに対応する平均値の平滑線上のＹ値、そして標準偏差値の平滑線上のＹ値を用いて、以下の数式により偏差値を計算する（ステップＳＥ−５）。
偏差値＝（遺伝子のｙ値−平滑線から得られた平均値）／
平滑線から得られた標準偏差値σ
このように計算された各スポットの偏差値を変動比率（倍率）の代わりに用いることにより、スライド間の誤差の差異に影響されない解析が可能になる。すなわち、従来各マイクロアレイなどの物理的な誤差、各チップごとに検出する際の人為的な誤差が一定ではないため、チップ間等の比較を行うことが困難であったが、偏差値を用いることによりチップ間等の比較が容易になる。
また、従来から遺伝子発現パターンの分類や共発現遺伝子の抽出のために階層的クラスタリング（一次元、二次元）、Ｋ−Ｍｅａｎｓ法、自己組織化マップ法などを用いたクラスター解析に代表される多変量解析が行われている。例えば、変動比率の対数を用いるものとして、ＭＢＥｉｓｅｎ，ＰＴＳｐｅｌｌｍａｎ，ＰＯＢｒｏｗｎ，ＤＢｏｔｓｔｅｉｎ（１９９８），”Ｃｌｕｓｔｅｒａｎａｌｙｓｉｓａｎｄｄｉｓｐｌａｙｏｆｇｅｎｏｍｅ−ｗｉｄｅｅｘｐｒｅｓｓｉｏｎｐａｔｔｅｒｎｓ”，ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＮａｔｉｏｎａｌＡｃａｄｅｍｙｏｆＳｃｉｅｎｃｅｓ，９５（２５）：１４８６３−１４８６８が公知である。また、正規化した変動比率を用いるものとして、ＴＲＧｏｌｕｂ，ＤＫＳｌｏｎｉｍ，ＰＴａｍａｙｏ，ＣＨｕａｒｄ，ＭＣａａｓｅｎｂｅｅｋ，ＪＰＭｅｓｉｒｏｖ，ＨＣｏｌｌｅｒ，ＭＬＬｏｈ，ＪＲＤｏｗｎｉｎｇ，ＭＡＣａｌｉｇｉｕｒｉ，ＣＤＢｌｏｏｍｆｉｅｌｄ，ＥＳＬａｎｄｅｒ（１９９９）、”Ｍｏｌｅｃｕｌａｒｃｌａｓｓｉｆｉｃａｔｉｏｎｏｆｃａｎｃｅｒ：ｃｌａｓｓｄｉｓｃｏｖｅｒｙａｎｄｃｌａｓｓｐｒｅｄｉｃｔｉｏｎｂｙｇｅｎｅｅｘｐｒｅｓｓｉｏｎｍｏｎｉｔｏｒｉｎｇ”，Ｓｃｉｅｎｃｅ，２８６：５３１−５３７が公知である。ここで、本方法により計算される偏差値を、クラスター解析に代表される多変量解析において変動比率の対数や正規化した変動比率の代わりに用いることにより、発現量の大小による誤差の影響の違いに左右されない解析が可能になる。
これにて遺伝子検出処理が終了する。
［シミュレーション処理］
次に、本発明のシミュレーション処理の詳細について、図１９および図２１を参照して説明する。図１９は本実施形態の本装置の遺伝子検出処理の一例を示すフローチャートである。
まず、遺伝子発現情報解析装置１００のシミュレーション部１０２ｄは、シミュレーション条件設定部１０２ｒの処理により、利用者に対して、シミュレーションの各種の条件パラメータ（例えば、遺伝子分布の標準偏差（幅）、発現が変動する遺伝子の分布の中心、検出基準（検出数／全体数）、および、シミュレーション回数）を設定させるためのシミュレーション条件設定画面を出力装置１１４に出力する（ステップＳＤ−１）。
ここで、図２１は、シミュレーション条件設定部１０２ｒの処理により、出力装置１１４に出力されるシミュレーション条件設定画面の一例を示す図である。図２１に示すように、シミュレーション条件設定画面は、遺伝子分布の標準偏差の入力領域ＭＢ−１、遺伝子分布の中心の入力領域ＭＢ−２、検出基準の入力領域ＭＢ−３、シミュレーション回数の入力領域ＭＢ−４、設定終了ボタンＭＢ−５を含んで構成される。
なお、遺伝子の分布の標準偏差は、例えば、発現が変わらない遺伝子の分布を標準正規分布として標準偏差σ＝１、中心μ＝０としたときに、標準偏差σの幅を０．１から１．５の範囲で設定してもよい。また、変動遺伝子の分布の中心は、例えば、当該条件のときに、中心μの幅を０．４から３の範囲で設定してもよい。また、変動遺伝子の検出基準は、例えば、全体数からみた検出された遺伝子の割合を、２／３、２／４、３／４、３／６、４／６などで設定してもよい。また、シミュレーション回数は、例えば、３回から１０回の範囲で設定してもよい。
ここで、利用者が、シミュレーション条件設定画面を見ながら入力装置１１２を用いて、入力領域ＭＢ−１〜入力領域ＭＢ−４の各項目の入力を完了した後、設定終了ボタンＭＢ−５を選択すると、シミュレーション部１０２ｄは、シミュレーション実行部１０２ｓの処理により、シミュレーション条件設定画面で設定された情報に基づいて、上述したバックグラウンド補正処理、バイアス補正処理、および、遺伝子検出処理を繰り返して実行して、遺伝子検出処理により抽出した発現量が変わる遺伝子集団（変動遺伝子集団）、および、発現量が変動しなかった遺伝子集団（非変動遺伝子集団）の混合分布のシミュレーション処理を行う（ステップＳＤ−２）。
ついで、シミュレーション部１０２ｄは、シミュレーション結果出力部１０２ｔの処理により、図７から図１０に示すシミュレーション結果画面用データを出力装置１１４に出力する（ステップＳＤ−３）。
ここで、図７から図１０は、シミュレーションによる第一種の検出エラー（偽陽性）の計算結果の一例を示した図である。混合分布は、上述した六つのシミュレーション条件で設定したパラメータ（全遺伝子数、発現が変動する遺伝子の割合、遺伝子分布の標準偏差（幅）、発現が変動する遺伝子の分布の中心、検出基準（検出数／全体数）、および、各データセット（ウィンドウ）内の信頼限界値（Ｐ−ｖａｌｕｅ））に依存する。
図７は、発現が変わる遺伝子集団（変動遺伝子集団）の中心μ’を±σ、標準偏差を１に設定し、検出基準を３回のうち２回が検出されるとき（検出基準＝２／３）、第一種の検出エラーの計算結果をグラフ出力した図である。
一方、図８は、発現が変わる遺伝子集団（変動遺伝子集団）の中心μ’を±σ、標準偏差を１に設定し、検出基準を４回のうち３回が検出されたら、発現が変わったとする場合を示した図である。
これら２つの図の比較により、α（第一種の検出エラー）は各ウィンドウ内で検出するＰ_ｅ値に大きく依存することがわかる。尚、図の横軸は発現が変動した遺伝子集団が全遺伝子を占める割合を表している。
すなわち、発現が変わらない遺伝子の分布を標準正規分布（標準偏差σ＝１、中心μ＝０）として発生し、一方、発現が変わる遺伝子の分布は標準正規分布の左か右に５０％の確率で発生する。
ただし、混合分布は合計六つのパラメータ（すべての遺伝子の数、発現が変わる遺伝子が全体に占める割合、発現が変わる遺伝子の分布の標準偏差と中心、そして検出の基準および各データセット内の信頼限界）に依存する。
また、多重検定の第一種のエラーα、すなわち、発現が変わらない遺伝子が変わる遺伝子として検出されたエラーのみを示し、またすべての結果はパラメータを固定した後、十回の計算結果の平均を表している。また、発現が変わる遺伝子集団の中心μ’＝±σ、標準偏差＝１のとき、（ａ）検出基準：３回のうち２回が検出されたら、発現が変わったとする場合（図７の場合）と、（ｂ）検出基準：４回のうち３回が検出されたら、発現が変わったとする場合（図８の場合）との比較により、αは各ウィンドウ内で検出するＰｅ値に大きく依存することがわかる。
さらに、図９、および、図１０は、発現が変わる遺伝子集団（変動遺伝子集団）の標準偏差を１とした場合を示した図である。図９では、Ｐ_ｅ＝０．１５となり、図１０では、Ｐ_ｅ＝０．２５となる。従って９５％の信頼度を得るためには、検定基準を３回中２回とする場合は、データセット内の信頼限界Ｐ_ｅを０．１５以下に設定すればよく、一方、検定基準を４回中３回とする場合は、データセット内の信頼限界Ｐ_ｅを０．２５以下に設定すればよいことがわかる。尚、図の横軸は、発現が変わる遺伝子集団の中心と発現が変わらない遺伝子集団の標準偏差とを積算した数値を表し、図中の「ＴＮｕｍ」は全遺伝子数、「ｄｉｆ＿ｘ％」は発現が変わる遺伝子集団が占める割合、そして、「２／３」、および、「３／４」は検出基準を意味する。
これにて、シミュレーション処理を終了する。
［他の実施の形態］
さて、これまで本発明の実施の形態について説明したが、本発明は、上述した実施の形態以外にも、上記特許請求の範囲に記載した技術的思想の範囲内において種々の異なる実施の形態にて実施されてよいものである。
また、実施形態において説明した各処理のうち、自動的に行なわれるものとして説明した処理の全部、または、一部を手動的に行なうこともでき、あるいは、手動的に行なわれるものとして説明した処理の全部または一部を公知の方法で自動的に行なうこともできる。
この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、シミュレーション条件等のパラメータを含む情報、画面例については、特記する場合を除いて任意に変更することができる。
また、シミュレーション部１０２ｄは、ガンマ分布などの他の分布と混合分布シミュレーションをすることにより、上述した信頼度（Ｐ_ｅ値）、第一種、第二種の検出エラー等を求めてもよい。上述した実施形態においては、変動しない遺伝子の分布と変動する遺伝子の分布が正規分布となる場合を一例として説明したが、例えば、変動する遺伝子の分布は正規分布以外の分布（たとえばガンマ分布）で発生させてもよく、本発明をあらゆる分布をとる遺伝子集団に適用することが可能である。
また、上述した本装置のバイアス判定部１０２ｇによるバイアス判定処理は、軸回転後にバイアスの大小を判定するものに限定されず、例えば、図２８に示すように、軸回転の前に高発現漸近線の傾きａと低発現漸近線の傾きｂとを比較することにより、バイアスの大小を判定してもよい。また、本処理において座標の回転は必要条件ではない。
また、遺伝子発現情報解析装置１００に関して、図示の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。
例えば、遺伝子発現情報解析装置１００の各部が備える処理機能、特に制御部１０２にて行なわれる各処理機能については、その全部、または、任意の一部を、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、および、当該ＣＰＵにて解釈実行されるプログラムにて実現することができ、あるいは、ワイヤードロジックによるハードウェアとして実現することも可能である。尚、プログラムは、後述する記録媒体に記録されており、必要に応じて遺伝子発現情報解析装置１００に機械的に読み取られる。
また、遺伝子発現情報解析装置１００は、既知のパーソナルコンピュータ、ワークステーション等の情報処理端末等のコンピュータ（情報処理装置）にプリンタ、モニタ、イメージスキャナ等の周辺装置を接続し、該情報処理装置に本発明の方法を実現させるソフトウェア（プログラム、データ等を含む）を実装することにより実現してもよい。
さらに、遺伝子発現情報解析装置１００の分散・統合の具体的形態は図示のものに限られず、その全部、または、一部を、各種の負荷等に応じた任意の単位で、機能的または物理的に分散・統合して構成することができる。例えば、各データベースを独立したデータベース装置として独立に構成してもよく、また、処理の一部をＣＧＩ（ＣｏｍｍｏｎＧａｔｅｗａｙＩｎｔｅｒｆａｃｅ）を用いて実現してもよい。
また、本発明にかかるプログラムを、コンピュータ読み取り可能な記録媒体に格納することもできる。ここで、この「記録媒体」は、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等の任意の「可搬用の物理媒体」や、各種コンピュータシステムに内蔵されるＲＯＭ、ＲＡＭ、ＨＤ等の任意の「固定用の物理媒体」、あるいは、ＬＡＮ、ＷＡＮ、インターネットに代表されるネットワークを介してプログラムを送信する場合の通信回線や搬送波のように、短期にプログラムを保持する「通信媒体」を含むものとする。
また、「プログラム」は、任意の言語や記述方法にて記述されたデータ処理方法であり、ソースコードやバイナリコード等の形式を問わない。尚、「プログラム」は必ずしも単一的に構成されるものに限られず、複数のモジュールやライブラリーとして分散構成されるものや、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）に代表される別個のプログラムと協働してその機能を達成するものをも含む。尚、実施の形態に示した各装置において記録媒体を読み取るための具体的な構成、読み取り手順、あるいは、読み取り後のインストール手順等については、周知の構成や手順を用いることができる。
また、遺伝子発現情報解析装置１００がスタンドアローンの形態で処理を行う場合を一例に説明したが、遺伝子発現情報解析装置１００とは別筺体で構成されるクライアント端末からネットワークを介して送信される要求に応じて処理を行い、その処理結果を当該クライアント端末に返却するように構成してもよい。
ここで、ネットワークは、遺伝子発現情報解析装置１００と外部のクライアント装置とを相互に接続する機能を有し、例えば、インターネットや、イントラネットや、ＬＡＮ（有線／無線の双方を含む）や、ＶＡＮや、パソコン通信網や、公衆電話網（アナログ／デジタルの双方を含む）や、専用回線網（アナログ／デジタルの双方を含む）や、ＣＡＴＶ網や、ＩＭＴ２０００方式、ＧＳＭ方式、または、ＰＤＣ／ＰＤＣ−Ｐ方式等の携帯回線交換網／携帯パケット交換網、無線呼出網、Ｂｌｕｅｔｏｏｔｈ等の局所無線網、ＰＨＳ網、ＣＳ、ＢＳまたはＩＳＤＢ等の衛星通信網等のうちいずれかを含んでもよい。すなわち、本装置は、有線・無線を問わず任意のネットワークを介して、各種データを送受信することができる。
以上詳細に説明したように、本発明によれば、ＤＮＡマイクロアレイやＤＮＡチップなどにより２つの条件で同一の遺伝子の発現量を示す蛍光強度を測定した各スポットの測定輝度データからバックグラウンド値を除去することによりバックグラウンド補正された輝度データを作成することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、バックグラウンド補正された輝度データの対数（自然対数または２の対数等）をＸ−Ｙ軸にとり蛍光強度散布図（スキャッタープロット）を作成し、各遺伝子のスポットについて同じ蛍光強度を示す蛍光強度平衡軸に対するバイアスを求め、輝度データから当該バイアスを除去することにより蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、より多くのバイアスを含む蛍光成分の判定を行い、このバイアスを除去した上で蛍光強度平衡軸と発現量の倍数軸とを２軸とする新しい直行軸系を構築することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、構築された新たなＸ−Ｙ軸系の蛍光強度散布図に基づいて発現量が変動した変動遺伝子を検出するので、従来の遺伝子検出法に比べて、測定装置、標本間の誤差、および、蛍光標識効率などの違いの影響を受けずに正確に発現量が変動した遺伝子を検出することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、ＤＮＡ濃度希釈系列の品質管理用のコントロール遺伝子サンプル（例えば外部遺伝子λＤＮＡサンプル、あるいは発現量がほとんど変わらないリボソームなどのＨｏｕｓｅ−ｋｅｅｐｉｎｇ遺伝子サンプル）を目的遺伝子サンプルと同時に測定し、蛍光強度データの積の一番小さい遺伝子から順に一つずつコントロール遺伝子を除き、残りすべてのコントロール遺伝子サンプルのデータから遺伝子の発現量とＤＮＡ量の検量線をそれぞれ作成し、データの相関係数を計算し、順番に計算される上記の相関係数が最初に強い相関が認められる基準（例えば０．８以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値１とし、二つの条件における蛍光強度データの積が閾値１を上回るすべての遺伝子サンプルの集団を発現量が多い遺伝子集団とし、上記発現量が順番に計算される相関係数度が最初に弱い相関が認められる基準（例えば０．５以上）を満たした場合のコントロールサンプルの二つの条件における蛍光強度データの積を閾値２とし（ただし、閾値２＜閾値１）、二つの条件における蛍光強度データの積が閾値２を下回るすべての遺伝子サンプルの集団を発現量が少ない遺伝子集団とし、発現量が多い遺伝子集団の蛍光強度対数値を用いて主成分分析を実行し、第一主成分となる漸近線の傾きと切片を求め、求めた漸近線とＸ軸との角度をθとし、発現量が少ない遺伝子集団のＸ−Ｙ軸系における座標を右にθ角度回転した座標を計算し、座標軸回転後の発現量が少ない遺伝子集団の座標を用いて、蛍光強度平衡軸の傾きを計算し、計算された傾き（例えば、正、負、ゼロ等）に基づいて２つの条件の輝度データのうちどちらにバイアスが多く含まれているかを判定し、バイアスが多く含まれていると判定された条件の輝度データからバイアスを差し引くこと（例えば、一定のバイアスをもつ遺伝子集団について座標を回転させる等）により蛍光強度平衡軸と発現量の倍率軸を２軸とする新たなＸ−Ｙ軸系の蛍光強度散布図を構築するので、実測値のバイアスを効率的に除去し、かつ、データの性質を明白に表現できる蛍光強度散布図を作成することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、主成分分析は、分散・共分散行列を用いて行うので、従来から発現遺伝子解析に用いられている相関行列を用いた主成分分析法と比較して正規化を要しないため、効率的に主成分分析を行うことができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、予め定めた区間内のウィンドウを設定し、設定された各ウィンドウ内において遺伝子の輝度データの平均値、標準偏差、Ｐ値（例えば、９５％値）、重心などのうち少なくとも一つを用いて信頼限界点を決定する。そして、蛍光強度平衡軸方向に一定遺伝子ずつウィンドウを移動し、移動した各ウィンドウについて各信頼限界点を求め、求めた複数の信頼限界点に基づいて信頼境界線を作成する信頼境界線作成手段と、上記信頼境界線作成手段により作成された上記信頼境界線の外側に位置する遺伝子を発現量が変動した変動遺伝子として抽出するので、安定性、再現性、および、信頼度の高い発現遺伝子抽出を行うことができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、誤差の範囲が異なる実験データであっても、その誤差に応じて、発現量変動倍率の閾値が決められる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、シミュレーションにより得られた重複データの検定統計表に基づき、ｔ−分布を用いて信頼限界点を決定するので、従来手法と比較して正確かつ効率的に信頼限界点を求めることができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、複数の信頼限界点に基づいてスプライン曲線を作成することにより平滑化を行い信頼境界線を作成するので、効率的に信頼限界点を補完して信頼曲線を作成することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、蛍光強度の高い領域については、最後のウィンドウ（最も右側にあるウィンドウ）で求めた信頼限界点のＸ軸に対する水平延長線を用いて信頼限界線を作成するので、傾きが少なくどちらに収束するか判断不能の場合であっても、適切な信頼限界線を作成することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、蛍光強度の低い領域については、例えば、最初から数十程度の各ウィンドウで求めた信頼限界点から最小二乗法により求めた漸近線の補外を上記信頼限界線として用いるので、蛍光強度が低い遺伝子のスポットについても的確に検出することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、利用者にウィンドウ内の遺伝子数を入力させ、入力された遺伝子数の遺伝子が含まれる区間内でウィンドウを設定するので、実験毎に利用者が設定する遺伝子数を変動させることができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、利用者に信頼限界値を入力させ、ウィンドウ内において入力された信頼限界値に基づいて信頼限界点を決定するので、実験毎に利用者が設定する信頼限界値を変動させることができ、各実験の誤差を適切な範囲に収めることができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、利用者に、変動しない遺伝子の分布の形（例えば、分布の標準偏差（例えば、発現が変わらない遺伝子の分布を標準正規分布として標準偏差σ＝１、中心μ＝０としたときに、標準偏差σの幅を０．１から１．５の範囲で設定する））、上記変動遺伝子の分布の形（例えば、中心（例えば、当該条件のときに、中心μの幅を０．４から３の範囲で設定する））、上記変動遺伝子の検出基準（例えば、全体数からみた検出された遺伝子の割合を、２／３、２／４、３／４、３／６、４／６などで設定する）、実験の繰り返し数、および、シミュレーション回数（例えば、３回から１０回の範囲で設定する）のうち少なくとも一つに関する情報を含むシミュレーション条件を入力させ、設定されたシミュレーション条件に従って、同一の遺伝子群に対して同じ分布から繰り返して生成し、遺伝子検出を実行し、発現遺伝子を検出するシミュレーションを複数回実行し、上記検出手段による結果の偽陽性率と偽陰性率を計算し、実験の繰り返し数、シミュレーション条件、および検出感度と検出信頼度との関係を計算し、発現量が変わる遺伝子の検定統計表を作成し、シミュレーション条件毎に、シミュレーション実行によるシミュレーション結果を出力するので、様々な条件におけるシミュレーション結果を組み合わせることにより上記の組み合わせによる検出力と検出信頼度を知ることができる。すなわち、同じ条件の対照実験を繰り返して行い、得られたそれぞれ異なったデータセットに対して変動遺伝子の検出を行い、あらかじめ決めた回数以上検出される遺伝子のみを選択することにより、期待通りの信頼度あるいは検出力で変動遺伝子を検出できる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、発現量が変わらない遺伝子が変動遺伝子として検出されたエラー（第一種のエラー）や、変動遺伝子が発現が変わらない遺伝子として検出されたエラー（第二種のエラー）を算出して比較することにより、シミュレーションのデータから上記の手法による変動遺伝子を検出する検出力と信頼度を把握でき、実際の実験データに対して、期待される検出力と信頼度を得るために、実験の繰り返し数と変動遺伝子の検出基準、および信頼限界点の組み合わせを設定することができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、シミュレーションにより得られた重複データの検定統計表に基づき、何回実験を行えば、正確な実験データを取ることができるかを予測することが可能になり、実験効率を著しく向上させることができる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
また、本発明によれば、各スポットの偏差値を計算するので、このように計算された各スポットの偏差値を変動比率（倍率）の代わりに用いることで、スライド間の誤差の差異に影響されない解析が可能になる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
さらに、本発明によれば、本装置により計算される偏差値を、クラスター解析に代表される多変量解析において変動比率の対数や正規化した変動比率の変わりに用いることができ、発現量の大小による誤差の影響の違いに左右されない解析が可能になる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体を提供することができる。
産業上の利用可能性
以上のように、本発明にかかる遺伝子発現情報解析装置、遺伝子発現情報解析方法、プログラム、および、記録媒体は、ＤＮＡマイクロアレイやＤＮＡチップなどの測定値データの解析を行うバイオインフォマティクス分野において極めて有用である。
本発明は、産業上多くの分野、特に医薬品、食品、化粧品、医療、遺伝子発現解析等の分野で広く実施することができ、極めて有用である。
【図面の簡単な説明】
第１図は、本発明による分散・共分散行列を用いた主成分分析の概念を示す図であり、第２図は、本発明による新しい座標系での漸近線を求める処理の概念を示す図であり、第３図は、本発明による分布図の再構築を概念的に示す図であり、第４図は、本発明による発現倍率の混合正規分布モデルを示す図であり、第５図は、本発明による発現倍率の混合正規分布モデルを示す図であり、第６図は、本発明による発現倍率の混合正規分布モデルを示す図であり、第７図は、本発明によるシミュレーションによる第一種の検出エラーの計算結果の一例を示した図であり、第８図は、本発明によるシミュレーションによる第一種の検出エラーの計算結果の一例を示した図であり、第９図は、本発明によるシミュレーションによる第一種の検出エラーの計算結果の一例を示した図であり、第１０図は、本発明によるシミュレーションによる第一種の検出エラーの計算結果の一例を示した図であり、第１１図は、本発明による発現変動信頼曲線の作成を概念的に示した図であり、第１２図は、本発明による発現変動信頼曲線の作成を概念的に示した図であり、第１３図は、本発明による発現変動信頼曲線の作成を概念的に示した図であり、第１４図は、本発明による発現変動信頼曲線の作成を概念的に示した図であり、第１５図は、本実施形態の本装置のメイン処理を示すフローチャートであり、第１６図は、本実施形態の本装置のバックグラウンド補正処理の一例を示すフローチャートであり、第１７図は、本実施形態の本装置のバイアス補正処理の一例を示すフローチャートであり、第１８図は、本実施形態の本装置の遺伝子検出処理の一例を示すフローチャートであり、第１９図は、本実施形態の本装置のシミュレーション処理の一例を示すフローチャートであり、第２０図は、ウィンドウ設定部１０２ｉの処理により、出力装置１１４に出力される遺伝子抽出条件設定画面の一例を示す図であり、第２１図は、シミュレーション条件設定部１０２ｒの処理により、出力装置１１４に出力されるシミュレーション条件設定画面の一例を示す図であり、第２２図は、本発明が適用される本装置の構成の一例を示すブロック図であり、第２３図は、バイアス補正部１０２ｂの構成の一例を示すブロック図であり、第２４図は、遺伝子検出部１０２ｃの構成の一例を示すブロック図であり、第２５図は、シミュレーション部１０２ｄの構成の一例を示すブロック図であり、第２６図は、本実施形態の本装置の偏差値を用いた遺伝子検出処理の一例を示すフローチャートであり、第２７図は、本実施形態の本装置の偏差値の計算を示す概念図であり、第２８図は、本実施形態の本装置のバイアス判定処理の一例を示す概念図である。 Technical field
The present invention relates to a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium, and in particular, performs background correction of measured value data such as a DNA microarray and a DNA chip to detect a gene whose expression level has changed. The present invention relates to a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can be extracted with high statistical reliability.
Background art
In molecular biology research, new drug research and development, clinical diagnosis, etc., it is very important to search for a gene whose messenger RNA expression level has changed and to identify the gene. Therefore, as a method for examining expression changes at the RNA level, a DNA microarray in which cDNA fragments reversely transcribed from messenger RNA using reverse transcriptase are immobilized on a slide glass at high density, and a fine processing technique are used. DNA chips (trade names) manufactured by Affymetrix Co. (company name), which synthesizes various types of oligonucleotides on a substrate, have attracted attention and are used.
Expression gene analysis methods using these DNA microarrays and DNA chips are effective to identify genes whose expression levels have been exhaustively varied at once from hundreds to tens of thousands of genes. The correction method of the measurement value includes two main steps called a background correction step and a normalization step.
The background correction process corrects the background by simply subtracting the average background value of the blank spot from the individual measurements or the background value of the area surrounding each spot from the fluorescence intensity measurement of the spot. The method of performing is mainly used.
On the other hand, the normalization process is a fluorescence intensity scatter plot (scatter plot) of non-parametric regression lines obtained by the least squares method or Lowess smoothing (local quadratic estimator using bandwidth corresponding to neighboring regions). A method of correcting measured values of all genes with a coefficient for converting to a Y = X straight line is used.
However, the expression gene analysis method using a DNA microarray or a DNA chip has a problem that an analysis method for a highly reliable measurement value has not been established. This problem will be specifically described below.
First, the conventional correction method has a problem that it is easily influenced by differences in measurement apparatus, error between samples, fluorescence labeling efficiency, and the like. Also, in the normalization process, the least square method strictly draws two regression lines, while Lowess smoothing (Dudoit S, Yang YH, Callow MJ, Speed TP (2000) Statistical methods for identificating diffractive differentials based on in replicate cDNA microarray experiments.Technical report, DePartment of Statistics, UC-Berkeley.http: /www.stat.berkelley. We had a problem that it is not only those without a process basis.
Furthermore, in the method of extracting genes whose expression levels fluctuated, genes showing a corrected fluorescence intensity ratio of an arbitrary magnification or higher in the conventional standard are extracted as genes having a difference in expression, and the standard magnification is (Chen Y, Dougerty ER, Bittner ML (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray. BioJ. G. Hilsenbeck, etc. (1999) Statistical analysis of array expression data as applied to the probe. m of tamoxifen resistance. Journal of the National Cancer Institute, Vol. 91, No. 5, etc.).
On the other hand, a method of detecting a gene by optimization assuming an error model and a probability distribution of gene expression (Chen Y, Dougherty ER, Bittner ML (1997) Ratio-based decisions and the quantitative analysis of cDNA cDNA. Biomed Opt 2: 364-374, Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability extensibility ratios. on compres- s from microarray data. J Comp Biol 8: 37-52., etc.), but these methods are not stable and reproducible. Had. In addition, there is no statistical table that guides how many times the experiment should be repeated in order to obtain ideal detection reliability, so the relationship between the number of repetitions of the experiment, detection sensitivity, and detection reliability is clarified. Not.
Therefore, the present invention provides a general formula for reliably comparing gene expression levels in a method for analyzing gene expression using a DNA microarray and a DNA chip, and is robust to the actual data distribution (robust). An object of the present invention is to provide a highly reliable method for extracting an expression variable gene.
Disclosure of the invention
The gene expression information analyzing apparatus according to the present invention is a luminance data that is background-corrected by removing the background value from the measured luminance data of each spot where the fluorescence intensity indicating the expression level of the same gene is measured under two conditions. A fluorescence intensity scatter diagram with the logarithm of the luminance data corrected by the background correction means and the logarithm of the luminance data taken by the background correction means as an XY axis, and a bias with respect to the fluorescence intensity equilibrium axis for each gene spot A bias correction means for constructing a new XY axis system fluorescence intensity scatter diagram having two axes of the fluorescence intensity balance axis and the magnification axis of the expression level by removing the bias from the luminance data; Based on the fluorescence intensity scatter diagram of the new XY axis system constructed by the bias correction means, the expression level is Characterized in that a gene detection means for detecting a motion by variation gene.
According to this apparatus, the background correction was performed by removing the background value from the measured luminance data of each spot where the fluorescence intensity indicating the expression level of the same gene was measured under two conditions using a DNA microarray or DNA chip. Create brightness data. Here, the average of the fluorescence intensity measurement values of the blank spots from the fluorescence intensity measurement values of the individual spots may be used as the background value, or the average value of the fluorescence intensity measurement values of the blank in the area around each spot may be used. May be used as the background value. Further, the background correction may be performed by any other method.
In addition, according to this apparatus, the logarithm of the luminance data corrected for the background (natural logarithm or logarithm of 2, etc.) is taken on the XY axis, and a fluorescence intensity scatter diagram (scatter plot) is created. Find the bias for the fluorescence intensity equilibrium axis showing the same fluorescence intensity for each gene (that is, asymptotic lines obtained from gene groups with the same expression level under the two conditions for each gene spot), and remove the bias from the luminance data As a result, a new fluorescence intensity scatter diagram of the X-Y axis system with the fluorescence intensity balance axis and the expression magnification magnification axis as two axes is constructed. It is possible to construct a new orthogonal axis system in which the fluorescence intensity balance axis and the multiple axis of the expression level are two axes.
In addition, according to the present apparatus, since a fluctuating gene whose expression level fluctuates based on the constructed fluorescence intensity scatter diagram of the new XY axis system is detected, the measurement apparatus, compared with the conventional gene detection method, It becomes possible to accurately detect genes whose expression levels fluctuate without being affected by differences between samples and differences in fluorescence labeling efficiency.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, wherein the bias correction means performs principal component analysis using logarithmic values of gene populations with high expression levels. The first principal component creation means for obtaining the slope and intercept of the asymptote as the principal component, and the angle between the asymptote and the X axis obtained by the first principal component creation means is θ, Using the coordinate rotation means for calculating coordinates obtained by rotating the coordinates in the XY axis system to the right by θ angle, and the coordinates of the gene group with a small expression level after the coordinate axis rotation by the coordinate rotation means, the fluorescence intensity balance axis The bias determination means for determining which of the two luminance data of the two conditions contains the bias based on the calculated inclination, and the bias determination means By subtracting the bias from the luminance data determined to contain a large amount of bias, the fluorescence intensity of a new XY axis system having the fluorescence intensity balance axis and the magnification axis of the expression level as two axes. And a correction plot generation means for constructing a scatter diagram.
This more specifically shows an example of the bias correction means. According to this apparatus, a control gene sample for quality control of a DNA concentration dilution series (for example, an external gene λ DNA sample or a House-keeping gene sample such as a ribosome whose expression level hardly changes) is measured simultaneously with a target gene sample, Remove the control genes one by one from the gene with the smallest product of fluorescence intensity data, create a calibration curve for the gene expression level and DNA level from the data for all remaining control gene samples, and calculate the correlation coefficient of the data. Threshold value 1 is the product of the fluorescence intensity data in the two conditions of the control sample when the above correlation coefficients calculated in order satisfy the criterion (for example, 0.8 or more) where a strong correlation is first recognized. , Product of fluorescence intensity data in two conditions exceeds threshold 1 When all gene sample groups are gene groups with a high expression level, the expression level is calculated in order, and the correlation coefficient is calculated in order. The product of the fluorescence intensity data under the two conditions of the control sample is defined as threshold value 2 (threshold value 2 <threshold value 1), and the expression level of all gene sample populations where the product of the fluorescence intensity data under the two conditions falls below threshold value 2 The principal component analysis is performed using the fluorescence intensity logarithm values of the gene population with a small gene population and a high expression level, and the slope and intercept of the asymptote as the first principal component are obtained. Fluorescence intensity balance is calculated using the coordinates of the gene group with a small angle of expression and the coordinates of the gene group with a small expression level rotated by the θ angle to the right. axis The slope is calculated, and based on the calculated slope (for example, positive, negative, zero, etc.), it is determined which of the two conditions of luminance data contains more bias, and more bias is included. By subtracting the bias from the luminance data of the determined condition (for example, rotating the coordinates for a gene group having a constant bias), a new X− with the fluorescence intensity balance axis and the expression magnification magnification axis as two axes Since the fluorescence intensity scatter diagram of the Y axis system is constructed, it becomes possible to create a fluorescence intensity scatter diagram that can effectively remove the bias of the actual measurement value and clearly express the properties of the data.
Note that this device is not limited to determining the magnitude of the bias after the shaft rotation. For example, the magnitude of the bias is determined by comparing the slopes of the high expression asymptote and the low expression asymptote before the shaft rotation. May be.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, wherein the principal component analysis is performed using a variance / covariance matrix.
This more specifically shows an example of principal component analysis. According to this apparatus, since principal component analysis is performed using a variance / covariance matrix, normalization is not required as compared with a principal component analysis method using a correlation matrix that has been used for gene expression analysis. Therefore, the principal component analysis can be performed efficiently.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, wherein the gene detection means includes window setting means for setting a window in a predetermined section in the fluorescence intensity balance axis direction. , A confidence limit point determining means for determining a confidence limit point within each window set by the window setting means, a window moving means for moving the window by a fixed gene in the fluorescence intensity balance axis direction, and a movement by the window moving means For each window, each confidence limit point is obtained by the confidence limit point determining means, and a confidence boundary line creating means for creating a trust boundary line based on the plurality of obtained confidence limit points, and the trust boundary line creating means are created. In addition, genes located outside the trust boundary are extracted as variable genes whose expression levels fluctuated. Characterized in that it further includes a dynamic gene extraction means.
This shows one example of the gene detection means more specifically. According to this apparatus, a window in a predetermined section is set, and the average value, standard deviation, P value (for example, p = 0.05), center of gravity, etc. of gene luminance data in each set window are set. A confidence limit is determined using at least one of them. And a confidence boundary line creating means for moving the window by a certain gene in the direction of the fluorescence intensity balance axis, obtaining each confidence limit point for each moved window, and creating a confidence boundary line based on the plurality of confidence limit points obtained; Since the gene located outside the trust boundary created by the trust boundary creating means is extracted as a fluctuating gene whose expression level has fluctuated, it is possible to extract expressed genes with high stability, reproducibility, and high reliability. Will be able to do.
As a result, even if the experimental data has a different error range, the threshold for the expression level variation magnification can be determined according to the error.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, wherein the confidence limit point determination means uses a t-distribution based on a test statistical table of duplicate data obtained by simulation. And determining the confidence limit point.
This more specifically shows an example of determination of the confidence limit point. According to this apparatus, since the confidence limit point is determined using the t-distribution based on the duplicate statistical test statistical table obtained by simulation, the confidence limit point is obtained more accurately and efficiently than in the conventional method. Will be able to. In addition, according to the duplicate data test table, the number of duplicate experiments required at the experimental design stage can be obtained.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, wherein the confidence boundary line creating means performs smoothing by creating a spline curve based on the plurality of confidence limit points. And creating the trust boundary.
This more specifically shows an example of creation of the trust boundary line. According to this apparatus, since a spline curve is created based on a plurality of confidence limit points and smoothed and a confidence boundary line is created, a confidence curve can be efficiently created by complementing the confidence limit points. become able to.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, wherein the trust boundary creating means is configured to provide a confidence limit point obtained in the last window for a region having a high fluorescence intensity. The reliability limit line is created using a horizontal extension line.
This more specifically shows an example of creation of the trust boundary line. According to this apparatus, for the region with high fluorescence intensity, the confidence limit line is created using the horizontal extension line with respect to the X axis of the confidence limit point obtained in the last window (the rightmost window). Even if it is difficult to determine which one will converge, an appropriate confidence limit line can be created.
The gene expression information analyzing apparatus according to the next invention is the gene expression information analyzing apparatus described above, wherein the confidence boundary line creating means is configured to perform a minimum two-value search from a confidence limit point obtained in each window for a region having a low fluorescence intensity. Asymptotic line extrapolation obtained by multiplication is used as the reliability limit line.
This more specifically shows an example of creation of the trust boundary line. According to this apparatus, for an area with low fluorescence intensity, for example, an extrapolation of an asymptotic line obtained by the least square method from a confidence limit point obtained in each of several tens of windows from the beginning is used as the confidence limit line. Thus, it becomes possible to accurately detect a spot of a gene having a low fluorescence intensity.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, further comprising gene number input means for allowing a user to input the number of genes in the window, wherein the window setting means comprises the gene The window is set in the section including the genes of the number of genes input by the number input means.
This more specifically shows an example of the window setting. According to this apparatus, the user is allowed to input the number of genes in the window, and the window is set in the section including the genes of the input number of genes. Therefore, the number of genes set by the user is changed for each experiment. Will be able to.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, further comprising a confidence limit value input means for allowing a user to input a confidence limit value, wherein the confidence limit point determination means includes the above The reliability limit point is determined based on the reliability limit value input by the reliability limit value input means in the window.
This more specifically shows an example of determination of the confidence limit point. According to this apparatus, since the user inputs the confidence limit value and determines the confidence limit point based on the confidence limit value input in the window, the confidence limit value set by the user is varied for each experiment. And the error of each experiment can be kept within an appropriate range.
The gene expression information analysis apparatus according to the next invention is the gene expression information analysis apparatus described above, in which the user is notified of the shape of the non-variable gene distribution, the shape of the variable gene distribution, and the detection criteria for the variable gene. The simulation condition setting means for inputting a simulation condition including information on at least one of the number of repetitions of the experiment and the number of simulations, and the same gene group according to the simulation condition set by the simulation condition setting means It is generated repeatedly from the same distribution, the above-mentioned gene detection means is executed, the simulation for detecting the above-mentioned expressed gene is executed a plurality of times, the false positive rate and false negative rate of the result by the above detection means are calculated, Number of repetitions, simulation conditions, detection sensitivity and detection reliability And a simulation execution means for generating a test statistical table of genes whose expression level changes, and a simulation result output means for outputting a simulation result by the simulation execution means for each simulation condition. It is characterized by.
According to this apparatus, the user is given a form of gene distribution that does not vary (for example, a standard deviation of the distribution (for example, a distribution of genes whose expression does not change is a standard normal distribution with standard deviation σ = 1, center μ = 0) The standard deviation σ is set in the range of 0.1 to 1.5)), the distribution gene distribution shape (for example, the center (for example, the width of the center μ under the above conditions) 0.4 to 3))), and the detection criteria for the above-mentioned variable genes (for example, the ratio of the genes detected from the total number is 2/3, 2/4, 3/4, 3/6, 4), the simulation condition including information on at least one of the number of repetitions of the experiment and the number of simulations (for example, set in the range of 3 to 10 times) is input and set. According to simulation conditions Repeatedly generate from the same distribution for the same gene group, execute gene detection, execute the simulation to detect the expressed gene multiple times, calculate the false positive rate and false negative rate of the result by the above detection means, Calculate the number of experiment repetitions, simulation conditions, and the relationship between detection sensitivity and detection reliability, create a test statistical table for genes whose expression level changes, and output simulation results by simulation execution for each simulation condition. By combining the simulation results under various conditions, it is possible to know the detection power and the detection reliability by the above combination. In other words, repeat control experiments under the same conditions, detect fluctuating genes for the different data sets obtained, and select only those genes that are detected more than a predetermined number of times. Fluctuating genes can be detected with degree or power.
This also calculates the error when the gene whose expression level does not change is detected as a fluctuating gene (type 1 error) and the error when the gene that changes expression is detected as a gene whose expression does not change (second type error). In comparison, the power and reliability of detecting the fluctuating gene by the above method can be grasped from the simulation data, and in order to obtain the expected power and reliability of the actual experimental data, It is possible to set a combination of the number of repetitions of experiments, the detection criterion of the fluctuating gene, and the confidence limit point.
This also makes it possible to predict how many times an experiment can be performed to obtain accurate experimental data, thereby significantly improving the experimental efficiency.
The gene expression information analysis apparatus according to the next invention is characterized in that, in the gene expression information analysis apparatus described above, the gene detection means further includes a deviation value calculation means for calculating a deviation value of each spot. .
This more specifically shows an example of gene detection. According to this device, since the deviation value of each spot is calculated, an analysis that is not affected by the difference in error between slides can be performed by using the deviation value of each spot calculated in this way instead of the fluctuation ratio (magnification). Is possible.
In addition, this allows deviation values calculated by this device to be used in place of the logarithm of the fluctuation ratio or the normalized fluctuation ratio in multivariate analysis typified by cluster analysis. Analysis that is not affected by the difference in influence is possible.
The present invention also relates to a method for analyzing gene expression information. The method for analyzing gene expression information according to the present invention is based on the measured luminance data of each spot obtained by measuring the fluorescence intensity indicating the expression level of the same gene under two conditions. A fluorescence intensity scatter diagram in which a background correction step for creating background-corrected luminance data by removing the background value and the logarithm of the luminance data background-corrected by the background correction step are taken on the XY axis. To obtain a bias for the fluorescence intensity balance axis for each gene spot, and remove the bias from the luminance data to create a new XY with the fluorescence intensity balance axis and the expression level magnification axis as two axes. A bias correction step for constructing a fluorescence intensity scatter diagram of the axis system and the bias correction step Wherein the expression level based on the new X-Y axis system fluorescence intensity scatter diagram constructed by-up comprises a gene detection step of detecting a variation genes varied.
According to this method, the background value was corrected by removing the background value from the measured luminance data of each spot where the fluorescence intensity indicating the expression level of the same gene was measured under two conditions using a DNA microarray or a DNA chip. Create brightness data. Here, the average of the fluorescence intensity measurement values of the blank spots from the fluorescence intensity measurement values of the individual spots may be used as the background value, or the average value of the fluorescence intensity measurement values of the blank in the area around each spot may be used. May be used as the background value. Further, the background correction may be performed by any other method.
In addition, according to this method, the logarithm of the luminance data (background logarithm or logarithm of 2 etc.) corrected for the background is plotted on the XY axis, and a fluorescence intensity scatter plot (scatter plot) is created. A new X-Y axis fluorescence intensity with two axes of the fluorescence intensity balance axis and the expression level magnification axis is obtained by obtaining a bias with respect to the fluorescence intensity balance axis showing the same fluorescence intensity and removing the bias from the luminance data. Since the scatter diagram is constructed, the fluorescent component including more bias is judged, and after removing this bias, a new orthogonal axis system is constructed with the fluorescence intensity balance axis and the multiple axis of the expression level as two axes. Will be able to.
In addition, according to the present method, since a fluctuating gene whose expression level fluctuates is detected based on the constructed fluorescence intensity scatter diagram of the new XY axis system, the measurement method, compared with the conventional gene detection method, It becomes possible to accurately detect genes whose expression levels fluctuate without being affected by differences between samples and differences in fluorescence labeling efficiency.
The gene expression information analysis method according to the next invention is the gene expression information analysis method described above, wherein the bias correction step performs a principal component analysis using a logarithmic value of a gene population with a high expression level. The first principal component creation step for obtaining the slope and intercept of the asymptote as the principal component, and the angle between the asymptote and the X axis obtained in the first principal component creation step is θ, A coordinate rotation step for calculating coordinates obtained by rotating the coordinates in the XY axis system to the right by θ angle, and the coordinates of the gene group with a small amount of expression after rotating the coordinate axes in the coordinate rotation step, And a bias determination step for determining which of the two luminance data of the two conditions contains the bias based on the calculated tilt, and By subtracting the bias from the luminance data determined to contain a large amount of the bias in the bias determination step, a new X− with the fluorescence intensity balance axis and the expression level magnification axis as two axes is obtained. And a correction plot generation step of constructing a fluorescence intensity scatter diagram of the Y-axis system.
This more specifically shows an example of the bias correction step. According to this method, a control gene sample for quality control of a DNA concentration dilution series (for example, an external gene λ DNA sample, or a House-keeping gene sample such as a ribosome whose expression level hardly changes) is measured simultaneously with the target gene sample, Remove the control genes one by one from the gene with the smallest product of fluorescence intensity data, create a calibration curve for the gene expression level and DNA level from the data for all remaining control gene samples, and calculate the correlation coefficient of the data. Threshold value 1 is the product of the fluorescence intensity data in the two conditions of the control sample when the above correlation coefficients calculated in order satisfy the criterion (for example, 0.8 or more) where a strong correlation is first recognized. , Product of fluorescence intensity data in two conditions exceeds threshold 1 When all gene sample groups are gene groups with a high expression level, the expression level is calculated in order, and the correlation coefficient is calculated in order. The product of the fluorescence intensity data under the two conditions of the control sample is defined as threshold value 2 (threshold value 2 <threshold value 1), and the expression level of all gene sample populations where the product of the fluorescence intensity data under the two conditions falls below threshold value 2 The principal component analysis is performed using the fluorescence intensity logarithm values of the gene population with a small gene population and a high expression level, and the slope and intercept of the asymptote as the first principal component are obtained. Fluorescence intensity balance is calculated using the coordinates of the gene group with a small angle of expression and the coordinates of the gene group with a small expression level rotated by the θ angle to the right. axis The slope is calculated, and based on the calculated slope (for example, positive, negative, zero, etc.), it is determined which of the two conditions of luminance data contains more bias, and more bias is included. By subtracting the bias from the luminance data of the determined condition (for example, rotating the coordinates for a gene group having a constant bias), a new X− with the fluorescence intensity balance axis and the expression magnification magnification axis as two axes Since the fluorescence intensity scatter diagram of the Y axis system is constructed, it becomes possible to create a fluorescence intensity scatter diagram that can effectively remove the bias of the actual measurement value and clearly express the properties of the data.
Note that this method is not limited to determining the magnitude of the bias after the shaft rotation. For example, the magnitude of the bias is determined by comparing the slopes of the high expression asymptote and the low expression asymptote before the shaft rotation. May be.
The gene expression information analysis method according to the next invention is characterized in that in the gene expression information analysis method described above, the principal component analysis is performed using a variance / covariance matrix.
This more specifically shows an example of principal component analysis. According to this method, since the principal component analysis is performed using a variance / covariance matrix, normalization is not required as compared with the principal component analysis method using a correlation matrix conventionally used for expression gene analysis. Therefore, the principal component analysis can be performed efficiently.
The gene expression information analysis method according to the next invention is the gene expression information analysis method described above, wherein the gene detection step includes a window setting step for setting a window in a predetermined section in the fluorescence intensity balance axis direction; A confidence limit point determining step for determining a confidence limit point within each window set by the window setting step, a window moving step for moving the window by a certain gene in the direction of the fluorescence intensity balance axis, and a movement by the window moving step For each window, each confidence limit point is obtained by the confidence limit point determination step, and a confidence boundary line creation step for creating a confidence boundary line based on the plurality of confidence limit points obtained, and the trust boundary line creation step are created. Expresses genes located outside the trust boundary There, further comprising a fluctuation gene extraction step of extracting as variable genes varied.
This more specifically shows an example of the gene detection step. According to this method, a window in a predetermined section is set, and the average value, standard deviation, P value (for example, p = 0.05), center of gravity, etc. of gene luminance data are set in each set window. A confidence limit is determined using at least one of them. And a confidence boundary line creating step of moving the window by a certain gene in the direction of the fluorescence intensity equilibrium axis, obtaining each confidence limit point for each moved window, and creating a confidence boundary line based on the plurality of confidence limit points obtained; Because the gene located outside the trust boundary created by the trust boundary creating step is extracted as a fluctuating gene whose expression level fluctuated, the expression gene extraction with high stability, reproducibility, and high reliability can be performed. Will be able to do.
As a result, even if the experimental data has a different error range, the threshold for the expression level variation magnification can be determined according to the error.
The gene expression information analysis method according to the next invention is the gene expression information analysis method described above, wherein the confidence limit point determination step uses a t-distribution based on a test statistical table of duplicate data obtained by simulation. And determining the confidence limit point.
This more specifically shows an example of determination of the confidence limit point. According to this method, since the confidence limit point is determined using the t-distribution based on the duplicate statistical test statistical table obtained by simulation, the confidence limit point is obtained more accurately and efficiently than in the conventional method. Will be able to.
The gene expression information analysis method according to the next invention is the gene expression information analysis method described above, wherein the confidence boundary line creating step performs smoothing by creating a spline curve based on the plurality of confidence limit points. And creating the trust boundary.
This more specifically shows an example of creation of the trust boundary line. According to this method, since a spline curve is created based on a plurality of confidence limit points and smoothed and a confidence boundary line is created, a confidence curve can be efficiently created by complementing the confidence limit points. become able to.
In the gene expression information analysis method according to the next invention, in the gene expression information analysis method described above, the trust boundary line creation step is performed for the region of high fluorescence intensity with the confidence limit point obtained in the last window. The reliability limit line is created using a horizontal extension line.
This more specifically shows an example of creation of the trust boundary line. According to this method, in the region with high fluorescence intensity, the confidence limit line is created using the horizontal extension line with respect to the X axis of the confidence limit point obtained in the last window (the window on the rightmost side). Even if it is difficult to determine which one will converge, an appropriate confidence limit line can be created.
In the gene expression information analysis method according to the next invention, in the gene expression information analysis method described above, the trust boundary line creation step is performed at least two times from the confidence limit point obtained in each window for a region having a low fluorescence intensity. Asymptotic line extrapolation obtained by multiplication is used as the reliability limit line.
This more specifically shows an example of creation of the trust boundary line. According to this method, for an area with low fluorescence intensity, for example, an extrapolation of an asymptotic line obtained by the least square method from the confidence limit point obtained in each tens of windows from the beginning is used as the confidence limit line. Thus, it becomes possible to accurately detect a spot of a gene having a low fluorescence intensity.
The gene expression information analysis method according to the next invention further includes a gene number input step for allowing a user to input the number of genes in the window in the gene expression information analysis method described above, wherein the window setting step includes the gene The window is set in the section including the genes of the number of genes input in the number input step.
This more specifically shows an example of the window setting. According to this method, the user is allowed to input the number of genes in the window, and the window is set within a section including the genes of the input number of genes. Therefore, the number of genes set by the user is changed for each experiment. Will be able to.
The gene expression information analysis method according to the next invention further includes a confidence limit value input step for allowing a user to input a confidence limit value in the gene expression information analysis method described above, wherein the confidence limit point determination step includes The reliability limit point is determined based on the reliability limit value input by the reliability limit value input step in the window.
This more specifically shows an example of determination of the confidence limit point. According to this method, the user inputs the confidence limit value, and the confidence limit point is determined based on the confidence limit value input in the window. Therefore, the confidence limit value set by the user is changed for each experiment. And the error of each experiment can be kept within an appropriate range.
The gene expression information analysis method according to the next invention is the gene expression information analysis method described above, wherein the user is given a form of distribution of the non-variable genes, a form of distribution of the fluctuating genes, and a detection criterion for the fluctuating genes. In accordance with the simulation conditions set in the simulation condition setting step and the simulation condition setting step for inputting a simulation condition including information on at least one of the repetition number of the experiment and the number of simulations, the same gene group It is generated repeatedly from the same distribution, the above-mentioned gene detection means is executed, the simulation for detecting the above-mentioned expressed gene is executed a plurality of times, the false positive rate and false negative rate of the result by the above detection means are calculated, The number of repetitions, the above simulation conditions, and detection sensitivity A simulation execution step of calculating a relationship with the output reliability and creating a test statistical table of genes whose expression level changes, and a simulation result output step of outputting a simulation result by the simulation execution step for each simulation condition It is characterized by including.
According to this method, the user is given a form of gene distribution that does not vary (for example, a standard deviation of the distribution (for example, a distribution of genes whose expression does not change is a standard normal distribution with standard deviation σ = 1, center μ = 0) The standard deviation σ is set in the range of 0.1 to 1.5)), the distribution gene distribution shape (for example, the center (for example, the width of the center μ under the above conditions) 0.4 to 3))), and the detection criteria for the above-mentioned variable genes (for example, the ratio of the genes detected from the total number is 2/3, 2/4, 3/4, 3/6, 4), the simulation condition including information on at least one of the number of repetitions of the experiment and the number of simulations (for example, set in the range of 3 to 10 times) is input and set. According to simulation conditions Repeatedly generate from the same distribution for the same gene group, execute gene detection, execute the simulation to detect the expressed gene multiple times, calculate the false positive rate and false negative rate of the result by the above detection means, Calculate the number of experiment repetitions, simulation conditions, and the relationship between detection sensitivity and detection reliability, create a test statistical table for genes whose expression level changes, and output simulation results by simulation execution for each simulation condition. By combining the simulation results under various conditions, it is possible to know the detection power and the detection reliability by the above combination. In other words, repeat control experiments under the same conditions, detect fluctuating genes for the different data sets obtained, and select only those genes that are detected more than a predetermined number of times. Fluctuating genes can be detected with degree or power.
This also calculates the error when the gene whose expression level does not change is detected as a fluctuating gene (type 1 error) and the error when the gene that changes expression is detected as a gene whose expression does not change (second type error). In comparison, the power and reliability of detecting the fluctuating gene by the above method can be grasped from the simulation data, and in order to obtain the expected power and reliability of the actual experimental data, It is possible to set a combination of the number of repetitions of experiments, the detection criterion of the fluctuating gene, and the confidence limit point.
This also makes it possible to predict how many times an experiment can be performed to obtain accurate experimental data, thereby significantly improving the experimental efficiency.
In the gene expression information analysis method according to the next invention, in the gene expression information analysis method described above, the gene detection step further includes a deviation value calculation step of calculating a deviation value of each spot.
This more specifically shows an example of gene detection. According to this method, the deviation value of each spot is calculated. Therefore, by using the deviation value of each spot calculated in this way instead of the fluctuation ratio (magnification), the analysis is not affected by the difference in error between slides. Is possible.
In addition, this allows the deviation value calculated by this method to be used in place of the logarithm of the fluctuation ratio or the normalized fluctuation ratio in multivariate analysis typified by cluster analysis. Analysis that is not affected by the difference in influence is possible.
In addition, the present invention relates to a program, and the program according to the present invention removes the background value from the measured luminance data of each spot in which the fluorescence intensity indicating the expression level of the same gene is measured under two conditions. Create a fluorescence intensity scatter plot with a background correction step for creating background-corrected luminance data and the logarithm of the luminance data corrected by the background correction step on the XY axis, and spot of each gene A new XY axis system fluorescence intensity scatter diagram with two axes of the fluorescence intensity balance axis and the expression level magnification axis is obtained by obtaining a bias with respect to the fluorescence intensity balance axis and removing the bias from the luminance data. Bias correction step to be constructed and a new one constructed by the above bias correction step Expression levels based on the fluorescence intensity scatter plot of X-Y axis system is characterized in that to perform the genetic expression information analysis method comprising the gene detection step of detecting a variation gene fluctuated computer such.
According to this program, the background correction was performed by removing the background value from the measured luminance data of each spot where the fluorescence intensity indicating the expression level of the same gene was measured under two conditions using a DNA microarray or DNA chip. Create brightness data. Here, the average of the fluorescence intensity measurement values of the blank spots from the fluorescence intensity measurement values of the individual spots may be used as the background value, or the average value of the fluorescence intensity measurement values of the blank in the area around each spot may be used. May be used as the background value. Further, the background correction may be performed by any other program.
In addition, according to this program, the logarithm of the luminance data with the background correction (natural logarithm or logarithm of 2 etc.) is taken on the XY axis, and a fluorescence intensity scatter diagram (scatter plot) is created. A new X-Y axis fluorescence intensity with two axes of the fluorescence intensity balance axis and the expression level magnification axis is obtained by obtaining a bias with respect to the fluorescence intensity balance axis showing the same fluorescence intensity and removing the bias from the luminance data. Since the scatter diagram is constructed, the fluorescent component including more bias is judged, and after removing this bias, a new orthogonal axis system is constructed with the fluorescence intensity balance axis and the multiple axis of the expression level as two axes. Will be able to.
Further, according to this program, since a fluctuating gene whose expression level fluctuates based on the constructed fluorescence intensity scatter diagram of the new XY axis system is detected, the measurement program, It becomes possible to accurately detect genes whose expression levels fluctuate without being affected by differences between samples and differences in fluorescence labeling efficiency.
The program according to the next invention is the program described above, wherein the bias correction step performs principal component analysis using logarithmic values of gene populations with high expression levels, and the slope of an asymptote that becomes the first principal component. The first principal component creation step for obtaining the intercept and the angle between the asymptotic line obtained in the first principal component creation step and the X axis is θ, and the coordinates in the XY axis system of the gene population with a low expression level are Calculate the slope of the fluorescence intensity balance axis using the coordinate rotation step for calculating coordinates rotated to the right by the θ angle and the coordinates of the gene population with a small amount of expression after rotating the coordinate axes in the coordinate rotation step. A bias determination step for determining which of the luminance data of the two conditions contains a large amount of the bias based on the measured slope, and the bias determination step. By subtracting the bias from the luminance data under the condition determined to contain a large amount of the bias, a new X-Y axis fluorescence having the fluorescence intensity balance axis and the magnification axis of the expression level as two axes. And a correction plot generation step of constructing an intensity scatter diagram.
This more specifically shows an example of the bias correction step. According to this program, a control gene sample for quality control of a DNA concentration dilution series (for example, an external gene λ DNA sample or a House-keeping gene sample such as a ribosome whose expression level hardly changes) is measured simultaneously with a target gene sample, Remove the control genes one by one from the gene with the smallest product of fluorescence intensity data, create a calibration curve for the gene expression level and DNA level from the data for all remaining control gene samples, and calculate the correlation coefficient of the data. Threshold value 1 is the product of the fluorescence intensity data in the two conditions of the control sample when the above correlation coefficients calculated in order satisfy the criterion (for example, 0.8 or more) where a strong correlation is first recognized. , The product of the fluorescence intensity data under the two conditions is the threshold value 1 A group of all gene samples that turn around is a gene group with a large amount of expression, and the expression coefficient is calculated in order when the correlation coefficient degree satisfies the criterion (for example, 0.5 or more) that a weak correlation is first observed. The product of the fluorescence intensity data under the two conditions of the control sample is defined as threshold value 2 (threshold value 2 <threshold value 1), and the expression level of all gene sample populations where the product of the fluorescence intensity data under the two conditions falls below threshold value 2 The principal component analysis is performed using the fluorescence intensity logarithm values of the gene population with a small gene population and a high expression level, and the slope and intercept of the asymptote as the first principal component are obtained. Fluorescence intensity is calculated using the coordinates of the gene population with a low expression level after rotating the coordinate axis, with the angle θ being the angle and the coordinates in the XY axis system of the gene population with a low expression level rotated to the right by the θ angle Calculates the slope of the equilibrium axis, determines which of the two conditions of luminance data contains more bias based on the calculated slope (eg, positive, negative, zero, etc.), and contains more bias Subtracting the bias from the luminance data determined to be determined (for example, rotating the coordinates for a group of genes with a constant bias), a new axis with the fluorescence intensity balance axis and the expression level magnification axis as two axes Since the X-Y axis system fluorescence intensity scatter diagram is constructed, it is possible to efficiently remove the bias of the actual measurement value and create a fluorescence intensity scatter diagram that can clearly express the properties of the data. .
Note that this program is not limited to determining the magnitude of the bias after the axis rotation. For example, the magnitude of the bias is determined by comparing the slopes of the high expression asymptote and the low expression asymptote before the axis rotation. May be.
The program according to the next invention is the program described above, wherein the principal component analysis is performed using a variance / covariance matrix.
This more specifically shows an example of principal component analysis. According to this program, principal component analysis is performed using a variance / covariance matrix, so normalization is not required compared to the principal component analysis method using a correlation matrix conventionally used for expression gene analysis. Therefore, the principal component analysis can be performed efficiently.
The program according to the next invention is the above-described program, wherein the gene detection step is set by a window setting step for setting a window in a predetermined section in the fluorescence intensity balance axis direction, and the window setting step. A confidence limit point determining step for determining a confidence limit point in each window, a window moving step for moving the window by a fixed gene in the direction of the fluorescence intensity balance axis, and the confidence limit point for each window moved by the window moving step. In the determination step, each confidence limit point is obtained, and a trust boundary line creation step for creating a trust boundary line based on the plurality of trust limit points obtained, and outside the trust boundary line created by the trust boundary line creation step. The gene that is located Characterized in that it further comprises a regulated genes extracting Te.
This more specifically shows an example of the gene detection step. According to this program, a window within a predetermined section is set, and the average value, standard deviation, P value (for example, p = 0.05), center of gravity, etc. of gene luminance data within each set window are set. A confidence limit is determined using at least one of them. And a confidence boundary line creating step of moving the window by a certain gene in the direction of the fluorescence intensity equilibrium axis, obtaining each confidence limit point for each moved window, and creating a confidence boundary line based on the plurality of confidence limit points obtained; Because the gene located outside the trust boundary created by the trust boundary creating step is extracted as a fluctuating gene whose expression level fluctuated, the expression gene extraction with high stability, reproducibility, and high reliability can be performed. Will be able to do.
As a result, even if the experimental data has a different error range, the threshold for the expression level variation magnification can be determined according to the error.
The program according to the next invention is the program described above, wherein the confidence limit point determining step determines the confidence limit point using a t-distribution based on a test statistical table of duplicate data obtained by simulation. It is characterized by that.
This more specifically shows an example of determination of the confidence limit point. According to this program, since a confidence limit point is determined using a t-distribution based on a test statistical table of duplicate data obtained by simulation, the confidence limit point is obtained more accurately and efficiently than in the conventional method. Will be able to.
The program according to the next invention is the above-described program, wherein the trusted boundary creating step creates a trusted boundary by performing smoothing by creating a spline curve based on the plurality of confidence limit points. It is characterized by that.
This more specifically shows an example of creation of the trust boundary line. According to this program, smoothing is performed by creating a spline curve based on a plurality of confidence limit points and a confidence boundary line is created. Therefore, a confidence curve can be efficiently created by complementing the confidence limit points. become able to.
The program according to the next invention is the above-described program, wherein the trust boundary creating step uses the horizontal extension line of the confidence limit point obtained in the last window for the region with high fluorescence intensity. It is characterized by creating a limit line.
This more specifically shows an example of creation of the trust boundary line. According to this program, for the region with high fluorescence intensity, the confidence limit line is created using the horizontal extension line with respect to the X axis of the confidence limit point obtained in the last window (the window on the rightmost side). Even if it is difficult to determine which one will converge, an appropriate confidence limit line can be created.
In the program according to the next invention, in the above-described program, the step of creating a confidence boundary line includes a step for compensating an asymptotic line obtained by a least square method from a confidence limit point obtained in each window for a region having a low fluorescence intensity. The outside is used as the reliability limit line.
This more specifically shows an example of creation of the trust boundary line. According to this program, for regions with low fluorescence intensity, for example, the extrapolation of asymptotic lines obtained by the least square method from the confidence limit points obtained in each of several tens of windows from the beginning is used as the confidence limit line. Thus, it becomes possible to accurately detect a spot of a gene having a low fluorescence intensity.
The program according to the next invention further includes a gene number input step for allowing a user to input the number of genes in the window in the program described above, wherein the window setting step includes the above-described number of genes input step. The window is set within the section including the number of genes.
This more specifically shows an example of the window setting. According to this program, the user inputs the number of genes in the window, and the window is set in the section including the genes of the input number of genes. Therefore, the number of genes set by the user is changed for each experiment. Will be able to.
The program according to the next invention further includes a confidence limit value input step for allowing a user to input a confidence limit value in the program described above, wherein the confidence limit point determination step includes inputting the confidence limit value within the window. The reliability limit point is determined based on the reliability limit value input in the step.
This more specifically shows an example of determination of the confidence limit point. According to this program, the user inputs the confidence limit value, and the confidence limit point is determined based on the confidence limit value input in the window. Therefore, the confidence limit value set by the user is changed for each experiment. And the error of each experiment can be kept within an appropriate range.
The program according to the next invention is the program described above, in which the user is given a distribution form of the non-variable gene, a distribution form of the fluctuating gene, a detection criterion for the fluctuating gene, a repetition number of experiments, and In accordance with the simulation condition setting step for inputting simulation conditions including information on at least one of the simulation times, and the simulation conditions set in the simulation condition setting step, the same gene group is repeatedly generated from the same distribution. And executing the gene detection means, executing a simulation for detecting the expressed gene a plurality of times, calculating the false positive rate and false negative rate of the result by the detection means, the number of repetitions of the experiment, the simulation conditions, And the relationship between detection sensitivity and detection reliability A simulation execution step of creating a test statistic tables of genes whose expression level varies, for each of the simulation conditions, and further comprising a simulation result output step of outputting a simulation result by the simulation execution step.
According to this program, the user is given a form of gene distribution that does not fluctuate (for example, the standard deviation of the distribution (for example, the distribution of genes whose expression does not change is standard normal distribution with standard deviation σ = 1, center μ = 0 The standard deviation σ is set in the range of 0.1 to 1.5)), the distribution gene distribution shape (for example, the center (for example, the width of the center μ under the above conditions) 0.4 to 3))), and the detection criteria for the above-mentioned variable genes (for example, the ratio of the genes detected from the total number is 2/3, 2/4, 3/4, 3/6, 4), the simulation condition including information on at least one of the number of repetitions of the experiment and the number of simulations (for example, set in the range of 3 to 10 times) is input and set. Simulation conditions Thus, the same gene group is repeatedly generated from the same distribution, the gene detection is performed, the simulation for detecting the expressed gene is performed a plurality of times, and the false positive rate and false negative rate of the result by the above detection means are calculated. Calculates the number of repeated experiments, simulation conditions, and the relationship between detection sensitivity and detection reliability, creates a test statistical table for genes whose expression level changes, and outputs simulation results from simulation execution for each simulation condition Therefore, it is possible to know the detection power and the detection reliability by the above combination by combining the simulation results under various conditions. In other words, repeat control experiments under the same conditions, detect fluctuating genes for the different data sets obtained, and select only those genes that are detected more than a predetermined number of times. Fluctuating genes can be detected with degree or power.
This also calculates the error when the gene whose expression level does not change is detected as a fluctuating gene (type 1 error) and the error when the gene that changes expression is detected as a gene whose expression does not change (second type error). In comparison, the power and reliability of detecting the fluctuating gene by the above method can be grasped from the simulation data, and in order to obtain the expected power and reliability of the actual experimental data, It is possible to set a combination of the number of repetitions of experiments, the detection criterion of the fluctuating gene, and the confidence limit point.
This also makes it possible to predict how many times an experiment can be performed to obtain accurate experimental data, thereby significantly improving the experimental efficiency.
The program according to the next invention is characterized in that, in the program described above, the gene detecting step further includes a deviation value calculating step of calculating a deviation value of each spot.
This more specifically shows an example of gene detection. According to this program, the deviation value of each spot is calculated. By using the deviation value of each spot calculated in this way instead of the fluctuation ratio (magnification), the analysis is not affected by the difference in error between slides. Is possible.
In addition, this allows deviation values calculated by this program to be used instead of the logarithm of the fluctuation ratio or the normalized fluctuation ratio in multivariate analysis typified by cluster analysis. Analysis that is not affected by the difference in influence is possible.
The present invention also relates to a recording medium, and the recording medium according to the present invention is characterized by recording the program described above.
According to this recording medium, the program described above can be realized by using a computer by causing the computer to read and execute the program recorded on the recording medium. An effect can be obtained.
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of a gene expression information analyzing apparatus, a gene expression information analyzing method, a program, and a recording medium according to the present invention will be described in detail with reference to the drawings. The present invention is not limited to the embodiments.
[Outline of this device]
Hereinafter, the basic concept of the apparatus will be described, and then the configuration, processing, and the like of the apparatus in each embodiment of the present invention will be described in detail.
[Basic concept of this device]
Hereinafter, the basic concept of the present invention will be described with reference to FIGS. 1 to 6 and FIGS. 11 to 14.
1. Two-step data correction for control fluorescence measurements
In the measurement of expressed genes using a DNA microarray or DNA chip, the expression level of each gene is reflected in the brightness of the fluorescence measurement value corresponding to each gene, and the expression level ratio of each gene is the same as the control fluorescence measurement value. Observed as a ratio. However, due to errors in DNA microarrays and DNA chips, errors in fluorescence labeling reactions, measurement errors, differences in the molar fluorescence coefficient of fluorescent materials, the ratio of the expression level does not accurately reflect the ratio of the fluorescence measurement values. Therefore, in the present invention, the following processing is performed in order to process these errors.
(1) Background correction
Background correction is performed as the first stage of data correction. First, the luminance measured under the two conditions for gene i is (a_i, B_i) And the background (BKGa) from the brightness of each gene_i, BKGb_i) Is subtracted. This correction result (a_i-BKGa_i, B_i-BKGb_i) (A_i, B_i).
(2) Bias correction
Next, as a second-stage data correction, the bias is corrected by the following procedure. First, the outline of the bias correction of the present invention will be described. A control gene sample for quality control of a DNA concentration dilution series (for example, an external gene λ DNA sample, or a House-keeping gene sample such as a ribosome whose expression level hardly changes) is measured simultaneously with the target gene sample, and the product of the fluorescence intensity data Remove the control genes one by one from the smallest gene in order, create a standard curve for the gene expression level and DNA level from the data of all remaining control gene samples, calculate the correlation coefficient of the data, and calculate in order When the above correlation coefficient satisfies a criterion (for example, 0.8 or more) in which a strong correlation is first recognized, the product of the fluorescence intensity data in the two conditions of the control sample is set as the threshold value 1, and the fluorescence in the two conditions All gene samples whose product of intensity data exceeds the threshold 1 The control group in the case where the gene group with a high expression level is satisfied, and the correlation coefficient degree in which the expression level is calculated in order satisfies a criterion (for example, 0.5 or more) where a weak correlation is first observed, is satisfied. The product of fluorescence intensity data under one condition is threshold 2 (threshold 2 <threshold 1), and the population of all gene samples for which the product of fluorescence intensity data under two conditions is less than threshold 2 is a gene group with low expression level. The principal component analysis is performed using the fluorescence intensity logarithm value of the gene population with a large amount of expression, the slope and intercept of the asymptote as the first principal component is obtained, and the angle between the obtained asymptote and the X axis is θ. Calculate the coordinates of the gene group with low expression level by rotating the coordinates in the XY axis system to the right by θ angle, and use the coordinates of the gene group with low expression level after rotation of the coordinate axis to determine the slope of the fluorescence intensity balance axis. Calculate and total Based on the determined slope (for example, positive, negative, zero, etc.), it is determined which of the two conditions of luminance data contains a large amount of bias, and the condition determined to include a large amount of bias. Fluorescence intensity of a new X-Y axis system with the fluorescence intensity balance axis and the expression magnification magnification axis as two axes by subtracting the bias from the luminance data (for example, rotating the coordinates for a gene group having a constant bias). Since the scatter diagram is constructed, it becomes possible to efficiently remove the bias of the actual measurement value and to create a fluorescence intensity scatter diagram that can clearly express the nature of the data.
Hereinafter, an example of the bias correction procedure will be described in detail.
i) General relational expression of control fluorescence measurement
The correction of the bias k according to the present invention is based on the general formula (1) or (1 ') representing the relationship between the fluorescence measurement values A and B.

Here, a, b, and k are unknown parameter constants. The average bias k is subtracted from the one of A and B that includes more biases. That is, if the background noise of A is greater than B and includes many biases, then Equation 1 is used, while the background noise of B is greater than A and includes many biases. In the case where it is, Equation 1 ′ is used. a, b, and k are (Log₂A-Log₂Inferred from the plot of fluorescence measurement values in the orthogonal axis system of B).
ii) Extraction of fluorescence intensity equilibrium axis by principal component analysis using variance / covariance matrix
If the expression level is the same, the fluorescence measurement value of the control experiment of DNA microarray or DNA chip is theoretically (Log₂A-Log₂B) Straight line Log indicating 1: 1 on the fluorescence intensity scatter diagram of the orthogonal axis system₂A = Log₂Should be located on B. However, due to differences in the properties of fluorescent substances, differences in experimental conditions, etc., the fluorescence intensity equilibrium axis showing the same fluorescence intensity (that is, for each gene spot, obtained from a gene population with the same expression level under the two conditions) Logged asymptote) is Log₂A = Log₂B may not be followed. In this case, assuming that the number of genes to be examined is sufficient as a sample (for example, 1,000 or more), and that the number of variable genes, which are genes whose expression level changes, is a low percentage of the total number, the fluorescence intensity The equilibrium axis is (Log₂A_i, Log₂B_i) Assuming a population asymptote.
Where A_iAnd B_iIs a value much larger than k, that is, when the influence of bias is small and can be ignored, Equations 1 and 1 '

Can be approximated. At this time, principal component analysis using a variance / covariance matrix is performed in order to obtain the slope a and the intercept b. The principal component analysis using the variance / covariance matrix does not require normalization, unlike the principal component analysis method using the correlation matrix conventionally used in gene analysis.
Here, FIG. 1 is a diagram showing the concept of principal component analysis using a variance / covariance matrix. Log₂A to x, Log₂When B is simplified to y, Equation 2 representing an asymptote is

It becomes.
Therefore, each point (x_i, Y_i) To asymptote_iIs

It is calculated by.
Also, the distance D from all points to the asymptote is

It becomes.
Here, when the distance D is minimum, asymptotic line parameters a and b that are most appropriate on the distribution map are determined.
When the distance D is the smallest,

The following two conditions are satisfied.
From Equation 6,

From Equation 7,

It becomes. In Equation 9, a is greater than zero of the two solutions. S_xIs x_iVariance, S_yIs y_iVariance, S_xyIs x_iAnd y_iMeans covariance of. In the actual correction, a and b are the product A_iB_iUpper gene population (Log₂A, Log₂B) is used. A simple calculation is the product A of all genes_iB_iIt is calculated | required using the gene group of the upper rank (for example, 70%). To determine the accuracy, a control gene sample for quality control of a DNA concentration dilution series (for example, an external gene λ DNA sample or a House-keeping gene sample such as a ribosome whose expression level hardly changes) is measured simultaneously with the target gene sample, and fluorescence is measured. The control genes are removed one by one from the gene with the smallest product of intensity data, and a calibration curve for the gene expression level and DNA level is created from the data of all remaining control gene samples, and the correlation coefficient of the data is calculated. Then, the product of the fluorescence intensity data in the two conditions of the control sample when the correlation coefficient calculated in order satisfies a criterion (for example, 0.8 or more) where a strong correlation is first recognized is set as a threshold value 1, The product of fluorescence intensity data under two conditions exceeds the threshold value 1. Expression of a population of gene sample of Te is a large gene cluster.
iii) Correction of bias
(Log₂A-Log₂In the orthogonal axis system of B), when A has a large background noise and includes more bias than B, the asymptotic line and Log₂The coordinates of the point that intersects the A axis is (A_c, 0)

And

It becomes.
If B has a large background noise and includes more bias than A, the asymptotic line and Log₂The coordinates of the point where the B axis intersects are (0, B_c), From Equation 1 '

And

It becomes.
Here, since a and b have already been obtained, A_cAnd B_cRespectively (Log₂A-Log₂B) Product A of orthogonal axis system_iB_iLower gene population (Log₂A, Log₂Asymptote and Log obtained from B)₂A axis or Log₂It is obtained as the value of the intersection of the B axis. Genes with small fluorescence measurements are strongly affected by errors, so the product A_iB_iLower gene population (Log₂A, Log₂The gene used to calculate the asymptote of B) is a simple calculation method that uses the product A of all genes._iB_iThe lower order (10% as an example) is used. In order to obtain accurately, a control gene sample for quality control of a DNA concentration dilution series (for example, an external gene λ DNA sample, or a House-keeping gene sample such as a ribosome whose expression level hardly changes) is measured simultaneously with a target gene sample, Remove the control genes one by one from the gene with the smallest product of fluorescence intensity data, create a calibration curve for the gene expression level and DNA level from the data for all remaining control gene samples, and calculate the correlation coefficient of the data. Threshold value 2 is the product of the fluorescence intensity data in the two conditions of the control sample when the above correlation coefficients calculated in order satisfy the criterion (for example, 0.5 or more) at which weak correlation is first observed. (Threshold 2 <threshold 1), fluorescence intensity under two conditions Over expression of the population of all genes samples product is below the threshold value 2 of data is less gene cluster.
The measured value (A_i) And (B_i) And the asymptote is Log₂A axis and Log₂This can be determined by crossing either of the B axes. At this time,

Or

It becomes.
iv) Bias determination
FIG. 2 is a diagram showing a concept of processing for obtaining an asymptote in a new coordinate system.
Product A by least squares_iB_iAs an asymptote of the subgene population of

Is required.
However, to determine the independent and dependent variables of the least squares method, first (Log₂A-Log₂B) Shaft system is product A_iB_iIt is necessary to rotate to an axis system in which the asymptotic line obtained from the upper gene group of is a new X axis. Therefore, (Log₂A_i, Log₂B_i) New coordinates (Log)₂A_i', Log₂B_i’)

More demanded.
Further, from the inclination α = tan θ,

Is required.
Next, in the new coordinate system A_iB_iSubgene group (A_i', B_iThe asymptote of ′) is obtained by the method of least squares. Where the asymptote is

And If m is negative, it is determined that B includes more bias than A. On the other hand, when m is a positive number, it is determined that A includes more bias than B.
v) Bias calculation
When m is a negative number on the asymptotic line shown in Equation 17, (Log₂A, Log₂B) In the shaft system, product A_iB_iSubgene group (A_i', B_i')₂The intercept with the B axis is the least squares method (Log₂A is an independent variable, Log₂B is obtained from the dependent variable).

If

It becomes.
On the other hand, if m is a positive number, (Log₂A, Log₂B) In the shaft system, product A_iB_iSubgene group (A_i', B_i')₂The intercept with the A axis is calculated by the least square method (Log₂B is an independent variable, Log₂A is a dependent variable).

If

It becomes.
The data correction between the second stage is performed by subtracting the bias obtained in Expression 11 or Expression 13 from the entire data of one of the control measurement values.
With the above correction, a new data plot (Log₂A_i, Log₂(B_i-K)) or (Log₂(A_i-K), Log₂B_i) (Hereinafter “correction plot (Log₂A_i, Log₂B_i) ") And proceed to the next analysis. Therefore, Formula 1 or Formula 1 'is

Can be expressed as
2. Robust detection of genes with altered expression levels by multiple testing
In this method, it is assumed that the corrected data is composed of a mixed distribution of a gene population whose expression level changes and a gene population whose expression level does not change. First, for each data pair, a window within a certain interval is set in the direction of the fluorescence intensity balance axis, and a confidence limit point of an arbitrary risk factor based on a student's t-distribution is obtained within each window. Subsequently, the window is moved by a certain gene in the direction of the fluorescence intensity balance axis (X axis), and each confidence limit point is obtained. The obtained plurality of confidence limit points are complemented by smoothing (spline) to obtain a confidence boundary line (confidence curve).
From this result, a gene located outside the trust boundary is selected as a gene whose expression level has changed. In order to obtain higher extraction reliability, genes whose expression level has been changed reliably are selected on the basis of the ratio of majority vote by repeated experiments. Next, in order to reduce the first type of error in extraction, it is recognized that the expression level of the gene has changed only when the expression level has been extracted by a multi-test as if the expression level has changed over a predetermined number of times.
(1) Reconstruction of data distribution by ratio of fluorescence intensity balance axis and expression level
FIG. 3 is a diagram conceptually showing the reconstruction of the distribution map. As shown in FIG. 3, each correction plot (Log₂A_i, Log₂B_i) To fluorescence intensity equilibrium axis Log₂B_i= ΚLog₂A_iThe vertical distance to + I is considered to be proportional to the ratio of the expression level. It is also clear that the fluorescence intensity increases proportionally as it moves to the right on the fluorescence intensity balance axis. Therefore, a fluorescence intensity scatter diagram in which the distance from each gene to the fluorescence intensity balance axis is calculated as the value of the Y axis and the fluorescence intensity balance axis is the X axis can clearly express the nature of the data. Where Y value of each gene is d₂(Magnification rate of expression level) is calculated by Equation 4.
Also, the X-axis value d of each gene₁(Fluorescence intensity) follows the formula 22 that generally shows the relationship between A and B, even though the group of fluorescence measurements A and B includes various errors. The reconstructed fluorescence intensity scatter diagram has an XY axis of (fluorescence intensity-change rate of expression level).
(2) Multiple tests of mixed distribution models of gene populations whose expression levels change and gene populations whose expression levels do not change
4 to 6 are diagrams showing mixed normal distribution models of expression magnification.
The actual data distribution can be considered to be a mixed distribution of a group of genes (variable genes) whose expression levels have changed and a group of genes (non-variable genes) whose expression levels do not change. In the mixed distribution model of this method, as shown in FIG. 4, the non-variable gene population distribution centered on zero and the expression ratio are increased and decreased, respectively, on the Y axis representing the rate of change in the expression level. It is assumed that it consists of a population distribution of fluctuating genes centered on one point. Here, for convenience of explanation, only the normal distribution is shown, but the present invention is not limited to the case of the normal distribution, and can be applied to data of all distributions.
Here, as shown in FIG. 5, when the ratio of the variable genes to the whole is not so large (for example, when the group of variable genes occupies 10% of the total number), the mixture distribution is as shown in FIG. To approximate a normal distribution. Therefore, a fluctuating gene can be extracted based on the t distribution with respect to the fluorescence magnification data of the mixed distribution under the condition of the P value (P-value) which is a certain confidence limit value.
Since this method determines the threshold value of the expression level variation magnification based on the actual data distribution and the calculation of the center, this method has a feature that it is robust. That is, in this method, even for experimental data having different error ranges, the threshold for the expression level variation magnification is determined according to the error. In addition, as another feature of this method, by performing several detections on different data sets obtained in a control experiment under the same conditions, by selecting only genes that are detected more than a predetermined number of times, The variable gene can be detected with high reliability.
Furthermore, the mixed distribution of non-variable genes and variable gene populations is divided into six parameters (total number of genes, percentage of genes whose expression varies, standard deviation (width) of gene distribution, center of distribution of genes whose expression varies) By changing the detection criteria (number of detections / total number) and the confidence limit value (P-value) in each data set (window), the first type detection error and the second type detection are performed. The error can be calculated. The result can be an experimental guideline. Here, the “first type detection error” refers to a false positive error that is detected as something that does not change, and the “second type detection error” refers to a false that is detected as something that does not change. A negative error.
(3) Creation of expression variation confidence curve according to data using moving window method
A gene with a lower fluorescence intensity is strongly affected by errors such as background in the value of expression change. For example, an error α that cannot be removed for each fluorescence value in a control experiment._AAnd α_BIs present, the expression change amount of a gene i is expressed by the fluorescence magnification = (A_i-Α_Ai) / (B_i-Α_Bi). Therefore, A_i>> α_AiAnd B_i>> α_Bi, The fluorescence magnification is A_i/ B_iCan be approximated as A_iAnd α_AiAnd B_iAnd α_BiIf the values of are close, the effect of the error cannot be ignored. Therefore, when genes are actually selected based on the t distribution, considering that there are mixed gene groups affected by different degrees due to errors such as background, t values of different groups are determined according to the fluorescence intensity. Should.
Here, FIGS. 11 to 14 are diagrams conceptually showing the creation of the confidence curve. First, as shown in FIG. 11, the present apparatus calculates the variance and the center for the magnification distribution of the expression level of the gene within a window configured with a certain number of genes, and determines the t value of the magnification change (magnification Corresponds to the value of the coordinate axis). Note that the fluorescence intensity balance axis median value of all genes in the window is used as the value on the fluorescence intensity balance axis of the confidence limit point of the expression change.
Next, as shown in FIG. 12, the present apparatus determines the coordinates of the confidence limit points for the expression change above and below the fluorescence intensity balance axis in the window, and then increases the fluorescence intensity balance axis in a direction in which the fluorescence intensity balance axis increases. Move the window. Thereafter, this operation is repeated.
After calculating the confidence limit points of all expression changes, this device connects the confidence limit points of the expression change confidence points with a cubic spline curve and creates an expression variation confidence curve that is an expression change boundary line. To do. Here, in the fluorescence intensity region that cannot be complemented by the cubic spline curve in the windows at both ends, as shown in FIG. 13, as shown in FIG. Using the horizontal extension line of the limit point, and where the fluorescence intensity is low (indicated by the dotted line), the asymptotic line extrapolation calculated by the least square method from the boundary point of several tens of windows that continued from the left A confidence curve (extrapolated change boundary line) is used.
Next, as shown in FIG. 14, genes outside the region sandwiched between the expression fluctuation confidence curves above and below the fluorescence intensity equilibrium axis are extracted as genes whose expression level has changed, that is, those whose expression level has increased or decreased. To do. The final gene extraction is performed by the above-described multiple test (2- (2)).
[Device configuration]
Next, the configuration of the gene expression information analysis apparatus will be described below with reference to FIGS. FIG. 22 is a block diagram showing an example of the configuration of the apparatus to which the present invention is applied, and conceptually shows only the part related to the present invention in the configuration.
In FIG. 22, a gene expression information analyzing apparatus 100 is schematically shown as a control unit 102 such as a CPU that controls the entire gene expression information analyzing apparatus 100 as a whole, a communication device such as a router connected to a communication line, etc. A communication control interface unit 104 connected to the input device 112 and the output device 114, and a storage unit 106 for storing various databases and tables. These units are communicably connected via an arbitrary communication path. Furthermore, the gene expression information analyzing apparatus 100 may be communicably connected to a network via a communication device such as a router and a wired or wireless communication line such as a dedicated line.
Various databases and tables (measured luminance data 106a and simulation result data 106b) stored in the storage unit 106 are storage means such as a fixed disk device, and various programs, tables, files, databases, and webs used for various processes. Stores file for page etc.
Among the constituent elements of the storage unit 106, the measured luminance data 106a is measured luminance data storage in which measured luminance data of each spot indicating the expression level of a gene experimented with a DNA chip, a DNA microarray, or the like is stored for each experiment. Means. The simulation result data 106b is a simulation result data storage unit that stores simulation result data obtained by this apparatus.
In FIG. 22, the communication control interface unit 104 performs communication control between the gene expression information analyzing apparatus 100 and a network (or a communication apparatus such as a router). That is, the communication control interface unit 104 has a function of communicating data with other terminals via a communication line.
In FIG. 22, the input / output control interface unit 108 controls the input device 112 and the output device 114. Here, as the output device 114, in addition to a monitor (including a home TV), a speaker can be used (hereinafter, the output device 114 may be described as a monitor). As the input device 112, a keyboard, a mouse, a microphone, or the like can be used. The monitor also realizes a pointing device function in cooperation with the mouse.
In FIG. 22, the control unit 102 has a control program such as an OS (Operating System), a program defining various processing procedures, and an internal memory for storing required data. Information processing for executing various processes is performed. The control unit 102 includes a background correction unit 102a, a bias correction unit 102b, a gene detection unit 102c, and a simulation unit 102d in terms of functional concept.
Among these, the background correction | amendment part 102a removes a background value from the measurement brightness | luminance data of each spot which measured the fluorescence intensity which shows the expression level of the same gene on two conditions, and carries out the brightness | luminance data by which background correction | amendment was carried out. This is a background correction means to be created.
Further, the bias correction unit 102b creates a fluorescence intensity scatter diagram by taking the logarithm of the luminance data subjected to background correction by the background correction unit as an XY axis, and obtains a bias with respect to the fluorescence intensity balance axis for each gene spot, This is bias correction means for constructing a new XY axis system fluorescence intensity scatter diagram with the fluorescence intensity balance axis and the expression amount magnification axis as two axes by removing the bias from the luminance data.
Here, FIG. 23 is a block diagram showing an example of the configuration of the bias correction unit 102b, and conceptually shows only the portion related to the present invention. As shown in FIG. 23, the bias correction unit 102b is configured from a functional concept to include a first principal component creation unit 102e, a coordinate rotation unit 102f, a bias determination unit 102g, and a correction plot generation unit 102h. .
In FIG. 23, the first principal component creation unit 102e performs principal component analysis using the logarithmic value of the gene population with a high expression level, and obtains the slope and intercept of the asymptote that becomes the first principal component. It is a creation means.
Further, the coordinate rotation unit 102f sets the angle between the asymptotic line obtained by the first principal component creation means and the X axis to θ, and rotates the coordinates in the XY axis system of the gene population with a small amount of expression to the right by the θ angle. Coordinate rotation means for calculating coordinates.
In addition, the bias determination unit 102g calculates the inclination of the fluorescence intensity balance axis using the coordinates of the gene group whose expression level is small after the coordinate axis rotation by the coordinate rotation means, and based on the calculated inclination, the brightness of two conditions Bias determination means for determining which of the data contains a large amount of bias.
In addition, the correction plot generation unit 102h sets the fluorescence intensity balance axis and the magnification axis of the expression level as two axes by subtracting the bias from the luminance data determined to include a large amount of bias by the bias determination unit. It is a correction plot generating means for constructing a new XY axis system fluorescence intensity scatter diagram.
Returning to FIG. 22 again, the gene detection unit 102c is a gene detection unit that detects a fluctuating gene whose expression level fluctuates based on the fluorescence intensity scatter diagram of the new XY axis system constructed by the bias correction unit.
Here, FIG. 24 is a block diagram showing an example of the configuration of the gene detection unit 102c, and conceptually shows only the portion of the configuration related to the present invention. As shown in FIG. 24, the gene detection unit 102c is functionally conceptually configured to have a window setting unit 102i, a confidence limit point determination unit 102j, a window moving unit 102k, a confidence boundary line creation unit 102m, a variable gene extraction unit 102n, the number of genes. An input unit 102p, a confidence limit value input unit 102q, and a deviation value processing unit 102u are provided.
In FIG. 24, a window setting unit 102i is window setting means for setting a window in a predetermined section in the fluorescence intensity balance axis direction.
The confidence limit point determination unit 102j is a confidence limit point determination unit that determines a confidence limit point in each window set by the window setting unit.
The window moving unit 102k is a window moving means for moving the window by a certain gene in the fluorescence intensity balance axis direction.
Further, the trust boundary creating unit 102m obtains each trust limit point by the trust limit point determining means for each window moved by the window moving means, and creates a trust boundary line based on the obtained plurality of trust limit points. It is a line creation means.
The variable gene extraction unit 102n is a variable gene extraction unit that extracts a gene located outside the trust boundary created by the trust boundary creation unit as a variable gene whose expression level has changed.
The number-of-genes input unit 102p is a number-of-genes input unit that allows the user to input the number of genes in the window.
Further, the confidence limit value input unit 102q is a confidence limit value input unit that allows a user to input a confidence limit value.
The deviation value processing unit 102u is deviation value calculation means for calculating the deviation value of each spot.
Referring back to FIG. 22 again, the simulation unit 102d is a simulation unit that executes a plurality of simulations according to a predetermined condition and outputs a simulation result for each condition.
Here, FIG. 25 is a block diagram showing an example of the configuration of the simulation unit 102d, and conceptually shows only the portion related to the present invention in the configuration. As shown in FIG. 25, the simulation unit 102d includes a simulation condition setting unit 102r, a simulation execution unit 102s, and a simulation result output unit 102t in terms of functions.
In FIG. 25, the simulation condition setting unit 102r gives the user a simulation condition including information on at least one of the standard deviation of the gene distribution, the center of the distribution of the variable gene, the detection criterion of the variable gene, and the number of simulations. Is a simulation condition setting means for inputting.
In addition, the simulation execution unit 102 s repeatedly generates the same gene group from the same distribution according to the simulation condition set by the simulation condition setting unit, executes the gene detection unit, and executes a simulation for detecting the expressed gene. Execute multiple times, calculate false positive rate and false negative rate of the results by detection means, calculate the number of experiment repetitions, simulation conditions, and relationship between detection sensitivity and detection reliability, and test for genes whose expression level changes It is a simulation execution means for creating a statistical table.
The simulation result output unit 102t is a simulation result output unit that outputs a simulation result by the simulation execution unit for each simulation condition.
Details of processing performed by each of these units will be described later.
[Processing of this device]
Next, an example of the process of the present embodiment of the present embodiment configured as described above will be described in detail with reference to FIGS. 7 to 10 and FIGS.
[Main processing of this device]
First, the main processing of this apparatus will be described with reference to FIG. FIG. 15 is a flowchart showing an example of main processing of the apparatus of the present embodiment.
First, the gene expression information analysis apparatus 100 performs a background correction process, which will be described later with reference to FIG. 16, by the process of the background correction unit 102a (step S-1). That is, the background correction unit 102a removes the background value from the measured luminance data of each spot obtained by measuring the fluorescence intensity indicating the expression level of the same gene under two conditions using a DNA microarray or a DNA chip. Create corrected luminance data.
Next, the gene expression information analysis apparatus 100 executes a bias correction process, which will be described later with reference to FIG. 17, by the process of the bias correction unit 102b (step S-2). That is, the bias correction unit 102b creates a fluorescence intensity scatter diagram (scatter plot) by taking the logarithm (natural logarithm or logarithm of 2) of luminance data subjected to background correction on the XY axis, and spot of each gene. A new X-Y axis fluorescence intensity with two axes of the fluorescence intensity balance axis and the expression level magnification axis is obtained by obtaining a bias with respect to the fluorescence intensity balance axis showing the same fluorescence intensity and removing the bias from the luminance data. Build a scatter plot.
Next, the gene expression information analyzing apparatus 100 executes a gene detection process using a moving window, which will be described later with reference to FIGS. 18 and 20, by the process of the gene detection unit 102c (step S-3). That is, the gene detection unit 102c detects a fluctuating gene whose expression level fluctuates based on the constructed fluorescence intensity scatter diagram of the new XY axis system.
Next, the gene expression information analyzing apparatus 100 executes a simulation process to be described later with reference to FIG. 19 and FIG. 21 by the process of the simulation unit 102d (step S-4). That is, the simulation unit 102d executes a plurality of simulations according to a predetermined condition and outputs a simulation result for each condition.
This completes the main process of this apparatus.
[Background correction processing]
Next, details of the background correction processing will be described with reference to FIG. FIG. 16 is a flowchart showing an example of bias correction processing of the apparatus of the present embodiment.
First, the gene expression information analysis apparatus 100 obtains an average or local background value from the luminance measured under the two conditions of the gene by processing of the background correction unit 102a (step SA-1), and this background. The value is removed from the measured value, and the result of this correction is taken as group A and group B (step SA-2).
That is, the background correction unit 102a subtracts the average background value of the blank spot from the fluorescence intensity measurement value of each spot or the background value of the area around each spot from the fluorescence intensity measurement value of each spot. As a result, background correction is performed. This ends the background correction process.
[Bias correction processing]
Next, details of the bias correction processing will be described with reference to FIG. FIG. 17 is a flowchart showing an example of bias correction processing of the apparatus of the present embodiment. First, the bias correction unit 102b calculates the logarithm with 2 as the base for the A group and the B group by the processing of the first principal component creation unit 102e, and Log₂A, Log₂A scatter plot is performed in an orthogonal axis system with B as the X and Y axes (step SB-1).
Next, the bias correction unit 102b calculates the variance / covariance matrix by using the logarithmic value of the upper gene group of the product AB (for example, the gene group up to the upper 70%) by the processing of the first principal component creation unit 102e. The used principal component analysis is executed to determine the slope and intercept of the asymptote that becomes the first principal component (step SB-2).
Next, the bias correction unit 102b obtains the asymptotic line and the log obtained by the processing of the coordinate rotation unit 102f.₂Log of a gene group belonging to the lower part of the product AB (for example, a group of genes included in the lower 10%) where the angle of the A axis is θ₂A-Log₂A coordinate obtained by rotating the coordinate in the B-axis system by θ angle to the right is calculated (step SB-3).
Next, the bias correction unit 102b calculates the slope of the asymptote using the coordinates of the lower gene group of the product AB after the rotation of the coordinate axis by the processing of the bias determination unit 102g (step SB-4).
Next, the bias correction unit 102b determines whether or not the slope of the asymptote is a positive number by the processing of the bias determination unit 102g (step SB-5). In the case of a positive number, the bias determination unit 102g determines that the A data includes more bias. Therefore, the bias correction unit 102b performs log processing by the bias determination unit 102g.₂A-Log₂Log using the coordinates of the lower gene group of the product AB in the B axis system (for example, the group of genes included in the lower 10%)₂Log data with the B axis as an independent variable₂Asymptotic lines and Logs of lower gene populations by the least square method using the data of A as the dependent variable₂Intersection with A axis (A_c, 0) value A_cIs obtained (step SB-6).
Next, the bias correction unit 102b obtains the bias by the processing of the correction plot generation unit 102h, and subtracts the bias from the control measurement value data (step SB-7).
On the other hand, if the slope of the asymptote is not a positive number in step SB-5, the bias determination unit 102g determines whether it is zero (step SB-8). If zero, the bias correction process is terminated.
In step SB-8, if the slope of the asymptote is not zero, the bias determination unit 102g determines that the B data includes more bias. Therefore, the bias correction unit 102b performs log processing by the correction plot generation unit 102h.₂A-Log₂Log using the coordinates of the lower gene group of the product AB in the B axis system (for example, the group of genes included in the lower 10%)₂Log data with A-axis data as independent variable₂Asymptotic lines and log of subgene populations by least squares method using B data as dependent variable₂Intersection with B axis (0, B_c) Value B_cIs obtained (step SB-9), and the process of step SB-7 described above is performed.
Next, the bias correction unit 102b uses the data obtained by subtracting the bias by the processing of the correction plot generation unit 102h, and uses the orthogonal axis system Log.₂(Ak) -Log₂B axis system or Log₂A-Log₂(Bk) An axis system is constructed (step SB-10). This completes the bias correction process.
[Gene detection process]
Next, details of the gene detection process will be described with reference to FIG. FIG. 18 is a flowchart showing an example of gene detection processing of the apparatus of the present embodiment.
First, the gene detection unit 102c of the gene expression information analysis apparatus 100 performs a process of the window setting unit 102i to give the user the number of genes in the window described above with reference to FIG. Degree (P_eA gene extraction condition setting screen for setting (value) is output to the output device 114 (step SC-1).
Here, FIG. 20 is a diagram illustrating an example of a gene extraction condition setting screen output to the output device 114 by the processing of the window setting unit 102i. As shown in FIG. 20, the gene extraction condition setting screen has an input area MA-1 for the number of genes in the window, and a confidence level (P_eValue) input area MA-2, setting end button MA-3, and the like.
Here, after the user completes the input of each item in the input areas MA-1 and MA-2 using the input device 112 while viewing the gene extraction condition setting screen shown in FIG. 20, the setting end button MA- When 3 is selected, the number-of-genes input unit 102p and the confidence limit value input unit 102q are based on the information set on the gene extraction condition setting screen so that the genes in the window shown in FIG. Adjust the size.
Returning to FIG. 18 again, the gene detection unit 102c uses the value of the Y axis (magnification of change) of each point in the window from the leftmost end of the X axis by the process of the confidence limit point determination unit 102j to , And the boundary value y of the increase in the expression level, which is the confidence limit point, is increased_{limit +}, Boundary value y of reduction ratio_limitAnd the center of gravity of the X axis is obtained (step SC-2).
Next, the gene detection unit 102c moves the window by a certain amount in the direction in which the X-axis fluorescence intensity increases by the processing of the window moving unit 102k, and the reliability limit in a new window is processed by the reliability limit point determination unit 102j. Boundary value y of the expression level change magnification that becomes a point_{limit +}And y_limitAnd the center of gravity of the X axis is obtained (step SC-3).
Next, the gene detection unit 102c repeats this process until the window reaches the rightmost end of the X axis (step SC-4).
Next, the gene detection unit 102c connects the expression level change magnification boundary points, which are the reliability limit points of the expression change of all windows, with the cubic spline curve by the processing of the confidence boundary line creation unit 102m, and is an expression variation confidence curve. An increase boundary line and a decrease boundary line of the expression magnification are determined (step SC-5).
Next, the gene detection unit 102c extracts genes (variable genes) that are outside the region sandwiched between the increase boundary line and the decrease boundary line that are expression variation confidence curves by the processing of the variable gene extraction unit 102n. This makes it possible to robustly detect a gene whose expression level has been changed by multiple testing (step SC-6).
In the present invention, gene detection efficiency may be improved by calculating a deviation value of each spot. Details of the gene detection process using the deviation value of the apparatus of the present embodiment will be described below with reference to FIGS. 26 and 27. FIG. 26 is a flowchart showing an example of gene detection processing using the deviation value of the apparatus of the present embodiment.
First, after the user sets the number of genes in the window and the reliability (Pe value) (step SE-1), the deviation value processing unit 102u selects a certain number of genes in the fluorescence intensity balance axis direction as described above. A window is set, and an average value and a standard deviation value are obtained using a Y-axis value representing a change rate of the expression level of all genes in each window. Next, the deviation value processing unit 102u obtains the center of gravity (corresponding to the intermediate value of fluorescence intensity) using the X-axis values of all genes (step SE-2).
Subsequently, the deviation value processing unit 102u moves the window by a certain amount in the X-axis direction, and repeats the same processing up to the rightmost window (step SE-3).
Next, the deviation value processing unit 102u complements the obtained data set (intermediate value of fluorescence intensity, average value) as a series of (x, y) data by smoothing (for example, a cubic spline curve). (Created), and the smoothing line of the average value shown in FIG. Similarly, the deviation value processing unit 102u complements a plurality of data sets (intermediate value of fluorescence intensity, standard deviation value) by smoothing (for example, creates a cubic spline curve), and the standard deviation shown in FIG. A smooth line of values is set (step SE-4).
Next, the deviation value processing unit 102u uses the Y value on the smooth line of the average value and the Y value on the smooth line of the standard deviation value from the value of the fluorescence intensity balance axis (X axis value) of each gene. Then, the deviation value is calculated by the following formula (step SE-5).
Deviation value = (y value of gene-average value obtained from smooth line) /
Standard deviation value σ obtained from smooth line
By using the deviation value of each spot calculated in this way instead of the fluctuation ratio (magnification), an analysis that is not affected by the difference in error between slides becomes possible. In other words, it has been difficult to compare between chips because the physical error of each microarray, etc. and the human error when detecting each chip are not constant. This facilitates comparison between chips.
Conventionally, clustering using hierarchical clustering (one-dimensional and two-dimensional), K-Means method, self-organizing map method, etc. for classification of gene expression patterns and extraction of co-expressed genes. A random analysis is being performed. For example, using the logarithm of the variation ratio, MB Eisen, PT Spellman, PO Brown, D Botstein (1998), “Cluster analysis and display of gene fate a cations”, Proceedings of c ): 14863-14868 is known. In addition, TR Gorub, DK Slonim, P Tamay, C Hard, M Caasenbek, JP Mesirov, H Coller, ML Loh, JR Downing, MA Caliguri, CD Blo 99 "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring", Science, 286: 531-537. Here, by using the deviation value calculated by this method in place of the logarithm of the fluctuation ratio and the normalized fluctuation ratio in the multivariate analysis typified by cluster analysis, the difference in the influence of error due to the magnitude of the expression level Analysis that is not affected by
This completes the gene detection process.
[Simulation process]
Next, details of the simulation processing of the present invention will be described with reference to FIGS. 19 and 21. FIG. FIG. 19 is a flowchart showing an example of gene detection processing of the apparatus of the present embodiment.
First, the simulation unit 102d of the gene expression information analyzing apparatus 100 changes various condition parameters (for example, standard deviation (width) of gene distribution, expression) of the simulation for the user by the processing of the simulation condition setting unit 102r. A simulation condition setting screen for setting the distribution center of the gene to be detected, the detection criterion (number of detections / total number), and the number of simulations) is output to the output device 114 (step SD-1).
Here, FIG. 21 is a diagram illustrating an example of a simulation condition setting screen output to the output device 114 by the processing of the simulation condition setting unit 102r. As shown in FIG. 21, the simulation condition setting screen includes an input area MB-1 for standard deviation of gene distribution, an input area MB-2 for the center of gene distribution, an input area MB-3 for detection criteria, and an input area for the number of simulations. It includes MB-4 and a setting end button MB-5.
Note that the standard deviation of gene distribution is, for example, when the distribution of genes whose expression does not change is standard normal distribution with standard deviation σ = 1 and center μ = 0, the range of standard deviation σ is 0.1 to 1. .5 may be set. Further, for example, the center of the distribution of the fluctuating gene may be set in the range of 0.4 to 3 with the width of the center μ under the above conditions. In addition, as a detection criterion for the fluctuating gene, for example, the ratio of genes detected from the total number may be set to 2/3, 2/4, 3/4, 3/6, 4/6, and the like. The number of simulations may be set in the range of 3 to 10 times, for example.
Here, the user selects the setting end button MB-5 after completing the input of each item of the input area MB-1 to the input area MB-4 using the input device 112 while looking at the simulation condition setting screen. Then, the simulation unit 102d repeatedly executes the background correction process, the bias correction process, and the gene detection process described above based on the information set on the simulation condition setting screen by the process of the simulation execution unit 102s. The simulation processing of the mixed distribution of the gene population (variable gene population) in which the expression level extracted by the gene detection process is changed and the gene population (non-variable gene population) in which the expression level has not changed is performed (step SD-2) .
Next, the simulation unit 102d outputs the simulation result screen data shown in FIGS. 7 to 10 to the output device 114 by the processing of the simulation result output unit 102t (step SD-3).
Here, FIG. 7 to FIG. 10 are diagrams showing an example of the calculation result of the first type detection error (false positive) by simulation. The mixed distribution is the parameters set under the six simulation conditions described above (total number of genes, percentage of genes whose expression varies, standard deviation (width) of gene distribution, center of distribution of genes whose expression varies, detection criteria (detection) Number / total number) and the confidence limit value (P-value) in each data set (window).
FIG. 7 shows the case where the center μ ′ of the gene population whose expression changes (variable gene population) is set to ± σ, the standard deviation is set to 1, and the detection criterion is detected twice out of three (detection criterion = 2 / 3) A graph showing the calculation result of the first type detection error.
On the other hand, FIG. 8 shows that the expression changes when the center μ ′ of the gene population whose expression changes (variable gene population) is set to ± σ, the standard deviation is set to 1, and 3 out of 4 detection criteria are detected. It is the figure which showed the case where it does.
By comparing these two figures, α (the first type of detection error) is detected in each window._eIt can be seen that it depends greatly on the value. In addition, the horizontal axis of the figure represents the ratio of the gene group whose expression is changed to the entire gene.
That is, the distribution of genes whose expression does not change is generated as a standard normal distribution (standard deviation σ = 1, center μ = 0), while the distribution of genes whose expression changes is 50% probability to the left or right of the standard normal distribution Occurs.
However, the mixed distribution is a total of six parameters (number of all genes, percentage of genes that change expression, standard deviation and center of distribution of genes that change expression, detection criteria and confidence limits within each dataset) ).
In addition, only the first type of error α in the multiple test α, that is, the error detected as a gene that changes the gene whose expression does not change is shown, and all the results are obtained by fixing the parameters and then averaging the 10 calculation results. Represents. Further, when the center μ ′ = ± σ of the gene population whose expression changes and the standard deviation = 1, (a) Detection standard: When 2 out of 3 detections are detected, the expression changes (in the case of FIG. 7) ) And (b) detection criterion: α is greatly dependent on the Pe value detected in each window by comparison with the case where the expression changes when 3 out of 4 are detected (in the case of FIG. 8) I understand that
Further, FIG. 9 and FIG. 10 are diagrams showing a case where the standard deviation of the gene population whose expression changes (variable gene population) is 1. In FIG._e= 0.15, and in FIG._e= 0.25. Therefore, in order to obtain 95% confidence level, the confidence limit P in the data set is used when the test criterion is 2 out of 3 times._eMay be set to 0.15 or less, while if the test criterion is 3 out of 4 times, the confidence limit P in the data set_eIt can be seen that it is sufficient to set the value to 0.25 or less. The horizontal axis of the figure represents the numerical value obtained by integrating the center of the gene group whose expression changes and the standard deviation of the gene group whose expression does not change. In the figure, “TNum” is the total number of genes, and “dif_x%” is the expression. The proportion of gene populations that change, and “2/3” and “3/4” mean detection criteria.
This completes the simulation process.
[Other embodiments]
Although the embodiments of the present invention have been described so far, the present invention can be applied to various different embodiments in addition to the above-described embodiments within the scope of the technical idea described in the claims. May be implemented.
In addition, among the processes described in the embodiment, all or a part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed All or a part of the above can be automatically performed by a known method.
In addition, the processing procedures, control procedures, specific names, information including parameters such as simulation conditions, and screen examples shown in the drawings and drawings can be arbitrarily changed unless otherwise specified.
In addition, the simulation unit 102d performs a mixed distribution simulation with other distributions such as a gamma distribution, so that the reliability (P_eValue), first type, second type detection error, and the like. In the embodiment described above, the case where the distribution of non-variable genes and the distribution of fluctuating genes are normal distributions has been described as an example. For example, the distribution of fluctuating genes is a distribution other than the normal distribution (for example, a gamma distribution). The present invention can be applied to a gene population having any distribution.
Further, the bias determination processing by the bias determination unit 102g of the above-described apparatus is not limited to the determination of the magnitude of the bias after the shaft rotation. For example, as shown in FIG. 28, the highly expressed asymptote before the shaft rotation. The magnitude of the bias may be determined by comparing the slope a with the slope b of the low expression asymptote. In this process, the rotation of coordinates is not a necessary condition.
Moreover, regarding the gene expression information analysis apparatus 100, each illustrated component is functionally conceptual and does not necessarily need to be physically configured as illustrated.
For example, the processing functions provided in each unit of the gene expression information analyzing apparatus 100, particularly the processing functions performed by the control unit 102, are all or any part of the processing function, a CPU (Central Processing Unit), and It can be realized by a program interpreted and executed by the CPU, or can be realized as hardware by wired logic. The program is recorded on a recording medium to be described later, and is mechanically read by the gene expression information analyzing apparatus 100 as necessary.
The gene expression information analyzing apparatus 100 connects peripheral devices such as a printer, a monitor, and an image scanner to a computer (information processing apparatus) such as an information processing terminal such as a known personal computer or workstation, and the information processing apparatus. You may implement | achieve by mounting the software (a program, data, etc. are included) which implement | achieve the method of this invention.
Furthermore, the specific form of the distribution / integration of the gene expression information analysis apparatus 100 is not limited to that shown in the figure, and all or a part of the gene expression information analysis apparatus 100 may be functional or physical in arbitrary units according to various loads. Can be distributed and integrated. For example, each database may be independently configured as an independent database device, and a part of the processing may be realized by using CGI (Common Gateway Interface).
The program according to the present invention can also be stored in a computer-readable recording medium. Here, the “recording medium” is built in an arbitrary “portable physical medium” such as a flexible disk, a magneto-optical disk, a ROM, an EPROM, an EEPROM, a CD-ROM, an MO, and a DVD, or in various computer systems. A program can be executed in a short period of time such as a communication line or a carrier wave when a program is transmitted via any “fixed physical medium” such as ROM, RAM, HD, or a network represented by LAN, WAN, or the Internet. It shall include the “communication medium” to be retained.
The “program” is a data processing method described in an arbitrary language or description method, and may be in any form such as source code or binary code. Note that the “program” is not necessarily limited to a single configuration, but is configured to be distributed as a plurality of modules and libraries, or in cooperation with a separate program represented by an OS (Operating System). Includes those that achieve that function. It should be noted that well-known configurations and procedures can be used for a specific configuration for reading a recording medium, a reading procedure, an installation procedure after reading, and the like in each device described in the embodiment.
Moreover, although the case where the gene expression information analysis apparatus 100 performs processing in a stand-alone form has been described as an example, a request transmitted via a network from a client terminal configured separately from the gene expression information analysis apparatus 100 The processing may be performed in accordance with the processing result, and the processing result may be returned to the client terminal.
Here, the network has a function of mutually connecting the gene expression information analyzing apparatus 100 and an external client apparatus. For example, the Internet, an intranet, a LAN (including both wired / wireless), VAN, PC communication network, public telephone network (including both analog / digital), dedicated line network (including both analog / digital), CATV network, IMT2000 system, GSM system, or PDC / PDC- It may include any one of a mobile line switching network / portable packet switching network such as P system, a wireless paging network, a local wireless network such as Bluetooth, a satellite communication network such as PHS network, CS, BS, or ISDB. That is, the present apparatus can transmit and receive various data via an arbitrary network regardless of wired or wireless.
As described above in detail, according to the present invention, the background value is removed from the measured luminance data of each spot where the fluorescence intensity indicating the expression level of the same gene is measured under two conditions using a DNA microarray or a DNA chip. By doing so, it is possible to provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can create luminance data with background correction.
In addition, according to the present invention, the logarithm of the luminance data with the background correction (natural logarithm or logarithm of 2 or the like) is taken on the XY axis, and a fluorescence intensity scatter diagram (scatter plot) is created. A new X-Y axis fluorescence intensity with two axes of the fluorescence intensity balance axis and the expression level magnification axis is obtained by obtaining a bias with respect to the fluorescence intensity balance axis showing the same fluorescence intensity and removing the bias from the luminance data. Since the scatter diagram is constructed, the fluorescent component including more bias is judged, and after removing this bias, a new orthogonal axis system is constructed with the fluorescence intensity balance axis and the multiple axis of the expression level as two axes. A gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can be provided are provided.
Further, according to the present invention, since a fluctuating gene whose expression level fluctuates based on the constructed fluorescence intensity scatter diagram of the new XY axis system is detected, the measurement apparatus, compared with the conventional gene detection method, Gene expression information analysis apparatus, gene expression information analysis method, program, and recording that can accurately detect genes whose expression levels fluctuate without being affected by differences between specimens and differences in fluorescence labeling efficiency A medium can be provided.
In addition, according to the present invention, a control gene sample for quality control of a DNA concentration dilution series (for example, an external gene λDNA sample or a House-keeping gene sample such as a ribosome whose expression level hardly changes) is measured simultaneously with the target gene sample. Then, remove the control genes one by one from the gene with the smallest product of fluorescence intensity data, create a gene expression level and a DNA level calibration curve from the data of all remaining control gene samples, and correlate the data. Calculate the number, and the threshold value is the product of the fluorescence intensity data in the two conditions of the control sample when the above correlation coefficients calculated in order meet the criteria (for example, 0.8 or higher) where strong correlation is first observed. 1 and the product of fluorescence intensity data under the two conditions exceeds the threshold 1 If all the gene sample populations are gene groups with high expression levels, the expression level is calculated in order, and the correlation coefficient is calculated in order. The product of the fluorescence intensity data under the two conditions of the control sample is defined as threshold value 2 (where threshold value 2 <threshold value 1), and the expression level of the population of all gene samples for which the product of the fluorescence intensity data under the two conditions falls below threshold value 2 The principal component analysis is performed using the fluorescence intensity logarithm values of the gene population with a small gene population and a high expression level, and the slope and intercept of the asymptote as the first principal component are obtained. The angle is θ, the coordinates in the XY axis system of the gene population with a small expression level are calculated by rotating the coordinates to the right by the θ angle, and the fluorescence intensity level is calculated using the coordinates of the gene population with the low expression level after the coordinate axis rotation. Calculate the tilt of the axis, determine which of the two conditions of luminance data contains more bias based on the calculated tilt (eg, positive, negative, zero, etc.) and include more bias Subtracting the bias from the luminance data of the condition determined to be (for example, rotating the coordinates for a gene group having a constant bias), a new axis with the fluorescence intensity balance axis and the expression level magnification axis as two axes Generating X-Y axis fluorescence intensity scatter plots, gene expression information analysis that can efficiently remove the bias of measured values and create fluorescence intensity scatter plots that can clearly express the nature of the data An apparatus, a gene expression information analysis method, a program, and a recording medium can be provided.
In addition, according to the present invention, since the principal component analysis is performed using a variance / covariance matrix, normalization is performed in comparison with the principal component analysis method using a correlation matrix conventionally used for expression gene analysis. Therefore, it is possible to provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can efficiently perform principal component analysis.
Further, according to the present invention, a window in a predetermined section is set, and the average value, standard deviation, P value (for example, 95% value), center of gravity, etc. of gene luminance data are set in each set window. A confidence limit is determined using at least one of them. And a confidence boundary line creating means for moving the window by a certain gene in the direction of the fluorescence intensity balance axis, obtaining each confidence limit point for each moved window, and creating a confidence boundary line based on the plurality of confidence limit points obtained; Since the gene located outside the trust boundary created by the trust boundary creating means is extracted as a fluctuating gene whose expression level has fluctuated, it is possible to extract expressed genes with high stability, reproducibility, and high reliability. A gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can be performed can be provided.
Further, according to the present invention, a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a program that can determine a threshold value of the expression level variation magnification according to the error even in experimental data having different error ranges. A recording medium can be provided.
Further, according to the present invention, since the confidence limit point is determined using the t-distribution based on the test table of duplicate data obtained by simulation, the confidence limit point is accurately and efficiently compared with the conventional method. Gene expression information analyzing apparatus, gene expression information analyzing method, program, and recording medium can be provided.
Further, according to the present invention, since a spline curve is created based on a plurality of confidence limit points to perform smoothing and a confidence boundary line is created, the confidence limit points are efficiently complemented to create a confidence curve. A gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can be provided are provided.
In addition, according to the present invention, for a region with high fluorescence intensity, a confidence limit line is created using a horizontal extension line with respect to the X axis of the confidence limit point obtained in the last window (the rightmost window). To provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium capable of creating an appropriate confidence limit line even when the slope is small and it cannot be determined which one will converge Can do.
Further, according to the present invention, for the region having a low fluorescence intensity, for example, an extrapolation of an asymptotic line obtained by a least square method from a confidence limit point obtained in each of several tens of windows from the beginning is used as the confidence limit line. Therefore, it is possible to provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can accurately detect a spot of a gene having low fluorescence intensity.
In addition, according to the present invention, the user is allowed to input the number of genes in the window, and the window is set within the section including the genes of the input number of genes. Therefore, the number of genes set by the user for each experiment is determined. A gene expression information analyzing apparatus, a gene expression information analyzing method, a program, and a recording medium that can be varied can be provided.
Further, according to the present invention, since the user inputs the confidence limit value and determines the confidence limit point based on the confidence limit value input in the window, the confidence limit value set by the user for each experiment is determined. It is possible to provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can be varied and can accommodate errors in each experiment within an appropriate range.
In addition, according to the present invention, the user is given a form of gene distribution that does not vary (for example, a standard deviation of the distribution (for example, a distribution of genes whose expression does not change is a standard normal distribution with a standard deviation σ = 1, a center μ = 0, the width of the standard deviation σ is set in the range of 0.1 to 1.5)), the distribution gene distribution shape (for example, the center (for example, the center μ Width is set in the range of 0.4 to 3)), and the detection criteria for the above-mentioned variable genes (for example, the ratio of the detected genes from the total number is 2/3, 2/4, 3/4, 3 / (6, 4/6, etc.), the number of repetitions of the experiment, and the number of simulations (for example, set in the range of 3 to 10 times). According to the simulated conditions The same gene group is repeatedly generated from the same distribution, gene detection is performed, and the simulation for detecting the expressed gene is executed multiple times, and the false positive rate and false negative rate of the result by the above detection means are calculated. Calculate the number of experiment repetitions, simulation conditions, and the relationship between detection sensitivity and detection reliability, create a test statistical table for genes whose expression level changes, and output simulation results from simulation execution for each simulation condition Therefore, it is possible to know the detection power and the detection reliability by the above combination by combining the simulation results under various conditions. In other words, repeat control experiments under the same conditions, detect fluctuating genes for the different data sets obtained, and select only genes that are detected more than a predetermined number of times. It is possible to provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that can detect a fluctuating gene with a degree or a detection power.
Further, according to the present invention, an error in which a gene whose expression level does not change is detected as a fluctuating gene (first type error), or an error in which a fluctuating gene is detected as a gene whose expression does not change (second type error). ) Is calculated and compared, the power and reliability to detect the fluctuating gene by the above method can be grasped from the simulation data, and the expected power and reliability can be obtained from the actual experimental data. Therefore, it is possible to provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium capable of setting a combination of the number of repetitions of an experiment, a detection criterion for a fluctuating gene, and a confidence limit point. .
Further, according to the present invention, it is possible to predict how many times an experiment can be performed based on the test statistical table of duplicate data obtained by simulation, so that the experimental efficiency can be obtained. Gene expression information analysis apparatus, gene expression information analysis method, program, and recording medium can be provided.
In addition, according to the present invention, since the deviation value of each spot is calculated, the deviation value of each spot calculated in this way is used instead of the fluctuation ratio (magnification), thereby affecting the difference in error between slides. It is possible to provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that enable analysis that is not performed.
Furthermore, according to the present invention, the deviation value calculated by the present apparatus can be used instead of the logarithm of the fluctuation ratio or the normalized fluctuation ratio in the multivariate analysis typified by cluster analysis. It is possible to provide a gene expression information analysis apparatus, a gene expression information analysis method, a program, and a recording medium that enable analysis that is not influenced by the difference in the influence of errors.
Industrial applicability
As described above, the gene expression information analysis apparatus, the gene expression information analysis method, the program, and the recording medium according to the present invention are extremely useful in the bioinformatics field for analyzing measurement value data such as a DNA microarray and a DNA chip. is there.
The present invention can be widely implemented in many industrial fields, particularly in fields such as pharmaceuticals, foods, cosmetics, medicine, and gene expression analysis, and is extremely useful.
[Brief description of the drawings]
FIG. 1 is a diagram showing a concept of principal component analysis using a variance / covariance matrix according to the present invention, and FIG. 2 is a diagram showing a concept of processing for obtaining an asymptote in a new coordinate system according to the present invention. FIG. 3 is a diagram conceptually showing the reconstruction of the distribution map according to the present invention, FIG. 4 is a diagram showing a mixed normal distribution model of expression magnification according to the present invention, and FIG. FIG. 6 is a diagram showing a mixed normal distribution model of expression magnification according to the present invention, FIG. 6 is a diagram showing a mixed normal distribution model of expression magnification according to the present invention, and FIG. FIG. 8 is a diagram showing an example of the calculation result of the first type of detection error, and FIG. 8 is a diagram showing an example of the calculation result of the first type of detection error by the simulation according to the present invention. First-type detection error by simulation according to the invention FIG. 10 is a diagram showing an example of the calculation result of the first type of detection error by simulation according to the present invention, and FIG. 11 is an expression fluctuation according to the present invention. FIG. 12 is a diagram conceptually showing the creation of a confidence curve, FIG. 12 is a diagram conceptually showing the creation of an expression variation confidence curve according to the present invention, and FIG. 13 is an expression variation confidence curve according to the present invention. FIG. 14 is a diagram conceptually showing the creation of an expression variation confidence curve according to the present invention, and FIG. 15 is the main process of the apparatus of the present embodiment. FIG. 16 is a flowchart showing an example of the background correction process of the apparatus of the present embodiment, and FIG. 17 is a flowchart of an example of the bias correction process of the apparatus of the embodiment. And the second FIG. 8 is a flowchart showing an example of gene detection processing of the apparatus of the present embodiment, FIG. 19 is a flowchart of an example of simulation processing of the apparatus of the embodiment, and FIG. FIG. 21 is a diagram illustrating an example of a gene extraction condition setting screen output to the output device 114 by the processing of the setting unit 102i. FIG. 21 illustrates simulation conditions output to the output device 114 by the processing of the simulation condition setting unit 102r. FIG. 22 is a block diagram showing an example of the configuration of the apparatus to which the present invention is applied, and FIG. 23 is a block diagram showing an example of the configuration of the bias correction unit 102b. FIG. 24 is a block diagram showing an example of the configuration of the gene detection unit 102c, and FIG. 25 shows the simulation unit 102d. FIG. 26 is a flowchart showing an example of gene detection processing using the deviation value of the apparatus of the present embodiment, and FIG. 27 is the apparatus of the present embodiment. FIG. 28 is a conceptual diagram showing an example of bias determination processing of the apparatus of the present embodiment.

Claims

Background correction means for creating background-corrected luminance data by removing the background value from the measured luminance data of each spot where the fluorescence intensity indicating the expression level of the same gene under two conditions is measured;
The logarithm of the luminance data corrected by the background correction means is taken on the XY axis to create a fluorescence intensity scatter diagram, and a bias with respect to the fluorescence intensity balance axis is obtained for each gene spot. A bias correcting means for constructing a fluorescence intensity scatter diagram of a new XY axis system having two axes of the fluorescence intensity balance axis and the expression amount magnification axis by removing
A gene detection means for detecting a fluctuating gene whose expression level fluctuates based on a fluorescence intensity scatter diagram of a new XY axis system constructed by the bias correction means;
A gene expression information analyzing apparatus comprising:

The bias correcting means is
A first principal component creation means for performing principal component analysis using logarithmic values of gene populations with a high expression level to obtain the slope and intercept of an asymptote as the first principal component;
Coordinate rotation for calculating coordinates obtained by rotating the coordinates in the XY axis system of the gene group with a small amount of expression to the right by θ angles, where θ is the angle between the asymptote obtained by the first principal component creation means and the X axis Means,
Using the coordinates of the gene population with a low expression level after the coordinate axis rotation by the coordinate rotation means, the inclination of the fluorescence intensity balance axis is calculated, and which of the luminance data of the two conditions is calculated based on the calculated inclination Bias determining means for determining whether a large amount of the bias is included in
By subtracting the bias from the luminance data under the condition that the bias determination means determines that a large amount of the bias is included, a new X with the fluorescence intensity balance axis and the expression magnification magnification axis as two axes is obtained. Correction plot generating means for constructing a fluorescence intensity scatter diagram of the Y-axis system,
The gene expression information analysis apparatus according to claim 1, further comprising:

The principal component analysis is performed using a variance / covariance matrix,
The gene expression information analyzing apparatus according to claim 2, wherein:

The gene detection means includes
Window setting means for setting a window within a predetermined section in the fluorescence intensity balance axis direction;
Confidence limit point determining means for determining a confidence limit point in each window set by the window setting means;
A window moving means for moving the window by a certain gene in the direction of the fluorescence intensity equilibrium axis;
For each window moved by the window moving means, each confidence limit point is obtained by the confidence limit point determining means, and a trust boundary line creating means for creating a trust boundary line based on the obtained plurality of confidence limit points;
Fluctuating gene extraction means for extracting a gene located outside the trust boundary created by the confidence boundary creation means as a fluctuation gene whose expression level has fluctuated,
The gene expression information analysis apparatus according to any one of claims 1 to 3, further comprising:

The confidence limit point determining means determines the confidence limit point using a t-distribution based on a test statistical table of duplicate data obtained by simulation,
The gene expression information analyzing apparatus according to claim 4, wherein:

The trusted boundary creating means creates the trusted boundary by performing smoothing by creating a spline curve based on the plurality of confidence limit points,
The gene expression information analysis apparatus according to claim 4 or 5, wherein

The confidence boundary line creating means, for a region having a high fluorescence intensity, creating the confidence limit line using a horizontal extension line of the confidence limit point obtained in the last window,
The gene expression information analysis apparatus according to any one of claims 4 to 6, wherein:

The confidence boundary line creating means uses, for the low fluorescence intensity region, asymptotic line extrapolation obtained by the least square method from the confidence limit point obtained in the window as the confidence limit line,
The gene expression information analysis apparatus according to any one of claims 4 to 7, wherein:

A gene number input means for allowing the user to input the number of genes in the window,
The window setting means sets the window within the section including the genes of the number of genes input by the number of genes input means;
The gene expression information analysis apparatus according to any one of claims 4 to 8, wherein:

A confidence limit value input means for allowing the user to input a confidence limit value;
Further comprising
The confidence limit point determining means determines the confidence limit point based on the confidence limit value input by the confidence limit value input means in the window;
The gene expression information analysis apparatus according to any one of claims 4 to 9, wherein:

Enter the simulation conditions including information on at least one of the distribution form of the non-fluctuating gene, the distribution form of the fluctuating gene, the detection criteria for the fluctuating gene, the number of repetitions of the experiment, and the number of simulations. Simulation condition setting means,
According to the simulation condition set in the simulation condition setting step, the same gene group is repeatedly generated from the same distribution, the gene detection unit is executed, and the simulation for detecting the expressed gene is executed a plurality of times. Calculate the false positive rate and false negative rate of the results by the above detection means, calculate the number of repetitions of the experiment, the above simulation conditions, and the relationship between detection sensitivity and detection reliability, and test statistical table of genes whose expression level changes A simulation execution means for creating
Simulation result output means for outputting a simulation result by the simulation execution means for each simulation condition;
The gene expression information analysis apparatus according to any one of claims 1 to 10, further comprising:

The gene detection means includes
Deviation value calculation means for calculating the deviation value of each spot,
The gene expression information analysis apparatus according to any one of claims 1 to 11, further comprising:

A background correction step of creating background-corrected luminance data by removing the background value from the measured luminance data of each spot where fluorescence intensity indicating the expression level of the same gene under two conditions is measured;
The logarithm of the brightness data corrected in the background correction step is taken on the XY axis to create a fluorescence intensity scatter diagram, and a bias with respect to the fluorescence intensity balance axis is obtained for each gene spot. A bias correction step of constructing a fluorescence intensity scatter diagram of a new XY axis system having two axes of the fluorescence intensity balance axis and the expression magnification magnification axis by removing
A gene detection step for detecting a fluctuating gene whose expression level fluctuates based on a fluorescence intensity scatter diagram of a new XY axis system constructed by the bias correction step;
A gene expression information analysis method comprising:

The bias correction step is
Performing a principal component analysis using logarithmic values of a gene population with a high expression level, and obtaining a slope and an intercept of an asymptote as the first principal component;
Coordinate rotation for calculating coordinates obtained by rotating the coordinates in the XY axis system of the gene group with a small amount of expression to the right by θ angles, where θ is the angle between the asymptotic line obtained in the first principal component creation step and the X axis Steps,
Using the coordinates of the gene group with a small amount of expression after the coordinate axis rotation in the coordinate rotation step, the inclination of the fluorescence intensity balance axis is calculated, and which of the luminance data of the two conditions is calculated based on the calculated inclination. A bias determination step for determining whether a large amount of the bias is included in
By subtracting the bias from the luminance data determined to contain a large amount of the bias in the bias determination step, a new X with the fluorescence intensity balance axis and the expression magnification magnification axis as two axes is obtained. A correction plot generation step for constructing a fluorescence intensity scatter diagram of the Y-axis system;
The gene expression information analysis method according to claim 13, further comprising:

The principal component analysis is performed using a variance / covariance matrix,
The gene expression information analysis method according to claim 14, wherein:

The gene detection step includes
A window setting step for setting a window in a predetermined section in the fluorescence intensity balance axis direction;
A confidence limit point determining step for determining a confidence limit point in each window set by the window setting step;
A window moving step for moving the window by a certain gene in the direction of the fluorescence intensity equilibrium axis;
For each window moved by the window moving step, each confidence limit point is determined by the confidence limit point determination step, and a confidence boundary line is created based on the determined plurality of confidence limit points;
A variable gene extraction step for extracting a gene located outside the trust boundary created by the trust boundary creation step as a variable gene whose expression level has changed,
The gene expression information analysis method according to any one of claims 13 to 15, further comprising:

The confidence limit point determining step determines the confidence limit point using a t-distribution based on a test statistical table of duplicate data obtained by simulation,
The gene expression information analysis method according to claim 16, wherein:

The trust boundary line creating step creates a trust boundary line by smoothing by creating a spline curve based on the plurality of confidence limit points,
The gene expression information analysis method according to claim 16 or 17, wherein

The confidence boundary line creating step creates the confidence limit line using a horizontal extension line of the confidence limit point obtained in the last window for a region having a high fluorescence intensity,
The gene expression information analysis method according to any one of claims 16 to 18, wherein:

For the region having a low fluorescence intensity, the confidence boundary line creation step uses, as the confidence limit line, an asymptote extrapolation obtained by the least square method from the confidence limit point obtained in the window,
The gene expression information analysis method according to any one of claims 16 to 19, wherein the gene expression information is analyzed.

Gene number input step that allows the user to input the number of genes in the window,
Further including
The window setting step sets the window in the section including the genes of the number of genes input by the number of genes input step;
The gene expression information analysis method according to any one of claims 16 to 20, wherein:

Confidence limit value input step that allows the user to input the confidence limit value,
Further including
The confidence limit point determination step determines the confidence limit point based on the confidence limit value input by the confidence limit value input step in the window;
The gene expression information analysis method according to any one of claims 16 to 21, wherein:

Enter the simulation conditions including information on at least one of the distribution form of the non-fluctuating gene, the distribution form of the fluctuating gene, the detection criteria for the fluctuating gene, the number of repetitions of the experiment, and the number of simulations. Simulation condition setting step,
According to the simulation condition set in the simulation condition setting step, the same gene group is repeatedly generated from the same distribution, the gene detection unit is executed, and the simulation for detecting the expressed gene is executed a plurality of times. Calculate the false positive rate and false negative rate of the results by the above detection means, calculate the number of repetitions of the experiment, the above simulation conditions, and the relationship between detection sensitivity and detection reliability, and test statistical table of genes whose expression level changes Simulation execution step to create
A simulation result output step for outputting a simulation result by the simulation execution step for each simulation condition,
The gene expression information analysis method according to any one of claims 13 to 22, further comprising:

The gene detection step includes
Deviation value calculation step for calculating the deviation value of each spot,
The gene expression information analysis method according to any one of claims 13 to 23, further comprising:

A background correction step of creating background-corrected luminance data by removing the background value from the measured luminance data of each spot where fluorescence intensity indicating the expression level of the same gene under two conditions is measured;
The logarithm of the brightness data corrected in the background correction step is taken on the XY axis to create a fluorescence intensity scatter diagram, and a bias with respect to the fluorescence intensity balance axis is obtained for each gene spot. A bias correction step of constructing a fluorescence intensity scatter diagram of a new XY axis system having two axes of the fluorescence intensity balance axis and the expression magnification magnification axis by removing
A gene detection step for detecting a fluctuating gene whose expression level fluctuates based on a fluorescence intensity scatter diagram of a new XY axis system constructed by the bias correction step;
A computer program for causing a computer to execute a gene expression information analysis method including:

The bias correction step is
Performing a principal component analysis using logarithmic values of a gene population with a high expression level, and obtaining a slope and an intercept of an asymptote as the first principal component;
Coordinate rotation for calculating coordinates obtained by rotating the coordinates in the XY axis system of the gene group with a small amount of expression to the right by θ angles, where θ is the angle between the asymptotic line obtained in the first principal component creation step and the X axis Steps,
Using the coordinates of the gene group with a small amount of expression after the coordinate axis rotation in the coordinate rotation step, the inclination of the fluorescence intensity balance axis is calculated, and which of the luminance data of the two conditions is calculated based on the calculated inclination. A bias determination step for determining whether a large amount of the bias is included in
By subtracting the bias from the luminance data determined to contain a large amount of the bias in the bias determination step, a new X with the fluorescence intensity balance axis and the expression magnification magnification axis as two axes is obtained. A correction plot generation step for constructing a fluorescence intensity scatter diagram of the Y-axis system;
The program according to claim 25, further comprising:

The principal component analysis is performed using a variance / covariance matrix,
27. The program according to claim 26, characterized by the above.

The gene detection step includes
A window setting step for setting a window in a predetermined section in the fluorescence intensity balance axis direction;
A confidence limit point determining step for determining a confidence limit point in each window set by the window setting step;
A window moving step for moving the window by a certain gene in the direction of the fluorescence intensity equilibrium axis;
For each window moved by the window moving step, each confidence limit point is determined by the confidence limit point determination step, and a confidence boundary line is created based on the determined plurality of confidence limit points;
A variable gene extraction step for extracting a gene located outside the trust boundary created by the trust boundary creation step as a variable gene whose expression level has changed,
The program according to any one of claims 25 to 27, further comprising:

The confidence limit point determining step determines the confidence limit point using a t-distribution based on a test statistical table of duplicate data obtained by simulation,
The program according to claim 28, characterized in that

The trust boundary line creating step creates a trust boundary line by smoothing by creating a spline curve based on the plurality of confidence limit points,
30. The program according to claim 28 or 29.

The confidence boundary line creating step creates the confidence limit line using a horizontal extension line of the confidence limit point obtained in the last window for a region having a high fluorescence intensity,
The program according to any one of claims 28 to 30, characterized in that:

For the region having a low fluorescence intensity, the confidence boundary line creation step uses, as the confidence limit line, an asymptote extrapolation obtained by the least square method from the confidence limit point obtained in the window,
32. The program according to any one of claims 28 to 31, wherein:

Gene number input step that allows the user to input the number of genes in the window,
Further including
The window setting step sets the window in the section including the genes of the number of genes input by the number of genes input step;
33. The program according to any one of claims 28 to 32, wherein:

Confidence limit value input step that allows the user to input the confidence limit value,
Further including
The confidence limit point determination step determines the confidence limit point based on the confidence limit value input by the confidence limit value input step in the window;
34. The program according to any one of claims 28 to 33, wherein:

Enter the simulation conditions including information on at least one of the distribution form of the non-fluctuating gene, the distribution form of the fluctuating gene, the detection criteria for the fluctuating gene, the number of repetitions of the experiment, and the number of simulations. Simulation condition setting step,
According to the simulation condition set in the simulation condition setting step, the same gene group is repeatedly generated from the same distribution, the gene detection unit is executed, and the simulation for detecting the expressed gene is executed a plurality of times. Calculate the false positive rate and false negative rate of the results by the above detection means, calculate the number of repetitions of the experiment, the above simulation conditions, and the relationship between detection sensitivity and detection reliability, and test statistical table of genes whose expression level changes Simulation execution step to create
A simulation result output step for outputting a simulation result by the simulation execution step for each simulation condition,
35. The program according to any one of claims 25 to 34, further comprising:

The gene detection step includes
Deviation value calculation step for calculating the deviation value of each spot,
The program according to any one of claims 25 to 35, further comprising:

A computer-readable recording medium on which the program according to any one of claims 25 to 36 is recorded.