JP2004030093A

JP2004030093A - Method for analyzing gene expression data

Info

Publication number: JP2004030093A
Application number: JP2002183810A
Authority: JP
Inventors: Hiroyuki Tomita; 富田　裕之; Hideyuki Maki; 牧　秀行; Toyohisa Morita; 森田　豊久; Sabau Sorin; ソリン　サバウ; Koji Tanigawa; 谷川　浩司
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-06-25
Filing date: 2002-06-25
Publication date: 2004-01-29

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and equipment for knowledge search based on gene expression data (also called a gene expression profile) using a DNA micro-array or the like. <P>SOLUTION: The knowledge search is done through: a process receiving the gene expression data; a process receiving class information; a process extracting a genetic group related to class classification by using a data mining technique; a process executing annotation with respect to the genetic group; a process extracting the common rule of the genetic group related to the class classification based on the genetic annotation; and a process executing the data mining using constraint conditions based on the common rule. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、ＤＮＡマイクロアレイ等を用いた遺伝子発現データ（遺伝子発現プロファイルとも言う）に基づく、知識探索の方法および装置に関する。
【０００２】
【従来の技術】
遺伝子発現プロファイルを解析することで遺伝子機能を明らかにし、創薬、薬理学、毒性学、診断に供する知見を得るための研究がなされている。例えば、相関分析（Ｃｏｒｒｅｌａｔｉｏｎ　Ａｎａｌｙｓｉｓ）、主因子分析（Ｐｒｉｎｃｉｐａｌ　Ｃｏｍｐｏｎｅｎｔ　Ａｎａｌｙｓｉｓ）、分散分析（Ａｎａｌｙｓｉｓ　ｏｆ　Ｖａｒｉａｎｃｅ）などの統計解析、ｋ平均クラスタリング（ｋ−ｍｅａｎ　Ｃｌｕｓｔｅｒｉｎｇ）、階層クラスタリング（Ｈｉｅｒａｒｃｈｉｃａｌ　Ｃｌｕｓｔｅｒｉｎｇ）、自己組織化マップ（Ｓｅｌｆ−ｏｒｇａｎｉｚｉｎｇ　Ｍａｐ）などのクラスタリング、最短近傍法（Ｎｅａｒｅｓｔ　Ｎｅｉｇｈｂｏｒ）、判別分析（Ｄｉｓｃｒｉｍｉｎａｎｔ　Ａｎａｌｙｓｉｓ）、サポートベクターマシン（Ｓｕｐｐｏｒｔ　Ｖｅｃｔｏｒ　Ｍａｃｈｉｎｅ）、ニューラルネットワーク（Ｎｅｕｒａｌ　Ｎｅｔｗｏｒｋ）、遺伝的アルゴリズム（Ｇｅｎｅｔｉｃ　Ａｌｇｏｒｉｓｍ）などの分類アルゴリズムをＤＮＡチップデータの解析に適用した例がある。Ｌａｕｒａ　Ｊ．　ｖａｎ’ｔ　Ｖｅｅｒら、Ｇｅｎｅ　ｅｘｐｒｅｓｓｉｏｎ　ｐｒｏｆｉｌｉｎｇ　ｐｒｅｄｉｃｔｓ　ｃｌｉｎｉｃａｌ　ｏｕｔｃｏｍｅ　ｏｆ　ｂｒｅａｓｔ　ｃａｎｃｅｒ．　Ｎａｔｕｒｅ　４１５，　ｐｐ．５３０−５３６　（２００２）や、Ｓｃｏｔｔ　Ｌ．　Ｐｏｍｅｒｏｙら、Ｐｒｅｄｉｃｔｉｏｎ　ｏｆ　ｃｅｎｔｒａｌ　ｎｅｒｖｏｕｓ　ｓｙｓｔｅｍｅｍｂｒｙｏｎａｌ　ｔｕｍｏｒ　ｏｕｔｃｏｍｅ　ｂａｓｅｄ　ｏｎ　ｇｅｎｅ　ｅｘｐｒｅｓｓｉｏｎ．　Ｎａｔｕｒｅ　４１５，　ｐｐ．４３６−４４２　（２００２）を参照。
【０００３】
【発明が解決しようとする課題】
しかしながら、多数の遺伝子の発現を同時に解析して知識を獲得する手段については、いまだに確立されたとはいえない状況である。特に知識獲得については、解析者の知識に依存することが大きいので、解析者によって得られる知識に差があること、またＤＮＡチップから得られるデータ量が膨大であり、そもそも人間の解析能力を超えているという問題があった。
【０００４】
本発明の目的は、遺伝子発現プロファイルのセットを解析するための方法および装置を提供することである。更に詳しくは、遺伝子発現プロファイルのセットから、創薬、薬理学、毒性学、診断に有用な遺伝子群の抽出およびそれら有用遺伝子群に共通する規則を抽出する技術を提供することを目的とする。
【０００５】
【課題を解決するための手段】
遺伝子発現プロファイルのセットから有用な遺伝子群の抽出およびそれら有用遺伝子群に共通する規則を抽出するために、（１）遺伝子発現データを受け取る工程、（２）クラス情報を受け取る工程、（３）データマイニング手法を用いてクラス分類に関連する遺伝子群を抽出する工程、（４）前記遺伝子群にアノテーションを行う工程、（５）遺伝子アノテーションに基づき前記クラス分類に関連する遺伝子群の共通規則を抽出する工程、（６）前記共通規則に基づく拘束条件をもちいたデータマイニングを行う工程を行う。また前記（３）から（６）の工程を反復することが重要である。従来の情報システムは工程１から４までを行っていた。クラス分類に関連する遺伝子群とそのアノテーションについて、例えば一覧表、あるいはグラフィカルユーザーインターフェースを用いて、解析者に提示するに留めていた。その後の知識獲得の工程は、解析者の知識と直感に頼っていた。しかし、解析者ごとの知識レベルが異なること、そもそもＤＮＡチップ遺伝子発現プロファイルデータから得られる情報が膨大であるので、解析に多大の時間がかかることが問題となっている。そのため知識獲得においても、解析者を支援する情報システムが必要とされている。
【０００６】
遺伝子発現解析を除く分野でのデータマイニングは、一般にインスタンス数＞＞アトリビュート数の関係にある。例えばデパートの顧客データを例にすると、インスタンス数（顧客数）イコール数千から数万、アトリビュート数（性別、年齢、年収など）イコールたかだか百個であり、上記の関係が成り立つ。しかしＤＮＡチップでは反対に、インスタンス数（サンプル数：数個から数十）に対し、アトリビュート数（遺伝子数：数千から数万）が桁違いに大きいことが特徴である。この種の問題では、「高度なマイニング手法を用いることなく、既存のサンプルに対してある程度正確な予測ができる」一方で「新規なサンプルに対しては予測に失敗することが多い」ことが知られている。この理由は、例えば１０個の遺伝子で説明できなければ、１００個、２００個と遺伝子を増やすことでいつかは説明ができるためである。また実験誤差をあらかじめ除くことが難しいのでアトリビュート値の信頼性が低いこと、そして誤り値（Ｉｎａｃｃｕｒａｔｅ　ｖａｌｕｅｓ）が少しでも含まれているとマイニング結果が大きく変わってしまう（ロバストでない）ことも知られている。このため、適切なデータを選択すること（Ｄａｔａ　Ｓｅｌｅｃｔｉｏｎ）、および解析に悪影響を与える誤り値を除くこと（Ｄａｔａ　Ｃｌｅａｎｓｉｎｇ）などの前処理が重要となる。現在、Ｄａｔａ　Ｓｅｌｅｃｔｉｏｎ、Ｄａｔａ　Ｃｌｅａｎｓｉｎｇの確立した方法はないが、後のＫｎｏｗｌｅｄｇｅ　Ｄｉｓｃｏｖｅｒｙの成否を決定的に左右すると考えられる。すなわちＤＮＡチップデータ等の遺伝子発現データの性質は、（１）アトリビュート数＞＞インスタンス数、（２）アトリビュート値の信頼性が低い、（３）誤り値に対しロバストでないので、Ｄａｔａ　Ｓｅｌｅｃｔｉｏｎ、Ｄａｔａ　Ｃｌｅａｎｓｉｎｇは極めて重要であるという３つの性質である。以後、本願明細書の記載では、Ｄａｔａ　Ｓｅｌｅｃｔｉｏｎ、Ｄａｔａ　Ｃｌｅａｎｓｉｎｇをまとめてフィルタリングと呼ぶ。
【０００７】
遺伝子発現データとは、例えばＤＮＡチップ法（ＤＮＡ　Ｃｈｉｐ）、ディファレンシャルディスプレイ法（Ｄｉｆｆｅｒｅｎｔｉａｌ　Ｄｉｓｐｌａｙ）、定量的ＰＣＲ法（Ｑｕａｎｔｉｔａｔｉｖｅ　ＰＣＲ）、ＳＡＧＥ（Ｓｅｒｉａｌ　Ａｎａｌｙｓｉｓ　ｏｆ　Ｇｅｎｅ　Ｅｘｐｒｅｓｓｉｏｎ）法、プロテインチップ法（Ｐｒｏｔｅｉｎ　Ｃｈｉｐ）などの複数遺伝子もしくは蛋白質の発現変化を測定する方法により得られた複数遺伝子（あるいは蛋白質）に関する発現量、もしくは発現量同士の比率のことである。
【０００８】
クラス情報とは、例えばＤＮＡチップ法等で測定された対象を分類するための情報である。例えば測定対象サンプルが被検査者の血液である場合、その血液が患者由来もしくは健常人由来のいずれであるかを、患者由来であれば１、健常人由来であれば０と定義する。また病理知見等に基づき、がんなどの疾患の悪性度を０，１，２，３などと定義できる。培養細胞であれば、薬物投与前の細胞由来サンプルを０、薬物投与６時間後のサンプルを１、薬物投与１２時間後のサンプルを２、薬物投与２４時間後のサンプルを３などと定義できる。クラスが同一であれば、測定対象となる個人、個体が異なっても、同一の実験条件や性質、表現型（フェノタイプ）を有するものとする。
【０００９】
データマイニングとは、データベースから、興味ある規則性や因果関係を計算機で自動的に抽出する技術のことである。例えば決定木（Ｄｅｃｉｓｉｏｎ　Ｔｒｅｅ）、ナイブベイズ（Ｎａｉｖｅ　Ｂａｙｅｓ）、フルベイジアン（Ｆｕｌｌｙ　Ｂａｙｅｓｉａｎ）、相関ルール（Ａｓｓｏｃｉａｔｉｏｎ　ｒｕｌｅ）、特徴ルール（Ｃｈａｒａｃｔｅｒｉｓｔｉｃ　ｒｕｌｅ）、ＥＭクラスタリング（ＥＭ　Ｃｌｕｓｔｅｒｉｎｇ）、最短近傍法（Ｎｅａｒｅｓｔ　Ｎｅｉｇｈｂｏｒ）、判別分析（Ｄｉｓｃｒｉｍｉｎａｎｔ　Ａｎａｌｙｓｉｓ），サポートベクターマシン（Ｓｕｐｐｏｒｔ　Ｖｅｃｔｏｒ　Ｍａｃｈｉｎｅ）、遺伝的アルゴリズム（Ｇｅｎｅｔｉｃ　Ａｌｇｏｒｉｓｍ）、線形回帰（ＬｉｎｅａｒＲｅｇｒｅｓｓｉｏｎ）のことである。もしくは上記方法を使用する上で、バギング（Ｂａｇｇｉｎｇ）、ブースティング（Ｂｏｏｓｔｉｎｇ）、スタッキング（Ｓｔａｃｋｉｎｇ）などの手法を併用することもある。
【００１０】
アノテーションとは注釈付けのことである。例えば塩基配列、遺伝子機能情報、疾患関連情報、公共データベースの該当ＩＤ、他種遺伝子間（例えばヒトとマウス）のホモログ情報、遺伝子ネットワーク情報、パスウェイ情報等である。
【００１１】
正規化（ノーマリゼーションとも言う）とは、実験を行った時間、場所、作業者が異なることから実験のバックグランドや、ノイズの程度が異なるので、その実験ごとのバックグランドやノイズの程度を揃える操作のことである。ＤＮＡチップであれば、チップごとに画像輝度（蛍光強度）が異なることがあるので、まずバックグランド輝度、例えば、本来プローブが存在しない部位の輝度を取得して、そのバックグランド輝度の平均値や中央値を、プローブ蛍光強度から差し引くことなどの操作のことである。
【００１２】
フィルタリングとは、前記のように適切なデータを選択し、解析に悪影響を与える誤り値を除く操作のことである。例えばしきい値を定める方法がある。具体的には、信号強度が小さい（例えば５００以下の）場合、しきい値イコール５００とし、５００以下の信号強度を５００にするかもしくはゼロにする操作のことである。
【００１３】
クロスバリデーション（Ｃｒｏｓｓ−Ｖａｌｉｄａｔｉｏｎ）とは、例えばテンフォールドクロスバリデーション（ｔｅｎｆｏｌｄ　ｃｒｏｓｓ−ｖａｌｉｄａｔｉｏｎ）、もしくはリーブワンアウトクロスバリデーション（Ｌｅａｖｅ−ｏｎｅ−ｏｕｔ　ｃｒｏｓｓ−ｖａｌｉｄａｔｉｏｎ）のことである。テンフォールドクロスバリデーションとは、データセットをランダムに１０個に等分割し、１０分の９個のデータでトレーニングし、残りの１０分の１個のデータでテストする合計１０回の試行の結果から、正解率（もしくはエラー率）を算出する方法である。リーブワンアウトクロスバリデーションとは、ｎ個のデータセットのうち、ｎ−１個のデータでトレーニングし、残りの１個のデータでテストする合計ｎ回の試行の結果から、正解率（もしくはエラー率）を算出する方法である。限られた数のデータセットから、各マイニング方法の正解率（もしくはエラー率）を比較し、どのマイニング方法が優れているのかを定量的に評価する方法である。
【００１４】
図１は、本発明のソフトウェアを実行するために利用され得るコンピュータシステムと全体の工程を表したインターフェースの一例を示す。まず遺伝子発現データ、クラス情報が格納されているファイル入力のためにファイル形式やファイル名を入力する。続いて、正規化やフィルタリング方法の選択をし、データマイニングを行う。データマイニングにより抽出された遺伝子群にアノテーションをほどこす際に、どのアノテーションを行うかを選択する。またデータマイニングにより抽出された遺伝子群に基づいてクロスバリデーションを計算し、正解率がしきい値（α；図１では０．９５）以上ならば、そのまま終了し、しきい値以下であれば、アノテーション結果を反映させた拘束条件のもとで次回のマイニングを行う。このマイニングは、正解率がしきい値を超えるまで自動的に反復して行える。
【００１５】
図２は、ＤＮＡチップの一般的な構造を示した図である。図１６にＤＮＡチップを用いた測定法のフローチャートを示す。まず支持体２４にＤＮＡプローブ２２を固定化する。続いて、測定対象サンプルから抽出した遺伝子断片を蛍光標識などで標識する。この蛍光標識された遺伝子２３を、ＤＮＡプローブ２２とハイブリダイズさせる。その後、蛍光標識由来の蛍光を検出器２１で検出する。この検出の結果、各ＤＮＡプローブ２２にハイブリダイズした蛍光標識された遺伝子２３の量が得られる。これを発現分布という。
【００１６】
図３は、遺伝子アノテーションを行うに際して考慮すべき、遺伝子全体像（ゲノム：Ｇｅｎｏｍｅ）、転写産物全体像（トランスクリプトーム：Ｔｒａｎｓｃｒｉｐｔｏｍｅ）、蛋白質全体像（プロテオーム：Ｐｒｏｔｅｏｍｅ）における遺伝子同士、転写産物同士、蛋白質同士、あるいは遺伝子と転写産物、転写産物と蛋白質、遺伝子と蛋白質の相互関係の例である。なお、オーム（ｏｍｅ）は全体あるいは全体像を意味する接尾語であり、遺伝子（Ｇｅｎｅ）の全体をゲノム（Ｇｅｎｏｍｅ）、転写産物（Ｔｒａｎｓｃｒｉｐｔ）の全体をトランスクリプトーム（Ｔｒａｎｓｃｒｉｐｔｏｍｅ）、蛋白質（Ｐｒｏｔｅｉｎ）の全体をプロテオーム（Ｐｒｏｔｅｏｍｅ）と呼ぶ。本願明細書の以下の記述では、遺伝子、転写産物、蛋白質を遺伝子等と呼ぶ。図３における白丸が個々の遺伝子等を指し、白丸と白丸とをつないだ線は、実験等により既知となっている相互作用、因果関係である。図３（Ａ）は遺伝子等に相互作用がない状態であり、独立と呼ばれる。但し将来、実験が行われることによってなんらかの相互作用、因果関係が発見される可能性はある。図３（Ｂ）は遺伝子等が１対１で相互作用する関係であり、二項関係と呼ばれる。例えば図４に示すように細胞表面に存在する受容体（レセプター）とその受容体に結合する結合体（リガンド）の関係が二項関係の一例である。図４（Ａ）はインターロイキン２（ＩＬ２）蛋白質とインターロイキン２受容体アルファ（ＩＬ２ＲＡ）、インターロイキン２（ＩＬ２）蛋白質とインターロイキン２受容体ベータ（ＩＬ２ＲＢ）、インターロイキン２（ＩＬ２）蛋白質とインターロイキン２受容体ガンマ（ＩＬ２ＲＧ）、図４（Ｂ）は形質転換成長因子ベータ１（ＴＧＦＢ１）と形質転換成長因子ベータ受容体１（ＴＧＦＢＲ１）、形質転換成長因子ベータ１（ＴＧＦＢ１）と形質転換成長因子ベータ受容体２（ＴＧＦＢＲ２）、形質転換成長因子ベータ１（ＴＧＦＢ１）と形質転換成長因子ベータ受容体３（ＴＧＦＢＲ３）、図４（Ｃ）はエリスロポエチン（ＥＰＯ）蛋白質とエリスロポエチン受容体（ＥＰＯＲ）の間のリガンド−レセプター関係である。またＤＮＡとＤＮＡ結合蛋白質も二項関係の一例である。ＤＮＡ結合蛋白質の例として転写因子、修復遺伝子などがある。図３（Ｃ）は道筋にそって遺伝子等同士が相互作用する関係でありパスウェイと呼ばれる。パスウェイ中には分岐は存在するものの、上流から下流へと一方向に相互作用が行われることが特徴である。パスウェイ上流とパスウェイ下流の遺伝子等の間には、因果関係が存在するともいえる。パスウェイの例として、図５に示すＭＡＰキナーゼ（Ｍｉｔｏｇｅｎ　Ａｃｔｉｖａｔｅｄ　Ｐｒｏｔｅｉｎ　Ｋｉｎａｓｅ）パスウェイのように、細胞表面の受容体とリガンドが結合したことを起点として、細胞表面から細胞内の細胞核にいたるまで情報を伝達するパスウェイが知られている。個々の丸印は遺伝子を、丸印と丸印をつなぐ矢印は、その矢印の方向に情報伝達が行われることを意味する。例えば図５のＭｏｓ遺伝子からＭＥＫ遺伝子に、ＭＥＫ遺伝子からＥＲＫ遺伝子に情報が伝達される。パスウェイ情報としては、例えばパスウェイデータベース　（ｈｔｔｐ：／／ｗｗｗ．ｂｉｏｃａｒｔａ．ｃｏｍ／）を参照。
【００１７】
図３（Ｄ）はＤＮＡ塩基配列上の遺伝子の相互配置関係であり、本願明細書ではゲノムと呼ぶ。ヒト、マウス、ラット等の高等動物では染色体上に、酵母や細菌などでは環状ＤＮＡ上に遺伝子が配置されている。図６にゲノムの例を示す。ヒト第１３番染色体上の１３ｑ１２から１３ｑ１３領域の一部では、ＬＯＣ２２２４２８遺伝子からＬＯＣ１６０９７９遺伝子までが、図６のような位置に順番に存在している。ある疾患は染色体の一部の領域が欠失したり、増幅したりする結果、近傍にある遺伝子群が同時に欠失もしくは増幅することが原因で発症する。そのよう遺伝子増幅、遺伝子欠失が原因となる疾患の原因遺伝子を探索するには、ゲノムの情報は有用である。図３（Ｅ）は遺伝子等同士が階層構造にある場合で、階層構造の例として図７のオントロジー、図８の酵素（ＥＣ：Ｅｎｚｙｍｅ　Ｃｏｍｍｉｓｓｉｏｎ）、図９のスーパーファミリーの関係がある。図７はＡＤＰＲＴと呼ばれる遺伝子のオントロジーである。オントロジーとは遺伝子配列や蛋白質配列解析等に基づき、遺伝子の機能を定義した辞書である。オントロジーの詳細については、Ｔｈｅ　ｇｅｎｅ　ｏｎｔｏｌｏｇｙ　ｃｏｎｓｏｒｔｉｕｍ、Ｇｅｎｅ　ｏｎｔｏｌｏｇｙ：　ｔｏｏｌ　ｆｏｒ　ｔｈｅ　ｕｎｉｆｉｃａｔｉｏｎ　ｏｆ　ｂｉｏｌｏｇｙ．　Ｎａｔｕｒｅ　Ｇｅｎｅｔｉｃｓ　２５，　ｐｐ．２５−２９　（２０００）　を参照。遺伝子オントロジーによるとＡＤＰＲＴ遺伝子は、ＤＮＡ修復（ＤＮＡ　ｒｅｐａｉｒ）、ＡＤＰ−リボシル化（ＡＤＰ−ｒｉｂｏｓｙｌａｔｉｏｎ）の機能を有することが分かる。ＤＮＡ修復はＤＮＡ代謝（ＤＮＡ　ｍｅｔａｂｏｌｉｓｍ）の一つであり、ＤＮＡ代謝は核酸等代謝（ｎｕｃｌｅｏｂａｓｅ，　ｎｕｃｌｅｏｓｉｄｅ，　ｎｕｃｌｅｏｔｉｄｅ　ａｎｄ　ｎｕｃｌｅｉｃ　ａｃｉｄ　ｍｅｔａｂｏｌｉｓｍ）の一つである。
【００１８】
図７のオントロジー右端括弧中の数字は登録遺伝子数を示す。３８６個の遺伝子がＤＮＡ修復機能を有する遺伝子として現在登録されており、ＤＮＡ修復を含むＤＮＡ代謝には１１３８個の遺伝子が登録されている。遺伝子オントロジーでは汎用的な大分類から詳細な小分類へと階層構造を形成していることが、登録遺伝子数からも見て取れる。図８に階層構造の二つ目の例である酵素の例を示す。酵素はその構造や機能に基づいたＥＣ番号（Ｅｎｚｙｍｅ　Ｃｏｍｍｉｓｓｉｏｎ）によって分類されている。ＥＣ１からＥＣ６まであり、ＥＣ１はオキシドレダクターゼ（Ｏｘｉｄｏｒｅｄｕｃｔａｓｅｓ）、ＥＣ２はトランスフェラーゼ（Ｔｒａｎｓｆｅｒａｓｅｓ）、ＥＣ３はハイドロラーゼ（Ｈｙｄｒｏｌａｓｅｓ）、ＥＣ４はリアーゼ（Ｌｙａｓｅｓ）、ＥＣ５はイソメラーゼ（Ｉｓｏｍｅｒａｓｅｓ）、ＥＣ６はリガーゼ（Ｌｉｇａｓｅｓ）である。ＥＣ１から６は更に階層構造に従った分類がなされている。図８ではＥＣ６の例をしめすが、ＥＣ６はＥＣ６．１からＥＣ６．５に分類される。ＥＣ６．３は更にＥＣ６．３．１からＥＣ６．３．５までに分類される。ＥＣ６．３．３の場合、実際の酵素はＥＣ６．３．３．１からＥＣ６．３．３．３である。
【００１９】
図９に階層構造の三つ目の例であるスーパーファミリーの例を示す。スーパーファミリーとは塩基配列の解析から得られるモチーフ、ドメイン構造から、類似の蛋白質立体構造や機能を有することが予想される一連の遺伝子群のことである。なおモチーフとは構造やパターンの要素のことであるが、ここでは各種のタンパク質のアミノ酸配列中に認められる一定の構造を指す。モチーフは互いに機能が異なる幅広いタンパク質に共通して見られる構造である。なおタンパク質にはドメインと呼ばれる構造があるが、タンパク質のドメインはモチーフの種々の組み合わせでできている。一般にモチーフは、タンパク質のドメインよりは小さい構造単位である。モチーフには、例えばヘリックス−ターン−ヘリックスやジンクフィンガーと呼ばれるＤＮＡ結合構造モチーフなどがある。図９は薬物代謝酵素ＣＹＰ遺伝子群の例を示している。ＣＹＰ遺伝子群は全体で約５０個ほど存在するが、それらの遺伝子群は、構造や機能ごとにＣＹＰ１Ａ１，ＣＹＰ１Ａ２などのグループに分類されている。
【００２０】
図３（Ｆ）は遺伝子等同士がネットワーク関係にある場合で、ネットワークの例として図１０の文献情報、図１１の蛋白質相互作用、図１２の代謝経路の関係がある。図１０にネットワークの一つ目の例である文献情報に基づく相互関係について示す。この方法は、二つの遺伝子名が同一の文献データベースの同一文章内に存在する数が多いほど両者の相互関係が強いとするものであり、例えばＰｕｂＧｅｎｅデータベースを用いて相互関係スコアを得ることができる。Ｔｏｒ−Ｋｒｉｓｔｉａｎ　Ｊｅｎｓｓｅｎら、Ａ　ｌｉｔｅｒａｔｕｒｅ　ｎｅｔｗｏｒｋ　ｏｆ　ｈｕｍａｎ　ｇｅｎｅｓ　ｆｏｒ　ｈｉｇｈ−ｔｈｒｏｕｇｈｐｕｔ　ａｎａｌｙｓｉｓ　ｏｆ　ｇｅｎｅ　ｅｘｐｒｅｓｓｉｏｎ，　Ｎａｔｕｒｅ　Ｇｅｎｅｔｉｃｓ，　ｖｏｌ．２８，　ｐｐ２１−２８参照。図１０の各丸印は遺伝子を、丸印同士をつなぐ線は、相互関係があることを示している。線に並んで記載された数字は相互関係スコアを表す。図１０では相互関係スコアとは、線でつながれた二つの遺伝子が医学文献データベースＭＥＤＬＩＮＥの同一アブストラクト文中に存在した件数を示している。図１０の中央にある遺伝子ＡＤＰＲＴと相互関係が強いものは、ＴＰ５３，ＣＦＴＲ，ＥＥＦ２，ＦＲＡ１Ｈ，ＳＰ１，ＤＡＦの６遺伝子であり、相互関係スコアは６遺伝子とも１である。このＰｕｂＧｅｎｅでは文献データベースとして米国ＮＣＢＩのＭＥＤＬＩＮＥやＯＭＩＭを用いているが、その他の文献データベースでもかまわない。図１１にネットワークの２つ目の例である蛋白質相互作用をしめす。米国カリフォルニア大学ロサンゼルス校のＤＩＰ（Ｄａｔａｂａｓｅ　ｏｆ　Ｉｎｔｅｒａｃｔｉｎｇ　Ｐｒｏｔｅｉｎｓ）などのような、タンパク質相互作用データベースを用いて蛋白質相互作用を調べることができる。ＤＩＰについてはＩ　．Ｘｅｎａｒｉｏｓら、ＤＩＰ：　ｔｈｅ　ｄａｔａｂａｓｅ　ｏｆ　ｉｎｔｅｒａｃｔｉｎｇ　ｐｒｏｔｅｉｎｓ，　Ｎｕｃｌｅｉｃ　Ａｃｉｄ　Ｒｅｓｅａｒｃｈ，　ｖｏｌ．２８，　２８９−２９１，　２０００参照。タンパク質相互作用データベースにおいても、相互作用するタンパク質同士が線で結合されている。相互作用の強さは、例えば解離定数（Ｄｅｓｓｏｃｉａｔｉｏｎ　Ｃｏｎｓｔａｎｔ）が実験的に求められていれば、分子同士の結合力が分かるので、結合力の強い方をより相互作用が大きいと見なす。また１回の実験で確認された相互作用より、２回以上の複数回の実験で確認された相互作用を、より相互作用が強いと見なしてもよい。なおＤＩＰ以外のタンパク質相互作用データベースを用いてもかまわない。図１１にネットワークの３つ目の例である代謝経路をしめす。代謝経路の詳細はＫＥＧＧ（ｈｔｔｐ：／／ｗｗｗ．ｋｅｇｇ．ｋｙｏｔｏ−ｕ．ａｄ．ｊｐ）などの代謝経路データベースを参照。図３（Ｃ）のパスウェイとこの代謝経路との違いは、補酵素（ｃｏ−ｅｎｚｙｍｅ）と呼ばれる反応を媒介する複数の蛋白質が代謝反応では関与するため、単純に上流から下流という一方向の関係にとどまらない複雑な構造をとる点である。
【００２１】
図３から図１２はゲノム、トランスクリプトーム、プロテオームといった遺伝子情報における相互関係を示している。しかし遺伝子情報は生物情報の一部である。図１３に示すように、生物情報には、酵素間相互作用（酵素全体像：エンザイモーム：Ｅｎｚｙｍｏｍｅ，代謝全体像：メタボローム：Ｍｅｔａｂｏｌｏｍｅ）、相互作用（相互作用全体像：インタラクトーム：Ｉｎｔｅｒａｃｔｏｍｅ）、時間的空間的局在（局在全体像：ローカリゾーム：Ｌｏｃａｌｉｚｏｍｅ）、表現型（表現型全体像：フェノーム：Ｐｈｅｎｏｍｅ）にいたる階層構造がある。Ｍ．　Ｖｉｄａｌ、Ａ　ｂｉｏｌｏｇｉｃａｌ　ａｔｌａｓ　ｏｆ　ｆｕｎｃｔｉｏｎａｌ　ｍａｐｓ．　Ｃｅｌｌ　１０４，　ｐｐ．３３３−３３９（２００１）を参照。本願明細書の解析法は、ゲノム、トランスクリプトーム、プロテオーム内の相互作用に留まらず、図１３に示す生物情報全般における相互作用についても、適用することができる。
【００２２】
図１４と図１５に本発明を実現するためのコンピュータ化された方法を示す模式図を示す。図１４は、クロスバリデーション値と既定値とを比較することで繰り返し回数を決定する、本発明を実現するためのコンピュータ化された方法を示す模式図である。
【００２３】
図１５は、全てのクロスバリデーション値を保存して比較する、本発明を実現するためのコンピュータ化された方法を示す模式図である。図１４と図１５において、遺伝子発現データを受け取る工程、クラス情報を受け取る工程、データマイニング手法を用いてクラス分類に関連する遺伝子群を抽出する工程、前記遺伝子群に注釈付け（アノテーション）を行う工程、遺伝子アノテーションに基づき前記クラス分類に関連する遺伝子群の共通規則を抽出する工程、前記共通規則に基づく拘束条件をもちいたデータマイニングを行う工程は共通である。図１４と図１５では遺伝子発現データを受け取った後に、正規化とフィルタリング処理を行ってもよいし行わなくてもよいが、特にフィルタリングはＤａｔａ　Ｓｅｌｅｃｔｉｏｎ、Ｄａｔａ　Ｃｌｅａｎｓｉｎｇというデータマイニング結果を左右する工程なので行うことが望ましい。
【００２４】
ＤＮＡチップのデータ解析において有用なフィルタリングの例を表１に示す。フィルタリング法の１例であるしきい値法は、前述のように信号強度が小さい（例えば５００以下の）場合、しきい値イコール５００とし、５００以下の信号強度を５００にするかもしくはゼロにする操作のことである。しきい値の決定には経験に基づく方法、ノンパラメトリック検定、例えばＷｉｌｃｏｘｏｎ符号付順位検定に基づく方法などがある。
【００２５】
【表１】

フィルタリング法の１例である相関係数法は、まず（１）実データの全遺伝子に対し、複数サンプルから２個のサンプルを無作為に抽出して、相関係数を計算する。（２）続いて人工的に作成したランダムデータに対し、前記と同様に相関係数を計算する。相関係数にはピアソンの相関係数を用いる。（３）ランダムデータでの相関係数の確率分布と、実データでの相関係数との確率分布を比較する。（４）ランダムデータの分布から見て有意に大きい相関もしくは逆相関を有する遺伝子群のみをデータマイニングの入力として用いる方法である。
【００２６】
フィルタリング法の１例である分散分析法は標準的なＡＮＯＶＡ（Ａｎａｌｙｓｉｓ　ｏｆ　Ｖａｒｉａｎｃｅ）と似ている。但し標準的なＡＮＯＶＡは、データ同士が互いに独立であることを仮定して、Ｆ分布を判定に用いる。実際の遺伝子では、遺伝子データ同士が何らかの相関があることが十分予想されるので、通常のＡＮＯＶＡを用いることは誤りである。そこで、Ｆ分布に相当する分布を実データからブートストラップ法などを用いて模擬的に作成する方法が取られることが多い。
【００２７】
フィルタリング法の１例であるベイズ推定法について説明する。以下Ｃｙ３とＣｙ５という二色の蛍光色素で、２種類のサンプルを染色して同時にハイブリダイゼーションするＤＮＡチッフ゜実験を想定する。予め、同一ＲＮＡを二つに分けて、片方をＣｙ３、もう片方をＣｙ５で標識し、同一チッフ゜上で競合ハイフ゛リさせる実験（チッフ゜枚数は５、１０枚程度必要）を行うことを考える。この実験のテ゛ータは、Ｃｙ５／Ｃｙ３の平均値＝ｃは１．０であるが、分散σ^２については未知である正規母集団Ｎ（ｃ、σ^２）から大きさｎの無作為標本｛Ｙ１，Ｙ２，・・・，Ｙｉ，・・・，Ｙｎ｝を抽出し、その観測値ｙ＝（ｙ１，　ｙ２，　・・・，ｙｉ，・・・，ｙｎ）を得たことに相当する。チッフ゜間差、色素間差、ハント゛リンク゛の個人差等の誤差要因は、実験間で相関がないと考えられるため｛Ｙ１，Ｙ２，・・・，Ｙｉ，・・・，Ｙｎ｝は相互に独立（ｉｎｄｅｐｅｎｄｅｎｔｌｙ　ａｎｄ　ｉｄｅｎｔｉｃａｌｌｙ　ｄｉｓｔｒｉｕｔｅｄ；　ｉ．ｉ．ｄ）と言える。
【００２８】
【式１】

本ベイズ推定の最終目的は、Ｙｉ　のσ^２を推定することにある。本推定により母集団Ｎ（１、σ^２）のσ^２を推定すれば、式１より、Ｙｉのσ^２も同一である。ベイズ推定ではσ^２は分布を持つとするので、あくまでσ^２の推定値が得られる。推定値としては、平均、モード゛（度数が最大となる点）や、信頼区間として最高密度信頼区間（Ｈｉｇｈｅｓｔ　Ｄｅｎｓｉｔｙ　Ｒｅｇｉｏｎ；ＨＤＲ）が得られる。９０％最高密度区間は、いわば未知母数σ２の９０％区間のうちで最も短く、また事後分布のヒ゜ーク値（事後モード）を必ず含み、かつ区間の両端における事後密度が等しくなるものである。
式１が成り立つとき、｛Ｙ１，Ｙ２，・・・，Ｙｉ，・・・，Ｙｎ｝の同時確率密度分布（ａ１　＜　Ｙ１　≦　ｂ１，　ａ１　＜　Ｙ１　≦　ｂ１，・・・、ａｎ　＜　Ｙｎ　≦　ｂｎが同時に満たされる確率密度分布）ｐ（ｙ‘｜ｃ，　σ^２）は、
【００２９】
【式２】

のように正規分布を掛け合わせた形式で書ける。但しΠは積記号である。従って、観測値ベクトルｙ＝（ｙ１，　ｙ２，　・・・，ｙｉ，・・・，ｙｎ）が与えられたときの尤度関数ｌ（σ^２｜　ｙ）は、
【００３０】
【式３】

【式４】

となる。ただし、ｓ^２は母平均ｃを中心とした観測値分布であり、下記の式で表される。
【００３１】
【式５】

ここで、分散の事前分布ｐ（σ^２）として、無情報事前分布（ｎｏｎｉｎｆｏｒｍａｔｉｖｅ　ｐｒｉｏｒ　ｄｉｓｔｒｉｂｕｔｉｏｎ）を仮定する。この仮定は母数に関して無知であるとすることで、事前分布についての恣意性を出来る限り排除し、事後分布はできるだけテ゛ータによって支配されるようにする点で妥当である。無情報事前分布として、局所一様事前分布を用いるのが一般的である。局所一様分布とは、未知母数を二乗しようが、三乗しようが、対数値をとろうが、事前情報の漠然性を表すために少なくとも局所的には一様に分布するような分布のことである。具体的にはフィッシャー情報量の平方根に比例するように定めればよいことが分かっている。事前分布として局所一様分布を用いるならば、
【００３２】
【式６】

とすれば良い。つまりσ^２の事前分布ｐ（σ^２）は、σ⁻ ^２すなわち定数とする。次に事後分布を考える。ベイズの定理より
【００３３】
【式７】

が成立するので、
【００３４】
【式８】

より、σ^２の事後分布ｐ（σ^２｜ｙ）は、χ^−２（ｎ、ｎｓ^２）と等しい分布になる。なおχ^−２（ν、λ）とは、尺度母数λをもつ自由度νの逆カイ二乗分布と呼ばれる分布である。χ^−２（ν、λ）の平均はλ／（ν−２）、モード（度数が最大となる点）はλ／（ν＋２）となることが分かっているので、σ^２の点推定値として、
【００３５】
【式９】

【式１０】

を考えることができる。また式８より事後的に、
【００３６】
【式１１】

の関係が得られる。式１１のχ^２（ｎ）とは、自由度ｎのカイ二乗分布である。ここでは、ｎｓ^２が固定値（観測値）、σ^２が確率変数になっている。式１１と数表を用いることで、ＨＤＲを求めることができる。式９，１０，１１をもちいることで、２種類のサンプル間の遺伝子発現比率データが得られたとき、その比率が１．０と比較してどの程度、統計的有意に異なるかを知ることができる。１．０より有意に異なる比率を有する遺伝子群のみをデータマイニングの入力として使用すれば良い。例えば５回の実験で、ｙ_１＝１．４、ｙ_２＝０．８９、ｙ_３＝１．２４、ｙ_４＝０．９１、ｙ_５＝１．０４が得られたとすると、ｓ^２＝０．０４７８８である。平均値を基準とした点推定（式９）より、Ｙｉ〜Ｎ（１，　０．０７９８）となる。またモードを基準とした点推定より（式１０）より、Ｙｉ〜Ｎ（１，　０．０３４２）となる。式１１より、σ^２〜０．２３９４χ^−２（５）、となることから、数表を用いることでσ^２の９０％ＨＤＲは０．０１９−０．１７７となる。なお式１から式１１までに示したベイズ推定法は統計手法の一つの方法であるので、ベイズ推定法以外の方法、例えば、ネイマン・ピアソン流の推定法を用いて同様の推定を行ってもよい。
【００３７】
発現データを読み込んだ後、正規化、フィルタリングを行う、またクラス情報を読み取ることで、データマイニングに入力するためのデータ形式である発現マトリクスを作成することができる。発現マトリクス形式を図１６に示す。遺伝子アノテーションを行（Ｒａｗ）、サンプルアノテーションを列（Ｃｏｌｕｍｎ）、対応する発現レベル（例えば前述のＣｙ５とＣｙ３の比率：Ｃｙ５／Ｃｙ３）をマトリクスにした構造である。この発現マトリクス形式のサンプルアノテーション部位にクラス情報を付加することで、データマイニングに適した構造となる。この発現マトリクスは例えばＣＳＶ形式やタブ区切り形式などの形式でファイルに保存することができる。
【００３８】
次のデータマイニング工程は、１回目であれば第１次データマイニング、以後マイニングを反復して行った（イタレーションした）場合その回数がｎ回目であればｎ次データマイニング工程と呼ばれる。チップデータに適したマイニング方法を４つ表２に示した。いずれも教師付学習法（スーパーバイズドメソッド）において代表的な方法である。最短近傍法、判別分析についてはＳｐｌｕｓ等の統計計算パッケージ（Ｓｐｌｕｓのｃｌａｓｓパッケージ）を用いて実行することができる。またサポートベクターマシンについてもフリーソフトでＳｐｌｕｓのクローンであるＲ言語のｅ１０７１パッケージを用いて実行することができる。次に特徴ルール法について説明する。
【００３９】
【表２】

特徴ルール法は特開平８−７７０１０に、データ分析方法として開示されている。特徴ルール法では、複数の属性項目からなるサンプルの集合を対象データとする。全てのサンプルは互いに同一の属性項目を持つ。それぞれの属性項目が取り得る属性値はサンプル数と比較して少数の離散値であることが要求される。典型的には、３通り程度の記号値である。元の分析対象データが実数値データである場合、適当な境界で値の範囲を区切り、「大」「中」「小」といった記号値に置き換える等の方法で離散化する。
【００４０】
特徴ルール法を実施する際には、属性項目の１個を選び、「結論項目」とする。また、結論項目の取り得る属性値のうちの１個を選び、「結論項目値」とする。また、その他の複数の属性項目を選び、「条件項目」とする。
【００４１】
特徴ルール法では、「ＩＦ（条件部）ＴＨＥＮ（結論部）」という形式のＩＦ−ＴＨＥＮルールを生成する。ルールの条件部は、条件項目とその属性値の組、すなわち述語であり、複数の述語が同時に条件部に現れることを許すが、典型的には３個程度に制限する。また、結論部は、結論項目と結論項目値からなる述語である。これにより、結論部の述語はただ１つに決定され、一方、条件部は様々な述語の組み合わせを取り得る。したがって、生成し得るＩＦ−ＴＨＥＮルールの数は、一般に多数になる。これら多数のＩＦ−ＴＨＥＮルールの中から、対象データの特徴をよく表している比較的少数のルールを探索することが特徴ルール法の目的である。各ＩＦ−ＴＨＥＮルールが対象データの特徴をどの程度よく表しているかを評価するために、以下の評価尺度を用いる。条件部をＡ（複数の述語の組み合わせを含む）、結論部をＢとする、「ＩＦ　Ａ　ＴＨＥＮ　Ｂ」というルールの評価尺度μ（Ａ→Ｂ）を次式のように定義する。
μ（Ａ→Ｂ）＝Ｐ（Ａ）＾β　×　ｌｏｇ［Ｐ（Ｂ｜Ａ）／Ｐ（Ｂ）］
ここで、Ｐ（Ａ）＾β　は　Ｐ（Ａ）のβ乗を意味する。Ｐ（Ａ）は対象データの中で条件部Ａが満足される確率、すなわち、対象データ全体の中で、Ａという条件を満たすサンプルの割合を表す。同様に、Ｐ（Ｂ）は結論部Ｂが満足される確率、Ｐ（Ｂ｜Ａ）は、Ａを満たすという条件の下で結論部Ｂが満足される条件付確率を表す。βは使用者が指定するパラメータで、０以上、１以下の実数値である。この評価尺度によって与えられる評価値が大きいルールほど、対象データの特徴をよく表していると見なす。また、上記の評価尺度の定義式におけるＰ（Ａ）をカバー率、Ｐ（Ｂ｜Ａ）をヒット率と呼び、これらは、取り出されたルールを使用者が解釈する際の手がかりとして用いられることがある。
【００４２】
生成され得る多数のＩＦ−ＴＨＥＮルールの中から、評価値の大きい、比較的少数のルールを取り出すアルゴリズムはいくつか考えられるが、「総当たり法」は、もっとも単純な方法の１つである。これは、取り出すルール数（例えば、１０）をあらかじめ定めておき、そして、条件部に同時に現れる述語数の上限（例えば、３）を定め、その範囲内で可能な全てのＩＦ−ＴＨＥＮルールを生成、評価し、その中で評価値の大きい上位のルールを、あらかじめ定めた数だけ取り出すというものである。
【００４３】
特徴ルール法を、クラス分類に関する遺伝子群の抽出に用いるには、対象データにおいて、クラス分類に関する属性をルールの結論項目とし、遺伝子アノテーションに該当する属性を条件項目とする。これにより、クラス分類に関して重要な遺伝子を条件部に持つＩＦ−ＴＨＥＮルールを得る。
【００４４】
データマイニングの結果、クラス分類において重要な（クラス分類を行う規則に関連する）遺伝子群を抽出するが、一般に正解率が高い順（エラー率の低い順）から、１つ以上で１０個から２００個、好ましくは２０個から５０個の遺伝子を選択する。この重要遺伝子群抽出には前述の特徴ルール法が最も優れている。
【００４５】
特徴ルール法以外の例として、例えばＢＳＳ／ＷＳＳ比を用いる方法がある。なおＢＳＳはＢｅｔｗｅｅｎ−ｇｒｏｕｐ　ｓｕｍ　ｏｆ　ｓｑｕａｒｅｓ、ＷＳＳはＷｉｔｈｉｎ−ｇｒｏｕｐ　ｓｕｍ　ｏｆ　ｓｑｕａｒｅｓの略であり、遺伝子ｊのＢＳＳ／ＷＳＳ比は下記の式で定義される。
【００４６】
【式１２】

但し、ｍ（ｘ．ｊ）は全検体における遺伝子ｊの平均発現量、ｍ（ｘｋｊ）は、クラスｋに属する検体における遺伝子ｊの平均発現量、ｘｉｊは遺伝子発現テ゛ータマトリクス、Ｉ（ｙｉ　＝　ｋ）は、ｙｉ　＝　ｋのとき１、それ以外は０となる関数である。式１２で表されるＢＳＳ／ＷＳＳ比が大きいほど、クラス内部の相違と比較してクラス間の相違が大きいので、クラシフィケーションの際にはＢＳＳ／ＷＳＳ比が大きい遺伝子、例えば上位２０から５０個を用いることが望ましい。
【００４７】
次にデータマイニングにより抽出されたクラス分類において重要な遺伝子群についてアノテーション付けを行う。アノテーションには図３から図１３までの様々な相互作用のうちいずれか１つ以上を行う。表３には例として遺伝子オントロジーを用いてアノテーションを行った例を示す。データマイニングの結果抽出された遺伝子群がＰｒｏｂｅ１からＰｒｏｂｅ５であった場合、もともとＤＮＡチップ設計時に既知である遺伝子クラスター番号（Ｕｎｉｇｅｎｅ番号）、塩基配列番号（Ｇｅｎｂａｎｋ番号）、遺伝子名に加え、遺伝子アノテーション工程において、染色体番号や遺伝子オントロジーを公共テ゛ータヘ゛ース等から検索し、表３を作成することができる。
【００４８】
【表３】

続いて遺伝子アノテーションから共通性や規則性を抽出する工程を行う。表３の遺伝子群はあるクラス分類において重要であることがデータマイニングにより分かった遺伝子群であるとする。この表３の遺伝子群から共通した性質や特徴を抽出することが、遺伝子アノテーションから共通規則を抽出する工程の目的である。例えば、表３の遺伝子オントロジーに対して、複数の遺伝子において共通に見られるオントロジーがないかを検索する。するとＰｒｏｂｅ１とＰｒｏｂｅ４で、ｉｎｔｅｇｒａｌ　ｍｅｍｂｒａｎｅ　ｐｒｏｔｅｉｎが共通して見られる。そこで表３からは、蛋白質として膜（ｉｎｔｅｇｒａｌ　ｍｅｍｂｒａｎｅ）に存在する遺伝子群がクラス分類に関わっているという規則が潜んでいる可能性を見出すことができる。次のデータマイニングでは、例えば、ＤＮＡチップ上に搭載されているプローブに対し、ｉｎｔｅｇｒａｌ　ｍｅｍｂｒａｎｅ　ｐｒｏｔｅｉｎ（ＧＯ番号：００１６０２１）に対応するプローブという拘束条件を設けて、それに該当する遺伝子群のみでデータマイニング・クロスバリデーションすることで、ｉｎｔｅｇｒａｌ　ｍｅｍｂｒａｎｅ　ｐｒｏｔｅｉｎがどの程度、クラス分類に寄与するかを定量的に把握することができる。またｉｎｔｅｇｒａｌ　ｍｅｍｂｒａｎｅ　ｐｒｏｔｅｉｎ（ＧＯ番号：００１６０２１）の上階層は、ｍｅｍｂｒａｎｅ　（ＧＯ番号：００１６０２０）である。そこで別のデータマイニングでは、ＤＮＡチップ上に搭載されているプローブに対し、ｉｎｔｅｇｒａｌ　ｍｅｍｂｒａｎｅ　ｐｒｏｔｅｉｎ（ＧＯ番号：００１６０２１）もしくはｍｅｍｂｒａｎｅ　（ＧＯ番号：００１６０２０）に対応するプローブという拘束条件を設けて、それに該当する遺伝子群のみでデータマイニング・クロスバリデーションしてもよい。仮に“ｉｎｔｅｇｒａｌ　ｍｅｍｂｒａｎｅ　ｐｒｏｔｅｉｎを含む”を拘束条件としてマイニングを行った場合と“ｍｅｍｂｒａｎｅを含む”を拘束条件としてマイニングを行った場合とでクロスバリデーション結果（正解率もしくはエラー率）を比較した場合に、前者の正解率が後者より良ければクラス分類には、ｍｅｍｂｒａｎｅでなく、ｉｎｔｅｇｒａｌ　ｍｅｍｂｒａｎｅ　ｐｒｏｔｅｉｎが重要であることが分かる。このように遺伝子アノテーションから共通規則を抽出する工程とは、前記のオントロジーのように、図３から図１３に示した相互関係を、表３のようなクラス分類の際に重要な遺伝子リストから検索する工程である。規則性抽出において、表４から表６に示したルールに従うことでより一般性の高い規則を見出すことができる。この見出された規則を拘束条件として、この拘束条件のもとにデータマイニングを行う工程が、次の、「前記共通規則に基づく拘束条件をもちいたデータマイニングを行う工程」である。この工程では、ｎ次マイニングより得られた重要遺伝子群とそのアノテーションを用いて（ｎ＋１）次マイニングを行う。このようにマイニングを繰り返す意義は、ルール生成で得られた遺伝子群に、アノテーションによって情報を付加し、これらの情報を各遺伝子の属性と考え、共通する特徴をさらに解析する。更にこの解析サイクルを繰り返すことで共通する特徴を徹底して探索することができることである。なお、規則の発明に際し、リガンド−レセプタ関係についてはＴ．Ｇ．Ｇｒａｅｂｅｒ　ａｎｄ　Ｄ．　Ｅｉｓｅｎｂｅｒｇ、Ｂｉｏｉｎｆｏｒｍａｔｉｃ　ｉｄｅｎｔｉｆｉｃａｔｉｏｎ　ｏｆ　ｐｏｔｅｎｔｉａｌ　ａｕｔｏｃｒｉｎｅ　ｓｉｇｎａｌｉｎｇ　ｌｏｏｐｓ　ｉｎ　ｃａｎｃｅｒｓ　ｆｒｏｍ　ｇｅｎｅ　ｅｘｐｒｅｓｓｉｏｎ　ｐｒｏｆｉｌｅｓ．　Ｎａｔｕｒｅ　Ｇｅｎｅｔｉｃｓ　２９，　ｐｐ．２９５−３００　（２００１）　を、蛋白質相互作用関係についてはＨ．　Ｇｅら、Ｃｏｒｒｅｌａｔｉｏｎ　ｂｅｔｗｅｅｎ　ｔｒａｎｓｃｒｉｐｔｏｍｅ　ａｎｄ　ｉｎｔｅｒａｃｔｏｍｅ　ｍａｐｐｉｎｇ　ｄａｔａ　ｆｒｏｍＳａｃｃｈａｒｏｍｙｃｅｓ　ｃｅｒｅｖｉｓｉａｅ．　Ｎａｔｕｒｅ　Ｇｅｎｅｔｉｃｓ　２９，　ｐｐ．４８２−４８６　（２００１）をそれぞれ参考にした。但し、前述の２件の公知例はどちらも、最初の１回目のマイニングに際し入力する発現マトリクスを改変する方法を提示しているのみである。本願明細書に記載されたように前回のマイニング結果を元に、更にマイニングを連続的に反復実行することはない。これでは一般性・頑強性（ロバストネス）に優れた知識獲得を行うことは困難である。本発明は、一般性・頑強性に優れた知識獲得を自動で行える点で公知技術とは根本的に異なっている。
【００４９】
【表４】

【表５】

【表６】

表４から表６について説明する。表４から表６において遺伝子間の相互関係が二項関係でありかつリガンド−レセプターの関係であった場合、クラス分類関連抽出遺伝子がリガンド遺伝子であれば、アノテーションから抽出された共通性や規則性に基づく拘束条件をレセプター遺伝子もしくはリガンド遺伝子とレセプター遺伝子の組を含む条件とし、クラス分類関連抽出遺伝子がレセプター遺伝子であれば、前記拘束条件をリガンド遺伝子もしくはリガンド遺伝子とレセプター遺伝子の組を含む条件とする。
【００５０】
表４から表６において遺伝子間の相互関係がパスウェイであった場合、クラス分類関連抽出遺伝子がパスウェイＰＡ上にあれば、アノテーションから抽出された共通性や規則性に基づく拘束条件をパスウェイＰＡ上の上流遺伝子、下流遺伝子、パスウェイＰＡと相関するパスウェイＰＢ上の遺伝子、もしくは前記いずれかの遺伝子の組を含む条件とする。
【００５１】
表４から表６において遺伝子間の相互関係がゲノムであった場合、クラス分類関連抽出遺伝子が染色体ＣＡ上にあれば、アノテーションから抽出された共通性や規則性に基づく拘束条件を染色体ＣＡ上の隣接遺伝子、もしくは前記の抽出された遺伝子と前記の隣接遺伝子との組を含む条件とする。
【００５２】
表４から表６において遺伝子間の相互関係が階層構造でかつオントロジーであった場合、クラス分類関連抽出遺伝子のオントロジーがＯＡであれば、アノテーションから抽出された共通性や規則性に基づく拘束条件をオントロジーＯＡの上階層のオントロジーを有する遺伝子、もしくは前記の抽出された遺伝子と前記の上階層のオントロジーを有する遺伝子との組を含む条件とする。
【００５３】
表４から表６において遺伝子間の相互関係が階層構造でかつ酵素（ＥＣ）であった場合、クラス分類関連抽出遺伝子のＥＣ番号がＥＣＡであれば、アノテーションから抽出された共通性や規則性に基づく拘束条件を酵素ＥＣＡの上階層に属する遺伝子、酵素ＥＣＡと同一グループに属する遺伝子もしくは前記いずれかの遺伝子の組を含む条件とする。
【００５４】
表４から表６において遺伝子間の相互関係が階層構造でかつスーパーファミリーであった場合、クラス分類関連抽出遺伝子のスーパーファミリーがＳＦＡであれば、アノテーションから抽出された共通性や規則性に基づく拘束条件を同一スーパーファミリーＳＦＡに属する遺伝子、もしくは前記の抽出された遺伝子と前記の同一スーパーファミリーの属する遺伝子との組を含む条件とする。
【００５５】
表４から表６において遺伝子間の相互関係がネットワークでかつ文献情報であった場合、アノテーションから抽出された共通性や規則性に基づく拘束条件とは、文献情報により、クラス分類関連抽出遺伝子との関連が予想される遺伝子、もしくはクラス分類関連抽出遺伝子と前記の文献情報から関連が予想される遺伝子との組を含む条件とする。
【００５６】
表４から表６において遺伝子間の相互関係がネットワークでかつ蛋白質相互作用であった場合、アノテーションから抽出された共通性や規則性に基づく拘束条件とは、蛋白質相互作用により、クラス分類関連抽出遺伝子との関連が予想される遺伝子、もしくはクラス分類関連抽出遺伝子と前記の蛋白質相互作用から関連が予想される遺伝子との組を含む条件とする。
【００５７】
表４から表６において遺伝子間の相互関係がネットワークでかつ代謝経路情報であった場合、アノテーションから抽出された共通性や規則性に基づく拘束条件とは、代謝経路情報により、クラス分類関連抽出遺伝子との関連が予想される遺伝子、もしくはクラス分類関連抽出遺伝子と前記の代謝経路情報から関連が予想される遺伝子との組を含む条件とする。
本発明において、図１４、図１５、表４から６に開示した方法を用いて反復マイニングを行うことで、より一般性や頑強性（ロバストネス）が高いメタ知識（ｍｅｔａ−ｋｎｏｗｌｅｄｇｅ）を自動で得ることができる。
【００５８】
【発明の実施の形態】
公開データである急性白血病のデータセット（Ｇｏｌｕｂら１９９９年）を使用し、具体的な発明の実施形態を示す。データファイルはＭＩＴのサイトからタ゛ウンロート゛した（ｈｔｔｐ：／／ｗｗｗ．ｇｅｎｏｍｅ．ｗｉ．ｍｉｔ．ｅｄｕ／ＭＰＲ）。このデータセットは、２種類の急性白血病：ＡＬＬ（ａｃｕｔｅ　ｌｙｍｐｈｏｂｌａｓｔｉｃ　ｌｅｕｋｅｍｉａ）、ＡＭＬ（ａｃｕｔｅ　ｍｙｅｌｏｉｄ　ｌｅｕｋｅｍｉａ）患者から血液を採取し、Ａｆｆｙｍｅｔｒｉｘヒトチップ（６８１７遺伝子）を用いて発現分布を測定したものである。被験者数は、合計７２人（内、３８人がＢ−ｃｅｌｌ　ＡＬＬ、９人がＴ−ｃｅｌｌ　ＡＬＬ、２５人がＡＭＬ）である。このデータセット（７２×６８１７）を用いて、発現データのみから２種類の白血病をクラシフィケーション（分類）することができるかを示す。以下、２種類の白血病（ＡＬＬ，　ＡＭＬ）をそれぞれ、クラス０，１と定義する。すなわち、ＡＬＬ＝クラス０、ＡＭＬ＝クラス１とする。なお急性白血病のデータセット（Ｇｏｌｕｂら１９９９年）は、７２検体データを３８のトレーニングセット（ｄａｔａ＿ｓｅｔ＿ＡＬＬ＿ＡＭＬ＿ｔｒａｉｎ．ｔｘｔ）と、３４のテストセット（ｄａｔａ＿ｓｅｔ＿ＡＬＬ＿ＡＭＬ＿ｉｎｄｅｐｅｎｄｅｎｔ．ｔｘｔ）に分けて提供されている。そのため、ここでは３８検体（トレーニングセット）を用いて学習をおこない、その結果を用いて３４検体（テストセット）のクラシフィケーションを試みた。この分類の結果と、予め分かっている各検体のクラスとを比較することで正解率とエラー率を求めることができる。
【００５９】
以後特徴ルール法を用いてデータマイニングした例を示す。６８１８属性、７２サンプルのデータセットを対象データとした。このうちの６８１７属性は、ＤＮＡチップで測定した遺伝子１つ１つの発現量に対応し、残りの１属性は、白血病の種類（ＡＬＬ、ＡＭＬ）に対応している。遺伝子発現量に対応する属性値は実数値データであるため、それぞれの属性項目ごとに「大」「中」「小」の３カテゴリに離散化した。この時、７２個のサンプルが各カテゴリにほぼ均等に振り分けられるようにそれぞれの属性項目のカテゴリの境界を設定した。また、白血病の種類に対応する属性値はもともと２種類の離散値（クラス０、クラス１）なので、そのまま用いた。
【００６０】
遺伝子に対応する６８１７属性をＩＦ−ＴＨＥＮルールの条件項目、白血病の種類に対応する属性を結論項目、「クラス０」を結論項目値とし、また、条件部に現れる述語数の上限は１として、評価値の上位２０個のルールを取り出した結果を表７に示す。表７の各行が１つのＩＦ−ＴＨＥＮルールに対応しており、評価値の大きい順に上から並んでいる。表７の第１列は、ＩＦ−ＴＨＥＮルールの条件部に対応し、第２列、第３列、第４列はそれぞれ、ルールの評価値、ヒット率、カバー率に対応する。なお、ルールの結論部は全てのルールで「白血病の種類　＝　クラス０」で同一なので、表中では省略した。例えば、第１行は「ＩＦ　Ｕ０７１３９＿ａｔ　＝　大　ＴＨＥＮ　白血病の種類　＝　クラス０」というルールであり、「Ｕ０７１３９＿ａｔ」という遺伝子の発現量が「大」であるならば、白血病の種類はクラス０（ＡＬＬ）となる傾向が大きいということを意味している。
【００６１】
ところで、第１、第２のルールの評価値はともに　０．１４８　で同じ、第３のルール以降は全て評価値　０．１４２　で同じである。このような評価値の同じルール同士の並び順には分析上の意味はなく、データファイルの中の属性項目の並び順に依存している。
【００６２】
【表７】

【発明の効果】
本発明により、遺伝子発現データに基づく、知識探索を行うことができる。遺伝子等同士の相互作用アノテーションに基づく繰り返し探索を行うことで、より一般性や頑強性（ロバストネス）が高いメタ知識（ｍｅｔａ−ｋｎｏｗｌｅｄｇｅ）を自動で得ることができる。
【図面の簡単な説明】
【図１】本発明の実施形態のソフトウェアを実行するために利用されるコンピュータシステムとその操作画面例。
【図２】ＤＮＡチップの概要図。
【図３】ゲノム、トランスクリプトーメ、プロテオームにおける相互作用の例。
【図４】二項関係の一例であるリガンド−レセプタ関係の例。
【図５】パスウェイ関係の例。
【図６】ゲノム関係の例。
【図７】階層構造の一例である遺伝子オントロジー関係の例。
【図８】階層構造の一例である酵素（ＥＣ：Ｅｎｚｙｍｅ　Ｃｏｍｍｉｓｓｉｏｎ）関係の例。
【図９】階層構造の一例であるスーパーファミリー関係の例。
【図１０】ネットワークの一例である文献情報の例。
【図１１】ネットワークの一例である遺伝子相互作用の例。
【図１２】ネットワークの一例である代謝経路の例。
【図１３】生物情報の階層構造の例。
【図１４】クロスバリデーション値と既定値とを比較することで繰り返し回数を決定する、本発明を実現するためのコンピュータ化された方法を示す模式図である。
【図１５】全てのクロスバリデーション値を保存して比較する、本発明を実現するためのコンピュータ化された方法を示す模式図である。
【図１６】ＤＮＡチップを用いた測定法のフローチャートの一例である。
【図１７】発現マトリクスの模式図である。
【符号の説明】
１．コンピュータシステム、２．ＣＰＵ、３．入力装置（マウス）、４．入力ファイル名・形式選択、５．フィルタリング機能オンオフボタン、６．フィルタリングアルゴリズム選択、７．マイニング手法選択、８．アノテーション内容、９．アノテーション内容選択、１０　．イタレーション条件選択、２１．蛍光検出器、２２．ＤＮＡプローブ、２３．蛍光標識された遺伝子、２４．支持体、４１．リガンド、４２．レセプター、５１．遺伝子、５２．遺伝子間パスウェイ、６１．染色体マップ、６２、遺伝子マップ、６３．遺伝子名、７１．遺伝子オントロジー（ＤＮＡ　ｒｅｐａｉｒ）、７２．遺伝子オントロジー（ｐｒｏｔｅｉｎａｍｉｎｏ　ａｃｉｄ　ＡＤＰ−ｒｉｂｏｓｙｌａｔｉｏｎ）、８１．酵素（ＥＣ）、９１．遺伝子スーパーファミリー、１０１．遺伝子、１０２．相互関係スコア、１１１．蛋白質、１１２．蛋白質間相互作用、１２１．酵素、１２２．代謝パスウェイ、１２３．酵素反応生成物、１７１．サンプルアノテーション、１７２．遺伝子アノテーション、１７３．発現レベル。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method and an apparatus for searching for knowledge based on gene expression data (also called gene expression profile) using a DNA microarray or the like.
[0002]
[Prior art]
Research has been conducted to clarify gene functions by analyzing gene expression profiles and to obtain findings for drug discovery, pharmacology, toxicology, and diagnosis. For example, statistical analysis such as correlation analysis (Principal Component Analysis), analysis of variance (Analysis @ of Variance), k-mean Clustering, Hierarchical Clustering, Hierarchical Clustering, etc. Clustering such as a map (Self-organizing @ Map), the nearest neighbor method (Nearest @ Neighbor), discriminant analysis (Discriminant @ Analysis), a support vector machine (Support @ Vector @ Machine), a neural network (Neural @ Genetic Network algorithm) netic Algorism) a classification algorithm, such as there is an example of application to the analysis of DNA chip data. Laura @ J. Van't Veer et al., Gene expression profiling predicts clinical outcome of breakest cancer. Nature 415, pp. 530-536 (2002) and Scott {L. Pomeroy et al., Prediction of central, system, systemmember, tumour, outcome, based, on gene, expression. Nature 415, pp. 436-442} (2002).
[0003]
[Problems to be solved by the invention]
However, the means for acquiring knowledge by simultaneously analyzing the expression of a large number of genes has not yet been established. In particular, knowledge acquisition depends heavily on the knowledge of the analyst, so there is a difference in the knowledge obtained by the analyst, and the amount of data obtained from the DNA chip is enormous. There was a problem that.
[0004]
It is an object of the present invention to provide a method and apparatus for analyzing a set of gene expression profiles. More specifically, an object of the present invention is to provide a technique for extracting a group of genes useful for drug discovery, pharmacology, toxicology, and diagnosis from a set of gene expression profiles, and for extracting a rule common to the group of useful genes.
[0005]
[Means for Solving the Problems]
(1) receiving gene expression data, (2) receiving class information, (3) data, in order to extract useful gene groups from the set of gene expression profiles and extract rules common to those useful gene groups Extracting a group of genes related to the classification using a mining technique; (4) performing an annotation on the group of genes; and (5) extracting a common rule of the group of genes related to the classification based on the gene annotation. Step, (6) performing a step of performing data mining using a constraint condition based on the common rule. It is important to repeat the steps (3) to (6). The conventional information system performs steps 1 to 4. The genes related to the classification and the annotations were simply presented to the analyst using, for example, a list or a graphical user interface. The subsequent knowledge acquisition process relied on the analyst's knowledge and intuition. However, there is a problem that the level of knowledge differs for each analyst, and the amount of information obtained from DNA chip gene expression profile data is enormous. Therefore, an information system for assisting an analyst is also required in knowledge acquisition.
[0006]
Data mining in fields other than gene expression analysis generally has a relationship of instance number >> attribute number. For example, taking the customer data of a department store as an example, the number of instances (number of customers) is equal to thousands to tens of thousands, and the number of attributes (sex, age, annual income, etc.) is equal to at most 100, and the above relationship is established. On the contrary, the DNA chip is characterized in that the number of attributes (the number of genes: thousands to tens of thousands) is orders of magnitude larger than the number of instances (the number of samples: several to several tens). In this type of problem, it is known that “it is possible to make some accurate predictions for existing samples without using advanced mining techniques”, while “it often fails to make predictions for new samples”. Have been. The reason for this is that if it cannot be explained by, for example, 10 genes, it can be explained sometime by increasing the number of genes to 100 or 200. It is also known that it is difficult to remove experimental errors in advance, so that the reliability of attribute values is low, and that even if an error value (Inaccurate @ values) is included even a little, the mining result changes significantly (not robust). I have. For this reason, preprocessing such as selecting appropriate data (Data @ Selection) and removing error values that adversely affect the analysis (Data @ Clearing) is important. At present, there is no established method of Data @ Selection and Data @ Clearing, but it is considered that the success or failure of Knowledge @ Discovery will be critically determined. That is, since the properties of gene expression data such as DNA chip data are (1) the number of attributes >> the number of instances, (2) the reliability of attribute values is low, and (3) they are not robust to error values, Data @ Selection and Data @ Cleansing Are three properties that are extremely important. Hereinafter, in this specification, Data @ Selection and Data @ Clearing are collectively referred to as filtering.
[0007]
Gene expression data includes, for example, a DNA chip method (DNA Chip), a differential display method (Differential Display), a quantitative PCR method (Quantitative PCR), a SAGE (Serial Analysis of Gene Expression) method, and a Protein chip method such as a Protein chip method. Means the expression level of a plurality of genes (or proteins) obtained by the method for measuring the change in expression of a plurality of genes or proteins, or the ratio between the expression levels.
[0008]
The class information is information for classifying an object measured by, for example, a DNA chip method or the like. For example, when the sample to be measured is blood of a subject, whether the blood is derived from a patient or a healthy person is defined as 1 if derived from a patient and 0 if derived from a healthy person. Further, the degree of malignancy of a disease such as cancer can be defined as 0, 1, 2, 3, etc. based on pathological findings and the like. In the case of cultured cells, a cell-derived sample before drug administration can be defined as 0, a sample at 6 hours after drug administration as 1, a sample at 12 hours after drug administration as 2, a sample at 24 hours after drug administration as 3, and so on. If the classes are the same, it is assumed that the same experimental conditions, properties, and phenotype (phenotype) are obtained even if the measurement target individual or individual is different.
[0009]
Data mining is a technique for automatically extracting interesting regularities and causal relationships from a database by a computer. For example, a decision tree (Decision @ Tree), a Naive Bayes, a Fully Bayesian, an association rule, a feature rule (Characteristic @ rule), an EM clustering (EM @ lasting), the nearest neighbor method, and the nearest neighbor method. Discriminant Analysis, Support Vector Machine, Genetic Algorithm, Linear Regression. Alternatively, a method such as bagging, boosting, or stacking may be used in combination with the above method.
[0010]
Annotations are annotations. For example, base sequences, gene function information, disease-related information, corresponding IDs in public databases, homolog information between genes of other species (for example, human and mouse), gene network information, pathway information, and the like.
[0011]
Normalization (also referred to as normalization) means that the background and noise level of an experiment are different because the time, place, and worker who performed the experiment are different, so the operation to equalize the background and noise level for each experiment That is. In the case of a DNA chip, the image luminance (fluorescence intensity) may be different for each chip. Therefore, first, the background luminance, for example, the luminance of a portion where the probe does not originally exist is obtained, and the average value of the background luminance and This is an operation such as subtracting the median value from the probe fluorescence intensity.
[0012]
Filtering is an operation of selecting appropriate data as described above and removing error values that adversely affect the analysis. For example, there is a method of determining a threshold. Specifically, when the signal strength is small (for example, 500 or less), the threshold equals 500, and the signal strength of 500 or less is set to 500 or zero.
[0013]
Cross-validation is, for example, tenfold cross-validation or leave-one-out cross-validation. Tenfold cross-validation is based on the results of a total of 10 trials in which a data set is randomly divided into 10 equal parts, trained on 9/10 data, and tested on the remaining 1/10 data. , A correct answer rate (or error rate). The leave-one-out cross-validation is based on the result of a total of n trials of training with n-1 data of n data sets and testing with the remaining one data, and calculating the correct answer rate (or error rate). ) Is calculated. This is a method of comparing the correct answer rate (or error rate) of each mining method from a limited number of data sets, and quantitatively evaluating which mining method is superior.
[0014]
FIG. 1 shows an example of a computer system that can be used to execute the software of the present invention and an interface showing the entire process. First, a file format and a file name are input for inputting a file in which gene expression data and class information are stored. Subsequently, normalization and a filtering method are selected, and data mining is performed. When annotating a group of genes extracted by data mining, the user selects which annotation is to be performed. Further, the cross validation is calculated based on the gene group extracted by the data mining, and if the correct answer rate is equal to or more than a threshold value (α; 0.95 in FIG. 1), the process is terminated. The next mining is performed under the constraint conditions reflecting the annotation result. This mining can be performed automatically and repeatedly until the accuracy rate exceeds a threshold.
[0015]
FIG. 2 is a diagram showing a general structure of a DNA chip. FIG. 16 shows a flowchart of a measurement method using a DNA chip. First, the DNA probe 22 is immobilized on the support 24. Subsequently, the gene fragment extracted from the sample to be measured is labeled with a fluorescent label or the like. The fluorescently labeled gene 23 is hybridized with the DNA probe 22. After that, the fluorescence derived from the fluorescent label is detected by the detector 21. As a result of this detection, the amount of the fluorescently labeled gene 23 hybridized to each DNA probe 22 is obtained. This is called expression distribution.
[0016]
FIG. 3 shows genes to be considered when performing gene annotation (genome: Genome), transcripts (transcriptome), proteins (Proteome), and transcripts. This is an example of the interrelationship between proteins, or between genes and transcripts, between transcripts and proteins, and between genes and proteins. The ohm is a suffix meaning the whole or the whole image. The whole gene (Gene) is the genome (Genome), the whole transcript (Transscript) is the transcriptome, the protein (Protein). Is called a proteome. In the following description of the present specification, genes, transcripts, and proteins are referred to as genes and the like. In FIG. 3, white circles indicate individual genes and the like, and lines connecting white circles with white circles indicate interactions and causal relationships known through experiments and the like. FIG. 3A shows a state where there is no interaction between genes and the like, which is called independent. However, it is possible that some interaction or causal relationship will be discovered in the future through experiments. FIG. 3B shows a relationship in which genes and the like interact one-to-one, which is called a binary relationship. For example, as shown in FIG. 4, a relationship between a receptor (receptor) existing on the cell surface and a conjugate (ligand) binding to the receptor is an example of a binary relationship. FIG. 4A shows interleukin 2 (IL2) protein and interleukin 2 receptor alpha (IL2RA), interleukin 2 (IL2) protein and interleukin 2 receptor beta (IL2RB), and interleukin 2 (IL2) protein. Interleukin 2 receptor gamma (IL2RG), FIG. 4 (B) shows transforming growth factor beta 1 (TGFB1) and transforming growth factor beta receptor 1 (TGFBR1), transforming growth factor beta 1 (TGFB1) Growth factor beta receptor 2 (TGFBR2), transforming growth factor beta 1 (TGFB1) and transforming growth factor beta receptor 3 (TGFBR3), FIG. 4 (C) shows erythropoietin (EPO) protein and erythropoietin receptor (EPOR) Between the ligand and the receptor. DNA and DNA-binding protein are also examples of the binary relation. Examples of DNA binding proteins include transcription factors and repair genes. FIG. 3C shows a relationship in which genes and the like interact along a path, and is called a pathway. Although there are branches in a pathway, it is characterized in that interaction occurs in one direction from upstream to downstream. It can also be said that a causal relationship exists between the upstream genes and the downstream genes. As an example of the pathway, information is transmitted from the cell surface to the cell nucleus in the cell, starting from the binding of a receptor and a ligand on the cell surface, as in the MAP kinase (Mitogen Activated Protein Kinase) pathway shown in FIG. Pathways to be known are known. Each circle represents a gene, and an arrow connecting the circles means that information is transmitted in the direction of the arrow. For example, information is transmitted from the Mos gene to the MEK gene and from the MEK gene to the ERK gene in FIG. For the pathway information, refer to, for example, the pathway database $ (http://www.biocarta.com/).
[0017]
FIG. 3 (D) shows the mutual arrangement of genes on a DNA base sequence, which is referred to as a genome in the present specification. In higher animals such as humans, mice and rats, genes are arranged on chromosomes, and in yeasts and bacteria, genes are arranged on circular DNA. FIG. 6 shows an example of the genome. In a part of the 13q12 to 13q13 region on human chromosome 13, LOC222428 gene to LOC160979 gene are present in sequence at the positions as shown in FIG. Certain diseases are caused by deletion or amplification of some regions of the chromosome, resulting in the simultaneous deletion or amplification of nearby genes. Genomic information is useful for searching for a causative gene of a disease caused by such gene amplification or gene deletion. FIG. 3E shows a case where genes and the like have a hierarchical structure. Examples of the hierarchical structure include the ontology in FIG. 7, the enzyme (EC: Enzyme @ Commission) in FIG. 8, and the superfamily in FIG. FIG. 7 is an ontology of a gene called ADPRT. An ontology is a dictionary that defines the functions of genes based on analysis of gene sequences and protein sequences. For details of the ontology, see The gene ontology consortium, Gene ontology: tool for the unification of biology. Nature Genetics 25, pp. 25-29 (2000)}. According to the gene ontology, it is found that the ADPRT gene has functions of DNA repair (DNA @ repair) and ADP-ribosylation. DNA repair is one of DNA metabolism (DNA metabolism), and DNA metabolism is one of nucleic acid metabolism (nucleobase, nucleoside, nucleotide and nucleoic acid metabolism).
[0018]
The numbers in parentheses at the right end of the ontology in FIG. 7 indicate the number of registered genes. 386 genes are currently registered as genes having a DNA repair function, and 1138 genes are registered for DNA metabolism including DNA repair. It can be seen from the number of registered genes that the gene ontology forms a hierarchical structure from general-purpose large classification to detailed small classification. FIG. 8 shows an example of an enzyme which is the second example of the hierarchical structure. Enzymes are classified by EC numbers (Enzyme @ Commission) based on their structures and functions. There are EC1 to EC6, EC1 is oxidoreductases, EC2 is transferase (Transferases), EC3 is hydrolase (Hydrolases), EC4 is lyase (Lyases), EC5 is isomerase (Isomerases) and EC6 is ligase (Ligase). is there. ECs 1 to 6 are further classified according to a hierarchical structure. FIG. 8 shows an example of EC6, which is classified into EC6.1 to EC6.5. EC 6.3 is further classified into EC 6.3.1 to EC 6.3.5. In the case of EC 6.3.3, the actual enzymes are EC 6.3.3.1 to EC 6.3.3.3.
[0019]
FIG. 9 shows an example of a super family, which is the third example of the hierarchical structure. The superfamily is a group of genes that are expected to have similar protein three-dimensional structures and functions based on motifs and domain structures obtained from nucleotide sequence analysis. The motif is a structural or pattern element, and here refers to a certain structure found in the amino acid sequences of various proteins. Motif is a structure commonly found in a wide range of proteins with different functions. Although proteins have structures called domains, protein domains are made up of various combinations of motifs. Generally, a motif is a structural unit smaller than a domain of a protein. The motif includes, for example, a DNA binding structural motif called a helix-turn-helix or zinc finger. FIG. 9 shows an example of the drug metabolizing enzyme CYP gene group. There are about 50 CYP gene groups in total, and these gene groups are classified into groups such as CYP1A1 and CYP1A2 by structure and function.
[0020]
FIG. 3F shows a case where genes and the like are in a network relationship. Examples of the network include the relationship between the document information in FIG. 10, the protein interaction in FIG. 11, and the metabolic pathway in FIG. FIG. 10 shows a mutual relationship based on document information, which is a first example of a network. This method assumes that the greater the number of two gene names present in the same document in the same reference database, the stronger the mutual relationship between them. For example, a correlation score can be obtained using the PubGene database. . See Tor-Kristian, Jenssen et al. 28, @ pp21-28. Each circle in FIG. 10 indicates a gene, and a line connecting the circles indicates a mutual relationship. The numbers shown alongside the lines represent the correlation scores. In FIG. 10, the correlation score indicates the number of cases where two genes connected by a line exist in the same abstract sentence of the medical literature database MEDLINE. Those having strong correlation with the gene ADPRT in the center of FIG. 10 are TP53, CFTR, EEF2, FRA1H, SP1, and DAF, and the correlation score is 1 for all 6 genes. In PubGene, Medline or OMIM of NCBI in the United States is used as a literature database, but other literature databases may be used. FIG. 11 shows a protein interaction, which is the second example of the network. The protein interaction can be examined using a protein interaction database such as DIP (Database of Interacting Proteins) of University of California, Los Angeles, USA. For DIP, see I. Xenarios et al., DIP: \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ `\ 28, $ 289-291, $ 2000. Also in the protein interaction database, interacting proteins are connected by lines. If the dissociation constant (Desociation Constant) is experimentally determined, for example, the strength of the interaction can be determined because the binding force between the molecules is known. The stronger the binding force, the larger the interaction. Further, an interaction confirmed in two or more experiments may be regarded as a stronger interaction than an interaction confirmed in one experiment. Incidentally, a protein interaction database other than DIP may be used. FIG. 11 shows a metabolic pathway which is a third example of the network. Refer to a metabolic pathway database such as KEGG (http://www.kegg.kyoto-u.ad.jp) for details of the metabolic pathway. The difference between the pathway of FIG. 3 (C) and this metabolic pathway is that a plurality of proteins that mediate a reaction called co-enzyme are involved in the metabolic reaction, so that a simple one-way relationship from upstream to downstream is involved. The point is that it takes a complicated structure that does not stop there.
[0021]
FIG. 3 to FIG. 12 show mutual relationships in genetic information such as genome, transcriptome, and proteome. However, genetic information is part of biological information. As shown in FIG. 13, biological information includes enzyme-to-enzyme interaction (enzyme: Enzyme, Enzyme: metabolome: Metabolome), interaction (interaction: Interactome), temporal There is a hierarchical structure ranging from spatial localization (localization overall image: Localizome) to phenotype (phenotype overall image: phenome: Phenome). M. \ Vidal, A \ biological \ atlas \ of \ functional \ maps. Cell 104, pp. 333-339 (2001). The analysis method described in the present specification can be applied not only to interactions in the genome, transcriptome, and proteome but also to interactions in general biological information shown in FIG.
[0022]
FIGS. 14 and 15 are schematic diagrams showing a computerized method for realizing the present invention. FIG. 14 is a schematic diagram showing a computerized method for implementing the present invention, in which the number of repetitions is determined by comparing a cross-validation value with a predetermined value.
[0023]
FIG. 15 is a schematic diagram illustrating a computerized method for implementing the present invention that stores and compares all cross-validation values. 14 and 15, a step of receiving gene expression data, a step of receiving class information, a step of extracting a gene group related to class classification using a data mining method, and a step of annotating the gene group The step of extracting a common rule of a group of genes related to the class classification based on the gene annotation and the step of performing data mining using constraint conditions based on the common rule are common. In FIGS. 14 and 15, normalization and filtering processing may or may not be performed after the gene expression data is received. In particular, filtering is performed because Data @ Selection and Data @ Cleansing are steps that influence the data mining result. It is desirable.
[0024]
Table 1 shows an example of filtering useful in data analysis of a DNA chip. As described above, the threshold method, which is an example of the filtering method, sets the threshold equal to 500 when the signal strength is small (for example, 500 or less) as described above, and sets the signal strength of 500 or less to 500 or zero. Operation. The threshold is determined by an empirical method, a non-parametric test, for example, a method based on a Wilcoxon signed rank test.
[0025]
[Table 1]

The correlation coefficient method, which is an example of the filtering method, firstly (1) randomly extracts two samples from a plurality of samples for all the genes of the actual data and calculates a correlation coefficient. (2) Subsequently, a correlation coefficient is calculated for the artificially created random data in the same manner as described above. Pearson's correlation coefficient is used as the correlation coefficient. (3) The probability distribution of the correlation coefficient in the random data is compared with the probability distribution of the correlation coefficient in the actual data. (4) In this method, only genes having significantly large correlation or inverse correlation in view of the distribution of random data are used as data mining inputs.
[0026]
An example of a filtering method, analysis of variance, is similar to standard ANOVA (Analysis \ of \ Variance). However, standard ANOVA uses the F distribution for determination, assuming that the data are independent of each other. In an actual gene, it is sufficiently expected that there is some correlation between gene data, and therefore, it is incorrect to use ordinary ANOVA. Therefore, a method is often adopted in which a distribution corresponding to the F distribution is simulated from actual data using a bootstrap method or the like.
[0027]
A Bayesian estimation method, which is an example of the filtering method, will be described. Hereinafter, a DNA chip experiment in which two types of samples are stained with two fluorescent dyes Cy3 and Cy5 and simultaneously hybridized is assumed. Consider an experiment in which the same RNA is divided into two in advance, and one is labeled with Cy3 and the other is labeled with Cy5, and an experiment in which competition is performed on the same chip (the number of chips is about 5 or 10 is required). The data of this experiment is that the average value of Cy5 / Cy3 = c is 1.0, but the variance σ²For the normal population N (c, σ²), A random sample {Y1, Y2,..., Yi,..., Yn} of size n is extracted, and its observed value y = (y1, {y2,..., Yi,. yn). Error factors such as differences between chips, differences between dyes, and individual differences in Hunt {link} are considered to be uncorrelated between experiments, so that {Y1, Y2,..., Yi,. It can be said that the information is independent (and independent) (distributed).
[0028]
(Equation 1)

The final purpose of this Bayesian estimation is²Is to estimate. By this estimation, the population N (1, σ²) Of σ²From Equation 1, σ of Yi²Are the same. Bayesian estimation uses σ²Has a distribution, so σ²Is obtained. As the estimated value, an average, a mode 点 (the point at which the frequency is maximum), and a highest density confidence interval (Highest Density Region; HDR) are obtained as the confidence interval. The 90% highest density section is, so to speak, the shortest of the 90% sections of the unknown parameter σ2, always includes the peak value of the posterior distribution (posterior mode), and has the same posterior density at both ends of the section.
When Expression 1 holds, the joint probability density distribution of {Y1, Y2,..., Yi,..., Yn} (a1 <{Y1} ≦ {b1, {a1} <{Y1} ≦ {b1,. Probability density distribution where p is simultaneously satisfied) p (y '| c, σ²)
[0029]
[Equation 2]

Can be written in a form multiplied by the normal distribution. Where Π is a product symbol. Therefore, the likelihood function l (σ) given the observed value vector y = (y1, y2,, ｉ, ｉ, yn) is given.²| Y) is
[0030]
[Equation 3]

(Equation 4)

It becomes. Where s²Is an observed value distribution centered on the population mean c, and is represented by the following equation.
[0031]
(Equation 5)

Here, the prior distribution p (σ²) Is assumed to be a non-information prior distribution (noninformative @ prior @ distribution). This assumption is reasonable in that it makes ignorance about the parameters as much as possible and eliminates the arbitrariness of the prior distribution as much as possible and makes the posterior distribution as governed by the data as possible. In general, a local uniform prior distribution is used as the non-information prior distribution. Regarding the local uniform distribution, regardless of whether the unknown parameter is squared, cubed, or logarithmic, it is at least locally uniformly distributed to represent the vagueness of the prior information. That is. Specifically, it has been found that the value may be determined so as to be proportional to the square root of the Fisher information amount. If a local uniform distribution is used as the prior distribution,
[0032]
(Equation 6)

It is good. That is, σ²Prior distribution p (σ²) Is σ⁻ ²That is, it is a constant. Next, consider the posterior distribution. From Bayes' theorem
[0033]
[Equation 7]

Holds, so
[0034]
(Equation 8)

Σ²Posterior distribution p (σ²| Y) is χ^-2(N, ns²). Χ^-2(Ν, λ) is a distribution called an inverse chi-square distribution with a degree of freedom ν having a scale parameter λ. χ^-2Since it is known that the average of (ν, λ) is λ / (ν−2) and the mode (the point where the frequency is maximum) is λ / (ν + 2), σ²As a point estimate for
[0035]
[Equation 9]

(Equation 10)

Can be considered. In addition, after Equation 8,
[0036]
[Equation 11]

Is obtained. Equation 11²(N) is a chi-square distribution with n degrees of freedom. Here, ns²Is a fixed value (observed value), σ²Is a random variable. The HDR can be obtained by using Expression 11 and a numerical table. By using Equations 9, 10, and 11, when the gene expression ratio data between two types of samples is obtained, it is necessary to know how much the ratio is statistically significantly different from 1.0. Can be. Only genes having a ratio significantly different from 1.0 need be used as input for data mining. For example, in five experiments, y₁= 1.4, y₂= 0.89, y₃= 1.24, y₄= 0.91, y₅= 1.04, s²= 0.04788. From the point estimation (Equation 9) based on the average value, Yi to N (1, （0.0798) are obtained. From the point estimation based on the mode (Equation 10), Yi to N (1, 0.0342) are obtained. From Equation 11, σ²~ 0.2394χ^-2(5), it is possible to obtain σ²90% HDR is 0.019-0.177. Note that the Bayesian estimation method shown in Expressions 1 to 11 is one of the statistical methods, and therefore, even if a similar estimation is performed using a method other than the Bayesian estimation method, for example, the Neyman-Pearson flow estimation method Good.
[0037]
After reading the expression data, normalization and filtering are performed, and by reading the class information, an expression matrix, which is a data format to be input to data mining, can be created. The expression matrix format is shown in FIG. It has a structure in which gene annotations are rows (Raw), sample annotations are columns (Column), and corresponding expression levels (for example, the ratio of Cy5 to Cy3 described above: Cy5 / Cy3) are in a matrix. By adding class information to the sample annotation site in the expression matrix format, the structure becomes suitable for data mining. This expression matrix can be stored in a file in a format such as CSV format or tab delimited format.
[0038]
The next data mining step is called the first data mining step if it is the first time, and the nth data mining step if the number of times that the mining is repeated (iteration) is n times thereafter. Table 2 shows four mining methods suitable for chip data. Both are typical methods in the supervised learning method (supervised method). The shortest neighbor method and discriminant analysis can be executed using a statistical calculation package such as Splus (a class package of Splus). The support vector machine can also be executed using free software using the R language e1071 package which is a clone of Plus. Next, the feature rule method will be described.
[0039]
[Table 2]

The feature rule method is disclosed in JP-A-8-77010 as a data analysis method. In the feature rule method, a set of samples including a plurality of attribute items is used as target data. All samples have the same attribute items. The attribute values that each attribute item can take are required to be a small number of discrete values compared to the number of samples. Typically, there are about three types of symbol values. When the original data to be analyzed is real-valued data, the range of values is divided at an appropriate boundary and discretized by a method such as replacement with symbol values such as "large", "medium", and "small".
[0040]
When implementing the feature rule method, one of the attribute items is selected and set as a “conclusion item”. Further, one of the attribute values that can be taken by the conclusion item is selected and set as “conclusion item value”. Further, a plurality of other attribute items are selected and set as “condition items”.
[0041]
In the feature rule method, an IF-THEN rule having a format of “IF (condition part) THEN (conclusion part)” is generated. The condition part of the rule is a set of a condition item and its attribute value, that is, a predicate, and allows a plurality of predicates to appear in the condition part at the same time, but typically, it is limited to about three. The conclusion part is a predicate composed of a conclusion item and a conclusion item value. As a result, only one predicate is determined in the conclusion part, while the condition part can take various combinations of predicates. Therefore, the number of IF-THEN rules that can be generated generally becomes large. The purpose of the feature rule method is to search for a relatively small number of rules that well represent the characteristics of the target data from among these many IF-THEN rules. The following evaluation scale is used to evaluate how well each IF-THEN rule represents the feature of the target data. An evaluation scale μ (A → B) of a rule “IF A THEN B” where the condition part is A (including a combination of a plurality of predicates) and the conclusion part is B is defined as follows.
μ (A → B) = P (A) ＾ β × log [P (B | A) / P (B)]
Here, P (A) {β} means {P (A) raised to the power of β. P (A) represents the probability that the condition part A is satisfied in the target data, that is, the ratio of the sample satisfying the condition A in the entire target data. Similarly, P (B) represents the probability that the conclusion part B is satisfied, and P (B | A) represents the conditional probability that the conclusion part B is satisfied under the condition that A is satisfied. β is a parameter specified by the user, and is a real value of 0 or more and 1 or less. A rule with a larger evaluation value given by this evaluation scale is considered to represent the characteristic of the target data better. Further, P (A) in the definition formula of the above evaluation scale is called a cover rate, and P (B | A) is called a hit rate, and these are used as clues when a user interprets the extracted rules. There is.
[0042]
Although there are several possible algorithms for extracting a relatively small number of rules having a large evaluation value from a large number of IF-THEN rules that can be generated, the “brute force method” is one of the simplest methods. This means that the number of rules to be extracted (for example, 10) is determined in advance, the upper limit (for example, 3) of the number of predicates appearing simultaneously in the conditional part is determined, and all possible IF-THEN rules within that range are generated. , And a predetermined number of high-order rules having a large evaluation value are taken out.
[0043]
In order to use the feature rule method for extracting a group of genes related to class classification, in the target data, an attribute related to the class classification is used as a conclusion item of the rule, and an attribute corresponding to the gene annotation is used as a condition item. As a result, an IF-THEN rule having an important gene for the class classification as a conditional part is obtained.
[0044]
As a result of the data mining, genes that are important in the classification (related to the rules for performing the classification) are extracted. In general, in order from the highest correct answer rate (lower error rate), one to more than 10 to 200 , Preferably 20 to 50 genes. The feature rule method described above is the most excellent for extracting important gene groups.
[0045]
As an example other than the feature rule method, for example, there is a method using a BSS / WSS ratio. BSS is an abbreviation for Between-group \ sum \ of \ squares, WSS is an abbreviation for Within-group \ sum \ of \ squares, and the BSS / WSS ratio of gene j is defined by the following equation.
[0046]
(Equation 12)

Here, m (xj) is the average expression level of gene j in all samples, m (xkj) is the average expression level of gene j in samples belonging to class k, xij is a gene expression data matrix, and I (yi = k ) Is a function that is 1 when yi = k and is 0 otherwise. The larger the BSS / WSS ratio represented by the expression 12, the greater the difference between classes compared to the difference within the class, so that in the case of classification, a gene having a large BSS / WSS ratio, for example, the top 20 to 50 It is desirable to use one.
[0047]
Next, annotation is performed on a group of genes important in the class classification extracted by data mining. One or more of the various interactions shown in FIGS. 3 to 13 are performed for the annotation. Table 3 shows an example of performing an annotation using a gene ontology as an example. When the gene group extracted as a result of the data mining is Probe 1 to Probe 5, in addition to the gene cluster number (Unigene number), base sequence number (Genbank number), and gene name that were originally known at the time of designing the DNA chip, the gene annotation step In Table 1, chromosome numbers and gene ontology can be searched from public data bases and the like, and Table 3 can be created.
[0048]
[Table 3]

Subsequently, a step of extracting commonality and regularity from the gene annotation is performed. It is assumed that the gene group in Table 3 is a gene group found to be important in a certain class classification by data mining. Extracting common properties and characteristics from the gene group in Table 3 is the purpose of the step of extracting common rules from gene annotations. For example, with respect to the gene ontology in Table 3, it is searched for an ontology that is commonly found in a plurality of genes. Then, integral \ membrane \ protein is commonly seen in Probe1 and Probe4. Therefore, from Table 3, it is possible to find a possibility that the rule that the gene group present in the membrane (integral @ membrane) as a protein is involved in the classification is lurking. In the next data mining, for example, for a probe mounted on a DNA chip, a constraint condition of a probe corresponding to integral \ membrane \ protein (GO number: 0016021) is provided, and data mining is performed using only a gene group corresponding to the constraint condition. By performing cross-validation, it is possible to quantitatively understand how much the integral \ membrane \ protein contributes to class classification. In addition, the upper layer of integral \ membrane \ protein (GO number: 0016021) is membrane \ (GO number: 0016020). Therefore, in another data mining, a constraint condition that a probe corresponding to integral @ membrane @ protein (GO number: 0016021) or membrane (GO number: 0016020) is provided for a probe mounted on a DNA chip, and the condition is met. Data mining / cross-validation may be performed using only the gene group. If the cross-validation result (correct answer rate or error rate) is compared between the case where mining is performed using "integral @ membrane \ protein" as a constraint and the case where mining is performed using "including membrane" as a constraint, If the accuracy rate of the former is better than that of the latter, it is understood that integral \ membrane \ protein is important for class classification instead of membrane. The step of extracting the common rules from the gene annotation in this manner refers to searching for the mutual relationships shown in FIGS. 3 to 13 from a list of genes that are important for class classification as shown in Table 3 as in the above ontology. This is the step of doing. In the regularity extraction, a rule with higher generality can be found by following the rules shown in Tables 4 to 6. The step of performing data mining based on the found rule as a constraint is the following "step of performing data mining using the constraint based on the common rule". In this step, (n + 1) -th mining is performed using the important gene group obtained by the n-th mining and its annotation. The significance of repeating mining in this way is to add information to the gene group obtained by rule generation by annotation, consider these information as attributes of each gene, and further analyze common features. Furthermore, by repeating this analysis cycle, it is possible to thoroughly search for common features. In the invention of the regulations, the ligand-receptor relationship is described in T.W. G. FIG. Graeber @ and @ D. Eisenberg, Bioinformatics identification of potential autocrine signaling loops in cancers from gene expression profiles. \ Nature \ Genetics \ 29, \ pp. 295-300 {(2001)}, and H.E.ら Ge et al., Correlation between transcriptome and interactome mapping data fromSaccharomyces cerevisiae. \ Nature \ Genetics \ 29, \ pp. 482-486 (2001). However, both of the above-mentioned two known examples only show a method of modifying an expression matrix to be input in the first mining. As described in the specification of the present application, the mining is not continuously and repeatedly performed based on the previous mining result. In this case, it is difficult to acquire knowledge excellent in generality and robustness (robustness). The present invention is fundamentally different from the prior art in that knowledge acquisition excellent in generality and robustness can be automatically performed.
[0049]
[Table 4]

[Table 5]

[Table 6]

Tables 4 to 6 will be described. In Tables 4 to 6, when the mutual relationship between genes is a binary relationship and a ligand-receptor relationship, if the class classification-related extracted gene is a ligand gene, the commonality or regularity extracted from the annotation is obtained. And a constraint condition based on a receptor gene or a condition including a pair of a ligand gene and a receptor gene. If the class classification-related extracted gene is a receptor gene, the constraint condition is a condition including a pair of a ligand gene or a ligand gene and a receptor gene. I do.
[0050]
In Tables 4 to 6, when the mutual relationship between genes is pathway, if the class classification-related extracted gene is on pathway PA, the constraint condition based on the commonality and regularity extracted from the annotation is changed on pathway PA. The conditions include an upstream gene, a downstream gene, a gene on pathway PB correlated with pathway PA, or a set of any of the above genes.
[0051]
In Table 4 to Table 6, when the interrelationship between genes is genome, if the class classification-related extracted gene is on chromosome CA, the constraint conditions based on the commonality and regularity extracted from the annotation are changed on chromosome CA. The condition includes a neighboring gene or a combination of the extracted gene and the neighboring gene.
[0052]
In Tables 4 to 6, when the interrelationship between genes has a hierarchical structure and is an ontology, if the ontology of the class classification-related extracted gene is OA, the constraint condition based on the commonality and regularity extracted from the annotation is used. The condition is a condition including a gene having an ontology of the upper hierarchy of the ontology OA, or a pair of the extracted gene and a gene having the ontology of the upper hierarchy.
[0053]
In Tables 4 to 6, when the interrelationship between genes has a hierarchical structure and is an enzyme (EC), if the EC number of the extracted gene related to class classification is ECA, the commonality and regularity extracted from the annotation are The constraint condition based on the condition includes a gene belonging to the upper hierarchy of the enzyme ECA, a gene belonging to the same group as the enzyme ECA, or a set of any of the above genes.
[0054]
In Tables 4 to 6, when the interrelationship between genes has a hierarchical structure and is a superfamily, if the superfamily of the extracted genes related to class classification is SFA, constraints based on the commonality and regularity extracted from the annotation The condition is a condition including a gene belonging to the same superfamily SFA or a set of the extracted gene and the gene belonging to the same superfamily.
[0055]
In Tables 4 to 6, when the interrelationship between genes is a network and document information, the constraint conditions based on the commonality and regularity extracted from the annotation are based on the document information. It is assumed that the condition includes a combination of a gene expected to be related, or a class classification related extracted gene and a gene expected to be related from the above-mentioned document information.
[0056]
In Tables 4 to 6, when the interrelationship between genes is a network and protein interaction, the constraint conditions based on the commonality and regularity extracted from the annotation are the class classification-related extracted genes due to the protein interaction. And a condition that includes a combination of a gene predicted to be related to the gene or a class classification-related extracted gene and a gene predicted to be related to the above-described protein interaction.
[0057]
In Tables 4 to 6, when the interrelationship between genes is a network and metabolic pathway information, the constraint conditions based on the commonality and regularity extracted from the annotation are the class classification-related extracted genes based on the metabolic pathway information. And a condition that includes a combination of a gene predicted to be related to, or a class classification-related extracted gene and a gene predicted to be related to the metabolic pathway information.
In the present invention, by performing iterative mining using the methods disclosed in FIGS. 14 and 15 and Tables 4 to 6, meta-knowledge with higher generality and robustness is automatically obtained. be able to.
[0058]
BEST MODE FOR CARRYING OUT THE INVENTION
A specific embodiment of the present invention will be described using a publicly available data set of acute leukemia (Golub et al., 1999). The data file was downloaded from the MIT site (http://www.genome.wi.mit.edu/MPR). This data set is obtained by collecting blood from two types of acute leukemia patients: ALL (acoustic lymphoblastic leukemia) and AML (acoustic myloidoid leukemia) patients, and measures expression distribution using an Affymetrix human chip (6817 gene). The total number of subjects was 72 (38 of which were B-cell @ ALL, 9 were T-cell @ ALL, and 25 were AML). Using this data set (72 × 6817), it is shown whether two types of leukemia can be classified (classified) from only expression data. Hereinafter, the two types of leukemias (ALL, $ AML) are defined as

classes

0 and 1, respectively. That is, ALL = Class 0 and AML = Class 1. The acute leukemia data set (Golub et al., 1999) is provided by dividing 72 sample data into 38 training sets (data_set_ALL_AML_train.txt) and 34 test sets (data_set_ALL_AML_independent.txt). Therefore, here, learning was performed using 38 samples (training set), and the classification of 34 samples (test set) was attempted using the results. By comparing the result of this classification with the class of each sample known in advance, the correct answer rate and the error rate can be obtained.
[0059]
Hereinafter, an example of data mining using the feature rule method will be described. A data set of 6818 attributes and 72 samples was used as target data. Among them, the 6817 attribute corresponds to the expression level of each gene measured by the DNA chip, and the other attribute corresponds to the type of leukemia (ALL, AML). Since the attribute value corresponding to the gene expression amount is real numerical data, each attribute item is discretized into three categories of “large”, “medium”, and “small”. At this time, the boundary of the category of each attribute item was set so that 72 samples were almost equally distributed to each category. Since the attribute values corresponding to the type of leukemia were originally two types of discrete values (class 0 and class 1), they were used as they were.
[0060]
The 6817 attribute corresponding to the gene is the condition item of the IF-THEN rule, the attribute corresponding to the type of leukemia is the conclusion item, “class 0” is the conclusion item value, and the upper limit of the number of predicates appearing in the conditional part is 1, Table 7 shows the results of extracting the top 20 rules of the evaluation value. Each row in Table 7 corresponds to one IF-THEN rule, and is arranged from the top in descending order of the evaluation value. The first column of Table 7 corresponds to the condition part of the IF-THEN rule, and the second, third, and fourth columns correspond to the rule evaluation value, hit rate, and cover rate, respectively. The conclusion of the rule is the same for all types of "leukemia type = class 0", and is therefore omitted in the table. For example, the first line has a rule of “IF {U07139_at} = {large {THEN} leukemia type} = {class 0”. If the expression level of the gene “U07139_at” is “large”, the type of leukemia is class 0 (ALL) It means that the tendency to be large is large.
[0061]
Incidentally, the evaluation values of the first and second rules are both {0.148} and the same, and the evaluation values of the third and subsequent rules are all the same as the evaluation value {0.142}. The order of the rules having the same evaluation value has no meaning in analysis, and depends on the order of the attribute items in the data file.
[0062]
[Table 7]

【The invention's effect】
According to the present invention, knowledge search can be performed based on gene expression data. By performing iterative search based on the interaction annotation between genes and the like, it is possible to automatically obtain meta-knowledge having higher generality and robustness (robustness).
[Brief description of the drawings]
FIG. 1 shows a computer system used to execute software according to an embodiment of the present invention and an example of an operation screen thereof.
FIG. 2 is a schematic diagram of a DNA chip.
FIG. 3 shows examples of interactions in the genome, transcriptome, and proteome.
FIG. 4 is an example of a ligand-receptor relationship that is an example of a binary relationship.
FIG. 5 is an example of a pathway relationship.
FIG. 6 shows an example of a genome relationship.
FIG. 7 is an example of a gene ontology relationship as an example of a hierarchical structure.
FIG. 8 shows an example of an enzyme (EC: Enzyme @ Commission) relationship which is an example of a hierarchical structure.
FIG. 9 illustrates an example of a superfamily relationship that is an example of a hierarchical structure.
FIG. 10 is an example of document information as an example of a network.
FIG. 11 is an example of a gene interaction that is an example of a network.
FIG. 12 is an example of a metabolic pathway which is an example of a network.
FIG. 13 shows an example of a hierarchical structure of biological information.
FIG. 14 is a schematic diagram showing a computerized method for implementing the present invention, in which the number of repetitions is determined by comparing a cross-validation value with a predetermined value.
FIG. 15 is a schematic diagram illustrating a computerized method for implementing the present invention for storing and comparing all cross-validation values.
FIG. 16 is an example of a flowchart of a measurement method using a DNA chip.
FIG. 17 is a schematic diagram of an expression matrix.
[Explanation of symbols]
1. Computer system, 2. CPU, 3. 3. input device (mouse); 4. Input file name / format selection; 5. On / off button for filtering function 6. Filtering algorithm selection, 7. Mining method selection, 8. Annotation content, Annotation content selection, 10 $. 21. Iteration condition selection, 21. Fluorescent detector, 22. DNA probe, 23. 24. fluorescently labeled gene; Support, 41. Ligand, 42. Receptor, 51. Gene, 52. Intergenic pathway, 61. Chromosome map, 62, gene map, 63. Gene name, 71. Gene ontology (DNA @ repair), 72. Gene Ontology (proteinamine @ acid @ ADP-ribosylation), 81. Enzymes (EC), 91. Gene superfamily, 101. Gene, 102. Correlation score, 111. Protein, 112. Protein-protein interactions, 121. Enzymes, 122. Metabolic pathway, 123. Enzyme reaction product, 171. Sample annotation, 172. Gene annotation, 173. Expression level.

Claims

A step of receiving gene expression data, a step of receiving class information, a step of extracting a group of genes related to class classification using a data mining method, a step of annotating the group of genes, A gene expression data analysis method, comprising: a step of extracting a common rule of a group of genes related to the class classification based on an annotation; and a step of performing data mining using a constraint condition based on the common rule.

After the step of receiving the gene expression data, and before the step of extracting a group of genes related to the classification using the data mining method, a step of normalizing the gene expression data; and 2. A method for analyzing gene expression data according to claim 1, further comprising a step of filtering data.

By performing cross-validation using the extracted gene group in the step of extracting the gene group related to the class classification, a step of comparing the correct answer rate or error rate of the class classification using the data mining method, 3. The method for analyzing gene expression data according to claim 1 or 2, wherein

Extracting a group of genes related to the class using the data mining method; annotating the group of genes; and extracting a common rule of the group of genes related to the class based on the gene annotation. 4. The gene expression data analysis method according to claim 1, wherein the step of performing data mining using a constraint condition based on the common rule is repeated a plurality of times.

The step of extracting a gene group related to the classification using a data mining method until the correct answer rate of the classification using the data mining method obtained by the cross validation exceeds a predetermined threshold, and Repeating the step of performing an annotation on the gene group, the step of extracting a common rule of the gene group related to the class classification based on the gene annotation, and the step of performing data mining using a constraint condition based on the common rule. The method according to claim 3, wherein the gene expression data is analyzed.

The step of extracting a group of genes related to the class classification by using the data mining method means that the ratio of the sum of squares between groups and the sum of squares within a group (BSS / WSS ratio) or the accuracy rate by the feature rule method is determined from the genetic data. Is a step of extracting 200 genes from the top 10 genes, wherein the gene expression data analysis method according to any one of claims 1 to 5, wherein

The step of extracting a common rule of a group of genes related to the class classification based on the gene annotation includes the step of extracting the common rules of the group of genes related to the class classification with respect to each gene extracted in the step of extracting the group of genes related to the class classification. The gene expression data analysis method according to any one of claims 1 to 5, further comprising a step of searching for a common rule belonging to any one or more of a term relation, a pathway, a genome, a hierarchical structure, and a network relation.

If the mutual relation is a binary ligand-receptor relation, and the extracted gene is a ligand gene, the constraint condition based on the common rule of the gene group related to the classification is a receptor gene or a ligand gene and a receptor. 8. The gene expression data according to claim 7, wherein the condition includes a set of genes, and if the extracted gene is a receptor gene, the restriction condition is a condition including a set of a ligand gene or a ligand gene and a receptor gene. analysis method.

The correlation is a pathway, the extracted gene is on a pathway PA, and a common rule of a group of genes related to the classification is an upstream gene on the pathway PA, a downstream gene, a gene on a pathway PB correlated with the pathway PA, 8. The method for analyzing gene expression data according to claim 7, wherein the condition includes a set of any one of the genes.

The correlation is a genome, the extracted gene is on chromosome CA, and a constraint condition based on a common rule of a group of genes related to class classification is defined by a neighboring gene on chromosome CA, or the extracted gene and the neighboring gene. 8. The method for analyzing gene expression data according to claim 7, wherein the conditions include a set of:

The interrelationship is a hierarchical structure and an ontology, the ontology of the extracted gene is OA, and a constraint based on a common rule of a group of genes related to the classification is a gene having an ontology OA of a higher hierarchy ontology; or 8. The method for analyzing gene expression data according to claim 7, wherein the condition includes a set of an extracted gene and a gene having the upper hierarchy ontology.

The constraint based on the common rule of the group of genes related to the classification according to claim 1, wherein the correlation is a hierarchical structure and an enzyme (EC), and the EC number (Enzyme @ Commission) of the extracted gene is ECA. The gene expression data analysis method according to claim 7, wherein the condition is a condition that includes a gene belonging to the upper hierarchy of the enzyme ECA, a gene belonging to the same group as the enzyme ECA, or a set of any one of the genes.

The interrelationship is a hierarchical structure and a superfamily, the superfamily of the extracted genes is an SFA, and a restriction condition based on a common rule of a group of genes related to the classification is a gene belonging to the same superfamily SFA, 8. The method for analyzing gene expression data according to claim 7, wherein the condition includes a set of a gene and a gene belonging to the same superfamily.

The interrelationship is a network and document information, and a constraint condition based on a common rule of a group of genes related to class classification, a document information, a gene expected to be related to the extracted gene, or the extracted gene and the extracted gene 8. The method for analyzing gene expression data according to claim 7, wherein the condition is a condition including a set of genes that are expected to be related from the document information.

The interaction is a network and a protein interaction, a constraint based on a common rule of a group of genes related to class classification, a protein interaction is expected to be a gene associated with the extracted gene, or the extracted gene 8. The method for analyzing gene expression data according to claim 7, wherein the condition includes a set of a gene and a gene predicted to be related from the protein interaction.

The interrelationship is a network and metabolic pathway information, and a constraint condition based on a common rule of a group of genes related to class classification, a gene expected to be associated with the extracted gene by metabolic pathway information, or the extracted gene 8. The method for analyzing gene expression data according to claim 7, wherein the condition is a condition including a pair of a gene predicted to be related from the metabolic pathway information.

The step of extracting a common rule of a group of genes related to the class classification based on the gene annotation includes the steps of: extracting an entire gene (genome), a transcript, Overview (transcriptome), overview of protein (proteome), overview of enzyme (enzyme), overview of metabolism (metabolome), overview of interaction (interactome), overview of spatiotemporal localization (localisome) 2. The method for analyzing gene expression data according to claim 1, further comprising a step of searching for a common rule in an interaction in any one or more of the phenotypic overall images (phenomes).