JP2004187562A

JP2004187562A - Dna microarray data analyzing method, dna microarray data analyzer, program, and recording medium

Info

Publication number: JP2004187562A
Application number: JP2002358594A
Authority: JP
Inventors: Satoru Ito; 哲伊藤
Original assignee: JGS KK
Current assignee: JGS KK
Priority date: 2002-12-10
Filing date: 2002-12-10
Publication date: 2004-07-08

Abstract

<P>PROBLEM TO BE SOLVED: To provide a DNA microarray data analyzing method allowed for analysis of mass gene data in high speed and accuracy by application of MT system method, a DNA microarray data analyzer, a relevant program, and a relevant recording medium. <P>SOLUTION: The DNA microarray data analyzing method involves narrowing down genes to be analyzed based on the assay data of DNA microarrays using the MT system method. When the MT system method is applied for analyzing the assay data for the DNA microarrays, the test data and specimen data of the DNA microarrays serve as input data and the prediction of the state of the specimen or the change predicted in the future for the state serves as the result of analysis, and from the beginning, in the process of grouping patients, the genes to be analyzed are narrowed down while judging the accuracy. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体に関し、特に、マハラノビス・タグチシステム（以下「ＭＴシステム法」または単に「ＭＴシステム」という。）の適用により、大量の遺伝子データを高速かつ高精度に解析することのできるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体に関する。
【０００２】
【従来の技術】
ＤＮＡマイクロアレイは、この数年間に開発された、新しい技術であり、そのデータの解析方法についても、様々な試みが報告されており（非特許文献１、２、４〜７、８〜１２等を参照。）、いまだ確立していない。以下に、従来のＤＮＡマイクロアレイデータの解析方法について説明する。
【０００３】
ＤＮＡマイクロアレイは、一度に数百から数千、数万の遺伝子の発現を同時に測定できる。しかし、その長所であるはずの測定数の膨大さは同時にデータ解析の困難さとなっている。また、機能等が確定していない遺伝子をＤＮＡマイクロアレイに搭載した場合は、それらを含むＤＮＡマイクロアレイデータの解析は後述する「外部基準のない判別」となるため、解析法そのものの検証が問題となる。ＤＮＡマイクロアレイでデータを取るのは簡単だが、問題はその解析であるという事実は研究者においては一般常識となっている。
【０００４】
ＤＮＡマイクロアレイのデータ解析の中心は、搭載された大量の遺伝子を、そのデータに基づいて、どのように分類するかということである。この分類は大きく（１）カテゴリーの決定（グループの発見）と（２）カテゴリーへの割り当ての二つに分かれる。前者は、例えば遺伝子を経時的な発現のパターンによって分類しようとするとき、まずどのような発現パターンが何種類あるかを決めなければならない（グループの発見）ような場合である。この場合は決定したカテゴリーが正しいかどうかを判定する外部基準はなく、いわゆる「教師なし（ｕｎｓｕｐｅｒｖｉｓｅｄ）」の分類である。一方後者（カテゴリーへの割り当て）は、例えば未知の遺伝子を前者で決定したカテゴリーのうちどれに入るかを決定することであり、この場合はそのカテゴリー内の基準に合うか合わないが検証ができる、いわゆる「教師あり」の分類となる。
【０００５】
上記カテゴリーの決定には後述する階層クラスタリング、自己組織化マップ（ＳＭ）、Ｇｅｎｅｓｈａｖｉｎｇ等が含まれ、カテゴリーへの割り当てにはサポートベクターマシン（ＳＶＭ）や判別分析等が含まれる。
【０００６】
従来ＤＮＡマイクロアレイにおける遺伝子発現データ解析方法は、ＤＮＡマイクロアレイの発達に伴った発現パターンの解析として開発されてきた。つまり、当初、最も開発されたのは酵母などのｃＤＮＡ（ｃｏｍｐｌｅｍｅｎｔａｒｙＤＮＡ：相補的ＤＮＡ）のマイクロアレイデータを用いて、それぞれの遺伝子の発現パターンの特徴から、アレイ上のすべての遺伝子を網羅的に分類することを目的とした遺伝子発現解析である（例えば、非特許文献１参照。）。
【０００７】
この非特許文献１では、マイクロアレイで得られたデータすべてをクラス分けして、その中に含まれる未知の遺伝子をクラスから機能推定するため、「階層的クラスタリング」（上記「カテゴリーの決定」に含まれる）という方法を用いている。即ち、個々の遺伝子について、様々な処理条件下での発現の状態を特徴としたベクトルを設定して、比較したい遺伝子間の類似度をＰｅａｒｓｏｎの相関係数によって定義して、クラスター分析の入力値（距離測度）とし、次ステップのグルーピングのための階層クラスタリングでは、「ａｖｅｒａｇｅ−ｌｉｎｋａｇｅｍｅｔｈｏｄ」を用いていることが特徴となっている（例えば、非特許文献３参照。）。さらに、この文献では、この方法によりグループ分けされた未知の遺伝子について、そのグループの既知の遺伝子群からその機能が推定されると考えられた。
【０００８】
ここで、ＤＮＡマイクロアレイにおける遺伝子発現データ解析にどのような手法を用いるかは解析の目的によって異なるが、たとえ目的が同じであっても、上述のように、論文によって様々な手法が用いられている。すなわち、従来から統計的解析方法やパターン認識方法には幾つもの確立した方法があったが、それらのうちどの方法をどの様に組み合わせ、また、どのようなＤＮＡマイクロアレイに特有の工夫を加えるかによって結果が異なるため、様々な解析が試行されている。
【０００９】
また、ＤＮＡマイクロアレイ解析方法として、最近では、臨床データと結びついた解析手法（癌の分類や薬剤感受性の分類）が報告されている（例えば、非特許文献２参照。）。これは上記ｃＤＮＡの網羅的解析をその一部に含み、全体としてはさらに複雑な解析方法となっている。
【００１０】
この非特許文献２では、遺伝子についてグループの発見を次元を削減しながら行い、その後患者について分類を行っている。即ち、データ次元の削減及びグルーピングを主成分分析（ＰＣＡ；Ｐｒｉｎｃｉｐａｌｃｏｍｐｏｎｅｎｔａｎａｌｙｓｉｓ）とニューラルネットワークで行って、マイクロアレイ上の６５６７種類の遺伝子から９６遺伝子を絞り込んでいる。その後各患者のそれら９６遺伝子のデータについて定量化し、患者ごとの階層クラスタリングによるグルーピングを行うことで、臨床的に分類の難しい癌を遺伝子発現に基づいて正確に分類することが可能となったことを報告している。そして、この報告は、ＤＮＡマイクロアレイ解析が、従来の臨床的分類では明確な差がなく、本来異なる治療法を施すべきところを同一の治療を施していた等の現状を改善し、癌治療一般において大きな利益をもたらす可能性を示唆した。
【００１１】
この非特許文献２のように、一般にＤＮＡマイクロアレイによる臨床データ解析では、解析の目的のためにどの遺伝子を用いるかが予め決められておらず、まず数百あるいは数千の遺伝子の発現パターンを解析してどの遺伝子データを使うかの絞込み（分類と次元の削減）を行い、そのうえで個々の患者の遺伝子発現データを解析するということが通常行われる。この場合、遺伝子の絞りこみは、外部基準を持たない（いわゆる「教師なし」）データ分類であるため、結果をどのように検証するかが問題となる。最終的には、絞り込んだ遺伝子項目を用いた患者のグルーピングの結果によってある程度は検証可能ではあるが、実際にその検証と並行して絞込みを行うことは、一般の解析方法では膨大な時間を要するため現実的ではない。
【００１２】
また、臨床データでの報告はないが、マイクロアレイの「教師なし学習」解析として、層状ニューラルネットワークを利用した、自己組織マップ（ＳＯＭ：ＳｅｌｆＯｒｇａｎｉｚａｔｉｏｎＭａｐ）（例えば、非特許文献４参照。）の適用が提案されている（例えば、非特許文献５参照。）。これは、上記階層クラスタリングの手法が、遺伝子の数の増加に伴い計算量が多くなることや、与えられたデータセットに依存して樹形図のトポロジーが変わりやすい等の欠点があるため、それを改良するものとして提案されたものである。
【００１３】
その他、例えば、「Ｇｅｎｅｓｈａｖｉｎｇ」（例えば、非特許文献６参照。）や、対応分析（Ｃｏｒｒｅｓｐｏｎｄｅｎｃｅａｎａｌｙｓｉｓ：ＣＡ）（例えば、非特許文献７参照。）等を適用する試みが報告されている。しかしいずれを用いても、「教師なし学習」であるため、その解析結果の妥当性が検証できないか、あるいは臨床データにより検証可能であっても、検証自体が非現実的で、最終結果が出るまで判定できないという問題を抱えている。
【００１４】
一方、上述の、遺伝子絞込みの後に、患者の該当する遺伝子データを解析してグループ分けする方法は、あらかじめ外部基準を持ち、その基準をもとに対象カテゴリーへ分類する「教師あり学習」に該当する。これは患者の分類だけではなく、個々の遺伝子を機能等によって分類する場合にも応用される。例えば、上述の階層クラスタリングで「教師あり学習」とするもの（例えば、非特許文献２参照。）や、ＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）の適用が提案され、また、ニューラルネットワークによる解析法も用いられている（例えば、非特許文献８参照。）。
【００１５】
このように、ＤＮＡマイクロアレイのデータを解析する方法として、現在様々な方法が試されている。特に上述の、外部基準データなしで遺伝子をどのように絞り込むかについては、最近のＤＮＡマイクロアレイを臨床診断に応用する試みの増加にともない、さらに大きな課題となりつつある。
【００１６】
以下にこれまでの、臨床診断におけるマイクロアレイ解析の報告を挙げる。主なものとして、上述した非特許文献２の他、非特許文献９〜非特許文献１２等がある。
【００１７】
この非特許文献９は、癌（ＮＳＣＬＣ）の再発予測について報告しており、ここでは、事前に「Ｃｏｘｐｒｏｐｏｒｔｉｏｎａｌｈａｚａｒｄｓｍｏｄｅｌ」という手法を用いて、２８９９の遺伝子のキャラクタライジングを行い、その後、そのデータと２８９９の遺伝子を用いて階層クラスタリングを行っている。
【００１８】
また、非特許文献１０は、乳癌における薬剤療法の有効性予測について報告しており、ここでは、上述した非特許文献２と基本的に同じ手法を用いて、２５０００の遺伝子から２３１の遺伝子を選択し、各患者のデータに基づいて階層クラスタリングを行っている。
【００１９】
また、非特許文献１１は、食道癌における薬剤療法の有効性予測について報告しており、ここでは、９２１６の各遺伝子データについて、Ｍａｎｎ−ＷｈｉｔｎｅｙテストのＵ値を統計学的に計算して５２の遺伝子を選択し、それらの遺伝子の発現状態を基にＤＲＳ（ｄｒｕｇｒｅｓｐｏｎｓｅｓｃｏｒｅ）を計算している。
【００２０】
さらに、その他として、非特許文献１２は、ＤＬＢＣＬ（ＤｉｆｆｕｓｅｌａｒｇｅＢ−ｃｅｌｌｌｙｍｐｈｏｍａ）における治癒率（ｓｕｒｖｉｖａｌｒａｔｅ）の予測を行ったことについて報告している。
【００２１】
一方、ＭＴシステム法は、品質工学を中心として発達したパターン認識方法で、機器監視や予防保全のための各機器のデータ解析などに用いられる（例えば、特許文献１参照。）。また、ＭＴシステム法による健康診断の予測方法（例えば、非特許文献１３参照。）等の報告はあるが、ＤＮＡマイクロアレイ解析における遺伝子データ解析にＭＴシステム法を適用する報告はなされていない。
【００２２】
その理由としては、ＭＴシステム法では基準データ数（本発明の実施例においては個体サンプル数）を項目数（本発明の実施例においてはＤＮＡマイクロアレイ搭載遺伝子数プラスその他血液検査等の臨床データ数）よりも多く収集しなければならいということがＭＴシステム法を提唱した田口玄一博士によって学会等で繰り返し述べられており、当業者にとっては一般常識として確立していたため、遺伝子搭載数の多さを最大特徴とするＤＮＡマイクロアレイの解析には適さないと考えられていたからである。
【００２３】
ここで、基準データ数は一般に項目数の２倍以上が奨励されており、この原則は、例えば非特許文献１４においては、項目数２６のところ、基準データ数が「マハラノビス距離が計算できる最低のデータ数、５２が得られなかった。」ため、項目数をあえて２２に減らしていること等、ＭＴシステム法を用いた解析では広く知られているものである。
【００２４】
【特許文献１】
特開２０００−２５９２２２号公報
【非特許文献１】
Ｅｉｓｅｎ，ｅｔ．ａｌ．，「Ｃｌｕｓｔｅｒａｎａｌｙｓｉｓａｎｄｄｉｓｐｌａｙｏｆｇｅｎｏｍｅ−ｗｉｄｅｅｘｐｒｅｓｓｉｏｎｐａｔｔｅｒｎｓ．」，Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．，１９９８，９５，ｐ．１４８６３−１４８６８
【非特許文献２】
Ｊ．Ｋｈａｎ，ｅｔ．ａｌ．，「ＮａｔｕｒｅＭｅｄｉｃｉｎｅ」，２００１，Ｖｏｌ．７，Ｎｕｍ．６，ｐ．６７３−６７９
【非特許文献３】
Ｓｏｋａｌ，ｅｔ．ａｌ．，「Ｕｎｉｖ．Ｋａｎｓ．Ｓｃｉ．Ｂｕｌｌ．」，１９５８，３８，ｐ．１４０９−１４３８
【非特許文献４】
ＫｏｈｎｅｎＴ．，「Ｐｒｏｃ．ＩＥＥＥ７８」，１９９１，ｐ．１４６４−１４８０
【非特許文献５】
Ｔａｍａｙｏ，ｅｔ．ａｌ．，「Ｉｎｔｅｒｐｒｅｔｉｎｇｐａｔｔｅｒｎｓｏｆｇｅｎｅｅｘｐｒｅｓｓｉｏｎｗｉｔｈｓｅｌｆ−ｏｒｇａｎｉｚｉｎｇｍａｐｓ：ｍｅｔｈｏｄｓａｎｄａｐｐｌｉｃａｔｉｏｎｔｏｈｅｍａｔｏｐｏｉｅｔｉｃｄｉｆｆｅｒｅｎｔｉａｔｉｏｎ」，Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．，１９９９，９６，ｐ．２９０７−２９１２
【非特許文献６】
Ｈｓｔｉｅ，ｅｔ．ａｌ．，「Ｇｅｎｅｓｈａｖｉｎｇａｓａｍｅｔｈｏｄｆｏｒｉｄｅｎｔｉｆｙｉｎｇｄｉｓｔｉｎｃｔｓｅｔｓｏｆｇｅｎｅｓｗｉｔｈｓｉｍｉｌａｒｅｘｐｒｅｓｓｉｏｎｐａｔｔｅｒｎｓ」，ＧｅｎｏｍｅＢｉｏｌｏｇｙ，２０００，１（２），ｐ．ｒｅｓｅａｒｃｈ０００３．１−０００３．２１
【非特許文献７】
Ｆｅｌｌｅｎｂｅｒｇ，ｅｔ．ａｌ．，「Ｃｏｒｒｅｓｐｏｎｄｅｎｃｅａｎａｌｙｓｉｓａｐｐｌｉｅｄｔｏｍｉｃｒｏａｒｒａｙｄａｔａ」，Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．，２００１，９８，ｐ．１０７８１−１０７８６
【非特許文献８】
前田英作著，「痛快！サポートベクトルマシン」，情報処理，２００１，４２巻７号，ｐ．６７６−６８３
【非特許文献９】
Ｄ．Ａ．Ｗｉｇｌｅ，ｅｔ．ａｌ．，「ＣａｎｃｅｒＲｅｓｅａｒｃｈ」，２００２，６２，ｐ．３００５−３００８
【非特許文献１０】
Ｌ．Ｊ．ｖａｎ’ｔＶｅｅｒ，ｅｔ．ａｌ．，「Ｎａｔｕｒｅ」，２００２，Ｖｏｌ．４１５，ｐ．５３０−５３６
【非特許文献１１】
Ｃ．Ｋｉｈａｒａ，ｅｔ．ａｌ．，「ＣａｎｃｅｒＲｅｓｅａｒｃｈ」，２００１，６１，ｐ．６４７４−６４７９
【非特許文献１２】
Ｍ．Ａ．Ｓｈｉｐｐ，ｅｔ．ａｌ．，「Ｎａｔｕｒｅｍｅｄｉｃｉｎｅ」，２００２，Ｖｏｌ．８，Ｎｕｍ．１，ｐ．６８−７４
【非特許文献１３】
中島ら「ＭＴシステム法による健康診断の予測的評価と効率化」、日本公衛誌、第４６巻、第５号、平成１１年
【非特許文献１４】
「ＭＴＳを活用した企業の利益予測」、「品質工学」１０巻３号、９６−１０２ページ、２００２年６月
【００２５】
【発明が解決しようとする課題】
しかしながら、従来のＤＮＡマイクロアレイにおける遺伝子解析手法では数百から数万項目に及ぶ大量の遺伝子データ解析に多大な時間を要してしまうという問題点を有していた。
【００２６】
具体的には、例えば、教師データを必要とするニューラルネットワークを用いた従来手法では、全教師データの素性（判定結果が明らかなデータ。例えば臨床的に判断した薬剤の感受性等）を数値化して表現する必要があるため、大量の遺伝子データ解析に多大な時間を要してしまうという問題点を有していた。
【００２７】
例えば、ニューラルネットワークの学習や、重回帰分析の係数算出などを用いた従来手法では、同様の演算を複数回繰り返すことにより、最適状態の神経回路やパラメータを形成するため、大量の遺伝子データ解析に多大な時間を要してしまうという問題点を有していた。
【００２８】
また、ニューラルネットワークなどを用いた従来手法では、教師データの内容や使用するソフトにより、算出される結果が変動し、神経回路再構築の際の再現性がないという問題点を有していた。
【００２９】
また、ニューラルネットワークなどを用いた従来手法では、演算過程がブラックボックスとして扱われており、得られる結果が明確でない、つまり、結果の計算過程の正当性について検証できないという問題点を有していた。
【００３０】
また、従来のクラスタリング手法では、与えられたデータ群の中で群分割することにより検体や遺伝子間の相違を群内での相対的なものとして表現するため、評価結果が不安定であるという問題点を有していた。
【００３１】
また、従来の「教師なし」による遺伝子のグループ分けの手法では、結果の検証が困難であったため、試行錯誤を繰り返さざるを得ないという問題点を有していた。
【００３２】
また、従来の手法では、遺伝子発現データや検体情報からどの項目が判別に有効であるかを機械的に検証する手順が定義されていなかったため、最適な項目を効率よく選択することができないという問題点を有していた。
【００３３】
このように現時点では、従来のシステム等は数々の問題点を有しており、その結果、利用者および管理者のいずれにとっても、利便性が悪く、また、利用効率が悪いものであった。
【００３４】
ここで、様々な現象の予測、判別などに用いられる技術として、田口玄一博士が提唱したマハラノビス距離を用いたあらゆる現象の測定技術である「ＭＴ（マハラノビス−タグチ）システム法」がある（例えば、田口玄一著「ＭＴシステムにおける技術開発」（日本規格協会、２００２年）を参照）。
【００３５】
このＭＴシステム法の原理は、（１）測定したい現象に対して結果が既知のデータ群の中からデータならびに結果においてばらつきの少ないデータ群を更に選び出し、このデータ群を基にデータベースを作成し、（２）このデータベースを「ものさし」として未知の現象をマハラノビス距離として解析する、ことにより構成されている。
【００３６】
ここで、上述したように、ＭＴシステム法を、機器監視や予防保全のための各機器のデータ解析などに用いること（例えば、特許文献１を参照。）や、ＭＴシステム法による健康診断の予測方法（例えば、非特許文献１３を参照。）等は知られているが、ＤＮＡマイクロアレイ解析における遺伝子データ解析にＭＴシステム法を適用する報告はなされていない。
【００３７】
本発明は上記問題点に鑑みてなされたもので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に解析することのできる、ＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することを目的としている。
【課題を解決するための手段】
【００３８】
このような目的を達成するため、請求項１に記載のＤＮＡマイクロアレイデータ解析方法は、マハラノビス・タグチシステムを用いてＤＮＡマイクロアレイの測定データを解析することを特徴とする。
【００３９】
この方法によれば、マハラノビス・タグチシステムを用いてＤＮＡマイクロアレイの測定データを解析するので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に解析することができるようになる。
【００４０】
すなわち、従来の教師データを必要とするニューラルネットワークを用いたＤＮＡマイクロアレイデータ解析方法等では、全教師データの素性（判定結果が明らかなデータ。例えば臨床的に判断した薬剤の感受性等）を数値化して表現する必要があるが、ＭＴシステム法では薬剤の効果のある・なしや、ある特定の群に属するか否かという漠然とした内容を基準データとして分析を開始することができるため、基準データ（教師データ）準備が非常に簡便になる。
【００４１】
また、従来のニューラルネットワークの学習や、重回帰分析の係数算出などを用いたＤＮＡマイクロアレイデータ解析方法では、同様の演算を複数回繰り返すことにより最適状態の神経回路やパラメータを形成するが、ＭＴシステム法では一度の計算でデータベース（基準データ）を生成でき高速に処理できるため、演算時間を非常に短くすることができる。
【００４２】
また、従来のニューラルネットワークなどを用いたＤＮＡマイクロアレイデータ解析方法では、演算過程がブラックボックスとして扱われており、得られる結果の計算過程が明確でない（つまり、計算過程の正当性について検証できない）が、ＭＴシステム法では幾つかの計算手法が提示されているが全て公開された計算式であり、しかも全基準データの相関関係に着目した計算式であることが明確であるため、演算過程の明確さを担保することができる。
【００４３】
また、従来のニューラルネットワークなどを用いたＤＮＡマイクロアレイデータ解析方法では、教師データの内容や使用するソフトにより、算出される結果が変動し、神経回路再構築の際の再現性が無いが、ＭＴシステム法では公開されている計算式を用いて算出するため、同じデータを使えば必ず同じ結果を得られ、再現性を非常に高くすることができる。
【００４４】
また、一般的なクラスタリング手法を用いたＤＮＡマイクロアレイデータ解析方法では、与えられたデータ群の中で群分割することにより、検体や遺伝子間の相違を群内での相対的なものとして表現するが、ＭＴシステム法では一度単位空間データベースを生成すると、この基準の原点からの距離として表現する事に絶対値として表現でき、また、分析したデータ群以外のものについても、単位空間データベースからのマハラノビス距離を算出するだけで評価できるため、評価結果が毎回安定したものとなる。
【００４５】
さらに、ＭＴシステム法ではどの項目（遺伝子の発現データや検体情報）が判別に有効であるか、機械的に検証する手順が定義されているため（項目選択）、これを使う事により、人間の偏った意見などに囚われることなく、最適な項目を効率的に選択する事ができる。
【００４６】
また、請求項２に記載のＤＮＡマイクロアレイデータ解析方法は、マハラノビス・タグチシステムを用いてＤＮＡマイクロアレイの測定データから解析対象の遺伝子を分類することを特徴とする。
【００４７】
この方法によれば、マハラノビス・タグチシステムを用いてＤＮＡマイクロアレイの測定データから解析対象の遺伝子を分類するので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に分類することができるようになる。
【００４８】
また、請求項３に記載のＤＮＡマイクロアレイデータ解析方法は、マハラノビス・タグチシステムを用いてサンプルのＤＮＡマイクロアレイの測定データから解析対象の遺伝子を選択し、マハラノビス・タグチシステムを用いてその選択した遺伝子データに基づいて臨床サンプルを分類することを特徴とする。
【００４９】
この方法によれば、マハラノビス・タグチシステムを用いてサンプルのＤＮＡマイクロアレイの測定データから解析対象の遺伝子を選択し、マハラノビス・タグチシステムを用いてその選択した遺伝子データに基づいて臨床サンプルを分類するので、ＭＴシステム法の適用により、大量のマイクロアレイの測定データを高速かつ高精度に分類することができるようになる。
【００５０】
また、請求項４に記載のＤＮＡマイクロアレイデータ解析方法は、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることによりデータ群を選択するデータ群選択ステップと、ＭＴシステム法で判別させる際の入力データとなる項目であって、上記ＭＴシステム法において解析される項目である解析データ項目を決定する解析データ項目決定ステップと、上記データ群選択ステップにて選択された上記基準データ群に含まれる上記基準データのマハラノビス空間を示す単位空間データベースを作成する単位空間データベース作成ステップと、上記データ群選択ステップにて選択された上記基準データのマハラノビス距離と、上記基準データに明確に属さない比較データを集めてマハラノビス距離を算出するマハラノビス距離算出ステップと、上記マハラノビス距離算出ステップにて算出された上記基準データと上記比較データとの上記マハラノビス距離に基づいて、上記基準データ群と上記比較データ群の境目となる閾値を１または複数個設定する閾値設定ステップと、上記単位空間データベースと選択された項目を使用して、未知のデータのマハラノビス距離を算出し、上記閾値によってこれを判定することにより未知のデータの判別分析を行う分類ステップを含むことを特徴とする。
【００５１】
この方法によれば、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることにより基準データ群を選択し、ＭＴシステム法で判別させる際の入力データとなる項目であって、ＭＴシステム法において解析される項目である解析データ項目を決定し、選択された基準データ群に含まれる基準データのマハラノビス空間を示す単位空間データベースを作成し、選択された基準データと明確に基準データに属さない比較データのマハラノビス距離を算出し、算出された基準データと比較データのマハラノビス距離に基づいて、基準データ群と比較データ群の境目となる１または複数個の閾値を設定し、上記単位空間データベースと選択項目を使用して未知のデータのマハラノビス距離を算出し、上記閾値によって判定することにより未知データを分類するので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に解析することができるようになる。
【００５２】
すなわち、ＭＴシステム法では薬剤の効果のある・なしや、ある特定の群にぞくするか否かという漠然とした内容を基準データとして分析を開始することができるため、基準データ（教師データ）準備が非常に簡便になる。
【００５３】
また、ＭＴシステム法では一度の計算でデータベース（基準データ）を生成でき高速に処理できるため、演算時間を非常に短くすることができる。
【００５４】
また、ＭＴシステム法では幾つかの計算手法が提示されているが全て公開された計算式であり、しかも全基準データの相関関係に着目した計算式であることが明確であるため、演算過程の明確さを担保することができる。
【００５５】
また、ＭＴシステム法では公開されている計算式を用いて算出するため、同じデータを使えば必ず同じ結果を得られ、再現性を非常に高くすることができる。
【００５６】
また、ＭＴシステム法では一度単位空間データベースを生成すると、この基準の原点からの距離として表現する事に絶対値として表現でき、また、分析したデータ群以外のものについても、単位空間データベースからのマハラノビス距離を算出するだけで評価できるため、評価結果が毎回安定したものとなる。また、ＭＴシステム法では対象が示す現象とマハラノビス距離の間に因果関係が確認された場合、閾値を複数個設定することにより、マハラノビス距離を算出するだけで複数の状態を分類することを可能とする。
また、ＭＴシステム法では対象の状態とマハラノビス距離の間に一定の比例関係が成り立つ場合は、対象の状態そのものをマハラノビス距離で表現することも可能とする。
【００５７】
さらに、ＭＴシステム法ではどの項目（遺伝子の発現データや検体情報）が判別に有効であるか、機械的に検証する手順が定義されているため（項目選択）、これを使う事により、人間の偏った意見などに囚われることなく、最適な項目を効率的に選択する事ができる。
【００５８】
また、請求項５に記載のＤＮＡマイクロアレイデータ解析方法は、請求項４に記載のＤＮＡマイクロアレイデータ解析方法において、上記基準データ群以外に上記基準データに属すべきデータを別途用意し、これと上記基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群について上記マハラノビス距離を算出し、上記閾値設定ステップにて設定された上記閾値を用いて上記基準データ群とそれ以外の集団とを正しく判別できるか検証する判別精度検証ステップとをさらに含むことを特徴とする。
【００５９】
これは判別精度検証ステップの一例を一層具体的に示すものである。この方法によれば、基準データ群以外に基準データに属すべきデータを別途用意し、これと基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群についてマハラノビス距離を算出し、設定された閾値を用いて基準データ群とそれ以外の集団とを正しく判別できるか検証するので、判別作業をさらに効率的に行うことができるようになる。
【００６０】
また、請求項６に記載のＤＮＡマイクロアレイデータ解析方法は、請求項５に記載のＤＮＡマイクロアレイデータ解析方法において、上記検証ステップにて十分な精度を得られないと判断された場合、上記データ群選択ステップにて選択した上記基準データ群、および／または、上記解析データ項目決定ステップにて決定した上記解析データ項目について再検討を行う再検討ステップをさらに含むことを特徴とする。
【００６１】
これは再検討ステップの一例を一層具体的に示すものである。この方法によれば、十分な精度を得られないと判断された場合、選択した基準データ群、および／または、決定した解析データ項目について再検討を行うので、判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００６２】
また、請求項７に記載のＤＮＡマイクロアレイデータ解析方法は、請求項６に記載のＤＮＡマイクロアレイデータ解析方法において、上記再検討ステップは、上記基準データ群内の各項目の標準偏差やデータ出現度数などを使い上記基準データのバラつき具合を検証し、当該バラつき具合が小さい上記基準データを使用するようにする基準データ再検討ステップをさらに含むことを特徴とする。
【００６３】
これは基準データ再検討ステップの一例を一層具体的に示すものである。この方法によれば、基準データ群内の各項目の標準偏差やデータ出現度数などを使い基準データのバラつき具合を検証し、当該バラつき具合が小さい基準データを使用するようにするので、基準データを再検討することにより判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００６４】
また、請求項８に記載のＤＮＡマイクロアレイデータ解析方法は、請求項６または７に記載のＤＮＡマイクロアレイデータ解析方法において、上記再検討ステップは、上記ＤＮＡマイクロアレイのどのスポットの測定データを使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、上記解析データ項目として使用するとＳＮ比が高くなるものを上記解析データ項目として使用するようにする解析データ項目再検討ステップをさらに含むことを特徴とする。
【００６５】
これは解析データ項目再検討ステップの一例を一層具体的に示すものである。この方法によれば、ＤＮＡマイクロアレイのどのスポットの測定データを使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、解析データ項目として使用するとＳＮ比が高くなるものを解析データ項目として使用するようにするので、解析データ項目を再検討することにより判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００６６】
また、本発明は、ＤＮＡマイクロアレイデータ解析装置に関するものであり、請求項９に記載のＤＮＡマイクロアレイデータ解析装置は、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることによりデータ群を選択するデータ群選択手段と、ＭＴシステム法で判別させる際の入力データとなる項目であって、上記ＭＴシステム法において解析される項目である解析データ項目を決定する解析データ項目決定手段と、上記データ群選択手段にて選択された上記基準データ群に含まれる上記基準データのマハラノビス空間を示す単位空間データベースを作成する単位空間データベース作成手段と、上記データ群選択手段にて選択された上記基準データのマハラノビス距離と、上記基準データに明確に属さない比較データを集めてマハラノビス距離を算出するマハラノビス距離算出手段と、上記マハラノビス距離算出手段にて算出された上記基準データと上記比較データとの上記マハラノビス距離に基づいて、上記基準データ群と上記比較データ群の境目となる閾値を１または複数個設定する閾値設定手段と、上記単位空間データベースと選択された項目を使用して、未知のデータのマハラノビス距離を算出し、上記閾値によってこれを判定することにより未知のデータの判別分析を行う分類手段を備えたことを特徴とする。
【００６７】
この装置によれば、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることにより基準データ群を選択し、ＭＴシステム法で判別させる際の入力データとなる項目であって、ＭＴシステム法において解析される項目である解析データ項目を決定し、選択された基準データ群に含まれる基準データのマハラノビス空間を示す単位空間データベースを作成し、選択された基準データと明確に基準データに属さない比較データのマハラノビス距離を算出し、算出された基準データと比較データのマハラノビス距離に基づいて、基準データ群と比較データ群の境目となる１または複数個の閾値を設定し、上記単位空間データベースと選択項目を使用して未知のデータのマハラノビス距離を算出し、上記閾値によって判定することにより未知データを分類するので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に解析することができるようになる。
【００６８】
すなわち、ＭＴシステム法では薬剤の効果のある・なしや、ある特定の群にぞくするか否かという漠然とした内容を基準データとして分析を開始することができるため、基準データ（教師データ）準備が非常に簡便になる。
【００６９】
また、ＭＴシステム法では一度の計算でデータベース（基準データ）を生成でき高速に処理できるため、演算時間を非常に短くすることができる。
【００７０】
また、ＭＴシステム法では幾つかの計算手法が提示されているが全て公開された計算式であり、しかも全基準データの相関関係に着目した計算式であることが明確であるため、演算過程の明確さを担保することができる。
【００７１】
また、ＭＴシステム法では公開されている計算式を用いて算出するため、同じデータを使えば必ず同じ結果を得られ、再現性を非常に高くすることができる。
【００７２】
また、ＭＴシステム法では一度単位空間データベースを生成すると、この基準の原点からの距離として表現する事に絶対値として表現でき、また、分析したデータ群以外のものについても、単位空間データベースからのマハラノビス距離を算出するだけで評価できるため、評価結果が毎回安定したものとなる。また、ＭＴシステム法では対象が示す現象とマハラノビス距離の間に因果関係が確認された場合、閾値を複数個設定することにより、マハラノビス距離を算出するだけで複数の状態を分類することを可能とする。
また、ＭＴシステム法では対象の状態とマハラノビス距離の間に一定の比例関係が成り立つ場合は、対象の状態そのものをマハラノビス距離で表現することも可能とする。
【００７３】
さらに、ＭＴシステム法ではどの項目（遺伝子の発現データや検体情報）が判別に有効であるか、機械的に検証する手順が定義されているため（項目選択）、これを使う事により、人間の偏った意見などに囚われることなく、最適な項目を効率的に選択する事ができる。
【００７４】
また、請求項１０に記載のＤＮＡマイクロアレイデータ解析装置は、請求項９に記載のＤＮＡマイクロアレイデータ解析装置において、上記基準データ群以外に上記基準データに属すべきデータを別途用意し、これと上記基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群について上記マハラノビス距離を算出し、上記閾値設定手段にて設定された上記閾値を用いて上記基準データ群とそれ以外の集団とを正しく判別できるか検証する判別精度検証手段とをさらに備えたことを特徴とする。
【００７５】
これは判別精度検証手段の一例を一層具体的に示すものである。この装置によれば、基準データ群以外に基準データに属すべきデータを別途用意し、これと基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群についてマハラノビス距離を算出し、設定された閾値を用いて基準データ群とそれ以外の集団とを正しく判別できるか検証するので、判別作業をさらに効率的に行うことができるようになる。
【００７６】
また、請求項１１に記載のＤＮＡマイクロアレイデータ解析装置は、請求項１０に記載のＤＮＡマイクロアレイデータ解析装置において、上記検証手段にて十分な精度を得られないと判断された場合、上記データ群選択手段にて選択した上記基準データ群、および／または、上記解析データ項目決定手段にて決定した上記解析データ項目について再検討を行う再検討手段をさらに備えたことを特徴とする。
【００７７】
これは再検討手段の一例を一層具体的に示すものである。この装置によれば、十分な精度を得られないと判断された場合、選択した基準データ群、および／または、決定した解析データ項目について再検討を行うので、判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００７８】
また、請求項１２に記載のＤＮＡマイクロアレイデータ解析装置は、請求項１１に記載のＤＮＡマイクロアレイデータ解析装置において、上記再検討手段は、上記基準データ群内の各項目の標準偏差やデータ出現度数などを使い上記基準データのバラつき具合を検証し、当該バラつき具合が小さい上記基準データを使用するようにする基準データ再検討手段をさらに備えたことを特徴とする。
【００７９】
これは基準データ再検討手段の一例を一層具体的に示すものである。この装置によれば、基準データ群内の各項目の標準偏差やデータ出現度数などを使い基準データのバラつき具合を検証し、当該バラつき具合が小さい基準データを使用するようにするので、基準データを再検討することにより判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００８０】
また、請求項１３に記載のＤＮＡマイクロアレイデータ解析装置は、請求項１１または１２に記載のＤＮＡマイクロアレイデータ解析装置において、上記再検討手段は、上記ＤＮＡマイクロアレイのどのスポットの測定データを使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、上記解析データ項目として使用するとＳＮ比が高くなるものを上記解析データ項目として使用するようにする解析データ項目再検討手段をさらに備えたことを特徴とする。
【００８１】
これは解析データ項目再検討手段の一例を一層具体的に示すものである。この装置によれば、ＤＮＡマイクロアレイのどのスポットの測定データを使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、解析データ項目として使用するとＳＮ比が高くなるものを解析データ項目として使用するようにするので、解析データ項目を再検討することにより判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００８２】
また、本発明は、ＤＮＡマイクロアレイデータ解析方法をコンピュータに実行させるプログラムに関するものであり、請求項１４に記載のプログラムは、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることによりデータ群を選択するデータ群選択ステップと、ＭＴシステム法で判別させる際の入力データとなる項目であって、上記ＭＴシステム法において解析される項目である解析データ項目を決定する解析データ項目決定ステップと、上記データ群選択ステップにて選択された上記基準データ群に含まれる上記基準データのマハラノビス空間を示す単位空間データベースを作成する単位空間データベース作成ステップと、上記データ群選択ステップにて選択された上記基準データのマハラノビス距離と、上記基準データに明確に属さない比較データを集めてマハラノビス距離を算出するマハラノビス距離算出ステップと、上記マハラノビス距離算出ステップにて算出された上記基準データと上記比較データとの上記マハラノビス距離に基づいて、上記基準データ群と上記比較データ群の境目となる閾値を１または複数個設定する閾値設定ステップと、上記単位空間データベースと選択された項目を使用して、未知のデータのマハラノビス距離を算出し、上記閾値によってこれを判定することにより未知のデータの判別分析を行う分類ステップを含むＤＮＡマイクロアレイデータ解析方法をコンピュータに実行させることを特徴とする。
【００８３】
このプログラムによれば、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることにより基準データ群を選択し、ＭＴシステム法で判別させる際の入力データとなる項目であって、ＭＴシステム法において解析される項目である解析データ項目を決定し、選択された基準データ群に含まれる基準データのマハラノビス空間を示す単位空間データベースを作成し、選択された基準データと明確に基準データに属さない比較データのマハラノビス距離を算出し、算出された基準データと比較データのマハラノビス距離に基づいて、基準データ群と比較データ群の境目となる１または複数個の閾値を設定し、上記単位空間データベースと選択項目を使用して未知のデータのマハラノビス距離を算出し、上記閾値によって判定することにより未知データを分類するので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に解析することができるようになる。
【００８４】
すなわち、ＭＴシステム法では薬剤の効果のある・なしや、ある特定の群にぞくするか否かという漠然とした内容を基準データとして分析を開始することができるため、基準データ（教師データ）準備が非常に簡便になる。
【００８５】
また、ＭＴシステム法では一度の計算でデータベース（基準データ）を生成でき高速に処理できるため、演算時間を非常に短くすることができる。
【００８６】
また、ＭＴシステム法では幾つかの計算手法が提示されているが全て公開された計算式であり、しかも全基準データの相関関係に着目した計算式であることが明確であるため、演算過程の明確さを担保することができる。
【００８７】
また、ＭＴシステム法では公開されている計算式を用いて算出するため、同じデータを使えば必ず同じ結果を得られ、再現性を非常に高くすることができる。
【００８８】
また、ＭＴシステム法では一度単位空間データベースを生成すると、この基準の原点からの距離として表現する事に絶対値として表現でき、また、分析したデータ群以外のものについても、単位空間データベースからのマハラノビス距離を算出するだけで評価できるため、評価結果が毎回安定したものとなる。また、ＭＴシステム法では対象が示す現象とマハラノビス距離の間に因果関係が確認された場合、閾値を複数個設定することにより、マハラノビス距離を算出するだけで複数の状態を分類することを可能とする。
また、ＭＴシステム法では対象の状態とマハラノビス距離の間に一定の比例関係が成り立つ場合は、対象の状態そのものをマハラノビス距離で表現することも可能とする。
【００８９】
さらに、ＭＴシステム法ではどの項目（遺伝子の発現データや検体情報）が判別に有効であるか、機械的に検証する手順が定義されているため（項目選択）、これを使う事により、人間の偏った意見などに囚われることなく、最適な項目を効率的に選択する事ができる。
【００９０】
また、請求項１５に記載のプログラムは、請求項１４に記載のプログラムにおいて、上記基準データ群以外に上記基準データに属すべきデータを別途用意し、これと上記基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群について上記マハラノビス距離を算出し、上記閾値設定ステップにて設定された上記閾値を用いて上記基準データ群とそれ以外の集団とを正しく判別できるか検証する判別精度検証ステップとをさらに含むことを特徴とする。
【００９１】
これは判別精度検証ステップの一例を一層具体的に示すものである。このプログラムによれば、基準データ群以外に基準データに属すべきデータを別途用意し、これと基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群についてマハラノビス距離を算出し、設定された閾値を用いて基準データ群とそれ以外の集団とを正しく判別できるか検証するので、判別作業をさらに効率的に行うことができるようになる。
【００９２】
また、請求項１６に記載のプログラムは、請求項１５に記載のプログラムにおいて、上記検証ステップにて十分な精度を得られないと判断された場合、上記データ群選択ステップにて選択した上記基準データ群、および／または、上記解析データ項目決定ステップにて決定した上記解析データ項目について再検討を行う再検討ステップをさらに含むことを特徴とする。
【００９３】
これは再検討ステップの一例を一層具体的に示すものである。このプログラムによれば、十分な精度を得られないと判断された場合、選択した基準データ群、および／または、決定した解析データ項目について再検討を行うので、判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００９４】
また、請求項１７に記載のプログラムは、請求項１６に記載のプログラムにおいて、上記再検討ステップは、上記基準データ群内の各項目の標準偏差やデータ出現度数などを使い上記基準データのバラつき具合を検証し、当該バラつき具合が小さい上記基準データを使用するようにする基準データ再検討ステップをさらに含むことを特徴とする。
【００９５】
これは基準データ再検討ステップの一例を一層具体的に示すものである。このプログラムによれば、基準データ群内の各項目の標準偏差やデータ出現度数などを使い基準データのバラつき具合を検証し、当該バラつき具合が小さい基準データを使用するようにするので、基準データを再検討することにより判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００９６】
また、請求項１８に記載のプログラムは、請求項１６または１７に記載のプログラムにおいて、上記再検討ステップは、上記ＤＮＡマイクロアレイのどのスポットの測定データを使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、上記解析データ項目として使用するとＳＮ比が高くなるものを上記解析データ項目として使用するようにする解析データ項目再検討ステップをさらに含むことを特徴とする。
【００９７】
これは解析データ項目再検討ステップの一例を一層具体的に示すものである。このプログラムによれば、ＤＮＡマイクロアレイのどのスポットの測定データを使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、解析データ項目として使用するとＳＮ比が高くなるものを解析データ項目として使用するようにするので、解析データ項目を再検討することにより判別作業をさらに精度よくかつ効率的に行うことができるようになる。
【００９８】
また、本発明は記録媒体に関するものであり、請求項１９に記載の記録媒体は、上記請求項１４から１８のいずれか一つに記載されたプログラムを記録したことを特徴とする。
【００９９】
この記録媒体によれば、当該記録媒体に記録されたプログラムをコンピュータに読み取らせて実行することによって、請求項１４から１８のいずれか一つに記載されたプログラムをコンピュータを利用して実現することができ、これら各方法と同様の効果を得ることができる。
【０１００】
【発明の実施の形態】
以下に、本発明にかかるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体の実施の形態を図面に基づいて詳細に説明する。なお、この実施の形態によりこの発明が限定されるものではない。
特に、以下の実施の形態においては、本発明を、インターフェロン（ＩＮＦ）感受性に関するＤＮＡマイクロアレイ実験結果の予測に適用した例について説明するが、この場合に限られず、他のＤＮＡマイクロアレイデータの解析において、同様に適用することができる。
【０１０１】
［本発明の概要］
以下、本発明の概要について説明し、その後、本発明の構成および処理等について詳細に説明する。図１は本発明の基本原理を示す原理構成図である。
本発明は、概略的に、以下の基本的特徴を有する。ＭＴシステム法は、田口玄一博士が提唱したマハラノビス距離を用いたあらゆる現象の測定技術である。ＭＴシステムの原理は、測定したい現象に対して結果が既知のデータ群の中から更にデータならびに結果においてバラつきの少ないデータ群（基準データ群）を選び出し、これを基に全入力データの相関関係を示すデータベースを作成する。そして、このデータベースを「ものさし」として未知の現象をマハラノビス距離として解析することにより、様々な現象の予測、判別などを現実化する技術である。
【０１０２】
ＤＮＡマイクロアレイの測定データの解析にＭＴシステム法を適用する場合には、ＤＮＡマイクロアレイの検査結果データ＋検体データ等が入力データとなり、その検体の状態予測もしくは将来予測される変化などが解析結果となる。そして、サンプルのグループ分けをする過程において、最初から精度を見ながら解析対象の遺伝子の絞込み（有効遺伝子の探索）を行う。
【０１０３】
ここで、従来のＤＮＡマイクロアレイの解析技術については、一般に大別して二種類の技術が活用されている。一つは解析結果の定量化（数値化）に関する技術（例えば、ニューラルネットワーク、重回帰分析、ファジー理論等に基づく技術）であり、もう一つは、定量化されたデータをもとにグルーピングする技術である。
【０１０４】
従来、ＤＮＡマイクロアレイ解析においては、この２つの技術を２段階で使ってデータ解析処理を行っていた。たとえば非特許文献１で行った遺伝子間の類似度の数値化がここでいう定量化（数値化）にあたり、その後のクラスタリング（グルーピング）の入力値としている。そのため、例えば数百〜数万項目に及ぶ大量のデータ解析に多大な演算時間を要し、かつ、その解析精度もあまり高いものでなかった。また非特許文献２における遺伝子の絞込みは、上記のような２段階の工程ではないが、ニューラルネットワークによるグルーピングを膨大な回数行うため、通常の電算機計算では数日を費やすことが予想される。
【０１０５】
ここで上記定量化及びグルーピングと「従来の技術」で述べたカテゴリーの決定及びカテゴリーへの割り当ての関係を述べると、カテゴリーの決定及びカテゴリーへの割り当てそれぞれにおいて、上記定量化とグルーピングが行われているといえる。ただし、非特許文献２のニューラルネットワークのように、必ずしもすべての方法が段階に分かれるわけではなく、一つの工程で両方を組み合わせたものもありうる。
【０１０６】
これに対してＭＴシステムでは、データベースを作成してその後の計算に使用するため、ＭＴシステムをＤＮＡマイクロアレイ解析に用いることにより、最初からサンプルのグループ分けをする過程において精度を見ながら遺伝子の絞込みを行うので、上記２つの演算処理を一括して高精度かつ高速に行うことが可能である。これにより、遺伝子発現状態による検体の分類や遺伝子そのものの役割解析などを効率的に行う事が可能となる。
【０１０７】
すなわち、ＭＴシステム法をＤＮＡマイクロアレイの解析に用いることにより、基準データ群からのマハラノビス距離による個々の検体の位置を表現することができるようになるため、解析結果の定量化とクラスタリングを一括して行えるようになり、分析の効率化を図ることができるようになる。その結果、遺伝子や検体情報のうち、有効と思われるものを効率的に探し出すことができるようになる。
【０１０８】
さらに、全入力データの相関関係をデータベース化することにより、単独のデータだけでは検知する事ができない現象を捉えることができるようになる。
【０１０９】
このように、本発明者は上述したＭＴシステム法をＤＮＡマイクロアレイ解析に適用することを見出した。そこで、本発明者は、ＭＴシステム法による解析を、ＭＴシステム法によるデータ解析機能を有する市販ソフトウェア（例えば、プローブ（会社名）の「ＰＲＡＴ」（製品名）など）を用いて行った。なおＭＴシステム法のソフトウェアとしては、いくつか市販の製品があり、いずれも使用することも可能である。
【０１１０】
しかし、実際に、市販ソフトウェアによりＤＮＡマイクロアレイ解析を行うためには、克服すべき問題があった。すなわち、ＭＴシステム法では、基準データ数を項目数よりも多く収集しなければならいということが当業者にとっては一般常識として確立している。基準データ数は本発明においては個体サンプル数に、項目数は本発明においてはＤＮＡマイクロアレイ搭載遺伝子数プラスその他血液検査等の臨床データ数にあたり、遺伝子搭載数の多さを最大特徴とするＤＮＡマイクロアレイでの解析ではそのような条件に適応することはほとんど不可能である。
【０１１１】
また、実際に、市販ソフトウェアによりＤＮＡマイクロアレイ解析を行ったところ、発現量が極端に低値の遺伝子が多く、マイクロアレイ間のＣＶ（ｃｏｅｆｆｉｃｉｅｎｔｏｆｖａｒｉａｔｉｏｎ：変動係数）が不安定である等の、ＤＮＡマイクロアレイ解析に特有の問題があるため、ＭＴシステム法を適用することは、困難であった。
【０１１２】
そこで、本発明者は、前者の問題については、より特徴性の高い項目を事前に選ぶことにより、項目数よりも極端に少ない基準データ数からでも有効な単位空間を作成し得ることを見出し、これを解決した。また、後者の問題については、直交表に項目の組合せを試行し、計算結果をＳ／Ｎ比で比較して有効な項目を選択する工夫を行った。
【０１１３】
さらに、ＭＴシステム法では単位空間作成時に、単位空間の元となる基準データ群と、それに対して明らかに「異常」と識別できる異常データ群の２種類を用意しなければならないこととされているが（「ＭＴシステムにおける技術開発」日本規格協会出版、４２８−４２９ページ「測定者の正常データと異常データの設定」参照。）、臨床においてはデータが曖昧でどちらのデータ群に含まれるか明確ではないサンプルが少なからず含まれており、判断が難しい場合が多い。そこで本発明者は、まず基準データのみで単位空間を設定し、その後評価の比較データで検証する工夫を行うことによって、判断を保留したサンプルの存在を許容したまま解析する工夫を行った。
【０１１４】
例えば、インターフェロン（ＩＮＦ）感受性に関するＤＮＡマイクロアレイ実験結果の予測に適用する際に、上記の工夫を用いて８８個のサンプルのうち１８サンプルの結果を基準データとして選択し、残りの７０サンプルについて検証した結果、高頻度（８０〜９０％）で患者のＩＮＦ感受性が予想できることが判明した。なお、これらの工夫や他の改良点などについて詳細は後述する。
【０１１５】
このように、ＭＴシステム法を単に適用しただけでは、現実的に有効なものとして実現することはできず、そこに本発明者等の創意工夫の結果により、ＤＮＡマイクロアレイ解析におけるＭＴシステム法の適用が現実的に有効なものとなった。
【０１１６】
［システム構成］
まず、本システムの構成について説明する。図２は、本発明が適用される本システムの構成の一例を示すブロック図であり、該構成のうち本発明に関係する部分のみを概念的に示している。本システムは、概略的に、ＤＮＡマイクロアレイデータ解析装置１００と、配列情報等に関する外部データベースやホモロジー検索等の外部プログラム等を提供する外部システム２００とを、ネットワーク３００を介して通信可能に接続して構成されている。
【０１１７】
図２においてネットワーク３００は、ＤＮＡマイクロアレイデータ解析装置１００と外部システム２００とを相互に接続する機能を有し、例えば、インターネット等である。
【０１１８】
図２において外部システム２００は、ネットワーク３００を介して、ＤＮＡマイクロアレイデータ解析装置１００と相互に接続され、利用者に対して配列情報等に関する外部データベースやホモロジー検索やモチーフ検索等の外部プログラムを実行するウェブサイトを提供する機能を有する。
【０１１９】
ここで、外部システム２００は、ＷＥＢサーバやＡＳＰサーバ等として構成してもよく、そのハードウェア構成は、一般に市販されるワークステーション、パーソナルコンピュータ等の情報処理装置およびその付属装置により構成してもよい。また、外部システム２００の各機能は、外部システム２００のハードウェア構成中のＣＰＵ、ディスク装置、メモリ装置、入力装置、出力装置、通信制御装置等およびそれらを制御するプログラム等により実現される。
【０１２０】
図２においてＤＮＡマイクロアレイデータ解析装置１００は、概略的に、ＤＮＡマイクロアレイデータ解析装置１００の全体を統括的に制御するＣＰＵ等の制御部１０２、通信回線等に接続されるルータ等の通信装置（図示せず）に接続される通信制御インターフェース部１０４、入力装置１１２や出力装置１１４に接続される入出力制御インターフェース部１０８、および、各種のデータベースやテーブルなどを格納する記憶部１０６を備えて構成されており、これら各部は任意の通信路を介して通信可能に接続されている。さらに、このＤＮＡマイクロアレイデータ解析装置１００は、ルータ等の通信装置および専用線等の有線または無線の通信回線を介して、ネットワーク３００に通信可能に接続されている。
【０１２１】
記憶部１０６に格納される各種のデータベースやテーブル（ＤＮＡマイクロアレイ測定データファイル１０６ａ〜検証データ群ファイル１０６ｇ）は、固定ディスク装置等のストレージ手段であり、各種処理に用いる各種のプログラムやテーブルやファイルやデータベースやウェブページ用ファイル等を格納する。
【０１２２】
これら記憶部１０６の各構成要素のうち、ＤＮＡマイクロアレイ測定データファイル１０６ａは、ＤＮＡマイクロアレイの測定データ（例えば、各遺伝子の発現データなどの検査結果データや、検体データなど）を格納したファイルである。図５は、ＤＮＡマイクロアレイ測定データファイル１０６ａに格納される利用者情報の一例を示す図である。
【０１２３】
このＤＮＡマイクロアレイ測定データファイル１０６ａに格納される情報は、図５に示すように、各サンプルを一意に識別するためのサンプルＩＤ、各遺伝子の発現データなどの検査結果データや、検体データなどを相互に関連付けて構成されている。ここで、検体データは、診療情報（例えば、血液検査による肝機能データ等）の項目を含んで構成される。
【０１２４】
また、基準データ群ファイル１０６ｂは、基準データ群に関する情報等を格納する基準データ群格納手段である。
【０１２５】
また、比較データ群ファイル１０６ｃは、比較データ群に関する情報等を格納する比較データ群格納手段である。
【０１２６】
また、解析データ項目１０６ｄは、解析データ項目に関する情報等を格納する解析データ項目格納手段である。
【０１２７】
また、単位空間データベース１０６ｅは、選択した基準データ群に含まれる各基準データから相関係数、標準偏差などを求め、これらの単位空間（マハラノビス空間）を示すデータベースに関する情報等を格納する単位空間データベース格納手段である。
【０１２８】
また、マハラノビス距離ファイル１０６ｆは、基準データ、比較データのそれぞれのデータ毎のマハラノビス距離に関する情報等を格納するマハラノビス距離データ格納手段である。
【０１２９】
また、検証データ群ファイル１０６ｇは、検証データ群に関する情報等を格納する検証データ群格納手段である。
【０１３０】
また、図２において、通信制御インターフェース部１０４は、ＤＮＡマイクロアレイデータ解析装置１００とネットワーク３００（またはルータ等の通信装置）との間における通信制御を行う。すなわち、通信制御インターフェース部１０４は、他の端末と通信回線を介してデータを通信する機能を有する。
【０１３１】
また、図２において、入出力制御インターフェース部１０８は、入力装置１１２や出力装置１１４の制御を行う。ここで、出力装置１１４としては、モニタ（家庭用テレビを含む）の他、スピーカを用いることができる（なお、以下においては出力装置１１４をモニタとして記載する場合がある）。また、入力装置１１２としては、キーボード、マウス、および、マイク等を用いることができる。また、モニタも、マウスと協働してポインティングデバイス機能を実現する。
【０１３２】
また、図２において、制御部１０２は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等の制御プログラム、各種の処理手順等を規定したプログラム、および所要データを格納するための内部メモリを有し、これらのプログラム等により、種々の処理を実行するための情報処理を行う。制御部１０２は、機能概念的に、解析目的決定部１０２ａ、データ群選択部１０２ｂ、解析データ項目決定部１０２ｃ、単位空間データベース作成部１０２ｄ、マハラノビス距離算出部１０２ｅ、閾値設定部１０２ｆ、判断精度検証部１０２ｇ、再検討部１０２ｈ、および、確定部１０２ｉを備えて構成されている。
【０１３３】
このうち、解析目的決定部１０２ａは、ＤＮＡマイクロアレイの測定データをＭＴシステム法で解析するにあたり、ＤＮＡマイクロアレイのデータによって判別させたい項目を決定する解析目的決定手段である。
【０１３４】
また、データ群選択部１０２ｂは、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることにより基準データ群を選択し、また明確に上記基準データに属さない比較データを集めることにより比較データ群を選択するデータ群選択手段である。
【０１３５】
また、解析データ項目決定部１０２ｃは、ＭＴシステム法で判別させる際の入力データとなる項目であって、ＭＴシステム法において解析される項目である解析データ項目を決定する解析データ項目決定手段である。
【０１３６】
また、単位空間データベース作成部１０２ｄは、データ群選択部１０２ｂにて選択された基準データ群に含まれる基準データのマハラノビス空間を示す単位空間データベースを作成する単位空間データベース作成手段である。
【０１３７】
また、マハラノビス距離算出部１０２ｅは、データ群選択部１０２ｂにて選択された基準データと比較データとの間のマハラノビス距離を算出するマハラノビス距離算出手段である。
【０１３８】
また、閾値設定部１０２ｆは、マハラノビス距離算出部１０２ｅにて算出された基準データと比較データとのマハラノビス距離に基づいて、基準データ群と比較データ群の境目となる閾値を設定する閾値設定手段である。
【０１３９】
また、判断精度検証部１０２ｇは、基準データ群以外に基準データに属すべきデータを別途用意し、これと基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群についてマハラノビス距離を算出し、閾値設定部１０２ｆにて設定された閾値を用いて基準データ群とそれ以外の集団とを正しく判別できるか検証する判別精度検証手段である。
【０１４０】
また、再検討部１０２ｈは、判断精度検証部１０２ｇにて十分な精度を得られないと判断された場合、データ群選択部１０２ｂにて選択した基準データ群、および／または、解析データ項目決定部１０２ｃにて決定した解析データ項目について再検討を行う再検討手段である。ここで、図６は、再検討部１０２ｈの構成の一例を示すブロック図である。図６に示すように、再検討部１０２ｈは、基準データ再検討部１０２ｊおよび解析データ項目再検討部１０２ｋを含んで構成される。
【０１４１】
ここで、基準データ再検討部１０２ｊは、基準データ群内の各項目の標準偏差やデータ出現度数などを使い基準データのバラつき具合を検証し、当該バラつき具合が小さい基準データを使用するようにする基準データ再検討手段である。
【０１４２】
また、解析データ項目再検討部１０２ｋは、ＤＮＡマイクロアレイのどのスポットの測定データを使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、解析データ項目として使用するとＳＮ比が高くなるものを解析データ項目として使用するようにする解析データ項目再検討手段である。
また、他の統計手法を用いたり上記方法と併用して項目の検討を行う機能も上記に含まれる。
【０１４３】
再び図２に戻り、確定部１０２ｉは、判断精度検証部１０２ｇにおける検証の結果、十分な精度を確保できると判断した段階で、単位空間データベースと閾値を確定させる確定手段である。
【０１４４】
なお、これら各部によって行なわれる処理の詳細については、後述する。
【０１４５】
［システムの処理］
次に、このように構成された本実施の形態における本システムの処理の一例について、以下に図３等を参照して詳細に説明する。
【０１４６】
［ＭＴシステムによるＤＮＡマイクロアレイ解析処理］
まず、ＭＴシステムによるＤＮＡマイクロアレイ解析処理の詳細について図３等を参照して説明する。図３は、本実施形態における本システムのＭＴシステムによるＤＮＡマイクロアレイ解析処理の一例を示すフローチャートである。
【０１４７】
（１）解析目的決定（目的定義）
まず、ＤＮＡマイクロアレイデータ解析装置１００は、解析目的決定部１０２ａの処理により、ＤＮＡマイクロアレイの測定データをＭＴシステム法で解析するにあたり、ＤＮＡマイクロアレイのデータによって判別させたい項目（判別項目）を決定して、解析目的を明確にする（ステップＳＡ−１）。
【０１４８】
例えば、Ｃ型肝炎の患者に対するインターフェロン投与が有効であるか無効であるか判別させる場合の判別項目として、インターフェロン投与の有効性などが挙げられる。
【０１４９】
（２）基準データ群＋比較データ群選択（基準データ、比較データ選択）
ＭＴシステムでは基準となる集団をデータベース化して利用するため、解析目的となる判別項目に応じた群を集め、これをデータベースの元データ（基準データ群）としなければならない。また、基準データ群以外に、基準と異なるデータ群を比較用に集めてこれを比較データ群として選択する必要もある。これらの群を正確に判別できるかどうかで解析精度の判定を行うことになる。
【０１５０】
そのため、ＤＮＡマイクロアレイデータ解析装置１００は、データ群選択部１０２ｂの処理により、ＤＮＡマイクロアレイ測定データファイル１０６ａにアクセスして、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることにより基準データ群を選択し、また明確に上記基準データに属さない比較データを集めることにより比較データ群を選択する（ステップＳＡ−２）。
【０１５１】
ここで、基準データに選択する集団はなるべくバラつきが少ないデータが望ましく、また、母集団としての数が多くとれる集団を選択することが望ましい。
【０１５２】
例えば、インターフェロン投与の有効性を解析させる場合、保有しているデータで投与結果が判明しているものの内、投与が有効である検体が多い場合は、有効である検体を基準データとして選択し基準データ群とする。また、無効の検体が多い場合はその逆となる。比較データ群については、明確に基準データ群に属さない集団を選ぶ。すなわち、本例では「基準データ」は個々の患者検体であり、「基準データ群」はグループ分けされた患者群（ＩＦＮ感受性あり・なしのそれぞれのグループ）である。
【０１５３】
そして、データ群選択部１０２ｂは、選択した基準データ群を基準データ群ファイル１０６ｂの所定の記憶領域に格納し、また、選択した比較データ群を比較データ群ファイル１０６ｃの所定の記憶領域に格納する。
【０１５４】
（３）解析データ項目決定（入力項目決定）
そして、ＤＮＡマイクロアレイデータ解析装置１００は、解析データ項目決定部１０２ｃの処理により、ＤＮＡマイクロアレイ測定データファイル１０６ａ乃至比較データ群ファイル１０６ｃにアクセスして、ＭＴシステムで判別させる際の「入力データ」となる項目（特徴、パラメータとも言う）、すなわち、ＭＴシステムにおいて解析される項目、である「解析データ項目」を決定する（ステップＳＡ−３）。
【０１５５】
この際、解析結果となる項目を解析データ項目に含めてはならない。例えば、インターフェロン投与の有効性を解析させたい場合、ＤＮＡマイクロアレイの測定結果（各遺伝子データ）ならびに検体の年齢、性別などの項目は入力データ（解析データ項目）となる。
【０１５６】
そして、比較データ群ファイル１０６ｃは、決定した解析データ項目を解析データ項目１０６ｄの所定の記憶領域に格納する。
【０１５７】
（４）単位空間データベース作成
そして、ＤＮＡマイクロアレイデータ解析装置１００は、単位空間データベース作成部１０２ｄの処理により、基準データ群ファイル１０６ｂ等にアクセスして、選択した基準データ群に含まれる各基準データから相関係数、標準偏差などを求めてこれらの単位空間（マハラノビス空間）を示すデータベースを作成する（ステップＳＡ−４）。
【０１５８】
ＭＴシステムでは、この計算方法に幾つかのバラエティがあり、例えば、逆行列を用いる方法、Ｓｃｈｍｉｄｔによる直交展開、余因子を用いる方法などがあるが、本発明ではいずれの方法を用いてもよい。
【０１５９】
例として、逆行列をもちいて計算する場合を以下に示す。インターフェロン投与が有効であったと判明している群でまとめた基準データから全ての項目の平均値、標準偏差値を求め、各データを規準化した後、これらから相関行列ならびにこれの逆行列を求める。このうち、平均値、標準偏差値、逆行列が単位空間データベースとなる。（上述した参考文献である「ＭＴシステムにおける技術開発」の５頁“手順３”もしくは、７１頁“７．３．１マハラノビス距離”などを参照。）。
【０１６０】
そして、単位空間データベース作成部１０２ｄは、作成した単位空間データベースを単位空間データベース１０６ｅの所定の記憶領域に格納する。
【０１６１】
（５）基準データ−比較データ間のマハラノビス距離算出
そして、ＤＮＡマイクロアレイデータ解析装置１００は、マハラノビス距離算出部１０２ｅの処理により、単位空間データベース１０６ｅ等にアクセスして、ステップＳＡ−４で作成した単位空間データベースについて、ステップＳＡ−２で選択した基準データ、比較データのそれぞれのデータ毎のマハラノビス距離を算出する（ステップＳＡ−５）。すなわち、本処理により、患者個々の距離を計算することができる。この距離がＭＴシステム法による測定結果となる。ここで、マハラノビス距離の算出方法については、いくつかの計算方法があるが、本発明はいずれの計算方法を用いてもよい。
【０１６２】
例としては、逆行列をもちいて計算する場合を以下に示す。（４）で作成した単位空間データベースと各データ値からマハラノビス距離を算出する。この値が、インターフェロン投与に対する有効性を示すことになる（上述した参考文献である「ＭＴシステムにおける技術開発」の５頁“手順３”、もしくは、７１頁“７．３．１マハラノビス距離”などを参照。）。
【０１６３】
そして、マハラノビス距離算出部１０２ｅは、算出した各マハラノビス距離をマハラノビス距離ファイル１０６ｆの所定の記憶領域に格納する。
【０１６４】
（６）閾値設定
そして、ＤＮＡマイクロアレイデータ解析装置１００は、閾値設定部１０２ｆの処理により、マハラノビス距離ファイル１０６ｆなどにアクセスして、ステップＳＡ−５で算出した基準データ、比較データそれぞれのマハラノビス距離に基づいて、基準データ群と比較データ群の境目となる値（閾値）を設定する（ステップＳＡ−６）。
【０１６５】
具体的には、図４に示すように、ヒストグラムのようなグラフにて分布状態を視覚的に表現して検証する場合や、距離の分布状態などを数学的に解析することにより検証する場合などが挙げられる。ここで、図４は、距離ヒストグラムにて基準データ群と比較データ群を表示し、閾値を設定する場合の一例を示す図である。
【０１６６】
ここで、設定する閾値は１個でもよく、また複数個でもよい。ＭＴシステム法では対象が示す現象とマハラノビス距離との間に因果関係が確認された場合、閾値を複数個設定することにより、マハラノビス距離を算出するだけで複数の状態を分類することが可能となる。
【０１６７】
具体的な例としては、マハラノビス距離とインターフェロン投与効果の間に比例関係が成り立つ場合、マハラノビス距離そのものをインターフェロン投与効果予測値とし、この予測値を１０の群に分割する閾値を９個設け、１０段階評価のインターフェロン投与効果予測指標とすること等が挙げられる。
【０１６８】
また、基準データ群以外の集団が複数存在する場合は、マハラノビス距離の大きさで集団を分割し複数集団を判別させる場合もある。例えば、右図では基準データ群と比較データ群が交差している。基準データをインターフェロン投与が有効な集団、比較データを無効な集団とすると、双方をより高い確率で判別させたい場合、この交差点のマハラノビス距離を閾値として設定する事が考えられる。
【０１６９】
（７）検証データ群による判別精度検証
そして、ＤＮＡマイクロアレイデータ解析装置１００は、判断精度検証部１０２ｇの処理により、ＤＮＡマイクロアレイ測定データファイル１０６ａなどにアクセスして、基準データ群以外に基準データに属すべきデータを別途用意し、これと基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群についてマハラノビス距離を算出し、閾値設定部１０２ｆにて設定された閾値を用いて基準データ群とそれ以外の集団とを正しく判別できるか検証する（ステップＳＡ−７）。
【０１７０】
すなわち、判断精度検証部１０２ｇは、ＤＮＡマイクロアレイ測定データファイル１０６ａなどにアクセスして、基準データ群以外に基準データに属すべきデータ（ＩＦＮ投与の結果をふせた患者データ）を別途用意し、これと基準データに属さないデータを混在させ、これを検証データ群とする。
【０１７１】
そして、判断精度検証部１０２ｇは検証データ群を検証データ群ファイル１０６ｇの所定の記憶領域に格納する。
【０１７２】
そして、判断精度検証部１０２ｇは、この検証データ群についてステップＳＡ−５と同じ方法にてマハラノビス距離を算出した後、ステップＳＡ−６で得た閾値を用いて基準データ群とそれ以外の集団を正しく判別できるか検証する。
【０１７３】
ここで、検証方法として判別率で検証する方法もあるが、ＭＴシステム法ではＳＮ比（ｓｉｇｎａｌ−ｎｏｉｓｅｒａｔｉｏ）で表現することにより、入力データから得られた距離と、期待される測定結果の比率でその精度を検証するのが一般的である。
【０１７４】
例えば、基準データ以外でインターフェロン投与が有効であると判明しているデータ群が閾値内に入っており、それ以外の集団が閾値外であることをＳＮ比にて検証し、インターフェロン投与の有効性と算出したマハラノビス距離の直線的関係を確認する。直交表とＳＮ比による項目選択については参考文献を参照の事（上述した参考文献である「ＭＴシステムにおける技術開発」の１９頁“２．５項目選択”もしくは、６２頁“６．６ＭＴシステムのパラメータ設計”などを参照。）。
【０１７５】
（８）再検討
ステップＳＡ−７で十分な精度を得られないと判断した場合、ステップＳＡ−３で解析データ項目として選択した項目や、ステップＳＡ−２で選択した基準データ群などの検討が不十分であったことが考えられる。
【０１７６】
そこで、ＤＮＡマイクロアレイデータ解析装置１００は、再検討部１０２ｈの処理により、ＤＮＡマイクロアレイ測定データファイル１０６ａ乃至解析データ項目１０６ｄにアクセスして、判断精度検証部１０２ｇにて十分な精度を得られないと判断された場合、データ群選択部１０２ｂにて選択した基準データ群、および／または、解析データ項目決定部１０２ｃにて決定した解析データ項目について再検討を行う（ステップＳＡ−８）。
【０１７７】
ここで、再検討部１０２ｈは、基準データ再検討部１０２ｊの処理により、基準データ群については、基準データ群内の各項目の標準偏差やデータ出現度数などを使い基準データのバラつき具合などで検証する。
【０１７８】
また、項目の再選択については、幾つかの方法がある。例えば、扱うデータに対する専門家の意見などを参考にする方法や、基準データ群とそれ以外の集団間の平均差が大きいデータを選択する方法や、ＭＴシステム法で提唱されている、直交表を用いて無作為に項目を選択した複数の組み合わせにてＳＮ比で各項目の有効性を検証する方法、などが挙げられる。
【０１７９】
例えば、再検討部１０２ｈは、解析データ項目再検討部１０２ｋの処理により、ＤＮＡマイクロアレイのどのスポット（遺伝子の発現状況）の測定結果を使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、項目として使用するとＳＮ比が高くなるものを項目として使用する。
【０１８０】
（９）確定
ステップＳＡ−７での検証の結果、十分な精度を確保できると判断した段階で、ＤＮＡマイクロアレイデータ解析装置１００は、確定部１０２ｉの処理により、データベース（単位空間）と閾値を確定させる（ステップＳＡ−９）。
【０１８１】
例えば、検証した結果、入力したデータでインターフェロン投与の有効性を医学的な見地で十分診断できると判定した段階で確定する。
【０１８２】
（１０）分類開始
データベース（単位空間）を確定させた後は、これを使い未知のデータについてマハラノビス距離を算出し、データの性質を距離で表現し、様々なデータの診断、予測に使用する（ステップＳＡ−１０）。
【０１８３】
例えば、インターフェロン投与の有効性をＤＮＡマイクロアレイとＭＴシステム法で検証する場合、ＤＮＡマイクロアレイの発現状況データ＋検体情報などを入力データとし、予め作成した単位空間データベースにてマハラノビス距離を算出し、予め定めたマハラノビス距離の閾値を用いてインターフェロン投与の有効性の診断ならびに効き具合を予測する。
これにて、ＭＴシステムによるＤＮＡマイクロアレイ解析処理が終了する。
【０１８４】
［他の実施の形態］
さて、これまで本発明の実施の形態について説明したが、本発明は、上述した実施の形態以外にも、上記特許請求の範囲に記載した技術的思想の範囲内において種々の異なる実施の形態にて実施されてよいものである。
【０１８５】
例えば、ＤＮＡマイクロアレイデータ解析装置１００がスタンドアローンの形態で処理を行う場合を一例に説明したが、ＤＮＡマイクロアレイデータ解析装置１００とは別筐体で構成されるクライアント端末からの要求に応じて処理を行い、その処理結果を当該クライアント端末に返却するように構成してもよい。
【０１８６】
また、実施形態において説明した各処理のうち、自動的に行なわれるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行なわれるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。
この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種の登録データや検索条件等のパラメータを含む情報、画面例、データベース構成については、特記する場合を除いて任意に変更することができる。
【０１８７】
また、ＤＮＡマイクロアレイデータ解析装置１００に関して、図示の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。
【０１８８】
例えば、ＤＮＡマイクロアレイデータ解析装置１００の各部または各装置が備える処理機能、特に制御部１０２にて行なわれる各処理機能については、その全部または任意の一部を、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）および当該ＣＰＵにて解釈実行されるプログラムにて実現することができ、あるいは、ワイヤードロジックによるハードウェアとして実現することも可能である。なお、プログラムは、後述する記録媒体に記録されており、必要に応じてＤＮＡマイクロアレイデータ解析装置１００に機械的に読み取られる。
【０１８９】
すなわち、ＲＯＭまたはＨＤなどの記憶部１０６などには、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）と協働してＣＰＵに命令を与え、各種処理を行うためのコンピュータプログラムが記録されている。このコンピュータプログラムは、ＲＡＭ等にロードされることによって実行され、ＣＰＵと協働して制御部１０２を構成する。また、このコンピュータプログラムは、ＤＮＡマイクロアレイデータ解析装置１００に対して任意のネットワーク３００を介して接続されたアプリケーションプログラムサーバに記録されてもよく、必要に応じてその全部または一部をダウンロードすることも可能である。
【０１９０】
また、本発明にかかるプログラムを、コンピュータ読み取り可能な記録媒体に格納することもできる。ここで、この「記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等の任意の「可搬用の物理媒体」や、各種コンピュータシステムに内蔵されるＲＯＭ、ＲＡＭ、ＨＤ等の任意の「固定用の物理媒体」、あるいは、ＬＡＮ、ＷＡＮ、インターネットに代表されるネットワークを介してプログラムを送信する場合の通信回線や搬送波のように、短期にプログラムを保持する「通信媒体」を含むものとする。
【０１９１】
また、「プログラム」とは、任意の言語や記述方法にて記述されたデータ処理方法であり、ソースコードやバイナリコード等の形式を問わない。なお、「プログラム」は必ずしも単一的に構成されるものに限られず、複数のモジュールやライブラリとして分散構成されるものや、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）に代表される別個のプログラムと協働してその機能を達成するものをも含む。なお、実施の形態に示した各装置において記録媒体を読み取るための具体的な構成、読み取り手順、あるいは、読み取り後のインストール手順等については、周知の構成や手順を用いることができる。
【０１９２】
記憶部１０６に格納される各種のデータベース等（ＤＮＡマイクロアレイ測定データファイル１０６ａ〜検証データ群ファイル１０６ｇ）は、ＲＡＭ、ＲＯＭ等のメモリ装置、ハードディスク等の固定ディスク装置、フレキシブルディスク、光ディスク等のストレージ手段であり、各種処理やウェブサイト提供に用いる各種のプログラムやテーブルやファイルやデータベースやウェブページ用ファイル等を格納する。
【０１９３】
また、ＤＮＡマイクロアレイデータ解析装置１００は、既知のパーソナルコンピュータ、ワークステーション等の情報処理端末等の情報処理装置にプリンタやモニタやイメージスキャナ等の周辺装置を接続し、該情報処理装置に本発明の方法を実現させるソフトウェア（プログラム、データ等を含む）を実装することにより実現してもよい。
【０１９４】
さらに、ＤＮＡマイクロアレイデータ解析装置１００等の分散・統合の具体的形態は明細書および図面に示すものに限られず、その全部または一部を、各種の負荷等に応じた任意の単位で、機能的または物理的に分散・統合して構成することができる（例えば、グリッド・コンピューティングなど）。例えば、各データベースを独立したデータベース装置として独立に構成してもよく、また、処理の一部をＣＧＩ（ＣｏｍｍｏｎＧａｔｅｗａｙＩｎｔｅｒｆａｃｅ）を用いて実現してもよい。
【０１９５】
また、ネットワーク３００は、ＤＮＡマイクロアレイデータ解析装置１００と外部システム２００とを相互に接続する機能を有し、例えば、インターネットや、イントラネットや、ＬＡＮ（有線／無線の双方を含む）や、ＶＡＮや、パソコン通信網や、公衆電話網（アナログ／デジタルの双方を含む）や、専用回線網（アナログ／デジタルの双方を含む）や、ＣＡＴＶ網や、ＩＭＴ２０００方式、ＧＳＭ方式またはＰＤＣ／ＰＤＣ―Ｐ方式等の携帯回線交換網／携帯パケット交換網や、無線呼出網や、Ｂｌｕｅｔｏｏｔｈ等の局所無線網や、ＰＨＳ網や、ＣＳ、ＢＳまたはＩＳＤＢ等の衛星通信網等のうちいずれかを含んでもよい。すなわち、本システムは、有線・無線を問わず任意のネットワークを介して、各種データを送受信することができる。
【０１９６】
【発明の効果】
以上詳細に説明したように、本発明によれば、マハラノビス・タグチシステムを用いてＤＮＡマイクロアレイの測定データを解析するので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に解析することができるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することができる。
【０１９７】
すなわち、従来の教師データを必要とするニューラルネットワークを用いたＤＮＡマイクロアレイデータ解析方法等では、全教師データの素性（判定結果が明らかなデータ。例えば臨床的に判断した薬剤の感受性等）を数値化して表現する必要があるが、ＭＴシステム法では薬剤の効果のある・なしや、ある特定の群に属するか否かという漠然とした内容を基準データとして分析を開始することができるため、基準データ（教師データ）準備が非常に簡便になる。
【０１９８】
また、従来のニューラルネットワークの学習や、重回帰分析の係数算出などを用いたＤＮＡマイクロアレイデータ解析方法では、同様の演算を複数回繰り返すことにより最適状態の神経回路やパラメータを形成するが、ＭＴシステム法では一度の計算でデータベース（基準データ）を生成でき高速に処理できるため、演算時間を非常に短くすることができる。
【０１９９】
また、従来のニューラルネットワークなどを用いたＤＮＡマイクロアレイデータ解析方法では、演算過程がブラックボックスとして扱われており、得られる結果の計算過程が明確でない（つまり、計算過程の正当性について検証できない）が、ＭＴシステム法では幾つかの計算手法が提示されているが全て公開された計算式であり、しかも全基準データの相関関係に着目した計算式であることが明確であるため、演算過程の明確さを担保することができる。
【０２００】
また、従来のニューラルネットワークなどを用いたＤＮＡマイクロアレイデータ解析方法では、教師データの内容や使用するソフトにより、算出される結果が変動し、神経回路再構築の際の再現性が無いが、ＭＴシステム法では公開されている計算式を用いて算出するため、同じデータを使えば必ず同じ結果を得られ、再現性を非常に高くすることができる。
【０２０１】
また、一般的なクラスタリング手法を用いたＤＮＡマイクロアレイデータ解析方法では、与えられたデータ群の中で群分割することにより、検体や遺伝子間の相違を群内での相対的なものとして表現するが、ＭＴシステム法では一度単位空間データベースを生成すると、この基準の原点からの距離として表現する事に絶対値として表現でき、また、分析したデータ群以外のものについても、単位空間データベースからのマハラノビス距離を算出するだけで評価できるため、評価結果が毎回安定したものとなる。
【０２０２】
さらに、ＭＴシステム法ではどの項目（遺伝子の発現データや検体情報）が判別に有効であるか、機械的に検証する手順が定義されているため（項目選択）、これを使う事により、人間の偏った意見などに囚われることなく、最適な項目を効率的に選択する事ができる。
【０２０３】
また、本発明によれば、マハラノビス・タグチシステムを用いてＤＮＡマイクロアレイの測定データから解析対象の遺伝子を分類するので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に分類することができるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することができる。
【０２０４】
また、本発明によれば、マハラノビス・タグチシステムを用いてサンプルのＤＮＡマイクロアレイの測定データから解析対象の遺伝子を選択し、マハラノビス・タグチシステムを用いてその選択した遺伝子データに基づいて臨床サンプルを分類するので、ＭＴシステム法の適用により、大量のマイクロアレイの測定データを高速かつ高精度に分類することができるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することができる。
【０２０５】
また、本発明によれば、ＤＮＡマイクロアレイの測定データの中から、解析目的となる判別項目に応じた基準データを集めることにより基準データ群を選択し、ＭＴシステム法で判別させる際の入力データとなる項目であって、ＭＴシステム法において解析される項目である解析データ項目を決定し、選択された基準データ群に含まれる基準データのマハラノビス空間を示す単位空間データベースを作成し、選択された基準データと明確に基準データに属さない比較データのマハラノビス距離を算出し、算出された基準データと比較データのマハラノビス距離に基づいて、基準データ群と比較データ群の境目となる１または複数個の閾値を設定し、上記単位空間データベースと選択項目を使用して未知のデータのマハラノビス距離を算出し、上記閾値によって判定することにより未知データを分類するので、ＭＴシステム法の適用により、大量の遺伝子データを高速かつ高精度に解析することができるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することができる。
【０２０６】
また、本発明によれば、基準データ群以外に基準データに属すべきデータを別途用意し、これと基準データに属さないデータを混在させた検証データ群を作成し、当該検証データ群についてマハラノビス距離を算出し、設定された閾値を用いて基準データ群とそれ以外の集団とを正しく判別できるか検証するので、判別作業をさらに効率的に行うことができるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することができる。
【０２０７】
また、本発明によれば、十分な精度を得られないと判断された場合、選択した基準データ群、および／または、決定した解析データ項目について再検討を行うので、判別作業をさらに精度よくかつ効率的に行うことができるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することができる。
【０２０８】
また、本発明によれば、基準データ群内の各項目の標準偏差やデータ出現度数などを使い基準データのバラつき具合を検証し、当該バラつき具合が小さい基準データを使用するようにするので、基準データを再検討することにより判別作業をさらに精度よくかつ効率的に行うことができるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することができる。
【０２０９】
さらに、本発明によれば、ＤＮＡマイクロアレイのどのスポットの測定データを使用するかについて直交表を用いて複数の組み合わせで検証し、ＳＮ比で評価後、解析データ項目として使用するとＳＮ比が高くなるものを解析データ項目として使用するようにするので、解析データ項目を再検討することにより判別作業をさらに精度よくかつ効率的に行うことができるＤＮＡマイクロアレイデータ解析方法、ＤＮＡマイクロアレイデータ解析装置、プログラム、および、記録媒体を提供することができる。
【図面の簡単な説明】
【図１】本発明の基本原理を示す原理構成図である。
【図２】本発明が適用される本システムの構成の一例を示すブロック図である。
【図３】本実施形態における本システムのＭＴシステムによるＤＮＡマイクロアレイ解析処理の一例を示すフローチャートである。
【図４】距離ヒストグラムにて基準データ群と比較データ群を表示し、閾値を設定する場合の一例を示す図である。
【図５】ＤＮＡマイクロアレイ測定データファイル１０６ａに格納される利用者情報の一例を示す図である。
【図６】再検討部１０２ｈの構成の一例を示すブロック図である。
【符号の説明】
１００ＤＮＡマイクロアレイデータ解析装置
１０２制御部
１０２ａ解析目的決定部
１０２ｂデータ群選択部
１０２ｃ解析データ項目決定部
１０２ｄ単位空間データベース作成部
１０２ｅマハラノビス距離算出部
１０２ｆ閾値設定部
１０２ｇ判断精度検証部
１０２ｈ再検討部
１０２ｉ確定部
１０２ｊ基準データ再検討部
１０２ｋ解析データ項目再検討部
１０４通信制御インターフェース部
１０６記憶部
１０６ａＤＮＡマイクロアレイ測定データファイル
１０６ｂ基準データ群ファイル
１０６ｃ比較データ群ファイル
１０６ｄ解析データ項目
１０６ｅ単位空間データベース
１０６ｆマハラノビス距離ファイル
１０６ｇ検証データ群ファイル
１０８入出力制御インターフェース部
１１２入力装置
１１４出力装置
２００外部システム
３００ネットワーク[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a DNA microarray data analysis method, a DNA microarray data analysis device, a program, and a recording medium, and in particular, by applying a Mahalanobis Taguchi system (hereinafter, referred to as “MT system method” or simply “MT system”). The present invention relates to a DNA microarray data analysis method, a DNA microarray data analysis device, a program, and a recording medium that can analyze a large amount of gene data at high speed and with high accuracy.
[0002]
[Prior art]
The DNA microarray is a new technology developed in recent years, and various attempts have been reported on a method of analyzing the data (Non-Patent Documents 1, 2, 4 to 7, 8 to 12 and the like). See :), not yet established. Hereinafter, a conventional DNA microarray data analysis method will be described.
[0003]
DNA microarrays can simultaneously measure the expression of hundreds to thousands or tens of thousands of genes at once. However, the advantage of the large number of measurements that must be at the same time also makes data analysis difficult. In addition, when genes whose functions and the like have not been determined are mounted on a DNA microarray, the analysis of the DNA microarray data including the genes becomes “discrimination without external standards” described later, and thus the verification of the analysis method itself becomes a problem. . Although it is easy to take data with a DNA microarray, the fact that the problem is its analysis is common knowledge among researchers.
[0004]
The center of data analysis of DNA microarrays is how to classify a large number of mounted genes based on the data. This classification is broadly divided into (1) category determination (group discovery) and (2) category assignment. The former is, for example, a case where when trying to classify genes by their expression patterns over time, it is first necessary to determine what kind of expression patterns and how many types (group discovery). In this case, there is no external criterion to determine whether the determined category is correct, and it is a so-called “unsupervised” classification. On the other hand, the latter (assignment to a category) is, for example, deciding which of the categories determined by the former the unknown gene belongs to. In this case, the criteria in the category are met or not, but verification can be performed. , So-called “supervised”.
[0005]
The determination of the category includes hierarchical clustering, a self-organizing map (SM), Gene shaving, and the like, which will be described later. The assignment to the category includes a support vector machine (SVM), discriminant analysis, and the like.
[0006]
Conventionally, a method for analyzing gene expression data in a DNA microarray has been developed as an analysis of an expression pattern accompanying the development of a DNA microarray. In other words, initially, the most developed was to comprehensively classify all genes on the array from the characteristics of the expression pattern of each gene using microarray data of cDNA (complementary DNA) such as yeast. This is a gene expression analysis for the purpose of performing the analysis (for example, see Non-Patent Document 1).
[0007]
In Non-Patent Document 1, in order to classify all the data obtained by the microarray and to estimate the functions of unknown genes contained in the data from the classes, “hierarchical clustering” (included in the above “category determination”) Is used). That is, for each gene, a vector featuring the expression state under various processing conditions is set, and the similarity between the genes to be compared is defined by Pearson's correlation coefficient, and the input value of the cluster analysis is set. (Distance measure), and in the hierarchical clustering for grouping in the next step, the feature is that “average-linkage method” is used (for example, see Non-Patent Document 3). Further, in this document, it was considered that the function of an unknown gene grouped by this method was estimated from the group of known genes in the group.
[0008]
Here, what method is used to analyze gene expression data in a DNA microarray depends on the purpose of the analysis, but even if the purpose is the same, various methods are used depending on the paper as described above. . In other words, there have been several established statistical analysis methods and pattern recognition methods, but depending on which method is combined and how, and what kind of unique techniques are added to the DNA microarray. Due to the different results, various analyzes have been attempted.
[0009]
As a DNA microarray analysis method, an analysis method (classification of cancer or classification of drug sensitivity) linked to clinical data has recently been reported (for example, see Non-Patent Document 2). This includes comprehensive analysis of the above-mentioned cDNA as a part thereof, and is a more complicated analysis method as a whole.
[0010]
In Non-Patent Document 2, a group of genes is found while reducing the dimensions, and then the patients are classified. That is, the reduction and grouping of data dimensions are performed by principal component analysis (PCA) and neural networks, and 96 genes are narrowed out of 6567 types of genes on the microarray. Subsequently, by quantifying the data of these 96 genes for each patient and performing grouping by hierarchical clustering for each patient, it was possible to accurately classify cancers that were clinically difficult to classify based on gene expression. Reporting. And this report improved the current situation that DNA microarray analysis did not have a clear difference in the conventional clinical classification, and performed the same treatment where different treatments should be originally performed. Suggested potential for great benefits.
[0011]
As described in Non-Patent Document 2, in general, in clinical data analysis using a DNA microarray, which gene is used for the purpose of analysis is not predetermined, and the expression pattern of hundreds or thousands of genes is analyzed first. It is common practice to narrow down (classification and dimensionality) which gene data to use, and then analyze the gene expression data of individual patients. In this case, since the gene narrowing is a data classification without external criteria (so-called “unsupervised”), how to verify the result is a problem. Ultimately, it is possible to verify to some extent by the results of patient grouping using the narrowed down genetic items, but actually performing narrowing down in parallel with the verification requires a huge amount of time with general analysis methods It is not realistic.
[0012]
Although there is no report on clinical data, application of a self-organizing map (SOM: Self Organization Map) (for example, see Non-Patent Document 4) using a layered neural network as “unsupervised learning” analysis of a microarray. Has been proposed (for example, see Non-Patent Document 5). This is because the above-mentioned hierarchical clustering method has disadvantages such as an increase in the amount of calculation as the number of genes increases, and the topology of the tree diagram tends to change depending on the given data set. Has been proposed as an improvement.
[0013]
In addition, attempts to apply, for example, "Gene shaving" (for example, see Non-Patent Document 6) and Correspondence analysis (CA) (for example, see Non-Patent Document 7) have been reported. However, regardless of which method is used, it is "unsupervised learning", so the validity of the analysis result cannot be verified, or even if it can be verified with clinical data, the verification itself is impractical and the final result is obtained There is a problem that can not be determined until.
[0014]
On the other hand, the above-mentioned method of analyzing the relevant gene data of a patient after gene narrowing and grouping is applicable to “supervised learning” that has external criteria in advance and classifies them into target categories based on the criteria. I do. This is applied not only to classification of patients, but also to classification of individual genes by function or the like. For example, it is proposed to adopt “supervised learning” in the above-described hierarchical clustering (for example, see Non-Patent Document 2), or to apply SVM (Support Vector Machine), and to use an analysis method by a neural network. (For example, see Non-Patent Document 8).
[0015]
As described above, various methods are currently being tested as methods for analyzing data of the DNA microarray. In particular, how to narrow down genes without the above-mentioned external reference data is becoming an even bigger problem with the recent increase in attempts to apply DNA microarrays to clinical diagnosis.
[0016]
The following is a report of microarray analysis in clinical diagnosis. As main ones, there are Non-Patent Documents 9 to 12 in addition to Non-Patent Document 2 described above.
[0017]
This non-patent document 9 reports on prediction of recurrence of cancer (NSCLC). Here, 2899 genes are characterized using a method called “Cox proportional hazards model” in advance, and then the data is obtained. And 2899 genes are used for hierarchical clustering.
[0018]
Non-Patent Document 10 reports on the prediction of the efficacy of drug therapy in breast cancer. Here, 231 genes were selected from 25,000 genes using basically the same method as Non-Patent Document 2 described above. Then, hierarchical clustering is performed based on the data of each patient.
[0019]
Non-Patent Document 11 reports on the prediction of the efficacy of drug therapy in esophageal cancer. Here, for each of the 9216 gene data, the U value of the Mann-Whitney test was statistically calculated and 52 Genes are selected, and DRS (drug response score) is calculated based on the expression status of those genes.
[0020]
Furthermore, Non-Patent Document 12 reports that a cure rate (survival rate) in DLBCL (Diffuse large B-cell lymphoma) was predicted.
[0021]
On the other hand, the MT system method is a pattern recognition method developed mainly for quality engineering, and is used for data monitoring of each device for device monitoring and preventive maintenance (for example, see Patent Document 1). In addition, although there are reports such as a prediction method of a health check by the MT system method (for example, see Non-Patent Document 13), there is no report that applies the MT system method to gene data analysis in DNA microarray analysis.
[0022]
The reason is that in the MT system method, the number of reference data (in the embodiment of the present invention, the number of individual samples) is the number of items (in the embodiment of the present invention, the number of genes on a DNA microarray plus the number of clinical data such as blood tests). Dr. Genichi Taguchi, who advocated the MT system method, has been repeatedly stated in academic societies and the like that it must be collected more than ever, and since it was established as common general knowledge for those skilled in the art, the large number of genes This is because it was considered unsuitable for the analysis of the DNA microarray, which is the most characteristic.
[0023]
Here, the number of reference data is generally encouraged to be at least twice the number of items. For example, in Non-Patent Document 14, when the number of items is 26, the number of reference data is “the lowest Mahalanobis distance can be calculated. The number of data, 52, could not be obtained. "Therefore, it is widely known in the analysis using the MT system method that the number of items is intentionally reduced to 22, for example.
[0024]
[Patent Document 1]
JP-A-2000-259222
[Non-patent document 1]
Eisen, et. al. , "Cluster analysis and display of genome-wide expression patterns.", Proc. Natl. Acad. Sci. , 1998, 95, p. 14863-14868
[Non-patent document 2]
J. Khan, et. al. , "Nature Medicine", 2001, Vol. 7, Num. 6, p. 673-679
[Non-Patent Document 3]
Sokal, et. al. , "Univ. Kans. Sci. Bull.", 1958, 38, p. 1409-1438
[Non-patent document 4]
Kohnen T .; , "Proc. IEEE 78", 1991, p. 1464-1480
[Non-Patent Document 5]
Tamayo, et. al. , "Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation", Proc. Natl. Acad. Sci. , 1999, 96, p. 2907-2912
[Non-Patent Document 6]
Hstie, et. al. , "Gene shaving as a method for identifying distinct sets of genes with similar expression patterns", Genome Biology, 2000, 1 (2), p. research0003.1-0003.21
[Non-Patent Document 7]
See Fellenberg, et. al. , "Correspondence analysis applied to microarray data", Proc. Natl. Acad. Sci. , 2001, 98, p. 10781-10786
[Non-Patent Document 8]
Maeda Eisaku, "Pleasant! Support Vector Machine", Information Processing, 2001, Vol. 42, No. 7, p. 676-683
[Non-Patent Document 9]
D. A. Wigle, et. al. , "Cancer Research", 2002, 62, p. 3005-3008
[Non-Patent Document 10]
L. J. van'tVeer, et. al. , "Nature", 2002, Vol. 415, p. 530-536
[Non-Patent Document 11]
C. Kihara, et. al. , "Cancer Research", 2001, 61, p. 6474-6479
[Non-Patent Document 12]
M. A. Shipp, et. al. , "Nature medicine", 2002, Vol. 8, Num. 1, p. 68-74
[Non-patent document 13]
Nakajima et al., "Predictive Evaluation and Efficiency of Health Examination Using the MT System Method," Nihon Koei Magazine, Vol. 46, No. 5, 1999
[Non-patent document 14]
"Profit Forecast Using MTS,""QualityEngineering," Vol. 10, No. 3, pp. 96-102, June 2002
[0025]
[Problems to be solved by the invention]
However, the conventional gene analysis method using a DNA microarray has a problem that it takes a lot of time to analyze a large amount of gene data covering several hundreds to tens of thousands of items.
[0026]
Specifically, for example, in the conventional method using a neural network that requires teacher data, the features of all the teacher data (data for which the determination result is clear; for example, the sensitivity of a clinically determined drug) are quantified. Because of the necessity of expression, there is a problem that a large amount of time is required for analyzing a large amount of genetic data.
[0027]
For example, in the conventional method using neural network learning and multiple regression analysis coefficient calculation, the same operation is repeated multiple times to form neural circuits and parameters in an optimal state. There is a problem that it takes a lot of time.
[0028]
Further, the conventional method using a neural network or the like has a problem that the calculated result fluctuates depending on the contents of the teacher data and the software used, and there is no reproducibility when reconstructing a neural circuit.
[0029]
In addition, in the conventional method using a neural network or the like, the operation process is treated as a black box, and the obtained result is not clear, that is, there is a problem that the validity of the calculation process of the result cannot be verified. .
[0030]
In addition, in the conventional clustering method, since the differences between samples and genes are expressed as relative ones within a given data group by grouping, the evaluation result is unstable. Had a point.
[0031]
Further, in the conventional method of grouping genes by “unsupervised”, it was difficult to verify the results, and thus had a problem that trial and error had to be repeated.
[0032]
In addition, the conventional method does not define a procedure for mechanically verifying which items are effective for discrimination from gene expression data or sample information, so that it is not possible to efficiently select an optimum item. Had a point.
[0033]
As described above, at the present time, the conventional system and the like have a number of problems, and as a result, both the user and the administrator are inconvenient and the utilization efficiency is low.
[0034]
Here, as a technique used for prediction and discrimination of various phenomena, there is an “MT (Maharanobis-Taguchi) system method” which is a technique for measuring all phenomena using Mahalanobis distance proposed by Dr. Genichi Taguchi (for example, Genichi Taguchi, "Technological Development in MT System" (Japan Standards Association, 2002)).
[0035]
The principle of the MT system method is as follows: (1) For a phenomenon to be measured, a data group with a small variation in data and results is further selected from a data group with a known result, and a database is created based on this data group. (2) This database is constructed by analyzing unknown phenomena as Mahalanobis distances as a "measure".
[0036]
Here, as described above, the MT system method is used for data analysis of each device for device monitoring and preventive maintenance (for example, see Patent Literature 1), and prediction of a health examination by the MT system method is performed. Although a method (for example, see Non-patent Document 13) is known, there is no report on applying the MT system method to gene data analysis in DNA microarray analysis.
[0037]
The present invention has been made in view of the above-mentioned problems, and a DNA microarray data analysis method, a DNA microarray data analysis apparatus, and a program capable of analyzing a large amount of gene data at high speed and with high accuracy by applying the MT system method. And a recording medium.
[Means for Solving the Problems]
[0038]
In order to achieve such an object, the DNA microarray data analysis method according to claim 1 is characterized by analyzing measurement data of a DNA microarray using a Mahalanobis-Taguchi system.
[0039]
According to this method, since the measurement data of the DNA microarray is analyzed using the Mahalanobis Taguchi system, a large amount of genetic data can be analyzed at high speed and with high accuracy by applying the MT system method.
[0040]
That is, in the conventional DNA microarray data analysis method using a neural network that requires teacher data, the features of all the teacher data (data for which the determination result is clear, such as the sensitivity of a clinically determined drug) are quantified. However, in the MT system method, the analysis can be started with reference to the vague contents such as whether or not the drug has an effect and whether or not the drug belongs to a specific group. Teacher data) Preparation becomes very simple.
[0041]
In the conventional DNA microarray data analysis method using neural network learning, multiple regression analysis coefficient calculation, and the like, the same operation is repeated a plurality of times to form neural circuits and parameters in an optimal state. In the method, a database (reference data) can be generated by one calculation and can be processed at a high speed, so that the operation time can be extremely shortened.
[0042]
Further, in the conventional DNA microarray data analysis method using a neural network or the like, the calculation process is treated as a black box, and the calculation process of the obtained result is not clear (that is, the validity of the calculation process cannot be verified). Although several calculation methods have been proposed in the MT system method, they are all open calculation formulas, and it is clear that the calculation formula focuses on the correlation of all reference data. Can be guaranteed.
[0043]
In the conventional DNA microarray data analysis method using a neural network or the like, the calculated result fluctuates depending on the contents of the teacher data and the software used, and there is no reproducibility when reconstructing a neural circuit. In the method, the calculation is performed using a published calculation formula, so that the same result can be always obtained by using the same data, and the reproducibility can be extremely improved.
[0044]
Further, in a DNA microarray data analysis method using a general clustering technique, a difference between a specimen and a gene is expressed as a relative one within a group by dividing the data into groups within a given data group. In the MT system method, once a unit space database is generated, it can be expressed as an absolute value by expressing it as a distance from the origin of this reference, and other than the analyzed data group, the Mahalanobis distance from the unit space database can be expressed. Can be evaluated simply by calculating the evaluation result, so that the evaluation result becomes stable every time.
[0045]
Furthermore, in the MT system method, a procedure for mechanically verifying which items (gene expression data and specimen information) are effective for discrimination is defined (item selection). The most appropriate items can be selected efficiently without being bound by biased opinions.
[0046]
Further, the DNA microarray data analysis method according to claim 2 is characterized in that genes to be analyzed are classified from the measurement data of the DNA microarray using a Mahalanobis-Taguchi system.
[0047]
According to this method, the gene to be analyzed is classified from the measurement data of the DNA microarray using the Mahalanobis-Taguchi system, so that a large amount of gene data can be classified at high speed and with high accuracy by applying the MT system method. Become like
[0048]
In addition, the DNA microarray data analysis method according to claim 3 selects a gene to be analyzed from the measurement data of the sample DNA microarray using a Mahalanobis Taguchi system, and uses the Mahalanobis Taguchi system to select the selected gene data. Classifying a clinical sample based on
[0049]
According to this method, a gene to be analyzed is selected from the measurement data of the DNA microarray of the sample using the Mahalanobis-Taguchi system, and the clinical sample is classified based on the selected gene data using the Mahalanobis-Taguchi system. By applying the MT system method, a large amount of microarray measurement data can be classified at high speed and with high accuracy.
[0050]
In addition, the DNA microarray data analysis method according to claim 4 includes a data group selection step of selecting a data group by collecting reference data according to a discrimination item to be analyzed from measurement data of the DNA microarray, An analysis data item determination step for determining an analysis data item which is an item to be input data when discriminated by the MT system method and which is an item analyzed in the MT system method, and which is selected in the data group selection step. A unit space database creation step of creating a unit space database indicating a Mahalanobis space of the reference data included in the reference data group, a Mahalanobis distance of the reference data selected in the data group selection step, and the reference data Mahalanobis distance by collecting comparison data that does not clearly belong to Based on the Mahalanobis distance calculation step of calculating the Mahalanobis distance between the reference data and the comparison data calculated in the Mahalanobis distance calculation step, a threshold that is a boundary between the reference data group and the comparison data group A threshold setting step of setting one or more, and a Mahalanobis distance of unknown data is calculated by using the unit space database and the selected item, and discrimination analysis of unknown data is performed by determining the Mahalanobis distance by the threshold. Is performed.
[0051]
According to this method, a reference data group is selected by collecting reference data corresponding to a discrimination item to be analyzed from measurement data of the DNA microarray, and an item serving as input data when discrimination is performed by the MT system method. And determining an analysis data item that is an item analyzed in the MT system method, creating a unit space database indicating a Mahalanobis space of the reference data included in the selected reference data group, and selecting the selected reference data and Calculate the Mahalanobis distance of comparison data that does not clearly belong to the reference data, and set one or more thresholds that serve as boundaries between the reference data group and the comparison data group based on the calculated Mahalanobis distance between the reference data and the comparison data The Mahalanobis distance of unknown data is calculated using the unit space database and the selected item, Since classify unknown data by determining the value, by applying the MT system method, it is possible to analyze quickly and accurately a large amount of genetic data.
[0052]
That is, in the MT system method, the analysis can be started with the vague contents of whether or not the medicine has an effect or whether or not to select a specific group as reference data. Becomes very simple.
[0053]
Further, in the MT system method, a database (reference data) can be generated by one calculation and can be processed at a high speed, so that the operation time can be extremely shortened.
[0054]
Although several calculation methods have been proposed in the MT system method, they are all open calculation formulas, and it is clear that the calculation formula focuses on the correlation of all reference data. Clarity can be ensured.
[0055]
Further, in the MT system method, since the calculation is performed using a publicly available calculation formula, the same result can be always obtained by using the same data, and the reproducibility can be extremely improved.
[0056]
Also, in the MT system method, once a unit space database is generated, it can be expressed as an absolute value by expressing it as a distance from the origin of this reference, and other than the analyzed data group, the Mahalanobis from the unit space database can be used. Since the evaluation can be performed only by calculating the distance, the evaluation result becomes stable every time. Also, in the MT system method, when a causal relationship is confirmed between the phenomenon indicated by the target and the Mahalanobis distance, by setting a plurality of thresholds, it is possible to classify a plurality of states simply by calculating the Mahalanobis distance. I do.
Further, in the MT system method, when a certain proportional relationship is established between the target state and the Mahalanobis distance, the target state itself can be expressed by the Mahalanobis distance.
[0057]
Furthermore, in the MT system method, a procedure for mechanically verifying which items (gene expression data and specimen information) are effective for discrimination is defined (item selection). The most appropriate items can be selected efficiently without being bound by biased opinions.
[0058]
In the DNA microarray data analysis method according to a fifth aspect, in the DNA microarray data analysis method according to the fourth aspect, data that belongs to the reference data besides the reference data group is separately prepared, A verification data group in which data that does not belong to the data is mixed is created, the Mahalanobis distance is calculated for the verification data group, and the reference data group and the other data are calculated using the threshold set in the threshold setting step. And a discrimination accuracy verification step of verifying whether or not a group can be correctly discriminated.
[0059]
This more specifically shows an example of the discrimination accuracy verification step. According to this method, data that should belong to the reference data besides the reference data group is separately prepared, a verification data group in which this data is mixed with data that does not belong to the reference data is created, and the Mahalanobis distance is calculated for the verification data group Since it is verified whether the reference data group can be correctly distinguished from the other groups using the set threshold value, the determination operation can be performed more efficiently.
[0060]
Further, in the DNA microarray data analysis method according to claim 6, in the DNA microarray data analysis method according to claim 5, when it is determined that sufficient accuracy cannot be obtained in the verification step, the data group selection is performed. The method further includes a review step of reviewing the reference data group selected in the step and / or the analysis data item determined in the analysis data item determination step.
[0061]
This shows one example of the review step more specifically. According to this method, when it is determined that sufficient accuracy cannot be obtained, the selected reference data group and / or the determined analysis data item are re-examined, so that the determination operation can be performed more accurately and efficiently. Will be able to do it.
[0062]
The DNA microarray data analysis method according to claim 7 is the DNA microarray data analysis method according to claim 6, wherein the reviewing step includes the standard deviation of each item in the reference data group and the frequency of data appearance. The method further includes a reference data reviewing step of verifying the degree of variation of the reference data using the reference data and using the reference data having a small degree of variation.
[0063]
This shows one example of the reference data review step more specifically. According to this method, the degree of variation of the reference data is verified using the standard deviation and the frequency of data appearance of each item in the reference data group, and the reference data having the small degree of variation is used. By reconsidering, the discrimination work can be performed more accurately and efficiently.
[0064]
In the DNA microarray data analysis method according to claim 8, in the DNA microarray data analysis method according to claim 6 or 7, the re-examination step includes determining which spot measurement data of the DNA microarray is to be used. The method further includes an analysis data item reexamination step of verifying a plurality of combinations using an orthogonal table, evaluating the S / N ratio, and then using an analysis data item having a higher S / N ratio as the analysis data item. It is characterized by including.
[0065]
This more specifically shows an example of the analysis data item review step. According to this method, which combination of spots on the DNA microarray is used to verify the measurement data is verified using a plurality of combinations using an orthogonal table. Since the analysis data item is used, the discrimination work can be performed more accurately and efficiently by reviewing the analysis data item.
[0066]
The present invention also relates to a DNA microarray data analyzer, wherein the DNA microarray data analyzer according to claim 9 converts reference data according to a discrimination item to be analyzed from among the measurement data of the DNA microarray. Data group selecting means for selecting a data group by collecting, and analysis data for determining an analysis data item which is an item serving as input data when discriminated by the MT system method and which is an item analyzed by the MT system method. Item determination means, unit space database creation means for creating a unit space database indicating a Mahalanobis space of the reference data included in the reference data group selected by the data group selection means, and the data group selection means The Mahalanobis distance of the selected reference data and the reference data Mahalanobis distance calculating means for collecting Mahalanobis distance by collecting comparison data that does not belong to, based on the Mahalanobis distance between the reference data and the comparison data calculated by the Mahalanobis distance calculating means, the reference data group and Using a threshold setting means for setting one or a plurality of thresholds serving as boundaries of the comparison data group, and using the unit space database and the selected item, calculate a Mahalanobis distance of unknown data. Classification means for performing discriminant analysis of unknown data by making a determination is provided.
[0067]
According to this apparatus, a reference data group is selected by collecting reference data corresponding to a discrimination item to be analyzed from measurement data of a DNA microarray, and an item serving as input data when discrimination is performed by the MT system method. And determining an analysis data item that is an item analyzed in the MT system method, creating a unit space database indicating a Mahalanobis space of the reference data included in the selected reference data group, and selecting the selected reference data and Calculate the Mahalanobis distance of comparison data that does not clearly belong to the reference data, and set one or more thresholds that serve as boundaries between the reference data group and the comparison data group based on the calculated Mahalanobis distance between the reference data and the comparison data The Mahalanobis distance of unknown data is calculated using the unit space database and the selected item, Since classify unknown data by determining the value, by applying the MT system method, it is possible to analyze quickly and accurately a large amount of genetic data.
[0068]
That is, in the MT system method, the analysis can be started with the vague contents of whether or not the medicine has an effect or whether or not to select a specific group as reference data. Becomes very simple.
[0069]
Further, in the MT system method, a database (reference data) can be generated by one calculation and can be processed at a high speed, so that the operation time can be extremely shortened.
[0070]
Although several calculation methods have been proposed in the MT system method, they are all open calculation formulas, and it is clear that the calculation formula focuses on the correlation of all reference data. Clarity can be ensured.
[0071]
Further, in the MT system method, since the calculation is performed using a publicly available calculation formula, the same result can be always obtained by using the same data, and the reproducibility can be extremely improved.
[0072]
Also, in the MT system method, once a unit space database is generated, it can be expressed as an absolute value by expressing it as a distance from the origin of this reference, and other than the analyzed data group, the Mahalanobis from the unit space database can be used. Since the evaluation can be performed only by calculating the distance, the evaluation result becomes stable every time. Also, in the MT system method, when a causal relationship is confirmed between the phenomenon indicated by the target and the Mahalanobis distance, by setting a plurality of thresholds, it is possible to classify a plurality of states simply by calculating the Mahalanobis distance. I do.
Further, in the MT system method, when a certain proportional relationship is established between the target state and the Mahalanobis distance, the target state itself can be expressed by the Mahalanobis distance.
[0073]
Furthermore, in the MT system method, a procedure for mechanically verifying which items (gene expression data and specimen information) are effective for discrimination is defined (item selection). The most appropriate items can be selected efficiently without being bound by biased opinions.
[0074]
The DNA microarray data analyzer according to claim 10 is the DNA microarray data analyzer according to claim 9, wherein data that belongs to the reference data is separately prepared in addition to the reference data group. A verification data group in which data that does not belong to the data is mixed is created, the Mahalanobis distance is calculated for the verification data group, and the reference data group and the other data are calculated using the threshold set by the threshold setting unit. A determination accuracy verification unit configured to verify whether the group can be correctly determined.
[0075]
This shows an example of the discrimination accuracy verification means more specifically. According to this device, data that should belong to the reference data besides the reference data group is separately prepared, a verification data group in which this data and data that does not belong to the reference data are mixed is created, and the Mahalanobis distance is calculated for the verification data group. Since it is verified whether the reference data group can be correctly distinguished from the other groups using the set threshold value, the determination operation can be performed more efficiently.
[0076]
Further, in the DNA microarray data analyzer according to claim 11, when the DNA microarray data analyzer according to claim 10 determines that sufficient accuracy cannot be obtained by the verification means, the data group selection is performed. A reexamination means for reexamining the reference data group selected by the means and / or the analysis data item determined by the analysis data item determination means is further provided.
[0077]
This more specifically shows an example of the reviewing means. According to this device, when it is determined that sufficient accuracy cannot be obtained, the selected reference data group and / or the determined analysis data item are reexamined, so that the determination operation can be performed with higher accuracy and efficiency. Will be able to do it.
[0078]
The DNA microarray data analyzer according to claim 12 is the DNA microarray data analyzer according to claim 11, wherein the reviewing means includes a standard deviation of each item in the reference data group and a data appearance frequency. And a reference data reviewing means for verifying the degree of variation of the reference data using the reference data having a small degree of variation.
[0079]
This more specifically shows one example of the reference data reviewing means. According to this device, the degree of variation of the reference data is verified using the standard deviation and the frequency of data appearance of each item in the reference data group, and the reference data having the small degree of variation is used. By reconsidering, the discrimination work can be performed more accurately and efficiently.
[0080]
The DNA microarray data analyzer according to claim 13 is the DNA microarray data analyzer according to claim 11 or 12, wherein the reviewing means determines which spot of the DNA microarray uses the measurement data. After analyzing in a plurality of combinations using an orthogonal table and evaluating with an SN ratio, an analysis data item reexamination means for using an item having a higher SN ratio when used as the analysis data item as the analysis data item is further provided. It is characterized by having.
[0081]
This more specifically shows an example of the analysis data item reviewing means. According to this apparatus, which combination of spots on the DNA microarray is to be used for measurement data is verified using a plurality of combinations using an orthogonal table. Since the analysis data item is used, the discrimination work can be performed more accurately and efficiently by reviewing the analysis data item.
[0082]
The present invention also relates to a program for causing a computer to execute a method for analyzing DNA microarray data, wherein the program according to claim 14 is a program for measuring, based on measurement data of a DNA microarray, a standard according to a discrimination item to be analyzed. A data group selection step of selecting a data group by collecting data; and an analysis data item which is an item serving as input data when discriminated by the MT system method and which is an item analyzed by the MT system method. An analysis data item determination step, a unit space database creation step of creating a unit space database indicating a Mahalanobis space of the reference data included in the reference data group selected in the data group selection step, and the data group selection step Maha of the above reference data selected in A Novice distance, a Mahalanobis distance calculation step of calculating a Mahalanobis distance by collecting comparison data that does not clearly belong to the reference data, and the Mahalanobis distance between the reference data and the comparison data calculated in the Mahalanobis distance calculation step A threshold setting step of setting one or a plurality of thresholds as boundaries between the reference data group and the comparison data group based on the above, and using the unit space database and the selected item, the Mahalanobis distance of unknown data , And making the computer execute a DNA microarray data analysis method including a classification step of performing discrimination analysis of unknown data by judging this by the threshold value.
[0083]
According to this program, a reference data group is selected by collecting reference data according to a discrimination item to be analyzed from measurement data of a DNA microarray, and an item serving as input data when discrimination is performed by the MT system method. And determining an analysis data item that is an item analyzed in the MT system method, creating a unit space database indicating a Mahalanobis space of the reference data included in the selected reference data group, and selecting the selected reference data and Calculate the Mahalanobis distance of comparison data that does not clearly belong to the reference data, and set one or more thresholds that serve as boundaries between the reference data group and the comparison data group based on the calculated Mahalanobis distance between the reference data and the comparison data Then, the Mahalanobis distance of unknown data is calculated using the unit space database and the selected items. Since classify unknown data by determining by the threshold, by application of an MT system method, it is possible to analyze quickly and accurately a large amount of genetic data.
[0084]
That is, in the MT system method, the analysis can be started with the vague contents of whether or not the medicine has an effect or whether or not to select a specific group as reference data. Becomes very simple.
[0085]
Further, in the MT system method, a database (reference data) can be generated by one calculation and can be processed at a high speed, so that the operation time can be extremely shortened.
[0086]
Although several calculation methods have been proposed in the MT system method, they are all open calculation formulas, and it is clear that the calculation formula focuses on the correlation of all reference data. Clarity can be ensured.
[0087]
Further, in the MT system method, since the calculation is performed using a publicly available calculation formula, the same result can be always obtained by using the same data, and the reproducibility can be extremely improved.
[0088]
Also, in the MT system method, once a unit space database is generated, it can be expressed as an absolute value by expressing it as a distance from the origin of this reference, and other than the analyzed data group, the Mahalanobis from the unit space database can be used. Since the evaluation can be performed only by calculating the distance, the evaluation result becomes stable every time. Also, in the MT system method, when a causal relationship is confirmed between the phenomenon indicated by the target and the Mahalanobis distance, by setting a plurality of thresholds, it is possible to classify a plurality of states simply by calculating the Mahalanobis distance. I do.
Further, in the MT system method, when a certain proportional relationship is established between the target state and the Mahalanobis distance, the target state itself can be expressed by the Mahalanobis distance.
[0089]
Furthermore, in the MT system method, a procedure for mechanically verifying which items (gene expression data and specimen information) are effective for discrimination is defined (item selection). The most appropriate items can be selected efficiently without being bound by biased opinions.
[0090]
A program according to a fifteenth aspect of the present invention is the program according to the fourteenth aspect, wherein data which belongs to the reference data is separately prepared in addition to the reference data group, and data which does not belong to the reference data is mixed. A verification data group is created, the Mahalanobis distance is calculated for the verification data group, and it is verified whether the reference data group can be correctly distinguished from the other groups using the threshold set in the threshold setting step. And a verification accuracy verification step.
[0091]
This more specifically shows an example of the discrimination accuracy verification step. According to this program, data that should belong to the reference data besides the reference data group is separately prepared, a verification data group is created by mixing this with data that does not belong to the reference data, and the Mahalanobis distance is calculated for the verification data group Since it is verified whether the reference data group can be correctly distinguished from the other groups using the set threshold value, the determination operation can be performed more efficiently.
[0092]
The program according to claim 16 is the program according to claim 15, wherein when it is determined that sufficient accuracy is not obtained in the verification step, the reference data selected in the data group selection step is selected. The method further includes a review step of reviewing the group and / or the analysis data item determined in the analysis data item determination step.
[0093]
This shows one example of the review step more specifically. According to this program, when it is determined that sufficient accuracy cannot be obtained, the selected reference data group and / or the determined analysis data items are re-examined, so that the determination operation can be performed with higher accuracy and efficiency. Will be able to do it.
[0094]
The program according to claim 17 is the program according to claim 16, wherein the reviewing step is performed by using a standard deviation of each item in the reference data group, a data appearance frequency, and the like to determine how the reference data varies. And further including a reference data reviewing step of using the reference data having a small degree of variation.
[0095]
This shows one example of the reference data review step more specifically. According to this program, the degree of variation of the reference data is verified using the standard deviation and the frequency of appearance of each item in the reference data group, and the reference data having the small degree of variation is used. By reconsidering, the discrimination work can be performed more accurately and efficiently.
[0096]
The program according to claim 18 is the program according to claim 16 or 17, wherein the re-examination step includes a plurality of combinations using the orthogonal table as to which measurement data of which spot on the DNA microarray is to be used. The method further includes an analysis data item reexamination step of using, as an analysis data item, an item having a high SN ratio when the analysis data item is used as the analysis data item after the verification with the SN ratio.
[0097]
This more specifically shows an example of the analysis data item review step. According to this program, which combination of spots on the DNA microarray is used to verify the data to be measured is examined in multiple combinations using an orthogonal table. Since the analysis data item is used, the discrimination work can be performed more accurately and efficiently by reviewing the analysis data item.
[0098]
The present invention also relates to a recording medium, wherein a recording medium according to claim 19 records the program according to any one of claims 14 to 18.
[0099]
According to this recording medium, a program recorded on the recording medium is read by a computer and executed, thereby realizing the program according to any one of claims 14 to 18 using a computer. And the same effect as each of these methods can be obtained.
[0100]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of a DNA microarray data analysis method, a DNA microarray data analysis device, a program, and a recording medium according to the present invention will be described in detail with reference to the drawings. It should be noted that the present invention is not limited by the embodiment.
In particular, in the following embodiment, an example in which the present invention is applied to prediction of a DNA microarray experiment result regarding interferon (INF) sensitivity will be described. However, the present invention is not limited to this case. The same can be applied.
[0101]
[Summary of the present invention]
Hereinafter, the outline of the present invention will be described, and then the configuration, processing, and the like of the present invention will be described in detail. FIG. 1 is a principle configuration diagram showing the basic principle of the present invention.
The present invention generally has the following basic features. The MT system method is a technology for measuring all phenomena using the Mahalanobis distance proposed by Dr. Genichi Taguchi. The principle of the MT system is that, for a phenomenon to be measured, a data group (reference data group) with less variation in data and results is selected from among data groups whose results are known, and based on this, the correlation of all input data is selected. Create the database shown. Then, by using this database as a "measure" and analyzing unknown phenomena as Mahalanobis distances, it is a technique for realizing prediction and discrimination of various phenomena.
[0102]
When the MT system method is applied to the analysis of the measurement data of the DNA microarray, the test result data of the DNA microarray + specimen data and the like are input data, and the predicted state of the specimen or a change predicted in the future is the analysis result. . Then, in the process of grouping the samples, the analysis target genes are narrowed down (searching for effective genes) while checking the accuracy from the beginning.
[0103]
Here, two types of conventional DNA microarray analysis techniques are generally used. One is a technology related to quantification (quantification) of analysis results (for example, a technology based on neural network, multiple regression analysis, fuzzy theory, etc.), and the other is grouping based on quantified data. Technology.
[0104]
Conventionally, in DNA microarray analysis, data analysis processing has been performed using these two techniques in two stages. For example, the quantification (similarization) of the similarity between genes performed in Non-Patent Document 1 corresponds to the quantification (numericalization) here, and is used as an input value for subsequent clustering (grouping). Therefore, a large amount of calculation time is required for analyzing a large amount of data covering, for example, hundreds to tens of thousands of items, and the analysis accuracy is not very high. Further, although the narrowing down of genes in Non-Patent Document 2 is not a two-step process as described above, grouping by a neural network is performed an enormous number of times, so that it is expected that ordinary computer calculations will take several days.
[0105]
Here, the relationship between the quantification and grouping described above and the category determination and assignment to the category described in the “prior art” is described. In each of the category determination and category assignment, the quantification and grouping are performed. It can be said that there is. However, as in the neural network of Non-Patent Document 2, not all methods are necessarily divided into stages, and there may be a combination of both in one process.
[0106]
In contrast, in the MT system, a database is created and used for subsequent calculations. Therefore, by using the MT system for DNA microarray analysis, genes can be narrowed down while checking the accuracy in the process of grouping samples from the beginning. Therefore, it is possible to perform the above two arithmetic processings at a high accuracy and at a high speed. As a result, it is possible to efficiently perform the classification of the specimen based on the gene expression state and the role analysis of the gene itself.
[0107]
In other words, by using the MT system method for DNA microarray analysis, it is possible to represent the position of each specimen by the Mahalanobis distance from the reference data group. And the efficiency of analysis can be improved. As a result, it is possible to efficiently search for a gene or specimen information that is considered effective.
[0108]
Further, by making the correlation of all the input data into a database, it becomes possible to catch a phenomenon that cannot be detected with only single data.
[0109]
Thus, the present inventors have found that the above-described MT system method is applied to DNA microarray analysis. Therefore, the present inventor performed analysis by the MT system method using commercially available software having a data analysis function by the MT system method (for example, “PRAT” (product name) of a probe (company name)). Some commercially available products are available as software for the MT system method, and any of them can be used.
[0110]
However, there are problems to be overcome in order to actually perform DNA microarray analysis using commercially available software. That is, in the MT system method, it is established as common general knowledge for those skilled in the art that the number of reference data must be collected more than the number of items. The number of reference data corresponds to the number of individual samples in the present invention, and the number of items corresponds to the number of DNA microarray-equipped genes in the present invention plus the number of clinical data such as blood tests. It is almost impossible to adapt to such conditions in the analysis of.
[0111]
Further, when a DNA microarray analysis was actually performed using commercially available software, it was found that many genes with extremely low expression levels were present, and that the CV (coefficient of variation) between microarrays was unstable. It was difficult to apply the MT system method because of the problems inherent in the analysis.
[0112]
Therefore, the present inventor has found that for the former problem, by selecting items with higher characteristics in advance, it is possible to create an effective unit space even from a reference data number extremely smaller than the number of items, Solved this. Regarding the latter problem, a combination of items in an orthogonal table was tried, and the calculation results were compared based on the S / N ratio to select an effective item.
[0113]
Furthermore, in the MT system method, at the time of creating a unit space, it is necessary to prepare two types of a reference data group as a source of the unit space and an abnormal data group that can be clearly identified as "abnormal". (See "Technological Development in MT Systems," published by the Japan Standards Association, pages 428-429, "Setting Normal and Abnormal Data for Measurers.") In clinical practice, the data is ambiguous and it is clear which data group is included. Sample is not included, and it is often difficult to judge. Therefore, the present inventor has first devised a unit space using only the reference data and then devised with the comparison data of the evaluation, thereby devising the analysis while allowing the existence of the sample for which the determination is suspended.
[0114]
For example, when applied to the prediction of DNA microarray experiment results for interferon (INF) sensitivity, using the above-described device, the results of 18 samples out of 88 samples were selected as reference data, and the remaining 70 samples were verified. As a result, it was found that the INF sensitivity of the patient can be expected with high frequency (80 to 90%). The details of these ideas and other improvements will be described later.
[0115]
As described above, simply applying the MT system method cannot be realized as a practically effective one, and based on the results of the inventors' ingenuity, the application of the MT system method in DNA microarray analysis has been considered. Became practically effective.
[0116]
[System configuration]
First, the configuration of the present system will be described. FIG. 2 is a block diagram showing an example of the configuration of the present system to which the present invention is applied, and conceptually shows only those parts of the configuration related to the present invention. The present system roughly connects a DNA microarray data analyzer 100 and an external system 200 that provides an external database or homology search and other external programs related to sequence information and the like via a network 300 so as to be communicable. It is configured.
[0117]
In FIG. 2, a network 300 has a function of interconnecting the DNA microarray data analyzer 100 and the external system 200, and is, for example, the Internet.
[0118]
In FIG. 2, an external system 200 is mutually connected to the DNA microarray data analyzer 100 via a network 300, and executes an external database for sequence information and the like, and an external program such as a homology search or a motif search for a user. Has the function of providing a website.
[0119]
Here, the external system 200 may be configured as a WEB server, an ASP server, or the like, and its hardware configuration may be configured by an information processing device such as a generally-available workstation, a personal computer, and its accompanying devices. Good. Each function of the external system 200 is realized by a CPU, a disk device, a memory device, an input device, an output device, a communication control device, and the like in a hardware configuration of the external system 200, a program for controlling them, and the like.
[0120]
In FIG. 2, a DNA microarray data analyzer 100 schematically includes a control unit 102 such as a CPU that integrally controls the entire DNA microarray data analyzer 100, and a communication device such as a router connected to a communication line or the like (see FIG. 2). (Not shown), an input / output control interface unit 108 connected to the input device 112 and the output device 114, and a storage unit 106 for storing various databases and tables. These units are communicably connected via an arbitrary communication path. Further, the DNA microarray data analyzer 100 is communicably connected to the network 300 via a communication device such as a router and a wired or wireless communication line such as a dedicated line.
[0121]
Various databases and tables (DNA microarray measurement data file 106a to verification data group file 106g) stored in the storage unit 106 are storage means such as a fixed disk device, and various programs, tables and files used for various processes. Stores database and web page files.
[0122]
Among the constituent elements of the storage unit 106, the DNA microarray measurement data file 106a is a file storing DNA microarray measurement data (for example, test result data such as expression data of each gene and specimen data). FIG. 5 is a diagram showing an example of the user information stored in the DNA microarray measurement data file 106a.
[0123]
As shown in FIG. 5, the information stored in the DNA microarray measurement data file 106a includes a sample ID for uniquely identifying each sample, test result data such as expression data of each gene, and sample data. It is configured in association with. Here, the sample data is configured to include items of medical information (for example, liver function data by a blood test).
[0124]
The reference data group file 106b is a reference data group storage unit that stores information on the reference data group and the like.
[0125]
The comparison data group file 106c is a comparison data group storage unit that stores information on the comparison data group and the like.
[0126]
The analysis data item 106d is an analysis data item storage unit that stores information on the analysis data item and the like.
[0127]
Further, the unit space database 106e obtains a correlation coefficient, a standard deviation, and the like from each reference data included in the selected reference data group, and stores information on a database indicating the unit space (Maharanobis space) and the like. Storage means.
[0128]
The Mahalanobis distance file 106f is a Mahalanobis distance data storage unit that stores information on the Mahalanobis distance for each of the reference data and the comparison data.
[0129]
The verification data group file 106g is a verification data group storage unit that stores information on the verification data group and the like.
[0130]
In FIG. 2, the communication control interface unit 104 controls communication between the DNA microarray data analysis device 100 and the network 300 (or a communication device such as a router). That is, the communication control interface unit 104 has a function of communicating data with another terminal via a communication line.
[0131]
2, the input / output control interface unit 108 controls the input device 112 and the output device 114. Here, as the output device 114, in addition to a monitor (including a home television), a speaker can be used (in the following, the output device 114 may be described as a monitor). As the input device 112, a keyboard, a mouse, a microphone, and the like can be used. The monitor also realizes a pointing device function in cooperation with the mouse.
[0132]
2, the control unit 102 has a control program such as an OS (Operating System), a program defining various processing procedures and the like, and an internal memory for storing required data. And information processing for executing various processes. The control unit 102 is functionally conceptually composed of an analysis purpose determination unit 102a, a data group selection unit 102b, an analysis data item determination unit 102c, a unit space database creation unit 102d, a Mahalanobis distance calculation unit 102e, a threshold setting unit 102f, and a judgment accuracy verification. It comprises a unit 102g, a review unit 102h, and a determination unit 102i.
[0133]
The analysis purpose determination unit 102a is an analysis purpose determination unit that determines an item to be determined based on the data of the DNA microarray when analyzing the measurement data of the DNA microarray by the MT system method.
[0134]
Further, the data group selection unit 102b selects a reference data group by collecting reference data according to the discrimination item to be analyzed from the measurement data of the DNA microarray, and further selects a comparison data group that does not clearly belong to the above reference data. Data group selecting means for selecting a comparison data group by collecting data.
[0135]
The analysis data item determination unit 102c is an analysis data item determination unit that determines an analysis data item that is input data when the determination is performed by the MT system method and is an item that is analyzed by the MT system method. .
[0136]
The unit space database creation unit 102d is a unit space database creation unit that creates a unit space database indicating a Mahalanobis space of the reference data included in the reference data group selected by the data group selection unit 102b.
[0137]
The Mahalanobis distance calculation unit 102e is a Mahalanobis distance calculation unit that calculates a Mahalanobis distance between the reference data selected by the data group selection unit 102b and the comparison data.
[0138]
The threshold setting unit 102f is a threshold setting unit that sets a threshold that is a boundary between the reference data group and the comparison data group based on the Mahalanobis distance between the reference data and the comparison data calculated by the Mahalanobis distance calculation unit 102e. is there.
[0139]
In addition, the judgment accuracy verification unit 102g separately prepares data that should belong to the reference data in addition to the reference data group, creates a verification data group in which this data and data that does not belong to the reference data are mixed, and creates a Mahalanobis for the verification data group. This is a discrimination accuracy verification unit that calculates the distance and verifies whether the reference data group and the other groups can be correctly discriminated using the threshold set by the threshold setting unit 102f.
[0140]
When the judgment accuracy verification unit 102g determines that sufficient accuracy cannot be obtained, the review unit 102h determines whether the reference data group selected by the data group selection unit 102b and / or the analysis data item determination unit This is a reviewing means for reviewing the analysis data item determined in 102c. Here, FIG. 6 is a block diagram illustrating an example of the configuration of the review unit 102h. As illustrated in FIG. 6, the review unit 102h includes a reference data review unit 102j and an analysis data item review unit 102k.
[0141]
Here, the reference data reviewing unit 102j verifies the degree of variation of the reference data using the standard deviation and the data appearance frequency of each item in the reference data group, and uses the reference data with the small degree of variation. It is a means of reviewing reference data.
[0142]
The analysis data item reviewing unit 102k verifies which spot of the DNA microarray to use for the measurement data in a plurality of combinations using an orthogonal table, evaluates the S / N ratio, and uses the S / N ratio as an analysis data item. This is an analysis data item reexamination means for using items having a higher value as analysis data items.
In addition, the above-mentioned function includes a function of examining an item using another statistical method or using the above method in combination.
[0143]
Returning to FIG. 2 again, the determination unit 102i is a determination unit that determines the unit space database and the threshold when it is determined that sufficient accuracy can be ensured as a result of the verification by the determination accuracy verification unit 102g.
[0144]
The details of the processing performed by these units will be described later.
[0145]
[System processing]
Next, an example of the processing of the present system configured as described above according to the present embodiment will be described in detail below with reference to FIG.
[0146]
[DNA microarray analysis processing by MT system]
First, details of the DNA microarray analysis processing by the MT system will be described with reference to FIG. FIG. 3 is a flowchart illustrating an example of a DNA microarray analysis process by the MT system of the present system in the present embodiment.
[0147]
(1) Analysis purpose determination (purpose definition)
First, the DNA microarray data analyzer 100 determines an item (discrimination item) to be determined based on the DNA microarray data when the measurement data of the DNA microarray is analyzed by the MT system method by the processing of the analysis purpose determination unit 102a. The purpose of the analysis is clarified (step SA-1).
[0148]
For example, the discrimination item for determining whether interferon administration is effective or ineffective for a patient with hepatitis C includes the effectiveness of interferon administration.
[0149]
(2) Selection of reference data group + comparison data group (reference data, comparison data selection)
In the MT system, in order to use a reference group as a database, a group corresponding to a discrimination item to be analyzed must be collected and used as original data (reference data group) of the database. Further, in addition to the reference data group, it is necessary to collect a data group different from the reference for comparison and select this as a comparison data group. The analysis accuracy is determined based on whether these groups can be accurately determined.
[0150]
Therefore, the DNA microarray data analyzer 100 accesses the DNA microarray measurement data file 106a by the processing of the data group selection unit 102b, and, from the DNA microarray measurement data, sets the reference data according to the discrimination item to be analyzed. To select a reference data group, and a comparison data group is selected by collecting comparison data that does not clearly belong to the reference data (step SA-2).
[0151]
Here, it is desirable that the population selected as the reference data be data with as little variation as possible, and it is desirable to select a population that can have a large number as a population.
[0152]
For example, when analyzing the efficacy of interferon administration, if the administration results are known from the data held, but there are many specimens for which administration is effective, select the effective specimen as reference data, and Data group. The reverse is true if there are many invalid samples. For the comparison data group, a group that does not clearly belong to the reference data group is selected. That is, in this example, the “reference data” is an individual patient sample, and the “reference data group” is a group of patients (each group with and without IFN sensitivity).
[0153]
Then, the data group selection unit 102b stores the selected reference data group in a predetermined storage area of the reference data group file 106b, and stores the selected comparison data group in a predetermined storage area of the comparison data group file 106c. .
[0154]
(3) Analysis data item determination (input item determination)
Then, the DNA microarray data analysis device 100 accesses the DNA microarray measurement data file 106a to the comparison data group file 106c by the processing of the analysis data item determination unit 102c and becomes “input data” when the MT system makes a determination. An item (also referred to as a feature or a parameter), that is, an “analysis data item” which is an item analyzed in the MT system is determined (step SA-3).
[0155]
At this time, the items that result in the analysis must not be included in the analysis data items. For example, when it is desired to analyze the effectiveness of interferon administration, the measurement results (each gene data) of the DNA microarray and the items such as the age and gender of the sample are input data (analysis data items).
[0156]
Then, the comparison data group file 106c stores the determined analysis data item in a predetermined storage area of the analysis data item 106d.
[0157]
(4) Unit space database creation
Then, the DNA microarray data analyzer 100 accesses the reference data group file 106b and the like by the processing of the unit space database creating unit 102d, and obtains a correlation coefficient, a standard deviation, etc. from each reference data included in the selected reference data group. And a database showing these unit spaces (Maharanobis space) is created (step SA-4).
[0158]
In the MT system, there are several types of calculation methods, for example, a method using an inverse matrix, an orthogonal expansion by Schmidt, a method using a cofactor, and the like. In the present invention, any method may be used.
[0159]
As an example, a case where calculation is performed using an inverse matrix is shown below. Calculate the mean value and standard deviation value of all items from the reference data compiled in the group for which interferon administration was known to be effective, normalize each data, and calculate the correlation matrix and its inverse matrix from these. . Among them, the average value, the standard deviation value, and the inverse matrix form the unit space database. (See “Procedure 3” on page 5, “7.3.1 Mahalanobis Distance” on page 5 of the above-mentioned reference, “Technical Development in MT System”).
[0160]
Then, the unit space database creating unit 102d stores the created unit space database in a predetermined storage area of the unit space database 106e.
[0161]
(5) Calculation of Mahalanobis distance between reference data and comparison data
Then, the DNA microarray data analyzing apparatus 100 accesses the unit space database 106e and the like by the processing of the Mahalanobis distance calculation unit 102e, and obtains the reference data selected in step SA-2 for the unit space database created in step SA-4. Then, the Mahalanobis distance for each of the comparison data is calculated (step SA-5). That is, by this processing, the distance of each patient can be calculated. This distance is a measurement result by the MT system method. Here, there are several calculation methods for calculating the Mahalanobis distance, but the present invention may use any calculation method.
[0162]
As an example, a case where calculation is performed using an inverse matrix will be described below. The Mahalanobis distance is calculated from the unit space database created in (4) and each data value. This value indicates the effectiveness for interferon administration (see “Procedure 3” on page 5, “7.3.1 Mahalanobis distance” on page 5, “Technical development in MT system”, which is the above-mentioned reference). See.).
[0163]
Then, the Mahalanobis distance calculation unit 102e stores each calculated Mahalanobis distance in a predetermined storage area of the Mahalanobis distance file 106f.
[0164]
(6) Threshold setting
Then, the DNA microarray data analyzer 100 accesses the Mahalanobis distance file 106f and the like by the processing of the threshold setting unit 102f, and based on the Mahalanobis distances of the reference data calculated in step SA-5 and the comparison data, respectively. A value (threshold) serving as a boundary between the group and the comparison data group is set (step SA-6).
[0165]
Specifically, as shown in FIG. 4, a case where the distribution state is visually represented by a graph such as a histogram and verification is performed, and a case where the distribution state of the distance is verified by mathematical analysis is performed. Is mentioned. Here, FIG. 4 is a diagram showing an example of a case where a reference data group and a comparison data group are displayed in a distance histogram and a threshold value is set.
[0166]
Here, one or more thresholds may be set. In the MT system method, when a causal relationship is confirmed between the phenomenon indicated by the target and the Mahalanobis distance, by setting a plurality of thresholds, it becomes possible to classify a plurality of states simply by calculating the Mahalanobis distance. .
[0167]
As a specific example, when a proportional relationship is established between the Mahalanobis distance and the interferon administration effect, the Mahalanobis distance itself is used as the interferon administration effect prediction value, and nine thresholds are provided for dividing the prediction value into 10 groups. It may be used as an index for predicting the effect of administration of interferon in step evaluation.
[0168]
Further, when there are a plurality of groups other than the reference data group, the groups may be divided by the magnitude of the Mahalanobis distance to determine the plurality of groups. For example, in the right figure, the reference data group and the comparison data group intersect. Assuming that the reference data is a group in which the interferon administration is effective and the comparison data is an ineffective group, if it is desired to discriminate both with a higher probability, the Mahalanobis distance of this intersection may be set as a threshold.
[0169]
(7) Verification of discrimination accuracy using verification data group
Then, the DNA microarray data analyzer 100 accesses the DNA microarray measurement data file 106a or the like by the processing of the judgment accuracy verification unit 102g, separately prepares data to belong to the reference data other than the reference data group, and A verification data group in which data that does not belong to the data is mixed is created, the Mahalanobis distance is calculated for the verification data group, and the reference data group and the other groups are calculated using the threshold set in the threshold setting unit 102f. It is verified whether it can be correctly determined (step SA-7).
[0170]
That is, the judgment accuracy verification unit 102g accesses the DNA microarray measurement data file 106a and the like, separately prepares data to be included in the reference data (patient data to which the result of IFN administration is added) in addition to the reference data group, and Data that does not belong to the reference data is mixed, and this is used as a verification data group.
[0171]
Then, the judgment accuracy verification unit 102g stores the verification data group in a predetermined storage area of the verification data group file 106g.
[0172]
Then, the judgment accuracy verification unit 102g calculates the Mahalanobis distance for the verification data group in the same manner as in step SA-5, and then uses the threshold value obtained in step SA-6 to divide the reference data group and the other groups. Verify that it can be determined correctly.
[0173]
Here, as a verification method, there is a verification method using a discrimination rate. However, in the MT system method, a ratio of a distance obtained from input data to an expected measurement result is represented by an SN ratio (signal-noise ratio). It is common to verify the accuracy.
[0174]
For example, a data group that is found to be effective for interferon administration other than the reference data is within the threshold, and that the other populations are outside the threshold by verifying the SN ratio that the interferon administration is effective. And the linear relationship between the calculated Mahalanobis distances. For the item selection by the orthogonal table and the S / N ratio, refer to the reference document (“2.5 item selection” on page 19 or “6.6 MT system” on page 62 of “Technical development in MT system” which is the above-mentioned reference document). Parameter design ”).)
[0175]
(8) Review
If it is determined in step SA-7 that sufficient accuracy cannot be obtained, the examination of the item selected as the analysis data item in step SA-3 and the reference data group selected in step SA-2 was insufficient. It is possible.
[0176]
Therefore, the DNA microarray data analyzer 100 accesses the DNA microarray measurement data file 106a to the analysis data item 106d by the processing of the review unit 102h, and determines that the determination accuracy verification unit 102g cannot obtain sufficient accuracy. If so, the reference data group selected by the data group selection unit 102b and / or the analysis data item determined by the analysis data item determination unit 102c is reviewed (step SA-8).
[0177]
Here, the reexamination unit 102h verifies the standard data group by the processing of the standard data reexamination unit 102j using the standard deviation of each item in the standard data group and the frequency of appearance of the data based on the degree of dispersion of the standard data. I do.
[0178]
There are several methods for reselecting items. For example, a method of referring to the opinions of experts on data to be handled, a method of selecting data having a large average difference between a reference data group and other groups, and an orthogonal table proposed by the MT system method. A method of verifying the validity of each item by the SN ratio in a plurality of combinations in which items are randomly selected using the method is exemplified.
[0179]
For example, the reviewing unit 102h verifies which spot (gene expression status) of the DNA microarray to use in the analysis by the analysis data item reviewing unit 102k in a plurality of combinations using an orthogonal table. After the evaluation with the SN ratio, the item whose SN ratio increases when used as an item is used as the item.
[0180]
(9) Confirm
As a result of the verification in step SA-7, when it is determined that sufficient accuracy can be ensured, the DNA microarray data analysis device 100 determines the database (unit space) and the threshold by the processing of the determination unit 102i (step SA-7). -9).
[0181]
For example, as a result of verification, it is determined at the stage where it is determined that the effectiveness of the administration of interferon can be sufficiently diagnosed from a medical standpoint based on the input data.
[0182]
(10) Start classification
After the database (unit space) is determined, it is used to calculate Mahalanobis distance for unknown data, express the properties of the data by distance, and use it for diagnosis and prediction of various data (step SA-10). .
[0183]
For example, when the effectiveness of interferon administration is verified by a DNA microarray and the MT system method, Mahalanobis distance is calculated in a unit space database created in advance using expression status data of the DNA microarray + specimen information as input data, and determined in advance. Using the threshold of Mahalanobis distance, we diagnose the effectiveness and predict the efficacy of interferon administration.
Thus, the DNA microarray analysis processing by the MT system is completed.
[0184]
[Other embodiments]
Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, but may be applied to various different embodiments within the scope of the technical idea described in the claims. It may be implemented.
[0185]
For example, the case where the DNA microarray data analysis apparatus 100 performs the processing in a stand-alone form has been described as an example, but the processing is performed in response to a request from a client terminal configured in a separate housing from the DNA microarray data analysis apparatus 100. Then, the processing result may be returned to the client terminal.
[0186]
Further, of the processes described in the embodiment, all or a part of the processes described as being performed automatically may be manually performed, or all of the processes described as being performed manually may be performed. Alternatively, a part thereof can be automatically performed by a known method.
In addition, the processing procedures, control procedures, specific names, information including parameters such as various registration data and search conditions, screen examples, and database configurations shown in the above-described documents and drawings, except where otherwise noted, It can be changed arbitrarily.
[0187]
Further, regarding the DNA microarray data analyzer 100, the components shown in the drawings are functionally conceptual, and need not necessarily be physically configured as shown in the drawings.
[0188]
For example, all or any part of the processing functions provided in each unit or each device of the DNA microarray data analysis device 100, particularly each processing function performed by the control unit 102, are replaced with a CPU (Central Processing Unit) and the CPU. And can be realized as hardware by wired logic. The program is recorded on a recording medium described later, and is mechanically read by the DNA microarray data analyzer 100 as needed.
[0189]
That is, a computer program for giving instructions to the CPU in cooperation with an OS (Operating System) and performing various processes is recorded in the storage unit 106 such as a ROM or an HD. This computer program is executed by being loaded into a RAM or the like, and configures the control unit 102 in cooperation with the CPU. Further, this computer program may be recorded in an application program server connected to the DNA microarray data analyzer 100 via an arbitrary network 300, and may be downloaded in whole or in part as necessary. It is possible.
[0190]
Further, the program according to the present invention can be stored in a computer-readable recording medium. Here, the “recording medium” refers to an arbitrary “portable physical medium” such as a flexible disk, a magneto-optical disk, a ROM, an EPROM, an EEPROM, a CD-ROM, an MO, a DVD, and the like, and a built-in various computer systems. A short-term program such as a communication line or a carrier wave when transmitting the program via an arbitrary "fixed physical medium" such as ROM, RAM, HD, or a network represented by LAN, WAN, or the Internet. "Communications medium" that holds.
[0191]
The “program” is a data processing method described in an arbitrary language or description method, and may be in any format such as a source code or a binary code. The “program” is not necessarily limited to a single program, but may be distributed in the form of a plurality of modules or libraries, or may operate in cooperation with a separate program represented by an OS (Operating System). Includes those that achieve functions. Note that a known configuration and procedure can be used for a specific configuration, a reading procedure, an installation procedure after reading, and the like in each apparatus described in the embodiments.
[0192]
Various databases and the like (DNA microarray measurement data file 106a to verification data group file 106g) stored in the storage unit 106 include memory devices such as RAM and ROM, fixed disk devices such as hard disks, and storage means such as flexible disks and optical disks. And stores various programs, tables, files, databases, files for web pages, and the like used for various processes and for providing a website.
[0193]
Further, the DNA microarray data analyzer 100 connects a peripheral device such as a printer, a monitor, or an image scanner to an information processing device such as an information processing terminal such as a known personal computer or a workstation, and connects the information processing device of the present invention to the information processing device. The method may be implemented by implementing software (including programs, data, and the like) for implementing the method.
[0194]
Further, the specific form of the dispersion / integration of the DNA microarray data analyzer 100 and the like is not limited to those shown in the description and the drawings, and all or a part thereof may be functionally divided into arbitrary units corresponding to various loads and the like. Alternatively, they can be physically distributed and integrated (for example, grid computing). For example, each database may be independently configured as an independent database device, or a part of the processing may be realized using a CGI (Common Gateway Interface).
[0195]
The network 300 has a function of interconnecting the DNA microarray data analyzer 100 and the external system 200, and includes, for example, the Internet, an intranet, a LAN (including both wired / wireless), a VAN, Personal computer communication network, public telephone network (including both analog and digital), leased line network (including both analog and digital), CATV network, IMT2000 system, GSM system, PDC / PDC-P system, etc. Or a local radio network such as Bluetooth, a PHS network, or a satellite communication network such as CS, BS or ISDB. That is, the present system can transmit and receive various data via any network regardless of wired or wireless.
[0196]
【The invention's effect】
As described above in detail, according to the present invention, since the measurement data of the DNA microarray is analyzed using the Mahalanobis-Taguchi system, a large amount of gene data is analyzed at high speed and with high accuracy by applying the MT system method. The present invention can provide a DNA microarray data analysis method, a DNA microarray data analysis device, a program, and a recording medium that can be used.
[0197]
That is, in the conventional DNA microarray data analysis method using a neural network that requires teacher data, the features of all the teacher data (data for which the determination result is clear, such as the sensitivity of a clinically determined drug) are quantified. However, in the MT system method, the analysis can be started with reference to the vague contents such as whether or not the drug has an effect and whether or not the drug belongs to a specific group. Teacher data) Preparation becomes very simple.
[0198]
In the conventional DNA microarray data analysis method using neural network learning, multiple regression analysis coefficient calculation, and the like, the same operation is repeated a plurality of times to form neural circuits and parameters in an optimal state. In the method, a database (reference data) can be generated by one calculation and can be processed at a high speed, so that the operation time can be extremely shortened.
[0199]
Further, in the conventional DNA microarray data analysis method using a neural network or the like, the calculation process is treated as a black box, and the calculation process of the obtained result is not clear (that is, the validity of the calculation process cannot be verified). Although several calculation methods have been proposed in the MT system method, they are all open calculation formulas, and it is clear that the calculation formula focuses on the correlation of all reference data. Can be guaranteed.
[0200]
In the conventional DNA microarray data analysis method using a neural network or the like, the calculated result fluctuates depending on the contents of the teacher data and the software used, and there is no reproducibility when reconstructing a neural circuit. In the method, the calculation is performed using a published calculation formula, so that the same result can be always obtained by using the same data, and the reproducibility can be extremely improved.
[0201]
Further, in a DNA microarray data analysis method using a general clustering technique, a difference between a specimen and a gene is expressed as a relative one within a group by dividing the data into groups within a given data group. In the MT system method, once a unit space database is generated, it can be expressed as an absolute value by expressing it as a distance from the origin of this reference, and other than the analyzed data group, the Mahalanobis distance from the unit space database can be expressed. Can be evaluated simply by calculating the evaluation result, so that the evaluation result becomes stable every time.
[0202]
Furthermore, in the MT system method, a procedure for mechanically verifying which items (gene expression data and specimen information) are effective for discrimination is defined (item selection). The most appropriate items can be selected efficiently without being bound by biased opinions.
[0203]
Further, according to the present invention, the gene to be analyzed is classified from the measurement data of the DNA microarray using the Mahalanobis-Taguchi system, so that a large amount of gene data can be classified at high speed and with high accuracy by applying the MT system method. , A DNA microarray data analysis device, a program, and a recording medium.
[0204]
According to the present invention, a gene to be analyzed is selected from the measurement data of the DNA microarray of the sample using the Mahalanobis-Taguchi system, and the clinical sample is classified based on the selected gene data using the Mahalanobis-Taguchi system. Therefore, by applying the MT system method, it is possible to provide a DNA microarray data analysis method, a DNA microarray data analysis device, a program, and a recording medium that can classify a large amount of microarray measurement data at high speed and with high accuracy. it can.
[0205]
Further, according to the present invention, a reference data group is selected by collecting reference data according to a discrimination item to be analyzed from the measurement data of the DNA microarray, and input data when discriminating by the MT system method is used. Items to be analyzed in the MT system method, and a unit space database indicating a Mahalanobis space of the reference data included in the selected reference data group is created. Calculate the Mahalanobis distance between the data and the comparison data that does not clearly belong to the reference data, and, based on the calculated Mahalanobis distance between the reference data and the comparison data, one or more thresholds that serve as boundaries between the reference data group and the comparison data group Is set, and the Mahalanobis distance of unknown data is calculated using the unit space database and the selection items, Since the unknown data is classified by making a determination based on the threshold value, a DNA microarray data analysis method, a DNA microarray data analysis device, a program, and a program that can analyze a large amount of genetic data at high speed and with high accuracy by applying the MT system method In addition, a recording medium can be provided.
[0206]
Further, according to the present invention, data to belong to the reference data besides the reference data group is separately prepared, and a verification data group in which the data and the data not belonging to the reference data are mixed is created, and the Mahalanobis distance is obtained for the verification data group. Is calculated, and it is verified whether the reference data group and the other groups can be correctly distinguished using the set threshold value. Therefore, a DNA microarray data analysis method and a DNA microarray data analysis that can perform the discrimination work more efficiently An apparatus, a program, and a recording medium can be provided.
[0207]
Further, according to the present invention, when it is determined that sufficient accuracy cannot be obtained, the selected reference data group and / or the determined analysis data item are re-examined, so that the determination operation can be performed with higher accuracy and It is possible to provide a DNA microarray data analysis method, a DNA microarray data analysis device, a program, and a recording medium that can be performed efficiently.
[0208]
Further, according to the present invention, the degree of variation of the reference data is verified using the standard deviation and the frequency of data appearance of each item in the reference data group, and the reference data having the small degree of variation is used. It is possible to provide a DNA microarray data analysis method, a DNA microarray data analysis device, a program, and a recording medium that can perform a discriminating operation more accurately and efficiently by reviewing data.
[0209]
Furthermore, according to the present invention, which combination of spots on the DNA microarray to use for measurement data is verified using a plurality of combinations using an orthogonal table, evaluated using the S / N ratio, and used as an analysis data item, the S / N ratio increases. A DNA microarray data analysis method, a DNA microarray data analysis device, a program, which can perform discrimination work more accurately and efficiently by re-examining the analysis data items because the data are used as analysis data items. In addition, a recording medium can be provided.
[Brief description of the drawings]
FIG. 1 is a principle configuration diagram showing a basic principle of the present invention.
FIG. 2 is a block diagram illustrating an example of a configuration of the present system to which the present invention is applied.
FIG. 3 is a flowchart illustrating an example of a DNA microarray analysis process by the MT system of the present system in the present embodiment.
FIG. 4 is a diagram illustrating an example of a case where a reference data group and a comparison data group are displayed in a distance histogram and a threshold is set.
FIG. 5 is a diagram showing an example of user information stored in a DNA microarray measurement data file 106a.
FIG. 6 is a block diagram illustrating an example of a configuration of a review unit 102h.
[Explanation of symbols]
100 DNA microarray data analyzer
102 control unit
102a Analysis purpose determination unit
102b Data group selector
102c Analysis data item determination unit
102d Unit space database creation unit
102e Mahalanobis distance calculator
102f threshold setting unit
102g Judgment accuracy verification unit
102h Review section
102i Confirmation unit
102j Reference data review section
102k Analysis data item review section
104 Communication control interface unit
106 storage unit
106a DNA microarray measurement data file
106b Reference data group file
106c Comparison data group file
106d Analysis data item
106e Unit space database
106f Mahalanobis distance file
106g Verification data group file
108 I / O control interface
112 input device
114 Output device
200 External system
300 Network

Claims

A DNA microarray data analysis method, comprising analyzing measurement data of a DNA microarray using a Mahalanobis Taguchi system.

A method for analyzing DNA microarray data, comprising classifying genes to be analyzed from DNA microarray measurement data using a Mahalanobis Taguchi system.

A DNA microarray characterized in that a gene to be analyzed is selected from measurement data of a sample DNA microarray using a Mahalanobis-Taguchi system, and a clinical sample is classified based on the selected gene data using a Mahalanobis-Taguchi system. Data analysis method.

A data group selection step of selecting a reference data group by collecting reference data according to the discrimination item to be analyzed from the measurement data of the DNA microarray;
An analysis data item determining step of determining an analysis data item which is an item to be input data when the Mahalanobis-Taguchi system is used to determine the analysis data item,
A unit space database creation step of creating a unit space database indicating a Mahalanobis space of the reference data included in the reference data group selected in the data group selection step,
Mahalanobis distance of the reference data selected in the data group selection step, Mahalanobis distance calculation step of calculating a Mahalanobis distance by collecting comparison data that does not clearly belong to the reference data,
Threshold setting for setting one or a plurality of thresholds serving as boundaries between the reference data group and the comparison data group based on the Mahalanobis distance of the reference data calculated in the Mahalanobis distance calculation step and the Mahalanobis distance of the comparison data Steps and
Using the unit space database and the selected item, calculate the Mahalanobis distance of the unknown data, and determine the Mahalanobis distance by the threshold, to perform a discriminant analysis of the unknown data,
A DNA microarray data analysis method, comprising:

In addition to the reference data group, separately prepare data to belong to the reference data, create a verification data group in which this and data that does not belong to the reference data are mixed, calculate the Mahalanobis distance for the verification data group, A determination accuracy verification step of verifying whether or not the reference data group and the other groups can be correctly determined using the threshold value set in the threshold value setting step,
The DNA microarray data analysis method according to claim 4, further comprising:

If it is determined that sufficient accuracy cannot be obtained in the verification step, the reference data group selected in the data group selection step and / or the analysis data item determined in the analysis data item determination step A review step to review the
The DNA microarray data analysis method according to claim 5, further comprising:

The above review step
A reference data reexamination step of verifying the degree of variation of the reference data using a standard deviation or a data appearance frequency of each item in the reference data group, and using the reference data having a small degree of variation,
The DNA microarray data analysis method according to claim 6, further comprising:

The above review step
Using an orthogonal table to verify which spots of the DNA microarray to use for the measurement data in a plurality of combinations and evaluating the S / N ratio, use the analysis data items that increase the S / N ratio as the analysis data items. Analysis data item review step to be used as
The DNA microarray data analysis method according to claim 6, further comprising:

Data group selecting means for selecting a reference data group by collecting reference data corresponding to the discrimination item to be analyzed from the measurement data of the DNA microarray;
Analysis data item determination means for determining an analysis data item which is an item serving as input data when the Mahalanobis-Taguchi system is used to make a determination in the Mahalanobis-Taguchi system,
A unit space database creating means for creating a unit space database indicating a Mahalanobis space of the reference data included in the reference data group selected in the data group selection step,
Mahalanobis distance of the reference data selected in the data group selection step, Mahalanobis distance calculating means for calculating a Mahalanobis distance by collecting comparison data that does not clearly belong to the reference data,
Threshold setting for setting one or a plurality of thresholds serving as boundaries between the reference data group and the comparison data group based on the Mahalanobis distance of the reference data calculated in the Mahalanobis distance calculation step and the Mahalanobis distance of the comparison data Means,
Using the unit space database and the selected item, calculate the Mahalanobis distance of the unknown data, and determine the Mahalanobis distance by the threshold, thereby performing a discriminant analysis of the unknown data,
A DNA microarray data analyzer, comprising:

In addition to the reference data group, separately prepare data to belong to the reference data, create a verification data group in which this and data that does not belong to the reference data are mixed, calculate the Mahalanobis distance for the verification data group, Discrimination accuracy verification means for verifying whether the reference data group and the other groups can be correctly determined using the threshold value set by the threshold value setting means,
The DNA microarray data analysis device according to claim 9, further comprising:

If the verification means determines that sufficient accuracy cannot be obtained, the reference data group selected by the data group selection means and / or the analysis data item determined by the analysis data item determination means Reexamination means to reexamine
The DNA microarray data analysis device according to claim 10, further comprising:

The above review means
A reference data reexamination means for verifying the degree of variation of the reference data using the standard deviation and the frequency of data appearance of each item in the reference data group, and using the reference data having a small degree of variation,
The DNA microarray data analysis device according to claim 11, further comprising:

The above review means
Using an orthogonal table to verify which spots of the DNA microarray to use for the measurement data in a plurality of combinations, evaluate them with the SN ratio, and then use the analysis data items that increase the SN ratio when used as the analysis data items. Analysis data item review means to be used as
The DNA microarray data analysis device according to claim 11 or 12, further comprising:

A data group selection step of selecting a reference data group by collecting reference data according to the discrimination item to be analyzed from the measurement data of the DNA microarray;
An analysis data item determining step of determining an analysis data item which is an item to be input data when the Mahalanobis-Taguchi system is used to determine the analysis data item,
A unit space database creation step of creating a unit space database indicating a Mahalanobis space of the reference data included in the reference data group selected in the data group selection step,
Mahalanobis distance of the reference data selected in the data group selection step, Mahalanobis distance calculation step of calculating a Mahalanobis distance by collecting comparison data that does not clearly belong to the reference data,
Threshold setting for setting one or a plurality of thresholds serving as boundaries between the reference data group and the comparison data group based on the Mahalanobis distance of the reference data calculated in the Mahalanobis distance calculation step and the Mahalanobis distance of the comparison data Steps and
A classification step of performing a discriminant analysis of unknown data by calculating a Mahalanobis distance of unknown data by using the unit space database and the selected item, and determining the Mahalanobis distance by the threshold value;
A program for causing a computer to execute a DNA microarray data analysis method including:

In addition to the reference data group, separately prepare data to belong to the reference data, create a verification data group in which this and data that does not belong to the reference data are mixed, calculate the Mahalanobis distance for the verification data group, A determination accuracy verification step of verifying whether or not the reference data group and the other groups can be correctly determined using the threshold value set in the threshold value setting step,
The program according to claim 14, further comprising:

If it is determined that sufficient accuracy cannot be obtained in the verification step, the reference data group selected in the data group selection step and / or the analysis data item determined in the analysis data item determination step A review step to review the
The program according to claim 15, further comprising:

The above review step
A reference data reexamination step of verifying the degree of variation of the reference data using a standard deviation or a data appearance frequency of each item in the reference data group, and using the reference data having a small degree of variation,
17. The program according to claim 16, further comprising:

The above review step
Using an orthogonal table to verify which spots of the DNA microarray to use for the measurement data in a plurality of combinations and evaluating the S / N ratio, use the analysis data items that increase the S / N ratio as the analysis data items. Analysis data item review step to be used as
The program according to claim 16, further comprising:

A computer-readable recording medium having recorded thereon the program according to any one of claims 14 to 18.