JP2004126857A

JP2004126857A - Default data estimating device and method and default data estimating program and recording medium with its program recorded thereon

Info

Publication number: JP2004126857A
Application number: JP2002288607A
Authority: JP
Inventors: Makoto Ishii; 石井　信; Shigemasa Oba; 大羽　成征
Original assignee: Nara Institute of Science and Technology NUC
Current assignee: Nara Institute of Science and Technology NUC
Priority date: 2002-10-01
Filing date: 2002-10-01
Publication date: 2004-04-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a default data estimating device, a default data estimating method, a default data estimating program and a recording medium for recording its program thereon capable of highly precisely estimating default data by simple processing. <P>SOLUTION: A CPU 3 acquires a development amount matrix where the logarithmic values of the development amounts of gene development profile data are arranged, and estimates the parameter of a stochastic main component analytic model where each gene vector is interrupted as an independent sample in the development amount matrix by repeating Bayes estimation like a variational method under ARD preliminary probability distribution to estimate the missing value of the gene development profile data. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、データの一部が欠落した不完全データについて欠落データを推定する欠落データ推定装置、欠落データ推定方法、欠落データ推定プログラム及び同プログラムを記録した記録媒体に関するものである。
【０００２】
【従来の技術】
データの一部が欠落した不完全データとしては、バイオインフォマティクス分野における欠測値を含んだ遺伝子発現プロファイルデータがある。この遺伝子発現プロファイルデータでは、例えば、ガラス基板のカスレ等の物理的な理由によるものばかりでなく、実験上のノイズが大きすぎること等により欠測が発生する。そのため、ＤＮＡマイクロアレイの実際のデータでは、発現量行列の数パーセント程度の要素が欠測となる場合が多い。
【０００３】
上記のような欠測値を含む遺伝子発現データを使用しないこととすると、欠測のないサンプルの数及び遺伝子の数が激減して不経済となる。このため、欠測値を０で補填する方法や行あるいは列平均値で補填する方法が取られる。また、他の方法としては、発現量行列の欠測値をＫ−最近傍法（ＫＮＮ：Ｋ−ｎｅａｒｅｓｔ　ｎｅｉｇｈｂｏｒｓ）に基づいて予測して補填する手法や、ＳＶＤ法（ｓｉｎｇｕｌａｒ　ｖａｌｕｅ　ｄｅｃｏｍｐｏｓｉｔｉｏｎ）に基づいて予測して補填する方法が提案されている（例えば、非特許文献１参照）。
【０００４】
【非特許文献１】
トロヤンスカヤ（Ｔｒｏｙａｎｓｋａｙａ，Ｏ．）他、「ＤＮＡマイクロアレイの欠測値推定方法」（Ｍｉｓｓｉｎｇ　ｖａｌｕｅ　ｅｓｔｉｍａｔｉｏｎ　ｍｅｔｈｏｄｓ　ｆｏｒ　ＤＮＡ　ｍｉｃｒｏａｒｒａｙｓ）、バイオインフォマティクス（Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ）、英国、オックスフォード大学プレス（Ｏｘｆｏｒｄ　Ｕｎｉｖｅｒｓｉｔｙ　Ｐｒｅｓｓ）、２００１年、１７（６）、ｐ．５２０−ｐ．５２５
【０００５】
【発明が解決しようとする課題】
しかしながら、欠測値を０で補填する方法や行あるいは列平均値で補填する方法では、その補填の仕方によってクラス発見やクラス予測における解析結果が大きく変化してしまう。ここで、クラスとは、例えば、ＤＮＡマイクロアレイで調べようとする条件において特徴的な発現パターンを示す遺伝子集団である。また、Ｋ−最近傍法に基づいて予測し補填する手法やＳＶＤ法に基づいて予測して補填する方法では、欠測値を高精度に補填することができなかった。
【０００６】
本発明の目的は、簡便な処理により欠落データを高精度に推定することができる欠落データ推定装置、欠落データ推定方法、欠落データ推定プログラム及び同プログラムを記録した記録媒体を提供することである。
【０００７】
【課題を解決するための手段】
本発明に係る欠落データ推定装置は、データの一部が欠落した不完全データの欠落データを推定する欠落データ推定装置であって、不完全データを取得する取得手段と、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定する推定手段とを備えるものである。
【０００８】
本発明に係る欠落データ推定装置においては、不完全データを取得し、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定しているので、簡便な処理により欠落データを高精度に推定することができる。
【０００９】
推定手段は、ベイズ推定を用いて不完全データの分布推定を行うことが好ましい。この場合、一つのクラスタを仮定することによって近似を含まない厳密なベイズ推定を用いて不完全データの分布推定を行うことができるので、欠落データをより高精度に推定することができる。
【００１０】
推定手段は、欠落データを隠れ変数として扱う変分法的ベイズ推定を用いた主成分分析による回帰処理により欠落データを推定することが好ましい。この場合、欠落データを隠れ変数として扱う変分法的ベイズ推定を用いた主成分分析による回帰処理によりデータに含まれるノイズを除去しつつ、ベイズ予測分布に基づいて欠落データを推定しているので、欠落データをより高精度に推定することができる。
【００１１】
推定手段は、ＡＲＤ事前確率分布のもとで変分法的ベイズ推定を繰り返すことにより上記の回帰処理を行うことが好ましい。この場合、ＡＲＤ事前確率分布のもとで変分法的ベイズ推定を繰り返すことにより上記の回帰処理を行っているので、予め最大限として設定された主成分次元が分布推定手続きの中で自然に次元選択され、主軸ベクトルを自動的に最適化することができ、人手によるパラメータ探索が不要となる。
【００１２】
不完全データは、欠測データを含む遺伝子発現プロファイルデータを含み、取得手段は、遺伝子発現プロファイルデータの発現量の対数値を並べた発現量行列を取得し、推定手段は、発現量行列において各遺伝子ベクトルが独立サンプルであると解釈した確率的主成分分析モデルのパラメータを、ＡＲＤ事前確率分布のもとで変分法的ベイズ推定を繰り返すことにより推定することが好ましい。
【００１３】
この場合、遺伝子発現プロファイルデータの発現量の対数値を並べた発現量行列を取得し、この発現量行列において各遺伝子ベクトルが独立サンプルであると解釈した確率的主成分分析モデルのパラメータを、ＡＲＤ事前確率分布のもとで変分法的ベイズ推定を繰り返すことにより推定しているので、適切な主成分次元の選択とベイズ予測分布に基づいて、遺伝子発現プロファイルデータにおける欠測データを高精度に推定することができる。
【００１４】
本発明に係る欠落データ推定方法は、コンピュータを用いて、データの一部が欠落した不完全データの欠落データを推定する欠落データ推定方法であって、コンピュータが、不完全データを取得するステップと、コンピュータが、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定するステップとを含むものである。
【００１５】
本発明に係る欠落データ推定方法においては、不完全データを取得し、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定しているので、簡便な処理により欠落データを高精度に推定することができる。
【００１６】
本発明に係る欠落データ推定プログラムは、データの一部が欠落した不完全データの欠落データを推定するための欠落データ推定プログラムであって、不完全データを取得する取得手段と、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定する推定手段としてコンピュータを機能させるものである。
【００１７】
本発明に係る欠落データ推定プログラムによれば、不完全データを取得し、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定しているので、簡便な処理により欠落データを高精度に推定することができる。
【００１８】
本発明に係る記録媒体は、データの一部が欠落した不完全データの欠落データを推定するための欠落データ推定プログラムを記録したコンピュータ読み取り可能な記録媒体であって、不完全データを取得する取得手段と、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定する推定手段としてコンピュータを機能させることを特徴とする欠落データ推定プログラムを記録したものである。
【００１９】
本発明に係る記録媒体に記録された欠落データ推定プログラムによれば、不完全データを取得し、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定しているので、簡便な処理により欠落データを高精度に推定することができる。
【００２０】
【発明の実施の形態】
以下、本発明の一実施の形態による欠落データ推定装置について図面を参照しながら説明する。図１は、本発明の一実施の形態による欠落データ推定装置の構成を示すブロック図である。
【００２１】
図１に示す欠落データ推定装置は、通常のコンピュータ等から構成され、入力装置１、ＲＯＭ（リードオンリメモリ）２、ＣＰＵ（中央演算処理装置）３、ＲＡＭ（ランダムアクセスメモリ）４、外部記憶装置５、表示装置６及び記録媒体駆動装置７を備える。各ブロックは内部のバスに接続され、このバスを介して種々のデータ等が入出力され、ＣＰＵ３の制御の下、種々の処理が実行される。
【００２２】
入力装置１は、キーボード、マウス等から構成され、操作者が種々のデータ及び操作指令等を入力するために使用される。例えば、入力装置１は、操作者が入力した不完全データである遺伝子発現プロファイルデータを取得し、ＣＰＵ３の制御の下、ＲＡＭ４又は外部記憶装置５に出力する。
【００２３】
ＲＯＭ２には、ＢＩＯＳ（Ｂａｓｉｃ　Ｉｎｐｕｔ／Ｏｕｔｐｕｔ　Ｓｙｓｔｅｍ）等のシステムプログラム等が記憶される。外部記憶装置５は、ハードディスクドライブ等から構成され、外部記憶装置５には所定のＯＳ（Ｏｐｅｒａｔｉｎｇ　Ｓｙｓｔｅｍ）及び後述する欠落データ推定プログラム等が記憶される。ＣＰＵ３は、外部記憶装置５から欠落データ推定プログラム等を読み出し、後述する欠落データ推定処理等を実行し、各ブロックの動作を制御する。ＲＡＭ４は、ＣＰＵ３の作業領域等として用いられる。
【００２４】
ＣＰＵ３は、一つのクラスタを仮定して入力装置１から入力される不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定する。このとき、ＣＰＵ３は、ベイズ推定を用いて不完全データの分布推定を行うことが好ましく、欠落データを隠れ変数として扱う変分法的ベイズ推定を用いた主成分分析による回帰処理により欠落データを推定することがより好ましく、ＡＲＤ（ａｕｔｏｍａｔｉｃ　ｒｅｌｅｖａｎｃｅ　ｄｅｔｅｒｍｉｎａｔｉｏｎ）事前確率分布のもとで変分法的ベイズ推定を繰り返すことにより上記の回帰処理を行うことがさらに好ましい。
【００２５】
また、不完全データが欠測データを含む遺伝子発現プロファイルデータの場合、ＣＰＵ３は、操作者が入力装置１を用いて入力した遺伝子発現プロファイルデータの発現量の対数値を並べた発現量行列を取得し、この発現量行列において各遺伝子ベクトルが独立サンプルであると解釈した確率的主成分分析モデルのパラメータを、ＡＲＤ事前確率分布のもとで変分法的ベイズ推定を繰り返すことにより推定する。
【００２６】
表示装置６は、液晶表示装置等から構成され、ＣＰＵ３の制御の下に種々の操作画面及び推定結果画面等を表示する。また、必要に応じて推定結果等を印字する印刷装置を付加してもよい。
【００２７】
記録媒体駆動装置７は、ＣＤ−ＲＯＭドライブ、フロッピィーディスクドライブ等から構成される。なお、欠落データ推定プログラムを、ＣＤ−ＲＯＭ、フロッピィーディスク等のコンピュータ読み取り可能な記録媒体８に記録し、記録媒体駆動装置７により記録媒体８から欠落データ推定プログラムを読み出して外部記憶装置５にインストールして実行するようにしてもよい。また、図１に示す欠落データ推定装置が通信装置等を備え、欠落データ推定プログラムが所定のネットワークを介して図１に示す欠落データ推定装置に接続された他のコンピュータ等に記憶されている場合、当該コンピュータからネットワークを介して欠落データ推定プログラムをダウンロードして実行するようにしてもよい。
【００２８】
また、ＤＮＡマイクロアレイ検査装置等のデータ測定装置により得られた遺伝子発現プロファイルデータ等の不完全データが所定の記録媒体に記録されている場合、記録媒体駆動装置７を用いて不完全データを取得するようにしてもよいし、ＤＮＡマイクロアレイ検査装置等のデータ測定装置と図１に示す欠落データ推定装置とが所定の通信規格に適合したインターフェースボード等から構成される通信装置等を介して通信可能に接続されている場合、データ測定装置から通信装置を介して遺伝子発現プロファイルデータ等の不完全データを取得するようにしてもよい。
【００２９】
本実施の形態では、入力装置１及びＣＰＵ３が取得手段に相当し、ＣＰＵ３等が推定手段に相当する。
【００３０】
次に、不完全データが遺伝子発現プロファイルデータの場合において、上記のように構成された欠落データ推定装置を用いて遺伝子発現プロファイルデータの欠測値を推定する欠落データ推定処理の推定原理について説明する。図２乃至図４は、図１に示す欠落データ推定装置による欠落データ推定処理の推定原理を説明するための模式図である。
【００３１】
上記の欠落データ推定装置を用いて遺伝子発現プロファイルデータの欠測データを推定する欠落データ推定処理は、以下に説明するＢＰＣＡ（Ｂａｙｅｓｉａｎ　ｐｒｉｎｃｉｐａｌ　ｃｏｍｐｏｎｅｎｔ　ａｎａｌｙｓｉｓ）法に基づくものであり、主成分分析、回帰を用いた欠測値の推定及びベイズ推定の３つの要素から構成される。
【００３２】
ＤＮＡマイクロアレイで得られる遺伝子発現プロファイルデータは、（Ｎ×Ｄ）行列Ｘの形で得られ、この行列Ｘを発現量行列と呼び、ここで、Ｎは遺伝子の種類の個数であり、Ｄはサンプル数である。行列Ｘの第（ｉ，ｊ）成分ｘ_ｉｊは、第ｉ遺伝子の第ｊサンプルにおける発現量を表す数値であり、例えば、ｃＤＮＡアレイによるデータの場合、対象試料の発現量比の対数が用いられ、第ｉ横ベクトル及び第ｊ縦ベクトルは、それぞれ第ｉ遺伝子及び第ｊサンプルの発現ベクトルである。
【００３３】
まず、主成分分析では、遺伝子発現ベクトルの共分散行列の次元を縮約し、各遺伝子ベクトルを共分散行列の主軸（固有ベクトル）によって表現する。図２に示すように、各遺伝子発現ベクトルＯＶは、線形係数ＬＣと主軸成分ＰＡとの各積の和にノイズ成分ＮＯを加算したものとして、主軸成分ＰＡの線形結合により表され、線形係数ＬＣは因子スコアと呼ばれる。
【００３４】
次に、回帰を用いた欠測値の推定において、遺伝子発現ベクトルは観測部分及び欠測部分からなり、欠測値の推定は観測部分及び他の発現ベクトルを用いて欠測部分を推定することにより行われ、欠測部分は主軸成分及び因子スコアを用いて推定される。ここで、図３に示すように、観測部分ＯＰから因子スコアＬＣが推定され、推定された因子スコアＬＣと主軸成分の一部ＰＰから欠測部分ＭＰが推定される。このようにして、因子スコアが観測部分から計算され、この因子スコアを用いて欠測部分を算出する処理は回帰と呼ばれる。
【００３５】
次に、ＢＰＣＡ法は、確率的主成分分析及びそのベイズ推定に基づくものであり、ベイズ推定に従って未知変数の事後分布が求められる。ここで、事後分布はデータの観測部分が与えられた後における事後知見を表し、欠測部分の事後分布が得られると、この事後分布に関する期待値として欠測値が推定され、事後分布に関する期待値はベイズ予測分布に基づくものである。未知変数の事前知見を表す事前分布として、ＡＲＤ事前分布が使用される。ＡＲＤ事前分布を用いることで、図４に示すように、適切な主軸成分ＲＡ及び不適切な主軸成分ＩＡに対してベイズ推定に基づく主軸長ＢＬが図示のようになり、推定されたノイズ成分ＥＮも図示のようになり、不適切な主軸分布ＩＡの推定に関わる寄与分が殆ど０となる。このＡＲＤ事前分布を用いることにより、遺伝子発現行列に含まれるノイズの影響を過度に受けることなく、不要な（不適切な）主軸成分が自動的に縮約される。
【００３６】
次に、上記の欠落データ推定処理について詳細に説明する。図５は、図１に示す欠落データ推定装置の欠落データ推定処理を説明するためのフローチャートである。なお、図５に示す欠落データ推定処理は、外部記憶装置５に記憶されている欠落データ推定プログラム等をＣＰＵ３により実行することにより行われる。
【００３７】
まず、操作者が入力装置１を用いて不完全データとして欠測データを含む遺伝子発現プロファイルデータの発現量の対数値を並べた発現量行列を入力すると、ステップＳ１において、ＣＰＵ３は、発現量行列をＲＡＭ４又は外部記憶装置５に記憶させて発現量行列を取得する。
【００３８】
次に、ステップＳ２において、ＣＰＵ３は、事前分布のハイパーパラメータγμ_０、γτ_０、γα_０、τ_０、α_０ｊとして適切な初期値を設定する。例えば、初期値として、γμ_０、γτ_０、γα_０を適当な小さい正値に設定し、ガンマ分布の中心値として、τ_０、α_０ｊを適当な小さい正値に設定する。ここで、γμ_０、γτ_０及びγα_０は、それぞれハイパーパラメータμ、τ及びαの事前分布の信頼度であり、τ_０は、発現ベクトルの逆分散の事前分布における平均値であり、α_０ｊは、発現量行列のｊ番目の主軸成分の自乗長さの事前分布における平均値である。
【００３９】
次に、ステップＳ３において、ＣＰＵ３は、欠測値を初期推定値で補完する。例えば、初期推定値として、０により補完する。
【００４０】
次に、ステップＳ４において、ＣＰＵ３は、事後分布のハイパーパラメータμ、Ｗ、Δ、τ、αの初期値として適切な値を設定する。ここで、μは、観測変数ｙの平均であり、Ｗは、主成分分析による主軸行列（Ｗ＝（ｃｏｖ（Ｙ））^１／２、Ｙは発現量行列）であり、Δは、単位行列（Δ＝Ｉ_ｑ）であり、τは、主成分分析によるノイズ逆分散（τ＝１／（Ｔｒ（ｃｏｖ（Ｙ））−Σλ、λは主成分分析による固有値）であり、αは、後述する式（５ｅ）が成り立つように設定され、発現量行列の主軸成分の自乗長さに対するハイパーパラメータである。なお、ｃｏｖ（　）は、行列の分散共分散行列であり、Ｔｒ（　）は、行列の対数和である。
【００４１】
次に、ステップＳ５において、ＣＰＵ３は、以下に説明するＥＭアルゴリズムによる処理を実行する。図６は、図５に示すＥＭアルゴリズム処理を説明するためのフローチャートである。
【００４２】
まず、ステップＳ１１において、ＣＰＵ３は、現在のハイパーパラメータを用いて、下式に従って主成分変数ｘの事後分布の推定を行う。
【００４３】
【数１】

【００４４】
ここで、ｘは、図２における因子スコアＬＣであり、ｙ_Ｏ及びｙ_ｈは、それぞれ観測変数ｙの観測部分及び欠測部分（欠測変数）であり、／ｘは、因子スコアｘの期待値であり、Ｑ（ｘ｜ｙ_Ｏ）は、観測部分ｙ_Ｏを用いた因子スコアの事後分布であり、Ｎ（／ｘ，Ｃ^−１）は、中心／ｘ，共分散Ｃの多次元正規分布であり、τは、ノイズ逆分散であり、Ｗ’の（’）は行列あるいはベクトル転置を表すプライム記号であり、ｄは、観測変数ｙの次元数であり、Ｃ_Ｏは、計算のための補助行列であり、Ｃ_Ｏ ^−１は、行列Ｃ_Ｏの逆行列を表し、Ｗ_Ｏ及びＷ_ｈは、それぞれ主軸行列において観測部分及び欠測部分に対応する部分であり、μ_Ｏ及びμ_ｈは、それぞれ発現ベクトルの平均値において観測部分及び欠測部分に対応する部分である。
【００４５】
次に、ステップＳ１２において、ＣＰＵ３は、現在のハイパーパラメータと因子スコアｘの期待値／ｘとを用いて、下式に従って欠測変数ｙ_ｈの事後分布の推定を行う。この中で、欠測変数ｙ_ｈの期待値／ｙ_ｈを算出する。
【００４６】
【数２】

【００４７】
次に、ステップＳ１３において、ＣＰＵ３は、補完された観測変数／ｙと主成分変数の事後分布（／ｘ及びＣ^−１）を用いて、下式に従って十分統計量を計算する。
【００４８】
【数３】

【００４９】
ここで、ｍは、前ステップにおいて欠測予測を行った後の観測変数のデータ平均であり、ｔはデータインデクスであり、ｙ（ｔ）は、ｙ_Ｏ（ｔ）と／ｙ_ｈ（ｔ）を並べたベクトル、すなわち前ステップにおいて欠測予測を行った後の観測変数であり、Ｔは、回帰計算に用いる補助行列であり、ＴｒＳは、前ステップにおいて欠測予測を行った後の観測変数ｙ（ｔ）の分散和のデータ平均であり、ｄ_ｈは、欠測部分の次元数である。
【００５０】
次に、ステップＳ１４において、ＣＰＵ３は、十分統計量と古いハイパーパラメータを用いて、下式に従ってハイパーパラメータを更新し、その後、図５に示すステップＳ６へ戻る。
【００５１】
【数４】

【００５２】
ここで、Ｄｉａｇは、ベクトルを対角成分とする対角行列であり、α_ｊは、発現量行列のｊ番目の主軸成分の自乗長さの事後分布における平均値である。
【００５３】
再び、図５を参照して、次に、ステップＳ６において、ＣＰＵ３は、τの増加が十分に小さくなって収束したか否かを判断し、収束していない場合はステップＳ５の処理を繰り返し、収束している場合はステップＳ７へ移行する。
【００５４】
収束している場合、ステップＳ７において、ステップＳ５において求められた／ｙを欠測補完を行ったデータとして決定し、欠落データ推定処理を終了する。なお、この実施形態において、ハイパーパラメータαを導入することで発現量行列の主軸成分の自乗長さの事後分布を制御しているが、これをＡＲＤ事前分布と呼ぶ。また、式（１）〜式（５ｅ）の繰り返しの手続きをＥＭアルゴリズムと呼ぶ。
【００５５】
図７及び図８は、図１に示す欠落データ推定装置を用いた欠落データ推定処理による欠落データの推定結果を示す図である。なお、図７及び図８では、図１に示す欠落データ推定装置を用いた欠落データ推定処理すなわちＢＰＣＡ法による欠落データの推定結果をＢＰＣＡで示し、従来のＫ−最近傍法による推定結果をＫＮＮで示し、従来のＳＶＤ法による推定結果をＳＶＤで示し、また、データＡはイースト細胞周期に関するｃＤＮＡマイクロアレイデータ中のアルファ係数に関わるものであり、データＥはイースト細胞周期に関するｃＤＮＡマイクロアレイデータ中のエルトリエーションに関わるものであり、データＣはイースト細胞周期に関するｃＤＮＡマイクロアレイデータ中のｃｄｃ１５及びｃｄｃ２８に関わるものであり、データＩは大腸ガンに関するｃＤＮＡマイクロアレイデータである。
【００５６】
また、図７及び図８の縦軸は、ＮＲＭＳＥ（正規化された平均自乗誤差の平方根）であり、これは真値が存在する部分を人工的に欠測とみなした後で欠落データ推定処理を行うことで評価することができる。図７の横軸は、ＢＰＣＡ法及びＳＶＤ法の場合は主軸数Ｋであり、Ｋ−最近傍法の場合は近傍数Ｋであり、図８では主軸数Ｋ及び近傍数Ｋの最適値を使用しており、図８の横軸は、上記のｃＤＮＡマイクロアレイデータの種類である。
【００５７】
図７に示すように、Ｋが大きい場合、本発明によるＢＰＣＡ法では、従来のＳＶＤ法及びＫ−最近傍法よりデータＡ＋Ｅ及びデータＩに対してＮＲＭＳＥが小さくなっており、高精度に欠測値を推定することができた。また、図８に示すように、本発明によるＢＰＣＡ法では、従来のＳＶＤ法及びＫ−最近傍法よりデータＡ、データＥ、データＡ＋Ｅ、データＡ＋Ｅ＋Ｃ及びデータＩに対してＮＲＭＳＥが小さくなっており、高精度に欠測値を推定することができた。
【００５８】
さらに、図７で示すように、従来のＳＶＤ法及びＫ−最近傍法では最適なＫの値が存在するにも関わらず、ＢＰＣＡ法ではＫの値は大きいほど良い。また、図７の下図に示すように、Ｋの値が大きすぎた場合でも、精度は悪くならない。これは、ＡＲＤ事前確率分布を用いることで、不必要な主成分次元を自動的に縮約しているためである。したがって、Ｋの値としては最大限にとっておけば充分であり、次元選択は必要に応じて自動的になされる。
【００５９】
上記のように、本実施の形態では、遺伝子発現プロファイルデータの発現量の対数値を並べた発現量行列を取得し、この発現量行列において各遺伝子ベクトルが独立サンプルであると解釈した確率的主成分分析モデルのパラメータを、ＡＲＤ事前確率分布のもとで変分法的ベイズ推定を繰り返すことにより推定しているので、遺伝子発現プロファイルデータにおける欠測値を高精度に推定することができる。また、ＡＲＤ事前確率分布のもとで変分法的ベイズ推定を繰り返すことにより主成分分析の回帰処理を行っているので、最大限にとった主成分次元が自然に次元選択され、主軸ベクトルを自動的に最適化することができ、従来のＳＶＤ法、Ｋ−最近傍法などの各種手法では必要であった人手によるパラメータ探索が不要となる。
【００６０】
なお、上記の説明では、不完全データとして遺伝子発現プロファイルデータを用いたが、本発明が提供可能な不完全データは上記の例に限定されず、欠落データを含むものであれば、文字データなどの画像データ、音声データ、生体データ等の他の不完全データにも同様に適用可能である。
【００６１】
【発明の効果】
本発明によれば、不完全データを取得し、一つのクラスタを仮定した不完全データの分布推定を行い、推定されたデータ分布を基に欠落データを推定しているので、簡便な処理により欠落データを高精度に推定することができる。
【図面の簡単な説明】
【図１】本発明の一実施の形態による欠落データ推定装置の構成を示すブロック図である。
【図２】図１に示す欠落データ推定装置による欠落データ推定処理の推定原理を説明するための第１の模式図である。
【図３】図１に示す欠落データ推定装置による欠落データ推定処理の推定原理を説明するための第２の模式図である。
【図４】図１に示す欠落データ推定装置による欠落データ推定処理の推定原理を説明するための第３の模式図である。
【図５】図１に示す欠落データ推定装置の欠落データ推定処理を説明するためのフローチャートである。
【図６】図５に示すＥＭアルゴリズム処理を説明するためのフローチャートである。
【図７】図１に示す欠落データ推定装置を用いた欠落データ推定処理による欠落データの第１の推定結果を示す図である。
【図８】図１に示す欠落データ推定装置を用いた欠落データ推定処理による欠落データの第２の推定結果を示す図である。
【符号の説明】
１　入力装置
２　ＲＯＭ
３　ＣＰＵ
４　ＲＡＭ
５　外部記憶装置
６　表示装置
７　記録媒体駆動装置
８　記録媒体[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a missing data estimating apparatus for estimating missing data for incomplete data in which a part of data is missing, a missing data estimating method, a missing data estimating program, and a recording medium on which the program is recorded.
[0002]
[Prior art]
The incomplete data in which a part of the data is missing includes gene expression profile data including missing values in the field of bioinformatics. In the gene expression profile data, for example, missing data is generated not only due to physical reasons such as a glass substrate fray, but also due to excessive experimental noise. Therefore, in actual data of the DNA microarray, about several percent of elements of the expression amount matrix are often missing.
[0003]
If the gene expression data including the above-mentioned missing values is not used, the number of samples without missing data and the number of genes are drastically reduced, which is uneconomical. For this reason, a method of filling in missing values with 0 or a method of filling up with row or column average values is used. In addition, as other methods, the missing value of the expression amount matrix is predicted and compensated based on the K-nearest neighbors (KNN) or the SVD method (single value decomposition). A method of predicting and compensating has been proposed (for example, see Non-Patent Document 1).
[0004]
[Non-patent document 1]
Troyanskaya, O., et al., "Method for Estimating Missing Values of DNA Microarray" (Missing value, estimation, methods, for DNA, microarrays), Bioinformatics (Bioinformatics, Oxford University, Oxford, Oxford, Oxford, United States). (6), p. 520-p. 525
[0005]
[Problems to be solved by the invention]
However, in the method of filling missing values with 0 or the method of filling with row or column average values, analysis results in class discovery and class prediction vary greatly depending on the method of filling. Here, the class is, for example, a gene group showing a characteristic expression pattern under the conditions to be examined by a DNA microarray. In addition, the method of predicting and compensating based on the K-nearest neighbor method or the method of predicting and compensating based on the SVD method cannot compensate for missing values with high accuracy.
[0006]
An object of the present invention is to provide a missing data estimating device, a missing data estimating method, a missing data estimating program, and a recording medium on which the missing data can be estimated by a simple process with high accuracy.
[0007]
[Means for Solving the Problems]
The missing data estimating device according to the present invention is a missing data estimating device for estimating missing data of incomplete data in which a part of data is missing, assuming an acquiring unit for acquiring incomplete data, and one cluster. Estimating means for estimating distribution of incomplete data and estimating missing data based on the estimated data distribution.
[0008]
In the missing data estimating device according to the present invention, the incomplete data is obtained, the distribution of the incomplete data assuming one cluster is estimated, and the missing data is estimated based on the estimated data distribution. Missing data can be estimated with high accuracy by simple processing.
[0009]
It is preferable that the estimation unit performs distribution estimation of incomplete data using Bayes estimation. In this case, by assuming one cluster, the distribution of incomplete data can be estimated using strict Bayesian estimation that does not include approximation, so that missing data can be estimated with higher accuracy.
[0010]
The estimating means preferably estimates missing data by regression processing based on principal component analysis using variational Bayes estimation that treats missing data as hidden variables. In this case, the missing data is estimated based on the Bayesian prediction distribution while removing noise included in the data by regression processing by principal component analysis using variational Bayes estimation that treats missing data as hidden variables. , Missing data can be estimated with higher accuracy.
[0011]
It is preferable that the estimating means performs the regression processing by repeating the variational Bayes estimation under the ARD prior probability distribution. In this case, since the above-described regression processing is performed by repeating the variational Bayesian estimation under the ARD prior probability distribution, the principal component dimension set as the maximum in advance naturally occurs in the distribution estimation procedure. The dimension is selected, and the main axis vector can be automatically optimized, so that a manual parameter search becomes unnecessary.
[0012]
The incomplete data includes gene expression profile data including missing data, the acquiring unit acquires an expression amount matrix in which logarithmic values of the expression amount of the gene expression profile data are arranged, and the estimating unit includes an expression amount matrix. It is preferable to estimate the parameters of the stochastic principal component analysis model in which the gene vector is interpreted as an independent sample by repeating the variational Bayes estimation under the ARD prior probability distribution.
[0013]
In this case, an expression level matrix in which the logarithmic values of the expression levels of the gene expression profile data are arranged is obtained, and the parameters of the probabilistic principal component analysis model in which each gene vector is interpreted as an independent sample in the expression level matrix are represented by ARD. Since the estimation is performed by repeating the variational Bayesian estimation under the prior probability distribution, the missing data in the gene expression profile data can be accurately determined based on the selection of the appropriate principal component dimensions and the Bayesian prediction distribution. Can be estimated.
[0014]
The missing data estimating method according to the present invention is a missing data estimating method for estimating missing data of incomplete data in which a part of data is missing, using a computer, wherein the computer acquires incomplete data. The computer estimates distribution of incomplete data assuming one cluster, and estimates missing data based on the estimated data distribution.
[0015]
In the missing data estimation method according to the present invention, the incomplete data is obtained, the distribution of the incomplete data assuming one cluster is estimated, and the missing data is estimated based on the estimated data distribution. Missing data can be estimated with high accuracy by simple processing.
[0016]
A missing data estimating program according to the present invention is a missing data estimating program for estimating missing data of incomplete data in which a part of data is missing, an acquiring unit for acquiring incomplete data, and one cluster. The computer makes the computer function as estimating means for estimating the distribution of the assumed incomplete data and estimating missing data based on the estimated data distribution.
[0017]
According to the missing data estimation program according to the present invention, the incomplete data is obtained, the distribution of incomplete data is estimated assuming one cluster, and the missing data is estimated based on the estimated data distribution. In addition, missing data can be estimated with high accuracy by simple processing.
[0018]
A recording medium according to the present invention is a computer-readable recording medium that records a missing data estimation program for estimating missing data of incomplete data in which a part of data is missing. Means and a missing data estimation program characterized by making a computer function as an estimating means for estimating missing data based on the estimated data distribution by performing incomplete data distribution estimation assuming one cluster. Things.
[0019]
According to the missing data estimation program recorded on the recording medium according to the present invention, the incomplete data is obtained, the distribution of the incomplete data is estimated assuming one cluster, and the missing data is estimated based on the estimated data distribution. Is estimated, the missing data can be estimated with high accuracy by simple processing.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a missing data estimation device according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration of a missing data estimation device according to an embodiment of the present invention.
[0021]
The missing data estimation device shown in FIG. 1 is composed of a normal computer or the like, and has an input device 1, a ROM (read only memory) 2, a CPU (central processing unit) 3, a RAM (random access memory) 4, an external storage device 5, a display device 6 and a recording medium driving device 7. Each block is connected to an internal bus, and various data and the like are input / output via the bus, and various processes are executed under the control of the CPU 3.
[0022]
The input device 1 includes a keyboard, a mouse, and the like, and is used by an operator to input various data, operation commands, and the like. For example, the input device 1 acquires gene expression profile data, which is incomplete data input by the operator, and outputs it to the RAM 4 or the external storage device 5 under the control of the CPU 3.
[0023]
The ROM 2 stores a system program such as a BIOS (Basic Input / Output System) and the like. The external storage device 5 includes a hard disk drive or the like, and stores a predetermined OS (Operating @ System), a missing data estimation program described later, and the like. The CPU 3 reads a missing data estimation program and the like from the external storage device 5, executes a missing data estimation process and the like described later, and controls the operation of each block. The RAM 4 is used as a work area of the CPU 3 and the like.
[0024]
The CPU 3 estimates the distribution of incomplete data input from the input device 1 assuming one cluster, and estimates missing data based on the estimated data distribution. At this time, the CPU 3 preferably estimates the distribution of incomplete data using Bayesian estimation, and estimates the missing data by regression processing using principal component analysis using variational Bayesian estimation that treats missing data as hidden variables. It is more preferable to perform the above-described regression processing by repeating the variational Bayesian estimation under an ARD (automatic relevance determination) prior probability distribution.
[0025]
When the incomplete data is gene expression profile data including missing data, the CPU 3 obtains an expression amount matrix in which logarithmic values of expression amounts of the gene expression profile data input by the operator using the input device 1 are arranged. Then, the parameters of the probabilistic principal component analysis model in which each gene vector is interpreted as an independent sample in the expression amount matrix are estimated by repeating the variational Bayesian estimation under the ARD prior probability distribution.
[0026]
The display device 6 is composed of a liquid crystal display device or the like, and displays various operation screens, estimation result screens, and the like under the control of the CPU 3. Further, a printing device for printing the estimation result or the like may be added as necessary.
[0027]
The recording medium drive 7 includes a CD-ROM drive, a floppy disk drive, and the like. The missing data estimation program is recorded on a computer-readable recording medium 8 such as a CD-ROM or a floppy disk, and the missing data estimation program is read from the recording medium 8 by the recording medium driving device 7 and installed in the external storage device 5. May be executed. In the case where the missing data estimating device shown in FIG. 1 includes a communication device or the like, and the missing data estimating program is stored in another computer or the like connected to the missing data estimating device shown in FIG. 1 via a predetermined network. Alternatively, the missing data estimation program may be downloaded from the computer via a network and executed.
[0028]
When incomplete data such as gene expression profile data obtained by a data measuring device such as a DNA microarray inspection device is recorded on a predetermined recording medium, the recording medium driving device 7 is used to acquire the incomplete data. Alternatively, the data measurement device such as a DNA microarray inspection device and the missing data estimation device shown in FIG. 1 can communicate with each other via a communication device or the like including an interface board or the like that conforms to a predetermined communication standard. When connected, incomplete data such as gene expression profile data may be obtained from the data measurement device via the communication device.
[0029]
In the present embodiment, the input device 1 and the CPU 3 correspond to an obtaining unit, and the CPU 3 and the like correspond to an estimating unit.
[0030]
Next, the estimation principle of the missing data estimation processing for estimating missing values of the gene expression profile data using the missing data estimation device configured as described above when the incomplete data is the gene expression profile data will be described. . FIGS. 2 to 4 are schematic diagrams for explaining the estimation principle of the missing data estimation processing by the missing data estimation device shown in FIG.
[0031]
The missing data estimation processing for estimating missing data of gene expression profile data using the above-described missing data estimating apparatus is based on the BPCA (Bayesian @ principal @ component @ analysis) method described below. It is composed of three elements: estimation of the missing value used and Bayes estimation.
[0032]
Gene expression profile data obtained by the DNA microarray is obtained in the form of an (N × D) matrix X, and this matrix X is referred to as an expression amount matrix, where N is the number of gene types, and D is a sample. Is a number. (I, j) -th component x of matrix X_ijIs a numerical value representing the expression level of the i-th gene in the j-th sample. For example, in the case of data using a cDNA array, the logarithm of the expression level ratio of the target sample is used, and the i-th horizontal vector and the j-th vertical vector are: Expression vectors of the i-th gene and the j-th sample, respectively.
[0033]
First, in the principal component analysis, the dimension of the covariance matrix of the gene expression vector is reduced, and each gene vector is represented by the main axis (eigenvector) of the covariance matrix. As shown in FIG. 2, each gene expression vector OV is represented by a linear combination of the main axis component PA as a sum of the product of the linear coefficient LC and the main axis component PA plus the noise component NO. Is called the factor score.
[0034]
Next, in the estimation of missing values using regression, the gene expression vector consists of the observed part and the missing part, and the estimation of the missing value is to estimate the missing part using the observed part and other expression vectors. The missing part is estimated using the principal axis component and the factor score. Here, as shown in FIG. 3, the factor score LC is estimated from the observed portion OP, and the missing portion MP is estimated from the estimated factor score LC and a part PP of the main axis component. In this way, the factor score is calculated from the observed part, and the process of calculating the missing part using this factor score is called regression.
[0035]
Next, the BPCA method is based on stochastic principal component analysis and its Bayesian estimation, and the posterior distribution of the unknown variable is obtained according to the Bayesian estimation. Here, the posterior distribution represents the posterior knowledge after the observation part of the data is given, and when the posterior distribution of the missing part is obtained, the missing value is estimated as the expected value for this posterior distribution, and the The values are based on the Bayesian prediction distribution. The ARD prior distribution is used as a prior distribution representing the prior knowledge of the unknown variable. By using the ARD prior distribution, as shown in FIG. 4, the main axis length BL based on Bayesian estimation for the appropriate main axis component RA and the inappropriate main axis component IA becomes as illustrated, and the estimated noise component EN Is also shown in the figure, and the contribution related to the estimation of the inappropriate main axis distribution IA is almost zero. By using this ARD prior distribution, unnecessary (unsuitable) main axis components are automatically reduced without being excessively affected by noise included in the gene expression matrix.
[0036]
Next, the missing data estimation processing will be described in detail. FIG. 5 is a flowchart for explaining the missing data estimation processing of the missing data estimation device shown in FIG. The missing data estimation process shown in FIG. 5 is performed by executing a missing data estimation program or the like stored in the external storage device 5 by the CPU 3.
[0037]
First, when the operator uses the input device 1 to input an expression level matrix in which logarithmic values of expression levels of gene expression profile data including missing data are arranged as incomplete data, in step S1, the CPU 3 executes the expression level matrix. Is stored in the RAM 4 or the external storage device 5 to obtain an expression amount matrix.
[0038]
Next, in step S2, the CPU 3 determines the hyperparameter γμ of the prior distribution.₀, Γτ₀, Γα₀, Τ₀, Α_0jTo set an appropriate initial value. For example, as an initial value, γμ₀, Γτ₀, Γα₀Is set to an appropriate small positive value, and as the center value of the gamma distribution, τ₀, Α_0jTo an appropriate small positive value. Where γμ₀, Γτ₀And γα₀Is the reliability of the prior distribution of the hyperparameters μ, τ and α, respectively, and τ₀Is the mean in the prior distribution of the inverse variance of the expression vector, α_0jIs the average value in the prior distribution of the square length of the j-th principal axis component of the expression amount matrix.
[0039]
Next, in step S3, the CPU 3 complements the missing value with the initial estimated value. For example, it is complemented by 0 as an initial estimated value.
[0040]
Next, in step S4, the CPU 3 sets appropriate values as initial values of the hyperparameters μ, W, Δ, τ, and α of the posterior distribution. Here, μ is the average of the observation variable y, and W is the principal axis matrix (W = (cov (Y)) by the principal component analysis.^1/2, Y is an expression level matrix), and Δ is a unit matrix (Δ = I_q) Is the noise inverse variance (τ = 1 / (Tr (cov (Y))-Σλ, λ is the eigenvalue obtained by the principal component analysis) based on the principal component analysis, and α is the equation (5e) described later. Is established, and is a hyperparameter for the square length of the main axis component of the expression amount matrix.cov (ｖ) is a variance-covariance matrix of the matrix, and Tr () is a logarithmic sum of the matrix. .
[0041]
Next, in step S5, the CPU 3 executes a process based on the EM algorithm described below. FIG. 6 is a flowchart for explaining the EM algorithm processing shown in FIG.
[0042]
First, in step S11, the CPU 3 estimates the posterior distribution of the principal component variable x using the current hyperparameter in accordance with the following equation.
[0043]
(Equation 1)

[0044]
Here, x is the factor score LC in FIG._OAnd y_hAre the observed part and the missing part (missing variable) of the observed variable y, respectively, and / x is the expected value of the factor score x, and Q (x | y_O) Indicates the observation part y_OIs the posterior distribution of the factor scores using N (/ x, C^-1) Is a multidimensional normal distribution with center / x and covariance C, τ is the noise inverse variance, (′) of W ′ is a prime symbol representing a matrix or vector transpose, and d is an observation variable. the dimension of y, C_OIs the auxiliary matrix for the calculation and C_O ^-1Is the matrix C_ORepresents the inverse matrix of_OAnd W_hAre the parts corresponding to the observed part and the missing part in the main axis matrix, respectively, and μ_OAnd μ_hAre portions corresponding to the observed portion and the missing portion in the average value of the expression vector, respectively.
[0045]
Next, in step S12, the CPU 3 uses the current hyperparameter and the expected value / x of the factor score x in accordance with the following equation to determine the missing variable y._hThe posterior distribution of is estimated. In this, the missing variable y_hExpected value of / y_hIs calculated.
[0046]
(Equation 2)

[0047]
Next, in step S13, the CPU 3 determines the posterior distribution (/ x and C^-1) Is used to calculate sufficient statistics according to the following equation.
[0048]
(Equation 3)

[0049]
Here, m is the data average of the observed variables after performing the missing prediction in the previous step, t is the data index, and y (t) is y_O(T) and / y_h(T) is a vector, that is, an observation variable after missing prediction in the previous step, T is an auxiliary matrix used for regression calculation, and TrS is a value after missing prediction in the previous step. Is the data mean of the variance sum of the observed variable y (t), d_hIs the number of dimensions of the missing part.
[0050]
Next, in step S14, the CPU 3 updates the hyperparameter using the sufficient statistics and the old hyperparameter in accordance with the following expression, and then returns to step S6 shown in FIG.
[0051]
(Equation 4)

[0052]
Here, Diag is a diagonal matrix having a vector as a diagonal component, and α_jIs the average value in the posterior distribution of the square length of the j-th principal axis component of the expression amount matrix.
[0053]
Referring again to FIG. 5, next, in step S6, the CPU 3 determines whether or not the increase in τ has become sufficiently small to converge. If not, the processing in step S5 is repeated. If the convergence has occurred, the process proceeds to step S7.
[0054]
If the convergence has occurred, in step S7, / y obtained in step S5 is determined as the data for which the missing data has been complemented, and the missing data estimation processing ends. In this embodiment, the posterior distribution of the square length of the main axis component of the expression amount matrix is controlled by introducing the hyperparameter α, which is called an ARD prior distribution. Also, the procedure of repeating equations (1) to (5e) is called an EM algorithm.
[0055]
FIGS. 7 and 8 are diagrams showing the estimation results of missing data by the missing data estimation processing using the missing data estimation device shown in FIG. 7 and 8, the missing data estimation processing using the missing data estimating apparatus shown in FIG. 1, that is, the estimation result of the missing data by the BPCA method is indicated by BPCA, and the estimation result by the conventional K-nearest neighbor method is indicated by KNN. , The result of estimation by the conventional SVD method is shown by SVD, and data A relates to the alpha coefficient in the cDNA microarray data relating to the yeast cell cycle, and data E represents the elt in the cDNA microarray data relating to the yeast cell cycle. Data C relates to cdc15 and cdc28 in the cDNA microarray data relating to yeast cell cycle, and Data I relates to cDNA microarray data relating to colon cancer.
[0056]
The vertical axis in FIGS. 7 and 8 is NRMSE (square root of the normalized mean square error), which is a missing data estimation process after a part where a true value exists is artificially regarded as missing. Can be evaluated. The horizontal axis in FIG. 7 is the number of spindles K in the case of the BPCA method and the SVD method, the number of neighbors K in the case of the K-nearest neighbor method, and FIG. 8 uses the optimum values of the number of spindles K and the number of neighbors K. The horizontal axis in FIG. 8 is the type of the above cDNA microarray data.
[0057]
As shown in FIG. 7, when K is large, in the BPCA method according to the present invention, NRMSE is smaller for data A + E and data I than in the conventional SVD method and the K-nearest neighbor method, and the measurement is not performed with high accuracy. The value could be estimated. As shown in FIG. 8, in the BPCA method according to the present invention, NRMSE is smaller for data A, data E, data A + E, data A + E + C, and data I than the conventional SVD method and the K-nearest neighbor method. , The missing values could be estimated with high accuracy.
[0058]
Furthermore, as shown in FIG. 7, the larger the value of K in the BPCA method is, the better the conventional SVD method and the K-nearest neighbor method have an optimal K value. Also, as shown in the lower diagram of FIG. 7, even if the value of K is too large, the accuracy does not deteriorate. This is because unnecessary principal component dimensions are automatically reduced by using the ARD prior probability distribution. Therefore, it is sufficient to set the value of K to the maximum, and the dimension selection is automatically made as necessary.
[0059]
As described above, in the present embodiment, an expression amount matrix in which the logarithmic values of the expression amounts of the gene expression profile data are arranged is obtained, and in the expression amount matrix, the probabilistic main matrix in which each gene vector is interpreted as an independent sample is obtained. Since the parameters of the component analysis model are estimated by repeating the variational Bayes estimation under the ARD prior probability distribution, missing values in the gene expression profile data can be estimated with high accuracy. In addition, since the regression processing of the principal component analysis is performed by repeating the variational Bayes estimation under the ARD prior probability distribution, the maximum principal component dimension is naturally selected, and the principal axis vector is determined. The optimization can be performed automatically, and the parameter search by hand, which is necessary in various methods such as the conventional SVD method and the K-nearest neighbor method, becomes unnecessary.
[0060]
In the above description, gene expression profile data is used as incomplete data. However, the incomplete data that can be provided by the present invention is not limited to the above example. The same can be applied to other incomplete data such as image data, audio data, and biometric data.
[0061]
【The invention's effect】
According to the present invention, the incomplete data is obtained, the distribution of the incomplete data assuming one cluster is estimated, and the missing data is estimated based on the estimated data distribution. Data can be estimated with high accuracy.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a missing data estimation device according to an embodiment of the present invention.
FIG. 2 is a first schematic diagram for explaining an estimation principle of a missing data estimation process by the missing data estimation device shown in FIG. 1;
FIG. 3 is a second schematic diagram for explaining the estimation principle of the missing data estimation processing by the missing data estimation device shown in FIG. 1;
FIG. 4 is a third schematic diagram for explaining an estimation principle of the missing data estimation processing by the missing data estimation device shown in FIG. 1;
FIG. 5 is a flowchart illustrating a missing data estimation process of the missing data estimation device shown in FIG. 1;
FIG. 6 is a flowchart for explaining an EM algorithm process shown in FIG. 5;
FIG. 7 is a diagram showing a first estimation result of missing data by a missing data estimation process using the missing data estimation device shown in FIG. 1;
FIG. 8 is a diagram showing a second estimation result of missing data by a missing data estimation process using the missing data estimation device shown in FIG. 1;
[Explanation of symbols]
1 Input device
2 ROM
3 CPU
4 RAM
5 External storage device
6 display device
7 Recording medium drive
8 Recording medium

Claims

A missing data estimating device that estimates missing data of incomplete data in which a part of data is missing,
Acquisition means for acquiring the incomplete data,
A missing data estimating device comprising: an estimating unit that estimates distribution of the incomplete data assuming one cluster and estimates missing data based on the estimated data distribution.

2. The missing data estimating apparatus according to claim 1, wherein the estimating unit estimates the distribution of the incomplete data using Bayesian estimation.

The missing data estimating apparatus according to claim 2, wherein the estimating means estimates missing data by regression processing based on principal component analysis using variational Bayes estimation that treats the missing data as a hidden variable.

The missing data estimating apparatus according to claim 3, wherein the estimating means performs the regression processing by repeating variational Bayes estimation under an ARD prior probability distribution.

The incomplete data includes gene expression profile data including missing data,
The obtaining means obtains an expression level matrix in which logarithmic values of the expression level of the gene expression profile data are arranged,
The estimating means estimates the parameters of the stochastic principal component analysis model in which each gene vector is interpreted as an independent sample in the expression amount matrix by repeating variational Bayes estimation under an ARD prior probability distribution. 5. The missing data estimating apparatus according to claim 4, wherein:

A missing data estimation method for estimating missing data of incomplete data in which a part of data is missing, using a computer,
Wherein the computer obtains the incomplete data;
A step of the computer estimating the distribution of the incomplete data assuming one cluster, and estimating the missing data based on the estimated data distribution.

A missing data estimation program for estimating missing data of incomplete data in which a part of data is missing,
Acquisition means for acquiring the incomplete data,
A missing data estimation program for causing a computer to function as estimating means for estimating missing data based on an estimated data distribution by performing distribution estimation of the incomplete data assuming one cluster.

A computer-readable recording medium recording a missing data estimation program for estimating missing data of incomplete data in which a part of data is missing,
Acquisition means for acquiring the incomplete data,
A recording medium for recording a missing data estimation program, wherein a computer is operated as estimating means for performing distribution estimation of the incomplete data assuming one cluster and estimating missing data based on the estimated data distribution. .