JP2004078371A

JP2004078371A - Data processor, data processing method and computer program

Info

Publication number: JP2004078371A
Application number: JP2002235352A
Authority: JP
Inventors: Hirotaka Higuchi; 樋口　裕高; Yoko Azuma; 東　陽子; Toshihiko Morimoto; 森本　俊彦; Arata Sato; 佐藤　新; Ei Sakano; 坂野　鋭; Tsutomu Matsunaga; 松永　務; Masaaki Muramatsu; 村松　正明; Keisuke Ishii; 石井　敬介
Original assignee: NTT Data Corp; Hubit Genomix Inc
Current assignee: Hubit Genomix Inc; NTT Data Group Corp
Priority date: 2002-08-13
Filing date: 2002-08-13
Publication date: 2004-03-11
Anticipated expiration: 2022-08-13
Also published as: JP4307807B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a mechanism for properly clustering data. <P>SOLUTION: A plot processing part 104 of an SNP data processor 1 plots emitted SNP fluorescent light quantity on coordinates, and an angle information processing part 107 obtains a straight line connecting each of the plotted SNP data with a reference point, and calculates the angle of the straight line with respect to the predetermined reference line, and a clustering processing part 108 clusters the emitted SNP fluorescent light quantity data based on the angle information. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、データをクラスタ分析するための技術であって、特に、ＳＮＰ（Ｓｉｎｇｌｅ　Ｎｕｃｌｅｏｔｉｄｅ　Ｐｏｌｙｍｏｒｐｈｉｓｍ：一塩基多型）データのように所定の基準点から放射状に分布するデータのクラスタリングに好適な技術に関する。
【０００２】
【従来の技術】
現在、バイオインフォマティックスの分野では、遺伝子の塩基配列の違いに基づく個体差の研究がされている。
この個体差は、遺伝子の塩基配列の約０．１％の配列の違いから生じていると考えられており、この違いを遺伝的個性（多型）という。そして、この多型の中でもたった一つの塩基だけが違っているものをＳＮＰ（「一塩基多型」）という。
このＳＮＰは、数百〜千塩基に一箇所くらいの割合で存在していると推測され、ゲノム中には３００万〜１０００万箇所のＳＮＰが存在すると考えられている。
そして、このＳＮＰを特定し、臨床データや環境等の要因と照らし合わせて解析することで、特定の一塩基型をもった人にはある薬が効く効かないといったことや、病気の予防、診断、治療に役立てることが可能となる。
【０００３】
このように、ＳＮＰを用いた解析を実現するためには、ＳＮＰがゲノム上のどの場所に、どのような頻度で存在しているかを分析する必要がある。
このうちＳＮＰの頻度解析を行う場合、ＴａｑＭａｎ法やＩｎｖａｄｅｒ法などが用いられている。これらの方法は、２種類の特殊な試薬によりＳＮＰの蛍光発光量を測り、この蛍光発光量データに基づいて２種類のホモ接合体（例えば、塩基がＡかＧの場合、ＡＡとＧＧ）と、１種類のヘテロ接合体（例えば、ＡＧ）の２ないし３の分類に分けて、それぞれの分類ごとの出現頻度を求める必要がある。
そして、ＳＮＰの頻度解析を行うため、従来は各試薬の発光量に基づいて、各データを２次元座標上にプロットし、プロットした座標上の点を人が目視により、２〜３にクラスタリングを行っていた。
【０００４】
【発明が解決しようとする課題】
しかし、従来のやり方では、２次元座標上にプロットしたデータを人が目視でクラスタリングしていたため、クラスタリングの判断基準が一定でなく、クラスタリングにバラツキが生じてしまうばかりか、人が目視で行うためクラスタリングに膨大な時間がかかってしまうし、人為的なミスも発生するなどの問題があった。
【０００５】
このようなことから、クラスタリングの自動化が必要とされるため、Ｋ−ｍｅａｎｓ法などの一般的なクラスタリング手法を用いることが考えられる。
しかし、ＳＮＰデータは、２次元座標上にプロットするとある中心点を中心として放射状に分布するという特性を有しているため、Ｋ−ｍｅａｎｓ法などの一般的なクラスタリング手法を用いて発光量に基づくクラスタリングを行ったのでは、単純に距離が近いデータ同士をクラスタリングしてしまうため、本来別のクラスタとなるホモとヘテロとに跨ったクラスタが形成されるなど正しい分類ができず、自動化が困難であるという問題があった。
【０００６】
本発明は、上記課題及び問題点を解決するためになされたものであって、データを適切にクラスタリングすることができ、特にＳＮＰデータのように所定の基準点を中心に放射状に分布するデータを適切にクラスタリングできる仕組みを提供することを課題とする。
【０００７】
【課題を解決するための手段】
上述の課題を解決するため、本発明の一の観点にかかるデータ処理装置は、複数のデータをクラスタリングするための装置であって、上記各データを２次元座標上にプロットするプロット処理手段と、上記プロットされた個々のデータと所定の基準点とを結ぶ直線を求め、この直線と所定の基準線との角度を求める角度情報処理手段と、上記角度情報に基づいて各データをクラスタリングするクラスタリング処理手段とを有することを特徴とするデータ処理装置。
【０００８】
また、上記データは、所定の２つの試薬による遺伝子のＳＮＰの蛍光発光量データであり、上記プロット手段は、各試薬による蛍光発光量をそれぞれ２次元座標の軸として、上記蛍光発光量データを上記２次元座標上にプロットするようにしてもよい。
【０００９】
また、座標上に所定の仮定点をとり、当該仮定点と上記座標上の各データとを結ぶ直線と上記仮定点を通る所定の線との角度を求め、この角度情報に基づいて上記各データのクラスタリングを行い、求められた各クラスタ中心を通る各クラスタの主成分直線の交点から上記基準点を決定する基準点処理手段を更に有するようにしてもよい。
【００１０】
また、上記クラスタ分析を行う前に、上記基準点から所定の距離に存在する点を抽出し、抽出した点を上記クラスタ分析の対象から除外させる手段をさらに有するようにしてもよい。
【００１１】
また、上記クラスタ処理手段は、複数パターンのクラスタリングを行い、上記クラスタ処理手段が行った複数パターンのクラスタリング結果を選択可能に並列表示する表示処理手段を更に有するようにしてもよい。
また、上記データは、所定の２つの試薬による遺伝子のＳＮＰの蛍光発光量データであり、上記プロット手段は、各試薬による蛍光発光量をそれぞれ２次元座標の軸として、上記蛍光発光量データを上記２次元座標上にプロットするようにしてもよい。
【００１２】
本発明の一の観点にかかるデータ処理方法は、コンピュータにより、複数のデータを所定のクラスに分類するための方法であって、上記各データを座標上にプロットする処理と、上記プロットされた個々のデータと所定の基準点とを結ぶ直線を求め、この直線と所定の基準線との角度を求める処理と、上記角度情報に基づいて各データをクラスタリングする処理とからなることを特徴とする。
【００１３】
本発明の一の観点にかかるコンピュータプログラムは、コンピュータに対して、複数のデータを所定のクラスに分類するためのコンピュータプログラムであって、コンピュータに対して、上記各データを座標上にプロットする処理と、上記プロットされた個々のデータと所定の基準点とを結ぶ直線を求め、この直線と所定の基準線との角度を求める処理と、上記角度情報に基づいて各データをクラスタリングする処理とを実行させることを特徴とする。
【００１４】
【発明の実施の形態】
以下、図面を参照して本発明にかかるデータ処理装置及びコンピュータプログラムを、ＳＮＰデータ処理システムに適用した実施形態について説明する。
図１に本実施形態にかかるＳＮＰデータ処理システムの一例を示す。
図１に示すように、本システムは、本発明に係るデータ処理装置を構成するＳＮＰデータ処理装置１及び結果確認処理装置２と、前処理装置３、修正処理装置４、マージ処理装置５、ＳＮＰ管理装置６から構成されている。
なお、これらの各装置は、ＬＡＮ（Ｌｏｃａｌ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）などにより相互に接続可能に構成してもよいし、ＣＤ−ＲＯＭ、ＦＤ、ＭＯなどの所定の媒体を介してデータをやり取りできるように構成してもよい。
【００１５】
前処理装置３は、ＴａｑＭａｎ法等による処理を行うコンピュータである。
前処理装置３は、例えば、２種類の特殊な試薬を用いて、１対の染色体上のＳＮＰの蛍光発光量を測定し、各ＳＮＰの蛍光発光量データを生成する処理を行う。
なお、この蛍光発光量データとしては、色とその発光強度から構成される。色としては、例えば、遺伝子がＡＡのホモであれば２つの試薬のうちの一つが発光してある色となり、ＧＧのホモであれば別の試薬が発光して別の色となり、ＡＧのヘテロであればそれぞれの試薬が発光するためその中間色の色となっている。
【００１６】
修正処理装置４は、結果確認処理装置２によりリジェクトされたデータの修正処理を行うためのコンピュータである。
この修正処理としては、例えば、修正処理装置４が、リジェクトされたデータのファイル（リジェクトファイル）を参照して、リジェクトされたデータの基となる蛍光発光量データを前処理装置３から取得し、これを２次元座標上に表示することで、操作者が目視でデータの分布を確認して、手動でクラスタリングできるようにする処理を行う。
【００１７】
マージ処理装置５は、結果確認装置２により確認処理されたデータ、修正処理装置４により修正された修正後データ、及び修正不可能なリジェクトデータをマージして、管理用のＳＮＰデータを作成する処理を行うコンピュータである。
ＳＮＰ管理装置６は、作成された管理用のＳＮＰデータを記憶し、これをデータベース化するなどして管理するコンピュータである。
【００１８】
ＳＮＰデータ処理装置１は、ＳＮＰデータをクラスタリングするための装置である。このＳＮＰデータ処理装置１は、コンピュータにより構成され、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）と、ＣＰＵが実行するコンピュータプログラム、このコンピュータプログラム及び所定のデータなどを格納するＲＡＭ、ＲＯＭなどの内部メモリ及びハードディスクドライブなどの外部記憶装置などにより、図２に示す機能ブロックを構成することができる。
図２に示した機能ブッロクは、アッセイデータ記憶部１０１、コールデータ記憶部１０２、設定処理部１０３、プロット処理部１０４、基準点処理部１０５、ラベリング領域処理部１０６、角度情報処理部１０７、クラスタリング処理部１０８、適合度処理部１０９から構成されている。
【００１９】
アッセイデータ記憶部１０１は、各ＳＮＰデータの蛍光発光量データを記憶する記憶部である。
このアッセイデータ記憶部１０１には、例えば、各ＳＮＰデータのファイル名、ＳＮＰの蛍光発光量データなどが記憶できるようになっている。
【００２０】
コールデータ記憶部１０２は、ＳＮＰデータ処理装置１により自動的にクラスタリング処理を行った結果のデータが記憶できるようになっている。
このコールデータ記憶部１０２には、例えば、各ＳＮＰデータと、各ＳＮＰデータのクラスタ情報などが記憶できるようになっている。
【００２１】
設定処理部１０３は、初期ファイルの設定処理などを行うことができる。
プロット処理部１０４は、アッセイデータ記憶部１０１に記憶されている各ＳＮＰの蛍光発光量データに基づいて、座標上でこれをプロットする処理を行う。この処理は、例えば、プロット処理部１０４が、一の試薬の発光量をＸ軸、他方の試薬の発光量をＹ軸とした２次元座標上に、各ＳＮＰデータをプロットすることにより処理できる。
【００２２】
基準点処理部１０５は、各ＳＮＰデータの座標上での角度を測る際の基準となる基準点を求める処理を行う。
この処理としては、例えば、基準点処理部１０５が、座標上の原点を仮の基準点（仮定点）として、この原点から座標上の各ＳＮＰデータを結ぶ直線と、Ｘ軸とのなす角度に基づいてＳＮＰデータを複数のクラスタに分類し、各クラスタ中心の主成分直線を求め、この主成分直線の交点を次の仮定点として、再度同じ処理を繰り返すことにより、当該仮定点が一定となって収束した点を基準点とすることができる。
なお、基準点の決定としては仮定点が収束した点を求めるのではなく、基準点処理部１０５が、上述の処理を予め決められた所定回数繰り返して求めた点を基準点として決定してもよい。あるいは、収束しなった場合は、基準点処理部１０５が最初の基準点（原点等）を基準点として決定してもよい。
【００２３】
ラベリング領域処理部１０６は、クラスタ分析を行う対象となるラベリング領域と、クラスタ分析の対象外とする非ラベリング領域とを区分けする処理を行う。
この処理としては、例えば、ラベリング領域処理部１０６が、ＳＮＰデータの最大値と最小値の中間点（ｘ＿ｍｅｄｉａｎ　ｙ＿ｍｅｄｉａｎ）をとり、この中間点からの距離が閾値（Ｓ）よりも大きい場合にラベリング領域、小さい場合に非ラベリング領域とすることにより行うことができる。
なお、ラベリングを行う際の閾値（Ｓ）は、操作の都度、操作者が決定してもよいし、また予めデフォルトで設定しておいてもよい。
【００２４】
角度情報処理部１０７は、座標上の基準点から各ＳＮＰデータを結ぶ直線と、所定の基準線との角度を求める処理を行う。
基準線は角度を測定する際の基準となる線であり、例えば、Ｘ軸又はＹ軸を基準としてもよいし、また他の軸をとってこれを基準線としてもよく任意である。
【００２５】
クラスタリング処理部１０８は、ＳＮＰデータの角度情報に基づいて、ＳＮＰデータをクラスタリングする処理を行う。
このクラスタリング処理としては、例えば、一次元の角度情報を基にｋ−ｍｅａｎｓ法など既存のクラスタリングアルゴリズムを用いて行うことができる。
なお、クラスタリングを行う場合には、例えば、クラスタリング処理部１０８が、２クラスタにクラスタリングする場合と、３クラスタにクラスタリングする場合との２つのパターンの処理をそれぞれ並行して行うことができる。
【００２６】
適合度処理部１０９は、クラスリング処理部１０８がクラスタリングした結果がＳＮＰデータのクラスタリングとして適合しているか否かを表す適合度を算定する処理を行う。
この処理としては、例えば、適合度処理部１０９が、各クラスタ間の距離をα、各クラスタ内でのデータ分散をβとし、このα／βの比率（Ｆ比）の大きさが大きいほどより適切なクラスタリングができているとして適合度を判定することができる。また、クラスタリングの別の適合度としては、例えば、適合度処理部１０９が、対立遺伝子の頻度がＨａｒｄｙ　Ｗｅｉｎｂｅｒｇ平衡の法則にしたがった分布となっているか否かにより適合度を判定することができる。なお、Ｈａｒｄｙ　Ｗｅｉｎｂｅｒｇ平衡の法則とは、例えば、対立遺伝子ＭとＮの頻度をそれぞれｐとｑ（但し、ｐ＋ｑ＝１）とすると、遺伝子型頻度は構成する対立遺伝子頻度の積で表され、ＭＭの頻度はｐ^２、ＮＮの頻度はｑ^２、ＭＮの頻度は２ｐｑの比率になるという法則をいう。
【００２７】
結果確認処理装置２は、ＳＮＰデータ処理装置１がクラスタリングした処理結果を操作者に確認させるための処理を行う装置である。
この結果確認処理装置２は、図３に示すように、ディスプレイ２０及びキーボード、マウスなどの入力装置３０が接続されている。この結果確認処理装置２は、コンピュータにより構成され、ＣＰＵと、ＣＰＵが実行するコンピュータプログラム、このコンピュータプログラム及び所定のデータなどを格納するＲＡＭ、ＲＯＭなどの内部メモリ及びハードディスクドライブなどの外部記憶装置などにより、図３に示す機能ブロックを構成することができる。
図３に示した機能ブロックは、確認データ記憶部２０１、データ入出力処理部２０２、表示制御部２０３から構成されている。
【００２８】
確認データ記憶部２０１は、操作者によりクラスタリング結果の確認が完了したデータを記憶するための記憶部である。この確認データ記憶部２０１には、ＳＮＰデータと、そのクラスタリングされたクラスタ情報などが記憶できるようになっている。
【００２９】
データ入出力処理部２０２は、ＳＮＰデータ処理装置１が生成したＳＮＰのクラスタリングデータの入力を受付ける処理や、確認が完了したデータをマージ処理装置５に出力する処理を行うことができる。
【００３０】
表示制御部２０３は、クラスタリングされた結果データをディスプレイ２０に表示させ、操作者に確認を要求する処理を行う。
また、表示制御部２０３は、ＳＮＰデータ処理装置１が２ラスタと３クラスタの２つのパターンの分類を行った場合には、これら２つのパターンの結果をディスプレイ２上に並列して表示させ、操作者に結果データの確認、選択をさせることができる。
【００３１】
次に、本発明にかかるデータ処理方法の一実施形態について図面を用いて説明する。
まず、図１を参照してシステム全体の処理の流れを説明する。
図１において、前処理装置３がＴａｑＭａｎ法等を用いてＳＮＰの蛍光発光量データを取得する（Ｓ１）。
ＳＮＰデータ処理装置１は、前処理装置３からＳＮＰの蛍光発光量データを取得し、ＳＮＰデータのクラスタリングを行い、ＳＮＰデータにそのクラスタを表すラベル付けを行う（Ｓ２）。
【００３２】
クラスタリングが完了すると、結果確認処理装置２が、ＳＮＰデータ処理装置１により生成されたクラスタリング結果データに基づいて、この結果を２次元座標上に表示し操作者に確認を要求する（Ｓ３）。
確認の結果、操作者により適正なクラスタリング行われていないと判定された場合には、当該リジェクトデータ名をリジェクトファイルに書き出し、修正処理装置４に提供する（Ｓ４）。
【００３３】
修正処理装置４では、リジェクトデータ名に基づき、ＳＮＰの蛍光発光量データを前処理装置３から取得し、これを所定のディスプレイ上に表示して、操作者が目視によりクラスタリングの修正を行えるようにする（Ｓ５）。
【００３４】
マージ処理装置５は、Ｓ３の処理で操作者の確認により適正なクラスタリングと判定されたデータ、Ｓ５の処理により修正がされたデータ、及び修正ができないと判定されたリジェクトデータを取得し、これらのデータをマージして管理データを作成する（Ｓ６）。
【００３５】
マージ処理が完了すると、マージ後の管理データがＳＮＰ管理装置６に提供され、このＳＮＰ管理装置６が管理データをデータベース化するなどして管理し（Ｓ７）、処理を終了する。
【００３６】
次に、図４を参照して、ＳＮＰデータ処理装置１がクラスタリングを行う際の詳細な処理について説明する。
図４において、まず設定処理部１０３が操作者により入力された初期設定ファイルを参照し、クラスタリングを行うための所定の定数などの初期設定を行う（Ｓ１０１）。
この初期設定ファイルには、例えば、出力ファイル名、入力ファイル名、処理対象となるファイルの数を表す設定ファイル数、１つのファイル中のレコード数、ラベリング処理を行う際のラベリング領域の閾値（Ｓ）、ｔａｎｇｅｎｔの連続領域にデータを置くための回転角度などのデータから構成されている。
【００３７】
設定処理部１０３は、コールデータ記憶部１０２に記憶されている出力ファイルをオープンし（Ｓ１０２）、読み込んだＳＮＰデータのファイル数が今回処理対象となっている設定ファイル数に達しているか判別する（Ｓ１０３）。
判別の結果、設定ファイル数に達している場合には、今回の処理の対象となった全てのＳＮＰデータのファイルについて処理が完了したものとして、処理を終了する。
【００３８】
また、判別の結果設定ファイル数に達していない場合には、プロット処理部１０４が、アッセイデータ記憶部１０１に記憶されている入力ファイルを開き、ＳＮＰの蛍光発光量データを読み込んで、各試薬の発光色の強度に基づいて２次元座標上にプロットする（Ｓ１０４）。
この２次元座標は、１つの軸（Ｘ軸）に一つの試薬の発光量をとり、他方の軸（Ｙ軸）にもう一方の試薬の発光量をとったものである。
【００３９】
プロットが完了すると、基準点処理部１０５が、各ＳＮＰデータを極座標に変換し、この極座標の角度情報を用いて１次元のクラスタリングを行う（Ｓ１０５）。
各データの極座標は、例えば、基準点処理部１０５が座標上の原点を極として、各データとを結ぶ直線を求め、この直線とＸ軸とがなす角度及び距離に基づいて求めることができる。また、クラスタリングは、例えば、基準点処理部１０５が、Ｋ−ｍｅａｎｓ法などの既知のクラスタリング手法を用いることにより行うことができる。
【００４０】
クラスタリングが完了すると、基準点処理部１０５は、所定の主成分分析アルゴリズムにより各クラスタの主成分分析を行ってクラスタごとにその中心を通る主成分直線をもとめ、これらの主成分直線の交点を決定する（Ｓ１０６）。
そして、基準点処理部１０５は、この交点を仮の基準点（仮定点）として再度Ｓ１０５、Ｓ１０６の処理を所定回数（ｎ回）行い、求めた交点１から交点ｎが一定の位置に収束しているか否か判別する（Ｓ１０７）。
判別の結果、交点が収束していない場合には、上述のＳ１０５の処理に戻って再度処理を繰り返す。
また、Ｓ１０７の処理で交点が一定の位置に収束したと判別された場合には、当該交点ｎを基準点として設定する（Ｓ１０８）。
【００４１】
なお、交点が収束しているか否かを判定する代わりに、予め交点の算定を繰り返す回数を設定しておき、基準点処理部１０５がこの設定された回数までＳ１０５の処理に戻って処理を繰り返した結果により基準点を決定してもよい。
【００４２】
基準点設定が完了すると、ラベリング領域処理部１０６は、ＳＮＰデータを２次元座標上プロットした際の中央点の座標（ｘ＿ｍｅｄｉａｎ，　ｙ＿ｍｅｄｉａｎ）を取得する（Ｓ１０９）。
この中央点（Ｍ）は、図６に示すようように、ＳＮＰ蛍光発光量データ中の最大値Ｍａｘと最小値Ｍｉｎを求め、この中間（ｘ＿ｍｅｄｉａｎ、ｙ＿ｍｅｄｉａｎ）を中央点とすることができる。
【００４３】
ラベリング領域処理部１０６が、読み込みレコード数が設定されているレコード数に達したか否か、即ち全てのレコードについて処理を行ったかを判別する（Ｓ１１０）。
判別の結果、設定数に達していない場合には、ラベリング領域処理部１０６は、一のＳＮＰ蛍光発光量データの座標（ｘ，ｙ）の値が、（ｘ／ｘ＿ｍｅｄｉａｎ）^２＋（ｙ／ｙ＿ｍｅｄｉａｎ）^２＜閾値（Ｓ）となっているか否か、即ち中央点からの距離が所定の閾値Ｓ内にあるか否か判別する（Ｓ１１１）。
判別の結果、閾値よりも小さい座標のＳＮＰデータについては、当該ＳＮＰデータを非ラベリング領域として設定し、クラスタリングを行う対象データから除外するフラグを設定する（Ｓ１１２）。
また、判別の結果、閾値よりも大きい座標のＳＮＰデータは、ラベリング領域に属する旨のフラグを設定する（Ｓ１１３）。
このラベリング処理の一例を図６に示す。図６に円弧状の線で示すようにその内側の領域が非ラベリング領域となり、円弧状の線よりも外側の領域がラベリング領域となる。
【００４４】
ラベリング領域の判定処理を行いＳ１１０の判別の結果、読み込みレコード数が設定数に達した場合には、角度情報処理部１０７は、ラベリング領域のＳＮＰ蛍光発光量データを対象として、基準点を極とした極座標に変換する（Ｓ１１４）。
この処理としては、例えば、図７に示すように、基準点（図示の例では原点）を極として、この基準点と各データとを結ぶ直線と基準線（例えば、Ｘ軸）との角度を測定し、これに基づいてＳＮＰ蛍光発光量データを極座標へ変換することができる。
【００４５】
引き続き、図５において、極座標への変換が完了すると、クラスタリング処理部１０８は、各ＳＮＰデータの極座標の角度情報に基づいて、所定のクラスタリングのアルゴリズムにより１次元のクラスタリングを行う（Ｓ１１５）。
この際クラスタリング処理部１０８は、２クラスタにクラスタリングする場合と、３クラスタにクラスタリングする場合の２つのケースにつてそれぞれクラスタリングを行うようにする。なお、このクラスタリング処理は、例えば、ｋ−ｍｅａｎｓ法などの既存のクラスタリングアルゴリズムを用いることができる。
なお、角度に基づいてヒストグラムを作成した例を図８に示す。図８では横軸に角度（ラジアン）、縦軸にデータ数を表している。
【００４６】
クラスタリングが完了すると、クラスタリング処理部１０８は、極座標とクラスタリングした２次元座標をコールデータ記憶部１０２に記憶する（Ｓ１１６）。
【００４７】
適合度処理部１０９は、２クラスタの場合と、３クラスタの場合のそれぞれについて座標上のクラスタのＦ比を算出する（Ｓ１１７）。
また、適合度処理部１０９は、２クラスタの場合と、３クラスタの場合のそれぞれについて対立遺伝子の比率等を算出し、この比率がＨａｒｄｙ　Ｗｅｉｎｂｅｒｇ平衡の法則に適合する度合いを算出する（Ｓ１１８）。
【００４８】
そして、適合度処理部１０９は、各クラスタのＦ比及びＨａｒｄｙ　Ｗｅｉｎｂｅｒｇ平衡の法則に適合する度合いから、適正なクラスタリングが行われているか、或いは適正なクラスタリングを行えないリジェクトデータかを判別する（Ｓ１１９）。
この判別処理は、Ｆ比及びＨａｒｄｙ　Ｗｅｉｎｂｅｒｇ平衡の法則の適合度合の閾値を予め定めておき、この閾値よりもこれらの算出結果が下回っている場合にリジェクトデータと判別することにより処理できる。
【００４９】
判別の結果リジェクトデータと判別した場合には、適合度処理部１０９は、データ名や検体のＩＤなどを記述したリジェクトファイルを作成する（Ｓ１２０）。
【００５０】
また、判別結果、適正なクラスタリングであると判別された場合、又はＳ１２０の処理が完了した場合、ラベル、Ｆ比、Ｈａｒｄｙ　Ｗｅｉｎｂｅｒｇ平衡の法則の適合度合いを出力ファイルとしてコールデータ記憶部１０２に記憶して（Ｓ１２１）、上述のＳ１０３の処理に戻り、設定ファイル数になるまで処理を繰り返して処理を終了する。
【００５１】
次に、結果確認装置２により作成したクラスタリングデータを操作者が確認する際の処理について、図９を参照して説明する。
図９において、結果確認処理装置２が、ＳＮＰデータ処理装置１から所定のネットワークなどを介して、２つのパターンのクラスタリング結果データが提供されると、データ入出力処理部２０２が提供されたデータを受付ける（Ｓ２０１）。
表示制御部２０３は、出力ファイルの各クラスタのＦ比、Ｈａｒｄｙ　Ｗｅｉｎｂｅｒｇ平衡の法則の適合度データに基づいて、クラスタリングが行われたデータか、クラスタリングが行われなかったリジェクトデータかを判別する（Ｓ２０２）。
【００５２】
判別の結果、適正なクラスタリングが行われたデータである場合には、表示制御部２０３は、図１０に示すように、２次元のグラフとして、適合度の高いクラスタリング結果をアクティブ状態（図示の例では左側がアクティブ状態）とし、適合度の低いクラスタリング結果はシェードを掛けた状態でディスプレイ２に並列して表示する（Ｓ２０３）。
なお、この際、各クラスタに属するデータをそれぞれ異なる色で表してもよい。また、ラベリングデータは丸形、非ラベリングデータは四角形で表示するなどしてもよい。
【００５３】
この状態で操作者は２つのクラスタリング結果を見比べてクラスタリング結果が正しいか目視で判断する。
そして、表示制御部２０３は、操作者が、選択されたクラスタリング結果（図１０の例では３クラスタ側）が正しいクラスタリング結果であると判断したか、又は選択されなかったクラスタリング結果（図１０の例では２クラスタ側）の方が正しいと判断したか、或いはいずれも適切なクラスタリング結果でないと判断したかのいずれの判断結果のとなったかを判別する（Ｓ２１４）。
【００５４】
操作者がアクティブとされているクラスタリング結果が正しいと判断し、図１０中の「ＮＥＸＴ」のラジオボタンを指示した場合には、表示制御部２０３は選択されたクラスタリング結果を適正なクラスタリング結果として確認データ記憶部２０１に記憶し（Ｓ２０５）、処理を終了する。
なお、次に処理すべきデータがある場合には、Ｓ２０１の処理に戻って処理を繰り返す。
【００５５】
また、操作者が、シェードが掛けられている方のクラスタリング結果が正しいと判断し、当該グラフ又はラジオボタン（図１０の例では「２ｃｌｕｓｔｅｒ」のラジオボタン）を指示すると、表示制御部２０３は選択されたクラスタリング結果をアクティブ状態の表示に切り替えて表示し（Ｓ２０６）、操作者がＮＥＸＴボタンを指示することにより確認データ記憶部２０１にクラスタリングデータを記憶して処理を終了する。
また、操作者がいずれのクラスタリング結果も適切でないと判断した場合には、後述のＳ２０８の処理に移る。
【００５６】
また、Ｓ２０２の処理で、リジェクトデータであると判別された場合には、表示制御部２０３は、「ＲＥＪＥＣＴＥＤ」の文字をアクティブ状態として表示すると共に、いずれのクラスタリング結果もシェードを掛けて表示する（Ｓ２０７）。
この状態で操作者が目視で確認を行い修正可能か否か判断し、表示制御部２０３は、操作者がクラスタリングの修正を行う旨の指示を行ったか否か判別する（Ｓ２０８）。
操作者がマニュアルでデータを修正する場合には、図１０中の「Ｍａｎｕａｌ　ｃａｌｌ」のラジオボタンを指示することにより、データを修正処理装置４に提供し（Ｓ２０９）、操作者が手動でクラスタリング結果の修正ができるようにし処理を終了する。
また、操作者が、元となるデータ自体が適切でない場合など修正が不可能なデータであると判断し、図１０中の「ｕｎａｂｌｅ」のラジオボタンを指示した場合には、表示制御部２０３は当該データをリジェクトデータとして確認データ記憶部２０１に記憶して（Ｓ２１０）、処理を終了する。
【００５７】
このように本実施形態によれば、プロット処理部１０４によりＳＮＰ蛍光発光量データを座標上にプロットし、角度情報処理部１０７により、プロットされた個々のＳＮＰデータと基準点とを結ぶ直線を求め、この直線と所定の基準線との角度を求め、クラスタリング処理部１０８により角度情報に基づいてＳＮＰ蛍光発光量データをクラスタリングするようにしたことから、基準点を中心に放射状に分布するＳＮＰデータを適切にクラスタリングすることができる。
これにより、ＳＮＰデータを自動的にクラスタリングできるため、人手により行う場合に比べて、画一的でかつミスのないクラスタリングを行うことができ、クラスタリングを行う人の作業量を減少させることもできる。
【００５８】
また、基準点処理部１０５により、座標上に所定の仮定点をとり、当該仮定点と座標上の各ＳＮＰ蛍光発光量データとを結ぶ直線と所定の基準線との角度を求め、この角度情報に基づいて各ＳＮＰ蛍光発光量データのクラスタリングを行い、求められた各クラスタの重心を通る直線の交点を基準点として決定するようにしたことから、例えば、ＳＮＰデータが座標の原点から離れた位置に分布している場合など、座標の原点を基準にした場合では適切なクラスタリングができない場合であっても、適切な基準点を設定してからＳＮＰ蛍光発光量データのクラスタリングを行うことができる。
【００５９】
また、ラベリング領域処理部１０６により、クラスタ分析を行う前に、基準点から所定の距離に存在する点を抽出し、抽出した点を上記クラスタ分析の対象から除外させるようにしたことから、基準点に近く、クラスタリングに誤差を生じやすい非ラベリング領域のデータを予め除外してからクラスタリングが行えるようになり、より適切なクラスタリングを行うことができる。
【００６０】
また、クラスタリング処理部１０８が、２クラスタと３クラスタの２つのパターンのクラスタリングを行い、表示制御部２０３が、これら２つのパターンのクラスタリング結果を選択可能に並列表示するようにしたことから、操作者は２つのクラスタを見比べて、各クラスタリング結果を比較したうえで適切なクラスタを選択することができる。
【００６１】
なお、上述の実施形態では、ＳＮＰデータ処理装置１、結果確認処理装置２、前処理装置３、修正処理装置４、マージ処理装置５、ＳＮＰ管理装置６をそれぞれ別の装置として構成した例について説明したが、これのうちのいずれか又は全ての機能を１つの装置で実現してもよく任意である。
【００６２】
また、上述の実施形態では、ＳＮＰ蛍光発光量データのクラスタリング処理について説明したが、本発明はこれに限定されるものではなく、所定の基準点から放射状に分布するデータであれば適用可能である。
【００６３】
本実施形態のＳＮＰデータ処理装置１又は結果確認処理装置２用のコンピュータプログラムを、コンピュータ読み取り可能な媒体（ＦＤ、ＣＤ−ＲＯＭ等）に格納して配布してもよいし、搬送波に重畳し、通信ネットワークを介して配信することも可能である。
なお、ＳＮＰデータ処理装置１又は結果確認処理装置２の機能をＯＳ（Ｏｐｅｒａｔｉｎｇ　Ｓｙｓｔｅｍ）が分担又はＯＳとアプリケーションプログラムの共同により実現する場合等には、ＯＳ以外の部分のみをコンピュータプログラムとして、またこのコンピュータプログラムをコンピュータ読み取り可能な媒体に格納したり、このコンピュータプログラムを配信等してもよい。
【００６４】
【発明の効果】
本発明によれば、データを適切にクラスタリングすることができ、特にＳＮＰデータのように所定の基準点を中心に放射状に分布するデータを適切にクラスタリングできる。
【図面の簡単な説明】
【図１】本発明にかかるデータ処理装置を用いたシステムの一実施形態の概略及び処理の流れを示した図。
【図２】本実施形態にかかるＳＮＰデータ処理装置の機能ブロック図。
【図３】本実施形態にかかる確認処理装置の機能ブロック図。
【図４】本実施形態にかかるＳＮＰデータ処理装置の処理の流れを示した処理フロー。
【図５】本実施形態にかかるＳＮＰデータ処理装置の処理の流れを示した続きの処理フロー。
【図６】本実施形態にかかるラベリング処理の概念を示した図。
【図７】本実施形態にかかる座標上の各データの角度情報を取得する処理の概念を示した図。
【図８】本実施形態にかかる角度情報に基づくヒストグラムの例を示した図。
【図９】本実施形態にかかる確認処理装置の処理の流れを示した図。
【図１０】本実施形態にかかる確認処理装置の画面の一例を示した図。
【符号の説明】
１　　　ＳＮＰデータ処理装置
２　　　結果確認処理装置
１０４　プロット処理部
１０５　基準点処理部
１０６　ラベリング領域処理部
１０７　角度情報処理部
１０８　クラスタリング処理部
２０３　表示制御部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a technique for cluster analysis of data, and more particularly to a technique suitable for clustering data distributed radially from a predetermined reference point such as SNP (Single Nucleotide Polymorphism) data. .
[0002]
[Prior art]
At present, in the field of bioinformatics, research on individual differences based on differences in base sequences of genes is being conducted.
This individual difference is considered to be caused by a difference in the sequence of about 0.1% of the nucleotide sequence of the gene, and this difference is called a genetic individuality (polymorphism). And, among these polymorphisms, those in which only one base is different are referred to as SNP ("single nucleotide polymorphism").
This SNP is presumed to be present at a ratio of about one in several hundred to 1,000 bases, and it is considered that three to ten million SNPs exist in the genome.
By identifying this SNP and analyzing it against factors such as clinical data and the environment, it is possible to say that certain drugs do not work for people with a specific single nucleotide, , Can be used for treatment.
[0003]
As described above, in order to realize the analysis using the SNP, it is necessary to analyze at what location on the genome and at what frequency the SNP exists.
When performing SNP frequency analysis, the TaqMan method, the Invader method, and the like are used. In these methods, the amount of fluorescence emitted from SNP is measured using two kinds of special reagents, and two homozygotes (for example, when the base is A or G, AA and GG) are used based on the fluorescence emission amount data. It is necessary to classify one type of heterozygote (for example, AG) into two or three classifications and determine the frequency of appearance for each classification.
Conventionally, in order to perform SNP frequency analysis, each data is plotted on two-dimensional coordinates based on the amount of light emitted from each reagent, and points on the plotted coordinates are visually clustered into two or three by a person. I was going.
[0004]
[Problems to be solved by the invention]
However, in the conventional method, since the data plotted on the two-dimensional coordinates is visually clustered by a person, the criterion of the clustering is not constant, and not only does the clustering vary, but also the person performs the clustering visually. There are problems such as enormous amount of time required for clustering and human error.
[0005]
Because of this, automation of clustering is required, and it is conceivable to use a general clustering method such as the K-means method.
However, since the SNP data has a characteristic of being radially distributed around a certain center point when plotted on two-dimensional coordinates, it is based on the amount of light emission using a general clustering method such as the K-means method. Since clustering simply clusters data that are close to each other, correct classification cannot be performed, such as formation of clusters that originally span different clusters, such as homo and hetero, making automation difficult. There was a problem.
[0006]
The present invention has been made to solve the above problems and problems, and can appropriately cluster data. In particular, data distributed radially around a predetermined reference point such as SNP data can be obtained. It is an object to provide a mechanism that can appropriately perform clustering.
[0007]
[Means for Solving the Problems]
In order to solve the above-described problem, a data processing device according to one aspect of the present invention is a device for clustering a plurality of data, and a plot processing unit that plots the data on two-dimensional coordinates. An angle information processing means for obtaining a straight line connecting the plotted individual data and a predetermined reference point and obtaining an angle between the straight line and the predetermined reference line; and a clustering process for clustering each data based on the angle information And a data processing device.
[0008]
Further, the data is data on the amount of fluorescence emitted from the SNP of the gene by two predetermined reagents, and the plotting means uses the amount of fluorescence emitted by each reagent as an axis of a two-dimensional coordinate system, and You may make it plot on a two-dimensional coordinate.
[0009]
Further, a predetermined hypothetical point is taken on the coordinates, an angle between a straight line connecting the hypothetical point and each data on the coordinates and a predetermined line passing through the hypothetical point is obtained, and the data is obtained based on the angle information. And a reference point processing means for determining the reference point from the intersection of the principal component straight lines of the respective clusters passing through the obtained cluster centers.
[0010]
In addition, before performing the cluster analysis, a means for extracting a point existing at a predetermined distance from the reference point and excluding the extracted point from the target of the cluster analysis may be further provided.
[0011]
Further, the cluster processing means may further include a display processing means for performing clustering of a plurality of patterns and displaying a result of the clustering of the plurality of patterns performed by the cluster processing means in a selectable parallel manner.
Further, the data is data on the amount of fluorescence emitted from the SNP of the gene by two predetermined reagents, and the plotting means uses the amount of fluorescence emitted by each reagent as an axis of a two-dimensional coordinate system, and You may make it plot on a two-dimensional coordinate.
[0012]
A data processing method according to one aspect of the present invention is a method for classifying a plurality of data into a predetermined class by a computer, wherein the processing of plotting the data on coordinates and the plotting of individual data are performed. And a process for obtaining an angle between the straight line and the predetermined reference line, and a process for clustering each data based on the angle information.
[0013]
A computer program according to one aspect of the present invention is a computer program for classifying a plurality of data into a predetermined class for a computer. And a process of obtaining a straight line connecting the individual data plotted and a predetermined reference point, obtaining an angle between the straight line and a predetermined reference line, and a process of clustering each data based on the angle information. It is characterized by being executed.
[0014]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment in which a data processing device and a computer program according to the present invention are applied to an SNP data processing system will be described with reference to the drawings.
FIG. 1 shows an example of the SNP data processing system according to the present embodiment.
As shown in FIG. 1, the present system includes an SNP data processing device 1 and a result confirmation processing device 2 which constitute a data processing device according to the present invention, a preprocessing device 3, a correction processing device 4, a merge processing device 5, a SNP It is composed of a management device 6.
Each of these devices may be configured to be mutually connectable by a LAN (Local Area Network) or the like, or may be configured to exchange data via a predetermined medium such as a CD-ROM, FD, or MO. May be.
[0015]
The pre-processing device 3 is a computer that performs processing by the TaqMan method or the like.
The pre-processing device 3 performs, for example, a process of measuring the fluorescence emission amount of SNPs on a pair of chromosomes using two kinds of special reagents and generating fluorescence emission amount data of each SNP.
The fluorescence emission amount data is composed of a color and its emission intensity. As for the color, for example, if the gene is homozygous for AA, one of the two reagents emits light, and if the gene is homologous for GG, another reagent emits light and becomes another color. In this case, each reagent emits light, so that the color is an intermediate color.
[0016]
The correction processing device 4 is a computer for performing a correction process on the data rejected by the result confirmation processing device 2.
As the correction processing, for example, the correction processing device 4 refers to the rejected data file (reject file), acquires the fluorescence emission amount data that is the basis of the rejected data from the preprocessing device 3, By displaying this on two-dimensional coordinates, a process is performed so that the operator can visually confirm the distribution of data and perform manual clustering.
[0017]
The merging processing device 5 merges the data confirmed by the result confirmation device 2, the corrected data corrected by the correction processing device 4, and the uncorrectable reject data to create management SNP data. Computer.
The SNP management device 6 is a computer that stores the created management SNP data and manages the SNP data by making it into a database.
[0018]
The SNP data processing device 1 is a device for clustering SNP data. The SNP data processing device 1 is configured by a computer, and includes a CPU (Central Processing Unit), a computer program executed by the CPU, an internal memory such as a RAM and a ROM for storing the computer program and predetermined data, and a hard disk drive. The functional block shown in FIG. 2 can be configured by the external storage device or the like.
The function block shown in FIG. 2 includes an assay data storage unit 101, a call data storage unit 102, a setting processing unit 103, a plot processing unit 104, a reference point processing unit 105, a labeling area processing unit 106, an angle information processing unit 107, and clustering. It comprises a processing unit 108 and a suitability processing unit 109.
[0019]
The assay data storage unit 101 is a storage unit that stores fluorescence emission amount data of each SNP data.
The assay data storage unit 101 can store, for example, the file name of each SNP data, the fluorescence emission amount data of the SNP, and the like.
[0020]
The call data storage unit 102 can store data obtained as a result of performing the clustering process automatically by the SNP data processing device 1.
The call data storage unit 102 can store, for example, each SNP data and cluster information of each SNP data.
[0021]
The setting processing unit 103 can perform setting processing of an initial file and the like.
The plot processing unit 104 performs a process of plotting this on coordinates based on the fluorescence emission amount data of each SNP stored in the assay data storage unit 101. This process can be performed, for example, by the plot processing unit 104 plotting each SNP data on two-dimensional coordinates where the light emission amount of one reagent is on the X axis and the light emission amount of the other reagent is on the Y axis.
[0022]
The reference point processing unit 105 performs a process of obtaining a reference point serving as a reference when measuring an angle on the coordinates of each SNP data.
In this process, for example, the reference point processing unit 105 sets the origin on the coordinates as a temporary reference point (assumed point), and calculates the angle between the straight line connecting each SNP data on the coordinates from the origin and the X axis. The SNP data is classified into a plurality of clusters on the basis of the data, a principal component line at the center of each cluster is obtained, and the intersection of the principal component lines is set as the next assumption point, and the same processing is repeated again, so that the assumption point becomes constant. The point converged as a result can be set as a reference point.
It should be noted that the reference point may be determined not by determining the point at which the hypothetical point converges, but by determining the point obtained by repeating the above-described processing a predetermined number of times as the reference point. Good. Alternatively, when the convergence does not occur, the reference point processing unit 105 may determine the first reference point (such as the origin) as the reference point.
[0023]
The labeling area processing unit 106 performs processing for separating a labeling area to be subjected to cluster analysis and a non-labeling area to be excluded from cluster analysis.
As this processing, for example, the labeling area processing unit 106 takes an intermediate point (x_median_median) between the maximum value and the minimum value of the SNP data, and if the distance from this intermediate point is larger than the threshold value (S), In the case where the size is small, the non-labeling area can be used.
The threshold value (S) for performing labeling may be determined by the operator each time an operation is performed, or may be set in advance by default.
[0024]
The angle information processing unit 107 performs a process of obtaining an angle between a straight line connecting each SNP data from a reference point on the coordinates and a predetermined reference line.
The reference line is a line serving as a reference when measuring the angle. For example, the X-axis or the Y-axis may be used as a reference, or another axis may be used as the reference line, which is optional.
[0025]
The clustering processing unit 108 performs a process of clustering the SNP data based on the angle information of the SNP data.
This clustering process can be performed using an existing clustering algorithm such as the k-means method based on one-dimensional angle information.
When clustering is performed, for example, the clustering processing unit 108 can perform two patterns of processing, that is, two clusters and three clusters in parallel.
[0026]
The matching degree processing unit 109 performs a process of calculating a matching degree indicating whether or not the result of the clustering performed by the classing processing unit 108 matches the clustering of the SNP data.
In this processing, for example, the fitness processing unit 109 sets the distance between the clusters to α and the data variance in each cluster to β, and the larger the ratio of α / β (F ratio), the larger the value. The suitability can be determined as appropriate clustering has been performed. Further, as another fitness of clustering, for example, the fitness processing unit 109 can determine the fitness based on whether or not the allele frequency has a distribution according to the law of Hardy Weinberg equilibrium. The law of Hardy Weinberg equilibrium means that, for example, when the frequencies of alleles M and N are p and q (where p + q = 1), the genotype frequency is represented by the product of the constituent allele frequencies. Is p ² , NN frequency is q ² , MN frequency is a ratio of 2 pq.
[0027]
The result confirmation processing device 2 is a device that performs a process for causing an operator to confirm the processing result obtained by the clustering performed by the SNP data processing device 1.
As shown in FIG. 3, the result confirmation processing device 2 is connected with a display 20 and an input device 30 such as a keyboard and a mouse. The result confirmation processing device 2 is constituted by a computer, and includes a CPU, a computer program executed by the CPU, an internal memory such as a RAM and a ROM for storing the computer program and predetermined data, and an external storage device such as a hard disk drive. Thus, the functional blocks shown in FIG. 3 can be configured.
The functional block shown in FIG. 3 includes a confirmation data storage unit 201, a data input / output processing unit 202, and a display control unit 203.
[0028]
The confirmation data storage unit 201 is a storage unit for storing data for which confirmation of the clustering result has been completed by the operator. The confirmation data storage unit 201 can store the SNP data and cluster information obtained by clustering the SNP data.
[0029]
The data input / output processing unit 202 can perform a process of receiving input of clustering data of the SNP generated by the SNP data processing device 1 and a process of outputting confirmed data to the merge processing device 5.
[0030]
The display control unit 203 displays the clustered result data on the display 20 and performs a process of requesting confirmation from the operator.
When the SNP data processing apparatus 1 classifies two patterns of two rasters and three clusters, the display control unit 203 displays the results of these two patterns on the display 2 in parallel, Can confirm and select the result data.
[0031]
Next, an embodiment of a data processing method according to the present invention will be described with reference to the drawings.
First, the flow of processing of the entire system will be described with reference to FIG.
In FIG. 1, the preprocessing device 3 acquires the fluorescence emission amount data of the SNP using the TaqMan method or the like (S1).
The SNP data processing device 1 acquires the fluorescence emission amount data of the SNP from the preprocessing device 3, performs clustering of the SNP data, and labels the SNP data indicating the cluster (S2).
[0032]
When the clustering is completed, the result confirmation processing device 2 displays the result on the two-dimensional coordinates based on the clustering result data generated by the SNP data processing device 1 and requests the operator to confirm (S3).
As a result of the confirmation, if it is determined that proper clustering has not been performed by the operator, the reject data name is written to a reject file and provided to the correction processing device 4 (S4).
[0033]
The correction processing device 4 acquires the fluorescence emission amount data of the SNP from the preprocessing device 3 based on the reject data name, displays the acquired data on a predetermined display, and allows the operator to visually correct the clustering. (S5).
[0034]
The merge processing device 5 acquires the data determined to be appropriate clustering by the operator's confirmation in the processing of S3, the data corrected by the processing of S5, and the reject data determined to be uncorrectable, and The management data is created by merging the data (S6).
[0035]
When the merge process is completed, the merged management data is provided to the SNP management device 6, and the SNP management device 6 manages the management data by creating a database or the like (S7), and ends the process.
[0036]
Next, a detailed process when the SNP data processing device 1 performs clustering will be described with reference to FIG.
In FIG. 4, first, the setting processing unit 103 refers to the initialization file input by the operator and performs initialization such as predetermined constants for performing clustering (S101).
The initialization file includes, for example, an output file name, an input file name, the number of setting files representing the number of files to be processed, the number of records in one file, and a threshold value of a labeling area (S ), Data such as a rotation angle for placing data in a continuous area of the tangent.
[0037]
The setting processing unit 103 opens the output file stored in the call data storage unit 102 (S102), and determines whether the number of read SNP data files has reached the number of setting files to be processed this time (S102). S103).
As a result of the determination, if the number of set files has been reached, it is assumed that the processing has been completed for all the SNP data files targeted for the current processing, and the processing ends.
[0038]
If the number of set files has not been reached as a result of the determination, the plot processing unit 104 opens the input file stored in the assay data storage unit 101, reads the fluorescence emission amount data of the SNP, and A plot is made on two-dimensional coordinates based on the intensity of the emission color (S104).
The two-dimensional coordinates are obtained by taking the light emission amount of one reagent on one axis (X axis) and the light emission amount of the other reagent on the other axis (Y axis).
[0039]
When the plot is completed, the reference point processing unit 105 converts each SNP data into polar coordinates, and performs one-dimensional clustering using the angle information of the polar coordinates (S105).
For example, the polar coordinates of each data can be obtained based on an angle and a distance between the straight line and the X axis by the reference point processing unit 105 obtaining a straight line connecting each data with the origin on the coordinates as a pole. The clustering can be performed, for example, by the reference point processing unit 105 using a known clustering method such as the K-means method.
[0040]
When the clustering is completed, the reference point processing unit 105 performs a principal component analysis of each cluster by a predetermined principal component analysis algorithm, finds a principal component straight line passing through the center of each cluster, and determines an intersection of these principal component straight lines. (S106).
Then, the reference point processing unit 105 performs the processing of S105 and S106 again a predetermined number of times (n times) using this intersection as a temporary reference point (assumed point), and the intersection n from the obtained intersection 1 converges to a fixed position. It is determined whether or not there is (S107).
If the result of determination is that the intersection has not converged, the process returns to the above-described step S105, and the process is repeated again.
If it is determined in the process of S107 that the intersection has converged to a fixed position, the intersection n is set as a reference point (S108).
[0041]
Instead of determining whether or not the intersection has converged, the number of times of repeating the calculation of the intersection is set in advance, and the reference point processing unit 105 returns to the processing of S105 until the set number of times and repeats the processing. The reference point may be determined based on the result.
[0042]
When the reference point setting is completed, the labeling area processing unit 106 acquires the coordinates (x_median, y_median) of the center point when the SNP data is plotted on the two-dimensional coordinates (S109).
As shown in FIG. 6, the center point (M) finds the maximum value Max and the minimum value Min in the SNP fluorescence emission amount data, and can set the middle (x_median, y_median) as the center point.
[0043]
The labeling area processing unit 106 determines whether the number of read records has reached the set number of records, that is, whether the processing has been performed for all records (S110).
As a result of the determination, if the number has not reached the set number, the labeling area processing unit 106 determines that the value of the coordinates (x, y) of one SNP fluorescence emission amount data is (x / x_median) ² + (Y / y_median) ² It is determined whether or not the threshold value (S) is satisfied, that is, whether or not the distance from the center point is within a predetermined threshold value S (S111).
As a result of the determination, for SNP data having coordinates smaller than the threshold value, the SNP data is set as a non-labeling area, and a flag for excluding the SNP data from target data to be clustered is set (S112).
Also, as a result of the determination, a flag indicating that the SNP data having coordinates larger than the threshold value belongs to the labeling area is set (S113).
FIG. 6 shows an example of this labeling process. As shown by an arc-shaped line in FIG. 6, the area inside the area is a non-labeling area, and the area outside the arc-shaped line is a labeling area.
[0044]
When the number of read records reaches the set number as a result of the determination processing of the labeling area and the determination in S110, the angle information processing unit 107 sets the reference point to the extreme with respect to the SNP fluorescence emission amount data of the labeling area. Is converted to the polar coordinates (S114).
In this process, as shown in FIG. 7, for example, a reference point (origin in the illustrated example) is used as a pole, and an angle between a straight line connecting the reference point and each data and a reference line (for example, X axis) is determined. The SNP fluorescence emission amount data can be converted into polar coordinates based on the measured values.
[0045]
Subsequently, in FIG. 5, when the conversion into the polar coordinates is completed, the clustering processing unit 108 performs one-dimensional clustering by a predetermined clustering algorithm based on the polar coordinate angle information of each SNP data (S115).
At this time, the clustering processing unit 108 performs clustering in two cases, that is, clustering into two clusters and clustering into three clusters. Note that this clustering process can use an existing clustering algorithm such as the k-means method.
FIG. 8 shows an example in which a histogram is created based on angles. In FIG. 8, the horizontal axis represents the angle (radian), and the vertical axis represents the number of data.
[0046]
When the clustering is completed, the clustering processing unit 108 stores the polar coordinates and the clustered two-dimensional coordinates in the call data storage unit 102 (S116).
[0047]
The fitness processing unit 109 calculates the F ratio of the cluster on the coordinates for each of the two clusters and the three clusters (S117).
The fitness processing unit 109 calculates the allele ratio and the like for each of the two clusters and the three clusters, and calculates the degree to which this ratio conforms to the Hardy Weinberg equilibrium rule (S118).
[0048]
Then, the suitability processing unit 109 determines whether proper clustering is performed or rejected data for which proper clustering cannot be performed, based on the F ratio of each cluster and the degree of conformity to the law of Hardy Weinberg equilibrium (S119). ).
This discriminating process can be performed by determining in advance a threshold value of the degree of conformity with the F ratio and the law of Hardy Weinberg equilibrium, and discriminating reject data when these calculation results are lower than the threshold value.
[0049]
When it is determined that the data is reject data as a result of the determination, the matching degree processing unit 109 creates a reject file in which a data name, a sample ID, and the like are described (S120).
[0050]
If it is determined that the clustering is appropriate clustering, or if the process of S120 is completed, the label, the F ratio, and the degree of conformity of the Hardy Weinberg equilibrium rule are stored in the call data storage unit 102 as output files. (S121), the process returns to S103, and the process is repeated until the number of set files is reached, and the process ends.
[0051]
Next, a process when the operator checks the clustering data created by the result checking device 2 will be described with reference to FIG.
In FIG. 9, when the result confirmation processing device 2 provides two patterns of clustering result data from the SNP data processing device 1 via a predetermined network or the like, the data input / output processing unit 202 deletes the provided data. Accept (S201).
The display control unit 203 determines whether the data is clustered or rejected data without clustering based on the F ratio of each cluster of the output file and the fitness data of Hardy Weinberg equilibrium law (S202). ).
[0052]
If the result of the determination is that the data has been properly clustered, the display control unit 203 displays the clustering result with a high degree of conformity as an active state as a two-dimensional graph as shown in FIG. In the example, the left side is an active state), and the clustering result with low fitness is displayed in parallel on the display 2 in a shaded state (S203).
At this time, the data belonging to each cluster may be represented by different colors. Further, the labeling data may be displayed in a round shape, and the non-labeling data may be displayed in a square shape.
[0053]
In this state, the operator compares the two clustering results and visually determines whether the clustering result is correct.
Then, the display control unit 203 determines that the operator has determined that the selected clustering result (the three clusters in the example of FIG. 10) is a correct clustering result, or that the operator has not selected the clustering result (the example of FIG. 10). Then, it is determined which of the two clusters has been judged to be more correct, or which of the two has not been judged to be an appropriate clustering result (S214).
[0054]
When the operator determines that the active clustering result is correct and instructs the “NEXT” radio button in FIG. 10, the display control unit 203 confirms the selected clustering result as an appropriate clustering result. The data is stored in the data storage unit 201 (S205), and the process ends.
If there is data to be processed next, the process returns to the process of S201 and repeats the process.
[0055]
When the operator determines that the clustering result of the shaded one is correct and designates the graph or the radio button (the “2 cluster” radio button in the example of FIG. 10), the display control unit 203 selects The displayed clustering result is switched to the display of the active state and displayed (S206). When the operator designates the NEXT button, the clustering data is stored in the confirmation data storage unit 201, and the process is terminated.
If the operator determines that none of the clustering results is appropriate, the process proceeds to S208 described below.
[0056]
If it is determined in the process of S202 that the data is reject data, the display control unit 203 displays the character “REJECTED” as an active state, and displays any clustering result with a shade ( S207).
In this state, the operator visually checks and determines whether or not the correction can be made, and the display control unit 203 determines whether or not the operator has given an instruction to correct the clustering (S208).
When the operator manually corrects the data, the data is provided to the correction processing device 4 by pointing to the “Manual call” radio button in FIG. 10 (S209), and the clustering result is manually input by the operator. And the process is terminated.
When the operator determines that the original data itself is improper data such as improper data, and instructs the “unable” radio button in FIG. 10, the display control unit 203 The data is stored as reject data in the confirmation data storage unit 201 (S210), and the process ends.
[0057]
As described above, according to the present embodiment, the plot processing unit 104 plots the SNP fluorescence emission amount data on coordinates, and the angle information processing unit 107 obtains a straight line connecting the plotted individual SNP data and the reference point. Since the angle between this straight line and a predetermined reference line is determined and the SNP fluorescence emission amount data is clustered based on the angle information by the clustering processing unit 108, the SNP data distributed radially around the reference point is obtained. Can be properly clustered.
As a result, the SNP data can be automatically clustered, so that uniform and error-free clustering can be performed and the amount of work of the person performing the clustering can be reduced as compared with the case where the SNP data is manually performed.
[0058]
Further, the reference point processing unit 105 takes a predetermined hypothetical point on the coordinate, obtains an angle between a straight line connecting the hypothetical point and each SNP fluorescence emission amount data on the coordinate and a predetermined reference line, and obtains the angle information. Are clustered based on the SNP fluorescence emission amount data, and the intersection of the obtained straight line passing through the center of gravity of each cluster is determined as the reference point. For example, the position where the SNP data is away from the origin of the coordinates Even when the clustering cannot be properly performed based on the origin of the coordinates, for example, when the distribution is distributed, the clustering of the SNP fluorescence emission amount data can be performed after setting an appropriate reference point.
[0059]
Before performing the cluster analysis by the labeling area processing unit 106, a point existing at a predetermined distance from the reference point is extracted and the extracted point is excluded from the target of the cluster analysis. , Clustering can be performed after previously excluding data in a non-labeling area where clustering is likely to cause an error, and more appropriate clustering can be performed.
[0060]
Also, the clustering processing unit 108 performs clustering of two patterns of two clusters and three clusters, and the display control unit 203 displays the clustering results of these two patterns in parallel in a selectable manner. Can compare two clusters, compare each clustering result, and select an appropriate cluster.
[0061]
In the above-described embodiment, an example will be described in which the SNP data processing device 1, the result confirmation processing device 2, the preprocessing device 3, the correction processing device 4, the merge processing device 5, and the SNP management device 6 are configured as separate devices. However, any or all of these functions may be implemented by a single device, which is optional.
[0062]
Further, in the above-described embodiment, the clustering processing of the SNP fluorescence emission amount data has been described. However, the present invention is not limited to this, and can be applied to data distributed radially from a predetermined reference point. .
[0063]
The computer program for the SNP data processing device 1 or the result confirmation processing device 2 of the present embodiment may be stored in a computer-readable medium (FD, CD-ROM, etc.) and distributed, or may be superimposed on a carrier wave. Distribution via a communication network is also possible.
When the OS (Operating System) shares the functions of the SNP data processing device 1 or the result confirmation processing device 2 or realizes the functions of the SNP data processing device 1 or the OS and the application program jointly, only the part other than the OS is used as a computer program. The computer program may be stored in a computer-readable medium, or the computer program may be distributed.
[0064]
【The invention's effect】
According to the present invention, data can be appropriately clustered, and in particular, data distributed radially around a predetermined reference point, such as SNP data, can be appropriately clustered.
[Brief description of the drawings]
FIG. 1 is a diagram showing an outline of an embodiment of a system using a data processing apparatus according to the present invention and a flow of processing.
FIG. 2 is a functional block diagram of the SNP data processing device according to the embodiment;
FIG. 3 is a functional block diagram of a confirmation processing device according to the embodiment;
FIG. 4 is a processing flow showing a processing flow of the SNP data processing device according to the embodiment;
FIG. 5 is a subsequent processing flow showing the processing flow of the SNP data processing device according to the embodiment;
FIG. 6 is an exemplary view showing the concept of a labeling process according to the embodiment;
FIG. 7 is an exemplary view showing a concept of a process of acquiring angle information of each data on coordinates according to the embodiment;
FIG. 8 is a view showing an example of a histogram based on angle information according to the embodiment;
FIG. 9 is an exemplary view showing the flow of the process of the confirmation processing device according to the embodiment;
FIG. 10 is an exemplary view showing an example of a screen of the confirmation processing device according to the embodiment.
[Explanation of symbols]
1 SNP data processing device
2 Result confirmation processing device
104 Plot processing unit
105 Reference point processing unit
106 Labeling area processing unit
107 Angle Information Processing Unit
108 Clustering processing unit
203 Display control unit

Claims

An apparatus for clustering a plurality of data,
Plot processing means for plotting each of the data on two-dimensional coordinates;
Angle information processing means for obtaining a straight line connecting the individual data plotted and a predetermined reference point, and obtaining an angle between the straight line and a predetermined reference line,
Clustering processing means for clustering each data based on the angle information;
A data processing device comprising:

The data is data distributed radially from the reference point when plotted on two-dimensional coordinates.
The data processing device according to claim 1.

A predetermined hypothetical point is taken on the coordinates, an angle between a straight line connecting the hypothetical point and each data on the coordinates and a predetermined line passing through the hypothetical point is obtained, and clustering of the data is performed based on the angle information. And a reference point processing means for determining the reference point from the intersection of the principal component straight lines of the respective clusters passing through the obtained cluster centers.
The data processing device according to claim 1.

Before performing the cluster analysis, a point that is present at a predetermined distance from the reference point is extracted, and further includes a unit that excludes the extracted point from a target of the cluster analysis.
The data processing device according to claim 1.

The cluster processing means performs clustering of a plurality of patterns,
Display processing means for selectively displaying a plurality of pattern clustering results performed by the cluster processing means in a selectable parallel manner;
The data processing device according to claim 1.

The data is fluorescence emission amount data of the SNP of the gene by two predetermined reagents,
The plotting means plots the fluorescence emission amount data on the two-dimensional coordinates, using the fluorescence emission amount of each reagent as an axis of the two-dimensional coordinates, respectively.
The data processing device according to claim 1.

A method for classifying a plurality of data into a predetermined class by a computer,
A process of plotting each of the data on coordinates,
A process of obtaining a straight line connecting the individual data plotted and a predetermined reference point, and obtaining an angle between the straight line and a predetermined reference line;
A process of clustering each data based on the angle information;
A data processing method comprising:

A computer program for classifying a plurality of data into a predetermined class for a computer,
Against the computer
A process of plotting each of the data on coordinates,
A process of obtaining a straight line connecting the individual data plotted and a predetermined reference point, and obtaining an angle between the straight line and a predetermined reference line;
A process of clustering each data based on the angle information;
A computer program that executes