JP3723767B2

JP3723767B2 - Biological sequence information processing method and apparatus

Info

Publication number: JP3723767B2
Application number: JP2001377632A
Authority: JP
Inventors: 宏樹荒川; 浩輔 ▲たか▼木
Original assignee: 株式会社バイオマティクス
Priority date: 2001-12-11
Filing date: 2001-12-11
Publication date: 2005-12-07
Anticipated expiration: 2021-12-11
Also published as: AU2002366918A1; WO2003054744A1; JP2003216615A

Description

【０００１】
【発明の属する技術分野】
本発明は、塩基配列、アミノ酸配列等の生物学的な配列情報を解析のために処理する方法および装置に関し、特に、処理の高速化に関する。
【０００２】
【従来の技術】
分子生物学の分野では、ＤＮＡ、遺伝子、タンパク質等の解析のための情報処理技術の有用性が高まっている。この分野では、配列情報を解析するために情報処理技術が用いられる。この種の技術はバイオインフォマティクスといわれる。
【０００３】
例えば、ＳＮＰｓ（スニプッス、単一塩基多型）解析は、ほぼ同一の多数の塩基配列を解析して、局所的に異なる部分をもつ塩基配列を求める。
【０００４】
また例えば、ホモロジー検索は、複数の配列情報が似ているか、そしてどのように似ているかの情報を求める。ホモロジー検索方法としては、例えば、ブラスト（ＢＬＡＳＴ）法およびファスタ（ＦａｓｔＡ）法が知られている。
【０００５】
ブラスト法は、ギャップの挿入を行わずに局所的によく一致する部位を探索する。このような部位を高スコア断片と呼ぶ。そして、高スコア断片が前後に伸長される。
【０００６】
ファスタ法においては、配列が長く一致する部分を求める。この処理のために、従来は、複数の配列情報の一致する要素をプロットしたドットマトリックス情報が利用される。そして、一致部分の周囲に対して動的計画法によるアライメントが行われる。
【０００７】
【発明が解決しようとする課題】
配列解析では、大量の情報を高速に処理することが求められる。非常に長い配列が処理され、また、多数の配列が処理されるからである。しかし、従来は、配列解析の大量の情報処理は、専ら大型コンピュータの大きな処理能力に頼って実現されており、配列情報の高速処理技術は十分に確立していない。そして、配列解析の研究が進み、創薬および医療などの現場での分子生物学の実用化が進展するのにつれて、配列情報処理の高速化の重要性も高まると考えられる。また、大型コンピュータではなく、パーソナルコンピュータ程度の比較的小型なコンピュータによっても、大量の配列情報を高速に処理することが求められる。
【０００８】
本発明は上記課題に鑑みてなされたものであり、その目的は、配列情報の処理を高速化する方法および装置を提供することにある。
【０００９】
本発明の一つの目的は、ＳＮＰＳ解析で見られるような複数の配列情報の比較を高速に行うことが可能な方法および装置を提供することにある。
【００１０】
本発明の一つの目的は、ブラスト解析で見られるような配列情報中の特定配列の検索を高速に行うことが可能な方法および装置を提供することにある。
【００１１】
本発明の一つの目的は、ファスタ解析で見られるような複数の配列情報の連続一致部分の探索を高速に行うことが可能な方法および装置を提供することにある。
【００１２】
【課題を解決するための手段】
（１）上記目的を達成するため、本発明の配列情報処理方法は、並列照合機能をもつ記憶処理装置に、配列情報を被照合データとして用いるために記憶させて、照合データと被照合データを並列処理にて記憶処理装置に照合させて、照合データと被照合データの一致を示す情報を得ることにより、配列解析情報を得る。並列照合機能を利用することで、配列情報の処理における、大量のデータの比較を高速に行うことができ、配列解析を高速化できる。
【００１３】
好ましくは、並列照合機能をもつ記憶処理装置は、ＣＡＭである。従来、ＣＡＭは、インターネットルータの部品として用いられている。本発明は、ＣＡＭのもつ並列照合機能が配列情報の処理に適していることに着目し、大量のデータの比較をＣＡＭに行わせている。これにより、配列解析処理のうちで大きなウエイトを占める部分が、ＣＡＭにより大幅に高速化され、配列解析の高速化が可能となる。
【００１４】
また、ＣＡＭは、インターネットのルータ用の部品として普及しており、比較的安価に、容易に入手できる。さらに、通常のパーソナルコンピュータのようなコンピュータとの接続が容易な点でもＣＡＭは有利である。したがって、本発明は、ルータ用部品として普及しているＣＡＭの特性が、配列情報の処理にも適していることに着目し、ＣＡＭを使って配列情報処理装置を構成したことで、高速性という利点に加えて、低コストで容易に配列情報処理装置を提供できるという利点も得られる。
【００１５】
（２）本発明の特に中心となる一態様は、並列照合機能をもつ記憶処理装置に、複数の配列情報を、被照合データとして用いるために、照合方向と交差する方向を向けて、照合方向に並ぶように記憶させる。そして、本発明は、照合方向に並んで隣接する複数の配列情報のデータを被照合データとして用いて、配列要素を表す文字等のコードとして同一のものが並ぶ同一コード列に対応するデータを照合データとして用いて、照合データと被照合データを並列処理にて記憶処理装置に照合させる。
【００１６】
このように、本発明は、照合方向と交差する方向を向けて配列情報を記憶させるという、記憶処理装置の特徴的な使い方をしている。したがって、被照合データは、照合方向に並んだ複数配列のデータで構成される。そして、照合データとして、同一コード列に対応するデータが用いられる。これら被照合データと照合データの並列照合処理により、複数の配列が一致するか否かが高速に求められる。
【００１７】
本発明は、ＣＡＭで見られるように、記憶処理装置の照合方向の幅が、配列の長さより狭いときに、特に有利である。実際に処理される配列は長いことが多いので、このような場合は頻繁にあり得る。本発明によれば、記憶処理装置の照合方向と交差する方向に配列情報を記憶させるので、長大な配列も記憶処理装置に収容できる。そして、同一コード列に対応する照合データを用いることで、交差方向に記憶される配列の一致性を求められる。そして、この処理が、並列照合処理により高速に行われる。このようにして、本発明によれば、並列照合処理機能をもつ記憶処理装置を利用し、配列解析を好適に高速化できる。
【００１８】
好ましくは、本態様は、上述の処理により、ＳＮＰｓ解析に用いる情報を得る。ＳＮＰｓ解析では、多くの配列を迅速に処理することが求められる。特に、今後、ゲノム創薬およびオーダーメード医療が実用化され、多数のサンプルのＳＮＰｓ解析が必要になると考えられる。そして、大型コンピュータを使わないでも、高速にＳＮＰｓ解析を行えることが望ましい。本発明によれば、このようなニーズに適切に応えることが可能である。
【００１９】
（３）本発明の一態様は、並列照合機能をもつ記憶処理装置に、生物学的な配列情報を、被照合データとして用いるために、照合方向を向けて記憶させる。さらに本発明は、照合対象の配列情報を照合データとして用いて、照合データと被照合データを並列処理にて記憶処理装置に照合させる。この態様では、上述の態様と異なり、配列情報が照合方向を向けて記憶される。したがって、上述の態様に関して説明したような、記憶の方向を異ならせることによる利点は得られない。しかし、本態様でも、並列照合機能を利用した並列処理による高速化という利点が得られる。以下は、本発明のさらに詳細な態様である。
【００２０】
（４）本発明の一態様は、並列照合機能をもつ記憶処理装置に、塩基配列、アミノ酸配列等の生物学的な複数の配列情報を、被照合データとして用いるために、照合方向を向けて記憶させる。そして本発明は、参照配列を照合データとして用いて、照合データと被照合データを並列処理にて記憶処理装置に照合させる。典型的には、部分配列からなる参照配列を用いて、ブラスト検索で行われるような、局所的一致箇所が求められる。本発明によれば、並列照合機能を利用して、複数の配列の各々が参照配列を含むか否かが、高速に求められる。
【００２１】
好ましくは、本発明は、参照配列に相当する長さをもつ照合対象部分と残りの照合除外部分とを設定して照合処理を行い、照合除外部分の位置を異ならせた複数回の照合処理を行う。本発明によれば、照合除外部分を異ならせて照合処理を行うことで、参照配列が、被照合データたる配列のどの部分と一致する場合でも、その一致を適切に検出できる。また、一致する部分の特定も可能となる。
【００２２】
好ましくは、本発明は、一連の配列を複数の分割配列情報に分けて、複数の分割配列情報を、照合方向と交差する方向に並ぶように、並列照合機能をもつ記憶処理装置に記憶させて、各分割配列情報の一部が参照配列と一致するか否かを並列処理により求める。
【００２３】
本発明は、ＣＡＭで見られるように、記憶処理装置の照合方向の幅が狭く、交差方向の長さが大きいときに、特に有利である。本発明によれば、照合方向の幅が狭い場合でも、配列を分割することで、交差方向の長さを活かして、長い配列を記憶可能である。交差方向の長さを利用して、大量の配列を同時に記憶させ、並列して処理することもできる。
【００２４】
さらに、本態様の配列分割は、計算の高速化にとって有利である。分割により、照合方向の配列長さが小さくなる。これにより計算量が少なくなる。上述の複数種類の照合除外部分を設定するとき、照合方向の配列長さが小さい方が、計算量が少なくなる。したがって本発明は、記憶処理装置が照合方向に狭く、交差方向に長いとき、このことを障害とせず、むしろ、配列分割と並列処理により計算量を削減し、配列解析の一層の高速化を可能としている。
【００２５】
好ましくは、本態様は、上述の処理により、ブラスト法等のホモロジー解析に用いる情報を得る。例えば、データベースの大量の配列を使ってブラスト検索を行うような場合、本発明の高速化は特に有用と考えられる。
【００２６】
（５）本発明の一態様は、並列照合機能をもつ記憶処理装置に、同一の配列情報を少しずつずらして、被照合データとして用いるために、照合方向を向けて記憶させる。配列情報は所定文字数ずつ、通常は一文字ずつずらされる。そして、本発明は、比較対象の別の配列情報を照合データとして用いて、少しずつずらして記憶された同一の配列情報を被照合データとして用いて、照合データと被照合データを並列処理にて記憶処理装置に照合させる。
【００２７】
本発明によれば、複数の配列情報が連続して一致する部分が、並列処理を利用して高速に求められる。最長一致部分を求めることも可能であり、また、連続一致部分の位置を特定することも可能である。並列照合機能をもつ記憶処理装置を利用し、少しずつずらして配列を記憶させるという特徴的な使い方により、例えばファスタ検索においてドットマトリックスを用いて得られるのと同様の、連続一致部分の情報を得ることができる。
【００２８】
好ましくは、本態様は、上述の処理により、ファスタ法等のホモロジー解析に用いる情報を得る。例えば、データベースの大量の配列を使ってファスタ検索を行うような場合、本発明の高速化は特に有用と考えられる。
【００２９】
本発明は、上述の方法の態様には限定されない。本発明の別の態様は、例えば、配列情報処理装置である。この装置は、ネットワークを経由してアクセスされるシステムを構成してもよい。分散配置された複数のコンピュータにより本装置および上記方法が実現されてもよい。また、本発明の別の態様は、例えば、上記処理方法をコンピュータに実現させるプログラムであり、また例えば、そのようなプログラムを記録した媒体である。
【００３０】
【発明の実施の形態】
以下、本発明の好適な実施の形態（以下、実施形態という）を図面を参照して説明する。
【００３１】
本実施形態では、配列情報の一形態である塩基配列が処理される。ただし、アミノ酸配列等の他の任意の生物学的配列情報にも本実施形態は同様に適用可能である。
【００３２】
図１は、本実施形態の生物学的配列情報処理装置のハードウエア構成を示す。配列情報処理装置１０は、ＣＰＵ１２、ＲＯＭ１４、ＲＡＭ１６、ＣＡＭ１８、ハードディスク２０、入力装置２２および出力装置２４を含む。
【００３３】
ハードディスク２０は、配列情報処理装置１０の機能を実現するためのプログラムを記憶している。このプログラムがＣＰＵ１２により実行される。また、ハードディスク２０は、解析対象の配列情報を記憶している。ＣＰＵ１２は、ハードディスク２０から配列情報を取得する。配列情報は、他の構成から取得されてもよい。例えば、配列情報は、図示されない記録媒体装着部を介して、ＣＤ−ＲＯＭ、ＤＶＤ等の記録媒体から取得されてもよい。また、配列情報は、通信装置を介して取得されてもよい。通信装置は、インターネット等のネットワークから配列情報を取得してもよい。
【００３４】
入力装置２２は、キーボード、ポインティングデバイス等である。ユーザは、入力装置２２を操作して、各種の指示を入力し、また、配列情報処理装置１０が要求する情報等を入力する。出力装置２４は、ディスプレイ、プリンタ等である。出力装置２４には、解析結果の情報が表示される。また、ディスプレイには、ユーザへの案内画面、例えば入力装置２２を操作するのに必要な画面が表示される。上述のように配列情報を通信装置で取得する場合には、通信装置を出力装置として機能させ、解析結果等の情報も通信装置を介して出力することが好適である。
【００３５】
上述の説明からも明らかなように、配列情報処理装置１０は通常のパーソナルコンピュータの機能を備える。ただし、通常のパーソナルコンピュータとの相違点として、配列情報処理装置１０はＣＡＭ１８を備える。配列情報処理装置１０は、ＣＡＭ１８を好適に利用して、配列情報の解析を高速に行う。
【００３６】
ＣＡＭ（ＣｏｎｔｅｎｔＡｄｄｒｅｓｓａｂｌｅＭｅｍｏｒｙ）は、本発明の並列照合機能をもつ記憶処理装置の典型的かつ好適な一形態である。ＣＡＭは連想記憶装置ともいわれる。ＣＡＭは、名前、アドレス、相対位置ではなく、情報内容により記憶場所が識別されるデータ記憶装置であり、これにより高速なデータ検索ができる。ＣＡＭは、通常はインターネットのルータにて用いられている。
【００３７】
図２は、インターネットのルータにおけるＣＡＭの機能を示している。ＣＡＭは、ルーティングテーブルを記憶する。ルーティングテーブルは、複数のＩＰアドレスとルータ名を対応付ける。各ＩＰアドレスは、そのＩＰアドレスが付されたデータが転送されるべきルータ名と対応づけられる。照合データとしてＩＰアドレスが入力されたとき、ＣＡＭは、照合データと一致するＩＰアドレスを検索する。この検索は、並列処理にて行われる。そして、ＣＡＭは、照合データと一致するＩＰアドレスと対応付けられたルータ名を出力する。
【００３８】
このように、ＣＡＭは、照合データと被照合データの照合を並列処理にて行い、照合結果を出力することができる。この機能を本発明では並列照合機能という。一方、配列処理では、配列に関する多数のデータ比較が行われる。この種の処理にはＣＡＭの機能が適している。本発明はこの点に着目し、配列に関する多数のデータの比較をＣＡＭに行わせる。これにより、配列解析処理のうちで大きなウエイトを占める部分が、ＣＡＭにより大幅に高速化され、配列解析の高速化が可能となる。
【００３９】
さらに、ＣＡＭは、インターネットのルータ用の部品として普及しており、比較的安価に入手できる。また、通常のパーソナルコンピュータとの接続が容易な点でもＣＡＭは有利である。したがって、本発明によれば、通常はルータに用いられるＣＡＭの特性が配列情報の処理に適していることに着目し、ＣＡＭを使って配列情報処理装置を構成したことで、高速性という利点に加えて、低コストで容易に配列情報処理装置を提供できるという利点も得られる。
【００４０】
ＣＡＭは、並列照合機能をもつ記憶処理装置の典型的かつ好適な一形態である。並列照合機能をもつ他の記憶処理装置が適用されてもよく、同様の高速化が可能である。従来は大量のデータの比較をＲＡＭ上でソフトウエアによって実現していたのと比較して、大幅な高速化が可能となる。
【００４１】
図３は、配列情報処理装置１０の機能ブロック図である。配列処理制御部３０の各種機能は、図１のＣＰＵ１８がプログラムを実行することにより実現される。配列処理制御部３０は、配列情報取得部３２、被照合データ投入部３４、照合データ投入部３６、照合結果取得部３８、照合結果処理部４０、解析情報出力部４２を有する。
【００４２】
配列情報取得部３２は、解析対象の配列情報を取得する。配列情報は、上述のようにハードディスク２０等から取得される。被照合データ投入部３４は、ＣＡＭ１８に被照合データ（被参照データ）を投入する。被照合データはＣＡＭ１８に記憶される。照合データ投入部３６は、照合データ（参照データ）をＣＡＭ１８に投入する。ＣＡＭ１８は、照合データと被照合データを照合し、照合結果を出力する。照合結果は、照合結果取得部３８により取得される。照合結果処理部４０は、照合結果に基づく各種判定等の配列解析のための処理を行う。解析情報出力部４２は、照合結果処理部４０によって得られた配列解析に関する情報を出力するための処理を行う。
【００４３】
以下、配列情報処理装置１０による各種の配列情報処理を説明する。
【００４４】
（１−１）ＳＮＰｓ解析
図４は、本実施形態のＳＮＰｓ解析を示している。遺伝子配列は、平均して１０００個の塩基配列に１塩基の割合で、個人によって異なるといわれる。ＳＮＰｓは、複数のサンプル配列を比較して、この異なる配列があるのを検出する。
【００４５】
図４を参照すると、ＣＡＭ１８には、配列処理制御部３０から送られた複数の配列が記憶される。ここで、通常のインターネットルータ等におけるＣＡＭの使用法では、照合されるべきデータは、ＣＡＭの照合方向に記憶される（図２）。本発明では、図示のように、配列情報が、照合方向と交差する方向を向けて、照合方向に並ぶように記憶される。
【００４６】
なお、記憶の順番は任意である。１本目の配列の第１文字が記憶され、２本目の配列の第１文字が記憶され、というように照合方向に順次データが記憶されてもよい。また、１本目の配列の第１文字、第２文字、というように、交差方向に順次データが記憶されてもよい。結果的に、図４に示されるように、各配列が交差方向を向いていればよい。
【００４７】
現在提供されているＣＡＭの照合方向の幅には、１４４ｂｉｔタイプ、２８８ｂｉｔタイプというように、幾つかのタイプがある。一度に処理できる配列の数は、ＣＡＭの幅に制約される。通常のＣＡＭ、例えば上記の１４４ｂｉｔタイプのＣＡＭを用いる場合、１００程度の配列を入力可能である。
【００４８】
次に、照合データとして、同一文字列のデータが入力される。周知のように、塩基は、Ａ（アデニン）、Ｔ（チミン）、Ｇ（グアニン）、Ｃ（シトシン）の４文字で表される。まず、照合データ（ＡＡＡＡ・・・）が入力される。
【００４９】
ＣＡＭ１８は、照合データと各被照合データの照合を行う。被照合データは、照合方向に並んだ文字列である。上述のように、本実施形態では、照合方向と交差する方向を向けて配列が入力されている。したがって、被照合データは、各配列の一つの文字が並んだデータである。
【００５０】
ＣＡＭ１８は、照合データと被照合データが完全に一致したとき、一致を示す情報を出力する。本実施形態では「１」が出力される。本実施形態では、ＣＡＭ１８が、インターネットのルーティングテーブルのルータ名に相当する位置に「１」を記憶しており、この「１」が出力される。一致しない場合には、「０」が出力される。
【００５１】
ここでは、照合データが「ＡＡＡ・・・」であるから、被照合データの文字がすべて「Ａ」であるとき、「１」が出力される。他の文字「Ｔ」「Ｇ」「Ｃ」についても順次同様の処理が行われる。
【００５２】
図４の右側部分は、照合結果を示している。被照合データが同一文字のみで構成されるときは、いずれかの照合データ（Ａ、Ｔ、ＧまたはＣ）を用いた照合結果として「１」が出力される。しかし、被照合データが異なる文字を含むときは、すべての照合データを用いた処理にて「０」が出力される。これは、入力された複数の配列が、多型により異なる配列を含むことを意味する。このようにして、異なる配列の有無が検出される。
【００５３】
異なる配列の有無を判定するためには、照合結果の論理演算（ｂｉｔ演算）が好適に行われる。ここでの論理演算は、１＊０＝１、０＊１＝１、１＊１＝１、０＊０＝０である。この演算は、各列ごとに、４回の照合結果に対して行われる。２つの照合結果が演算され、それにもう一つの照合結果が加えられ、さらにもう一つの照合結果が加えられる。この演算結果は、図４の左方に示すように、被照合データが同一文字のみで構成されるときは「１」であり、異なる文字を含むときは「０」である。このようにして、異なる配列の有無が特定され、さらに、異なる塩基の位置が特定される。
【００５４】
次に、複数の塩基配列のどれが他の配列と異なるのかを特定する処理を説明する。この処理では、被照合データにＤｏｎ’ｔＣａｒｅｂｉｔ（以下、ＤＣｂｉｔ）を設定する。
【００５５】
ここで、ＤＣｂｉｔは、被照合データのうちで特定の位置のデータを無視した部分データによって一致検索を行うためのものである。本実施形態では、ＤＣｂｉｔは、無視されるべき文字の位置、または、無視されるべき文字そのものである。この無視により、被照合データが部分的に照合から除外される。
【００５６】
図５は、ＤＣｂｉｔの設定のパターンを示している。ＤＣｂｉｔが設定された位置の配列は、照合対象から除外される。図示のように、ＤＣｂｉｔを順次ずらしていき、上述の図４の処理を行う。他の配列と異なる配列がある位置にＤＣｂｉｔが設定されたときは、図４の処理の結果として、すべての配列が完全に一致するとの判定が得られる。すなわち、左方の論理演算の結果が、すべて「１」になる。このとき、ＤＣｂｉｔが設定された位置の配列が、異なる配列として特定される。
【００５７】
例えば、図４の例では、３本目の配列が、他の配列と異なる。この場合、図５に矢印で示すように、３番目のｂｉｔをＤＣｂｉｔに設定したとき、すべての配列が完全一致する。これにより、３本目の配列が他と異なることが分かる。
【００５８】
図６は、上述したＳＮＰｓ解析の処理を示すフローチャートである。まず、配列処理制御部３０の被照合データ投入部３４は、配列情報取得部３２により取得された配列情報をＣＡＭ１８に投入する（Ｓ１０）。配列情報は、照合方向と交差する方向を向けて記憶される。次に、照合データが照合データ投入部３６により投入される（Ｓ１２）。ＡＡＡ・・・というように、同一文字列が投入される。
【００５９】
ＣＡＭでは、照合データと被照合データの照合が行われ、その結果が出力される。照合データと被照合データが完全に一致すれば「１」が出力され、そうでなければ「０」が出力される。照合結果は、照合結果取得部３８により取得される（Ｓ１４）。
【００６０】
次に、配列処理制御部３０は、全文字（Ａ、Ｔ、Ｇ、Ｃ）についての照合処理が行われたか否かを判定する（Ｓ１６）。終了していなければ、Ｓ１２に戻り、次の文字に関して、同一文字列を照合データとして用いた処理が行われる。
【００６１】
Ｓ１６がＹＥＳであれば、Ｓ１８に進み、照合結果処理部４０によりＳＮＰｓ判定が行われる。ここでは、図４の右側に示される演算処理が行われ、異なる配列の有無と、異なる塩基の位置が特定される。
【００６２】
次に、異なる配列があるか否かが判定され（Ｓ２０）、ＮＯであれば処理を終了する。ＹＥＳの場合、Ｓ２２に進んで、異なる配列がどれかが特定される。
【００６３】
Ｓ２２では、ＤＣｂｉｔが設定される。まず、被照合データの１番目のｂｉｔが、ＤＣｂｉｔに設定される（図５の最上段）。Ｓ２４、Ｓ２６、Ｓ２８の処理は、上述のＳ１２、Ｓ１４、Ｓ１６と同様でよい。すなわち、照合データとして同一文字列がＣＡＭ１８に投入され（Ｓ２４）、ＣＡＭ１８から照合結果が取得され（Ｓ２６）、全文字の照合処理が行われる（Ｓ２８）。
【００６４】
次に、図５に示されるＤＣｂｉｔの全パターンに関して、Ｓ２２〜Ｓ２８の処理が行われたか否かが判定される（Ｓ３０）。ＮＯであれば、Ｓ２２に戻り、ＤＣｂｉｔの設定が変更される。ＤＣｂｉｔの位置は、一つずつずらされる。このようにして、ＤＣｂｉｔを異なる位置に設定したときの照合結果が得られる。すなわち、配列を一つずつ照合対象から除外したときの照合結果が得られる。
【００６５】
Ｓ３０がＹＥＳであれば、Ｓ３２に進み、照合結果処理部４０により異なる配列が特定される。完全一致が得られたときのＤＣｂｉｔの位置が、異なる配列を示している。
【００６６】
以上においては、本発明を分かりやすく説明するため、塩基配列を表現するのに通常用いられる「文字Ａ、Ｔ、Ｇ、Ｃ」を使用した。しかし、本発明の範囲内で、塩基等の要素を表すものであれば、他のコードが用いられてもよいことはもちろんである。
【００６７】
実際のコンピュータでの処理では、文字そのものを扱わず、文字を少ないデータで表現するべきである。塩基は４種類であるので、少なくとも２ｂｉｔのデータによりすべての塩基が表現される。このとき、図４のＣＡＭ１８上では、交差方向に、１文字につき２ｂｉｔのデータが並ぶ。照合方向の列をｂｉｔレベルで考えると、２列が、ＡＡＡ・・・といった被照合データを表す。本発明では、照合処理において、この２列のデータがまとめて処理されてもよい。また、１列ずつ照合処理が行われ、その結果がさらに処理されてもよい。後者のデータ処理も、本発明の同一コード列に対応するデータを照合データとして用いる処理に含まれる。
【００６８】
ＤＣｂｉｔについても、上述の説明では、文字に対してＤＣｂｉｔが設定されていた。実際の処理では、例えば４種類の塩基が２ｂｉｔで表されるとき、上述の説明における１つのＤＣｂｉｔ（＊）がコンピュータ上の２ｂｉｔに対応することはもちろんである。
【００６９】
また、上記においては、本発明を分かりやすく説明するため、通常のＣＡＭ等を表現する形式に従い、図２、図４に示されるように、四角形の図に基づいて本発明を説明した。しかし、実際のＣＡＭ上での物理的なデータの位置は図４等には限定されないことはもちろんである。この点は、他の実施形態においても、もちろん同様である。
【００７０】
また、図６の処理では、異なる配列を見つけるために、ＤＣｂｉｔ設定の全パターンに関して照合処理が行われている。しかし、ＤＣｂｉｔの全パターンを使い終わる前に、異なる配列が見つかった時点で処理を終了してもよい。この場合、一つのパターンの処理を行うたびに、異なる配列が見つかったか否かが判定される。
【００７１】
また、上記の処理では、ＤＣｂｉｔを設定したときの照合は、最初の照合と同じであった。これに対して、より狭い部分を対象として、ＤＣｂｉｔを使った照合が行われてもよい。例えば、異なる塩基のある位置を対象として、照合が行われてもよい。異なる塩基のある位置は、図４の処理で特定可能である（演算結果が０）。
【００７２】
（１−２）欠損・挿入検出
次に、本実施形態の配列解析技術を利用した欠損・挿入検出処理を説明する。周知のように、欠損とは、複数の配列を比較したときに、ある配列に、欠けている塩基があることをいう。また、挿入とは、複数の配列を比較したときに、ある配列が、他の配列にない塩基をもつことをいう。
【００７３】
図７は、本実施形態の処理を示している。図７の処理は、全体的には図４のＳＮＰｓ解析と同様である。ただし、照合結果の判定処理が異なる。
【００７４】
すなわち、図７では、比較対象の複数の配列は、ＣＡＭ１８により、照合方向と交差する方向を向けて、照合方向に並ぶように、記憶される。したがって、被照合データは、各配列の一つの文字が並んだデータである。照合データは、ＡＡＡ・・・といった同一文字列である。照合データが、それぞれの被照合データと比較される。ＣＡＭ１８は、照合データと被照合データが一致すれば「１」を出力し、一致しなければ「０」を出力するようにプログラミングされている。
【００７５】
図７の例では、ｎ列において、３本目の配列に欠損がある。このとき、ｎ−１列およびその前の列では、４文字の照合処理のいずれかにおいて「１」が出力される。一方、ｎ列およびその後の列では、「０」が出力される。
【００７６】
このように、本実施形態の照合処理を行うと、欠損がある位置を境界として、照合データと被照合データが連続して一致する部分と、照合データと被照合データが連続して一致しない部分とが隣接する。挿入がある場合にも、同様の結果が得られる。
【００７７】
したがって、本実施形態によれば、上記のような結果が得られるとき、すなわち、照合データと被照合データが連続して一致する部分と、照合データと被照合データが連続して一致しない部分とが隣接するとき、欠損または挿入があることが分かる。
【００７８】
欠損または挿入に関する判定は、図７の左方に示す論理演算を使って好適に行われる。この論理演算も、図４と同様に、１＊０＝１、０＊１＝１、１＊１＝１、０＊０＝０である。欠損または挿入がある場合、図示のように、論理演算結果は、・・・１１１０００・・・となる。すなわち、照合データと被照合データの連続一致部分と、連続不一致部分とが隣接する。この境界部分に欠損または挿入があることが分かる。
【００７９】
どの配列が欠損または挿入をもつかは、ＤＣｂｉｔを用いて検出可能である。ＤＣｂｉｔを用いた処理は、ＳＮＰｓ解析と同様でよい。ＤＣｂｉｔを設定することにより、一つの配列が照合対象から除外される。ある位置にＤＣｂｉｔを設定したときに論理演算結果が変わり、連続一致部分が延長された場合、そのＤＣｂｉｔの位置に対応する配列が欠損または挿入を有する。
【００８０】
すなわち、図７の例では、被照合データの３番目のｂｉｔがＤＣｂｉｔに設定されたとき、論理演算結果が変わり、連続一致部分がｎ列を越えて延長される。これにより、３本目の配列が欠損または挿入をもつことが分かる。
【００８１】
さらに、欠損と挿入のどちらがあるかの判定も可能である。この判定を行うためには、欠損または挿入がある配列情報を、ＣＡＭ１８上で、照合方向と交差する方向に１文字分だけずらす。そして、上述の照合および論理演算を行う。
【００８２】
ここでは、１文字分だけ、図７の下方にシフトしたとする。このとき、ｎ＋１列以降では、照合結果として「１」が出力され、その演算結果も１になる。ｎ列およびその前の列は、結果が逆転し、「０」が得られる。仮に挿入がある場合、上記のような結果は得られない。すなわち、シフトした状態でも、照合結果としては、「１」は出力されない。演算結果も０が連続する。このようにして、シフト状態での照合結果から、欠損と挿入のいずれがあるのかが判明する。
【００８３】
上記の処理と反対に、配列が、図７の上方にシフトされてもよい。この場合、挿入があったとすると、演算結果が変わり、ｎ列以降で１が連続し、ｎ−１列およびその前では０が連続する。欠損の場合には、ｎ列以降も０が連続する。この結果の相違により、欠損と挿入のどちらが発生したかが分かる。
【００８４】
上記の２つのシフト処理では、配列全体がシフトされた。しかし、配列の一部がシフトされてもよい。欠損または挿入がある箇所およびその後の配列部分だけがシフトされてもよい。
【００８５】
また、上記の処理では、１文字の欠損または挿入が検出された。２文字以上の欠損または挿入も同様に検出可能である。文字数分だけ、配列が交差方向にずらされればよい。例えば２文字の欠損等を判定するためには、２文字分、配列が交差方向にずらされる。
【００８６】
図８は、上述の欠損・挿入検出処理のフローチャートを示している。基本的な処理は、図６のＳＮＰｓ解析と同様であるので、適宜、説明を簡略化する。被照合データ投入部３４が配列情報をＣＡＭ１８に投入する（Ｓ４０）。配列情報は、照合方向と交差する方向を向けて記憶される。そして、同一文字列に対応する照合データが照合データ投入部３６により投入される（Ｓ４２）。ＣＡＭ１８での照合結果が照合結果取得部３８により取得される（Ｓ４４）。そして、配列処理制御部３０は、全文字（Ａ、Ｔ、Ｇ、Ｃ）についての照合処理が行われたか否かを判定する（Ｓ４６）。終了していなければ、Ｓ１２に戻る。
【００８７】
Ｓ４６がＹＥＳであれば、Ｓ４８に進み、照合結果処理部４０により、欠損または配列があるか否かが判定される。ここでは、図７を用いて説明したように、照合結果処理部４０は、照合データと被照合データが連続して一致する部分と、照合データと被照合データが連続して一致しない部分とが隣接するとき、欠損または挿入があると判定する。欠損または挿入がないとき、Ｓ５０の判定がＮＯになり処理が終了する。
【００８８】
欠損または挿入があるとき、Ｓ５２に進んで、欠損または挿入をもつ配列が特定される。Ｓ５２では、ＤＣｂｉｔが設定される。Ｓ５４、Ｓ５６、Ｓ５８の処理は、上述のＳ４２、Ｓ４４、Ｓ４６と同様でよい。そして、ＤＣｂｉｔの全パターンに関して、Ｓ５２〜Ｓ５８の処理が行われたか否かが判定される（Ｓ６０）。ＮＯであれば、Ｓ５２に戻り、ＤＣｂｉｔの設定が変更される。Ｓ６０がＹＥＳであれば、Ｓ６２に進み、欠損または挿入をもつ配列が特定される。
【００８９】
なお、図６に関して説明した通り、すべてのＤＣｂｉｔのパターンに関して照合処理が行われなくてもよい。すなわち、１つのパターンに対する照合結果に基づき、欠損または挿入をもつ配列が見つかったか否かが判定され、見つかった時点でこの特定処理を終了してもよい。
【００９０】
次に、Ｓ６４に進み、欠損と挿入のどちらがあるのかが判定される。Ｓ６４では、欠損または挿入をもつ配列が、ＣＡＭ１８上で、照合方向と交差する方向にシフトされる。そして、Ｓ６６〜Ｓ７０の照合処理が行われる。Ｓ６６、Ｓ６８、Ｓ７０の処理は、それぞれ、Ｓ４２、Ｓ４４、Ｓ４６と同様でよい。この照合結果に基づき、上述のようにして、欠損と挿入のどちらがあるのかが判定される（Ｓ７２）。
【００９１】
（１−３）置換検出
図９は、本実施形態の配列解析技術を利用した置換検出処理のフローチャートを示している。この処理は、基本的にＳＮＰｓと同様である。元々、ＳＮＰｓは、複数の配列における１塩基の置換を求めるものである。したがって、ＳＮＰｓに関して説明した処理を適用することにより置換を検出できる。置換がある場合には、図４に示されるように、照合データと被照合データが連続して一致し、照合データと被照合データが一致しない部分があり、再び照合データと被照合データが連続して一致する。このような照合結果が得られたとき、置換があることと、その位置が特定される。以上のようにして、本発明によれば置換の検出が可能である。
【００９２】
図９は、基本的に図６とほぼ同様なので、その説明は省略する。ただし、置換検出の場合、あらかじめ、欠損などの配列長が異なるサンプルを特定し、データ中から排除することが好適である。そこで、Ｓ８０では、同じ配列長をもつ複数の配列が被照合データとして用いるためにＣＡＭ１８に投入される。
【００９３】
以上、ＳＮＰｓ解析および変異（欠損、挿入および置換）検出を取り上げて、ＣＡＭを有効に利用した本発明の配列解析処理を説明した。ＣＡＭは、通常、照合方向の幅が比較的狭い。例えば、１４４ｂｉｔおよび２８８ｂｉｔが、通常のＣＡＭの幅である。このような狭い幅には、遺伝子等の比較的長い配列情報は収まらない。そこで、本発明では、ＣＡＭに、照合方向と交差する方向を向けて配列情報を記憶させる。この交差方向の長さは、通常のＣＡＭでも非常に長い。これにより、長い配列をＣＡＭに収容することを可能としている。さらに、同一文字列に対応する照合データを用いることで、ＣＡＭによる配列比較を実現している。このようにして、本発明は、ＣＡＭの並列処理による高速照合機能の配列解析への活用を可能とし、配列解析の高速化を可能としている。
【００９４】
なお、本発明の配列処理は、本発明の範囲内で実現可能な限り、ＳＮＰｓ解析および変異検出以外の配列解析に適用されてもよい。
【００９５】
本発明の処理の計算量と、従来の配列処理の計算量を、単純化された例を使って比較する。塩基は４種類の文字で表される。ｎ文字の配列を比較するとき、従来の処理の計算量は、概略的には、４のｎ乗で表される。文字数ｎが増えると、大幅に計算量が増大する。
【００９６】
一方、本発明では、記憶処理装置（ＣＡＭを含む）の並列照合機能が適切に利用され、同一文字列に対応する照合データが記憶処理装置に投入される。４種類の文字に対応して、４つの照合データが順次、投入される。したがって、本発明の処理の計算量は、４回の照合に相当する。文字数ｎが増えたときにも計算量があまり増大しない。したがって、本発明は、従来の処理と比較して、計算量を大幅に低減できる。
【００９７】
ここで、本発明は、既に述べたように、塩基配列に限らず、アミノ酸配列等の、他の生物学的配列情報の処理にも同様に適用可能である。そして、本発明の利点は、特に、配列要素の種類（一般には文字の種類）が多い場合ほど、顕著に得られる。以下、この利点について詳細に説明する。
【００９８】
上記の単純化された例を再び用いる。塩基は４種類の文字で表され、天然のアミノ酸は２０種類の文字で表される。ｎ文字の配列を従来の処理で比較するとき、塩基配列比較の計算量は４のｎ乗で表される。アミノ酸配列比較の計算量は２０のｎ乗で表される。したがって、アミノ酸配列の計算量は、塩基配列の計算量の、「５のｎ乗」倍である。このように、従来の処理では、配列要素の種類が増えると、計算量が大幅に増加する。
【００９９】
一方、本発明は、記憶処理装置（ＣＡＭを含む）の並列照合機能を利用しているので、上記の例において、アミノ酸配列の計算量は、塩基配列の計算量の５倍（２０÷４）にしかならない。
【０１００】
すなわち、本発明では、同一文字列に対応するデータが照合データとして記憶処理装置に投入される。塩基の場合、４種類の文字に対応して、４つの照合データが投入される。アミノ酸の場合、２０種類の文字に対応して、２０の照合データが投入される。したがって、計算量は５倍にしかならない。このように、配列要素の種類数に応じた計算量の増大に関して、本発明の方が従来処理より明らかに増大の程度が小さい。
【０１０１】
上記の例は単純化されており、精密な計算量は表してはいない。それでも、上記の例から明らかなように、本発明の処理の計算量は従来の処理より大幅に少ない。したがって本発明は、従来の処理配列処理を有利に高速化できる。
【０１０２】
（２）ブラスト検索
次に、本発明のもう一つの実施形態を説明する。上述の実施形態では、配列情報が、ＣＡＭの照合方向と交差する方向を向けて、ＣＡＭに記憶された。本実施形態では、照合方向を向けて配列情報が記憶される。ただし、配列情報は、ＣＡＭの照合方向の幅よりも長いことが多い。そこで、このような場合に、本実施形態では、配列が複数に分割され、ＣＡＭの複数の列を使って配列情報が記憶される。これにより、本発明は、長い配列をＣＡＭで処理可能としている。
【０１０３】
本実施形態では、本発明の配列処理が、ブラスト検索へと適用される。ブラスト検索は、ホモロジー検索の一つである。ブラスト検索では、ギャップの挿入を行わずに局所的によく一致する部位が探索される。このような部位を高スコア断片と呼ぶ。そして、高スコア断片が前後に伸長される。本実施形態では、一連のブラスト検索のうちで、高スコア断片を検索する処理に、本発明が適用される。
【０１０４】
図１０は、ホモロジー検索の比較対象である２つの配列の例を示している。配列の全長は相当に長く、ＣＡＭの照合方向の幅を越えている。
【０１０５】
図１１は、ＣＡＭ１８に配列を記憶させた状態を示している。各配列は、複数の分割配列へと分割され、各分割配列が、ＣＡＭの１つの列に記憶されている。塩基は４種類であるから、２ｂｉｔで表現される。図１１の例では、一つの分割配列が６０の塩基を含むので、一つの分割配列の長さは１２０ｂｉｔである。したがって、例えば、１４４ｂｉｔの幅をもつＣＡＭを使うことで、図１１の状態での配列の記憶が可能である。
【０１０６】
ブラスト検索では、高スコア断片を探すとき、部分配列からなる参照配列が用いられる。参照配列は比較的短く、例えば図示のように９文字で構成される。参照配列と一致する部分配列が、サンプル配列に含まれるか否かが問い合わされる。この処理が、本実施形態では、ＣＡＭを用いて行われる。
【０１０７】
すなわち、図１１に示すように、本実施形態では、照合データとして、参照配列がＣＡＭ１８に投入される。ＣＡＭ１８は、照合データと各列の被照合データとを並列処理にて比較する。照合データと被照合データが一致するとき、ＣＡＭ１８は「１」を出力し、一致しなければ、ＣＡＭ１８は「０」を出力する。この照合結果より、検索対象の各配列に参照配列が含まれるか否かが分かる。
【０１０８】
参照配列を照合データとして用いる照合処理は、ＣＡＭ１８の特性に基づき、ＤＣｂｉｔを用いて行われる。
【０１０９】
図１２を参照すると、本実施形態では、被照合データに、図示のようなＤＣｂｉｔ（＊）が与えられる。すなわち、参照配列の長さに相当する部分を除いた残りの部分にＤＣｂｉｔが与えられる。ＤＣｂｉｔが与えられた部分は、照合の対象から除外される。ＤＣｂｉｔが与えられない部分が、照合の対象になる。
【０１１０】
ＤＣｂｉｔの位置は、順次ずらされていく。言い換えれば、ＤＣｂｉｔが与えられない部分（照合対象部分）が順次、１文字ずつシフトされる。このようにして、本発明によれば、照合除外部分の位置を異ならせた複数回の照合処理が行われて、被照合データのどの部分が参照配列と一致するときでも、その一致を検出できる。また、参照配列と一致する場所を特定することも可能である。
【０１１１】
ＤＣｂｉｔをずらしたときの複数回の照合結果は、論理演算を用いて好適に処理される。
【０１１２】
図１３の上方部分を参照する。本実施形態では、上述のように、ＤＣｂｉｔ設定の各パターンを使って複数回の照合が行われる。照合の度に、１または０がＣＡＭ１８から出力される。１は、照合データと被照合データが一致するとき、０は一致しないときに出力される。
【０１１３】
全パターンの照合結果に対して、論理演算が行われる。論理演算は、１＊０＝１、０＊１＝１、１＊１＝１、０＊０＝０である。２つの照合結果が演算され、さらにもう一つの演算結果が加えれば、これが繰り返される。最終的な演算結果が１であれば、いずれかのパターンを使った照合にて、完全一致が得られている。そうでなければ、最終的な演算結果は０である。したがって、演算結果が１であれば、被照合データに参照配列が含まれることが分かる。
【０１１４】
図１３の下方は、複数の参照配列がサンプル配列に含まれるか否かを判定するための好適な処理を示している。
【０１１５】
参照配列は、Ａ、Ｂ、Ｃの３つであるとする。各参照配列に関して、図１３の上方の処理により、ＣＡＭ１８の各列が参照配列と一致する部分配列をもつか否かの情報が得られる。一致部分があれば「１」、なければ「０」である。この各列の結果が、論理演算にかけられる。すなわち、図１３では、縦方向に演算が進められる。演算は、上述と同様に、１＊０＝１、０＊１＝１、１＊１＝１、０＊０＝０である。これにより、いずれか一つの列が参照配列を含むと、演算結果が１になる。そして、全部の参照配列の演算結果が１であれば、すなわち、図示のように１が並べば、すべての参照配列がサンプル配列に含まれる。演算結果として０が得られるとき、該当する参照配列は含まれない。
【０１１６】
上記の処理の利点を説明する。図１３の例では、参照配列が比較的少ない。しかし、ブラスト検索では、より多くの参照配列が使われることがある。このとき、多数の照合結果を、一連の処理途中で保持しておく必要が生じ、保持するデータが多くなる傾向がある。本発明によれば、上述の処理により、データ量が多くなるという問題に好適に対処できる。
【０１１７】
本発明は、並列処理を好適に利用したことで、参照配列の検索処理を高速化可能である。この点について、通常の処理の計算量と本発明の処理の計算量を概略的に比較する。
【０１１８】
ここでは、数万から数十万といった大量の遺伝子配列を格納したデータベースを用いてブラスト検索を行う場合を考える。データベースの遺伝子配列の数をＮｃ、一つの配列の塩基数をＬｃ、参照配列の塩基数をＲｌとすると、従来の処理の計算量は、Ｎｃ＊（Ｌｃ−Ｒｌ）で表される。
【０１１９】
一方、本発明においては、各分割配列のデータ長をＣｃとし、参照配列の塩基数をＲｌとすると、計算量は、Ｃｃ−Ｒｌで表される。この式には、配列全体のデータ長Ｌｃが含まれていない。本発明では、遺伝子配列を分割した分割配列が検索対象となるからである。また、上記の式には、配列の数Ｎｃが含まれていない。これは以下の理由による。ＣＡＭは、通常はインターネットのルータの部品として用いられ、大量のＩＰアドレスを並列検索が可能な状態で記憶するように構成される。したがって、ＣＡＭは、照合方向の幅は比較的短いが、それに交差する方向には非常に長い。この点を利用することにより、数万以上の遺伝子配列を、交差方向に並べて同時に記憶し、同時に並列処理できる。したがって、本発明の計算量の式には、遺伝子配列の数Ｎｃは含まれていない。
【０１２０】
上述のように、概略的には、従来の処理の計算量は、Ｎｃ＊（Ｌｃ−Ｒｌ）で表され、本発明の処理の計算量はＣｃ−Ｒｌで表される。遺伝子配列の数Ｎｃは、通常、数万から数十万である。また、一配列の塩基数Ｌｃは、１０００〜１００００個程度である。また、参照配列の塩基数Ｒｌは２０程度である。さらに、分割配列のデータ長Ｃｃは１００程度（図１１の例では６０）とする。この場合、両者の計算量を比較すると、本発明の処理の計算量は、概略的には、例えば約１００００分の１である。
【０１２１】
このようにして、本発明によれば、配列検索の高速化が可能になる。そして、上記説明から明らかなように、本発明は、ＣＡＭの特性を好適に利用している。すなわち、照合方向と交差する方向の長さを利用して、大量の遺伝子が同時に被照合データとして格納される。さらに、照合方向の幅が短いことを不利とせずに、むしろ、複数の分割配列を並列にて処理することで計算量を削減している。こうして、上述の大幅な高速化が可能となる。
【０１２２】
図１４は、上述のブラスト検索の処理を示すフローチャートである。まず、配列処理制御部３０の被照合データ投入部３４は、配列情報取得部３２により取得された配列情報をＣＡＭ１８に投入する（Ｓ１１０）。配列情報は、前述のように複数の分割配列情報に分割され、各分割配列情報がＣＡＭの一つの列に記憶される。ＤＣｂｉｔが設定され（Ｓ１１２）、照合データが照合データ投入部３６により投入される（Ｓ１１４）。ここでは、まず、一つ目のパターンのＤＣｂｉｔが設定される。また、照合データは、参照配列である。ＣＡＭ１８では、照合データと各列の被照合データの照合が行われ、その結果が出力される。照合データと被照合データが完全に一致すれば「１」が出力され、そうでなければ「０」が出力される。照合結果は、照合結果取得部３８により取得される（Ｓ１１６）。
【０１２３】
次に、配列処理制御部３０は、ＤＣｂｉｔの全パターンについての照合が行われたか否かを判定する（Ｓ１１８）。ＮＯであれば、Ｓ１１２に戻り、ＤＣｂｉｔのパターンが変更される。本実施形態では、前述のように、ＤＣｂｉｔの位置が順次シフトされる。
【０１２４】
Ｓ１１８がＹＥＳであれば、Ｓ１２０に進み、配列処理制御部３０は、すべての参照配列に関して照合処理が終了したか否かを判定する。例えば、図１３の参照配列Ａ、Ｂ、Ｃの全部が処理されたか否かが判定される。Ｓ１２０がＮＯであれば、Ｓ１１２に戻り、次の参照配列を使って照合が行われる。Ｓ１２０がＹＥＳであれば、Ｓ１２２に進む。Ｓ１２２では、図１３を用いて説明したように、照合結果処理部４０が、照合結果を使った論理演算を行い、各参照配列がサンプル配列に含まれるか、そして、全部の参照配列がサンプル配列に含まれるかを判定する。なお、上記の処理は、複数のサンプル配列の各々に対して行われる。
【０１２５】
好ましくは、本実施形態の配列情報処理装置１０は、上記の参照配列の問合せ結果を利用して、その後の処理、すなわちブラスト検索の残りの処理を行うように構成される。この残りの処理は、別の装置で行われてもよい。
【０１２６】
ところで、本実施形態は、上述のように、一つの配列を複数の分割配列に分けている。したがって、分割箇所にて、参照配列（参照配列と一致する部分配列を意味する）が複数の分割配列に跨ることがあり得る。このような参照配列は、以下の処理により好適に検出される。
【０１２７】
図１５を参照する。本実施形態では、参照配列の端の部分を照合データとして用いた照合処理が行われる。参照配列の後方の部分が、被照合データである分割配列の前方部分と一致するか否かが、求められる。図示のように、参照配列の後方の１文字を使う照合、２文字を使う照合・・・ｉ−１文字を使う照合が行われる。ｉは、参照配列の文字数である。照合対象以外の部分には、上述の処理で説明したのと同様に、ＤＣｂｉｔが設定される。実際の処理では、ＤＣｂｉｔのパターンを増やせばよい。すなわち、図２において、参照配列が被照合データの前方部分からはみ出る場合についてもＤＣｂｉｔのパターンが設定されればよい。これにより、上述の照合処理をそのまま適用可能となる。
【０１２８】
また、同様にして、参照配列の前方の部分が、被照合データである分割配列の前方部分と一致するか否かが、求められる。
【０１２９】
そして、上記の処理により、第ｎ＋１列の前方部分に、参照配列の後方部分があったとする。また、第ｎ列の後方部分に、参照配列の前方部分があったとする。両部分を連結すると、参照配列が得られるか否かが判定される。ここでは、２つの部分の文字数が参照配列の文字数と一致するか否かが判定されてもよい。参照配列が得られるとき、参照配列と同じ部分配列がサンプル配列に含まれると判定される。
【０１３０】
この判定処理は、実際のプログラムでは、以下のようにして好適に行われる。ここでも論理演算を利用する。参照配列の一部を使った照合により、被照合データと照合データが一致するとき、ＣＡＭ１８は「１」を出力し、そうでなければＣＡＭ１８は「０」を出力する。以下の２つの照合結果が論理演算にかけられる。
【０１３１】
（１）第ｎ列の後方部分を、参照配列の前方のｋ文字と照合した結果
（２）第ｎ＋１列の前方部分を、参照配列の後方のｉ−ｋ文字と照合した結果
論理演算は、１＊１＝１、１＊０＝０、０＊１＝０、０＊０＝０である。演算結果が１であれば、参照配列と同じ部分配列がサンプル配列に含まれる。演算結果が０であれば、参照配列と同じ部分配列はサンプル配列に含まれない。１≦ｋ≦ｉ−１の範囲で、上記の処理が行われる。このようにして、２つの分割配列に跨る参照配列が好適に検出される。
【０１３２】
図１６は、分割箇所の参照配列を検出するもう一つの処理を示している。この処理は、隣接する分割配列を部分的に重複させる。重複文字数は、ｉ−１である。ここで、ｉは、参照配列の文字数である。この状態で、上述の照合処理を行えば、分割箇所の参照配列が、漏れることなく検出される。
【０１３３】
図１６の処理では、参照配列の長さに応じて、重複部分の文字数を変更する必要がある。この点に関しては、ＤＣｂｉｔを利用することで対応可能である。すなわち、過剰な重複を避けるために、過剰な部分にＤＣｂｉｔを設定する。例えば、２０文字の参照配列Ａと、１５文字の参照配列を使うとする。重複部分の文字数は、適当に、例えば３０文字に設定される。参照配列Ａを使うときは、被照合データの後方部分の１１文字に対してＤＣｂｉｔが設定される。参照配列Ｂを使うときは、被参照データの後方部分の１６文字に対してＤＣｂｉｔが設定される。このようにして、参照配列の長さに応じた処理が実現される。
【０１３４】
ただし、上記のような対応が不要な点では、図１６の処理よりも、図１５の処理が有利と考えられる。
【０１３５】
以上、本実施形態の配列処理を説明した。本実施形態では、本発明が、ブラスト検索に適用された。本発明は、他の配列解析に適用されてもよい。本発明は、例えば、コンセンサス配列検索、遺伝子地図、ＳＮＰｓ配列検出に適用されてよい。各解析に応じて、上述の実施形態の処理が変更されることはもちろんである。例えば、ＳＮＰｓの場合、ＤＣｂｉｔは設定されなくてよい。
【０１３６】
（３）ファスタ検索
次に、本発明のもう一つの実施形態を説明する。本実施形態でも、上述の実施形態と同様、配列情報は、ＣＡＭの照合方向を向けて記憶される。本実施形態では、ＣＡＭの特性を利用し、並列処理により、複数の配列が連続して一致する部分を求める。この連続一致部分の検出は、ファスタ検索に適している。
【０１３７】
まず、従来のファスタ検索を概略的に説明する。
【０１３８】
図１７、図１８は、ドットマトリックス画像を示している。ドットマトリックス画像は、従来のファスタ検索において、複数の配列の連続一致部分を求めるために用いられる。図１７は概念図であり、図１８は実際のドットマトリックス画像の例である。
【０１３９】
ドットマトリックス画像では、２つの配列が直交して配置される。２つの配列の文字（要素）が一致する箇所には点が打たれる。４５度方向に点が連続するとき、その部分では、配列の文字が連続して一致している。この特徴を利用して、最も長く連続して一致する部分が求められる。そして、一致部分の周囲に対して動的計画法によるアライメントが行われる。
【０１４０】
本実施形態は、ＣＡＭを利用し、上述のドットマトリックス画像を利用したときと同様の情報を得る。
【０１４１】
図１９は、本実施形態の処理を示している。ここでは、説明を簡単にするために、配列は分割されていない。しかし、実際には、後述するように、ＣＡＭの幅が狭いことを考慮して、配列を複数に分割することが好適である。
【０１４２】
図１９の例では、比較対象の配列は２本、すなわち配列１および配列２である。配列１は、被照合データとしてＣＡＭ１８に記憶される。配列２は、照合データとしてＣＡＭ１８に投入される。
【０１４３】
配列１は、図示のように、ＣＡＭ１８の複数の列に記憶される。すなわち、同一の配列が、ＣＡＭ１８上の複数の列に記憶される。ただし、列によって、配列１が照合方向にずらされる。配列１は、１文字ずつずらされる。
【０１４４】
このように配列１が記憶された状態で、配列２が照合データとして入力される。ＣＡＭ１８は、照合データと、各列の被照合データとを比較する。両者が一致するとき、「１」が出力され、一致しないとき、「０」が出力される。
【０１４５】
上記の処理では、配列全体が一致する場合が検出される。各種の長さの連続一致部分は、以下のようにして検出される。
【０１４６】
図２０は、各種の長さの連続一致部分を検出する処理を示す。図示のように、ＤＣｂｉｔ（＊）が利用される。ＤＣｂｉｔは、照合除外部分をつくるために利用される。
【０１４７】
最上段では、ＤＣｂｉｔが設定されていない。２段目では、被照合データの後端にＤＣｂｉｔが１つ設定される。３段目では、被照合データの前端にＤＣｂｉｔが１つ設定される。２段目および３段目のパターンを用いて照合を行うと、配列の長さより１文字分短い連続一致部分の有無が検出される。
【０１４８】
同様にして、配列の長さよりｎ文字分短い連続一致部分を検出するためには、ｎ個のＤＣｂｉｔが設定される。ｎ個のＤＣｂｉｔは、図２０に示されるように、配列の両端に振り分けられる。振分けの全組合せが、ＤＣｂｉｔの設定パターンとして用いられる。
【０１４９】
このようにして、本実施形態によれば、配列を部分的に照合対象から除外することで、各種の長さの連続一致部分が検出される。そして、最も長く配列が連続する部分も求めることができる。
【０１５０】
上記の処理において、最長一致部分を見つけるためには、すべての種類の長さの連続一致部分を検出する処理が行われなくてもよい。ＤＣｂｉｔを順次変えていき、最長一致部分が見つかるまで、検出対象の一致長さを順次短縮していく。ここでは、図２０のＤＣｂｉｔのパターンを、上方から下方へ順番に使用する。そして、照合データと被照合データが一致したとき、最も長い配列が見つかったので、処理を終了する。このような処理も好適である。
【０１５１】
図２１は、配列を複数に分割するときの処理を示している。配列１は、ＣＡＭ１８の幅より短く分割され、そして、ＣＡＭ１８の複数の列に分けて記憶される。同一の配列が、１文字ずつずらして、ＣＡＭ１８上の複数の領域に記憶される。ずらし量の最大値は、（分割配列の長さ−１）に設定される。それ以上ずらすと、同じ被照合データが重複するからである。
【０１５２】
配列２は、配列１と同様に分割される。そして、各分割配列が、順次、照合データとしてＣＡＭ１８に入力される。したがって、ＣＡＭ１８は、配列２の各分割配列を用いて、照合処理を行う。一つの分割配列を用いるときの処理は、図１９および図２０を参照して説明した処理でよい。
【０１５３】
なお、図２１にＸ印で示すように、配列をずらすと、ＣＡＭの列上で、文字データのない部分が生じる。この部分は適当に処理対象から除外される。Ｘ印をもつ列全体が削除されてもよい。この削除を行ったとしても、図１８の隅の領域が検索対象から除外されるだけなので、問題はないと考えられる。
【０１５４】
また、分割処理に関しては、各分割配列に分けて処理が行われてもよい。すなわち、まず、配列１、２の一つ目の分割配列が選択される。配列１の分割配列が、図１９のようにＣＡＭ１８に配置される。配列２の分割配列を用いて、図１９に関して説明した処理が行われる。次に、配列１、２の２つ目の分割配列が選択され、同様の処理が行われる。このような処理でも同様の結果が得られる。
【０１５５】
ところで、配列の連続一致部分が、複数の分割配列に跨ることがある。この点については、以下のようにして対応する。
【０１５６】
図２２を参照すると、ｎ列の後方部分およびｎ＋１列の前方部分に、連続一致部分が存在するとき、それらが連結される。連結状態での配列部分が、配列１の最も長い連続一致部分であるか否かが判定される。この処理をより正確に行うためには、ある分割配列の端部が１文字のみ一致する場合も、その分割配列を連結の対象とすることが好適である。また、図示されないが、連続一致部分が、３本以上の分割配列に跨ることもあり得る。この場合には、それら分割配列がすべて連結される。両側の連続一致部分（分割配列長より短い、１本または０本の場合もある）と、それらの間の連続一致部分（分割配列長と同じ長さ、１本または複数本）とが連結される。
【０１５７】
図２３は、上述した本実施形態の処理を示すフローチャートである。まず、配列処理制御部３０の被照合データ投入部３４は、配列情報取得部３２により取得された配列情報をＣＡＭ１８に投入する（Ｓ１３０）。配列情報は、図２１に示されるように、複数に分割されて投入される。また、同一の配列が、少しずつずらして、投入される。次に、照合データが照合データ投入部３６により投入される（Ｓ１３２）。照合データは、配列２の分割配列である。そして、ＣＡＭ１８により、照合データと被照合データの照合が行われる。まず、ＤＣｂｉｔを設定しないで、照合が行われる。照合データと被照合データが完全に一致すれば「１」が出力され、そうでなければ「０」が出力される。照合結果は、照合結果取得部３８により取得される（Ｓ１３４）。
【０１５８】
次に、配列処理制御部３０は、全長さに関する照合を終了したか否かを判定する（Ｓ１３６）。そして、ＮＯであれば、長さを変更して（Ｓ１３８）、Ｓ１３２に戻る。Ｓ１３６では、図２０のＤＣｂｉｔの全パターンが処理されたか否かが判定される。全パターンが処理されていないとき、Ｓ１３８で、次のパターンが選択される。Ｓ１３６がＹＥＳであれば、Ｓ１４０に進む。
【０１５９】
なお、既に述べたように、本実施形態では、すべての長さに関して、連続一致の判定が行われなくてもよい。この場合、１文字ずつ、順次、検出対象の長さが短縮される。すなわち、図２０のＤＣｂｉｔのパターンが、上から順番に使用される。照合データと一致する被照合データが得られた時点で、Ｓ１４０へ進む。
【０１６０】
Ｓ１４０では、配列処理制御部３０が、配列２の全部の分割配列を処理したか否かを判定する。Ｓ１４０がＮＯであれば、Ｓ１３２に戻り、次の分割配列が処理される。Ｓ１４０がＹＥＳであれば、Ｓ１４２に進み、照合結果処理部４０が、これまでの照合結果を基に、最長一致部分（配列が最も長く一致する部分）を特定する。好ましくは、本実施形態の配列情報処理装置１０は、特定された最長一致部分を使って、その後の処理、すなわちファスタ検索の残りの処理を行うように構成される。この残りの処理は、別の装置で行われてもよい。
【０１６１】
以上のように、本実施形態によれば、ＣＡＭを使って、ドットマトリックスを使うのと同様に、配列の連続一致部分が検出可能であり、最長一致部分も検出可能である。そして、ＣＡＭの並列処理機能を利用して、高速な検索が可能である。
【０１６２】
本実施形態では、２つの配列が比較された。しかし、本発明の範囲内で３つ以上の配列が比較されてもよい。この場合、好ましくは、ＣＡＭの照合方向と交差する方向に、複数の配列が、並べられる。各配列については、図２１に示したように、同一の配列が、すこしずつシフトして、複数箇所に記憶される。そして、照合データとして用いる配列（図２１の配列２）が入力される。これにより、照合データの配列を、複数の配列と同時に比較することができる。
【０１６３】
また、本実施形態では、本発明の情報処理がファスタ解析に適用された。他の配列解析にも本発明が適用されてよい。他の解析においても、配列の連続一致部分を求めるときに、本発明が有利に適用可能である。
【０１６４】
以上、本発明の好適な各種の実施形態を説明した。本実施形態は、本発明の範囲内で変形可能なことはもちろんである。例えば、本実施形態では、塩基配列が処理された。これに対し、本発明の範囲内で、既に述べたように、アミノ酸等の他の配列が処理されてもよい。また、本発明の配列情報処理装置は、ネットワークを介してアクセスされるシステムを構成してもよい。
【０１６５】
【発明の効果】
（１）以上に説明したように、本発明は、並列照合機能をもつ記憶処理装置に、配列情報を被照合データとして用いるために記憶させて、照合データと被照合データを並列処理にて記憶処理装置に照合させて、照合データと被照合データの一致を示す情報を得ることにより、配列解析情報を得る。並列照合機能を利用することで、配列情報の処理における、大量のデータの比較を高速に行うことができ、配列解析を高速化できる。
【０１６６】
好ましくは、並列照合機能をもつ記憶処理装置は、ＣＡＭである。従来、ＣＡＭは、インターネットルータの部品として用いられている。本発明は、ＣＡＭのもつ並列照合機能が配列情報の処理に適していることに着目し、大量のデータの比較をＣＡＭに行わせている。これにより、配列解析処理のうちで大きなウエイトを占める部分が、ＣＡＭにより大幅に高速化され、配列解析の高速化が可能となる。
【０１６７】
また、ＣＡＭは、インターネットのルータ用の部品として普及しており、比較的安価に、容易に入手できる。さらに、通常のパーソナルコンピュータのようなコンピュータとの接続が容易な点でもＣＡＭは有利である。したがって、本発明は、ルータ用部品として普及しているＣＡＭの特性が、配列情報の処理にも適していることに着目し、ＣＡＭを使って配列情報処理装置を構成したことで、高速性という利点に加えて、低コストで容易に配列情報処理装置を提供できるという利点も得られる。
【０１６８】
本発明の並列照合機能付き記憶処理装置はＣＡＭには限定されない。また、通常のＣＡＭは、一つの照合データを、記憶された全部の被照合データと同時に比較するように構成されている。上述の実施形態でも、主としてこのような処理が行われた。これに対し、ＣＡＭまたは他の記憶処理装置は、複数の照合データを同時に利用するように構成されてもよい。そして、照合データによって、相手の被照合データを異ならせる処理が行われてもよい。この構成は、複数の照合データの同時処理を可能とすることでさらなる高速化に寄与する。例えば上述の実施形態のブラスト検索において複数の分割配列を照合データとして用いる場合に有利である。
【０１６９】
本発明の並列照合機能付き記憶処理装置（ＣＡＭを含む）は、プロセッサの一部であってもよい。このプロセッサを利用し、その記憶処理部に本発明の処理を行わせることも、本発明の範囲に含まれる。この種のプロセッサには、図３を用いて説明したような、記憶処理装置を利用するための本発明の処理機能の一部または全部が備えられてもよい。この場合には、プロセッサは本発明の配列情報処理装置（少なくとも一部）を構成する。
【０１７０】
（２）本発明の特に中心となる一態様は、並列照合機能をもつ記憶処理装置に、複数の配列情報を、被照合データとして用いるために、照合方向と交差する方向を向けて、照合方向に並ぶように記憶させる。そして、本発明は、照合方向に並んで隣接する複数の配列情報のデータを被照合データとして用いて、配列要素を表す文字等のコードとして同一のものが並ぶ同一コード列に対応するデータを照合データとして用いて、照合データと被照合データを並列処理にて記憶処理装置に照合させる。
【０１７１】
このように、本発明は、照合方向と交差する方向を向けて配列情報を記憶させるという、記憶処理装置の特徴的な使い方をしている。したがって、被照合データは、照合方向に並んだ複数配列のデータで構成される。そして、照合データとして、同一コード列に対応するデータが用いられる。これら被照合データと照合データの並列照合処理により、複数の配列が一致するか否かが高速に求められる。
【０１７２】
計算量を検討すると、本発明では並列処理機能が適切に利用され、同一文字列に対応する照合データが用いられるので、本発明の処理の計算量は従来の処理よりも大幅に低減される。塩基を想定した単純化された例では、ｎ文字の配列が処理されるとき、従来の処理の計算量は、「４のｎ乗」で表される。「４」は、塩基の種類数である。一方、本発明では、４つの同一文字列の各々を使って照合が行われる。したがって、本発明の処理の計算量は、４回の照合に相当し、従来の処理よりも大幅に少ない。文字数ｎが増大するほど、計算量の差が大きくなる。
【０１７３】
さらに、本発明の利点は、配列要素の種類が多いときに顕著である。上記の例において、天然のアミノ酸を想定したとき、従来の処理の計算量は「２０のｎ乗」で表される。２０は、アミノ酸の種類数である。塩基の例（「４のｎ乗」）と比べると、計算量は、「５のｎ乗」倍である。一方、本発明では、アミノ酸を想定したときは、同一文字列に対応する照合データの数が、２０である。塩基の例と比べると、計算量は、５倍（２０÷４）にしかならない。このように、配列要素の種類数に応じた計算量の増大に関して、本発明の方が従来処理より明らかに増大の程度が小さい。この点でも、本発明は、従来の処理配列処理を有利に高速化できる。
【０１７４】
本発明は、ＣＡＭの例を用いて説明したように、記憶処理装置の照合方向の幅が、配列の長さより狭いときに、特に有利である。実際に処理される配列は長いことが多いので、このような場合は頻繁にあり得る。本発明によれば、記憶処理装置の照合方向と交差する方向に配列情報を記憶させるので、長大な配列も記憶処理装置に収容できる。そして、同一コード列に対応する照合データを用いることで、交差方向に記憶される配列の一致性を求められる。そして、この処理が、並列照合処理により高速に行われる。このようにして、本発明によれば、並列照合処理機能をもつ記憶処理装置を利用し、配列解析を好適に高速化できる。
【０１７５】
好ましくは、本発明は、配列情報を構成する複数種類のコードの各々について、同一コード列に対応するデータを照合データとして用いた照合を行い、複数回の照合結果を処理して、複数の配列情報の一致に関する情報を得る。例えば、塩基配列の場合、上述の実施形態で説明したように、Ａ、Ｇ、Ｔ、Ｃの各コードが照合にかけられる。さらに、好ましくは、上述の実施形態で説明したように、論理演算を用いた処理が行われる。本発明によれば、複数種類の同一コード列を使って照合が行われ、いずれかの同一コード列を使ったときに被照合データと照合データが一致したか否かが判定される。したがって、配列中の各位置のコードが何であるかを意識することなく、同様の処理により、配列が一致するか否かを求められ、処理が簡単になる。
【０１７６】
好ましくは、本発明は、複数の配列情報の一部を照合対象から除外して、照合処理を行う。これにより、他の配列情報と一致しない配列情報を特定することができる。
【０１７７】
好ましくは、本発明は、照合データと被照合データが一致しないとき、多型により他の配列と異なる配列があると判定する。これにより、ＳＮＰｓ等の多型解析ができる。さらに、好ましくは、本発明は、複数の配列情報の一部を照合対象から除外して、照合処理を行う。これにより、ＳＮＰｓ等の多型解析にて、他の配列と異なる配列を特定できる。
【０１７８】
好ましくは、本発明は、照合データと被照合データが連続して一致する部分と、照合データと被照合データが連続して一致しない部分とが隣接するとき、それら部分の境界に欠損または挿入があると判定する。このようにして、本発明によれば、欠損または挿入を検出できる。さらに、好ましくは、本発明は、複数の配列情報の一部を照合対象から除外して、照合処理を行う。これにより、欠損または挿入がある配列情報を特定することができる。
【０１７９】
また、好ましくは、本発明は、欠損または挿入がある配列情報を、照合方向と交差する方向にずらして記憶させて、照合処理を行う。これにより、欠損と挿入のどちらがあるかを判定できる。上述の実施形態を用いて説明したように、欠損がある場合と、挿入がある場合では、シフト時の照合結果が特徴的に異なるからである。
【０１８０】
なお、本発明の範囲内で、欠損または挿入のいずれか一方を検出するために本発明が適用されてもよい。すなわち欠損または挿入のいずれか一方が配列情報処理により検出されてもよい。
【０１８１】
好ましくは、本発明は、照合データと被照合データが連続して一致し、照合データと被照合データが一致しない部分があり、再び照合データと被照合データが連続して一致するとき、一致しない部分に置換があると判定する。このようにして、本発明によれば、置換が検出できる。好ましくは、同一長さの配列のみが比較される。これにより正確な結果が得られる。さらに好ましくは、本発明は、複数の配列情報の一部を照合対象から除外して、照合処理を行う。これにより、置換がある配列情報を特定することができる。
【０１８２】
好ましくは、本態様、すなわち、交差方向に配列を記憶する態様において、並列照合機能をもつ記憶処理装置は、ＣＡＭである。ＣＡＭは、既に述べたように、並列照合機能をもつ点で、配列情報の処理に適した特性をもち、配列解析を高速化できる。また、ＣＡＭは、これまで配列情報処理には用いられていないが、インターネットルータ部品として普及しており、安価である。したがって、ＣＡＭを使うことで、低コストで高速な配列解析が可能となる。さらに、通常のＣＡＭは、照合方向の幅が比較的狭いにも拘わらず、本発明によれば、配列の記憶する方向を照合方向と交差させることで、そして、同一コード列に対応するデータを照合データとして用いることで、長い配列の照合を可能としている。しかも、ＣＡＭの並列照合機能が活かされ、高速な解析が可能となる。
【０１８３】
好ましくは、本態様、すなわち、交差方向に配列を記憶する態様は、上述の処理により、ＳＮＰｓ解析に用いる情報を得る。ＳＮＰｓ解析では、多くの配列を迅速に処理することが求められる。特に、今後、ゲノム創薬およびオーダーメード医療が実用化され、多数のサンプルのＳＮＰｓ解析が必要になると考えられる。そして、大型コンピュータを使わないでも、高速にＳＮＰｓ解析を行えることが望ましい。本発明によれば、このようなニーズに適切に応えることが可能である。
【０１８４】
（３）本発明の一態様は、並列照合機能をもつ記憶処理装置に、生物学的な配列情報を、被照合データとして用いるために、照合方向を向けて記憶させる。さらに本発明は、照合対象の配列情報を照合データとして用いて、照合データと被照合データを並列処理にて記憶処理装置に照合させる。この態様では、上述の態様と異なり、配列情報が照合方向を向けて記憶される。したがって、上述の態様に関して説明したような、記憶の方向を異ならせることによる利点は得られない。しかし、本態様でも、並列照合機能を利用した並列処理による高速化という利点が得られる。以下は、本発明のさらに詳細な態様である。
【０１８５】
（４）本発明の一態様は、並列照合機能をもつ記憶処理装置に、塩基配列、アミノ酸配列等の生物学的な複数の配列情報を、被照合データとして用いるために、照合方向を向けて記憶させる。そして本発明は、参照配列を照合データとして用いて、照合データと被照合データを並列処理にて記憶処理装置に照合させる。典型的には、部分配列からなる参照配列を用いて、ブラスト検索で行われるような、局所的一致箇所が求められる。本発明によれば、並列照合機能を利用して、複数の配列の各々が参照配列を含むか否かが、高速に求められる。
【０１８６】
好ましくは、本発明は、参照配列に相当する長さをもつ照合対象部分と残りの照合除外部分とを設定して照合処理を行い、照合除外部分の位置を異ならせた複数回の照合処理を行う。上述の実施形態では、ＤＣｂｉｔを用いて、照合除外部分が設定された。本発明によれば、照合除外部分を異ならせて照合処理を行うことで、参照配列が、被照合データたる配列のどの部分と一致する場合でも、その一致を適切に検出できる。また、一致する部分の特定も可能となる。
【０１８７】
好ましくは、本発明は、一連の配列を複数の分割配列情報に分けて、複数の分割配列情報を、照合方向と交差する方向に並ぶように、並列照合機能をもつ記憶処理装置に記憶させて、各分割配列情報の一部が参照配列と一致するか否かを並列処理により求める。
【０１８８】
本発明は、ＣＡＭの例を用いて説明したように、記憶処理装置の照合方向の幅が狭く、交差方向の長さが大きいときに、特に有利である。本発明によれば、照合方向の幅が狭い場合でも、配列を分割することで、交差方向の長さを活かして、長い配列を記憶可能である。交差方向の長さを利用して、大量の配列を同時に記憶させ、並列して処理することもできる。
【０１８９】
さらに、本態様の配列分割は、計算の高速化にとって有利である。分割により、照合方向の配列長さが小さくなる。これにより計算量が少なくなる。上述の複数種類の照合除外部分を設定するとき、すなわち、上述の実施形態ではＤＣｂｉｔの複数のパターンを用いるとき、照合方向の配列長さが小さい方が、計算量が少なくなる。したがって本発明は、記憶処理装置が照合方向に狭く、交差方向に長いとき、このことを障害とせず、むしろ、配列分割と並列処理により計算量を削減し、配列解析の一層の高速化を可能としている。
【０１９０】
なお、本発明の範囲内で、連続する分割配列は、記憶処理装置上で隣合わせに並べられなくてもよい。それらが離れていても構わない。
【０１９１】
好ましくは、本発明は、複数の分割配列情報の照合結果を処理して、配列情報が参照配列を含むか否かを判定する。ここでは、典型的には、上述の実施形態で説明したような論理演算が行われる。これにより、簡単な処理で、参照配列が含まれるか否かが求められる。
【０１９２】
好ましくは、本発明は、参照配列の端の部分を照合データとして用いた照合を行って、隣接する分割配列情報に跨る参照配列を検出する。これにより、複数の分割配列に、参照配列と一致する部分配列が跨るときでも、すなわち、記憶処理装置上の複数の列に、参照配列と一致する部分配列が跨るときでも、そのような部分配列を検出できる。また、そのような部分配列の位置を特定することも可能となる。
【０１９３】
好ましくは、本発明は、隣接する分割配列情報を部分的に重複させる。この処理によっても、分割箇所の参照配列を、漏らすことなく検出可能である。
【０１９４】
好ましくは、本態様、すなわち、照合方向に配列を記憶させる態様において、並列照合機能をもつ記憶処理装置は、ＣＡＭである。ＣＡＭは、既に述べたように、並列照合機能をもつ点で、配列情報の処理に適した特性をもち、配列解析を高速化できる。また、ＣＡＭは、これまで配列情報処理には用いられていないが、インターネットルータ部品として普及しており、安価である。したがって、ＣＡＭを使うことで、低コストで高速な配列解析が可能となる。さらに、通常のＣＡＭは、照合方向の幅が比較的狭いにも拘わらず、本発明によれば、配列の分割記憶により、長い配列をＣＡＭに記憶できる。ＣＡＭの長さを利用して、大量の配列を記憶させて同時に処理することもできる。さらに、配列分割により照合方向の配列長さを縮めることで、計算量を実質的に削減し、さらなる高速化も図れる。このようにして、本発明によれば、ＣＡＭの特性を利用して、配列解析を好適に高速化できる。
【０１９５】
好ましくは、本態様、すなわち、照合方向に配列を記憶させる態様において、上述の処理により、ブラスト法等のホモロジー解析に用いる情報が得られる。例えば、データベースの大量の配列を使ってブラスト検索を行うような場合、本発明の高速化は特に有用と考えられる。
【０１９６】
（５）本発明の一態様は、並列照合機能をもつ記憶処理装置に、同一の配列情報を少しずつずらして、被照合データとして用いるために、照合方向を向けて記憶させる。配列情報は所定文字数ずつ、通常は一文字ずつずらされる。そして、本発明は、比較対象の別の配列情報を照合データとして用いて、少しずつずらして記憶された同一の配列情報を被照合データとして用いて、照合データと被照合データを並列処理にて記憶処理装置に照合させる。本発明によれば、複数の配列情報が連続して一致する部分が、並列処理を利用して高速に求められる。最長一致部分を求めることも可能であり、また、連続一致部分の位置を特定することも可能である。並列照合機能をもつ記憶処理装置を利用し、少しずつずらして配列を記憶させるという特徴的な使い方により、例えばファスタ検索においてドットマトリックスを用いて得られるのと同様の、連続一致部分の情報を得ることができる。
【０１９７】
好ましくは、本発明は、配列情報の部分的な照合を行うことで、配列の部分一致を求める。好ましくは、本発明は、配列情報の部分的な照合を行うために、照合除外部分を設定する。上述の実施形態では、ＣＡＭの特性に基づき、ＤＣｂｉｔを設定することで、照合除外部分が好適に設定された。さらに、好ましくは、本発明は、配列情報の部分的な照合を、複数種類の部分照合パターンで行うことにより、複数種類の長さの配列一致部分を検索する。複数種類の部分照合パターンは、図２０に例示されている。本発明によれば、各種の長さの連続一致部分の情報が得られる。最長一致部分も適切に検出される。
【０１９８】
好ましくは、本発明は、同一の配列情報を、並列処理機能をもつ記憶処理装置の異なる領域に、少しずつずらして記憶させる。これにより、少しずつずらされた同一配列が並列処理され、高速に検索結果が得られる。
【０１９９】
好ましくは、本発明は、一連の配列を複数の分割配列情報に分ける。本発明は、複数の分割配列情報を、照合方向と交差する方向に並ぶように、並列照合機能をもつ記憶処理装置に記憶させる。記憶処理装置の照合方向の幅が狭い場合でも、長い配列を記憶処理装置に記憶させて、並列処理による配列解析ができる。
【０２００】
なお、本発明の範囲内で、連続する分割配列は、記憶処理装置上で隣合わせに並べられなくてもよい。それらが離れていても構わない。
【０２０１】
好ましくは、本発明は、隣接する分割配列情報に跨って配列が一致する部分を、連続して配列が一致する部分として求める。本発明によれば、連続一致部分が複数の分割配列、すなわち記憶処理装置の複数の列に跨るときでも、そのような連続一致部分を検出できる。
【０２０２】
好ましくは、本態様において、並列照合機能をもつ記憶処理装置は、ＣＡＭである。ＣＡＭは、既に述べたように、並列照合機能をもつ点で、配列情報の処理に適した特性をもち、配列解析を高速化できる。また、ＣＡＭは、これまで配列情報処理には用いられていないが、インターネットルータ部品として普及しており、安価である。したがって、ＣＡＭを使うことで、低コストで高速な配列解析が可能となる。さらに、通常のＣＡＭは、照合方向の幅が比較的狭いにも拘わらず、本発明によれば、配列の分割記憶により、長い配列をＣＡＭに記憶できる。また、ＣＡＭの長さを利用して、大量の配列を記憶させて同時に処理することもできる。このようにして、本発明によれば、ＣＡＭの特性を利用して、配列解析を好適に高速化できる。
【０２０３】
好ましくは、本態様において、上述の処理により、ファスタ法等のホモロジー解析に用いる情報が得られる。例えば、データベースの大量の配列を使ってファスタ検索を行うような場合、本発明の高速化は特に有用と考えられる。
【図面の簡単な説明】
【図１】本発明の好適な本実施形態における生物学的配列情報処理装置のハードウエア構成を示す図である。
【図２】インターネットのルータに用いるときのＣＡＭの通常の機能を示す図である。
【図３】本実施形態の生物学的配列情報処理装置の機能ブロック図である。
【図４】図３の装置によるＳＮＰｓ解析処理を示す図である。
【図５】ＳＮＰｓ解析におけるＤＣｂｉｔの設定のパターンを示す図である。
【図６】図４の処理に対応するフローチャートである。
【図７】図３の装置による欠損または挿入の検出処理を示す図である。
【図８】図７の処理に対応するフローチャートである。
【図９】図３の装置による置換検出処理のフローチャートである。
【図１０】図３の装置によるブラスト検索の対象になる配列の例を示す図である。
【図１１】図３の装置によるブラスト検索処理を示す図であり、図１０の配列をＣＡＭに記憶させた状態を示す図である。
【図１２】図１１の処理で設定されるＤＣｂｉｔを示す図である。
【図１３】図１１の照合結果を処理して参照配列の有無の判定を行う論理演算処理を示す図である。
【図１４】図１１の処理に対応するフローチャートである。
【図１５】参照配列の端の部分を照合データとして用いることにより、複数の分割配列に跨る参照配列を求める処理を示す図である。
【図１６】隣接する分割配列を重複させることにより、分割箇所の参照配列を検出可能とした形態を示す図である。
【図１７】ファスタ検索で用いられるドットマトリックスを概念的に示す図である。
【図１８】ファスタ検索で実際に用いられるドットマトリックスの例を示す図である。
【図１９】図３の装置によるファスタ検索処理を示す図である。
【図２０】図１９の処理にて、各種の長さの連続一致部分を検出する処理を示す図であって、ＤＣｂｉｔの各種の設定パターンを示す図である。
【図２１】図１９の処理で配列を複数に分割するときの処理を示す図である。
【図２２】図２１の処理に関して、複数の分割配列に跨った連続一致部分の検出処理を示す図である。
【図２３】図２１の処理に対応するフローチャートである。
【符号の説明】
１０配列情報処理装置
１２ＣＰＵ
１８ＣＡＭ
２０ハードディスク
３０配列処理制御部
３２配列情報取得部
３４被照合データ投入部
３６照合データ投入部
３８照合結果取得部
４０照合結果処理部
４２解析情報出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method and apparatus for processing biological sequence information such as a base sequence and an amino acid sequence for analysis, and in particular, to speeding up the processing.
[0002]
[Prior art]
In the field of molecular biology, the usefulness of information processing technology for analysis of DNA, genes, proteins, etc. is increasing. In this field, information processing technology is used to analyze sequence information. This type of technology is called bioinformatics.
[0003]
For example, in SNPs (snips, single nucleotide polymorphism) analysis, a large number of substantially identical base sequences are analyzed to obtain base sequences having locally different portions.
[0004]
In addition, for example, in the homology search, information on whether and how a plurality of pieces of sequence information are similar is obtained. As a homology search method, for example, a blast (BLAST) method and a fasta (Fast A) method are known.
[0005]
The blast method searches for a site that matches well locally without inserting a gap. Such a site is called a high score fragment. The high score fragment is then extended back and forth.
[0006]
In the Faster method, a long matching part of the sequence is obtained. For this process, conventionally, dot matrix information obtained by plotting matching elements of a plurality of array information is used. Then, alignment by dynamic programming is performed around the matching portion.
[0007]
[Problems to be solved by the invention]
In sequence analysis, it is required to process a large amount of information at high speed. This is because very long sequences are processed and many sequences are processed. However, conventionally, a large amount of information processing for sequence analysis has been realized exclusively by relying on the large processing capability of large computers, and high-speed processing technology for sequence information has not been sufficiently established. As sequence analysis research advances and the practical use of molecular biology in the field of drug discovery and medicine progresses, it is considered that the importance of speeding up sequence information processing will increase. In addition, a large amount of array information is required to be processed at high speed not by a large computer but by a relatively small computer such as a personal computer.
[0008]
The present invention has been made in view of the above problems, and an object thereof is to provide a method and apparatus for speeding up processing of array information.
[0009]
One object of the present invention is to provide a method and apparatus capable of comparing a plurality of sequence information as seen in SNPS analysis at high speed.
[0010]
One object of the present invention is to provide a method and apparatus capable of performing a high-speed search for a specific sequence in sequence information as seen in blast analysis.
[0011]
One object of the present invention is to provide a method and apparatus capable of performing a high-speed search for consecutively matched portions of a plurality of sequence information as seen in Faster analysis.
[0012]
[Means for Solving the Problems]
(1) In order to achieve the above object, an array information processing method of the present invention causes a storage processing device having a parallel collation function to store sequence information for use as collated data, and the collation data and the collated data are stored. Sequence analysis information is obtained by collating with the storage processing device in parallel processing and obtaining information indicating a match between the collation data and the data to be collated. By using the parallel collation function, a large amount of data can be compared at high speed in the processing of sequence information, and sequence analysis can be speeded up.
[0013]
Preferably, the storage processing device having a parallel collation function is a CAM. Conventionally, the CAM is used as a part of an Internet router. The present invention pays attention to the fact that the parallel collation function of the CAM is suitable for processing sequence information, and makes the CAM compare a large amount of data. As a result, the portion of the sequence analysis processing that occupies a large weight is greatly accelerated by the CAM, and the sequence analysis can be speeded up.
[0014]
Further, CAM is widely used as a router component for the Internet, and can be easily obtained at a relatively low cost. Furthermore, CAM is advantageous in that it can be easily connected to a computer such as a normal personal computer. Therefore, the present invention pays attention to the fact that the characteristics of the CAM that is widely used as a router component are also suitable for processing the array information. By configuring the array information processing apparatus using the CAM, the present invention is called high speed. In addition to the advantages, there is also an advantage that the array information processing apparatus can be easily provided at a low cost.
[0015]
(2) of the present invention Especially central According to one aspect, a storage processing device having a parallel collation function stores a plurality of pieces of array information so as to be aligned in a collation direction with a direction crossing the collation direction in order to use the data as collated data. Then, the present invention uses data of a plurality of adjacent array information arranged in the collating direction as data to be collated, and collates data corresponding to the same code string in which the same codes as characters representing array elements are arranged. Using as data, collation data and data to be collated are collated with the storage processing device in parallel processing.
[0016]
As described above, the present invention has a characteristic usage of the storage processing apparatus in which the array information is stored in a direction that intersects the collation direction. Therefore, the collated data is composed of a plurality of arrays of data arranged in the collating direction. Data corresponding to the same code string is used as the collation data. By parallel collation processing of the data to be collated and the collation data, it is determined at high speed whether or not a plurality of arrays match.
[0017]
The present invention is particularly advantageous when the width of the storage processing device in the verification direction is narrower than the length of the array, as seen in CAM. This is often the case because the sequences actually processed are often long. According to the present invention, since the array information is stored in a direction crossing the collation direction of the storage processing device, a long array can be accommodated in the storage processing device. And by using the collation data corresponding to the same code string, the consistency of the arrangement | sequence memorize | stored in a cross direction is calculated | required. And this process is performed at high speed by a parallel collation process. In this way, according to the present invention, it is possible to suitably speed up the sequence analysis by using a storage processing device having a parallel matching processing function.
[0018]
Preferably, this aspect obtains information used for SNP analysis by the above-described processing. In SNP analysis, it is required to process many sequences quickly. In particular, it is considered that genome drug discovery and customized medicine will be put into practical use in the future, and SNPs analysis of a large number of samples will be required. It is desirable that SNPs analysis can be performed at high speed without using a large computer. According to the present invention, it is possible to appropriately meet such needs.
[0019]
(3) According to one embodiment of the present invention, a storage processing device having a parallel collation function stores biological sequence information in a collation direction in order to use the data as collated data. Furthermore, according to the present invention, the collation data and the data to be collated are collated with the storage processing device in parallel processing using the sequence information to be collated as the collation data. In this aspect, unlike the above-described aspect, the array information is stored with the collation direction facing. Therefore, the advantage obtained by changing the storage direction as described with respect to the above-described aspect cannot be obtained. However, this aspect also provides the advantage of speeding up by parallel processing using the parallel collation function. The following are more detailed aspects of the invention.
[0020]
(4) In one embodiment of the present invention, in order to use a plurality of biological sequence information such as a base sequence and an amino acid sequence as data to be compared in a storage processing device having a parallel verification function, the verification direction is turned Remember. The present invention uses the reference sequence as the collation data, and collates the collation data and the data to be collated with the storage processing device in parallel processing. Typically, using a reference sequence consisting of a partial sequence, a local coincidence such as that performed in a blast search is obtained. According to the present invention, it is determined at high speed whether each of a plurality of arrays includes a reference array using a parallel collation function.
[0021]
Preferably, the present invention performs a collation process by setting a collation target part having a length corresponding to a reference sequence and the remaining collation exclusion part, and performing a plurality of collation processes with different positions of the collation exclusion part. Do. According to the present invention, by performing collation processing with different collation exclusion portions, even if the reference sequence matches any portion of the sequence that is the data to be collated, the match can be detected appropriately. In addition, it is possible to specify a matching part.
[0022]
Preferably, the present invention divides a series of arrays into a plurality of pieces of divided array information, and stores the plurality of pieces of divided array information in a storage processing device having a parallel collation function so as to be arranged in a direction crossing the collation direction. Then, it is determined by parallel processing whether or not a part of each divided array information matches the reference array.
[0023]
The present invention is particularly advantageous when the width of the storage processing device in the collating direction is narrow and the length in the crossing direction is large, as seen in CAM. According to the present invention, even when the width in the collating direction is narrow, by dividing the array, it is possible to store a long array utilizing the length in the crossing direction. Using the length in the cross direction, a large number of sequences can be stored simultaneously and processed in parallel.
[0024]
Furthermore, the array division of this aspect is advantageous for speeding up the calculation. Due to the division, the arrangement length in the collation direction is reduced. This reduces the amount of calculation. When the above-described plural types of collation exclusion portions are set, the amount of calculation becomes smaller as the arrangement length in the collation direction is smaller. Therefore, according to the present invention, when the storage processing device is narrow in the collating direction and long in the crossing direction, this is not an obstacle. Rather, the calculation amount is reduced by array division and parallel processing, and the array analysis can be further speeded up. It is said.
[0025]
Preferably, in this embodiment, information used for homology analysis such as a blast method is obtained by the above-described processing. For example, when performing a blast search using a large number of sequences in a database, the speeding up of the present invention is considered particularly useful.
[0026]
(5) According to one embodiment of the present invention, a storage processing device having a parallel collation function shifts the same array information little by little and stores the data in a collation direction in order to use the data as collated data. The array information is shifted by a predetermined number of characters, usually by one character. Then, the present invention uses another sequence information to be compared as collation data, uses the same sequence information stored by being shifted little by little as collation data, and collates the collation data and the collation data in parallel processing. The storage processing device is collated.
[0027]
According to the present invention, a portion where a plurality of pieces of sequence information are continuously matched is obtained at high speed using parallel processing. It is possible to obtain the longest matching portion, and it is also possible to specify the position of the continuous matching portion. By using a storage processing device with a parallel collation function and storing the array by shifting it little by little, for example, information on continuous matching parts similar to that obtained by using a dot matrix in a fast search is obtained. be able to.
[0028]
Preferably, in this embodiment, information used for homology analysis such as the Faster method is obtained by the above-described processing. For example, when performing a fast search using a large number of sequences in a database, the speeding up of the present invention is considered particularly useful.
[0029]
The present invention is not limited to the method aspects described above. Another aspect of the present invention is, for example, an array information processing apparatus. This device may constitute a system accessed via a network. The present apparatus and the above method may be realized by a plurality of computers arranged in a distributed manner. Another aspect of the present invention is, for example, a program that causes a computer to implement the above processing method, and, for example, a medium that records such a program.
[0030]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention (hereinafter referred to as embodiments) will be described with reference to the drawings.
[0031]
In this embodiment, a base sequence that is one form of sequence information is processed. However, the present embodiment can be similarly applied to any other biological sequence information such as an amino acid sequence.
[0032]
FIG. 1 shows a hardware configuration of the biological sequence information processing apparatus of the present embodiment. The array information processing apparatus 10 includes a CPU 12, ROM 14, RAM 16, CAM 18, hard disk 20, input device 22, and output device 24.
[0033]
The hard disk 20 stores a program for realizing the functions of the array information processing apparatus 10. This program is executed by the CPU 12. Further, the hard disk 20 stores sequence information to be analyzed. The CPU 12 acquires array information from the hard disk 20. The sequence information may be acquired from other configurations. For example, the array information may be acquired from a recording medium such as a CD-ROM or DVD via a recording medium mounting unit (not shown). Moreover, arrangement | sequence information may be acquired via a communication apparatus. The communication device may acquire the array information from a network such as the Internet.
[0034]
The input device 22 is a keyboard, a pointing device, or the like. The user operates the input device 22 to input various instructions, and inputs information requested by the array information processing device 10. The output device 24 is a display, a printer, or the like. The output device 24 displays the analysis result information. In addition, a guidance screen for the user, for example, a screen necessary for operating the input device 22 is displayed on the display. As described above, when the sequence information is acquired by the communication device, it is preferable to cause the communication device to function as an output device and to output information such as analysis results via the communication device.
[0035]
As is apparent from the above description, the array information processing apparatus 10 has the functions of a normal personal computer. However, as a difference from a normal personal computer, the array information processing apparatus 10 includes a CAM 18. The sequence information processing apparatus 10 analyzes the sequence information at high speed by suitably using the CAM 18.
[0036]
A CAM (Content Addressable Memory) is a typical and preferred form of a storage processing apparatus having a parallel collation function of the present invention. CAM is also called associative memory. The CAM is a data storage device in which a storage location is identified not by name, address, or relative position but by information contents, and thus high-speed data retrieval can be performed. CAM is usually used in routers on the Internet.
[0037]
FIG. 2 shows the function of the CAM in the Internet router. The CAM stores a routing table. The routing table associates a plurality of IP addresses with router names. Each IP address is associated with the name of a router to which data with the IP address is transferred. When an IP address is input as verification data, the CAM searches for an IP address that matches the verification data. This search is performed in parallel processing. Then, the CAM outputs a router name associated with the IP address that matches the verification data.
[0038]
In this way, the CAM can collate the collation data with the data to be collated in parallel processing and output the collation result. This function is called a parallel collation function in the present invention. On the other hand, in array processing, a large number of data comparisons regarding arrays are performed. The CAM function is suitable for this type of processing. The present invention pays attention to this point and allows the CAM to compare a large number of data relating to the sequences. As a result, a portion that occupies a large weight in the sequence analysis processing is greatly accelerated by the CAM, and the sequence analysis can be speeded up.
[0039]
Furthermore, the CAM is widely used as a router component for the Internet and can be obtained at a relatively low cost. The CAM is also advantageous in that it can be easily connected to a normal personal computer. Therefore, according to the present invention, attention is paid to the fact that the characteristics of the CAM normally used for the router are suitable for processing the array information, and the array information processing apparatus is configured by using the CAM. In addition, there is an advantage that the array information processing apparatus can be easily provided at low cost.
[0040]
The CAM is a typical and preferable form of a storage processing device having a parallel collation function. Other storage processing devices having a parallel collation function may be applied, and the same speeding up is possible. Compared to the conventional case where a large amount of data is compared with software on a RAM, the speed can be greatly increased.
[0041]
FIG. 3 is a functional block diagram of the array information processing apparatus 10. Various functions of the array processing control unit 30 are realized by the CPU 18 in FIG. 1 executing a program. The array processing control unit 30 includes an array information acquisition unit 32, a verification data input unit 34, a verification data input unit 36, a verification result acquisition unit 38, a verification result processing unit 40, and an analysis information output unit 42.
[0042]
The sequence information acquisition unit 32 acquires sequence information to be analyzed. The array information is acquired from the hard disk 20 or the like as described above. The checked data input unit 34 inputs checked data (referenced data) to the CAM 18. The data to be verified is stored in the CAM 18. The verification data input unit 36 inputs verification data (reference data) to the CAM 18. The CAM 18 collates the collation data with the data to be collated, and outputs a collation result. The verification result is acquired by the verification result acquisition unit 38. The collation result processing unit 40 performs processing for sequence analysis such as various determinations based on the collation result. The analysis information output unit 42 performs processing for outputting information related to sequence analysis obtained by the matching result processing unit 40.
[0043]
Hereinafter, various types of array information processing by the array information processing apparatus 10 will be described.
[0044]
(1-1) SNP analysis
FIG. 4 shows the SNP analysis of this embodiment. The gene sequence is said to vary from individual to individual at a rate of 1 base per 1000 base sequences on average. SNPs compare multiple sample sequences to detect the presence of this different sequence.
[0045]
Referring to FIG. 4, the CAM 18 stores a plurality of arrays sent from the array processing control unit 30. Here, in the usage of the CAM in a normal Internet router or the like, the data to be verified is stored in the verification direction of the CAM (FIG. 2). In the present invention, as shown in the figure, the arrangement information is stored so as to be aligned in the collation direction with the direction intersecting the collation direction.
[0046]
Note that the order of storage is arbitrary. The first character of the first array may be stored, the first character of the second array may be stored, and the data may be sequentially stored in the collating direction. Further, data may be sequentially stored in the intersecting direction such as the first character and the second character in the first array. As a result, as shown in FIG. 4, each array only needs to face the crossing direction.
[0047]
There are several types of CAM collation directions currently provided, such as 144 bit type and 288 bit type. The number of arrays that can be processed at one time is limited by the width of the CAM. When a normal CAM, for example, the 144-bit type CAM described above is used, about 100 arrays can be input.
[0048]
Next, data of the same character string is input as collation data. As is well known, a base is represented by four letters A (adenine), T (thymine), G (guanine), and C (cytosine). First, collation data (AAAAA ...) is input.
[0049]
The CAM 18 collates the collation data with each collated data. The data to be collated is a character string arranged in the collation direction. As described above, in the present embodiment, the array is input with the direction crossing the collating direction. Therefore, the collated data is data in which one character of each array is arranged.
[0050]
The CAM 18 outputs information indicating coincidence when the collation data and the data to be collated completely match. In this embodiment, “1” is output. In this embodiment, the CAM 18 stores “1” at a position corresponding to the router name in the Internet routing table, and this “1” is output. If they do not match, “0” is output.
[0051]
Here, since the collation data is “AAA...”, “1” is output when all the characters of the collation data are “A”. The same processing is sequentially performed for the other characters “T”, “G”, and “C”.
[0052]
The right part of FIG. 4 shows the collation result. When the data to be collated is composed of only the same characters, “1” is output as the collation result using any of the collation data (A, T, G or C). However, when the collated data includes different characters, “0” is output in the processing using all the collation data. This means that the plurality of input sequences include sequences that differ depending on the polymorphism. In this way, the presence or absence of a different sequence is detected.
[0053]
In order to determine the presence or absence of a different arrangement, a logical operation (bit operation) of the matching result is suitably performed. The logical operations here are 1 * 0 = 1, 0 * 1 = 1, 1 * 1 = 1, and 0 * 0 = 0. This calculation is performed on the collation result four times for each column. Two collation results are calculated, another collation result is added to it, and another collation result is further added. As shown on the left side of FIG. 4, the calculation result is “1” when the collated data is composed only of the same character, and is “0” when the different data includes different characters. In this way, the presence or absence of a different sequence is specified, and further, the position of a different base is specified.
[0054]
Next, processing for specifying which of a plurality of base sequences is different from other sequences will be described. In this process, Don't Care bit (hereinafter referred to as DC bit) is set in the data to be verified.
[0055]
Here, DCbit is for performing a matching search using partial data in which data at a specific position is ignored in the data to be verified. In the present embodiment, DCbit is the position of the character that should be ignored or the character itself that should be ignored. By this ignorance, the data to be verified is partially excluded from the verification.
[0056]
FIG. 5 shows a DCbit setting pattern. The array at the position where the DCbit is set is excluded from the comparison target. As shown in the figure, the DC bits are sequentially shifted, and the above-described processing of FIG. 4 is performed. When the DCbit is set at a position where there is a sequence different from the other sequences, as a result of the processing of FIG. 4, it can be determined that all the sequences are completely matched. That is, the results of the left logical operation are all “1”. At this time, the arrangement at the position where the DCbit is set is specified as a different arrangement.
[0057]
For example, in the example of FIG. 4, the third array is different from the other arrays. In this case, as shown by the arrow in FIG. 5, when the third bit is set to DC bit, all the sequences are completely matched. This shows that the third array is different from the others.
[0058]
FIG. 6 is a flowchart showing the above-described SNP analysis process. First, the collation data input unit 34 of the array processing control unit 30 inputs the array information acquired by the array information acquisition unit 32 to the CAM 18 (S10). The array information is stored with the direction crossing the collating direction. Next, verification data is input by the verification data input unit 36 (S12). The same character string is input, such as AAA.
[0059]
In the CAM, collation data and collated data are collated, and the result is output. If the collation data and the data to be collated completely match, “1” is output. Otherwise, “0” is output. The verification result is acquired by the verification result acquisition unit 38 (S14).
[0060]
Next, the array processing control unit 30 determines whether or not collation processing has been performed for all characters (A, T, G, C) (S16). If not completed, the process returns to S12, and the next character is processed using the same character string as collation data.
[0061]
If S16 is YES, it will progress to S18 and SNPs determination will be performed by the collation result process part 40. FIG. Here, the arithmetic processing shown on the right side of FIG. 4 is performed, and the presence or absence of a different sequence and the position of a different base are specified.
[0062]
Next, it is determined whether or not there is a different arrangement (S20). If NO, the process is terminated. In the case of YES, the process proceeds to S22 and any different arrangement is specified.
[0063]
In S22, DCbit is set. First, the first bit of the collated data is set to DC bit (the uppermost stage in FIG. 5). The processes of S24, S26, and S28 may be the same as S12, S14, and S16 described above. That is, the same character string is input to the CAM 18 as collation data (S24), a collation result is acquired from the CAM 18 (S26), and collation processing for all characters is performed (S28).
[0064]
Next, it is determined whether or not the processing of S22 to S28 has been performed for all the DCbit patterns shown in FIG. 5 (S30). If NO, the process returns to S22 and the DCbit setting is changed. The position of the DCbit is shifted one by one. In this way, the collation result when the DCbit is set at a different position is obtained. That is, a matching result is obtained when sequences are excluded from the matching target one by one.
[0065]
If S30 is YES, the process proceeds to S32, and the collation result processing unit 40 identifies a different arrangement. The position of the DCbit when a perfect match is obtained indicates a different sequence.
[0066]
In the above, in order to explain the present invention in an easy-to-understand manner, “letters A, T, G, C” that are usually used to express a base sequence are used. However, other codes may be used as long as they represent elements such as bases within the scope of the present invention.
[0067]
In actual computer processing, characters should not be handled, but characters should be represented with a small amount of data. Since there are four types of bases, all bases are represented by at least 2-bit data. At this time, on the CAM 18 in FIG. 4, 2-bit data is arranged for each character in the intersecting direction. If the columns in the collation direction are considered at the bit level, the two columns represent data to be collated such as AAA. In the present invention, these two columns of data may be processed together in the matching process. Further, the collation process may be performed for each column, and the result may be further processed. The latter data processing is also included in the processing using data corresponding to the same code string of the present invention as collation data.
[0068]
Also for DCbit, in the above description, DCbit is set for a character. In actual processing, for example, when four types of bases are represented by 2 bits, it goes without saying that one DC bit (*) in the above description corresponds to 2 bits on the computer.
[0069]
Further, in the above, in order to explain the present invention in an easy-to-understand manner, the present invention has been described based on a quadrangular diagram as shown in FIGS. However, the actual physical data position on the CAM is not limited to that shown in FIG. This is the same in other embodiments as well.
[0070]
Further, in the process of FIG. 6, in order to find a different arrangement, a collation process is performed for all patterns of DC bit setting. However, the processing may be terminated when a different arrangement is found before using all the DCbit patterns. In this case, every time one pattern is processed, it is determined whether or not a different array is found.
[0071]
In the above processing, the collation when the DCbit is set is the same as the first collation. On the other hand, collation using DCbit may be performed for a narrower portion. For example, collation may be performed for a position with a different base. A position where a different base is present can be identified by the processing of FIG. 4 (the calculation result is 0).
[0072]
(1-2) Deletion / insertion detection
Next, a defect / insertion detection process using the sequence analysis technique of this embodiment will be described. As is well known, the term “deletion” means that there is a missing base in a certain sequence when a plurality of sequences are compared. Insertion means that when a plurality of sequences are compared, a certain sequence has a base not found in other sequences.
[0073]
FIG. 7 shows the processing of this embodiment. The processing in FIG. 7 is generally the same as the SNP analysis in FIG. However, the verification result determination process is different.
[0074]
That is, in FIG. 7, the plurality of arrays to be compared are stored by the CAM 18 so as to be aligned in the collation direction with the direction intersecting the collation direction. Therefore, the collated data is data in which one character of each array is arranged. The collation data is the same character string such as AAA. The verification data is compared with the respective data to be verified. The CAM 18 is programmed to output “1” if the collation data and the data to be collated match, and “0” if they do not match.
[0075]
In the example of FIG. 7, there is a defect in the third array in the n columns. At this time, in the n−1 column and the previous column, “1” is output in any of the four-character collation processes. On the other hand, “0” is output in the n-th column and the subsequent columns.
[0076]
As described above, when the collation processing according to the present embodiment is performed, a part where the collation data and the collated data are continuously matched, and a part where the collation data and the collated data are not continuously matched, with the position where the defect is present as a boundary. And are adjacent. Similar results are obtained when there is an insertion.
[0077]
Therefore, according to the present embodiment, when the above result is obtained, that is, the portion where the matching data and the checked data are continuously matched, and the portion where the matching data and the checked data are not matched continuously, When is adjacent, it can be seen that there is a defect or insertion.
[0078]
The determination regarding missing or inserted is preferably performed using a logical operation shown on the left side of FIG. As in FIG. 4, this logical operation is 1 * 0 = 1, 0 * 1 = 1, 1 * 1 = 1, and 0 * 0 = 0. When there is a defect or insertion, as shown in the figure, the logical operation result is... 111000. That is, the continuous matching part and the continuous non-matching part of the verification data and the data to be verified are adjacent to each other. It can be seen that there is a defect or insertion at this boundary.
[0079]
Which sequence has a deletion or an insertion can be detected using DCbit. The processing using DCbit may be the same as the SNP analysis. By setting the DCbit, one sequence is excluded from the verification target. When the DC bit is set at a certain position, the logical operation result changes, and when the continuous matching portion is extended, the sequence corresponding to the position of the DC bit has a deletion or insertion.
[0080]
That is, in the example of FIG. 7, when the third bit of the data to be verified is set to DC bit, the logical operation result is changed, and the continuous matching portion is extended beyond n columns. This shows that the third sequence has a deletion or insertion.
[0081]
It is also possible to determine whether there is a defect or an insertion. In order to make this determination, the sequence information having a defect or insertion is shifted on the CAM 18 by one character in the direction intersecting the verification direction. Then, the above collation and logical operation are performed.
[0082]
Here, it is assumed that one character is shifted downward in FIG. At this time, after the n + 1th column, “1” is output as the collation result, and the calculation result is also “1”. For column n and the previous column, the result is reversed, and “0” is obtained. If there is an insertion, the above result cannot be obtained. That is, “1” is not output as the collation result even in the shifted state. The calculation result is also zero. In this way, it can be determined from the collation result in the shift state whether there is a defect or an insertion.
[0083]
Contrary to the above processing, the array may be shifted upward in FIG. In this case, if there is an insertion, the calculation result changes, and 1 continues in the nth column and after, and 0 continues in the n−1 column and before. In the case of a deficiency, 0 continues after the nth column. The difference in the results indicates whether a deletion or insertion has occurred.
[0084]
In the two shift processes described above, the entire array was shifted. However, part of the array may be shifted. Only where there is a deletion or insertion and the subsequent sequence portion may be shifted.
[0085]
Further, in the above processing, the deletion or insertion of one character was detected. A loss or insertion of two or more characters can be detected as well. It is only necessary to shift the arrangement in the crossing direction by the number of characters. For example, in order to determine a loss of two characters, the arrangement is shifted in the crossing direction by two characters.
[0086]
FIG. 8 shows a flowchart of the above-described loss / insertion detection processing. Since the basic process is the same as the SNP analysis of FIG. 6, the description will be simplified as appropriate. The collation data input unit 34 inputs the array information to the CAM 18 (S40). The array information is stored with the direction crossing the collating direction. Then, verification data corresponding to the same character string is input by the verification data input unit 36 (S42). The verification result at the CAM 18 is acquired by the verification result acquisition unit 38 (S44). And the arrangement | sequence process control part 30 determines whether the collation process about all the characters (A, T, G, C) was performed (S46). If not completed, the process returns to S12.
[0087]
If S46 is YES, the process proceeds to S48, and the collation result processing unit 40 determines whether there is a defect or an array. Here, as described with reference to FIG. 7, the collation result processing unit 40 includes a portion where the collation data and the collated data are continuously matched and a portion where the collation data and the collated data are not continuously matched. When adjacent, it is determined that there is a defect or insertion. When there is no defect or insertion, the determination in S50 is NO and the process ends.
[0088]
When there is a deletion or insertion, the process proceeds to S52, and a sequence having the deletion or insertion is specified. In S52, DCbit is set. The processes of S54, S56, and S58 may be the same as S42, S44, and S46 described above. Then, it is determined whether or not the processing of S52 to S58 has been performed for all the DCbit patterns (S60). If NO, the process returns to S52 and the DCbit setting is changed. If S60 is YES, the process proceeds to S62, and a sequence having a deletion or insertion is specified.
[0089]
Note that as described with reference to FIG. 6, the collation process need not be performed for all DCbit patterns. That is, based on the matching result for one pattern, it is determined whether or not an array having a defect or an insertion is found, and this specifying process may be terminated when it is found.
[0090]
Next, it progresses to S64 and it is determined whether there exists any defect | deletion or insertion. In S64, the sequence having a deletion or insertion is shifted on the CAM 18 in a direction crossing the verification direction. And the collation process of S66-S70 is performed. The processes of S66, S68, and S70 may be the same as S42, S44, and S46, respectively. Based on the collation result, it is determined whether there is a defect or an insertion as described above (S72).
[0091]
(1-3) Replacement detection
FIG. 9 shows a flowchart of replacement detection processing using the sequence analysis technique of this embodiment. This process is basically the same as SNPs. Originally, SNPs seek substitution of one base in a plurality of sequences. Therefore, substitution can be detected by applying the processing described for SNPs. If there is a replacement, as shown in FIG. 4, the collation data and the data to be collated continuously match, there is a portion where the collation data and the data to be collated do not match, and the collation data and the data to be collated again. And match. When such a collation result is obtained, there is a substitution and its position is specified. As described above, according to the present invention, substitution can be detected.
[0092]
FIG. 9 is basically the same as FIG. 6 and will not be described. However, in the case of substitution detection, it is preferable to identify samples with different sequence lengths such as deletions in advance and exclude them from the data. Therefore, in S80, a plurality of arrays having the same array length are input to the CAM 18 for use as the data to be verified.
[0093]
The sequence analysis process of the present invention that effectively uses CAM has been described above, taking SNPs analysis and mutation (deletion, insertion and substitution) detection. CAM usually has a relatively narrow width in the collation direction. For example, 144 bits and 288 bits are normal CAM widths. Such a narrow width cannot accommodate relatively long sequence information such as genes. Therefore, in the present invention, the array information is stored in the CAM with the direction crossing the collating direction. The length in the crossing direction is very long even in a normal CAM. This makes it possible to accommodate a long array in the CAM. Furthermore, by using collation data corresponding to the same character string, sequence comparison by CAM is realized. In this way, the present invention makes it possible to utilize a high-speed collation function by parallel processing of CAM for sequence analysis, and to speed up the sequence analysis.
[0094]
The sequence processing of the present invention may be applied to sequence analysis other than SNP analysis and mutation detection as long as it can be realized within the scope of the present invention.
[0095]
The calculation amount of the processing of the present invention is compared with the calculation amount of the conventional array processing by using a simplified example. A base is represented by four types of letters. When comparing n-letter sequences, the amount of calculation in the conventional processing is roughly represented by 4 to the power of n. As the number of characters n increases, the amount of calculation increases significantly.
[0096]
On the other hand, in the present invention, the parallel collation function of the storage processing device (including CAM) is appropriately used, and collation data corresponding to the same character string is input to the storage processing device. Corresponding to the four types of characters, four collation data are sequentially input. Therefore, the calculation amount of the processing according to the present invention corresponds to four collations. When the number of characters n increases, the calculation amount does not increase so much. Therefore, the present invention can greatly reduce the amount of calculation compared with the conventional processing.
[0097]
Here, as described above, the present invention is not limited to the base sequence, and can be similarly applied to the processing of other biological sequence information such as an amino acid sequence. The advantages of the present invention are particularly prominent as the number of array element types (generally character types) increases. Hereinafter, this advantage will be described in detail.
[0098]
Again use the simplified example above. Bases are represented by 4 types of letters and natural amino acids are represented by 20 types of letters. When comparing n-character sequences by conventional processing, the amount of calculation for base sequence comparison is represented by 4 to the power of n. The amount of calculation for amino acid sequence comparison is represented by 20 to the power of n. Therefore, the calculation amount of the amino acid sequence is “5 to the power of n” times the calculation amount of the base sequence. As described above, in the conventional processing, the amount of calculation greatly increases as the types of array elements increase.
[0099]
On the other hand, since the present invention uses the parallel verification function of the memory processing device (including CAM), in the above example, the calculation amount of the amino acid sequence is five times (20 ÷ 4) the calculation amount of the base sequence. It can only be.
[0100]
That is, in the present invention, data corresponding to the same character string is input to the storage processing device as collation data. In the case of a base, four collation data are input corresponding to four types of characters. In the case of amino acids, 20 collation data are input corresponding to 20 types of characters. Therefore, the calculation amount is only five times. Thus, regarding the increase in the amount of calculation according to the number of types of array elements, the degree of increase is clearly smaller in the present invention than in the conventional processing.
[0101]
The above example is simplified and does not represent the precise amount of computation. Nevertheless, as is clear from the above example, the calculation amount of the processing of the present invention is significantly smaller than that of the conventional processing. Therefore, the present invention can advantageously speed up the conventional processing array processing.
[0102]
(2) Blast search
Next, another embodiment of the present invention will be described. In the above-described embodiment, the sequence information is stored in the CAM in a direction that intersects the verification direction of the CAM. In this embodiment, arrangement information is stored with the collation direction facing. However, the array information is often longer than the width of the CAM collation direction. Therefore, in such a case, in this embodiment, the array is divided into a plurality of pieces, and the array information is stored using a plurality of columns of the CAM. As a result, the present invention makes it possible to process a long array with CAM.
[0103]
In this embodiment, the array processing of the present invention is applied to blast search. Blast search is one of homology searches. In the blast search, a region that matches well locally is searched without inserting a gap. Such a site is called a high score fragment. The high score fragment is then extended back and forth. In the present embodiment, the present invention is applied to processing for searching for high score fragments in a series of blast searches.
[0104]
FIG. 10 shows an example of two sequences to be compared in the homology search. The total length of the sequence is considerably long and exceeds the width of the verification direction of the CAM.
[0105]
FIG. 11 shows a state in which the array is stored in the CAM 18. Each array is divided into a plurality of divided arrays, and each divided array is stored in one column of the CAM. Since there are four types of bases, they are expressed in 2 bits. In the example of FIG. 11, since one divided sequence includes 60 bases, the length of one divided sequence is 120 bits. Therefore, for example, by using a CAM having a width of 144 bits, it is possible to store the array in the state of FIG.
[0106]
In the blast search, a reference sequence consisting of a partial sequence is used when searching for a high score fragment. The reference sequence is relatively short and is composed of, for example, 9 characters as shown in the figure. An inquiry is made as to whether a partial sequence that matches the reference sequence is included in the sample sequence. In this embodiment, this process is performed using CAM.
[0107]
That is, as shown in FIG. 11, in this embodiment, a reference sequence is input to the CAM 18 as collation data. The CAM 18 compares the collation data with the collated data in each column by parallel processing. When the collation data matches the data to be collated, the CAM 18 outputs “1”, and when it does not coincide, the CAM 18 outputs “0”. From this collation result, it can be determined whether or not each sequence to be searched includes a reference sequence.
[0108]
The collation process using the reference sequence as the collation data is performed using DCbit based on the characteristics of the CAM 18.
[0109]
Referring to FIG. 12, in this embodiment, DCbit (*) as shown is given to the data to be verified. That is, DCbit is given to the remaining part excluding the part corresponding to the length of the reference sequence. The part to which DCbit is given is excluded from the target of collation. A portion to which no DCbit is given is a target of collation.
[0110]
The position of DCbit is shifted sequentially. In other words, the part to which no DC bit is given (part to be verified) is sequentially shifted character by character. In this way, according to the present invention, a plurality of collation processes with different positions of the collation exclusion portion are performed, and even when any portion of the collated data matches the reference sequence, the coincidence can be detected. . It is also possible to specify a location that matches the reference sequence.
[0111]
A plurality of collation results when the DCbit is shifted are suitably processed using a logical operation.
[0112]
Reference is made to the upper part of FIG. In the present embodiment, as described above, collation is performed a plurality of times using each pattern of the DCbit setting. At each verification, 1 or 0 is output from the CAM 18. 1 is output when the collation data matches the data to be collated, and 0 is not coincident.
[0113]
A logical operation is performed on the matching results of all patterns. The logical operations are 1 * 0 = 1, 0 * 1 = 1, 1 * 1 = 1, 0 * 0 = 0. Two verification results are calculated, and if another calculation result is added, this is repeated. If the final calculation result is 1, a perfect match is obtained by collation using any pattern. Otherwise, the final operation result is zero. Therefore, if the calculation result is 1, it is understood that the reference data is included in the data to be verified.
[0114]
The lower part of FIG. 13 shows a suitable process for determining whether or not a plurality of reference sequences are included in the sample sequence.
[0115]
Assume that there are three reference sequences, A, B, and C. With respect to each reference sequence, information on whether or not each column of the CAM 18 has a partial sequence that matches the reference sequence is obtained by the upper processing of FIG. If there is a matching part, it is “1”, otherwise it is “0”. The result of each column is subjected to a logical operation. That is, in FIG. 13, the calculation is advanced in the vertical direction. The calculation is 1 * 0 = 1, 0 * 1 = 1, 1 * 1 = 1, 0 * 0 = 0, as described above. As a result, when any one column includes the reference array, the calculation result becomes 1. If the calculation results of all the reference arrays are 1, that is, if 1s are arranged as shown in the figure, all the reference arrays are included in the sample array. When 0 is obtained as the operation result, the corresponding reference sequence is not included.
[0116]
The advantages of the above processing will be described. In the example of FIG. 13, there are relatively few reference sequences. However, more reference sequences may be used in the blast search. At this time, it is necessary to hold a large number of collation results in the middle of a series of processing, and there is a tendency that data to be held increases. According to the present invention, it is possible to suitably cope with the problem that the amount of data increases due to the above-described processing.
[0117]
According to the present invention, it is possible to speed up the reference sequence search processing by suitably using parallel processing. In this regard, the calculation amount of the normal processing is roughly compared with the calculation amount of the processing of the present invention.
[0118]
Here, a case where a blast search is performed using a database storing a large number of gene sequences such as tens of thousands to hundreds of thousands is considered. When the number of gene sequences in the database is Nc, the number of bases of one sequence is Lc, and the number of bases of a reference sequence is R1, the calculation amount of the conventional processing is represented by Nc * (Lc−Rl).
[0119]
On the other hand, in the present invention, when the data length of each divided sequence is Cc and the number of bases of the reference sequence is Rl, the calculation amount is represented by Cc-Rl. This expression does not include the data length Lc of the entire array. This is because, in the present invention, a divided sequence obtained by dividing a gene sequence is a search target. Further, the above formula does not include the number of arrays Nc. This is due to the following reason. The CAM is usually used as a part of an Internet router, and is configured to store a large number of IP addresses in a state where parallel search is possible. Accordingly, the CAM has a relatively short width in the collating direction, but is very long in the direction intersecting it. By utilizing this point, tens of thousands or more gene sequences can be stored in parallel in the crossing direction and processed simultaneously in parallel. Therefore, the calculation amount formula of the present invention does not include the number Nc of gene sequences.
[0120]
As described above, generally, the calculation amount of the conventional process is represented by Nc * (Lc−Rl), and the calculation amount of the process of the present invention is represented by Cc−Rl. The number Nc of gene sequences is usually from tens of thousands to hundreds of thousands. Moreover, the base number Lc of one sequence is about 1000 to 10,000. Moreover, the base number Rl of the reference sequence is about 20. Furthermore, the data length Cc of the divided array is about 100 (60 in the example of FIG. 11). In this case, when the calculation amount of both is compared, the calculation amount of the processing of the present invention is roughly, for example, about 1/10000.
[0121]
Thus, according to the present invention, it is possible to speed up the sequence search. As is apparent from the above description, the present invention suitably uses the characteristics of CAM. That is, a large amount of genes are simultaneously stored as data to be verified using the length in the direction crossing the verification direction. Furthermore, the amount of calculation is reduced by processing a plurality of divided arrays in parallel without disadvantageous that the width in the collation direction is short. In this way, the above-described significant speedup can be achieved.
[0122]
FIG. 14 is a flowchart showing the blast search process described above. First, the collation data input unit 34 of the array processing control unit 30 inputs the array information acquired by the array information acquisition unit 32 to the CAM 18 (S110). The array information is divided into a plurality of divided array information as described above, and each divided array information is stored in one column of the CAM. DCbit is set (S112), and verification data is input by the verification data input unit 36 (S114). Here, first, the DCbit of the first pattern is set. The collation data is a reference sequence. The CAM 18 collates the collation data with the data to be collated in each column, and outputs the result. If the collation data and the data to be collated completely match, “1” is output. Otherwise, “0” is output. The verification result is acquired by the verification result acquisition unit 38 (S116).
[0123]
Next, the array processing control unit 30 determines whether or not collation has been performed for all the DCbit patterns (S118). If NO, the process returns to S112, and the DCbit pattern is changed. In this embodiment, as described above, the position of the DCbit is sequentially shifted.
[0124]
If S118 is YES, the process proceeds to S120, and the array processing control unit 30 determines whether or not the matching process has been completed for all reference arrays. For example, it is determined whether or not all of the reference arrays A, B, and C in FIG. 13 have been processed. If S120 is NO, the process returns to S112 and collation is performed using the next reference sequence. If S120 is YES, the process proceeds to S122. In S122, as described with reference to FIG. 13, the matching result processing unit 40 performs a logical operation using the matching result, and whether each reference sequence is included in the sample sequence, and all the reference sequences are included in the sample sequence. Is included. The above process is performed for each of the plurality of sample arrays.
[0125]
Preferably, the sequence information processing apparatus 10 of the present embodiment is configured to perform subsequent processing, that is, the remaining processing of the blast search, using the query result of the reference sequence. This remaining processing may be performed by another apparatus.
[0126]
By the way, in this embodiment, as described above, one array is divided into a plurality of divided arrays. Therefore, the reference sequence (meaning a partial sequence that coincides with the reference sequence) may straddle a plurality of divided sequences at the divided locations. Such a reference sequence is suitably detected by the following process.
[0127]
Refer to FIG. In this embodiment, collation processing is performed using the end portion of the reference sequence as collation data. It is determined whether the rear part of the reference array matches the front part of the divided array that is the data to be verified. As shown in the figure, collation using one character behind the reference sequence, collation using two characters, and collation using i-1 characters are performed. i is the number of characters in the reference sequence. The DCbit is set in the portion other than the verification target in the same manner as described in the above processing. In actual processing, the DCbit pattern may be increased. That is, in FIG. 2, the DC bit pattern may be set even when the reference sequence protrudes from the front portion of the data to be verified. Thereby, the above-described collation process can be applied as it is.
[0128]
Similarly, it is determined whether or not the front part of the reference sequence matches the front part of the divided array that is the data to be verified.
[0129]
Then, it is assumed that there is a rear part of the reference array in the front part of the (n + 1) th column by the above processing. Also, assume that there is a front part of the reference sequence in the rear part of the nth column. When both parts are connected, it is determined whether or not a reference sequence is obtained. Here, it may be determined whether or not the number of characters in the two portions matches the number of characters in the reference sequence. When the reference sequence is obtained, it is determined that the same partial sequence as the reference sequence is included in the sample sequence.
[0130]
This determination process is suitably performed as follows in an actual program. Again, logic operations are used. When the collated data matches the collated data by collation using a part of the reference sequence, the CAM 18 outputs “1”, otherwise the CAM 18 outputs “0”. The following two matching results are subjected to a logical operation.
[0131]
(1) The result of collating the rear part of the n-th column with the front k characters of the reference sequence
(2) The result of collating the front part of the (n + 1) th column with the ik characters behind the reference sequence
The logical operations are 1 * 1 = 1, 1 * 0 = 0, 0 * 1 = 0, 0 * 0 = 0. If the calculation result is 1, the same partial array as the reference array is included in the sample array. If the operation result is 0, the same partial array as the reference array is not included in the sample array. The above processing is performed in the range of 1 ≦ k ≦ i−1. In this manner, a reference sequence that straddles two divided arrays is preferably detected.
[0132]
FIG. 16 shows another process for detecting the reference sequence at the division location. This process partially overlaps adjacent divided arrays. The number of duplicate characters is i-1. Here, i is the number of characters in the reference sequence. In this state, if the above-described collation process is performed, the reference sequence of the divided portion is detected without leaking.
[0133]
In the processing of FIG. 16, it is necessary to change the number of characters in the overlapping portion according to the length of the reference sequence. This can be dealt with by using DCbit. That is, in order to avoid excessive duplication, DC bits are set in an excessive portion. For example, assume that a 20-character reference array A and a 15-character reference array are used. The number of characters in the overlapping portion is appropriately set to 30 characters, for example. When the reference array A is used, DCbit is set for 11 characters in the rear part of the data to be verified. When the reference array B is used, DCbit is set for the 16 characters in the rear part of the referenced data. In this way, processing according to the length of the reference sequence is realized.
[0134]
However, the process shown in FIG. 15 is considered advantageous over the process shown in FIG.
[0135]
The arrangement processing of this embodiment has been described above. In the present embodiment, the present invention is applied to blast search. The present invention may be applied to other sequence analysis. The present invention may be applied to, for example, consensus sequence search, genetic map, and SNPs sequence detection. It goes without saying that the processing of the above-described embodiment is changed according to each analysis. For example, in the case of SNPs, DCbit may not be set.
[0136]
(3) Fasta search
Next, another embodiment of the present invention will be described. Also in this embodiment, as in the above-described embodiment, the array information is stored with the collation direction of the CAM directed. In the present embodiment, a portion where a plurality of arrays are continuously matched is obtained by parallel processing using the characteristics of CAM. This detection of the continuous matching portion is suitable for the fasta search.
[0137]
First, a conventional faster search will be schematically described.
[0138]
17 and 18 show dot matrix images. The dot matrix image is used for obtaining continuous matching portions of a plurality of arrays in the conventional faster search. FIG. 17 is a conceptual diagram, and FIG. 18 is an example of an actual dot matrix image.
[0139]
In a dot matrix image, two arrays are arranged orthogonally. A dot is placed at a place where the characters (elements) of the two arrays match. When dots continue in the 45-degree direction, the characters in the array match continuously in that portion. Using this feature, the longest continuously matching portion is obtained. Then, alignment by dynamic programming is performed around the matching portion.
[0140]
In the present embodiment, CAM is used to obtain the same information as when the above-described dot matrix image is used.
[0141]
FIG. 19 shows the processing of this embodiment. Here, the array is not divided for ease of explanation. However, in practice, as described later, it is preferable to divide the array into a plurality of parts in consideration of the narrow CAM width.
[0142]
In the example of FIG. 19, there are two sequences to be compared, that is, sequence 1 and sequence 2. The array 1 is stored in the CAM 18 as collated data. The array 2 is input to the CAM 18 as collation data.
[0143]
Array 1 is stored in multiple columns of CAM 18 as shown. That is, the same sequence is stored in a plurality of columns on the CAM 18. However, the array 1 is shifted in the collation direction depending on the columns. Array 1 is shifted character by character.
[0144]
With the array 1 stored in this way, the array 2 is input as collation data. The CAM 18 compares the collation data with the collated data in each column. When both match, “1” is output, and when they do not match, “0” is output.
[0145]
In the above processing, the case where the entire sequence matches is detected. Continuously matching portions of various lengths are detected as follows.
[0146]
FIG. 20 shows processing for detecting consecutively matched portions of various lengths. As shown, DCbit (*) is used. DCbit is used to create a verification exclusion part.
[0147]
In the top row, no DCbit is set. In the second stage, one DCbit is set at the rear end of the data to be verified. In the third row, one DCbit is set at the front end of the data to be verified. When collation is performed using the second and third patterns, the presence or absence of a continuous matching portion shorter by one character than the length of the array is detected.
[0148]
Similarly, n DC bits are set in order to detect a continuous matching portion shorter by n characters than the length of the array. As shown in FIG. 20, n DC bits are distributed to both ends of the array. All combinations of distribution are used as a DCbit setting pattern.
[0149]
In this way, according to the present embodiment, consecutively matching portions of various lengths are detected by partially excluding the sequence from the target for collation. And the part where arrangement | sequence continues longest can also be calculated | required.
[0150]
In the above processing, in order to find the longest matching portion, the processing for detecting continuous matching portions of all types of lengths does not have to be performed. The DCbit is sequentially changed, and the matching length of the detection target is sequentially shortened until the longest matching portion is found. Here, the DCbit pattern of FIG. 20 is used in order from the top to the bottom. Then, when the collation data and the data to be collated match, the longest sequence is found, and the processing is terminated. Such processing is also suitable.
[0151]
FIG. 21 shows processing when the array is divided into a plurality. The array 1 is divided shorter than the width of the CAM 18 and stored in a plurality of columns of the CAM 18. The same arrangement is stored in a plurality of areas on the CAM 18 while being shifted one character at a time. The maximum value of the shift amount is set to (length of divided array-1). This is because if the data is shifted further, the same data to be verified is duplicated.
[0152]
Array 2 is divided in the same manner as array 1. Each divided array is sequentially input to the CAM 18 as collation data. Therefore, the CAM 18 performs a collation process using each divided array of the array 2. The processing when using one divided array may be the processing described with reference to FIGS. 19 and 20.
[0153]
As shown by X in FIG. 21, when the arrangement is shifted, a portion having no character data is generated on the CAM column. This part is appropriately excluded from the processing target. The entire column with the X mark may be deleted. Even if this deletion is performed, it is considered that there is no problem because the corner area in FIG. 18 is only excluded from the search target.
[0154]
Further, regarding the dividing process, the process may be performed separately for each divided array. That is, first, the first divided array of the arrays 1 and 2 is selected. The divided array of the array 1 is arranged in the CAM 18 as shown in FIG. The process described with reference to FIG. 19 is performed using the divided array of array 2. Next, the second divided array of arrays 1 and 2 is selected, and the same processing is performed. Similar results can be obtained by such processing.
[0155]
By the way, the continuous matching portion of the array may straddle a plurality of divided arrays. This point is handled as follows.
[0156]
Referring to FIG. 22, when there is a continuous matching portion in the rear portion of the nth column and the front portion of the (n + 1) th row, they are connected. It is determined whether or not the array part in the connected state is the longest continuous matching part of array 1. In order to perform this process more accurately, it is preferable that the divided array is a connection target even when the end of a certain divided array matches only one character. Moreover, although not shown in figure, a continuous matching part may straddle three or more division | segmentation arrangement | sequences. In this case, all the divided arrays are connected. Consecutive matching parts on both sides (which may be 1 or 0 shorter than the divided array length) and continuous matching parts between them (the same length as the divided array length, one or more) are connected. The
[0157]
FIG. 23 is a flowchart showing the processing of this embodiment described above. First, the collation data input unit 34 of the array processing control unit 30 inputs the array information acquired by the array information acquisition unit 32 to the CAM 18 (S130). As shown in FIG. 21, the array information is divided into a plurality of pieces and input. Moreover, the same arrangement | sequence is thrown in little by little. Next, verification data is input by the verification data input unit 36 (S132). The collation data is a divided array of array 2. Then, the verification data and the data to be verified are verified by the CAM 18. First, collation is performed without setting DCbit. If the collation data and the data to be collated completely match, “1” is output. Otherwise, “0” is output. The verification result is acquired by the verification result acquisition unit 38 (S134).
[0158]
Next, the array processing control unit 30 determines whether or not the collation regarding the total length has been completed (S136). If NO, the length is changed (S138), and the process returns to S132. In S136, it is determined whether or not all the DCbit patterns in FIG. 20 have been processed. When all the patterns have not been processed, the next pattern is selected in S138. If S136 is YES, the process proceeds to S140.
[0159]
Note that, as already described, in this embodiment, it is not necessary to perform continuous matching determination for all lengths. In this case, the length of the detection target is sequentially reduced character by character. That is, the DCbit pattern of FIG. 20 is used in order from the top. When the collated data that matches the collation data is obtained, the process proceeds to S140.
[0160]
In S140, the array processing control unit 30 determines whether or not all the divided arrays of the array 2 have been processed. If S140 is NO, the process returns to S132 and the next divided array is processed. If S140 is YES, it will progress to S142 and the collation result process part 40 will specify the longest matching part (part where arrangement | sequence matches the longest) based on the collation result so far. Preferably, the array information processing apparatus 10 of the present embodiment is configured to perform the subsequent processing, that is, the remaining processing of the fasta search, using the identified longest matching portion. This remaining processing may be performed by another apparatus.
[0161]
As described above, according to the present embodiment, the continuous matching portion of the array can be detected and the longest matching portion can also be detected using the CAM, as in the case of using the dot matrix. A high-speed search is possible using the parallel processing function of the CAM.
[0162]
In this embodiment, two sequences were compared. However, more than two sequences may be compared within the scope of the present invention. In this case, preferably, a plurality of arrays are arranged in a direction crossing the collating direction of CAM. About each arrangement | sequence, as shown in FIG. 21, the same arrangement | sequence is shifted a little and is memorize | stored in several places. Then, an array (array 2 in FIG. 21) used as collation data is input. Thereby, the arrangement | sequence of collation data can be compared simultaneously with a some arrangement | sequence.
[0163]
In the present embodiment, the information processing of the present invention is applied to Faster analysis. The present invention may be applied to other sequence analysis. In other analyses, the present invention can be advantageously applied when determining consecutively matched portions of sequences.
[0164]
The various preferred embodiments of the present invention have been described above. Of course, this embodiment can be modified within the scope of the present invention. For example, in this embodiment, the base sequence is processed. On the other hand, other sequences such as amino acids may be processed within the scope of the present invention, as already mentioned. Furthermore, the array information processing apparatus of the present invention may constitute a system accessed via a network.
[0165]
【The invention's effect】
(1) As described above, according to the present invention, a storage processing device having a parallel collation function stores sequence information for use as collated data, and stores collation data and collated data in parallel processing. Sequence analysis information is obtained by making the processing device collate and obtaining information indicating the match between the collation data and the data to be collated. By using the parallel collation function, a large amount of data can be compared at high speed in the processing of sequence information, and sequence analysis can be speeded up.
[0166]
Preferably, the storage processing device having a parallel collation function is a CAM. Conventionally, the CAM is used as a part of an Internet router. The present invention pays attention to the fact that the parallel collation function of the CAM is suitable for processing sequence information, and makes the CAM compare a large amount of data. As a result, the portion of the sequence analysis processing that occupies a large weight is greatly accelerated by the CAM, and the sequence analysis can be speeded up.
[0167]
Further, CAM is widely used as a router component for the Internet, and can be easily obtained at a relatively low cost. Furthermore, CAM is advantageous in that it can be easily connected to a computer such as a normal personal computer. Therefore, the present invention pays attention to the fact that the characteristics of the CAM that is widely used as a router component are also suitable for processing the array information. By configuring the array information processing apparatus using the CAM, the present invention is called high speed. In addition to the advantages, there is also an advantage that the array information processing apparatus can be easily provided at a low cost.
[0168]
The storage processing device with a parallel collation function of the present invention is not limited to a CAM. In addition, a normal CAM is configured to compare one collation data simultaneously with all stored data to be collated. In the embodiment described above, such processing is mainly performed. On the other hand, the CAM or other storage processing device may be configured to use a plurality of collation data simultaneously. And the process which changes the other party's to-be-verified data by collation data may be performed. This configuration contributes to further speedup by enabling simultaneous processing of a plurality of collation data. For example, it is advantageous when a plurality of divided arrays are used as collation data in the blast search of the above-described embodiment.
[0169]
The storage processing device with a parallel collation function (including the CAM) of the present invention may be a part of a processor. It is within the scope of the present invention to use this processor to cause the storage processing unit to perform the processing of the present invention. This type of processor may be provided with a part or all of the processing functions of the present invention for using the storage processing device as described with reference to FIG. In this case, the processor constitutes the array information processing apparatus (at least a part) of the present invention.
[0170]
(2) of the present invention Especially central According to one aspect, a storage processing device having a parallel collation function stores a plurality of pieces of array information so as to be aligned in a collation direction with a direction crossing the collation direction in order to use the data as collated data. Then, the present invention uses data of a plurality of adjacent array information arranged in the collating direction as data to be collated, and collates data corresponding to the same code string in which the same codes as characters representing array elements are arranged. Using as data, collation data and data to be collated are collated with the storage processing device in parallel processing.
[0171]
As described above, the present invention has a characteristic usage of the storage processing apparatus in which the array information is stored in a direction that intersects the collation direction. Therefore, the collated data is composed of a plurality of arrays of data arranged in the collating direction. Data corresponding to the same code string is used as the collation data. By parallel collation processing of the data to be collated and the collation data, it is determined at high speed whether or not a plurality of arrays match.
[0172]
Considering the amount of calculation, the parallel processing function is appropriately used in the present invention, and collation data corresponding to the same character string is used. Therefore, the amount of calculation of the processing of the present invention is significantly reduced as compared with the conventional processing. In a simplified example assuming a base, when a sequence of n characters is processed, the amount of calculation of the conventional processing is represented by “4 to the power of n”. “4” is the number of types of bases. On the other hand, in the present invention, collation is performed using each of the four identical character strings. Therefore, the calculation amount of the processing of the present invention corresponds to four collations, and is significantly smaller than the conventional processing. As the number of characters n increases, the difference in calculation amount increases.
[0173]
Further, the advantage of the present invention is remarkable when there are many kinds of array elements. In the above example, when a natural amino acid is assumed, the calculation amount of the conventional processing is represented by “20 to the power of n”. 20 is the number of amino acid types. Compared with the base example (“4 to the power of n”), the calculation amount is “5 to the power of n” times. On the other hand, in the present invention, when amino acids are assumed, the number of collation data corresponding to the same character string is 20. Compared to the base example, the computational complexity is only 5 times (20 ÷ 4). Thus, regarding the increase in the amount of calculation according to the number of types of array elements, the degree of increase is clearly smaller in the present invention than in the conventional processing. In this respect as well, the present invention can advantageously speed up the conventional processing array processing.
[0174]
As described with reference to the CAM example, the present invention is particularly advantageous when the width of the storage processing device in the verification direction is narrower than the length of the array. This is often the case because the sequences actually processed are often long. According to the present invention, since the array information is stored in a direction crossing the collation direction of the storage processing device, a long array can be accommodated in the storage processing device. And by using the collation data corresponding to the same code string, the consistency of the arrangement | sequence memorize | stored in a cross direction is calculated | required. And this process is performed at high speed by a parallel collation process. In this way, according to the present invention, it is possible to suitably speed up the sequence analysis by using a storage processing device having a parallel matching processing function.
[0175]
Preferably, the present invention performs collation using data corresponding to the same code string as collation data for each of a plurality of types of codes constituting the sequence information, and processes a plurality of collation results to obtain a plurality of sequences Get information about matching information. For example, in the case of a base sequence, each code of A, G, T, and C is subjected to collation as described in the above embodiment. Further, preferably, as described in the above-described embodiment, processing using a logical operation is performed. According to the present invention, collation is performed using a plurality of types of the same code string, and it is determined whether or not the collated data matches the collation data when any one of the same code strings is used. Therefore, it is possible to determine whether or not the arrays match by the same process without being aware of what the code at each position in the array is, and the process is simplified.
[0176]
Preferably, according to the present invention, a part of the plurality of pieces of sequence information is excluded from the subject of collation, and the collation process is performed. Thereby, the sequence information that does not match the other sequence information can be specified.
[0177]
Preferably, according to the present invention, when the collation data and the data to be collated do not match, it is determined that there is a sequence different from other sequences due to the polymorphism. Thereby, polymorphism analysis, such as SNPs, can be performed. Furthermore, preferably, according to the present invention, a part of the plurality of pieces of sequence information is excluded from the subject of collation, and the collation process is performed. Thereby, the arrangement | sequence different from another arrangement | sequence can be specified in polymorphism analysis, such as SNPs.
[0178]
Preferably, according to the present invention, when a portion where the matching data and the checked data are continuously matched and a portion where the matching data and the checked data are not matched are adjacent to each other, a defect or an insertion is made at the boundary between these portions. Judge that there is. Thus, according to the present invention, a defect or insertion can be detected. Furthermore, preferably, according to the present invention, a part of the plurality of pieces of sequence information is excluded from the subject of collation, and the collation process is performed. Thereby, sequence information with a deletion or insertion can be specified.
[0179]
Further, preferably, the present invention performs the collation processing by storing the sequence information having a defect or an insertion in a direction crossing the collation direction. Thereby, it can be determined whether there exists a defect | deletion or insertion. This is because, as described using the above-described embodiment, the matching result at the time of shift is characteristically different between when there is a defect and when there is an insertion.
[0180]
Note that the present invention may be applied to detect either a defect or an insertion within the scope of the present invention. In other words, either deletion or insertion may be detected by sequence information processing.
[0181]
Preferably, according to the present invention, there is a portion in which the collation data and the data to be collated coincide with each other and the collation data and the data to be collated do not coincide with each other. It is determined that there is a replacement in the part. Thus, according to the present invention, substitution can be detected. Preferably, only sequences of the same length are compared. This gives accurate results. More preferably, according to the present invention, the collation process is performed by excluding a part of the plurality of pieces of sequence information from the collation target. Thereby, sequence information with substitution can be specified.
[0182]
Preferably, in the present aspect, that is, in the aspect in which the array is stored in the crossing direction, the storage processing device having a parallel matching function is a CAM. As described above, the CAM has characteristics suitable for processing of sequence information in that it has a parallel matching function, and can speed up the sequence analysis. Further, CAM has not been used for array information processing so far, but is widely used as an Internet router component and is inexpensive. Therefore, using CAM enables high-speed sequence analysis at low cost. Furthermore, although the normal CAM has a relatively narrow width in the collation direction, according to the present invention, the data stored in the array is crossed with the collation direction, and data corresponding to the same code string is obtained. By using it as collation data, it is possible to collate long sequences. In addition, the parallel verification function of CAM is utilized, and high-speed analysis becomes possible.
[0183]
Preferably, in this aspect, that is, in the aspect in which the sequence is stored in the crossing direction, information used for the SNP analysis is obtained by the above-described processing. In SNP analysis, it is required to process many sequences quickly. In particular, it is considered that genome drug discovery and customized medicine will be put into practical use in the future, and SNPs analysis of a large number of samples will be required. It is desirable that SNPs analysis can be performed at high speed without using a large computer. According to the present invention, it is possible to appropriately meet such needs.
[0184]
(3) According to one embodiment of the present invention, a storage processing device having a parallel collation function stores biological sequence information in a collation direction in order to use the data as collated data. Furthermore, according to the present invention, the collation data and the data to be collated are collated with the storage processing device in parallel processing using the sequence information to be collated as the collation data. In this aspect, unlike the above-described aspect, the array information is stored with the collation direction facing. Therefore, the advantage obtained by changing the storage direction as described with respect to the above-described aspect cannot be obtained. However, this aspect also provides the advantage of speeding up by parallel processing using the parallel collation function. The following are more detailed aspects of the invention.
[0185]
(4) In one embodiment of the present invention, in order to use a plurality of biological sequence information such as a base sequence and an amino acid sequence as data to be compared in a storage processing device having a parallel verification function, the verification direction is turned Remember. The present invention uses the reference sequence as the collation data, and collates the collation data and the data to be collated with the storage processing device in parallel processing. Typically, using a reference sequence consisting of a partial sequence, a local coincidence such as that performed in a blast search is obtained. According to the present invention, it is determined at high speed whether each of a plurality of arrays includes a reference array using a parallel collation function.
[0186]
Preferably, the present invention performs a collation process by setting a collation target part having a length corresponding to a reference sequence and the remaining collation exclusion part, and performing a plurality of collation processes with different positions of the collation exclusion part. Do. In the above-described embodiment, the collation exclusion portion is set using DCbit. According to the present invention, by performing collation processing with different collation exclusion portions, even if the reference sequence matches any portion of the sequence that is the data to be collated, the match can be detected appropriately. In addition, it is possible to specify a matching part.
[0187]
Preferably, the present invention divides a series of arrays into a plurality of pieces of divided array information, and stores the plurality of pieces of divided array information in a storage processing device having a parallel collation function so as to be arranged in a direction crossing the collation direction. Then, it is determined by parallel processing whether or not a part of each divided array information matches the reference array.
[0188]
As described with reference to the CAM example, the present invention is particularly advantageous when the width of the storage processing device in the verification direction is narrow and the length in the crossing direction is large. According to the present invention, even when the width in the collating direction is narrow, by dividing the array, it is possible to store a long array utilizing the length in the crossing direction. Using the length in the cross direction, a large number of sequences can be stored simultaneously and processed in parallel.
[0189]
Furthermore, the array division of this aspect is advantageous for speeding up the calculation. Due to the division, the arrangement length in the collation direction is reduced. This reduces the amount of calculation. When the above-described plural types of collation exclusion portions are set, that is, when a plurality of DCbit patterns are used in the above-described embodiment, the calculation amount is smaller when the arrangement length in the collation direction is smaller. Therefore, according to the present invention, when the storage processing device is narrow in the collating direction and long in the crossing direction, this is not an obstacle. Rather, the calculation amount is reduced by array division and parallel processing, and the array analysis can be further speeded up. It is said.
[0190]
Within the scope of the present invention, the continuous divided arrays need not be arranged side by side on the storage processing device. They can be separated.
[0191]
Preferably, the present invention processes a result of collating a plurality of pieces of divided array information to determine whether or not the array information includes a reference sequence. Here, typically, a logical operation as described in the above-described embodiment is performed. Thereby, it is determined by simple processing whether or not the reference sequence is included.
[0192]
Preferably, according to the present invention, a reference sequence straddling adjacent divided sequence information is detected by performing verification using an end portion of the reference sequence as verification data. Thereby, even when a partial sequence that matches the reference sequence spans a plurality of divided arrays, that is, even when a partial sequence that matches the reference sequence spans a plurality of columns on the storage processing device, Can be detected. It is also possible to specify the position of such a partial sequence.
[0193]
Preferably, the present invention partially overlaps adjacent divided array information. Also by this processing, it is possible to detect the reference sequence of the divided portion without leaking.
[0194]
Preferably, in the present aspect, that is, in the aspect in which the arrangement is stored in the collation direction, the storage processing device having the parallel collation function is a CAM. As described above, the CAM has characteristics suitable for processing of sequence information in that it has a parallel matching function, and can speed up the sequence analysis. Further, CAM has not been used for array information processing so far, but is widely used as an Internet router component and is inexpensive. Therefore, using CAM enables high-speed sequence analysis at low cost. Furthermore, although a normal CAM has a relatively narrow width in the collation direction, according to the present invention, a long array can be stored in the CAM by dividing and storing the arrays. Using the length of the CAM, a large number of sequences can be stored and processed simultaneously. Furthermore, by shortening the array length in the collating direction by array division, the amount of calculation can be substantially reduced and further speedup can be achieved. In this way, according to the present invention, sequence analysis can be suitably speeded up using the characteristics of CAM.
[0195]
Preferably, in this aspect, that is, in the aspect in which the sequence is stored in the collation direction, the information used for the homology analysis such as the blast method is obtained by the above-described processing. For example, when performing a blast search using a large number of sequences in a database, the speeding up of the present invention is considered particularly useful.
[0196]
(5) According to one embodiment of the present invention, a storage processing device having a parallel collation function shifts the same array information little by little and stores the data in a collation direction in order to use the data as collated data. The array information is shifted by a predetermined number of characters, usually by one character. Then, the present invention uses another sequence information to be compared as collation data, uses the same sequence information stored by being shifted little by little as collation data, and collates the collation data and the collation data in parallel processing. The storage processing device is collated. According to the present invention, a portion where a plurality of pieces of sequence information are continuously matched is obtained at high speed using parallel processing. It is possible to obtain the longest matching portion, and it is also possible to specify the position of the continuous matching portion. By using a storage processing device with a parallel collation function and storing the array by shifting it little by little, for example, information on continuous matching parts similar to that obtained by using a dot matrix in a fast search is obtained. be able to.
[0197]
Preferably, the present invention obtains partial matching of sequences by performing partial matching of sequence information. Preferably, the present invention sets a collation exclusion portion in order to perform partial collation of sequence information. In the above-described embodiment, the collation exclusion portion is suitably set by setting the DCbit based on the characteristics of the CAM. Still preferably, in the present invention, partial matching of sequence information is performed using a plurality of types of partial matching patterns to search for a sequence matching portion having a plurality of types of lengths. Multiple types of partial matching patterns are illustrated in FIG. According to the present invention, information on continuously matching portions having various lengths can be obtained. The longest matching part is also detected appropriately.
[0198]
Preferably, according to the present invention, the same sequence information is stored in different regions of a storage processing device having a parallel processing function while being shifted little by little. Thereby, the same arrangement shifted little by little is processed in parallel, and a search result is obtained at high speed.
[0199]
Preferably, the present invention divides a series of arrays into a plurality of pieces of divided array information. According to the present invention, a plurality of pieces of divided array information are stored in a storage processing device having a parallel collation function so as to be arranged in a direction crossing the collation direction. Even when the collation direction of the storage processing device is narrow, a long array can be stored in the storage processing device, and array analysis can be performed by parallel processing.
[0200]
Within the scope of the present invention, the continuous divided arrays need not be arranged side by side on the storage processing device. They can be separated.
[0201]
Preferably, the present invention obtains a portion where the sequences match across adjacent divided array information as a portion where the sequences match continuously. According to the present invention, such a continuous matching portion can be detected even when the continuous matching portion extends over a plurality of divided arrays, that is, a plurality of columns of the storage processing device.
[0202]
Preferably, in this aspect, the storage processing device having a parallel collation function is a CAM. As described above, the CAM has characteristics suitable for processing of sequence information in that it has a parallel matching function, and can speed up the sequence analysis. Further, CAM has not been used for array information processing so far, but is widely used as an Internet router component and is inexpensive. Therefore, using CAM enables high-speed sequence analysis at low cost. Furthermore, although a normal CAM has a relatively narrow width in the collation direction, according to the present invention, a long array can be stored in the CAM by dividing and storing the arrays. In addition, by using the length of the CAM, a large amount of sequences can be stored and processed simultaneously. In this way, according to the present invention, sequence analysis can be suitably speeded up using the characteristics of CAM.
[0203]
Preferably, in this embodiment, information used for homology analysis such as the Faster method is obtained by the above-described processing. For example, when performing a fast search using a large number of sequences in a database, the speeding up of the present invention is considered particularly useful.
[Brief description of the drawings]
FIG. 1 is a diagram showing a hardware configuration of a biological sequence information processing apparatus according to a preferred embodiment of the present invention.
FIG. 2 is a diagram illustrating a normal function of a CAM when used in an Internet router.
FIG. 3 is a functional block diagram of the biological sequence information processing apparatus according to the present embodiment.
4 is a diagram showing SNPs analysis processing by the apparatus of FIG. 3; FIG.
FIG. 5 is a diagram showing a DCbit setting pattern in SNPs analysis.
6 is a flowchart corresponding to the process of FIG.
7 is a diagram showing a defect or insertion detection process by the apparatus of FIG.
FIG. 8 is a flowchart corresponding to the process of FIG.
9 is a flowchart of replacement detection processing by the apparatus of FIG. 3;
10 is a diagram showing an example of an array to be subjected to blast search by the apparatus of FIG. 3;
11 is a diagram showing blast search processing by the apparatus of FIG. 3, and is a diagram showing a state in which the array of FIG. 10 is stored in the CAM.
12 is a diagram showing DCbits set in the process of FIG.
13 is a diagram showing a logical operation process for processing the collation result of FIG. 11 to determine the presence or absence of a reference sequence.
14 is a flowchart corresponding to the process of FIG.
FIG. 15 is a diagram illustrating a process for obtaining a reference array across a plurality of divided arrays by using an end portion of the reference array as collation data.
FIG. 16 is a diagram illustrating a form in which a reference arrangement at a division location can be detected by overlapping adjacent division arrangements;
FIG. 17 is a diagram conceptually showing a dot matrix used in a fasta search.
FIG. 18 is a diagram illustrating an example of a dot matrix actually used in a fasta search.
FIG. 19 is a diagram showing fasta search processing by the apparatus of FIG. 3;
20 is a diagram showing processing for detecting consecutively matched portions of various lengths in the processing of FIG. 19, and is a diagram showing various setting patterns of DC bits.
FIG. 21 is a diagram showing a process when the array is divided into a plurality of parts in the process of FIG. 19;
FIG. 22 is a diagram illustrating a process for detecting a continuous matching portion across a plurality of divided arrays in the process of FIG.
FIG. 23 is a flowchart corresponding to the process of FIG.
[Explanation of symbols]
10 Array information processing device
12 CPU
18 CAM
20 hard disk
30 Array processing control unit
32 Sequence information acquisition unit
34 Checked data input section
36 Verification data input part
38 Verification result acquisition unit
40 Verification result processing section
42 Analysis information output section

Claims

In order to use a plurality of biological sequence information such as base sequences and amino acid sequences as data to be collated in a memory processing device having a parallel collation function, the data should be aligned in the collation direction with the direction crossing the collation direction. A stored data storage step to be stored in
Using the data of the plurality of array information adjacent in the collation direction as data to be collated, using data corresponding to the same code string in which the same codes as characters representing array elements are arranged as collation data, A collation step for collating collation data and data to be collated with the storage processing device in parallel processing;
A biological sequence information processing method comprising:

The biological sequence information processing method according to claim 1,
The storage processing device having the parallel collation function is a CAM (Content Addressable Memory), which compares one collation data and a plurality of collation data in parallel processing, and indicates information matching between the collation data and each collation data For outputting biological sequence information.

The biological sequence information processing method according to claim 1 ,
For each of a plurality of types of codes constituting the sequence information, collation using data corresponding to the same code string as collation data is performed, and a plurality of collation results are processed to obtain information on matching of the plurality of sequence information. Biological sequence information processing method characterized by obtaining.

The biological sequence information processing method according to claim 1 ,
A biological sequence information processing method characterized by identifying part of sequence information that does not match other sequence information by excluding a part of a plurality of sequence information from a target for verification and performing a verification process.

The biological sequence information processing method according to claim 1 ,
A biological sequence information processing method characterized by determining that there is a sequence different from another sequence due to a polymorphism when the verification data and the data to be verified do not match.

The biological sequence information processing method according to claim 5 ,
A biological sequence information processing method characterized by identifying a sequence different from other sequences by excluding a part of a plurality of sequence information from a verification target and performing a verification process.

The biological sequence information processing method according to claim 1 ,
When a portion where the matching data and the checked data are continuously matched and a portion where the matching data and the checked data are not matched are adjacent to each other, it is determined that there is a missing or inserted boundary between the portions. Biological sequence information processing method.

The biological sequence information processing method according to claim 7 ,
A biological sequence information processing method characterized in that a part of a plurality of sequence information is excluded from a verification target and a verification process is performed to identify sequence information having a deletion or insertion.

The biological sequence information processing method according to claim 8 ,
Biological sequence characterized by determining whether there is a deletion or an insertion by storing the sequence information with the deletion or insertion shifted in a direction crossing the verification direction and performing a verification process Information processing method.

The biological sequence information processing method according to claim 1 ,
If there is a part where the collation data and the data to be collated continuously match, the collation data and the data to be collated do not match, and when the collation data and the data to be collated again continuously match, The biological sequence information processing method characterized by determining.

The biological sequence information processing method according to claim 10 ,
A biological sequence information processing method characterized by identifying part of sequence information with substitution by excluding a part of a plurality of sequence information from a verification target and performing a verification process.

In the biological sequence information processing method according to any one of claims 1 to 11 ,
A biological sequence information processing method characterized by obtaining information used for SNP analysis.

A biological sequence information processing apparatus for processing biological sequence information such as base sequence and amino acid sequence,
A storage processing device having a parallel collating function, means for acquiring sequence information to be analyzed,
Means for storing collation data in the storage processing device; means for inputting collation data into the storage processing device; causing the storage processing device to collate collation data with the collated data; and a collation result from the storage processing device Means for acquiring, means for processing the acquired matching result,
Including
In order to use a plurality of array information as data to be collated, the storage processing device having the parallel collation function stores the data so that the direction crossing the collation direction is aligned and aligned in the collation direction, and adjacent to the collation direction. Using the data of the plurality of array information as the data to be verified, and using the data corresponding to the same code string in which the same codes as characters representing array elements are arranged as the verification data, the verification data and the data to be verified The biological sequence information processing apparatus is characterized in that the storage processing apparatus is collated in parallel processing.

The biological sequence information processing apparatus according to claim 13 ,
The storage processing device having the parallel collation function is a CAM (Content Addressable Memory), which compares one collation data and a plurality of collation data in parallel processing, and indicates information matching between the collation data and each collation data A biological sequence information processing apparatus characterized in that

The biological sequence information processing apparatus according to claim 13 ,
For each of a plurality of types of codes constituting the sequence information, collation is performed using data corresponding to the same code string as collation data, a plurality of collation results are processed, and information on matching of the plurality of sequence information is obtained. A biological sequence information processing apparatus characterized by being obtained.

The biological sequence information processing apparatus according to claim 13 ,
A biological sequence information processing apparatus that identifies sequence information that does not match other sequence information by excluding a part of a plurality of sequence information from a verification target and performing a verification process.

The biological sequence information processing apparatus according to claim 13 ,
A biological sequence information processing apparatus characterized by determining that there is a sequence different from another sequence due to a polymorphism when the verification data and the verification target data do not match.

The biological sequence information processing apparatus according to claim 17 ,
A biological sequence information processing apparatus characterized in that a part of a plurality of sequence information is excluded from verification targets and a verification process is performed to specify a sequence different from other sequences.

The biological sequence information processing apparatus according to claim 13 ,
It is characterized that when a portion where the matching data and the checked data are continuously matched and a portion where the matching data and the checked data are not matched are adjacent to each other, it is determined that there is a missing or inserted boundary between the portions. Biological sequence information processing device.

The biological sequence information processing apparatus according to claim 19 ,
A biological sequence information processing apparatus that identifies sequence information having a defect or insertion by excluding a part of a plurality of sequence information from a verification target and performing a verification process.

The biological sequence information processing apparatus according to claim 20 ,
Biological sequence characterized by determining whether there is a deletion or an insertion by storing the sequence information with the deletion or insertion shifted in a direction crossing the verification direction and performing a verification process Information processing device.

The biological sequence information processing apparatus according to claim 13 ,
If there is a part where the collation data and the data to be collated continuously match, the collation data and the data to be collated do not match, and when the collation data and the data to be collated again continuously match, A biological sequence information processing apparatus characterized by determining.

The biological sequence information processing apparatus according to claim 22 ,
A biological sequence information processing apparatus which identifies sequence information with substitution by excluding a part of a plurality of sequence information from a verification target and performing a verification process.

The biological sequence information processing apparatus according to any one of claims 13 to 23 ,
A biological sequence information processing apparatus characterized by obtaining information used for SNP analysis.

A computer-executable program for causing a computer to process biological sequence information such as a base sequence and an amino acid sequence,
In a storage processing device having a parallel collation function, in order to use a plurality of array information as data to be collated, a collated data storage step for storing in a direction crossing the collation direction so as to be aligned in the collation direction;
Using the data of the plurality of array information adjacent in the collation direction as data to be collated, using data corresponding to the same code string in which the same codes as characters representing array elements are arranged as collation data, A collation step for collating collation data and data to be collated with the storage processing device in parallel processing;
That causes the computer to execute the program.

The program according to claim 25 ,
The storage processing device having the parallel collation function is a CAM (Content Addressable Memory), which causes the CAM to compare one collation data and a plurality of collation data in parallel processing, A program that outputs information indicating coincidence.

The program according to claim 25 ,
For each of a plurality of types of codes constituting the sequence information, collation is performed using data corresponding to the same code string as collation data, a plurality of collation results are processed, and information on matching of the plurality of sequence information is obtained. A program for causing a computer to execute a process to obtain.

The program according to claim 25 ,
A program that causes the computer to execute a process of identifying sequence information that does not match other sequence information by excluding a part of a plurality of sequence information from a verification target and performing a verification process.

The program according to claim 25 ,
A program for causing the computer to execute a process of determining that there is a sequence different from another sequence due to a polymorphism when the verification data and the data to be verified do not match.

The program according to claim 29 ,
A program that causes a computer to execute a process of specifying a sequence different from another sequence by performing a verification process by excluding a part of a plurality of sequence information from a verification target.

The program according to claim 25 ,
The process of determining that there is a defect or insertion at the boundary between the portion where the matching data and the checked data are continuously matched and the portion where the matching data and the checked data are not matched are adjacent to each other A program that is executed by a computer.

The program according to claim 31 , wherein
A program that causes the computer to execute a process of identifying sequence information having a defect or an insertion by excluding a part of a plurality of sequence information from a verification target and performing a verification process.

The program according to claim 32 ,
By causing the computer to execute a process of determining whether there is a defect or an insertion by storing and storing the sequence information with the missing or inserted in a direction crossing the collating direction and performing a collating process. A featured program.

The program according to claim 25 ,
If there is a part where the collation data and the data to be collated continuously match, the collation data and the data to be collated do not match, and when the collation data and the data to be collated again continuously match, A program for causing a computer to execute a determination process.

The program according to claim 34 ,
A program that causes the computer to execute processing for specifying sequence information with substitution by excluding a part of a plurality of sequence information from a verification target and performing verification processing.

The program according to any one of claims 25 to 35 ,
A program for causing the computer to execute processing for obtaining information used for SNP analysis.

A computer-readable recording medium storing the program according to any one of claims 25 to 36 .