JP3723760B2

JP3723760B2 - Biological sequence information processing method and apparatus

Info

Publication number: JP3723760B2
Application number: JP2001341121A
Authority: JP
Inventors: 浩輔高木
Original assignee: 株式会社バイオマティクス
Priority date: 2001-11-06
Filing date: 2001-11-06
Publication date: 2005-12-07
Anticipated expiration: 2021-11-06
Also published as: JP2003141122A

Description

【０００１】
【発明の属する技術分野】
本発明は、生物学的な複数の配列情報の比較によりホモロジーに関する情報を得る装置に関する。生物学的な配列情報は、典型的にはタンパク質のアミノ酸配列およびＤＮＡの塩基配列である。本発明は、典型的には、２つの配列を行方向および列方向にそれぞれ配置したマトリックス情報を用いる処理に適用され、この種の処理を高速化する。本発明は、３つ以上の配列を比較する処理に適用されてもよい。
【０００２】
【従来の技術】
分子生物学の分野では、ＤＮＡ、遺伝子、タンパク質等の解析のための情報処理技術の有用性が高まっている。そして、ホモロジー検索においても、高速な計算で信頼性の高い結果を得るための各種の方法が提案され、また実用化されている。
【０００３】
周知のように、ホモロジー検索とは、アミノ酸等の複数の配列を比較して、それら配列が似ているか否かを判断したり、どのように似ているかを求めるための技術である。ここでは２つの配列の比較について説明する。ホモロジーの表現には、置換およびギャップを用いることが知られている。置換およびギャップは配列間の変異を表す。タンパク質の場合、置換は、２つの配列の対応する箇所に異なるアミノ酸があることをいう。ギャップは、１方の配列中のあるアミノ酸が他方の配列の対応箇所にないことをいい、アミノ酸の挿入および欠損により生じる。
【０００４】
ホモロジー検索のための情報処理方法およびそのアルゴリズムとしては、動的計画法、ブラスト法およびファスタ法（ＦａｓｔＡ）が知られている。
【０００５】
動的計画法は、最も変異の量が少なくなるような２つの配列の並べ方（アライメント）を求めるために、経路探索技術の原理を適用する。２種類のアミノ酸の変異コストおよびギャップコストを用いて、コストが小さくなる並べ方が求められる。
【０００６】
ブラスト法は、ギャップの挿入を行うことなく、２つの配列間で局所的によく一致する部位（高スコア断片）を探索する。そして、探索された高スコア断片が、その前後に伸長される。
【０００７】
ファスタ法は、２つの配列を行方向および列方向にそれぞれ配置したマトリックスを用いる。このマトリックスは、両配列の一致する箇所を表す要素をもつ。一般には、この要素を点で表す画像情報であるドットマトリックスが用いられる。例えば、タンパク質の場合に、一方の配列のｉ番目のアミノ酸と、他方の配列のｊ番目のアミノ酸が一致するとき、ｉ行ｊ列の位置がプロットされる。そして、ドットマトリックスから、局所的に一致する部分が求められる（ブラスト法の高スコア断片）。この一致部分の周辺領域に対して動的計画法によるアライメントが行われる。そして、長くつながる点列だけを抽出し、表示する処理が行われる。
【０００８】
これらのホモロジー検索は、例えば、「遺伝子とコンピュータ」（小長谷明彦、共立出版株式会社、６７〜７９ページ、２０００年）に説明されている。
【０００９】
これら３つの方法のうち、従来の動的計画法は、計算速度の点で不利である。ブラスト法は、高速な処理ではあるものの、弱いホモロジーを見逃さないといった信頼性の点で不利といわれている。そして、ファスタ法は、ブラスト法ほどではないものの動的計画法より速く、かつ、ブラスト法より信頼性が高い、という特徴を有する。
【００１０】
【発明が解決しようとする課題】
上述のように、ファスタ法は、比較的高い速度と信頼性を提供する。しかし、配列データ量の増大に伴い、さらなる高速化が恒常的に求められる。そして、高速化のためには、計算量を少なくすることが有効と考えられる。もちろん、十分な信頼性を確保しつつ、少ない計算量でのホモロジー検索を可能することが求められる。
【００１１】
本発明は、上記課題に鑑みてなされたものであり、その目的は、高速なホモロジー検索を可能にする配列情報処理装置を提供することにある。
【００１２】
【課題を解決するための手段】
本発明は、上記目的を達成するため、従来のファスタ法で用いられるようなマトリックス情報を用いる。しかし、本発明は、以下のように、ファスタ法とは異なる新たなデータ処理によってホモロジーの情報を得る。また、本発明は、２つの配列を比較する２次元の処理に限定されず、３つ以上の数の配列を比較する処理に適用されてよい。
【００１３】
本発明のある態様は、アミノ酸配列、ＤＮＡ配列等の生物学的な複数の配列情報の比較によりホモロジーに関する情報を得る配列情報処理装置である。本発明の装置は、比較対象の２つの配列情報を受け付ける配列情報取得部と、２次元マトリックス画像情報生成部と、第１抽出処理部および第２抽出処理部と、第２抽出処理部による処理を経た２次元マトリックス画像情報を出力する出力処理部と、を含む。
【００１４】
２次元マトリックス画像情報生成部は、比較対象の２つの配列情報を異なる方向に配置して２次元マトリックス画像情報を生成する２次元マトリックス画像情報生成部であって、２つの配列を比較し、２つの配列の全組合せに対して、２つの配列が一致するとき行列要素として第一の値を設定し、両配列が一致しないとき行列要素として前記第一の値と異なる第二の値を設定する処理を行うことにより、２つの配列の一致箇所を表す要素群で構成される２次元マトリックス画像情報を得る。
第１抽出処理部は、２次元マトリックス画像情報について、２つの配列の一致箇所に対応する要素が斜め方向に所定数以上連続するかどうかを判定し、連続すると判定された要素が抽出された２次元マトリックス画像情報を生成する。
【００１５】
第２抽出処理部は、第１抽出処理部による処理を経た２次元マトリックス画像情報に設定される隣接して並べられた複数の斜め方向の平行四辺形の判定領域を用いて、各判定領域ごとの処理として、判定領域の平行四辺形における一の配列の配置方向の辺の長さである領域幅数と、判定領域内に並ぶ段のうちで配置方向に配列一致箇所の要素が存在しない空段の箇所の数と、の合計が、所定のしきい変異数以下であるかどうかを判定し、しきい変異数以下であると判定された判定領域の要素が抽出された２次元マトリックス画像情報を生成する。
【００１６】
このようにして、配列が一致する箇所が連続するときに要素が並ぶ方向に長く連なる要素群、すなわち、比較対象の配列のホモロジーを表す要素群が抽出される。そして、本発明によれば、第１抽出処理部、第２抽出処理部共に比較的簡素であり、少ない計算量で実現できるので、ホモロジーを表す情報を高速で求めることができる。そして、本発明によれば、上記の処理により、変異数に基づいた判断を簡素な処理で行えるので、計算量がさらに少なくなり、より一層の高速化が可能となる。
【００１７】
本発明では、２つの配列のマトリックス情報に、マトリックス上で斜め方向の平行四辺形の領域が好適に設定される。上記の判定領域は、ホモロジーを表す要素群と同じ方向に延びる。したがってこの構成によれば、ホモロジーを表す要素群と同方向に延びる判定領域を設定することで、正確に必要な情報が得られる。また、第２抽出処理部による処理を実現するためには、「要素が存在しない箇所の数」が、「しきい変異数と領域幅数の差」と比較されてもよい。
【００１８】
上述の配列情報処理装置において、第１抽出処理部は、２次元マトリックス画像情報について、２つの配列の一致箇所に対応する要素が斜め方向に３以上連続するかどうかを判定してもよい。
【００１９】
上述の配列情報処理装置において、第１抽出処理部は、２つの配置方向の少なくとも一つに関して、２つの配列の一致箇所に対応する３つの要素が連続するかどうかを判定し、連続すると判定された３つの要素のうちの中央の要素を抽出しないように構成してもよい。この構成によれば、第２抽出処理部で処理されるべき要素が減るので、さらなる高速化が可能である。
【００２０】
上述の配列情報処理装置において、第２抽出処理部による処理を経た前記２次元マトリックス画像情報が示すホモロジーを表現した配列情報を生成するホモロジー配列生成処理部をさらに含んでもよい。この構成によれば、高速な計算により抽出された要素群を用いて、複数配列のホモロジーを表現した有用な情報が得られる。
【００２１】
上述の配列情報処理装置において、ホモロジー配列生成処理部は、ギャップおよび置換を含んだ情報を生成してもよい。この構成によれば、高速な計算により抽出された要素群を用いて、複数配列のホモロジーを表現した有用な情報が得られる。
【００２２】
上述の配列情報処理装置において、２次元マトリックス画像情報のうち、判定領域が設定されない縁部に残る２つの配列の一致箇所に対応する要素を削除する縁部調整処理部をさらに含んでもよい。例えば、マトリックスを平行四辺形の領域で分割するとき、マトリックスの縁部には領域が設定されないために、マトリックスの縁部に不要な要素が残ることがある。このような要素が、この構成の縁部調整処理部により削除されるので、ホモロジーを表す情報がより正確に求められる。
【００２３】
上述の配列情報処理装置において、第２抽出処理部による処理を経た２次元マトリックス画像情報において、２つの配置方向の少なくとも一つに関して、その配置方向に複数の抽出された要素が残っているとき、その残った要素が周囲の抽出された要素と形成する連続部分の長さに基づいて不要な要素を削除する長さ比較調整処理部をさらに含んでもよい。この構成によれば、第２抽出処理部にて残った不要な要素が削除されるので、ホモロジーを表す情報がより正確に求められる。
【００２４】
上述の配列情報処理装置において、出力処理部は、第２抽出処理部による処理を経た前記２次元マトリックス画像情報を画面に表示する表示処理部を含んでもよい。この構成によれば、抽出された要素群が描く線が画面表示される。この画面表示は、ホモロジーを視覚的に表す情報として、有用に利用される。
【００２６】
本発明の別の態様の配列情報処理装置は、アミノ酸配列、ＤＮＡ配列等の生物学的な２つの配列情報の比較によりホモロジーに関する情報を得る配列情報処理装置であって、比較対象の２つの配列情報を取得する配列情報取得部と、比較対象の２つの配列情報を異なる方向に配置して２次元マトリックス画像情報を生成する２次元マトリックス画像情報生成部であって、２つの配列を比較し、２つの配列の全組合せに対して、２つの要素が類似するとき行列要素として第一の値を設定し、２つの要素が類似しないとき行列要素として第一の値と異なる第二の値を設定する処理を行うことにより、２つの配列の類似箇所を表す要素群で構成される２次元マトリックス画像情報を得る２次元マトリックス画像情報生成部と、２次元マトリックス画像情報について、２つの配列の類似箇所に対応する要素が斜め方向に所定数以上連続するかどうかを判定し、連続すると判定された要素が抽出された２次元マトリックス画像情報を生成する第１抽出処理部と、第１抽出処理部による処理を経た２次元マトリックス画像情報に設定される隣接して並べられた複数の斜め方向の平行四辺形の判定領域を用いて、各判定領域ごとの処理として、判定領域の平行四辺形における一の配列の配置方向の辺の長さである領域幅数と、判定領域内に並ぶ段のうちで配置方向に配列一致箇所の要素が存在しない空段の箇所の数と、の合計が、所定のしきい変異数以下であるかどうかを判定し、しきい変異数以下であると判定された判定領域の要素が抽出された２次元マトリックス画像情報を生成する第２抽出処理部と、第２抽出処理部による処理を経た２次元マトリックス画像情報を出力する出力処理部と、を含む。この構成では、例えば、アミノ酸等が類似する箇所（一致する箇所を含む）を表す要素をもった２次元マトリックス画像情報を用いる。類似を考慮することで、ホモロジー検索の信頼性が増大する。一致部位に対応する要素のみを使う方法と比べると計算量は増加するものの、それでも従来と比べて少ない計算量でホモロジー検索を実現できる。
【００２７】
本発明は、上述した装置の態様には限定されない。本発明の別の態様は、例えば、上記装置としてコンピュータを機能させるためのプログラムであり、また、そのようなプログラムを記録したコンピュータ可読媒体である。
【００２８】
【発明の実施の形態】
以下、本発明の好適な実施の形態（以下、実施形態という）を説明する。
【００２９】
本実施形態では、生物学的な配列の一形態であるアミノ酸配列の情報処理に本発明が適用される。もちろん、本発明は、別の配列情報、例えば、塩基配列の情報処理に適用されてもよい。
【００３０】
また、本実施形態では、本発明が、２つの配列を比較する情報処理に適用される。この情報処理は、比較対象である第１配列および第２配列を異なる方向に配置したときの両配列の一致箇所を表す要素をもつ２次元マトリックス情報を用いる。
【００３１】
本実施形態では、２つの配列を配置する方向を、行方向および列方向という。また、斜め方向は、第１配列および第２配列が連続して一致するときに要素が並ぶ方向である。本実施形態では、画像情報が用いられ、第１配列と第２配列が直交するように配置され、要素間の間隔が行方向と列方向で同じなので、斜め方向は、行方向および列方向に対して４５度の角度をなす方向である。
【００３２】
図１は、本発明の一形態の配列情報処理装置のハードウエア構成を示す。配列情報処理装置１は、ＣＰＵ３、ＲＯＭ５、ＲＡＭ７、ハードディスク９、媒体装着部１１、キーボード１３、マウス１５、ディスプレイ１７および通信装置１９を含む。
【００３３】
キーボード１３およびマウス１５、ディスプレイ１７および通信装置１９は入出力装置として機能する。さらに他の入出力装置も適宜設けられてよい。また、通信装置１９は、赤外線通信等により近傍の装置と通信する装置でもよく、また、ＬＡＮ、インターネット等の通信を行う装置でもよい。これらの複数種類の通信装置が設けられてもよい。
【００３４】
また、媒体装着部１１は、フレキシブルディスク、コンパクトディスク等の記録媒体が装着される。媒体装着部１１も、記録媒体への情報の入出力装置と見ることができる。
【００３５】
配列情報処理装置１は、汎用のコンピュータであってよい。本実施形態の情報処理機能をコンピュータに実現させるプログラムをインストールすることにより、配列情報処理装置１が構成される。
【００３６】
図２は、配列情報処理装置１の構成を示す機能ブロック図である。図示の各構成要素は、上記プログラムを実行することにより実現される。図示のように、配列情報処理装置１は、配列情報取得部２０、マトリックス情報生成部２２、第１抽出処理部２４、第２抽出処理部２６、縁部調整処理部２８、長さ比較調整処理部３０、ホモロジー配列生成処理部３２および出力処理部３４を含む。以下、これらの各構成要素を、図面を参照して説明する。
【００３７】
図３は、配列情報取得部２０により取得される、比較対象のアミノ酸配列情報の例である。配列情報は、例えば、媒体装着部１１に装着された記録媒体から読み出される。このとき、媒体装着部１１は配列情報の入力装置として機能する。配列情報は、通信装置等の他の手段を用いて取得されてもよい。また配列情報は、ハードディスク９から読み出されてもよい。
【００３８】
マトリックス情報生成部２２は、配列情報取得部２０により取得された２つの配列からマトリックス情報を生成する。マトリックス情報生成部２２は、一方の配列におけるｉ番目の文字（アミノ酸）と、ｊ番目の文字（アミノ酸）を比較する。２つの文字が一致するとき、行列要素（成分）として１が設定される。２つの文字が一致しないとき、行列要素として０が設定される。この処理を両配列の文字の全組合せに対して行うことによりマトリックス情報（ＤＰ行列）が得られる。さらに、マトリックス情報生成部２２は、マトリックス情報を表す２値画像を生成する。画像中で、行列要素１に対応する位置に点（ドット）が打たれる。
【００３９】
図４は、上記のようにして得られたマトリックス情報の画像を示す。図４の例は、ＣＲＥ−ＢＰ１のアミノ酸配列とＭＵＳＭＸＢＰのアミノ酸配列から得られたマトリックスである。
【００４０】
こうして得られたマトリックスは、一方の配列を行方向に、他方の配列を列方向に配置したときの両配列の一致する箇所を表す「要素」をもつ。「要素」は、本実施形態では画像上の点である。
【００４１】
マトリックス情報は、例えば、Ｊａｖａ（登録商標）プログラムを用いて生成される。このプログラムには、２つの配列を比較するための関数が用意される。この関数は、両配列の文字が一致する場合には１を、その他の場合には０を要素にもつマトリックスを返す。さらに、同プログラムにより、１に対応する画像上の位置に点が打たれる。上記の処理は、Ｈａｓｈテーブルを用いることで、より高速に行うことができる。
【００４２】
以下、本実施形態は、図４のマトリックスを用いた画像処理を通じて、「意味のある情報」、すなわち、ホモロジーを表す情報を求める。求められるべき情報は、概略的には、直線上の点群であり、この直線は、（１）なるべく長く、（２）傾きがマイナス４５度（右下がり）である（縦方向と横方向の画素間隔が同一である場合）。
【００４３】
図４では、既に、ホモロジーを表す斜めのラインが比較的明瞭に現れている。このライン上の点が以降の処理で抽出される。ホモロジーのラインが図４ほどに明瞭でない場合でも、以降の処理によってラインが抽出される。
【００４４】
また、上記のように本実施形態では、図４の画像を用いて画像処理が行われる。しかし、好ましい変形例では、前段で得られたマトリックス情報、すなわち１、０の要素をもつマトリックス情報を用いて処理が進められてもよい。
【００４５】
さて、ホモロジーを表すラインは、主として、第１抽出処理部２４および第２抽出処理部２６により抽出される。このうち、第１抽出部２４は、図４のマトリックス画像から、両配列の一致箇所に所定の連続性が見られる部分を抽出する。一致箇所が連続するところでは画像中の点が斜めに並ぶので、第１抽出部２４は、斜め方向に所定の連続性をもった点を抽出する。
【００４６】
図５は、第１抽出処理部２４の処理に用いられるフィルタを示す。このフィルタは、斜め方向に３つの点が続くとき、それらの点を残す。このような点が、本実施形態では、上記の所定の連続性をもった点に相当する。その他の場合には点が削除される。ここで、斜め方向は、既に述べたように、縦方向および横方向に同数の画素だけ進む方向（４５度）である。
【００４７】
図６は、フィルタによる抽出後の画像を示す。ホモロジーに関係ない多くの点が消えている。しかし、依然として多くの不要な点が残っている。このように、第１抽出部２４は、以下の第２抽出部２６による処理の前処理として、ラフな抽出処理を行う。
【００４８】
次に、第２抽出部２６による抽出処理を説明する。
【００４９】
図７に示されるように、第２抽出部２６は、マトリックス上に複数の帯状の判定領域を設定する。各判定領域の形状は平行四辺形である。マトリックスを多数の平行四辺形へと分割することにより判定領域が設定される。判定領域は、マトリックス上に敷き詰めるようにして設けられる。
【００５０】
図８は、一つの判定領域を示している。図示のように、本実施形態の例では、各判定領域の高さ（帯高さ）は１０画素、幅（帯幅）は５画素である。帯幅数（点の数）は、本発明の領域幅数に相当する。平行四辺形の斜辺の角度は４５度である。したがって、判定領域は、本実施形態の斜め方向、すなわち配列情報が連続して一致するときに要素が並ぶ方向へと延びる。
【００５１】
第２抽出処理部２６は、設定された複数の判定領域を用いて抽出処理を行う。抽出処理は、各判定領域ごとに行われる。第２抽出処理部２６は、判定領域内の点の分布に基づき、配列変異に関する所定の分布をもつ判定領域内の点を領域単位で抽出する。所定の分布は、本実施形態では、領域内の変異数が所定のしきい変異数以下であると判断されるような分布である。このような分布をもつ領域内のすべての点が抽出される。変異数がしきい変異数を越えると判断されるときは、領域内のすべての点が消去される。
【００５２】
変異は、周知のように、置換とギャップを含む。置換は、２つの配列の対応する箇所に異なるアミノ酸があることをいう。ギャップは、一方の配列のアミノ酸を他方の配列がもたないことをいう。ギャップは、アミノ酸の挿入または欠損により生じる。本実施形態の変異数は、置換とギャップの総数である。
【００５３】
しきい変異数に関しては、下記の特殊な判定が行われる。領域内で点が存在しない行を、本実施形態では、空行と呼ぶ。空行は、本発明における、判定領域内で一の配列の配置方向に要素が存在しない箇所に相当する。空行数と帯幅数の合計がしきい変異数以下のとき、第２抽出部２６は、領域内の変異数がしきい変異数以下と判定し、領域内の点を残す。実際の処理では、第２抽出処理部２６は、空行の数が、しきい変異数と帯幅数の差以下であるか否かを判定すればよい。
【００５４】
このような簡単な処理によって必要な判定を行える理由を説明する。
【００５５】
図９は、２つの類似する配列の一部を示している。ここでは、説明を簡単にするために、実際のアミノ酸を表す文字の代わりに、文字Ａ〜Ｅを用いる。
【００５６】
図９（ａ）は、変異がない場合を示す。この場合、マトリックス上の点が斜めに並ぶ。図９（ｂ）は、置換がある場合を示す。置換があるとき、図示のように空の行が発生する（空の列も同時に発生する）。図９（ｃ）は、行方向（横方向）の配列にギャップがある場合を示す。この場合は、マトリックス上で空の行が発生する。図９（ｄ）は、列方向の配列にギャップがある場合を示す。この場合は、マトリックス上で空の列が発生するとともに、ラインが幅方向に１画素ずれる。図９（ｄ）については、主にラインのずれに着目する。
【００５７】
ホモロジーを表すラインが判定領域を通過するとして、空行数は、置換数（図９（ｂ））と、行方向の配列のギャップ数（図９（ｃ））との合計を示す。一方、列方向の配列に一つのギャップがあると、図９（ｄ）に示されるように、ラインが行方向に１画素ずれる。したがって、「帯幅数」は、列方向の配列が領域内でもつギャップの最大数に相当する。以上より、空行数と帯幅数の合計がしきい変異数以下であれば、すなわち、空行数が「２」（＝７（しきい変異数）−５（帯幅数））以下であれば、全変異数はしきい変異数であるといえる。
【００５８】
図１０の例では、ホモロジーを表すラインが判定領域を通っている。この場合、置換数が１であり、行方向の配列のギャップ数が１である。空行数が２以下なので、この領域の点は抽出される。
【００５９】
図１１の例では、ホモロジーを表すラインが判定領域を通っていない。この場合、領域内の点は、ラインを形成していない。たまたま、２組の連続点が領域内に存在するだけである。そして、この場合には空行の数が多いので、領域の点は消去される。
【００６０】
ところで、上記の抽出処理には、以下のような限界がある。
【００６１】
図１２を参照する。この例の場合、領域内には、置換が２カ所にある。また、行方向の配列のギャップが１カ所にある。一方、列方向の配列のギャップはない。したがって、空行数は３であり、また、全変異数も３である。この場合、全変異数は７以下であるにも拘わらず、領域が抽出対象から外されてしまう。
【００６２】
本実施形態は、このような状況を許容している。ホモロジーを表すラインが領域を通っていれば、図１２のような状況が生じる可能性は低い。そこで、空行が所定数以下であれば（空行数と帯幅数の合計がしきい変異数以下であれば）、領域の変異数がしきい変異数以下であるとみなしている。
【００６３】
同様に、領域形状およびしきい変異数の設定を変更したとき、逆の状況、すなわち、実際のしきい数がしきい変異数以上であるにも拘わらず領域内の点が抽出される状況が仮にあったとしても、本実施形態はこのような状況を許容する。
【００６４】
図１３は、第２抽出処理部２６による抽出処理が施されたマトリックス画像を示している。本実施形態の例では、上述のように、帯高さは１０であり、帯幅数は５である。また、しきい変異数は７である。したがって、図１３では、空行数が２以下である領域のみの点が残っている。この条件は、比較的ゆるい。それにも拘わらず、図１３に示されるように、殆どの不要な点が効果的に削除される。これは、本実施形態の抽出処理が、簡単な処理であるにも拘わらず、非常に有効であることを示す。
【００６５】
以上、第２抽出処理部２６による抽出処理を説明した。図１３に示されるように、上記の抽出処理を経ても、依然として不要な点が残っている。この不要な点は以下の処理で削除される。この削除処理は、縁部調整処理部２８および長さ比較調整処理部３０により行われる。
【００６６】
まず、縁部調整処理部２８の処理を説明する。
【００６７】
図１４は、マトリックス画像の縁部を部分的に示している。マトリックス画像上には、完全な形状の判定領域のみが設定される。中途半端な形状の判定領域は設定されない。その結果、画像の縁部には、判定領域が設定されない。
【００６８】
例えば画像の上下の縁部を考える。画像の高さが、判定領域の高さの整数倍でないとき、上下の縁の少なくとも一方には、判定領域が設定されない。また、横方向の縁部に着目すると、平行四辺形の判定領域を並べたとき、画像の縁部が判定領域で完全に覆われることはあり得ない。図１４から明らかなように、平行四辺形の集合の端部はギザギザの線を描くからである。
【００６９】
このように縁部に判定領域が設定されないので、縁部に対しては第２抽出処理部２６の処理も行われない。その結果、図１３に示されるように、縁部には、ホモロジーに関係ない不要な点が残っている。そこで、縁部調整処理部２８は、縁部の点を削除する。本実施形態では、判定領域が設定されない場所にある全部の点が削除される。
【００７０】
図１５は、縁部の修正が行われたマトリックス画像を示す。縁部の不要な点が消えている。
【００７１】
次に、長さ比較調整処理部３０の処理を説明する。長さ比較調整処理部３０は、以下に説明するようにして、マトリックス画像において一つの行に対して複数の点が残っているとき、各要素が前後行の点と形成する連続部分の長さに基づいて不要な点を削除する。
【００７２】
図１６を参照する。ホモロジーを表すライン上の点群だけが抽出されたとすると、１つの行には１つの点のみが存在するはずである。したがって、１つの行に複数の点があるときは、それらの中の一つの点のみが、ホモロジーのライン上の点として残されるべきである。残されるべき点は、斜め方向の長いライン（点列）の一部である。図１６では、点Ａが残され、点Ｂが消去されるべきである。
【００７３】
そこで、長さ比較調整処理部３０は、各行ごとに、複数の点があるか否かを調べる。複数の点があるときは、各点ごとに、前後の行の点と共に形成するラインの長さを求める。最も長いラインを形成する点が残される。その他の点は削除する。この際、図１０に示されるような状況では、一続きのラインが形成されているとみなすことが好ましい。そこで、比較調整処理部３０は、所定数（例えば２個）以下の変異（置換およびギャップ／画像中では空行および空列）を挟んで点が続くときは、ラインが途切れていないと判定する。
【００７４】
図１７は、長さ比較調整処理部３０の処理を施されたマトリックス画像を示している。図１５で残っていた不要な点が消去されている。以上のようにして、本ホモロジーのラインが好適に抽出される。
【００７５】
なお、本実施形態では、一つの行に複数の点が残っているときに、不要な点を削除する処理が行われた。変形例では、一つの列に複数の点が残っているときに、同様の処理により、不要な点が削除されてもよい。すなわち、２つの配列の配置方向のどちらを使って修正処理が行われてもよい。また、両方の配列方向に関して上記の処理が行われてもよい。一般に、一つの行に複数の点が残るときは、同時に、一つの列にも複数の点が残る。したがって、行と列のどちらに着目しても、一般にほぼ同じ結果が得られ、実質的に同じ処理であるといえる。
【００７６】
次に、ホモロジー配列生成処理部３２は、図１７のラインが示すホモロジーを表現した配列情報を生成する。図１７のラインは、一方の配列のどの部分が、他方の配列のどの部分と似ているのか、を示している。ラインの上方への写像に対応する配列部分と、同ラインの左方への写像に対応する配列部分とが類似している。類似部分が比較される。この比較を通して、置換およびギャップが求められる。そして、２つの配列が、類似部分の対応関係が分かるように、かつ、置換およびギャップが分かるような状態で並べられる。上記の処理は、コンピュータプログラムにより自動的に行うことができる。
【００７７】
この処理に関し、図９を参照すると明らかなように、マトリックス画像中で空行および空列が同時に発生している所に対応して、置換があり（図９（ａ））、空行のみが発生している所に対応して、行方向の配列にギャップがあり（図９（ｂ））、空列のみが発生している所に対応して、列方向の配列にギャップがある（図９（ｃ））。これらの置換およびギャップが求められ、それらを表現する情報が求められる。空行および空列から得られる変異情報を、該当個所の両配列と照合して、変異情報が適正か否かを確認し、必要な修正を行うことが望ましいと考えられ、この処理も好ましくはコンピュータにより自動的に行われる。
【００７８】
図１８は、上記の処理により得られた配列情報を示す。置換は、対応する文字の相違をもって表される。ギャップは「−」で示される。図中において、ギャップが長く続く箇所が、２つある。これらのギャップ群が発生するのは、図１７でラインが３つに分かれているからである（最も左の部分は極短い）。すなわち、図１７において、横方向に配置された配列は、ラインが途切れる部分に多数のアミノ酸をもつ。これらのアミノ酸は、縦方向の配列には存在しない。そのため、縦方向の配列において多数のギャップが連続する。
【００７９】
配列情報処理装置１は、さらに、出力処理部３４を含む（図２）。出力処理部３４は、表示処理部として機能し、上述の処理を通じて得られる情報をディスプレイ上に表示するための処理を行う。
【００８０】
出力処理部３４は、図１８のギャップ付き配列情報を表示する。また、出力処理部３４は、図１７のマトリックス画像を表示する。この画面表示は、ホモロジーを視覚的に表す情報として、有用に利用される。
【００８１】
出力処理部３４は、上述の処理過程で得られる各段階のマトリックス画像を、ユーザの指示に従って表示してもよい。ユーザの指示は、入力装置を用いて受け付けられる。ユーザは、画面表示を見て、パラメータ等の調整ができる。この調整も入力装置から受け付けられる。この調整に関しては、後述にてさらに説明する。
【００８２】
また、出力処理部３４は、ディスプレイ以外の装置への出力を行ってもよい。プリンタへの出力はもちろん、通信装置を用いて外部へ情報が出力されてもよい。また、フレキシブルディスク等の記録媒体へと情報が格納される。
【００８３】
図１９は、本実施形態による処理の全体概要を示す。配列情報取得部２０が、比較対象の２つの配列情報を取得すると（Ｓ１０）、マトリックス情報生成部２２が、２つの配列情報からマトリックス情報を生成する（Ｓ１２）。マトリックス情報は２値画像へと変換される。第１抽出処理部２４は、マトリックス画像を用いて、斜め方向の連続性に基づいた抽出処理を行う（Ｓ１４）。そして、第２抽出処理部２６が、変異の量に基づいた抽出処理を、複数の判定領域のそれぞれにおいて行う（Ｓ１６）。さらに、縁部調整処理部２８が、マトリックス画像の縁部に残った点を削除する（Ｓ１８）。長さ比較調整処理部３０が、各行に残る複数の点の一つのみを残す調整処理を行う（Ｓ２０）。以上より、ホモロジーを表すラインが得られる。ホモロジー配列生成処理部３２は、得られたラインが示すホモロジーを表現した配列情報（図１８）を生成する（Ｓ２２）。出力処理部３４は、マトリックス画像および配列情報を、ユーザの指示に従って表示する（Ｓ２４）。
【００８４】
図２０は、第１抽出処理部２４の処理で用いられるフィルタの変形例である。前出の図５のフィルタを用いるときは、斜め方向に３つ以上の点が続くとき、それらの点が残される。図２０のフィルタを用いるときは、斜め方向に後側、すなわち斜め下側の画素だけが調べられる。着目する点の斜め下に点があれば、その点は残される。このフィルタを用いるときは、より多くの点が残される。
【００８５】
なお、上記のフィルタの代わりに、斜め上側の画素に点があるか否かを判定するフィルタが用いられてもよい。また、着目点に対して斜め方向の上下のどちらかに点があれば、着目点を残すフィルタ（すなわち、斜めに２つ以上の点がつながれば、それらの点を残すフィルタ）が適用されてもよい。
【００８６】
図２１は、第１抽出処理部２４の処理に適したもう一つのフィルタを示す。このフィルタを用いるとき、３つの点が列方向（縦方向）に並ぶ場合に、中央の点が削除される。図２１のフィルタは、図５のフィルタと組み合わせて用いられる。図５のフィルタの処理に続いて、図２１のフィルタの処理が行われる。図２１のフィルタは、図２０のフィルタと組み合わされてもよい。図２１のフィルタを用いることで、より多くの点が削除されるので、以降の計算量が減り、したがってさらなる高速化が可能となる。
【００８７】
図２２〜図２４を参照して、図２１のフィルタがもつ意味を説明する。図２２に示すように、縦方向配列がＡＡＡＢであり、横方向配列がＡｘｘｘであるとする。Ａ、Ｂ、ｘは一つのアミノ酸であるとする。ｘは、任意のアミノ酸である。このような場合に、縦方向に連続する３点、すなわち、図２１のフィルタの着目する状況が発生する。
【００８８】
図２２の横方向配列に関して、２番目の文字がＢである場合（ＡＢｘｘ）と、３番目の文字がＢである場合（ＡｘＢｘ）を検討する。なお、ＡｘｘｘｘＢのような配列でも、以下の検討においては、結果が同じになる。ＡＡｘｘについては、２番目以降の文字列を考えれば、以下の検討と同じ結果が得られる。
【００８９】
図２３では、横方向配列はＡＢｘｘであり、すなわち２番目の文字がＢである。図示のように点が並び、この範囲での最短経路はラインＬである。そして、この場合、縦方向の３点のうちの中央の点αを削除しても、最短経路は変わらない。したがって、中央の点αは削除されてもよい。
【００９０】
図２４では、横方向配列はＡｘＢｘであり、すなわち３番目の文字がＢである。この範囲での最短経路は、図中の左側に示す３つの経路のいずれかである。図中の右側には、ギャップを使った表現が示されている。どの経路でも、ギャップの数が同じである。変異の量を表すスコアとしてギャップ数を用いるとき、図２２の３つの表現では、スコアが同じである。そして、縦方向の３点の中央の点αを削除したとしても、最短経路は３つのどれかであり、その他の経路にはならない。したがって、中央の点αは削除されてもよい。
【００９１】
以上より、中央の点αを削除したとしても、最短経路が変わらないので、点αは削除されてもよい。そこで、本実施形態は、図２１のフィルタにより、点αを削除する。これにより、以降の処理の計算量が減り、処理を高速化できる。
【００９２】
なお、図２１のフィルタは、列方向の代わりに、行方向の３連続点のうちの中央の点を削除してもよい。また、図２１のフィルタは、列方向に３連続点がある場合と、行方向に３連続点がある場合と、いずれの場合にも、中央の点を削除してもよい。
【００９３】
図２５は、第１抽出処理部２４の好適な構成例を示している。この形態では、第１抽出処理部２４が、複数種類のフィルタを扱える。図２５に示されるように、第１抽出処理部２４は、フィルタ選択部４０、抽出判定部４２および消去処理部４４を有する。
【００９４】
フィルタ選択部４０は、ディスプレイに、選択可能な複数種類のフィルタを提示する。そして、ユーザがキーボードおよびマウスを使って所望のフィルタを選択すると、その選択が受け付けられる。フィルタ選択部４０は、この選択に従って、抽出に使うべきフィルタを設定する。
【００９５】
例えば、図５、図２０および図２１のフィルタが記憶されており、選択可能である。フィルタ選択部４０は、これらの中から、図５、図２０、図５と図２１の組合せ、図２０と図２１の組合せのいずれかを選択、設定する。
【００９６】
抽出判定部４０は、設定されたフィルタを用いて、マトリックス画像中の各点を抽出すべきか否かを判定する。抽出すべき点は残される。その他の点は、消去処理部４４により消去される。
【００９７】
ユーザは、本実施形態の装置を利用するとき、ホモロジー検索に先立ってフィルタを選択する。ユーザが選択しないときは、デフォルトで設定されたフィルタ（例えば図５）が使用される。ユーザは、処理結果を検討して、フィルタを変更できる。このようにして、適切な結果が得られるように適当なフィルタを選ぶことができる。
【００９８】
その他、第１抽出処理部２４は、フィルタ中のパラメータを変更可能に構成されてもよい。例えば、図５のフィルタにおける点の連続数（図５では「３」）が変更されてもよい。
【００９９】
図２６は、第２抽出処理部２６の好適な構成例を示している。この形態では、第２抽出処理部２６が、判定領域の形状を変更可能に構成されている。図２６に示されるように、帯高さ設定部５０、帯幅設定部５２、しきい変異数設定部５４、抽出判定部５６および消去処理部５８を有する。
【０１００】
帯高さ設定部５０、帯幅設定部５２およびしきい変異数設定部５４は、それぞれ、ユーザによる入力操作に従い、帯高さ、帯幅およびしきい変異数といったパラメータを設定する。設定されたパラメータを用いて、抽出判定部５６による処理が行われる。パラメータに従った形状の領域が設定され、領域内の空行数が求められる。そして、抽出判定部５６は、「空行数」が「しきい変異数−帯数」以下か、すなわち、「空行数＋帯幅数」が「しきい変異数」以下かが判定される。判定結果に基づき、抽出すべき点は残される。その他の点は、消去処理部５８により消去される。
【０１０１】
ユーザは、本実施形態の装置を利用するとき、ホモロジー検索に先立って、帯高さ、帯幅、しきい変異数といったパラメータを入力する。しきい変異数の代わりに空行上限数を入力するように構成されてもよい。ユーザが選択しないときは、デフォルト値、例えば、前出の例の通りに帯高さ１０、帯幅５、しきい変異数７が使用される。ユーザは、処理結果を検討して、パラメータを変更できる。このようにして、適切な結果が得られるように適当なパラメータを選ぶことができる。
【０１０２】
パラメータは、第２抽出部処理部２６により自動的に変更されてもよい。一連の処理で適当なホモロジーのラインが得られないとき（各行に１つの点、各列に１つの点が存在し、それらの点が長いラインを描くといった結果が得られないとき）、パラメータを変更する。適当なラインが得られるまで、パラメータを変更しながら、計算が繰り返される。
【０１０３】
図２７は、ホモロジー配列生成処理部３２の構成例を示している。ホモロジー配列生成処理部３２は、対応部分特定部６０、配列作成部６２および修正処理部６４を含む。対応部分特定部６０は、ホモロジーを表すラインから、２つの配列の対応部分を特定する。既に述べたように、図１７のラインを参照して、ラインの上方（横軸）への写像に対応する文字列と、ラインの横方向（縦軸）への写像に対応する文字列（横方向（縦軸）への写像）とが、対応部分として求められる。この対応部分に関して、図１８に示すように、ホモロジーを表現する配列情報が、配列作成部６２により生成される。
【０１０４】
修正処理部６４は、対応部分特定部６０の処理に先立って、ホモロジーのラインを修正する。修正を行うことが望ましい場合の例としては、図１２の場合が挙げられる。図１２では、既に説明したように、実際の変異数が少ないにも拘わらず、領域内の点がすべて削除される。本実施形態は、処理速度を優先する代わりに、このような事態を許容している。ただし、図１２の状況が発生すると、ホモロジーを表すラインが途中で途切れる。この途切れた部分を修復する処理が、修正処理部６２により行われる。修正処理部６２は、例えば、ユーザによる入力操作に従って、途切れた部分を補完する。この補完は、自動的に行われてもよい。例えば、２つの配列の、判定領域に対応する部分が比較され、一致する箇所を表す点の列が再生される。ここでは、従来周知のホモロジー検索技術が適用されてもよい。判定領域が比較的狭いので、補完処理の計算量も比較的少なくてよい。
【０１０５】
修正処理部６２は、ホモロジー配列生成処理部３２に設けられていなくてもよい。ホモロジーのラインを表示するときに、適宜、修正が行われてよい。また、上述の修正に限らず、本装置は、上述の実施形態における任意の段階でユーザの入力操作による指示に応じて（または自動的に）、適切な結果が得られるようにマトリックス画像または配列の適当な修正を行うように構成されてよい。
【０１０６】
本実施形態は、本発明の範囲内で変形可能なことはもちろんである。例えば、本実施形態における各種の処理において、「行」と「列」が入れ替えられてもよいことはもちろんである。例えば、第２抽出処理部２６の処理において、行数の代わりに列数が用いられてもよい。この場合、判定領域の平行四辺形は、縦方向の線および斜め方向の線で構成され、縦方向の線の長さが帯幅に相当する。
【０１０７】
また、本実施形態は、「斜め方向」という表現を用いている。これは、マトリックス上で行方向と列方向に共に進む方向であり、本実施形態のマトリックス画像中では４５であり、そして、マトリックスの成分を基準にしたときの斜め方向である。斜め方向は、上述のように、配列同士が連続して一致するときに点が並ぶ方向である。この点に関し、「配列同士が連続して一致するときに点が並ぶ方向」は、画像の表現によっては、画面上で斜めでないこともあり得る。例えば、マトリックス画像全体を変形し、平行四辺形にしたときは、上記のマトリックス上の「斜め方向」が、画面上では鉛直方向になり得る。このような場合も、マトリックスの成分で見たときに斜め方向であるので、本実施形態では「斜め方向」と呼んでよい。その他の構成も、画像の表現によって視覚的には変更可能であったとしても、本実施形態のようなマトリックスの上で見たときに本発明の範囲内であれば、そのような構成は本発明の範囲内である。
【０１０８】
本実施形態では、隣り合う判定領域は重なっていなかったが、本発明の範囲内で、隣り合う判定領域が重なってもよい。
【０１０９】
本実施形態では、マトリックス情報は、２つの配列におけるアミノ酸の一致を表す要素（成分１、および、画像中の点）をもっていた。別の実施形態形態では、アミノ酸が類似する（一致を含む）ことを表す要素がマトリックス情報に与えられる。アミノ酸の類似度を表すものとして、置換配列（例えばブロッサム６２）が知られている。この置換配列は、アミノ酸の組合せの変異コストを表す。この変異コストが所定値以上のとき、マトリックス情報に成分１が与えられる。その他の場合には、成分０が与えられる。このようにして得られたマトリックス情報、およびそれから得られるマトリックス画像（２値画像）は、上述の実施形態と同じように処理される。
【０１１０】
本変形例では、マトリックス画像中の点の数が上述の実施形態よりも増加するので、計算速度は少し遅くなる可能性がある。しかし、従来と比べれば、計算速度は大幅に増大し、本発明の利点が得られる。また、類似を考慮することによる信頼性の向上が期待できる。
【０１１１】
また、本実施形態では、配列情報の取得から、マトリックスの生成、ホモロジーのラインの抽出、ホモロジーを表現する配列の生成までが、一貫して行われた。これに対し、配列情報処理装置１は、既にできあがったマトリックス情報を外部から取得してもよい。また、配列情報処理装置１は、ホモロジーを表現した配列を作る機能はもっていなくてもよい。図１７のような画像だけでも相当に有用な情報が視覚的に得られる。
【０１１２】
また、本実施形態ではアミノ酸配列が処理されたが、アミノ酸以外の配列情報の処理にも本発明は適用可能である。典型的には、塩基配列の処理に本発明を適用可能である。
【０１１３】
また、本実施形態の配列情報処理装置１は、インターネット等のネットワークに接続されてもよい。あるコンピュータからネットワークを経由して配列情報が検索依頼とともに送られてくる。配列情報処理装置１は、配列情報からホモロジーの処理を行って、処理結果を返す。各種の条件、例えば、領域形状に関するパラメータも、ネットワークを通じて取得される。
【０１１４】
以上、本発明の好適な実施形態を説明した。本実施形態においては、上述のように、第１抽出処理部２４による第１抽出ステップは、マトリックス情報から、斜め方向に連続する要素を抽出する。次に、第２抽出処理部２６による第２抽出ステップは、マトリックス情報に設定される複数の判定領域を用いて、各判定領域ごとに抽出処理を行い、変異の少ない判定領域内の要素が抽出する。このようにして、マトリックス上で斜め方向に長く連なる要素群が抽出される。第１抽出ステップ、第２抽出ステップ共に比較的簡素であり、少ない計算量で実現できるので、ホモロジーを表す情報を高速で求めることができる。
【０１１５】
また、第２抽出ステップは、マトリックス情報上に斜め方向に延びる帯状の判定領域、本実施形態では平行四辺形の領域を設定する。斜め方向、すなわち、配列一致箇所が連続するときに要素が並ぶ方向に延びる判定領域が用いられる。ホモロジーを表す要素群と同方向に延びる判定領域が設定されるので、正確に必要な情報が得られる。
【０１１６】
また、第２抽出ステップは、領域幅数と、幅方向に要素がない箇所の数との合計（上述の処理では帯幅数と空行数との合計）がしきい変異数以下のときに、判定領域内の要素を抽出する。変異数に関する判定が簡単なので、計算量が少なく、したがって一層の高速化が可能となる。
【０１１７】
また、第１抽出ステップは、図２１のフィルタを用いて、配列の配置方向に連続する３要素のうちの中央の要素を抽出しない。これにより、第２抽出ステップで処理されるべき要素が減るので、さらなる高速化が可能である。
【０１１８】
また、本発明は、抽出された要素群が示すホモロジーを表現した配列情報を生成する。好ましくは、ギャップおよび置換を含んだ情報が生成される。高速な計算により抽出された要素群を用いて、両配列のホモロジーを表現した有用な情報が得られる。
【０１１９】
また、本発明は、マトリックスの縁部に判定領域を設定するのが困難なことに配慮して、第２抽出ステップを経たマトリックス情報から縁部に残る要素を削除する。これにより、不要な点が効果的に削除され、ホモロジーを表す情報がより正確に求められる。
【０１２０】
また、本発明は、第２抽出ステップを経たマトリックス情報において配列の配置方向に複数の要素が残っているとき、その残った要素が周囲の要素と形成する連続部分の長さに基づいて不要な要素を削除する。より短い連続部分を形成する方の要素が削除される。そして、最も長い連続部分を形成する要素が残される。本発明によれば、第２抽出ステップにて残った不要な要素が削除されるので、ホモロジーを表す情報がより正確に求められる。
【０１２１】
また、本発明は、抽出された要素群をもったマトリックス情報を画面に表示する。これにより、抽出された要素群が描く線が画面表示される。この画面表示は、ホモロジーを視覚的に表す情報として、有用に利用される。
【０１２２】
本実施形態では、２つの配列が比較された。しかし、本発明の範囲内で、３つ以上の配列が比較されてもよいことはもちろんである。
【０１２３】
ｎ個の配列を比較するとき、例えば、ｎ個のうちの２つの配列の２次元マトリックス情報が、本発明の複数次元要素群情報（比較対象の複数の配列情報を異なる方向に配列したときのそれらの一致箇所を表す要素群で構成される情報）として、利用される。この２次元マトリックス情報を用いて、上述の実施形態で説明した２次元の処理が行われる。
【０１２４】
ここでは、一つの配列が基準に設定されてもよい。基準の配列と、残りの複数配列の各々との組合せによる２次元マトリックス情報からホモロジー検索が行われる。また、ｎ個の配列から選ばれる種々の２配列の組合せ関して、２次元マトリックス情報を用いたホモロジー検索が行われてもよい。このような処理は、上述した２次元の処理の組合せであり、本発明に含まれる。
【０１２５】
また、ｎ個の配列を比較するとき、ｎ個の配列の一致箇所を表す要素をもつｎ次元の情報が、本発明の複数次元要素群情報として用いられてもよい。例えば、ｎ＝３のとき、３つの配列を異なる方向に配置したときの、それらの一致箇所を表す要素群で構成される情報が用いられる。より詳細な例では、３つの配列が、それぞれ、直交するｘ軸、ｙ軸、ｚ軸にそって配置される。ｘ軸、ｙ軸、ｚ軸が形成する空間内で、３つの配列が一致する箇所に対応する位置に点が設定される。これにより、２次元マトリックス情報を３次元に拡張した、立体的な情報が得られる。この３次元の要素群情報を対象として、上述の実施形態に従い、ホモロジー検索の処理が行われる。ホモロジーを表すライン（点列）は、ｘ軸、ｙ軸およびｚ軸に対して４５度をなす。このように、本発明の範囲内で、上述の２次元の処理がｎ次元の処理へと拡張されてもよく、そして、本発明を構成する各種処理について、上述の２次元の処理と同様の原理をｎ次元に適用した処理が行われてよく、このような処理を行う構成も本発明に含まれる。
【０１２６】
以上のように、本発明の範囲内で、３つ以上の配列が比較されてもよい。この点は、上述した、複数配列の類似箇所を表す要素をもつマトリックス情報を使う実施形態においても同様である。
【０１２７】
【発明の効果】
以上に説明したように、本発明によれば、高速なホモロジー検索を可能にする配列情報処理装置を提供することができる。
【図面の簡単な説明】
【図１】本発明の実施形態における配列情報処理装置のハードウエア構成を示す図である。
【図２】配列情報処理装置のソフトウエア構成を示す機能ブロック図である。
【図３】配列情報取得部により取得される比較対象のアミノ酸配列情報の例を示す図である。
【図４】図３の２つの配列から作られるマトリックス情報の画像を示す図である。
【図５】図２の第１抽出処理部に用いられるフィルタを示す図である。
【図６】第１抽出処理部による処理後のマトリックス画像を示す図である。
【図７】マトリックス画像上に設定される複数の帯状の判定領域を示す図である。
【図８】図７に示される複数の判定領域の一つを示す図である。
【図９】変異が画像に及ぼす影響を示す図である。
【図１０】第２抽出処理部による抽出対象になる場合の例を示す図である。
【図１１】第２抽出処理部による抽出対象にならない場合の例を示す図である。
【図１２】例外的に、第２抽出処理部による抽出対象にならない場合の例を示す図である。
【図１３】第２抽出処理部による処理後のマトリックス画像を示す図である。
【図１４】マトリックス画像の縁部における判定領域の配置を示す図である。
【図１５】縁部の修正が行われたマトリックス画像を示す図である。
【図１６】１つの行に複数の点があるときの調整処理を示す図である。
【図１７】図１６の処理が施されたマトリックス画像を示す図である。
【図１８】図１７の抽出結果から作られる、ホモロジーを表現する配列情報を示す図である。
【図１９】配列情報処理装置による処理の全体概要を示す図である。
【図２０】第１抽出処理部により用いられるフィルタの変形例を示す図である。
【図２１】第１抽出処理部により付加的に用いられる好適なフィルタを示す図である。
【図２２】図２１のフィルタが適用可能なことを説明するための図である。
【図２３】図２１のフィルタが適用可能なことを説明するための図である。
【図２４】図２１のフィルタが適用可能なことを説明するための図である。
【図２５】第１抽出処理部の好適な構成例を示す図である。
【図２６】第２抽出処理部の好適な構成例を示す図である。
【図２７】ホモロジー配列生成処理部の好適な構成例を示す図である。
【符号の説明】
１配列情報処理装置
２０配列情報取得部
２２マトリックス情報生成部
２４第１抽出処理部
２６第２抽出処理部
２８縁部調整処理部
３０長さ比較調整処理部
３２ホモロジー配列生成処理部
３４出力処理部[0001]
BACKGROUND OF THE INVENTION
  The present invention obtains information on homology by comparing multiple biological sequence information.apparatusAbout. Biological sequence information is typically an amino acid sequence of a protein and a base sequence of DNA. The present invention is typically applied to processing using matrix information in which two arrays are arranged in the row direction and the column direction, respectively, and speeds up this type of processing. The present invention may be applied to a process of comparing three or more sequences.
[0002]
[Prior art]
In the field of molecular biology, the usefulness of information processing technology for analysis of DNA, genes, proteins, etc. is increasing. In homology search, various methods for obtaining highly reliable results with high-speed calculations have been proposed and put into practical use.
[0003]
As is well known, homology search is a technique for comparing a plurality of sequences such as amino acids to determine whether or not the sequences are similar, and for determining how similar the sequences are. Here, comparison of two sequences will be described. It is known to use substitutions and gaps to represent homology. Substitutions and gaps represent variations between sequences. In the case of proteins, substitution refers to the presence of different amino acids at corresponding positions in the two sequences. A gap refers to the absence of an amino acid in one sequence at a corresponding position in the other sequence, and is caused by amino acid insertion and deletion.
[0004]
As an information processing method and algorithm for homology search, a dynamic programming method, a blast method, and a faster method (Fast A) are known.
[0005]
In dynamic programming, the principle of a route search technique is applied in order to obtain a method for aligning two sequences so as to minimize the amount of mutation. Using the mutation cost and gap cost of the two types of amino acids, an arrangement method that reduces the cost is required.
[0006]
The blast method searches for a site (high-score fragment) that matches well locally between two sequences without inserting a gap. Then, the searched high score fragment is extended before and after that.
[0007]
The Faster method uses a matrix in which two arrays are arranged in the row direction and the column direction, respectively. This matrix has elements that represent the locations where both sequences match. In general, a dot matrix, which is image information representing this element with dots, is used. For example, in the case of a protein, when the i-th amino acid of one sequence matches the j-th amino acid of the other sequence, the position of i row and j column is plotted. Then, a locally matching portion is obtained from the dot matrix (high score fragment of the blast method). Alignment by dynamic programming is performed on the peripheral region of the coincidence portion. And the process which extracts and displays only the point sequence connected for a long time is performed.
[0008]
These homology searches are described, for example, in “Gene and Computer” (Akihiko Konagaya, Kyoritsu Publishing Co., Ltd., pages 67-79, 2000).
[0009]
Of these three methods, the conventional dynamic programming method is disadvantageous in terms of calculation speed. Although the blast method is a high-speed process, it is said to be disadvantageous in terms of reliability that a weak homology is not missed. The Faster method is characterized by being faster than the dynamic programming method and more reliable than the blast method, although not as much as the blast method.
[0010]
[Problems to be solved by the invention]
As mentioned above, the Faster method provides relatively high speed and reliability. However, as the amount of sequence data increases, further speeding up is constantly required. In order to increase the speed, it is considered effective to reduce the amount of calculation. Of course, it is required to enable homology search with a small amount of calculation while ensuring sufficient reliability.
[0011]
  The present invention has been made in view of the above problems, and its purpose is to enable high-speed homology search.Array information processing deviceIs to provide.
[0012]
[Means for Solving the Problems]
In order to achieve the above object, the present invention uses matrix information as used in the conventional Faster method. However, the present invention obtains homology information by new data processing different from the Faster method as follows. The present invention is not limited to a two-dimensional process for comparing two arrays, and may be applied to a process for comparing three or more arrays.
[0013]
  An aspect of the present invention is a sequence information processing that obtains information on homology by comparing a plurality of biological sequence information such as amino acid sequences and DNA sequences.apparatusIt is. Of the present inventionapparatusIsAn array information acquisition unit that receives two pieces of sequence information to be compared; a two-dimensional matrix image information generation unit;First extractionProcessing partAnd second extractionAn output processing unit for outputting two-dimensional matrix image information that has undergone processing by the processing unit and the second extraction processing unit;including.
[0014]
  The two-dimensional matrix image information generation unit is a two-dimensional matrix image information generation unit that generates two-dimensional matrix image information by arranging two pieces of array information to be compared in different directions. For all combinations of one array, a first value is set as a matrix element when the two arrays match, and a second value different from the first value is set as a matrix element when the two arrays do not match By performing the processing, two-dimensional matrix image information composed of element groups representing the coincident portions of the two arrays is obtained.
  First extractionProcessing partIsWith respect to the two-dimensional matrix image information, it is determined whether or not a predetermined number of elements corresponding to the coincident portions of the two arrays are continued in a diagonal direction, and two-dimensional matrix image information in which the elements determined to be continuous are extracted is generated..
[0015]
  Second extractionThe processing unit uses a plurality of adjacent parallel quadrilateral determination areas arranged in the two-dimensional matrix image information processed by the first extraction processing unit as processing for each determination area. In the parallelogram of the judgment area, the number of area widths, which is the length of the side in the arrangement direction of one arrangement, and the empty stage where the elements of the arrangement matching part do not exist in the arrangement direction among the stages arranged in the judgment area Is determined to determine whether or not the sum of the number and the number of is less than or equal to a predetermined threshold mutation number, and two-dimensional matrix image information in which elements of the determination region determined to be equal to or less than the threshold mutation number is extracted is generated. .
[0016]
  In this way, an element group that is long and continuous in the direction in which the elements are arranged when locations where the sequences match is continuous, that is, an element group that represents the homology of the comparison target sequences is extracted. According to the present invention, since both the first extraction processing unit and the second extraction processing unit are relatively simple and can be realized with a small amount of calculation, information representing homology can be obtained at high speed. According to the present invention, the above-described processing makes it possible to make a determination based on the number of mutations with a simple processing, thereby further reducing the amount of calculation and further increasing the speed.
[0017]
  In the present invention, a parallelogram region in an oblique direction on the matrix is suitably set in the matrix information of the two arrays. The determination area extends in the same direction as the element group representing the homology. Therefore, according to this configuration, the necessary information can be accurately obtained by setting the determination region extending in the same direction as the element group representing the homology. Further, in order to realize the processing by the second extraction processing unit, “the number of places where no element exists” may be compared with “the difference between the threshold variation number and the region width number”.
[0018]
  In the above-described array information processing apparatus, the first extraction processing unit may determine whether or not three or more elements corresponding to the matching positions of the two arrays are consecutive in the oblique direction with respect to the two-dimensional matrix image information..
[0019]
  In the above-described array information processing apparatus, the first extraction processing unit determines whether or not three elements corresponding to the matching positions of the two arrays are continuous with respect to at least one of the two arrangement directions, and is determined to be continuous. Alternatively, the center element of the three elements may not be extracted. According to this configuration, since the number of elements to be processed by the second extraction processing unit is reduced, the speed can be further increased.
[0020]
  The above-described sequence information processing apparatus may further include a homology sequence generation processing unit that generates sequence information representing the homology indicated by the two-dimensional matrix image information that has been processed by the second extraction processing unit. According to this configuration, useful information expressing the homology of a plurality of sequences can be obtained using an element group extracted by high-speed calculation.
[0021]
  In the above-described sequence information processing apparatus, the homology sequence generation processing unit may generate information including gaps and substitutions. According to this configuration, useful information expressing the homology of a plurality of sequences can be obtained using an element group extracted by high-speed calculation.
[0022]
  The above-described array information processing apparatus may further include an edge adjustment processing unit that deletes an element corresponding to a matching portion of two arrays remaining in the edge where the determination region is not set in the two-dimensional matrix image information. For example, when a matrix is divided into parallelogram regions, no region is set at the edge of the matrix, and unnecessary elements may remain at the edge of the matrix. Since such an element is deleted by the edge adjustment processing unit having this configuration, information representing homology is obtained more accurately.
[0023]
  In the above-described array information processing apparatus, when two or more extracted elements remain in the arrangement direction with respect to at least one of the two arrangement directions in the two-dimensional matrix image information that has been processed by the second extraction processing unit, The remaining element may further include a length comparison / adjustment processing unit that deletes unnecessary elements based on the length of a continuous portion formed with the surrounding extracted elements. According to this configuration, since unnecessary elements remaining in the second extraction processing unit are deleted, information representing homology can be obtained more accurately.
[0024]
  In the above-described array information processing apparatus, the output processing unit may include a display processing unit that displays the two-dimensional matrix image information that has undergone processing by the second extraction processing unit on a screen. According to this configuration, a line drawn by the extracted element group is displayed on the screen. This screen display is usefully used as information that visually represents homology.
[0026]
  Sequence information processing according to another aspect of the present inventionThe apparatus is a sequence information processing apparatus that obtains information on homology by comparing two biological sequence information such as an amino acid sequence and a DNA sequence, and a sequence information acquisition unit that acquires two sequence information to be compared; A two-dimensional matrix image information generating unit that generates two-dimensional matrix image information by arranging two pieces of array information to be compared in different directions, compares the two arrays, and for all combinations of the two arrays, When two elements are similar, a first value is set as a matrix element, and when two elements are not similar, a second value that is different from the first value is set as a matrix element. A two-dimensional matrix image information generating unit for obtaining two-dimensional matrix image information composed of elements representing similar parts of the two-dimensional matrix image information. A first extraction processing unit for determining whether or not a predetermined number or more of elements corresponding to the location are continuous in an oblique direction, and generating two-dimensional matrix image information from which elements determined to be continuous are extracted; and a first extraction processing unit As a process for each determination area, a plurality of determination areas in the parallelogram of the determination area are used as a process for each determination area. The total of the number of area widths, which is the length of the side in the arrangement direction of the array, and the number of empty stages where no element of the arrangement matching position exists in the arrangement direction among the stages arranged in the determination area is predetermined. A second extraction processing unit that determines whether or not the number of threshold mutations is equal to or less than the threshold mutation number and generates two-dimensional matrix image information in which elements of the determination region determined to be equal to or less than the threshold mutation number are extracted; Depending on the extraction processing unit Comprising an output processing unit that outputs a two-dimensional matrix image information through the management, the. In this configuration, for example, two-dimensional matrix image information having an element representing a portion where amino acids and the like are similar (including a matching portion) is used. Considering similarities increases the reliability of homology searches. Although the amount of calculation increases compared with the method using only the element corresponding to the matching part, the homology search can be realized with a small amount of calculation compared with the conventional method.
[0027]
  The present invention has been described above.apparatusIt is not limited to the embodiment. Another aspect of the present invention is, for example,The above deviceAs a program for causing a computer to function as a computer-readable medium on which such a program is recorded.
[0028]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, preferred embodiments of the present invention (hereinafter referred to as embodiments) will be described.
[0029]
In the present embodiment, the present invention is applied to information processing of amino acid sequences that are one form of biological sequences. Of course, the present invention may be applied to information processing of other sequence information, for example, base sequences.
[0030]
In the present embodiment, the present invention is applied to information processing for comparing two sequences. This information processing uses two-dimensional matrix information having an element representing a matching portion of both arrays when the first array and the second array to be compared are arranged in different directions.
[0031]
In the present embodiment, directions in which the two arrays are arranged are referred to as a row direction and a column direction. Further, the diagonal direction is a direction in which elements are arranged when the first array and the second array continuously match. In the present embodiment, image information is used, the first array and the second array are arranged so as to be orthogonal, and the spacing between the elements is the same in the row direction and the column direction. It is a direction that forms an angle of 45 degrees with respect to the direction.
[0032]
FIG. 1 shows a hardware configuration of an array information processing apparatus according to an embodiment of the present invention. The array information processing apparatus 1 includes a CPU 3, a ROM 5, a RAM 7, a hard disk 9, a medium mounting unit 11, a keyboard 13, a mouse 15, a display 17, and a communication device 19.
[0033]
The keyboard 13, the mouse 15, the display 17, and the communication device 19 function as input / output devices. Furthermore, other input / output devices may be provided as appropriate. The communication device 19 may be a device that communicates with a nearby device by infrared communication or the like, or may be a device that performs communication such as a LAN or the Internet. These plural types of communication devices may be provided.
[0034]
The medium loading unit 11 is loaded with a recording medium such as a flexible disk or a compact disk. The medium loading unit 11 can also be regarded as an information input / output device for a recording medium.
[0035]
The array information processing apparatus 1 may be a general-purpose computer. The array information processing apparatus 1 is configured by installing a program that causes a computer to realize the information processing function of the present embodiment.
[0036]
FIG. 2 is a functional block diagram showing the configuration of the array information processing apparatus 1. Each illustrated component is realized by executing the program. As illustrated, the array information processing apparatus 1 includes an array information acquisition unit 20, a matrix information generation unit 22, a first extraction processing unit 24, a second extraction processing unit 26, an edge adjustment processing unit 28, and a length comparison adjustment process. A unit 30, a homology sequence generation processing unit 32, and an output processing unit 34. Hereinafter, each of these components will be described with reference to the drawings.
[0037]
FIG. 3 shows an example of comparison target amino acid sequence information acquired by the sequence information acquisition unit 20. The arrangement information is read from, for example, a recording medium attached to the medium attachment unit 11. At this time, the medium mounting unit 11 functions as an input device for array information. The sequence information may be acquired using other means such as a communication device. The arrangement information may be read from the hard disk 9.
[0038]
The matrix information generation unit 22 generates matrix information from the two arrays acquired by the array information acquisition unit 20. The matrix information generation unit 22 compares the i-th character (amino acid) and the j-th character (amino acid) in one sequence. When two characters match, 1 is set as a matrix element (component). When the two characters do not match, 0 is set as the matrix element. Matrix information (DP matrix) is obtained by performing this process on all combinations of characters in both arrays. Further, the matrix information generation unit 22 generates a binary image representing the matrix information. In the image, dots (dots) are placed at positions corresponding to the matrix element 1.
[0039]
FIG. 4 shows an image of matrix information obtained as described above. The example of FIG. 4 is a matrix obtained from the amino acid sequence of CRE-BP1 and the amino acid sequence of MUSMXBP.
[0040]
The matrix obtained in this way has “elements” that represent locations where both arrays match when one array is arranged in the row direction and the other array is arranged in the column direction. An “element” is a point on an image in the present embodiment.
[0041]
The matrix information is generated using, for example, a Java (registered trademark) program. This program provides a function for comparing two sequences. This function returns a matrix whose elements are 1 if the characters in both arrays match, and 0 otherwise. Further, the program places dots at positions on the image corresponding to 1. The above processing can be performed at higher speed by using the Hash table.
[0042]
Hereinafter, in the present embodiment, “significant information”, that is, information representing homology is obtained through image processing using the matrix of FIG. The information to be obtained is roughly a point group on a straight line, and this straight line is (1) as long as possible, and (2) the inclination is minus 45 degrees (downward to the right) (vertical and horizontal directions). If the pixel spacing is the same).
[0043]
In FIG. 4, diagonal lines representing homology already appear relatively clearly. Points on this line are extracted in subsequent processing. Even if the line of homology is not as clear as in FIG. 4, the line is extracted by the subsequent processing.
[0044]
As described above, in the present embodiment, image processing is performed using the image of FIG. However, in a preferred modification, the process may be performed using the matrix information obtained in the previous stage, that is, matrix information having elements of 1 and 0.
[0045]
Now, lines representing homology are mainly extracted by the first extraction processing unit 24 and the second extraction processing unit 26. Among these, the 1st extraction part 24 extracts the part from which the predetermined continuity is seen by the matching location of both arrangement | sequences from the matrix image of FIG. Since the points in the image are diagonally arranged where the coincident portions are continuous, the first extraction unit 24 extracts points having predetermined continuity in the diagonal direction.
[0046]
FIG. 5 shows a filter used for the processing of the first extraction processing unit 24. This filter leaves those points when three points follow in an oblique direction. Such a point corresponds to the point having the predetermined continuity in the present embodiment. In other cases, the point is deleted. Here, as described above, the oblique direction is a direction (45 degrees) in which the same number of pixels are advanced in the vertical direction and the horizontal direction.
[0047]
FIG. 6 shows an image after extraction by the filter. Many points not related to homology have disappeared. However, many unnecessary points still remain. Thus, the 1st extraction part 24 performs rough extraction processing as pre-processing of the process by the following 2nd extraction part 26. FIG.
[0048]
Next, the extraction process by the second extraction unit 26 will be described.
[0049]
As shown in FIG. 7, the second extraction unit 26 sets a plurality of strip-shaped determination areas on the matrix. The shape of each determination area is a parallelogram. A determination area is set by dividing the matrix into a large number of parallelograms. The determination area is provided so as to be spread on the matrix.
[0050]
FIG. 8 shows one determination area. As illustrated, in the example of the present embodiment, the height (band height) of each determination region is 10 pixels, and the width (band width) is 5 pixels. The number of band widths (number of points) corresponds to the number of area widths of the present invention. The angle of the hypotenuse of the parallelogram is 45 degrees. Therefore, the determination region extends in the oblique direction of the present embodiment, that is, the direction in which elements are arranged when the arrangement information continuously matches.
[0051]
The second extraction processing unit 26 performs an extraction process using the set determination areas. The extraction process is performed for each determination area. Based on the distribution of points in the determination area, the second extraction processing unit 26 extracts points in the determination area having a predetermined distribution related to sequence variation in units of areas. In the present embodiment, the predetermined distribution is a distribution in which it is determined that the number of mutations in the region is equal to or less than a predetermined threshold mutation number. All points in the region having such a distribution are extracted. When it is determined that the number of mutations exceeds the threshold number, all points in the region are erased.
[0052]
Mutations include substitutions and gaps, as is well known. Substitution means that there are different amino acids at corresponding positions in the two sequences. A gap means that one sequence has no amino acid in the other sequence. Gaps are caused by amino acid insertions or deletions. The number of mutations in this embodiment is the total number of substitutions and gaps.
[0053]
The following special determination is made regarding the threshold mutation number. In the present embodiment, a line in which no point exists in the region is referred to as a blank line. A blank line corresponds to a location where no element exists in the arrangement direction of one array in the determination region in the present invention. When the sum of the number of blank lines and the number of bandwidths is equal to or less than the threshold mutation number, the second extraction unit 26 determines that the number of mutations in the area is equal to or less than the threshold mutation number, and leaves a point in the area. In actual processing, the second extraction processing unit 26 may determine whether the number of blank lines is equal to or less than the difference between the threshold mutation number and the bandwidth number.
[0054]
The reason why necessary determination can be made by such simple processing will be described.
[0055]
FIG. 9 shows part of two similar sequences. Here, in order to simplify the description, letters A to E are used instead of letters representing actual amino acids.
[0056]
FIG. 9A shows a case where there is no mutation. In this case, the points on the matrix are arranged obliquely. FIG. 9B shows a case where there is a substitution. When there is a replacement, an empty row is generated as shown (an empty column is also generated at the same time). FIG. 9C shows a case where there is a gap in the arrangement in the row direction (lateral direction). In this case, an empty row is generated on the matrix. FIG. 9D shows a case where there is a gap in the arrangement in the column direction. In this case, an empty column is generated on the matrix, and the line is shifted by one pixel in the width direction. With regard to FIG. 9 (d), attention is paid mainly to line shift.
[0057]
Assuming that the line representing the homology passes through the determination region, the number of blank rows indicates the sum of the number of replacements (FIG. 9B) and the number of gaps in the array in the row direction (FIG. 9C). On the other hand, if there is one gap in the arrangement in the column direction, the line is shifted by one pixel in the row direction as shown in FIG. Therefore, the “number of band widths” corresponds to the maximum number of gaps that the arrangement in the column direction has in the region. From the above, if the sum of the number of blank lines and the number of bandwidths is less than or equal to the threshold mutation number, that is, the number of blank lines is “2” (= 7 (threshold mutation number) −5 (bandwidth number)) or less. If so, it can be said that the total number of mutations is the threshold number of mutations.
[0058]
In the example of FIG. 10, a line representing homology passes through the determination region. In this case, the number of substitutions is 1, and the number of gaps in the array in the row direction is 1. Since the number of blank lines is 2 or less, the points in this area are extracted.
[0059]
In the example of FIG. 11, the line representing the homology does not pass through the determination region. In this case, the points in the region do not form a line. It happens that there are only two sets of continuous points in the region. In this case, since the number of blank lines is large, the points in the area are deleted.
[0060]
By the way, the above extraction process has the following limitations.
[0061]
Please refer to FIG. In this example, there are two substitutions in the region. In addition, there is one arrangement gap in the row direction. On the other hand, there is no gap in the arrangement in the column direction. Therefore, the number of blank lines is 3, and the total number of mutations is 3. In this case, although the total number of mutations is 7 or less, the region is excluded from the extraction target.
[0062]
The present embodiment allows such a situation. If the line representing the homology passes through the region, the situation as shown in FIG. 12 is unlikely to occur. Therefore, if the number of blank lines is equal to or less than a predetermined number (if the sum of the number of blank lines and the number of bandwidths is equal to or less than the threshold mutation number), the number of mutations in the region is regarded as being equal to or less than the threshold mutation number.
[0063]
Similarly, when the region shape and threshold mutation number setting are changed, there is an opposite situation, that is, a situation where points in the region are extracted even though the actual threshold number is greater than or equal to the threshold mutation number. Even if it exists, this embodiment permits such a situation.
[0064]
FIG. 13 shows a matrix image that has been subjected to extraction processing by the second extraction processing unit 26. In the example of this embodiment, the band height is 10 and the band width number is 5, as described above. The threshold mutation number is 7. Therefore, in FIG. 13, only the points where the number of blank lines is 2 or less remain. This condition is relatively loose. Nevertheless, as shown in FIG. 13, most unnecessary points are effectively deleted. This indicates that the extraction process of the present embodiment is very effective even though it is a simple process.
[0065]
The extraction process by the second extraction processing unit 26 has been described above. As shown in FIG. 13, unnecessary points still remain after the above extraction process. This unnecessary point is deleted by the following process. This deletion processing is performed by the edge adjustment processing unit 28 and the length comparison adjustment processing unit 30.
[0066]
First, the processing of the edge adjustment processing unit 28 will be described.
[0067]
FIG. 14 partially shows the edge of the matrix image. Only a complete shape determination area is set on the matrix image. A determination area having a halfway shape is not set. As a result, no determination area is set at the edge of the image.
[0068]
For example, consider the upper and lower edges of an image. When the height of the image is not an integral multiple of the height of the determination area, no determination area is set on at least one of the upper and lower edges. When attention is paid to the edge in the horizontal direction, the edge of the image cannot be completely covered with the determination area when the parallelogram determination areas are arranged. As is apparent from FIG. 14, the ends of the parallelogram set draw a jagged line.
[0069]
As described above, since the determination region is not set at the edge, the process of the second extraction processing unit 26 is not performed on the edge. As a result, as shown in FIG. 13, unnecessary points that are not related to homology remain at the edge. Therefore, the edge adjustment processing unit 28 deletes the edge point. In the present embodiment, all points at locations where no determination area is set are deleted.
[0070]
FIG. 15 shows a matrix image with edge correction. Unnecessary points on the edge disappear.
[0071]
Next, processing of the length comparison adjustment processing unit 30 will be described. As described below, the length comparison / adjustment processing unit 30 determines the length of the continuous portion formed by each element with the points on the preceding and following rows when a plurality of points remain for one row in the matrix image. Delete unnecessary points based on.
[0072]
Refer to FIG. If only the point cloud on the line representing the homology is extracted, there should be only one point in one row. Thus, when there are multiple points in a row, only one of them should be left as a point on the homology line. The point to be left is a part of a long diagonal line (point sequence). In FIG. 16, point A should remain and point B should be erased.
[0073]
Therefore, the length comparison / adjustment processing unit 30 checks whether or not there are a plurality of points for each row. When there are a plurality of points, the length of the line to be formed together with the points in the preceding and succeeding rows is obtained for each point. The point forming the longest line is left. Remove other points. At this time, in the situation as shown in FIG. 10, it is preferable to consider that a continuous line is formed. Therefore, the comparison / adjustment processing unit 30 determines that the line is not interrupted when dots continue after a predetermined number (for example, two) or less of mutations (replacement and gap / blank and blank in the image). .
[0074]
FIG. 17 shows a matrix image that has been processed by the length comparison adjustment processing unit 30. The unnecessary points remaining in FIG. 15 are deleted. As described above, the line of this homology is preferably extracted.
[0075]
In the present embodiment, when a plurality of points remain in one row, a process of deleting unnecessary points is performed. In the modification, when a plurality of points remain in one column, unnecessary points may be deleted by the same process. That is, the correction process may be performed using either of the two arrangement directions. Further, the above-described processing may be performed for both arrangement directions. Generally, when a plurality of points remain in one row, a plurality of points remain in one column at the same time. Therefore, even if attention is paid to either the row or the column, generally the same result is obtained, and it can be said that the processing is substantially the same.
[0076]
Next, the homology sequence generation processing unit 32 generates sequence information representing the homology indicated by the line in FIG. The lines in FIG. 17 indicate which part of one array is similar to which part of the other array. The arrangement part corresponding to the mapping upward of the line is similar to the arrangement part corresponding to the mapping to the left of the line. Similar parts are compared. Through this comparison, substitutions and gaps are determined. Then, the two arrays are arranged so that the correspondence between the similar parts can be understood and the substitution and the gap can be understood. The above processing can be automatically performed by a computer program.
[0077]
Regarding this processing, as is apparent from FIG. 9, there is a replacement corresponding to the place where blank rows and blank columns occur simultaneously in the matrix image (FIG. 9A), and only blank rows are present. There is a gap in the arrangement in the row direction corresponding to the place where the occurrence occurs (FIG. 9B), and there is a gap in the arrangement in the column direction corresponding to the place where only the empty column occurs (see FIG. 9). 9 (c)). These permutations and gaps are sought, and information representing them is sought. It is considered desirable to check the mutation information obtained from blank rows and blank columns against both sequences at the relevant location to check whether the mutation information is appropriate, and to make necessary corrections. This is done automatically by the computer.
[0078]
FIG. 18 shows the arrangement information obtained by the above processing. Substitutions are represented with corresponding character differences. The gap is indicated by “−”. In the figure, there are two places where the gap lasts long. These gap groups occur because the line is divided into three in FIG. 17 (the leftmost part is extremely short). That is, in FIG. 17, the sequence arranged in the horizontal direction has a large number of amino acids in the portion where the line is interrupted. These amino acids are not present in the longitudinal sequence. Therefore, many gaps continue in the vertical arrangement.
[0079]
The array information processing apparatus 1 further includes an output processing unit 34 (FIG. 2). The output processing unit 34 functions as a display processing unit, and performs processing for displaying information obtained through the above-described processing on the display.
[0080]
The output processing unit 34 displays the array information with gaps in FIG. The output processing unit 34 displays the matrix image of FIG. This screen display is usefully used as information that visually represents homology.
[0081]
The output processing unit 34 may display the matrix image at each stage obtained in the above-described process in accordance with a user instruction. The user's instruction is accepted using the input device. The user can adjust parameters and the like by looking at the screen display. This adjustment is also accepted from the input device. This adjustment will be further described later.
[0082]
Further, the output processing unit 34 may perform output to a device other than the display. In addition to output to a printer, information may be output to the outside using a communication device. Information is stored in a recording medium such as a flexible disk.
[0083]
FIG. 19 shows the overall outline of the processing according to this embodiment. When the sequence information acquisition unit 20 acquires the two sequence information to be compared (S10), the matrix information generation unit 22 generates matrix information from the two sequence information (S12). Matrix information is converted into a binary image. The first extraction processing unit 24 uses the matrix image to perform extraction processing based on diagonal continuity (S14). Then, the second extraction processing unit 26 performs an extraction process based on the amount of mutation in each of the plurality of determination areas (S16). Further, the edge adjustment processing unit 28 deletes points remaining at the edge of the matrix image (S18). The length comparison / adjustment processing unit 30 performs adjustment processing that leaves only one of the plurality of points remaining in each row (S20). From the above, a line representing homology is obtained. The homology sequence generation processing unit 32 generates sequence information (FIG. 18) representing the homology indicated by the obtained line (S22). The output processing unit 34 displays the matrix image and the arrangement information according to the user's instruction (S24).
[0084]
FIG. 20 is a modification of the filter used in the processing of the first extraction processing unit 24. When using the filter of FIG. 5 above, when three or more points follow in an oblique direction, those points are left. When the filter of FIG. 20 is used, only the pixels on the rear side in the diagonal direction, that is, the diagonally lower pixel are examined. If there is a point diagonally below the point of interest, that point is left behind. More points remain when using this filter.
[0085]
Instead of the above filter, a filter that determines whether or not there is a point in the diagonally upper pixel may be used. Also, if there is a point on either the upper or lower side of the point of interest, a filter that leaves the point of interest (that is, a filter that leaves those points if two or more points are connected diagonally) is applied. Also good.
[0086]
FIG. 21 shows another filter suitable for the processing of the first extraction processing unit 24. When this filter is used, the center point is deleted when three points are aligned in the column direction (vertical direction). The filter of FIG. 21 is used in combination with the filter of FIG. Subsequent to the filter processing of FIG. 5, the filter processing of FIG. 21 is performed. The filter of FIG. 21 may be combined with the filter of FIG. By using the filter of FIG. 21, more points are deleted, so that the amount of subsequent calculations is reduced, and therefore further speedup is possible.
[0087]
The meaning of the filter of FIG. 21 will be described with reference to FIGS. As shown in FIG. 22, it is assumed that the vertical arrangement is AAAB and the horizontal arrangement is Axxx. Let A, B, and x be one amino acid. x is any amino acid. In such a case, three points that are continuous in the vertical direction, that is, a situation in which the filter of FIG.
[0088]
Consider the case where the second character is B (ABxx) and the case where the third character is B (AxBx) with respect to the horizontal arrangement of FIG. Even with an array such as AxxxxB, the results are the same in the following discussion. As for AAxx, the same result as the following examination can be obtained if the second and subsequent character strings are considered.
[0089]
In FIG. 23, the horizontal arrangement is ABxx, that is, the second character is B. As shown in the figure, points are arranged, and the shortest path in this range is a line L. In this case, even if the central point α among the three points in the vertical direction is deleted, the shortest path does not change. Therefore, the central point α may be deleted.
[0090]
In FIG. 24, the horizontal arrangement is AxBx, that is, the third character is B. The shortest path in this range is one of the three paths shown on the left side in the figure. On the right side of the figure, a representation using gaps is shown. Every path has the same number of gaps. When the number of gaps is used as a score representing the amount of mutation, the three expressions in FIG. 22 have the same score. Even if the center point α of the three vertical points is deleted, the shortest path is one of the three, and it is not the other path. Therefore, the central point α may be deleted.
[0091]
As described above, even if the central point α is deleted, the shortest path does not change, so the point α may be deleted. Therefore, in this embodiment, the point α is deleted by the filter of FIG. Thereby, the calculation amount of the subsequent processing is reduced, and the processing can be speeded up.
[0092]
Note that the filter of FIG. 21 may delete the center point of the three consecutive points in the row direction instead of the column direction. Further, the filter of FIG. 21 may delete the central point in any case where there are three continuous points in the column direction and where there are three continuous points in the row direction.
[0093]
FIG. 25 shows a preferred configuration example of the first extraction processing unit 24. In this form, the first extraction processing unit 24 can handle a plurality of types of filters. As illustrated in FIG. 25, the first extraction processing unit 24 includes a filter selection unit 40, an extraction determination unit 42, and an erasure processing unit 44.
[0094]
The filter selection unit 40 presents a plurality of types of filters that can be selected on the display. When the user selects a desired filter using the keyboard and mouse, the selection is accepted. The filter selection unit 40 sets a filter to be used for extraction according to this selection.
[0095]
For example, the filters of FIGS. 5, 20, and 21 are stored and can be selected. The filter selection unit 40 selects and sets one of the combinations shown in FIGS. 5, 20, 5 and 21, and the combination shown in FIGS. 20 and 21.
[0096]
The extraction determination unit 40 determines whether or not each point in the matrix image should be extracted using the set filter. The points to be extracted remain. Other points are erased by the erasure processing unit 44.
[0097]
When using the apparatus of this embodiment, the user selects a filter prior to the homology search. When the user does not select, a default filter (for example, FIG. 5) is used. The user can change the filter by examining the processing result. In this way, an appropriate filter can be selected so as to obtain an appropriate result.
[0098]
In addition, the 1st extraction process part 24 may be comprised so that the parameter in a filter can be changed. For example, the number of consecutive points (“3” in FIG. 5) in the filter of FIG. 5 may be changed.
[0099]
FIG. 26 shows a preferred configuration example of the second extraction processing unit 26. In this embodiment, the second extraction processing unit 26 is configured to be able to change the shape of the determination region. As shown in FIG. 26, a band height setting unit 50, a band width setting unit 52, a threshold variation number setting unit 54, an extraction determination unit 56, and an erasure processing unit 58 are included.
[0100]
The band height setting unit 50, the band width setting unit 52, and the threshold variation number setting unit 54 respectively set parameters such as the band height, the band width, and the threshold variation number according to the input operation by the user. The extraction determination unit 56 performs processing using the set parameters. A region having a shape according to the parameters is set, and the number of blank lines in the region is obtained. Then, the extraction determination unit 56 determines whether the “number of blank lines” is equal to or less than the “threshold mutation number−the number of bands”, that is, whether “number of blank lines + bandwidth” is equal to or less than the “threshold mutation number”. . Based on the determination result, the points to be extracted remain. Other points are erased by the erasure processing unit 58.
[0101]
When using the apparatus of this embodiment, the user inputs parameters such as band height, band width, and threshold mutation number prior to homology search. Instead of the threshold mutation number, a blank upper limit number may be input. When the user does not select, default values, for example, the band height 10, the band width 5, and the threshold variation number 7 are used as in the above example. The user can change the parameters by examining the processing results. In this manner, appropriate parameters can be selected so that appropriate results can be obtained.
[0102]
The parameter may be automatically changed by the second extraction unit processing unit 26. If a series of processing does not produce a line of appropriate homology (when there is one point in each row, one point in each column, and the result is that those points draw a long line), the parameter change. The calculation is repeated while changing the parameters until a suitable line is obtained.
[0103]
FIG. 27 shows a configuration example of the homology array generation processing unit 32. The homology sequence generation processing unit 32 includes a corresponding portion specifying unit 60, an array creation unit 62, and a correction processing unit 64. The corresponding part specifying unit 60 specifies corresponding parts of the two sequences from the line representing the homology. As described above, referring to the line in FIG. 17, the character string corresponding to the mapping to the upper side (horizontal axis) of the line and the character string corresponding to the mapping to the horizontal direction (vertical axis) of the line (horizontal axis) Direction (vertical axis)) is determined as a corresponding part. With respect to this corresponding portion, as shown in FIG. 18, sequence information expressing homology is generated by the sequence creation unit 62.
[0104]
The correction processing unit 64 corrects the homology line prior to the processing of the corresponding part specifying unit 60. An example of a case where it is desirable to perform correction is the case of FIG. In FIG. 12, as already described, all the points in the region are deleted even though the actual number of mutations is small. The present embodiment allows such a situation instead of giving priority to the processing speed. However, when the situation of FIG. 12 occurs, the line representing the homology is interrupted. The correction processing unit 62 performs processing for repairing the interrupted portion. For example, the correction processing unit 62 supplements the interrupted portion in accordance with an input operation by the user. This completion may be performed automatically. For example, the portions of the two arrays corresponding to the determination area are compared, and a sequence of points representing the matching points is reproduced. Here, a conventionally known homology search technique may be applied. Since the determination area is relatively small, the amount of calculation for the complementing process may be relatively small.
[0105]
The correction processing unit 62 may not be provided in the homology sequence generation processing unit 32. Corrections may be made as appropriate when displaying the line of homology. Further, the present invention is not limited to the above-described correction, and the present apparatus can perform a matrix image or arrangement so that an appropriate result can be obtained in response to an instruction by a user input operation at any stage in the above-described embodiment (or automatically). May be configured to make appropriate modifications.
[0106]
Of course, this embodiment can be modified within the scope of the present invention. For example, in various processes in the present embodiment, “row” and “column” may of course be interchanged. For example, in the process of the second extraction processing unit 26, the number of columns may be used instead of the number of rows. In this case, the parallelogram of the determination region is composed of a vertical line and an oblique line, and the length of the vertical line corresponds to the band width.
[0107]
In the present embodiment, the expression “oblique direction” is used. This is a direction that proceeds in both the row direction and the column direction on the matrix, and is 45 in the matrix image of the present embodiment, and is an oblique direction when the matrix components are used as a reference. As described above, the oblique direction is a direction in which dots are arranged when the arrays are continuously matched. In this regard, the “direction in which dots are arranged when the arrangements continuously match” may not be oblique on the screen depending on the representation of the image. For example, when the entire matrix image is deformed into a parallelogram, the “oblique direction” on the matrix can be a vertical direction on the screen. Such a case is also an oblique direction when viewed in terms of a matrix component, and therefore may be referred to as an “oblique direction” in the present embodiment. Even if other configurations can be visually changed by the expression of the image, such configurations are not limited to the present invention as long as they are within the scope of the present invention when viewed on the matrix as in the present embodiment. Within the scope of the invention.
[0108]
In the present embodiment, adjacent determination areas do not overlap. However, adjacent determination areas may overlap within the scope of the present invention.
[0109]
In the present embodiment, the matrix information has elements (component 1 and points in the image) representing amino acid matches in the two sequences. In another embodiment, elements representing amino acid similarity (including matches) are provided in the matrix information. A substitution sequence (for example, blossom 62) is known as an amino acid similarity measure. This substitution sequence represents the mutation cost of the amino acid combination. When the mutation cost is equal to or higher than a predetermined value, component 1 is given to the matrix information. In other cases, component 0 is given. The matrix information obtained in this way and the matrix image (binary image) obtained therefrom are processed in the same manner as in the above-described embodiment.
[0110]
In the present modification, the number of points in the matrix image is increased as compared with the above-described embodiment, so that the calculation speed may be slightly slower. However, compared with the prior art, the calculation speed is greatly increased, and the advantages of the present invention can be obtained. In addition, improvement in reliability can be expected by considering similarity.
[0111]
In this embodiment, the process from acquisition of sequence information to generation of a matrix, extraction of lines of homology, and generation of sequences expressing the homology are performed consistently. On the other hand, the array information processing apparatus 1 may acquire already completed matrix information from the outside. The array information processing apparatus 1 may not have a function of creating an array that expresses homology. Considerable information can be obtained visually only with the image as shown in FIG.
[0112]
In this embodiment, the amino acid sequence is processed, but the present invention can also be applied to processing of sequence information other than amino acids. Typically, the present invention is applicable to base sequence processing.
[0113]
Further, the array information processing apparatus 1 of the present embodiment may be connected to a network such as the Internet. Sequence information is sent together with a search request from a computer via a network. The sequence information processing apparatus 1 performs homology processing from the sequence information and returns a processing result. Various conditions, for example, parameters related to the region shape are also acquired through the network.
[0114]
The preferred embodiment of the present invention has been described above. In the present embodiment, as described above, the first extraction step by the first extraction processing unit 24 extracts elements that are continuous in an oblique direction from the matrix information. Next, a second extraction step by the second extraction processing unit 26 performs extraction processing for each determination region using a plurality of determination regions set in the matrix information, and extracts elements in the determination region with few mutations. To do. In this way, an element group that is long in the diagonal direction on the matrix is extracted. Since both the first extraction step and the second extraction step are relatively simple and can be realized with a small amount of calculation, information representing homology can be obtained at high speed.
[0115]
In the second extraction step, a band-shaped determination region extending in an oblique direction on the matrix information, that is, a parallelogram region in this embodiment is set. A determination area is used that extends in an oblique direction, that is, in a direction in which elements are arranged when arrangement matching portions are continuous. Since a determination region extending in the same direction as the element group representing the homology is set, necessary information can be accurately obtained.
[0116]
The second extraction step is performed when the sum of the number of area widths and the number of portions having no element in the width direction (the sum of the number of bandwidths and the number of blank lines in the above processing) is equal to or less than the threshold number of mutations. Extract the elements in the determination area. Since the determination regarding the number of mutations is simple, the amount of calculation is small, and therefore, further speedup is possible.
[0117]
In the first extraction step, the central element of the three elements that are continuous in the arrangement direction of the array is not extracted using the filter of FIG. As a result, the number of elements to be processed in the second extraction step is reduced, so that a further increase in speed is possible.
[0118]
Further, the present invention generates sequence information expressing the homology indicated by the extracted element group. Preferably, information including gaps and substitutions is generated. Useful information expressing the homology of both sequences can be obtained by using elements extracted by high-speed calculation.
[0119]
In addition, considering that it is difficult to set a determination region at the edge of the matrix, the present invention deletes elements remaining at the edge from the matrix information that has undergone the second extraction step. Thereby, unnecessary points are effectively deleted, and information representing homology is obtained more accurately.
[0120]
In addition, the present invention is unnecessary when a plurality of elements remain in the arrangement direction of the array in the matrix information that has undergone the second extraction step, based on the length of the continuous part that the remaining elements form with surrounding elements. Delete the element. The element that forms the shorter continuous part is deleted. Then, the elements that form the longest continuous portion are left. According to the present invention, since unnecessary elements remaining in the second extraction step are deleted, information representing homology can be obtained more accurately.
[0121]
Further, the present invention displays matrix information having the extracted element group on the screen. Thereby, a line drawn by the extracted element group is displayed on the screen. This screen display is usefully used as information that visually represents homology.
[0122]
In this embodiment, two sequences were compared. However, it will be appreciated that more than two sequences may be compared within the scope of the present invention.
[0123]
When comparing n arrays, for example, the two-dimensional matrix information of two of the n arrays is the multi-dimensional element group information of the present invention (when a plurality of array information to be compared is arrayed in different directions). This information is used as information composed of a group of elements representing the matching points. Using the two-dimensional matrix information, the two-dimensional processing described in the above embodiment is performed.
[0124]
Here, one array may be set as a reference. A homology search is performed from two-dimensional matrix information based on a combination of the reference sequence and each of the remaining plurality of sequences. Further, a homology search using two-dimensional matrix information may be performed for various combinations of two sequences selected from n sequences. Such a process is a combination of the two-dimensional processes described above and is included in the present invention.
[0125]
Further, when comparing n arrays, n-dimensional information having an element representing a matching portion of the n arrays may be used as the multi-dimensional element group information of the present invention. For example, when n = 3, information composed of an element group representing the matching points when three arrays are arranged in different directions is used. In a more detailed example, three arrays are arranged along the orthogonal x-axis, y-axis, and z-axis, respectively. In the space formed by the x-axis, y-axis, and z-axis, a point is set at a position corresponding to a location where the three arrays match. As a result, three-dimensional information obtained by extending the two-dimensional matrix information to three dimensions can be obtained. For this three-dimensional element group information, homology search processing is performed according to the above-described embodiment. A line (point sequence) representing homology forms 45 degrees with respect to the x-axis, the y-axis, and the z-axis. Thus, within the scope of the present invention, the above-described two-dimensional process may be expanded to an n-dimensional process, and various processes constituting the present invention are similar to the above-described two-dimensional process. Processing in which the principle is applied to n dimensions may be performed, and a configuration for performing such processing is also included in the present invention.
[0126]
As described above, three or more sequences may be compared within the scope of the present invention. This is the same in the above-described embodiment using matrix information having elements representing similar portions of a plurality of arrays.
[0127]
【The invention's effect】
  As explained above, according to the present invention, high-speed homology search is enabled.Array information processing deviceCan be provided.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a hardware configuration of an array information processing apparatus according to an embodiment of the present invention.
FIG. 2 is a functional block diagram showing a software configuration of the array information processing apparatus.
FIG. 3 is a diagram illustrating an example of comparison target amino acid sequence information acquired by a sequence information acquisition unit;
4 is a diagram showing an image of matrix information created from the two arrays in FIG. 3;
FIG. 5 is a diagram illustrating a filter used in the first extraction processing unit of FIG. 2;
FIG. 6 is a diagram illustrating a matrix image after processing by the first extraction processing unit.
FIG. 7 is a diagram showing a plurality of strip-shaped determination areas set on a matrix image.
FIG. 8 is a diagram showing one of a plurality of determination areas shown in FIG. 7;
FIG. 9 is a diagram showing the effect of mutation on an image.
FIG. 10 is a diagram illustrating an example of a case where the second extraction processing unit becomes an extraction target.
FIG. 11 is a diagram illustrating an example in a case where the second extraction processing unit does not become an extraction target.
FIG. 12 is a diagram illustrating an example of a case where the second extraction processing unit is not an extraction target exceptionally.
FIG. 13 is a diagram showing a matrix image after processing by a second extraction processing unit.
FIG. 14 is a diagram illustrating an arrangement of determination regions at the edge of a matrix image.
FIG. 15 is a diagram illustrating a matrix image in which edge correction has been performed.
FIG. 16 is a diagram illustrating an adjustment process when there are a plurality of points in one row.
FIG. 17 is a diagram illustrating a matrix image on which the processing of FIG. 16 has been performed.
18 is a diagram showing sequence information representing homology created from the extraction result of FIG.
FIG. 19 is a diagram illustrating an overall outline of processing by an array information processing apparatus.
FIG. 20 is a diagram illustrating a modification of the filter used by the first extraction processing unit.
FIG. 21 is a diagram showing a suitable filter that is additionally used by the first extraction processing unit;
FIG. 22 is a diagram for explaining that the filter of FIG. 21 is applicable.
FIG. 23 is a diagram for explaining that the filter of FIG. 21 is applicable.
FIG. 24 is a diagram for explaining that the filter of FIG. 21 is applicable.
FIG. 25 is a diagram illustrating a preferred configuration example of a first extraction processing unit.
FIG. 26 is a diagram illustrating a preferred configuration example of a second extraction processing unit.
FIG. 27 is a diagram illustrating a preferred configuration example of a homology sequence generation processing unit.
[Explanation of symbols]
1 Array information processing equipment
20 Sequence information acquisition unit
22 Matrix information generator
24 First extraction processing unit
26 Second extraction processing unit
28 Edge adjustment processing section
30 Length comparison adjustment processing section
32 Homology sequence generation processing unit
34 Output processing section

Claims

A sequence information processing apparatus for obtaining information on homology by comparing two biological sequence information such as amino acid sequence and DNA sequence ,
An array information acquisition unit that receives two array information to be compared;
A said comparison a two-dimensional matrix image information generating unit that generates a two-dimensional matrix image information arranged in different directions two sequence information of the target, comparing the two sequences, all combinations of the two sequences On the other hand, by performing a process of setting a first value as a matrix element when the two arrays match, and setting a second value different from the first value as a matrix element when both arrays do not match A two-dimensional matrix image information generating unit for obtaining two-dimensional matrix image information composed of element groups representing matching points of the two arrays;
With respect to the two-dimensional matrix image information, it is determined whether or not a predetermined number or more of elements corresponding to the coincident portions of the two arrays are continuous in a diagonal direction, and the elements determined to be continuous are extracted. A first extraction processing unit for generating
Using the parallelogram determination area of the first plurality of extracted ordered adjacently by the processing unit is set in a two-dimensional matrix image information has been processed the oblique direction, as the processing for each determination area, the In the parallelogram of the determination area, the number of area widths, which is the length of the side in the arrangement direction of one array, and the empty stage in which the element of the arrangement matching position does not exist in the arrangement direction among the stages arranged in the determination area It is determined whether or not the sum of the number of locations is equal to or less than a predetermined threshold variation number, and the two-dimensional matrix image information from which the elements of the determination region determined to be equal to or smaller than the threshold variation number are extracted A second extraction processing unit to be generated ;
An output processing unit that outputs the two-dimensional matrix image information that has undergone processing by the second extraction processing unit;
An array information processing apparatus comprising:

The array information processing apparatus according to claim 1 ,
The first extraction processing unit determines whether or not three or more elements corresponding to matching positions of the two arrays are consecutive in the oblique direction with respect to the two-dimensional matrix image information .

The array information processing apparatus according to claim 1 or 2 ,
The first extraction processing unit determines whether or not three elements corresponding to the coincident portions of the two arrays are continuous with respect to at least one of the two arrangement directions, and among the three elements determined to be continuous An array information processing apparatus characterized by not extracting the central element of the.

The array information processing apparatus according to any one of claims 1 to 3 ,
An array information processing apparatus further comprising: a homology array generation processing unit that generates array information representing the homology indicated by the two-dimensional matrix image information that has been processed by the second extraction processing unit .

The array information processing apparatus according to claim 4 ,
The homology sequence generation processing unit generates information including gaps and substitutions.

The array information processing apparatus according to any one of claims 1 to 5 ,
An array information processing apparatus further comprising an edge adjustment processing section that deletes an element corresponding to a matching portion of the two arrays remaining in the edge where the determination area is not set in the two-dimensional matrix image information .

The array information processing apparatus according to claim 1 ,
In the two- dimensional matrix image information that has been processed by the second extraction processing unit, when a plurality of extracted elements remain in the arrangement direction with respect to at least one of the two arrangement directions , the remaining elements are the surroundings. An array information processing apparatus further comprising: a length comparison adjustment processing unit that deletes unnecessary elements based on the extracted elements and the length of a continuous portion to be formed.

The array information processing apparatus according to any one of claims 1 to 7 ,
The array information processing apparatus , wherein the output processing unit includes a display processing unit that displays the two-dimensional matrix image information subjected to processing by the second extraction processing unit on a screen.

A sequence information processing apparatus for obtaining information on homology by comparing two biological sequence information such as amino acid sequence and DNA sequence ,
An array information acquisition unit for acquiring two pieces of array information to be compared;
A said comparison a two-dimensional matrix image information generating unit that generates a two-dimensional matrix image information arranged in different directions two sequence information of the target, comparing the two sequences, all combinations of the two sequences On the other hand, when the two elements are similar, the first value is set as the matrix element, and when the two elements are not similar, the second value different from the first value is set as the matrix element. A two-dimensional matrix image information generating unit for obtaining two-dimensional matrix image information composed of element groups representing similar portions of the two arrays,
With respect to the two-dimensional matrix image information, it is determined whether or not a predetermined number or more of elements corresponding to similar portions of the two arrays are continuous in a diagonal direction, and the elements determined to be continuous are extracted. A first extraction processing unit for generating
Using the parallelogram determination area of the first plurality of extracted ordered adjacently by the processing unit is set in a two-dimensional matrix image information has been processed the oblique direction, as the processing for each determination area, the In the parallelogram of the determination area, the number of area widths, which is the length of the side in the arrangement direction of one array, and the empty stage in which the element of the arrangement matching position does not exist in the arrangement direction among the stages arranged in the determination area It is determined whether or not the sum of the number of locations is equal to or less than a predetermined threshold variation number, and the two-dimensional matrix image information from which the elements of the determination region determined to be equal to or smaller than the threshold variation number are extracted A second extraction processing unit to be generated ;
An output processing unit that outputs the two-dimensional matrix image information that has undergone processing by the second extraction processing unit;
An array information processing apparatus comprising: