JP2012032975A

JP2012032975A - Optimal alignment computation device and program

Info

Publication number: JP2012032975A
Application number: JP2010171432A
Authority: JP
Inventors: Koichi Kimura; 宏一木村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-07-30
Filing date: 2010-07-30
Publication date: 2012-02-16
Anticipated expiration: 2030-07-30
Also published as: JP5528249B2

Abstract

PROBLEM TO BE SOLVED: To accelerate dynamic programming (DP) alignment processing by parallelization with one processor, utilizing bit parallelism of word operations in a computer, without imposing restrictions on the lead length.SOLUTION: A range to compute a DP matrix is restricted to a near-diagonal region with the width corresponding to the word length in a computer, and information inside this region is divided into and represented by bits aligned in the direction orthogonal to the diagonal to achieve bit parallelization of computation for information in these bits.

Description

本発明は、文字列、DNA塩基配列等の文字列データ（文字列データとして取り扱い可能な形式に加工された任意のデータを含む。）を、検索文字列と比較照合する処理を高速化するための技術に関する。 The present invention speeds up the process of comparing and collating character string data such as character strings and DNA base sequences (including arbitrary data processed into a form that can be handled as character string data) with search character strings. Related to technology.

近年、従来型のキャピラリー型ＤＮＡシーケンサとは異なる原理に基づき、一回のランで数千万本以上にも及ぶ大量の配列を読み取ることができる超並列ＤＮＡシーケンサが出現し、シーケンシングに要するコストを劇的に低減させた（非特許文献１）。そのため、大量の配列データを高速に解析する技術に対する需要が高まりつつある。中でも、ゲノム・マッピング解析は、最も基本的な配列データ解析の一つであり、シーケンサにより新たに読まれた塩基配列（リード）を、データベース中の参照ゲノム配列と比較して、ゲノム内の位置を特定し、さらに、そこに含まれる変異（塩基の置換・挿入・欠失）を同定する。 In recent years, a massively parallel DNA sequencer has emerged that can read tens of millions of sequences in a single run based on a principle different from that of conventional capillary DNA sequencers. Was drastically reduced (Non-patent Document 1). Therefore, there is an increasing demand for a technique for analyzing a large amount of array data at high speed. Among them, genome mapping analysis is one of the most basic sequence data analysis. The base sequence (read) newly read by the sequencer is compared with the reference genome sequence in the database, and the position in the genome is compared. And mutations (base substitution / insertion / deletion) contained therein are identified.

初期の超並列ＤＮＡシーケンサのリード長は30塩基程度と比較的短かったため、ハッシュ法やBurrows-Wheeler変換などを用いて高速・高精度なゲノム・マッピング解析がなされた（非特許文献２）。 Since the read length of the initial massively parallel DNA sequencer was relatively short, about 30 bases, high-speed and high-accuracy genome mapping analysis was performed using a hash method or Burrows-Wheeler transformation (Non-patent Document 2).

しかし、リード長が100塩基以上まで伸びてくると、従来型のＤＮＡシーケンサ向けのゲノム・マッピング解析で用いられてきたseed-and-extend strategyのような方法が必要となってくる（非特許文献３）。この方法では、第一ステップで、リードに含まれる短い配列（シード）がゲノム内に出現する位置を高速に検索し、第二ステップで、シードの出現位置周辺でリード全体がゲノム配列に類似しているか（即ち、シードとゲノムの対応関係を、多少の変異を許容しながら、リード全体にまで拡張できるか）を調べる。この第一ステップでは、シード長が短いため、ハッシュ法やBurrows-Wheeler変換などを用いた高速な検索方法が利用でき、これにより、長大なゲノム配列全体の中からリードのマッピング候補位置を高速に絞り込むことができる。第二ステップでは、第一ステップで絞り込まれた候補位置にあるゲノム部分配列（ターゲット配列）に対して、動的計画法（DP法）などを用いてリード（クエリー配列）とのアラインメントを行い、リード長が長い場合であっても、塩基変異を高精度に調べることができる（非特許文献８）。 However, when the read length increases to 100 bases or more, a method such as the seed-and-extend strategy that has been used in genome mapping analysis for conventional DNA sequencers becomes necessary (non-patent literature). 3). In this method, in the first step, the position where the short sequence (seed) contained in the read appears in the genome is searched at high speed, and in the second step, the entire read is similar to the genome sequence around the seed appearance position. (That is, whether the correspondence between the seed and the genome can be extended to the entire read while allowing some variation). In this first step, since the seed length is short, a high-speed search method using a hash method, Burrows-Wheeler transform, etc. can be used. You can narrow down. In the second step, the genome partial sequence (target sequence) at the candidate position narrowed down in the first step is aligned with the read (query sequence) using dynamic programming (DP method), etc. Even when the lead length is long, the base mutation can be examined with high accuracy (Non-patent Document 8).

しかし、一般に、動的計画法では、配列長の2乗に比例する大きな計算コストが必要となる。そこで、動的計画法によるアラインメント処理を高速化することは、ゲノム・マッピング解析処理全体を高速化する上で有益となる。 However, in general, dynamic programming requires a large calculation cost proportional to the square of the array length. Therefore, speeding up the alignment process by dynamic programming is beneficial for speeding up the entire genome mapping analysis process.

動的計画法によるアラインメント処理を高速化するために、複数のプロセッサを利用して処理を並列化する方法が知られている（非特許文献４）。このとき、プロセッサ間で受け渡されるデータの依存関係を考慮し、また、プロセッサ間通信コストを抑えるような負荷分散方式を採用する必要がある。それにより、多数のプロセッサを使えば、高い並列性が得られる。 In order to speed up alignment processing by dynamic programming, a method of parallelizing processing using a plurality of processors is known (Non-Patent Document 4). At this time, it is necessary to consider the dependency relationship of data passed between the processors and to adopt a load balancing method that suppresses the inter-processor communication cost. Therefore, if a large number of processors are used, high parallelism can be obtained.

ところで、現在主流のデジタル計算機では、8バイト整数（ワード）に対する演算が基本演算となっているが、これは、8バイト即ち64ビットの並列演算と見ることもできる。G. Myersは、この整数演算のビット並列性を利用して、動的計画法によるアラインメント処理を１プロセッサで並列化し、配列長に比例する時間で計算する方法を示した（非特許文献５）。 By the way, in the current mainstream digital computer, an operation on an 8-byte integer (word) is a basic operation, but this can also be regarded as an 8-byte or 64-bit parallel operation. G. Myers showed how to use the parallelism of integer programming to parallelize the alignment process by dynamic programming with one processor and calculate in time proportional to the array length (Non-patent Document 5). .

G. Myersの方法では、長さmのクエリー配列（リード）と長さnのターゲット配列（ゲノム部分配列）のアラインメントを求めるために、サイズm×nのDP行列の情報を長さmの列方向の情報のn個の並びに分解して表現し、新たに読み込んだターゲット配列の１文字とクエリー配列の全m文字との比較に関わる処理を、mビット並列に行い、行方向（ターゲット配列の方向）に1塩基ずつ移動しながらこの処理を繰り返す。 In G. Myers' method, in order to obtain an alignment between a query sequence (read) of length m and a target sequence (genome partial sequence) of length n, information on a DP matrix of size m × n is used as a column of length m. Processes related to the comparison of one character of the newly read target sequence and all the m characters of the query sequence are performed in m bits in parallel and expressed in the row direction (target sequence of the target sequence). Repeat this process while moving one base at a time).

式を用いて詳細に説明すると、次のようになる。１塩基の置換・挿入・欠失のコストを全て１として、DP行列の第 (i, j) 成分C[i, j] （クエリー配列 q = q1 q2 ... qm の長さ iの部分文字列q1 q2 ... qi とターゲット配列 t = t1 t2 ... tn の長さ jの部分文字列 t1 t2 ... tjとの最適なアラインメントのコスト）に対して、
等式C[i, j]−C[i, j−1] = 1が成立するか否かを表わすブール変数をPh[i, j]、
等式C[i, j]−C[i, j−1] = −1が成立するか否かを表わすブール変数をMh[i, j]、
等式C[i, j]−C[i−1, j] = 1が成立するか否かを表わすブール変数をPv[i, j]、
等式C[i, j]−C[i−1, j] = −1が成立するか否かを表わすブール変数をMv[i, j]、
クエリー配列のインデクス iの文字（i番目の文字）qiとターゲット配列のインデクスjの文字tjとが一致するか否かを表わすブール変数をEq(i, j)、
と表わし、これらの 1ビット（偽または真、即ち、0または1）で表わされるブール変数をi = 1, 2, ..., m について下位ビットより昇順に並べてできるワード変数を、
（数１)
Ph(j) = [ Ph[m, j], ・・・ , Ph[2, j], Ph[1, j] ]
Mh(j) = [ Mh[m, j], ・・・ , Mh[2, j], Mh[1, j] ]
Pv(j) = [ Pv[m, j], ・・・ , Pv[2, j], Pv[1, j] ]
Mv(j) = [ Mv[m, j], ・・・ , Mv[2, j], Mv[1, j] ]
Eq(j) = [ Eq[m, j], ・・・ , Eq[2, j], Eq[1, j] ]
と表わし、さらに記号の簡略化のためにjを一つの値に固定して、Ph(j), Mh(j), Pv(j), Mv(j), Eq(j) の値をそれぞれ、Ph, Mh, Pv, Mv, Eqと略記し、Xv、Xhを補助ワード変数として、
（数２)
Xh = ((Eq & Pv) + Pv) ^ Pv | Eq
Ph = Mv | ~ (Xh | Pv)
Mh = Pv & Xh
Ph <<= 1
Mh <<= 1
Xv = Eq | Mv
Pv = Mh | ~(Xv | Ph)
Mv = Ph & Xv
によりビット並列な整数演算を行い、Ph, Mh, Pv, Mvの値を上書きする。ここで、ワード変数a、bに対して、a|bはビットごとのOR演算、a&bはビットごとのAND演算、a ^ bはビットごとの排他的論理和演算、a + bは整数としての加算、~aはビットごとのNOT演算、a<<=1は1ビットの左シフト演算を表わす。これにより、Ph, Mh, Pv, Mvの値は、Ph(j+1), Mh(j+1), Pv(j+1), Mv(j+1) の値に更新される。これが、アラインメント処理をビット並列化する中核部分であり、これにより処理が高速化される。 This will be described in detail using equations. All the substitution, insertion, and deletion costs of one base are 1, and the DP matrix's (i, j) component C [i, j] (query sequence q = q1 q2 ... qm length i partial character For the optimal alignment cost of the sequence q1 q2 ... qi and the target sequence t = t1 t2 ... tn with substring t1 t2 ... tj of length j)
A Boolean variable indicating whether or not the equation C [i, j] −C [i, j−1] = 1 is satisfied, Ph [i, j],
A Boolean variable indicating whether or not the equation C [i, j] −C [i, j−1] = − 1 holds is represented by Mh [i, j],
A Boolean variable indicating whether or not the equation C [i, j] −C [i−1, j] = 1 is satisfied is Pv [i, j],
A Boolean variable indicating whether or not the equation C [i, j] −C [i−1, j] = − 1 holds is expressed as Mv [i, j],
Eq (i, j) is a Boolean variable that indicates whether the character (i-th character) qi of the index i of the query sequence matches the character tj of the index j of the target sequence
And a word variable that can be arranged in ascending order from the least significant bit for i = 1, 2, ..., m for these Boolean variables represented by 1 bit (false or true, ie 0 or 1),
(Equation 1)
Ph (j) = [Ph [m, j], ..., Ph [2, j], Ph [1, j]]
Mh (j) = [Mh [m, j], ..., Mh [2, j], Mh [1, j]]
Pv (j) = [Pv [m, j], ..., Pv [2, j], Pv [1, j]]
Mv (j) = [Mv [m, j], ..., Mv [2, j], Mv [1, j]]
Eq (j) = [Eq [m, j], ..., Eq [2, j], Eq [1, j]]
In order to simplify the symbol, j is fixed to one value, and the values of Ph (j), Mh (j), Pv (j), Mv (j), Eq (j) are respectively Abbreviated as Ph, Mh, Pv, Mv, Eq, Xv, Xh as auxiliary word variables,
(Equation 2)
Xh = ((Eq & Pv) + Pv) ^ Pv | Eq
Ph = Mv | ~ (Xh | Pv)
Mh = Pv & Xh
Ph << = 1
Mh << = 1
Xv = Eq | Mv
Pv = Mh | ~ (Xv | Ph)
Mv = Ph & Xv
Executes bitwise integer arithmetic and overwrites the values of Ph, Mh, Pv, and Mv. Here, for word variables a and b, a | b is a bitwise OR operation, a & b is a bitwise AND operation, a ^ b is a bitwise exclusive OR operation, and a + b is an integer Addition, ~ a represents a bitwise NOT operation, and a << = 1 represents a 1-bit left shift operation. Thereby, the values of Ph, Mh, Pv, and Mv are updated to the values of Ph (j + 1), Mh (j + 1), Pv (j + 1), and Mv (j + 1). This is the core part of aligning the alignment process bitwise, which speeds up the process.

しかし、この方法を直接的に用いることができるのは、リード長（クエリー配列長）m が計算機のワード長（64）以下の場合に限られる。この方法をリード長が計算機のワード長を越える場合に拡張するためには、リード（クエリー配列）をワード長以下の長さの複数の断片配列に分割する必要がある（非特許文献５、非特許文献６）。このとき、これらの断片配列に対する計算には互いに依存関係があり、並列化する際に制約が生じ、処理効率も低下する。 However, this method can be used directly only when the read length (query sequence length) m is less than or equal to the computer word length (64). In order to extend this method when the read length exceeds the word length of the computer, it is necessary to divide the read (query sequence) into a plurality of fragment sequences having a length equal to or less than the word length (Non-Patent Document 5, Non-Patent Document 5, Non-patent Document 5). Patent Document 6). At this time, the calculations for these fragment sequences are dependent on each other, and restrictions are imposed upon parallelization, resulting in a reduction in processing efficiency.

Rajendrani Mukhopadhyay. DNA sequencers: the next generation. Analytical Chemistry, Vol. 81, No. 5, March 1, 2009Rajendrani Mukhopadhyay.DNA sequencers: the next generation.Analytical Chemistry, Vol. 81, No. 5, March 1, 2009 Kouichi Kimura, Yutaka Suzuki, Sumio Sugano, Asako Koike. Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences. Journal of Computational Biology. November 2009, 16(11): 1601-1613.Kouichi Kimura, Yutaka Suzuki, Sumio Sugano, Asako Koike.Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences.Journal of Computational Biology.November 2009, 16 (11): 1601 -1613. Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005 May 1;21(9):1859-75.Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005 May 1; 21 (9): 1859-75. Martins WS, Del Cuvillo JB, Useche FJ, Theobald KB, Gao GR. A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison. Pac Symp Biocomput. 2001:311-22.Martins WS, Del Cuvillo JB, Useche FJ, Theobald KB, Gao GR.A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison.Pac Symp Biocomput. 2001: 311-22. G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM, 46(3):395-415, 1999.G. Myers.A fast bit-vector algorithm for approximate string matching based on dynamic programming.Journal of the ACM, 46 (3): 395-415, 1999. HEIKKI HYYRO. BIT-PARALLEL COMPUTATION OF LOCAL SIMILARITY SCORE MATRICES WITH UNITARY WEIGHTS. International Journal of Foundations of Computer Science, 17(6): 1325-1344, 2006.HEIKKI HYYRO. BIT-PARALLEL COMPUTATION OF LOCAL SIMILARITY SCORE MATRICES WITH UNITARY WEIGHTS. International Journal of Foundations of Computer Science, 17 (6): 1325-1344, 2006. Kevin Judd McKernan, Heather E. Peckham, Gina L. Costa, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 2009 19: 1527-1541.Kevin Judd McKernan, Heather E. Peckham, Gina L. Costa, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding.Genome Res. 2009 19: 1527-1541 . Richard Durbin, Sean R. Eddy, Andrew Krogh, Graeme Mitchison, Biological Seuqnece Analysis, probabilistic models of proteins and nucleic acids, Cambridge University Press 1988（邦訳、阿久津達也、浅井潔、矢田哲士訳、バイオインフォマティクス―確率モデルによる遺伝子配列解析―、医学出版、2001）Richard Durbin, Sean R. Eddy, Andrew Krogh, Graeme Mitchison, Biological Seuqnece Analysis, probabilistic models of proteins and nucleic acids, Cambridge University Press 1988 Sequence analysis, medical publication, 2001)

本発明が解決しようとする課題は、リード長に対する制限を設けることなく、また、リード配列を分割するなどの処理効率を低下させる方法もとらず、計算機のワード演算のビット並列性を利用して、動的計画法によるアラインメント処理を1プロセッサで並列化し高速化する方法を提供することにある。 The problem to be solved by the present invention is to make use of the bit parallelism of the word operation of a computer without setting a restriction on the read length and without using a method of reducing the processing efficiency such as dividing the read array. Another object is to provide a method for parallelizing and speeding up alignment processing by dynamic programming with one processor.

本発明による方法では、長さmのクエリー配列（リード）と長さnのターゲット配列（ゲノム部分配列）のアラインメントを求めるために、サイズm×nのDP行列の計算を対角線（行のインデクスiと列のインデクスjが等しいところ）の周辺領域に制限し、その領域内の情報を対角線に直行する方向（反対角線方向）の情報のビットの並びに分解して表現し、計算機のワード演算のビット並列性を利用して、反対角線上でのクエリー配列とターゲット配列の文字比較に関わる処理を1プロセッサで並列化し、対角線方向に移動しながらこの処理を繰り返す。 In the method according to the present invention, in order to obtain an alignment between a query sequence (read) of length m and a target sequence (genome partial sequence) of length n, the computation of a DP matrix of size m × n is performed diagonally (row index i And the column index j are equal to each other), and the information in that area is expressed by disassembling the information in the direction perpendicular to the diagonal (opposite diagonal direction), and the bit of the word operation of the computer Using parallelism, the processing related to character comparison between the query sequence and the target sequence on the opposite corner is parallelized by one processor, and this processing is repeated while moving in the diagonal direction.

式を用いて詳細に説明すると、次のようになる。C[i, j], Ph[i, j], Mh[i, j], Pv[i, j], Mv[i, j], Eq[i, j] を上記の通りとして、これらを対角線周辺で反対角線方向に並べてできるワード変数を、
（数３）
Ph(k) = [ Ph[i, j] | i + j = k, −m ＜ i−j ≦ m ]
Mh(k) = [ Mh[i, j] | i + j = k, −m ＜ i−j ≦ m ]
Pv(k) = [ Pv[i, j] | i + j = k, −m ＜ i−j ≦ m ]
Mv(k) = [ Mv[i, j] | i + j = k, −m ＜ i−j ≦ m ]
Eq(k) = [ Eq[i, j] | i + j = k, −m ＜ i−j ≦ m ]
とする。ここで、ビット変数をワード変数に詰め込む順番は、jについて下位ビットから昇順とする。さらに、記号の簡略化のためにkを一つの値に固定して、Ph(k), Mh(k), Pv(k), Mv(k), Eq(k) の値をそれぞれ、Ph, Mh, Pv, Mv, Eqと略記し、Xv、Xhを補助ワード変数として、
kが偶数のときは、
（数４）
Xv = Eq | Mv; Pv' = Mh | ~(Xv | Ph); Mv' = Ph & Xv;
Xh = Eq | Mh; Ph' = Mv | ~(Xh | Pv); Mh' = Pv & Xh;
kが奇数のときは、
（数５）
Ph >>= 1; Mh >>= 1;
Xv = Eq | Mv; Pv' = Mh | ~(Xv | Ph); Mv' = Ph & Xv;
Xh = Eq | Mh; Ph' = Mv | ~(Xh | Pv); Mh' = Pv & Xh;
Pv' << = 1; Mv' <<= 1;
に従ってビット並列な整数演算を行い、Ph', Mh', Pv,' Mv'の値を計算する。これらは、Ph(k+1), Mh(k+1), Pv(k+1), Mv(k+1) の値を与える。これが、アラインメント処理をビット並列化する中核部分である。 This will be described in detail using equations. C [i, j], Ph [i, j], Mh [i, j], Pv [i, j], Mv [i, j], Eq [i, j] Word variables that can be arranged in the opposite corner direction around
(Equation 3)
Ph (k) = [Ph [i, j] | i + j = k, −m <i−j ≦ m]
Mh (k) = [Mh [i, j] | i + j = k, −m <i−j ≦ m]
Pv (k) = [Pv [i, j] | i + j = k, −m <i−j ≦ m]
Mv (k) = [Mv [i, j] | i + j = k, −m <i−j ≦ m]
Eq (k) = [Eq [i, j] | i + j = k, −m <i−j ≦ m]
And Here, the order in which the bit variables are packed into the word variables is ascending order from the lower bits for j. Furthermore, to simplify the symbol, k is fixed to one value, and the values of Ph (k), Mh (k), Pv (k), Mv (k), Eq (k) are set to Ph, Abbreviated as Mh, Pv, Mv, Eq, Xv, Xh as auxiliary word variables,
When k is an even number,
(Equation 4)
Xv = Eq | Mv; Pv '= Mh | ~ (Xv | Ph); Mv' = Ph &Xv;
Xh = Eq | Mh; Ph '= Mv | ~ (Xh | Pv); Mh' = Pv &Xh;
When k is odd,
(Equation 5)
Ph >> = 1; Mh >> = 1;
Xv = Eq | Mv; Pv '= Mh | ~ (Xv | Ph); Mv' = Ph &Xv;
Xh = Eq | Mh; Ph '= Mv | ~ (Xh | Pv); Mh' = Pv &Xh;
Pv '<< = 1; Mv'<< = 1;
According to the above, a bit-parallel integer operation is performed, and the values of Ph ', Mh', Pv, and 'Mv' are calculated. These give the values Ph (k + 1), Mh (k + 1), Pv (k + 1), Mv (k + 1). This is the core part that makes the alignment process bit parallel.

G. Myersの方法と同様、アラインメント処理が１プロセッサでビット並列化され高速化される。G. Myersの方法と異なるのは、動的計画法のビット並列計算を、DP行列の対角線周辺領域に限定して対角線方向に移動させるため、リード長に関する制限が無くなり、従って、リード長がワード長を越えた場合にリードを分割する必要がなくなる点である。また、類似性の高い配列の対する動的計画法の計算は、DP行列の対角線の周辺を行えば十分であるため、対角線から離れた領域での不要な計算を避けることができ、処理効率が向上する。また、行と列に関して対称な形で計算を行えるような新規なビット並列化方式であるため、（数４）及び（数５）において補助変数Xhの計算式はXvの計算式と対称な形に簡易化され、これはG. Myersによる（数２）における補助変数Xhのやや複雑な計算式よりも少ないマシン命令数で計算可能となる。 Similar to G. Myers' method, the alignment process is bit-parallelized and accelerated by one processor. The difference from G. Myers' method is that the bit parallel computation of dynamic programming is limited to the area around the diagonal of the DP matrix and moved in the diagonal direction, so there is no restriction on the read length, so the read length is equal to the word When the length is exceeded, there is no need to divide the lead. In addition, dynamic programming calculations for sequences with high similarity need only be performed around the diagonal of the DP matrix, so unnecessary calculations in areas away from the diagonal can be avoided, and processing efficiency is improved. improves. In addition, since this is a new bit parallelization method that can perform calculations in a symmetric manner with respect to rows and columns, in (Equation 4) and (Equation 5), the equation for the auxiliary variable Xh is symmetric with the equation for Xv. This can be calculated with a smaller number of machine instructions than the slightly complicated calculation formula of the auxiliary variable Xh in (Equation 2) by G. Myers.

本発明による最適アラインメント計算方法において、ビット並列演算の適用箇所を示した説明図である（実施例１）。In the optimal alignment calculation method by this invention, it is explanatory drawing which showed the application location of bit parallel calculation (Example 1). 次世代型DNAシーケンサの配列データと参照ゲノム配列データベースとの高速比較照合を行うシステムの構成例を示す図である（実施例１）。(Example 1) which is a figure which shows the structural example of the system which performs the high-speed comparison collation with the sequence data of a next-generation type DNA sequencer, and a reference genome sequence database. DP行列内のセル周辺における入力側と出力側のブール変数の位置関係を示す説明図である。It is explanatory drawing which shows the positional relationship of the Boolean variable of the input side and output side in the cell periphery in DP matrix. 和が k となる2重インデクス [i, j] とワード変数のインデクスとの対応関係を例示する説明図である。It is explanatory drawing which illustrates the correspondence of the double index [i, j] whose sum is k, and the index of a word variable. 一定の k に対するブール変数Ph, Mh, Pv, Mv のインデクスとワード変数のインデクスの対応関係を例示する説明図である。It is explanatory drawing which illustrates the correspondence of the index of the Boolean variable Ph, Mh, Pv, Mv and the index of a word variable with respect to fixed k. セル周辺のブール変数Ph, Mh, Pv, Mvのインデクスが、k が偶数か奇数かにより違いが生じることを示す説明図である。It is explanatory drawing which shows that a difference arises according to whether the index of the Boolean variables Ph, Mh, Pv, and Mv around a cell is k even or odd. ブール変数 Eq をビット並列に計算する方法を示す説明図である。It is explanatory drawing which shows the method of calculating the Boolean variable Eq in bit parallel. 対角線周辺領域の外部で事前に設定されるブール変数値を示す説明図である。It is explanatory drawing which shows the Boolean variable value preset in the exterior of a diagonal periphery area | region. DPコストの絶対値を求めるためのDPコストの差分値の累積法を示す説明図である。It is explanatory drawing which shows the accumulation method of the difference value of DP cost for calculating | requiring the absolute value of DP cost.

［実施例１］
以下、本発明の実施例を図面を用いて詳細に説明する。
本実施例では、リード（クエリー配列）と参照ゲノム配列データが与えられたとき、リードに類似した部分配列が存在するゲノム配列内の位置（マッピング位置）を求め、クエリー配列とマッピング位置周辺のゲノム部分配列（ターゲット配列）との最適アラインメントを求めるための方法を説明する。通常、DNAシーケンサから得られるリード配列は塩基配列であるため、以下、リード配列は塩基配列であると仮定する。特殊な場合として、２ベース・エンコーディング方式を採用したDNAシーケンサでは、リード配列は、隣り合う２塩基を4色に符号化したカラー配列となる（非特許文献７）。このような場合は、参照ゲノム配列をカラー配列に変換して、本実施例で述べる方法に類似の方法を適用することは、容易な類推により可能である。また、アミノ酸配列をクエリー配列として、参照タンパク配列データベースを検索する課題に対しても、容易な類推により、本実施例で述べる方法に類似の方法が適用可能である。また、文字列や画像を含む文書その他のデータベースを検索する課題に対しても、容易な類推により、本実施例で述べる方法に類似の方法が適用可能である。 [Example 1]
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In this example, when a read (query sequence) and reference genome sequence data are given, a position (mapping position) in the genome sequence where a partial sequence similar to the read exists is obtained, and the genome around the query sequence and the mapping position is obtained. A method for obtaining the optimum alignment with the partial sequence (target sequence) will be described. Usually, since the lead sequence obtained from a DNA sequencer is a base sequence, it is assumed hereinafter that the lead sequence is a base sequence. As a special case, in a DNA sequencer adopting the 2-base encoding method, the lead sequence is a color sequence in which two adjacent bases are encoded in four colors (Non-patent Document 7). In such a case, it is possible to convert the reference genome sequence into a color sequence and apply a method similar to the method described in this embodiment by simple analogy. In addition, a method similar to the method described in this embodiment can be applied to the problem of searching a reference protein sequence database using an amino acid sequence as a query sequence, by simple analogy. In addition, a method similar to the method described in the present embodiment can be applied to a task of searching a document or other database including a character string or an image by simple analogy.

図2は、本実施例のシステム構成例を示す図である。201は、このシステムが稼働している計算機であり、外部記憶装置202に格納された参照ゲノム配列データ203の情報を主記憶209内に読み込み、またＤＮＡシーケンサ204を用いてＤＮＡサンプル205を解析して得られた塩基配列データ（クエリー配列、リード配列）206を、入力装置207を介して主記憶209内に読み込み、また、ユーザにより指定される処理パラメータ（類似配列のアラインメント・コストの上限値M）を入力装置208を介して主記憶209内に読み込む。 FIG. 2 is a diagram illustrating a system configuration example of the present embodiment. Reference numeral 201 denotes a computer in which this system is operated, reads information of the reference genome sequence data 203 stored in the external storage device 202 into the main memory 209, and analyzes the DNA sample 205 using the DNA sequencer 204. The base sequence data (query sequence, read sequence) 206 obtained in this way is read into the main memory 209 via the input device 207, and processing parameters specified by the user (upper limit value M of alignment cost of similar sequences) ) Is read into the main memory 209 via the input device 208.

計算機201は、主記憶209 に蓄えられたリード配列と参照ゲノム配列に対して、シード検索処理210を行い、リードに含まれる短い配列（シード配列）がゲノム配列内に出現する位置を求める。シード配列としては、リード配列の中から長さ12〜20塩基程度の短い部分配列を切り出したものを使えばよく、また、シード配列の出現位置を求めるためには、Burrows-Wheeler変換など用いた公知の高速な検索方法を用いればよい（非特許文献２）。 The computer 201 performs a seed search process 210 on the read sequence and the reference genome sequence stored in the main memory 209 to obtain a position where a short sequence (seed sequence) included in the read appears in the genome sequence. As the seed sequence, it is sufficient to use a short partial sequence of about 12 to 20 bases in length from the lead sequence, and Burrows-Wheeler transformation or the like was used to determine the appearance position of the seed sequence. A known high-speed search method may be used (Non-Patent Document 2).

詳細比較処理部211は、シード配列がゲノム配列内に出現する位置の周辺で、最適アラインメント計算212を行い、そのアラインメントのコスト値がユーザにより指定された上限値M以下ならば、そこにリード配列に類似した配列があると判断し、出力処理部213にゲノム配列内の出現位置情報とアラインメント情報を伝える。出力処理部213は、出力装置214に、検索結果としてゲノム配列内の出現位置情報（ゲノム・マッピング位置情報）215と、最適アラインメント結果216を出力する。 The detailed comparison processing unit 211 performs an optimal alignment calculation 212 around the position where the seed sequence appears in the genome sequence. If the cost value of the alignment is equal to or less than the upper limit value M specified by the user, the lead sequence is included there. It is determined that there is a sequence similar to, and the output position information and alignment information in the genome sequence are transmitted to the output processing unit 213. The output processing unit 213 outputs the appearance position information (genome mapping position information) 215 in the genome sequence and the optimum alignment result 216 as search results to the output device 214.

アラインメントのコストは、アラインメントに内に含まれる１塩基の置換、挿入、欠失のコストを１として、その総和として計算する。クエリー配列のp塩基目の位置から始まる長さsのシード配列が、ゲノム配列中の座標区間 [x, x + s−1] に出現している場合、リード配列の出現位置は、座標区間 [x−p+1−M, x−p+m+M] に含まれる。ここで、mはクエリー配列長、Mはアラインメント・コストの上限値を表わす。リード配列の出現位置の範囲が±Mだけ広がるのは、シード配列の出現位置の前後に、最大M塩基の挿入・欠失が含まれる可能性があるからである。最適アラインメント計算212では、この座標区間 [x−p+1−M, x−p+m+M] のゲノム部分配列をターゲット配列として、クエリー配列とターゲット配列との間で、セミ・グローバルなアラインメントを行う。即ち、クエリー配列の全体と、ターゲット配列の部分配列との間に、挿入・欠失を許しながら対応関係を定める。その計算は、具体的には、DP行列を計算することによってなされる（非特許文献８）。 The cost of alignment is calculated as the sum of the costs of substitution, insertion, and deletion of one base included in the alignment as 1. If a seed sequence of length s starting from the position of the p base of the query sequence appears in the coordinate interval [x, x + s−1] in the genome sequence, the appearance position of the lead sequence is the coordinate interval [ x−p + 1−M, x−p + m + M]. Here, m represents the query sequence length, and M represents the upper limit of the alignment cost. The reason why the range of the appearance position of the lead sequence is expanded by ± M is that there is a possibility that insertion / deletion of a maximum of M bases is included before and after the appearance position of the seed sequence. In the optimal alignment calculation 212, a semi-global alignment is performed between the query sequence and the target sequence using the genome partial sequence in the coordinate interval [x−p + 1−M, x−p + m + M] as the target sequence. I do. That is, a correspondence relationship is determined between the entire query sequence and a partial sequence of the target sequence while allowing insertion / deletion. Specifically, the calculation is performed by calculating a DP matrix (Non-patent Document 8).

クエリー配列長をm、ターゲット配列長をnとする。１≦ i ≦m，１≦ j ≦nに対して、クエリー配列 q = q1 q2 ... qm の長さiの部分文字列 q1 q2 ... qi とターゲット配列 t = t1 t2 ... tn の長さjの部分文字列 t1 t2 ... tj との最適なアラインメントのコストを表わすDP行列の第 (i, j) 成分を C[i, j] として、G. Myersに従い（非特許文献５）、
等式C[i, j]−C[i, j−1] = 1が成立するか否かを表わすブール変数をPh[i, j]、
等式C[i, j]−C[i, j−1] = −1が成立するか否かを表わすブール変数をMh[i, j]、
等式C[i, j]−C[i−1, j] = 1が成立するか否かを表わすブール変数をPv[i, j]、
等式C[i, j]−C[i−1, j] = −1が成立するか否かを表わすブール変数をMv[i, j]、
クエリー配列のインデクスiの文字（i番目の文字）qi とターゲット配列のインデクスjの文字（j番目の文字） tj が一致するか否かを表わすブール変数をEq(i, j)、
と表わす。また、セミ・グローバルなアラインメントに対応する境界条件として、C[i, 0] = i, C[0, j] = 0と定める。任意の１≦ i ≦m，１≦ j ≦nに対して、
（数６）
Mh = Mh[i−1, j]; Ph = Ph[i−1, j];
Mv = Mh[i, j−1]; Pv = Pv[i, j−1];
Mh' = Mh[i, j]; Ph' = Ph[i, j];
Mv' = Mv[i, j]; Pv' = Pv[i, j];
Eq = Eq[i, j];
と略記すると、これらブール変数の間には、
（数７）
Xv = Eq or Mv; Pv' = Mh or not(Xv or Ph); Mv' = Ph and Xv;
Xh = Eq or Mh; Ph' = Mv or not(Xh or Pv); Mh' = Pv and Xh;
の関係式が成立する（非特許文献５）。ここで、XvとXhは補助のブール変数を表わす。この関係式を反復して利用することによりC[i, j]の値を全て計算することができ、DP行列の下辺でC[m, j]（j = 1, 2, ... , n）の最小値を与える点 (m, j) からトレースバックすることにより最適なアラインメントを求めることができる（非特許文献８）。 The query sequence length is m, and the target sequence length is n. For 1 ≤ i ≤ m, 1 ≤ j ≤ n, query string q = q1 q2 ... qm of length i partial string q1 q2 ... qi and target array t = t1 t2 ... tn According to G. Myers, the first (i, j) component of the DP matrix representing the optimal alignment cost with the substring t1 t2 ... tj of length j is C [i, j] (non-patent literature) 5),
A Boolean variable indicating whether or not the equation C [i, j] −C [i, j−1] = 1 is satisfied, Ph [i, j],
A Boolean variable indicating whether or not the equation C [i, j] −C [i, j−1] = − 1 holds is represented by Mh [i, j],
A Boolean variable indicating whether or not the equation C [i, j] −C [i−1, j] = 1 is satisfied is Pv [i, j],
A Boolean variable indicating whether or not the equation C [i, j] −C [i−1, j] = − 1 holds is expressed as Mv [i, j],
Eq (i, j) is a Boolean variable that indicates whether or not the character (i-th character) qi of the index i of the query sequence matches the character j of the index j (j-th character) tj of the target sequence.
It expresses. Further, C [i, 0] = i, C [0, j] = 0 are defined as boundary conditions corresponding to the semi-global alignment. For any 1 ≤ i ≤ m, 1 ≤ j ≤ n,
(Equation 6)
Mh = Mh [i−1, j]; Ph = Ph [i−1, j];
Mv = Mh [i, j−1]; Pv = Pv [i, j−1];
Mh '= Mh [i, j]; Ph' = Ph [i, j];
Mv '= Mv [i, j]; Pv' = Pv [i, j];
Eq = Eq [i, j];
For short, between these Boolean variables,
(Equation 7)
Xv = Eq or Mv; Pv '= Mh or not (Xv or Ph); Mv' = Ph and Xv;
Xh = Eq or Mh; Ph '= Mv or not (Xh or Pv); Mh' = Pv and Xh;
Is established (Non-patent Document 5). Here, Xv and Xh represent auxiliary Boolean variables. By using this relational expression repeatedly, all values of C [i, j] can be calculated, and C [m, j] (j = 1, 2, ..., n The optimum alignment can be obtained by tracing back from the point (m, j) that gives the minimum value of () (Non-patent Document 8).

ここで、複数組のPh, Mh, Pv, Mvから複数組のPh', Mh', Pv', Mv'を同時にビット並列に計算して、処理を高速化する方法を考える。図３に、これらのブール変数のDP行列内部における位置関係を示す。これらのブール変数は、DP行列内のセル109の周りにおいて、水平方向または垂直方向に隣接するDP行列の格子点301におけるコスト値 C[i, j] の差分に関する情報を表現している。（数６）より、入力側302のブール変数Ph, Mh, Pv, Mvの2重インデクスは [i−1, j] または [i, j−1] でその和は i + j−1 であり、また、出力側303のブール変数Ph', Mh', Pv', Mv'の2重インデクスは [i, j] でその和は i + j である。従って、2重インデクス [i, j] の和 k が小さいものから順に計算することによって、全てのブール変数Ph, Mh, Pv, Mv の値を計算することができる。さらに、クエリー長に関する制限を設けることなく、これらの計算をビット並列化できる方法があることを以下に説明する。 Here, a method is considered in which a plurality of sets of Ph ′, Mh ′, Pv ′, and Mv ′ are simultaneously calculated in bit parallel from a plurality of sets of Ph, Mh, Pv, and Mv to speed up the processing. FIG. 3 shows the positional relationship of these Boolean variables within the DP matrix. These Boolean variables represent information about the difference between the cost values C [i, j] at the lattice points 301 of the DP matrix adjacent in the horizontal direction or the vertical direction around the cell 109 in the DP matrix. From (Equation 6), the double index of the Boolean variables Ph, Mh, Pv, and Mv on the input side 302 is [i−1, j] or [i, j−1], and the sum is i + j−1. The double index of the Boolean variables Ph ′, Mh ′, Pv ′, Mv ′ on the output side 303 is [i, j] and the sum is i + j. Therefore, the values of all the Boolean variables Ph, Mh, Pv, and Mv can be calculated by calculating in order from the smallest sum k of the double indexes [i, j]. Further, it will be described below that there is a method in which these calculations can be performed in parallel without any restriction on the query length.

ところで、和が k に等しい2重インデクス [i, j] の個数は、 k と共に増大するが、類似配列の最適アラインメントの計算でトレースバックする際に参照すべきブール変数の2重インデクスは対角線 i = j の周辺の範囲に限られる。そこで、この範囲に制限し、（数７）の関係式をビット並列に計算する方法を考える。 By the way, the number of double indexes [i, j] whose sum is equal to k increases with k, but the double index of the Boolean variable to be referred to when tracing back in calculating the optimal alignment of similar sequences is the diagonal i = Limited to the range around j. Therefore, a method of calculating the relational expression of (Equation 7) in bit parallel is limited to this range.

計算機の基本演算に用いられる整数ワードのビット長をwとする。wは一般に２のべき乗であり、現在主流の計算機では、w = 64 となっている。また、半ワード長をw' = w／2とする。対角線の周辺の範囲として、−w < i−j≦w の範囲を考えると、和が k となる2重インデクス [i, j] は、
k が偶数 k = 2k' の場合、
（数８）
[i, j] = [k' + w', k'−w'], ..., [k'+1, k'−1], [k', k'], [k'−1, k'+1],
..., [k'−w' + 1, k' + w'−1]
k が奇数 k = 2k' + 1 の場合、
（数９）
[i, j] = [k' + w', k'−w' + 1], ..., [k'+1, k'], [k', k'+1],
..., [k'−w' + 1, k' + w']
となる。そこで、これらの2重インデクス（格子点）に、（1重）インデクスα= k'−i を対応させれば、kが偶数であるか奇数であるかに関わらず、これらは、
α = −w', −w' + 1, ..., −1, 0, 1, ..., w'−1
に対応する。そこで、wビットのワード変数の各ビットを、下位ビットより順にこれらでインデクス付けして対応させる。以下、説明を簡単にするために w = 4 として、図を用いてこれらの対応関係を具体的に説明する。 Let w be the bit length of the integer word used in the basic calculation of the computer. In general, w is a power of 2, and w = 64 in the current mainstream computer. The half-word length is w ′ = w / 2. Considering the range of −w <i−j ≦ w as the range around the diagonal, the double index [i, j] whose sum is k is
If k is an even number k = 2k '
(Equation 8)
[i, j] = [k '+ w', k'−w '], ..., [k' + 1, k'−1], [k ', k'], [k'−1, k '+ 1],
..., [k'−w '+ 1, k' + w'−1]
If k is odd k = 2k '+ 1,
(Equation 9)
[i, j] = [k '+ w', k'−w '+ 1], ..., [k' + 1, k '], [k', k '+ 1],
..., [k'−w '+ 1, k' + w ']
It becomes. Therefore, if the (single) index α = k′−i is made to correspond to these double indexes (grid points), regardless of whether k is an even number or an odd number,
α = −w ', −w' + 1, ..., −1, 0, 1, ..., w'−1
Corresponding to Therefore, each bit of the w-bit word variable is indexed with these in order from the lower order bit to correspond. In the following, for simplicity of explanation, w = 4 is used, and the corresponding relationship will be specifically described with reference to the drawings.

図４は、w = 4 の場合に、対角線の周辺範囲 −w < i−j ≦w にある和が k となる2重インデクス [i, j] と4ビットのワード変数のインデクスとの対応関係を例示したものである。101は、DP行列の2重インデクス [i, j] の全体を表わす正方格子である。102と103に表示したインデクスi、jは、各々、クエリー配列とターゲット配列のインデクス（文字位置）である。104で示した対角線 i = j の周辺領域−w < i−j ≦wは、105に示す2本の破線に挟まれた範囲106である。この範囲106の中で、2重インデクスの和がk = 5（奇数）となるものは、407に示す破線で囲まれた中にある4個の白丸（408）の位置にある。同様に、この範囲106の中で、2重インデクスの和がk = 6（偶数）となるものは、409に示す破線で囲まれた中にある4個の黒丸（410）の位置にある。k = 7（奇数）の場合も同様である。これらに対応するインデクスα = −2, −1, 0, 1 の値を傍らに付した。また、これらのインデクスに対応する4ビット整数のビット位置を411に示した。 Figure 4 shows the correspondence between the double index [i, j] and the index of the 4-bit word variable with the sum of k in the peripheral area of the diagonal line -w <i-j ≤ w when w = 4 Is illustrated. 101 is a square lattice that represents the entire double index [i, j] of the DP matrix. The indexes i and j displayed in 102 and 103 are the indexes (character positions) of the query sequence and the target sequence, respectively. The peripheral region −w <i−j ≦ w of the diagonal line i = j shown by 104 is a range 106 sandwiched between two broken lines shown by 105. Within this range 106, the one where the sum of the double indexes is k = 5 (odd number) is at the position of four white circles (408) in the area surrounded by the broken line 407. Similarly, in this range 106, the one where the sum of the double indexes is k = 6 (even number) is at the position of four black circles (410) in the area surrounded by the broken line 409. The same applies to k = 7 (odd number). The corresponding index α = −2, −1, 0, 1 is attached to the side. The bit positions of 4-bit integers corresponding to these indexes are shown in 411.

和が k となる2重インデクス [i, j] をもつブール変数Ph, Mh, Pv, Mv に対しても、同様にワード変数の（1重）インデクスαを対応させる。ブール変数Ph, Mhは水平方向に隣接する格子点のコストの差分値の情報を表わしているが、それらには左方の格子点のインデクスαを対応させる（図5の501参照）。即ち、Ph[i, j], Mh[i, j] には [i, j−1] に対応するインデクスαを対応させて、Ph[α], Mh[α] と表わす。同様に、ブール変数Pv, Mvは垂直方向に隣接する格子点のコストの差分値の情報を表わしているが、それらには下方の格子点のインデクスαを対応させる（図5の502参照）。即ち、Pv[i, j], Mv[i, j] には [i, j] に対応するインデクスαを対応させて、Pv[α], Mv[α] と表わす。図5に、w = 4 の場合の例を具体的に示す。h[α]はブール変数Ph[α]またはMh[α]を表わし、また、v[α]はブール変数Pv[α]またはMv[α]を表わす。 Similarly, the (single) index α of the word variable is made to correspond to the Boolean variables Ph, Mh, Pv, and Mv having the double index [i, j] whose sum is k. The Boolean variables Ph and Mh represent the information of the difference value of the cost of the grid points adjacent in the horizontal direction, and correspond to the index α of the left grid point (see 501 in FIG. 5). That is, Ph [i, j], Mh [i, j] is represented as Ph [α], Mh [α] by associating an index α corresponding to [i, j−1]. Similarly, the Boolean variables Pv and Mv represent the information of the difference value of the cost of the lattice points adjacent in the vertical direction, and correspond to the index α of the lower lattice point (see 502 in FIG. 5). That is, Pv [i, j] and Mv [i, j] are represented as Pv [α] and Mv [α] by associating an index α corresponding to [i, j]. FIG. 5 specifically shows an example in the case of w = 4. h [α] represents a Boolean variable Ph [α] or Mh [α], and v [α] represents a Boolean variable Pv [α] or Mv [α].

対角線周辺領域106内に、一定の k に対する Ph[α], Mh[α], Pv[α], Mv[α] は各々w個ずつあるので、対応するインデクスαに従いそれらのビット値503を1ワードに詰め込むことができ、そのようにして得られるワード変数504, 505をPh, Mh, Pv, Mv と表わす。図5はw = 4 の場合を示す。ここで、hはPhまたはMhを、vはPvまたはMvを表わす。ブール変数Eq = Eq[i, j]についても、同様に、2重インデクス [i, j] に対応するインデクスαを対応させて Eq[α] と表わし、それに従ってビット値をワード変数に詰め込んだものをEqと書く。 Since there are w Ph [α], Mh [α], Pv [α], and Mv [α] for a certain k in the diagonal peripheral area 106, the bit value 503 is set to 1 according to the corresponding index α. The word variables 504 and 505 that can be packed into a word and are thus obtained are denoted as Ph, Mh, Pv, and Mv. FIG. 5 shows the case where w = 4. Here, h represents Ph or Mh, and v represents Pv or Mv. Similarly, for the Boolean variable Eq = Eq [i, j], the index α corresponding to the double index [i, j] is represented as Eq [α], and the bit value is packed into the word variable accordingly. Write things as Eq.

これらのワード変数は、 k について昇順に計算することができる。図1は、w = 4 の場合に、その計算過程におけるワード変数のデータの依存関係と、その時に計算されるブール変数のDP行列内における位置を示したものである。対角線周辺領域106（−w < i−j ≦w）の中で、kが奇数のとき、ブール変数Ph[α], Mh[α], Pv[α], Mv[α]に対応する水平または垂直方向の隣接格子点を結ぶ辺を太い実線107で、また、kが偶数のとき、対応する同様の辺を太い点線108で表わした。また、辺で囲まれた各セル109には、ブール変数Eq[α]が対応している。これらのブール変数をワード変数に詰め込んだものがPh, Mh, Pv, Mv, Eqであり、これらは110に示すワード・データを構成する。これらは、以下に説明するビット並列演算111によりkについて昇順に計算される。 These word variables can be computed in ascending order for k. FIG. 1 shows the dependency of word variable data in the calculation process when w = 4 and the position of the Boolean variable calculated at that time in the DP matrix. In the diagonal surrounding area 106 (−w <i−j ≦ w), when k is an odd number, the horizontal corresponding to the Boolean variables Ph [α], Mh [α], Pv [α], Mv [α] A side connecting adjacent lattice points in the vertical direction is represented by a thick solid line 107, and when k is an even number, a corresponding similar side is represented by a thick dotted line 108. Further, each cell 109 surrounded by the edge corresponds to the Boolean variable Eq [α]. These Boolean variables packed into word variables are Ph, Mh, Pv, Mv, and Eq, and these constitute the word data shown at 110. These are calculated in ascending order for k by the bit parallel operation 111 described below.

ビット並列演算111を説明するために、図5に戻って、各セル109の周りで計算すべき式を確認する。但し、セルの左辺と上辺に対応するkの値が偶数か奇数かによって事情が異なる。先ず、kが偶数の場合は、図6左に示すように、kに対応する左辺と上辺の入力側ブール変数601はPv[α], Mv[α] , Ph[α], Mh[α] であり、また、k+1に対応する下辺と右辺の出力側ブール変数602はPh[α], Mh[α], Pv[α], Mv[α]であり、これらは全て同じインデクスαをもっている。従って、この場合は、ワード変数 Ph, Mh, Pv, Mv に対して、kに対応するものをPh, Mh, Pv, Mv、また、k+1に対応するものをPh', Mh', Pv', Mv'と書いて区別すれば、（数７）と同様の関係式、
（数４）
Xv = Eq | Mv; Pv' = Mh | ~(Xv | Ph); Mv' = Ph & Xv;
Xh = Eq | Mh; Ph' = Mv | ~(Xh | Pv); Mh' = Pv & Xh;
により、ビット並列に計算できる。ここで、ワード変数a、bに対して、a|bはビットごとのOR演算、a&bはビットごとのAND演算、~aはビットごとのNOT演算を表わす。一方、kが奇数の場合は、図6右に示すように、kに対応する左辺と上辺の入力側ブール変数603はPv[α], Mv[α] , Ph[α+1], Mh[α+1] であり、また、k+1に対応する下辺と右辺の出力側ブール変数604はPh[α], Mh[α], Pv[α+1], Mv[α+1] であり、インデクスαに+1のずれが生じている。そこで、この場合は、上記のワード変数 Ph, Mh, Pv, Mv 及び Ph', Mh', Pv', Mv' に対して、前後にビットシフト演算を行えば、（数７）と同様な関係式が適用可能となる。即ち、この場合は、
（数５）
Ph >>= 1; Mh >>= 1;
Xv = Eq | Mv; Pv' = Mh | ~(Xv | Ph); Mv' = Ph & Xv;
Xh = Eq | Mh; Ph' = Mv | ~(Xh | Pv); Mh' = Pv & Xh;
Pv' <<= 1; Mv' <<= 1;
により、ビット並列に計算できる。ここで、ワード変数aに対して、a<<=1 は1ビットの左シフト演算、a>>=1 は1ビットの右シフト演算を表わす。 In order to explain the bit parallel operation 111, returning to FIG. 5, the formula to be calculated around each cell 109 is confirmed. However, the situation differs depending on whether the value of k corresponding to the left side and the upper side of the cell is even or odd. First, when k is an even number, as shown on the left of FIG. 6, the input-side Boolean variables 601 on the left and upper sides corresponding to k are Pv [α], Mv [α], Ph [α], Mh [α] And the output Boolean variables 602 on the lower and right sides corresponding to k + 1 are Ph [α], Mh [α], Pv [α], Mv [α], all of which have the same index α. Yes. Therefore, in this case, for the word variables Ph, Mh, Pv, Mv, those corresponding to k are Ph, Mh, Pv, Mv, and those corresponding to k + 1 are Ph ', Mh', Pv. If it is distinguished by writing ', Mv', the same relational expression as (Equation 7),
(Equation 4)
Xv = Eq | Mv; Pv '= Mh | ~ (Xv | Ph); Mv' = Ph &Xv;
Xh = Eq | Mh; Ph '= Mv | ~ (Xh | Pv); Mh' = Pv &Xh;
Can be calculated in bit parallel. Here, for word variables a and b, a | b represents a bitwise OR operation, a & b represents a bitwise AND operation, and ~ a represents a bitwise NOT operation. On the other hand, when k is an odd number, as shown on the right side of FIG. 6, the input side Boolean variables 603 corresponding to k are Pv [α], Mv [α], Ph [α + 1], Mh [ α + 1], and the output side Boolean variables 604 corresponding to k + 1 are Ph [α], Mh [α], Pv [α + 1], Mv [α + 1] The index α is shifted by +1. Therefore, in this case, if a bit shift operation is performed before and after the above word variables Ph, Mh, Pv, Mv and Ph ', Mh', Pv ', Mv', the same relationship as in (Equation 7) is obtained. The formula becomes applicable. That is, in this case,
(Equation 5)
Ph >> = 1; Mh >> = 1;
Xv = Eq | Mv; Pv '= Mh | ~ (Xv | Ph); Mv' = Ph &Xv;
Xh = Eq | Mh; Ph '= Mv | ~ (Xh | Pv); Mh' = Pv &Xh;
Pv '<< = 1; Mv'<< = 1;
Can be calculated in bit parallel. Here, for the word variable a, a << = 1 represents a 1-bit left shift operation, and a >> = 1 represents a 1-bit right shift operation.

一方、ワード変数 Eq については、
k が偶数 k = 2k' の場合、
（数10）
Eq(2k') = [ Eq[−w'], ..., Eq[−1], Eq[0], Eq[1], ..., Eq[w'−1 ] ]
= [ Eq[k'−w' + 1, k' + w'−1], ...,
Eq[k'−1, k'+1], Eq[k', k'], Eq[k'+1, k'−1],
..., Eq[k' + w', k'−w'] ]
k が奇数 k = 2k' + 1 の場合、
（数11）
Eq(2k'+1) = [ Eq[−w'], ..., Eq[−1], Eq[0], Eq[1], ..., Eq[w'−1 ] ]
= [ Eq[k'−w' + 1, k' + w'], ...,
Eq[k', k'+1], Eq[k'+1, k'],
..., Eq[k' + w', k'−w' + 1] ]
となる。ここで、外側の[ , , ... , ] はビットを並べてできる整数値を表わし（右端が最下位ビット）、Eq[i, j] は、クエリー配列 q = q1 q2 ... qm のi番目の文字（塩基）qiとターゲット配列 t = t1 t2 ... tn のj番目の文字（塩基）tjが一致するか否かを表わすブール変数を表わす。4種類の塩基A, C, G, Tの2ビット符号化 Aa00, Ca01, Ga10, Ta11 における下位ビットをl、上位ビットをu と表わし、l(A) = l(G) = 0, l(G) = l(T) = 1, u(A) = u(C) = 0, u(G) = u(T) = 1 とすると、
（数12）
Eq[i, j] = not( ( l(qi) xor l(tj) ) or ( u(qi) xor u(tj) ))
が成り立つ。ここで、xorはブール変数の排他的論理和を表わす。そこで、ブール変数l(tj)、u(tj)を最下位ビットから昇順に並べてできるワード変数をTl, Tu、ブール変数l(qi)、u(qi)を最上位ビットから降順に並べてできるワード変数をQl, Quとすれば、
ワード変数Eqは、
（数13）
Eq = ~( ( Ql ^ Tl ) | ( Qu ^ Tu ))
のようにビット並列に計算できる。ここで、ワード変数a, bに対して、a ^ bはビットごとの排他的論理和演算を表わす。一方、Tl, Tu, Ql, Quは、k について帰納的にビットシフト演算を用いて計算できる。但し、ビットシフトによって空いたビットには、クエリー配列またはターゲット配列の対応する文字の符号化ビットをセットする。図7にw = 4 の場合のk = 4, 5, 6に対応するTl, Tu, Ql, Quの計算方法を示す。一般の場合に、kが増加するときのTl, Tu, Ql, Quの値の更新方法を具体的に式で表わすと、
kが奇数（k = 2k' + 1）のとき、
（数14）
Ql <<= 1; Ql |= l(qi);
Qu <<= 1; Qu |= u(qi);
kが偶数（k = 2k'）のとき、
（数15）
Tl >>= 1; Tl |= (l(tj) << (w−1));
Tu >>= 1; Tu |= (u(tj) << (w−1));
となる。ここで、qi, tjはクエリー配列とターゲット配列から新たに読み込まれる文字を表わし、i = w' + k' + 1、j = w' + k' である。また、ワード変数a, b に対して、a = a | b を a |= b と略記する。 On the other hand, for the word variable Eq,
If k is an even number k = 2k '
(Equation 10)
Eq (2k ') = [Eq [−w'], ..., Eq [−1], Eq [0], Eq [1], ..., Eq [w'−1]]
= [Eq [k'−w '+ 1, k' + w'−1], ...,
Eq [k'-1, k '+ 1], Eq [k', k '], Eq [k' + 1, k'-1],
..., Eq [k '+ w', k'−w ']]
If k is odd k = 2k '+ 1,
(Equation 11)
Eq (2k '+ 1) = [Eq [−w'], ..., Eq [−1], Eq [0], Eq [1], ..., Eq [w'−1]]
= [Eq [k'−w '+ 1, k' + w '], ...,
Eq [k ', k' + 1], Eq [k '+ 1, k'],
..., Eq [k '+ w', k'−w '+ 1]]
It becomes. Here, the outer [,, ...,] represents an integer value formed by arranging bits (the right end is the least significant bit), and Eq [i, j] is the i of the query array q = q1 q2 ... qm This represents a Boolean variable indicating whether or not the j th character (base) tj of the th character (base) qi and the target sequence t = t1 t2. The low-order bits of the four types of bases A, C, G, and T, Aa00, Ca01, Ga10, and Ta11 are represented as l and the high-order bits as u, and l (A) = l (G) = 0, l ( G) = l (T) = 1, u (A) = u (C) = 0, u (G) = u (T) = 1,
(Equation 12)
Eq [i, j] = not ((l (qi) xor l (tj)) or (u (qi) xor u (tj)))
Holds. Here, xor represents an exclusive OR of Boolean variables. Therefore, Tl, Tu, Boolean variables l (qi), u (qi) can be arranged in descending order from the most significant bit in Boolean variables l (tj), u (tj) in ascending order from the least significant bit. If the variables are Ql and Qu,
The word variable Eq is
(Equation 13)
Eq = ~ ((Ql ^ Tl) | (Qu ^ Tu))
It can be calculated in bit parallel like Here, for the word variables a and b, a ^ b represents a bitwise exclusive OR operation. On the other hand, Tl, Tu, Ql, and Qu can be calculated recursively for k using a bit shift operation. However, the encoded bit of the character corresponding to the query array or the target array is set to the bit vacated by the bit shift. FIG. 7 shows a calculation method of Tl, Tu, Ql, and Qu corresponding to k = 4, 5, and 6 when w = 4. In the general case, the method of updating the values of Tl, Tu, Ql, Qu when k increases is specifically expressed as an equation:
When k is odd (k = 2k '+ 1)
(Equation 14)
Ql << = 1; Ql | = l (qi);
Qu << = 1; Qu | = u (qi);
When k is an even number (k = 2k ')
(Equation 15)
Tl >> = 1; Tl | = (l (tj) <<(w−1));
Tu >> = 1; Tu | = (u (tj) <<(w−1));
It becomes. Here, qi and tj represent characters newly read from the query sequence and the target sequence, i = w ′ + k ′ + 1, j = w ′ + k ′. For word variables a and b, a = a | b is abbreviated as a | = b.

上述の計算において、対角線周辺領域106の境界105 （ i−j = ±w ）に内部から接する辺に対応するブール変数Pv[α], Mv[α] , Ph[α], Mh[α] を計算する際、この境界に外部から接する辺に対応するブール変数Pv[α], Mv[α] , Ph[α], Mh[α] の値を参照する必要が生じる。後者の値は計算されないので、境界条件として、それらに適切な値を事前に定めておく必要がある。そこで、対角線周辺領域106の外部では、クエリー配列とターゲット配列の文字比較結果は常に不一致と仮定する。これは、偽の類似配列が求められることを避けるための（悲観的）仮定である。このことは、対角線周辺領域106の外部では C[i, j]−C[i−1, j−1] = 1 が常に成り立つことを意味する。そのためには、図8に示すように、対角線周辺領域の外部では常に Pv[α] = 1, Mv[α] = Ph[α] = Mh[α] = Eq[α] = 0 が成り立つと仮定すればよい。そこで、これを境界条件とする。このとき、（数７）が成立することは、これらの値を代入して直接確かめることができる。 In the above calculation, the Boolean variables Pv [α], Mv [α], Ph [α], and Mh [α] corresponding to the side in contact with the boundary 105 (i−j = ± w) of the diagonal peripheral region 106 from the inside are When calculating, it is necessary to refer to the values of the Boolean variables Pv [α], Mv [α], Ph [α], and Mh [α] corresponding to the edges that are in contact with the boundary from the outside. Since the latter values are not calculated, it is necessary to determine appropriate values in advance as boundary conditions. Therefore, it is assumed that the character comparison result between the query sequence and the target sequence is always inconsistent outside the diagonal peripheral region 106. This is a (pessimistic) assumption to avoid seeking false similar sequences. This means that C [i, j] −C [i−1, j−1] = 1 always holds outside the diagonal peripheral region 106. For this purpose, as shown in Fig. 8, it is assumed that Pv [α] = 1, Mv [α] = Ph [α] = Mh [α] = Eq [α] = 0 always holds outside the diagonal region. do it. Therefore, this is set as a boundary condition. At this time, the fact that (Equation 7) holds can be confirmed directly by substituting these values.

また、k の値がワード長 w 以下のとき、ワード変数Ph, Mh, Pv, Mv は、DP行列外部の i < 0，j < 0 となる領域に対応するブール変数Pv[α], Mv[α] , Ph[α], Mh[α], Eq[α] を参照するが、これらに対しても同様に、初期条件としてPv[α] = 1, Mv[α] = Ph[α] = Mh[α] = Eq[α] = 0 が常に成り立つと仮定することにより、偽の類似配列が求められることを避けることができる。 When the value of k is less than or equal to the word length w, the word variables Ph, Mh, Pv, Mv are Boolean variables Pv [α], Mv [corresponding to the region where i <0, j <0 outside the DP matrix. α], Ph [α], Mh [α], Eq [α] are also referred to, but for these, Pv [α] = 1, Mv [α] = Ph [α] = By assuming that Mh [α] = Eq [α] = 0 always holds, it can be avoided that a false similar sequence is obtained.

最適アラインメントを求めるトレースバック処理を開始するためには、DP行列の下辺上のDPコスト値 C[m, j]（j = 1, 2, ... , n）の最小値を求める必要がある。類似配列に対しては、この最小値は対角線との交点の近くに現れる。そこで、最小値を探す範囲を、DP行列の下辺上で、対角線周辺領域106に含まれる範囲902に制限する。対角線周辺領域106の内部では、隣接する格子点間のDPコストの差分値を上述の方法で計算しているので、それらの値を図9内の矢印901の列で示すように、C[0, 0] = 0から出発して、先ず対角線に沿って加算し、次に、DP行列の下辺で、対角線との交点から左右両方向に加算することにより、対角線周辺領域106内に含まれるDP行列下辺902上のDPコスト値を計算し、その最小値を求めることができる。また、類似配列の最適アラインメントを求めるトレースバック処理で参照されるのは、対角線周辺領域内の隣接する格子点間のDPコストの差分値なので（非特許文献８）、上述の計算過程で各 k に対するPh, Mh, Pv, Mv の値を計算機主記憶内に保持しておき、トレースバック処理でこれらの値を参照できる。これらの計算は、配列長に比例する少ない計算コストで計算できる。 To start the traceback process to find the optimal alignment, it is necessary to find the minimum value of the DP cost value C [m, j] (j = 1, 2, ..., n) on the lower side of the DP matrix . For similar sequences, this minimum appears near the intersection with the diagonal. Therefore, the range for searching for the minimum value is limited to the range 902 included in the diagonal peripheral region 106 on the lower side of the DP matrix. Inside the diagonal peripheral region 106, the DP cost difference value between adjacent lattice points is calculated by the above-described method, and as shown by the column of the arrow 901 in FIG. , 0] = 0, first adding along the diagonal, and then adding in the left and right directions from the intersection with the diagonal at the lower side of the DP matrix, so that the DP matrix included in the diagonal peripheral region 106 The DP cost value on the lower side 902 can be calculated and its minimum value can be obtained. Further, the traceback processing for obtaining the optimal alignment of similar sequences refers to the difference value of the DP cost between adjacent lattice points in the diagonal peripheral region (Non-Patent Document 8). The values of Ph, Mh, Pv, and Mv are stored in the computer main memory, and these values can be referred to in the traceback process. These calculations can be performed with a small calculation cost proportional to the sequence length.

［実施例２］
実施例１では、DP行列の対角線周辺領域として −w < i−j≦w を満たす2重インデクス (i, j) の範囲を採用した。ここで、wは計算機のワード長を表わす。この場合、実施例１の方法で最適なアラインメントが求まるためには、挿入長、欠失長の累積値が w 未満でなければならない。従って、より多くの挿入・欠失を含むような最適なアラインメントを求める必要がある場合には、DP行列の対角線周辺領域の幅を広げる必要がある。本実施例では、DP行列の対角線周辺領域の幅を、実施例1の2倍に広げるために、実施例1の方法を修正して適用する方法を説明する。DP行列の対角線周辺領域の幅を3倍以上に広げる場合も、容易な類推により、同様な方法が可能である。 [Example 2]
In the first embodiment, the range of the double index (i, j) that satisfies −w <i−j ≦ w is adopted as the diagonal peripheral region of the DP matrix. Here, w represents the word length of the computer. In this case, in order to obtain an optimal alignment by the method of Example 1, the cumulative values of the insertion length and the deletion length must be less than w. Therefore, when it is necessary to obtain an optimal alignment that includes more insertions / deletions, it is necessary to increase the width of the region around the diagonal line of the DP matrix. In the present embodiment, a method will be described in which the method of the first embodiment is modified and applied in order to increase the width of the diagonal peripheral region of the DP matrix to twice that of the first embodiment. The same method can be used with a simple analogy when the width of the diagonal region of the DP matrix is increased by more than three times.

DP行列の対角線周辺領域の幅を、実施例1の2倍に広げると、−2 w < i−j ≦2 w を満たす2重インデクス (i, j) の範囲が計算の対象となる。そこで、この領域を対角線を挟む２つの部分領域、−2 w < i−j ≦0 と0 < i−j ≦2 w に分割する。それぞれの部分領域は、幅wであるため、その内部では実施例１と類似の方法によるビット並列なDP行列の計算が可能となる。但し、これら２つの部分領域が互いに接する（対角線に接する）位置では、互いに他の部分領域で計算されたブール変数Pv[α], Mv[α] , Ph[α], Mh[α], Eq[α] の値を境界条件として使用しなければならない。その他は、実施例1と全く同様に実施できる。 When the width of the diagonal peripheral region of the DP matrix is increased to twice that of the first embodiment, the range of the double index (i, j) satisfying −2 w <i−j ≦ 2 w becomes a calculation target. Therefore, this region is divided into two partial regions sandwiching the diagonal line, −2 w <i−j ≦ 0 and 0 <i−j ≦ 2 w. Since each partial region has a width w, a bit-parallel DP matrix can be calculated by a method similar to that in the first embodiment. However, at the position where these two partial areas touch each other (touch the diagonal), the Boolean variables Pv [α], Mv [α], Ph [α], Mh [α], Eq calculated in the other partial areas. The value of [α] must be used as the boundary condition. Others can be carried out in exactly the same manner as in the first embodiment.

101 DP行列の2重インデクス [i, j] の全体を表わす正方格子
102 クエリー配列のインデクス（文字位置）
103 ターゲット配列のインデクス（文字位置）
104 DP行列の対角線：i = j
105 DP行列内の対角線周辺領域の境界線：i−j = ±w
106 DP行列内の対角線周辺領域：−w < i−j ≦w
107 奇数の k に対応する水平または垂直方向の隣接格子点を結ぶ辺
108 偶数の k に対応する水平または垂直方向の隣接格子点を結ぶ辺
109 DP行列内の隣接格子点を結ぶ辺で囲まれたセル
110 ワード・データ：Ph, Mh, Pv, Mv, Eq
111 ワード・データに対するビット並列演算
201 計算機
202 外部記憶装置
203 ゲノム配列データベース
204 DNAシーケンサ
205 DNAサンプル
206 リード配列データ（クエリー配列データ）
207 データ入力装置
208 パラメータ入力装置
209 主記憶装置
210 シード検索処理部
211 詳細比較処理部
212 最適アラインメント計算処理部
213 出力処理部
214 出力装置
215 検索結果（ゲノム・マッピング位置情報）
216 最適アラインメント結果（塩基変異の情報）
301 DP行列の格子点
302 入力側ブール変数
303 出力側ブール変数
407 対角線周辺領域に含まれ、和がk = 5（奇数）となる2重インデクス [ i, j ] の位置（格子点）の集合
408 対角線周辺領域に含まれ、和がk = 5（奇数）となる2重インデクス [ i, j ] の位置（格子点）
409 対角線周辺領域に含まれ、和がk = 5（奇数）となる2重インデクス [ i, j ] の位置（格子点）の集合
410 対角線周辺領域に含まれ、和がk = 5（奇数）となる2重インデクス [ i, j ] の位置（格子点）
411 ワード変数のインデクス
501 水平方向の隣接辺に対応するブール変数への（１重）インデクスαの対応付け
502 垂直方向の隣接辺に対応するブール変数への（１重）インデクスαの対応付け
503 対角線周辺領域内で、一定のkに対応する（1重）インデクス付けされた（水平または垂直方向の隣接辺に対応する）ブール変数の集合
504 対角線周辺領域内で、一定のkに対応する（1重）インデクス付けされた（水平方向の隣接辺に対応する）ブール変数の集合を詰め込んだワード変数
505 対角線周辺領域内で、一定のkに対応する（1重）インデクス付けされた（垂直方向の隣接辺に対応する）ブール変数の集合を詰め込んだワード変数
601 kが偶数のときの入力側ブール変数
602 k + 1が奇数のときの出力側ブール変数
603 kが奇数のときの入力側ブール変数
604 k + 1が偶数のときの出力側ブール変数
901 DPコスト値を得るために累積されるDPコストの差分値
902 DP行列の下辺上で、対角線周辺領域106に含まれる範囲（DPコスト値の最小値を探す範囲） 101 Square lattice representing the entire double index [i, j] of DP matrix
102 Query sequence index (character position)
103 Target sequence index (character position)
104 Diagonal line of DP matrix: i = j
105 Boundary line around diagonal line in DP matrix: i−j = ± w
106 Diagonal area in DP matrix: −w <i−j ≦ w
107 edge connecting adjacent grid points in horizontal or vertical direction corresponding to odd k
108 edge connecting adjacent grid points in horizontal or vertical direction corresponding to even k
109 Cell surrounded by edges connecting adjacent grid points in DP matrix
110 Word data: Ph, Mh, Pv, Mv, Eq
111 Bit parallel operation on word data
201 calculator
202 External storage device
203 Genome sequence database
204 DNA sequencer
205 DNA samples
206 Read sequence data (query sequence data)
207 Data input device
208 Parameter input device
209 Main memory
210 Seed search processor
211 Detailed comparison processing section
212 Optimal alignment calculation processor
213 Output processor
214 Output device
215 Search results (genome mapping position information)
216 Optimal alignment results (base mutation information)
301 DP matrix grid points
302 Input Boolean variable
303 Boolean variable on output side
407 Set of positions (grid points) of double index [i, j] that are included in the area surrounding the diagonal and whose sum is k = 5 (odd number)
408 Position of the double index [i, j] that is included in the area surrounding the diagonal and the sum is k = 5 (odd number) (grid point)
409 Set of positions (grid points) of double index [i, j] that are included in the area surrounding the diagonal and whose sum is k = 5 (odd number)
410 Position (grid point) of double index [i, j] that is included in the area surrounding the diagonal and whose sum is k = 5 (odd number)
411 Word variable index
501 Correspondence of (single) index α to Boolean variables corresponding to adjacent edges in the horizontal direction
502 Correspondence of (single) index α to Boolean variables corresponding to adjacent edges in the vertical direction
503 A set of Boolean variables (corresponding to adjacent edges in the horizontal or vertical direction) indexed (single) corresponding to a constant k within the region surrounding the diagonal
504 Word variable packed with a set of Boolean variables (corresponding to the adjacent edges in the horizontal direction) indexed (corresponding to the adjacent edge in the horizontal direction) corresponding to a constant k in the area surrounding the diagonal line
505 Word variable packed with a set of Boolean variables (corresponding to adjacent edges in the vertical direction) indexed (corresponding to the adjacent edge in the vertical direction) corresponding to a constant k in the area surrounding the diagonal
Boolean variable on the input side when 601 k is an even number
Boolean variable on the output side when 602 k + 1 is odd
Boolean variable on the input side when 603 k is odd
Boolean variable on the output side when 604 k + 1 is even
901 DP cost difference value accumulated to obtain DP cost value
902 Range included in the area surrounding the diagonal line 106 on the lower side of the DP matrix (the range in which the minimum DP cost value is searched)

Claims

A storage device for storing a character string database;
An input device for entering a search string;
An input device for inputting an upper limit value of an alignment cost for determining that two character strings are similar;
A seed search processing unit that compares the search character string with the character string database and searches for an occurrence position in a character string database of a partial sequence shared by both (seed array);
For each occurrence position of the seed array in the character string database obtained by the seed search processing unit, a partial character string (target array) in the vicinity of the character string database and the entire search character string A detailed comparison processing unit that calculates an optimal alignment by comparison and determines whether or not a sequence similar to the entire search character string is included at the position by comparing the cost with the upper limit value;
Based on the determination result of the detailed comparison processing unit, an appearance position of the character string database where a sequence similar to the search character string appears, and an output device that outputs the optimal alignment,
The detailed comparison processing unit repeatedly performs a process (extension process) for sequentially expanding the optimal alignment for the partial sequence in order to calculate the optimal alignment for the two character string data of the search character string and the target sequence. The process based on dynamic programming,
In the decompression process, when a new character is read from each character string data and compared to decompress the optimum alignment, the absolute value of the difference in the index of the character to be compared (character position counted from the beginning of the character string) is By limiting the processing to those that are less than the word length of the computer, the number of decompression processes in which the sum of those indexes is constant is suppressed to less than the word length, and the bit parallelism of operations on the word of the computer is utilized. An optimal alignment calculation device characterized in that those decompression processes are performed in parallel by one processor.

An input device for inputting a search string and a target sequence;
An input device for inputting an upper limit value of an alignment cost for determining that two character strings are similar;
By comparing the target sequence with the entire search character string to calculate an optimal alignment and comparing the cost with the upper limit value, the target sequence includes a sequence similar to the entire search character string. A detailed comparison processing unit for determining whether or not
An output device that outputs the determination result of the detailed comparison processing unit and the optimum alignment;
The detailed comparison processing unit repeatedly performs a process (extension process) for sequentially expanding the optimal alignment for the partial sequence in order to calculate the optimal alignment for the two character string data of the search character string and the target sequence. The process based on dynamic programming,
In the decompression process, when a new character is read from each character string data and compared to decompress the optimum alignment, the absolute value of the difference in the index of the character to be compared (character position counted from the beginning of the character string) is By limiting the processing to those that are less than the word length of the computer, the number of decompression processes in which the sum of those indexes is constant is suppressed to less than the word length, and the bit parallelism of operations on the word of the computer is utilized. An optimal alignment calculation device characterized in that those decompression processes are performed in parallel by one processor.

In the optimal alignment calculation apparatus according to claim 1,
The cost of substitution, insertion and deletion of one character in the alignment is 1,
The optimal alignment cost between the partial character string from the beginning of the search character string to the index i and the partial character string from the beginning of the neighborhood character string to the index j is represented by C [i, j], and the equation C [i , j] −C [i, j−1] = 1, the Boolean variable is represented by Ph [i, j].
The Boolean variable Mh [i, j] represents whether the equation C [i, j] −C [i, j−1] = −1 holds,
The Boolean variable Pv [i, j] indicates whether or not the equation C [i, j] −C [i−1, j] = 1 holds.
The Boolean variable Mv [i, j] represents whether or not the equation C [i, j] −C [i−1, j] = − 1 holds.
The Boolean variable Eq (i, j) represents whether or not the character of index i of the search character string matches the character of index j of the neighboring character string,
The word length of the computer is represented by w,
For a pair of i and j whose absolute value of the difference between i and j is less than w and whose sum is k, the Boolean variable Ph [i, j] represented by 1 bit of 0 or 1 is the least significant bit for j A word that can be arranged in ascending order is represented as Ph (k), and the Boolean variables Mh [i, j], Pv [i, j], Mv [i, j], Eq [i, j] in the same manner. Are represented as Mh (k), Pv (k), Mv (k), Eq (k)
As a calculation for word variables a and b by the computer, a | b for bitwise OR operation, a & b for AND operation for each bit, ~ a for bitwise NOT operation, and a << 1 for left shift operation for 1 bit , 1-bit right shift operation is expressed as a >> = 1,
Hereinafter, for simplicity of symbols, k is fixed to one value, and the word variables Ph (k), Mh (k), Pv (k), Mv (k), Eq (k) are set to Ph, Mh. , Pv, Mv, Eq,
Xv and Xh as auxiliary word variables
When k is an even number
(Equation 1)
Xv = Eq | Mv; Pv '= Mh | ~ (Xv | Ph); Mv' = Ph &Xv;
Xh = Eq | Mh; Ph '= Mv | ~ (Xh | Pv); Mh' = Pv &Xh;
in accordance with,
When k is an odd number
(Equation 2)
Ph >> = 1; Mh >> = 1;
Xv = Eq | Mv; Pv '= Mh | ~ (Xv | Ph); Mv' = Ph &Xv;
Xh = Eq | Mh; Ph '= Mv | ~ (Xh | Pv); Mh' = Pv &Xh;
Pv '<< = 1; Mv'<< = 1;
And calculate the values of word variables Ph (k + 1), Mh (k + 1), Pv (k + 1), Mv (k + 1) by calculating Ph ', Mh', Pv ', Mv' An optimal alignment calculation device characterized in that the extension processing is calculated as described above.

In the optimal alignment calculation apparatus according to claim 2,
The cost of substitution, insertion and deletion of one character in the alignment is 1,
The optimal alignment cost between the partial character string from the beginning of the search character string to the index i and the partial character string from the beginning of the neighborhood character string to the index j is represented by C [i, j], and the equation C [i , j] −C [i, j−1] = 1, the Boolean variable is represented by Ph [i, j].
The Boolean variable Mh [i, j] represents whether the equation C [i, j] −C [i, j−1] = −1 holds,
The Boolean variable Pv [i, j] indicates whether or not the equation C [i, j] −C [i−1, j] = 1 holds.
The Boolean variable Mv [i, j] represents whether or not the equation C [i, j] −C [i−1, j] = − 1 holds.
The Boolean variable Eq (i, j) represents whether or not the character of index i of the search character string matches the character of index j of the neighboring character string,
The word length of the computer is represented by w,
For a pair of i and j whose absolute value of the difference between i and j is less than w and whose sum is k, the Boolean variable Ph [i, j] represented by 1 bit of 0 or 1 is the least significant bit for j A word that can be arranged in ascending order is represented as Ph (k), and the Boolean variables Mh [i, j], Pv [i, j], Mv [i, j], Eq [i, j] in the same manner. Are represented as Mh (k), Pv (k), Mv (k), Eq (k)
As a calculation for word variables a and b by the computer, a | b for bitwise OR operation, a & b for AND operation for each bit, ~ a for bitwise NOT operation, and a << 1 for left shift operation for 1 bit , 1-bit right shift operation is expressed as a >> = 1,
Hereinafter, for simplicity of symbols, k is fixed to one value, and the word variables Ph (k), Mh (k), Pv (k), Mv (k), Eq (k) are set to Ph, Mh. , Pv, Mv, Eq,
Xv and Xh as auxiliary word variables
When k is an even number
(Equation 1)
Xv = Eq | Mv; Pv '= Mh | ~ (Xv | Ph); Mv' = Ph &Xv;
Xh = Eq | Mh; Ph '= Mv | ~ (Xh | Pv); Mh' = Pv &Xh;
in accordance with,
When k is an odd number
(Equation 2)
Ph >> = 1; Mh >> = 1;
Xv = Eq | Mv; Pv '= Mh | ~ (Xv | Ph); Mv' = Ph &Xv;
Xh = Eq | Mh; Ph '= Mv | ~ (Xh | Pv); Mh' = Pv &Xh;
Pv '<< = 1; Mv'<< = 1;
And calculate the values of word variables Ph (k + 1), Mh (k + 1), Pv (k + 1), Mv (k + 1) by calculating Ph ', Mh', Pv ', Mv' An optimal alignment calculation device characterized in that the extension processing is calculated as described above.

On the computer,
Processing to accept input of search string and target sequence;
A process of accepting an input of an upper limit value of an alignment cost for determining that two character strings are similar;
By comparing the target sequence with the entire search character string to calculate an optimal alignment and comparing the cost with the upper limit value, the target sequence includes a sequence similar to the entire search character string. A detailed comparison process for determining whether or not
A program for executing the determination result of the detailed comparison process and a process of outputting the optimum alignment to an output device;
The detailed comparison process dynamically repeats a process (extension process) for sequentially expanding the optimal alignment for the partial sequence in order to calculate the optimal alignment for the two character string data of the search character string and the target sequence. Execute processing based on the programming method,
In the decompression process, when a new character is read from each character string data and compared to decompress the optimum alignment, the absolute value of the difference in the index of the character to be compared (character position counted from the beginning of the character string) is By limiting the processing to those that are less than the word length of the computer, the number of decompression processes in which the sum of those indexes is constant is suppressed to less than the word length, and the bit parallelism of operations on the word of the computer is utilized. A program characterized by performing the decompression processing in parallel by one processor.