JP7422367B2

JP7422367B2 - Approximate string matching method and computer program for realizing the method

Info

Publication number: JP7422367B2
Application number: JP2021194605A
Authority: JP
Inventors: 淳一郎牧野; 龍太郎姫野
Original assignee: 先端加速システムズ株式会社
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2024-01-26
Anticipated expiration: 2041-11-30
Also published as: JP2023080989A

Description

本発明は、近似文字列照合技術に関し、特に、ヒトゲノムの解析に利用可能な近似文字列照合装置及び近似文字列照合方法並びに該方法を実現するためのコンピュータプログラムに関する。 The present invention relates to an approximate character string matching technique, and in particular to an approximate character string matching device and an approximate string matching method that can be used in human genome analysis, and a computer program for implementing the method.

ヒトゲノムは、人が持つ遺伝情報のセットであり、これを担っている物質が、約３０億対の塩基が連なったＤＮＡ（デオキシリボ核酸）である。塩基は、アデニン（Ａ）、グアニン（Ｇ）、シトシン（Ｃ）、及びチミン（Ｔ）がある。すなわち、人の遺伝情報は、これらの塩基の並び（配列）によって決定される。 The human genome is a set of genetic information possessed by humans, and the substance that carries this information is DNA (deoxyribonucleic acid), which is a chain of about 3 billion pairs of bases. The bases include adenine (A), guanine (G), cytosine (C), and thymine (T). That is, a person's genetic information is determined by the arrangement (sequence) of these bases.

ヒトゲノムの読み取りにはシークエンサと称される装置が用いられる。シークエンサは、サンプルとなるヒトゲノムを読み取って、これを所定の上限値（数百塩基対程度）に細断して増幅し、データ片からなる膨大なデータ配列として出力する。現行のシークエンサは、一人分のヒトゲノムを１時間ほどで読み出すことができる。 A device called a sequencer is used to read the human genome. A sequencer reads a sample of the human genome, shreds it into pieces of a predetermined upper limit (about several hundred base pairs), amplifies it, and outputs it as a huge data array made up of data pieces. Current sequencers can read the human genome for one person in about an hour.

シークエンサにより読み出されるばらばらのデータ片は、人の標準的なゲノム配列として定められたヒトゲノム参照配列と比較されることによって、元の長さのヒトゲノム配列に再構築され解析される。例えば、各データ片が、ヒトゲノム参照配列との比較において、どの位置にあるかが調べられ（マッピング）、また、どのような変異があるかといった解析がなされる。通常、シーケンサから読み出されるデータには誤差が含まれるため、１サンプルあたりヒトゲノム配列一人分の例えば３０倍の冗長データを用いて統計処理を行うことにより誤差を小さくしている。したがって、ヒトゲノム配列の解析には膨大な計算量が必要とされるため、典型的には、スーパーコンピュータやクラスタコンピュータ、ＦＰＧＡ（Field-Programmable Gate Array）ベースのコンピュータといった高性能なコンピュータが用いられる。 The disparate pieces of data read out by the sequencer are compared with the human genome reference sequence, which has been defined as the standard human genome sequence, to reconstruct and analyze the original length human genome sequence. For example, the position of each data piece in comparison with the human genome reference sequence is investigated (mapping), and the types of mutations present are analyzed. Normally, data read from a sequencer contains errors, so errors are reduced by performing statistical processing using, for example, 30 times as much redundant data as one human genome sequence per sample. Therefore, since an enormous amount of calculation is required to analyze the human genome sequence, a high-performance computer such as a supercomputer, a cluster computer, or an FPGA (Field-Programmable Gate Array)-based computer is typically used.

データ片のマッピングには、例えばＢＷＡ（Burrow-Wheeler Aligner）といったプログラムツールが用いられる。ＢＷＡは、３つのアルゴリズム、ＢＷＡ－ｂａｃｋｔｒａｃｋ、ＢＷＡ－ＳＷ、及びＢＷＡ－ＭＥＭから構成される。このうち、ＢＷＡ－ＭＥＭは、Indel（挿入欠失）に対応した高速アルゴリズムとして広く利用されている。ＢＷＡ－ＭＥＭは、読み出したデータ片に基づくクエリ文字列のうち、ヒトゲノム参照配列に繰り返し現れる部分に対して、接尾辞配列を用いてインデックスを作成し、マッピングを行うアルゴリズムである（非特許文献１）。 A program tool such as BWA (Burrow-Wheeler Aligner) is used to map data pieces. BWA consists of three algorithms: BWA-backtrack, BWA-SW, and BWA-MEM. Among these, BWA-MEM is widely used as a high-speed algorithm compatible with indels (insertions and deletions). BWA-MEM is an algorithm that uses a suffix array to create an index and perform mapping for parts that repeatedly appear in the human genome reference sequence among query strings based on read data pieces (Non-patent Document 1). ).

また、データ片は、マッピングにより得られるヒトゲノム参照配列の部分配列と照合され、どのような変異があるかが解析される。ヒトゲノム参照配列の部分配列とデータ片との照合には、例えば、アラインメントアルゴリズムが用いられる。アラインメントアルゴリズムでは、動的計画法に従って、アラインメント表と称される配列の各要素について変異度（類似度）を算出しながら、近似文字列を同定する（非特許文献２）。 Furthermore, the data pieces are compared with partial sequences of the human genome reference sequence obtained through mapping, and the types of mutations are analyzed. For example, an alignment algorithm is used to match a partial sequence of the human genome reference sequence with a piece of data. In the alignment algorithm, approximate character strings are identified while calculating the degree of variation (degree of similarity) for each element of an array called an alignment table, according to dynamic programming (Non-Patent Document 2).

Heng Li, “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM”, May 26, 2013, arXiv:1303.3997 (q-bio.GN)Heng Li, “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM”, May 26, 2013, arXiv:1303.3997 (q-bio.GN) 内山将夫他，「近似文字列照合による全文検索のための接尾辞配列の高速走査法」，２００２年９月１５日，情報処理学会，Vol. 43 No. SIG 9(TOD 15)Masao Uchiyama et al., “High-speed scanning method of suffix array for full text search using approximate string matching”, September 15, 2002, Information Processing Society of Japan, Vol. 43 No. SIG 9 (TOD 15)

上述したＢＷＡ－ＭＥＭは、ギャップを許容しかつ高速マッピングが可能なアルゴリズムとして広く利用されているが、作成されるインデックスのサイズが非常に大きくなるため、大量のメモリリソースを必要とするという問題があった。また、ＢＷＡ－ＭＥＭでは、クエリ文字列の文字数（すなわち、塩基数）に応じた回数だけメモリへのアクセスが必要となるため、メモリへのランダムアクセスが頻繁に発生し、これにより、プロセッサの処理速度がメモリへのアクセス時間により律速されてしまうという問題があった。 The BWA-MEM described above is widely used as an algorithm that allows gaps and enables high-speed mapping, but it has the problem of requiring a large amount of memory resources because the size of the index created is very large. there were. In addition, in BWA-MEM, memory access is required a number of times according to the number of characters (i.e., number of bases) in the query string, so random access to memory occurs frequently, which causes processor processing There was a problem in that the speed was limited by the access time to the memory.

より具体的には、ＢＷＡ－ＭＥＭでは、クエリ文字列における部分文字列（これを「キー」と称することがある。）がヒトゲノム参照配列内に出現する位置を同定するために、メインメモリへのランダムアクセスがクエリ文字列の総文字数分の回数だけ実行される。これは、ヒトゲノムの解析に用いるデータの量が非常に大きいため、必要なデータ配列をより高速アクセス可能なキャッシュメモリに一度に収容しきれないからである。ここで、メインメモリへの１回のランダムアクセスの待ち時間（アクセスが発生してから実際にデータが得られるまでの時間）は、現在のコンピュータでは、１バンクあたり約１μ秒（０．０００００１秒）かかっている。したがって、３０億塩基対のヒトゲノムの解析の場合、冗長性を３０とすると、総文字数に対する総アクセス時間Ｔは、
Ｔ＝３，０００，０００，０００×０．０００００１×３０
＝９０，０００（秒）
となる。つまり、ヒトゲノムの解析において、メインメモリへのランダムアクセスだけで、約９万秒（約２５時間）かかることになる。このため、たとえ、高性能なプロセッサを備えたコンピュータを用いたとしても、メモリアクセス時間が制約となって、マッピング時間を短縮するには限界があった。 More specifically, in BWA-MEM, in order to identify the position where a substring (sometimes referred to as a "key") in a query string appears in the human genome reference sequence, Random access is performed as many times as the total number of characters in the query string. This is because the amount of data used to analyze the human genome is so large that the required data array cannot be accommodated all at once in a cache memory that can be accessed at higher speeds. Here, in current computers, the waiting time for one random access to main memory (the time from when the access occurs until the data is actually obtained) is approximately 1 μs (0.000001 seconds) per bank. ). Therefore, in the case of analyzing the human genome of 3 billion base pairs, assuming the redundancy is 30, the total access time T for the total number of characters is:
T=3,000,000,000×0.000001×30
=90,000 (seconds)
becomes. In other words, in analyzing the human genome, it would take about 90,000 seconds (about 25 hours) just to randomly access the main memory. For this reason, even if a computer equipped with a high-performance processor is used, the memory access time is limited, and there is a limit to reducing the mapping time.

また、上述したように、シークエンサにより一人分のヒトゲノムを読み出すためには、現状、１時間ほど要している。一方で、高性能コンピュータを用いてＢＷＡ－ＭＥＭを実行した場合、シークエンサの読み出し時間以内にマッピングが完了し、両者のバランスは概ね保たれている。 Furthermore, as mentioned above, it currently takes about one hour to read out one person's human genome using a sequencer. On the other hand, when BWA-MEM is executed using a high-performance computer, mapping is completed within the sequencer read time, and the balance between the two is generally maintained.

一方で、次世代型シークエンサは、低コスト化が進み、また、読み出し時間を更に短縮し得ると言われており、これに伴って、解析時間もまた短縮することが望まれる。解析時間を短縮するための一つのアプローチとして、コンピュータの更なる高性能化が考えられるが、コンピュータの高性能化のためには非常にコストが嵩むため実用化へのハードルが高い。 On the other hand, it is said that next-generation sequencers have lower costs and can further shorten readout time, and along with this, it is desired that analysis time also be shortened. One possible approach to shorten analysis time is to further improve the performance of computers, but increasing the performance of computers requires a significant increase in cost, making it difficult to put this into practical use.

更に、マッピングされたデータ片の解析に用いられる従前のアラインメントアルゴリズムでは、アラインメント表の全ての要素について変異度を算出するため、計算量が多くなり、時間がかかるという問題がある。 Furthermore, conventional alignment algorithms used to analyze mapped data pieces have the problem that the degree of variation is calculated for all elements in the alignment table, which increases the amount of calculation and takes time.

そこで、本発明は、参照データを用いて、与えられるクエリデータの解析を高速及び／又は効率的に行うことができる新たな技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a new technique that can quickly and/or efficiently analyze given query data using reference data.

より具体的には、本発明の一つの目的は、与えられるクエリ文字列と参照文字列との間の近似文字列照合を高速及び／又は効率的に行うことができる近似文字列照合装置及びこれを用いた近似文字列照合方法を提供することである。 More specifically, one object of the present invention is to provide an approximate string matching device that can quickly and/or efficiently perform approximate string matching between a given query string and a reference string, and the same. The purpose of this invention is to provide an approximate string matching method using .

また、本発明の一つの目的は、シークエンサによって読み出されたデータ片に基づくヒトゲノム参照配列を用いた解析を高速及び／又は効率的に行うことができる近似文字列照合装置及びこれを用いた近似文字列照合方法を提供することである。 Another object of the present invention is to provide an approximate character string matching device that can quickly and/or efficiently perform analysis using a human genome reference sequence based on a data piece read out by a sequencer, and an approximate character string matching device using the same. The purpose is to provide a string matching method.

また、本発明の一つの目的は、前記近似文字列照合に適合した参照文字列に基づく階層的インデックスを作成する技術を提供することである。 Another object of the present invention is to provide a technique for creating a hierarchical index based on reference strings that are compatible with the above-mentioned approximate string matching.

上記課題を解決するための本発明は、以下に示す発明特定事項乃至は技術的特徴を含んで構成される。 The present invention for solving the above-mentioned problems includes the following invention specific matters and technical features.

ある観点に従う本発明は、コンピューティングデバイスに、クエリ文字列に基づいて参照文字列における近似文字列を検索するための方法を実現させるためのコンピュータプログラムである。
前記方法は、前記参照文字列に基づいて階層的インデックスを作成することと、前記クエリ文字列の少なくとも一部と一致する前記参照文字列における部分文字列を同定するために、前記階層的インデックスを参照して、前記参照文字列に対する前記クエリ文字列のマッピングを行うことと、前記マッピングにより同定される少なくとも１以上の前記部分文字列に基づいて、前記近似文字列を導出することと、を含む。
ここで、前記階層的インデックスを作成することは、前記参照文字列から所定長の各第１のキーを切り出すことと、切り出された前記各第１のキーについて、所定のハッシュ関数により該第１のキーに基づいて算出されるハッシュ値を割り当てた第１のキー配列を作成することと、作成された前記第１のキー配列を更新することと、更新された前記第１のキー配列を前記階層的インデックスとして出力することと、を含む。
また、前記第１のキー配列を更新することは、前記第１のキー配列における前記各第１のキーについて、前記参照文字列における該第１のキーの出現回数を同定することと、同定された前記第１のキーの前記出現回数に従って、該第１のキーに第１の追加キーを追加することにより新たな第１のキーを作成し、該新たな第１のキーに基づいて前記第１のキー配列を更新することと、を含む。 SUMMARY OF THE INVENTION According to one aspect, the invention is a computer program product for causing a computing device to implement a method for searching for an approximate string in a reference string based on a query string.
The method includes creating a hierarchical index based on the reference string, and using the hierarchical index to identify substrings in the reference string that match at least a portion of the query string. mapping the query string to the reference string; and deriving the approximate string based on at least one partial string identified by the mapping. .
Here, creating the hierarchical index involves cutting out each first key of a predetermined length from the reference character string, and using a predetermined hash function for each of the cut out first keys. creating a first key array to which a hash value calculated based on the key is assigned, updating the created first key array, and updating the updated first key array to the and outputting it as a hierarchical index.
Furthermore, updating the first key arrangement includes identifying, for each of the first keys in the first key arrangement, the number of occurrences of the first key in the reference character string. create a new first key by adding a first additional key to the first key according to the number of appearances of the first key; This includes updating the key layout of No. 1.

前記第１のキー配列を作成することは、前記ハッシュ値に従って前記第１のキー配列における前記各第１のキーをソートすることを含み得る。 Creating the first key array may include sorting each of the first keys in the first key array according to the hash value.

また、前記第１のキー配列を更新することは、前記同定した出現回数が所定の許容値を超えているか否かを判断することと、前記同定された前記出現回数が所定の許容値を超えていると判断される場合に、前記第１のキーに対して前記参照文字列における該第１のキーに続く少なくとも１以上の文字からなる前記第１の追加キーを追加することにより前記新たな第１のキーを作成ことと、前記新たな第１のキーについて、前記参照文字列における該第１のキーの出現回数を同定することと、を含み得る。 Furthermore, updating the first keyboard layout includes determining whether the identified number of occurrences exceeds a predetermined tolerance value, and updating the first keyboard layout includes determining whether the identified number of occurrences exceeds a predetermined tolerance value. If it is determined that The method may include creating a first key and identifying, for the new first key, a number of occurrences of the first key in the reference string.

また、前記第１のキー配列を更新することは、前記第１の追加キーに従って前記第１のキー配列における前記新たな第１のキーをソートすることを更に含み得る。 Moreover, updating the first keyboard layout may further include sorting the new first keys in the first keyboard layout according to the first additional keys.

また、前記第１のキー配列を更新することは、前記同定された前記出現回数が所定の許容値を超えていないと判断されるまで、現在の前記第１のキーに新たな前記第１の追加キーを順次に追加することにより新たな前記第１のキーを作成することを含み得る。 Furthermore, updating the first key arrangement may include adding a new first key arrangement to the current first key until it is determined that the identified number of occurrences does not exceed a predetermined tolerance value. The method may include creating a new first key by sequentially adding additional keys.

また、前記キー配列を前記階層的インデックスとして出力することは、前記同定された前記出現回数が所定の許容値を超えていないと判断される場合に、現在の前記キー配列を前記階層的インデックスとして出力することを含み得る。 Further, outputting the keyboard layout as the hierarchical index may include outputting the current keyboard layout as the hierarchical index when it is determined that the identified number of appearances does not exceed a predetermined tolerance value. may include outputting.

前記マッピングを行うことは、前記クエリ文字列から所定長の各第２のキーを切り出すことと、前記クエリ文字列から切り出された前記各第２のキーについて、前記所定のハッシュ関数により該第２のキーに基づいて算出されるハッシュ値を割り当てた第２のキー配列を作成することと、前記各第２のキーについて、前記ハッシュ値に従って、所定のサンプリング間隔で、前記階層的インデックスを参照し、該第２のキーの出現開始位置及び出現回数を同定することと、を含み得る。 Performing the mapping includes cutting out each second key of a predetermined length from the query string, and using the predetermined hash function to calculate each second key cut out from the query string. creating a second key array to which a hash value calculated based on the key is assigned; and for each second key, referencing the hierarchical index at a predetermined sampling interval according to the hash value. , identifying the starting position and number of occurrences of the second key.

前記第２のキーの前記出現開始位置及び前記出現回数を同定することは、前記第２のキーの前記出現回数が前記所定の許容値を超えているか否かを判断することと、前記第２のキーの前記出現回数が前記所定の許容値を超えていると判断される場合に、前記第２のキーに対して前記クエリ文字列における該第２のキーに続く少なくとも１以上の文字からなる第２の追加キーを追加することにより新たな第２のキーを作成することと、前記第２のキーの前記出現回数が前記所定の許容値を超えていないと判断される場合に、同定された現在の前記第２のキーを一致文字列として出力するとともに該第２のキーの前記出現開始位置を出力することと、を含み得る。 Identifying the appearance start position and the number of appearances of the second key includes determining whether the number of appearances of the second key exceeds the predetermined tolerance value; consists of at least one or more characters following the second key in the query string for the second key when it is determined that the number of occurrences of the key exceeds the predetermined tolerance value. creating a new second key by adding a second additional key; and if it is determined that the number of occurrences of the second key does not exceed the predetermined tolerance value; outputting the current second key as a matching character string and outputting the appearance start position of the second key.

また、前記第２のキーの前記出現開始位置及び前記出現回数を同定することは、前記第２のキーの前記同定された前記出現回数が前記所定の許容値を超えていないと判断されるまで、現在の前記第２のキーに新たな前記第２の追加キーを順次に追加して、前記新たな第２のキーを作成することを更に含み得る。 Further, identifying the appearance start position and the number of appearances of the second key may be performed until it is determined that the identified number of appearances of the second key does not exceed the predetermined tolerance value. The method may further include creating the new second key by sequentially adding a new second additional key to the current second key.

また、前記第２のキーの前記出現回数が前記所定の許容値を超えていると判断される場合に、該第２のキーの前記所定のサンプリング間隔が大きくなるように変更され得る。 Further, when it is determined that the number of appearances of the second key exceeds the predetermined tolerance value, the predetermined sampling interval of the second key may be changed to become larger.

前記近似文字列を導出することは、前記マッピングにより同定された前記一致文字列に基づく、被照合文字列と照合文字列とからなる文字列ペアを受信することと、前記文字列ペアに基づいて少なくとも１つの近似文字列を導出するために、所定のアラインメント処理を実行することと、導出された前記少なくとも１つの近似文字列を出力することと、を含み得る。 Deriving the approximate string includes receiving a string pair consisting of a string to be matched and a string to be matched based on the matching string identified by the mapping, and deriving the string pair based on the string pair. The method may include performing a predetermined alignment process to derive at least one approximate character string, and outputting the derived at least one approximate character string.

前記所定のアラインメント処理を実行することは、前記被照合文字列と前記照合文字列とに基づいて所定のアラインメント表を作成することと、前記アラインメント表の対角線上の要素を中心にした幅ｍを有する計算領域を設定することと、設定された前記計算領域における各要素について、変異度を算出することと、算出された前記変異度に基づいて、最大変異度を決定することと、決定された前記最大変異度に基づいて、前記少なくとも１つの近似文字列を導出することを含み得る。 Executing the predetermined alignment process includes creating a predetermined alignment table based on the character string to be matched and the character string to be matched, and calculating a width m centered on a diagonal element of the alignment table. calculating a degree of variation for each element in the set calculation area; determining a maximum degree of variation based on the calculated degree of variation; The method may include deriving the at least one approximate character string based on the maximum degree of variation.

また、前記所定のアラインメント処理を実行することは、前記最大変異度と所定の下限値とを比較して、前記最大変異度が前記所定の下限値を超えているかを判断することと、前記最大変異度が前記所定の下限値を超えていないと判断される場合に、新たな計算領域を設定するために、前記計算領域の前記幅ｍを拡幅することと、前記最大変異度が前記所定の下限値を超えていると判断される場合に、前記最大変異度を有する要素に基づいて、前記少なくとも１つの近似文字列を導出することと、を含み得る。そして、前記方法は、前記最大変異度が前記下限値を超えると判断されるまで、前記計算領域を拡幅することにより新たな計算領域を設定して前記変異度を算出することを繰り返すように構成され得る。 Furthermore, executing the predetermined alignment process may include comparing the maximum degree of variation with a predetermined lower limit value to determine whether the maximum degree of variation exceeds the predetermined lower limit value; In order to set a new calculation area when it is determined that the degree of variation does not exceed the predetermined lower limit value, the width m of the calculation area is widened, and the maximum degree of variation is set to the predetermined lower limit. The method may include deriving the at least one approximate character string based on the element having the maximum degree of variation when it is determined that the lower limit value is exceeded. The method is configured to repeat calculating the degree of variation by expanding the calculation area and setting a new calculation area until it is determined that the maximum degree of variation exceeds the lower limit value. can be done.

前記所定の下限値は、所定の要素列にｍ個の連続したギャップがあり、それ以外の部分は完全又は実質的に一致したと仮定した場合の変異度の値であり得る。 The predetermined lower limit value may be a value of the degree of variation when it is assumed that there are m consecutive gaps in the predetermined element sequence and that the other parts are completely or substantially matched.

また、前記方法は、前記一致文字列に対して前記参照文字列における対応する所定の文字列を追加することにより前記被照合文字列を作成することと、前記一致文字列に対して前記クエリ文字列における対応する所定の文字列を追加することにより前記照合文字列を作成することと、を更に含み得る。 The method also includes creating the matched character string by adding a corresponding predetermined character string in the reference string to the matched character string, and adding the query character string to the matched character string. creating the match string by adding corresponding predetermined strings in a column.

また、ある観点に従う本発明は、コンピューティングデバイスに、クエリ文字列に基づいて参照文字列を探索するための階層的インデックスを作成する方法を実現させるためのコンピュータプログラムである。
前記方法は、前記参照文字列から所定長の各第１のキーを切り出すことと、切り出された前記各第１のキーについて、所定のハッシュ関数により該第１のキーに基づいて算出されるハッシュ値を割り当てた第１のキー配列を作成することと、作成された前記第１のキー配列を更新することと、更新された前記第１のキー配列を前記階層的インデックスとして出力することと、を含む。
ここで、前記第１のキー配列を更新することは、前記第１のキー配列における前記各第１のキーについて、前記参照文字列における該第１のキーの出現開始位置及び出現回数を同定することと、同定された前記第１のキーの前記出現開始位置及び前記出現回数に従って、該第１のキーに第１の追加キーを追加することにより新たな第１のキーを作成し、該新たな第１のキーに基づいて前記第１のキー配列を更新することと、を含む。 The invention according to one aspect is also a computer program product for implementing a method for creating a hierarchical index for searching a reference string based on a query string in a computing device.
The method includes cutting out each first key of a predetermined length from the reference character string, and calculating a hash based on the first key using a predetermined hash function for each of the cut out first keys. Creating a first key array to which a value is assigned, updating the created first key array, and outputting the updated first key array as the hierarchical index; including.
Here, updating the first key layout includes identifying, for each of the first keys in the first key layout, the appearance start position and number of appearances of the first key in the reference character string. and creating a new first key by adding a first additional key to the first key according to the appearance start position and the number of appearances of the identified first key, and updating the first key arrangement based on the first key.

また、ある観点に従う本発明は、コンピューティングデバイスに、参照文字列に対してクエリ文字列のマッピングを行う方法を実現させるためのコンピュータプログラムである。
前記方法は、前記参照文字列に基づく階層的インデックスを読み出すことと、前記クエリ文字列から所定のキー長を有する各キーを切り出して、キー配列を作成することと、前記クエリ文字列から切り出された前記各キーについて、前記所定のハッシュ関数により該キーに基づいて算出されるハッシュ値を割り当てたキー配列を作成することと、前記各キーについて、前記ハッシュ値に従って、所定のサンプリング間隔で、前記階層的インデックスを参照し、該キーの出現開始位置及び出現回数を同定することと、前記同定した出現回数が所定のしきい値を超えているか否かを判断することと、前記同定された前記出現回数が所定の許容値を超えていると判断される場合に、前記キーに対して前記クエリ文字列における該キーに続く少なくとも１以上の文字からなる追加キーを追加することにより新たなキーを作成することと、前記同定された前記出現回数が所定のしきい値を超えていないと判断される場合に、同定された現在の前記キーの出現開始位置及び該キーを出力することと、を含む。
そして、前記キーの前記出現開始位置及び前記出現回数を同定することは、前記同定された前記出現回数が所定のしきい値を超えていないと判断されるまで、現在の前記キーに新たな前記追加キーを順次に追加して、前記新たなキーを作成することを含む。 The invention according to one aspect is also a computer program product for causing a computing device to implement a method for mapping a query string to a reference string.
The method includes reading a hierarchical index based on the reference string, creating a key array by cutting each key having a predetermined key length from the query string, and creating a key array by cutting each key having a predetermined key length from the query string. creating a key array in which a hash value calculated based on the key by the predetermined hash function is assigned to each of the keys; identifying the starting position and number of occurrences of the key with reference to a hierarchical index; determining whether the identified number of occurrences exceeds a predetermined threshold; If it is determined that the number of occurrences exceeds a predetermined tolerance value, a new key is created by adding an additional key consisting of at least one character following the key in the query string to the key. and outputting the identified current appearance start position of the key and the key when it is determined that the identified number of appearances does not exceed a predetermined threshold. include.
Then, identifying the appearance start position and the number of appearances of the key includes adding a new one to the current key until it is determined that the identified number of appearances does not exceed a predetermined threshold. The method includes sequentially adding additional keys to create the new key.

ここで、前記同定された前記出現回数が所定の許容値を超えていると判断される場合に、前記キーの前記所定のサンプリング間隔が大きくなるように変更され得る。 Here, if it is determined that the identified number of appearances exceeds a predetermined tolerance value, the predetermined sampling interval of the key may be changed to become larger.

また、ある観点に従う本発明は、コンピューティングデバイスに、参照文字列における部分文字列とクエリ文字列との間の変異を所定のアラインメント処理により同定する方法を実現させるためのコンピュータプログラムである。
前記方法は、マッピングにより同定された一致文字列に基づく、被照合文字列と照合文字列とからなる文字列ペアを受信することと、前記文字列ペアに基づいて少なくとも１つの近似文字列を導出するために、所定のアラインメント処理を実行することと、
導出された前記少なくとも１つの近似文字列を出力することと、を含む。
そして、前記所定のアラインメント処理を実行することは、前記被照合文字列と前記照合文字列とに基づいて所定のアラインメント表を作成することと、前記アラインメント表の対角線上の要素を中心にした幅ｍを有する計算領域を設定することと、設定された前記計算領域における各要素について、変異度を算出することと、算出された前記変異度に基づいて、最大変異度を決定することと、決定された前記最大変異度に基づいて、前記少なくとも１つの近似文字列を導出することを含む。 Further, the present invention according to a certain aspect is a computer program for causing a computing device to implement a method of identifying a variation between a substring in a reference string and a query string using a predetermined alignment process.
The method includes receiving a string pair consisting of a matched string and a matching string based on a matching string identified by mapping, and deriving at least one approximate string based on the string pair. performing a predetermined alignment process in order to
outputting the derived at least one approximate character string.
Executing the predetermined alignment process includes creating a predetermined alignment table based on the string to be matched and the string to be matched, and determining the width of the alignment table based on the diagonal elements of the alignment table. m, calculating a degree of variation for each element in the set calculation area, determining a maximum degree of variation based on the calculated degree of variation, and determining deriving the at least one approximate character string based on the maximum variation degree determined.

また、前記所定のアラインメント処理を実行することは、前記最大変異度と所定の下限値とを比較して、前記最大変異度が前記所定の下限値を超えているかを判断することと、前記最大変異度が前記所定の下限値を超えていないと判断される場合に、新たな計算領域を設定するために、前記計算領域の前記幅ｍを拡幅することと、前記最大変異度が前記所定の下限値を超えていると判断される場合に、前記最大変異度を有する要素に基づいて、前記少なくとも１つの近似文字列を導出することと、を含み得る。
そして、前記最大変異度が前記下限値を超えると判断されるまで、前記計算領域を拡幅することにより新たな計算領域を設定して前記変異度を算出することが繰り返され得る。 Furthermore, executing the predetermined alignment process may include comparing the maximum degree of variation with a predetermined lower limit value to determine whether the maximum degree of variation exceeds the predetermined lower limit value; In order to set a new calculation area when it is determined that the degree of variation does not exceed the predetermined lower limit value, the width m of the calculation area is widened, and the maximum degree of variation is set to the predetermined lower limit. The method may include deriving the at least one approximate character string based on the element having the maximum variation degree when it is determined that the lower limit value is exceeded.
Then, the calculation of the degree of variation by expanding the calculation area to set a new calculation area may be repeated until it is determined that the maximum degree of variation exceeds the lower limit value.

また、前記所定のアラインメント処理を実行することは、所定の要素列にｍ個の連続したギャップがあり、それ以外の部分は完全に又は実質的に一致したと仮定した場合の変異度の値を前記所定の下限値として設定することを更に含み得る。 Furthermore, executing the predetermined alignment process calculates the value of the degree of variation when it is assumed that there are m consecutive gaps in the predetermined element sequence and that the other parts are completely or substantially matched. It may further include setting the predetermined lower limit value as the predetermined lower limit value.

なお、本発明は、前記コンピュータプログラムを記憶した記録媒体としても成立する。また、本発明は、前記方法を実行するように構成されたハードウェア及び／又はファームウェアからなる装置としても成立する。近似文字列照合装置は、本発明の一形態である。 Note that the present invention can also be implemented as a recording medium that stores the computer program. The invention can also be implemented as a device comprising hardware and/or firmware configured to carry out the method. An approximate character string matching device is one form of the present invention.

なお、本明細書等において、手段又は部（unit）とは、単に物理的手段を意味するものではなく、その手段又は部が有する機能をソフトウェアによって実現する場合も含む。また、１つの手段又は部が有する機能が２つ以上の物理的手段により実現されても、２つ以上の手段又は部の機能が１つの物理的手段により実現されても良い。また、「システム」とは、複数の装置（又は特定の機能を実現する機能モジュール）が論理的に集合した物のことをいい、各装置や機能モジュールが単一の筐体内にあるか否かは特に問わない。 Note that in this specification and the like, the term "means" or "unit" does not simply mean physical means, but also includes cases in which the functions of the means or unit are realized by software. Furthermore, the functions of one means or section may be realized by two or more physical means, or the functions of two or more means or sections may be realized by one physical means. In addition, a "system" refers to a logical collection of multiple devices (or functional modules that realize a specific function), and whether each device or functional module is in a single housing or not. There is no particular question.

本発明によれば、与えられるクエリデータに対する参照データを用いた解析を高速及び／又は効率的に行うことができる。とりわけ、本発明によれば、与えられるクエリ文字列と参照文字列との間の近似文字列照合を高速及び／又は効率的に行うことができる。 According to the present invention, it is possible to quickly and/or efficiently analyze given query data using reference data. In particular, according to the present invention, approximate string matching between a given query string and a reference string can be performed quickly and/or efficiently.

また、本発明によれば、シークエンサによって読み出されたデータ片に基づくヒトゲノム参照配列を用いた解析を高速及び／又は効率的に行うことができる。 Further, according to the present invention, analysis using a human genome reference sequence based on data pieces read by a sequencer can be performed quickly and/or efficiently.

更に、本発明によれば、近似文字列照合に適合した参照文字列に基づく階層的インデックスを提供することができる。 Further, according to the present invention, it is possible to provide a hierarchical index based on reference strings suitable for approximate string matching.

本発明の他の技術的特徴、目的、及び作用効果乃至は利点は、添付した図面を参照して説明される以下の実施形態により明らかにされる。本明細書に記載された効果はあくまで例示であって限定されるものではなく、また他の効果があっても良い。 Other technical features, objects, effects, and advantages of the present invention will be made clear by the following embodiments described with reference to the accompanying drawings. The effects described in this specification are merely examples and are not limiting, and other effects may also be present.

図１は、本発明の一実施形態に係るコンピュータシステムの概略的構成の一例を示すブロックダイアグラムである。FIG. 1 is a block diagram showing an example of a schematic configuration of a computer system according to an embodiment of the present invention. 図２は、本発明の一実施形態に係るコンピュータシステムによる近似文字列照合処理の概略の一例を説明するフローチャートである。FIG. 2 is a flowchart illustrating an example of an approximate character string matching process performed by a computer system according to an embodiment of the present invention. 図３は、本発明の一実施形態に係るコンピュータシステムによるインデックス作成処理の一例を説明するフローチャートである。FIG. 3 is a flowchart illustrating an example of index creation processing by a computer system according to an embodiment of the present invention. 図４は、本発明の一実施形態に係るコンピュータシステムにおいて用いられる参照文字列の一例を示す図である。FIG. 4 is a diagram showing an example of a reference character string used in a computer system according to an embodiment of the present invention. 図４Ａは、図４に示される参照文字列に基づく階層的インデックスの作成過程におけるデータ配列構造の一例を説明するための図である。FIG. 4A is a diagram for explaining an example of a data array structure in the process of creating a hierarchical index based on the reference character strings shown in FIG. 4. 図４Ｂは、図４に示される参照文字列に基づく階層的インデックスの作成過程におけるデータ配列構造の一例を説明するための図である。FIG. 4B is a diagram for explaining an example of a data array structure in the process of creating a hierarchical index based on the reference character strings shown in FIG. 4. 図４Ｃは、図４に示される参照文字列に基づく階層的インデックスの作成過程におけるデータ配列構造の一例を説明するための図である。FIG. 4C is a diagram for explaining an example of a data array structure in the process of creating a hierarchical index based on the reference character strings shown in FIG. 4. 図５は、図４に示される参照文字列に基づく階層的インデックスの作成過程におけるキー開始位置及び出現回数を示すテーブル構造の一例を示す図である。FIG. 5 is a diagram showing an example of a table structure showing the key start position and number of appearances in the process of creating a hierarchical index based on the reference character string shown in FIG. 4. 図６は、本発明の一実施形態に係るコンピュータシステムにおいて用いられる参照文字列の一例を示す図である。FIG. 6 is a diagram showing an example of a reference character string used in a computer system according to an embodiment of the present invention. 図７は、本発明の一実施形態に係るコンピュータシステムによるクエリ文字列に基づく参照文字列の探索処理の一例を説明するフローチャートである。FIG. 7 is a flowchart illustrating an example of a reference string search process based on a query string by the computer system according to an embodiment of the present invention. 図８Ａは、Ｎｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムを説明するためのアラインメント表の一例を示す図である。FIG. 8A is a diagram showing an example of an alignment table for explaining the Needleman-Wunsch algorithm. 図８Ｂは、Ｎｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムを説明するためのアラインメント表の一例を示す図である。FIG. 8B is a diagram showing an example of an alignment table for explaining the Needleman-Wunsch algorithm. 図８Ｃは、Ｎｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムを説明するためのアラインメント表の一例を示す図である。FIG. 8C is a diagram showing an example of an alignment table for explaining the Needleman-Wunsch algorithm. 図８Ｄは、Ｎｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムを説明するためのアラインメント表の一例を示す図である。FIG. 8D is a diagram showing an example of an alignment table for explaining the Needleman-Wunsch algorithm. 図９は、本発明の一実施形態に係るコンピュータシステムによる動的計画法を用いたアラインメント処理の一例を説明するフローチャートである。FIG. 9 is a flowchart illustrating an example of alignment processing using dynamic programming by a computer system according to an embodiment of the present invention. 図１０Ａは、本発明の一実施形態に係る動的計画法を用いたアラインメントを説明するためのアラインメント表の一例を示す図である。FIG. 10A is a diagram showing an example of an alignment table for explaining alignment using dynamic programming according to an embodiment of the present invention. 図１０Ｂは、本発明の一実施形態に係る動的計画法を用いたアラインメントを説明するためのアラインメント表の一例を示す図である。FIG. 10B is a diagram showing an example of an alignment table for explaining alignment using dynamic programming according to an embodiment of the present invention. 図１０Ｃは、本発明の一実施形態に係る動的計画法を用いたアラインメントを説明するためのアラインメント表の一例を示す図である。FIG. 10C is a diagram showing an example of an alignment table for explaining alignment using dynamic programming according to an embodiment of the present invention. 図１１Ａは、本発明の一実施形態に係る動的計画法を用いたアラインメントにより得られる近似文字列の一例を示す図である。FIG. 11A is a diagram illustrating an example of an approximate character string obtained by alignment using dynamic programming according to an embodiment of the present invention. 図１１Ｂは、本発明の一実施形態に係る動的計画法を用いたアラインメントにより得られる近似文字列の一例を示す図である。FIG. 11B is a diagram showing an example of an approximate character string obtained by alignment using dynamic programming according to an embodiment of the present invention. 本発明の一実施形態に係るシステムにおけるコンピューティングデバイスのハードウェア構成の一例を示す図である。1 is a diagram illustrating an example of a hardware configuration of a computing device in a system according to an embodiment of the present invention.

以下、図面を参照して本発明の実施の形態を説明する。ただし、以下に説明する実施形態は、あくまでも例示であり、以下に明示しない種々の変形や技術の適用を排除する意図はない。本発明は、その趣旨を逸脱しない範囲で種々変形（例えば各実施形態を組み合わせる等）して実施することができる。また、以下の図面の記載において、同一又は類似の部分には同一又は類似の符号を付して表している。図面は模式的なものであり、必ずしも実際の寸法や比率等とは一致しない。図面相互間においても互いの寸法の関係や比率が異なる部分が含まれていることがある。 Embodiments of the present invention will be described below with reference to the drawings. However, the embodiments described below are merely examples, and there is no intention to exclude the application of various modifications and techniques not specified below. The present invention can be implemented in various ways (for example, by combining the embodiments) without departing from the spirit thereof. In addition, in the description of the drawings below, the same or similar parts are denoted by the same or similar symbols. The drawings are schematic and do not necessarily correspond to actual dimensions or proportions. The drawings may also include portions that differ in dimensional relationships and ratios.

図１は、本発明の一実施形態に係るコンピュータシステムの概略的構成の一例を示すブロックダイアグラムである。同図に示すように、コンピュータシステム１は、例えば、上位コンピュータ１０と、複数の下位コンピュータ２０（１）～２０（ｎ）と、データベース３０と含み構成される。上位コンピュータ１０と下位コンピュータ２０（１）～２０（ｎ）とは、所定のインターフェースを介して通信可能に接続される。本開示では、コンピュータシステム１は、上位コンピュータ１０が複数の下位コンピュータ２０（１）～２０（ｎ）を統括的に制御する中央集権型コンピュータシステムとして構成されるが、これに限られず、例えば分散型コンピュータシステムとして構成されても良い。分散型コンピュータシステムにおいては、各コンピュータが並列分散処理により協調的に動作し得るが、特定の処理に関しては、代表する一のコンピュータのみが該処理を実行する場合があっても良い。上位コンピュータ１０及び下位コンピュータ２０のハードウェア構成は、図１２に例示されるが、当業者にとって自明であるため、その詳細な説明は省略する。本開示では、下位コンピュータ２０は、上位コンピュータ１０の制御の下、並列的にタスクを処理する。例えば、複数の下位コンピュータ２０のそれぞれは、与えられた個々のクエリ文字列に基づいて参照文字列との比較において解析処理を行う。以下では、下位コンピュータ２０（１）～２０（ｎ）について、それらを特に区別する必要がない限り、単に、「下位コンピュータ２０」と表記することがある。 FIG. 1 is a block diagram showing an example of a schematic configuration of a computer system according to an embodiment of the present invention. As shown in the figure, the computer system 1 includes, for example, a higher-level computer 10, a plurality of lower-level computers 20(1) to 20(n), and a database 30. The upper computer 10 and the lower computers 20(1) to 20(n) are communicably connected via a predetermined interface. In the present disclosure, the computer system 1 is configured as a centralized computer system in which the higher-level computer 10 centrally controls the multiple lower-level computers 20(1) to 20(n), but is not limited to this, for example, distributed It may also be configured as a type computer system. In a distributed computer system, each computer can operate cooperatively through parallel distributed processing, but there may be cases where only one representative computer executes a specific process. The hardware configurations of the higher-level computer 10 and lower-level computer 20 are illustrated in FIG. 12, but since they are obvious to those skilled in the art, detailed description thereof will be omitted. In the present disclosure, the lower-level computer 20 processes tasks in parallel under the control of the higher-level computer 10. For example, each of the plurality of lower-level computers 20 performs an analysis process based on each given query string in comparison with a reference string. Below, the lower-level computers 20(1) to 20(n) may be simply referred to as "lower-level computers 20" unless there is a particular need to distinguish between them.

データベース３０は、上位コンピュータ１０の制御の下、各種のデータ、例えば参照文字列を格納する。一例として、データベース３０は、ヒトゲノム参照配列を格納する。ヒトゲノム参照配列は、人の標準的なゲノム配列として定められた塩基対の配列を示すデータである。また、データベース３０は、参照文字列に基づいて作成されたインデックスを格納する。インデックスは、参照文字列を探索するために用いられるある種のデータ配列構造である。なお、図中、データベース３０は、上位コンピュータ１０にのみアクセス可能に接続される構成となっているが、これに限られず、下位コンピュータ２０もまたアクセス可能に接続される構成であっても良い。 The database 30 stores various data such as reference character strings under the control of the host computer 10. As an example, database 30 stores human genome reference sequences. The human genome reference sequence is data indicating a base pair sequence defined as a standard human genome sequence. The database 30 also stores indexes created based on reference character strings. An index is a type of data array structure used to search for reference strings. In the figure, the database 30 is configured to be accessible only to the upper level computer 10, but the configuration is not limited to this, and the configuration may be such that the lower level computer 20 is also connected for access.

図２は、本発明の一実施形態に係るコンピュータシステムによる近似文字列照合処理の概略の一例を説明するフローチャートである。かかる処理は、例えば、上位コンピュータ１０及び複数の下位コンピュータ２０が、プロセッサの制御の下、近似文字列照合プログラムを実行することにより他のハードウェアコンポーネントと協働し、実現される。 FIG. 2 is a flowchart illustrating an example of an approximate character string matching process performed by a computer system according to an embodiment of the present invention. Such processing is realized, for example, by the higher-level computer 10 and the plurality of lower-level computers 20 cooperating with other hardware components by executing an approximate string matching program under the control of a processor.

すなわち、同図に示すように、まず、上位コンピュータ１０は、データベース３０に格納された参照文字列を読み出し、読み出した参照文字列に基づいて、インデックスを作成する（Ｓ２０１）。具体的には、上位コンピュータ１０は、参照文字列における部分文字列を所定の条件に従って拡張し及び／又はソートして配列化することにより、所定のデータ配列構造を有するインデックスを作成する。データ配列構造は、階層的なツリー構造からなる。本開示では、このようなインデックスを階層的インデックスと称するものとする。上位コンピュータ１０は、作成した参照文字列に基づくインデックスをデータベース３０に格納する。参照文字列に基づく階層的インデックスの作成処理の詳細は後述される。なお、参照文字列に基づく階層的インデックスが既に作成されデータベース３０に格納されている場合には、処理Ｓ２０１は省略され得る。 That is, as shown in the figure, the host computer 10 first reads a reference character string stored in the database 30, and creates an index based on the read reference character string (S201). Specifically, the host computer 10 creates an index having a predetermined data array structure by expanding and/or sorting and arranging partial strings in the reference string according to predetermined conditions. The data array structure consists of a hierarchical tree structure. In this disclosure, such an index will be referred to as a hierarchical index. The host computer 10 stores an index based on the created reference character string in the database 30. Details of the hierarchical index creation process based on reference character strings will be described later. Note that if a hierarchical index based on the reference character string has already been created and stored in the database 30, processing S201 may be omitted.

続いて、上位コンピュータ１０は、図示しない外部装置からクエリ文字列を受信し、受信したクエリ文字列に基づいて、階層的インデックスを参照して、マッピングを行う（Ｓ２０２）。一例では、外部装置は、ヒトゲノムを読み出すシークエンサであり、クエリ文字列は、シークエンサから出力される例えば１５０～２００塩基程度の塩基対のデータ片である。本開示では、下位コンピュータ２０もまた、上位コンピュータ１０の制御の下、割り当てられたクエリ文字列に基づいて、階層的インデックスを参照して、マッピングを行う。 Subsequently, the host computer 10 receives a query string from an external device (not shown), and performs mapping based on the received query string with reference to the hierarchical index (S202). In one example, the external device is a sequencer that reads the human genome, and the query string is a data piece of about 150 to 200 base pairs output from the sequencer. In the present disclosure, the lower-level computer 20 also performs mapping under the control of the higher-level computer 10 by referring to the hierarchical index based on the assigned query string.

マッピングは、クエリ文字列の少なくとも一部と一致する参照文字列における部分文字列（一致文字列）を同定する処理である。つまり、マッピングにより、参照文字列におけるクエリ文字列の少なくとも一部と一致する部分文字列の出現開始位置及び長さ（文字数）が同定される。本開示に係るマッピングでは、クエリ文字列における各キーについて、階層的インデックスを検索ないしは探索することにより、参照文字列におけるクエリ文字列の各キー及びその出現開始位置が同定される。なお、処理の高速化のため、作成された階層的インデックスは、メインメモリ又はキャッシュメモリ等の高速メモリに展開され得る。 Mapping is a process of identifying a substring (matching string) in a reference string that matches at least a portion of a query string. That is, the mapping identifies the appearance start position and length (number of characters) of a substring that matches at least a portion of the query string in the reference string. In the mapping according to the present disclosure, each key in the query string and its starting position in the reference string are identified by searching or searching a hierarchical index for each key in the query string. Note that in order to speed up processing, the created hierarchical index may be developed in high-speed memory such as main memory or cache memory.

また、本開示に係るマッピングでは、階層的インデックスの探索に際して、クエリ文字列におけるキーの出現開始位置のサンプリング間隔が従来技術に比較してある程度大きくなるように調整される。例えばＢＬＡＳＴ（Basic Local Alignment Search Tool）と称されるＤＮＡの塩基配列やタンパク質のアミノ酸配列のシーケンスアラインメントを行うためのアルゴリズムでは、キーの出現開始位置の間隔は３～５文字に設定されるが、本開示に係る探索では、例えば１０文字又はそれ以上に設定され得る。探索に際して、キーの出現開始位置のサンプリング間隔が小さいと、探索の回数が増え、効率が低下してしまうのに対して、キーの出現開始位置のサンプリング間隔を常に大きくしてしまうと、探索の高速化と引き換えに、見落としの確率が上昇してしまう。そこで、本開示では、キーの長さを長くしながら、それが一定長を超える場合には、出現開始位置のサンプリング間隔を大きくすることにより、キーの一致を見落とすことを防ぐとともに、探索効率の向上を図っている。 Furthermore, in the mapping according to the present disclosure, when searching for a hierarchical index, the sampling interval of the appearance start position of a key in a query string is adjusted to be somewhat larger than in the prior art. For example, in an algorithm called BLAST (Basic Local Alignment Search Tool) for sequence alignment of DNA base sequences and protein amino acid sequences, the interval between the starting positions of keys is set to 3 to 5 characters. In the search according to the present disclosure, the number may be set to 10 characters or more, for example. During a search, if the sampling interval of the key appearance start position is small, the number of searches will increase and the efficiency will decrease, whereas if the sampling interval of the key appearance start position is always large, the search will be slow. In exchange for increased speed, the probability of oversight increases. Therefore, in the present disclosure, while increasing the length of the key, if the key length exceeds a certain length, the sampling interval of the appearance start position is increased to prevent key matches from being overlooked and to improve search efficiency. We are trying to improve.

なお、上記の例では、上位コンピュータ１０及び下位コンピュータ２０が、それぞれ、割り当てられたクエリ文字列に従ってマッピングを行っているが、これに限られず、例えば、上位コンピュータ１０のみが、割り当てられたクエリ文字列に基づいて、階層的インデックスを参照して、探索を行っても良い。 Note that in the above example, the upper computer 10 and the lower computer 20 each perform mapping according to the assigned query string, but the invention is not limited to this. For example, only the upper computer 10 performs mapping according to the assigned query string. Searches may be performed based on columns and with reference to hierarchical indexes.

次に、上位コンピュータ１０及び下位コンピュータ２０は、それぞれ、後述する近似文字列照合のための文字列ペアを作成する（Ｓ２０３）。文字列ペアは、マッピングにより同定される一致文字列を含む、参照文字列における被照合文字列とクエリ文字列における照合文字列とからなる。 Next, the upper computer 10 and the lower computer 20 each create a string pair for approximate string matching, which will be described later (S203). A string pair consists of a matched string in a reference string and a matched string in a query string, including a matched string identified by mapping.

具体的には、上位コンピュータ１０及び下位コンピュータ２０は、マッピングにより同定された一致文字列の先頭及び末尾のそれぞれに、参照文字列における対応する所定長の文字列を追加することにより、被照合文字列を作成する。つまり、被照合文字列は、参照文字列において同定された一致文字列を含む該一致文字列近傍の文字列である。例えば、参照文字列が「ＣＣＧＡＴＣＴＧＴＡＴＡＣＣＣＴＡＣＧＡ」であって、一致文字列が「ＴＡＣＣ」である場合に、例えば前後２文字ずつ追加した文字列「ＴＡＴＡＣＣＣＴ」が被照合文字列となる。ヒトゲノムの解析の場合、参照文字列について、一致文字列の先頭及び末尾に追加する塩基の長さは、それぞれ、例えば５０塩基程度であり得る。 Specifically, the higher-level computer 10 and the lower-level computer 20 add a character string of a predetermined length corresponding to the reference character string to the beginning and end of the matching character string identified by mapping, thereby determining the matched character. Create columns. In other words, the matched character string is a character string in the vicinity of the matched character string that includes the matched character string identified in the reference character string. For example, when the reference character string is "CCGATCTGTATACCCTACGA" and the matching character string is "TACC", the character string "TATACCCT", which is obtained by adding two characters before and after, becomes the character string to be matched. In the case of human genome analysis, the length of bases added to the beginning and end of a matching character string for a reference character string can be, for example, about 50 bases.

また、上位コンピュータ１０は、一致文字列の末尾にクエリ文字列における対応する所定長の文字列を追加することにより、照合文字列を作成する。ヒトゲノムの解析の場合、クエリ文字列について、一致文字列の末尾に追加する文字列の長さは、例えば５０塩基程度である。 Further, the host computer 10 creates a matching string by adding a corresponding character string of a predetermined length in the query string to the end of the matching string. In the case of human genome analysis, the length of the string added to the end of the matching string with respect to the query string is, for example, about 50 bases.

なお、本開示では、参照文字列について、一致文字列の先頭及び末尾に所定長の文字列が追加されるものとしたが、これに限られず、例えば、先頭又は末尾の一方にのみ所定長の文字列が追加されても良い。また、クエリ文字列について、一致文字列の先頭及び末尾に、それぞれ、所定長の文字列を追加するようにしても良いし、或いは、文字列を追加せずに、一致文字列そのものを照合文字列として扱っても良い。 Note that in the present disclosure, a character string of a predetermined length is added to the beginning and end of a matching character string for the reference character string, but the invention is not limited to this. For example, a character string of a predetermined length is added only to either the beginning or the end. Strings may be added. Also, regarding the query string, you may add character strings of a predetermined length to the beginning and end of the matching string, or you may add the matching string itself to the matching string without adding any strings. It can also be treated as a column.

次に、上位コンピュータ１０及び下位コンピュータ２０は、参照文字列に基づく被照合文字列とクエリ文字列に基づく照合文字列とからなる文字列ペアに基づいて少なくとも１つの近似文字列を導出する（Ｓ２０４）。近似文字列の導出には、動的計画法を用いた所定のアラインメントが適用される。アラインメントは、２つの配列（文字列）をその要素どうしで置換、挿入及び欠損を許容しつつ比較して、定義されたスコア／ペナルティに従って変異度（類似度）を算出する手法である。アラインメントを実現するアルゴリズムとしては、Ｓｍｉｔｈ－ＷａｔｅｒｍａｎアルゴリズムやＮｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムが知られている。また、動的計画法とは、ある段階で得られた最適解に基づいて次の段階の最適解を算出する手法である。つまり、変異度の算出では、動的計画法に従って、配列状のアラインメント表の各要素に対してスコアが算出され、最大スコアを持つ要素が決定され、これにより、少なくとも１つ以上の近似文字列が導出される。 Next, the upper computer 10 and the lower computer 20 derive at least one approximate character string based on the character string pair consisting of the character string to be matched based on the reference character string and the character string to be matched based on the query character string (S204 ). A predetermined alignment using dynamic programming is applied to derive the approximate character string. Alignment is a method of comparing two sequences (character strings) while allowing substitutions, insertions, and deletions among their elements, and calculating the degree of variation (similarity) according to a defined score/penalty. As algorithms for realizing alignment, the Smith-Waterman algorithm and the Needleman-Wunsch algorithm are known. Furthermore, dynamic programming is a method of calculating the optimal solution for the next stage based on the optimal solution obtained at one stage. In other words, in calculating the degree of variation, a score is calculated for each element of the array-like alignment table according to dynamic programming, and the element with the maximum score is determined. is derived.

以上のように、本実施形態の近似文字列照合では、参照文字列に基づく階層的インデックスが作成された後、与えられたクエリ文字列に従って、該階層的インデックスを探索することにより一致文字列（及びその長さ）が同定され、同定された一致文字列に基づく被照合文字列と照合文字列とからなる文字列ペアに対して近似文字列照合がなされることにより、近似文字列が導出される。 As described above, in the approximate string matching of this embodiment, after a hierarchical index is created based on a reference string, the matching string ( and its length) are identified, and an approximate string is derived by performing approximate string matching on a string pair consisting of a string to be matched and a matching string based on the identified matching string. Ru.

図３は、本発明の一実施形態に係るコンピュータシステムによるインデックス作成処理の一例を説明するフローチャートである。すなわち、図３は、図２に示した階層的インデックスの作成処理（Ｓ２０１）の詳細を示している。また、図４は、階層的インデックスを作成するための参照文字列の一例を示す図であり、図４Ａ～４Ｃは、参照文字列に基づく階層的インデックスの作成過程におけるデータ配列構造の一例を示す図である。更に、図５は、階層的インデックスの作成過程におけるキーの出現開始位置及び出現回数を示すテーブル構造の一例を示す図であり、図６は、階層的インデックスを説明するための図である。 FIG. 3 is a flowchart illustrating an example of index creation processing by a computer system according to an embodiment of the present invention. That is, FIG. 3 shows details of the hierarchical index creation process (S201) shown in FIG. 2. Further, FIG. 4 is a diagram showing an example of a reference character string for creating a hierarchical index, and FIGS. 4A to 4C are diagrams showing an example of a data array structure in the process of creating a hierarchical index based on the reference character string. It is a diagram. Further, FIG. 5 is a diagram showing an example of a table structure showing the appearance start position and number of appearances of keys in the process of creating a hierarchical index, and FIG. 6 is a diagram for explaining the hierarchical index.

まず、図３に示すように、上位コンピュータ１０は、階層的インデックスを作成するための参照文字列を受信する（Ｓ３０１）。例えば、ヒトゲノム参照配列であれば、上位コンピュータ１０は、データベース３０にアクセスし、格納されているヒトゲノム参照配列を読み出す。以下では、理解容易のため、４種類の文字「Ａ」、「Ｃ」、「Ｇ」、及び「Ｔ」から構成される参照文字列「ＣＣＧＡＴＣＴＧＴＡＴＡＣＣＣＴＡＣＧＡ」を例にして説明する（図４参照）。 First, as shown in FIG. 3, the host computer 10 receives a reference character string for creating a hierarchical index (S301). For example, in the case of a human genome reference sequence, the host computer 10 accesses the database 30 and reads out the stored human genome reference sequence. In the following, for ease of understanding, a reference character string "CCGATCTGTATACCCTACGA" composed of four types of characters "A", "C", "G", and "T" will be explained as an example (see FIG. 4).

上位コンピュータ１０は、受信した参照文字列について、所定のキー長に従って各部分文字列（すなわち、キー）を切り出して、キー配列を作成する（Ｓ３０２）。例えば、キー長が「２」である場合、キー配列は、図４Ａ（ａ）のようになる。同図中、左端の番号は、配列番号である。また、参照文字列の末端である１９番目の「Ａ」で始まるキーは、説明の簡略化のため、ここでは省略している。 The host computer 10 cuts out each partial character string (ie, key) from the received reference character string according to a predetermined key length to create a key arrangement (S302). For example, when the key length is "2", the key arrangement is as shown in FIG. 4A(a). In the figure, the number at the left end is the sequence number. Further, the key starting with the 19th "A", which is the end of the reference character string, is omitted here to simplify the explanation.

次に、上位コンピュータ１０は、作成したキー配列の各キーについて、所定のハッシュ関数を用いてハッシュ値を算出し、これを該キーに割り当てる（Ｓ３０３）。これにより、キー配列は、図４Ａ（ｂ）のようになる。本開示におけるハッシュ関数は、４種類の文字「Ａ」、「Ｃ」、「Ｇ」、及び「Ｔ」にそれぞれ割り当てた「０」～「３」の数値により、４進数で表現した値を出力する関数として定義されるが、これに限られない。例えば、１番目のキー「ＣＧ」については、ハッシュ値は、ｈ（ＣＧ）＝１×４＋２＝６となり、また、１１番目のキー「ＡＣ」については、ハッシュ値は、ｈ（ＡＣ）＝０×４＋１＝１となる。 Next, the host computer 10 calculates a hash value for each key of the created key array using a predetermined hash function, and assigns this to the key (S303). As a result, the key arrangement becomes as shown in FIG. 4A(b). The hash function in this disclosure outputs a value expressed in quaternary digits using numerical values from “0” to “3” assigned to four types of characters “A”, “C”, “G”, and “T”, respectively. However, it is not limited to this. For example, for the 1st key "CG", the hash value is h(CG)=1×4+2=6, and for the 11th key "AC", the hash value is h(AC)=0 ×4+1=1.

次に、上位コンピュータ１０は、キー配列の各キーを、割り当てたハッシュ値に従って、例えば昇順にソートする（Ｓ３０４）。これにより、キー配列は、図４Ａ（ｃ）のようになる。本例では、ソート後の各キー配列は、ソート前の配列番号を含み得る。例えば、図４Ａ（ｃ）中、ソート後のキー配列における０番目のキー「ＡＣ」は、ソート前の（元の）配列番号「１１」を保持し、また、ソート後の１番目のキー「ＡＣ」は、ソート前の配列番号「１６」を保持している。 Next, the host computer 10 sorts each key in the keyboard array, for example, in ascending order according to the assigned hash value (S304). As a result, the key arrangement becomes as shown in FIG. 4A(c). In this example, each key arrangement after sorting may include the arrangement number before sorting. For example, in FIG. 4A(c), the 0th key "AC" in the key array after sorting retains the (original) array number "11" before sorting, and the 1st key "AC" after sorting retains the (original) array number "11". AC” holds the array number “16” before sorting.

なお、上記の例では、切り出された各キーについて、算出したハッシュ値を割り当てて、ソートするものとしているが、これに限られない。例えば、参照文字列から切り出されるキーに拘わらず、参照文字列に現れる全ての文字の組み合わせに基づいてハッシュ値を算出して割り当てたキー配列を用意し、切り出されるキーに対応する出現開始位置を割り当てても良い。すなわち、４種類の文字「Ａ」、「Ｃ」、「Ｇ」、及び「Ｔ」から構成される参照文字列において、例えば、キーの長さが２文字であれば、４^２個の要素を有するキー配列がまず作成され、更に、ハッシュ値がそれぞれ割り当てられる。続いて、切り出されたキーは、参照文字列における出現開始位置とともに、作成されたキー配列における対応する要素（同じキーの要素）に割り当てられることにより、図４Ａ（ｃ）に示すようなキー配列が得られる。 Note that, in the above example, the calculated hash value is assigned to each extracted key and sorted, but the present invention is not limited to this. For example, regardless of the key that is extracted from the reference string, a hash value is calculated and assigned based on the combination of all characters that appear in the reference string, and a key array is prepared, and the appearance start position corresponding to the extracted key is determined. May be assigned. That is, in a reference character string consisting of four types of characters "A", "C", "G", and "T", for example, if the key length is 2 characters, 4 ² elements are A key array is first created, and a hash value is assigned to each key array. Subsequently, the extracted key is assigned to the corresponding element (element of the same key) in the created key array along with the appearance start position in the reference character string, resulting in a key array as shown in FIG. 4A(c). is obtained.

次に、上位コンピュータ１０は、現在のキー配列における各キーの出現開始位置及び出現回数を同定する（Ｓ３０５）。図５（ａ）は、図４Ａ（ｃ）に示されるキー配列における各キーの出現開始位置及び出現回数を示している。例えば、キー「ＡＣ」は、キー配列において、出現開始位置「０」（配列番号「０」）を基点にして２回出現することが示されている。また、キー「ＣＣ」は、出現開始位置「４」を基点にして３回出現することが示されている。なお、ここでは、各文字どうしの全ての組み合わせからなるキーのパターン（すなわち、１６パターン）に対するその出現開始位置及び出現回数が示されており、例示したキー配列に含まれていない例えばキー「ＡＡ」については、出現開始位置「－」及び出現回数「０」のように示されている。 Next, the host computer 10 identifies the appearance start position and number of appearances of each key in the current key arrangement (S305). FIG. 5(a) shows the appearance start position and number of appearances of each key in the key arrangement shown in FIG. 4A(c). For example, the key "AC" is shown to appear twice in the key arrangement, starting from the appearance start position "0" (array number "0"). Further, it is shown that the key "CC" appears three times with the appearance start position "4" as the base point. In addition, here, the appearance start position and number of appearances are shown for key patterns consisting of all combinations of each character (i.e. 16 patterns), and for example, the key "AA" which is not included in the illustrated key arrangement. ", the appearance start position is indicated as "-" and the number of occurrences is "0".

次に、上位コンピュータ１０は、各キーについて、その出現回数が所定の許容値を超えているか否かを判断する（Ｓ３０６）。本開示において、所定の許容値は、階層的インデックスにおいて同じキーが重複して存在し得ることを許容する値である。本例では、所定の許容値は「１」としている。つまり、所定の許容値が「１」であれば、階層的インデックスにおいて各キーは唯一の存在となる。また、所定の許容値が大きいほど、階層的インデックスにおいて重複したキーが存在する可能性が高くなる一方、階層的インデックスの作成は高速化される。上位コンピュータ１０は、出現回数が所定の許容値を超えているキーがあると判断する場合（Ｓ３０６のＹｅｓ）、そのキーに対して追加キーを追加する（Ｓ３０８）。 Next, the host computer 10 determines whether the number of appearances of each key exceeds a predetermined allowable value (S306). In this disclosure, the predetermined tolerance value is a value that allows the same key to exist twice in the hierarchical index. In this example, the predetermined tolerance value is "1". That is, if the predetermined tolerance value is "1", each key exists uniquely in the hierarchical index. Also, the larger the predetermined tolerance value, the more likely it is that duplicate keys will exist in the hierarchical index, while the faster the creation of the hierarchical index. If the host computer 10 determines that there is a key whose number of appearances exceeds a predetermined tolerance value (S306: Yes), it adds an additional key to that key (S308).

追加キーは、元の文字列における該キーに続く１以上の文字である。本例では、追加キーは１文字としている。追加キーの追加により得られる部分文字列は、新たなキーとみなされる。以下では、追加キーが追加された新たなキーを元のキーと区別するために「拡張キー」と称し、その配列を拡張キー配列と称する場合がある。追加キーが追加されることにより、各キーどうしが異なるものとして識別されることになる。図４Ｂ（ａ）は、元のキーに１個の追加キーが追加された拡張キーからなる拡張キー配列の一例を示している。なお、配列番号「１２」の拡張キー「ＧＡ」については、元の文字列において「Ａ」に続く文字がないため、終端文字列として例えば「＄」を割り当てている。また、配列番号「１３」、「１７」及び「１８」のキーについては、その出現回数が１回であるため、追加キーは追加されない。 An additional key is one or more characters that follow the key in the original string. In this example, the additional key is one character. A substring obtained by adding an additional key is considered a new key. Hereinafter, a new key to which an additional key has been added will be referred to as an "extended key" to distinguish it from the original key, and its arrangement may be referred to as an extended key arrangement. By adding additional keys, each key is identified as being different from each other. FIG. 4B(a) shows an example of an extended key array consisting of extended keys in which one additional key is added to the original key. Note that for the extended key "GA" with array array number "12", since there is no character following "A" in the original character string, for example, "$" is assigned as the terminal character string. Furthermore, since the keys with array numbers "13", "17", and "18" appear only once, no additional keys are added.

次に、上位コンピュータ１０は、各キー（すなわち、拡張キー）を、図４Ｂ（ｂ）に示すように、該追加キーに従って例えば昇順に更にソートする（Ｓ３０８）。この場合、元のキーのソート順が優先される。同図中、例えば、配列番号「２」及び「３」のキー配列「ＡＴ」については、追加キーによるソートで、その順序が入れ替わっていることがわかる。続いて、上位コンピュータ１０は、処理Ｓ３０６に戻り、全てのキーの出現回数が所定の許容値を超えなくなるまで上記の処理を繰り返す。 Next, the host computer 10 further sorts each key (that is, extended key) according to the additional key, for example, in ascending order, as shown in FIG. 4B(b) (S308). In this case, the original key sort order takes precedence. In the figure, for example, it can be seen that the order of the key arrays "AT" with array numbers "2" and "3" is changed by sorting by the additional keys. Subsequently, the host computer 10 returns to process S306 and repeats the above process until the number of appearances of all keys no longer exceeds a predetermined tolerance value.

すなわち、上位コンピュータ１０は、キー配列における各キーの出現開始位置及び出現回数を同定し（Ｓ３０５）、出現回数が所定の許容値を超えているキーがないと判断する場合（Ｓ３０６のＮｏ）、所望の階層的インデックスが作成されたため、処理を終了する。 That is, the host computer 10 identifies the appearance start position and number of appearances of each key in the keyboard layout (S305), and if it is determined that there is no key whose number of appearances exceeds a predetermined tolerance value (No in S306), Since the desired hierarchical index has been created, the process ends.

図５は、図４に示される参照文字列に基づく階層的インデックスの作成過程におけるキーの出現開始位置及び出現回数を説明するための図である。例えば、図５（ａ）において出現回数が２以上であるキーには、追加キーが追加され（図４Ｂ（ａ））、各キー（拡張キー）は、追加キーに従ってソートされる（図４Ｂ（ｂ））。これにより、図５（ｂ）に示されるように、現在のキー配列における各キーの出現開始位置及び出現回数が同定される。同図に示す例では、拡張キー「ＣＧＡ」及び「ＴＡＣ」の出現回数が２となっている。したがって、これらの拡張キーのそれぞれについて、同様に、追加キーが追加されソートされる（図４Ｃ（ａ）及び（ｂ））。なお、元の配列番号「１７」（ハッシュ値でソート後の配列番号「８」）のキー「ＣＧ」については、キー「Ａ」に続く文字がないため、終端文字列として例えば「＄」が割り当てられている。これにより、図５（ｃ）に示されるように、該キーの出現開始位置及び出現回数が同定される。以上により、全ての拡張キーは、その出現回数が「１」となったため、インデックスの作成処理が終了する。 FIG. 5 is a diagram for explaining the appearance start position and number of appearances of keys in the process of creating a hierarchical index based on the reference character strings shown in FIG. 4. For example, an additional key is added to a key that appears two or more times in FIG. 5(a) (FIG. 4B(a)), and each key (extended key) is sorted according to the additional key (FIG. 4B(a)). b)). As a result, as shown in FIG. 5(b), the appearance start position and number of appearances of each key in the current key arrangement are identified. In the example shown in the figure, the number of appearances of the extended keys "CGA" and "TAC" is 2. Therefore, additional keys are similarly added and sorted for each of these extended keys (FIGS. 4C(a) and (b)). Note that for the key "CG" of the original array number "17" (array number "8" after sorting by hash value), there is no character following the key "A", so for example "$" is used as the terminal string. Assigned. As a result, as shown in FIG. 5(c), the appearance start position and number of appearances of the key are identified. As a result of the above, the number of occurrences of all extended keys becomes "1", so the index creation process ends.

一例として、キー「ＴＡ」について考える。キー「ＴＡ」は、配列番号「１４」～「１６」にあることから（図４Ａ（ｃ）参照）、図６（ａ）に示すように、その出現開始位置は「１４」、出現回数は「３」となる。次に、キー「ＴＡ」に追加キーが追加されソートされることより（図４Ｂ（ｂ）参照）、図６（ｂ）に示すように、追加キー「Ｃ」を含む拡張キー「ＴＡＣ」については、その出現開始位置は「１４」、出現回数は「２」となる一方、追加キー「Ｔ」を含む拡張キー「ＴＡＴ」については、その出現開始位置は「１６」、出現回数は「１」となる。したがって、出現回数が「２」のキー「ＴＡＣ」について、更に追加キー「Ｃ」及び「Ｇ」がそれぞれ追加され、これにより、図６（ｃ）に示すように、キー「ＴＡＣＧ」の出現回数は「１」となる。このようにして、拡張キー配列は、階層的なツリー構造として把握される。 As an example, consider the key "TA". Since the key "TA" is located in array numbers "14" to "16" (see FIG. 4A(c)), its appearance start position is "14" and the number of appearances is "14" as shown in FIG. 6(a). It becomes "3". Next, by adding an additional key to the key "TA" and sorting it (see FIG. 4B(b)), as shown in FIG. 6(b), regarding the extended key "TAC" including the additional key "C", The appearance start position is "14" and the appearance number is "2", while the appearance start position is "16" and the appearance number is "1" for the extended key "TAT" that includes the additional key "T". ”. Therefore, for the key "TAC" whose number of occurrences is "2", additional keys "C" and "G" are added, respectively, and as a result, as shown in FIG. 6(c), the number of occurrences of the key "TACG" becomes "1". In this way, the extended keyboard layout is understood as a hierarchical tree structure.

以上のようにして、上位コンピュータ１０は、例えばヒトゲノム参照配列に基づいて、階層的インデックスを作成する。このような階層的インデックスは、ハッシュ値に従ってソートされているため、特定のキー（部分文字列）に関連する階層的インデックスの部分的なデータ配列構造は、メインメモリにおける特定のアドレス領域に集約的に展開され（データのシーケンシャル化）、これにより、メインメモリに対するランダムアクセスの回数を大幅に減らすことができるようになる。 As described above, the host computer 10 creates a hierarchical index based on, for example, the human genome reference sequence. Since such a hierarchical index is sorted according to the hash value, the partial data array structure of the hierarchical index related to a particular key (substring) is concentrated in a particular address area in main memory. (data sequentialization), which greatly reduces the number of random accesses to main memory.

なお、上記では、簡単化のため、極めて短い文字列を例にして説明したが、例えば、ヒトゲノムを扱う場合には、部分文字列のキー長を１０～２０程度、追加キーのキー長を２～８、所定の許容値を５～４０とすることが好ましく、部分文字列のキー長を１０～１５程度、追加キーのキー長を２～４、所定の許容値を１０～２０とすることがより好ましい。 Note that the above explanation uses an extremely short string as an example for simplicity, but for example, when dealing with the human genome, the key length of the partial string should be about 10 to 20, and the key length of the additional key should be about 2. ~8. Preferably, the predetermined tolerance value is 5 to 40, the key length of the substring is approximately 10 to 15, the key length of the additional key is 2 to 4, and the predetermined tolerance value is 10 to 20. is more preferable.

図７は、本発明の一実施形態に係るコンピュータシステムによるクエリ文字列に基づく参照文字列の探索処理の一例を説明するフローチャートである。すなわち、図７は、図２に示したクエリ文字列に基づく参照文字列の探索処理（Ｓ２０２）の詳細を示している。かかる探索処理により、クエリ文字列における所定のキーが参照文字列にマッピングされ、参照文字列における一致文字列及びその出現開始位置が同定される。なお、以下では、上位コンピュータ１０による探索処理が説明されるが、並列的に動作する下位コンピュータ２０による探索処理も同様である。 FIG. 7 is a flowchart illustrating an example of a reference string search process based on a query string by the computer system according to an embodiment of the present invention. That is, FIG. 7 shows details of the reference string search process (S202) based on the query string shown in FIG. 2. Through this search process, a predetermined key in the query string is mapped to the reference string, and a matching string and its appearance start position in the reference string are identified. Note that, although the search processing by the higher-level computer 10 will be described below, the same applies to the search processing by the lower-level computers 20 that operate in parallel.

同図に示すように、上位コンピュータ１０は、データベース３０から階層的インデックスを読み出して、メモリ上に展開し、記憶する（Ｓ７０１）。上位コンピュータ１０は、高速化の観点から、階層的インデックスを構成するデータをメインメモリ上に連続的に展開し、記憶する。ここで、連続的に展開とは、データが同一バンクにおける連続的なメモリアドレスに配置されることを含む。 As shown in the figure, the host computer 10 reads the hierarchical index from the database 30, develops it on the memory, and stores it (S701). From the viewpoint of speeding up, the host computer 10 continuously develops and stores data constituting the hierarchical index on the main memory. Here, "continuously expanding" includes arranging data at consecutive memory addresses in the same bank.

次に、上位コンピュータ１０は、参照文字列に対してマッピングを行うためのクエリ文字列を受信する（Ｓ７０２）。例えば、参照文字列がヒトゲノム参照配列であれば、クエリ文字列は、シークエンサから読み出される塩基対のデータ片である。以下では、理解容易のため、クエリ文字列は「ＴＡＣＣ」であるものとして説明する。 Next, the host computer 10 receives a query string for mapping the reference string (S702). For example, if the reference string is a human genome reference sequence, the query string is a piece of base pair data read from a sequencer. In the following description, for ease of understanding, it is assumed that the query string is "TACC".

次に、上位コンピュータ１０は、受信したクエリ文字列について、所定のキー長に従って各キーを連続的に切り出す（Ｓ７０３）。本例では、所定のキー長の初期値は「２」であるものとする。したがって、切り出される各キーは、「ＴＡ」、「ＡＣ」、「ＣＣ」、及び「Ｃ＄」となる。また、本例では、各キーのサンプリング間隔の初期値は１であるものとする。キーのサンプリング間隔とは、該キーに従って階層的インデックスを順に参照する開始位置の間隔である。なお、ヒトゲノムの解析であれば、サンプリング間隔の初期値は、例えば３～５程度であり得る。後述するように、サンプリング間隔は、キーの長さに応じて伸長される。 Next, the host computer 10 successively cuts out each key from the received query string according to a predetermined key length (S703). In this example, it is assumed that the initial value of the predetermined key length is "2". Therefore, the keys to be extracted are "TA", "AC", "CC", and "C$". Further, in this example, it is assumed that the initial value of the sampling interval for each key is 1. The sampling interval of a key is the interval of starting positions for sequentially referencing the hierarchical index according to the key. Note that in the case of human genome analysis, the initial value of the sampling interval may be, for example, about 3 to 5. As described below, the sampling interval is extended depending on the length of the key.

続いて、上位コンピュータ１０は、切り出した各キーについて、所定のハッシュ関数を用いてハッシュ値を算出する（Ｓ７０４）。上述したように、ハッシュ関数は、４種類の文字「Ａ」、「Ｃ」、「Ｇ」、及び「Ｔ」にそれぞれ割り当てた「０」～「３」の数値により、４進数で表現した値を出力する関数として定義される。例えば、キー「ＴＡ」について、ハッシュ値は、ｈ（ＴＡ）＝３×４＋０＝１２となる。 Next, the host computer 10 calculates a hash value for each extracted key using a predetermined hash function (S704). As mentioned above, the hash function is a value expressed in quaternary by the numerical values ``0'' to ``3'' assigned to four types of characters ``A'', ``C'', ``G'', and ``T''. Defined as a function that outputs . For example, for the key “TA”, the hash value is h(TA)=3×4+0=12.

次に、上位コンピュータ１０は、算出したハッシュ値に従って、階層的インデックスを参照し（Ｓ７０５）、参照文字列における各キーの出現開始位置及び出現回数を同定する（Ｓ７０６）。例えば、キー「ＴＡ」であれば、ハッシュ値に従って、階層的インデックスにおける出現開始位置は１４、出現回数は３であることが同定される（図６（ａ）参照）。 Next, the host computer 10 refers to the hierarchical index according to the calculated hash value (S705), and identifies the appearance start position and number of appearances of each key in the reference character string (S706). For example, for the key "TA", it is identified that the appearance start position in the hierarchical index is 14 and the number of appearances is 3 according to the hash value (see FIG. 6(a)).

次に、上位コンピュータ１０は、各キーについて、その出現回数が所定の許容値を超えているか否かを判断する（Ｓ７０７）。上述したように、本例では、所定の許容値は１としている。上位コンピュータ１０は、キーの出現回数が所定の許容値を超えていると判断する場合（Ｓ７０７のＹｅｓ）、続いて、現在のキー長が所定の上限値を超えているか否かを判断する（Ｓ７０８）。所定の上限値は、例えば、ヒトゲノムの解析であれば、２０程度であり得るが、これに限られない。 Next, the host computer 10 determines whether the number of appearances of each key exceeds a predetermined allowable value (S707). As described above, in this example, the predetermined tolerance value is 1. If the host computer 10 determines that the number of appearances of the key exceeds a predetermined allowable value (Yes in S707), it then determines whether the current key length exceeds a predetermined upper limit ( S708). The predetermined upper limit may be, for example, about 20 in the case of human genome analysis, but is not limited to this.

上位コンピュータ１０は、各キーについて、現在のキー長が所定の上限値を超えていないと判断する場合（Ｓ７０８のＮｏ）、該キーに対して追加キーを追加することにより、拡張キーを作成し（Ｓ７０９）、Ｓ７０６の処理に戻る。 If the host computer 10 determines that the current key length does not exceed the predetermined upper limit for each key (No in S708), it creates an extended key by adding an additional key to the key. (S709), the process returns to S706.

例えば、上位コンピュータ１０は、キー「ＴＡ」に対して追加キー「Ｃ」を追加して、その出現回数を調べ（図６（ｂ）参照）、更に、追加キーが追加されたキー（拡張キー）に対して追加キー「Ｃ」を追加して、その出現回数を調べ（図６（ｃ）参照）、出現回数が１になるまで繰り返す。 For example, the host computer 10 adds an additional key "C" to the key "TA", checks the number of occurrences thereof (see FIG. 6(b)), and then adds the additional key (extended key ), the number of appearances thereof is checked (see FIG. 6(c)), and the process is repeated until the number of appearances reaches 1.

一方、上位コンピュータ１０は、各キーについて、現在のキー長が所定の上限値を超えていると判断する場合（Ｓ７０８のＹｅｓ）、該キーのサンプリング間隔を伸長（大きく）し（Ｓ７１０）、Ｓ７０５の処理に戻る。なお、サンプリング間隔の伸長は、キーについて、所定回数（例えば一回）だけ行われるようにしても良い。 On the other hand, if the host computer 10 determines that the current key length exceeds the predetermined upper limit for each key (Yes in S708), it extends (increases) the sampling interval for the key (S710), and in S705 Return to processing. Note that the sampling interval may be expanded a predetermined number of times (for example, once) for each key.

また、上位コンピュータ１０は、現在のキー（すなわち、拡張キー）について、その出現回数が所定の許容値を超えていないと判断する場合（Ｓ７０７のＮｏ）、同定された参照文字列におけるキー（一致文字列）及びその出現開始位置を出力する（Ｓ７１１）。 Further, when the host computer 10 determines that the number of appearances of the current key (that is, the extended key) does not exceed a predetermined tolerance value (No in S707), the host computer 10 character string) and its appearance start position are output (S711).

以上のようにして、上位コンピュータ１０は、階層的インデックスを用いて、参照文字列の中からクエリ文字列の少なくとも一部を探し出すことができる。とりわけ、本実施形態によれば、階層的インデックスの各キーはハッシュ値に従ってソートされているため、クエリ文字列に従った探索は、階層的インデックスにおける部分的・連続的なデータ配列構造に対して行われことになり、メインメモリに対するランダムアクセスの回数を大幅に減らすことができるようになる。 As described above, the host computer 10 can use the hierarchical index to search for at least a portion of the query string from among the reference strings. In particular, according to this embodiment, since each key in the hierarchical index is sorted according to the hash value, the search according to the query string can be performed for partial and continuous data array structures in the hierarchical index. This will greatly reduce the number of random accesses to main memory.

また、階層的インデックスの探索において、現在のキー（拡張キー）の長さが一定長以上になった場合に出現開始位置のサンプリング間隔を伸長するので、探索の回数が効率的に削減され、探索の高速化を図ることができる。一方で、サンプリング間隔の伸長は、本来同定されるべきキーを見逃す確率が高くなる可能性があるが、キーの長さが一定長以上になった場合にサンプリング間隔を長くしているので、このような見逃しの発生を効果的に抑制している。 In addition, when searching a hierarchical index, if the length of the current key (extended key) exceeds a certain length, the sampling interval of the appearance start position is extended, so the number of searches is efficiently reduced, and the search The speed can be increased. On the other hand, extending the sampling interval may increase the probability of missing keys that should have been identified, but since the sampling interval is lengthened when the key length exceeds a certain value, this This effectively prevents such oversights from occurring.

次に、動的計画法を用いたアラインメントについて説明する。すなわち、上位コンピュータ１０は、参照文字列における同定した出現開始位置近傍の文字列（被照合文字列）に対してクエリ文字列（照合文字列）がどれくらい変異しているか（変異度）を推定するために、動的計画法を用いたアラインメント処理を実行する。 Next, alignment using dynamic programming will be explained. That is, the host computer 10 estimates how much variation (degree of variation) the query string (matching string) has with respect to the string (matching string) in the vicinity of the identified appearance start position in the reference string. For this purpose, alignment processing using dynamic programming is performed.

アラインメントとは、２つの配列（文字列）をその要素どうしで置換、挿入及び欠損を許容しつつ比較して、定義されたスコア／ペナルティに従って変異度（類似度）を算出する手法である。例えば、文字列Ｘ（すなわち、参照文字列に基づく被照合文字列）と文字列Ｙ（すなわち、クエリ文字列に基づく照合文字列）との比較において、文字列Ｙの一部の文字が置換され、挿入され、又は欠損する場合に、文字列Ｙは文字列Ｘに一致しないと判断される。つまり、文字列Ｘ及びＹのそれぞれにおける各位置の文字どうしの関係は、一致する場合（一致）、一致しない場合（不一致）、及び一方の文字が存在しない場合（ギャップ）のいずれかであるといえる。ここで、一致する場合のスコアを例えば「＋２」、不一致の場合のスコアを例えば「－１」、及びギャップの場合のスコアを例えば「－２」と定義する。そして、文字列Ｘ及び文字列Ｙの各要素（文字）どうしを比較して、これらのスコアを用いて各要素の変異度が算出される。本開示では、変異度の算出のために、グローバルアラインメントの一例であるＮｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムをベースにした改良アラインメントアルゴリズムが用いられる。以下、Ｎｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムの基本的な考え方を説明し、更に、本発明に適用される改良アラインメントアルゴリムを説明する。 Alignment is a method of comparing two sequences (character strings) while allowing substitutions, insertions, and deletions among their elements, and calculating the degree of variation (similarity) according to a defined score/penalty. For example, when comparing string X (i.e., a match string based on a reference string) and string Y (i.e., a match string based on a query string), some characters of string , is inserted or deleted, it is determined that the character string Y does not match the character string X. In other words, the relationships between the characters at each position in each of the character strings I can say that. Here, the score in the case of a match is defined as, for example, "+2", the score in the case of a mismatch is defined as, for example, "-1", and the score in the case of a gap is defined as, for example, "-2". Then, the elements (characters) of the character strings X and Y are compared, and the degree of variation of each element is calculated using these scores. In the present disclosure, an improved alignment algorithm based on the Needleman-Wunsch algorithm, which is an example of global alignment, is used to calculate the degree of variation. Below, the basic idea of the Needleman-Wunsch algorithm will be explained, and further, an improved alignment algorithm applied to the present invention will be explained.

例えば、文字列Ｘ＝ｘ_１ｘ_２.. ｘ_iと文字列Ｙ＝ｙ_１ｙ_２.. ｙ_ｊとの比較において、文字ｘ_iと文字ｙ_ｊとの変異度Ｆ（ｉ，ｊ）は以下のように定義される。

ここで、ｍａｘは与えられた式の値の中から最大値を出力する関数、ｓはスコア関数（一致：ｓ＝２、不一致：ｓ＝－１、ギャップ：ｓ＝－２）、ｄはギャップによるペナルティ（ｄ＝２）である。 For _example _, in comparing _the _character string X ₌ x ₁ x ₂ _.. It is defined as below.

Here, max is a function that outputs the maximum value from among the values of the given expression, s is a score function (match: s=2, mismatch: s=-1, gap: s=-2), and d is the gap The penalty is (d=2).

以下では、理解容易のため、被照合文字列「ＧＣＣＴＣＧＣＴ」と照合文字列「ＧＣＣＡＴＴＣＡ」との間での動的計画法を用いたアラインメントを説明する。 In the following, for ease of understanding, alignment using dynamic programming between the character string to be matched "GCCTCGCT" and the character string to be matched "GCCATTCA" will be explained.

図８Ａは、本例における比較対象の文字列どうしを配列したアラインメント表である。表中、「φ」は空文字であり、Ｆ（０，０）＝０とする。また、式１において、変異度Ｆ（ｉ，ｊ）の引数が負の値になる場合、範囲外であるため、計算は省略される（出力をＮｕｌｌとする。）。アラインメント表は、ある種のデータ構造としてメモリ上に展開され、プロセッサの利用に供される。 FIG. 8A is an alignment table in which character strings to be compared in this example are arranged. In the table, "φ" is a blank character, and F(0,0)=0. Furthermore, in Equation 1, if the argument of the degree of variation F(i, j) is a negative value, it is outside the range, so the calculation is omitted (the output is set to Null). The alignment table is expanded on memory as a type of data structure and made available to the processor.

まず、ｊ＝０の場合において、Ｆ（１，０）については、式１より、ｍａｘ関数内のそれぞれは、
Ｆ（０，－１）＋ｓ（ｘ_１，ｙ_０）＝Ｎｕｌｌ
Ｆ（０，０）－ｄ＝０－２＝－２
Ｆ（１，－１）－ｄ＝Ｎｕｌｌ
であるから、
Ｆ（１，０）＝－２
となる。 First, in the case of j=0, for F(1,0), from equation 1, each of the max functions is
F (0, -1) + s (x ₁ , y ₀ ) = Null
F(0,0)-d=0-2=-2
F(1,-1)-d=Null
Because it is,
F(1,0)=-2
becomes.

次に、Ｆ（２，０）は、
Ｆ（２，０）＝Ｆ（１，０）－ｄ＝－４
となる。同様にして、
Ｆ（ｎ，０）＝－ｎｄ
Ｆ（０，ｎ）＝－ｎｄ
となるため、アラインメント表は図８Ｂに示すようになる。 Next, F(2,0) is
F(2,0)=F(1,0)-d=-4
becomes. Similarly,
F(n,0)=-nd
F(0,n)=-nd
Therefore, the alignment table becomes as shown in FIG. 8B.

次に、ｊ＝１の場合においても、同様に算出される。Ｆ（１，１）については、ｍａｘ関数内のそれぞれは、
Ｆ（０，０）＋ｓ（ｘ_１，ｙ_１）＝０＋２＝２
Ｆ（０，１）－ｄ＝－２－２＝－４
Ｆ（１，０）－ｄ＝－２－２＝－４
であり、これにより、
Ｆ（１，１）＝２
となり、アラインメント表は図８Ｃに示すようになる。 Next, in the case of j=1, it is calculated in the same way. For F(1,1), each in the max function is
F(0,0)+s( _x1 , _y1 )=0+2=2
F(0,1)-d=-2-2=-4
F(1,0)-d=-2-2=-4
and, thereby,
F(1,1)=2
Therefore, the alignment table becomes as shown in FIG. 8C.

以上の計算を同様に繰り返すことにより、図８Ｄに示すようなアラインメント表が作成されることになる。作成されたアラインメント表において、要素位置（６，７）の変異度Ｆ（６，７）が最大値「７」を有している。したがって、同図に示すように、要素位置（６，７）から要素位置（１，１）までバックトラックがなされる。表中、右下斜め方向への矢印は文字の一致を示し、右方向への矢印は欠損を示し、下方向への矢印は挿入を示している。これにより、近似文字列「ＧＣＣ＜Ａ＞Ｔ［Ｔ］Ｇ」が導出されることになる。ただし、記号「＜＞」は、文字間への挿入を表し、記号「［］」は置換を表すものとする。つまり、参照文字列「ＧＣＣＴＣＧ」であるところ、クエリ文字列は、参照文字列の「Ｃ」と「Ｔ」の間に「Ａ」が挿入され、参照文字列の「ＴＣＧ」の「Ｃ」が「Ｔ」に置換されていることがわかる。 By repeating the above calculation in the same way, an alignment table as shown in FIG. 8D is created. In the created alignment table, the degree of variation F(6,7) at element position (6,7) has a maximum value of "7". Therefore, as shown in the figure, backtracking is performed from element position (6, 7) to element position (1, 1). In the table, an arrow pointing diagonally to the lower right indicates a match of characters, an arrow pointing to the right indicates a deletion, and an arrow pointing downward indicates an insertion. As a result, the approximate character string "GCC<A>T[T]G" is derived. However, the symbol "< >" represents insertion between characters, and the symbol "[ ]" represents replacement. In other words, where the reference string is "GCCTCG", the query string has "A" inserted between "C" and "T" in the reference string, and "C" in the reference string "TCG" is It can be seen that it has been replaced with "T".

以上のように、Ｎｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムに従って、アラインメント表における変異度が最大値である要素位置を特定することにより、そこから近似文字列を導出することができる。しかしながら、Ｎｅｅｄｌｅｍａｎ－Ｗｕｎｓｃｈアルゴリズムではアラインメント表の全ての要素の変異度を算出するため、計算量が膨大となり、時間がかかっていた。そこで、本開示では、以下のような改良アラインメントアルゴリズムを提案し、これにより、計算量を削減し、処理の高速化を図っている。 As described above, by specifying the element position in the alignment table where the degree of variation is the maximum value according to the Needleman-Wunsch algorithm, an approximate character string can be derived therefrom. However, since the Needleman-Wunsch algorithm calculates the degree of variation of all elements in the alignment table, the amount of calculation becomes enormous and takes a long time. Therefore, in the present disclosure, the following improved alignment algorithm is proposed, thereby reducing the amount of calculation and speeding up the processing.

すなわち、本発明に適用される改良アラインメントアルゴリズムは、概略的には、アラインメント表における対角線上に位置する要素を中心とする所定の幅を有する計算領域を定め、該計算領域内の要素についてのみ変異度を算出することによりその最大値を決定し、該最大値が所定の条件を満たす場合に、該最大値に基づく要素から近似文字列を導出することを含む。最大値が所定の条件を満たさない場合には、計算領域が拡大され、同様に、変異度が算出されることによりその最大値を決定し、該最大値が所定の条件を満たすまで繰り返される。 That is, the improved alignment algorithm applied to the present invention generally defines a calculation region having a predetermined width centered on elements located on the diagonal line in the alignment table, and performs mutation only on the elements within the calculation region. The method includes determining the maximum value by calculating the degree, and deriving an approximate character string from elements based on the maximum value when the maximum value satisfies a predetermined condition. If the maximum value does not satisfy the predetermined condition, the calculation area is expanded, and the degree of variation is similarly calculated to determine the maximum value, and this is repeated until the maximum value satisfies the predetermined condition.

図９は、本発明の一実施形態に係るコンピュータシステムによる動的計画法を用いたアラインメント処理の一例を説明するフローチャートである。すなわち、図９は、図２に示したアラインメント処理の（Ｓ２０４）の詳細を示している。なお、以下では、上位コンピュータ１０による一のクエリ文字列に基づくアラインメント処理が説明されるが、並列的に動作する下位コンピュータ２０による他のクエリ文字列に基づくアラインメント処理も同様である。 FIG. 9 is a flowchart illustrating an example of alignment processing using dynamic programming by a computer system according to an embodiment of the present invention. That is, FIG. 9 shows details of (S204) of the alignment process shown in FIG. 2. Although alignment processing based on one query string by the higher-level computer 10 will be described below, alignment processing based on other query strings by the lower-level computers 20 operating in parallel is also the same.

同図に示すように、上位コンピュータ１０は、被照合文字列と照合文字列とからなる文字列ペアに基づいてアラインメント表を作成する（Ｓ９０１）。上述したように、アラインメント表は、ある種のデータ構造としてメモリ上に展開される。 As shown in the figure, the host computer 10 creates an alignment table based on a character string pair consisting of a character string to be matched and a character string to be matched (S901). As mentioned above, the alignment table is developed in memory as some kind of data structure.

次に、上位コンピュータ１０は、計算領域の幅ｍ（ただし、ｍは正数）を初期値に設定する（Ｓ９０２）。幅ｍの初期値は、例えば、マッピングにより得られた一致文字列の長さであり得るが、これに限られない。また、幅ｍは、アラインメント表の対角線上の要素位置を中心とすることから、奇数の値に設定されるが、これに限られるものではない。これにより、アラインメント表の対角線上の要素位置を中心とする幅ｍの計算領域が決定される。 Next, the host computer 10 sets the width m (where m is a positive number) of the calculation area to an initial value (S902). The initial value of the width m may be, for example, the length of the matching character string obtained by mapping, but is not limited thereto. Further, since the width m is centered at the element position on the diagonal line of the alignment table, it is set to an odd value, but it is not limited to this. As a result, a calculation area of width m centered on the element position on the diagonal of the alignment table is determined.

続いて、上位コンピュータ１０は、計算領域の境界を画定する各要素に所定のダミー値を設定する（Ｓ９０３）。ダミー値は、変異度Ｆの値として十分に小さい値が選択される。例えば、ダミー値は、初期値の幅ｍの例えば２～３倍程度の負の値であり、任意に設定することができる。 Subsequently, the host computer 10 sets a predetermined dummy value to each element defining the boundary of the calculation area (S903). As the dummy value, a value that is sufficiently small as the value of the degree of variation F is selected. For example, the dummy value is a negative value, for example, about 2 to 3 times the width m of the initial value, and can be set arbitrarily.

次に、上位コンピュータ１０は、計算領域内の各要素について、式１に従って変異度Fを算出する（Ｓ９０４）。続いて、上位コンピュータ１０は、計算領域において最大値を有する最大変異度Ｆ_ｍａｘを決定し、その要素の位置を特定する（Ｓ９０５）。なお、最大変異度Ｆ_ｍａｘを持つ要素は、１つであるとは限らない。 Next, the host computer 10 calculates the degree of variation F for each element in the calculation area according to Equation 1 (S904). Subsequently, the host computer 10 determines the maximum degree of variation F _max having the maximum value in the calculation area, and specifies the position of the element (S905). Note that the number of elements having the maximum degree of variation F _max is not necessarily one.

また、上位コンピュータ１０は、アライメント表の行又は列における変異度Ｆの下限値Ｆ_Ｌｏｗを算出する（Ｓ９０６）。下限値Ｆ_Ｌｏｗは、該行又は列において、参照文字列とクエリ文字列とを比較して、ｍ個の連続したギャップがあり、それ以外の部分は完全に又は実質的に一致したと仮定した場合の変異度Ｆの値である。すなわち、下限値Ｆ_Ｌｏｗは、
Ｆ_Ｌｏｗ＝（文字列の長さ－ｍ）×ｓ …式２
ただし、ｓ＝２である。
で算出される。 The host computer 10 also calculates the lower limit value F _Low of the degree of variation F in the row or column of the alignment table (S906). The lower limit value F _Low is based on the assumption that there are m consecutive gaps when comparing the reference string and query string in the row or column, and that the other parts are completely or substantially matched. This is the value of the degree of variation F for the case. That is, the lower limit value F _Low is
F _Low = (character string length - m) x s...Formula 2
However, s=2.
It is calculated by

次に、上位コンピュータ１０は、最大変異度Ｆ_ｍａｘと下限値Ｆ_Ｌｏｗとを比較して、最大変異度Ｆ_ｍａｘが下限値Ｆ_Ｌｏｗを超えているか否かを判断する（Ｓ９０７）。上位コンピュータ１０は、最大変異度Ｆ_ｍａｘが下限値Ｆ_Ｌｏｗを超えていないと判断する場合（Ｓ９０７のＮｏ）、幅ｍを所定の大きさδだけ拡幅する（Ｓ９０８）。例えば、δは、ｍ＋１とする（ただし、ｍは文字列の文字数を超えないものとする。）。 Next, the host computer 10 compares the maximum degree of variation F _max and the lower limit value F _Low and determines whether the maximum degree of variation F _max exceeds the lower limit value F _Low (S907). When the host computer 10 determines that the maximum degree of variation F _max does not exceed the lower limit value F _Low (No in S907), the host computer 10 widens the width m by a predetermined size δ (S908). For example, δ is assumed to be m+1 (provided that m does not exceed the number of characters in the character string).

上位コンピュータ１０は、拡幅された幅ｍの計算領域に対して、同様に処理を行い、最大値Ｆ_ｍａｘが下限値Ｆ_Ｌｏｗを超えるまで、上記処理を繰り返す。 The host computer 10 performs the same process on the expanded calculation area of width m, and repeats the above process until the maximum value F _max exceeds the lower limit value F _Low .

上位コンピュータ１０は、最大値Ｆ_ｍａｘが下限値Ｆ_Ｌｏｗを超えていると判断する場合（Ｓ９０７のＹｅｓ）、該最大値を持つ要素の位置からバックトラックを行って、近似文字列を決定する（Ｓ９０９）。そして、上位コンピュータ１０は、決定した近似文字列を出力する（Ｓ９１０）。 When the host computer 10 determines that the maximum value F _max exceeds the lower limit value F _Low (Yes in S907), backtracking is performed from the position of the element having the maximum value to determine an approximate character string ( S909). The host computer 10 then outputs the determined approximate character string (S910).

例えば、被照合文字列「ＧＧＧＡＴＣＣＧＡＴＡＡＴＣＧＧＴＣＣＣＣＴＡＧＧ」（２４文字）に対して照合文字列「ＧＧＧＣＡＴＴＣＡＡＣＡＴＡＡＧＴＣＧＧＣＣＴＧ」（２４文字）との間での、本発明に係るアラインメント法による変異度の算出例を説明する。なお、変異度Ｆの算出に式１を用いる点は、上述した例と同様である。なお、被照合文字列の長さと照合文字列の長さとは一致する必要はなく、典型的には、被照合文字列の長さの方が照合文字列の長さよりも長い。 For example, an example of calculating the degree of variation using the alignment method according to the present invention will be described between the character string to be matched "GGGATCCGATAATCGGTCCCCTAGG" (24 characters) and the character string to be matched "GGGCATTCAACATAAGTCGGCCTG" (24 characters). Note that the use of Equation 1 to calculate the degree of variation F is similar to the example described above. Note that the length of the character string to be matched and the length of the character string to be matched do not need to match, and typically, the length of the character string to be matched is longer than the length of the character string to be matched.

まず、上位コンピュータ１０は、比較対象の文字列どうしを配列したアラインメント表を用意し、幅ｍの初期値を設定する。本例では、幅ｍ（０）の初期値は７であるものとする。また、上位コンピュータ１０は、幅ｍに従って規定される計算領域の境界部分の要素にダミー値を設定する。本例では、ダミー値は－２０であるものとする。図１０Ａは、ダミー値が設定されたアラインメント表を示している。表中、ハッチングが描かれている要素が幅ｍ（０）＝７での計算領域Ｒ（０）である。 First, the host computer 10 prepares an alignment table in which character strings to be compared are arranged, and sets an initial value of the width m. In this example, it is assumed that the initial value of the width m(0) is 7. Furthermore, the host computer 10 sets dummy values to elements at the boundary of the calculation area defined according to the width m. In this example, it is assumed that the dummy value is -20. FIG. 10A shows an alignment table in which dummy values are set. In the table, the hatched element is the calculation region R(0) with width m(0)=7.

上位コンピュータ１０は、計算領域Ｒ（０）内の各要素の変異度Ｆを式１に従って計算する。図１０Ｂは、計算領域内の各要素の変異度Ｆが算出された状態を示している。上位コンピュータ１０は、算出された変異度Ｆの中から、最大変異度Ｆ_ｍａｘを決定する。本例では、最大変異度Ｆ_ｍａｘは１７である。図中、最大変異度Ｆ_ｍａｘが１７である要素にはハッチングが示されている。 The host computer 10 calculates the degree of variation F of each element within the calculation region R(0) according to Equation 1. FIG. 10B shows a state in which the degree of variation F of each element within the calculation area has been calculated. The host computer 10 determines the maximum degree of mutation F _max from among the calculated degrees of mutation F. In this example, the maximum degree of variation F _max is 17. In the figure, elements with a maximum degree of variation F _max of 17 are hatched.

続いて、上位コンピュータ１０は、アラインメント表の行又は列における下限値Ｆ_Ｌｏｗを算出する。本例では、下限値Ｆ_Ｌｏｗは、
Ｆ_Ｌｏｗ＝（２４－ｍ）×ｓ
＝１７×２
＝３４
となる。 Subsequently, the host computer 10 calculates the lower limit value F _Low in the row or column of the alignment table. In this example, the lower limit value F _Low is
F _Low = (24-m)×s
=17×2
=34
becomes.

続いて、上位コンピュータ１０は、最大変異度Ｆ_ｍａｘと下限値Ｆ_Ｌｏｗとを比較し、これにより、最大変異度Ｆ_ｍａｘが下限値Ｆ_Ｌｏｗを超えていないと判断するため、幅ｍをδだけ拡幅する。本例では、拡幅された幅ｍ（１）を１５に拡幅する。また、拡幅された幅ｍ（１）の計算領域をＲ（１）とする。 Next, the host computer 10 compares the maximum degree of variation F _max and the lower limit value F _Low , and thereby determines that the maximum degree of variation F _max does not exceed the lower limit value F _Low , so the width m is reduced by δ. Expand. In this example, the expanded width m(1) is expanded to 15. Further, the calculation area of the expanded width m(1) is assumed to be R(1).

上位コンピュータ１０は、同様にして、計算領域Ｒ（１）内の各要素の変異度Ｆを式１に従って算出する。図１０Ｃは、計算領域内の各要素の変異度Ｆが算出された状態を示している。図中、計算領域Ｒ（１）に対して拡幅により追加された領域にハッチングが描かれている。これにより、最大変異度Ｆ_ｍａｘは１９となる。また、このときの下限値Ｆ_Ｌｏｗは１８となる。 The host computer 10 similarly calculates the degree of variation F of each element within the calculation region R(1) according to Equation 1. FIG. 10C shows a state in which the degree of variation F of each element within the calculation area has been calculated. In the figure, hatching is drawn in a region added to the calculation region R(1) by widening. As a result, the maximum degree of variation F _max is 19. Further, the lower limit value F _Low at this time is 18.

したがって、上位コンピュータ１０は、最大変異度Ｆ_ｍａｘが下限値Ｆ_Ｌｏｗを超えていると判断するため、最大変異度Ｆ_ｍａｘ＝１９である要素からバックトラックし、これにより得られるパスに従って近似文字列を特定する。 Therefore, in order to determine that the maximum degree of variation F _max exceeds the lower limit value F _Low , the host computer 10 backtracks from the element with the maximum degree of variation F _max = 19, and follows the path obtained by this to create an approximate character string. Identify.

すなわち、図１０Ｃに示す例では、上位コンピュータ１０は、最大変異度Ｆ_ｍａｘ＝１９である要素（１８，２２）を起点（現在の要素位置）として、要素（０，０）方向に向けて隣接する要素のうち変異度Ｆの値が最も大きい要素を同定し、そこに遷移する。したがって、変異度Ｆ＝１７を持つ要素（１７，２１）が現在の要素位置となる。このような遷移を要素（０，０）まで繰り返すことにより、最終的に、近似文字列が導出される。なお、バックトラックにより得られるパスは、１つとは限られず、複数である場合がある。 That is, in the example shown in FIG. 10C, the host computer 10 starts from the element (18, 22) with the maximum degree of variation F _max = 19 (current element position) and selects adjacent elements in the direction of the element (0, 0). The element with the largest value of the degree of variation F is identified among the elements to be changed, and transition is made to that element. Therefore, the element (17, 21) with the degree of variation F=17 becomes the current element position. By repeating this transition up to element (0,0), an approximate character string is finally derived. Note that the number of paths obtained by backtracking is not limited to one, but may be multiple.

より具体的には、図１０Ｃに示すアラインメント表において、バックトラックにより得られるパスは６通りあり、各パスに従う近似文字列は、以下のとおりとなる。
（ａ）第１のパス：ＧＧＧ＜Ｃ＞Ａ＜Ｔ＞ＴＣ＜ＡＡ＞Ｃ－ＡＴＡＡ＜Ｇ＞ＴＣＧ［Ｇ］ＣＣ
（ｂ）第２のパス：ＧＧＧ＜Ｃ＞Ａ＜Ｔ＞ＴＣ＜Ａ＞［Ａ］［Ｃ］ＡＴＡＡ＜Ｇ＞ＴＣＧ［Ｇ］ＣＣ
（ｃ）第３のパス：ＧＧＧ＜Ｃ＞ＡＴ［Ｔ］Ｃ＜Ａ＞［Ａ］［Ｃ］ＡＴＡＡ＜Ｇ＞ＴＣＧ［Ｇ］ＣＣ
（ｄ）第４のパス：ＧＧＧ＜Ｃ＞ＡＴ［Ｔ］Ｃ［Ａ］＜ＡＣ＞ＡＴＡＡ＜Ｇ＞ＴＣＧ［Ｇ］ＣＣ
（ｅ）第５のパス：ＧＧＧ＜Ｃ＞Ａ＜Ｔ＞ＴＣ＜ＡＡ＞Ｃ－ＡＴＡＡ＜Ｇ＞ＴＣＧ［Ｇ］ＣＣ
（ｆ）第６のパス：ＧＧＧ＜Ｃ＞ＡＴ＜Ｔ＞Ｃ＜Ａ＞［Ａ］［Ｃ］ＡＴＡＡ＜Ｇ＞ＴＣＧ［Ｇ］ＣＣ
ただし、記号「＜＞」は、文字間への挿入を表し、記号「［］」は文字の置換を表し、記号「－」は文字の欠損を表すものとする。
なお、理解容易のため、上記の各パスに従う近似文字列を参照文字列との対比において図１１Ａ及び１１Ｂに示している。 More specifically, in the alignment table shown in FIG. 10C, there are six paths obtained by backtracking, and the approximate character strings following each path are as follows.
(a) First path: GGG<C>A<T>TC<AA>C-ATAA<G>TCG[G]CC
(b) Second path: GGG<C>A<T>TC<A>[A][C]ATAA<G>TCG[G]CC
(c) Third path: GGG<C>AT[T]C<A>[A][C]ATAA<G>TCG[G]CC
(d) Fourth path: GGG<C>AT[T]C[A]<AC>ATAA<G>TCG[G]CC
(e) Fifth path: GGG<C>A<T>TC<AA>C-ATAA<G>TCG[G]CC
(f) Sixth path: GGG<C>AT<T>C<A>[A][C]ATAA<G>TCG[G]CC
However, the symbol "<>" represents insertion between characters, the symbol "[ ]" represents character replacement, and the symbol "-" represents missing character.
For ease of understanding, approximate character strings following each of the above paths are shown in FIGS. 11A and 11B in comparison with reference character strings.

このように、改良されたアラインメントアルゴリズムでは、アラインメント表の全ての要素について変異度を算出するのではなく、所定の幅を有する計算領域を定め、必要に応じた範囲で計算領域を拡大させながら変異度を算出していくので、計算量を削減し、これにより、処理の高速化を図ることができるようになる。 In this way, the improved alignment algorithm does not calculate the degree of variation for all elements in the alignment table, but instead defines a calculation area with a predetermined width and calculates the variation while expanding the calculation area as necessary. Since the degree is calculated, the amount of calculation can be reduced, thereby speeding up the processing.

以上のように、本実施形態によれば、参照文字列に基づく階層的インデックスが作成された後、与えられたクエリ文字列に従って、該階層的インデックスを探索することにより一致文字列（及びその長さ）が同定され、同定された一致文字列に基づく被照合文字列と照合文字列とからなる文字列ペアに対して近似文字列照合がなされることにより、近似文字列が導出される。 As described above, according to the present embodiment, after a hierarchical index based on a reference string is created, matching strings (and their lengths) are searched according to a given query string. (a) is identified, and an approximate character string is derived by performing approximate character string matching on a character string pair consisting of a character string to be matched and a character string to be matched based on the identified matching character string.

上記各実施形態は、本発明を説明するための例示であり、本発明をこれらの実施形態にのみ限定する趣旨ではない。本発明は、その要旨を逸脱しない限り、さまざまな形態で実施することができる。 Each of the embodiments described above is an illustration for explaining the present invention, and the present invention is not intended to be limited only to these embodiments. The present invention can be implemented in various forms without departing from the gist thereof.

例えば、本明細書に開示される方法においては、その結果に矛盾が生じない限り、ステップ、動作又は機能を並行して又は異なる順に実施しても良い。説明されたステップ、動作及び機能は、単なる例として提供されており、ステップ、動作及び機能のうちのいくつかは、発明の要旨を逸脱しない範囲で、省略でき、また、互いに結合させることで一つのものとしてもよく、また、他のステップ、動作又は機能を追加してもよい。 For example, steps, acts, or functions in the methods disclosed herein may be performed in parallel or in a different order unless the results are inconsistent. The steps, acts, and functions described are provided by way of example only, and some of the steps, acts, and functions may be omitted or combined with each other without departing from the spirit of the invention. It is also possible to add other steps, actions, or functions.

また、本明細書では、さまざまな実施形態が開示されているが、一の実施形態における特定のフィーチャ（技術的事項）を、適宜改良しながら、他の実施形態に追加し、又は該他の実施形態における特定のフィーチャと置換することができ、そのような形態も本発明の要旨に含まれる。 Further, although various embodiments are disclosed in this specification, specific features (technical matters) in one embodiment may be added to other embodiments while improving them as appropriate, or Certain features in the embodiments may be replaced and such forms are within the scope of the invention.

１…コンピュータシステム
１０…上位コンピュータ
２０…下位コンピュータ
３０…データベース 1...Computer system 10...Upper computer 20...Lower computer 30...Database

Claims

A computer program for causing a computing device to perform a method for searching for an approximate string in a reference string based on a query string, the computer program comprising:
The method includes:
creating a hierarchical index based on the reference string;
mapping the query string to the reference string with reference to the hierarchical index to identify substrings in the reference string that match at least a portion of the query string; And,
A matched character string that is a character string in the vicinity of the partial string that includes the partial string identified by the mapping in the reference string, and the partial string identified by the mapping in the query string. deriving a character string that approximates the matched character string and the matched character string as the approximate character string based on the matched character string that is a character string near the partial character string;
Creating the hierarchical index comprises:
cutting out each first key of a predetermined length from the reference character string;
Creating a first key array in which a hash value calculated based on the first key by a predetermined hash function is assigned to each of the extracted first keys;
updating the created first key arrangement;
outputting the updated first key arrangement as the hierarchical index,
Updating the first keyboard layout includes:
For each of the first keys in the first key arrangement, identifying the number of occurrences of the first key in the reference character string;
adding a first additional key consisting of at least one or more characters following the first key in the reference string to the first key according to the number of occurrences of the identified first key; updating the first key arrangement by;
computer program.

Creating the first key array includes sorting each of the first keys in the first key array according to the hash value.
The computer program according to claim 1.

Updating the first keyboard layout includes:
determining whether the identified number of occurrences exceeds a predetermined tolerance;
consisting of at least one or more characters following the first key in the reference character string for the first key when it is determined that the identified number of occurrences exceeds the predetermined tolerance value. creating a new first key by adding the first additional key;
identifying, for the new first key, the number of occurrences of the new first key in the reference string;
The computer program according to claim 1 or 2.

Updating the first keyboard layout further includes sorting the new first keys in the first keyboard layout according to the first additional keys.
The computer program according to claim 3.

Updating the first key arrangement includes adding a new first additional key to the current first key until it is determined that the identified number of occurrences does not exceed a predetermined tolerance value. creating a new first key by sequentially adding
The computer program according to claim 3 or 4.

Outputting the first key arrangement as the hierarchical index includes:
outputting the current first key arrangement as the hierarchical index when it is determined that the identified number of occurrences does not exceed a predetermined tolerance value;
A computer program according to any one of claims 3 to 5.

Performing the mapping includes:
cutting out each second key of a predetermined length from the query string;
creating a second key array in which a hash value calculated based on the second key by the predetermined hash function is assigned to each of the second keys extracted from the query string;
For each of the second keys, refer to the hierarchical index at predetermined sampling intervals according to the hash value, and identify the starting position and number of occurrences of the second key.
A computer program according to any one of claims 1 to 6.

Identifying the appearance start position and the number of appearances of the second key,
determining whether the number of appearances of the second key exceeds the predetermined tolerance;
If it is determined that the number of occurrences of the second key exceeds the predetermined tolerance value, at least one or more occurrences following the second key in the query string for the second key creating a new second key by adding a second additional key consisting of characters;
If it is determined that the number of occurrences of the second key does not exceed the predetermined tolerance value, outputting the currently identified second key as the partial character string; outputting the appearance start position of
The computer program according to claim 7.

Identifying the occurrence start position and the number of occurrences of the second key may be performed until it is determined that the identified number of occurrences of the second key does not exceed the predetermined tolerance value. further comprising sequentially adding a new second additional key to the second key of the second key to create the new second key.
The computer program according to claim 8.

increasing the predetermined sampling interval of the second key when it is determined that the number of appearances of the second key exceeds the predetermined tolerance;
The computer program according to claim 8 or 9.

Deriving the approximate string is as follows:
creating a character string pair consisting of the character string to be matched and the character string to be matched based on the partial character string identified by the mapping;
performing a predetermined alignment process to derive at least one approximate string based on the string pair;
outputting the derived at least one approximate character string;
A computer program according to any one of claims 8 to 10.

Executing the predetermined alignment process includes:
creating a predetermined alignment table based on the character string to be matched and the character string to be matched;
Setting a calculation area having a width m centered on an element on a diagonal line of the alignment table;
Calculating the degree of variation for each element in the set calculation area;
Determining a maximum degree of mutation based on the calculated degree of mutation;
deriving the at least one approximate character string based on the determined maximum variation degree;
The computer program according to claim 11.

Executing the predetermined alignment process includes:
Comparing the maximum mutation degree and a predetermined lower limit value to determine whether the maximum mutation degree exceeds the predetermined lower limit value;
widening the width m of the calculation area in order to set a new calculation area when it is determined that the maximum variation degree does not exceed the predetermined lower limit;
deriving the at least one approximate character string based on the element having the maximum variation degree when it is determined that the maximum variation degree exceeds the predetermined lower limit value;
repeating the calculation of the degree of variation by expanding the calculation area to set a new calculation area until it is determined that the maximum degree of variation exceeds the lower limit;
The computer program according to claim 12.

The predetermined lower limit value is a value of the degree of variation when it is assumed that there are m consecutive gaps in the predetermined element sequence and that the other parts match.
The computer program according to claim 13.

creating the matched character string by adding a corresponding predetermined character string in the reference character string to the partial character string;
creating the match string by adding a corresponding predetermined string in the query string to the substring;
Computer program according to any one of claims 11 to 14.

A computer program for causing a computing device to perform a method for creating a hierarchical index for searching a reference string based on a query string, the computer program comprising:
The method includes:
cutting out each first key of a predetermined length from the reference character string;
Creating a first key array in which a hash value calculated based on the first key by a predetermined hash function is assigned to each of the extracted first keys;
updating the created first key arrangement;
outputting the updated first key arrangement as the hierarchical index,
Updating the first keyboard layout includes:
For each of the first keys in the first key arrangement, identifying the appearance start position and number of appearances of the first key in the reference character string;
a first addition consisting of at least one or more characters following the first key in the reference string to the first key according to the identified start position of the first key and the number of occurrences; updating the first keyboard layout by adding a key;
computer program.

A computer program for causing a computing device to perform a method for mapping a query string to a reference string, the computer program comprising:
The method includes:
Each first key cut out from the reference character string, and a first key array to which a hash value calculated by a predetermined hash function based on the first key is assigned to each first key. reading a hierarchical index consisting of the first key arrangement in which an additional key is added to the first key according to the appearance start position and number of appearances of the first key in the reference character string;
cutting out each second key having a predetermined key length from the query string;
creating a second key array in which a hash value calculated based on the second key by a predetermined hash function is assigned to each of the second keys extracted from the query string;
For each of the second keys, the hierarchical index is referenced at a predetermined sampling interval according to the hash value, and the second key is determined by comparing with the first key in the first key array. identifying a starting position and number of occurrences of the matching first key in the reference character string;
determining whether the identified number of occurrences exceeds a predetermined tolerance;
consisting of at least one or more characters following the second key in the query string for the second key when it is determined that the identified number of occurrences exceeds a predetermined tolerance value. creating a new second key by adding an additional key;
If it is determined that the identified number of occurrences does not exceed a predetermined tolerance value, the occurrence start position of the first key that matches the currently identified second key in the reference character string. and outputting the second key,
Identifying the occurrence start position and the number of occurrences in the reference character string of the first key that matches the second key may be performed if the identified number of occurrences does not exceed a predetermined tolerance value. creating the new second key by sequentially adding new additional keys to the current second key until determined;
computer program

The method is configured to increase the predetermined sampling interval of the second key if it is determined that the identified number of occurrences exceeds a predetermined tolerance value.
The computer program according to claim 17.

A computer program for causing a computing device to execute a method for identifying variations between a substring in a reference string and a query string by a predetermined alignment process, the computer program comprising:
The method includes:
mapping the query string to the reference string;
A matched character string that is a character string in the vicinity of the partial string that includes the partial string identified by the mapping in the reference string, and the partial string identified by the mapping in the query string. creating a string pair consisting of a matching string that is a string near the substring;
Performing a predetermined alignment process based on the character string pair to derive a character string that approximates the to-be-matched character string and the collated character string as at least one approximate character string;
outputting the derived at least one approximate character string;
Executing the predetermined alignment process includes:
creating a predetermined alignment table based on the character string to be matched and the character string to be matched;
Setting a calculation area having a width m centered on an element on a diagonal line of the alignment table;
Calculating the degree of variation for each element in the set calculation area;
Determining a maximum degree of mutation based on the calculated degree of mutation;
deriving the at least one approximate character string based on the determined maximum variation degree;
computer program.

Executing the predetermined alignment process includes:
Comparing the maximum mutation degree and a predetermined lower limit value to determine whether the maximum mutation degree exceeds the predetermined lower limit value;
widening the width m of the calculation area in order to set a new calculation area when it is determined that the maximum variation degree does not exceed the predetermined lower limit;
deriving the at least one approximate character string based on the element having the maximum variation degree when it is determined that the maximum variation degree exceeds the predetermined lower limit value;
repeating the calculation of the degree of variation by expanding the calculation area to set a new calculation area until it is determined that the maximum degree of variation exceeds the lower limit;
A computer program according to claim 19.

Executing the predetermined alignment process involves setting the degree of variation value, assuming that there are m consecutive gaps in the predetermined element sequence and that the other parts match, as the predetermined lower limit value. further including,
Computer program according to claim 20.