JP6653628B2

JP6653628B2 - DNA sequence analyzer, DNA sequence analysis method, and DNA sequence analysis system

Info

Publication number: JP6653628B2
Application number: JP2016119723A
Authority: JP
Inventors: 安田　知弘; 知弘安田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2016-06-16
Filing date: 2016-06-16
Publication date: 2020-02-26
Anticipated expiration: 2036-06-16
Also published as: JP2017224191A

Description

本発明は、ＤＮＡ配列解析装置、ＤＮＡ配列解析方法及びＤＮＡ配列解析システムに関する。 The present invention relates to a DNA sequence analysis device, a DNA sequence analysis method, and a DNA sequence analysis system.

次世代シーケンサによるＤＮＡ配列決定手法（Next Generation Sequencing, NGS）は、サンガー法（従来）に比べ、劇的に低いコストでゲノム配列を決定することができる。例えば2001年に約１億米ドルであったゲノム配列の決定コストは、次世代シーケンサの登場により、2015年には1,245米ドルにまで下がっている(“DNA Sequencing Costs, http://www.genome.gov/sequencingcosts/”参照)。次世代シーケンサは、コストを安くできるだけでなく、短期間で膨大な量の配列データを得ることができる。例えば、一度に１兆塩基を超える膨大な配列データを生成することができる次世代シーケンサも製品化されている。こうした技術により、多数の被験者の個々のゲノム配列を決定することが可能となった。 The DNA sequencing technique (Next Generation Sequencing, NGS) using a next-generation sequencer can determine the genome sequence at a dramatically lower cost than the Sanger method (conventional). For example, the cost of genome sequence determination, which was about $ 100 million in 2001, has fallen to $ 1,245 in 2015 with the advent of next-generation sequencers ("DNA Sequencing Costs, http: //www.genome. gov / sequencingcosts / ”). The next-generation sequencer can not only reduce the cost but also obtain a huge amount of sequence data in a short period of time. For example, next-generation sequencers capable of generating a huge amount of sequence data exceeding one trillion bases at a time have been commercialized. These techniques have made it possible to determine the individual genomic sequence of a large number of subjects.

次世代シーケンサを用いてヒトのゲノム配列を解析する場合、次世代シーケンサによって得られた配列（以下「リード配列」という。）を、ヒトの標準的なゲノム配列である参照ゲノム配列と比較することが必要となる。リード配列と参照ゲノム配列を比較して、リード配列に対応する参照ゲノム配列の位置を特定し、参照ゲノム配列との違いを明らかにする計算処理を「マッピング」という。マッピングにより、被験者のゲノムに特有の配列が明らかになり、その被験者がもつ遺伝子の特徴が分かる。こうして得られた遺伝子の情報は、その被験者の疾患リスクや薬剤応答性を予測する上で極めて有用である。 When analyzing a human genome sequence using a next-generation sequencer, a sequence obtained by the next-generation sequencer (hereinafter referred to as a “read sequence”) must be compared with a reference genome sequence, which is a standard human genome sequence. Is required. The calculation process for comparing the read sequence with the reference genome sequence, specifying the position of the reference genome sequence corresponding to the read sequence, and clarifying the difference from the reference genome sequence is called “mapping”. The mapping reveals sequences that are unique to the subject's genome and reveals the characteristics of the gene that the subject has. The gene information thus obtained is extremely useful in predicting the disease risk and drug responsiveness of the subject.

マッピングは、次世代シーケンサで配列データを処理する上で不可欠であるため、配列データを公共機関のデータベースで公開する際に、ゲノムにマッピングされた結果が公開される場合がある。一方で、近年では、次世代シーケンサで扱う配列データの量が膨大となり問題となっている。そこで、リード配列を参照ゲノム配列にマッピングした結果を使用して参照ゲノム配列との差分だけを記録し、リード配列を圧縮する技術も出現している(Fritz et al., Genome Res. 2011, 21(5):734-40)。 Since mapping is indispensable for processing sequence data in a next-generation sequencer, when the sequence data is published on a database of a public institution, the result mapped to the genome may be published. On the other hand, in recent years, the amount of sequence data handled by the next-generation sequencer has become enormous, which is a problem. Therefore, a technique for recording only the difference from the reference genome sequence using the result of mapping the read sequence to the reference genome sequence and compressing the read sequence has also been developed (Fritz et al., Genome Res. 2011, 21). (5): 734-40).

また、参照ゲノム配列にマッピングするリード配列の量が膨大であるため、リード配列をマッピングする処理は、計算機上で配列データを解析する上での主要なボトルネックとなっている。そこで、当該解析処理を計算機上で効率よく実行するための技術が多数開発されている（例えば、非特許文献１）。これらの技術は、与えられたリード配列を、参照ゲノム配列と直接比較する手法を採用する。 In addition, since the amount of read sequences mapped to a reference genome sequence is enormous, the process of mapping read sequences is a major bottleneck in analyzing sequence data on a computer. Therefore, many techniques have been developed for efficiently executing the analysis processing on a computer (for example, Non-Patent Document 1). These techniques employ a technique of directly comparing a given read sequence to a reference genomic sequence.

Li and Durbin 2009, Bioinformatics 2009; 25(14): 1754-1760.Li and Durbin 2009, Bioinformatics 2009; 25 (14): 1754-1760.

前述したように、参照ゲノム配列にリード配列をマッピングした結果が提供されているが、参照ゲノム配列は唯一の固定された配列ではない。実際、ヒトのゲノム配列には、配列決定が困難な部分（例えば、非常に長い繰返し配列）がある。このため、参照ゲノム配列として完全な配列を提供することは容易でなく、現在でも参照ゲノム配列の改善が続けられている。 As described above, the results of mapping the read sequence to the reference genomic sequence are provided, but the reference genomic sequence is not the only fixed sequence. In fact, there are parts of the human genomic sequence that are difficult to sequence (eg, very long repetitive sequences). For this reason, it is not easy to provide a complete sequence as a reference genome sequence, and the improvement of the reference genome sequence is still ongoing.

また前述したように、昨今では、次世代シーケンサによって多数の個人のゲノム配列を解析できるようになっているが、個人毎のゲノム配列は参照ゲノム配列とは異なる配列である。このため今後は、様々な参照ゲノム配列を扱わなければならないと予想される。これに伴い、今後は、ある参照ゲノム配列にマッピングされたリード配列を、別の参照ゲノム配列にマッピングし直す必要性が生じると予想され、当該マッピング処理を効率化するための仕組みも必要になると考えられる。 Further, as described above, recently, genome sequences of many individuals can be analyzed by the next-generation sequencer, but the genome sequence of each individual is different from the reference genome sequence. For this reason, it is expected that various reference genome sequences will have to be handled in the future. Along with this, it is expected that a read sequence mapped to one reference genome sequence will need to be re-mapped to another reference genome sequence, and a mechanism for streamlining the mapping process will also be required. Conceivable.

しかし、前述の非特許文献１には、ある参照ゲノム配列（以下「旧ゲノム配列」という。）にリード配列が既にマッピングされている場合に、そのマッピング結果を用いて、別の参照ゲノム配列（以下「新ゲノム配列」という。）に効率よくマッピングするための仕組みは提供されていない。 However, in the above-mentioned Non-Patent Document 1, when a read sequence is already mapped to a certain reference genome sequence (hereinafter referred to as “old genome sequence”), another reference genome sequence ( There is no mechanism provided for efficient mapping to the "new genome sequence."

当然のことながら、新旧２つの参照ゲノム配列が一致する領域であれば、既に存在するマッピング結果をそのまま新しい参照ゲノム配列のマッピング結果とすることが考えられる。ところが、新ゲノム配列に、新旧２つの参照ゲノム配列の間で一致する配列と同一の配列が新たに挿入されている場合、リード配列のマッピング先を、旧ゲノム配列に存在していた配列とするべきか、新たに挿入された配列とするべきかを一意に特定することはできない。すなわち、元のマッピング結果をそのまま使えないという問題がある。 As a matter of course, in a region where the two old and new reference genome sequences match, it is conceivable that an existing mapping result is used as it is as a mapping result of the new reference genome sequence. However, when a sequence identical to the sequence matching between the two new and old reference genome sequences is newly inserted into the new genome sequence, the mapping destination of the read sequence is set to the sequence existing in the old genome sequence. It is not possible to uniquely specify whether the sequence should be a newly inserted sequence. That is, there is a problem that the original mapping result cannot be used as it is.

そこで、本発明は、旧ゲノム配列にマッピングされた膨大な数のリード配列を、旧ゲノム配列へのマッピング結果を利用して、旧ゲノム配列とは異なる新ゲノム配列に効率よくマッピングするための技術を提供する。 Accordingly, the present invention provides a technique for efficiently mapping a huge number of read sequences mapped to an old genome sequence to a new genome sequence different from the old genome sequence by using the mapping result to the old genome sequence. I will provide a.

上記課題を解決するために、本発明は、例えば特許請求の範囲に記載の構成を採用する。本明細書は上記課題を解決する手段を複数含んでいるが、その一例を挙げるならば、「(1) 旧ゲノム配列と呼ぶ第１のゲノム配列と、新ゲノム配列と呼ぶ第２のゲノム配列と、リード配列と呼ぶ多数のＤＮＡ配列と、前記リード配列を前記旧ゲノム配列にマッピングした結果を記憶する記憶装置と、(2) 前記新ゲノム配列に存在する任意の文字列を探索するためのインデックスを構築する新ゲノムインデックス構築部と、(3) 前記リード配列に存在する前記旧ゲノム配列に対する変異であって、前記旧ゲノム配列上での処理位置の近傍にある変異の組合せを記録したクエリ表と呼ぶデータ構造を、マッピングの前記結果に基づいて構築するクエリ表更新部と、(4) 前記クエリ表に格納されている変異の組合せを前記旧ゲノム配列に適用して構築した配列を前記新ゲノム配列と比較し、前記構築した配列のうちＫ（１以上の自然数）塩基以上が前記新ゲノム配列と完全に一致する箇所を網羅的に出力するマッピング先探索部と、を有するＤＮＡ配列解析装置。」である。 In order to solve the above-described problems, the present invention employs, for example, a configuration described in the claims. The present specification includes a plurality of means for solving the above-mentioned problems. For example, "(1) a first genomic sequence called an old genome sequence and a second genomic sequence called a new genome sequence" A large number of DNA sequences called a read sequence, a storage device for storing the result of mapping the read sequence to the old genomic sequence, and (2) a search for an arbitrary character string existing in the new genomic sequence. A new genome index constructing unit for constructing an index, and (3) a query which records a combination of mutations with respect to the old genomic sequence present in the read sequence, the mutations being located near a processing position on the old genomic sequence. A data structure called a table, a query table update unit that is constructed based on the result of mapping, and (4) constructed by applying a combination of mutations stored in the query table to the old genome sequence A mapping destination search unit that compares a sequence with the new genome sequence and comprehensively outputs a portion where the number of K (one or more natural numbers) bases or more completely matches the new genome sequence in the constructed sequence. DNA sequence analyzer. "

本発明によれば、旧ゲノム配列にマッピングされた膨大な数のリード配列を、旧ゲノム配列へのマッピング結果を利用して、旧ゲノム配列とは異なる新ゲノム配列に効率よくマッピングすることができる。前述した以外の課題、構成及び効果は、以下の実施の形態の説明により明らかにされる。 According to the present invention, an enormous number of read sequences mapped to an old genome sequence can be efficiently mapped to a new genome sequence different from the old genome sequence by utilizing the result of mapping to the old genome sequence. . Problems, configurations, and effects other than those described above will be apparent from the following description of the embodiments.

従来技術によるリード配列のマッピング手法を説明する図。FIG. 7 is a diagram for explaining a conventional technique for mapping a read sequence. 実施例に係るリード配列のマッピング手法の中間段階を説明する図。FIG. 4 is a diagram for explaining an intermediate stage of the lead sequence mapping method according to the embodiment. 実施例に係るリード配列のマッピング手法を説明する図。FIG. 4 is a view for explaining a lead array mapping method according to the embodiment. 実施例に係るＤＮＡ配列解析装置の構成例を示す図。FIG. 1 is a diagram illustrating a configuration example of a DNA sequence analyzer according to an embodiment. 実施例に係るＤＮＡ配列解析処理の概要を示すフローチャート。5 is a flowchart illustrating an outline of a DNA sequence analysis process according to the embodiment. 実施例に係るＤＮＡ配列解析処理のシーケンスを示す図。The figure which shows the sequence of the DNA sequence analysis processing which concerns on an Example. 新ゲノムインデックス構築部で実行される処理動作例を示すフローチャート。5 is a flowchart illustrating an example of a processing operation performed by a new genome index construction unit. クエリ表の一例を示す図。The figure which shows an example of a query table. クエリ表更新部で実行される処理動作例を示すフローチャート。9 is a flowchart illustrating an example of a processing operation executed by a query table update unit. クエリ表更新部で実行される既存のクエリアリルの更新処理（Ｓ２１０の詳細）を説明するフローチャート。9 is a flowchart for explaining an existing query allele update process (details of S210) executed by the query table update unit. 図８のクエリ表を更新した後のクエリ表を示す図。The figure which shows the query table after updating the query table of FIG. マッピング先探索部で実行される処理動作例を示すフローチャート。9 is a flowchart illustrating an example of a processing operation performed by a mapping destination search unit. 近傍に変異が無い領域におけるクエリ表の一例を示す図。The figure which shows an example of the query table in the area | region where there is no variation in the vicinity. マッピング先探索部で実行される一致領域喪失時の処理（Ｓ３０６の詳細）を説明するフローチャート。11 is a flowchart for explaining processing (details of S306) when a matching area is lost, which is executed by the mapping destination search unit. マッピング先探索部で実行される一致領域延長時の処理（Ｓ３０７の詳細）を説明するフローチャート。9 is a flowchart for explaining processing (details of S307) performed when a matching area is extended by a mapping destination search unit. 新規の一致領域の探索に用いられる処理部で実行される処理（Ｓ３０９の詳細）を説明するフローチャート。9 is a flowchart for explaining processing (details of S309) executed by a processing unit used for searching for a new matching area. クエリ表の可視化例を示す図。The figure which shows the example of visualization of a query table.

以下、図面に基づいて、本発明の実施形態を説明する。なお、本発明の実施形態は、後述する実施例に限定されるものではなく、その技術思想の範囲において、種々の変形が可能である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The embodiments of the present invention are not limited to the examples described later, and various modifications are possible within the scope of the technical idea.

（１）基本的な考え方
旧ゲノム配列１０８へマッピングされた膨大なリード配列を、新ゲノム配列１１４へマッピングし直す処理を効率よく実行するためには、従来技術で用いるような、リード配列１０７を個別に再マッピングする処理（図１）を避けることが好ましい。この処理は、既知の情報を利用しないために非効率であり、非常に長い処理時間を必要とする。 (1) Basic Concept In order to efficiently execute the process of remapping a large amount of read sequences mapped to the old genome sequence 108 to the new genome sequence 114, the read sequence 107 used in the conventional technique must be used. It is preferable to avoid the individual remapping process (FIG. 1). This process is inefficient because it does not use known information and requires a very long processing time.

そこで、以下で説明する実施例では、まず、リード配列１０７に含まれる変異４０１を旧ゲノム配列１０８に適用して得られる配列（図２の例では、旧ゲノム配列１０８について破線で囲んだ領域）を新ゲノム配列１１４と比較することで、旧ゲノム配列１０８と新ゲノム配列１１４で対応する位置を明らかにする。図２の例では、類似する２つの領域のうち、新ゲノム配列１１４で右側の領域の方の類似度合いが高いため、こちらがマッピング先となる。次に、以下で説明する実施例は、旧ゲノム配列１０８にマッピングされていたリード配列１０７を、図３に示すように、新ゲノム配列１１４で対応する位置に一括して移動させることでマッピング効率を高める。 Therefore, in the embodiment described below, first, a sequence obtained by applying the mutation 401 contained in the read sequence 107 to the old genome sequence 108 (in the example of FIG. 2, the region surrounded by the broken line with respect to the old genome sequence 108) Is compared with the new genome sequence 114 to clarify the corresponding positions in the old genome sequence 108 and the new genome sequence 114. In the example of FIG. 2, of the two similar regions, the region on the right side in the new genome sequence 114 has a higher degree of similarity, and thus this region is the mapping destination. Next, in the embodiment described below, the read sequence 107 mapped to the old genome sequence 108 is collectively moved to the corresponding position in the new genome sequence 114 as shown in FIG. Enhance.

（２）用語
まず、本明細書で用いる用語の定義を説明する。
・「アルファベット」とは、解析対象文字列を構成する文字の集合である。ＤＮＡ配列の場合、アルファベットは｛Ａ，Ｃ，Ｇ，Ｔ｝である。
・「文字列」とは、アルファベットに含まれる文字を連結して得られる列である。例えば「ＡＴＴＧ」は文字列である。
・文字列ｓの長さを｜ｓ｜と表記する。ｓ＝ＡＴＴＧの場合、｜ｓ｜＝４である。
・文字列ｓのｉ番目の文字をｓ［ｉ］と表記する。ｓ＝ＡＴＴＧの場合、ｓ［1］＝Ａである。本明細書では、先頭の文字は１番目とし、０番目としない。
・与えられた文字列の先頭から始まる部分文字列を、接頭辞（prefix）という。「ＡＴＴＧ」、「ＡＴＴ」、「ＡＴ」、「Ａ」は、文字列「ＡＴＴＧ」の接頭辞である。
・与えられた文字列の末尾で終わる部分文字列を、接尾辞（suffix）という。「ＡＴＴＧ」、「ＴＴＧ」、「ＴＧ」、「Ｇ」は、文字列「ＡＴＴＧ」の接尾辞である。
・与えられた文字列ｓの各接尾辞を辞書順にならべたとき、それらの接尾辞が、文字列ｓにおいて何番目から始まっていたかを記録した整数配列を、接尾辞配列（Suffix Array：ＳＡ）という（“Gusfield，Algorithms on Strings, Trees and Sequences，Cambridge University Press，1997”参照）。ｓ＝ＡＴＴＧの場合、接尾辞を辞書順にならべると、「ＡＴＴＧ」、「Ｇ」、「ＴＧ」、「ＴＴＧ」であるから、接尾辞配列はＳＡ＝［１，４，３，２］となる。接尾辞配列ＳＡのｉ番目の要素を、ＳＡ［ｉ］と表記する。
・「ＬＣＰ（Longest common prefix）」とは、２つの文字列を比較したとき、最も長い共通の接頭辞である。
・「ＬＣＰ配列」とは、ＬＣＰの長さを、接尾辞配列で隣り合うすべての接尾辞について計算して得られる整数配列である。先ほどの接尾辞配列の例では、ＬＣＰ配列は［０，０，１］となる。ＬＣＰ配列の計算には、公知の方法（“Kasai et al., Proc. CPM pp.181-192, 2001”参照)を使用できる。
・与えられた文字列ｓの「Burrows‐Wheeler変換（ＢＷＴ）」とは、｜ｓ｜と同じ長さの文字列であって、ｉ番目の文字がｓ［ＳＡ［ｉ］−１］である文字列である（“Navarro&Makkinen，ACM Computing Surveys 39(1):Article No.2,2007”参照）。ただし、ＳＡ［ｉ］＝１の場合、ＢＷＴのｉ番目の文字は「＄」とする。「＄」は終端記号とよばれ、アルファベットに無い新しい文字である。
・「FM-index」とは、ＢＷＴを、元の文字列に含まれる任意の文字列を検索するための検索インデックスとして使えるように拡張したデータ構造である（“Ferragina&Manzini Proc.FOCS pp.390-398,2000”参照）。
・「ビットベクトル」とは、０と１を並べて得られる数列である。例えば「０１００１１０」はビットベクトルである。
・「関数rank（Ｓ，ａ，ｉ）」は、文字列又はビットベクトルＳにおいて、ｉ番目までにａが出現する回数を返す。例えば、rank（ＡＴＴＧ，Ｔ，３）＝２である。
・「関数select（Ｓ，ａ，ｉ）」は、文字列またはビットベクトルＳにおいて、先頭からｉ個目のａが出現する位置を返す。例えば、select（ＡＴＴＧ，Ｔ，２）＝３である。
・ビットベクトルの「辞書（dictionary）」とは、ビットベクトルに対して、rank関数及びselect関数を高速計算するためのデータ構造である（“Navarro&Makkinen,ACM Computing Surveys 39(1):Article No.2,2007”参照）。
・「変異」とは、個体差等に由来するＤＮＡ配列の違いである。
・「アリル」とは、変異によって発生する異なるＤＮＡ配列である。例えばゲノム配列の、ある変異の位置で、旧ゲノム配列のアリルはＡに、リード配列のアリルはＴとなる場合がある。 (2) Terms First, definitions of terms used in this specification will be described.
-"Alphabet" is a set of characters constituting a character string to be analyzed. For DNA sequences, the alphabet is {A, C, G, T}.
"Character string" is a string obtained by connecting characters included in the alphabet. For example, "ATTG" is a character string.
-The length of the character string s is expressed as | s |. When s = ATTG, | s | = 4.
-The i-th character of the character string s is described as s [i]. When s = ATTG, s [1] = A. In this specification, the first character is the first character and not the zeroth character.
-A partial character string starting from the beginning of a given character string is called a prefix. “ATTG”, “ATT”, “AT”, “A” are prefixes of the character string “ATTG”.
A substring that ends at the end of a given string is called a suffix. “ATTG”, “TTG”, “TG”, and “G” are suffixes of the character string “ATTG”.
When the suffixes of the given character string s are arranged in dictionary order, an integer array that records the order in which the suffixes start in the character string s is defined as a suffix array (Suffix Array: SA). (See "Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press, 1997"). In the case of s = ATTG, if the suffixes are arranged in dictionary order, they are "ATTG", "G", "TG", and "TTG". Therefore, the suffix array is SA = [1, 4, 3, 2]. . The i-th element of the suffix array SA is denoted as SA [i].
"LCP (Longest common prefix)" is the longest common prefix when two character strings are compared.
The “LCP array” is an integer array obtained by calculating the length of the LCP for all suffixes adjacent in the suffix array. In the above example of the suffix array, the LCP array is [0, 0, 1]. For the calculation of the LCP sequence, a known method (see “Kasai et al., Proc. CPM pp. 181-192, 2001”) can be used.
The "Burrows-Wheeler transform (BWT)" of the given character string s is a character string having the same length as | s |, and the ith character is s [SA [i] -1]. It is a character string (see “Navarro & Makkinen, ACM Computing Surveys 39 (1): Article No. 2, 2007”). However, when SA [i] = 1, the i-th character of the BWT is “＄”. "@" Is a new character that is not in the alphabet, called terminal symbol.
"FM-index" is a data structure obtained by extending BWT so that it can be used as a search index for searching for an arbitrary character string included in the original character string ("Ferragina & Manzini Proc.FOCS pp.390- 398,2000 ").
"Bit vector" is a sequence obtained by arranging 0s and 1s. For example, “0100110” is a bit vector.
"Function rank (S, a, i)" returns the number of times a appears up to the i-th character string or bit vector S. For example, rank (ATTG, T, 3) = 2.
"Function select (S, a, i)" returns the position where the i-th "a" appears from the beginning in the character string or bit vector S. For example, select (ATTG, T, 2) = 3.
The "dictionary" of a bit vector is a data structure for high-speed calculation of a rank function and a select function for a bit vector ("Navarro & Makkinen, ACM Computing Surveys 39 (1): Article No. 2 , 2007 ”).
"Mutation" is a difference in DNA sequence derived from individual differences or the like.
-"Allyls" are different DNA sequences that are generated by mutation. For example, at the position of a certain mutation in the genomic sequence, the allele of the old genomic sequence may be A and the allele of the read sequence may be T.

（３）実施例１
続いて、旧ゲノム配列１０８に対するマッピング結果を使用して、リード配列１０７を新ゲノム配列１１４に効率よくマッピングするための仕組みを説明する。 (3) Example 1
Next, a mechanism for efficiently mapping the read sequence 107 to the new genome sequence 114 using the mapping result for the old genome sequence 108 will be described.

（３−１）実施例の前提条件
説明を簡単にするために、実施例１では、以下の仮定を採用する。まず、実施例１では、新ゲノム配列１１４が１本の文字列であると仮定する。複数の染色体を扱う場合、この仮定は成り立たないが、例えばアルファベットに新しい文字「＃」を加え、「＃」を挟んで新ゲノム配列の複数の配列を連結して１本の文字列にするといった手法で、この制約を回避することができる。この手法は、接尾辞配列で複数の文字列を扱うときの一般的な手法である。 (3-1) Prerequisites of Embodiment For the sake of simplicity, the following assumption is adopted in Embodiment 1. First, in Example 1, it is assumed that the new genome sequence 114 is one character string. When dealing with multiple chromosomes, this assumption does not hold, but for example, adding a new character "#" to the alphabet and concatenating multiple sequences of the new genome sequence across "#" into one character string In a way, this constraint can be avoided. This method is a general method when handling a plurality of character strings in a suffix array.

また、処理中に検出される変異は、１塩基変異であると仮定する。なお、複数塩基の変異を処理する場合には、１塩基変異が連続して複数存在するものとして処理すればよい。 It is also assumed that the mutation detected during processing is a single base mutation. In the case of treating a mutation of a plurality of bases, the treatment may be performed assuming that a plurality of single base mutations are present continuously.

一般に、リード配列１０７を参照ゲノム配列にアラインメントする処理は、参照ゲノム配列上で、リード配列１０７のマッピング先となり得る位置（以下「シード（seed）」という。）を計算する処理と、その後、シード（seed）周辺の文字列がリード配列１０７と一致しているかを若干の配列の差異を許容しつつ比較し、マッピング可能か否かを判定する（以下「extension」という。）処理との２段階で構成される。 Generally, the process of aligning the read sequence 107 with the reference genomic sequence includes a process of calculating a position on the reference genomic sequence to which the read sequence 107 can be mapped (hereinafter, referred to as a “seed”), and thereafter, a process of calculating the seed. (Seed) A comparison is made between whether the surrounding character string matches the read sequence 107 while allowing a slight sequence difference, and it is determined whether mapping is possible (hereinafter referred to as "extension"). It consists of.

以下の実施例では、マッピングにおいてシード（seed）を計算するための処理手順を提供する。Extension処理は、例えば“Smith‐Waterman法（Smith&Waterman，journal of Molecular Biology 147:195-197,1981）”のような公知の方法を利用できる。 In the following embodiment, a procedure for calculating a seed in mapping is provided. A known method such as “Smith-Waterman method (Smith & Waterman, journal of Molecular Biology 147: 195-197, 1981)” can be used for the extension process.

解析処理の開始に先立ち、２つのパラメータが与えられるものとする。第１のパラメータＬは、リード配列１０７の長さである。リード配列１０７の長さは、初期の次世代シーケンサでは３０塩基前後であったが、近年は２００塩基を超える場合もある。処理対象となるデータのリード配列長に一致するよう、Ｌを設定する。第２のパラメータは、マッピング先の候補を探す際に要求する、完全一致の長さＫ（１以上の自然数）である。例えばＬ＝１００塩基の場合、５％までのエラーを許すとすれば、少なくとも１箇所は１５塩基の完全一致が存在するから、Ｋ＝１５としてＫ塩基以上の長さの全ての完全一致を発見すれば、５％までエラーを許してマッピング可能な箇所を、網羅的に列挙することができる。 Prior to the start of the analysis process, two parameters are given. The first parameter L is the length of the read array 107. The length of the read sequence 107 was about 30 bases in the early next-generation sequencer, but in recent years may exceed 200 bases. L is set so as to match the read array length of the data to be processed. The second parameter is the length K (a natural number of 1 or more) of a perfect match required when searching for a mapping destination candidate. For example, in the case of L = 100 bases, if an error of up to 5% is allowed, at least one place has a perfect match of 15 bases, so that K = 15 and all perfect matches having a length of K bases or more are found. If so, it is possible to exhaustively enumerate the locations that can be mapped with an error of up to 5%.

以下の説明では、リード配列１０７には、リード配列１０７を旧ゲノム配列１０８にマッピングした結果も含まれているものとする。さらに、本明細書では、配列と文字列という２つの言葉を、特に区別せずに用いる。同様に、塩基と文字という２つの言葉も、区別せずに用いる。 In the following description, it is assumed that the read sequence 107 includes the result of mapping the read sequence 107 to the old genome sequence 108. Furthermore, in this specification, the two terms array and character string are used without distinction. Similarly, the two words base and letter are used interchangeably.

（３−２）システム構成
図４に、本実施例に係るＤＮＡ配列解析装置１００と、当該装置を用いて構成されるＤＮＡ配列解析システム１１５の構成例を示す。ＤＮＡ配列解析装置１００は、ＣＰＵ（Central Processing Unit）１０１、主記憶装置（メモリ）１０２、補助記憶装置１０３で構成され、必要に応じてリムーバブルメディア１０４を使用する。 (3-2) System Configuration FIG. 4 shows a configuration example of a DNA sequence analysis apparatus 100 according to the present embodiment and a DNA sequence analysis system 115 configured using the apparatus. The DNA sequence analyzer 100 includes a CPU (Central Processing Unit) 101, a main storage device (memory) 102, and an auxiliary storage device 103, and uses a removable medium 104 as necessary.

主記憶装置１０２は、ＣＰＵ１０１によって実行される各種のプログラムと、これらのプログラムをＣＰＵ１０１で実行するために必要となる各種のデータが保持されるＲＡＭ（Random Access Memory）等のメモリである。主記憶装置１０２には、リード配列(1) １０７−１、旧ゲノム配列(1) １０８−１、新ゲノムに基づき構築する新ゲノムインデックス（1）１０９−１、クエリ表１１０が記憶される。また、主記憶装置１０２には、ＣＰＵ１０１で実行させるプログラムも格納される。当該プログラムの実行を通じ、後述する新ゲノムインデックス構築部１１１、クエリ表更新部１１２、マッピング先探索部１１３の各機能が提供される。 The main storage device 102 is a memory such as a RAM (Random Access Memory) in which various programs executed by the CPU 101 and various data necessary for executing the programs by the CPU 101 are stored. The main storage device 102 stores a read sequence (1) 107-1, an old genome sequence (1) 108-1, a new genome index (1) 109-1 constructed based on a new genome, and a query table 110. The main storage device 102 also stores programs to be executed by the CPU 101. Through the execution of the program, the functions of a new genome index constructing unit 111, a query table updating unit 112, and a mapping destination searching unit 113, which will be described later, are provided.

補助記憶装置１０３は、ＨＤＤ（Hard Disk Drive）等の記憶装置であり、リード配列(2) １０７−２、旧ゲノム配列(2) １０８−２、新ゲノムインデックス(2) １０９−２等が記録される。補助記憶装置１０３には、更に、新ゲノム配列(1) １１４−１を記録してもよい。リムーバブルメディア１０４は、ＤＮＡ配列解析装置１００に対して着脱可能なＣＤ、ＤＶＤ等の記憶媒体であり、リード配列(3) １０７−３、旧ゲノム配列(3) １０８−３、新ゲノムインデックス(3) １０９−３、新ゲノム配列(2) １１４−２等が記録される。 The auxiliary storage device 103 is a storage device such as an HDD (Hard Disk Drive), and records a read sequence (2) 107-2, an old genome sequence (2) 108-2, a new genome index (2) 109-2, and the like. Is done. The auxiliary storage device 103 may further record the new genome sequence (1) 114-1. The removable medium 104 is a storage medium such as a CD and a DVD that can be attached to and detached from the DNA sequence analyzer 100, and includes a read sequence (3) 107-3, an old genome sequence (3) 108-3, and a new genome index (3 ) 109-3, new genome sequence (2) 114-2, etc. are recorded.

ＤＮＡ配列解析装置１００には、ユーザインタフェース部１０６を接続してもよく、ＬＡＮ（Local Area Network）等のネットワーク１０５を介してストレージ装置を接続してもよい。ここで、ネットワーク１０５は、ＬＡＮ（Local Area Network）やインターネットでもかなわない。また、ネットワーク１０５は、有線接続に限る必要は無く、無線接続であってもよい。ユーザインタフェース部１０６は、ユーザインタフェースを提供する入出力装置であり、例えばキーボード、マウス、ディスプレイ等で構成される。ユーザインタフェース部１０６は、ＤＮＡ配列解析装置１００と一体でもよい。 The DNA sequence analyzer 100 may be connected to a user interface unit 106, or may be connected to a storage device via a network 105 such as a LAN (Local Area Network). Here, the network 105 is incompatible with a LAN (Local Area Network) or the Internet. The network 105 does not need to be limited to a wired connection but may be a wireless connection. The user interface unit 106 is an input / output device that provides a user interface, and includes, for example, a keyboard, a mouse, and a display. The user interface unit 106 may be integrated with the DNA sequence analyzer 100.

また、ネットワーク１０５を介して接続されるストレージ装置には、リード配列(4) １０７−４、旧ゲノム配列(4) １０８−４、新ゲノムインデックス(4) １０９−４、新ゲノム配列(3) １１４−３を記録することができる。本実施例では、ＤＮＡ配列解析装置１００に対してユーザインタフェース部１０６とこれらのストレージ装置を接続した構成をＤＮＡ配列解析システム１１５と呼ぶ。しかし、この区分は固定的なものではなく、例えばプログラムの一部を、ネットワーク１０５を介して接続された他のコンピュータと連携して実行するものをＤＮＡ配列解析システムと呼んでもよい。また、ユーザ側には、ユーザインタフェース部１０６だけが存在し、ネットワークを通じて接続されたサーバ側で実行される処理の結果を確認する仕組みを採用してもよい。ＤＮＡ解析装置１００とＤＮＡ解析システム１１５の区別は便宜的なもので、装置構成は同じでも良い。 The storage devices connected via the network 105 include a read sequence (4) 107-4, an old genome sequence (4) 108-4, a new genome index (4) 109-4, and a new genome sequence (3). 114-3 can be recorded. In this embodiment, a configuration in which the user interface unit 106 and these storage devices are connected to the DNA sequence analysis device 100 is referred to as a DNA sequence analysis system 115. However, this division is not fixed, and for example, a system that executes a part of a program in cooperation with another computer connected via the network 105 may be referred to as a DNA sequence analysis system. Further, the user may have only the user interface unit 106 and employ a mechanism for confirming the result of processing executed on the server connected via the network. The distinction between the DNA analysis device 100 and the DNA analysis system 115 is for convenience, and the device configuration may be the same.

図４では、同じ種類のデータでも、データの格納先に応じて、各データを表す記号(1)〜(4)を付しているが、格納先が関係しない説明では、代表してリード配列１０７、旧ゲノム配列１０８、新ゲノムインデックス１０９、新ゲノム配列１１４と表記する。 In FIG. 4, symbols (1) to (4) representing respective data are given according to the storage destination of the data even for the same type of data. 107, an old genome sequence 108, a new genome index 109, and a new genome sequence 114.

これらのデータは、例えばＣＰＵ１０１で読み書きする場合に、必要に応じて主記憶装置１０２に格納してもよいし、ＤＮＡ配列解析装置１００の電源を切る場合又は主記憶装置１０２の空き容量が無くなった場合に、主記憶装置１０２から他の記憶メディアや記憶装置（ストレージ装置を含む。）にコピーしてもよい。また、解析開始前に、不図示の他の装置からリムーバブルメディア１０４や外部のストレージ装置にデータを格納し、当該格納されたデータを、ＤＮＡ配列解析装置１００の起動時や解析開始時に、ＤＮＡ配列解析装置１００内の主記憶装置１０２又は補助記憶装置１０３にコピーしてもよい。 These data may be stored in the main storage device 102 as necessary, for example, when reading and writing by the CPU 101, or when the power of the DNA sequence analysis device 100 is turned off or the free space of the main storage device 102 is exhausted. In this case, the data may be copied from the main storage device 102 to another storage medium or storage device (including a storage device). Before the analysis is started, data is stored in a removable medium 104 or an external storage device from another device (not shown), and the stored data is stored in the DNA sequence analysis device 100 when the DNA sequence analysis device 100 is started or analysis is started. The data may be copied to the main storage device 102 or the auxiliary storage device 103 in the analyzer 100.

（３−３）処理動作
以下では、図５及び図６を使用して、ＤＮＡ配列解析装置１００で実行される処理動作を説明する。図５は、新ゲノム配列１１４に対するリード配列１０７のマッピング先候補が出力されるまでの処理の概要を示している。図６は、各処理部の協調動作を説明するための処理シーケンスである。これらの図面を参照しつつ、個々の処理を説明する。 (3-3) Processing Operation Hereinafter, a processing operation performed by the DNA sequence analyzer 100 will be described with reference to FIGS. 5 and 6. FIG. 5 shows an outline of processing until a candidate for mapping the read sequence 107 to the new genome sequence 114 is output. FIG. 6 is a processing sequence for explaining the cooperative operation of each processing unit. The individual processes will be described with reference to these drawings.

（３−３−１）新ゲノムインデックス構築部１１１の処理
新ゲノムインデックス構築部１１１は、新ゲノム配列１１４に含まれる任意の部分配列の探索を容易にするための検索インデックス（すなわち、新ゲノムインデックス１０９）を構築する処理部である。新ゲノムインデックス１０９は、２つのデータで構成される。一つは、一般的な文字列検索用インデックスであるFM-indexである。残る一つは、ＬＣＰ配列に基づくビットベクトルであり、旧ゲノムインデックスと新ゲノムインデックス１０９の一致箇所を漏れなく探索するためのデータである。 (3-3-1) Processing of the New Genome Index Construction Unit 111 The new genome index construction unit 111 searches for a search index (ie, a new genome index) for facilitating the search for an arbitrary partial sequence included in the new genome sequence 114. 109). The new genome index 109 is composed of two data. One is FM-index, which is a general character string search index. The remaining one is a bit vector based on the LCP sequence, and is data for completely searching for a match between the old genome index and the new genome index 109.

図７に、新ゲノムインデックス構築部１１１で実行される処理動作を示す。まず、新ゲノムインデックス構築部１１１は、新ゲノム配列１１４に対し、FM-indexを構築する（Ｓ１０１）。次に、新ゲノムインデックス構築部１１１は、新ゲノム配列１１４に対し、ＬＣＰ配列を計算する（Ｓ１０２）。その後、新ゲノムインデックス構築部１１１は、Ｓ１０２で計算したＬＣＰ配列と同じ長さのビットベクトルＢ_Ｌを構築する（Ｓ１０３）。ここで、Ｂ_Ｌのｉ番目の数値は、ＬＣＰ配列のｉ番目にＫ以上の数値が書かれている場合には「１」、そうでない場合には「０」とする。続いて、新ゲノムインデックス構築部１１１は、ビットベクトルＢ_Ｌに対し、rank及びselectを高速に計算するための辞書を構築する（Ｓ１０４）。 FIG. 7 shows a processing operation executed by the new genome index construction unit 111. First, the new genome index construction unit 111 constructs an FM-index for the new genome sequence 114 (S101). Next, the new genome index construction unit 111 calculates an LCP sequence for the new genome sequence 114 (S102). Thereafter, the new genome index construction unit 111 constructs a bit vector _BL having the same length as the LCP array calculated in S102 (S103). Here, the i-th numerical value of B _L is “1” when a numerical value equal to or more than K is written at the i-th value of the LCP array, and is “0” otherwise. Subsequently, the new genome index constructing unit 111 constructs a dictionary for calculating rank and select at high speed with respect to the bit vector _BL (S104).

（３−３−２）クエリ表更新部１１２の処理
新ゲノムインデックス構築部１１１によって辞書が構築されると、クエリ表更新部１１２の処理が開始される。クエリ表更新部１１２は、旧ゲノム配列１０８を１塩基ずつ読み、変異位置をクエリ表１１０に記録する処理を実行する。ここでのクエリ表１１０の更新（準備）により、旧ゲノム配列１０８の同一箇所にマッピングされている複数のリード配列１０７を、一括して新ゲノム配列１１４にマッピングすることが可能となり、マッピング効率が格段に向上される。 (3-3-2) Process of Query Table Update Unit 112 When the new genome index constructing unit 111 constructs a dictionary, the process of the query table update unit 112 starts. The query table updating unit 112 executes a process of reading the old genome sequence 108 one base at a time and recording a mutation position in the query table 110. By updating (preparing) the query table 110 here, a plurality of read sequences 107 mapped to the same location of the old genome sequence 108 can be collectively mapped to the new genome sequence 114, and the mapping efficiency is improved. It is greatly improved.

ただし、リード配列１０７は、マッピングされた箇所の旧ゲノム配列１０８と必ずしも完全に一致するわけではなく、変異により違いが生じている場合がある。そこで、本実施例では、リード配列１０７がマッピングされた箇所の旧ゲノム配列１０８そのものではなく、リード配列１０７に含まれる変異のアリルを旧ゲノム配列１０８に反映させて構築した配列を探索に使用する手法を採用する。この手法を実現すべく、クエリ表更新部１１２は、リード配列１０７から発見された変異をクエリ表１１０に記録し、旧ゲノム配列１０８に反映させる変異の情報を管理する。 However, the read sequence 107 does not always completely match the mapped old genome sequence 108, and a difference may occur due to mutation. Therefore, in the present embodiment, a sequence constructed by reflecting the allele of the mutation contained in the read sequence 107 in the old genome sequence 108, instead of the old genome sequence 108 itself where the read sequence 107 is mapped, is used for the search. Adopt the method. In order to realize this method, the query table updating unit 112 records the mutation found in the read sequence 107 in the query table 110, and manages information on the mutation to be reflected in the old genome sequence 108.

クエリ表更新部１１２は、クエリ表１１０の構築時（初期状態）と探索時に、クエリ表１１０を更新するために使用される。図８に、クエリ表１１０の一例を示す。なお、図８は、クエリ表１１０のデータそのものではなく、説明のために表形式で表したものである。クエリ表には、旧ゲノム配列１０８上での処理位置からＬ塩基以内に存在する変異のアリルの組合せを格納する。この組合せを、以下ではクエリアリル１１０１と呼ぶ。また、クエリアリル１１０１を反映して旧ゲノム配列１０８を修正した配列を、クエリ配列と呼ぶ。個々のクエリアリル１１０１について、各クエリアリル１１０１に対応するクエリ配列の長さ（列１１０２）、そのクエリ配列が新ゲノムインデックスに存在する範囲（列１１０３）、個々の変異位置でのアリルを列１１０４に記録する。個々の変異位置について、リード配列１０７に出現するアリルの一覧１１０５を記録してもよい。 The query table updating unit 112 is used to update the query table 110 when the query table 110 is constructed (initial state) and when searching. FIG. 8 shows an example of the query table 110. FIG. 8 illustrates not the data of the query table 110 itself but a table format for explanation. The query table stores combinations of alleles of mutations existing within L bases from the processing position on the old genome sequence 108. This combination is hereinafter referred to as query allele 1101. A sequence obtained by modifying the old genome sequence 108 by reflecting the query allele 1101 is called a query sequence. For each query allele 1101, the length of the query sequence corresponding to each query allele 1101 (column 1102), the range in which the query sequence exists in the new genome index (column 1103), and the allele at each mutation position are recorded in column 1104. I do. A list 1105 of alleles appearing in the read sequence 107 may be recorded for each mutation position.

図９に、クエリ表更新部１１２で実行される処理動作を示す。ここで、クエリ表更新部１１２は、マッピング先探索部１１３の処理位置（旧ゲノム配列１０８上）の近傍にマッピングされているリード配列１０７の変異がクエリ表１１０に記録されるようにクエリ表１１０を適宜更新する。 FIG. 9 shows a processing operation executed by the query table update unit 112. Here, the query table updating unit 112 updates the query table 110 so that the mutation of the read sequence 107 mapped near the processing position (on the old genome sequence 108) of the mapping destination searching unit 113 is recorded in the query table 110. Is updated as appropriate.

まず、クエリ表更新部１１２は、近傍から離れた変異を除去する（Ｓ２０１〜Ｓ２０３）。具体的には、Ｓ２０１において、クエリ表更新部１１２は、クエリ表１１０に、処理中の位置からＬ（１以上の自然数）塩基以上離れた位置に変異があるか否かを判定する。そのような変異がある場合、クエリ表更新部１１２はＳ２０２に進み、その変異を記録した列をクエリ表１１０から削除する。次に、クエリ表更新部１１２はＳ２０３に進み、クエリアリル１１０１が同じものを統合する。この際、クエリ表更新部１１２は、クエリ表１１０に記録されている個々のクエリアリル１１０１を比較し、同じアリルの組合せで構成されるクエリアリル１１０１で長さの値が同一のものが存在する場合、それらのうち１つのクエリアリル１１０１のみを残し、他のクエリアリル１１０１を消去する。 First, the query table updating unit 112 removes a mutation that is far from the vicinity (S201 to S203). Specifically, in S201, the query table updating unit 112 determines whether or not the query table 110 has a mutation at a position that is at least L (a natural number equal to or greater than 1) bases from the position being processed. If there is such a mutation, the query table update unit 112 proceeds to S202, and deletes the column in which the mutation is recorded from the query table 110. Next, the query table updating unit 112 proceeds to S203, and integrates the same query allele 1101. At this time, the query table updating unit 112 compares the individual query alleles 1101 recorded in the query table 110, and when there is a query allele 1101 composed of the same combination of alleles having the same length value, Only one query allele 1101 is left, and the other query allele 1101 is deleted.

続いて、クエリ表更新部１１２は、新しく発見された変異をクエリ表１１０に追加する（Ｓ２０４〜Ｓ２０５）。具体的には、Ｓ２０４において、クエリ表更新部１１２は、クエリ表１１０の新しい列を追加する。また、Ｓ２０５において、クエリ表更新部１１２は、変異があると判定された旧ゲノム配列１０８上の位置を表す番号に対応付けるように、旧ゲノム配列１０８及びリード配列１０７の各アリルを記録する。 Subsequently, the query table updating unit 112 adds the newly discovered mutation to the query table 110 (S204 to S205). Specifically, in S204, the query table update unit 112 adds a new column of the query table 110. In S205, the query table updating unit 112 records each allele of the old genome sequence 108 and the read sequence 107 so as to be associated with a number indicating the position on the old genome sequence 108 determined to have a mutation.

これらの処理が終了すると、クエリ表更新部１１２は、処理前のクエリ表１１０が空だった場合には、クエリアリルを追加する処理（Ｓ２０８〜Ｓ２０９）を実行し、空でなかった場合には、クエリアリルを更新する処理（Ｓ２０８、Ｓ２１０）を実行する。具体的には、Ｓ２０８において、クエリ表更新部１１２は、クエリアリル１１０１がクエリ表１１０に１つ以上既に存在しているか否かを判定する。 When these processes are completed, the query table updating unit 112 executes a process (S208 to S209) for adding a query allele when the query table 110 before the process is empty, and when the query table 110 is not empty, The processing of updating the query allele (S208, S210) is executed. Specifically, in S208, the query table updating unit 112 determines whether one or more query alleles 1101 already exist in the query table 110.

存在しない場合（Ｓ２０９において）、クエリ表更新部１１２は、発見された変異でリード配列１０７のアリルを持つ新たなクエリアリル１１０１をクエリ表１１０に追加する。ただし、処理中の位置に変異が無い場合には、図１３に例示するように、対応する変異がない行を、クエリアリルの代わりに追加する。ここで、クエリ表の「新ゲノムインデックスの範囲」は全範囲（［１、新ゲノム配列長］）、「長さ」は１とする。一方、Ｓ２０８で既存のクエリアリル１１０１が存在すると判定された場合（Ｓ２１０において）、クエリ表更新部１１２は、既存の各クエリアリル１１０１の更新処理（図１０）を実行する。 If not present (in S209), the query table updating unit 112 adds a new query allele 1101 having the allele of the read sequence 107 due to the found mutation to the query table 110. However, when there is no mutation at the position being processed, as shown in FIG. 13, a row having no corresponding mutation is added instead of the query allele. Here, the “range of the new genome index” in the query table is the entire range ([1, new genome sequence length]), and the “length” is 1. On the other hand, when it is determined in S208 that the existing query allele 1101 exists (in S210), the query table update unit 112 executes an update process (FIG. 10) of each existing query allele 1101.

図１０に、Ｓ２１０で実行される更新処理の詳細動作を示す。まず、クエリ表更新部１１２は、既存のクエリアリル１１０１の全てを順番に処理対象とする（Ｓ２１０１、Ｓ２１０９）。具体的には、Ｓ２１０１において、クエリ表更新部１１２は、クエリ表１１０から未処理のクエリアリル１１０１を１つ選択する。なお、Ｓ２１０９において、クエリ表更新部１１２は、未処理のクエリアリル１１０１が残っているか否かを判定し、残っている場合にはＳ２１０１に戻る。このループ処理により、全てのクエリアリル１１０１について、１つずつ、後述するＳ２１０２〜Ｓ２１０８の処理が繰り返される。 FIG. 10 shows a detailed operation of the update process executed in S210. First, the query table updating unit 112 sequentially processes all of the existing query alleles 1101 (S2101, S2109). Specifically, in S2101, the query table update unit 112 selects one unprocessed query allele 1101 from the query table 110. In step S2109, the query table update unit 112 determines whether an unprocessed query allele 1101 remains, and returns to step S2101 if it remains. By this loop processing, the processing of S2102 to S2108 described later is repeated for each query allele 1101 one by one.

次に、クエリ表更新部１１２は、新しく追加した変異のアリルの全てを順番に処理対象とする（Ｓ２１０２、Ｓ２１０５）。具体的には、Ｓ２１０２において、クエリ表更新部１１２は、新しい変異のアリルから未処理のアリルを１つ選択する。また、Ｓ２１０５において、クエリ表更新部１１２は、新しい変異のアリルのうち未処理のものが残っているか否かを判定し、残っている場合にはＳ２１０２に戻る。このループ処理により、全ての新しい変異のアリルについて、１つずつ、後述するＳ２１０３〜Ｓ２１０４の処理が繰り返される。 Next, the query table updating unit 112 sequentially processes all the alleles of the newly added mutation (S2102, S2105). Specifically, in S2102, the query table update unit 112 selects one unprocessed allele from the allele of the new mutation. In S2105, the query table update unit 112 determines whether or not unprocessed alleles of the new mutation remain, and returns to S2102 if alleles remain. By this loop processing, the processing of S2103 to S2104 described later is repeated for all new mutation alleles one by one.

続いて、クエリ表更新部１１２は、新しい変異のアリルと既存のクエリアリル１１０１の両方を含むリード配列１０７があるか否かを判定する（Ｓ２１０３）。そのようなリード配列が有る場合、クエリ表更新部１１２は、新しい変異のアリルと既存のクエリアリル１１０１とを組み合わせた新しいクエリアリル１１０１をクエリ表１１０に追加する（Ｓ２１０４）。一方、Ｓ２１０３において否定結果が得られた場合（新しく追加した変異のアリルのいずれも含むリード配列１０７が存在しない場合）、クエリ表更新部１１２は、Ｓ２１０４をスキップしてＳ２１０５に進む。 Subsequently, the query table updating unit 112 determines whether there is a read sequence 107 including both the new mutation allele and the existing query allele 1101 (S2103). When there is such a read sequence, the query table updating unit 112 adds a new query allele 1101 obtained by combining the new mutation allele and the existing query allele 1101 to the query table 110 (S2104). On the other hand, when a negative result is obtained in S2103 (when there is no read sequence 107 including any of the newly added alleles of the mutation), the query table update unit 112 skips S2104 and proceeds to S2105.

Ｓ２１０５で肯定結果が得られた場合、クエリ表更新部１１２は、処理中の既存のクエリアリル１１０１について、新しい変異のアリルのいずれかで、Ｓ２１０４の追加処理が行なわれなかったかを判定する（Ｓ２１０６）。ここで肯定結果が得られた場合、クエリ表更新部１１２は、それまでに見つかっていた一致領域を出力するために一致領域喪失時の処理を実行する（Ｓ２１０７）。一方、Ｓ２１０６で否定結果が得られた場合、又は、Ｓ２１０７の処理が終了した場合、クエリ表更新部１１２は、前述した処理の実行により不要となった既存のクエリアリル１１０１を削除する（Ｓ２１０８）。 When a positive result is obtained in S2105, the query table updating unit 112 determines whether the additional process of S2104 has not been performed on any of the new alleles of the existing query allele 1101 being processed (S2106). . Here, when a positive result is obtained, the query table updating unit 112 executes a process at the time of loss of the matching area to output the matching area found so far (S2107). On the other hand, when a negative result is obtained in S2106, or when the process of S2107 is completed, the query table updating unit 112 deletes the existing query allele 1101 that becomes unnecessary by performing the above-described process (S2108).

図１１を参照して、クエリ表更新部１１２による更新処理によりクエリ表１１０がどのように変化するかを説明する。図１１は、図８に示すクエリ表１１０を更新した後のクエリ表１１０である。図８は、旧ゲノム配列の「１１０４」番目の塩基を処理した後の状態、図１１は「１１０３」番目の塩基を処理した後の状態を想定している。また、Ｌ＝１００、Ｋ＝１５と仮定する。 With reference to FIG. 11, how the query table 110 is changed by the update processing by the query table update unit 112 will be described. FIG. 11 shows the query table 110 after updating the query table 110 shown in FIG. FIG. 8 assumes the state after processing the “1104” base of the old genome sequence, and FIG. 11 assumes the state after processing the “1103” base. Also assume that L = 100 and K = 15.

図８に示すクエリ表１１０の場合、「１１０３」番目の位置からＬ＝１００塩基以上離れた塩基が無いので、Ｓ２０１〜Ｓ２０３で削除される列は無い。次に、旧ゲノム配列１０８の「１１０３」番目の位置に見つかった変異が、新たな列として図１１のクエリ表１１０に追加される（Ｓ２０４〜Ｓ２０５）。さらに、クエリ表の更新処理（Ｓ２０８、Ｓ２１０）により、クエリ表１１０に書かれているクエリアリル１１０１を１つずつ更新する（Ｓ２１０１、Ｓ２１０９）。 In the case of the query table 110 shown in FIG. 8, there is no base that is at least L = 100 bases away from the “1103” -th position, and thus there is no column deleted in S201 to S203. Next, the mutation found at the “1103” th position of the old genome sequence 108 is added to the query table 110 of FIG. 11 as a new column (S204 to S205). Further, the query alleles 1101 written in the query table 110 are updated one by one by the query table update processing (S208, S210) (S2101, S2109).

まず、図８のクエリ表１１０におけるクエリアリル１１０１のうち最初の行に記載されているＧ、Ａを処理する場合を考える。この例の場合、「１１０３」番目の位置に見つかった新しい変異がＴ又はＣなので、これらを先頭に加えたクエリアリル１１０１には「Ｔ、Ｇ、Ａ」と「Ｃ、Ｇ、Ａ」の２つがある。これらのアリルの組合せがリード配列１０７に存在しないのなら、クエリ表１１０に記録して処理する必要がない。そこで、これらを含むリード配列１０７の有無を判定し（Ｓ２１０３）、「有る」場合に限ってクエリ表１１０に追加する（Ｓ２１０４）。図１１では、「Ｔ、Ｇ、Ａ」と「Ｃ、Ｇ、Ａ」を含むリード配列１０７がいずれも存在していたと想定し、どちらもクエリ表１１０に追加されている。図８のクエリ表１１０に記載されていたクエリアリル１１０１のうちＧ、Ａだけのクエリアリル１１０１はもう不要なので消去する（Ｓ２１０８）。 First, consider the case where G and A described in the first row of the query allele 1101 in the query table 110 in FIG. 8 are processed. In this example, since the new mutation found at the “1103” th position is T or C, two of “T, G, A” and “C, G, A” are included in the query allele 1101 with these added at the beginning. is there. If these allele combinations do not exist in the read sequence 107, there is no need to record them in the query table 110 and process them. Therefore, the presence or absence of the read sequence 107 including these is determined (S2103), and is added to the query table 110 only when “exist” is present (S2104). In FIG. 11, it is assumed that both the read sequences 107 including “T, G, A” and “C, G, A” exist, and both are added to the query table 110. The query alleles 1101 of only G and A among the query alleles 1101 described in the query table 110 of FIG. 8 are unnecessary and are deleted (S2108).

次に、図１１のクエリ表１１０におけるクエリアリル１１０１のうち２つ目の行に記載されているＣ、Ａを処理する場合を考える。「１１０３」番目の位置に見つかった新しい変異と組み合わせると、「Ｔ、Ｃ、Ａ」と「Ｃ、Ｃ、Ａ」の２通りが考えられる。しかし、これらはいずれもリード配列１０７に存在しないとＳ２１０３で判定され、図１１には残っていない。その代わりＳ２１０６で、新しい変異と共存するリード配列がないと判定されて、一致領域喪失時の処理（Ｓ２１０７）が実行される。 Next, consider the case where C and A described in the second row of the query allele 1101 in the query table 110 of FIG. 11 are processed. When combined with the new mutation found at the "1103" position, there are two possibilities: "T, C, A" and "C, C, A". However, it is determined in S2103 that none of these exist in the read sequence 107, and they are not left in FIG. Instead, in S2106, it is determined that there is no read sequence coexisting with the new mutation, and the process at the time of loss of the matching region (S2107) is executed.

図１１のクエリ表１１０におけるクエリアリル１１０１のうち３つ目の行に記載されているＧ、Ｔからは「Ｃ、Ｇ、Ｔ」のみ、４つ目の行に記載されているＣ、Ｔからは「Ｔ、Ｃ、Ｔ」のみがリード配列１０７に存在するので、図１１に示すように、クエリ表１１０に追加される。図１１のクエリ表１１０には、Ｃだけのクエリアリル１１０１とＴだけのクエリアリル１１０１も追加されている。これは、後述のＳ３０７４（図１５）において、長さＫ＝１５の接頭辞が新ゲノム配列上の新しい箇所と一致することが分かったために、追加されたものである。 In the query table 1101 of FIG. 11, only “C, G, T” from G and T described in the third row of the query allele 1101 is described from C and T described in the fourth row. Since only “T, C, T” exists in the read sequence 107, it is added to the query table 110 as shown in FIG. A query allele 1101 of only C and a query allele 1101 of only T are added to the query table 110 of FIG. This is added because it was found in S3074 (FIG. 15) described later that the prefix of length K = 15 coincides with a new location on the new genome sequence.

（３−３−３）マッピング先探索部１１３の処理
マッピング先探索部１１３は、クエリ表１１０に記載されている変異を旧ゲノム配列１０８に反映して得られた文字列が、新ゲノム配列１１４のどこに含まれるかを、新ゲノムインデックス１０９を用いて探索する。因みに、マッピング先探索部１１３は、旧ゲノム配列１０８の右端から左端へ１塩基ずつ処理を行ない、クエリ表１１０を参照しつつ、クエリ配列と同一の配列が新ゲノム配列１１４に存在する限り、クエリ配列を１塩基ずつ延長する。 (3-3-3) Processing of the Mapping Destination Searching Unit 113 The mapping destination searching unit 113 converts the character string obtained by reflecting the mutation described in the query table 110 into the old genome sequence 108 into the new genome sequence 114. Is searched using the new genome index 109. By the way, the mapping destination search unit 113 processes the bases one by one from the right end to the left end of the old genome sequence 108 and, while referring to the query table 110, as long as the same sequence as the query sequence exists in the new genome sequence 114, Extend the sequence one base at a time.

図１２に、マッピング先探索部１１３で実行される処理動作を示す。まず、マッピング先探索部１１３は、初期化処理を実行する（Ｓ３０１）。具体的には、変数ｉを、旧ゲノム配列１０８の長さに等しい整数値に初期化する。次に、マッピング先探索部１１３は、クエリ表１１０の更新処理を実行する（Ｓ３０２）。具体的には、旧ゲノム配列１０８のｉ番目の位置について、クエリ表更新部１１２の処理を実行する。 FIG. 12 shows a processing operation executed by the mapping destination search unit 113. First, the mapping destination search unit 113 performs an initialization process (S301). Specifically, the variable i is initialized to an integer value equal to the length of the old genome sequence 108. Next, the mapping destination search unit 113 executes a process of updating the query table 110 (S302). Specifically, the processing of the query table updating unit 112 is executed for the i-th position of the old genome sequence 108.

マッピング先探索部１１３は、クエリ表１１０の各クエリアリル１１０１について、新ゲノム配列１１４の中で一致している箇所を１塩基ずつ延長する処理を実行する（Ｓ３０３〜Ｓ３０８）。具体的には、Ｓ３０３において、マッピング先探索部１１３は、クエリ表１１０に記録されているクエリアリル１１０１の中から未処理のクエリアリル１１０１を１つ選び、そのクエリアリル１１０１を用いて旧ゲノム配列１０８のｉ番目の塩基から始まる、クエリ表１１０の「長さ」の列１１０２に記録された長さの配列を修正することにより、クエリ配列を生成する。 For each query allele 1101 of the query table 110, the mapping destination search unit 113 executes a process of extending the matching position in the new genome sequence 114 by one base (S303 to S308). Specifically, in S303, the mapping destination search unit 113 selects one unprocessed query allele 1101 from among the query alleles 1101 recorded in the query table 110, and uses the query allele 1101 to generate i of the old genome sequence 108. A query sequence is generated by modifying the length sequence recorded in the “length” column 1102 of the query table 110, starting from the base number.

ただし、処理中の位置の近傍に変異がない場合には、図１３に示すクエリ表１１０のように変異が無いことがある。その場合には、旧ゲノム配列１０８のｉ番目の塩基から始まる、クエリ表１１０の列１１０２に記録された「長さ」の配列を、そのままクエリ配列として使用する。本明細書ではこのような、変異がなく旧ゲノム配列１０８に一致するクエリ配列を表す行も、クエリアリル１１０１と同様に扱う。 However, when there is no mutation near the position being processed, there is a case where there is no mutation as in the query table 110 shown in FIG. In this case, the “length” sequence recorded in column 1102 of the query table 110, starting from the i-th base of the old genome sequence 108, is used as it is as the query sequence. In the present specification, such a row indicating a query sequence that has no mutation and matches the old genome sequence 108 is also treated in the same manner as the query allele 1101.

次に、Ｓ３０４において、マッピング先探索部１１３は、クエリアリル１１０１を含むクエリ配列であって、「長さ」の列１１０２に記録されている値に１を加えた長さのクエリ配列の先頭の文字をｂ、２文字目から始まる接尾辞をｘとして、文字列ｂｘが存在する新ゲノムインデックス１０９の範囲を計算する。この範囲は、ＬＦ−ｍａｐｐｉｎｇ(Navarro&Makkinen, ACM Computing Surveys 39(1):Article No.2,2007）を用いて、下記の通り計算できる。
beg(bx) ＝C［b］＋rank（BWT，b，beg（x)-1）＋1
end(bx) ＝C［b］＋rank（BWT，b，end（x）） Next, in S304, the mapping destination search unit 113 searches the first character of the query sequence including the query allele 1101 and having a length obtained by adding 1 to the value recorded in the “length” column 1102. , And the suffix starting from the second character is x, and the range of the new genome index 109 where the character string bx exists is calculated. This range can be calculated as follows using LF-mapping (Navarro & Makkinen, ACM Computing Surveys 39 (1): Article No. 2, 2007).
beg (bx) = C [b] + rank (BWT, b, beg (x) -1) +1
end (bx) = C [b] + rank (BWT, b, end (x))

ただし、C［b］は、辞書順でｂより小さな文字の新ゲノム配列での総数の和である。また、beg（x）とend（x）は、それぞれ、新ゲノムインデックス１０９の範囲（列１１０３）に記載されている開始位置と終了位置である。 Here, C [b] is the sum of the total number of characters smaller than b in the dictionary order in the new genome sequence. Beg (x) and end (x) are the start position and end position described in the range (column 1103) of the new genome index 109, respectively.

次のＳ３０５において、マッピング先探索部１１３は、計算結果がbeg（bx）＞end（bx）となったか否か判定する。もし、beg（bx）＞end（bx）であれば、ＬＦ−ｍａｐｐｉｎｇの性質から、文字列ｂｘが新ゲノム配列１１４に存在しないことが分かるので、この場合、マッピング先探索部１１３は、一致領域喪失時の処理を実行する（Ｓ３０６）。 In the next step S305, the mapping destination search unit 113 determines whether or not the calculation result is beg (bx)> end (bx). If beg (bx)> end (bx), it is known from the nature of LF-mapping that the character string bx does not exist in the new genome sequence 114. In this case, the mapping destination search unit 113 sets the matching region The process at the time of loss is executed (S306).

図１４に、Ｓ３０６の詳細動作を示す。この処理において、マッピング先探索部１１３は、一致領域喪失直前まで延長されていた、クエリ配列と新ゲノム配列１１４の一致箇所のうち、必要なものを判別して出力する処理を実行する。まず、Ｓ３０６１において、マッピング先探索部１１３は、｜ｘ｜≧Ｋか否かを判定する。｜ｘ｜＜Ｋであれば、何も出力しない。Ｓ３０６２において、マッピング先探索部１１３は、別のクエリアリル等を処理する際に、既に出力した範囲であるか否かを判定する。既に出力されている場合、マッピング先探索部１１３は、重複して出力はしない。Ｓ３０６３において、マッピング先探索部１１３は、「新ゲノムインデックスの範囲」のうち、ｉ＋１の値に対してまだ出力されていない範囲を、ｉ＋１での一致領域として出力する。 FIG. 14 shows the detailed operation of S306. In this process, the mapping destination search unit 113 executes a process of judging and outputting a necessary portion from the matching portion between the query sequence and the new genome sequence 114 which has been extended until immediately before the loss of the matching region. First, in S3061, the mapping destination searching unit 113 determines whether or not | x | ≧ K. If | x | <K, nothing is output. In S3062, the mapping destination search unit 113 determines whether or not the output range is already output when processing another query allele or the like. If the data has already been output, the mapping destination search unit 113 does not output the data again. In S3063, the mapping destination search unit 113 outputs a range that has not been output for the value of i + 1 among the “range of the new genome index” as a matching area for i + 1.

図１２の説明に戻る。Ｓ３０５で、もしbeg(bx)≦end(bx)であった場合、ＬＦ−ｍａｐｐｉｎｇの性質から、文字列ｂｘが新ゲノム配列に存在することが分かるので、この場合、マッピング先探索部１１３は、一致領域延長時の処理を実行する（Ｓ３０７）。 Returning to the description of FIG. In step S305, if beg (bx) ≦ end (bx), the character string bx is found to exist in the new genome sequence from the property of LF-mapping. In this case, the mapping destination search unit 113 The process at the time of extension of the matching area is executed (S307).

図１５に、Ｓ３０７の詳細動作を示す。この処理において、マッピング先探索部１１３は、おおよそ３つの処理を実行する。第１の処理は、クエリ表情報に記録されている各クエリアリルに対応するクエリ配列ｘを左に１文字延長して得られる文字列ｂｘが、新ゲノム中に出現する範囲を更新する処理である（Ｓ３０７１）。第２の処理は、リード長Ｌに達した一致領域を、出力する処理である（Ｓ３０７２〜Ｓ３０７３）である。第３の処理は、クエリ配列ｂｘを短くすることにより一致箇所を増やす処理であり（Ｓ３０７４〜Ｓ３０７６）、ｂｘの長さＫの接頭辞が一致する、新ゲノム配列１１４上の新たな箇所を計算する。長さＫ未満の一致は出力対象でないから処理不要であり、長さＫ＋１以上の一致は既に探索済みの長さＫ以上の一致を延長することで処理できる。 FIG. 15 shows the detailed operation of S307. In this process, the mapping destination search unit 113 executes approximately three processes. The first process is a process of updating the range in which the character string bx obtained by extending the query sequence x corresponding to each query allele recorded in the query table information by one character to the left appears in the new genome. (S3071). The second process is a process of outputting a matching area that has reached the read length L (S3072 to S3073). The third process is a process of increasing the number of matching points by shortening the query sequence bx (S3074 to S3076), and calculating a new position on the new genome sequence 114 where the prefix of the length K of bx matches. I do. Since a match with a length less than K is not an output target, no processing is required, and a match with a length of K + 1 or more can be processed by extending a match having a length of K or more already searched.

ここで、Ｓ３０７１において、マッピング先探索部１１３は、処理対象とするクエリアリル１１０１の「新ゲノムインデックスの範囲」を（beg（bx），end（bx））で置き換える。また、マッピング先探索部１１３は、クエリアリル１１０１の「長さ」に１を加える。また、Ｓ３０７２において、マッピング先探索部１１３は、「長さ」がＬ以上となったか否かを判定する。 Here, in S3071, the mapping destination search unit 113 replaces “the range of the new genome index” of the query allele 1101 to be processed with (beg (bx), end (bx)). Further, the mapping destination search unit 113 adds 1 to the “length” of the query allele 1101. In step S3072, the mapping destination searching unit 113 determines whether the “length” has become L or more.

「長さ」がＬ以上となった場合、マッピング先探索部１１３はＳ３０７３に進み、旧ゲノム配列の位置ｉにマッピングされているリード配列があれば、ｉ及び置き換え後の「新ゲノムインデックスの範囲」の出力のうち、同じｉの値に対してまだ出力されていない範囲を出力するとともに、「長さ」をＬ−１に戻す。「長さ」がＬ未満であった場合又はＳ３０７３が実行された後、マッピング先探索部１１３はＳ３０７４に進み、bxの長さＫの接頭辞ｙが、新ゲノム配列においてbxの接頭辞以外の箇所に存在するか否かを調べる。そのために、マッピング先探索部１１３は、ビットベクトルＢ_Ｌを用いて、以下の２つの値beg（y）及びend（y）を計算する。
beg（y）＝select（B_L, 0, rank（B_L, 0, beg(bx)-1））+1
end（y）＝select（B_L, 0, rank（B_L, 0, end(bx)-1）+1） If the “length” is equal to or greater than L, the mapping destination search unit 113 proceeds to S3073, and if there is a read sequence mapped at position i of the old genome sequence, i and “the range of the new genome index after replacement” , The range not yet output for the same value of i is output, and the "length" is returned to L-1. When the “length” is less than L or after S3073 has been executed, the mapping destination search unit 113 proceeds to S3074, where the prefix y of the length K of bx is other than the prefix y of bx in the new genome sequence. Check if it exists at the location. For this purpose, the mapping destination search unit 113 calculates the following two values beg (y) and end (y) using the bit vector _BL .
beg (y) = select (B _L , 0, rank (B _L , 0, beg (bx) -1)) + 1
end (y) = select (B _L , 0, rank (B _L , 0, end (bx) -1) +1)

そして、beg(y)≠beg(bx)、又は、end(y)≠end(bx)が成立するか否か判断する。この条件が成立するなら、ｙが新ゲノム配列においてbxの接頭辞以外の箇所に存在すると判断する。なお上記の式で、beg(x)およびend(bx)から１を減算するのは、BWTのbeg(x)番目、end(x)番目の要素がB_Lのbeg(x)-1番目、end(x)-1番目の要素に対応するためである。 Then, it is determined whether or not beg (y) ≠ beg (bx) or end (y) ≠ end (bx) holds. If this condition is satisfied, it is determined that y exists at a position other than the bx prefix in the new genome sequence. In the above equation, subtracting 1 from beg (x) and end (bx) means that the bet (x) -th and end (x) -th elements of the BWT are the beg (x) -first of the _BL , end (x) -1 to correspond to the first element.

Ｓ３０７５において、マッピング先探索部１１３は、Ｓ３０７４の条件が満たされるか否かを判定し、満たされる場合には、Ｓ３０７６に進み、クエリ表１１０に新しいクエリアリルの組合せを追加する。このクエリアリルは、前記クエリアリルでＫ塩基以内のアリルを全てコピーし、「新ゲノムインデックス上の範囲」は［beg(y),end(y)］、「長さ」をＫとしたものである。なお、Ｋ塩基以内に変異がなければ、図１３の例にあるような、変異がない行をクエリ表に追加する。Ｓ３０７４の条件が満たされない場合、マッピング先探索部１１３は、Ｓ３０７の処理を終了する。 In step S3075, the mapping destination search unit 113 determines whether the condition in step S3074 is satisfied. If the condition is satisfied, the process advances to step S3076 to add a new query allele combination to the query table 110. In this query allele, all the alleles within K bases are copied in the above query allele, the “range on the new genome index” is [beg (y), end (y)], and the “length” is K. If there is no mutation within K bases, a row having no mutation is added to the query table as in the example of FIG. If the condition of S3074 is not satisfied, the mapping destination search unit 113 ends the processing of S307.

図１２の説明に戻る。以上の処理が終了すると、Ｓ３０８が実行される。Ｓ３０８において、マッピング先探索部１１３は、クエリ表１１０内に未処理のクエリアリル１１０１が残っているか否かを判定し、残っている場合には、前述したＳ３０３〜Ｓ３０８の処理を繰り返す。未処理のクエリアリル１１０１が残っていない場合、マッピング先検出部１１３は、Ｓ３０９として、既存の一致領域の延長ではない、新しい一致領域を探索する処理を実行する。 Returning to the description of FIG. When the above process ends, S308 is executed. In S308, the mapping destination search unit 113 determines whether or not the unprocessed query allele 1101 remains in the query table 110, and if it remains, repeats the processes of S303 to S308 described above. If no unprocessed query allele 1101 remains, the mapping destination detection unit 113 executes a process of searching for a new matching area, which is not an extension of the existing matching area, in S309.

図１６に、Ｓ３０９の詳細動作を示す。この処理において、マッピング先探索部１１３は、既存のクエリアリル１１０１に無いアリルの組合せを検討し（Ｓ３０９１〜Ｓ３０９３、Ｓ３０９８）、旧ゲノム配列のＫ塩基の領域に反映させ（Ｓ３０９４）、新ゲノム配列１１４にＫ塩基以上一致する箇所を探索する（Ｓ３０９５）。なお、一致する箇所があった場合、マッピング先探索部１１３は、クエリ表１１０に加える(Ｓ３０９６〜Ｓ３０９７)。 FIG. 16 shows the detailed operation of S309. In this processing, the mapping destination search unit 113 examines combinations of alleles that are not present in the existing query allele 1101 (S3091 to S3093, S3098), reflects the combination in the K base region of the old genome sequence (S3094), and Is searched for K bases or more (S3095). If there is a matching part, the mapping destination search unit 113 adds it to the query table 110 (S3096 to S3097).

ここで、Ｓ３０９１において、マッピング先探索部１１３は、変数ｉが指し示す旧ゲノム配列１０８上の位置からＫ塩基以内にある全ての変異の、アリルの組合せを列挙する。なお、Ｋ塩基以内に変異が無い場合は、旧ゲノム配列と完全に一致するアリルが存在する場合と同様の処理を行う。次のＳ３０９２において、マッピング先探索部１１３は、Ｓ３０９１で列挙したアリルの組合せの中から１つを選択する。次のＳ３０９３において、マッピング先探索部１１３は、Ｓ３０９２で選択したアリルの組合せを含むリード配列があるか否かを判定する。無ければ、マッピング先探索部１１３は、Ｓ３０９８に移動し、アリルの次の組合せに進む。 Here, in S3091, the mapping destination searching unit 113 lists all combinations of all mutations within K bases from the position on the old genome sequence 108 indicated by the variable i. When there is no mutation within K bases, the same processing as when there is an allele completely matching the old genome sequence is performed. In the next step S3092, the mapping destination search unit 113 selects one of the combinations of alleles listed in step S3091. In the next step S3093, the mapping destination searching unit 113 determines whether or not there is a read sequence containing the combination of alleles selected in S3092. If not, the mapping destination search unit 113 moves to S3098 and proceeds to the next combination of alleles.

一方、Ｓ３０９３において肯定結果が得られた場合、マッピング先探索部１１３は、Ｓ３０９４に進み、変数ｉが指し示す旧ゲノム配列１０８上の位置からＫ塩基の旧ゲノム配列１０８に、選択したアリルの組合せを適用した配列を生成する。次のＳ３０９５において、マッピング先探索部１１３は、Ｓ３０９４の配列が、新ゲノム配列１１４に存在するか否かを、新ゲノムインデックス１０９を用いて判定する。 On the other hand, if a positive result is obtained in S3093, the mapping destination search unit 113 proceeds to S3094 and adds the selected allele combination to the K base old genome sequence 108 from the position on the old genome sequence 108 indicated by the variable i. Generate the applied array. In the next step S3095, the mapping destination search unit 113 determines whether or not the sequence in S3094 exists in the new genome sequence 114 using the new genome index 109.

存在しないと判定された場合（Ｓ３０９６で否定結果）、マッピング先探索部１１３は、Ｓ３０９８に移動し、アリルの次の組合せに進む。一方、存在すると判定された場合（Ｓ３０９６で肯定結果）、マッピング先探索部１１３は、Ｓ３０９７に移動し、Ｓ３０９２以降処理してきたアリルの組合せを、クエリ表にまだ無ければ、新たなクエリアリル１１０１としてクエリ表１１０に加える。なお、Ｓ３０９１でＫ塩基以内の変異が存在せず完全一致するアリルがあるものとして処理していた場合には、図１３に例示するような変異がない行をクエリ表１１０に加える。そして、Ｓ３０９８において、マッピング先探索部１１３は、アリルの組合せで、未処理のものがまだあれば、Ｓ３０９２へ戻り、アリルの他の組合せを処理する。 If it is determined that the combination does not exist (No in S3096), the mapping destination search unit 113 moves to S3098 and proceeds to the next combination of alleles. On the other hand, if it is determined that the combination exists (Yes in S3096), the mapping destination search unit 113 moves to S3097, and if the combination of alleles processed after S3092 does not yet exist in the query table, queries as a new query allele 1101. Add to Table 110. If it is determined in S3091 that there is no mutation within K bases and that there is a completely identical allele, a row having no mutation as illustrated in FIG. 13 is added to the query table 110. Then, in S3098, if there is any unprocessed combination of alleles, the mapping destination search unit 113 returns to S3092 to process another combination of alleles.

図１２の説明に戻る。前述の処理が終了すると、マッピング先探索部１１３は、旧ゲノム配列１０８の右端から左端へ１塩基ずつ移動する（Ｓ３１０〜Ｓ３１１）。具体的には、Ｓ３１０において、マッピング先探索部１１３は、変数ｉから１を減じる。また、Ｓ３１１において、マッピング先探索部１１３は、変数ｉの値が１以上であれば、Ｓ３０２に戻り、前述の処理を繰り返す。 Returning to the description of FIG. When the above-described processing is completed, the mapping destination search unit 113 moves one base at a time from the right end to the left end of the old genome sequence 108 (S310 to S311). Specifically, in S310, the mapping destination search unit 113 subtracts 1 from the variable i. Further, in S311, if the value of the variable i is 1 or more, the process returns to S302, and the above-described processing is repeated.

（３−４）効果
前述した処理の実行により、Ｓ３０７３において、旧ゲノム配列１０８にリード配列１０７に見られる変異を適用した長さＬ以内のあらゆる配列について、新ゲノム配列１１４において出現する位置を、網羅的に出力することができる。従って、本実施例によれば、旧ゲノム配列１０８の同一箇所にマッピングされていたリード配列１０７がマッピングされるべき新ゲノム配列１１４中の位置の候補を、一括して効率よく計算することができる。 (3-4) Effect By executing the above-described processing, in S3073, for any sequence within the length L obtained by applying the mutation found in the read sequence 107 to the old genome sequence 108, the position that appears in the new genome sequence 114 is It can output comprehensively. Therefore, according to the present embodiment, candidates for the position in the new genome sequence 114 to which the read sequence 107 mapped to the same position in the old genome sequence 108 is to be mapped can be efficiently calculated in a lump. .

（４）実施例２
マッピング結果の正当性を評価するためには、ＤＮＡ解析装置１００で重要な部分を占めるクエリ表１１０の内容を、ユーザが検査できることが望ましい。そこで、本実施例では、ユーザがクエリ表１１０の内容を確認するためのグラフィカルなインタフェースを提供する仕組みについて説明する。 (4) Example 2
In order to evaluate the validity of the mapping result, it is desirable that the user can inspect the contents of the query table 110 occupying an important part in the DNA analyzer 100. Therefore, in the present embodiment, a mechanism for providing a graphical interface for the user to check the contents of the query table 110 will be described.

図１７に、インタフェース画面１７００の例を示す。インタフェース画面１７００は、ユーザインタフェース部１０６を構成するディスプレイ画面上に表示される。インタフェース画面１７００は、クエリ表１１０の内容を表示する表示領域１７０１を有する。表示領域１７０１の下段には、クエリ表１１０に対応するクエリ配列の表示領域１７１０が表示される。 FIG. 17 shows an example of the interface screen 1700. The interface screen 1700 is displayed on a display screen constituting the user interface unit 106. The interface screen 1700 has a display area 1701 for displaying the contents of the query table 110. In the lower part of the display area 1701, a display area 1710 of a query sequence corresponding to the query table 110 is displayed.

表示領域１７１０には、旧ゲノム配列１０８を表す直線（横軸）１７０２に対して平行に、クエリ配列を表す直線１７０３が表示される。直線１７０３上には、クエリ表１１０の各変異の位置１７０４毎に、当該位置におけるクエリ配列のアリル１７０５が表示され、各クエリ配列に合致するリード配列１０７の本数１７０６（例えば「３×」は３本を示す。）が表示される。このインタフェース画面１７００に表示すべき旧ゲノム配列１０８上の範囲は、例えば方向ボタン１７０７のようなインタフェースにより、ユーザが任意に選択できるものとする。クエリ表１１０は、近傍する（Ｌ未満の）領域内で生じた変異のアリルを記録したものである。すなわち、クエリ表１１０は、Ｌ以上離れた領域には無関係であるため、ユーザが観察したい領域のクエリ表を効率よく再構築できる。 In the display area 1710, a straight line 1703 representing a query sequence is displayed in parallel with a straight line (horizontal axis) 1702 representing the old genome sequence 108. On the straight line 1703, for each position 1704 of each mutation in the query table 110, the allele 1705 of the query sequence at that position is displayed, and the number 1706 of the read sequences 107 that match each query sequence (for example, “3 ×” is 3) This indicates a book.) Is displayed. The range on the old genome sequence 108 to be displayed on the interface screen 1700 can be arbitrarily selected by the user using an interface such as the direction button 1707, for example. The query table 110 records alleles of mutations occurring in the neighboring (less than L) region. That is, since the query table 110 is irrelevant to an area separated by L or more, the query table of the area that the user wants to observe can be efficiently reconstructed.

（５）他の実施例
以上の実施例は好ましい例を示したものであり、具体的構成に限定する趣旨ではない。以上の説明の要旨を逸脱しない範囲において種々変更可能である。例えば前述した実施例の全ての構成を必ずしも備える必要はない。また、ある実施例の一部を他の実施例の構成に置き換えることができる。また、ある実施例の構成に他の実施例の構成を加えることもできる。また、各実施例の構成の一部について、他の実施例の構成の一部を追加、削除又は置換することもできる。 (5) Other Embodiments The above embodiments show preferred examples, and are not intended to limit the present invention to a specific configuration. Various changes can be made without departing from the gist of the above description. For example, it is not always necessary to provide all the components of the above-described embodiment. Further, a part of one embodiment can be replaced with the configuration of another embodiment. Further, the configuration of one embodiment can be added to the configuration of another embodiment. Further, for a part of the configuration of each embodiment, a part of the configuration of another embodiment can be added, deleted, or replaced.

また、上述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現しても良い。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することにより（すなわちソフトウェア的に）実現しても良い。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、SSD（Solid State Drive）等の記憶装置、又は、ICカード、SDカード、DVD等の記憶媒体に格納することができる。また、制御線や情報線は、説明上必要と考えられるものを示すものであり、製品上必要な全ての制御線や情報線を表すものでない。実際にはほとんど全ての構成が相互に接続されていると考えて良い。 In addition, each of the above-described configurations, functions, processing units, processing means, and the like may be partially or entirely realized by hardware, for example, by designing an integrated circuit. Further, each of the above-described configurations, functions, and the like may be realized by a processor interpreting and executing a program that realizes each function (that is, as software). Information such as a program, a table, and a file for realizing each function can be stored in a storage device such as a memory, a hard disk, an SSD (Solid State Drive), or a storage medium such as an IC card, an SD card, and a DVD. Further, the control lines and the information lines indicate those considered to be necessary for the description, and do not represent all the control lines and the information lines necessary for the product. In fact, it can be considered that almost all components are interconnected.

１００…ＤＮＡ配列解析装置
１０１…ＣＰＵ
１０２…主記憶装置
１０３…補助記憶装置
１０４…リムーバブルメディア
１０５…ネットワーク
１０６…ユーザインタフェース部
１０７…リード配列及びその旧ゲノム配列へのマッピング結果
１０８…旧ゲノム配列
１０９…新ゲノムインデックス
１１０…クエリ表
１１１…新ゲノムインデックス構築部
１１２…クエリ表更新部
１１３…マッピング先探索部
１１４…新ゲノム配列
１１５…ＤＮＡ配列解析システム
４０１…リード配列の変異
１１０１…クエリ表に格納されているクエリアリル
１１０２…各クエリアリルに対応するクエリ配列の長さ
１１０３…各クエリアリルに対応する新ゲノムインデックスでの範囲
１１０４…各クエリアリルに対応する各変異位置でのアリル
１１０５…各変異位置での、リード配列に見られるアリルの一覧
１７０１…クエリ表の内容を表示する表示領域
１７０２…旧ゲノム配列を表す直線
１７０３…クエリ配列を表す直線
１７０４…クエリ表の各変異の位置
１７０５…当該位置におけるクエリ配列のアリル
１７０６…各クエリ配列に合致するリード配列の本数
１７０７…表示すべき範囲を変更するための方向ボタン 100: DNA sequence analyzer 101: CPU
102 main storage device 103 auxiliary storage device 104 removable medium 105 network 106 user interface unit 107 mapping result of read sequence and its old genome sequence 108 old genome sequence 109 new genome index 110 query table 111 ... New genome index constructing unit 112 Query table updating unit 113 Mapping destination searching unit 114 New genome sequence 115 DNA sequence analysis system 401 Lead sequence mutation 1101 Query alleles 1102 stored in the query table 1102 Length 1103 of the corresponding query sequence ... Range in the new genome index corresponding to each query allele 1104 ... Allele at each mutation position corresponding to each query allele 1105 ... List of alleles found in the read sequence at each mutation position 17 Reference numeral 1 denotes a display area for displaying the contents of the query table 1702 ... a straight line 1703 representing the old genome sequence 1703 ... a straight line 1704 representing the query sequence 1705 of the position of each mutation in the query table 1705 ... alleles of the query sequence at the position 1706 ... Number of matching lead sequences 1707: Direction buttons for changing the range to be displayed

Claims

A storage for storing a first genomic sequence called an old genomic sequence, a second genomic sequence called a new genomic sequence, a large number of DNA sequences called a read sequence, and a result of mapping the read sequence to the old genomic sequence. Equipment and
A new genome index constructing unit that constructs an index for searching for any character string present in the new genome sequence,
A data structure called a query table that records a combination of mutations in the read sequence that is a mutation with respect to the old genome sequence and is located near the processing position on the old genome sequence, based on the result of the mapping. A query table update unit to be constructed,
A sequence constructed by applying the combination of mutations stored in the query table to the old genomic sequence is compared with the new genomic sequence, and K (1 or more natural number) bases or more in the constructed sequence are compared with the new genomic sequence. A DNA sequence analysis device comprising: a mapping destination search unit that comprehensively outputs a part that completely matches a genome sequence.

The DNA sequence analyzer according to claim 1,
The new genome index construction unit,
As the index,
FM-index,
When comparing the suffixes corresponding to the adjacent numbers in the suffix array, construct a bit vector in which the length of the longest prefix that matches is 1 at a position exceeding the K and 0 at other positions. A DNA sequence analyzer characterized by the above-mentioned.

The DNA sequence analyzer according to claim 1,
The query table update unit,
A combination of allele combinations of mutations within L (one or more natural numbers) bases, wherein allele combinations present in both the read sequence and the new genome sequence are stored in the query table. Sequence analyzer.

The DNA sequence analyzer according to claim 2,
The mapping destination search unit,
Using the bit vector, searching for a position on the new genomic sequence where the sequence reflecting the allele of the read sequence in the old genomic sequence and the new genomic sequence completely match at least K bases. DNA sequence analyzer.

A storage for storing a first genomic sequence called an old genomic sequence, a second genomic sequence called a new genomic sequence, a large number of DNA sequences called a read sequence, and a result of mapping the read sequence to the old genomic sequence. Equipment and
A new genome index constructing unit that constructs an index for searching for any character string present in the new genome sequence,
A data structure called a query table that records a combination of mutations in the read sequence that is a mutation with respect to the old genome sequence and is located near the processing position on the old genome sequence, based on the result of the mapping. A query table update unit to be constructed,
A sequence constructed by applying the combination of mutations stored in the query table to the old genomic sequence is compared with the new genomic sequence, and K (1 or more natural number) bases or more in the constructed sequence are compared with the new genomic sequence. A mapping destination search unit that comprehensively outputs a part that completely matches the genome sequence,
A first region for displaying the content and position of the mutation appearing in the read sequence in association with each other, the combination of the mutation, the length of the corresponding query sequence on the new genome sequence, and the corresponding query sequence And a user interface unit for displaying a confirmation screen including a second area for displaying a range present in the index in association with the second area.

The DNA sequence analysis system according to claim 5,
The user interface unit displays one or a plurality of straight lines corresponding to the query sequence in parallel with a first axis corresponding to the old genome sequence, and is displayed in the query table in the straight line. A DNA sequence analysis system, wherein the content of the mutation in the query sequence is displayed for each mutation position.

DNA sequence analyzer,
According to the processing contents, a first genomic sequence called an old genomic sequence, a second genomic sequence called a new genomic sequence, a large number of DNA sequences called a read sequence, and the read sequence were mapped to the old genomic sequence. Reading a part or all of the result from the storage device;
A process of constructing an index for searching for any character string present in the new genome sequence,
A data structure called a query table that records a combination of mutations in the read sequence that is a mutation with respect to the old genome sequence and is located near the processing position on the old genome sequence, based on the result of the mapping. The process to build,
A sequence constructed by applying the combination of mutations stored in the query table to the old genomic sequence is compared with the new genomic sequence, and K (1 or more natural number) bases or more in the constructed sequence are compared with the new genomic sequence. And a process of comprehensively outputting a part that completely matches the genome sequence.

The DNA sequence analysis method according to claim 7,
The DNA sequence analyzer,
A first region for displaying the content and position of the mutation appearing in the read sequence in association with each other, the combination of the mutation, the length of the corresponding query sequence on the new genome sequence, and the corresponding query sequence Further comprising the step of: displaying a confirmation screen including a second area for displaying a range associated with the index in the second area.

The DNA sequence analysis method according to claim 8,
The DNA sequence analyzer,
One or more straight lines corresponding to the query sequence are displayed in parallel to a first axis corresponding to the old genome sequence, and for each mutation position displayed in the query table on the straight line A DNA sequence analysis method, further comprising the step of displaying the contents of the mutation in the query sequence.