JP2005218421A

JP2005218421A - Method for producing contig from dna fragmented sequential data by total genom shot gun method and recording medium

Info

Publication number: JP2005218421A
Application number: JP2004058153A
Authority: JP
Inventors: Masahiro Kasahara; 雅弘笠原; Shin Sasaki; 伸佐々木; Yukimasa Nagayasu; 佑希允永安
Original assignee: Individual
Current assignee: Individual
Priority date: 2004-02-03
Filing date: 2004-02-03
Publication date: 2005-08-18

Abstract

<P>PROBLEM TO BE SOLVED: To provide a practical contig by classifying correct annexations, incorrect annexations and annexations which can not be judged as correct or incorrect substantially, from fragmented sequential data obtained by a total genom shot gun method without using already known repeated sequential data in a genom sequential determination containing repeated sequences. <P>SOLUTION: This method for producing the contig is provided by examining whether each of the fragmented sequences is included in another fragmented sequence approximately, and in performing the annexation on a fragmented sequence x not included in any other fragmented sequences and its complemental sequence, determining a fragment sequence or its part capable of extending the fragmented sequence x in 3' direction based on a score calculated from a partial fragmented sequence which is in common with the fragmented sequence x, among another fragmented sequence capable of extending the fragmented sequence in 3' direction or its complemental sequence. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明はゲノム塩基配列決定方法の領域に関し、より詳細には全ゲノムショットガン法によるＤＮＡ断片の配列を含むデータ集合からコンティグを作成するための方法及び記録媒体に関する。 The present invention relates to the field of genome base sequencing methods, and more particularly to a method and a recording medium for creating a contig from a data set including DNA fragment sequences by the whole genome shotgun method.

ゲノム塩基配列決定のために全ゲノムショットガン法が広く使われている。全ショットガン法ではまずゲノムを物理的に細かい断片に分断し、得られた断片をランダムにクローニングし、得られた各クローンについて塩基配列を決定する。このようにして得られた断片配列はゲノム配列中でランダムな位置の塩基配列を含む。次に、断片配列同士で共有される配列を検出し、配列を共有する断片配列を併合しコンティグを作り、元のゲノム塩基配列またはその一部を復元する。 The whole genome shotgun method is widely used for genome sequencing. In the whole shotgun method, the genome is first physically divided into fine fragments, the obtained fragments are randomly cloned, and the nucleotide sequence of each obtained clone is determined. The fragment sequence thus obtained contains a nucleotide sequence at a random position in the genome sequence. Next, a sequence shared by the fragment sequences is detected, the fragment sequences sharing the sequences are merged to create a contig, and the original genomic base sequence or a part thereof is restored.

しかしながら、ゲノム配列は多数の反復配列を含むことがあり、共通する配列を持っている断片配列を無秩序に併合していくと矛盾が発生することがある。そこでゲノム配列中に反復配列が存在しても矛盾を生じないような断片配列の併合の方法が必要であった。 However, the genome sequence may contain a large number of repetitive sequences, and inconsistencies may occur when fragment sequences having a common sequence are merged randomly. Therefore, there is a need for a method for merging fragment sequences so that no contradiction occurs even if there are repetitive sequences in the genome sequence.

非特許文献１においてＭｙｅｒｓらは、既知の反復配列についてのデータベースを用いて既知の反復配列と相同性を持つ断片配列中の反復配列をマスクし、マスクされている領域をアラインメントしないことで誤った配列併合を行わないようにする手法を提案している。また、その際には反復配列と非反復配列の境界すなわち反復配列境界を検出し、境界をまたぐ断片配列の併合を禁止する方法を併用し、矛盾を生じない断片配列の併合を行っている。 In Non-Patent Document 1, Myers et al. Erroneously masked a repetitive sequence in a fragment sequence having homology with a known repetitive sequence using a database of known repetitive sequences and did not align the masked region. A method is proposed to avoid sequence merging. In this case, a boundary between a repetitive sequence and a non-repetitive sequence, that is, a repetitive sequence boundary is detected, and a method for prohibiting the merging of fragment sequences across the boundary is used together, and fragment sequences that do not cause contradiction are merged.

非特許文献２のＢａｔｚｏｇｌｏｕらのグループは、反復配列境界の検出とともに、対で重なりを持つメイトペアを優先的に併合することにより、反復配列による誤った併合を削減する方法を考案している。 The group of Batzoglou et al. In Non-Patent Document 2 has devised a method for reducing erroneous merging due to repetitive sequences by preferentially merging mate pairs that overlap in pairs with detection of repetitive sequence boundaries.

非特許文献３においてＷａｎｇらのグループは断片配列中に現れるｋ−ｍｅｒ（ｋ文字の塩基配列）の出現頻度を求めて一定以上の出現頻度があるｋ−ｍｅｒに関しては反復配列由来のものであると見なし、反復配列由来であるとされたｋ−ｍｅｒを断片配列中でマスクし、マスクされた領域にアラインメントスコアを与えないことで誤った併合を削減する方法を提案している。 In Non-Patent Document 3, the group of Wang et al. Is derived from a repetitive sequence with respect to a k-mer having an appearance frequency of a certain level or more by determining the appearance frequency of a k-mer (k character base sequence) appearing in a fragment sequence. In view of this, a method has been proposed in which a k-mer that is derived from a repetitive sequence is masked in a fragment sequence, and an alignment score is not given to the masked region, thereby reducing erroneous merging.

ＥｕｇｅｎｅＷ．Ｍｙｅｒｓ，ＧｒａｎｇｅｒＧ．Ｓｕｔｔｏｎｅｔａｌ，ＡＷｈｏｌｅＧｅｎｏｍｅＡｓｓｅｍｂｌｙｏｆＤｒｏｓｏｐｈｉｌａ，Ｓｃｉｅｎｃｅ誌（２０００年）（２８７）２１９６−２２０４ページEugene W. Myers, Granger G., et al. Sutton et al, A Whole Genome Assembly of Drosophila, Science (2000) (287) 2196-2204 ＳｅｒａｆｉｍＢａｔｚｏｇｌｏｕｅｔａｌ，ＡＲＡＣＨＮＥ：ＡＷｈｏｌｅ−ＧｅｎｏｍｅＳｈｏｔｇｕｎＡｓｓｅｍｂｌｅｒ，ＧｅｎｏｍｅＲｅｓｅａｒｃｈ誌２００２年１月号（１）：１７７−１８９ページSerafim Batzoglou et al, ARACHNE: A Whole-Genome Shotgun Assembler, Genome Research, January 2002 (1): pp. 177-189 ＪｕｎＷａｎｇｅｔａｌ，ＡＳｅｑｕｅｎｃｅＡｓｓｅｍｂｌｅｒＴｈａｔＭａｓｋｓＥｘａｃｔＲｅｐｅａｔｓＩｄｅｎｔｉｆｉｅｄｆｒｏｍｔｈｅＳｈｏｔｇｕｎＤａｔａ，ＧｅｎｏｍｅＲｅｓｅａｒｃｈ誌２００２年３月号（５）：８２４−８３１ページJun Wang et al, A Sequence Assembler That Masks Exact Repeats Isolated from the Shotgun Data, Genome Research, March 2002 (page 8): 82.

しかしながら反復配列をマスクする方法は反復配列が既知でなくては適用できない。新しくゲノム配列を決定する生物については、その生物に存在する反復配列は未知であり既知の反復配列についてのデータベースは存在しないことも多い。また、既知の反復配列についてのデータベースが整備されている生物種においても、その反復配列データベースがゲノム配列の一部を決定した結果に基づいて作られているために、反復配列全てではなく一部分の反復配列のみがデータベースに含まれているのが普通であり、データベースに記載されていない反復配列が誤った断片併合を引き起こすことがあった。 However, the method of masking repetitive sequences cannot be applied unless the repetitive sequences are known. For an organism for which a new genomic sequence is to be determined, the repetitive sequences present in that organism are unknown, and there is often no database for known repetitive sequences. In addition, even in a species for which a database for known repetitive sequences is maintained, the repetitive sequence database is created based on the result of determining a part of the genome sequence. Normally, only repetitive sequences are included in the database, and repetitive sequences not listed in the database may cause incorrect fragment merging.

断片配列中に出現するｋ−ｍｅｒの頻度を用いて反復配列をマスクする方法は反復配列データベースを必要としない。しかしながら非特許文献３に述べられているようにゲノム中に出現する回数が非常に高い反復配列に対しては高い確率でマスクを行うことができるものの比較的出現回数が少ない反復配列に関して判定を誤ることがしばしばあり、誤った断片併合を引き起こすことがあった。 The method of masking repetitive sequences using the frequency of k-mer appearing in the fragment sequence does not require a repetitive sequence database. However, as described in Non-Patent Document 3, it is possible to mask a repetitive sequence having a very high number of occurrences in the genome with a high probability, but erroneously determining a repetitive sequence having a relatively low number of appearances. Often, it would cause incorrect fragment merging.

また、反復配列境界を検出する方法も、反復配列が高頻度に散在しているゲノム配列に対して適用するのが難しかった。反復配列境界が高頻度で出現することによって、断片配列がほとんど併合されず、結果としてコンティグの数が非常に多くなり実用的ではないからである。また、低頻度でゲノム中に現れる反復配列に対しても本質的に可能な併合をも抑制してしまいコンティグの数を増やしてしまうことがあった。 In addition, the method for detecting repetitive sequence boundaries has been difficult to apply to genome sequences in which repetitive sequences are frequently scattered. This is because repeated sequence boundaries appear frequently, so that fragment sequences are hardly merged, resulting in a very large number of contigs and impractical. In addition, even the repetitive sequences appearing in the genome at a low frequency are sometimes suppressed in merging which is essentially possible, which increases the number of contigs.

非特許文獸１のＭｙｅｒｓの場合には反復配列をマスクする方法によって高頻度に散在する反復配列をマスクし、マスクされずに残存する反復配列を低頻度に抑えたうえで反復配列境界を検出する方法を用いることで対処しているが、反復配列データベースを必要とする。 In the case of Non-Patent Document 1 Myers, the repetitive sequence that is frequently scattered is masked by the method of masking the repetitive sequence, and the repetitive sequence boundary is detected after the repetitive sequence remaining without being masked is suppressed to a low frequency. This is handled by using a method that requires a repetitive sequence database.

非特許文献２のＢａｔｚｏｇｌｏｕらのグループは、対で重なりを持つメイトペアを優先的に併合することによって高頻度な反復配列の影響を低減しているが、非特許文献２で述べられているようにメイトペア対のどちらか片方が非反復配列を含んでいないときには誤った併合を行う可能性があるなど効果は限定的であると考えられる。 The group of Batzoglou et al. In Non-Patent Document 2 reduces the influence of high-frequency repetitive sequences by preferentially merging mate pairs that overlap in pairs, but as described in Non-Patent Document 2 The effect is considered to be limited, such as the possibility of erroneous merging when one of the pair of mate pairs does not contain a non-repetitive sequence.

高等な生物は高頻度な散在反復配列を含むゲノム配列を持っていることが多いために、これらの生物についてゲノム配列を決定する際には、既知の反復配列についてのデータベースを必要とせず、高頻度に散在する反復配列がゲノム配列中に存在しても誤った併合を行わずにコンティグを作成することが求められていた。 Because higher organisms often have genomic sequences that contain a high number of scattered repeats, determining the genome sequence for these organisms does not require a database of known repeats, There has been a demand for creating a contig without erroneous merging even if repeated sequences scattered in the frequency are present in the genome sequence.

全ゲノムショットガン方式によって獲得されたランダムなＤＮＡ断片の塩基配列情報を含むデータ集合を元に、各断片配列が他の断片配列に近似的に包含されるか検査し、どの断片配列にも包含されない断片配列ｘおよびその相補配列について、併合を行った際に該断片配列を３’方向へ延長可能な別の断片配列またはその相補配列の中から、断片配列ｘとの間に共有される部分断片配列から計算されるアラインメントスコアに基づいて、断片配列ｘを３’方向へ延長する断片配列またはその部分を決定する方法により課題が解決される。 Based on a data set containing base sequence information of random DNA fragments acquired by the whole genome shotgun method, it is checked whether each fragment sequence is approximately included in other fragment sequences, and included in any fragment sequence A fragment sequence x and its complementary sequence that are not merged, and a portion shared with the fragment sequence x from another fragment sequence or its complementary sequence that can extend the fragment sequence in the 3 ′ direction when merged The problem is solved by a method for determining a fragment sequence or a portion thereof that extends the fragment sequence x in the 3 ′ direction based on the alignment score calculated from the fragment sequence.

より好ましくは前記のアラインメントスコアに代えて相同部分の長さに比例するスコアを用いる。 More preferably, a score proportional to the length of the homologous portion is used instead of the alignment score.

より好ましくは各断片配列ｘに対し、断片配列ｘを３’方向へ延長する断片配列の中で潜在的な共有配列長が長い断片配列から順にスコアの計算処理を行う方法によって処理速度を向上させることができる。 More preferably, with respect to each fragment sequence x, the processing speed is improved by a method of calculating the score in order from the fragment sequence having a long potential shared sequence length among the fragment sequences extending the fragment sequence x in the 3 ′ direction. be able to.

より好ましくは断片配列中に出現する連続または非連続の部分塩基配列とその断片配列先頭からの部分塩基配列の位置の両方を同時に索引として断片配列の集合に関連付けるハッシュテーブルを有するデータが記録されたコンピュータ読み取り可能な記録媒体を用いることで必要な潜在的共有配列長の断片配列を求めるのに必要な時間を削減することができる。 More preferably, data having a hash table in which both continuous or non-contiguous partial base sequences appearing in a fragment sequence and the position of the partial base sequence from the beginning of the fragment sequence are simultaneously associated with a set of fragment sequences is recorded. By using a computer-readable recording medium, it is possible to reduce the time required to obtain a fragment sequence having a necessary shared sequence length.

本発明は既知の反復配列についてのデータベースの有無および反復配列のゲノム中での出現回数の大小および反復配列のゲノム中での分布にかかわらず、ａ）反復配列に因る誤りの併合ｂ）誤りでない併合ｃ）断片配列情報からでは本質的にどちらとも特定できない併合の３種類を分類することができ、それにより誤りの少ないコンティグを生成できる。 The present invention relates to a) merging errors due to repetitive sequences regardless of the presence or absence of a database of known repetitive sequences and the magnitude of the number of occurrences of repetitive sequences in the genome and the distribution of repetitive sequences in the genome. B) errors C) It is possible to classify the three types of merge that cannot be specified essentially from the fragment sequence information, thereby generating a contig with few errors.

以下に添付図面を参照して、この発明の一実施形態を示す。 Hereinafter, an embodiment of the present invention will be described with reference to the accompanying drawings.

［定義］
「リード」とは、ゲノムからランダムな位置に由来するＤＮＡ配列情報の短い配列を意味する。リードの配列は５’末端から３’末端への方向を持った配列とし、５’側を先頭とする。[Definition]
“Read” means a short sequence of DNA sequence information derived from a random position from the genome. The lead sequence is a sequence having a direction from the 5 ′ end to the 3 ′ end, and the 5 ′ side is the head.

「相補リード」あるいは、ある２つのリード同士が「相補」の関係にあるとは、それらがＤＮＡの二重鎖の関係にあることである。 “Complementary leads” or “two complementary leads” are in a “complementary” relationship means that they are in a DNA double-stranded relationship.

本実施例で用いられる全てのリードに対して一意な「リード番号」を振る。相補の関係にあるリード同士は何らかの方法により各々を一意に識別できることが望ましく、本実施例ではその一つの方法としてリード番号の差が入力リード数に等しいとき、２本のリードは互いに相補となるようにリード番号を定めた。 A unique “lead number” is assigned to all the leads used in this embodiment. It is desirable that leads in a complementary relationship can be uniquely identified by some method. In this embodiment, as one method, when the difference in lead numbers is equal to the number of input leads, the two leads are complementary to each other. The lead number was determined as follows.

塩基配列中の連続した部分塩基配列または非連続な部分塩基配列を「シード」とよぶ。 A continuous partial base sequence or a non-continuous partial base sequence in a base sequence is called a “seed”.

２つのリードが「重複」するとは、２つのリード配列がアラインする部分配列を有し、そのアラインする部分配列はいずれかのリードの末端で終結していることである。アラインする部分配列は完全な一致でなくともよく、処理ごとに与えられる相違度のパラメータによってアラインしているか否かが判断される。 Two leads “duplicate” means that the two lead sequences have a partial sequence that aligns, and the aligned partial sequences terminate at the end of either lead. The partial sequences to be aligned do not need to be completely coincident, and it is determined whether or not the alignment is made according to the parameter of the degree of difference given for each process.

ある２つのリードをｘとｙとし、この２つのリードｘ、ｙが重複する場合、以下の５つの状態が考えられる。 If two leads are x and y and the two leads x and y overlap, the following five states are considered.

図２に示すように、ｘの５’末端よりｙの５’末端が内側（３’側）にあり、ｘの３’末端よりｙの３’末端が外側（３’側）にある状態。この状態にあるとき、「ｘの（３’方向の）重複先はｙである」とよぶ。 As shown in FIG. 2, the 5 ′ end of y is on the inside (3 ′ side) from the 5 ′ end of x, and the 3 ′ end of y is on the outside (3 ′ side) from the 3 ′ end of x. In this state, it is called “the overlapping destination of x (in the 3 ′ direction) is y”.

図３に示すように、ｘの５’末端よりｙの５’末端が外側（５’側）にあり、ｘの３’末端よりｙの３’末端が内側（５’側）にある状態。この状態はｘとｙのそれぞれの相補リードｘ^ｃ、ｙ^ｃからみると、ｘ^ｃの５’末端よりｙ^ｃの５’末端が内側（３’側）にあり、ｘ^ｃの３’末端よりｙ^ｃの３’末端が外側（３’側）にある状態であるので、「ｘ^ｃの重複先はｙ^ｃである」とよぶこととする。As shown in FIG. 3, the 5 ′ end of y is outside (5 ′ side) from the 5 ′ end of x, and the 3 ′ end of y is inside (5 ′ side) from the 3 ′ end of x. Respective complementary read ^x c of the state x and y, when viewed from the ^{y c,} the 5 'terminus of ^{y c} from the 5' end of the ^{x c} is 'located in (side, 3 ^{x c} inside 3)' from end Since the 3 ′ end of y ^c is on the outside (3 ′ side), it is referred to as “the overlapping destination of x ^c is y ^c ”.

図４に示すように、ｘの５’末端よりｙの５’末端が内側（３’側）にあり、ｘの３’末端よりｙの３’末端が内側（５’側）にある状態。このとき、２つのリードの５’末端３’末端のいずれか一方のみが一致している場合もこの状態に含める。この状態にあるとき、「ｘはｙを包含している」あるいは「ｙはｚに包含される」とよび、ｘをｙの「親リード」、ｙをｘの「子リード」とよぶ。すなわち親リードは子リードを含む。 As shown in FIG. 4, the 5 ′ end of y is inside (3 ′ side) from the 5 ′ end of x, and the 3 ′ end of y is inside (5 ′ side) from the 3 ′ end of x. At this time, a case where only one of the 5 'end 3' ends of the two leads is also included in this state. In this state, “x contains y” or “y is contained in z”, x is called “parent lead” of y, and y is called “child lead” of x. That is, the parent lead includes a child lead.

図５に示すように、ｘの５’末端よりｙの５’末端が外側（５’側）にあり、ｘの３’末端よりｙの３’末端が外側（３’側）にある状態。このとき、２つのリードの５’末端３’末端のいずれか一方のみが一致している場合もこの状態に含める。この状態にあるとき、「ｙはｘを包含している」あるいは「ｘはｙに包含される」とよび、ｙをｘの「親リード」、ｘをｙの「子リード」とよぶ。 As shown in FIG. 5, the 5 'end of y is outside (5' side) from the 5 'end of x, and the 3' end of y is outside (3 'side) from the 3' end of x. At this time, a case where only one of the 5 'end 3' ends of the two leads is also included in this state. In this state, “y contains x” or “x is contained in y”, y is called “parent lead” of x, and x is called “child lead” of y.

図６に示すように、ｘの５’末端とｙの５’末端が一致し、かつｘの３’末端とｙの３’末端も一致するような状態。このような状態も上で述べた「包含」の一種とし、リード番号が小さい方のリードを親リード、大きい方のリードを子リードとすることとする。 As shown in FIG. 6, the 5 ′ end of x and the 5 ′ end of y are matched, and the 3 ′ end of x and the 3 ′ end of y are also matched. Such a state is also a kind of “include” described above, and the lead having the smaller lead number is the parent lead and the lead having the larger lead number is the child lead.

上記の包含関係については、それぞれの相補リードについても等しい関係が一般的に成り立つ。「ｘがｙを包含する」場合、「ｘ^ｃもｙ^ｃを包含する」。ただし、相補リード間のリード番号の振り方によっては図６のようなリード同士が一致する場合に必ずしもそうならないことがあるが、本実施例のリード番号の振り方はこの関係が成り立つようになっている。As for the above inclusion relationship, the same relationship generally holds for each complementary lead. When “x includes y”, “x ^c also includes y ^c ”. However, depending on how the lead numbers are assigned between the complementary leads, this may not necessarily be the case when the leads as shown in FIG. 6 match. However, the relationship between the lead numbers assigned in the present embodiment is established. ing.

［ハッシュテーブルの構築］
全ゲノムショットガン法によって獲得された、各リードに対してリード番号を振る。各リードに対して、前記リード配列中のシード配列と、そのシード配列のリード配列における先頭からの位置を同時に索引として、該部分塩基配列を前記の位置に保持するリード配列のリード番号の集合が取得できるハッシュテーブルを構築する。説明のために簡略化した図を図７に示す。このとき、位置はリードの先頭（５’側）から数える。前記シード配列の長さは６から１４塩基であることが望ましい。前記ハッシュテーブルは配列構造を用いて実現されることがより好ましい。[Build hash table]
A lead number is assigned to each lead acquired by the whole genome shotgun method. For each lead, a set of lead numbers of the lead sequence that holds the partial base sequence at the above position, using the seed sequence in the lead sequence and the position from the beginning of the lead sequence of the seed sequence as an index at the same time, Build a hash table that can be obtained. A simplified diagram for illustration is shown in FIG. At this time, the position is counted from the head (5 ′ side) of the lead. The length of the seed sequence is preferably 6 to 14 bases. More preferably, the hash table is implemented using an array structure.

［高頻度反復配列の排除］
ゲノム配列中に普遍的に高頻度で出現すると予想されるような配列については、ハッシュテーブルにリード番号を追加しないようにする。実施例の一つとしては、例えば「ＡＡＡＡＡＡＡＡＡＡ」や「ＡＴＡＴＡＴＡＴＡＴ」などの１塩基、２塩基、３塩基から成る単純反復配列であるようなシード配列についてはハッシュテーブルからリード番号を除外する方法が考えられるが、各部分塩基配列が断片配列中に現れる頻度を実際に求めて、一定以上の頻度で現れる部分塩基配列についてリード番号をハッシュテーブルから排除するなどの他の実施方法も考えられる。[Exclude frequently repeated sequences]
For sequences that are expected to appear universally at a high frequency in the genome sequence, a read number is not added to the hash table. As one example, for example, a seed sequence such as “AAAAAAAAAA” or “ATATATATAT” such as a simple repeat sequence consisting of 1 base, 2 bases and 3 bases may be excluded from the hash table. However, other implementation methods such as actually determining the frequency at which each partial base sequence appears in the fragment sequence and excluding the lead number from the hash table for the partial base sequence appearing at a certain frequency or higher are also conceivable.

まず検査対象のリードを一つ選びａとする。この検査対象配列は全てのリードにわたって反復される。変数ｗを１に初期化し、ａに対し３’方向へｗ塩基分ずれてａの重複先となるようなリードの候補をハッシュテーブルから検索する。具体的にはａ中のｗ＋１塩基目以降に存在するシード配列についてａ中に現れる位置よりｗだけ５’方向へ位置を移動した位置に該シード配列が出現するようなリードのリード番号の集合を求める。 First, select one lead to be inspected and set it as a. This test sequence is repeated across all reads. The variable w is initialized to 1, and a candidate for a read that is shifted by w bases in the 3 'direction with respect to a and becomes the duplication destination of a is searched from the hash table. Specifically, for a seed sequence existing after the (w + 1) th base in a, a set of read lead numbers such that the seed sequence appears at a position shifted in the 5 ′ direction by w from the position appearing in a. Ask.

［シードを核としたアラインメントの精緻化］
求めた各リード番号が示すリードをｂとする。図８で示すように、このｂに対してａとｂの間に共有されているシード群を核として５’および３’両方向についてアラインメントを精緻化する。このステップはスミス＝ウォーターマン法やその拡張などの動的計画法により行われるのが望ましいがスパース動的計画法などで実施しても構わない。[Refinement of alignment with seed as core]
Let b be the lead indicated by each obtained lead number. As shown in FIG. 8, the alignment is refined in both 5 ′ and 3 ′ directions with the seed group shared between a and b as the nucleus. This step is preferably performed by a dynamic programming method such as Smith-Waterman method or its extension, but may be performed by a sparse dynamic programming method or the like.

［複数シードによる処理の絞り込み］
上記のステップの単純な実行方法として単一のシードをアラインメントの核とする方法がある。このような方法の場合、ゲノム配列中の反復領域中に核となるシード配列を選択してしまった場合は、実際は有為なアラインメントを持たないリードの組についての処理も行ってしまうこととなる。これを避け、処理の効率を上げるために単一のシードではなく複数のシードをまとめるという方法も考えられる。この複数のシードをまとめる方法の一つとしては、シードが各々のリード中に出現している位置座標の差が、それらのリードの重複のずれを近似していると考えることも出来るので、図９で示すようにその各リードでの出現位置の差によってシードをまとめるという方法も考えられる。[Refine processing by multiple seeds]
One simple way to perform the above steps is to use a single seed as the core of the alignment. In the case of such a method, if a seed sequence as a nucleus is selected in a repetitive region in a genome sequence, a processing for a set of reads that does not actually have a significant alignment will be performed. . In order to avoid this and increase the efficiency of processing, it is also conceivable to combine a plurality of seeds instead of a single seed. One way to combine these seeds is that the difference in position coordinates where seeds appear in each lead can be thought of as approximating the deviation of the overlap between the leads. As indicated by 9, it is also conceivable to combine the seeds according to the difference in the appearance position of each lead.

前述したステップによるｂの選び方により、多くの場合においてａとｂの関係は「ａの重複先はｂ」という関係か、「ａがｂを包含する」関係であるが、逆の２つの関係となることもある。 Depending on how b is selected in the above-described steps, in many cases, the relationship between a and b is the relationship that “a overlaps with b” or the relationship “a includes b”. Sometimes.

［重複先の評価］
あるリードｘの（３’方向の）重複先がリードｙであったとする。このようなリードｙは、一つのｘに対して複数存在しうる。また一つも存在しないこともある。このような複数の重複先に対して、それぞれの「重複先としての良さ」を評価する尺度を設け、それを計算する。このような尺度としては、リードｘとｙの重複の共有配列長や、ｘとｙのアラインメントスコアなどが考えられる。[Destination evaluation]
Assume that a lead x (in the 3 ′ direction) overlaps a lead y. There can be a plurality of such leads y for one x. There may be none. For such a plurality of duplication destinations, a scale for evaluating each “goodness as duplication destination” is provided and calculated. As such a measure, the shared sequence length of the overlap of leads x and y, the alignment score of x and y, and the like can be considered.

［最良の重複先の選択］
前記のステップにより、あるリードａに対する重複先ｂが得られた場合、リードａの重複先がｂしか存在していない場合は、そのリードｂをリードａの現時点での最良の重複先とする。ａに対する重複先ｂを得た時点で、それまでの最良の重複先ｃがすでに存在している場合は、ａとｂとの重複と、ａとｃとの重複を上記の尺度に従って比較し、より良い重複の方をこの時点のａの最良の重複先とする。[Select the best duplication destination]
When the duplication destination b for a certain lead a is obtained by the above steps, if the duplication destination of the lead a is only b, the lead b is set as the best duplication destination at the present time of the lead a. When the duplication destination b for a is obtained, if the best duplication destination c so far already exists, the duplication between a and b is compared with the duplication between a and c according to the above-mentioned scale. The better duplication is set as the best duplication destination of a at this time.

［包含されたリードの除外］
アラインメントの精緻化によって「リードａがリードｂを包含する」あるいは「リードｂがリードａを包含する」という状況となった場合は、以降のステップでは「包含された」リードは、その「包含された」リードの重複先は計算せず、また重複先としても直接は採用しないので、「包含された」リードには、これを表現する情報を付加する。Exclude included leads
If refinement of the alignment results in a situation where “lead a includes lead b” or “lead b includes lead a”, the “included” lead is “included” in the following steps. Since the lead duplication destination is not calculated and is not directly adopted as the duplication destination, information representing this is added to the “included” lead.

［潜在的共有配列長］
また、図１０が示すように、上記の方法ではリードａの先頭より約ｗ塩基分ずれたアラインメントを持つと近似的に予想されるリード番号をハッシュテーブルにより得ていることとなり、このようなアラインメントは、およそ「ａの配列長−ｗ」という長さの共有配列を持つと予想される。これをアラインメントを取る２つのリードの「潜在的共有配列長」とよぶ。[Potential shared sequence length]
In addition, as shown in FIG. 10, in the above method, a read number that is approximately predicted to have an alignment that is shifted by about w bases from the beginning of the lead a is obtained from the hash table. Is expected to have a shared sequence approximately “a sequence length−w” in length. This is called the “potential shared sequence length” of the two reads that align.

［得られる重複の良さの推定］
ｗを１から開始することにより、ａに対する潜在的共有配列長が長いリード群から順に処理を行う。したがって、この処理の継続によって現時点での最良の重複先よりも良い重複先が得られるか判定を行う。例えば、重複の比較尺度としてアラインメントの共有配列長を用いている場合には、潜在的共有配列長がこのアラインメントの正確な共有配列長の近似となっていると考えることができ、処理の継続に従って、処理の対象となるリードのリードａとの潜在的共有配列長は単調に短くなっていくので、これ以降処理を継続しても現時点での最良の重複先よりも良い重複先が得られないと判断することが出来る。具体的には潜在的共有配列長×（１＋誤り許容率）×アラインメントマッチスコアまたは１（前記尺度に共有配列長を採用したとき）が現時点の最良の重複先についてのスコアを下回った時に、これ以上良い重複先は無いと判定する。誤り許容率は対象ゲノム中に予想されるポリモルフィズムの割合の上限値の見積もりを予め与えておく。[Estimation of good overlap obtained]
By starting w from 1, processing is performed in order from a read group having a long potential shared sequence length for a. Therefore, it is determined whether the duplication destination better than the current best duplication destination can be obtained by continuing this process. For example, if the alignment shared sequence length is used as a measure of overlap, the potential shared sequence length can be considered an approximation of the exact shared sequence length of this alignment, and as the process continues Since the potential shared sequence length of the lead to be processed with the lead a is monotonously shortened, even if processing is continued thereafter, a better duplication destination than the best duplication destination at the present time cannot be obtained. It can be judged. Specifically, potential shared sequence length × (1 + error tolerance) × alignment match score or 1 (when the shared sequence length is adopted in the scale) is below the score for the current best duplication destination. It is determined that there is no better duplication destination. For the error tolerance, an estimate of the upper limit of the proportion of polymorphism expected in the target genome is given in advance.

［反復の終了］
これ以上処理を継続しても現時点の最良の重複先よりも良い重複先は得られないだろうと判断された場合は、ｗについての処理ループを終了する。これによって、不要な計算処理を省略することが出来る。[End iteration]
If it is determined that even if processing is continued further than this, it is not possible to obtain a better duplication destination than the current best duplication destination, the processing loop for w is terminated. Thereby, unnecessary calculation processing can be omitted.

重複先がまだ得られていない場合あるいは処理を継続することでより良い重複先が得られる可能性があると判断された場合は、ｗを増加させて上記のステップを反復する。このとき、ｗの増分としては、ハッシュテーブルの構築の際に用いたシードの配列長が好ましいが、得られている現時点の最良の重複のアラインメントによって増分を変更することが考えられる。 When the duplication destination has not been obtained yet or when it is determined that there is a possibility that a better duplication destination may be obtained by continuing the processing, w is increased and the above steps are repeated. At this time, the increment of w is preferably the sequence length of the seed used in the construction of the hash table, but it is conceivable that the increment is changed according to the best overlap alignment obtained at present.

またリードａ自体が「包含されるリード」であると判定された場合は、重複先を得る必要がないので、すぐにａについてのステップを終了し、次の検索対象配列を新しいａとしてステップを反復する。 If it is determined that the lead a itself is an “included lead”, there is no need to obtain the duplication destination, so the step for a is immediately terminated, and the next search target sequence is set as a new a. Iterate.

［真の親リード、真の子リード］
上記のステップによっても「包含されたリード」であると判定されなかったリードは、一度も「子リード」であると判定されなかったリードである。このようなリードを「真の親リード」とよぶ。また一度以上「子リード」であると判定されたリードは「真の子リード」とよぶ。[True parent lead, true child lead]
A lead that has not been determined to be an “included lead” by the above steps is a lead that has never been determined to be a “child lead”. Such a lead is called a “true parent lead”. A lead that is determined to be a “child lead” at least once is called a “true child lead”.

［包含関係の計算］
次に、上記のステップにより真の子リードであると判定されたものについて、そのリードを包含する真の親リードを計算する処理を行う。[Calculation of inclusion relationship]
Next, a process for calculating a true parent lead including the lead is performed for those determined to be true child leads in the above steps.

このステップは、上記の重複先を検索するステップとほぼ同様な処理を行うが、検査対象配列はすでに真の子リードのみとし、ハッシュテーブルを検索して得られたリード番号の集合が示すアラインメント対象のリードは、真の親リードのみとする。 This step performs almost the same process as the above-described step of searching for duplication destinations, but the inspection target array is already a true child read, and the alignment target indicated by the set of lead numbers obtained by searching the hash table The only lead is the true parent lead.

このステップでは、真の親リードと真の子リードの関係を示すデータ構造を前述のステップと同様の処理によって計算するが、このステップでは１つの「自分を包含する真の親リード」を発見した時点で反復を終了せず、全ての「自分を包含する真の親リード」を発見するまで処理を継続する。 In this step, the data structure indicating the relationship between the true parent lead and the true child lead is calculated by the same process as in the previous step, but in this step, one “true parent lead that includes me” was found. Do not end the iteration at that point and continue until you find all the “true parent leads that contain you”.

このステップにより、ある真の子リードをｃとすると、ｃは１つ以上の「ｃの真の親リード」の情報を持ち、ある真の親リードをｐとすると、ｐは「真の子リードを持たない」か［１つ以上の真の子リード」の情報を持つ。 By this step, if a certain true child lead is c, c has information of one or more “true parent lead of c”, and if a certain true parent lead is p, p is “true child lead”. Do not have "or [one or more true child leads] information.

以上のようにして、各真の親リードであるリードに対して、真の親リードの重複先をそれが存在する場合には決定する。 As described above, the duplicate destination of the true parent lead is determined for each true parent lead when it exists.

［真の子リードへの重複の対処］
図１１に示すように、各真の親リードであるリードに対して、真の親リードの重複先が存在しないが、真の子リードが重複先として存在する場合、図１２で示すようにアラインメントの精緻化の処理のパラメータを緩和し、検査対象配列に対して途中からアラインメントするような真の親リードとの重なり合いを計算するステップを行い、そのようなアラインメントの中で最も良い重複を持つ真の親リードの、該検査対象リードと共有している部分を重複先の配列とする。[Handling duplicates to true child leads]
As shown in FIG. 11, when there is no duplication destination of the true parent lead for each lead that is a true parent lead, but the true child lead exists as the duplication destination, alignment is performed as shown in FIG. The parameters of the elaboration process are relaxed, and the step of calculating the overlap with the true parent lead that is aligned from the middle to the test target sequence is performed, and the true overlap with the best overlap among such alignment is performed. The portion of the parent lead that is shared with the lead to be inspected is used as an overlapping destination array.

以上の処理によって、全ての真の親リードおよびその相補配列について図１２のように１個あるいは０個の３’方向の重複先配列が決定される。これは、各配列を点とし、３’方向の重複先を示す関係を辺とした有向グラフであると見ることができる。 As a result of the above processing, one or zero 3′-direction duplication destination sequences are determined for all true parent reads and their complementary sequences as shown in FIG. This can be viewed as a directed graph with each array as a point and a relationship indicating an overlap destination in the 3 'direction as an edge.

［相補配列についてのグラフの統合］
図１に示すように、各配列について相補配列の情報と統合することによって各塩基配列の３’方向および５’方向の両方の重複関係を示すグラフを導く。この統合において、ある配列ｊの３’方向の重複先が配列ｋであり、ｋの相補配列の３’方向の重複先がｊの相補配列であったような場合は、ｊとｋを結ぶ辺は１つに併合される。[Graph integration for complementary sequences]
As shown in FIG. 1, by integrating the information of complementary sequences for each sequence, a graph showing the overlapping relationship in both the 3 ′ direction and the 5 ′ direction of each base sequence is derived. In this integration, when the overlapping destination in the 3 ′ direction of a certain sequence j is the sequence k and the overlapping destination in the 3 ′ direction of the complementary sequence of k is the complementary sequence of j, the edge connecting j and k Are merged into one.

［参照数の計算］
上記の処理により構築されたグラフについて、塩基配列に相当する各点が３’方向および５’方向のそれぞれについて、いくつの辺を保持しているかを計算する。３’方向および５’方向の少なくとも片方に２つ以上の辺を持つ点は「分岐点」とよぶ。[Calculation of number of references]
For the graph constructed by the above processing, it is calculated how many sides each point corresponding to the base sequence holds for each of the 3 ′ direction and the 5 ′ direction. A point having two or more sides in at least one of the 3 ′ direction and the 5 ′ direction is called a “branch point”.

［３’方向の併合検査］
図１３に示すように、この統合後のグラフにおいて、以下のステップで未到達かつ分岐点ではない任意の点を起点として、その３’方向についての辺を辿り、分岐点でない部分までを１つの連続した組とする。[3 'direction merge inspection]
As shown in FIG. 13, in the graph after integration, an arbitrary point that has not been reached and is not a branch point in the following steps is used as a starting point, the side in the 3 ′ direction is traced, and a portion that is not a branch point is Consecutive pairs.

［５’方向の併合検査］
また、図１４に示すように上記と同様の処理を５’方向についても行う。このとき、分岐点については分岐点の５’方向および３’方向のそれぞれの辺の本数によりこの処理の組に併合するかが決定される。分岐点から見て２本以上の辺がある方向からの処理の場合は、分岐点はその組に含めない。[5 'direction merge inspection]
Further, as shown in FIG. 14, the same processing as described above is performed in the 5 ′ direction. At this time, with respect to the branch point, whether to merge with this processing set is determined by the number of the respective sides in the 5 ′ direction and 3 ′ direction of the branch point. In the case of processing from a direction having two or more sides when viewed from the branch point, the branch point is not included in the set.

［真の子リードの復元］
以上のステップにより、真の親リードあるいはその部分配列を、グラフ上で分岐のない部分ごとの組に分けることが出来る。短い反復配列に対しては図１６のようなグラフが生成され、上下の２種類のリード群が混じることは無い。長い反復配列に対しては図１７のようなグラフが生成され、上下の２種類のリードが分岐点を通して接続される。分岐点が生じる場合には分岐点の位置において併合が行えない。たとえば図１７の例においては配列Ａと配列Ｃがたとえ逆であったとしても、ショットガン法により得られる断片配列は同一になる可能性があり曖昧性が生じている。以上のステップにより得られた配列の組はグラフの辺、重複関係による列を構成している。この配列の列に対して以前の処理で真の子リードと判定されたリードについても、その真の親リードが単一であるが、あるいは真の親リードの全てが同じ組に分類され隣り合った配列である場合のみ、図１５のようにこの配列の組に加える。[Restore True Child Lead]
Through the above steps, the true parent lead or its partial arrangement can be divided into sets for each part having no branch on the graph. For short repetitive sequences, a graph as shown in FIG. 16 is generated, and the upper and lower two types of read groups are never mixed. For a long repetitive sequence, a graph as shown in FIG. 17 is generated, and two types of upper and lower leads are connected through a branch point. When a branch point occurs, merging cannot be performed at the position of the branch point. For example, in the example of FIG. 17, even if the sequence A and the sequence C are reversed, the fragment sequences obtained by the shotgun method may be the same, resulting in ambiguity. The set of arrays obtained by the above steps constitutes a column of graph edges and overlapping relationships. For a lead that has been determined to be a true child lead in the previous processing for this array column, the true parent lead is single, or all of the true parent leads are classified into the same set and are adjacent. Only in the case of an array, it is added to this array set as shown in FIG.

上記のステップにより分けられた配列の組は、それらが持っている位置を表す情報を用いることによって５’方向から３’方向へと整列させることが出来、それらをマルチプルアラインメントすることによりコンティグ列を生成する。 The set of sequences separated by the above steps can be aligned from the 5 ′ direction to the 3 ′ direction by using the information representing the position they have, and the contig sequence can be obtained by multiple alignment. Generate.

リードの分岐を示すグラフ構造を表す図である。It is a figure showing the graph structure which shows the branch of a lead. リードｘからリードｙに重複している状態を表す図である。It is a figure showing the state which overlaps with lead y from lead x. リードｙからリードｘに重複している状態を表す図である。It is a figure showing the state which overlaps with lead x from lead y. リードｘがリードｙを包含する状態を表す図である。It is a figure showing the state where lead x includes lead y. リードｙがリードｘを包含する状態を表す図である。It is a figure showing the state where lead y includes lead x. リードｘがリードｙと等しい状態を表す図である。It is a figure showing the state where lead x is equal to lead y. シード配列と配列内位置を索引とするハッシュテーブルを表す図である。It is a figure showing the hash table which makes an index the seed arrangement | sequence and the position in an arrangement | sequence. リードａとリードｂの核によるアラインメントの精緻化を表す図である。It is a figure showing refinement of alignment by the nucleus of lead a and lead b. シード群の分類を表す図である。It is a figure showing the classification | category of a seed group. 潜在的強雨配列長を表す図である。It is a figure showing potential heavy rain arrangement length. 真の子リードへの重複状態および真の親リードへの重複の検出を表す図である。It is a figure showing the detection of the duplication state to the true child lead and the duplication to the true parent lead. リードの重複関係のグラフを表す図である。It is a figure showing the graph of the duplication relation of a lead. 起点から３’方向への処理を表す図である。It is a figure showing the process to 3 'direction from a starting point. 起点から５’方向への処理を表す図である。It is a figure showing the process to 5 'direction from a starting point. 真の子リードの復元を表す図である。It is a figure showing restoration of a true child lead. 短い反復配列を超える正しい重複先が決定される例。An example where the correct duplication destination over a short repeat sequence is determined. 長い反復配列があり本質的に曖昧性がある例。An example with long repeat sequences and inherent ambiguity.

Claims

(A) Check whether each fragment sequence is approximately included in other fragment sequences against a data set including base sequence information of random DNA fragments acquired by the whole genome shotgun method (b) which fragment For a fragment sequence x not included in the sequence and its complementary sequence, another fragment sequence that can extend the fragment sequence in the 3 ′ direction when merged or a complementary sequence thereof, between the fragment sequence x A method for determining a fragment sequence or a portion thereof that extends a fragment sequence x in the 3 ′ direction based on an alignment score calculated from a shared partial fragment sequence.

The method according to claim 1, wherein a score proportional to the length of the homologous portion is used instead of the alignment score of (b).

The method according to claim 1 or 2, wherein for each fragment sequence x, the score is calculated in order from a fragment sequence having a longer potential shared sequence length among fragment sequences extending the fragment sequence x in the 3 'direction. How to do the processing.

Computer-readable data recorded with a hash table that associates both a continuous or non-contiguous partial base sequence appearing in a fragment sequence with the position of the partial base sequence from the beginning of the fragment sequence as an index simultaneously with a set of fragment sequences Recording medium.

A computer-readable recording medium recording a program for causing a computer to execute the method according to claim 1.