JP5863396B2

JP5863396B2 - DNA sequence decoding system, DNA sequence decoding method and program

Info

Publication number: JP5863396B2
Application number: JP2011242340A
Authority: JP
Inventors: 真希子吉田; 麻子小池; 木村　宏一; 宏一木村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2011-11-04
Filing date: 2011-11-04
Publication date: 2016-02-16
Anticipated expiration: 2031-11-04
Also published as: JP2013094149A

Description

本発明は、超並列シーケンサーによるDNA配列の解読技術に関し、特に、読み取りが困難なDNA配列の解読技術に関する。 The present invention relates to a DNA sequence decoding technique using a massively parallel sequencer, and more particularly to a DNA sequence decoding technique that is difficult to read.

現在普及している超並列シーケンサー（以下「シーケンサー」ともいう。）は、一分子から増幅されたDNAフラグメントのクラスターを基板上に多数配置し、大量のDNA配列を同時にシーケンスする。シーケンス方法は、DNAフラグメントのそれぞれに対して、蛍光標識されたヌクレオチドプローブを一つずつ付加し、相補鎖を伸長していくことに基づいている。蛍光色素は複数種類用いられ、これにより、DNA各塩基のエンコードが行われる。 A massively parallel sequencer (hereinafter also referred to as “sequencer”) that is widely used arranges a large number of clusters of DNA fragments amplified from a single molecule on a substrate and simultaneously sequences a large amount of DNA sequences. The sequencing method is based on adding a fluorescently labeled nucleotide probe to each of the DNA fragments and extending the complementary strand. A plurality of types of fluorescent dyes are used, whereby each DNA base is encoded.

シーケンサーは、伸長反応の各サイクルにおいて、蛍光色素を励起して発光させ、蛍光各色の基板画像を取得する。その後、DNAフラグメントの個々のクラスターについて、各サイクルにおける蛍光各色の蛍光強度を測定する。配列解読システムは、それぞれのクラスターのDNA配列について、各サイクルにおける蛍光強度の測定値に基づいて、該当するポジションの塩基を判定する。 The sequencer excites the fluorescent dye to emit light in each cycle of the extension reaction, and acquires a substrate image of each fluorescent color. Thereafter, the fluorescence intensity of each color of fluorescence in each cycle is measured for each cluster of DNA fragments. The sequence decoding system determines the base of the corresponding position based on the measured fluorescence intensity in each cycle for the DNA sequence of each cluster.

ここで、DNAフラグメントの個々のクラスターは、理想的には、伸長反応の各サイクルにおいて、該当するポジションの塩基に対応した蛍光色１色でのみ強度を持ち、他の色では検出されないことが期待される。しかし、実際には、クラスター内での伸長反応の遅れ・進み、蛍光のクロストークなどの要因によるノイズが生じ、複数の色で検出される場合が生じる。これは、配列解読システムによるDNA配列の読み取り精度の悪化の原因となる。しかも、ノイズの影響は、伸長反応が進むほど大きくなる。このため、ノイズの影響が想定される場合、読み取り可能な配列長を制限する必要がある。 Here, it is expected that each cluster of DNA fragments ideally has an intensity only in one fluorescent color corresponding to the base at the corresponding position in each cycle of the extension reaction and is not detected in other colors. Is done. However, in reality, noise due to factors such as delay / advance of extension reaction in the cluster and crosstalk of fluorescence occurs, and there are cases where detection is performed in a plurality of colors. This causes deterioration in the reading accuracy of the DNA sequence by the sequence decoding system. Moreover, the influence of noise increases as the extension reaction proceeds. For this reason, when the influence of noise is assumed, it is necessary to limit the readable sequence length.

ノイズの影響を考慮した配列解読の方法として、伸長反応の遅れ・進みや蛍光クロストークなどをパラメトリックにモデル化し、各サイクルにおいて、蛍光色を推定する方法がある（非特許文献１）。しかし、それぞれのノイズ要因は、サイクルや化学反応条件などに対する複雑な依存性を持ち、完全にモデル化することは困難である。そこで、サポートベクターマシン（ＳＶＭ）などの機械学習アプローチを適用し、既知のDNA配列に基づいて、シーケンサーから得られる各サイクルについて、４色の蛍光強度と正解配列との関係を直接学習し、各サイクルにおいて蛍光色を推定する方法も行われている（非特許文献２）。 As a method of sequence decoding in consideration of the influence of noise, there is a method of estimating the fluorescence color in each cycle by modeling extension reaction delay / advance, fluorescence crosstalk and the like parametrically (Non-patent Document 1). However, each noise factor has a complicated dependency on the cycle and chemical reaction conditions, and it is difficult to model completely. Therefore, a machine learning approach such as support vector machine (SVM) is applied to directly learn the relationship between the fluorescence intensity of four colors and the correct sequence for each cycle obtained from the sequencer based on a known DNA sequence. A method for estimating a fluorescent color in a cycle is also performed (Non-Patent Document 2).

Kao et al., “BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing” Genome Research, 19, 1884, 2009Kao et al., “BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing” Genome Research, 19, 1884, 2009 Kircher et al., “Improved base calling for the Illuina Genome Analyzer using machine learning strategies” Genome Biology, 10, R83, 2009Kircher et al., “Improved base calling for the Illuina Genome Analyzer using machine learning strategies” Genome Biology, 10, R83, 2009

一般に、DNA配列の読み取り精度は均一ではない。例えば配列によっては、読み取り精度が特に悪化する場合があることが知られている。化学反応による伸長法に基づくシーケンサーにおけるノイズの生じ方は、読み取り対象であるDNA配列の持つ特徴に大きく依存すると考えられる。例えば（１）GC含有量の高い配列、（２）２塩基繰り返し配列、（３）パリンドローム（回文）配列等を持つ高次構造を形成し易い性質は、伸長の際の化学反応に大きく影響すると考えられる。 In general, the reading accuracy of DNA sequences is not uniform. For example, it is known that the reading accuracy may deteriorate particularly depending on the arrangement. The occurrence of noise in sequencers based on the chemical extension method is considered to depend greatly on the characteristics of the DNA sequence to be read. For example, (1) a high GC content, (2) a double-base repeat sequence, (3) a palindromic (palindrome) sequence, and the like are more likely to form a higher order structure, which greatly affects the chemical reaction during elongation. It is considered to have an effect.

ところが、従来の配列解読システムでは、一般的なリファレンスゲノムをコントロールとして構築した一つのモデルを、全てのDNA配列に対して適用する。すなわち、DNA配列の特徴の違いに依存するノイズの性質が、従来の配列解読システムでは十分に考慮されていない。 However, in a conventional sequence decoding system, a single model constructed using a general reference genome as a control is applied to all DNA sequences. That is, the nature of noise that depends on the difference in the characteristics of DNA sequences is not fully taken into account in conventional sequencing systems.

本発明は、以上の状況を鑑みてなされたものであり、読み取りが困難なDNA配列（いわゆる難読DNA配列）の配列解析精度の向上を期待できる仕組みを提供する。 The present invention has been made in view of the above situation, and provides a mechanism that can be expected to improve the accuracy of sequence analysis of DNA sequences that are difficult to read (so-called difficult-to-read DNA sequences).

本発明は、難読DNA配列の配列上の特徴に基づいて配列を解読する。より具体的には、難読DNA配列をその配列の特徴に基づいて特徴グループに分類する処理と、各グループの特徴毎に既知のDNA配列データを用いて各サイクルにおける蛍光色の判定基準を学習する処理と、未知の実験DNA配列を解読する際に、その配列上の特徴グループについて学習した判定基準を適用し、配列を解読する処理とを実行する手法を提供する。 The present invention decodes sequences based on sequence characteristics of obfuscated DNA sequences. More specifically, the process of classifying obfuscated DNA sequences into feature groups based on the features of the sequences, and learning the criteria for determining the fluorescent color in each cycle using known DNA sequence data for each group feature Provided is a method for executing processing and processing for decoding a sequence by decoding a sequence of the unknown experimental DNA sequence by applying a criterion learned from the feature group on the sequence.

本発明によれば、難読DNA配列の解読精度を高めることができる。上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to the present invention, it is possible to improve the decoding accuracy of difficult-to-read DNA sequences. Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.

形態例に係る配列解読システムにおける処理全体の流れを説明する図。The figure explaining the flow of the whole process in the arrangement | sequence decoding system which concerns on an example. 形態例に係る配列解読システムの構成例を示す図。The figure which shows the structural example of the arrangement | sequence decoding system which concerns on an example. 難読DNA配列の抽出方法を説明する図。The figure explaining the extraction method of an obfuscated DNA sequence. DNA配列の特徴分類の概念を説明する図。The figure explaining the concept of the feature classification of a DNA sequence. 配列解読システムの学習処理の概要を示す図。The figure which shows the outline | summary of the learning process of a sequence decoding system. 学習処理の手順を説明するフローチャート。The flowchart explaining the procedure of a learning process. 蛍光色判定基準データベースの構成例を示す図。The figure which shows the structural example of a fluorescence color criteria database. 蛍光色尤度データベースの構成例を示す図。The figure which shows the structural example of a fluorescence color likelihood database. 配列解読システムの配列推定処理の概要を表す図。The figure showing the outline | summary of the sequence estimation process of a sequence decoding system. 配列推定処理の手順を説明するフローチャート。The flowchart explaining the procedure of sequence estimation processing.

以下、本発明の実施の形態を、図面に基づいて詳細に説明する。なお、本発明は、後述する実施の形態に限定されるものでなく、本発明には様々な変形例が含まれる。例えば後述する実施の形態に追加の構成を備えてもよく、一部の構成を含まなくてもよい。また、後述する形態例の一部の構成を他の構成に置換してもよい。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited to embodiment mentioned later, Various modifications are included in this invention. For example, an embodiment described later may have an additional configuration, and a part of the configuration may not be included. Moreover, you may substitute the one part structure of the example mentioned later to another structure.

また、後述する各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路その他のハードウェアとして実現しても良い。また、後述する各構成、機能、処理部、処理手段等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することにより実現しても良い。すなわち、後述する各構成、機能、処理部、処理手段等は、ソフトウェアとして実現しても良い。各構成等を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、SSD（Solid State Drive）等の記憶装置、ICカード、SDカード、DVD等の記憶媒体に格納することができる。 In addition, each or all of the configurations, functions, processing units, processing units, and the like described below may be realized as part or all of them as, for example, an integrated circuit or other hardware. In addition, each configuration, function, processing unit, processing unit, and the like, which will be described later, may be realized by the processor interpreting and executing a program that realizes each function. That is, each configuration, function, processing unit, processing unit, and the like described later may be realized as software. Information such as programs, tables, and files for realizing each configuration and the like can be stored in a storage device such as a memory, a hard disk, or an SSD (Solid State Drive), or a storage medium such as an IC card, an SD card, or a DVD.

なお、実施の形態を説明するための全図において、同一の機能を有する部材には同一又は関連する符号を付し、その繰り返しの説明は省略する。また、以下の実施の形態では、特に必要なとき以外は同一または同様な部分の説明を原則として繰り返さない。 Note that components having the same function are denoted by the same or related reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted. In the following embodiments, the description of the same or similar parts will not be repeated in principle unless particularly necessary.

［処理全体の流れ］
図１に、実施の形態に係る配列解読システムにおいて実行される処理全体の流れを示す。この配列解読システムは、難読DNA配列の探索ステージ１１０、既知のDNA配列データ１２１を用いた学習ステージ１２０、未知の実験DNA配列データ１３１に対する推定ステージ１３０の３つの段階を有する。 [Flow of overall processing]
FIG. 1 shows the overall flow of processing executed in the sequence decoding system according to the embodiment. This sequence decoding system has three stages: an obfuscated DNA sequence search stage 110, a learning stage 120 using known DNA sequence data 121, and an estimation stage 130 for unknown experimental DNA sequence data 131.

未知の実験DNA配列１３１を解読する際には、伸長反応の各サイクルについてシーケンサーから取得される蛍光強度の全サイクル分の蛍光強度配列１３２に基づいてDNA配列を推定し、推定結果１３３として推定DNA配列１３４とその信頼度１３５を出力する。例えば、伸長反応の各サイクルについて、４つの塩基に対応する４色（a、b、c、d）の蛍光強度（I_a、I_b、I_c、I_d）がシーケンサーから取得されるものとすると、全サイクル分の蛍光強度配列１３２（すなわち、(I_a、I_b、I_c、I_d)^cycle1、(I_a、I_b、I_c、I_d)^cycle2、… (I_a、I_b、I_c、I_d)^cyclen）に基づいてDNA配列を推定する。 When decoding the unknown experimental DNA sequence 131, the DNA sequence is estimated based on the fluorescence intensity sequence 132 for all cycles of the fluorescence intensity acquired from the sequencer for each cycle of the extension reaction, and the estimated DNA is estimated as 133. The array 134 and its reliability 135 are output. For example, for each cycle of the extension reaction, fluorescence intensity (I _a , I _b , I _c , I _d ) of four colors (a, b, c, d) corresponding to four bases is acquired from the sequencer. Then, the fluorescence intensity arrangement 132 for all cycles (ie, (I _a , I _b , I _c , I _d ) ^{cycle 1} , (I _a , I _b , I _c , I _d ) ^{cycle 2} , ... (I _a , I _b , I _c , I _d ) Estimate DNA sequence based on ^cyclen ).

［システム構成及び処理動作の詳細］
図２に、実施の形態に係る配列解読システムの構成例を示す。この配列解読システムは、入出力装置２１０と、難読DNA配列解析部２２１、学習部２２２、推定部２２３とを有する解析装置２２０と、記憶装置２３０とにより構成される。実施の形態において、解析装置２２０は、後述する各ステージで実行される機能を、コンピュータ上で実行されるプログラムの処理機能として実現する。 [Details of system configuration and processing operations]
FIG. 2 shows a configuration example of the sequence decoding system according to the embodiment. This sequence decoding system includes an input / output device 210, an analysis device 220 having an obfuscated DNA sequence analysis unit 221, a learning unit 222, and an estimation unit 223, and a storage device 230. In the embodiment, the analysis device 220 realizes a function executed at each stage described later as a processing function of a program executed on a computer.

［探索ステージ］
難読DNA配列解析部２２１は、難読DNA配列の探索ステージ１１０（図１）において、現実の様々なゲノムに対してシーケンスを実行した際にエラーが多く検出されるような領域を難読DNA配列として集め、さらに、それら難読DNA配列が有する配列上の特徴を各ステージ毎に分類する処理を実行する。 [Search stage]
The obfuscated DNA sequence analysis unit 221 collects, as an obfuscated DNA sequence, a region in which many errors are detected when sequences are executed on various actual genomes in the obfuscated DNA sequence search stage 110 (FIG. 1). Furthermore, the process which classifies the feature on the sequence which these obfuscated DNA sequences have for every stage is performed.

図３に、難読DNA配列の判定方法の一例を示す。まず、シーケンス後のリード配列をリファレンス配列にマッピングする。マッピングされたリード配列とリファレンス配列とを比較したとき、リード配列のある長さの範囲で、解読されていない（図３では塩基Nと表示)、又は、ある一定の割合以上、同一でない塩基が存在する領域を難読DNA配列として判定する。 FIG. 3 shows an example of a method for determining an obfuscated DNA sequence. First, the read sequence after the sequence is mapped to the reference sequence. When comparing the mapped lead sequence with the reference sequence, it is not decoded within the range of the length of the lead sequence (indicated as base N in FIG. 3), or bases that are not identical for a certain ratio or more. The existing region is determined as an obfuscated DNA sequence.

図４は、難読DNA配列と判定されたDNA配列を、それらが有する特徴別に分類（グループ化）する様子を表している。分類基準には、シーケンサーから得られる４色の蛍光強度配列（すなわち、(I_a、I_b、I_c、I_d)^cycle1、(I_a、I_b、I_c、I_d)^cycle2、…(I_a、I_b、I_c、I_d)^cyclen）を特徴ベクトルとてクラスタリング解析することにより得られる情報を使用する。また、分類基準には、前述した特徴ベクトルを非線形変換した特徴空間上でクラスタリング解析することにより得られる情報を使用する。本明細書においては、分類後のDNA配列の集合を特徴グループ１、２…と呼ぶ。 FIG. 4 shows how DNA sequences determined to be difficult to read are classified (grouped) according to their characteristics. Classification criteria include four-color fluorescence intensity sequences obtained from the sequencer (ie, (I _a , I _b , I _c , I _d ) ^cycle1 , (I _a , I _b , I _c , I _d ) ^cycle2 , ... ( Information obtained by clustering analysis using I _a , I _b , I _c , I _d ) ^cyclen ) as a feature vector is used. In addition, information obtained by performing clustering analysis on a feature space obtained by nonlinearly transforming the above-described feature vector is used as the classification criterion. In this specification, a set of DNA sequences after classification is referred to as feature groups 1, 2,.

［学習ステージ］
学習部２２２は、学習ステージ１２０（図１）において、特徴グループに分類されたDNA配列の各サイクルにおいて出現する頻度の高い蛍光色の判定傾向を検出し、これらを各特徴グループに固有の蛍光色判定基準１２６（図１）として学習する。 [Learning stage]
In the learning stage 120 (FIG. 1), the learning unit 222 detects the determination tendency of the fluorescent color that frequently appears in each cycle of the DNA sequence classified into the feature group, and uses these to detect the fluorescent color unique to each feature group. Learning is performed as the criterion 126 (FIG. 1).

図５に、学習ステージ１２０において、学習部２２２が実行する処理手順の概要を示す。学習処理の前提として、難読DNA配列のそれぞれの特徴グループについて、既知のDNA配列データ１２１が用意される。既知のDNA配列データ１２１は、シーケンサーから取得された伸長反応の各サイクルにおける４色の蛍光強度配列１２４（すなわち、(I_a、I_b、I_c、I_d)^cycle1、 (I_a、I_b、I_c、I_d)^cycle2、… (I_a、I_b、I_c、I_d)^cyclen）と、正解DNA配列から決定される正解蛍光色配列１２５から構成される。学習部２２２は、これらの配列データを使用し、特徴グループ毎に、以下の学習処理を行う。なお、各特徴グループに属する配列データのうち一部を訓練データ１２２として使用し、残りを蛍光色尤度計算用データ１２３として使用する。 FIG. 5 shows an outline of a processing procedure executed by the learning unit 222 in the learning stage 120. As a premise of the learning process, known DNA sequence data 121 is prepared for each feature group of the obfuscated DNA sequence. The known DNA sequence data 121 includes four-color fluorescence intensity sequences 124 (that is, (I _a , I _b , I _c , I _d ) ^cycle1 , (I _a , I _b ) in each cycle of the extension reaction obtained from the sequencer. , I _c , I _d ) ^{cycle 2} ,... (I _a , I _b , I _c , I _d ) ^{cycle n} ) and a correct fluorescent color array 125 determined from the correct DNA sequence. The learning unit 222 uses these array data and performs the following learning process for each feature group. A part of the array data belonging to each feature group is used as training data 122 and the rest is used as fluorescence color likelihood calculation data 123.

学習部２２２は、蛍光色判定基準学習部５０１、蛍光色判定部５０２、蛍光色尤度計算部５０３で構成される。蛍光色判定基準学習部５０１は、特徴グループ毎に、訓練データ１２２を構成する４色の蛍光強度配列１２４と正解蛍光色配列１２５を参照し、各サイクルにおいて蛍光色を判定するための蛍光色判定基準１２６を学習して蛍光色判定基準データベース２３１に記憶する。この学習処理の詳細は後述する。蛍光色判定部５０２と蛍光色尤度計算部５０３は、各特徴グループの蛍光色尤度計算用データ１２３を用い、各サイクルについて、４色の蛍光色それぞれについての蛍光色尤度１２７（図１）を導出し、蛍光色尤度データベース２３２に記憶する。 The learning unit 222 includes a fluorescent color determination reference learning unit 501, a fluorescent color determination unit 502, and a fluorescent color likelihood calculation unit 503. The fluorescent color determination criterion learning unit 501 refers to the four-color fluorescent intensity array 124 and the correct fluorescent color array 125 constituting the training data 122 for each feature group, and determines the fluorescent color for determining the fluorescent color in each cycle. The reference 126 is learned and stored in the fluorescent color determination reference database 231. Details of this learning process will be described later. The fluorescent color determination unit 502 and the fluorescent color likelihood calculation unit 503 use the fluorescent color likelihood calculation data 123 of each feature group, and for each cycle, the fluorescent color likelihood 127 for each of the four fluorescent colors (FIG. 1). ) And is stored in the fluorescence color likelihood database 232.

図６に、学習ステージ１２０（図１）において、ある特徴グループに対して実行される学習処理の手順例を示す。ステップＳ１において、学習部２２２は、学習対象である特徴グループに属する訓練データ１２２を不図示の記憶領域から取得する。 FIG. 6 shows a procedure example of learning processing executed for a certain feature group in the learning stage 120 (FIG. 1). In step S 1, the learning unit 222 acquires training data 122 belonging to a feature group to be learned from a storage area (not shown).

ステップＳ２において、学習部２２２の蛍光色判定基準学習部５０１は、４色の蛍光強度配列１２４と正解蛍光色配列１２５を訓練データ１２２から読み込む。次に、蛍光色判定基準学習部５０１は、４色の蛍光強度の全サイクル分の配列又は一部サイクル分の配列と正解蛍光色配列との関係を学習し、各サイクルにおいて蛍光色を判定するための蛍光色判定基準１２６を導出する。学習には、例えばサポートベクターマシン（ＳＶＭ）を使用する。例えばサイクルiにおける蛍光色を、そのサイクルiとその前後のサイクルi-1，i+1の蛍光強度から判定するための蛍光色判定基準１２６を導出する。この場合、蛍光色判定基準学習部５０１は、訓練データ１２２の蛍光強度配列（すなわち、(I_a、I_b、I_c、I_d)^cycle(i-1)、(I_a、I_b、I_c、I_d)^cyclei、(I_a、I_b、I_c、I_d)^cycle(i+1)）と正解蛍光色ｘi とをＳＶＭに入力して学習し、サイクルiにおける蛍光色を判定するための蛍光色判定基準１２６を導出する。 In step S 2, the fluorescent color determination reference learning unit 501 of the learning unit 222 reads the four-color fluorescence intensity array 124 and the correct fluorescent color array 125 from the training data 122. Next, the fluorescence color determination reference learning unit 501 learns the relationship between the arrangement of all the fluorescence intensities of the four colors or the arrangement of a partial cycle and the correct fluorescence color arrangement, and determines the fluorescence color in each cycle. A fluorescent color criterion 126 is derived. For example, a support vector machine (SVM) is used for learning. For example, a fluorescent color determination criterion 126 for determining the fluorescent color in the cycle i from the fluorescent intensity of the cycle i and the preceding and subsequent cycles i−1 and i + 1 is derived. In this case, the fluorescent color determination reference learning unit 501 uses the fluorescence intensity array (that is, (I _a , I _b , I _c , I _d ) ^{cycle (i-1)} , (I _a , I _b , I _c , I _d ) ^cyclei , (I _a , I _b , I _c , I _d ) ^{cycle (i + 1)} ) and correct fluorescent color ^{x i} are input to SVM and learned to determine the fluorescent color in cycle i A fluorescent color criterion 126 is derived.

ステップＳ３において、学習部２２２は、蛍光色判定基準学習部５０１において導出された伸長反応の各サイクルにおける蛍光色判定基準１２６を、記憶装置２３０の蛍光色判定基準データベース２３１に記憶する。図７に、各特徴グループのサイクル別に学習された蛍光色判定基準１２６としてのサポートベクトルを記憶するデータベースの構成例を示す。なお、記憶されるサポートベクトルの数は任意であり、１つのサイクルについて１つ又は複数の蛍光色判定基準１２６が記憶される。 In step S 3, the learning unit 222 stores the fluorescent color determination criterion 126 in each cycle of the extension reaction derived by the fluorescent color determination criterion learning unit 501 in the fluorescent color determination criterion database 231 of the storage device 230. FIG. 7 shows a configuration example of a database that stores support vectors as fluorescent color determination criteria 126 learned for each cycle of each feature group. Note that the number of stored support vectors is arbitrary, and one or a plurality of fluorescent color criteria 126 are stored for one cycle.

ステップＳ４において、学習部２２２は、特徴グループに属する既知のDNA配列データ１２１の残りを蛍光色尤度計算用データ１２３として取得する。この際、蛍光色尤度計算用データ１２３は、蛍光色判定部５０２及び蛍光色尤度計算部５０３に与えられる。 In step S 4, the learning unit 222 acquires the remainder of the known DNA sequence data 121 belonging to the feature group as fluorescence color likelihood calculation data 123. At this time, the fluorescence color likelihood calculation data 123 is provided to the fluorescence color determination unit 502 and the fluorescence color likelihood calculation unit 503.

ステップＳ５において、学習部２２２は、蛍光色判定基準データベース２３１を検索し、当該特徴を持つDNA配列を訓練データに用いて蛍光色判定基準１２６を抽出する。 In step S 5, the learning unit 222 searches the fluorescent color determination criterion database 231, and extracts the fluorescent color determination criterion 126 using the DNA sequence having the characteristic as training data.

ステップＳ６において、蛍光色判定部５０２は、特徴グループ毎に、蛍光色尤度計算用データ１２３から４色の蛍光強度配列１２４を取得し、蛍光色判定基準１２６を用いて、各サイクルにおける蛍光色を判定する。例えばサイクルiにおける蛍光色を判定するための蛍光色判定基準１２６を、サイクルｉとその前後のサイクルi-1,i+1の蛍光強度を用いて学習する場合、蛍光色判定部５０２は、蛍光強度配列１２４から｛(I_a、I_b、I_c、I_d)^cycle(i-1)、(I_a、I_b、I_c、I_d)^cyclei、(I_a、I_b、I_c、I_d)^cycle(i+1)｝を入力して蛍光色判定基準１２６を適用し、サイクルiにおける蛍光色を判定する。 In step S 6, the fluorescence color determination unit 502 acquires the fluorescence intensity array 124 of the four colors from the fluorescence color likelihood calculation data 123 for each feature group, and uses the fluorescence color determination reference 126 to indicate the fluorescence color in each cycle. Determine. For example, when learning the fluorescent color determination standard 126 for determining the fluorescent color in the cycle i using the fluorescent intensity of the cycle i and the preceding and subsequent cycles i-1, i + 1, the fluorescent color determining unit 502 From the intensity array 124 {(I _a , I _b , I _c , I _d ) ^{cycle (i-1)} , (I _a , I _b , I _c , I _d ) ^cyclei , (I _a , I _b , I _c , I _d ) ^{cycle (i + 1)} } is input and the fluorescent color judgment criterion 126 is applied to judge the fluorescent color in cycle i.

この際、蛍光色尤度計算部５０３は、蛍光色判定部５０２において判定された蛍光色配列と正解蛍光色配列１２５とを比較し、各サイクルにおける４色の蛍光色尤度Ｐ（x’_i | x_i）を導出する。ここで、x_iはサイクルiにおける正解蛍光色、x’_iはサイクルiにおける判定蛍光色である。 At this time, the fluorescent color likelihood calculating unit 503 compares the fluorescent color array determined by the fluorescent color determining unit 502 with the correct fluorescent color array 125, and the four color fluorescent color likelihoods P (x ′ _i) in each cycle. | x _i ) is derived. Here, x _i is the correct fluorescence color in cycle i, and x ′ _i is the determination fluorescence color in cycle i.

ステップＳ７において、学習部２２２は、蛍光色尤度計算部５０３で伸長反応の各サイクルについて計算された４色の蛍光色尤度１２７を、蛍光色尤度データベース２３２に記憶する。図８に、各特徴グループの各サイクルについて、４色の蛍光色尤度Ｐ（x’_i | x_i）を記憶するデータベースの構成例を示す。 In step S 7, the learning unit 222 stores the fluorescence color likelihood 127 of the four colors calculated for each cycle of the extension reaction by the fluorescence color likelihood calculation unit 503 in the fluorescence color likelihood database 232. FIG. 8 shows a configuration example of a database that stores four colors of color likelihood P (x ′ _i | x _i ) for each cycle of each feature group.

［推定ステージ］
推定部２２３は、推定ステージ１３０（図１）において、蛍光色判定基準データベース２３１と蛍光色尤度データベース２３２を使用し、未知の実験DNA配列データ１３１の各ステージにおける蛍光色を推定する。図９に、推定ステージ１３０において、推定部２２３が実行する処理手順の概要を示す。 [Estimation stage]
In the estimation stage 130 (FIG. 1), the estimation unit 223 uses the fluorescence color determination reference database 231 and the fluorescence color likelihood database 232 to estimate the fluorescence color in each stage of the unknown experimental DNA sequence data 131. FIG. 9 shows an outline of a processing procedure executed by the estimation unit 223 in the estimation stage 130.

推定処理の前提となる未知の実験DNA配列データ１３１は、シーケンサーから取得される伸長反応の各サイクルにおける４色の蛍光強度配列１３２（すなわち、(I_a、I_b、I_c、I_d)^cycle1、 (I_a、I_b、I_c、I_d)^cycle2、… (I_a、I_b、I_c、I_d)^cyclen）で与えられる。推定部２２３は、推定結果１３３として、推定DNA配列１３４とその信頼度１３５を出力する。 The unknown experimental DNA sequence data 131 that is the premise of the estimation process is the fluorescence intensity sequence 132 (ie, (I _a , I _b , I _c , I _d ) ^cycle1 in each cycle of the extension reaction obtained from the sequencer. , (I _a , I _b , I _c , I _d ) ^cycle2 ,... (I _a , I _b , I _c , I _d ) ^cyclen ). The estimation unit 223 outputs the estimated DNA sequence 134 and its reliability 135 as the estimation result 133.

推定部２２３は、配列の特徴判別部９０１、蛍光色判定部９０２、DNA配列尤度計算部９０３で構成される。配列の特徴判別部９０１は、未知の実験DNA配列データ１３１に含まれるDNA配列が有する特徴を判別し、既知のDNA配列データ１２１について生成された特徴グループのいずれに属するかを判別する。蛍光色判定部９０２とDNA配列尤度計算部９０３は、判別結果で得られた特徴グループの蛍光色判定基準１２６及び発光色尤度１２７を用い、推定対象である蛍光強度配列１３２の推定DNA配列１３４と信頼度１３５（すなわち、DNA配列尤度）を計算する。 The estimation unit 223 includes a sequence feature determination unit 901, a fluorescent color determination unit 902, and a DNA sequence likelihood calculation unit 903. The sequence feature discriminating unit 901 discriminates the features of the DNA sequence included in the unknown experimental DNA sequence data 131 and discriminates which of the feature groups generated for the known DNA sequence data 121 belongs. The fluorescent color determination unit 902 and the DNA sequence likelihood calculation unit 903 use the fluorescent color determination standard 126 and the emission color likelihood 127 of the feature group obtained from the determination result, and estimate the DNA sequence of the fluorescence intensity sequence 132 to be estimated. 134 and confidence 135 (ie, DNA sequence likelihood) are calculated.

図１０に、推定ステージ１３０（図１）において実行される推定処理の手順例を示す。
ステップＳ１１において、配列の特徴判別部９０１及び蛍光色判定部９０２は、未知の実験DNA配列データ１３１をシーケンサーから取得する。なお、未知の実験DNA配列データ１３１は不図示の記憶領域から取得されるのでもよい。 FIG. 10 shows a procedure example of the estimation process executed in the estimation stage 130 (FIG. 1).
In step S11, the sequence feature determination unit 901 and the fluorescent color determination unit 902 acquire unknown experimental DNA sequence data 131 from the sequencer. The unknown experimental DNA sequence data 131 may be acquired from a storage area (not shown).

ステップＳ１２において、配列の特徴判別部９０１は、シーケンサーから得られる全サイクル分の４色の蛍光強度配列１３２（すなわち、(I_a、I_b、I_c、I_d)^cycle1、 (I_a、I_b、I_c、I_d)^cycle2、… (I_a、I_b、I_c、I_d)^cyclen）を特徴ベクトルとしてクラスタリング解析し、又は、この特徴ベクトルを非線形変換した特徴空間上でクラスタリング解析し、図４で説明したように難読DNA配列解析部２２１（図２）で構築したクラスターのいずれかに分類する。 In step S12, the sequence feature discriminating unit 901 obtains the fluorescence intensity sequences 132 of four colors for all cycles obtained from the sequencer (that is, (I _a , I _b , I _c , I _d ) ^cycle1 , (I _a , I _b , I _c , I _d ) ^cycle2 ,… (I _a , I _b , I _c , I _d ) ^cyclen ) as a feature vector, or a clustering analysis on a feature space obtained by nonlinear transformation of this feature vector As described in FIG. 4, the cluster is classified into one of the clusters constructed by the obfuscated DNA sequence analysis unit 221 (FIG. 2).

ステップＳ１３において、蛍光色判定部９０２は、蛍光色判定基準データベース２３１を検索し、判別された特徴グループに応じた蛍光色判定基準１２６を抽出する。 In step S13, the fluorescent color determination unit 902 searches the fluorescent color determination reference database 231 and extracts the fluorescent color determination reference 126 according to the determined feature group.

ステップＳ１４において、DNA配列尤度計算部９０３は、蛍光色尤度データベース２３２を検索し、判別された特徴グループに応じた４色の蛍光色尤度１２７を抽出する。 In step S14, the DNA sequence likelihood calculating unit 903 searches the fluorescent color likelihood database 232, and extracts the four color fluorescent color likelihoods 127 corresponding to the identified feature groups.

ステップＳ１５において、蛍光色判定部９０２は、推定対象である４色の蛍光強度配列が属する特徴グループに固有の蛍光色判定基準１２６に基づいて、各サイクルにおける蛍光色を判定する。例えばサイクルiにおける蛍光色を判定する判定基準が、そのサイクルiとその前後のサイクルi-1,i+1の蛍光強度から学習されている場合、蛍光色判定部９０２は、｛(I_a、I_b、I_c、I_d)^cycle(i-1)、(I_a、I_b、I_c、I_d)^cyclei、(I_a、I_b、I_c、I_d)^cycle(i+1)｝に対して蛍光色判定基準１２６を適用し、サイクルiの蛍光色を判定する。 In step S15, the fluorescent color determination unit 902 determines the fluorescent color in each cycle based on the fluorescent color determination criterion 126 unique to the feature group to which the four-color fluorescent intensity array to be estimated belongs. For example, when the criterion for determining the fluorescence color in cycle i is learned from the fluorescence intensity of cycle i and cycles i-1, i + 1 before and after that cycle i, fluorescence color determination unit 902 uses {(I _a , I _b , I _c , I _d ) ^{cycle (i-1)} , (I _a , I _b , I _c , I _d ) ^cyclei , (I _a , I _b , I _c , I _d ) ^{cycle (i + 1)} }, The fluorescent color criterion 126 is applied to the fluorescent color of cycle i.

また、DNA配列尤度計算部９０３は、蛍光色判定部９０２によって判定（推定）された各サイクルの蛍光色x’_iと、判別された特徴グループについて蛍光色尤度データベース２３２から呼び出した各サイクルにおける４色の蛍光色尤度Ｐ（x’_i | x_i）とに基づいて、DNA配列尤度Ｐ（u_i | x’）を計算する。 Further, the DNA sequence likelihood calculating unit 903 calls each cycle called from the fluorescence color likelihood database 232 for the fluorescence color x ′ _i of each cycle determined (estimated) by the fluorescence color determining unit 902 and the identified feature group. The DNA sequence likelihood P (u _i | x ′) is calculated based on the fluorescence color likelihood P (x ′ _i | x _i ) of the four colors in FIG.

これらの処理の後、推定部２２３は、推定対象である蛍光強度配列に対する推定結果１３３として、推定DNA配列１３４と信頼度１３５を出力する。 After these processes, the estimation unit 223 outputs the estimated DNA sequence 134 and the reliability 135 as the estimation result 133 for the fluorescence intensity sequence to be estimated.

因みに、DNA配列尤度Ｐ（u_i | x’）、推定DNA配列ｂ_i、信頼度Ｒ_i、は、それぞれ以下の式で与えられる。ただし、x’は判定蛍光色の配列、uはDNA配列である。
Ｐ（u_i | x’） ∝ Σ Ｐ（x’ | u） * Ｐ（u）
Σ Ｐ（x’ | u） * Ｐ（u）＝ Σ Ｐ（x’ | x) ＊Ｐ（u）
ただし、u_i ∈{A,G,C,T}である。Ｐ（u）が既知の場合、Ｐ（u）には既知の値をそのまま使用し、Ｐ（u）が未知の場合、Ｐ（u）には1/4を使用する。また、ｂ_i＝ argmax_ui Ｐ（u_i | x’）であり、Ｒ_i ＝ -10 log₁₀ Ｐ（u_i=b_i | x’）である。 Incidentally, the DNA sequence likelihood P (u _i | x ′), the estimated DNA sequence b _i , and the reliability R _i are given by the following equations, respectively. Here, x ′ is a determination fluorescent color sequence, and u is a DNA sequence.
P (u _i | x ') ∝ Σ P (x' | u) * P (u)
ΣP (x ′ | u) * P (u) = ΣP (x ′ | x) * P (u)
However, u _i ∈ {A, G, C, T}. When P (u) is known, a known value is used as it is for P (u), and when P (u) is unknown, 1/4 is used for P (u). Also, b _i = argmax _ui P (u _i | x ′) and R _i = −10 log ₁₀ P (u _i = b _i | x ′).

［まとめ］
以上説明したように、実施の形態に係る配列解読システムは、難読DNA配列を特徴グループ毎に分類し、各特徴グループに特有の蛍光色判定基準１２６と蛍光色尤度１２７を事前に学習する。そして、DNA配列データの解読時には、未知の実験DNA配列データ１３１の蛍光強度配列１３２がいずれの特徴グループに属するかをまず判定し、その後、判定された特徴グループについて学習済みの蛍光色判定基準１２６と蛍光色尤度１２７を適用し、解読対象であるDNAの配列を推定する。この処理手法の適用により、難読DNA配列の解読精度を高めることができる。また、解読精度が向上すると、シーケンサーの一度の実行で取得可能な配列情報を増加させることができる。この結果、配列変異の検出力の向上、配列解析（例えばマッピングやアセンブリ等）の精度の改善等を実現することができる。 [Summary]
As described above, the sequence decoding system according to the embodiment classifies the obfuscated DNA sequences for each feature group, and learns in advance the fluorescent color criterion 126 and the fluorescent color likelihood 127 that are specific to each feature group. At the time of decoding the DNA sequence data, it is first determined to which feature group the fluorescence intensity sequence 132 of the unknown experimental DNA sequence data 131 belongs, and then the learned fluorescence color criterion 126 for the determined feature group. And the fluorescent color likelihood 127 are applied to estimate the DNA sequence to be decoded. By applying this processing technique, it is possible to increase the accuracy of decoding obfuscated DNA sequences. Further, when the decoding accuracy is improved, the sequence information that can be acquired by one execution of the sequencer can be increased. As a result, it is possible to improve the detection power of sequence variation and improve the accuracy of sequence analysis (for example, mapping, assembly, etc.).

１１０…難読DNA配列の探索ステージ
１２０…学習ステージ
１２１…既知のDNA配列データ
１２２…訓練データ
１２３…蛍光色強度計算用データ
１２４…蛍光強度配列
１２５…正解蛍光色配列
１２６…蛍光色判定基準
１２７…蛍光色尤度
１３０…推定ステージ
１３１…未知の実験DNA配列データ
１３２…蛍光強度配列
１３３…推定結果
１３４…推定DNA配列
１３５…推定DNA配列の信頼度
２１０…入出力装置
２２０…解析装置
２２１…難読DNA配列解析部
２２２…学習部
２２３…推定部
２３０…記憶装置
２３１…蛍光色判定基準データベース
２３２…蛍光色尤度データベース
５０１…蛍光色判定基準学習部
５０２…蛍光色判定部
５０３…蛍光色尤度計算部
９０１…配列の特徴判別部
９０２…蛍光色判定部
９０３…DNA配列尤度計算部 110 ... Searching stage 120 for obfuscated DNA sequences ... Learning stage 121 ... Known DNA sequence data 122 ... Training data 123 ... Fluorescence color intensity calculation data 124 ... Fluorescence intensity array 125 ... Correct fluorescence color array 126 ... Fluorescence color judgment standard 127 ... Fluorescent color likelihood 130 ... estimated stage 131 ... unknown experimental DNA sequence data 132 ... fluorescent intensity sequence 133 ... estimated result 134 ... estimated DNA sequence 135 ... estimated DNA sequence reliability 210 ... input / output device 220 ... analyzer 221 ... obfuscated DNA sequence analysis unit 222 ... learning unit 223 ... estimation unit 230 ... storage device 231 ... fluorescent color judgment reference database 232 ... fluorescent color likelihood database 501 ... fluorescent color judgment reference learning unit 502 ... fluorescent color judgment unit 503 ... fluorescent color likelihood Calculation unit 901 ... Sequence feature determination unit 902 ... Fluorescent color determination unit 903 ... DNA sequence likelihood calculation unit

Claims

In the DNA sequencing system in the massively parallel sequencer,
About the obfuscated DNA sequence data, the fluorescence intensity sequence data acquired from the massively parallel sequencer is classified into one or a plurality of feature groups according to the features on the sequence, and the known prepared for each classified feature group Using the fluorescence intensity sequence data of the DNA sequence data of, and learning the fluorescence color criteria for judging the fluorescence color, the function to calculate the fluorescence color likelihood that gives its reliability, and the unknown experimental DNA sequence data A corresponding feature group is discriminated from the one or more feature groups based on the features on the sequence, and the estimated DNA sequence data of the unknown experimental DNA sequence data and its reliability are given based on the discrimination result An analysis device having a function of calculating a fluorescent color likelihood;
A storage device for storing the fluorescent color criteria and the fluorescent color likelihood learned for each feature group on the array;
The inputs the fluorescence intensity sequence data, have a input-output device for outputting a fluorescent color likelihood that gives the reliability and the estimated DNA sequence data,
The function of classifying into the feature groups is
By applying a clustering analysis using the fluorescence intensity sequence data acquired from the massively parallel sequencer as a feature vector to the DNA sequence data determined to be the obfuscated DNA sequence data, or nonlinearly transforming the feature vector By applying clustering analysis on the feature space, the DNA sequence data is classified into the one or more feature groups .

In the DNA sequence decoding system according to claim 1,
A DNA sequencing system that extracts difficult-to-read DNA sequence data that is difficult to read based on the mapping accuracy when mapping the actual sequence of various genomes to the reference sequence.

In the DNA sequencing method in a massively parallel sequencer,
Processing to classify fluorescence intensity sequence data acquired from the massively parallel sequencer for obfuscated DNA sequence data into one or more feature groups according to the features on the sequence;
Using fluorescence intensity sequence data of known DNA sequence data prepared for each classified feature group, learn fluorescence color criteria for determining fluorescence color and calculate fluorescence color likelihood that gives its reliability Processing to
A corresponding feature group is determined from the one or more feature groups based on the sequence characteristics of the unknown experimental DNA sequence data, and the estimated DNA sequence of the unknown experimental DNA sequence data is determined based on the determination result possess a process of calculating a fluorescence color likelihood providing data and its reliability,
The process of classifying into the feature groups is
By applying a clustering analysis using the fluorescence intensity sequence data acquired from the massively parallel sequencer as a feature vector to the DNA sequence data determined to be the obfuscated DNA sequence data, or nonlinearly transforming the feature vector A DNA sequence decoding method , wherein the DNA sequence data is classified into the one or more feature groups by applying clustering analysis on the feature space .

In the DNA sequence decoding method according to claim 3 ,
A DNA sequencing method characterized by extracting difficult-to-read DNA sequence data based on mapping accuracy when mapping the actual sequence of various genomes to a reference sequence.

On the computer,
Processing to classify the fluorescence intensity sequence data obtained from the massively parallel sequencer for the obfuscated DNA sequence data into one or more feature groups according to the features on the sequence;
Using fluorescence intensity sequence data of known DNA sequence data prepared for each classified feature group, learn fluorescence color criteria for determining fluorescence color and calculate fluorescence color likelihood that gives its reliability Processing to
A corresponding feature group is determined from the one or more feature groups based on the sequence characteristics of the unknown experimental DNA sequence data, and the estimated DNA sequence of the unknown experimental DNA sequence data is determined based on the determination result A program that executes data and a process for calculating a fluorescent color likelihood that gives the reliability of the data ,
The process of classifying into the feature groups is
By applying a clustering analysis using the fluorescence intensity sequence data acquired from the massively parallel sequencer as a feature vector to the DNA sequence data determined to be the obfuscated DNA sequence data, or nonlinearly transforming the feature vector Classifying the DNA sequence data into the one or more feature groups by applying a clustering analysis on the feature space
A program characterized by that .