TW201243117A - Method and system of assembling DNA reads with paired-end sequencing - Google Patents

Method and system of assembling DNA reads with paired-end sequencing Download PDF

Info

Publication number
TW201243117A
TW201243117A TW100114888A TW100114888A TW201243117A TW 201243117 A TW201243117 A TW 201243117A TW 100114888 A TW100114888 A TW 100114888A TW 100114888 A TW100114888 A TW 100114888A TW 201243117 A TW201243117 A TW 201243117A
Authority
TW
Taiwan
Prior art keywords
sequence
read
reading
order
paired
Prior art date
Application number
TW100114888A
Other languages
Chinese (zh)
Inventor
Hsueh-Ting Chu
Cheng-Yan Kao
Li-Chen Chen
Original Assignee
Hsueh-Ting Chu
Cheng-Yan Kao
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hsueh-Ting Chu, Cheng-Yan Kao filed Critical Hsueh-Ting Chu
Priority to TW100114888A priority Critical patent/TW201243117A/en
Publication of TW201243117A publication Critical patent/TW201243117A/en

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention discloses a method of assembling DNA reads of paired-end sequencing to determine the target DNA sequences. The method comprises the steps of: inputting sequencing reads including pair-end reads; indexing the head/tail subsequences of the reads; producing initial fragment for extending; determining the reads having a distance within a specific range with respect to its paired read to be foreseeing reads; adding one foreseeing read to extend the initial fragment whenever after aligning one candidate read which is paired with the foreseeing read; and completing the process of assembly when no suitable reads for extending.

Description

201243117. 六、發明說明: 【發明所屬之技術領域】 本發明係關於-種應用配對末端測序序列的dna組合方 法’,尤指-種細在基因測序的資料分析系統及方法屬於基因 工程的技術領域。 【先前技術】 基因組(脫氧核糖核酸,簡稱DNA)是人類遺傳物質,其包含 所有的基因序列及非基因序列。基因序列是臟分子中一段能表 現生理功能的序列。為了研究分析基因組,必須對基因組進行測序, 以測出基_相’供研究麵絲因組做鑒定及分析。要 正確分析基因組’目前專業的普遍認知是必須對基因組進行全測 序,亦即將基因組打碎成小片段,並產生測序讀序,再依據測序讀 序進行分析。而在分析出的基因片段的生理功能後,再將測序讀序 重組,以建立遺傳基因圖譜。 目前習知的基因測序方法,都是根據1977年由“基因狂人” 克雷格.文特爾(Craig Venter)發明的“全基因组鳥搶法測序,,發 展而來。這種化學測序方法又被稱為“霰彈法”,需要把腿長 键切割成若干>}段’對每個段進行測序,然后再料算機把碎 片序列重新拼接成整個DM的序列。例如瑞多基.德芒納克及美國 基因公司(US Genomics)所採用的“納米流體(Nan〇fluidic)的芯 片技術。這種技術每分鐘約可測序大約2〇〇〇〇〇個鹼基對。然而, 鳥搶法所得之測序讀序,在基因重組時,相當耗費時間,且對於 重複的讀序並無法有效重組。 201243117 為了便於基_序的組合,解決前述鳥搶法之測序讀序重組 的問題及缺失,目前已有研發者發展出—種眾所周知的所謂的配 對末端基因測序(paired-end DNA SeqUencing)技術。也已有許多應 用配對末端讀序的重組技術被發展出來’例如美國公開第 20090233291 (Paired end sequencing) ^ , 依世界專利合作條約所中請之公開第WQ贿/Q7_號雙端測序 (double-ended sequencing)方法,及參考文獻[9]都是眾所周知 的配對末端測序技術。雖然目前的基因測序技術已經能產生許多 的適合使用的配對末端序列,但尚無一種可以有效組合測序序列 之技術被發展出來。 按一般所知的配對末端基因測序(paired_end DNA sequendng) 技術(如圖一所示)是在進行待測dna序列的測序程序時,對序列 的兩端都加上測序引物接口(sequencing primer site)的銜接子 (adapter)Al,A2。在測定鹼基時’輪流以待測序列Sf及其反向互補 (reverse complement)序列SRC為範本鏈合成鹼基序列。因此可以 同時測出一個待測序列Sf二側各若干驗基的序列資訊。因此,配 對末端基因測序所產生的短讀序rl,r2都是成對的,如圖二所示。 每個讀序及其配對讀序的反向互補序列是位在一個待測插入序列 (insertion)二端的序列片段。因為插入序列的確切長度未知,所以 配對的二個讀序間的距離也是未知的。 測序程序產出的資料如以下的例子(13210606對短序列):201243117. VI. Description of the invention: [Technical field of the invention] The present invention relates to a DNA combination method using a paired-end sequencing sequence, in particular, a data analysis system and method for gene sequencing is a genetic engineering technique. field. [Prior Art] The genome (deoxyribonucleic acid, abbreviated as DNA) is a human genetic material that contains all gene sequences and non-gene sequences. A gene sequence is a sequence in a dirty molecule that exhibits physiological functions. In order to study the analysis of the genome, the genome must be sequenced to determine the base phase for identification and analysis. To properly analyze the genome's current general recognition is that the genome must be fully sequenced, that is, the genome is broken into small fragments, and sequencing reads are generated, and then analyzed according to sequencing reads. After analyzing the physiological functions of the gene fragments, the sequencing reads are recombined to establish a genetic map. The current methods of gene sequencing are based on the "Genenomics Maniac Craig Venter" invented in 1977, "the whole genome of the bird robbing sequencing, developed. This chemical sequencing method Known as the "smashing method", it is necessary to cut the long leg of the leg into several segments to sequence each segment, and then re-splicing the fragment sequence into a sequence of the entire DM. For example, Rydogen. "Nan〇fluidic" chip technology used by Munak and US Genomics. This technique can sequence approximately 2 base pairs per minute. However, the sequencing sequence obtained by the bird robbing method is quite time consuming in gene recombination and cannot be effectively reorganized for repeated reading. 201243117 In order to facilitate the combination of the base sequence and the problem of the sequencing and recombination of the aforementioned bird robbing method, developers have developed a well-known paired-end DNA SeqUencing technology. There have also been many recombination techniques that have been applied to paired end reads, such as the US Public Publication 20090233291 (Paired end sequencing) ^, according to the World Patent Cooperation Treaty, the open WQ bribe / Q7_ double-end sequencing (double The -ended sequencing method, and reference [9] are well known paired-end sequencing technologies. Although current gene sequencing technologies have produced many suitable paired end sequences, no technique for efficiently combining sequencing sequences has been developed. According to the commonly known paired-end DNA sequendng technique (shown in Figure 1), a sequencing primer site is added to both ends of the sequence when the sequencing procedure of the DNA sequence to be tested is performed. The adapter is Al, A2. In the determination of the base, the base sequence is synthesized by taking the sequence Sf to be tested and its reverse complement sequence SRC as a template chain. Therefore, the sequence information of each of the two test bases on the two sides of the sequence Sf can be simultaneously measured. Therefore, the short reads rl, r2 produced by sequencing the paired end genes are paired, as shown in Figure 2. The reverse complement of each read sequence and its paired reads is a sequence fragment located at the two ends of an insertion to be tested. Because the exact length of the inserted sequence is unknown, the distance between the two readings of the pairing is also unknown. The data produced by the sequencing program are as follows (13210606 for short sequences):

序列 1-1: TCCTGTATATTCTAAACTTAGAGATTGTTCAT 序列 2-l:CATAAACATCTTTATAAAATACTAATAGAAAGSequence 1-1: TCCTGTATATTCTAAACTTAGAGATTGTTCAT Sequence 2-l:CATAAACATCTTTATAAAATACTAATAGAAAG

序列 1-2: AAAGGAGAGAACGTCGTCGTTTTCGTCGAAGT 201243117Sequence 1-2: AAAGGAGAGAACGTCGTCGTTTTCGTCGAAGT 201243117

序列 2-2: ACAACCCTAACTCTTTTTTTTTTGGCTATTGTSequence 2-2: ACAACCCTAACTCTTTTTTTTTTGGCTATTGT

.序列 1-13210605: TCTTCCGCCGTCGCAACTTTACCCAACGCCGC 序列 2-13210605: ACCGCAAAAGCAAGATGATTCATTGTGTATCCSequence 1-13210605: TCTTCCGCCGTCGCAACTTTACCCAACGCCGC Sequence 2-13210605: ACCGCAAAAGCAAGATGATTCATTGTGTATCC

序列 1-13210606: CCTGGATCACAGCATCCACACGCACAAATATC 序列 2-13210606: CCAATGGATTCTTTCTTTACTAACAATATCGA 習知的基因測序從頭開始組合方法(de novo assembly method) 主要有以下三種: (1) De Bruijn 圖(De Bruijn graph) (2) 重疊-排列-一致法(Overlap -Layout -Consensus) ⑶貪婪延伸演算法(Greedy extension algorithm) 在新一代的短序列拼接程式中,多數都使用De Bruijn graph 的演算法,包括 Abyss[l],Velvet[2],Allpaths[3],Euler[4]和 SOAPdenovo[5]。在De Bruijn的圖中,包含了一些節點及連接節 點的有向邊。每一個節點代表了一個長度為K的單詞(稱為 k-mers),相鄰節點的單詞會重疊κ-1個鹼基。將所有序列表示成 De Bruijn圖’稱為“路徑圖”(roadmap)。使用De Bruijn圖的從 頭開始組合方法演算法是把相鄰的節點合併成一個更大的節點。 如果,合併的結果形成複數的節點’則在De Bruijn圖中找尋一筆 畫的路徑(Eulerian path)來做為最後可以合併的序列。 習知De Bruijn graph的測序組合方法,由於De Bruijn graph 的節點是表示k-mers結構^ k-mers結構只是一個測序讀序的部份 驗基,而不是讀序全部的鹼基。因此,配對末端的資訊無法在De 201243117.Sequence 1-13210606: CCTGGATCACAGCATCCACACGCACAAATATC Sequence 2-13210606: CCAATGGATTCTTTCTTTACTAACAATATCGA Conventional gene sequencing de novo assembly method There are three main types: (1) De Bruijn graph (2) overlap-arrangement- Overlap -Layout -Consensus (3) Greedy extension algorithm In the new generation of short-sequence splicing programs, most of them use the algorithm of De Bruijn graph, including Abyss[l], Velvet[2], Allpaths [3], Euler [4] and SOAPdenovo [5]. In De Bruijn's diagram, there are some nodes and directed edges of the connected nodes. Each node represents a word of length K (called k-mers), and the words of adjacent nodes overlap by κ-1 bases. All sequences are represented as De Bruijn diagrams called 'roadmaps'. Using the De Bruijn graph from the beginning of the combined method algorithm is to merge adjacent nodes into one larger node. If the result of the merging forms a complex node' then find the Eulerian path in the De Bruijn diagram as the last sequence that can be merged. The sequencing combination method of the conventional De Bruijn graph, because the node of the De Bruijn graph is that the k-mers structure ^ k-mers structure is only a partial base of the sequencing read sequence, rather than reading all the bases. Therefore, the information at the end of the match cannot be found at De 201243117.

Bruijngraph關财呈現。所以,制⑽叫叩响的測序組 合方法的系統都是在建構$重疊群(c〇mig)之後,才利用配對末端 的=貝訊進行重疊群的支架運算(scaff〇lding)。支架運算是把重疊群 排列成更長的序列支架(scaffold)。 雖然配對末端測序技術可明_出—個制序列二側的驗 基#訊。所財視為可以提高相組合正確率及覆蓋率的技術。 然而傳統的三種主要基關序組合方法如附件二之表—所示,都 是把配對末_序齡_在sea_的_对。如麵的建 構方式是在產生重叠群後,才利用配對末端資訊將重疊群組建構 成支架(scaffold)。由於人類的基因體中幾乎有一半的序列是重覆 性的,所以在處理人類基因體或其他生物的基因體之測序資料 時,要組合出重覆性的目標相是_的事情。我們佐以圖五來 說明序列重覆問題。圖五⑻示例中的目標基因序列是由5個子序 列片段ARBRC所組成的基因序列。當中,R是出現二次的一個基 因序列片段,而A,B,C是各自不同的基因序列片段。如圖五⑼所 示,當我們要延伸子序列AR時,會發現跨越(R,B)及跨越(R,c) 的讀序都可以接續在子序列AR的右側。例如AGATaacgga是 跨越(R,B)的讀序,而GGGGAAAAAT是跨越(R,C)的讀序。如= 我們把讀序GGGGAAAAAT接續在子序列ar右侧,將形成 的錯誤子序列。同樣地,如圖五(c)所示,如果我們把讀序 AGATAACGGA接續在子序列BR右側,將形成BRB的錯誤子序 列。在基因測序的領域,對重覆序列的組合一直是惡名昭拿 (notorious)的問題。 ’ 6 1 201243117 有鑑於上述習知技術所產生之序列重覆問題,本發明研發— 種全新的方法,稱為讀序預見(read foreseeing)的測序組合技術架 構。讀序預見(read foreseeing)是在組合序列時,記錄目前組合中的 序列所使用到配對序列資訊’然後進行讀序預見檢查的機制。此 機制利用配對末端資訊來進行延伸(extension)、橋接(bridging)及重 覆檢測(repeatsensing)等三種操作。所以本發明方法所新創的方法 稱為 EBAR,是英文 Extension, Bridging And Repeat sensing 的簡 稱。為了釐清本發明新創技術中首度創造的技術術語,如讀序預 見、反常配對、含糊配對及分歧延伸等。所以,把本發明延伸-橋 接-重覆檢測之技術相關的中英文名詞對照及說明如附件二之表 【發明内容】 本發明之目的,在提供一種新的組合配對末端讀序的方法及 系統。本發明改變傳統利用配對末端資訊所進行支架運作 (scaffolding)的方式。本發明的技術稱之為讀序預見(read f〇reseeing) 的方法及系統。本發明充份利用配對末端的資訊來處理讀序拼接 組合時會遭遇的序列重覆問題(repeat pr〇blem)。達成前述目的之技 術手段’係建構一讀序預見(read foreseeing)的技術架構。接續一序 列所使用之讀序’定義其還沒被接續的配對讀序為預見讀序 (ForeseeableReads,FRs),並以該預見讀序來延續子序列。 為讓本發明之上述目的及其他特徵能更加清楚,下文舉出一 些較佳實施例,並配合所附圖式,作詳細說明。在這些實施例的 說明中’為了簡明解釋原理,所以在不同實例使用不同的序列長 度’以及不同的索引鍵長度。 201243117 【實施方式】 I·本發明基本技術架構 竿構的:::1七所不’為本發明讀序預見(read foreseeing)的技術 =構=實動德。圖六所示,是配對的讀序分佈在序列中的示意 圖七中說明如何應用讀序預見來判斷讀序的正確接續位 置。接續子序列AR時,因為η,γ2, θ是位在ar上的讀序所 以定義其㈣讀序^ 分狀被接續的預見讀序 (reseeable Reads’ FRs)。其巾,如圖二所示,經基關序系統所 讀出的讀序龍,其賴_序是(rl r2>但在組合時,本發明 所實際使關輯讀序,是每個讀序rl r2及其配賴序的反向 互補序列c2, cl ’也就是在圖三、四中示例說明的(H,c2)以及(r2, cl)’圖四(a)所示係本發明所利用之一對配對末端rl,c2,圖四(b) 所示係本發明利用該對配對末端rl,c2來組合重疊群a,C2以建 立一重疊群支架。由圖五至七的圖示例中可以發現讀序r4及rll 都可以接續在子序列AR的右側,但是只有讀序γ4是屬於預見讀 序’所以如圖七⑻所示本發明會以讀序r4來延續子序列ar。反 之,如圖七(b)所示,在接續到子序列ARBR時,才會找到讀序rii 是預見讀序’用以將子序列ARBR延伸成正確的ARBRC序列。 本發明讀序預見技術架構是可行的。但不是所有被找到的預 見讀序都可以拿來產生正確的延伸序列片段。本發明之讀序預見 架構必須解決二個問題:(1)反常配對(anomaly of pairing)以及(2) 含糊配對(ambiguity of pairing) 〇 本發明讀序預見架構在實施時,所需解決的第一個問題是反 常配對的問題。圖八示例中,說明在讀序預見架構下,發生反常 201243117 配對的問題。當重覆序列較短,而插人序顺長時,可能發生配 對的讀序會跨越兩個或多個$覆的序列,如圖八巾所補配對讀 序(r2,r5)。接續子序列AR時’讀序r4, r5, r6都是預見讀序。因 此,讀序r5可以接續在子序列ar右側,形成錯誤的延伸。類似 的情形,也會發生在接續子序列ARBR時。如果讀序r4還留在預 見讀序集合時,讀序r4可以接續在子序列arbr右側,而形成錯 誤的延伸。針對這個問題,本發明人所研發的解決辦法是偵測合 理的配對距離。如果插入序列的長度大約是2〇〇鹼基對(base pak, bp),我們可以推測配對讀序在序列中的距離是1〇〇〜3〇〇bp之間。 在圖八示例中,例如圖八(;a)所示配對讀序(rl,r4),(r2,r5),(γ3,γ6)之 間的距離分別是160 bp, 145 bp及180bp。則當發生反常配對時, 配對讀序之間的距離會異常高或異常低。例如圖八(b)所示,讀序 r5被接到子序列AR右側時,其配對讀序(r2,r5)間的配對距離只有 25bp ’因此我們可以推論發生反常配對,也就是說讀序亡雖然是 預見讀序,但不應該被接續在子序列AR右侧。類似的情形如圖 八(c)所示,若讀序r4接續在子序列ARBR右侧時,其配對讀序 (rl,r4)間的配對距離長達375bp,因此我們可以推論發生反常配 對。讀序r4雖然是預見讀序,但不應該被接續在子序列ARgR右 側。 本發明讀序預見架構在實施時,所需解決的第二個問題是含 糊配對(ambiguity of pairing)。圖九示例中,說明在讀序預見架構 下’發生含糊配對的問題。如圖九⑻所示,當重覆序列很長時, 會造成無法判斷其前後接續的序列片段之順序。也就是說無法判 斷子序列ARB或ARC中何者才是對的。在圖九(b)中,接續子序 201243117 列AR時,^^成出扣都是預見讀序心此^从都可以 合法接續子序列AR。便無法判斷ARB或ARC何者是正確的組 σ本發明人稱之為分歧延伸(branching extensi〇n)。圖十示例說明 發生分歧延伸時的情形,在接續子序列AR時,找到1〇個可能接 續的讀序χΐ,χ2, χ3,·..,χ1〇。由圖十示例可發現χι及χ3是預見讀 序’然而同時接續Xl及Χ3會形成二種不同的延伸片段:ACTTaacgt 及CCCTGGCCA。因此’當系統檢查到分歧延伸時,便需放棄接續, 以免組合出錯誤的序列。 Π·本發明之具體實施例 i•本發明之系統 如圖十一至十六所示,係本發明之基因序列組合系統1〇,其 係用以將一個經由基因測序系統測序後所產生的基因配對末端短 序列之集合,拼接成目標的基因序列。其一種具體實施例係包括 有一資料讀入器11、一索引資料庫12、一配對資料庫13、一啟始 片段生成器14、一索引器15、一預見讀序檢查器16、一延伸器 17、一架橋器18、一容錯讀序檢查器19、一重疊群建構器2〇及 —輸出介面21。茲將前述各元件詳述如下。 資料讀入器11(可以是一輸入介面),用以從儲存在一資料庫 或記憶體中之檔案讀入複數個讀序11〇(複數個讀序可以是由基因 測序系統測序後所產生的,包括有配對讀序,該配對讀序為配對 末端讀序(paired-end),或是將較長的序列模擬成配對的型式 〇nate-pair))而形成一讀序集合’並給予輸入的每一個讀序一個編 號資料,且取每一個讀序的前後各N個鹼基做為索引鍵值資料 201243117. m,並將索引鍵值資料m存人索引資料庫η。再將讀序為配斜 的編號資料資料m存入配對資料庫13。資料讀入器u包括有〜 個讀序使用記錄陣列114,該讀序使用記錄陣列114用來記錄讀序 是否已被排入-序列重疊群中。其中,定義接續位在一序列上的 讀序,其崎域沒被__料職讀序,域成—預見讀 序集合。 啟始片段生成器14,用以產生至少一啟始序列片段141,其 啟始序列片段141供後續之延伸器π及架橋器18使用,以進行 延伸或橋接其他的讀序。 索引器15,其用以在系統進行延伸一序列片段時,以長度κ 選取一檢測視窗150中的序列當作索引鍵值151(如圖十二&)所 示),向索引資料庫12查詢與索引鍵值151相符合之索引鍵值資料 112所對應的讀序之編號資料,並將該編號資料轉換成該讀序的鹼 基字串後,把鹼基字串提供給啟始片段生成器14、延伸器17、及 預見讀序檢查器16做可能接續之讀序ri,r2,r3,r4的索引查詢使用 (如圖十二(b)所示)。 預見讀序檢查器16’用以動態記錄目前延伸序列的預見讀 序。請配合參看圖十一、十二(b)及圖十四,啟始片段生成器η及 延伸器17會向預見讀序檢查器16提交其將延伸之候選的讀序161, 162,163(1*1成13/4)及其位置是不是屬於預見讀序(!1,£2,〇”預見 讀序檢查器16除了檢查候選之讀序161,162, 163 (1*1/23,料)是否 在目前預見讀序的集合中,並且計算其與配對讀序間的距離是否 201243117. 在一預定範圍内,是否為合理。 延伸器17和架橋器18’是實際負責找尋用以延伸目前序列的 預見讀序162, 163。當延伸器17找不到有效的預見讀序162後, 才使用架橋器18來找尋可以延伸目前序列的預見讀序m3。 虽使用延伸器17和架橋器18成功建立目前序列的延伸片 段。再以容錯讀序檢查器19檢查選定之讀序是否有測序之錯誤以 及檢查疋否有为歧延伸的問題,亦即當檢查到序列接續二個不同 預見讀序時,會形成二種不同的延伸片段,即為分歧延伸,此時, 便需放棄接續’以免組合出錯誤的序列。 重疊群建構器20,用以將延伸器17和架橋器18所選定延伸, 且經過容錯讀序檢查器19檢查通過的選定讀序191依其位置重疊 排列以建構出-序列重疊群(contig)2〇 1。透過輸出介面2】將此 序列重疊群201的序列輸出到檔案211中,以供利用。 H·本發明之方法 請配合參看圖十-至十六所示,本發明之測序讀序組合方法 的一種具體實施例’係_本發明之魏,麵行包括有以下所 述之步驟。 步驟S2〇1 :由資料讀入器n讀入複數個讀序(即讀入由基因 測序系統測序後所產生的複數個讀序)以構成—讀序集合,分別給 予輸入的每—鋪序一個編號倾,並且建立每—餅的索引^ 構並儲存在㈣丨龍庫U,以及建立每—個讀序的配對資訊並儲 存在配對資料庫13。 12 201243117 步驟S211 :如圖十一及十四所示,利用啟始片段生成器14 產生一啟始序列片段141,供後續之延伸器π及架橋器18使用。 本發明之啟始序列片段生成方式的一種較佳具體實施例,係先利 用啟始片段生成器14自資料讀入器11所讀入的讀序集合中,找 出一對尚未被用在其他序列重疊群的配對讀序113,並以該對配對 讀序當成一個啟始片段SegA及SegB。再透過讀序預見檢杳器 16,自讀序集合中檢查是否有配對讀序16卜當找到可以分別對啟 始片段SegA及SegB進行延伸的配對讀序ι61 (rl,r2)做為初選之 讀序140時’則先利用該初選讀序161(即配對讀序rl及^)來分別 延伸啟始片段SegA及SegB。而當自讀序集合中找到有一個讀序 可以同時接續在啟始片段SegA及SegB上時,則決定啟始片段 SegA及SegB的左右位置順序,並且以該讀序^接合該二啟始片 段SegA及SegB ’進而接合成一啟始序列片段141。 步驟S221及S222 :是左右對稱的延伸運算程序,由延伸器 17以目前接續的序列或啟始序列片段丨41之二侧各μ個驗基長度 分別做為一檢測視窗142。本發明圖十五及十六示例係以向右延伸 運算來說明接下去的程序。 步驟S231及S232 :是延伸器17延伸序列的步驟,並配合圖 十一、十三及十五示例做說明。其延伸器17係利用前述之檢測視 窗142來找出可接續的候選讀序,然後向預見讀序檢查器16查詢 是否有候選之讀序rl,r2,r3,r4是屬於預見讀序162如圖十五之 (fl,f2, f3)。若有,則決定這些候選讀序暨預見讀序fl,f2,㈡為 201243117. 可延伸接續在啟始序列片段〗41至少一側的預見讀序162,並利用 該預見讀序162供建立一更長的延伸序列片段171。 步驟S251及S252 :當延伸器17未找到預見讀序162時,則 以架橋器18向預見讀序檢查器16查詢出一個預見讀序163的集 合。然後分別以這些預見讀序163 一側長度N的鹼基做為錨點序 列(anchor)164。然後在檢測視窗142中找尋是否存在有錨點序列 164的鹼基字串。其程序如圖十六所示。在檢測視窗142找尋是否 存在有長度K的錨點序列164。這時,會有三種情形。第丨種情形 疋在檢測視窗142找不到錨點序列164時,則決定此預見讀序163 不疋排在這個位置。第2種情形是在檢測視窗142找到錨點序列 164’但是在對齊錯點序列164位置後,其剩餘驗基和檢測視窗142 的驗基是不吻合的,則決定此預見讀序163短序列也不是排在這 個位置。第3種情形是在檢測視窗142找到錯點序列164,且對齊 後其剩餘驗基也和檢測視窗142的驗基是吻合的,則決定該預見 讀序163為可接續延伸,並_預見讀序163短序列接續或建立 該延伸序列片段181。 步驟S26i及S262 :從延伸器17或架橋器18所找出的預見 讀序162及職讀序163,依序賴建立延伸序則段⑻。容錯 讀序檢查器19再找出可以排列在此延伸序列片段⑻的讀序。^ 部讀序會被排列以檢查是否有測序錯誤,筛選掉配對距離異 者和檢測絲142之不—致的讀序,_通過者做為選定讀 201243117 步驟S271及S272 ·由步驟S261筛選出來的讀序清單中,以 容錯讀序檢查器19進一步檢查是否有分歧延伸。若沒有分歧,則 決定全部之選定讀序191為可加入一目標序列重疊群《並將選定 讀序191中之讀序的配對資料192回報給預見讀序檢查器16,以 更新目前的預見讀序集合。同時,將已被使用之配對讀序的配對 資訊從配對資料庫中刪除。 步驟S291,當延伸序列片段181之二側都無法繼續附加新的 讀序時,則利用重疊群建構器20把所有找到的選定讀序191依其 位置重疊成序列重疊群(c〇ntig)201。並以輸出介面21輸出序列重 疊群201每個位置最確定的驗基至槽案中,以成為最終組合的目 標基因的鹼基序列211。 雖然本發明已以較佳實施例揭露如上,然其並非用以限定本 發明,任何熟悉此項技藝者,在不脫本發明之精神和範圍内,冬 可做些許更動與潤飾,因此本發明之保護範圍當視後附之申請專 利範圍所界定為準。 月 【圖式簡單說明】 圖一為習知配對末端的測序程序之示意圖; 圖二為習知配對末端產生的測序讀列之示意圖; 圖三為習知成對測序讀列位置關係的示意圖; 圖四(a)及(b)為習知重疊群支架(scaff〇ld)示意圖; 圖五(a)、(b)及(c)為說明習知重覆序列問題的示意圖; 圖六為本發明之利用讀序輯的讀序預見架構的示意圖; 15 201243117. 圖七⑻及(b)為酬本發明之彻預見讀序來騎重覆序列的示意 圖; 圖八⑻(b)及⑹為說明本發明之反常配對及解決方法的示意圖; 圖九⑻及(b)為說明本發明之含糊配對的示意圖; 十為本發月之刀歧延伸(branching extension)及解決方法的 示意圖; 圖十為本發明之配對末端序列組合系統的一種實施例示意圖; 圖十二為本發明之讀序f雌構的示意圖; 圖十三為本發明魏對末端相組合m種實細流程圖; 圖十凹⑻、⑼、⑹及(峡說明本發明之啟始片段生成器運作的示 意圖; 圖十五為本發明之延伸器及延伸運作的示意圖;及 圖十六為本發明之架橋器及架橋運作的示意圖。 附件一:參考文獻。 附件一.表一為習知與本發明技術使用配對末端測序資訊方式比 較,表一為本發明新基因測序組合技術的中英文名詞對照。 【主要元件符號說明】Bruijngraph shows the wealth. Therefore, the system of the (10) sequencing combination method is constructed after the $ 〇 ) , , 才 才 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 The scaffolding operation is to arrange the contigs into longer sequence scaffolds. Although the paired-end sequencing technology can clearly show the results of the two sides of the sequence. The financial sector is regarded as a technology that can improve the correct combination rate and coverage rate. However, the traditional three main base-sequence combination methods, as shown in Table 2 of the Annex II, are the _ pairs of paired end_sequence_in sea_. The construction method is to construct the scaffold by using the paired end information after the contig is generated. Since almost half of the sequences in human genomes are repetitive, it is necessary to combine repetitive target phases when processing the sequencing data of human genomes or other organisms. We will use Figure 5 to illustrate the sequence repeat problem. The target gene sequence in the example of Fig. 5 (8) is a gene sequence consisting of 5 sub-sequence fragments ARBRC. Among them, R is a gene sequence fragment that appears twice, and A, B, and C are different gene sequence fragments. As shown in Figure 5 (9), when we want to extend the subsequence AR, we will find that the read sequence spanning (R, B) and spanning (R, c) can continue to the right of the subsequence AR. For example, AGATaccga is a read sequence spanning (R, B), and GGGGAAAAAT is a read sequence spanning (R, C). For example, we will insert the read sequence GGGGAAAAAT on the right side of the subsequence ar, which will form the wrong subsequence. Similarly, as shown in Figure 5(c), if we connect the read sequence AGATAACGGA to the right of the subsequence BR, the error subsequence of the BRB will be formed. In the field of gene sequencing, the combination of repeated sequences has always been a problem of notorious. In the light of the above-mentioned conventional techniques, the present invention has developed a completely new method called a read foreseeing sequencing combination technique architecture. Read foreseeing is a mechanism for recording the paired sequence information used in the sequence in the current combination when the sequence is combined, and then performing a read-ahead check. This mechanism uses pairing end information for extension, bridging, and repeatsensing. Therefore, the method newly created by the method of the present invention is called EBAR, which is an abbreviation of English Extension, Bridging And Repeat sensing. In order to clarify the technical terms first created in the novel technology of the present invention, such as reading preamble, abnormal matching, vague matching and divergence extension. Therefore, the Chinese and English nouns related to the technique of the extension-bridge-repetitive detection of the present invention are as shown in the attached table. [Invention] The object of the present invention is to provide a new method and system for combining paired end readings. . The present invention alters the manner in which scaffolding is traditionally performed using paired end information. The technique of the present invention is referred to as a method and system for read f〇reseeing. The present invention fully utilizes the information at the end of the pair to handle the sequence rep〇blem that would be encountered when reading the splicing combination. The technical means to achieve the aforementioned objectives is to construct a technical structure for read foreseeing. The read sequence used in the subsequent sequence defines the paired read sequence that has not yet been connected as a Foreseeable Reads (FRs), and the subsequence is continued in the foreseeable read order. The above described objects and other features of the present invention will become more apparent from the aspects of the appended claims. In the description of these embodiments 'for the sake of brevity of explanation, different sequence lengths' and different index key lengths are used in different examples. 201243117 [Embodiment] I. The basic technical architecture of the present invention:::1:7 is not a technique for read foreseeing of the invention = structure = real action. Figure 6 shows a schematic diagram of the paired read order distribution in the sequence. Figure 7 shows how to apply the read order look-ahead to determine the correct contiguous position of the read sequence. When the subsequence AR is continued, since η, γ2, θ is the read order of the bit on ar, the (4) read-reading read order (reseeable Reads' FRs) is defined. Its towel, as shown in Figure 2, reads the sequenced dragon read by the base-sequence system, and its order is (rl r2>; but when combined, the actual reading of the invention is read, and each read is The reverse complement sequence c2, cl ' of the sequence rl r2 and its partner sequence is also illustrated in Figures 3 and 4 (H, c2) and (r2, cl) 'Figure 4 (a) shows the invention One pair of paired ends rl, c2, as shown in Figure 4(b), utilizes the pair of mating ends rl, c2 to combine contigs a, C2 to create a contig bracket. Figures 5 through 7. In the example, it can be found that the read sequence r4 and rll can be connected to the right side of the sub-sequence AR, but only the read order γ4 belongs to the foreseeable read order. Therefore, as shown in FIG. 7(8), the present invention continues the sub-sequence ar with the read order r4. On the contrary, as shown in FIG. 7(b), when the sub-sequence ARBR is continued, the read sequence rii is found to be the read-read sequence 'to extend the sub-sequence ARBR into the correct ARBRC sequence. The architecture is feasible, but not all of the foreseeable readings found can be used to generate the correct extended sequence fragments. The architecture must solve two problems: (1) anomaly of pairing and (2) ambiguity of pairing. The first problem that needs to be solved when the invention is read and implemented is that it is abnormally paired. The problem in Figure 8 illustrates the problem of anomalous 201243117 pairing under the read-predictive architecture. When the repeat sequence is short and the insertion order is long, the paired read sequence may span two or more. The sequence of the overlay is as shown in Fig. 8 (r2, r5). When the subsequence AR is connected, the read sequence r4, r5, and r6 are all foreseeable. Therefore, the read sequence r5 can be continued in the subsequence ar. On the right side, an erroneous extension is formed. A similar situation occurs when the subsequence ARBR is continued. If the read sequence r4 is still in the foreseeable read set, the read sequence r4 can continue to the right of the subsequence arbr, forming an erroneous extension. To solve this problem, the solution developed by the inventors is to detect a reasonable pairing distance. If the length of the inserted sequence is about 2 base pairs (base pak, bp), we can speculate that the paired reading is in the sequence. of The distance is between 1 〇〇 and 3 〇〇 bp. In the example of Fig. 8, for example, the paired reading order (rl, r4), (r2, r5), (γ3, γ6) shown in Fig. 8 (; a) The distances are 160 bp, 145 bp and 180 bp respectively. When an abnormal pairing occurs, the distance between the paired readings is abnormally high or abnormally low. For example, as shown in Fig. 8(b), the reading sequence r5 is connected to the subsequence. When the AR is on the right side, the pairing read order (r2, r5) has a pairing distance of only 25 bp 'so we can infer that an abnormal pairing occurs, that is, although the reading order is a foreseeable reading order, it should not be connected to the sub-sequence AR right. side. A similar situation is shown in Figure 8(c). If the read sequence r4 is continued to the right of the sub-sequence ARBR, the pairing read order (rl, r4) has a pairing distance of 375 bp, so we can infer that an abnormal match occurs. Although the read sequence r4 is a foreseeable read sequence, it should not be contiguous to the right side of the subsequence ARgR. The second problem that needs to be solved when implementing the read-ahead architecture of the present invention is ambiguity of pairing. In the example in Figure 9, the problem of ambiguous pairing occurs under the read-ahead architecture. As shown in Fig. 9 (8), when the repeat sequence is long, the order of the sequence segments that cannot be judged before and after is determined. That is to say, it is impossible to judge which of the sub-sequences ARB or ARC is correct. In Figure 9(b), when the sub-sequence 201243117 is listed as AR, the ^^ is deducted and the pre-reading sequence can be legally connected to the sub-sequence AR. It is impossible to judge which ARB or ARC is the correct group. σ The inventor calls it a branching extensi〇n. Figure 10 illustrates the situation when a divergent extension occurs. When the subsequence AR is continued, one possible consecutive read sequence χΐ, χ2, χ3,·.., χ1〇 is found. It can be seen from the example of Fig. 10 that χι and χ3 are foreseeable readings. However, simultaneous X1 and Χ3 form two different extension fragments: ACTTaacgt and CCCTGGCCA. Therefore, when the system checks for a divergence extension, it is necessary to abandon the connection so as not to combine the wrong sequence. DETAILED DESCRIPTION OF THE INVENTION i The system of the present invention, as shown in Figs. 11 to 16, is a gene sequence combining system of the present invention, which is used to sequence a gene sequencing system. A collection of short sequences of the gene pairing ends, spliced into the target gene sequence. A specific embodiment includes a data reader 11, an index database 12, a pairing database 13, a start fragment generator 14, an indexer 15, a preview read checker 16, and an extender. 17. A bridge 18, a fault tolerant read sequence checker 19, a contig group constructor 2, and an output interface 21. The foregoing elements are described in detail below. A data reader 11 (which may be an input interface) for reading a plurality of readings 11 from a file stored in a database or a memory (a plurality of readings may be generated by sequencing by a genetic sequencing system) , including a paired read sequence, which is paired-end, or a longer sequence is modeled as a paired type 〇nate-pair) to form a read set and given Each of the input reads a numbered data, and each N bases before and after each reading is used as an index key value data 201243117.m, and the index key value data m is stored in the index database η. The numbered data m of the read order is stored in the paired database 13. The data reader u includes a read order use record array 114 for recording whether the read order has been queued into the sequence contig. Among them, the definition of the continuation bit in a sequence is read, and the smear field is not read by the __ material, and the domain is - foreseeable reading set. The start fragment generator 14 is configured to generate at least one start sequence segment 141, the start sequence segment 141 for use by the subsequent extender π and the bridger 18 to extend or bridge other read sequences. The indexer 15 is configured to select a sequence in the detection window 150 as the index key value 151 (as shown in FIG. 12 &) when the system extends a sequence of segments, to the index database 12 Querying the number data of the reading sequence corresponding to the index key value data 151 corresponding to the index key value 151, and converting the number data into the base string of the reading sequence, and providing the base string to the starting fragment The generator 14, the extender 17, and the preview read checker 16 use the index query of the possible read sequences ri, r2, r3, r4 (as shown in Figure 12(b)). The read sequence checker 16' is foreseen to dynamically record the foreseeable read sequence of the current extended sequence. Referring to FIG. 11, 12(b) and FIG. 14, the start fragment generator η and the extender 17 will submit the read sequence 161, 162, 163 of the candidate to be extended to the preview read checker 16 ( 1*1 into 13/4) and its position is not in the foreseeable reading order (!1, £2, 〇) foreseeing the reading order checker 16 in addition to checking the candidate reading order 161, 162, 163 (1*1/23, Whether it is currently in the set of read orders, and calculate whether the distance between it and the paired reading is 201243117. Whether it is reasonable within a predetermined range. The extender 17 and the bridge 18' are actually responsible for finding extensions. The current sequence of foreseeable reads 162, 163. When the extender 17 cannot find a valid foreseeable read sequence 162, the bridger 18 is used to find the foreseeable read sequence m3 that can extend the current sequence. Although the extender 17 and the bridge are used. 18 successfully establishes an extended segment of the current sequence, and then uses the fault-tolerant read sequence checker 19 to check whether the selected read sequence has a sequencing error and whether the check is extended or not, that is, when the sequence is checked for two different foreseeable reads. In the sequence, two different extensions are formed, which is the difference. Extending, at this point, it is necessary to abandon the connection 'to avoid assembling the wrong sequence. The contig constructor 20 is used to extend the extender 17 and the bridge 18, and the selected read by the fault-tolerant read sequence checker 19 is checked. The sequence 191 is arranged in an overlapping manner to construct a contig 2 〇 1. The sequence of the sequence contig 201 is output to the file 211 through the output interface 2 for use. H. The method of the present invention Please refer to FIG. 10 to FIG. 16 for a specific embodiment of the sequencing read sequence combination method of the present invention. The method of the present invention includes the following steps. Step S2〇1: From data The reader n reads a plurality of readings (ie, reads a plurality of readings generated by sequencing by the genetic sequencing system) to form a reading sequence, respectively giving each input of the input a number, and establishing each The index of the cake is stored and stored in (4) 丨龙库 U, and the pairing information of each reading order is established and stored in the pairing database 13. 12 201243117 Step S211: As shown in FIG. 11 and FIG. Start fragment generator 14 The sequence fragment 141 is started for use by the subsequent extender π and the bridge 18. A preferred embodiment of the method for generating the start sequence fragment of the present invention first utilizes the start fragment generator 14 from the data reader 11 In the read set that is read, a pair of paired readings 113 that have not been used in other sequence contigs are found, and the paired readings are used as a starting segment SegA and SegB. In the self-reading set, it is checked whether there is a paired reading 16 when the matching reading ι61 (rl, r2) which can extend the starting segments SegA and SegB respectively is used as the reading order 140 of the primary selection. The initial read sequence 161 (ie, the pair read sequence rl and ^) is used to extend the start segments SegA and SegB, respectively. When a read sequence in the self-reading set can be found on the start segments SegA and SegB at the same time, the left and right position order of the start segments SegA and SegB is determined, and the two start segments are joined by the read sequence. SegA and SegB' are in turn joined into a start sequence segment 141. Steps S221 and S222: are left-and-right symmetrical extension calculation programs, and the lengths of the respective bases on the two sides of the current sequence or the start sequence segment 丨41 are respectively taken as a detection window 142 by the extender 17. The fifteenth and sixteenth examples of the present invention illustrate the following procedure by extending to the right. Steps S231 and S232: are steps of extending the sequence of the extender 17, and are explained in conjunction with the examples of Figs. 11, 13, and 15. The extender 17 uses the aforementioned detection window 142 to find a contiguous candidate read sequence, and then queries the preview read sequence checker 16 to see if there is a candidate read order rl, r2, r3, r4 belongs to the foreseeable read order 162. Figure 15 (fl, f2, f3). If so, it is determined that the candidate read order and the preview read order fl, f2, (b) are 201243117. The forward read sequence 162 on at least one side of the start sequence fragment 41 can be extended and used to establish a read order 162. A longer extended sequence fragment 171. Steps S251 and S252: When the extender 17 does not find the look-ahead read sequence 162, then the bridger 18 queries the preview read sequence checker 16 for a set of look-ahead reads 163. Bases of length N on the side of these foreseeable readings 163 are then used as anchors 164, respectively. It is then found in the detection window 142 whether or not there is a base string of the anchor sequence 164. The program is shown in Figure 16. An array of anchors 164 having a length K is found in the detection window 142. At this time, there will be three situations. The third case 疋 When the detection window 142 cannot find the anchor sequence 164, it is determined that the preview read sequence 163 is not ranked at this position. The second case is that the anchor point sequence 164' is found in the detection window 142, but after the position of the alignment error sequence 164 is aligned, the residual test base and the test base of the detection window 142 are not coincident, then the short sequence of the preview read sequence 163 is determined. It is not in this position. In the third case, the error sequence 164 is found in the detection window 142, and after the alignment, the remaining test base is also consistent with the test basis of the detection window 142, then the preview read sequence 163 is determined to be extendable, and the read-ahead is read. The sequence 163 short sequence continues or establishes the extended sequence segment 181. Steps S26i and S262: The foreseeable reading sequence 162 and the job reading sequence 163 found from the extender 17 or the bridger 18 are sequentially used to establish the extended sequence segment (8). The fault-tolerant read sequence checker 19 then finds the read order that can be arranged in this extended sequence segment (8). ^ The order of reading will be arranged to check whether there is a sequencing error, the pairing distance is different, and the reading order of the detecting wire 142 is not selected, and the _ passer is selected as the reading 201243117. Steps S271 and S272 are screened by step S261. In the selected reading list, the fault-tolerant reading checker 19 further checks whether there is a divergence extension. If there is no disagreement, it is determined that all of the selected readings 191 can be added to a target sequence contig "and the paired data 192 of the reading in the selected reading 191 is returned to the preview reading checker 16 to update the current foreseeable reading. Ordered collection. At the same time, the pairing information of the paired reading that has been used is deleted from the pairing database. In step S291, when the two sides of the extended sequence segment 181 cannot continue to add a new read sequence, all the selected read sequences 191 are overlapped by the contig group constructor 20 into a sequence contig (c〇ntig) 201. . The output interface 21 outputs the most determined test sequence for each position of the sequence overlap group 201 to the slot case to become the base sequence 211 of the final combined target gene. Although the present invention has been disclosed in the above preferred embodiments, it is not intended to limit the invention, and the present invention may be modified and retouched in the winter without departing from the spirit and scope of the invention. The scope of protection is subject to the definition of the scope of the patent application. Figure 1 is a schematic diagram of a sequencing procedure of a conventional paired end; Figure 2 is a schematic diagram of a sequencing readout generated by a conventional paired end; Figure 3 is a schematic diagram of a conventional paired sequencing read position relationship; (a) and (b) are schematic diagrams of conventional contig scaffolds; Figures 5(a), (b) and (c) are schematic diagrams illustrating conventional repeat sequence problems; A schematic diagram of a read-predictive architecture using a read sequence; 15 201243117. Figure 7 (8) and (b) are schematic diagrams of a repeating sequence for the foreseeable reading of the invention; Figures 8(8)(b) and (6) are illustrative Schematic diagram of the abnormal pairing and solution of the invention; Figures 9(8) and (b) are schematic diagrams illustrating the ambiguous pairing of the present invention; 10 is a schematic diagram of the branching extension and solution of the present month; BRIEF DESCRIPTION OF THE DRAWINGS FIG. 12 is a schematic diagram of a read sequence f of the present invention; FIG. 13 is a flow chart of a combination of Wei and end phases of the present invention; FIG. , (9), (6) and (the gorge illustrate the invention FIG. 15 is a schematic diagram of an extender and an extended operation of the present invention; and FIG. 16 is a schematic diagram of the operation of the bridge and bridge of the present invention. Annex 1: References. One is the comparison between the prior art and the technique of the present invention using paired-end sequencing information, and Table 1 is a comparison of Chinese and English nouns for the novel gene sequencing combination technique of the present invention.

測序讀序的組合系統10讀序110, 113, 161,162, 163, 191 資料讀入器 11配對資料 1U 索引對應資料 112讀序使用記錄陣列 H4 索引資料庫 12配對資料庫 啟始片段生成器 η啟始序列片段 141 201243117 檢測視窗 142, 150 索引器 15 索引鍵值 151 預見讀序檢查器 16 錨點序列 164 延伸器 17 延伸序列片段 171, 181 架橋器 18 容錯讀序檢查器 19 配對資料 192 重疊群建構器 20 序列重疊群 201 輸出介面 21 目標基因的鹼基序列 211 17Sequencing Reader Combination System 10 Read Order 110, 113, 161, 162, 163, 191 Data Reader 11 Paired Data 1U Index Correspondence Data 112 Read Order Use Record Array H4 Index Database 12 Paired Database Start Fragment Generator ηStart Sequence Fragment 141 201243117 Detection Window 142, 150 Indexer 15 Index Key Value 151 Preview Reader Checker 16 Anchor Sequence 164 Extender 17 Extend Sequence Segment 171, 181 Bridge 18 Fault Tolerant Reader Checker 19 Pairing Data 192 Contig group constructor 20 sequence contig 201 output interface 21 base sequence of target gene 211 17

Claims (1)

201243117 七、申凊專利範圍·· 生之序顺合方法,其_接__統所產 玍^序的集合,以延伸—相,其包括: 讀入該讀序集合,該讀序包括有複數組配對的讀序; 建立該讀序的索引資料,及該配對讀序的配對編號資料; 利用該4序產生-啟始序,段,供自_賴合中被選出 為可接續延伸之候賴序以延伸該序列; 其中, 以該候選讀序之輯讀序且配對距離在一駭翻内之讀序 為預見讀序;及 每當該啟財㈣段闕—雜_序之後,便再接續一該 預見讀序,以延伸該序列。 2. 如清求項1所述之方法,其中該配對讀序為—該讀序及其 配對之讀序的反向互補序列。 3. 如請求項1所述之方法’其中,以目前已接續之該序列片 段至少-側的Μ個驗基長度做為—檢測視窗,以該檢測視窗找出 該候選讀序’並檢查該候選讀序是否為該預見讀序。 4. 如請求項3所述之基因序列組合系統,其中以該候選讀序 -側長度Ν麟基做為細序列’綱職細視窗在該序列中 找尋疋否存在有與該錯點序列相同的驗基字串,當找到時,且對 齊該候選讀序與該酬視窗之後,該候_序之剩紐基也和該 檢測視窗的其躲基是吻合的,騎雜選讀序接咖該序列。 18 201243117 5.如請求項3或4所述之方法 讀序排列及檢查是否有測序錯誤, 檢測視窗之驗基不一致的讀序。 ’其中,將接續該序列的該候選 篩選掉配對距離異常或者和該 6.如請求項3或4所述之方法,其中將該配對讀序接續於該 序列時’檢查是否有分歧延伸之情形,若停止接續。 7·如响求項1所述之方法’其中自該讀序集合中找出一對尚 未被接續的配對讀序,並以該對配對讀序當成二個啟始片段,並 以其他該候選讀序分別對該二啟始片段進行延伸,並以—可以同 時接續該二啟始片段之讀序,接合該二啟始片段而產生該啟始 列片段。 8. 如請求項1所述之方法,其中以一讀序使用記錄陣列來記 錄該候選讀序是否已被接續到該序列。 9. 如請求項丨所述之方法,其中將已接_該相的讀序的 配對編號資料刪除。 10·如請求項1所述之方法,其中當該序列之二侧都無法繼續 附加新的讀序時,則把所有接_該序_讀序依其位置重疊成 一序列重疊群’並輸出該序列重疊群以供利用。201243117 VII. The scope of the application for patents · The method of sequel to the order of life, the collection of the sequence of the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The reading sequence of the complex array pairing; establishing the index data of the reading sequence, and the pairing number data of the pairing reading sequence; using the 4-order generation-starting sequence, the segment is selected from the _ _ _ _ _ _ Waiting for the sequence to extend the sequence; wherein, in the reading order of the candidate reading order and the pairing distance is within a preview, the reading order is a foreseeable reading order; and each time after the opening (four) paragraph 阙-hetero_order, The pre-reading sequence is then continued to extend the sequence. 2. The method of claim 1, wherein the paired read sequence is a reverse complement of the read sequence and the paired read sequence. 3. The method of claim 1, wherein the length of at least one side of the sequence segment of the sequence that has been succeeded is taken as a detection window, and the candidate window is found by the detection window and the Whether the candidate read order is the foreseeable read order. 4. The gene sequence combination system according to claim 3, wherein the candidate read-side length unicorn is used as a fine sequence, and the sequence is searched in the sequence, and the presence or absence of the sequence is the same as the sequence of the error. The base string of the test, when found, and after aligning the candidate read sequence with the reward window, the remaining button base of the candidate sequence is also coincident with the hiding base of the detection window, and the riding candidate sequence. 18 201243117 5. Method as claimed in claim 3 or 4 Read order and check for sequencing errors, and the reading order of the detection window is inconsistent. 'Where, the candidate contiguous to the sequence is filtered out of the pairing distance abnormality or the method as described in claim 3 or 4, wherein the paired reading is continued in the sequence to check whether there is a divergence extension If you stop connecting. 7. The method of claim 1, wherein a pair of read sequences that have not been connected are found from the set of readings, and the paired readings are used as two starting segments, and the other candidate is The reading sequence extends the two starting segments separately, and the starting sequence can be generated by simultaneously joining the reading sequence of the two starting segments and joining the two starting segments. 8. The method of claim 1, wherein the record array is used in a read order to record whether the candidate read sequence has been connected to the sequence. 9. The method of claim 1, wherein the pairing number data of the read sequence of the received phase is deleted. 10. The method of claim 1, wherein when the two sides of the sequence are unable to continue to add a new read sequence, then all the _ read orders are overlapped by their positions into a sequence of contigs and output Sequence contigs are available for utilization.
TW100114888A 2011-04-28 2011-04-28 Method and system of assembling DNA reads with paired-end sequencing TW201243117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW100114888A TW201243117A (en) 2011-04-28 2011-04-28 Method and system of assembling DNA reads with paired-end sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW100114888A TW201243117A (en) 2011-04-28 2011-04-28 Method and system of assembling DNA reads with paired-end sequencing

Publications (1)

Publication Number Publication Date
TW201243117A true TW201243117A (en) 2012-11-01

Family

ID=48093747

Family Applications (1)

Application Number Title Priority Date Filing Date
TW100114888A TW201243117A (en) 2011-04-28 2011-04-28 Method and system of assembling DNA reads with paired-end sequencing

Country Status (1)

Country Link
TW (1) TW201243117A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015061103A1 (en) * 2013-10-21 2015-04-30 Seven Bridges Genomics Inc. Systems and methods for using paired-end data in directed acyclic structure
US9898575B2 (en) 2013-08-21 2018-02-20 Seven Bridges Genomics Inc. Methods and systems for aligning sequences
US11347704B2 (en) 2015-10-16 2022-05-31 Seven Bridges Genomics Inc. Biological graph or sequence serialization

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9898575B2 (en) 2013-08-21 2018-02-20 Seven Bridges Genomics Inc. Methods and systems for aligning sequences
WO2015061103A1 (en) * 2013-10-21 2015-04-30 Seven Bridges Genomics Inc. Systems and methods for using paired-end data in directed acyclic structure
CN105830078A (en) * 2013-10-21 2016-08-03 七桥基因公司 Systems and methods for using paired-end data in directed acyclic structure
US10055539B2 (en) 2013-10-21 2018-08-21 Seven Bridges Genomics Inc. Systems and methods for using paired-end data in directed acyclic structure
US10204207B2 (en) 2013-10-21 2019-02-12 Seven Bridges Genomics Inc. Systems and methods for transcriptome analysis
US11347704B2 (en) 2015-10-16 2022-05-31 Seven Bridges Genomics Inc. Biological graph or sequence serialization

Similar Documents

Publication Publication Date Title
US10783984B2 (en) De novo diploid genome assembly and haplotype sequence reconstruction
Ebler et al. Haplotype-aware diplotyping from noisy long reads
US6714874B1 (en) Method and system for the assembly of a whole genome using a shot-gun data set
US20140309945A1 (en) Genome sequence alignment apparatus and method
US20120330566A1 (en) Sequence assembly and consensus sequence determination
Lin et al. AGORA: assembly guided by optical restriction alignment
CN107133493B (en) Method for assembling genome sequence, method for detecting structural variation and corresponding system
TW201243117A (en) Method and system of assembling DNA reads with paired-end sequencing
US20180060484A1 (en) Extending assembly contigs by analyzing local assembly sub-graph topology and connections
CN110782946A (en) Method and device for identifying repeated sequence, storage medium and electronic equipment
KR20160039386A (en) Apparatus and method for detection of internal tandem duplication
US20150142328A1 (en) Calculation method for interchromosomal translocation position
TWI420007B (en) System and method of assembling dna reads
Li et al. DCHap: a divide-and-conquer haplotype phasing algorithm for third-generation sequences
Tammi et al. ReDiT: Repeat Discrepancy Tagger—a shotgun assembly finishing aid
Hou et al. DEEP-LONG: a fast and accurate aligner for long RNA-seq
Milicchio et al. Hercool: high-throughput error correction by oligomers
CN111261225B (en) Reverse correlation complex variation detection method based on second-generation sequencing data
Cazaux et al. Read mapping on de Bruijn graphs
US20140121992A1 (en) System and method for aligning genome sequence
Bogerd A Method for Construction of a Splice Graph from RNA Sequence Data
Hou et al. Long read error correction algorithm based on the de bruijn graph for the third-generation sequencing
CN103699817B (en) Method for identifying and removing self-loop bidirectional edges of bidirectional multistep De Bruijn graph
CN103699814B (en) Method for identifying and removing tips of bidirectional multistep De Bruijn graph
Bayat et al. Fast accurate sequence alignment using Maximum Exact Matches