WO2015094854A1 - Iterative clustering of sequence reads for error correction - Google Patents

Iterative clustering of sequence reads for error correction Download PDF

Info

Publication number
WO2015094854A1
WO2015094854A1 PCT/US2014/069539 US2014069539W WO2015094854A1 WO 2015094854 A1 WO2015094854 A1 WO 2015094854A1 US 2014069539 W US2014069539 W US 2014069539W WO 2015094854 A1 WO2015094854 A1 WO 2015094854A1
Authority
WO
WIPO (PCT)
Prior art keywords
reads
sequence
cluster
clusters
probability
Prior art date
Application number
PCT/US2014/069539
Other languages
French (fr)
Inventor
Huei-Hun TSENG
Original Assignee
Pacific Biosciences Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pacific Biosciences Inc. filed Critical Pacific Biosciences Inc.
Priority to CN201480069926.4A priority Critical patent/CN105849555B/en
Priority to EP14871681.4A priority patent/EP3084426B1/en
Publication of WO2015094854A1 publication Critical patent/WO2015094854A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Definitions

  • a base-call As a true variant or, alternatively, a miscall (e.g., insertion, deletion, or mismatch error in the sequence read).
  • a miscall e.g., insertion, deletion, or mismatch error in the sequence read.
  • sequence reads have the base calls that differ between the homologous chromosomes, it is important to be able to determine whether the differing base calls are true variations between the homologues, or are merely sequencing errors.
  • a viral population in an individual can have many variations between individual viral genomes in the population, especially in highly mutable viruses such as HIV.
  • a transcriptome is a set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNAs produced in one or a population of cells. Because the term includes all mRNA transcripts in the cell, the transcriptome reflects the genes that are being actively expressed at any given time.
  • commercially available genome aligners do not enable error correction of full-length long sequence reads in transcriptome sequencing.
  • reads produced on the PacBio® RS II instrument average about 5-6 kb, and reads as long as 20 kb are routinely generated. With such long-read capabilities, full-length mRNA transcripts can be sequenced, e.g., after conversion to cDNA. This can help the researcher identify splicing patterns that are difficult to reconstruct using short-read sequencing technologies.
  • publicly available sequence aligners e.g., GMAP, and functional annotation tools almost all require reads having near 100% accuracy.
  • the PacBio instrument generates reads from single template molecules that have an error profile that makes it difficult to directly apply these sequence alignment tools.
  • the exemplary embodiments are generally directed to processes for analyzing sequence data from mixed populations of nucleic acids, for assigning each sequence read to a particular origin, and for ultimately identifying one or more consensus sequences of one or more biomolecular target sequences from the sequence information.
  • the methods provided herein are applicable not only to sequence data having few errors, but also to sequence data having relatively high rates of insertions, deletions, and/or mismatch errors. Consequently, the invention is also directed to systems that carry out these processes.
  • the exemplary embodiments provide methods and systems for iterative clustering of sequence reads for error correction, which is performed by at least one software component executing on at least one processor.
  • such methods comprise receiving a set of sequence reads and associated quality values; grouping the sequence reads into a set of initial clusters based on sequence similarity; generating a cluster consensus for each of the initial clusters; iteratively improving the clustering based on the cluster consensus and the quality values associated with the sequence reads; and generating and outputting a final cluster consensus for each of the clusters.
  • iteratively improving the clustering further comprises: calculating a probability of each sequence read belonging to each cluster using the quality values; reassigning individual sequence reads from one cluster to another cluster having a highest calculated probability; and merging highly similar clusters.
  • the input sequence reads comprise full-length long reads of at least .5 kb in length through 1 , 2, 3, 4, 5, 7, or 10 kb in length
  • the final cluster consensus is generated using the cluster consensus and non-full-length reads, which are used to impart full coverage to the sequence data to provide a higher level of consensus.
  • Figure 1 is a diagram illustrating one embodiment of a computer system for implementing a process for using iterative clustering of sequencing reads for error correction of transcriptome sequencing data.
  • Figure 2 is a flow diagram illustrating certain aspects of a process for iterative clustering of sequence reads for error correcting by according to an exemplary embodiment.
  • Figure 3 is a diagram showing an example portion of two reads from the same isoform that have been aligned to create a pairwise alignment.
  • Figure 4 is a diagram illustrating exemplary similarity graphs.
  • Figure 5 is a diagram illustrating one implementation for distinguishing a true isoform difference from sequence errors between aligned reads.
  • Figure 6 is a diagram illustrating an example of sequence reads initially assigned to incorrect clusters, where sequence reads of the same fill pattern are from the same isoform.
  • Figure 7 is a diagram illustrating exemplary cluster consensus, C1 , C2, C3 and C4, generated for each of clusters, respectively.
  • Figure 8 is a diagram illustrating reassignment of a sequence read from one cluster to the cluster having the highest computed probability of membership.
  • Figure 9 is a diagram illustrating an example of creating a new cluster from an orphan.
  • Figure 10 is a diagram illustrating the merger of two clusters.
  • Error correction of long reads for transcriptome analysis is different from error correction for genome assembly. Both can be formulated as a clustering problem. In genome assembly, there are only as many "clusters" as there are chromosomes; each chromosome is largely different from each other. In comparison to the whole chromosome size, the shared repeat regions are very small, and as long as there is a continuous long read spanning the repeat, it is relatively easy to resolve its origin.
  • HGAP Hierarchical Genome Assembly Process
  • the "quasispecies problem” is a specific application of the general transcriptome clustering problem. Like transcriptome sequencing, the total number of clusters is unknown and one must iteratively "guess" both the clusters and the cluster consensus. The problem is easier with respect to the HIV genome, because the HIV genome is known and priors can be placed on the number of mutations expected. Further information on the quasispecies problem is provided in Zogardi, et al. (2010) “Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies," Nucl. Acids. Res. doi: 10.1093/nar/gkq655, which is incorporated herein by reference in its entirety for all purposes.
  • an algorithm is provided that addresses the problem of errors in transcriptome sequencing.
  • the algorithm of the exemplary embodiments utilizes cluster consensus.
  • the exemplary embodiments relate generally to generating consensus sequences from mixed populations. More specifically, the exemplary embodiments provide methods and systems for error correction of reads based on iterative clustering of isoforms using primarily long read data. The probability of each input sequence read belonging to each cluster is iteratively calculated and the sequences are then reassigned to clusters having a higher probability of membership. In addition, the process merges highly similar clusters. According to the exemplary embodiments, iterative isoform-level clustering removes transcript redundancy and improves transcriptome consensus accuracy, all without requiring a reference genome.
  • FIG. 1 is a diagram illustrating one embodiment of a computer system for implementing a process for iterative clustering of sequence reads for error correction.
  • the invention may be embodied in whole or in part as software recorded on fixed media.
  • the computer 100 may be any electronic device having at least one processor 102 (e.g., CPU and the like), a memory 103, input/output (I/O) 104, and a data repository 106.
  • the CPU 102, the memory 103, the I/O 104 and the data repository 106 may be connected via a system bus or buses, or alternatively using any type of communication connection.
  • the computer 100 may also include a network interface for wired and/or wireless communication.
  • computer 100 may comprise a personal computer (e.g., desktop, laptop, tablet etc.), a server, a client computer, or wearable device.
  • computer 100 may comprise any type of information appliance for interacting with a remote data application, and could include such devices as an internet-enabled television, cell phone, and the like.
  • the processor 102 controls operation of the computer 100 and may read information (e.g., instructions and/or data) from the memory 103 and/or a data repository 106 and execute the instructions accordingly to implement the exemplary embodiments.
  • information e.g., instructions and/or data
  • the term "processor 102" is intended to include one processor, multiple processors, or one or more processors with multiple cores.
  • the I/O 104 may include any type of input devices such as a keyboard, a mouse, a microphone, etc., and any type of output devices such as a monitor and a printer, for example.
  • the output devices may be coupled to a local client computer.
  • the memory 103 may comprise any type of static or dynamic memory, including flash memory, DRAM, SRAM, and the like.
  • the memory 103 may store programs and data including a sequence aligner/overlapper 1 10, a cluster consensus algorithm 1 1 1 , an Iterative Cluster Error correction (ICE) component 1 12, and a polisher component 1 14 (e.g., Quiver). These components/algorithms may be used in the process of transcriptome sequence assembly as described herein.
  • the data repository 106 may store several databases including one or more databases that store sequence reads 1 16, read quality values (hereinafter QVs) 1 18, maximal cliques 120, clusters 122, cluster consensus 124, probabilities 126, and final consensus sequences 128.
  • the sequence reads 1 16 comprise isoform sequence reads, which may include both full-length sequence reads (hereinafter, “full-length reads”) 1 16-1 and non-full-length sequence reads (hereinafter, "Non-full-length reads”) 1 16-2.
  • the clusters 122 may comprise isoform-level clusters.
  • the data repository 106 may reside within the computer 100. In another embodiment, the data repository 106 may be connected to the computer 100 via a network port or external drive.
  • the data repository 106 may comprise a separate server or any type of memory storage device (e.g., a disk-type optical or magnetic media, solid state dynamic or static memory, and the like).
  • the data repository 106 may optionally comprise multiple auxiliary memory devices, e.g., for separate storage of input sequences (e.g., sequence reads), sequence information, calculation results and/or other information. Computer 100 can thereafter use that information to direct server or client logic, as understood in the art, to embody aspects of the invention.
  • an operator may interact with the computer 100 via a user interface presented on a display screen (not shown) to specify the sequence reads 1 16 and other parameters required by the various software programs.
  • the programs in the memory 103 including the sequence aligner/overlapper 1 10, the cluster consensus algorithm 1 1 1 , the ICE component 1 12, and the polisher component 1 14, are executed by the processor 102 to implement the methods of the present invention.
  • the sequence aligner/overlapper 1 10 reads the selected sequence reads 1 16 from the data repository 106 and performs sequence alignment on the sequence reads 1 16 to identify regions of similarity that may be a consequence of structural or functional or other relationships between the sequence reads 1 16.
  • the full-length reads 1 16-1 are generally high accuracy reads, e.g., at least about 98% or 99% accurate, and may be raw reads from a sequencing technology that provides such high quality reads, or may be pre-assembled, high-quality reads constructed from sequencing read data of a lower quality, as described elsewhere herein.
  • Aligned sequences 1 17 are generated by the sequence aligner/overlaper 1 10 during the sequence alignment.
  • the sequence aligner/overlaper 1 10 is implemented in C, C++, Java, C#, F#, Python, Perl, Haskell, Scala, Lisp, a Python/C hybrid, and others known in the art.
  • the ICE component 1 12 generates clusters 122 of similar sequence reads by grouping the sequence reads 1 16 into a set of initial clusters based on similarity and maximal cliques 120.
  • the cluster consensus algorithm 1 1 1 generates the cluster consensus 124 for each of the clusters.
  • the ICE component 1 12 then iteratively improves the clustering through iterations of based on the cluster consensus 124 and the quality values associated with the sequence reads, which includes reassigning sequence reads 1 16 from one cluster to another based on computer probabilities 126, and merging substantially similar clusters.
  • the polisher component 1 14 may then generate the final cluster consensus 128 for each of the clusters 122 in accordance with exemplary embodiments, as explained further below.
  • the output of the processing may include is a list of final consensus sequences 128, each of which represents the "consensus" of a cluster.
  • each of the clusters 122 may represent a single, unique transcript.
  • the present invention may provide methods and systems for identifying a set of unique full-length transcripts from a mixed population using full-length reads 1 16-1 .
  • the results of the process may optionally further comprise quality information, technology information (e.g., peak characteristics, expected error rates), alternate (e.g., second or third best) consensus determination, confidence metrics, and the like.
  • the progress and/or result of this processing may be saved to the memory 103 and the data repository 106 and/or output through the I/O 104 for display on a display device and/or saved to an additional storage device (e.g., CD, DVD, Blu-ray, flash memory card, etc.), or printed.
  • an additional storage device e.g., CD, DVD, Blu-ray, flash memory card, etc.
  • Figure 2 is a flow diagram illustrating certain aspects of a process for iterative clustering of sequence reads for error correcting by according to an exemplary embodiment.
  • the process may be used to correct errors in long reads during transcriptome sequencing.
  • the process may be performed by a combination of the sequence aligner/overlapper 1 10, the cluster consensus algorithm 1 1 1 , the ICE component 1 12, and the polisher component 1 14 ( Figure 1 ), which although are shown as separate components, the functionality of each may be combined into a lesser or greater number of software algorithms/components.
  • the process may begin by receiving a set of sequence reads 1 16 and associated quality values 1 18 (block 200).
  • the sequence reads 1 16 preferably comprise, but are not limited to, a set full-length long reads 1 16-1 .
  • the quality values (QVs) 1 18 are estimations of per-position base call accuracy generated by a sequencing machine.
  • the iterative clustering error correction (ICE) component 1 12 groups the sequence reads into a set of initial clusters based on sequence similarity (block 202).
  • the cluster consensus algorithm 1 1 1 generates the cluster consensus 124 for each of the initial clusters (block 204).
  • the ICE component 1 12 iteratively improves clustering based on the cluster consensus and the quality values 1 18 associated with the sequence reads (block 206), as described further below.
  • the process further includes generating and outputting a final cluster consensus 128 for each of the clusters of (block 208).
  • the final cluster consensus 128 may comprise a list of final cluster consensus sequences, each of which represents the consensus sequence of a cluster (and therefore a transcript in one embodiment).
  • the final cluster consensus 128 may be generated once the iterative clustering process has completed.
  • the input comprises full-length reads 1 16-1
  • the final cluster consensus may be generated by inputting non-full-length reads 1 16-2 into a final polishing process, which then generates the final cluster consensus 128.
  • the final cluster consensus 128 may be saved e.g. in the memory 103 and/or the data repository 106, or sent to the I/O 104 for display on a monitor and/printing by a printer.
  • transcript isoform sequencing is to understand transcriptome complexity using accurate, unassembled, full-length long reads.
  • the full-length reads 1 16- 1 are captured and identified automatically by a sequencing machine, but the exemplary embodiments improve the accuracy through iterative clustering.
  • the input sequence reads 1 16 comprise the full-length long reads 1 16-1 of a transcript, for example.
  • the input sequence reads 1 16 may comprise the non-full-length reads 1 16-2.
  • the sequence reads 1 16 can optionally comprise redundant sequence information, e.g., where the same transcript was repeatedly sequenced to generate a long sequence read comprising multiple copies of the transcript.
  • the additional information associated with the sequence reads 1 16 may comprise features of underlying sequencing technology output (e.g., trace characteristics (integrated counts per peak, shape/height/width of peaks, distance to neighboring peaks, characteristics of neighboring peaks), signal-to-noise ratios, power-to-noise ratio, background metrics, signal strength, reaction kinetics, etc.), and the like.
  • the iterative clustering process comprises two phases.
  • the first phase includes grouping the sequence reads 1 16 into a set of initial clusters based on sequence similarity (block 202).
  • initial clustering is used to help determine which sequence reads 1 16 originate from the same transcript isoform.
  • the background idea for clustering is the observation of multiple sequence reads originating from multiple copies of the same isoform. For example, the following shows three copies of a transcript read originating the same isoform:
  • Figure 2 shows further details of the initial clustering process (block 202).
  • the initial clustering process may begin by the sequence aligner/overlapper 1 10 aligning the sequence reads 1 16 to create aligned reads.
  • sequence aligner/overlapper 1 10 aligning the sequence reads 1 16 to create aligned reads.
  • Many known sequence alignment processes may be used, such as for example, mapping single molecule sequencing reads 1 16 using a Basic Local Alignment with Successive Refinement (BLASR) algorithm, further described in U.S. Patent Publication No. 20120330566, and incorporated by reference in its entirety for all purposes.
  • BLASR Basic Local Alignment with Successive Refinement
  • Figure 3 is a diagram showing an example portion of two reads from the same isoform that have been aligned using the sequence aligner/overlapper 1 10, creating aligned reads 300.
  • the first aligned read shown as "Query” has a length of 1 ,675 bp
  • the second aligned read shown as “Target” has a length of 1 ,680 bp.
  • the alignment (“nMatch”) between the aligned reads 300 is 1 .6 kbp and the percent similarity ("%sim”) is 99.1677, which includes 2 insertions and 1 1 deletion indels (indicated by " * ").
  • the next step is to form isoform clusters.
  • this approach relies on an aligner and requires a good reference genome to begin with, limiting the approach to those applications having a pre-existing reference genome.
  • methods and systems are provided for identifying isoform clusters that do not use a reference genome, and are therefore suitable for use in applications in which no reference genome exists.
  • the aligned reads 300 are used to build a similarity graph (block 202-2).
  • the similarity graph is constructed such that each sequence read 1 16 is represented as a node in the graph, and alignments between the sequence reads 1 16 are represented as connecting edges between the nodes to indicate that the two sequence reads have an alignment hit (i.e., a sufficiently high percent of similarity).
  • FIG 4 is a diagram illustrating exemplary similarity graphs 400.
  • the algorithm for finding isoform clusters utilizes pairwise alignment in which each node 402 in a similarity graph 400 represents a read, and edges 404 connecting pairs of nodes represent the presence of pairwise alignment, such as that shown in Figure 3 where the query and target reads would be represented in the graph by a pair of nodes 402 and connected by an edge 404 due to their high percentage of similarity.
  • the similarity graph process results in multiple similarity graphs 400 being formed.
  • a clique refers to a graph comprising a set of nodes in which for every two nodes 402, there exists an edge connecting the two.
  • a maximal clique is a clique of the largest possible size that does not contain a node 402 that overlaps with another clique.
  • the maximal clique finding algorithm non-deterministically partitions the similarity graphs 400 into non-overlapping maximal cliques. There are many approaches to finding all maximal cliques.
  • a maximal clique finding algorithm may be run, such as a greedy randomized adaptive search procedure that iteratively constructs a randomized, greedy biased solution, which is then expanded to a local optimal solution. See for example, Abello et al., On maximum clique problems in very large graphs, AT&T labs Research Technical Report, 1998, which is hereby incorporated herein by reference in its entirety for all purposes.
  • Partitioning the similarity graphs 400 into non-overlapping maximal cliques requires comparing the sequence reads 1 16 to detect isoform alignment differences to determine if the sequence reads 1 16 belong in the same clique.
  • One method for detecting isoform alignment differences is to detect large gaps in alignment between two aligned reads. For example, if given two aligned reads, where one has a large insertion with respect to the other one, then it is very likely the insertion is an extra xenon, and therefore, an isoform difference can been detected.
  • detecting isoform alignment differences becomes more problematic as the gaps in alignment become smaller and smaller. For example, only a seven base insertion difference between two aligned reads may indicate a polymer stretch.
  • determining an isoform difference from a sequence error may be made by leveraging the fact that every base from the raw read sequences 1 16, including insertions, is associated with a quality values (QV) that estimates per-position accuracy and indicates the probability of whether each base is a substitution error, an insertion error, or deletion error.
  • QV quality values
  • Figure 5 is a diagram illustrating one implementation for distinguishing an isoform difference from sequence errors between aligned reads 300.
  • a difference array 500 may be used to keep track of positional differences between the two aligned reads 300. Rows for substitutions (S), insertions (I), and deletions (D) above and below the two aligned reads 300 having a "+"at each base position show where the associated QVs 1 18 indicate that it is sufficiently likely an error occurred.
  • Each position in the difference array 500 may include a value, e.g., 0 or 1 , where a 0 value indicates the differences between the two aligned reads 300 is caused by a sequencing error (rather than a true isoform difference), and a 1 value indicates the differences between the two aligned reads 300 cannot be explained by sequencing errors.
  • a 0 value indicates the differences between the two aligned reads 300 is caused by a sequencing error (rather than a true isoform difference)
  • a 1 value indicates the differences between the two aligned reads 300 cannot be explained by sequencing errors.
  • the threshold length T is set to 10 bases
  • the percentage threshold C is set to 50%.
  • the difference array 500 would be searched for a region longer than 10 bases in which more than 50% of the bases have a value of 1 . If no such region can be found, then the two aligned reads 300 may be considered to be from the same isoform. In the example shown in Figure 5, no such a region exists in the difference array 500 so the two aligned reads 300 are determined to be from the same isoform, and hence placed in the same clique. If, on the other hand, such a region is found, the two aligned reads 300 would be determined to be from different isoforms and hence not placed the same clique. For further information see Tseng & Tompa, Algorithms for Locating Extremely conserveed Elements in Multiple Sequence Alignments, BMC Bioinformatics (2009), which is incorporated by reference in its entirety for all purposes.
  • a clique by definition requires each node 402 to be interconnected.
  • the term “clique” will be replaced below with the broader term “cluster” because the edges 404 between the nodes are not required or used after the maximal clique finding process.
  • Figure 6 is a diagram illustrating an example of sequence reads initially assigned to incorrect clusters 122, where sequence reads/nodes of the same fill pattern are from the same isoform. As shown, sequence reads labeled 1 -3 have been incorrectly placed in a different cluster than sequence reads 4-5, all of which are from the same isoform. In addition, sequence reads 1 1 and 12 are incorrectly grouped with read 12; and read 6 has been incorrectly separately grouped from reads 7-9.
  • the processes performed after the initial clustering 202 are designed to resolve the ambiguities of the initial clusters 122.
  • the cluster consensus algorithm 1 1 1 After the initial clusters 122 are formed, the cluster consensus algorithm 1 1 1 generates the cluster consensus 124 for each of the initial clusters (block 204), where each cluster consensus 124 is used to represent the sequence of all the members of the cluster.
  • Cluster consensus generation is well-known in the art.
  • the cluster consensus algorithm 1 1 1 may be based on using directed acyclic graphs to encode multiple sequence alignment, e.g., the DAGCon (Directed Acyclic Graph Consensus) algorithm.
  • DAGCon Directed Acyclic Graph Consensus
  • DAGCon takes a group of paired alignments that are aligned against a reference or backbone/seed to which other reads are aligned (in de novo genome assembly, the longest sequence reads are used as the backbone/seed) to generate a directed acyclic graph, where each path through the graph represents one of the alignments. The graph is then simplified and a most likely path through the graph is determined, which is the consensus. See Chin et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods (2013), which is hereby incorporated by reference. [0071]
  • Figure 7 is a diagram illustrating exemplary cluster consensus, C1 , C2, C3 and C4, generated for each of clusters 122, respectively, where each of the input read sequences 1 16 belongs to exactly one cluster 122
  • the second phase of the iterative clustering for error correction (ICE) process is invoked.
  • the second phase of ICE begins by iteratively improving the clustering based on the cluster consensus 124 and the quality values 1 18 (block 206).
  • the read sequences 1 16 are automatically "reassigned" from one cluster to another or designated as "orphans" and used to generate new clusters if the sequence reads are determined to not belong to any of the existing clusters, and highly similar clusters are merged, as explained below.
  • FIG 2 shows further details of the process for iteratively improving the clustering (block 206).
  • a current sequence read does not align to any of the cluster consensus (C) with a sufficiently high percent of similarity using the process of detecting isoform alignment differences described above (i.e., no isoform hit)
  • the sequence read is ignored for having a bad probability.
  • a linear-time algorithm may be used to filter out alignments that have large indels. (See, e.g., Algorithms tor locating extremely conserved elements in multiple sequence alignments. Tseng & Tompa, BMC Bioinformatics, 2009).
  • the ICE component 1 12 calculates a probability of the current read belonging to each of the clusters given the cluster consensus and the QV for the current read:
  • Pr (Si I Cu QVs(Si)) (9match)count(match) (1 /3 Bsub)count(sub) (1 /3 Bins)count(ins) (1 /3 Bdel)count(del),
  • is the probability of a match of a substitution (sub), an insertion (ins), and a deletion (del), respectively.
  • the ICE component 1 12 determines that the probability of read 6 belonging to cluster consensus C3 is greater than the probability of read 6 (S6) belonging to cluster consensus C4:
  • cluster C3 contains isoforms from the same group as S6.
  • the output of the probability calculation is a list of probabilities calculated for each of the read sequences belonging to each of the clusters 122. In one embodiment, number of probabilities calculated the total number of nodes/sequence reads multiplied by the total number of clusters, with some probabilities having a value of "unknown”. [0078] Referring again to Figure 2, after the probabilities are calculated, the ICE component 1 12 reassigns individual sequence reads from one cluster to another cluster having a highest calculated probability (block 206-2).
  • Figure 8 is a diagram illustrating reassignment of a sequence read from one cluster to the cluster having the highest computed probability of membership. This example shows reassignment of read 6 from cluster C4 to C3.
  • the sequence reads may be considered orphans, and new clusters may then be formed from the orphans (block 206-3).
  • the new clusters may be formed from the orphans using the same procedure as in the initial phase described above.
  • Figure 9 is a diagram illustrating an example of creating a new cluster from an orphan. In this example, it is determined that read S12 does not have an isoform hit. Read S12 is then designated as an orphan and a new cluster C6 is created that contains read S12.
  • this problem may be solved by randomly generating a probability for each orphan node. That is, a random number generator may be used to generate a number between a predefined range, between 0 and 1 , for example (other ranges are possible). If the random probability is smaller than a predefined threshold probability, e.g., 0.30, then the orphan is reassigned to one of the clusters having a nonzero computed probability of membership for the orphan.
  • a predefined threshold probability e.g. 0.30
  • FIG 10 is a diagram illustrating the merger of two clusters.
  • former cluster C1 and C4 from Figure 9 are determined to be isoform hits and have a threshold percent of similarity greater than 99.5%. Consequently, in Figure 10 clusters C1 and C4 have been merged into a new cluster and a corresponding cluster consensus C7.
  • the ICE component 1 12 updates the cluster consensus for each of the changed clusters and updates the probability Pr (Si
  • the cluster consensus algorithm 1 1 1 may generate a cluster consensus 124 for the clusters every time a change happens.
  • a predefined threshold may be optionally used for limiting when the cluster consensus algorithm 1 1 1 is called based on cluster sizes i.e., if the cluster size is relatively large, calling the cluster consensus algorithm 1 1 1 may be skipped in block 206-5 if the number of nodes in a particular cluster is larger than the predefined threshold.
  • the cluster consensus algorithm step is parallelizable.
  • new additional sequence reads can be added to the existing cluster sets any time during the second phase by aligning the additional sequence reads against all existing consensus sequences.
  • a new sequence is assigned to an existing cluster C if cluster C has the highest posterior probability and the alignment is not rejected. Otherwise, the sequence read is considered an orphan and can form new clusters as in the initial phase described above.
  • the clusters can be improved no further via reassignment of the sequence reads and/or merging the clusters, the iterative clustering for error correction process (block 206) completes.
  • the polisher component 1 14 is called to polish the consensus results (block 208).
  • the polisher component 1 14 may be based on the Quiver algorithm, as described in U.S. 13/941 ,442, filed July 12, 2013, which is hereby incorporated by reference.
  • the ICE component 1 12 calls the cluster consensus algorithm 1 1 1 to generate the cluster consensus for each of the clusters. These cluster consensus are used as a "reference" to which the full-length reads 1 16-1 are aligned during the iterative clustering process in one embodiment.
  • the input to the polishing step may comprise the cluster consensus and the non-full-length reads 1 16-2, which are then aligned to each cluster consensus as a reference.
  • the non-full- length reads 1 16-2 are used to impart full coverage to the sequence data to provide a higher level of consensus using the same "isoform hit" criteria described above.
  • non-full-length reads 1 16-2 do not have to align exclusively and can belong to multiple clusters. Again, the linear-time algorithm is used to reject non-favorable alignments.
  • the polisher component 1 14 generates a final consensus sequence 128 ( Figure 1 ) for each of the clusters.
  • the output from the process may comprise a list of final cluster consensus 128, each of which represents the "consensus" sequence of a cluster.
  • each cluster may be used to represent a single, unique transcript.
  • the final cluster consensus 128 may be mapped to a genome, wherein redundancy is removed and the isoforms collapsed, thereby generating high-quality, full-length isoforms.
  • the iterative clustering of sequence reads for error correction process may have many applications.
  • ICE may be used for full-length cDNA sequencing, bioinformatics analysis, and biological applications.
  • Examples of full-length cDNA sequencing may include, but are not limited to, constructing cDNA libraries enriched in full-length transcripts; size selection using agarose gel or the BluePippinTM system; sequencing transcripts up to 10 kb in full-length; and single-molecule observation of each transcript.
  • bioinformatics analysis may include, but are not limited to, identifying putatively full-length transcripts; and detecting artificial chimeras.
  • examples of biological applications may include, but are not limited to, novel transcripts; alternative splicing; alternative polyadenylation; retained introns; fusion genes; and anti-sense transcription.
  • the system includes a computer-readable medium operatively coupled to the processor that stores instructions for execution by the processor.
  • the instructions may include one or more of the following: instructions for receiving input of sequence reads (and, optionally, reference sequence information), instructions for constructing pre-assembled reads, instructions for aligning sequence reads, instructions for generating string graphs, instructions for generating unitig graphs, instructions for identifying string bundles, instructions for determining primary contigs, instructions for determining associated contigs, instructions for correcting reads, instructions for generating consensus sequences, instructions for generating haplotype sequences, instructions that compute/store information related to various steps of the method (e.g., edges and nodes in a string graph, overlaps and branch points in a string graph, primary and associated contigs, and instructions that record the results of the method.
  • steps of the method e.g., edges and nodes in a string graph, overlaps and branch points in a string graph, primary and associated contigs, and instructions that record the results of the method.
  • the methods are computer-implemented methods.
  • the algorithm and/or results (e.g., consensus sequences generated) are stored on computer-readable medium, and/or displayed on a screen or on a paper print-out.
  • the results are further analyzed, e.g., to identify genetic variants, to identify one or more origins of the sequence information, to identify genomic regions conserved between individuals or species, to determine relatedness between two individuals, to provide an individual with a diagnosis or prognosis, or to provide a health care professional with information useful for determining an appropriate therapeutic strategy for a patient.
  • the computer-readable media may comprise any combination of a hard drive, auxiliary memory, external memory, server, database, portable memory device (CD-R, DVD, ZIP disk, flash memory cards, etc.), and the like.
  • the invention includes an article of manufacture for string graph assembly of polyploid genomes that includes a machine-readable medium containing one or more programs which when executed implement the steps of the invention as described herein.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Exemplary embodiments provide methods and systems for iterative clustering of sequence reads for error correction. Aspects of the exemplary embodiment include receiving a set of sequence reads and associated quality values; grouping the sequence reads into a set of initial clusters based on sequence similarity; generating a cluster consensus for each of the initial clusters; iteratively improving the clustering based on the cluster consensus and the quality values associated with the sequence reads; and generating and outputting a final cluster consensus for each of the clusters.

Description

ITERATIVE CLUSTERING OF SEQUENCE READS FOR ERROR CORRECTION
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of US Provisional Patent Application Serial No. 61 /917,777, filed December 18, 2013, entitled "Methods for Generating Consensus Sequences From Mixed Populations", and US Provisional Patent Application Serial No. 62/028,741 , filed July 24, 2014, entitled, "ITERATIVE CLUSTERING OF SEQUENCE READS FOR ERROR CORRRECTION", both assigned to the assignee of the present application, and incorporated herein by reference.
BACKGROUND OF THE INVENTION
[0002] Advances in biomolecule sequence determination, in particular with respect to nucleic acid and protein samples, has revolutionized the fields of cellular and molecular biology. Facilitated by the development of automated sequencing systems, it is now possible to sequence mixed populations of sample nucleic acids. However, the quality of the sequence information must be carefully monitored, and may be compromised by many factors related to the biomolecule itself or the sequencing system used, including the composition of the biomolecule (e.g., base composition of a nucleic acid molecule), experimental and systematic noise, variations in observed signal strength, and differences in reaction efficiencies. As such, processes must be implemented to analyze and improve the quality of the data from such sequencing technologies.
[0003] Besides affecting overall accuracy of sequence reads generated, these factors can complicate designation of a base-call as a true variant or, alternatively, a miscall (e.g., insertion, deletion, or mismatch error in the sequence read). For example, when sequence reads have the base calls that differ between the homologous chromosomes, it is important to be able to determine whether the differing base calls are true variations between the homologues, or are merely sequencing errors. Yet further, a viral population in an individual can have many variations between individual viral genomes in the population, especially in highly mutable viruses such as HIV. Being able to identify different sequencing reads that have different origins (e.g., different chromosome or genome origins) is key to being able to accurately characterize a mixed population of nucleic acids. For a theoretical sequencing platform that generates reads that are 100% accurate, the reads can simply be compared to one another with simple string matching algorithms. Any difference between the reads is indicative of a true variant, and therefore, a different origin. However, any real-world raw sequencing data is likely to contain errors, so a simple string matching algorithmic approach will not be sufficient. This is especially true when sequencing transcriptomes.
[0004] A transcriptome is a set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNAs produced in one or a population of cells. Because the term includes all mRNA transcripts in the cell, the transcriptome reflects the genes that are being actively expressed at any given time. Currently, there are two general methods of inferring transcriptomes. One approach maps sequence reads onto a reference genome, either of the organism whose transcriptome is being studied or of a closely related species. The other approach is de novo transcriptome assembly that uses software to infer transcripts directly from short sequence reads. [0005] However, commercially available genome aligners do not enable error correction of full-length long sequence reads in transcriptome sequencing. For example, reads produced on the PacBio® RS II instrument average about 5-6 kb, and reads as long as 20 kb are routinely generated. With such long-read capabilities, full-length mRNA transcripts can be sequenced, e.g., after conversion to cDNA. This can help the researcher identify splicing patterns that are difficult to reconstruct using short-read sequencing technologies. However, publicly available sequence aligners, e.g., GMAP, and functional annotation tools almost all require reads having near 100% accuracy. The PacBio instrument generates reads from single template molecules that have an error profile that makes it difficult to directly apply these sequence alignment tools. However, in cases where the sequencing insert (transcript) is much shorter than the polymerase readlength, highly accurate consensus sequences can be generated: through redundant sequencing of single molecules, the long length of the cDNA templates combined with the processivity of the polymerase sequencing engine in the system can result in sufficient redundancy to achieve the required accuracy for these analysis tools. This however, is only applicable to shorter transcripts, and longer transcripts would still require additional processing before they achieve the level of accuracy amenable for biological analysis.
[0006] Currently, there are two published tools for error correcting long reads (e.g., PacBio® cDNA long reads) in transcriptome sequencing, PacBioToCA and LSC. Both tools use short reads (e.g., Illumina® short reads), and follow the general schema of: for each long read, align short reads to the long read as if it were a genomic "scaffold", and create the best consensus based on the short read alignment. This general schema has several disadvantages: (1 ) Since short reads are only 50-100 bp, they can map nonspecifically and introduce more errors; (2) All currently available short read technologies carry their own systematic errors that could bias the correction; (3) No advantage is taken of the fact that the same transcript is often represented by multiple long reads, which, in the case of long reads from Pacific Biosciences, do not have systematic biases; (4) Quality Values (QVs) from the long reads are not used; and (5) The schema requires two different sequencing systems.
[0007] What is needed is an algorithm that addresses the problem of errors in transcriptome sequencing, and preferably an algorithm designed to deal with the transcriptome de novo, i.e., without a reference genome.
BRIEF SUMMARY OF THE INVENTION
[0008] The exemplary embodiments are generally directed to processes for analyzing sequence data from mixed populations of nucleic acids, for assigning each sequence read to a particular origin, and for ultimately identifying one or more consensus sequences of one or more biomolecular target sequences from the sequence information. The methods provided herein are applicable not only to sequence data having few errors, but also to sequence data having relatively high rates of insertions, deletions, and/or mismatch errors. Consequently, the invention is also directed to systems that carry out these processes.
[0009] The invention and various specific aspects and embodiments will be better understood with reference to the following detailed descriptions and figures, in which the invention is described in terms of various specific aspects and embodiments. These are provided for purposes of clarity and should not be taken to limit the invention. The invention and aspects thereof may have applications to a variety of types of methods, devices, and systems not specifically disclosed herein. In certain aspects, the exemplary embodiments provide methods and systems for iterative clustering of sequence reads for error correction, which is performed by at least one software component executing on at least one processor. In certain embodiments, such methods comprise receiving a set of sequence reads and associated quality values; grouping the sequence reads into a set of initial clusters based on sequence similarity; generating a cluster consensus for each of the initial clusters; iteratively improving the clustering based on the cluster consensus and the quality values associated with the sequence reads; and generating and outputting a final cluster consensus for each of the clusters.
[0010] In yet another aspect, iteratively improving the clustering further comprises: calculating a probability of each sequence read belonging to each cluster using the quality values; reassigning individual sequence reads from one cluster to another cluster having a highest calculated probability; and merging highly similar clusters.
[0011] In one embodiment, the input sequence reads comprise full-length long reads of at least .5 kb in length through 1 , 2, 3, 4, 5, 7, or 10 kb in length, and the final cluster consensus is generated using the cluster consensus and non-full-length reads, which are used to impart full coverage to the sequence data to provide a higher level of consensus.
BRIEF DESCRIPTION OF SEVERAL VIEWS OF THE DRAWINGS [0012] Figure 1 is a diagram illustrating one embodiment of a computer system for implementing a process for using iterative clustering of sequencing reads for error correction of transcriptome sequencing data.
[0013] Figure 2 is a flow diagram illustrating certain aspects of a process for iterative clustering of sequence reads for error correcting by according to an exemplary embodiment.
[0014] Figure 3 is a diagram showing an example portion of two reads from the same isoform that have been aligned to create a pairwise alignment.
[0015] Figure 4 is a diagram illustrating exemplary similarity graphs.
[0016] Figure 5 is a diagram illustrating one implementation for distinguishing a true isoform difference from sequence errors between aligned reads.
[0017] Figure 6 is a diagram illustrating an example of sequence reads initially assigned to incorrect clusters, where sequence reads of the same fill pattern are from the same isoform.
[0018] Figure 7 is a diagram illustrating exemplary cluster consensus, C1 , C2, C3 and C4, generated for each of clusters, respectively.
[0019] Figure 8 is a diagram illustrating reassignment of a sequence read from one cluster to the cluster having the highest computed probability of membership.
[0020] Figure 9 is a diagram illustrating an example of creating a new cluster from an orphan.
[0021] Figure 10 is a diagram illustrating the merger of two clusters.
DETAILED DESCRIPTION OF THE INVENTION [0022] Various embodiments and components of the present invention employ signal and data analysis techniques that are familiar in a number of technical fields. For clarity of description, details of known analysis techniques are not provided herein. These techniques are discussed in a number of available reference works, such as: R. B. Ash. Real Analysis and Probability. Academic Press, New York, 1972; D. T. Bertsekas and J. N. Tsitsiklis. Introduction to Probability. 2002; K. L. Chung. Markov Chains with Stationary Transition Probabilities, 1967; W. B. Davenport and W. L Root. An Introduction to the Theory of Random Signals and Noise. McGraw-Hill, New York, 1958; S. M. Kay, Fundamentals of Statistical Processing, Vols. 1 -2, (Hardcover - 1998); Monsoon H. Hayes, Statistical Digital Signal Processing and Modeling, 1996; Introduction to Statistical Signal Processing by R.M. Gray and L.D. Davisson; Modern Spectral Estimation: Theory and Application/Book and Disk (Prentice-Hall Signal Processing Series) by Steven M. Kay (Hardcover - Jan 1988); Modern Spectral Estimation: Theory and Application by Steven M. Kay (Paperback - Mar 1999); Spectral Analysis and Filter Theory in Applied Geophysics by Burkhard Buttkus (Hardcover - May 1 1 , 2000); Spectral Analysis for Physical Applications by Donald B. Percival and Andrew T. Walden (Paperback - Jun 25, 1993); Astronomical Image and Data Analysis (Astronomy and Astrophysics Library) by J. L. Starck and F. Murtagh (Hardcover - Sep 25, 2006); Spectral Techniques In Proteomics by Daniel S. Sem (Hardcover - Mar 30, 2007); Exploration and Analysis of DNA Microarray and Protein Array Data (Wiley Series in Probability and Statistics) by Dhammika Amaratunga and Javier Cabrera (Hardcover - Oct 21 , 2003).
[0023] Error correction of long reads for transcriptome analysis is different from error correction for genome assembly. Both can be formulated as a clustering problem. In genome assembly, there are only as many "clusters" as there are chromosomes; each chromosome is largely different from each other. In comparison to the whole chromosome size, the shared repeat regions are very small, and as long as there is a continuous long read spanning the repeat, it is relatively easy to resolve its origin.
[0024] In contrast, for transcriptome analysis there are as many clusters as there are transcripts. In eukaryotes, genes can have many different splice forms. In an extreme example, one isoform of a transcript has an extra 20 bp exon and the other isoform does not. For many biological problems, it is important to be able to tell the two isoforms apart. This level of detailed difference is rarely found on genomic scales, so currently available methods, e.g., Hierarchical Genome Assembly Process (HGAP), which generates high quality (>99.999% accurate) de novo assemblies cannot be directly applied to the transcriptome problem (HGAP is described in U.S. Patent Application 13/941 ,442, filed July 12, 2013).
[0025] The "quasispecies problem" is a specific application of the general transcriptome clustering problem. Like transcriptome sequencing, the total number of clusters is unknown and one must iteratively "guess" both the clusters and the cluster consensus. The problem is easier with respect to the HIV genome, because the HIV genome is known and priors can be placed on the number of mutations expected. Further information on the quasispecies problem is provided in Zogardi, et al. (2010) "Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies," Nucl. Acids. Res. doi: 10.1093/nar/gkq655, which is incorporated herein by reference in its entirety for all purposes. [0026] According to the exemplary embodiments, an algorithm is provided that addresses the problem of errors in transcriptome sequencing. However, instead of the HGAP concept of using a "seed read" for aligning shorter reads in order to generate highly accurate pre-assembled reads, the algorithm of the exemplary embodiments utilizes cluster consensus.
[0027] The exemplary embodiments relate generally to generating consensus sequences from mixed populations. More specifically, the exemplary embodiments provide methods and systems for error correction of reads based on iterative clustering of isoforms using primarily long read data. The probability of each input sequence read belonging to each cluster is iteratively calculated and the sequences are then reassigned to clusters having a higher probability of membership. In addition, the process merges highly similar clusters. According to the exemplary embodiments, iterative isoform-level clustering removes transcript redundancy and improves transcriptome consensus accuracy, all without requiring a reference genome.
[0028] Computer Implementation
[0029] Figure 1 is a diagram illustrating one embodiment of a computer system for implementing a process for iterative clustering of sequence reads for error correction. In specific embodiments, the invention may be embodied in whole or in part as software recorded on fixed media. The computer 100 may be any electronic device having at least one processor 102 (e.g., CPU and the like), a memory 103, input/output (I/O) 104, and a data repository 106. The CPU 102, the memory 103, the I/O 104 and the data repository 106 may be connected via a system bus or buses, or alternatively using any type of communication connection. Although not shown, the computer 100 may also include a network interface for wired and/or wireless communication. In one embodiment, computer 100 may comprise a personal computer (e.g., desktop, laptop, tablet etc.), a server, a client computer, or wearable device. In another embodiment the computer 100 may comprise any type of information appliance for interacting with a remote data application, and could include such devices as an internet-enabled television, cell phone, and the like.
[0030] The processor 102 controls operation of the computer 100 and may read information (e.g., instructions and/or data) from the memory 103 and/or a data repository 106 and execute the instructions accordingly to implement the exemplary embodiments. The term "processor 102" is intended to include one processor, multiple processors, or one or more processors with multiple cores.
[0031] The I/O 104 may include any type of input devices such as a keyboard, a mouse, a microphone, etc., and any type of output devices such as a monitor and a printer, for example. In an embodiment where the computer 100 comprises a server, the output devices may be coupled to a local client computer.
[0032] The memory 103 may comprise any type of static or dynamic memory, including flash memory, DRAM, SRAM, and the like. The memory 103 may store programs and data including a sequence aligner/overlapper 1 10, a cluster consensus algorithm 1 1 1 , an Iterative Cluster Error correction (ICE) component 1 12, and a polisher component 1 14 (e.g., Quiver). These components/algorithms may be used in the process of transcriptome sequence assembly as described herein.
[0033] The data repository 106 may store several databases including one or more databases that store sequence reads 1 16, read quality values (hereinafter QVs) 1 18, maximal cliques 120, clusters 122, cluster consensus 124, probabilities 126, and final consensus sequences 128. In the transcriptome sequencing embodiment, the sequence reads 1 16 comprise isoform sequence reads, which may include both full-length sequence reads (hereinafter, "full-length reads") 1 16-1 and non-full-length sequence reads (hereinafter, "Non-full-length reads") 1 16-2. Also in this embodiment, the clusters 122 may comprise isoform-level clusters.
[0034] In one embodiment, the data repository 106 may reside within the computer 100. In another embodiment, the data repository 106 may be connected to the computer 100 via a network port or external drive. The data repository 106 may comprise a separate server or any type of memory storage device (e.g., a disk-type optical or magnetic media, solid state dynamic or static memory, and the like). The data repository 106 may optionally comprise multiple auxiliary memory devices, e.g., for separate storage of input sequences (e.g., sequence reads), sequence information, calculation results and/or other information. Computer 100 can thereafter use that information to direct server or client logic, as understood in the art, to embody aspects of the invention.
[0035] In operation, an operator may interact with the computer 100 via a user interface presented on a display screen (not shown) to specify the sequence reads 1 16 and other parameters required by the various software programs. Once invoked, the programs in the memory 103 including the sequence aligner/overlapper 1 10, the cluster consensus algorithm 1 1 1 , the ICE component 1 12, and the polisher component 1 14, are executed by the processor 102 to implement the methods of the present invention.
[0036] The sequence aligner/overlapper 1 10 reads the selected sequence reads 1 16 from the data repository 106 and performs sequence alignment on the sequence reads 1 16 to identify regions of similarity that may be a consequence of structural or functional or other relationships between the sequence reads 1 16. In one embodiment, the full-length reads 1 16-1 are generally high accuracy reads, e.g., at least about 98% or 99% accurate, and may be raw reads from a sequencing technology that provides such high quality reads, or may be pre-assembled, high-quality reads constructed from sequencing read data of a lower quality, as described elsewhere herein. Aligned sequences 1 17 are generated by the sequence aligner/overlaper 1 10 during the sequence alignment. In certain embodiments, the sequence aligner/overlaper 1 10 is implemented in C, C++, Java, C#, F#, Python, Perl, Haskell, Scala, Lisp, a Python/C hybrid, and others known in the art.
[0037] The ICE component 1 12 generates clusters 122 of similar sequence reads by grouping the sequence reads 1 16 into a set of initial clusters based on similarity and maximal cliques 120. The cluster consensus algorithm 1 1 1 generates the cluster consensus 124 for each of the clusters. The ICE component 1 12 then iteratively improves the clustering through iterations of based on the cluster consensus 124 and the quality values associated with the sequence reads, which includes reassigning sequence reads 1 16 from one cluster to another based on computer probabilities 126, and merging substantially similar clusters. The polisher component 1 14 may then generate the final cluster consensus 128 for each of the clusters 122 in accordance with exemplary embodiments, as explained further below.
[0038] The output of the processing may include is a list of final consensus sequences 128, each of which represents the "consensus" of a cluster. In one embodiment, each of the clusters 122 may represent a single, unique transcript. Thus, in one embodiment, the present invention may provide methods and systems for identifying a set of unique full-length transcripts from a mixed population using full-length reads 1 16-1 . [0039] In one embodiment, the results of the process may optionally further comprise quality information, technology information (e.g., peak characteristics, expected error rates), alternate (e.g., second or third best) consensus determination, confidence metrics, and the like. During and after the process of creating the initial clusters, generating cluster consensus, iterative clustering, and generation of the final cluster consensus, the progress and/or result of this processing may be saved to the memory 103 and the data repository 106 and/or output through the I/O 104 for display on a display device and/or saved to an additional storage device (e.g., CD, DVD, Blu-ray, flash memory card, etc.), or printed.
[0040] Figure 2 is a flow diagram illustrating certain aspects of a process for iterative clustering of sequence reads for error correcting by according to an exemplary embodiment. In one embodiment, the process may be used to correct errors in long reads during transcriptome sequencing. The process may be performed by a combination of the sequence aligner/overlapper 1 10, the cluster consensus algorithm 1 1 1 , the ICE component 1 12, and the polisher component 1 14 (Figure 1 ), which although are shown as separate components, the functionality of each may be combined into a lesser or greater number of software algorithms/components.
[0041] The process may begin by receiving a set of sequence reads 1 16 and associated quality values 1 18 (block 200). The sequence reads 1 16 preferably comprise, but are not limited to, a set full-length long reads 1 16-1 . The quality values (QVs) 1 18 are estimations of per-position base call accuracy generated by a sequencing machine.
[0042] The iterative clustering error correction (ICE) component 1 12 groups the sequence reads into a set of initial clusters based on sequence similarity (block 202). The cluster consensus algorithm 1 1 1 generates the cluster consensus 124 for each of the initial clusters (block 204). The ICE component 1 12 iteratively improves clustering based on the cluster consensus and the quality values 1 18 associated with the sequence reads (block 206), as described further below.
[0043] In further embodiments, the process further includes generating and outputting a final cluster consensus 128 for each of the clusters of (block 208). In one embodiment, the final cluster consensus 128 may comprise a list of final cluster consensus sequences, each of which represents the consensus sequence of a cluster (and therefore a transcript in one embodiment). In one embodiment, the final cluster consensus 128 may be generated once the iterative clustering process has completed. In another embodiment where the input comprises full-length reads 1 16-1 , the final cluster consensus may be generated by inputting non-full-length reads 1 16-2 into a final polishing process, which then generates the final cluster consensus 128. As is well known in the art, the final cluster consensus 128 may be saved e.g. in the memory 103 and/or the data repository 106, or sent to the I/O 104 for display on a monitor and/printing by a printer.
[0044] The above steps are described in further detail below.
[0045] Sequence Reads
[0046] One goal of transcript isoform sequencing is to understand transcriptome complexity using accurate, unassembled, full-length long reads. The full-length reads 1 16- 1 are captured and identified automatically by a sequencing machine, but the exemplary embodiments improve the accuracy through iterative clustering. [0047] In the exemplary embodiment, the input sequence reads 1 16 comprise the full-length long reads 1 16-1 of a transcript, for example. However in another embodiment, the input sequence reads 1 16 may comprise the non-full-length reads 1 16-2. The sequence reads 1 16 can optionally comprise redundant sequence information, e.g., where the same transcript was repeatedly sequenced to generate a long sequence read comprising multiple copies of the transcript. Further, the additional information associated with the sequence reads 1 16 may comprise features of underlying sequencing technology output (e.g., trace characteristics (integrated counts per peak, shape/height/width of peaks, distance to neighboring peaks, characteristics of neighboring peaks), signal-to-noise ratios, power-to-noise ratio, background metrics, signal strength, reaction kinetics, etc.), and the like.
[0048] Initial Clustering
[0049] The iterative clustering process comprises two phases. The first phase includes grouping the sequence reads 1 16 into a set of initial clusters based on sequence similarity (block 202). In one embodiment, for example, initial clustering is used to help determine which sequence reads 1 16 originate from the same transcript isoform. The background idea for clustering is the observation of multiple sequence reads originating from multiple copies of the same isoform. For example, the following shows three copies of a transcript read originating the same isoform:
TGGGAGCCTATGCGACAATGAAACCTG ...
TGGAGCAATATGCGAACAATAAAACCTC...
TGGAGCATATGCGAACAATAAAACGGG ... where the bolded bases represent randomly distributed errors that are primarily indels (insertions or deletions). Clustering such reads originating from the same isoform may lead to higher accuracy consensus sequences.
[0050] Alignment
[0051] Figure 2 shows further details of the initial clustering process (block 202). In one embodiment, the initial clustering process may begin by the sequence aligner/overlapper 1 10 aligning the sequence reads 1 16 to create aligned reads. Many known sequence alignment processes may be used, such as for example, mapping single molecule sequencing reads 1 16 using a Basic Local Alignment with Successive Refinement (BLASR) algorithm, further described in U.S. Patent Publication No. 20120330566, and incorporated by reference in its entirety for all purposes.
[0052] Figure 3 is a diagram showing an example portion of two reads from the same isoform that have been aligned using the sequence aligner/overlapper 1 10, creating aligned reads 300. In this example, the first aligned read shown as "Query" has a length of 1 ,675 bp, and the second aligned read shown as "Target" has a length of 1 ,680 bp. The alignment ("nMatch") between the aligned reads 300 is 1 .6 kbp and the percent similarity ("%sim") is 99.1677, which includes 2 insertions and 1 1 deletion indels (indicated by "*").
[0053] After alignment, the next step is to form isoform clusters. One could use a reference genome and align the reads to the reference genome and determine that the reads at a particular locus represent an isoform. However, there are many alternative isoforms per locus that would remain unresolved. In addition, this approach relies on an aligner and requires a good reference genome to begin with, limiting the approach to those applications having a pre-existing reference genome.
[0054] According to one exemplary embodiment, methods and systems are provided for identifying isoform clusters that do not use a reference genome, and are therefore suitable for use in applications in which no reference genome exists.
[0055] Similarity Graph
[0056] Referring again to Figure 2, after alignment, the aligned reads 300 are used to build a similarity graph (block 202-2). The similarity graph is constructed such that each sequence read 1 16 is represented as a node in the graph, and alignments between the sequence reads 1 16 are represented as connecting edges between the nodes to indicate that the two sequence reads have an alignment hit (i.e., a sufficiently high percent of similarity).
[0057] Figure 4 is a diagram illustrating exemplary similarity graphs 400. The algorithm for finding isoform clusters utilizes pairwise alignment in which each node 402 in a similarity graph 400 represents a read, and edges 404 connecting pairs of nodes represent the presence of pairwise alignment, such as that shown in Figure 3 where the query and target reads would be represented in the graph by a pair of nodes 402 and connected by an edge 404 due to their high percentage of similarity.
[0058] Maximal Cliques
[0059] Typically, the similarity graph process results in multiple similarity graphs 400 being formed. Referring again to Figure 2, this is followed by finding all maximal cliques in the similarity graphs (block 202-3). A clique refers to a graph comprising a set of nodes in which for every two nodes 402, there exists an edge connecting the two. A maximal clique is a clique of the largest possible size that does not contain a node 402 that overlaps with another clique. The maximal clique finding algorithm non-deterministically partitions the similarity graphs 400 into non-overlapping maximal cliques. There are many approaches to finding all maximal cliques. In one embodiment, a maximal clique finding algorithm may be run, such as a greedy randomized adaptive search procedure that iteratively constructs a randomized, greedy biased solution, which is then expanded to a local optimal solution. See for example, Abello et al., On maximum clique problems in very large graphs, AT&T labs Research Technical Report, 1998, which is hereby incorporated herein by reference in its entirety for all purposes.
[0060] Partitioning the similarity graphs 400 into non-overlapping maximal cliques requires comparing the sequence reads 1 16 to detect isoform alignment differences to determine if the sequence reads 1 16 belong in the same clique. One method for detecting isoform alignment differences is to detect large gaps in alignment between two aligned reads. For example, if given two aligned reads, where one has a large insertion with respect to the other one, then it is very likely the insertion is an extra xenon, and therefore, an isoform difference can been detected. However, detecting isoform alignment differences becomes more problematic as the gaps in alignment become smaller and smaller. For example, only a seven base insertion difference between two aligned reads may indicate a polymer stretch. What needs to be determined is whether this is a true isoform difference or sequence error. [0061] According to one aspect of the exemplary embodiment, determining an isoform difference from a sequence error may be made by leveraging the fact that every base from the raw read sequences 1 16, including insertions, is associated with a quality values (QV) that estimates per-position accuracy and indicates the probability of whether each base is a substitution error, an insertion error, or deletion error.
[0062] Figure 5 is a diagram illustrating one implementation for distinguishing an isoform difference from sequence errors between aligned reads 300. In one embodiment, a difference array 500 may be used to keep track of positional differences between the two aligned reads 300. Rows for substitutions (S), insertions (I), and deletions (D) above and below the two aligned reads 300 having a "+"at each base position show where the associated QVs 1 18 indicate that it is sufficiently likely an error occurred. Each position in the difference array 500 may include a value, e.g., 0 or 1 , where a 0 value indicates the differences between the two aligned reads 300 is caused by a sequencing error (rather than a true isoform difference), and a 1 value indicates the differences between the two aligned reads 300 cannot be explained by sequencing errors.
[0063] It is then determined if there is any sufficiently large regions of 1 values in the difference array, i.e., looking for a region in the difference array from [I, J] having a range greater than or equal to a threshold length T, and the sum of the region of 1 values is greater than a percentage threshold C:
J - l≥T and sum (D [l:J]) > C * T
[0064] For example, assume that the threshold length T is set to 10 bases, and the percentage threshold C is set to 50%. The difference array 500 would be searched for a region longer than 10 bases in which more than 50% of the bases have a value of 1 . If no such region can be found, then the two aligned reads 300 may be considered to be from the same isoform. In the example shown in Figure 5, no such a region exists in the difference array 500 so the two aligned reads 300 are determined to be from the same isoform, and hence placed in the same clique. If, on the other hand, such a region is found, the two aligned reads 300 would be determined to be from different isoforms and hence not placed the same clique. For further information see Tseng & Tompa, Algorithms for Locating Extremely Conserved Elements in Multiple Sequence Alignments, BMC Bioinformatics (2009), which is incorporated by reference in its entirety for all purposes.
[0065] Note that a clique by definition requires each node 402 to be interconnected. According to the exemplary embodiments, after the maximal clique finding process, the term "clique" will be replaced below with the broader term "cluster" because the edges 404 between the nodes are not required or used after the maximal clique finding process.
[0066] After the alignment, building the similarity graphs and maximal clique finding processes (blocks 202-1 through 202-3), problems with how the sequence reads 1 16 are grouped may still typically remain. That is, there may be ambiguity in the first set of clusters that are formed. For example, in Figure 4, the maximal clique finding process may find ambiguity with respect to a pair of nodes/reads 402 as to which clique the nodes/reads 402 belong. Maximal clique finding is only an initial estimate of the membership of each clique. Therefore, at the end of phase 1 of the process, some sequence reads 1 16 may be assigned to incorrect clusters, and some sequence reads 1 16 that should belong together may be assigned to separate clusters.
[0067] Figure 6 is a diagram illustrating an example of sequence reads initially assigned to incorrect clusters 122, where sequence reads/nodes of the same fill pattern are from the same isoform. As shown, sequence reads labeled 1 -3 have been incorrectly placed in a different cluster than sequence reads 4-5, all of which are from the same isoform. In addition, sequence reads 1 1 and 12 are incorrectly grouped with read 12; and read 6 has been incorrectly separately grouped from reads 7-9.
[0068] Referring again to Figure 2, according to the exemplary embodiments, the processes performed after the initial clustering 202 are designed to resolve the ambiguities of the initial clusters 122.
[0069] Cluster Consensus
[0070] After the initial clusters 122 are formed, the cluster consensus algorithm 1 1 1 generates the cluster consensus 124 for each of the initial clusters (block 204), where each cluster consensus 124 is used to represent the sequence of all the members of the cluster. Cluster consensus generation is well-known in the art. For example, the cluster consensus algorithm 1 1 1 may be based on using directed acyclic graphs to encode multiple sequence alignment, e.g., the DAGCon (Directed Acyclic Graph Consensus) algorithm. Given a collection of aligned reads 300, DAGCon takes a group of paired alignments that are aligned against a reference or backbone/seed to which other reads are aligned (in de novo genome assembly, the longest sequence reads are used as the backbone/seed) to generate a directed acyclic graph, where each path through the graph represents one of the alignments. The graph is then simplified and a most likely path through the graph is determined, which is the consensus. See Chin et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods (2013), which is hereby incorporated by reference. [0071] Figure 7 is a diagram illustrating exemplary cluster consensus, C1 , C2, C3 and C4, generated for each of clusters 122, respectively, where each of the input read sequences 1 16 belongs to exactly one cluster 122|[DAI].
[0072] Referring again to Figure 2, after the cluster consensus generation (block 204), the second phase of the iterative clustering for error correction (ICE) process is invoked. The second phase of ICE begins by iteratively improving the clustering based on the cluster consensus 124 and the quality values 1 18 (block 206). During this process, the read sequences 1 16 are automatically "reassigned" from one cluster to another or designated as "orphans" and used to generate new clusters if the sequence reads are determined to not belong to any of the existing clusters, and highly similar clusters are merged, as explained below.
[0073] Figure 2 shows further details of the process for iteratively improving the clustering (block 206). The ICE component 1 12 may begin the iteration process by calculating a probability of each sequence read (S) belonging to each cluster (C) using the quality values (QVs) (block 206-1 ). This may be accomplished by aligning each of the sequence reads 1 16 in each of the clusters 122 to each of the cluster consensus C. More specifically, each read Si is aligned to each cluster consensus Cu, where "i" = 1 to the total number of sequence reads, and "u" = 1 to the total number of cluster consensus (in the example shown in Figure 7, i = 12 and u = 4).
[0074] If a current sequence read (S) does not align to any of the cluster consensus (C) with a sufficiently high percent of similarity using the process of detecting isoform alignment differences described above (i.e., no isoform hit), then the sequence read is ignored for having a bad probability. In one embodiment, a linear-time algorithm may be used to filter out alignments that have large indels. (See, e.g., Algorithms tor locating extremely conserved elements in multiple sequence alignments. Tseng & Tompa, BMC Bioinformatics, 2009).
[0075] If the current sequence read does align to one or more of the cluster consensus, the ICE component 1 12 calculates a probability of the current read belonging to each of the clusters given the cluster consensus and the QV for the current read:
Pr (Si I Cu, QVs(Si))
If the QVs are unavailable, then:
Pr (Si I Cu QVs(Si)) = (9match)count(match) (1 /3 Bsub)count(sub) (1 /3 Bins)count(ins) (1 /3 Bdel)count(del),
where Θ is the probability of a match of a substitution (sub), an insertion (ins), and a deletion (del), respectively.
[0076] Referring to Figure 7 as an example, when the probabilities are calculated for read 6, the ICE component 1 12 determines that the probability of read 6 belonging to cluster consensus C3 is greater than the probability of read 6 (S6) belonging to cluster consensus C4:
Pr (S6 I C3) > Pr (S6 | C4)
This is likely to happen because cluster C3 contains isoforms from the same group as S6.
[0077] The output of the probability calculation is a list of probabilities calculated for each of the read sequences belonging to each of the clusters 122. In one embodiment, number of probabilities calculated the total number of nodes/sequence reads multiplied by the total number of clusters, with some probabilities having a value of "unknown". [0078] Referring again to Figure 2, after the probabilities are calculated, the ICE component 1 12 reassigns individual sequence reads from one cluster to another cluster having a highest calculated probability (block 206-2).
[0079] Figure 8 is a diagram illustrating reassignment of a sequence read from one cluster to the cluster having the highest computed probability of membership. This example shows reassignment of read 6 from cluster C4 to C3.
[0080] It should be understood that if it is determined that the cluster having the highest calculated probability for the current sequence read is the cluster to which the sequence read is already a member, then no reassignment occurs.
[0081] Referring again to Figure 2, according to another aspect of the exemplary embodiment, if either no alignment exists (i.e., the probabilities are unknown) between any of sequence reads and the clusters) or if the linear-time algorithm rejects all alignments for any of the sequence reads, the sequence reads may be considered orphans, and new clusters may then be formed from the orphans (block 206-3). The new clusters may be formed from the orphans using the same procedure as in the initial phase described above.
[0082] Figure 9 is a diagram illustrating an example of creating a new cluster from an orphan. In this example, it is determined that read S12 does not have an isoform hit. Read S12 is then designated as an orphan and a new cluster C6 is created that contains read S12.
[0083] There is one minor problem with the above approach, which is when an orphan is assigned to a new cluster, e.g. C6, which only has one sequence read, the read is its own representative. Therefore, the computed probability of a cluster with one sequence read will always be 1 , which means no other cluster will have a higher computed probability for the read and the read will not be reassigned to another cluster. Clusters having only one sequence read may be called singletons and have a lack of members to create diversity, resulting in singleton clusters never having a better cluster to which the node can be reassigned even though one may exist.
[0084] According to one embodiment, this problem may be solved by randomly generating a probability for each orphan node. That is, a random number generator may be used to generate a number between a predefined range, between 0 and 1 , for example (other ranges are possible). If the random probability is smaller than a predefined threshold probability, e.g., 0.30, then the orphan is reassigned to one of the clusters having a nonzero computed probability of membership for the orphan.
[0085] Referring again to Figure 2, the clusters are processed to determine if substantially identical clusters exists, and if so, the clusters are merged into a new cluster (block 206-4). Substantially identical clusters may occur during processing due to approximate maximal clique finding and the iterative consensus calling process. In one embodiment, the determination of whether two clusters are substantially identical based on similarity of their cluster consensus sequence may be controlled through user-defined parameters, e.g., percent of similarity => 95%.
[0086] Figure 10 is a diagram illustrating the merger of two clusters. In this example, former cluster C1 and C4 from Figure 9 are determined to be isoform hits and have a threshold percent of similarity greater than 99.5%. Consequently, in Figure 10 clusters C1 and C4 have been merged into a new cluster and a corresponding cluster consensus C7. [0087] Referring again to Figure 2, every time the members of a cluster change, the corresponding cluster consensus may change, as well. Accordingly, the ICE component 1 12 updates the cluster consensus for each of the changed clusters and updates the probability Pr (Si | Cu, QVs(Si)) for all the sequence reads (block 206-5). This is accomplished by calling the cluster consensus algorithm 1 1 1 (block 204) via line 206-6, and thus creating the "iterative loop" of phase II of the iterative clustering for error correction process, the first step of which is recalculating the probabilities of each of the sequence reads (block 206-1 ).
[0088] In one embodiment, the cluster consensus algorithm 1 1 1 may generate a cluster consensus 124 for the clusters every time a change happens. However, in one embodiment a predefined threshold may be optionally used for limiting when the cluster consensus algorithm 1 1 1 is called based on cluster sizes i.e., if the cluster size is relatively large, calling the cluster consensus algorithm 1 1 1 may be skipped in block 206-5 if the number of nodes in a particular cluster is larger than the predefined threshold. In certain embodiments, the cluster consensus algorithm step is parallelizable.
[0089] In one embodiment, new additional sequence reads can be added to the existing cluster sets any time during the second phase by aligning the additional sequence reads against all existing consensus sequences. A new sequence is assigned to an existing cluster C if cluster C has the highest posterior probability and the alignment is not rejected. Otherwise, the sequence read is considered an orphan and can form new clusters as in the initial phase described above. [0090] Once the clusters can be improved no further via reassignment of the sequence reads and/or merging the clusters, the iterative clustering for error correction process (block 206) completes.
[0091] Once the iterative clustering for error correction process is complete, the polisher component 1 14 is called to polish the consensus results (block 208). In one embodiment, the polisher component 1 14 may be based on the Quiver algorithm, as described in U.S. 13/941 ,442, filed July 12, 2013, which is hereby incorporated by reference. As described above, the ICE component 1 12 calls the cluster consensus algorithm 1 1 1 to generate the cluster consensus for each of the clusters. These cluster consensus are used as a "reference" to which the full-length reads 1 16-1 are aligned during the iterative clustering process in one embodiment.
[0092] According to one aspect of exemplary embodiment, the input to the polishing step may comprise the cluster consensus and the non-full-length reads 1 16-2, which are then aligned to each cluster consensus as a reference. During polishing, the non-full- length reads 1 16-2 are used to impart full coverage to the sequence data to provide a higher level of consensus using the same "isoform hit" criteria described above. In one embodiment, unlike full-length input sequences, non-full-length reads 1 16-2 do not have to align exclusively and can belong to multiple clusters. Again, the linear-time algorithm is used to reject non-favorable alignments. Once the non-full-length reads 1 16-2 are aligned to the cluster consensus, the polisher component 1 14 generates a final consensus sequence 128 (Figure 1 ) for each of the clusters. The output from the process may comprise a list of final cluster consensus 128, each of which represents the "consensus" sequence of a cluster. In one embodiment, each cluster may be used to represent a single, unique transcript.
[0093] In a further embodiment, the final cluster consensus 128 may be mapped to a genome, wherein redundancy is removed and the isoforms collapsed, thereby generating high-quality, full-length isoforms.
[0094] According to the exemplary embodiments, the iterative clustering of sequence reads for error correction process may have many applications. For example, ICE may be used for full-length cDNA sequencing, bioinformatics analysis, and biological applications.
[0095] Examples of full-length cDNA sequencing may include, but are not limited to, constructing cDNA libraries enriched in full-length transcripts; size selection using agarose gel or the BluePippin™ system; sequencing transcripts up to 10 kb in full-length; and single-molecule observation of each transcript.
[0096] Besides isoform-level clustering to generate high-quality transcript consensus sequences, examples of bioinformatics analysis may include, but are not limited to, identifying putatively full-length transcripts; and detecting artificial chimeras.
[0097] Finally, examples of biological applications may include, but are not limited to, novel transcripts; alternative splicing; alternative polyadenylation; retained introns; fusion genes; and anti-sense transcription.
[0098] In some embodiments, the system includes a computer-readable medium operatively coupled to the processor that stores instructions for execution by the processor. The instructions may include one or more of the following: instructions for receiving input of sequence reads (and, optionally, reference sequence information), instructions for constructing pre-assembled reads, instructions for aligning sequence reads, instructions for generating string graphs, instructions for generating unitig graphs, instructions for identifying string bundles, instructions for determining primary contigs, instructions for determining associated contigs, instructions for correcting reads, instructions for generating consensus sequences, instructions for generating haplotype sequences, instructions that compute/store information related to various steps of the method (e.g., edges and nodes in a string graph, overlaps and branch points in a string graph, primary and associated contigs, and instructions that record the results of the method.
[0099] In certain aspects, the methods are computer-implemented methods. In certain aspects, the algorithm and/or results (e.g., consensus sequences generated) are stored on computer-readable medium, and/or displayed on a screen or on a paper print-out. In certain aspects, the results are further analyzed, e.g., to identify genetic variants, to identify one or more origins of the sequence information, to identify genomic regions conserved between individuals or species, to determine relatedness between two individuals, to provide an individual with a diagnosis or prognosis, or to provide a health care professional with information useful for determining an appropriate therapeutic strategy for a patient.
[00100] Furthermore, the functional aspects of the invention that are implemented on a computer or other logic processing systems or circuits, as will be understood to one of ordinary skill in the art, may be implemented or accomplished using any appropriate implementation environment or programming language, such as C, C++, Cobol, Pascal, Java, Java-script, HTML, XML, dHTML, assembly or machine code programming, RTL, etc.
[00101] In certain embodiments, the computer-readable media may comprise any combination of a hard drive, auxiliary memory, external memory, server, database, portable memory device (CD-R, DVD, ZIP disk, flash memory cards, etc.), and the like.
[00102] In some aspects, the invention includes an article of manufacture for string graph assembly of polyploid genomes that includes a machine-readable medium containing one or more programs which when executed implement the steps of the invention as described herein.
[00103] It is to be understood that the above description is intended to be illustrative and not restrictive. It readily should be apparent to one skilled in the art that various modifications may be made to the invention disclosed in this application without departing from the scope and spirit of the invention. The scope of the invention should, therefore, be determined not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. Throughout the disclosure various references, patents, patent applications, and publications are cited. Unless otherwise indicated, each is hereby incorporated by reference in its entirety for all purposes. All publications mentioned herein are cited for the purpose of describing and disclosing reagents, methodologies and concepts that may be used in connection with the present invention. Nothing herein is to be construed as an admission that these references are prior art in relation to the inventions described herein.

Claims

What is claimed is:
1 . A method for iterative clustering of sequence reads for error correction, the method performed at least one software component executing on at least one processor, comprising:
receiving a set of sequence reads and associated quality values;
grouping the sequence reads into a set of initial clusters based on sequence similarity;
generating a cluster consensus for each of the initial clusters;
iteratively improving the clustering based on the cluster consensus and the quality values associated with the sequence reads; and
generating and outputting a final cluster consensus for each of the clusters.
2. The method of claim 1 , wherein iteratively improving the clustering further
comprises:
calculating a probability of each sequence read belonging to each cluster using the quality values;
reassigning individual sequence reads from one cluster to another cluster having a highest calculated probability; and
merging highly similar clusters.
3. The method of claim 2, wherein calculating a probability of each sequence read belonging to each cluster using the quality values, further comprises:
aligning each of the sequence reads in each of the clusters to each of the cluster consensus; responsive to a current sequence read not aligning to any of the cluster consensus with a sufficiently high percent of similarity, ignoring the sequence read for having a bad probability;
responsive to the current sequence read (S) aligning to one or more of the cluster consensus (C), calculating a probability of the current read belonging to each of the clusters given the cluster consensus and the quality values (QVs) for the current read:
Pr (Si I Cu, QVs(Si))
4. The method of claim 3, wherein responsive to the QVs being unavailable, then calculating:
Pr (Si I Cu QVs(Si)) = (9match)count(match) (1 /3 Bsub)count(sub) (1 /3 Bins)count(ins) (1 /3 Bdel)count(del),
where Θ is the probability of a match of a substitution (sub), an insertion (ins), and a deletion (del), respectively.
5. The method of claim 2, further comprising:
responsive to no alignment existing between any of the sequence reads and any of the clusters, considering the sequence reads orphans, and forming new clusters from the orphans.
6. The method of claim 5, further comprising:
responsive to a new cluster having only one sequence read, randomly generating a random probability for each orphan node; and
responsive to random probability being smaller than a predefined threshold probability, reassigning the orphan to one of the clusters having a non-zero computed probability of membership for the orphan.
7. The method of claim 1 , wherein the sequence reads received comprise full-length long reads, and wherein the generating and outputting the final cluster consensus further comprises:
inputting non-full-length reads into a final polishing processes that generates the final cluster consensus.
8. The method of claim 7, wherein the reads comprise full-length long reads range in length from 0.5 kb to 1 , 2, 3, 5, 10, 15, 20 kb.
9. The method of claim 1 , wherein grouping the sequence reads into a set of initial clusters based on sequence similarity further comprises:
aligning the sequence reads to create aligned reads;
building similarity graphs using the aligned reads; and
finding maximal cliques using the similarity graphs.
10. The method of claim 9, wherein finding maximal cliques comprises: non- deterministically partitioning the similarity graphs into non-overlapping maximal cliques.
1 1 . An executable software product stored on a computer-readable medium containing program instructions for iterative clustering of sequence reads for error correction, the program instructions executing on at least one processor, comprising:
receiving a set of sequence reads and associated quality values;
grouping the sequence reads into a set of initial clusters based on sequence similarity;
generating a cluster consensus for each of the initial clusters;
iteratively improving the clustering based on the cluster consensus and the quality values associated with the sequence reads; and generating and outputting a final cluster consensus for each of the clusters.
12. The executable software product of claim 1 1 , wherein iteratively improving the clustering further comprises:
calculating a probability of each sequence read belonging to each cluster using the quality values;
reassigning individual sequence reads from one cluster to another cluster having a highest calculated probability; and
merging highly similar clusters.
13. The executable software product of claim 12, wherein calculating a probability of each sequence read belonging to each cluster using the quality values, further comprises: aligning each of the sequence reads in each of the clusters to each of the cluster consensus;
responsive to a current sequence read not aligning to any of the cluster consensus with a sufficiently high percent of similarity, ignoring the sequence read for having a bad probability;
responsive to the current sequence read (S) aligning to one or more of the cluster consensus (C), calculating a probability of the current read belonging to each of the clusters given the cluster consensus and the quality values (QVs) for the current read:
Pr (Si I Cu, QVs(Si))
14. The executable software product of claim 13, wherein responsive to the QVs being unavailable, then calculating:
Pr (Si I Cu QVs(Si)) = (9match)count(match) (1 /3 Bsub)count(sub) (1 /3 Bins)count(ins) (1 /3 9del)count(del), where Θ is the probability of a match of a substitution (sub), an insertion (ins), and a deletion (del), respectively.
15. The executable software product of claim 12, further comprising:
responsive to no alignment existing between any of the sequence reads and any of the clusters, considering the sequence reads orphans, and forming new clusters from the orphans.
16. The executable software product of claim 15, further comprising:
responsive to a new cluster having only one sequence read, randomly generating a random probability for each orphan node; and
responsive to random probability being smaller than a predefined threshold probability, reassigning the orphan to one of the clusters having a non-zero computed probability of membership for the orphan.
17. The executable software product of claim 1 1 , wherein the sequence reads received comprise full-length long reads, and wherein the generating and outputting the final cluster consensus further comprises:
inputting non-full-length reads into a final polishing processes that generates the final cluster consensus.
18. The executable software product of claim 15, wherein the reads comprise full-length long reads range in length from 0.5 kb to 1 , 2, 3, 5, 10, 15, 20 kb.
19. The executable software product of claim 1 1 , wherein grouping the sequence reads into a set of initial clusters based on sequence similarity further comprises:
aligning the sequence reads to create aligned reads;
building similarity graphs using the aligned reads; and finding maximal cliques using the similarity graphs.
20. The executable software product of claim 19, wherein finding maximal cliques comprises: non-deterministically partitioning the similarity graphs into non-overlapping maximal cliques.
21 . A system for iterative clustering of sequence reads for error correction, comprising: a memory; and
a processor coupled to the memory configured to:
receiving a set of sequence reads and associated quality values; grouping the sequence reads into a set of initial clusters based on sequence similarity;
generating a cluster consensus for each of the initial clusters; iteratively improving the clustering based on the cluster consensus and the quality values associated with the sequence reads; and
generating and outputting a final cluster consensus for each of the clusters.
22. The system of claim 21 , wherein iteratively improving the clustering further comprises:
calculating a probability of each sequence read belonging to each cluster using the quality values;
reassigning individual sequence reads from one cluster to another cluster having a highest calculated probability; and
merging highly similar clusters.
23. The system of claim 22, wherein calculating a probability of each sequence read belonging to each cluster using the quality values, further comprises: aligning each of the sequence reads in each of the clusters to each of the cluster consensus;
responsive to a current sequence read not aligning to any of the cluster consensus with a sufficiently high percent of similarity, ignoring the sequence read for having a bad probability;
responsive to the current sequence read (S) aligning to one or more of the cluster consensus (C), calculating a probability of the current read belonging to each of the clusters given the cluster consensus and the quality values (QVs) for the current read:
Pr (Si I Cu, QVs(Si))
24. The system of claim 23, wherein responsive to the QVs being unavailable, then calculating:
Pr (Si I Cu QVs(Si)) = (9match)count(match) (1 /3 Bsub)count(sub) (1 /3 Bins)count(ins) (1 /3 Bdel)count(del),
where Θ is the probability of a match of a substitution (sub), an insertion (ins), and a deletion (del), respectively.
25. The system of claim 22, further comprising:
responsive to no alignment existing between any of the sequence reads and any of the clusters, considering the sequence reads orphans, and forming new clusters from the orphans.
26. The system of claim 25, further comprising:
responsive to a new cluster having only one sequence read, randomly generating a random probability for each orphan node; and responsive to random probability being smaller than a predefined threshold probability, reassigning the orphan to one of the clusters having a non-zero computed probability of membership for the orphan.
27. The system of claim 21 , wherein the sequence reads received comprise full-length long reads, and wherein the generating and outputting the final cluster consensus further comprises:
inputting non-full-length reads into a final polishing processes that generates the final cluster consensus.
28. The system of claim 25, wherein the reads comprise full-length long reads range in length from 0.5 kb to 1 , 2, 3, 5, 10, 15, 20 kb.
29. The system of claim 21 , wherein grouping the sequence reads into a set of initial clusters based on sequence similarity further comprises:
aligning the sequence reads to create aligned reads;
building similarity graphs using the aligned reads; and
finding maximal cliques using the similarity graphs.
30. The system of claim 29, wherein finding maximal cliques comprises: non- deterministically partitioning the similarity graphs into non-overlapping maximal cliques.
PCT/US2014/069539 2013-12-18 2014-12-10 Iterative clustering of sequence reads for error correction WO2015094854A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480069926.4A CN105849555B (en) 2013-12-18 2014-12-10 Sequence reads iteration for error correction clusters
EP14871681.4A EP3084426B1 (en) 2013-12-18 2014-12-10 Iterative clustering of sequence reads for error correction

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201361917777P 2013-12-18 2013-12-18
US61/917,777 2013-12-18
US201462028741P 2014-07-24 2014-07-24
US62/028,741 2014-07-24

Publications (1)

Publication Number Publication Date
WO2015094854A1 true WO2015094854A1 (en) 2015-06-25

Family

ID=53400323

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/069539 WO2015094854A1 (en) 2013-12-18 2014-12-10 Iterative clustering of sequence reads for error correction

Country Status (4)

Country Link
US (1) US20150178446A1 (en)
EP (1) EP3084426B1 (en)
CN (1) CN105849555B (en)
WO (1) WO2015094854A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016191380A1 (en) 2015-05-26 2016-12-01 Pacific Biosciences Of California, Inc. De novo diploid genome assembly and haplotype sequence reconstruction
US20180181707A1 (en) * 2016-11-10 2018-06-28 Life Technologies Corporation Methods, systems and computer readable media to correct base calls in repeat regions of nucleic acid sequence reads
CN110111843B (en) * 2018-01-05 2021-07-06 深圳华大基因科技服务有限公司 Method, apparatus and storage medium for clustering nucleic acid sequences
CN108830047A (en) * 2018-06-21 2018-11-16 河南理工大学 A kind of scaffolding method based on long reading and contig classification
US20230029058A1 (en) * 2021-07-26 2023-01-26 Microsoft Technology Licensing, Llc Computing system for news aggregation
CN115691700B (en) * 2022-11-09 2023-05-02 哈尔滨理工大学 High-entropy alloy hardness prediction method based on dual-granularity clustering integration algorithm of three consensus strategies
CN115985400B (en) * 2022-12-02 2024-03-15 江苏先声医疗器械有限公司 Method for reassigning metagenome multiple comparison sequences and application
CN117668576B (en) * 2023-11-22 2024-06-21 太极计算机股份有限公司 Logic processing method of hierarchical clustering consensus framework

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040171051A1 (en) * 2000-01-31 2004-09-02 Zymogenetics, Inc. Method and system for detecting near identities in large DNA databases
WO2009091798A1 (en) * 2008-01-16 2009-07-23 Helicos Biosciences Corporation Quantitative genetic analysis
US20120330566A1 (en) * 2010-02-24 2012-12-27 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
US20130137588A1 (en) * 2008-09-12 2013-05-30 University Of Washington Sequence tag directed subassembly of short sequencing reads into long sequencing reads
US20130196321A1 (en) * 2012-01-31 2013-08-01 Genomic Health, Inc. Gene Expression Profile Algorithm and Test for Determining Prognosis of Prostate Cancer
WO2013181170A1 (en) * 2012-05-31 2013-12-05 Board Of Regents, The University Of Texas System Method for accurate sequencing of dna

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL121806A0 (en) * 1997-09-21 1998-02-22 Compugen Ltd Method and apparatus for MRNA assembly
WO2003081471A1 (en) * 2002-02-18 2003-10-02 Celestar Lexico-Sciences, Inc. Apparatus for managing gene expression data
CN101278041A (en) * 2005-08-05 2008-10-01 密执安州大学 Genes from actinobacillus succinogenes 13oz (atcc 55618) for production of chemicals from the a. succinogenes C4-pathway
CN101457253B (en) * 2008-12-12 2011-08-31 深圳华大基因研究院 Sequencing sequence error correction method, system and device
US10777301B2 (en) * 2012-07-13 2020-09-15 Pacific Biosciences For California, Inc. Hierarchical genome assembly method using single long insert library
WO2016191380A1 (en) * 2015-05-26 2016-12-01 Pacific Biosciences Of California, Inc. De novo diploid genome assembly and haplotype sequence reconstruction

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040171051A1 (en) * 2000-01-31 2004-09-02 Zymogenetics, Inc. Method and system for detecting near identities in large DNA databases
WO2009091798A1 (en) * 2008-01-16 2009-07-23 Helicos Biosciences Corporation Quantitative genetic analysis
US20130137588A1 (en) * 2008-09-12 2013-05-30 University Of Washington Sequence tag directed subassembly of short sequencing reads into long sequencing reads
US20120330566A1 (en) * 2010-02-24 2012-12-27 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
US20130196321A1 (en) * 2012-01-31 2013-08-01 Genomic Health, Inc. Gene Expression Profile Algorithm and Test for Determining Prognosis of Prostate Cancer
WO2013181170A1 (en) * 2012-05-31 2013-12-05 Board Of Regents, The University Of Texas System Method for accurate sequencing of dna

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3084426A4

Also Published As

Publication number Publication date
EP3084426A1 (en) 2016-10-26
CN105849555A (en) 2016-08-10
EP3084426B1 (en) 2020-04-15
EP3084426A4 (en) 2017-08-16
US20150178446A1 (en) 2015-06-25
CN105849555B (en) 2019-05-14

Similar Documents

Publication Publication Date Title
EP3084426B1 (en) Iterative clustering of sequence reads for error correction
Alser et al. Technology dictates algorithms: recent developments in read alignment
US20240120021A1 (en) Methods and systems for large scale scaffolding of genome assemblies
Olson et al. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes
Meisner et al. Inferring population structure and admixture proportions in low-depth NGS data
US20200399719A1 (en) Systems and methods for analyzing viral nucleic acids
Shi et al. Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of gibbons
US20140025312A1 (en) Hierarchical genome assembly method using single long insert library
WO2017120128A1 (en) Systems and methods for adaptive local alignment for graph genomes
CN107133493B (en) Method for assembling genome sequence, method for detecting structural variation and corresponding system
WO2013040583A2 (en) Determining variants in a genome of a heterogeneous sample
CN108595915B (en) Third-generation data correction method based on DNA variation detection
US9589102B2 (en) Base sequence cluster generating system, base sequence cluster generating method, program for performing cluster generating method, and computer readable recording medium on which program is recorded and system for providing base sequence information
US20150286775A1 (en) String graph assembly for polyploid genomes
Kremer et al. Approaches for in silico finishing of microbial genome sequences
US20180060484A1 (en) Extending assembly contigs by analyzing local assembly sub-graph topology and connections
US20180322242A1 (en) A System and Method for Compensating Noise in Sequence Data for Improved Accuracy and Sensitivity of DNA Testing
Luo et al. Computational approaches for transcriptome assembly based on sequencing technologies
CN114424288A (en) Method for determining a measure related to the probability that two mutated sequence reads are derived from the same sequence comprising a mutation
WO2016205767A1 (en) String graph assembly for polyploid genomes
Petri et al. isONform: reference-free transcriptome reconstruction from Oxford Nanopore data
WO2021086734A1 (en) Trace reconstruction of polymer sequences using quality scores
Tang et al. Integration of hybrid and self-correction method improves the quality of long-read sequencing data
Krishnan et al. Analysis of among-site variation in substitution patterns
Pan Optical Map-Based Genome Scaffolding

Legal Events

Date Code Title Description
REEP Request for entry into the european phase

Ref document number: 2014871681

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014871681

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE