US20140296080A1 - Methods, Systems, and Computer Readable Media for Evaluating Variant Likelihood - Google Patents

Methods, Systems, and Computer Readable Media for Evaluating Variant Likelihood Download PDF

Info

Publication number
US20140296080A1
US20140296080A1 US14/200,942 US201414200942A US2014296080A1 US 20140296080 A1 US20140296080 A1 US 20140296080A1 US 201414200942 A US201414200942 A US 201414200942A US 2014296080 A1 US2014296080 A1 US 2014296080A1
Authority
US
United States
Prior art keywords
sequencing
ensemble
likelihood
evaluating
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/200,942
Inventor
Earl Hubbell
Sowmi Utiramerur
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Life Technologies Corp
Original Assignee
Life Technologies Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Life Technologies Corp filed Critical Life Technologies Corp
Priority to US14/200,942 priority Critical patent/US20140296080A1/en
Assigned to Life Technologies Corporation reassignment Life Technologies Corporation ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HUBBELL, EARL, UTIRAMERUR, SOWMI
Publication of US20140296080A1 publication Critical patent/US20140296080A1/en
Priority to US15/974,976 priority patent/US11636919B2/en
Priority to US18/130,134 priority patent/US20230360726A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F19/22
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • This application generally relates to methods, systems, and computer readable media for nucleic acid sequencing, and, more specifically, to methods, systems, and computer readable media for evaluating variant likelihood in nucleic acid sequencing.
  • Nucleic acid sequencing data may be obtained in various ways, including using next-generation sequencing systems such as, for example, the Ion PGMTM and Ion ProtonTM systems implementing Ion TorrentTM sequencing technology; see, e.g., U.S. Pat. No. 7,948,015 and U.S. Pat. Appl. Publ. Nos. 2010/0137143, 2009/0026082, and 2010/0282617, which are all incorporated by reference herein in their entirety. There is a need for new methods, systems, and computer readable media that can better evaluate variant likelihood and reduce sequencing errors when analyzing data obtained using these or other sequencing systems/platforms.
  • next-generation sequencing systems such as, for example, the Ion PGMTM and Ion ProtonTM systems implementing Ion TorrentTM sequencing technology; see, e.g., U.S. Pat. No. 7,948,015 and U.S. Pat. Appl. Publ. Nos. 2010/0137143, 2009/0026082, and 2010/0282617
  • FIG. 1 illustrates an exemplary system for evaluating variant likelihood.
  • FIG. 2 illustrates components of an exemplary apparatus for nucleic acid sequencing.
  • FIG. 3 illustrates an exemplary flow cell for nucleic acid sequencing.
  • FIG. 4 illustrates an exemplary process for label-free, pH-based sequencing.
  • FIG. 5 illustrates an exemplary computer system.
  • FIG. 6 illustrates an exemplary method for evaluating variant likelihood.
  • FIG. 7A illustrates an exemplary plot of sample frequency.
  • FIG. 7B illustrates an exemplary plot of responsibility for each read.
  • FIG. 8A illustrates exemplary bias terms for each strand for an ensemble of reads.
  • FIG. 8B illustrates exemplary variance components corresponding to homopolymers.
  • FIG. 9A illustrates predicted/measured data ratios for a sequence of nucleotide flows.
  • FIG. 9B illustrates residual values for the same data as in FIG. 9A .
  • FIG. 9C illustrates predicted/measured data ratios for a sequence of nucleotide flows.
  • FIG. 9D illustrates residual values for the same data as in FIG. 9C .
  • a method for evaluating variant likelihood in nucleic acid sequencing comprising: (a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; (b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; (c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model
  • a non-transitory machine-readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a method for evaluating variant likelihood in nucleic acid sequencing comprising: (a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some template polynucleotide strands in at least one defined space, wherein a plurality of template polynucleotide strands, sequencing primers, and polymerase have been provided in a plurality of defined spaces disposed on a sensor array, and wherein the plurality of template polynucleotide strands, sequencing primers, and polymerase have been exposed to a series of flows of nucleotide species according to a predetermined order; and (b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is
  • a system for evaluating variant likelihood in nucleic acid sequencing including: a plurality of template polynucleotide strands, sequencing primers, and polymerase provided in a plurality of defined spaces disposed on a sensor array; an apparatus configured to expose the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; a machine-readable memory; and a processor configured to execute machine-readable instructions, which, when executed by the processor, cause the system to perform a method for evaluating variant likelihood, comprising: (a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in
  • sequence variants including single nucleotide polymorphism (SNPs), insertions, and/or deletions
  • SNPs single nucleotide polymorphism
  • insertions insertions
  • deletions and/or deletions
  • the particular approach/technology used to obtain sequencing data and the particular data analysis approach/methods used to analyze the sequencing data can both affect the accuracy of sequence variant identification.
  • the use of alignment methodologies developed and operating in base space can sometimes lead to insertions or deletions in the alignment of reads obtained using sequencing-by-synthesis technologies that operate at least to some extent in flow space.
  • methods, systems, and computer readable media for evaluating variant likelihood in nucleic acid sequencing are disclosed herein.
  • the various embodiments provide a more generalized method for inferring variant likelihood by considering properties of ensemble of reads.
  • the ability to infer the confidence of a read without referring to the ensemble is high because bias is expected to be low and variation should be well-estimated.
  • bias and variance is critical to avoid false positives, and it may then be beneficial to consider the ensemble in addition to each read to evaluate confidence.
  • a fundamental problem in variant calling is to assign a posterior distribution to a variant frequency given an observed ensemble of reads.
  • the reads that strongly support a sequence provide information about the frequency of that sequence while those that weakly support the sequence provide little information.
  • the strength of support for a sequence is itself conditional on the ensemble of reads observed. In various embodiments, one may estimate both the error structure for a particular ensemble of reads and the plausible distribution of variants.
  • V 1 and V 2 represent two possible sequences, and let ⁇ represent the frequency of V 1 and (1 ⁇ ) represent the frequency of V 2 .
  • R 1 , R 2 , . . . , R N represent N measured reads, and let each read have a measurement confidence value of being V 1 of ⁇ i .
  • ⁇ ) which is a function of a product between the prior and the likelihood of each measured read under a given value of ⁇ , may be expressed as follows:
  • c is a normalization constant
  • P( ⁇ ) represents a prior probability of ⁇
  • i is an integer between 1 and N identifying a sequencing read
  • N is an integer specifying a number of sequencing reads in the ensemble.
  • the distribution of the frequency ⁇ and the confidence values ⁇ i may be inferred based on the data for downstream hypothesis evaluation.
  • V 1 and V 2 represent two possible sequences
  • represent the frequency of V 1 and (1 ⁇ ) represent the frequency of V 2
  • (1 ⁇ ) represent a frequency of an outlier event.
  • R 1 , R 2 , . . . , R N represent N measured reads, and let each read have a measurement confidence value of being V 1 of ⁇ i .
  • c is a normalization constant
  • T represents a density of the outlier event
  • P( ⁇ ) represents a prior probability of ⁇
  • i is an integer between 1 and N identifying a sequencing read
  • N is an integer specifying a number of sequencing reads in the ensemble.
  • the distribution of the frequencies ⁇ and (1 ⁇ ) and the confidence values ⁇ i may be inferred based on the data for downstream hypothesis evaluation.
  • the frequency (1 ⁇ ) may be assumed to have an (improper) flat density across all reads.
  • the likelihood of observing the N reads depends on the measurement confidence values ⁇ i in addition to the frequency ⁇ (and outlier frequency ⁇ if included).
  • the measurement confidence values ⁇ i could be estimated using methods known in the art for evaluating confidence values, which typically would be read-specific and not specifically adapted to a particular underlying sequencing technology.
  • the confidence values ⁇ i may be derived or estimated using ensembles of reads along with predictive models related to the underlying sequencing technology.
  • m i1 , m i2 , . . . , m ij , . . . , m iM represent a vector of measured values for M nucleotide flows associated with an i-th read (e.g., a set of normalized, calibrated values observed for the i-th read over the M flows), and let p i1 , p i2 , . . . , p ij , . . . , p iM represent a vector of predicted values for the i-th read over the M flows under a predictive model. Any suitable predictive model may be used, recognizing of course that more accurate models are likely to yield better results.
  • the model may be selected depending on the underlying sequencing technology and may, for example, be a model as described in Davey et al., U.S. Pat. Appl. Publ. No. 2012/0109598, published on May 3, 2012, and/or Sikora et al., U.S. Pat. Appl. Publ. No. 2013/0060482, published on Mar. 7, 2012, which are all incorporated by reference herein in their entirety.
  • the normal/gaussian likelihood model may be too light-tailed and a t-distribution may better reflect the heavy-tailed phenomena.
  • ⁇ ij x may be calculated and ⁇ ij 2 may be estimated using any suitable method for fitting data to a distribution (or preferably as further discussed in this and/or the next sections).
  • ⁇ ij x and ⁇ ij 2 may be used to calculate various statistical measures that may be used to obtain estimates of the measurement confidence values ⁇ i .
  • the statistical measures may be log-likelihood contributions.
  • the log-likelihood contributions under two sequence hypotheses may be used to estimate ⁇ i using the following expression:
  • the variation in measurements may be estimated by exploiting the information across the ensemble of reads to better estimate the confidence within each read.
  • each read may be assigned a division of responsibility ⁇ i for each sequence based on both the likelihood and the base rate to obtain an estimate of variation for a given division of responsibility.
  • the measurement confidence values ⁇ i in the likelihood of observing the N reads conditional on a given frequency may be measures of responsibilities ⁇ i .
  • may be estimated using any suitable estimation method, and, given a suitable predictive model and some approximation characterizing the distribution of differences between measured and predicted values, ⁇ ij x may be calculated and ⁇ ij 2 may be estimated using any suitable method for fitting data to a distribution (or preferably as further discussed in this and/or the next section).
  • ⁇ , ⁇ ij x , and ⁇ ij 2 may be used to calculate various statistical measures that may be used to obtain estimates of the measures of responsibility ⁇ i .
  • the statistical measures may be log-likelihood contributions.
  • the log-likelihood contributions under two sequence hypotheses may be used to estimate ⁇ i using the following expression:
  • ⁇ i ⁇ ⁇ + ( 1 - ⁇ ) * exp ⁇ ( LL yi - LL xi ) .
  • the quantity determining the variance relevant to the sequence change of interest is then the expected value of the squared deviation across reads.
  • the variance ⁇ circumflex over ( ⁇ ) ⁇ 2 may be estimated using the evaluated ⁇ ij x and estimated ⁇ i , using the following expression:
  • the components may be defined in various ways.
  • the components ⁇ m 2 may consist of ⁇ ⁇ 2 (the variance contributed to a flow by simply existing without any reference to incorporation), ⁇ X 2 (a term for incorporation components not explicitly modeled), and then all the other components.
  • the components ⁇ m 2 may correspond to each integer homopolymer length.
  • the components ⁇ m 2 may be defined according to the polymerase state variables (e.g., 10% of the variance is provided by a given base or homopolymer run in the hypothesis in a given flow) and fitted according to variation contributed by a genomic location.
  • initial estimates for the latent variables may be updated using an expectation-maximization (EM) methodology and a method-of-moments approximation.
  • EM expectation-maximization
  • the use of a multiplicative proportion of each squared residual to estimate the contribution towards the component ensures that estimates are always positive for variances.
  • ⁇ ) of observing the N reads may be maximized, repeatedly, by estimating (i) the frequency ⁇ (and outlier frequency ⁇ if included), (ii) either the measurement confidence values ⁇ i (which may be functions of ⁇ ij 2 and ⁇ ij 2 as described above) or the measures of responsibility ⁇ i (which may be functions of ⁇ , ⁇ ij 2 , and ⁇ ij 2 as described above), and (iii) the variances ⁇ ij 2 (which may be expressed various ways in terms of latent components as described above) individually conditional on the values of the others and the observed data.
  • the likelihood may be first maximized based on ⁇ while holding ⁇ i (or ⁇ i ) and ⁇ ij 2 fixed, then based on ⁇ i (or ⁇ i ) while holding ⁇ and ⁇ ij 2 fixed, then based on ⁇ ij 2 while holding ⁇ and ⁇ i (or ⁇ i ) fixed, and then iterating as needed.
  • the posterior distribution for the variant frequency conditional on the maximum a posteriori value of the variance may then be obtained.
  • Descriptions of methods for finding maximum likelihood or maximum a posteriori estimates of a plurality of parameters in a statistical model, including the expectation-maximization algorithm and other algorithms, may be found in various statistical papers, such as Dempster et al., “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society Series B, 39(1):1-38 (1977); and Roche, “EM algorithm and variants: an informal tutorial,” arXiv:1105.1476v2 [stat.CO] (2012) (available at http://arxiv.org/abs/1105.1476v2).
  • variant calling may be further improved by capturing systematic bias between the measurements and predicted values (e.g., systematic underestimating of predicted values that may lead to systematic undercalls) that may affect the ability to distinguish between sequence hypotheses.
  • the biases may be estimated in various ways.
  • the two biases may be inferred from the data together with (i) the frequency ⁇ (and outlier frequency T if included), (ii) either the measurement confidence values ⁇ i (which may be functions of ⁇ ij 2 and ⁇ ij 2 as described above) or the measures of responsibility ⁇ i (which may be functions of ⁇ , ⁇ ij 2 , and ⁇ ij 2 as described above), and (iii) the variances ⁇ ij 2 (which may be expressed various ways in terms of latent components as described above) individually conditional on the values of the others and the observed data.
  • this may be done using any suitable statistical method for obtaining maximum likelihood and/or maximum a posteriori estimates of some or all of these parameters given the observed data and any underlying assumptions, including by iteratively maximizing over one variable while holding the others fixed.
  • a prior to the bias which may be a normal distribution with mean zero and variance 1/t 2 (that is, N(0,1/t 2 )), where t is the approximate precision of the bias term, which may be learned within an experiment by exploiting the fact that most positions are reference.
  • the bias maximizing the likelihood conditional on the other latent variables may be determined using the following expression:
  • ⁇ f ( ⁇ d ij 2 ⁇ ij 2 ) - 1 * ( ⁇ d ij 2 * ( ⁇ i * ( m ij - p ij x ) + ( 1 - ⁇ i ) * ( m ij - p ij y ) ) ⁇ ij 2 ) ,
  • the bias on the reads mapping to the reverse strand may be estimated similarly, except that the sum is then taken over all reads on the reverse strand and all relevant flows.
  • a shrinkage term 1/t 2 may be added to the denominator to shrink the bias estimate towards zero, thus incorporating the prior.
  • the shrinkage term essentially acts as a tuning parameter for the precision of the bias prior. If set high, the approach reduces to the case without bias being postulated, as the bias becomes zero, as in the case of independent reads.
  • FIG. 1 illustrates an exemplary system for evaluating variant likelihood.
  • the exemplary system includes an apparatus or sub-system for nucleic acid sequencing and/or analysis 11 , a computing server/node/device 12 including a variant calling engine 14 , and a display 16 , which may be internal and/or external.
  • the apparatus or sub-system for nucleic acid sequencing and/or analysis 11 may be any type of instrument that can generate nucleic acid sequence data from nucleic acid samples, which may include a nucleic acid sequencing instrument, a real-time/digital/quantitative PCR instrument, a microarray scanner, etc.
  • the computing server/node/device 12 may be a workstation, mainframe computer, distributed computing node (part of a “cloud computing” or distributed networking system), personal computer, mobile device, etc.
  • the computing server/node/device 12 may be configured to host a pre-variant calling processing engine 13 , which may be configured to include various signal/data processing modules that may be configured to receive signal/data from the apparatus or sub-system for nucleic acid sequencing and/or analysis 11 and perform various processing steps, such as conversion from flow space to base space, determination of base calls, determination of base call quality values, preparation of read data for use by a mapping module, and/or alignment and/or mapping of reads to a reference sequence or genome, which may be a whole/partial genome, whole/partial exome, etc.
  • the exemplary system may also include a client device terminal 17 , which may include a data analysis API and may be communicatively connected to the computing server/node/device 12 via a network connection 18 that may be a “hardwired” physical network connection (e.g., Internet, LAN, WAN, VPN, etc.) or a wireless network connection (e.g., Wi-Fi, WLAN, etc.).
  • a network connection 18 may be a “hardwired” physical network connection (e.g., Internet, LAN, WAN, VPN, etc.) or a wireless network connection (e.g., Wi-Fi, WLAN, etc.).
  • the exemplary system may also include a post-variant calling processing engine 15 , which may be configured to include various signal/data processing modules that may be configured to apply post-processing to variant calls, which may include annotating various variant calls and/or features, converting data from flow space to base space, filtering of variants (e.g., based on a minimum score threshold, a minimum number of reads including the variant, a minimum frequency of reads including the variant, a minimum mapping quality, a strand probability, and region filtering, for example), and formatting the variant data for display or use by client device terminal 17 .
  • a post-variant calling processing engine 15 may be configured to include various signal/data processing modules that may be configured to apply post-processing to variant calls, which may include annotating various variant calls and/or features, converting data from flow space to base space, filtering of variants (e.g., based on a minimum score threshold, a minimum number of reads including the variant, a minimum frequency of reads including the variant, a minimum mapping
  • the apparatus or sub-system for nucleic acid sequencing and/or analysis 11 and the computing server/node/device 12 may be integrated into a single instrument or system comprising components present in a single enclosure 19 .
  • the client device terminal 17 may be configured to communicate information to and/or control the operation of the computing server/node/device 12 and its modules and/or operating parameters.
  • FIG. 2 illustrates components of an exemplary apparatus for nucleic acid sequencing.
  • the components include a flow cell and sensor array 100 , a reference electrode 108 , a plurality of reagents 114 , a valve block 116 , a wash solution 110 , a valve 112 , a fluidics controller 118 , lines 120 / 122 / 126 , passages 104 / 109 / 111 , a waste container 106 , an array controller 124 , and a user interface 128 .
  • the flow cell and sensor array 100 includes an inlet 102 , an outlet 103 , a microwell array 107 , and a flow chamber 105 defining a flow path of reagents over the microwell array 107 .
  • the reference electrode 108 may be of any suitable type or shape, including a concentric cylinder with a fluid passage or a wire inserted into a lumen of passage 111 .
  • the reagents 114 may be driven through the fluid pathways, valves, and flow cell by pumps, gas pressure, or other suitable methods, and may be discarded into the waste container 106 after exiting the flow cell and sensor array 100 .
  • the reagents 114 may, for example, contain dNTPs to be flowed through passages 130 and through the valve block 116 , which may control the flow of the reagents 114 to flow chamber 105 (also referred to herein as a reaction chamber) via passage 109 .
  • the system may include a reservoir 110 for containing a wash solution that may be used to wash away dNTPs, for example, that may have previously been flowed.
  • the microwell array 107 may include an array of defined spaces, such as microwells, for example, that is operationally associated with a sensor array so that, for example, each microwell has a sensor suitable for detecting an analyte or reaction property of interest.
  • the microwell array 107 may preferably be integrated with the sensor array as a single device or chip.
  • the array controller 124 may provide bias voltages and timing and control signals to the sensor, and collect and/or process output signals.
  • the user interface 128 may display information from the flow cell and sensor array 100 as well as instrument settings and controls, and allow a user to enter or set instrument settings and controls.
  • the valve 112 may be shut to prevent any wash solution 110 from flowing into passage 109 as the reagents are flowing. Although the flow of wash solution may be stopped, there may still be uninterrupted fluid and electrical communication between the reference electrode 108 , passage 109 , and the sensor array 107 .
  • the distance between the reference electrode 108 and the junction between passages 109 and 111 may be selected so that little or no amount of the reagents flowing in passage 109 and possibly diffusing into passage 111 reach the reference electrode 108 .
  • the fluidics controller 118 may be programmed to control driving forces for flowing reagents 114 and the operation of valve 112 and valve block 116 to deliver reagents to the flow cell and sensor array 100 according to a predetermined reagent flow ordering.
  • defined space generally refers to any space (which may be in one, two, or three dimensions) in which at least some of a molecule, fluid, and/or solid can be confined, retained and/or localized.
  • the space may be a predetermined area (which may be a flat area) or volume, and may be defined, for example, by a depression or a micro-machined well in or associated with a microwell plate, microtiter plate, microplate, or a chip, or by isolated hydrophobic areas on a generally hydrophobic surface.
  • defined spaces may be arranged as an array, which may be a substantially planar one-dimensional or two-dimensional arrangement of elements such as sensors or wells.
  • Defined spaces may be in electrical communication with at least one sensor to allow detection or measurement of one or more detectable or measurable parameter or characteristics.
  • the sensors may convert changes in the presence, concentration, or amounts of reaction by-products (or changes in ionic character of reactants) into an output signal, which may be registered electronically, for example, as a change in a voltage level or a current level which, in turn, may be processed to extract information or signal about a chemical reaction or desired association event, for example, a nucleotide incorporation event and/or a related ion concentration (e.g., a pH measurement).
  • the sensors may include at least one ion sensitive field effect transistor (“IS FET”) or chemically sensitive field effect transistor (“chemFET”).
  • FIG. 3 illustrates an exemplary flow cell for nucleic acid sequencing.
  • the flow cell 200 includes a microwell array 202 , a sensor array 205 , and a flow chamber 206 in which a reagent flow 208 may move across a surface of the microwell array 202 , over open ends of microwells in the microwell array 202 .
  • the flow of reagents e.g., nucleotide species
  • a microwell 201 in the microwell array 202 may have any suitable volume, shape, and aspect ratio.
  • a sensor 214 in the sensor array 205 may be an ISFET or a chemFET sensor with a floating gate 218 having a sensor plate 220 separated from the microwell interior by a passivation layer 216 , and may be predominantly responsive to (and generate an output signal related to) an amount of charge 224 present on the passivation layer 216 opposite of the sensor plate 220 .
  • Changes in the amount of charge 224 cause changes in the current between a source 221 and a drain 222 of the sensor 214 , which may be used directly to provide a current-based output signal or indirectly with additional circuitry to provide a voltage output signal.
  • Reactants, wash solutions, and other reagents may move into microwells primarily by diffusion 240 .
  • One or more analytical reactions to identify or determine characteristics or properties of an analyte of interest may be carried out in one or more microwells of the microwell array 202 . Such reactions may generate directly or indirectly by-products that affect the amount of charge 224 adjacent to the sensor plate 220 .
  • a reference electrode 204 may be fluidly connected to the flow chamber 206 via a flow passage 203 .
  • the microwell array 202 and the sensor array 205 may together form an integrated unit forming a bottom wall or floor of the flow cell 200 .
  • one or more copies of an analyte may be attached to a solid phase support 212 , which may include microparticles, nanoparticles, beads, gels, and may be solid and porous, for example.
  • the analyte may include one or more copies of a nucleic acid analyte obtained using any suitable technique.
  • FIG. 4 illustrates an exemplary process for label-free, pH-based sequencing.
  • a template 682 with sequence 685 and a primer binding site 681 are attached to a solid phase support 680 .
  • the template 682 may be attached as a clonal population to a solid support, such as a microparticle or bead, for example, and may be prepared as disclosed in Leamon et al., U.S. Pat. No. 7,323,305.
  • the template may be associated with a substrate surface or present in a liquid phase with or without being coupled to a support.
  • a primer 684 and DNA polymerase 686 are annealed to the template 682 so that the primer's 3′ end may be extended by a polymerase and that a polymerase is bound to such primer-template duplex (or in close proximity thereof) so that binding and/or extension may take place when dNTPs are added.
  • dNTP shown as dATP
  • the DNA polymerase 686 incorporates a nucleotide “A” (since “T” is the next nucleotide in the template 682 and is complementary to the flowed dATP nucleotide).
  • a wash is performed.
  • step 692 the next dNTP (shown as dCTP) is added, and the DNA polymerase 686 incorporates a nucleotide “C” (since “G” is the next nucleotide in the template 682 ). More details about pH-based nucleic acid sequencing may be found in U.S. Pat. No. 7,948,015 and U.S. Pat. Appl. Publ. Nos. 2010/0137143, 2009/0026082, and 2010/0282617.
  • the primer-template-polymerase complex may be subjected to a series of exposures of different nucleotides in a pre-determined sequence or ordering. If one or more nucleotides are incorporated, then the signal resulting from the incorporation reaction may be detected, and after repeated cycles of nucleotide addition, primer extension, and signal acquisition, the nucleotide sequence of the template strand may be determined. The output signals measured throughout this process depend on the number of nucleotide incorporations. Specifically, in each addition step, the polymerase extends the primer by incorporating added dNTP only if the next base in the template is complementary to the added dNTP.
  • an hydrogen ion is released, and collectively a population released hydrogen ions change the local pH of the reaction chamber.
  • the production of hydrogen ions may be monotonically related to the number of contiguous complementary bases (e.g., homopolymers) in the template.
  • Deliveries of nucleotides to a reaction vessel or chamber may be referred to as “flows” of nucleotide triphosphates (or dNTPs).
  • a flow of dATP will sometimes be referred to as “a flow of A” or “an A flow,” and a sequence of flows may be represented as a sequence of letters, such as “ATGT” indicating “a flow of dATP, followed by a flow of dTTP, followed by a flow of dGTP, followed by a flow of dTTP.”
  • the predetermined ordering may be based on a cyclical, repeating pattern consisting of consecutive repeats of a short pre-determined reagent flow ordering (e.g., consecutive repeats of pre-determined sequence of four nucleotide reagents such as, for example, “ACTG ACTG . . .
  • reagent flows may be based in whole or in part on some other pattern of reagent flows (such as, e.g., any of the various reagent flow orderings discussed herein and/or in Hubbell et al., U.S. Pat. Appl. Publ. No. 2012/0264621, published Oct. 18, 2012, which is incorporated by reference herein in its entirety), and may also be based on some combination thereof.
  • output signals due to nucleotide incorporation may be processed, given knowledge of what nucleotide species were flowed and in what order to obtain such signals, to make base calls for the flows and compile consecutive base calls associated with a sample nucleic acid template into a read.
  • a base call refers to a particular nucleotide identification (e.g., dATP (“A”), dCTP (“C”), dGTP (“G”), or dTTP (“T”)).
  • Base calling may include performing one or more signal normalizations, signal phase and signal decay (e.g, enzyme efficiency loss) estimations, and signal corrections, and may identify or estimate base calls for each flow for each defined space. Any suitable base calling method may be used.
  • base calling may be performed as described in Davey et al., U.S. Pat. Appl. Publ. No. 2012/0109598, published on May 3, 2012, and/or Sikora et al., U.S. Pat. Appl. Publ. No. 2013/0060482, published on Mar. 7, 2012.
  • FIG. 5 illustrates an exemplary computer system.
  • the computer system 501 includes a bus 502 or other communication mechanism for communicating information, a processor 503 coupled to the bus 502 for processing information, and a memory 505 coupled to the bus 502 for dynamically and/or statically storing information.
  • the computer system 501 can also include one or more co-processors 504 coupled to the bus 502 , such as GPUs and/or FPGAs, for performing specialized processing tasks; a display 506 coupled to the bus 502 , such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user; an input device 507 coupled to the bus 502 , such as a keyboard including alphanumeric and other keys, for communicating information and command selections to the processor 503 ; a cursor control device 508 coupled to the bus 502 , such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to the processor 503 and for controlling cursor movement on display 506 ; and one or more storage devices 509 coupled to the bus 502 , such as a magnetic disk or an optical disk, for storing information and instructions.
  • the memory 505 may include a random access memory (RAM) or other dynamic storage device and/or a read only memory (ROM) or
  • one or more features of the teachings and/or embodiments described herein may be performed or implemented using appropriately configured and/or programmed hardware and/or software elements.
  • Examples of hardware elements may include processors, microprocessors, input(s) and/or output(s) (I/O) device(s) (or peripherals) that are communicatively coupled via a local interface circuit, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth.
  • the local interface may include, for example, one or more buses or other wired or wireless connections, controllers, buffers (caches), drivers, repeaters and receivers, etc., to allow appropriate communications between hardware components.
  • a processor is a hardware device for executing software, particularly software stored in memory.
  • the processor can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer, a semiconductor based microprocessor (e.g., in the form of a microchip or chip set), a macroprocessor, or generally any device for executing software instructions.
  • a processor can also represent a distributed processing architecture.
  • the I/O devices can include input devices, for example, a keyboard, a mouse, a scanner, a microphone, a touch screen, an interface for various medical devices and/or laboratory instruments, a bar code reader, a stylus, a laser reader, a radio-frequency device reader, etc.
  • the I/O devices also can include output devices, for example, a printer, a bar code printer, a display, etc.
  • the I/O devices further can include devices that communicate as both inputs and outputs, for example, a modulator/demodulator (modem; for accessing another device, system, or network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, etc.
  • modem for accessing another device, system, or network
  • RF radio frequency
  • Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof.
  • a software in memory may include one or more separate programs, which may include ordered listings of executable instructions for implementing logical functions.
  • the software in memory may include a system for identifying data streams in accordance with the present teachings and any suitable custom made or commercially available operating system (O/S), which may control the execution of other computer programs such as the system, and provides scheduling, input-output control, file and data management, memory management, communication control, etc.
  • O/S operating system
  • one or more features of teachings and/or embodiments described herein may be performed or implemented using appropriately configured and/or programmed non-transitory machine-readable medium or article that may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the embodiments.
  • a machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, scientific or laboratory instrument, etc., and may be implemented using any suitable combination of hardware and/or software.
  • the machine-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, read-only memory compact disc (CD-ROM), recordable compact disc (CD-R), rewriteable compact disc (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disc (DVD), a tape, a cassette, etc., including any medium suitable for use in a computer.
  • any suitable type of memory unit for example, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media,
  • Memory can include any one or a combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, EPROM, EEROM, Flash memory, hard drive, tape, CDROM, etc.). Moreover, memory can incorporate electronic, magnetic, optical, and/or other types of storage media. Memory can have a distributed architecture where various components are situated remote from one another, but are still accessed by the processor.
  • the instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, etc., implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
  • one or more features of the teachings and/or embodiments described herein may be performed or implemented at least partly using a distributed, clustered, remote, or cloud computing resource.
  • FIG. 6 illustrates an exemplary method for evaluating variant likelihood.
  • a user or component provides a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array.
  • a user or component exposes the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order.
  • a server or other computing means or resource obtains measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces.
  • the measured values may include voltage data indicative of hydrogen ion concentrations, which may be processed and analyzed to yield sequences of base calls for the reads, which may in turn be aligned and mapped.
  • the server or other computing means or resource evaluates a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands.
  • FIG. 7A illustrates an exemplary plot of sample frequency.
  • the y-axis shows the posterior density of the sample frequency ⁇ conditional on the other values after the maximization algorithm has converged.
  • the x-axis shows the major frequency.
  • the density is derived from maximization of the ensemble likelihood of all reads as described herein by estimating ⁇ , ⁇ i , ⁇ ij 2 , and the two biases. The density is scaled so that the maximum likelihood is 1.0 (which occurs at 100% frequency of the variant allele).
  • FIG. 7B illustrates an exemplary plot of responsibility for each read.
  • the y-axis shows the responsibility ⁇ i for each read, calculated at the maximum likelihood estimate for the sample frequency.
  • the x-axis show integers corresponding to the reads.
  • Two data points are plotted per read—one is the responsibility for reference (triangle, near 100%), and one is the responsibility for variant allele (cross, near 0%). (Not shown is the outlier case, which makes these two responsibilities not quite add to 1.0.)
  • the strand for the read is indicated by gray (reverse) or black (forward).
  • FIG. 8A illustrates exemplary bias terms for each strand for an ensemble of reads.
  • the graph shows the computed bias terms for each strand for the ensemble of reads.
  • the y-axis shows the reverse bias.
  • the x-axis shows the forward bias. In this case, there is exactly one point, indicating that when the procedure has converged on this set of reads for this variant, the estimate of bias on the forward strand is approximately ⁇ 0.5 and the estimate of bias on the reverse strand is close to 0.
  • This ensemble bias may then be applied per read per flow as discussed above.
  • FIG. 8B illustrates exemplary variance components corresponding to homopolymers.
  • the y-axis shows the standard deviation (square root of the variance) for the variance components fitted by the model and used to estimate the variance for each flow in each read by interpolation.
  • the x-axis shows intensities corresponding to homopolymers scaled so that a completely in-phase 1-mer should read 1.0. (When there is insufficient data, the graph shows the prior estimate for this estimate of variation.)
  • FIG. 9A illustrates predicted/measured data ratios for a sequence of nucleotide flows.
  • the y-axis shows the predicted/measured signal ratio for reads on the forward strand (in black).
  • the x-axis shows a series of nucleotide flows as discussed herein (e.g., a flow of dTTP, followed by a flow of dATP, followed by a flow of dCTP, etc.).
  • the y-axis values are normalized based on the first flows corresponding to a key (see T, C, and A at flows 1, 3, and 6 with the final G of the key combining with the G in the read to form a 2-mer G at flow 8) so that the key values are at 1.0.
  • FIG. 9B illustrates residual values for the same data as in FIG. 9A .
  • the y-axis shows the residuals ⁇ ij representing differences between measured and predicted values.
  • the x-axis shows a series of nucleotide flows as discussed herein.
  • the middle line with a value of 0, means that the model applied to the reference sequence would predict the intensity correctly. Dots placed off the middle line indicate the measurement-prediction difference is large in the original system.
  • the corrected predictions are indicated by the cyan line for reference (passing through the middle of the residuals); the dotted blue line shows the predictions for the variant allele (deleted G).
  • This strand has a bias on the positive strand of ⁇ 0.58—nearly 60% of a 1-mer. It is therefore unsurprising that this strand is reporting deletion alleles.
  • FIG. 9C illustrates predicted/measured data ratios for a sequence of nucleotide flows.
  • the y-axis shows the predicted/measured signal ratio for reads overlapping the same sequence portion as in FIG. 9A but on the reverse strand (in red).
  • the x-axis shows a series of nucleotide flows as discussed herein (e.g., a flow of dATP, followed by a flow of dCTP, followed by a flow of dGTP, etc.).
  • the 2-mer C in flow 22 (corresponding to the G on the forward strand in flow 14) has a similar signal to the 2-mer G in flow 19 (corresponding to the C on the forward strand in flow 15).
  • Modified predictions for the reference sequence are shown in cyan triangles; predictions for the intensity that would be observed if the true underlying sequence for reads were a deleted G are shown as blue crosses.
  • FIG. 9D illustrates residual values for the same data as in FIG. 9C .
  • the y-axis shows the residuals ⁇ ij representing differences between measured and predicted values.
  • the x-axis shows a series of nucleotide flows as discussed herein.
  • the modified prediction line (blue) for reference lies along the horizontal axis, indicating that the predictions are on average centered around the reference value.
  • a method for evaluating variant likelihood in nucleic acid sequencing comprising: (a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; (b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; (c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using the predictive model of nucleo
  • a method for evaluating variant likelihood in nucleic acid sequencing comprising: (a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; (b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; (c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse
  • a method for evaluating variant likelihood in nucleic acid sequencing comprising: (a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; (b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; (c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model
  • modifying the at least some model-predicted values may comprise applying a transformation including a product of (i) one of the first and second biases and (ii) a discriminant vector representing a difference between model-predicted values corresponding to different hypothesized sequences.
  • evaluating the likelihood may further comprise assigning a first frequency to a variant sequence and a second frequency to a non-variant sequence, and calculating a likelihood of having observed the ensemble of sequencing reads conditioned on the first frequency as a function of a product of the likelihoods of having observed each of the sequencing reads given the first frequency.
  • the likelihood of having observed the ensemble of sequencing reads may be determined using an expression comprising:
  • represents the first frequency
  • 1 ⁇ represents the second frequency
  • P( ⁇ ) represents a prior probability of ⁇
  • ⁇ i represents the measurement confidence value for sequencing read i
  • i is an integer between 1 and N identifying a sequencing read
  • N is an integer specifying a number of sequencing reads in the ensemble.
  • evaluating the likelihood may further comprise assigning a first frequency to a variant sequence, a second frequency to a non-variant sequence, and a third frequency to an outlier event.
  • the outlier event may have a flat density across all sequencing reads in the ensemble.
  • Evaluating the likelihood may further comprise calculating a likelihood of having observed the ensemble of sequencing reads conditioned on the third frequency as a function of a product of the likelihoods of having observed each of the sequencing reads given the third frequency.
  • the likelihood of having observed the ensemble of sequencing reads may be determined using an expression comprising
  • represents the first frequency
  • 1 ⁇ represents the second frequency
  • (1 ⁇ ) represents the third frequency
  • T represents a density of the outlier event
  • P( ⁇ ) represents a prior probability of ⁇
  • ⁇ 1 represents a measurement confidence value for sequencing read i
  • i is an integer between 1 and N identifying a sequencing read
  • N is an integer specifying a number of sequencing reads in the ensemble.
  • the measurement confidence values may be estimated using a function comprising a sum of log-likelihood of values measured for a given flow given a hypothesized sequence.
  • the measurement confidence values may further be estimated using a function comprising differences between the measured values and the model-predicted values.
  • the differences between measured and model-predicted values at each nucleotide flow may be assumed to follow independent normal distributions each having a mean and a variance.
  • the differences between measured and model-predicted values at each nucleotide flow may also be assumed to follow independent t-distributions.
  • the measurement confidence values may be estimated using an expression comprising
  • ⁇ i 1 1 + exp ⁇ ( LL yi - LL xi )
  • LL yi and LL xi are log-likelihoods of values measured for a given sequencing read under hypothesized sequences y and x, respectively.
  • the log-likelihoods of values measured for a given sequencing read under hypothesized sequences y and x may be expressed as
  • m ij x and m ij y represent measured values for read i at flow j under hypothesized sequences x and y, respectively
  • p ij x and p ij y represent predicted values for read i at flow j under hypothesized sequences x and y, respectively
  • ⁇ ij x and ⁇ ij y are the standard deviations of independently distributed normal distributions for read i at flow j under hypothesized sequences x and y, respectively, where i is an integer identifying a sequencing read, and where M represents a number of flows.
  • the measurement confidence values may be estimated using an expression for responsibility comprising
  • ⁇ i ⁇ ⁇ + ( 1 - ⁇ ) * exp ⁇ ( LL yi - LL xi )
  • represents a first frequency assigned to a variant sequence
  • 1 ⁇ represents a second frequency assigned to a non-variant sequence
  • ⁇ i represents a measure of responsibility for each of the sequencing reads in the ensemble.
  • the variance may then be estimated using an expression comprising
  • m ij x and m ij y represent measured values of read i at flow j under hypothesized sequences x and y, respectively
  • p ij x and p ij y represent predicted values for read i at flow j hypothesized sequences x and y, respectively
  • M represents a number of flows and N is an integer specifying a number of sequencing reads
  • the variance may be estimated by decomposition of the variance in a flow and sequencing read into underlying latent components.
  • the method may further comprise updating the decomposition using an expression comprising
  • ⁇ kn 2 ⁇ ( ⁇ i * r ij 2 ) ⁇ ( ⁇ ijm )
  • ⁇ ij ⁇ ijm * ⁇ k , n - 1 2 ⁇ ( ⁇ ijm * ⁇ m , n - 1 2 )
  • each latent component may correspond to a homopolymer having an integer length.
  • the latent components may include a null variance component representing contribution to a flow regardless of any nucleotide incorporation, a residual variance component representing contribution for nucleotide incorporations not explicitly modeled, and one or more additional variance components.
  • the one or more additional variance components may comprise variance components associated with homopolymers having an integer length.
  • the latent components may be estimated using an EM methodology and a method of moments approximation.
  • evaluating the likelihood may comprise estimating (i) a first frequency ⁇ assigned to a variant sequence, (ii) at least one of a measurement confidence value ⁇ i for each of the sequencing reads in the ensemble and a measure of responsibility ⁇ i for each of the sequencing reads in the ensemble, and (iii) a variance ⁇ ij 2 for each of the flows and sequencing reads in the ensemble.
  • Evaluating the likelihood may further comprise estimating ⁇ , ⁇ i (or ⁇ i ), and ⁇ ij 2 individually conditional on the values of the others.
  • a non-transitory machine-readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a method for evaluating variant likelihood in nucleic acid sequencing comprising: (a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some template polynucleotide strands in at least one defined space, wherein a plurality of template polynucleotide strands, sequencing primers, and polymerase have been provided in a plurality of defined spaces disposed on a sensor array, and wherein the plurality of template polynucleotide strands, sequencing primers, and polymerase have been exposed to a series of flows of nucleotide species according to a predetermined order; and (b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is
  • a system for evaluating variant likelihood in nucleic acid sequencing including: a plurality of template polynucleotide strands, sequencing primers, and polymerase provided in a plurality of defined spaces disposed on a sensor array; an apparatus configured to expose the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; a machine-readable memory; and a processor configured to execute machine-readable instructions, which, when executed by the processor, cause the system to perform a method for evaluating variant likelihood, comprising: (a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in

Abstract

A method for evaluating variant likelihood includes: providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: determining a measurement confidence value for each read in the ensemble of sequencing reads and modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit of U.S. Prov. Pat. Appl. No. 61/782,240, filed Mar. 14, 2013, which is incorporated by reference herein in its entirety.
  • FIELD
  • This application generally relates to methods, systems, and computer readable media for nucleic acid sequencing, and, more specifically, to methods, systems, and computer readable media for evaluating variant likelihood in nucleic acid sequencing.
  • BACKGROUND
  • Nucleic acid sequencing data may be obtained in various ways, including using next-generation sequencing systems such as, for example, the Ion PGM™ and Ion Proton™ systems implementing Ion Torrent™ sequencing technology; see, e.g., U.S. Pat. No. 7,948,015 and U.S. Pat. Appl. Publ. Nos. 2010/0137143, 2009/0026082, and 2010/0282617, which are all incorporated by reference herein in their entirety. There is a need for new methods, systems, and computer readable media that can better evaluate variant likelihood and reduce sequencing errors when analyzing data obtained using these or other sequencing systems/platforms.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more exemplary embodiments and serve to explain the principles of various exemplary embodiments. The drawings are exemplary and explanatory only and are not to be construed as limiting or restrictive in any way.
  • FIG. 1 illustrates an exemplary system for evaluating variant likelihood.
  • FIG. 2 illustrates components of an exemplary apparatus for nucleic acid sequencing.
  • FIG. 3 illustrates an exemplary flow cell for nucleic acid sequencing.
  • FIG. 4 illustrates an exemplary process for label-free, pH-based sequencing.
  • FIG. 5 illustrates an exemplary computer system.
  • FIG. 6 illustrates an exemplary method for evaluating variant likelihood.
  • FIG. 7A illustrates an exemplary plot of sample frequency.
  • FIG. 7B illustrates an exemplary plot of responsibility for each read.
  • FIG. 8A illustrates exemplary bias terms for each strand for an ensemble of reads.
  • FIG. 8B illustrates exemplary variance components corresponding to homopolymers.
  • FIG. 9A illustrates predicted/measured data ratios for a sequence of nucleotide flows.
  • FIG. 9B illustrates residual values for the same data as in FIG. 9A.
  • FIG. 9C illustrates predicted/measured data ratios for a sequence of nucleotide flows.
  • FIG. 9D illustrates residual values for the same data as in FIG. 9C.
  • SUMMARY
  • According to an exemplary embodiment, there is provided a method for evaluating variant likelihood in nucleic acid sequencing, comprising: (a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; (b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; (c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
  • According to an exemplary embodiment, there is provided a non-transitory machine-readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a method for evaluating variant likelihood in nucleic acid sequencing comprising: (a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some template polynucleotide strands in at least one defined space, wherein a plurality of template polynucleotide strands, sequencing primers, and polymerase have been provided in a plurality of defined spaces disposed on a sensor array, and wherein the plurality of template polynucleotide strands, sequencing primers, and polymerase have been exposed to a series of flows of nucleotide species according to a predetermined order; and (b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
  • According to an exemplary embodiment, there is provided a system for evaluating variant likelihood in nucleic acid sequencing, including: a plurality of template polynucleotide strands, sequencing primers, and polymerase provided in a plurality of defined spaces disposed on a sensor array; an apparatus configured to expose the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; a machine-readable memory; and a processor configured to execute machine-readable instructions, which, when executed by the processor, cause the system to perform a method for evaluating variant likelihood, comprising: (a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
  • Exemplary Embodiments
  • The following description and the various embodiments described herein are exemplary and explanatory only and are not to be construed as limiting or restrictive in any way. Other embodiments, features, objects, and advantages of the present teachings will be apparent from the description and accompanying drawings, and from the claims.
  • The identification of sequence variants, including single nucleotide polymorphism (SNPs), insertions, and/or deletions, is an important application of next-generation sequencing technologies. The particular approach/technology used to obtain sequencing data and the particular data analysis approach/methods used to analyze the sequencing data can both affect the accuracy of sequence variant identification. For example, the use of alignment methodologies developed and operating in base space can sometimes lead to insertions or deletions in the alignment of reads obtained using sequencing-by-synthesis technologies that operate at least to some extent in flow space.
  • According to various exemplary embodiments, methods, systems, and computer readable media for evaluating variant likelihood in nucleic acid sequencing are disclosed herein. The various embodiments provide a more generalized method for inferring variant likelihood by considering properties of ensemble of reads. In some cases, such as a SNP or simple indel affecting one or two flows, the ability to infer the confidence of a read without referring to the ensemble is high because bias is expected to be low and variation should be well-estimated. However, in other cases, such as some multi-nucleotide repeats or short indels in long homopolymer stretches, estimating bias and variance is critical to avoid false positives, and it may then be beneficial to consider the ensemble in addition to each read to evaluate confidence.
  • Variant Frequency Framework
  • A fundamental problem in variant calling is to assign a posterior distribution to a variant frequency given an observed ensemble of reads. The reads that strongly support a sequence provide information about the frequency of that sequence while those that weakly support the sequence provide little information. However, under reasonable assumptions about systematic error, the strength of support for a sequence is itself conditional on the ensemble of reads observed. In various embodiments, one may estimate both the error structure for a particular ensemble of reads and the plausible distribution of variants.
  • In an example, let V1 and V2 represent two possible sequences, and let π represent the frequency of V1 and (1−π) represent the frequency of V2. Further, let R1, R2, . . . , RN represent N measured reads, and let each read have a measurement confidence value of being V1 of εi. The likelihood of observing the N reads conditional on a given frequency, L(R1, R2, . . . , RN|π), which is a function of a product between the prior and the likelihood of each measured read under a given value of π, may be expressed as follows:
  • L ( R 1 , R 2 , , R N | π ) = c * P ( π ) * i = 1 N ( π * ε i + ( 1 - π ) * ( 1 - ε i ) )
  • where: c is a normalization constant; P(π) represents a prior probability of π; i is an integer between 1 and N identifying a sequencing read; and N is an integer specifying a number of sequencing reads in the ensemble. In various embodiments, the distribution of the frequency π and the confidence values εi, which are unknown, may be inferred based on the data for downstream hypothesis evaluation.
  • In another example, let V1 and V2 represent two possible sequences, let π represent the frequency of V1 and (1−π) represent the frequency of V2, and let (1−τ) represent a frequency of an outlier event. (Because of the strong likelihood that there are outlier measurements in any collection of reads (e.g., due to misalignment), it can be useful to postulate a third possibility covering such outlier events.) Further, let R1, R2, . . . , RN represent N measured reads, and let each read have a measurement confidence value of being V1 of εi. The likelihood of observing the N reads conditional on given frequencies π and τ, L(R1, R2, . . . RN|π, τ), which is a function of a product between the prior and the likelihood of each measured read under given values of π and τ, may be expressed as follows:
  • L ( R 1 , R 2 , , R N | π , τ ) = c * P ( π ) * i = 1 N ( τ * ( π * ε i + ( 1 - π ) * ( 1 - ε i ) ) + ( 1 - τ ) * T )
  • where: c is a normalization constant; T represents a density of the outlier event; P(π) represents a prior probability of π; i is an integer between 1 and N identifying a sequencing read; and N is an integer specifying a number of sequencing reads in the ensemble. In various embodiments, the distribution of the frequencies π and (1−π) and the confidence values εi, which are unknown, may be inferred based on the data for downstream hypothesis evaluation. In some embodiments, the frequency (1−τ) may be assumed to have an (improper) flat density across all reads.
  • Confidence Values
  • In the above framework, the likelihood of observing the N reads depends on the measurement confidence values εi in addition to the frequency π (and outlier frequency τ if included). The measurement confidence values εi could be estimated using methods known in the art for evaluating confidence values, which typically would be read-specific and not specifically adapted to a particular underlying sequencing technology. Preferably, however, in an embodiment the confidence values εi may be derived or estimated using ensembles of reads along with predictive models related to the underlying sequencing technology.
  • In an example, let mi1, mi2, . . . , mij, . . . , miM represent a vector of measured values for M nucleotide flows associated with an i-th read (e.g., a set of normalized, calibrated values observed for the i-th read over the M flows), and let pi1, pi2, . . . , pij, . . . , piM represent a vector of predicted values for the i-th read over the M flows under a predictive model. Any suitable predictive model may be used, recognizing of course that more accurate models are likely to yield better results. The model may be selected depending on the underlying sequencing technology and may, for example, be a model as described in Davey et al., U.S. Pat. Appl. Publ. No. 2012/0109598, published on May 3, 2012, and/or Sikora et al., U.S. Pat. Appl. Publ. No. 2013/0060482, published on Mar. 7, 2012, which are all incorporated by reference herein in their entirety.
  • Regardless of what predictive model is being used, there will be some variation between measured and predicted values. In some cases, it may be useful to make certain approximations characterizing the variation. In an example, one may make the approximation that the difference between the measured and predicted values are normally and independently distributed in each flow with a mean of zero and a variance given by a so that σij 2 so that δij x=mij x−pij x˜N(0,σij 2) for the i-th read and the j-th flow (where the x superscript denotes an index of some hypothesized sequence Vx). In another example, which may be of particular use in cases including outliers that may result in heavy-tailed phenomena, the normal/gaussian likelihood model may be too light-tailed and a t-distribution may better reflect the heavy-tailed phenomena.
  • In various embodiments, given measured sequencing data, a suitable predictive model, and some approximation characterizing the distribution of differences between measured and predicted values, δij x may be calculated and σij 2 may be estimated using any suitable method for fitting data to a distribution (or preferably as further discussed in this and/or the next sections). In turn, δij x and σij 2 may be used to calculate various statistical measures that may be used to obtain estimates of the measurement confidence values εi.
  • In various embodiments, the statistical measures may be log-likelihood contributions. For example, the log-likelihood contribution for the j-th flow conditional on Vx (neglecting constants), LLxj, may be expressed as LLxj=−ln(σij)+(δij 2/(2*σij 2)). The log-likelihood of the measured values under hypothesis x may then be expressed as LLxjLLxj, the sum of the contributions of all flows for the particular read. Finally, the log-likelihood contributions under two sequence hypotheses may be used to estimate εi using the following expression:
  • ε i = exp ( LL xi ) exp ( LL xi ) + exp ( LL yi ) = 1 1 + exp ( LL yi - LL xi ) .
  • It should be noted that the above embodiments provide an underspecified parameter set as the variances within the observed reads are unknown. One could estimate the variances (e.g., using flows at which the predictions do not differ), which would effectively amount to treating each read independently. However, while this might be computationally convenient, such a simplification does not fit the data, which suggest for example that variances for flows with large measured values are larger than variances for flows with small measured values.
  • In various embodiments, the variation in measurements may be estimated by exploiting the information across the ensemble of reads to better estimate the confidence within each read. In an example, each read may be assigned a division of responsibility ρi for each sequence based on both the likelihood and the base rate to obtain an estimate of variation for a given division of responsibility. In an embodiment, the measurement confidence values εi in the likelihood of observing the N reads conditional on a given frequency may be measures of responsibilities ρi.
  • In various embodiments, given measured sequencing data, one may estimate π using any suitable estimation method, and, given a suitable predictive model and some approximation characterizing the distribution of differences between measured and predicted values, δij x may be calculated and σij 2 may be estimated using any suitable method for fitting data to a distribution (or preferably as further discussed in this and/or the next section). In turn, π, δij x, and σij 2 may be used to calculate various statistical measures that may be used to obtain estimates of the measures of responsibility ρi.
  • In various embodiments, the statistical measures may be log-likelihood contributions. For example, the log-likelihood contribution for the j-th flow conditional on Vx (neglecting constants), LLxj, may be expressed as LLxj=−ln(σij)+(δij 2/(2*σij 2)). The log-likelihood of the measured values under hypothesis x may then be expressed as LLxjLLxj, the sum of the contributions of all flows for the particular read. Finally, the log-likelihood contributions under two sequence hypotheses may be used to estimate ρi using the following expression:
  • ρ i = π π + ( 1 - π ) * exp ( LL yi - LL xi ) .
  • Under the division of responsibility framework, the quantity determining the variance relevant to the sequence change of interest is then the expected value of the squared deviation across reads. In various embodiments, the variance {circumflex over (σ)}2 may be estimated using the evaluated δij x and estimated ρi, using the following expression:
  • σ ^ 2 = i = 1 N ( ρ i j = 1 M ( δ ij x ) 2 + ( 1 - ρ i ) j = 1 M ( δ ij y ) 2 ) N .
  • Estimation of Variance
  • Because the variance around predictions can vary significantly across different reads and flows, it is preferable to estimate σij 2, the variance for the j-th flow in the i-th read. In an embodiment, this may be achieved by decomposing the variance in any flow and read in terms of some underlying latent components: σij 2m=1 Kijmm 2), where φijm is a proportionality constant determining the amount of variation contributed by the m-th latent component.
  • The components may be defined in various ways. For example, the components σm 2 may consist of σφ 2 (the variance contributed to a flow by simply existing without any reference to incorporation), σX 2 (a term for incorporation components not explicitly modeled), and then all the other components. Alternatively, the components σm 2 may correspond to each integer homopolymer length. Alternatively, the components σm 2 may be defined according to the polymerase state variables (e.g., 10% of the variance is provided by a given base or homopolymer run in the hypothesis in a given flow) and fitted according to variation contributed by a genomic location.
  • The components may be approximated in various ways. For example, initial estimates for the latent variables may be updated using an expectation-maximization (EM) methodology and a method-of-moments approximation. Under the decomposition above, estimates σkn 2=(Σ(ωi*rij 2))/(Σ(φijm)), where ωij=(φijmk,n-1 2)/(Σ(φijmm,n-1 2)), for the k-th component may be updated using the proportion of variance attributed to each read and flow. This can be done relatively quickly by iterating only over relevant flows for each component. Here, the use of a multiplicative proportion of each squared residual to estimate the contribution towards the component ensures that estimates are always positive for variances.
  • Overall Likelihood
  • In various embodiments, putting all the above together, the likelihood L(R1, R2, . . . RN|π) of observing the N reads may be maximized, repeatedly, by estimating (i) the frequency π (and outlier frequency τ if included), (ii) either the measurement confidence values εi (which may be functions of δij 2 and δij 2 as described above) or the measures of responsibility ρi (which may be functions of π, δij 2, and σij 2 as described above), and (iii) the variances σij 2 (which may be expressed various ways in terms of latent components as described above) individually conditional on the values of the others and the observed data. This may be done using any suitable statistical method for obtaining maximum likelihood and/or maximum a posteriori estimates of some or all of these parameters given the observed data and any underlying assumptions, including for example estimating the variance iteratively using an expectation-maximization algorithm and a method-of-moments approximation. In an embodiment, after starting with initial estimates, the likelihood may be first maximized based on π while holding εi (or ρi) and σij 2 fixed, then based on εi (or ρi) while holding π and σij 2 fixed, then based on σij 2 while holding π and εi (or ρi) fixed, and then iterating as needed. Assuming reasonable stability of estimating the variance, the posterior distribution for the variant frequency conditional on the maximum a posteriori value of the variance may then be obtained. Descriptions of methods for finding maximum likelihood or maximum a posteriori estimates of a plurality of parameters in a statistical model, including the expectation-maximization algorithm and other algorithms, may be found in various statistical papers, such as Dempster et al., “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society Series B, 39(1):1-38 (1977); and Roche, “EM algorithm and variants: an informal tutorial,” arXiv:1105.1476v2 [stat.CO] (2012) (available at http://arxiv.org/abs/1105.1476v2).
  • Systematic Biases
  • In various embodiments, variant calling may be further improved by capturing systematic bias between the measurements and predicted values (e.g., systematic underestimating of predicted values that may lead to systematic undercalls) that may affect the ability to distinguish between sequence hypotheses. In the present context, the relevant bias in measurements for purposes of assigning responsibility is a component lying along the discriminant vector d of length M for a given read R, having components dj=pj x−pj y, which are the differences in measurements predicted under the two hypothesized sequences x and y.
  • In an example, a predicted value for each read may be modified by applying a transformation qj x=pj xf*dj for forward strands and a transformation qj x=pj xr*dj for reverse strands, where: 1 j x represents modified predicted values for a given sequencing read at flow j under hypothesized sequence x; βf denotes the first bias for forward strands; βr denotes the second bias for reverse strands; pj x and pj y represent predicted values for the given sequencing read at flow j under hypothesized sequences x and y, respectively; j is an integer between 1 and M; and M represents a number of flows.
  • The biases may be estimated in various ways. In an embodiment, the two biases may be inferred from the data together with (i) the frequency π (and outlier frequency T if included), (ii) either the measurement confidence values εi (which may be functions of δij 2 and σij 2 as described above) or the measures of responsibility ρi (which may be functions of π, δij 2, and σij 2 as described above), and (iii) the variances σij 2 (which may be expressed various ways in terms of latent components as described above) individually conditional on the values of the others and the observed data. As above, this may be done using any suitable statistical method for obtaining maximum likelihood and/or maximum a posteriori estimates of some or all of these parameters given the observed data and any underlying assumptions, including by iteratively maximizing over one variable while holding the others fixed. In addition, one may assign a prior to the bias, which may be a normal distribution with mean zero and variance 1/t2 (that is, N(0,1/t2)), where t is the approximate precision of the bias term, which may be learned within an experiment by exploiting the fact that most positions are reference. The bias maximizing the likelihood conditional on the other latent variables may be determined using the following expression:
  • β f = ( d ij 2 σ ij 2 ) - 1 * ( d ij 2 * ( ρ i * ( m ij - p ij x ) + ( 1 - ρ i ) * ( m ij - p ij y ) ) σ ij 2 ) ,
  • where the sum is taken over all reads on the forward strand and all relevant flows. (This is the projection of the residuals onto the discriminant vector d, weighted by the responsibility ρi of each read for each sequence.) The bias on the reads mapping to the reverse strand may be estimated similarly, except that the sum is then taken over all reads on the reverse strand and all relevant flows. A shrinkage term 1/t2 may be added to the denominator to shrink the bias estimate towards zero, thus incorporating the prior. The shrinkage term essentially acts as a tuning parameter for the precision of the bias prior. If set high, the approach reduces to the case without bias being postulated, as the bias becomes zero, as in the case of independent reads.
  • FIG. 1 illustrates an exemplary system for evaluating variant likelihood. The exemplary system includes an apparatus or sub-system for nucleic acid sequencing and/or analysis 11, a computing server/node/device 12 including a variant calling engine 14, and a display 16, which may be internal and/or external. The apparatus or sub-system for nucleic acid sequencing and/or analysis 11 may be any type of instrument that can generate nucleic acid sequence data from nucleic acid samples, which may include a nucleic acid sequencing instrument, a real-time/digital/quantitative PCR instrument, a microarray scanner, etc. The computing server/node/device 12 may be a workstation, mainframe computer, distributed computing node (part of a “cloud computing” or distributed networking system), personal computer, mobile device, etc. The computing server/node/device 12 may be configured to host a pre-variant calling processing engine 13, which may be configured to include various signal/data processing modules that may be configured to receive signal/data from the apparatus or sub-system for nucleic acid sequencing and/or analysis 11 and perform various processing steps, such as conversion from flow space to base space, determination of base calls, determination of base call quality values, preparation of read data for use by a mapping module, and/or alignment and/or mapping of reads to a reference sequence or genome, which may be a whole/partial genome, whole/partial exome, etc. The exemplary system may also include a client device terminal 17, which may include a data analysis API and may be communicatively connected to the computing server/node/device 12 via a network connection 18 that may be a “hardwired” physical network connection (e.g., Internet, LAN, WAN, VPN, etc.) or a wireless network connection (e.g., Wi-Fi, WLAN, etc.). The exemplary system may also include a post-variant calling processing engine 15, which may be configured to include various signal/data processing modules that may be configured to apply post-processing to variant calls, which may include annotating various variant calls and/or features, converting data from flow space to base space, filtering of variants (e.g., based on a minimum score threshold, a minimum number of reads including the variant, a minimum frequency of reads including the variant, a minimum mapping quality, a strand probability, and region filtering, for example), and formatting the variant data for display or use by client device terminal 17. In an embodiment, the apparatus or sub-system for nucleic acid sequencing and/or analysis 11 and the computing server/node/device 12 may be integrated into a single instrument or system comprising components present in a single enclosure 19. The client device terminal 17 may be configured to communicate information to and/or control the operation of the computing server/node/device 12 and its modules and/or operating parameters.
  • Sequencing Instrumentation
  • FIG. 2 illustrates components of an exemplary apparatus for nucleic acid sequencing. The components include a flow cell and sensor array 100, a reference electrode 108, a plurality of reagents 114, a valve block 116, a wash solution 110, a valve 112, a fluidics controller 118, lines 120/122/126, passages 104/109/111, a waste container 106, an array controller 124, and a user interface 128. The flow cell and sensor array 100 includes an inlet 102, an outlet 103, a microwell array 107, and a flow chamber 105 defining a flow path of reagents over the microwell array 107. The reference electrode 108 may be of any suitable type or shape, including a concentric cylinder with a fluid passage or a wire inserted into a lumen of passage 111. The reagents 114 may be driven through the fluid pathways, valves, and flow cell by pumps, gas pressure, or other suitable methods, and may be discarded into the waste container 106 after exiting the flow cell and sensor array 100. The reagents 114 may, for example, contain dNTPs to be flowed through passages 130 and through the valve block 116, which may control the flow of the reagents 114 to flow chamber 105 (also referred to herein as a reaction chamber) via passage 109. The system may include a reservoir 110 for containing a wash solution that may be used to wash away dNTPs, for example, that may have previously been flowed. The microwell array 107 may include an array of defined spaces, such as microwells, for example, that is operationally associated with a sensor array so that, for example, each microwell has a sensor suitable for detecting an analyte or reaction property of interest. The microwell array 107 may preferably be integrated with the sensor array as a single device or chip. The array controller 124 may provide bias voltages and timing and control signals to the sensor, and collect and/or process output signals. The user interface 128 may display information from the flow cell and sensor array 100 as well as instrument settings and controls, and allow a user to enter or set instrument settings and controls. The valve 112 may be shut to prevent any wash solution 110 from flowing into passage 109 as the reagents are flowing. Although the flow of wash solution may be stopped, there may still be uninterrupted fluid and electrical communication between the reference electrode 108, passage 109, and the sensor array 107. The distance between the reference electrode 108 and the junction between passages 109 and 111 may be selected so that little or no amount of the reagents flowing in passage 109 and possibly diffusing into passage 111 reach the reference electrode 108. In various embodiments, the fluidics controller 118 may be programmed to control driving forces for flowing reagents 114 and the operation of valve 112 and valve block 116 to deliver reagents to the flow cell and sensor array 100 according to a predetermined reagent flow ordering.
  • In this application, “defined space” generally refers to any space (which may be in one, two, or three dimensions) in which at least some of a molecule, fluid, and/or solid can be confined, retained and/or localized. The space may be a predetermined area (which may be a flat area) or volume, and may be defined, for example, by a depression or a micro-machined well in or associated with a microwell plate, microtiter plate, microplate, or a chip, or by isolated hydrophobic areas on a generally hydrophobic surface. Defined spaces may be arranged as an array, which may be a substantially planar one-dimensional or two-dimensional arrangement of elements such as sensors or wells. Defined spaces, whether arranged as an array or in some other configuration, may be in electrical communication with at least one sensor to allow detection or measurement of one or more detectable or measurable parameter or characteristics. The sensors may convert changes in the presence, concentration, or amounts of reaction by-products (or changes in ionic character of reactants) into an output signal, which may be registered electronically, for example, as a change in a voltage level or a current level which, in turn, may be processed to extract information or signal about a chemical reaction or desired association event, for example, a nucleotide incorporation event and/or a related ion concentration (e.g., a pH measurement). The sensors may include at least one ion sensitive field effect transistor (“IS FET”) or chemically sensitive field effect transistor (“chemFET”).
  • FIG. 3 illustrates an exemplary flow cell for nucleic acid sequencing. The flow cell 200 includes a microwell array 202, a sensor array 205, and a flow chamber 206 in which a reagent flow 208 may move across a surface of the microwell array 202, over open ends of microwells in the microwell array 202. The flow of reagents (e.g., nucleotide species) can be provided in any suitable manner, including delivery by pipettes, or through tubes or passages connected to a flow chamber. A microwell 201 in the microwell array 202 may have any suitable volume, shape, and aspect ratio. A sensor 214 in the sensor array 205 may be an ISFET or a chemFET sensor with a floating gate 218 having a sensor plate 220 separated from the microwell interior by a passivation layer 216, and may be predominantly responsive to (and generate an output signal related to) an amount of charge 224 present on the passivation layer 216 opposite of the sensor plate 220. Changes in the amount of charge 224 cause changes in the current between a source 221 and a drain 222 of the sensor 214, which may be used directly to provide a current-based output signal or indirectly with additional circuitry to provide a voltage output signal. Reactants, wash solutions, and other reagents may move into microwells primarily by diffusion 240. One or more analytical reactions to identify or determine characteristics or properties of an analyte of interest may be carried out in one or more microwells of the microwell array 202. Such reactions may generate directly or indirectly by-products that affect the amount of charge 224 adjacent to the sensor plate 220. In an embodiment, a reference electrode 204 may be fluidly connected to the flow chamber 206 via a flow passage 203. In an embodiment, the microwell array 202 and the sensor array 205 may together form an integrated unit forming a bottom wall or floor of the flow cell 200. In an embodiment, one or more copies of an analyte may be attached to a solid phase support 212, which may include microparticles, nanoparticles, beads, gels, and may be solid and porous, for example. The analyte may include one or more copies of a nucleic acid analyte obtained using any suitable technique.
  • FIG. 4 illustrates an exemplary process for label-free, pH-based sequencing. A template 682 with sequence 685 and a primer binding site 681 are attached to a solid phase support 680. The template 682 may be attached as a clonal population to a solid support, such as a microparticle or bead, for example, and may be prepared as disclosed in Leamon et al., U.S. Pat. No. 7,323,305. In an embodiment, the template may be associated with a substrate surface or present in a liquid phase with or without being coupled to a support. A primer 684 and DNA polymerase 686 are annealed to the template 682 so that the primer's 3′ end may be extended by a polymerase and that a polymerase is bound to such primer-template duplex (or in close proximity thereof) so that binding and/or extension may take place when dNTPs are added. In step 688, dNTP (shown as dATP) is added, and the DNA polymerase 686 incorporates a nucleotide “A” (since “T” is the next nucleotide in the template 682 and is complementary to the flowed dATP nucleotide). In step 690, a wash is performed. In step 692, the next dNTP (shown as dCTP) is added, and the DNA polymerase 686 incorporates a nucleotide “C” (since “G” is the next nucleotide in the template 682). More details about pH-based nucleic acid sequencing may be found in U.S. Pat. No. 7,948,015 and U.S. Pat. Appl. Publ. Nos. 2010/0137143, 2009/0026082, and 2010/0282617.
  • In an embodiment, the primer-template-polymerase complex may be subjected to a series of exposures of different nucleotides in a pre-determined sequence or ordering. If one or more nucleotides are incorporated, then the signal resulting from the incorporation reaction may be detected, and after repeated cycles of nucleotide addition, primer extension, and signal acquisition, the nucleotide sequence of the template strand may be determined. The output signals measured throughout this process depend on the number of nucleotide incorporations. Specifically, in each addition step, the polymerase extends the primer by incorporating added dNTP only if the next base in the template is complementary to the added dNTP. With each incorporation, an hydrogen ion is released, and collectively a population released hydrogen ions change the local pH of the reaction chamber. The production of hydrogen ions may be monotonically related to the number of contiguous complementary bases (e.g., homopolymers) in the template. Deliveries of nucleotides to a reaction vessel or chamber may be referred to as “flows” of nucleotide triphosphates (or dNTPs). For convenience, a flow of dATP will sometimes be referred to as “a flow of A” or “an A flow,” and a sequence of flows may be represented as a sequence of letters, such as “ATGT” indicating “a flow of dATP, followed by a flow of dTTP, followed by a flow of dGTP, followed by a flow of dTTP.” The predetermined ordering may be based on a cyclical, repeating pattern consisting of consecutive repeats of a short pre-determined reagent flow ordering (e.g., consecutive repeats of pre-determined sequence of four nucleotide reagents such as, for example, “ACTG ACTG . . . ”), may be based in whole or in part on some other pattern of reagent flows (such as, e.g., any of the various reagent flow orderings discussed herein and/or in Hubbell et al., U.S. Pat. Appl. Publ. No. 2012/0264621, published Oct. 18, 2012, which is incorporated by reference herein in its entirety), and may also be based on some combination thereof.
  • In various embodiments, output signals due to nucleotide incorporation may be processed, given knowledge of what nucleotide species were flowed and in what order to obtain such signals, to make base calls for the flows and compile consecutive base calls associated with a sample nucleic acid template into a read. A base call refers to a particular nucleotide identification (e.g., dATP (“A”), dCTP (“C”), dGTP (“G”), or dTTP (“T”)). Base calling may include performing one or more signal normalizations, signal phase and signal decay (e.g, enzyme efficiency loss) estimations, and signal corrections, and may identify or estimate base calls for each flow for each defined space. Any suitable base calling method may be used. Preferably, base calling may be performed as described in Davey et al., U.S. Pat. Appl. Publ. No. 2012/0109598, published on May 3, 2012, and/or Sikora et al., U.S. Pat. Appl. Publ. No. 2013/0060482, published on Mar. 7, 2012.
  • FIG. 5 illustrates an exemplary computer system. The computer system 501 includes a bus 502 or other communication mechanism for communicating information, a processor 503 coupled to the bus 502 for processing information, and a memory 505 coupled to the bus 502 for dynamically and/or statically storing information. The computer system 501 can also include one or more co-processors 504 coupled to the bus 502, such as GPUs and/or FPGAs, for performing specialized processing tasks; a display 506 coupled to the bus 502, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user; an input device 507 coupled to the bus 502, such as a keyboard including alphanumeric and other keys, for communicating information and command selections to the processor 503; a cursor control device 508 coupled to the bus 502, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to the processor 503 and for controlling cursor movement on display 506; and one or more storage devices 509 coupled to the bus 502, such as a magnetic disk or an optical disk, for storing information and instructions. The memory 505 may include a random access memory (RAM) or other dynamic storage device and/or a read only memory (ROM) or other static storage device. Such an exemplary computer system with suitable software may be used to perform the embodiments described herein.
  • More generally, in various embodiments, one or more features of the teachings and/or embodiments described herein may be performed or implemented using appropriately configured and/or programmed hardware and/or software elements.
  • Examples of hardware elements may include processors, microprocessors, input(s) and/or output(s) (I/O) device(s) (or peripherals) that are communicatively coupled via a local interface circuit, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. The local interface may include, for example, one or more buses or other wired or wireless connections, controllers, buffers (caches), drivers, repeaters and receivers, etc., to allow appropriate communications between hardware components. A processor is a hardware device for executing software, particularly software stored in memory. The processor can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer, a semiconductor based microprocessor (e.g., in the form of a microchip or chip set), a macroprocessor, or generally any device for executing software instructions. A processor can also represent a distributed processing architecture. The I/O devices can include input devices, for example, a keyboard, a mouse, a scanner, a microphone, a touch screen, an interface for various medical devices and/or laboratory instruments, a bar code reader, a stylus, a laser reader, a radio-frequency device reader, etc. Furthermore, the I/O devices also can include output devices, for example, a printer, a bar code printer, a display, etc. Finally, the I/O devices further can include devices that communicate as both inputs and outputs, for example, a modulator/demodulator (modem; for accessing another device, system, or network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, etc.
  • Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. A software in memory may include one or more separate programs, which may include ordered listings of executable instructions for implementing logical functions. The software in memory may include a system for identifying data streams in accordance with the present teachings and any suitable custom made or commercially available operating system (O/S), which may control the execution of other computer programs such as the system, and provides scheduling, input-output control, file and data management, memory management, communication control, etc.
  • According to various embodiments, one or more features of teachings and/or embodiments described herein may be performed or implemented using appropriately configured and/or programmed non-transitory machine-readable medium or article that may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the embodiments. Such a machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, scientific or laboratory instrument, etc., and may be implemented using any suitable combination of hardware and/or software. The machine-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, read-only memory compact disc (CD-ROM), recordable compact disc (CD-R), rewriteable compact disc (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disc (DVD), a tape, a cassette, etc., including any medium suitable for use in a computer. Memory can include any one or a combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, EPROM, EEROM, Flash memory, hard drive, tape, CDROM, etc.). Moreover, memory can incorporate electronic, magnetic, optical, and/or other types of storage media. Memory can have a distributed architecture where various components are situated remote from one another, but are still accessed by the processor. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, etc., implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
  • According to various embodiments, one or more features of the teachings and/or embodiments described herein may be performed or implemented at least partly using a distributed, clustered, remote, or cloud computing resource.
  • FIG. 6 illustrates an exemplary method for evaluating variant likelihood. In step 601, a user or component provides a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array. In step 602, a user or component exposes the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order. In step 603, a server or other computing means or resource obtains measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces. The measured values may include voltage data indicative of hydrogen ion concentrations, which may be processed and analyzed to yield sequences of base calls for the reads, which may in turn be aligned and mapped. In step 604, the server or other computing means or resource evaluates a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands.
  • FIG. 7A illustrates an exemplary plot of sample frequency. The y-axis shows the posterior density of the sample frequency π conditional on the other values after the maximization algorithm has converged. The x-axis shows the major frequency. The density is derived from maximization of the ensemble likelihood of all reads as described herein by estimating π, ρi, σij 2, and the two biases. The density is scaled so that the maximum likelihood is 1.0 (which occurs at 100% frequency of the variant allele).
  • FIG. 7B illustrates an exemplary plot of responsibility for each read. The y-axis shows the responsibility ρi for each read, calculated at the maximum likelihood estimate for the sample frequency. The x-axis show integers corresponding to the reads. Two data points are plotted per read—one is the responsibility for reference (triangle, near 100%), and one is the responsibility for variant allele (cross, near 0%). (Not shown is the outlier case, which makes these two responsibilities not quite add to 1.0.) The strand for the read is indicated by gray (reverse) or black (forward).
  • FIG. 8A illustrates exemplary bias terms for each strand for an ensemble of reads. The graph shows the computed bias terms for each strand for the ensemble of reads. The y-axis shows the reverse bias. The x-axis shows the forward bias. In this case, there is exactly one point, indicating that when the procedure has converged on this set of reads for this variant, the estimate of bias on the forward strand is approximately −0.5 and the estimate of bias on the reverse strand is close to 0. This ensemble bias may then be applied per read per flow as discussed above.
  • FIG. 8B illustrates exemplary variance components corresponding to homopolymers. The y-axis shows the standard deviation (square root of the variance) for the variance components fitted by the model and used to estimate the variance for each flow in each read by interpolation. The x-axis shows intensities corresponding to homopolymers scaled so that a completely in-phase 1-mer should read 1.0. (When there is insufficient data, the graph shows the prior estimate for this estimate of variation.)
  • FIG. 9A illustrates predicted/measured data ratios for a sequence of nucleotide flows. The y-axis shows the predicted/measured signal ratio for reads on the forward strand (in black). The x-axis shows a series of nucleotide flows as discussed herein (e.g., a flow of dTTP, followed by a flow of dATP, followed by a flow of dCTP, etc.). The y-axis values are normalized based on the first flows corresponding to a key (see T, C, and A at flows 1, 3, and 6 with the final G of the key combining with the G in the read to form a 2-mer G at flow 8) so that the key values are at 1.0. There is a 1-mer A at flow 13 and a 2-mer C at flow 15. At flow 14 there should be a 2-mer G, however, the value is about half-way between a 1-mer and a 2-mer. At such an early position in the read there should not be “phase” effects reducing the signal. Further, this can be observed across many reads at the same sequence position (and flow) and may thus not be due to noise in any one well. Modified predictions for the reference sequence are shown in cyan triangles; predictions for the intensity that would be observed if the true underlying sequence for reads were a deleted G are shown as blue crosses. As can be observed, the modified predictions fit the measurements quite well.
  • FIG. 9B illustrates residual values for the same data as in FIG. 9A. The y-axis shows the residuals δij representing differences between measured and predicted values. The x-axis shows a series of nucleotide flows as discussed herein. The middle line, with a value of 0, means that the model applied to the reference sequence would predict the intensity correctly. Dots placed off the middle line indicate the measurement-prediction difference is large in the original system. The corrected predictions are indicated by the cyan line for reference (passing through the middle of the residuals); the dotted blue line shows the predictions for the variant allele (deleted G). This strand has a bias on the positive strand of −0.58—nearly 60% of a 1-mer. It is therefore unsurprising that this strand is reporting deletion alleles.
  • FIG. 9C illustrates predicted/measured data ratios for a sequence of nucleotide flows. The y-axis shows the predicted/measured signal ratio for reads overlapping the same sequence portion as in FIG. 9A but on the reverse strand (in red). The x-axis shows a series of nucleotide flows as discussed herein (e.g., a flow of dATP, followed by a flow of dCTP, followed by a flow of dGTP, etc.). On this strand, the 2-mer C in flow 22 (corresponding to the G on the forward strand in flow 14) has a similar signal to the 2-mer G in flow 19 (corresponding to the C on the forward strand in flow 15). Modified predictions for the reference sequence are shown in cyan triangles; predictions for the intensity that would be observed if the true underlying sequence for reads were a deleted G are shown as blue crosses.
  • FIG. 9D illustrates residual values for the same data as in FIG. 9C. The y-axis shows the residuals δij representing differences between measured and predicted values. The x-axis shows a series of nucleotide flows as discussed herein. Here, the modified prediction line (blue) for reference lies along the horizontal axis, indicating that the predictions are on average centered around the reference value. Unsurprisingly, there are no deletion alleles originally called on this strand.
  • According to an exemplary embodiment, there is provided a method for evaluating variant likelihood in nucleic acid sequencing, comprising: (a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; (b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; (c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
  • According to an exemplary embodiment, there is provided a method for evaluating variant likelihood in nucleic acid sequencing, comprising: (a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; (b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; (c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands.
  • According to an exemplary embodiment, there is provided a method for evaluating variant likelihood in nucleic acid sequencing, comprising: (a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array; (b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; (c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
  • In such a method, modifying the at least some model-predicted values may comprise applying a transformation including a product of (i) one of the first and second biases and (ii) a discriminant vector representing a difference between model-predicted values corresponding to different hypothesized sequences.
  • In such a method, modifying the at least some model-predicted values may comprise applying a transformation qj x=pj xf*dj for forward strands and a transformation qj x=pj xr*dj for reverse strands, where: qj x represents modified model-predicted values for a given sequencing read at flow j under hypothesized sequence x; βf denotes the first bias for forward strands; βr denotes the second bias for reverse strands; dj=pj x−pj y represents a discriminant vector for the given sequencing read; pj x and pj y represent predicted values for the given sequencing read at flow j under hypothesized sequences x and y, respectively; j is an integer between 1 and M; and M represents a number of flows.
  • In such a method, evaluating the likelihood may further comprise assigning a first frequency to a variant sequence and a second frequency to a non-variant sequence, and calculating a likelihood of having observed the ensemble of sequencing reads conditioned on the first frequency as a function of a product of the likelihoods of having observed each of the sequencing reads given the first frequency. The likelihood of having observed the ensemble of sequencing reads may be determined using an expression comprising:
  • P ( π ) * i = 1 N ( π * ε i + ( 1 - π ) * ( 1 - ε i ) )
  • where: π represents the first frequency; 1−π represents the second frequency; P(π) represents a prior probability of π; εi represents the measurement confidence value for sequencing read i; i is an integer between 1 and N identifying a sequencing read; and N is an integer specifying a number of sequencing reads in the ensemble.
  • In such a method, evaluating the likelihood may further comprise assigning a first frequency to a variant sequence, a second frequency to a non-variant sequence, and a third frequency to an outlier event. The outlier event may have a flat density across all sequencing reads in the ensemble. Evaluating the likelihood may further comprise calculating a likelihood of having observed the ensemble of sequencing reads conditioned on the third frequency as a function of a product of the likelihoods of having observed each of the sequencing reads given the third frequency. The likelihood of having observed the ensemble of sequencing reads may be determined using an expression comprising
  • P ( π ) * i = 1 N ( τ * ( π * ε i + ( 1 - π ) * ( 1 - ε i ) ) + ( 1 - τ ) * T )
  • where: π represents the first frequency; 1−π represents the second frequency; (1−τ) represents the third frequency; T represents a density of the outlier event; P(π) represents a prior probability of π; ε1 represents a measurement confidence value for sequencing read i; i is an integer between 1 and N identifying a sequencing read; and N is an integer specifying a number of sequencing reads in the ensemble.
  • In such a method, the measurement confidence values may be estimated using a function comprising a sum of log-likelihood of values measured for a given flow given a hypothesized sequence. The measurement confidence values may further be estimated using a function comprising differences between the measured values and the model-predicted values. The differences between measured and model-predicted values at each nucleotide flow may be assumed to follow independent normal distributions each having a mean and a variance. The differences between measured and model-predicted values at each nucleotide flow may also be assumed to follow independent t-distributions.
  • In such a method, the measurement confidence values may be estimated using an expression comprising
  • ε i = 1 1 + exp ( LL yi - LL xi )
  • where LLyi and LLxi are log-likelihoods of values measured for a given sequencing read under hypothesized sequences y and x, respectively. The log-likelihoods of values measured for a given sequencing read under hypothesized sequences y and x may be expressed as
  • LL xi = j = 1 M ( - ln ( σ ij x ) + ( δ ij x ) 2 2 * ( σ ij x ) 2 ) and LL yi = j = 1 M ( - ln ( σ ij y ) + ( δ ij y ) 2 2 * ( σ ij y ) 2 )
  • where δij x=mij x−pij x and δij y=mij y−pij y, where mij x and mij y represent measured values for read i at flow j under hypothesized sequences x and y, respectively, where pij x and pij y represent predicted values for read i at flow j under hypothesized sequences x and y, respectively, where σij x and σij y are the standard deviations of independently distributed normal distributions for read i at flow j under hypothesized sequences x and y, respectively, where i is an integer identifying a sequencing read, and where M represents a number of flows.
  • In such a method, the measurement confidence values may be estimated using an expression for responsibility comprising
  • ρ i = π π + ( 1 - π ) * exp ( LL yi - LL xi )
  • where π represents a first frequency assigned to a variant sequence, 1−π represents a second frequency assigned to a non-variant sequence, and ρi represents a measure of responsibility for each of the sequencing reads in the ensemble. In the responsibility framwork, the variance may then be estimated using an expression comprising
  • σ ^ 2 = i = 1 N ( ρ i j = 1 M ( δ ij x ) 2 + ( 1 - ρ i ) j = 1 M ( δ ij y ) 2 ) N
  • where δij x=mij x−pij x and δij y=mij y−pij y, where mij x and mij y represent measured values of read i at flow j under hypothesized sequences x and y, respectively, where pij x and pij y represent predicted values for read i at flow j hypothesized sequences x and y, respectively, where M represents a number of flows and N is an integer specifying a number of sequencing reads, and where
  • ρ i = π π + ( 1 - π ) * exp ( LL yi - LL xi ) and LL xi = j = 1 M ( - ln ( σ ij x ) + ( δ ij x ) 2 2 * ( σ ij x ) 2 ) and LL yi = j = 1 M ( - ln ( σ ij y ) + ( δ ij y ) 2 2 * ( σ ij y ) 2 ) .
  • In such a method, the variance may be estimated by decomposition of the variance in a flow and sequencing read into underlying latent components. The decomposition may comprise an expression σij 2m=1 Kijmm 2) where σm 2 is a latent component and φijm is a proportionality constant determining an amount of variation contributed by the mth latent component. The method may further comprise updating the decomposition using an expression comprising
  • σ kn 2 = ( ω i * r ij 2 ) ( ϕ ijm ) where ω ij = ϕ ijm * σ k , n - 1 2 ( ϕ ijm * σ m , n - 1 2 )
  • for the k-th component using the proportion of variance attributed to each flow. In some cases, each latent component may correspond to a homopolymer having an integer length. In another case, the latent components may include a null variance component representing contribution to a flow regardless of any nucleotide incorporation, a residual variance component representing contribution for nucleotide incorporations not explicitly modeled, and one or more additional variance components. The one or more additional variance components may comprise variance components associated with homopolymers having an integer length. In some cases, the latent components may be estimated using an EM methodology and a method of moments approximation.
  • In such a method, evaluating the likelihood may comprise estimating (i) a first frequency π assigned to a variant sequence, (ii) at least one of a measurement confidence value εi for each of the sequencing reads in the ensemble and a measure of responsibility ρi for each of the sequencing reads in the ensemble, and (iii) a variance σij 2 for each of the flows and sequencing reads in the ensemble. Evaluating the likelihood may further comprise estimating π, εi (or ρi), and σij 2 individually conditional on the values of the others.
  • According to an exemplary embodiment, there is provided a non-transitory machine-readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a method for evaluating variant likelihood in nucleic acid sequencing comprising: (a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some template polynucleotide strands in at least one defined space, wherein a plurality of template polynucleotide strands, sequencing primers, and polymerase have been provided in a plurality of defined spaces disposed on a sensor array, and wherein the plurality of template polynucleotide strands, sequencing primers, and polymerase have been exposed to a series of flows of nucleotide species according to a predetermined order; and (b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
  • According to an exemplary embodiment, there is provided a system for evaluating variant likelihood in nucleic acid sequencing, including: a plurality of template polynucleotide strands, sequencing primers, and polymerase provided in a plurality of defined spaces disposed on a sensor array; an apparatus configured to expose the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; a machine-readable memory; and a processor configured to execute machine-readable instructions, which, when executed by the processor, cause the system to perform a method for evaluating variant likelihood, comprising: (a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and (b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising: (i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and (ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
  • Unless otherwise specifically designated herein, terms, techniques, and symbols of biochemistry, cell biology, genetics, molecular biology, nucleic acid chemistry, nucleic acid sequencing, and organic chemistry used herein follow those of standard treatises and texts in the relevant field.
  • Although the present description described in detail certain embodiments, other embodiments are also possible and within the scope of the present invention. For example, those skilled in the art may appreciate from the present description that the present teachings may be implemented in a variety of forms, and that the various embodiments may be implemented alone or in combination. Variations and modifications will be apparent to those skilled in the art from consideration of the specification and figures and practice of the teachings described in the specification and figures, and the claims.

Claims (20)

1. A method for evaluating variant likelihood in nucleic acid sequencing, comprising:
(a) providing a plurality of template polynucleotide strands, sequencing primers, and polymerase in a plurality of defined spaces disposed on a sensor array;
(b) exposing the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order;
(c) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and
(d) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising:
(i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and
(ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
2. The method of claim 1, wherein modifying the at least some model-predicted values comprises applying a transformation including a product of (i) one of the first and second biases and (ii) a discriminant vector representing a difference between model-predicted values corresponding to different hypothesized sequences.
3. The method of claim 1, wherein evaluating the likelihood further comprises assigning a first frequency to a variant sequence and a second frequency to a non-variant sequence, and calculating a likelihood of having observed the ensemble of sequencing reads conditioned on the first frequency as a function of a product of the likelihoods of having observed each of the sequencing reads given the first frequency.
4. The method of claim 1, wherein evaluating the likelihood further comprises assigning a first frequency to a variant sequence, a second frequency to a non-variant sequence, and a third frequency to an outlier event.
5. The method of claim 4, wherein the outlier event has a flat density across all sequencing reads in the ensemble.
6. The method of claim 4, wherein evaluating the likelihood further comprises calculating a likelihood of having observed the ensemble of sequencing reads conditioned on the third frequency as a function of a product of the likelihoods of having observed each of the sequencing reads given the third frequency.
7. The method of claim 1, wherein the measurement confidence values are estimated using a function comprising a sum of log-likelihood of values measured for a given flow given a hypothesized sequence.
8. The method of claim 7, wherein the measurement confidence values are estimated using a function comprising differences between the measured values and the model-predicted values.
9. The method of claim 8, wherein the differences between measured and model-predicted values at each nucleotide flow are assumed to follow independent normal distributions each having a mean and a variance.
10. The method of claim 9, wherein the differences between measured and model-predicted values at each nucleotide flow are assumed to follow independent t-distributions.
11. The method of claim 1, wherein the measurement confidence values are estimated using an expression comprising
ε i = 1 1 + exp ( LL yi - LL xi )
where LLyi and LLxi are log-likelihoods of values measured for a given sequencing read under hypothesized sequences y and x, respectively.
12. The method of claim 11, wherein the measurement confidence values are estimated using an expression for responsibility comprising
ρ i = π π + ( 1 - π ) * exp ( LL yi - LL xi )
where π represents a first frequency assigned to a variant sequence, 1−π represents a second frequency assigned to a non-variant sequence, and ρi represents a measure of responsibility for each of the sequencing reads in the ensemble.
13. The method of claim 11, where the variance is estimated by decomposition of the variance in a flow and sequencing read into underlying latent components.
14. The method of claim 13, wherein each latent component corresponds to a homopolymer having an integer length.
15. The method of claim 13, wherein the latent components include a null variance component representing contribution to a flow regardless of any nucleotide incorporation, a residual variance component representing contribution for nucleotide incorporations not explicitly modeled, and one or more additional variance components.
16. The method of claim 15, wherein the one or more additional variance components comprise variance components associated with homopolymers having an integer length.
17. The method of claim 13, wherein the latent components are estimated using an EM methodology and a method of moments approximation.
18. The method of claim 1, wherein evaluating the likelihood comprises estimating (i) a first frequency π assigned to a variant sequence, (ii) at least one of a measurement confidence value εi for each of the sequencing reads in the ensemble and a measure of responsibility ρi for each of the sequencing reads in the ensemble, and (iii) a variance σij 2 for each of the flows and sequencing reads in the ensemble.
19. A non-transitory machine-readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a method for evaluating variant likelihood in nucleic acid sequencing comprising:
(a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some template polynucleotide strands in at least one defined space, wherein a plurality of template polynucleotide strands, sequencing primers, and polymerase have been provided in a plurality of defined spaces disposed on a sensor array, and wherein the plurality of template polynucleotide strands, sequencing primers, and polymerase have been exposed to a series of flows of nucleotide species according to a predetermined order; and
(b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising:
(i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and
(ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
20. A system for evaluating variant likelihood in nucleic acid sequencing, including:
a plurality of template polynucleotide strands, sequencing primers, and polymerase provided in a plurality of defined spaces disposed on a sensor array;
an apparatus configured to expose the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order;
a machine-readable memory; and
a processor configured to execute machine-readable instructions, which, when executed by the processor, cause the system to perform a method for evaluating variant likelihood, comprising:
(a) obtaining measured values corresponding to an ensemble of sequencing reads for at least some of the template polynucleotide strands in at least one of the defined spaces; and
(b) evaluating a likelihood that a variant sequence is present given the measured values corresponding to the ensemble of sequencing reads, the evaluating comprising:
(i) determining a measurement confidence value for each read in the ensemble of sequencing reads, wherein the determining is based on variations between the measured values and model-predicted values for hypothesized sequences obtained using a predictive model of nucleotide incorporations responsive to flows of nucleotide species; and
(ii) modifying at least some model-predicted values using a first bias for forward strands and a second bias for reverse strands, wherein the modifying is based on variations between model-predicted values for different hypothesized sequences obtained using the predictive model of nucleotide incorporations responsive to flows of nucleotide species.
US14/200,942 2013-03-14 2014-03-07 Methods, Systems, and Computer Readable Media for Evaluating Variant Likelihood Abandoned US20140296080A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US14/200,942 US20140296080A1 (en) 2013-03-14 2014-03-07 Methods, Systems, and Computer Readable Media for Evaluating Variant Likelihood
US15/974,976 US11636919B2 (en) 2013-03-14 2018-05-09 Methods, systems, and computer readable media for evaluating variant likelihood
US18/130,134 US20230360726A1 (en) 2013-03-14 2023-04-03 Methods, systems, and computer readable media for evaluating variant likelihood

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361782240P 2013-03-14 2013-03-14
US14/200,942 US20140296080A1 (en) 2013-03-14 2014-03-07 Methods, Systems, and Computer Readable Media for Evaluating Variant Likelihood

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US15/974,976 Continuation US11636919B2 (en) 2013-03-14 2018-05-09 Methods, systems, and computer readable media for evaluating variant likelihood
US15/974,976 Division US11636919B2 (en) 2013-03-14 2018-05-09 Methods, systems, and computer readable media for evaluating variant likelihood

Publications (1)

Publication Number Publication Date
US20140296080A1 true US20140296080A1 (en) 2014-10-02

Family

ID=51621420

Family Applications (3)

Application Number Title Priority Date Filing Date
US14/200,942 Abandoned US20140296080A1 (en) 2013-03-14 2014-03-07 Methods, Systems, and Computer Readable Media for Evaluating Variant Likelihood
US15/974,976 Active 2037-11-23 US11636919B2 (en) 2013-03-14 2018-05-09 Methods, systems, and computer readable media for evaluating variant likelihood
US18/130,134 Pending US20230360726A1 (en) 2013-03-14 2023-04-03 Methods, systems, and computer readable media for evaluating variant likelihood

Family Applications After (2)

Application Number Title Priority Date Filing Date
US15/974,976 Active 2037-11-23 US11636919B2 (en) 2013-03-14 2018-05-09 Methods, systems, and computer readable media for evaluating variant likelihood
US18/130,134 Pending US20230360726A1 (en) 2013-03-14 2023-04-03 Methods, systems, and computer readable media for evaluating variant likelihood

Country Status (1)

Country Link
US (3) US20140296080A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9926597B2 (en) 2013-07-26 2018-03-27 Life Technologies Corporation Control nucleic acid sequences for use in sequencing-by-synthesis and methods for designing the same
WO2018089567A1 (en) 2016-11-10 2018-05-17 Life Technologies Corporation Methods, systems and computer readable media to correct base calls in repeat regions of nucleic acid sequence reads
WO2018106884A1 (en) 2016-12-08 2018-06-14 Life Technologies Corporation Methods for detecting mutation load from a tumor sample
CN108282491A (en) * 2018-02-27 2018-07-13 北京奇艺世纪科技有限公司 A kind of method and device of assessment plug-flow quality
WO2018213235A1 (en) 2017-05-16 2018-11-22 Life Technologies Corporation Methods for compression of molecular tagged nucleic acid sequence data
WO2018218103A1 (en) 2017-05-26 2018-11-29 Life Technologies Corporation Methods and systems to detect large rearrangements in brca1/2
US10323275B2 (en) * 2014-04-25 2019-06-18 Dnae Group Holdings Limited Methods for sequencing a polynucleotide strand
WO2020046784A1 (en) 2018-08-28 2020-03-05 Life Technologies Corporation Methods for detecting mutation load from a tumor sample
WO2021034711A1 (en) 2019-08-21 2021-02-25 Life Technologies Corporation System and method for sequencing
WO2021034484A1 (en) 2019-08-20 2021-02-25 Life Technologies Corporation Methods for control of a sequencing device
WO2022104138A1 (en) 2020-11-14 2022-05-19 Life Technologies Corporation System and method for automated repeat sequencing
WO2022104272A1 (en) 2020-11-16 2022-05-19 Life Technologies Corporation System and method for sequencing
WO2022146708A1 (en) 2020-12-31 2022-07-07 Life Technologies Corporation System and method for control of sequencing process
WO2023215847A1 (en) 2022-05-05 2023-11-09 Life Technologies Corporation Methods for deep artificial neural networks for signal error correction
WO2024006878A1 (en) 2022-06-30 2024-01-04 Life Technologies Corporation Methods for assessing genomic instability
WO2024059487A1 (en) 2022-09-12 2024-03-21 Life Technologies Corporation Methods for detecting allele dosages in polyploid organisms
WO2024073544A1 (en) 2022-09-30 2024-04-04 Life Technologies Corporation System and method for genotyping structural variants

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3908963A1 (en) * 2019-01-07 2021-11-17 Mobius Labs GmbH Automated capturing of images comprising a desired feature
CN114072523A (en) 2019-05-03 2022-02-18 阿尔缇玛基因组学公司 Method for detecting nucleic acid variants
JP2022533801A (en) 2019-05-03 2022-07-25 ウルティマ ジェノミクス, インコーポレイテッド Fast forward sequencing by synthesis

Family Cites Families (88)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4683202A (en) 1985-03-28 1987-07-28 Cetus Corporation Process for amplifying nucleic acid sequences
US4683195A (en) 1986-01-30 1987-07-28 Cetus Corporation Process for amplifying, detecting, and/or-cloning nucleic acid sequences
US4965188A (en) 1986-08-22 1990-10-23 Cetus Corporation Process for amplifying, detecting, and/or cloning nucleic acid sequences using a thermostable enzyme
US4800159A (en) 1986-02-07 1989-01-24 Cetus Corporation Process for amplifying, detecting, and/or cloning nucleic acid sequences
CA2020958C (en) 1989-07-11 2005-01-11 Daniel L. Kacian Nucleic acid sequence amplification methods
US5210015A (en) 1990-08-06 1993-05-11 Hoffman-La Roche Inc. Homogeneous assay system using the nuclease activity of a nucleic acid polymerase
US5750341A (en) 1995-04-17 1998-05-12 Lynx Therapeutics, Inc. DNA sequencing by parallel oligonucleotide extensions
US5854033A (en) 1995-11-21 1998-12-29 Yale University Rolling circle replication reporter systems
ES2243393T3 (en) 1996-06-04 2005-12-01 University Of Utah Research Foundation MONITORING OF HYBRIDIZATION DURING PCR.
GB9620209D0 (en) 1996-09-27 1996-11-13 Cemu Bioteknik Ab Method of sequencing DNA
GB9626815D0 (en) 1996-12-23 1997-02-12 Cemu Bioteknik Ab Method of sequencing DNA
US7348181B2 (en) 1997-10-06 2008-03-25 Trustees Of Tufts College Self-encoding sensor with microspheres
JP3813818B2 (en) 1998-05-01 2006-08-23 アリゾナ ボード オブ リージェンツ Method for determining the nucleotide sequence of oligonucleotides and DNA molecules
US7875440B2 (en) 1998-05-01 2011-01-25 Arizona Board Of Regents Method of determining the nucleotide sequence of oligonucleotides and DNA molecules
US6780591B2 (en) 1998-05-01 2004-08-24 Arizona Board Of Regents Method of determining the nucleotide sequence of oligonucleotides and DNA molecules
CA2321821A1 (en) 1998-06-26 2000-01-06 Visible Genetics Inc. Method for sequencing nucleic acids with reduced errors
GB9901475D0 (en) 1999-01-22 1999-03-17 Pyrosequencing Ab A method of DNA sequencing
AU6241099A (en) 1999-06-25 2001-01-31 Crosby Group Inc., The Wire rope socket
US7211390B2 (en) 1999-09-16 2007-05-01 454 Life Sciences Corporation Method of sequencing a nucleic acid
US7244559B2 (en) 1999-09-16 2007-07-17 454 Life Sciences Corporation Method of sequencing a nucleic acid
US6274320B1 (en) 1999-09-16 2001-08-14 Curagen Corporation Method of sequencing a nucleic acid
WO2001023610A2 (en) 1999-09-29 2001-04-05 Solexa Ltd. Polynucleotide sequencing
US6783934B1 (en) 2000-05-01 2004-08-31 Cepheid, Inc. Methods for quantitative analysis of nucleic acid amplification reaction
GB0016472D0 (en) 2000-07-05 2000-08-23 Amersham Pharm Biotech Uk Ltd Sequencing method and apparatus
WO2002019602A2 (en) 2000-09-01 2002-03-07 Fred Hutchinson Cancer Research Center Statistical modeling to analyze large data arrays
GB0021977D0 (en) 2000-09-07 2000-10-25 Pyrosequencing Ab Method of sequencing DNA
GB0022069D0 (en) 2000-09-08 2000-10-25 Pyrosequencing Ab Method
CN100429509C (en) 2001-11-16 2008-10-29 株式会社Bio-X FET type sensor, ion density detecting method comprising this sensor, and base sequence detecting method
JP2005516300A (en) 2002-01-25 2005-06-02 アプレラ コーポレイション How to place, accept, and fulfill orders for products and services
US20030215816A1 (en) 2002-05-20 2003-11-20 Narayan Sundararajan Method for sequencing nucleic acids by observing the uptake of nucleotides modified with bulky groups
US20040197793A1 (en) 2002-08-30 2004-10-07 Arjang Hassibi Methods and apparatus for biomolecule detection, identification, quantification and/or sequencing
US20040197845A1 (en) 2002-08-30 2004-10-07 Arjang Hassibi Methods and apparatus for pathogen detection, identification and/or quantification
US7244567B2 (en) 2003-01-29 2007-07-17 454 Life Sciences Corporation Double ended sequencing
US7575865B2 (en) 2003-01-29 2009-08-18 454 Life Sciences Corporation Methods of amplifying and sequencing nucleic acids
JP2006517798A (en) 2003-02-12 2006-08-03 イェニソン スベンスカ アクティエボラーグ Methods and means for nucleic acid sequences
GB0324456D0 (en) 2003-10-20 2003-11-19 Isis Innovation Parallel DNA sequencing methods
JP3903183B2 (en) 2004-02-03 2007-04-11 独立行政法人物質・材料研究機構 Gene detection field effect device and gene polymorphism analysis method using the same
CA2558510A1 (en) 2004-03-04 2005-09-15 The University Of British Columbia Thrombomodulin (thbd) haplotypes predict outcome of patients
ITTO20040386A1 (en) 2004-06-09 2004-09-09 Infm Istituto Naz Per La Fisi FIELD-EFFECTIVE DEVICE FOR THE DETECTION OF SMALL QUANTITIES OF ELECTRIC CHARGE, SUCH AS THOSE GENERATED IN BIOMOLECULAR PROCESSES, IMMOBILIZED NEAR THE SURFACE.
EP1801209B1 (en) 2004-08-24 2011-02-23 Tokyo Metropolitan Organization for Medical Research Modified human hepatitis c virus genomic rna having autonomous replicative competence
JP4608697B2 (en) 2004-08-27 2011-01-12 独立行政法人物質・材料研究機構 DNA base sequence analysis method and base sequence analysis apparatus using field effect device
TWI303714B (en) 2004-10-14 2008-12-01 Toshiba Kk Nucleic acid detecting sensor, nucleic acid detecting chip, and nucleic acid detecting circuit
US7424371B2 (en) 2004-12-21 2008-09-09 Helicos Biosciences Corporation Nucleic acid analysis
US7785862B2 (en) 2005-04-07 2010-08-31 454 Life Sciences Corporation Thin film coated microwell arrays
US8445194B2 (en) 2005-06-15 2013-05-21 Callida Genomics, Inc. Single molecule arrays for genetic and chemical analysis
JP4353958B2 (en) 2005-09-15 2009-10-28 株式会社日立製作所 DNA measuring apparatus and DNA measuring method
EP2002367B1 (en) 2006-02-16 2017-03-29 454 Life Sciences Corporation System and method for correcting primer extension errors in nucleic acid sequence data
US8364417B2 (en) 2007-02-15 2013-01-29 454 Life Sciences Corporation System and method to correct out of phase errors in DNA sequencing data by use of a recursive algorithm
JP4857820B2 (en) 2006-03-03 2012-01-18 学校法人早稲田大学 DNA sensing method
CA2672315A1 (en) 2006-12-14 2008-06-26 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes using large scale fet arrays
US8349167B2 (en) 2006-12-14 2013-01-08 Life Technologies Corporation Methods and apparatus for detecting molecular interactions using FET arrays
US8262900B2 (en) 2006-12-14 2012-09-11 Life Technologies Corporation Methods and apparatus for measuring analytes using large scale FET arrays
US7932034B2 (en) 2006-12-20 2011-04-26 The Board Of Trustees Of The Leland Stanford Junior University Heat and pH measurement for sequencing of DNA
ATE521948T1 (en) 2007-01-26 2011-09-15 Illumina Inc SYSTEM AND METHOD FOR NUCLEIC ACID SEQUENCING
US8481259B2 (en) 2007-02-05 2013-07-09 Intelligent Bio-Systems, Inc. Methods and devices for sequencing nucleic acids in smaller batches
US8612161B2 (en) 2008-03-19 2013-12-17 Intelligent Biosystems Inc. Methods and compositions for base calling nucleic acids
WO2008150432A1 (en) 2007-06-01 2008-12-11 454 Life Sciences Corporation System and meth0d for identification of individual samples from a multiplex mixture
US8182993B2 (en) 2007-06-06 2012-05-22 Pacific Biosciences Of California, Inc. Methods and processes for calling bases in sequence by incorporation methods
WO2009005753A2 (en) 2007-06-28 2009-01-08 454 Life Sciences Corporation System and method for adaptive reagent control in nucleic acid sequencing
US8518640B2 (en) 2007-10-29 2013-08-27 Complete Genomics, Inc. Nucleic acid sequencing and process
US7767400B2 (en) 2008-02-03 2010-08-03 Helicos Biosciences Corporation Paired-end reads in sequencing by synthesis
US7782237B2 (en) 2008-06-13 2010-08-24 The Board Of Trustees Of The Leland Stanford Junior University Semiconductor sensor circuit arrangement
GB2461026B (en) 2008-06-16 2011-03-09 Plc Diagnostics Inc System and method for nucleic acids sequencing by phased synthesis
GB2461127B (en) 2008-06-25 2010-07-14 Ion Torrent Systems Inc Methods and apparatus for measuring analytes using large scale FET arrays
EP3650847A1 (en) 2008-06-26 2020-05-13 Life Technologies Corporation Methods and apparatus for detecting molecular interactions using fet arrays
US8407012B2 (en) 2008-07-03 2013-03-26 Cold Spring Harbor Laboratory Methods and systems of DNA sequencing
US20100035252A1 (en) 2008-08-08 2010-02-11 Ion Torrent Systems Incorporated Methods for sequencing individual nucleic acids under tension
US8392126B2 (en) 2008-10-03 2013-03-05 Illumina, Inc. Method and system for determining the accuracy of DNA base identifications
US20100137143A1 (en) 2008-10-22 2010-06-03 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes
US20100301398A1 (en) 2009-05-29 2010-12-02 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes
JP2012506557A (en) 2008-10-22 2012-03-15 ライフ テクノロジーズ コーポレーション Integrated sensor arrays for biological and chemical analysis
US8546128B2 (en) 2008-10-22 2013-10-01 Life Technologies Corporation Fluidics system for sequential delivery of reagents
US20110246084A1 (en) 2008-11-26 2011-10-06 Mostafa Ronaghi Methods and systems for analysis of sequencing data
US20110311980A1 (en) 2008-12-15 2011-12-22 Advanced Liquid Logic, Inc. Nucleic Acid Amplification and Sequencing on a Droplet Actuator
WO2010075188A2 (en) 2008-12-23 2010-07-01 Illumina Inc. Multibase delivery for long reads in sequencing by synthesis protocols
US20100323348A1 (en) 2009-01-31 2010-12-23 The Regents Of The University Of Colorado, A Body Corporate Methods and Compositions for Using Error-Detecting and/or Error-Correcting Barcodes in Nucleic Acid Amplification Process
US8407554B2 (en) 2009-02-03 2013-03-26 Complete Genomics, Inc. Method and apparatus for quantification of DNA sequencing quality and construction of a characterizable model system using Reed-Solomon codes
US8772473B2 (en) 2009-03-30 2014-07-08 The Regents Of The University Of California Mostly natural DNA sequencing by synthesis
US8673627B2 (en) 2009-05-29 2014-03-18 Life Technologies Corporation Apparatus and methods for performing electrochemical reactions
EP2327793A1 (en) 2009-11-25 2011-06-01 Universität Zu Köln Pyrosequencing method for predicting the response of a patient towards anti cancer treatment
EP2513341B1 (en) 2010-01-19 2017-04-12 Verinata Health, Inc Identification of polymorphic sequences in mixtures of genomic dna by whole genome sequencing
US20110257889A1 (en) 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
EP2580353B1 (en) 2010-06-11 2015-07-29 Life Technologies Corporation Alternative nucleotide flows in sequencing-by-synthesis methods
US8666678B2 (en) 2010-10-27 2014-03-04 Life Technologies Corporation Predictive model for use in sequencing-by-synthesis
US9594870B2 (en) 2010-12-29 2017-03-14 Life Technologies Corporation Time-warped background signal for sequencing-by-synthesis operations
US10146906B2 (en) 2010-12-30 2018-12-04 Life Technologies Corporation Models for analyzing data from sequencing-by-synthesis operations
US10241075B2 (en) 2010-12-30 2019-03-26 Life Technologies Corporation Methods, systems, and computer readable media for nucleic acid sequencing
EP3366782B1 (en) 2011-04-08 2021-03-10 Life Technologies Corporation Phase-protecting reagent flow orderings for use in sequencing-by-synthesis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Quail et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers BMC Genomics Vol. 13, article 341 (2012) *
Rothberg et al. An integrated semiconductor device enabling non-optical genome sequencing Nature Vol. 475 pages 348-352 (2011) *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9926597B2 (en) 2013-07-26 2018-03-27 Life Technologies Corporation Control nucleic acid sequences for use in sequencing-by-synthesis and methods for designing the same
US10760125B2 (en) 2013-07-26 2020-09-01 Life Technologies Corporation Control nucleic acid sequences for use in sequencing-by-synthesis and methods for designing the same
US10323275B2 (en) * 2014-04-25 2019-06-18 Dnae Group Holdings Limited Methods for sequencing a polynucleotide strand
WO2018089567A1 (en) 2016-11-10 2018-05-17 Life Technologies Corporation Methods, systems and computer readable media to correct base calls in repeat regions of nucleic acid sequence reads
WO2018106884A1 (en) 2016-12-08 2018-06-14 Life Technologies Corporation Methods for detecting mutation load from a tumor sample
US10892037B2 (en) 2017-05-16 2021-01-12 Life Technologies Corporation Methods for compression of molecular tagged nucleic acid sequence data
US11887699B2 (en) 2017-05-16 2024-01-30 Life Technologies Corporation Methods for compression of molecular tagged nucleic acid sequence data
WO2018213235A1 (en) 2017-05-16 2018-11-22 Life Technologies Corporation Methods for compression of molecular tagged nucleic acid sequence data
US11468972B2 (en) 2017-05-16 2022-10-11 Life Technologies Corporation Methods for compression of molecular tagged nucleic acid sequence data
WO2018218103A1 (en) 2017-05-26 2018-11-29 Life Technologies Corporation Methods and systems to detect large rearrangements in brca1/2
CN108282491A (en) * 2018-02-27 2018-07-13 北京奇艺世纪科技有限公司 A kind of method and device of assessment plug-flow quality
WO2020046784A1 (en) 2018-08-28 2020-03-05 Life Technologies Corporation Methods for detecting mutation load from a tumor sample
WO2021034484A1 (en) 2019-08-20 2021-02-25 Life Technologies Corporation Methods for control of a sequencing device
WO2021034711A1 (en) 2019-08-21 2021-02-25 Life Technologies Corporation System and method for sequencing
US11905558B2 (en) 2019-08-21 2024-02-20 Life Technologies Corporation System and method for sequencing
WO2022104138A1 (en) 2020-11-14 2022-05-19 Life Technologies Corporation System and method for automated repeat sequencing
WO2022104272A1 (en) 2020-11-16 2022-05-19 Life Technologies Corporation System and method for sequencing
WO2022146708A1 (en) 2020-12-31 2022-07-07 Life Technologies Corporation System and method for control of sequencing process
WO2023215855A1 (en) 2022-05-05 2023-11-09 Life Technologies Corporation Methods for deep artificial neural networks for signal error correction
WO2023215847A1 (en) 2022-05-05 2023-11-09 Life Technologies Corporation Methods for deep artificial neural networks for signal error correction
WO2024006878A1 (en) 2022-06-30 2024-01-04 Life Technologies Corporation Methods for assessing genomic instability
WO2024059487A1 (en) 2022-09-12 2024-03-21 Life Technologies Corporation Methods for detecting allele dosages in polyploid organisms
WO2024073544A1 (en) 2022-09-30 2024-04-04 Life Technologies Corporation System and method for genotyping structural variants

Also Published As

Publication number Publication date
US11636919B2 (en) 2023-04-25
US20180330051A1 (en) 2018-11-15
US20230360726A1 (en) 2023-11-09

Similar Documents

Publication Publication Date Title
US20230360726A1 (en) Methods, systems, and computer readable media for evaluating variant likelihood
US20220367005A1 (en) Methods, systems, and computer readable media for improving base calling accuracy
US20230194464A1 (en) Methods, systems, and computer readable media for making base calls in nucleic acid sequencing
US20220283117A1 (en) Methods, systems, and computer readable media for nucleic acid sequencing
US20200082907A1 (en) Methods, systems, and computer readable media for making base calls in nucleic acid sequencing
CN108292326B (en) Integrated method and system for identifying functional patient-specific somatic aberrations
US20210102249A1 (en) Control nucleic acid sequences for use in sequencing-by-synthesis and methods for designing the same
US20230193379A1 (en) Calibration panels and methods for designing the same
US20110010103A1 (en) Rapid method of pattern recognition, machine learning, and automated genotype classification through correlation analysis of dynamic signals
JP2005531853A (en) System and method for SNP genotype clustering
US20190237163A1 (en) Methods for flow space quality score prediction by neural networks
US11232851B2 (en) System and method for modeling and subtracting background signals from a melt curve
US20090176232A1 (en) Assessment of reaction kinetics compatibility between polymerase chain reactions
EP2745108B1 (en) Methods, systems, and computer readable media for making base calls in nucleic acid sequencing
US11332781B2 (en) Fitting melting curve data to determine copy number variation
Guha et al. Bayesian hidden Markov modeling of array CGH data
Kreutz Statistical Approaches for Molecular and Systems Biology
Wu et al. A Bayesian Analysis of Copy Number Variations in Array Comparative Genomic Hybridization Data
Gupta et al. Bayesian integrated modeling of expression data: a case study on RhoG
Chen et al. edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets

Legal Events

Date Code Title Description
AS Assignment

Owner name: LIFE TECHNOLOGIES CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HUBBELL, EARL;UTIRAMERUR, SOWMI;REEL/FRAME:033072/0681

Effective date: 20140609

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION